Tutorial on MPI: The
Message-Passing Interface
Willi am Gropp
TIONAL A LA N B O E R N A
N T
O O
G R
R Y A
• •
U
N O
I G V E A R C S I I T C H
Y O F
Mathematics and Computer Science Division
Argonne National Lab oratory
Argonne, IL 60439
Course Outline
Background on Parallel Computing
Getting Started
MPI Basics
Intermediate MPI
To ols for writing libraries
Final comments
Thanks to Rusty Lusk for some of the material in this
tutorial.
This tutorial maybe used in conjunction with
the book \Using MPI" which contains detailed
descriptions of the use of the MPI routines.
Material that b eings with this symb ol is `advanced'
and may b e skipp ed on a rst reading. 2
Background
Communicating with other pro cesses
Co op erative op erations
One-sided op erations
The MPI pro cess 3
Parallel Computing
Separate workers or pro cesses
Interact by exchanging information 4
Typ es of parallel computing
All use di erent data for each worker
Data-parallel Same op erations on di erent
data. Also called SIMD
SPMD Same program, di erent data
MIMD Di erent programs, di erent data
SPMD and MIMD are essentially the same
b ecause any MIMD can be made SPMD
SIMD is also equivalent, but in a less
practical sense.
MPI is primaril y for SPMD/MIMD. HPF is
an example of a SIMD interface. 5
Communicating with other pro cesses
Data must be exchanged with other workers
Co op erative | all parties agree to
transfer data
One sided | one worker p erforms
transfer of data 6
Co op erative op erations
Message-passing is an approach that makes
the exchange of data co op erative.
Data must b oth be explicitl y sent and
received.
An advantage is that any change in the
receiver's memory is made with the receiver's
participation. Process 0 Process 1
SEND( data )
RECV( data ) 7
One-sided op erations
One-sided op erations between parallel
pro cesses include remote memory reads and
writes.
An advantage is that data can be accessed
without waiting for another pro cess
Process 0 Process 1
PUT( data )
(Memory)
Process 0 Process 1
(Memory)
GET( data ) 8
Class Example
Take a pad of pap er. Algorithm: Initialize with the
numb er of neighb ors you have
Compute average of your neighb or's values and
subtract from your value. Make that your new
value.
Rep eat until done
Questions
1. Howdoyou get values from your neighb ors?
2. Which step or iteration do they corresp ond to?
Do you know? Do you care?
3. Howdoyou decide when you are done? 9
Hardware mo dels
The previous example illustrates the
hardware mo dels by how data is exchanged
among workers.
Distributed memory e.g., Paragon, IBM
SPx, workstation network
Shared memory e.g., SGI Power
Challenge, Cray T3D
Either may be used with SIMD or MIMD
software mo dels.
All memory is distributed. 10
What is MPI?
A message-passing library sp eci cation
{ message-passing mo del
{ not a compiler sp eci cation
{ not a sp eci c pro duct
Forparallel computers, clusters, and heterogeneous
networks
Full-featured
Designed to p ermit unleash? the development of
parallel software libraries
Designed to provide access to advanced parallel
hardware for
{ end users
{ library writers
{ to ol develop ers 11
Motivation for a New Design
Message Passing now mature as programming
paradigm
{ well understo o d
{ ecient match to hardware
{ many applications
Vendor systems not p ortable
Portable systems are mostly research projects
{ incomplete
{ lack vendor supp ort
{ not at most ecient level 12
Motivation cont.
Few systems o er the full range of desired features.
mo dularity for libraries
access to p eak p erformance
portability
heterogeneity
subgroups
top ologies
p erformance measurement to ols 13
The MPI Pro cess
Began at Williamsburg Workshop in April, 1992
Organized at Sup ercomputing '92 Novemb er
Followed HPF format and pro cess
Met every six weeks fortwodays
Extensive, op en email discussions
Drafts, readings, votes
Pre- nal draft distributed at Sup ercomputing '93
Two-month public comment p erio d
Final version of draft in May, 1994
Widely availablenow on the Web, ftp sites, netlib
http://www.mcs.anl.gov/mpi/index.html
Public implementations available
Vendor implementations coming so on 14
Who Designed MPI?
Broad participation
Vendors
{ IBM, Intel, TMC, Meiko, Cray, Convex, Ncub e
Library writers
{ PVM, p4, Zip co de, TCGMSG, Chameleon,
Express, Linda
Application sp ecialists and consultants
Companies Lab oratories Universities
ARCO ANL UC Santa Barbara
Convex GMD Syracuse U
Cray Res LANL Michigan State U
IBM LLNL Oregon Grad Inst
Intel NOAA U of New Mexico
KAI NSF Miss. State U.
Meiko ORNL U of Southampton
NAG PNL U of Colorado
nCUBE Sandia Yale U
ParaSoft SDSC UofTennessee
Shell SRC UofMaryland
TMC Western Mich U
U of Edinburgh
Cornell U.
Rice U.
U of San Francisco 15
Features of MPI
General
{ Communicators combine context and group for
message security
{ Thread safety
Point-to-p oint communication
{ Structured bu ers and derived datatyp es,
heterogeneity
{ Mo des: normal blo cking and non-blo cking,
synchronous, ready to allow access to fast
proto cols, bu ered
Collective
{ Both built-in and user-de ned collective
op erations
{ Large numb er of data movement routines
{ Subgroups de ned directly orby top ology 16
Features of MPI cont.
Application-oriented pro cess top ologies
{ Built-in supp ort for grids and graphs uses
groups
Pro ling
{ Ho oks allow users to intercept MPI calls to
install their own to ols
Environmental
{ inquiry
{ error control 17
Features not in MPI
Non-message-passing concepts not included:
{ pro cess management
{ remote memory transfers
{ active messages
{ threads
{ virtual shared memory
MPI do es not address these issues, but has tried to
remain compatible with these ideas e.g. thread
safety as a goal, intercommunicators 18
Is MPI Large or Small?
MPI is large 125 functions
{ MPI's extensive functionality requires many
functions
{ Numb er of functions not necessarily a measure
of complexity
MPI is small 6 functions
{ Many parallel programs can b e written with just
6 basic functions.
MPI is just right
{ One can access exibility when it is required.
{ One need not master all parts of MPI to use it. 19
Where to use MPI?
You need a portable parallel program
You are writing a parallel library
You have irregular or dynamic data
relationships that do not t a data
parallel mo del
Where not to use MPI:
You can use HPF or a parallel Fortran 90
You don't need parallelism at all
You can use librari es which may be
written in MPI 20
Why learn MPI?
Portable
Expressive
Good way to learn ab out subtle issues in
parallel computing 21
Getting started
Writing MPI programs
Compiling and linking
Running MPI programs
More information
{ Using MPI by William Gropp, Ewing Lusk,
and Anthony Skjellum,
{ The LAM companion to \Using MPI..." by
Zdzislaw Meglicki
{ Designing and Building Parallel Programs by
Ian Foster.
{ ATutorial/User's Guide for MPI byPeter
Pacheco
ftp://math.usfca.edu/pub/MPI/mpi.guide.ps
{ The MPI standard and other information is
available at http://www.mcs.anl.gov/mpi. Also
the source for several implementations. 22
Writing MPI programs
include "mpi.h"
include
int main argc, argv
int argc;
char **argv;
{
MPI_Init &argc, &argv ;
printf "Hello world\n" ;
MPI_Finalize;
return 0;
} 23
Commentary
include "mpi.h" provides basic MPI
de nitions and typ es
MPI_Init starts MPI
MPI_Finalize exits MPI
Note that all non-MPI routines are lo cal;
thus the printf run on each pro cess 24
Compiling and linking
For simple programs, sp ecial compiler
commands can be used. For large projects,
it is b est to use a standard Make le.
The MPICH implementation provides
the commands mpicc and mpif77
as well as `Makefile' examples in
`/usr/local/mpi/examples/Makefile.in' 25
Sp ecial compilation commands
The commands
mpicc -o first first.c
mpif77 -o firstf firstf.f
may b e used to build simple programs when using
MPICH.
These provide sp ecial options that exploit the pro ling
features of MPI
-mpilog Generate log les of MPI calls
-mpitrace Trace execution of MPI calls
-mpianim Real-time animation of MPI not available
on all systems
There are sp eci c to the MPICH implementation;
other implementations mayprovide similar commands
e.g., mpcc and mpxlf on IBM SP2. 26
Using Make les
The le `Makefile.in' is a template Make le.
The program script `mpireconfig' translates
this to a Make le for a particular system.
This allows you to use the same Make le for
a network of workstations and a massively
parallel computer, even when they use
di erent compilers, librari es, and linker
options.
mpireconfig Makefile
Note that you must have `mpireconfig' in
your PATH. 27
Sample Make le.in
User configur ab le options
ARCH = @ARCH@
COMM = @COMM@
INSTALL_D IR = @INSTAL L_ DI R@
CC = @CC@
F77 = @F77@
CLINKER = @CLINKE R@
FLINKER = @FLINKE R@
OPTFLAGS = @OPTFLA GS @
LIB_PATH = -L$INS TA LL _D IR / li b/ $ AR CH /$ C OM M
FLIB_PATH =
@FLIB_PAT H_ LE AD ER @$ I NS TA LL _D IR / li b/ $ ARC H /$ C OM M
LIB_LIST = @LIB_LI ST @
INCLUDE_D IR = @INCLUD E_ PA TH @ -I$INST AL L_D IR / in cl ud e
End User configur ab le options 28
Sample Make le.in con't
CFLAGS = @CFLAGS @ $OPTFLA GS $INCLUD E_D IR -DMPI_$ AR CH
FFLAGS = @FFLAGS@ $INCLU DE _D IR $OPTFLAG S
LIBS = $LIB_PA TH $LIB_LI ST
FLIBS = $FLIB_ PA TH $LIB_LI ST
EXECS = hello
default: hello
all: $EXECS
hello: hello.o $INSTAL L_ DI R /i nc lu de /m pi. h
$CLINK ER $OPTFLA GS -o hello hello.o \
$LIB_P AT H $LIB_L IS T -lm
clean:
/bin/rm -f *.o *~ PI* $EXECS
.c.o:
$CC $CFLAG S -c $*.c
.f.o:
$F77 $FFLAGS -c $*.f 29
Running MPI programs
mpirun -np 2 hello
`mpirun' is not part of the standard, but
some version of it is common with several
MPI implementations. The version shown
here is for the MPICH implementation of
MPI.
Just as Fortran do es not sp ecify how
Fortran programs are started, MPI do es not
sp ecify how MPI programs are started.
The option -t shows the commands that
mpirun would execute; you can use this to
nd out how mpirun starts programs on yor
system. The option -help shows all options
to mpirun. 30
Finding out ab out the environment
Two of the rst questions asked in a parallel
program are: How many pro cesses are there?
and Who am I?
How many is answered with MPI_Comm_size
and who am I is answered with MPI_Comm_rank.
The rank is a numb er between zero and
size-1. 31
A simple program
include "mpi.h"
include
int main argc, argv
int argc;
char **argv;
{
int rank, size;
MPI_Init &argc, &argv ;
MPI_Comm_rank MPI_COMM_WORLD, &rank ;
MPI_Comm_size MPI_COMM_WORLD, &size ;
printf "Hello world! I'm d of d\n",
rank, size ;
MPI_Finalize;
return 0;
} 32
Caveats
These sample programs have b een kept
as simple as p ossible by assuming that all
pro cesses can do output. Not all parallel
systems provide this feature, and MPI
provides a way to handle this case. 33
Exercise - Getting Started
Objective: Learn how to login, write,
compile, and run a simple MPI program.
Run the \Hello world" programs. Try two
di erent parallel computers. What do es the
output lo ok like? 34
Sending and Receiving messages
Process 0 Process 1
A: Send Recv
B:
Questions:
To whom is data sent?
What is sent?
How do es the receiver identify it? 35
Current Message-Passing
Atypical blo cking send lo oks like
send dest, type, address, length
where
{ dest is an integer identi er representing the
pro cess to receive the message.
{ type is a nonnegative integer that the
destination can use to selectively screen
messages.
{ address, length describ es a contiguous area in
memory containing the message to b e sent.
and
Atypical global op eration lo oks like:
broadcast type, address, length
All of these sp eci cations are a go o d match to
hardware, easy to understand, but to o in exible. 36
The Bu er
Sending and receiving only a contiguous arrayof
bytes:
hides the real data structure from hardware which
might b e able to handle it directly
requires pre-packing disp ersed data
{ rows of a matrix stored columnwise
{ general collections of structures
prevents communications b etween machines with
di erent representations even lengths for same
data typ e 37
Generalizing the Bu er Description
Sp eci ed in MPI by starting address , datatyp e , and
count , where datatyp e is:
{ elementary all C and Fortran datatyp es
{ contiguous array of datatyp es
{ strided blo cks of datatyp es
{ indexed array of blo cks of datatyp es
{ general structure
Datatyp es are constructed recursively.
Sp eci cations of elementary datatyp es allows
heterogeneous communication.
Elimination of length in favor of count is clearer.
Sp ecifying application-oriented layout of data
allows maximal use of sp ecial hardware. 38
Generalizing the Typ e
A single typ e eld is to o constraining. Often
overloaded to provide needed exibility.
Problems:
{ under user control
{ wild cards allowed MPI_ANY_TAG
{ library use con icts with user and with other
libraries 39
Sample Program using Library Calls
Sub1 and Sub2 are from di erent libraries.
Sub1;
Sub2;
Sub1a and Sub1b are from the same library
Sub1a;
Sub2;
Sub1b;
Thanks to Marc Snir for the following four examples 40
Correct Execution of Library Calls
Process 0 Process 1 Process 2
recv(any) send(1) Sub1
recv(any) send(0)
recv(1) send(0) Sub2 recv(2) send(1)
send(2) recv(0) 41
Incorrect Execution of Library Calls
Process 0 Process 1 Process 2
recv(any) send(1) Sub1
recv(any) send(0)
recv(1) send(0)
Sub2 recv(2) send(1)
send(2) recv(0) 42
Correct Execution of Library Calls with Pending Communcication
Process 0 Process 1 Process 2
recv(any) send(1) Sub1a
send(0)
recv(2) send(0)
Sub2 send(2) recv(1)
send(1) recv(0)
recv(any)
Sub1b 43
Incorrect Execution of Library Calls with Pending Communication
Process 0 Process 1 Process 2
recv(any) send(1) Sub1a
send(0)
recv(2) send(0)
Sub2 send(2) recv(1)
send(1) recv(0)
recv(any)
Sub1b 44
Solution to the typ e problem
A separate communication context for each family
of messages, used for queueing and matching.
This has often b een simulated in the past by
overloading the tag eld.
No wild cards allowed, for security
Allo cated by the system, for security
Typ es tags , in MPI retained fornormal use wild
cards OK 45
Delimiting Scop e of Communication
Separate groups of pro cesses working on
subproblems
{ Merging of pro cess name space interferes with
mo dularity
{ \Lo cal" pro cess identi ers desirable
Parallel invo cation of parallel libraries
{ Messages from application must b e kept
separate from messages internal to library.
{ Knowledge of library message typ es interferes
with mo dularity.
{ Synchronizing b efore and after library calls is
undesirable. 46
Generalizing the Pro cess Identi er
Collective op erations typically op erated on all
pro cesses although some systems provide
subgroups.
This is to o restrictive e.g., need minimum over a
column or a sum across a row, of pro cesses
MPI provides groups of pro cesses
{ initial \all" group
{ group management routines build, delete
groups
All communication not just collective op erations
takes place in groups.
A group and a context are combined in a
communicator.
Source/destination in send/receive op erations refer
to rank in group asso ciated with a given
communicator. MPI_ANY_SOURCE p ermitted in a
receive. 47
MPI Basic Send/Receive
Thus the basic blo cking send has b ecome:
MPI_Send start, count, datatype, dest, tag,
comm
and the receive:
MPI_Recvstart, count, datatype, source, tag,
comm, status
The source, tag, and count of the message actually
received can b e retrieved from status.
Two simple collective op erations:
MPI_Bcaststart, count, datatype, root, comm
MPI_Reducestart, result, count, datatype,
operation, root, comm 48
Getting information ab out a message
MPI_Status status;
MPI_Recv ..., &status ;
... status.MPI_TAG;
... status.MPI_SOURCE;
MPI_Get_count &status, datatype, &count ;
MPI_TAG and MPI_SOURCE primarilyof use when
MPI_ANY_TAG and/or MPI_ANY_SOURCE in the receive.
MPI_Get_count may b e used to determine how much
data of a particulartyp e was received. 49
Simple Fortran example
program main
include 'mpif.h '
integer rank, size, to, from, tag, count, i, ierr
integer src, dest
integer st_sour ce , st_tag, st_count
integer status MP I_ ST AT US _S IZ E
double precisio n data100
call MPI_INIT ierr
call MPI_COMM _R AN K MPI_COM M_ WO RL D, rank, ierr
call MPI_COMM _S IZ E MPI_COM M_ WO RL D, size, ierr
print *, 'Process ', rank, ' of ', size, ' is alive'
dest = size - 1
src = 0
C
if rank .eq. src then
to = dest
count =10
tag = 2001
do 10 i=1, 10
10 datai =i
call MPI_SEN D data, count, MPI_DOUBL E_ PR EC IS IO N, to,
+ tag, MPI_COM M_ WOR LD , ierr
else if rank .eq. dest then
tag = MPI_ANY_ TA G
count =10
from = MPI_ANY_ SO UR CE
call MPI_REC V da ta , count, MPI_DOUB LE _P RE CI SI ON , from,
+ tag, MPI_COMM _W ORL D, status, ierr 50
Simple Fortran example cont.
call MPI_GET _C OU NT status, MPI_DOUBL E_ PR EC IS IO N,
+ st_count , ierr
st_sourc e = statusM PI _S OU RC E
st_tag = statusM PI _T AG
C
print *, 'Status info: source = ', st_sourc e,
+ ' tag = ', st_tag, ' count = ', st_coun t
print *, rank, ' receive d' , datai, i= 1, 10
endif
call MPI_FINA LI ZE ierr
end 51
Six Function MPI
MPI is very simple. These six functions allow
you to write many programs:
MPI Init
Finalize MPI
MPI Comm size
MPI Comm rank
MPI Send
MPI Recv 52
A taste of things to come
The following examples show a C and
Fortran version of the same program.
This program computes PI with a very
simple metho d but do es not use MPI_Send
and MPI_Recv. Instead, it uses collective
op erations to send data to and from all of
the running pro cesses. This gives a di erent
six-function MPI set:
MPI Init
MPI Finalize
MPI Comm size
MPI Comm rank
MPI Bcast
Reduce MPI 53
Broadcast and Reduction
The routine MPI_Bcast sends data from one
pro cess to all others.
The routine MPI_Reduce combines data from
all pro cesses by adding them in this case,
and returning the result to a single pro cess. 54
Fortran example: PI
program main
include "mpif.h "
double precisio n PI25DT
paramet er PI25DT = 3.1415926 53 58 97 93 23 84 62 64 3d 0
double precisio n mypi, pi, h, sum, x, f, a
integer n, myid, numprocs , i, rc
c function to integrat e
fa = 4.d0 / 1.d0 + a*a
call MPI_INIT ierr
call MPI_COMM _R AN K MPI_COM M_ WO RL D, myid, ierr
call MPI_COMM _S IZ E MPI_COM M_ WO RL D, numprocs , ierr
10 if myid .eq. 0 then
write6, 98
98 format' En te r the number of intervals : 0 quits'
read5,9 9 n
99 formati 10
endif
call MPI_BCAS T n, 1, MP I_ IN TE GE R, 0, MPI _C OM M_ WO RL D, ie rr 55
Fortran example cont.
c check for quit signal
if n .le. 0 goto 30
c calculate the interva l size
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numproc s
x = h * dblei - 0.5d0
sum = sum + fx
20 continue
mypi =h*sum
c collect all the partial sums
call MPI_RED UC E my pi ,p i, 1, MP I_ DO UB LE_ PR EC IS IO N, MP I_ SU M, 0,
$ MPI_COM M_ WO RL D, ie rr
c node 0 prints the answer.
if myid .eq. 0 then
write6 , 97 pi, abspi - PI25DT
97 format ' pi is approxi ma te ly : ', F18.16,
+ ' Error is: ', F18.16
endif
goto 10
30 call MPI_FIN AL IZ E rc
stop
end 56
C example: PI
include "mpi.h"
include
int mainargc ,a rg v
int argc;
char *argv[];
{
int done = 0, n, myid, numprocs , i, rc;
double PI25DT = 3.14159 26 53 58 97 93 23 84 626 43 ;
double mypi, pi, h, sum, x, a;
MPI_Init &a rg c, &a rg v ;
MPI_Comm_ si ze M PI _C OM M_ WO RL D, &n um pr oc s;
MPI_Comm_ ra nk M PI _C OM M_ WO RL D, &m yi d ; 57
C example cont.
while !done
{
if myid == 0 {
printf "E nt er the number of interval s: 0 quits ";
scanf" d ", &n ;
}
MPI_Bcast & n, 1, MPI_INT, 0, MPI_COMM_ WO RL D ;
if n == 0 break;
h = 1.0 / double n;
sum = 0.0;
for i = myid + 1; i <= n; i += numprocs {
x=h*doubl e i - 0.5;
sum += 4.0 / 1.0 + x*x;
}
mypi = h * sum;
MPI_Reduc e &m yp i, &pi, 1, MPI_DOU BL E, MPI_SUM, 0,
MPI_COM M_ WO RL D ;
if myid == 0
printf "p i is approxi ma te ly .16f, Error is .16f\n" ,
pi, fabspi - PI25DT ;
}
MPI_Final iz e ;
} 58
Exercise - PI
Objective: Exp eriment with send/receive
Run either program for PI. Write new
versions that replace the calls to MPI_Bcast
and MPI_Reduce with MPI_Send and MPI_Recv.
The MPI broadcast and reduce op erations
use at most log p send and receive op erations
on each pro cess where p is the size of
COMM WORLD. How many op erations do MPI
your versions use? 59
Exercise - Ring
Objective: Exp eriment with send/receive
Write aprogram to send a message around a
ring of pro cessors. That is, pro cessor 0 sends
to pro cessor 1, who sends to pro cessor 2,
etc. The last pro cessor returns the message
to pro cessor 0.
You can use the routine MPI Wtime to time
co de in MPI. The statement
t = MPI Wtime;
returns the time as a double DOUBLE
PRECISION in Fortran. 60
Top ologies
MPI provides routines to provide structure to
collections of pro cesses
This helps to answer the question:
Who are my neighb ors? 61
Cartesian Top ologies
A Cartesian top ology is a mesh
Example of 3 4Cartesian mesh with arrows
p ointing at the right neighb ors:
(0,2) (1,2) (2,2) (3,2)
(0,1) (1,1) (2,1) (3,1)
(0,0) (1,0) (2,0) (3,0) 62
De ning a Cartesian Top ology
The routine MPI_Cart_create creates a Cartesian
decomp osition of the pro cesses, with the numb er of
dimensions given by the ndim argument.
dims1 = 4
dims2 = 3
periods1 = .false.
periods2 = .false.
reorder = .true.
ndim = 2
call MPI_CART_CREATE MPI_COMM_WORLD, ndim, dims,
$ periods, reorder, comm2d, ierr 63
Finding neighb ors
MPI_Cart_create creates a new communicator with the
same pro cesses as the input communicator, but with
the sp eci ed top ology.
The question, Who are my neighb ors, can nowbe
answered with MPI_Cart_shift:
call MPI_CART_SHIFT comm2d, 0, 1,
nbrleft, nbrright, ierr
call MPI_CART_SHIFT comm2d, 1, 1,
nbrbottom, nbrtop, ierr
The values returned are the ranks, in the
communicator comm2d, of the neighb ors shifted by 1
in the two dimensions. 64
Who am I?
Can be answered with
integer coords2
call MPI_COMM_RANK comm1d, myrank, ierr
call MPI_CART_COORDS comm1d, myrank, 2,
$ coords, ierr
Returns the Cartesian co ordinates of the calling
pro cess in coords. 65
Partitioning
When creating a Cartesian top ology, one question is
\What is a go o d choice for the decomp osition of the
pro cessors?"
This question can b e answered with MPI_Dims_create:
integer dims2
dims1 = 0
dims2 = 0
call MPI_COMM_SIZE MPI_COMM_WORLD, size, ierr
call MPI_DIMS_CREATE size, 2, dims, ierr 66
Other Top ology Routines
MPI contains routines to translate between
Cartesian co ordinates and ranks in a
communicator, and to access the prop erties
of a Cartesian top ology.
The routine MPI_Graph_create allows the
creation of a general graph top ology. 67
Why are these routines in MPI?
In many parallel computer interconnects,
some pro cessors are closer to than
others. These routines allow the MPI
implementation to provide an ordering of
pro cesses in a top ology that makes logical
neighb ors close in the physical interconnect.
Some parallel programmers may rememb er
hyp ercub es and the e ort that went into
assigning no des in a mesh to pro cessors
in a hyp ercub e through the use of Grey
co des. Many new systems have di erent
interconnects; ones with multiple paths
may have notions of near neighb ors that
changes with time. These routines free
the programmer from many of these
considerations. The reorder argument is
used to request the b est ordering. 68
The p erio ds argument
Who are my neighb ors if I am at the edge of
a Cartesian Mesh?
? 69
Perio dic Grids
Sp ecify this in MPI_Cart_create with
dims1 = 4
dims2 = 3
periods1 = .TRUE.
periods2 = .TRUE.
reorder = .true.
ndim = 2
call MPI_CART_CREATE MPI_COMM_WORLD, ndim, dims,
$ periods, reorder, comm2d, ierr 70
Nonp erio dic Grids
In the nonp erio dic case, a neighb or may
not exist. This is indicated by a rank of
MPI_PROC_NULL.
This rank may be used in send and receive
calls in MPI. The action in b oth cases is as if
the call was not made. 71
Collective Communications in MPI
Communication is co ordinated among a group of
pro cesses.
Groups can b e constructed \by hand" with MPI
group-manipulation routines orby using MPI
top ology-de nition routines.
Message tags are not used. Di erent
communicatorsare used instead.
No non-blo cking collective op erations.
Three classes of collective op erations:
{ synchronization
{ data movement
{ collective computation 72
Synchronization
MPI_Barriercomm
Function blo cks untill all pro cesses in
comm call it. 73
Available Collective Patterns
P0 AAP0
P1 Broadcast P1 A P2 P2 A
P3 P3 A
P0 ABCDP0 A Scatter P1 P1 B
P2 P2 C Gather P3 P3 D
P0 A P0 ABCD
P1 B All gather P1 ABCD P2 C P2 ABCD
P3 D P3 ABCD
P0 A0 A1 A2 A3 P0 A0 B0 C0 D0
P1 B0 B1 B2 B3 All to All P1 A1 B1 C1 D1
P2 C0 C1 C2 C3 P2 A2 B2 C2 D2
P3 D0 D1 D2 D3 P3 A3 B3 C3 D3
Schematic representation of collective data
movement in MPI 74
Available Collective Computation Patterns
P0 A P0 ABCD
P1 B Reduce P1
P2 C P2
P3 D P3
P0 A P0 A AB P1 B Scan P1 P2 C P2 ABC
P3 D P3 ABCD
Schematic representation of collective data
movement in MPI 75
MPI Collective Routines
Many routines:
Allgather Allgatherv Allreduce
Alltoall Alltoallv Bcast
Gather Gatherv Reduce
ReduceScatter Scan Scatter
Scatterv
All versions deliver results to all participating
pro cesses.
V versions allow the chunks to have di erent sizes.
Allreduce, Reduce, ReduceScatter, and Scan take
b oth built-in and user-de ned combination
functions. 76
Built-in Collective Computation Op erations
MPI Name Op eration
MPI MAX Maximum
MPI MIN Minimum
MPI PROD Pro duct
MPI SUM Sum
MPI LAND Logical and
MPI LOR Logical or
MPI LXOR Logical exclusive or xor
MPI BAND Bitwise and
MPI BOR Bitwise or
MPI BXOR Bitwise xor
MPI MAXLOC Maximum value and lo cation
MPI MINLOC Minimum value and lo cation 77
De ning Your Own Collective Op erations
MPI_Op_createuser_function, commute, op
MPI_Op_freeop
user_functioninvec, inoutvec, len, datatype
The user function should p erform
inoutvec[i] = invec[i] op inoutvec[i];
for i from 0 to len-1.
user_function can b e non-commutative e.g., matrix
multiply. 78
Sample user function
For example, to create an op eration that has the
same e ect as MPI_SUM on Fortran double precision
values, use
subroutine myfunc invec, inoutvec, len, datatype
integer len, datatype
double precision inveclen, inoutveclen
integer i
do 10 i=1,len
10 inoutveci = inveci + inoutveci
return
end
To use, just
integer myop
call MPI_Op_create myfunc, .true., myop, ierr
call MPI_Reduce a, b, 1, MPI_DOUBLE_PRECISON, myop, ...
The routine MPI_Op_free destroys user-functions when
they are no longer needed. 79
De ning groups
All MPI communication is relative to a
communicator which contains a context
and a group. The group is just a set of
pro cesses. 80
Sub dividing a communicator
The easiest way to create communicators with new
groups is with MPI_COMM_SPLIT.
For example, to form groups of rows of pro cesses Column 01234 0
Row 1
2
use
MPI_Comm_split oldcomm, row, 0, &newcomm ;
To maintain the order by rank, use
MPI_Comm_rank oldcomm, &rank ;
MPI_Comm_split oldcomm, row, rank, &newcomm ; 81
Sub dividing con't
Similarly,toform groups of columns, Column 01234 0
Row 1
2
use
MPI_Comm_split oldcomm, column, 0, &newcomm2 ;
To maintain the order by rank, use
MPI_Comm_rank oldcomm, &rank ;
MPI_Comm_split oldcomm, column, rank, &newcomm2 ; 82
Manipulating Groups
Another way to create a communicator with sp eci c
memb ers is to use MPI_Comm_create.
MPI_Comm_create oldcomm, group, &newcomm ;
The group can b e created in many ways: 83
Creating Groups
All group creation routines create a group by
sp ecifying the memb ers to take from an existing
group.
MPI_Group_incl sp eci es sp eci c memb ers
MPI_Group_excl excludes sp eci c memb ers
MPI_Group_range_incl and MPI_Group_range_excl
use ranges of memb ers
MPI_Group_union and MPI_Group_intersection
creates a new group from two existing groups.
To get an existing group, use
MPI_Comm_group oldcomm, &group ;
Free a group with
MPI_Group_free &group ; 84
Bu ering issues
Where do es data go when you send it? One
p ossibility is: Process 1 Process 2 A:
Local Buffer
Local Buffer
The Network B: 85
Better bu ering
This is not very ecient. There are three
copies in addition to the exchange of data
between pro cesses. We prefer Process 1 Process 2 A:
B:
But this requires that either that MPI_Send
not return until the data has b een delivered
or that we allow a send op eration to return
b efore completing the transfer. In this case,
we need to test for completion later. 86
Blo cking and Non-Blo cking communication
So farwe have used blocking communication:
{ MPI Send do es not complete until bu er is empty
available for reuse.
{ MPI Recv do es not complete until bu er is full
available for use.
Simple, but can b e \unsafe":
Pro cess 0 Pro cess 1
Send1 Send0
Recv1 Recv0
Completion dep ends in general on size of message
and amount of system bu ering.
Send works for small enough messages but fails
when messages get to o large. Too large ranges from
zero bytes to 100's of Megabytes. 87
Some Solutions to the \Unsafe" Problem
Order the op erations more carefully:
Pro cess 0 Pro cess 1
Send1 Recv0
Recv1 Send0
Supply receive bu er at same time as send, with
MPI Sendrecv:
Pro cess 0 Pro cess 1
Sendrecv1 Sendrecv0
Use non-blo cking op erations:
Pro cess 0 Pro cess 1
Isend1 Isend0
Irecv1 Irecv0
Waitall Waitall
Use MPI_Bsend 88
MPI's Non-Blo cking Op erations
Non-blo cking op erations return immediately
\request handles" that can b e waited on and queried:
MPI Isendstart, count, datatype, dest, tag, comm,
request
MPI Irecvstart, count, datatype, dest, tag, comm,
request
MPI Waitrequest, status
One can also test without waiting: MPI_Test request,
flag, status 89
Multiple completions
It is often desirable to wait on multiple requests. An
example is a master/slave program, where the master
waits for one ormore slaves to send it a message.
Waitallcount, array of requests, MPI
array of statuses
MPI Waitanycount, array of requests, index,
status
Waitsomeincount, array of requests, outcount, MPI
array of indices, array of statuses
There are corresp onding versions of test for each of
these.
WAITSOME and MPI TESTSOME may b e used to The MPI
implement master/slave algorithms that provide fair
access to the master by the slaves. 90
Fairness
What happ ens with this program:
include "mpi.h"
include
int mainargc , argv
int argc;
char **argv;
{
int rank, size, i, buf[1];
MPI_Statu s status;
MPI_Init &argc, &argv ;
MPI_Comm_ ra nk MPI_COMM _W OR LD , &rank ;
MPI_Comm_ si ze MPI_COMM _W OR LD , &size ;
if rank == 0 {
for i=0; i<100* si ze -1 ; i++ {
MPI_Rec v buf, 1, MPI_INT , MPI_ANY_S OU RC E,
MPI_ANY_ TA G, MPI_COM M_ WOR LD , &status ;
printf "Msg from d with tag d\n",
status. MP I_ SO UR CE , status.MP I_ TA G ;
}
}
else {
for i=0; i<100; i++
MPI_Sen d buf, 1, MPI_INT , 0, i, MPI_COM M_ WO RL D ;
}
MPI_Final iz e ;
return 0;
} 91
Fairness in message-passing
An parallel algorithm is fair if no pro cess
is e ectively ignored. In the preceeding
program, pro cesses with low rank like
pro cess zero may be the only one whose
messages are received.
MPI makes no guarentees ab out fairness.
However, MPI makes it p ossible to write
ecient, fair programs. 92
Providing Fairness
One alternative is
define large 128
MPI_Reque st request s[ la rg e] ;
MPI_Statu s statuse s[ la rg e] ;
int indices [l ar ge ];
int buf[lar ge ];
for i=1; i MPI_Irecv buf+i, 1, MPI_INT, i, MPI_ANY_ TA G, MPI_COM M_ WO RLD , &request s[ i- 1] ; whilenot done { MPI_Waits om e size-1, request s, &ndone, indices, statuse s ; for i=0; i j = indices [i ]; printf "Msg from d with tag d\n", statuse s[ i] .M PI _S OU RC E, statuse s[ i] .M PI _T AG ; MPI_Ire cv buf+j, 1, MPI_INT, j, MPI_ANY_ TA G, MPI_COM M_W OR LD , &request s[ j] ; } } 93 Providing Fairness Fortran One alternative is paramete r large = 128 integer requests l ar ge ; integer statuses M PI _S TA TU S_ SI ZE ,l ar ge ; integer indices la rg e ; integer buflarg e ; logical done do 10 i = 1,size-1 10 call MPI_Ire cv bufi, 1, MPI_INT EG ER, i, * MPI_ANY_ TA G, MPI_COM M_ WO RLD , requests i , ierr 20 if .not. done then call MPI_Wait so me size-1, requests , ndone, indices, statuse s, ierr do 30 i=1, ndone j = indices i print *, 'Msg from ', statuse s MPI _S OU RC E, i , ' with tag', * statuses M PI _T AG ,i call MPI_Irec v bufj, 1, MPI_INTEG ER , j, MPI_ANY_ TA G, MPI_COM M_W OR LD , requests j , ierr done = ... 30 continue goto 20 endif 94 Exercise - Fairness Objective: Use nonblo cking communications Complete the program fragment on \providing fairness". Make sure that you leave no uncompleted requests. How would you test your program? 95 More on nonblo cking communication In applications where the time to send data b etween pro cesses is large, it is often helpful to cause communication and computation to overlap. This can easily b e done with MPI's non-blo cking routines. For example, in a 2-D nite di erence mesh, moving data needed for the b oundaries can b e done at the same time as computation on the interior. MPI_Irecv ... each ghost edge ... ; MPI_Isend ... data for each ghost edge ... ; ... compute on interior while still some uncompleted requests { MPI_Waitany ... requests ... if request is a receive ... compute on that edge ... } Note that we call MPI_Waitany several times. This exploits the fact that after a request is satis ed, it is set to MPI_REQUEST_NULL, and that this is a valid request object to the wait and test routines. 96 Communication Mo des MPI provides mulitple mo des for sending messages: Synchronous mo de MPI Ssend: the send do es not complete until a matching receive has b egun. Unsafe programs b ecome incorrect and usually deadlo ck within an MPI_Ssend. Bu ered mo de MPI Bsend: the user supplies the bu er to system for its use. User supplies enough memory to make unsafe program safe. Ready mo de MPI Rsend: user guarantees that matching receive has b een p osted. { allows access to fast proto cols { unde ned b ehavior if the matching receive is not p osted Non-blo cking versions: MPI Issend, MPI Irsend, MPI Ibsend Note that an MPI_Recv may receive messages sent with any send mo de. 97 Bu ered Send MPI provides a send routine that may b e used when MPI_Isend is awkward to use e.g., lots of small messages. MPI_Bsend makes use of a user-provided bu er to save any messages that can not b e immediately sent. int bufsize; char *buf = mallocbufsize; MPI_Buffer_attach buf, bufsize ; ... MPI_Bsend ... same as MPI_Send ... ; ... MPI_Buffer_detach &buf, &bufsize ; The MPI_Buffer_detach call do es not complete until all messages are sent. Bsend dep ends on the The p erformance of MPI implementation of MPI and may also dep end on the size of the message. For example, making a message one byte longer may cause a signi cant drop in p erformance. 98 Reusing the same bu er Consider a lo op MPI_Buffer_attach buf, bufsize ; while !done { ... MPI_Bsend ... ; } where the buf is large enough to hold the message in the MPI_Bsend. This co de may fail b ecause the { void *buf; int bufsize; MPI_Buffer_detach &buf, &bufsize ; MPI_Buffer_attach buf, bufsize ; } 99 Other Point-to-Point Features MPI_SENDRECV, MPI_SENDRECV_REPLACE MPI_CANCEL Persistent communication requests 100 Datatyp es and Heterogenity MPI datatyp es have two main purp oses Heterogenity | parallel programs between di erent pro cessors Noncontiguous data | structures, vectors with non-unit stride, etc. Basic datatyp e, corresp onding to the underlying language, are prede ned. The user can construct new datatyp es at run time; these are called derived datatyp es. 101 Datatyp es in MPI Elementary: Language-de ned typ es e.g., MPI_INT or MPI_DOUBLE_PRECISION Vector: Separated by constant \stride" Contiguous: Vector with stride of one Hvector: Vector, with stride in bytes Indexed: Array of indices for scatter/gather Hindexed: Indexed, with indices in bytes Struct: General mixed typ es for C structs etc. 102 Basic Datatyp es Fortran MPI datatyp e Fortran datatyp e MPI_INTEGER INTEGER MPI_REAL REAL MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER1 MPI_BYTE MPI_PACKED 103 Basic Datatyp es C MPI datatyp e C datatyp e MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED 104 Vectors 29 30 31 32 33 34 35 22 23 24 25 26 27 28 15 16 17 18 19 20 21 8 9 10 11 12 13 14 1234567 To sp ecify this row in C order, we can use MPI_Type_vector count, blocklen, stride, oldtype, &newtype ; MPI_Type_commit &newtype ; The exact co de for this is MPI_Type_vector 5, 1, 7, MPI_DOUBLE, &newtype ; MPI_Type_commit &newtype ; 105 Structures Structures are describ ed byarrays of numb er of elements array_of_len displacement or lo cation array_of_displs datatyp e array_of_types MPI_Type_structure count, array_of_len, array_of_displs, array_of_types, &newtype ; 106 Example: Structures struct { char display[ 50 ]; /* Name of display */ int maxiter; /* max of iteratio ns */ double xmin, ymin; /* lower left corner of rectangl e */ double xmax, ymax; /* upper right corner */ int width; /* of display in pixels */ int height; /* of display in pixels */ } cmdline; /* set up 4 blocks */ int blockcou nt s[ 4] = {50,1,4 ,2 }; MPI_Datat yp e types[4] ; MPI_Aint displs[4 ]; MPI_Datat yp e cmdtype; /* initiali ze types and displs with addresse s of items */ MPI_Addre ss &cmdline .d is pl ay , &displs[ 0] ; MPI_Addre ss &cmdline .m ax it er , &displs[ 1] ; MPI_Addre ss &cmdline .x mi n, &displs[ 2] ; MPI_Addre ss &cmdline .w id th , &displs[ 3] ; types[0] = MPI_CHAR ; types[1] = MPI_INT; types[2] = MPI_DOUB LE ; types[3] = MPI_INT; for i = 3; i >= 0; i-- displs[i] -= displs[0 ]; MPI_Type_ st ru ct 4, blockco un ts , displs, types, &cmdtype ; MPI_Type_ co mm it &cmdtype ; 107 Strides The extent of a datatyp e is normally the distance between the rst and last memb er. Memory locations specified by datatype EXTENT LB UB You can set an arti cial extent by using MPI_UB and MPI_LB in MPI_Type_struct. 108 Vectors revisited This co de creates a datatyp e for an arbitrary numb er of element in a row of an array stored in Fortran order column rst. int blens[2], displs[2]; MPI_Datatype types[2], rowtype; blens[0] = 1; blens[1] = 1; displs[0] = 0; displs[1] = number_in_column * sizeofdouble; types[0] = MPI_DOUBLE; types[1] = MPI_UB; MPI_Type_struct 2, blens, displs, types, &rowtype ; MPI_Type_commit &rowtype ; To send n elements, you can use MPI_Send buf, n, rowtype, ... ; 109 Structures revisited When sending an array of a structure, it is imp ortant to ensure that MPI and the C compiler have the same value for the size of each structure. The most portable way to do this is to add an MPI_UB to the structure de nition for the end of the structure. In the previous example, this is /* initiali ze types and displs with addresse s of items */ MPI_Addre ss &cmdline .d is pl ay , &displs[ 0] ; MPI_Addre ss &cmdline .m ax it er , &displs[ 1] ; MPI_Addre ss &cmdline .x mi n, &displs[ 2] ; MPI_Addre ss &cmdline .w id th , &displs[ 3] ; MPI_Addre ss &cmdline +1 , &displs[ 4] ; types[0] = MPI_CHAR ; types[1] = MPI_INT; types[2] = MPI_DOUB LE ; types[3] = MPI_INT; types[4] = MPI_UB; for i = 4; i >= 0; i-- displs[i] -= displs[0 ]; MPI_Type_ st ru ct 5, blockco un ts , displs, types, &cmdtype ; MPI_Type_ co mm it &cmdtype ; 110 Interleaving data By moving the UB inside the data, you can interleave data. Consider the matrix 0 8 16 24 32 1 9 17 25 33 2 10 18 26 34 3 11 19 27 35 4 12 20 28 36 5 13 21 29 37 6 14 22 30 38 7 15 23 31 39 We wish to send 0-3,8-11,16-19, and 24-27 to pro cess 0, 4-7,12-15,20-23, and 28-31 to pro cess 1, etc. How can we do this with MPI_Scatterv? 111 An interleaved datatyp e MPI_Type_vector 4, 4, 8, MPI_DOUBLE, &vec ; de nes a blo ck of this matrix. blens[0] = 1; blens[1] = 1; types[0] = vec; types[1] = MPI_UB; displs[0] = 0; displs[1] = sizeofdouble; MPI_Type_struct 2, blens, displs, types, &block ; de nes a blo ck whose extent is just 1 entries. 112 Scattering a Matrix We set the displacements for each blo ck as the lo cation of the rst element in the blo ck. This works b ecause MPI_Scatterv uses the extents to determine the start of each piece to send. scdispls[0] = 0; scdispls[1] = 4; scdispls[2] = 32; scdispls[3] = 36; MPI_Scatterv sendbuf, sendcounts, scdispls, block, recvbuf, nx * ny, MPI_DOUBLE, 0, MPI_COMM_WORLD ; Howwould use use the top ology routines to make this more general? 113 Exercises - datatyp es Objective: Learn ab out datatyp es 1. Write a program to send rows of a matrix stored in column-majorform to the other pro cessors. Let pro cessor 0 have the entire matrix, which has as many rows as pro cessors. Pro cessor 0 sends rowitopro cessori. Pro cessor i reads that row into a lo cal array that holds only that row. That is, pro cessor 0 has a matrix AN; M while the other pro cessors have a row B M . a Write the program to handle the case where the matrix is square. b Write the program to handle a numb er of columns read from the terminal. Cprogrammers may send columns of a matrix stored in row-majorform if they prefer. If you have time, try one of the following. If you don't have time, think ab out howyou would program these. 2. Write a program to transp ose a matrix, where each pro cessor hasapart of the matrix. Use top ologies to de ne a 2-Dimensional partitioning 114 of the matrix across the pro cessors, and assume that all pro cessors have the same size submatrix. a Use MPI_Send and MPI_Recv to send the blo ck, the transp ose the blo ck. b Use MPI_Sendrecv instead. c Create a datatyp e that allows you to receive the blo ck already transp osed. 3. Write a program to send the "ghostp oints" of a 2-Dimensional mesh to the neighb oring pro cessors. Assume that each pro cessor has the same size subblo ck. a Use top ologies to nd the neighb ors b De ne a datatyp e for the \rows" c Use MPI_Sendrecv or MPI_IRecv and MPI_Send with MPI_Waitall. d Use MPI_Isend and MPI_Irecv to start the communication, do some computation on the interior, and then use MPI_Waitany to pro cess the b oundaries as they arrive The same approach works for general datastructures, such as unstructured meshes. 4. Do 3, but for 3-Dimensional meshes. You will need MPI_Type_Hvector. To ols for writing libraries MPI is sp eci cally designed to make it easier to write message-passing librari es Communicators solve tag/source wild-card problem Attributes provide a way to attach information to a communicator 115 Private communicators One of the rst thing that a library should normally do is create private communicator. This allows the library to send and receive messages that are known only to the library. MPI_Comm_dup old_comm, &new_comm ; 116 Attributes Attributes are data that can be attached to one or more communicators. Attributes are referenced by keyval. Keyvals are created with MPI_KEYVAL_CREATE. Attributes are attached to a communicator with MPI_Attr_put and their values accessed by MPI_Attr_get. Op erations are de ned for what happ ens to an attribute when it is copied by creating one communicator from another or deleted by deleting a communicator when the keyval is created. 117 What is an attribute? In C, an attribute is a p ointer of typ e void *. You must allo cate storage for the attribute to p oint to make sure that you don't use the address of a lo cal variable. In Fortran, it is a single INTEGER. 118 Examples of using attributes Forcing sequential op eration Managing tags 119 Sequential Sections include "mpi.h" include static int MPE_Seq_ ke yv al = MPI_KEY VA L_ INV AL ID ; /*@ MPE_Seq_ be gi n - Begins a sequent ia l section of code. Input Paramete rs : . comm - Communi ca to r to sequent ia li ze . . ng - Number in group. This many processe s are allowed to execute at the same time. Usually one. @*/ void MPE_Seq_ be gi n comm, ng MPI_Comm comm; int ng; { int lidx, np; int flag; MPI_Comm local_co mm ; MPI_Statu s status; /* Get the private communic at or for the sequenti al operation s */ if MPE_Seq _k ey va l == MPI_KEY VA L_ IN VA LI D { MPI_Keyva l_ cr ea te MPI_NULL _C OP Y_ FN , MPI_NULL _D EL ET E_ FN, &MPE_Seq _k ey va l, NULL ; } 120 Sequential Sections I I MPI_Attr_ ge t comm, MPE_Seq _k ey va l, void *&local _c om m, &flag ; if !flag { /* This expects a communi ca to r to be a pointer */ MPI_Comm_ du p comm, &local_ co mm ; MPI_Attr_ pu t comm, MPE_Seq _k ey va l, void *local _c om m ; } MPI_Comm_ ra nk comm, &lidx ; MPI_Comm_ si ze comm, &np ; if lidx != 0 { MPI_Recv NULL, 0, MPI_INT, lidx-1, 0, local_c om m, &status ; } /* Send to the next process in the group unless we are the last process in the processo r set */ if lidx ng MPI_Send NULL, 0, MPI_INT, lidx + 1, 0, local_c om m ; } } 121 Sequential Sections I I I /*@ MPE_Seq_ en d - Ends a sequent ia l section of code. Input Paramete rs : . comm - Communi ca to r to sequent ia li ze . . ng - Number in group. @*/ void MPE_Seq_ en d comm, ng MPI_Comm comm; int ng; { int lidx, np, flag; MPI_Statu s status; MPI_Comm local_co mm ; MPI_Comm_ ra nk comm, &lidx ; MPI_Comm_ si ze comm, &np ; MPI_Attr_ ge t comm, MPE_Seq _k ey va l, void *&local _c om m, &flag ; if !flag MPI_Abort comm, MPI_ERR_ UN KN OW N ; /* Send to the first process in the next group OR to the first process in the process or set */ if lidx ng == ng - 1 || lidx == np - 1 { MPI_Send NULL, 0, MPI_INT, lidx + 1 np, 0, local_com m ; } if lidx == 0 { MPI_Recv NULL, 0, MPI_INT, np-1, 0, local_c om m, &status ; } } 122 Comments on sequential sections Note use of MPI_KEYVAL_INVALID to determine to create a keyval Note use of ag on MPI_Attr_get to discover that a communicator has no attribute for the keyval 123 Example: Managing tags Problem: A library contains many objects that need to communicate in ways that are not known until runtime. Messages between objects are kept separate by using di erent message tags. How are these tags chosen? Unsafe to use compile time values Must allo cate tag values at runtime Solution: Use a private communicator and use an attribute to keep track of available tags in that communicator. 124 Caching tags on communicator include "mpi.h" static int MPE_Tag_ ke yv al = MPI_KEY VA L_ INV AL ID ; /* Private routine to delete internal storage when a communica to r is freed. */ int MPE_DelTa g comm, keyval, attr_va l, extra_st at e MPI_Comm *comm; int *keyval; void *attr_va l, *extra_ st at e; { free attr_va l ; return MPI_SUCC ES S; } 125 Caching tags on communicator II /*@ MPE_GetTa gs - Returns tags that can be used in communica ti on with a communica to r Input Paramet er s: . comm_in - Input communi ca to r . ntags - Number of tags Output Paramete rs : . comm_out - Output communi ca to r. May be 'comm_in '. . first_tag - First tag availab le @*/ int MPE_GetTa gs comm_in, ntags, comm_out, first_t ag MPI_Comm comm_in, *comm_o ut ; int ntags, *first_ ta g; { int mpe_errno = MPI_SUC CE SS ; int tagval, *tagval p, *maxval , flag; if MPE_Tag _k ey va l == MPI_KEY VA L_ IN VA LI D { MPI_Keyva l_ cr ea te MPI_NULL _C OP Y_ FN , MPE_Del Ta g, &MPE_Tag _k ey va l, void *0 ; } 126 Caching tags on communicator III if mpe_err no = MPI_Att r_ ge t comm_in , MPE_Tag_k ey va l, &tagvalp, &flag return mpe_errn o; if !flag { /* This communi ca to r is not yet known to this system, so we dup it and setup the first value */ MPI_Comm_ du p comm_in , comm_out ; comm_in = *comm_o ut ; MPI_Attr_ ge t MPI_COM M_ WO RL D, MPI_TAG_ UB , &maxval, &flag ; tagvalp = int *malloc 2 * sizeofin t ; printf "Malloc in g address x\n", tagvalp ; if !tagval p return MPI_ERR_ EX HA US TED ; tagvalp = *maxval ; MPI_Attr_ pu t comm_in , MPE_Tag_ ke yv al, tagvalp ; return MPI_SUCC ES S; } 127 Caching tags on communicator IV *comm_out = comm_in ; if *tagval p < ntags { /* Error, out of tags. Another solution would be to do an MPI_Com m_ du p. */ return MPI_ERR_ IN TE RN ; } *first_ta g = *tagvalp - ntags; *tagvalp = *first_t ag ; return MPI_SUCC ES S; } 128 Caching tags on communicator V /*@ MPE_Retur nT ag s - Returns tags allocat ed with MPE_Get Ta gs . Input Paramet er s: . comm - Communic at or to return tags to . first_tag - First of the tags to return . ntags - Number of tags to return. @*/ int MPE_Retur nT ag s comm, first_t ag , ntags MPI_Comm comm; int first_ta g, ntags; { int *tagvalp, flag, mpe_err no ; if mpe_err no = MPI_Att r_ ge t comm, MPE_Tag_ ke yv al , &tagvalp, &flag return mpe_errn o; if !flag { /* Error, attribu te does not exist in this communi ca to r */ return MPI_ERR_ OT HE R; } if *tagval p == first_t ag *tagvalp = first_ta g + ntags; return MPI_SUCC ES S; } 129 Caching tags on communicator VI /*@ MPE_TagsE nd - Returns the private keyval. @*/ int MPE_TagsE nd { MPI_Keyva l_ fr ee &MPE_Tag _k ey va l ; MPE_Tag_k ey va l = MPI_KEYV AL _I NV AL ID ; } 130 Commentary Use MPI_KEYVAL_INVALID to detect when keyval must be created Use flag return from MPI_ATTR_GET to detect when a communicator needs to be initial i zed 131 Exercise - Writing libraries Objective: Use private communicators and attributes Write a routine to circulate data to the next pro cess, using a nonblo cking send and receive op eration. void Init_pipe comm void ISend_pipe comm, bufin, len, datatype, bufout void Wait_pipe comm Atypical use is Init_pipe MPI_COMM_WORLD for i=0; i ISend_pipe comm, bufin, len, datatype, bufout ; Do_Work bufin, len ; Wait_pipe comm ; t = bufin; bufin = bufout; bufout = t; } What happ ens if Do_Work calls MPI routines? What do you need to do to clean up Init pipe? How can you use a user-de ned top ology to Topo test determine the next pro cess? Hint: see MPI and MPI Cartdim get. 132 MPI Objects MPI has a variety of objects communicators, groups, datatyp es, etc. that can be created and destroyed. This section discusses the typ es of these data and how MPI manages them. This entire chapter may be skipp ed by b eginners. 133 The MPI Objects MPI Request Handle for nonblo cking communication, normally freed by MPI in a test or wait MPI Datatype MPI datatyp e. Free with MPI_Type_free. Op User-de ned op eration. Free with MPI MPI_Op_free. MPI Comm Communicator. Free with MPI_Comm_free. MPI Group Group of pro cesses. Free with MPI_Group_free. MPI Errhandler MPI errorhandler. Free with MPI_Errhandler_free. 134 When should objects b e freed? Consider this co de MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ; MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1, &newx ; MPI_Type_commit &newx ; This creates a datatyp e for one face of a 3-D decomp osition. When should newx1 b e freed? 135 Reference counting MPI keeps track of the use of an MPI object, and only truely destroys it when no-one is using it. newx1 is b eing used by the user the MPI_Type_vector that created it and by the MPI_Datatype newx that uses it. If newx1 is not needed after newx is de ned, it should b e freed: MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ; MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1, &newx ; MPI_Type_free &newx1 ; MPI_Type_commit &newx ; 136 Why reference counts Why not just free the object? Consider this library routine: void MakeDatatype nx, ny, ly, lz, MPI_Datatype *new { MPI_Datatype newx1; MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ; MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1, new ; MPI_Type_free &newx1 ; MPI_Type_commit new ; } Without the MPI_Type_free &newx1 ,itwould b e very awkward to later free newx1 when new was freed. 137 To ols for evaluating programs MPI provides some to ols for evaluating the p erformance of parallel programs. These are Timer Pro ling interface 138 The MPI Timer The elapsed wall-clo ck time between two p oints in an MPI program can be computed using MPI_Wtime: double t1, t2; t1 = MPI_Wtime; ... t2 = MPI_Wtime; printf "Elapsed time is f\n", t2 - t1 ; The value returned by a single call to MPI_Wtime has littl e value. The times are lo cal; the attribute MPI WTIME IS GLOBAL may b e used to determine if the times are also synchronized with each other for all pro cesses in MPI COMM WORLD. 139 Pro ling All routines have two entry p oints: MPI ... and PMPI .... This makes it easy to provide a single level of low-overhead routines to intercept MPI calls without any source co de mo di cations. Used to provide \automatic" generation of trace les. MPI_Send MPI_Send MPI_Send PMPI_Send PMPI_Send MPI_Bcast MPI_Bcast User ProgramProfile Library MPI Library static int nsend =0; int MPI_Send start, count, datatyp e, dest, tag, comm { nsend++; return PMPI_Sen d start, count, datatyp e, dest, tag, comm } 140 Writing pro ling routines The MPICH implementation contains a program for writing wrapp ers. This description will write out each MPI routine that is called.: ifdef MPI_BUILD_PROFILING undef MPI_BUILD_PROFILING endif include include "mpi.h" {{fnall fn_name}} {{vardecl int llrank}} PMPI_Comm_rank MPI_COMM_WORLD, &llrank ; printf "[d] Starting {{fn_name}}...\n", llrank ; fflush stdout ; {{callfn}} printf "[d] Ending {{fn_name}}\n", llrank ; fflush stdout ; {{endfnall}} The command wrappergen -w trace.w -o trace.c converts this to a C program. The complie the le `trace.c' and insert the resulting object le into your link line: cc -o a.out a.o ... trace.o -lpmpi -lmpi 141 Another pro ling example This version counts all calls and the numb er of bytes sent with MPI_Send, MPI_Bse nd ,or MPI_Ise nd . include "mpi.h" {{foreach fn fn_name MPI_Sen d MPI_Bsen d MPI_Isend }} static long {{fn_na me }} _n by te s_ {{ fi le no }}; {{ en df or ea ch fn }} {{forallf n fn_name MPI_Init MPI_Fin al iz e MPI_Wti me }} in t {{fn_name }} _n ca ll s_ {{ fi le no }} ; {{endfora ll fn }} {{fnall this_fn _n am e MPI_Fina li ze }} printf "{{this _f n_ na me }} is being called.\n " ; {{callfn} } {{this_fn _n am e} }_ nc al ls _{ {f il en o} }+ +; {{endfnal l} } {{fn fn_name MPI_Send MPI_Bse nd MPI_Ise nd} } {{vardecl int typesiz e} } {{callfn} } MPI_Type_ si ze {{dataty pe }} , MPI_Ain t *&{{ty pe si ze }} ; {{fn_name }} _n by te s_ {{ fi le no }} += {{ ty pe siz e} }* {{ co un t} } {{fn_name }} _n ca ll s_ {{ fi le no }} ++ ; {{endfn}} 142 Another pro ling example con't {{fn fn_name MPI_Fina li ze }} {{forallf n dis_fn}} if {{dis_f n} }_ nc al ls _{ {f il en o} } { printf "{{dis_ fn }} : d calls\n ", {{dis_fn }} _n ca ll s_ {{ fi le no} } ; } {{endfora ll fn }} if MPI_Sen d_ nc al ls _{ {f il en o} } { printf "d bytes sent in d calls with MPI_Sen d\ n" , MPI_Send _n by te s_ {{ fi le no }}, MPI_Send_ nc al ls _{ {f il en o} } ; } {{callfn} } {{endfn}} 143 Generating and viewing log les Log les that contain a history of a parallel computation can be very valuable in understanding a parallel program. The upshot and nupshot programs, provided in the MPICH and MPI-F implementations, may be used to view log les 144 Generating a log le This is very easy with the MPICH implementation of MPI. Simply replace -lmpi with -llmpi -lpmpi -lm in the link line for your program, and relink your program. You do not need to recompile. On some systems, you can get a real-time animation by using the librari es -lampi -lmpe -lm -lX11 -lpmpi. Alternately, you can use the -mpilog or -mpianim options to the mpicc or mpif77 commands. 145 Connecting several programs together MPI provides supp ort for connection separate message-passing programs together through the use of intercommunicators. 146 Sending messages between di erent programs Programs share MPI_COMM_WORLD. Programs have separate and disjoint communicators. App1 App2 Comm1 Comm2 Comm_intercomm MPI_COMM_WORLD 147 Exchanging data between programs Form intercommunicator MPI_INTERCOMM_CREATE Send data MPI_Send ..., 0, intercomm MPI_Recv buf, ..., 0, intercomm ; MPI_Bcast buf, ..., localcomm ; More complex p oint-to-p oint op erations can also be used 148 Collective op erations Use MPI_INTERCOMM_MERGE to create an intercommunicator. 149 Final Comments Additional features of MPI not covered in this tutorial Persistent Communication Error handling 150 Sharable MPI Resources The Standard itself: { As a Technical rep ort: U. of Tennessee. rep ort { As p ostscript for ftp: at info.mcs.anl.gov in pub/mpi/mpi-report.ps. { As hyp ertext on the World Wide Web: http://www.mcs.anl.gov/mpi { As a journal article: in the Fall issue of the Journal of Sup ercomputing Applications MPI Forum discussions { The MPI Forum email discussions and b oth current and earlier versions of the Standard are available from netlib. Bo oks: { Using MPI: Portable Parallel Programming with the Message-Passing Interface,by Gropp, Lusk, and Skjellum, MIT Press, 1994 { MPI Annotated Reference Manual,by Otto, et al., in preparation. 151 Sharable MPI Resources, continued Newsgroup: { comp.parallel.mpi Mailing lists: { [email protected]: the MPI Forum discussion list. { [email protected]: the implementors' discussion list. Implementations availableby ftp: { MPICH is availableby anonymous ftp from info.mcs.anl.gov in the directory pub/mpi/mpich, le mpich.tar.Z. { LAM is available by anonymous ftp from tbag.osc.edu in the directory pub/lam. { The CHIMP version of MPI is availableby anonymous ftp from ftp.epcc.ed.ac.uk in the directory pub/chimp/release. Test co de rep ository: { ftp://info.mcs.anl.gov/pub/mpi/mpi-test 152 MPI-2 The MPI Forum with old and new participants has b egun a follow-on series of meetings. Goals { clarify existing draft { provide features users have requested { make extensions, not changes MajorTopics b eing considered { dynamic pro cess management { client/server { real-time extensions { \one-sided" communication put/get, active messages { portable access to MPI system state for debuggers { language bindings for C++ and Fortran-90 Schedule { Dynamic pro cesses, client/server by SC '95 { MPI-2 complete by SC '96 153 Summary The parallel computing community has co op erated to develop a full-featured standard message-passing library interface. Implementations ab ound Applications b eginning to b e develop ed orported MPI-2 pro cess b eginning Lots of MPI material available 154