1 2

Outline

 Motivation for MPI

Overview of PVM and MPI

 The pro cess that pro duced MPI

 What is di erent ab out MPI?

{ the \usual" send/receive

Jack Dongarra

{ the MPI send/receive

{ simple collective op erations

Computer Science Department

 New in MPI: Not in MPI

UniversityofTennessee

 Some simple complete examples, in Fortran and C

and

 Communication mo des, more on collective op erations

 Implementation status

Mathematical Sciences Section

Oak Ridge National Lab oratory

 MPICH - a free, p ortable implementation

 MPI resources on the Net

 MPI-2

http://www.netlib.org/utk/p eople/JackDongarra.html

3 4

Messages

What is SPMD?

2 Messages are packets of data moving b etween sub-programs.

2 Single Program, Multiple Data

2 The system has to b e told the

2 Same program runs everywhere.

following information:

2 Restriction on the general message-passing mo del.

{ Sending pro cessor

2 Some vendors only supp ort SPMD parallel programs.

{ Source lo cation

{ Data typ e

2 General message-passing mo del can b e emulated.

{ Data length

{ Receiving pro cessors

{ Destination lo cation { Destination size

5 6

Access Point-to-Point Communication

2 A sub-program needs to b e connected to a message passing 2 Simplest form of message passing.

system.

2 One pro cess sends a message to another

2 A message passing system is similar to:

2 Di erenttyp es of p oint-to p oint communication

{ Mail b ox

{ Phone line

{ fax machine

{ etc.

7 8 Synchronous Sends Asynchronous Sends

Provide information about the completion of the Only know when the message has left. message.

?

"Beep"

9 10 Non−Blocking Operations

Return straight away and allow the sub−program to

king Op erations Blo c continue to perform other work. At some later time

the sub−program can TEST or WAIT for the Relate to when the op eration has completed.

2 completion of the non−blocking operation.

2 Only return from the subroutine call when the

op eration has completed.

11 12 Barriers Broadcast Synchronise processes. A one−to−many communication.

Barrier

Barrier

Barrier

13 14

Parallelizati on { Getting Started Starting with a large serial application

Reduction Operations 

{ Lo ok at the Physics { tly parallel?

Combine data from several processes to produce a Is problem inheren Examine lo op structures {

single result. {

Are any indep endent? Mo derately so?

o ols likeForge90 can b e helpful

STRIKE T

{ Lo ok for the core linear algebra routines {

Replace with parallelized versions

 Already b een done. check survey

15 16

Popular Distributed Programming Schemes Parallel Programming Considerations

 Master / Slave  Granularity of tasks

Master task starts all slave tasks and co ordinates their work and I/O Key measure is communication/computation ratio of the machine: Num-

ber of bytes sent divided bynumb er of ops p erformed. Larger granu-

 SPMD hostless

larity gives higher sp eedups but often lower parallelism.

Same program executes on di erent pieces of the problem

 Numb er of messages

 Functional

Desirable to keep the numb er of messages low but dep ending on the al-

Several programs are written; each p erforms a di erent function in the

gorithm it can b e more ecient to break large messages up and pip eline

application.

the data when this increases parallelism.

 Functional vs.

Which b etter suits the application? PVM allows either or b oth to b e used.

17 18

Network Programming Considerations Load Balancing Metho ds

 Message latency  Static load balancing

Network latency can b e high. Algorithms should b e designed to account Problem is divided up and tasks are assigned to pro cessors only once.

for this f.e. send data b efore it is needed. The numb er or size of tasks maybevaried to account for di erent

computational p owers of machines.

 Di erent Machine Powers

 Dynamic load balancing by p o ol of tasks

Virtual machines may b e comp osed of computers whose p erformance

varies over orders of magnitude. Algorithm must b e able to handle this. Typically used with master/slavescheme. The master keeps a queue

of tasks and sends them to idle slaves until the queue is empty.Faster

 Fluctuating machine and network loads

machines end up getting more tasks naturally. see xep example in

Multiple users and other comp eting PVM tasks cause the machine and

PVM distribution

network loads to change dynamically. Load balancing is imp ortant.

 Dynamic load balancing by co ordination

Typically used in SPMD scheme. All the tasks synchronize and redis-

tribute their work either at xed times or if some condition o ccurs f.e.

load imbalance exceeds some limit

19 20

Communication Tips Bag of Tasks

 Limit size, numb er of outstanding messages  Comp onents

{ Can load imbalance cause to o many outstanding messages? { Job p o ol

{ Mayhave to send very large data in parts { Worker pool

Scheduler Sending { Task State of each job State of each worker

Unstarted Pvmd Idle A

Running B A

Receiving B Busy Task

Finished

Figure 1: Bag of tasks state machines

 Complex communication patterns

{ Network is deadlo ck-free, shouldn't hang

 Possible improvements

{ Still have to consider

{ Adjust size of jobs

 Correct data distribution

 To sp eed of workers

 Bottlenecks

 To turnaround time granularity

{ Consider using a library

{ Start bigger jobs b efore smaller ones

 ScaLAPACK: LAPACK for distributed-memory machines

{ Allowworkers to communicate

 BLACS: Communication primitives

more complex scheduling

 Oriented towards linear algebra

 Matrix distribution w/ no send-recv

 Used by ScaLAPACK

21 22

PVM Is Physical and Logical Views of PVM

ware package that allows a collection of serial, parallel and

PVM is a soft Physical

vector computers on a network to b e managed as one large

resource. IP Network (routers, bridges, ...)

 Po or man's sup ercomputer

{ High p erformance from network of workstations

{ O -hours crunching

 Metacomputer linking multiple sup ercomputers

{ Very high p erformance

Computing elements adapted to subproblems { Multiprocessor

Host host

{ Visualization

 Educational to ol

Logical

{ Simple to install

Pvmd (host)

{ Simple to learn

{ Available Can b e mo di ed { Tasks

Console(s)

23 24

Parts of the PVM System Programming in PVM

 PVM daemon pvmd  A simple message-passing environment

{ One manages each host of virtual machine { Hosts, Tasks, Messages

{ Mainly a message router, also has kernel-like functions { No enforced top ology

{ Has message entry points where tasks request service { Virtual machine can b e comp osed of any mix of machine typ es

{ Inter-host p ointofcontact

 Pro cess Control

{ Authentication

{ Tasks can b e spawned/killed anywhere in the virtual machine

{ Creates pro cesses

 Communication

{ Collects output printed by pro cesses

{ Any task can communicate with any other

{ Fault detection of pro cesses, network

{ Data conversion is handled by PVM

{ More robust than application comp onents

 Dynamic Pro cess Groups

 Interface library libpvm

{ Tasks can join/leave one or more groups at any time

{ Linked with each application comp onent

{ 1. Functions to comp ose, send, receive messages

 Fault Tolerance

{ 2. PVM syscal ls that send requests to pvmd

{ Task can request noti cation of lost/gained resources

{ Machine-dep endent communication part can b e replaced

 Underlying op erating system usually Unix is visible

{ Kept as simple as p ossible

 Supp orts C, C++ and Fortran

 PVM Console

 Can use other languages must b e able to link with C

{ Interactive control of virtual machine

{ Kind of likea shel l

{ Normal PVM task, several can b e attached, to any host

25 26

Hellos World Unique Features of PVM

 Program hello1.c, the main program:  Software is highly p ortable

 Allows fully heterogeneous virtual machine hosts, network

include

include "pvm3.h"

 Dynamic pro cess, machine con guration

main

 Supp ort for fault tolerant programs

{

int tid; /* tid of child */

 System can b e customized

char buf[100];

 Large existing user base

printf"I'm tx\n", pvm_mytid;

 Some comparable systems

pvm_spawn"hello2", char**0, 0, "", 1, &tid;

pvm_recv-1, -1;

{ Portable message-passing

pvm_bufinfocc, int*0, int*0, &tid;

pvm_upkstrbuf;

 MPI

printf"Message from tx: s\n", tid, buf;

 p4

pvm_exit;

exit0;

 Express

}

 PICL

{ One-of-a-kind

 Program hello2.c, the slave program:

 NX

include "pvm3.h"

 CMMD

main

{ Other typ es of communication

{

 AM

int ptid; /* tid of parent */

char buf[100];

 Linda

ptid = pvm_parent;

 Also DOSs, Languages, ...

strcpybuf, "hello, world from ";

gethostnamebuf + strlenbuf, 64;

pvm_initsendPvmDataDefault;

pvm_pkstrbuf;

pvm_sendptid, 1;

pvm_exit;

exit0;

}

27 28

Portability How to Get PVM

 PVM home page URL Oak Ridge is

 Con gurations include

http://www /ep m/ orn l/g ov /pv m/p vm

home.html

803/486 BSDI, NetBSD, FreeBSD Alliant FX/8

803/486 Linux BBN Butter y TC2000

 PVM source co de, user's guide, examples and related material are pub-

DEC AlphaOSF-1, Mips, uVAX Convex C2, CSPP

DG Aviion T-3D, YMP, 2, C90 Unicos

lished on Netlib, a software rep ository with several sites around the

HP 68000, PA-Risc Encore Multimax

world.

IBM RS-6000, RT Fujitsu 780UXP/M

Mips IBM Power-4

{ To get started, send email to netlib:

NeXT Intel Paragon, iPSC/860, iPSC/2

Silicon Graphics Kendall Square

 mail netlib@orn l. gov

Sun 3, 4x SunOS, Solaris Maspar

Subject: send index from pvm3

NEC SX-3

Sequent

Stardent Titan

A list of les and instructions will b e automatically mailed back

Thinking Machines CM-2, CM-5

{ Using xnetlib: select directory pvm3

 Very p ortable across Unix machines, usually just pick options

 FTP: host netlib2.cs .u tk. ed u, login anonymous, directory /pvm3

 Multipro cessors:

 URL: http://www. net li b.o rg /pv m3/ in dex .h tml

{ Distributed-memory: T-3D, iPSC/860, Paragon, CM-5, SP-2/MPI

 Bug rep orts, comments, questions can b e mailed to:

{ Shared-memory: Convex/HP, SGI, Alpha, Sun, KSR, Symmetry

[email protected] m.o rn l.g ov

{ Source co de largely shared with generic 80

 Usenet newsgroup for discussion and supp ort:

 PVM is p ortable to non-Unix machines

comp.paral lel .p vm

{ VMS p ort has b een done

 Bo ok:

PVM: Paral lel Virtual Machine

{ OS/2 p ort has b een done

A Users' Guide and Tutorial for Networked Paral lel Computing

{ Windows/NT p ort in progress

MIT press 1994.

 PVM di erences are almost transparent to programmer

{ Some options may not b e supp orted

{ Program runs in di erentenvironment

29 30

Installing PVM Building PVM Package

 Package requires a few MB of disk + a few MB / architecture  Software comes with con gurations for most Unix machines

 Don't need ro ot privelege  Installation is easy

 Libraries and executables can b e shared b etween users  After package is extracted

 PVM cho oses machine architecture name for you { cd $PVM

ROOT

more than 60 currently de ned

{ make

 Environmentvariable PVM

ROOT p oints to installed path

 Software automatically

{ E.g. /usr/loca l/p vm3 .3 .4 or $HOME/pvm3

{ Determines architecture typ e

{ If you use csh, add to your .cshrc:

{ Creates necessary sub directories

setenv PVM

ROOT /usr/loca l/p vm3

{ Builds pvmd, console, libraries, group server and library

{ If you use sh or ksh, add to your .profile:

{ Installs executables and libraries in lib and bin

PVM

ROOT=/usr/l oc al/ pvm 3

PVM DPATH=$PVM ROOT/lib/p vm d

export PVM ROOT PVM DPATH

 Imp ortant directories b elow $PVM ROOT

include Header les

man Manual pages

lib Scripts

lib/ARCH System executables

bin/ARCH System tasks

31 32

Starting PVM XPVM

 Graphical interface for PVM

 Three ways to start PVM

 pvm [-ddebugmask] [-nhostname][host le]

{ Performs console-like functions

PVM console starts pvmd, or connects to one already running

{ Real-time graphical monitor with

 View of virtual machine con guration, activity

 xpvm

 Space-time plot of task status

Graphical console, same as ab ove

 Host utilization plot

 pvmd [-ddebugmask] [-nhostname][host le]

 Call level debugger, showing last libpvm call by each task

Manual start, used mainly for debugging or when necessary to enter

 Writes SDDF format trace les

passwords

 Can b e used for p ost-mortem analysis

 Some common error messages

 Built on top of PVM using

{ Can't start pvmd

Check PVM

ROOT is set, .rhosts correct, no garbage in .cshrc

{ Group library

{ Can't contact local daemon

{ Libpvm trace system

PVM crashedpreviously; socket le left over

{ Output collection system

{ Version mismatch

Mixed versions of PVM instal led or stale executables

{ No such host

Can't resolve IP address

{ Duplicate host

Host already in virtual machine or shared /tmp directory

{ failed to start group server

Group option not built or ep= not correct

{ shmget: ... No space left on device

Stale segments left from crash or not enough arecon gured

33 34

Programming Interface Pro cess Control

 pvm

spawnfile, argv, flags, where, ntask, tids

Ab out 80 functions

Start new tasks

Message bu er manipulation Create, destroy bu ers

PvmTaskDefault Round-robin

PvmTaskHost Named host "." is lo cal

{ Placement options

Pack, unpack data

PvmTaskArch Named architecture class

Message passing Send, receive

PvmHostCompl Complements host set

Multicast

PvmMppFront Start on MPP service no de

{ Other ags

Pro cess control Create, destroy tasks

PvmTaskDebug Enable debugging dbx

PvmTaskTrace Enable tracing

Query task tables

{ Spawn can return partial success

Find own tid, parent tid

Dynamic pro cess groups With optional group library

 pvm mytid

Join, leave group

Find my task id / enrol l as a task

Map group memb ers ! tids

 pvm

parent

Broadcast

Find parent's task id

Global reduce

Machine con guration Add, remove hosts

 pvm

exit

Query host status

Disconnect from PVM

Start, halt virtual machine

 pvm

killtid

Miscellaneous Get, set options

Terminate another PVM task

Request noti cation

Register sp ecial tasks

 pvm

pstattid

Get host timeofday clo ck o sets

Query status of another PVM task

35 36

Basic PVM Communication Higher Performance Communication

 Three-step send metho d  Two matched calls for high-sp eed low-latency messages

{ pvm

initsenden co din g

{ pvm

psenddest, tag, data, num items, data type

Initialize send bu er, clearing current one

{ pvm precvsourc e, tag, data, num items, data type,

PvmDataD efa ult

asource, atag, alength

Enco ding can b e PvmDataR aw

PvmDataI nPl ace

 Pack and send a contiguous, single-typ ed data bu er

{ pvm pktypedata, num items, stride

 As fast as native calls on multipro cessor machines

...

Pack bu er with various data

{ pvm

senddest, tag

pvm mcastdests , count, tag

Sends bu er to other tasks, returns when safe to clear bu er

 To receive

{ pvm

recvsource , tag

pvm nrecvsourc e, tag

Blo cking or non-blo cking receive

{ pvm

upktypedata, num items, stride

Unpack message into user variables

 Can also pvm

probesourc e, tag for a message

 Another receive primitive: pvm trecvsour ce , tag, timeout

Equivalenttopvm nrecv if timeout set to zero

Equivalenttopvm recv if timeout set to null

37 38

Collective Communication Virtual Machine Control

 Collective functions op erate across all memb ers of a group  pvm

addhostsho st s, num hosts, tids

Add hosts to virtual machine

{ pvm

barriergro up , count

 pvm

Synchronize all tasks in a group confignhos ts , narch, hosts

Get current VM con guration

{ pvm

bcastgroup , tag

Broadcast message to all tasks in a group

 pvm

delhostsho st s, num hosts, results

{ pvm

scatterres ul t, data, num items, data type,

Remove hosts from virtual machine

msgtag, rootinst, group

 pvm

halt

pvm gatherresu lt , data, num items, data type,

Stop al l pvmds and tasks shutdown

msgtag, rootinst, group

Distribute and collect arrays across task groups

 pvm

mstathost

Query status of host

{ pvm

reduce*fu nc  , data, num items, data type,

msgtag, group, rootinst

 pvm

start pvmdargc, argv, block

Reduce distributed arrays. Prede ned functions are

Start new master pvmd

 PvmMax

 PvmMin

 PvmSum

 PvmProduct

39 40

PVM Examples in Distribution Compiling Applications

 Examples illustrate usage and serve as templates  Header les

 Examples include { C programs should include

Always

hello, hello other Hello world

To manipulate trace masks

master, slave Master/slave program

For resource manager interface

spmd SPMD program

{ Sp ecify include directory: cc -I$PVM ROOT/inclu de ...

gexample Group and collective op erations

timing, timing slave Tests communication p erformance

{ Fortran: INCLUDE '/usr/loc al/ pvm 3/ inc lu de/ fpv m3 .h'

hitc, hitc slave Dynamic load balance example

 Compiling and linking

xep, mtile Interactive X-Window example

{ C programs must b e linked with

 Examples come with Make le.aimk les

libpvm3.a Always

libgpvm3.a If using group library functions

 Both C and Fortrans versions for some examples

p ossibly other libraries for so cket or XDR functions

{ Fortran programs must additionally b e linked with libfpvm3.a

41 42

Compiling Applications, Cont'd Load Balancing

 Aimk  Imp ortant for application p erformance

{ Shares single make le b etween architectures  Not done automatically yet?

{ Builds for di erent architectures in separate directories

 Static { Assignmentofwork or placement of tasks

{ Determines PVM architecture

{ Must predict algorithm time

{ Runs make, passing it PVM

ARCH

{ Mayhave di erent pro cessor sp eeds

{ Do es one of three things

{ Externally imp osed static machine loads

 If $PVM

ARCH/[Mm]a kef ile exists:

 Dynamic { Adapting to changing conditions

Runs make in sub directory, using make le

 Else if Makefile.a im k exists:

{ Make simple scheduler: E.g. Bag of Tasks

Creates sub directory, runs make using Makefile.ai mk

 Simple, often works well

 Otherwise:

 Divide work into small jobs

Runs make in current directory

 Given to pro cessors as they b ecome idle

 PVM comes with examples

 C { xep

 Fortran { hitc

 Can include some fault tolerance

{ Work migration: Cancel / forward job

 Poll for cancel message from master

 Can interrupt with pvm

sendsig

 Kill worker exp ensive

{ Task migration: Not in PVM yet

 Even with load balancing, exp ect p erformance to b e variable

43 44

Six Examples Circular Messaging

Avector circulates among the pro cessors

Each pro cessor lls in a part of the vector

P3 P4

 Circular messaging

 Inner pro duct

 Matrix vector multiply row distribution

P2 P5

 Matrix vector multiply column distribution

 Integration to evaluate 

P1

 Solve 1-D heat equation

Solution:

 SPMD

 Uses the following PVM features:

{ spawn

{ group

{ barrier

{ send-recv

{ pack-unpack

45 46

Inner Pro duct

program spmd1

Problem: In parallel compute

include '/src/icl/pvm/pvm3/include/fpvm3.h'

n

X

T

s = x y

PARAMETER NPROC=4 

i=1

integer rank, left, right, i, j, ierr

integer tidsNPROC-1

integer dataNPROC

C Group Creation

call pvmfjoingroup 'foo', rank 

if rank .eq. 0  then

call pvmfspawn'spmd1',PVMDEFAULT,'*',NPROC-1,tids1,ierr endif

X Partial Ddot

call pvmfbarrier 'foo', NPROC, ierr 

C compute the neighbours IDs

d 'foo', MODrank+NPROC-1,NPROC, left 

call pvmfgetti Y

call pvmfgettid 'foo', MODrank+1,NPROC, right

Solution:

if rank .eq. 0  then

 Master - Slave

C I am the first

 Uses the following PVM features:

do 10 i=1,NPROC

10 datai = 0

{ spawn

call pvmfinitsend PVMDEFAULT, ierr 

call pvmfpack INTEGER4, data, NPROC, 1, ierr 

{ group

call pvmfsend right, 1 , ierr 

call pvmfrecv left, 1, ierr

{ barrier

call pvmfunpack INTEGER4, data, NPROC, 1, ierr

write*,* ' Results received :'

{ send-recv

write*,* dataj,j=1,NPROC

else

{ pack-unpack

C I am an intermediate process

 Master sends out data, collects the partial solutions and computes the

call pvmfrecv left, 1, ierr 

sum.

call pvmfunpack INTEGER4, data, NPROC, 1, ierr 

datarank+1 = rank

 Slaves receive data, compute partial inner pro duct and send the results

call pvmfinitsendPVMDEFAULT, ierr

call pvmfpack INTEGER4, data, NPROC, 1, ierr

to master.

call pvmfsend right, 1, ierr

endif

call pvmflvgroup 'foo', ierr 

call pvmfexitierr

stop

end

47 48

Inner Pro duct - Pseudo co de

program inner

include '/src/icl/pvm/pvm3/include/fpvm3.h'

PARAMETER NPROC=7 

PARAMETER N = 100

 Master

double precision ddot

external ddot

integer remain, nb

Ddot =0

integer rank, i, ierr, bufid

fori=1to

integer tidsNPROC-1, slave, master

double precision xN,yN

send ith part of X to the ith slave

double precision result,partial

send ith part of Y to the ith slave

remain = MODN,NPROC-1

end for

nb = N-remain/NPROC-1

Ddot = Ddot + Ddotrema ini ng part of X and Y

call pvmfjoingroup 'foo', rank 

fori=1to

if rank .eq. 0  then

receive a partial result

call pvmfspawn'inner',PVMDEFAULT,'*',NPROC-1,tids,ierr

Ddot = Ddot + partial result

endif

end for

call pvmfbarrier 'foo', NPROC, ierr 

call pvmfgettid 'foo', 0, master   Slave

C MASTER

Receive a part of X

if rank .eq. 0  then

Receive a part of Y

C Set the values

partial = Ddotpart of X and part of Y

do 10 i=1,N

send partial to the master

xi = 1.0d0

10 yi = 1.0d0

C Send the data

count = 1

do 20 i=1,NPROC-1

call pvmfinitsend PVMDEFAULT, ierr 

call pvmfpack REAL8, xcount, nb, 1, ierr 

call pvmfpack REAL8, ycount, nb, 1, ierr 

call pvmfgettid 'foo', i, slave

call pvmfsend slave, 1, ierr 

count = count + nb

20 continue

49 50

Matrix - Vector Pro duct

Row Distribution

result = 0.d0

C Add the remainding part

Problem: In parallel compute y = y + Ax, where y is of length m, x is of

partial = ddotremain,xN-remain+1,1,yN-remain+1,1

result = result + partial

length n and A is an m  n matrix.

Solution:

C Get the result

do 30 i =1,NPROC-1

 Master - Slave

call pvmfrecv-1,1,bufid

call pvmfunpack REAL8, partial, 1, 1, ierr

result = result + partial

 Uses the following PVM features:

30 continue

print *, ' The ddot = ', result

{ spawn

C SLAVE

{ group

else

{ barrier

C Receive the data

{ send-recv

call pvmfrecv -1, 1, bufid 

call pvmfunpack REAL8, x, nb, 1, ierr 

{ pack-unpack

call pvmfunpack REAL8, y, nb, 1, ierr 

C Compute the partial product

partial = ddotnb,x1,1,y1,1 Send back the result

C n

call pvmfinitsend PVMDEFAULT, ierr

call pvmfpack REAL8, partial, 1, 1, ierr

call pvmfsend master, 1, ierr endif

P1

call pvmflvgroup 'foo', ierr 

call pvmfexitierr

stop P2 end m P3

P4

A X Y Y

51 52

Matrix - Vector Pro duct

program matvec_row

Row Distribution

include '/src/icl/pvm/pvm3/include/fpvm3.h'

Pseudo Co de

C

C y <--- y + A * x

C

C A : MxN visible only on the slaves

C X:N

C Y:M

C

 Master

PARAMETER NPROC = 4

PARAMETER M = 9,N=6

fori=1to

PARAMETER NBY = INTM/NPROC+1

send X to the ith slave

double precision XN, YM

send Y to the ith slave

integer tidsNPROC

end for

integer mytid, rank, i, ierr, from

fori=1to

receive a partial result from a slave

call pvmfmytid mytid 

update the correspond ing part of Y

call pvmfjoingroup 'foo', rank 

end for

if rank .eq. 0  then

call pvmfspawn'matvecslv_row',PVMDEFAULT,'*',NPROC,tids,ierr

endif

 Slave

call pvmfbarrier 'foo', NPROC+1, ierr 

C Data initialize for my part of the data

Receive X

Receive Y

do 10 i = 1,N

Compute my part of the product

xi = 1.d0

10 continue

and Update my part of Y

do 15 i = 1,M

Send back my part of Y

yi = 1.d0

15 continue

C Send X and Y to the slaves

call pvmfinitsend PVMDEFAULT, ierr 

call pvmfpackREAL8, X, N, 1, ierr

call pvmfpackREAL8, Y, M, 1, ierr

call pvmfbcast'foo', 1, ierr

53 54

program matvecslv_row

C I get the results include '/src/icl/pvm/pvm3/include/fpvm3.h'

do 20 i = 1, NPROC C

call pvmfrecv-1, 1, ierr C y <--- y + A * x

call pvmfunpack INTEGER4, from, 1, ierr C A : MxN visible only on the slaves

if from .EQ. NPROC then C X:N

call pvmfunpack REAL8, Yfrom-1*NBY+1, C Y:M

$ M-NBY*NPROC-1, 1, ierr C

else PARAMETER NPROC = 4

call pvmfunpack REAL8, Yfrom-1*NBY+1, PARAMETER M = 9,N=6

$ NBY, 1, ierr PARAMETER NBY = INTM/NPROC+1

endif

20 continue double precision ANBY,N

double precision XN, YM

write*,* 'Results received'

do 30 i=1,M integer rank, i, ierr, to

write*,* 'Y',i,' = ',Yi

30 continue external dgemv

call pvmflvgroup 'foo', ierr 

call pvmfexitierr call pvmfjoingroup 'foo', rank 

stop call pvmfbarrier 'foo', NPROC+1, ierr 

end

C Data initialize for my part of the data

do 10 j = 1,N

do 20 i = 1,NBY

Ai,j = 1.d0

20 continue

10 continue

C I receive X and Y

call pvmfrecv -1, 1, ierr 

call pvmfunpackREAL8, X, N, 1, ierr

call pvmfunpackREAL8, Y, M, 1, ierr

C I compute my part

if rank .NE. NPROC then

call dgemv'N', NBY, N, 1.d0, A, NBY,

$ X, 1, 1.d0, Yrank-1*NBY+1,1

else

call dgemv'N', M-NBY*NPROC-1, N, 1.d0, A, NBY,

$ X, 1, 1.d0, YNPROC-1*NBY+1, 1

endif

55 56

Matrix - Vector Pro duct

Column Distribution

Problem: In parallel compute y = y + Ax, where y is of length m, x is of

C I send back my part of Y

length n and A is an m  n matrix.

call pvmfinitsendPVMDEFAULT, ierr

Solution:

call pvmfpack INTEGER4, rank, 1, 1, ierr

if rank .NE. NPROC then

 Master - Slave

call pvmfpack REAL8, Yrank-1*NBY+1,NBY, 1, ierr

else

call pvmfpack REAL8, Yrank-1*NBY+1,M-NBY*NPROC-1,1,ierr

 Uses the following PVM features:

endif

call pvmfgettid'foo',0,to

{ spawn

call pvmfsendto, 1, ierr

{ group

C done

{ barrier

call pvmflvgroup 'foo', ierr 

{ reduce

call pvmfexitierr

stop

{ send-recv

end

{ pack-unpack

n

m P1 P2 P3 P4

A X Y

57 58

Matrix - Vector Pro duct

program matvec_col

Column Distribution

include '/src/icl/pvm/pvm3/include/fpvm3.h'

Pseudo Co de

C

C y <--- y + A * x

C

C A : MxN visible only on the slaves

C X:N

C Y:M

C

 Master

PARAMETER NPROC = 4

PARAMETER M = 9,N=6

fori=1to

double precision XN, YM

send X to the ith slave

end for

external PVMSUM

Global Sum on Y root

integer tidsNPROC

integer mytid, rank, i, ierr

 Slave

call pvmfmytid mytid 

Receive X

call pvmfjoingroup 'foo', rank 

Compute my Contributi on to Y

if rank .eq. 0  then

call pvmfspawn'matvecslv_col',PVMDEFAULT,'*',NPROC,tids,ierr

Global Sum on Y leaf

endif

call pvmfbarrier 'foo', NPROC+1, ierr 

C Data initialize for my part of the data

do 10 i = 1,N

xi = 1.d0

10 continue

do 15 i = 1,M

yi = 1.d0

15 continue

59 60

program matvecslv_col

C Send X include '/src/icl/pvm/pvm3/include/fpvm3.h'

call pvmfinitsend PVMDEFAULT, ierr  C

call pvmfpackREAL8, X, N, 1, ierr C y <--- y + A * x

call pvmfbcast'foo', 1, ierr C A : MxN visible only on the slaves

C X:N

C I get the results C Y:M

C

call pvmfreducePVMSUM,Y,M,REAL8,1,'foo',0,ierr PARAMETER NPROC = 4

PARAMETER M = 9,N=6

write*,* 'Results received' PARAMETER NBX = INTN/NPROC+1

do 30 i=1,M

write*,* 'Y',i,' = ',Yi external PVMSUM

30 continue

call pvmflvgroup 'foo', ierr  double precision AM,NBX

call pvmfexitierr double precision XN, YM

stop

end integer rank, i, ierr

external dgemv

call pvmfjoingroup 'foo', rank 

call pvmfbarrier 'foo', NPROC+1, ierr 

C Data initialize for my part of the data

do 10 j = 1,NBX

do 20 i = 1,M

Ai,j = 1.d0

20 continue

10 continue

C I receive X

call pvmfrecv -1, 1, ierr 

call pvmfunpackREAL8, X, N, 1, ierr

61 62

Integration to evaluate 

C I compute my part

Computer approximations to  by using numerical integration

Know

if rank .NE. NPROC then

0

call dgemv'N', M, NBX, 1.d0, A, M,

tan45 =1;

$ Xrank-1*NBX+1, 1, 1.d0, Y,1

else

same as



call dgemv'N', M,N-NBX*NPROC-1, 1.d0, A, M,

tan

=1;

$ XNPROC-1*NBX+1, 1, 1.d0, Y, 1

4

endif

So that;

1

C I send back my part of Y

4  tan 1= 

call pvmfreducePVMSUM,Y,M,REAL8,1,'foo',0,ierr

From the integral tables we can nd

C done

Z

1

1

tan x =

dx

2

call pvmflvgroup 'foo', ierr 

1+ x

call pvmfexitierr

or

Z

stop

1

1

1

end

tan 1=

dx

2

0

1+ x

Using the mid-p oint rule with panels of

uniform length h =1=n, for various values of n.

Evaluate the function at the midp oints of

each subinterval x , x .

i1 i

i  h h=2 is the midp oint.

Formula for the integral is

n

X

x = f h  i 1=2

i=1

 = h  x

where

4

f x=

2

1+ x

63 64

Integration to evaluate  continued

program spmd2

Number of ways to divide up the problem.

include '/src/icl/pvm/pvm3/include/fpvm3.h'

Each part of the sum is indep endent.

PARAMETER NPROC=3 

EXTERNAL PVMSUM

 Divide the interval into more or less equal parts and give each pro cess

a part.

integer mytid, rank, i, ierr

integer tids0:NPROC-1

th

 Let each pro cessor take the p part.

double precision PI25DT

parameter PI25DT = 3.141592653589793238462643d0

 Compute part of integral.

double precision mypi,pi,h,sum,x,f,a

 Sum pieces.

C function to integrate

fa = 4.d0 / 1.d0 + a*a

th

The example given let's each pro cessor take the p part. Uses the following

call pvmfmytid mytid 

PVM features:

call pvmfjoingroup 'foo', rank 

 spawn

10 if rank .eq. 0  then

call pvmfspawn'spmd2',PVMDEFAULT,'*',NPROC-1,tids1,ierr

 group

endif

 barrier

call pvmfbarrier 'foo', NPROC, ierr 

 b cast

if rank .eq. 0  then

write6,98

 reduce

98 format'Enter the number of intervals: 0 quits'

read5,99n

99 formati10

if n .GT. 100000 then

print *, 'Too large value of pi'

print *, 'Using 100000 instead'

n = 100000

endif

call pvmfinitsend PVMDEFAULT, ierr

call pvmfpack INTEGER4, n, 1, 1, ierr

call pvmfbcast'foo',0,ierr

endif

ifrank .ne. 0 then

call pvmfrecv-1,-1,ierr

call pvmfunpackINTEGER4, n, 1, 1, ierr endif

65 66

1-D Heat Equation

C check for quit signal

if n .le. 0 goto 30

C calculate the interval size

h = 1.0d0/n

sum = 0.0d0

do 20 i = rank+1, n, NPROC

Problem: Calculating heat di usion through a wire.

x = h * dblei - 0.5d0

sum = sum + fx

20 continue

mypi=h*sum

C collect all the partial sums

print *,' reduce'

The one-dimensional heat equation on a thin wire is :

call pvmfreducePVMSUM,mypi,1,REAL8,0,'foo',0,ierr

C node 0 prints the number

2

if rank .eq. 0 then

@A

@ A

pi = mypi

=

2

@t @x

write6,97 pi, abspi - PI25DT

97 format' pi is approximatively: ',F18.16,

+ ' Error is: ',F18.16

and a discretization of the form :

goto 10

endif

A A

A 2A + A

i+1;j i;j

i;j +1 i;j i;j 1

=

2

t x

30 call pvmflvgroup 'foo', ierr 

call pvmfexitierr

giving the explicit formula :

stop

end

t

A = A +

A 2A + A 

i+1;j i;j

i;j +1 i;j i;j 1

2

x

initial and b oundary conditions:

At; 0 = 0;At; 1 = 0 for all t

A0;x=sinx for 0  x  1

67 68

1-D Heat Equation 1-D Heat Equation

Continuation Pseudo Co de

 Master

P1 P2P3 P4

Set the initial temperatur es

fori=1to

send the ith part of the initial

temperatu re s to the ith slave

end for

Boundaries fori=1to

receive results from ith slave

update my data

Solution:

end for

 Master Slave

 Slaves communicate their b oundaries values

 Slave

 Uses the following PVM features:

Receive my part of the initial values

{ spawn

fori=1to

send my left bound to my left neighbor

{ group

send my right bound to my right neighbor

{ barrier

receive my left neighbor's left bound

{ send-recv

receive my right neighbor's right bound

{ pack-unpack

compute the new temperature s

end for

send back my result to the master

69 70

C C enroll in pvm

C Use PVM to solve a simple heat diffusion differential equation, call pvmfmytid mytid

C using 1 master program and 5 slaves.

C C spawn the slave tasks

C The master program sets up the data, communicates it to the slaves call pvmfspawn'heatslv',PVMDEFAULT,'*',NPROC,task_ids,ierr

C and waits for the results to be sent from the slaves.

C Produces xgraph ready files of the results. C create the initial data set

C do 10 i = 1,SIZE

initi = SINPI * DBLEi-1 / DBLESIZE-1

program heat 10 continue

include '/src/icl/pvm/pvm3/include/fpvm3.h' init1 = 0.d0

initSIZE = 0.d0

integer NPROC, TIMESTEP, PLOTINC, SIZE

double precision PI C run the problem 4 times for different values of delta t

PARAMETERPI = 3.14159265358979323846 do 20 l=1,4

PARAMETERNPROC= 3

PARAMETERTIMESTEP = 10 deltax2 = deltatl/1.0/DBLESIZE**2.0

PARAMETERPLOTINC = 1 C start timing for this run

PARAMETERSIZE = 100

PARAMETERSLAVENAME = 'heatslv' eltimel = etimet0

integer num_data C send the initial data to the slaves.

integer mytid, task_idsNPROC,i,j C include neighbor info for exchanging boundary data

integer left, right, k, l do 30 i =1,NPROC

integer step call pvmfinitsendPVMDEFAULT,ierr

integer ierr IF i .EQ. 1 THEN

external wh left = 0

integer wh ELSE

left = task_idsi-1

double precision initSIZE ENDIF

double precision resultTIMESTEP*SIZE/NPROC call pvmfpackINTEGER4, left, 1, 1, ierr

double precision solutionTIMESTEP,SIZE IF i .EQ. NPROC THEN

character*20 filename4 right = 0

double precision deltat4, deltax2 ELSE

real etime right = task_idsi+1

real t02 ENDIF

real eltime4 call pvmfpackINTEGER4, right, 1, 1, ierr

call pvmfpackINTEGER4, INTstep, 1, 1, ierr

step = TIMESTEP call pvmfpackREAL8, deltax2, 1, 1, ierr

num_data = INTSIZE/NPROC call pvmfpackINTEGER4, INTnum_data, 1, 1, ierr

call pvmfpackREAL8, initnum_data*i-1+1,num_data,1,ierr

filename1 = 'graph1' call pvmfsendtask_idsi, 4, ierr

filename2 = 'graph2' 30 continue

filename3 = 'graph3'

filename4 = 'graph4' C wait for the results

deltat1 = 5.0E-1 do 40 i = 1,NPROC

deltat2 = 5.0E-3 call pvmfrecvtask_idsi, 7, ierr

deltat3 = 5.0E-6 call pvmfunpackREAL8, result, num_data*TIMESTEP, 1, ierr

deltat4 = 5.0E-9

71 72

C update the solution C

do50j=1,TIMESTEP C The slaves receive the initial data from the host,

do 60 k = 1, num_data C exchange boundary information with neighbors,

solutionj,num_data*i-1+1+k-1 = C and calculate the heat change in the wire.

$ resultwhj-1,k-1,num_data+1 C This is done for a number of iterations, sent by the master.

60 continue C

50 continue C

40 continue

program heatslv

C stop timing include '/src/icl/pvm/pvm3/include/fpvm3.h'

eltimel = etimet0 - eltimel

PARAMETERMAX1 = 1000

C produce the output PARAMETERMAX2 = 100000

write*,* 'Writing output to file ',filenamel

open23, FILE = filenamel integer mytid, left, right, i, j, master

write23,* 'TitleText: Wire Heat over Delta Time: ',deltatl integer timestep

write23,* 'XUnitText: Distance'

write23,* 'YUnitText: Heat' external wh

do 70 i=1,TIMESTEP,PLOTINC integer wh

write23,* '"Time index: ',i-1

do 80 j = 1,SIZE double precision initMAX1, AMAX2

write23,* j-1,REALsolutioni,j double precision leftdata, rightdata

81 FORMATI5,F10.4 double precision delta, leftside, rightside

80 continue

write23,* '' C enroll in pvm

70 continue

endfile 23 call pvmfmytidmytid

closeUNIT = 23, STATUS = 'KEEP' call pvmfparentmaster

20 continue C receive my data from the master program

10 continue

write*,* 'Problem size: ', SIZE

do 90 i = 1,4 call pvmfrecvmaster,4,ierr

write*,* 'Time for run ',i-1,': ',eltimei,' sec.' call pvmfunpackINTEGER4, left, 1, 1, ierr

90 continue call pvmfunpackINTEGER4, right, 1, 1, ierr

call pvmfunpackINTEGER4, timestep, 1, 1, ierr

C kill the slave processes call pvmfunpackREAL8, delta, 1, 1, ierr

do 100 i = 1,NPROC call pvmfunpackINTEGER4, num_data, 1, 1, ierr

call pvmfkilltask_idsi,ierr call pvmfunpackREAL8, init, num_data, 1, ierr

100 continue

call pvmfexitierr C copy the initial data into my working array

END

do 20 i = 1, num_data

integer FUNCTION whx,y,z Ai = initi

integer x,y,z 20 continue

wh=x*z+y do 22 i = num_data+1, num_data*timestep

Ai = 0

RETURN 22 continue END

73 74

C perform the calculation C send the results back to the master program

do 30 i = 1, timestep-1 call pvmfinitsendPVMDEFAULT, ierr

call pvmfpackREAL8, A, num_data*timestep, 1, ierr

C trade boundary info with my neighbors call pvmfsendmaster, 7, ierr

C send left, receive right

goto 10

IF left .NE. 0 THEN

call pvmfinitsendPVMDEFAULT, ierr C just for good measure

call pvmfpackREAL8, Awhi-1,0,num_data+1, 1, 1, ierr call pvmfexitierr

call pvmfsend left, 5, ierr

ENDIF END

IF right .NE. 0 THEN

call pvmfrecv right, 5, ierr integer FUNCTION whx,y,z

call pvmfunpackREAL8, rightdata, 1, 1, ierr

call pvmfinitsendPVMDEFAULT, ierr integer x,y,z

call pvmfpackREAL8, Awhi-1, num_data-1,num_data+1,

$ 1, 1, ierr wh = x*z + y

call pvmfsendright, 6, ierr

ENDIF RETURN

IF left .NE. 0 THEN END

call pvmfrecvleft, 6, ierr

call pvmfunpackREAL8, leftdata, 1, 1, ierr

ENDIF

C do the calculations for this iteration

do 40 j = 1, num_data

IF j .EQ. 1 THEN

leftside = leftdata

ELSE

leftside = Awhi-1,j-2,num_data+1

ENDIF

IF j .EQ. num_data THEN

rightside = rightdata

ELSE

rightside = Awhi-1,j,num_data+1

ENDIF

IF j. EQ. 1 .AND. left. EQ. 0 THEN

Awhi,j-1,num_data+1 = 0.d0

ELSE IF j .EQ. num_data .AND. right .EQ. 0 THEN

Awhi,j-1,num_data+1 = 0.d0

ELSE

Awhi,j-1,num_data+1 = Awhi-1,j-1,num_data+1 +

$ delta*rightside - 2*Awhi-1,j-1,num_data+1+leftside

ENDIF

40 continue

30 continue

75 76

Motivation for a New Design

Motivation cont.

Few systems o er the full range of desired features.

 Message Passing now mature as programming paradigm

 mo dularity for libraries

{ well understo o d

 access to p eak p erformance

{ ecient match to hardware

 p ortability

{ many applications

 heterogeneity

 Vendor systems not p ortable

 subgroups

 Portable systems are mostly research pro jects

 top ologies

{ incomplete

{ lackvendor supp ort

 p erformance measurement to ols

{ not at most ecient level

77 78

The MPI Pro cess MPI Lacks...

 Began at Williamsburg Workshop in April, 1992  Mechanisms for pro cess creation

 Organized at Sup ercomputing '92 Novemb er  One sided communication put, get, active messages

 Followed HPF format and pro cess  Language binding for Fortran 90 anc C++

 Met every six weeks for twodays

 Extensive, op en email discussions

There are a xed numb er of pro cesses from start to nish of an applicaiton.

 Drafts, readings, votes

Many features were considered and not included

 Pre- nal draft distributed at Sup ercomputing '93

 Two-month public comment p erio d

 Final version of draft in May, 1994

 Time constraint

 Widely available now on the Web, ftp sites, netlib

 http://ww w. net li b.o rg/ mp i/i nd ex. htm l

 Not enough exp erience

 Public implementations available

 Concern that additional features would delay the app earance of imple-

mentations

 Vendor implementations coming so on

79 80

Who Designed MPI? What is MPI?

 Broad participation  A message-passing library speci cation

 Vendors { message-passing mo del

{ not a compiler sp eci cation

{ IBM, Intel, TMC, Meiko, Cray, Convex, Ncub e

{ not a sp eci c pro duct

 Library writers

 For parallel computers, clusters, and heterogeneous networks

{ PVM, p4, Zip co de, TCGMSG, Chameleon, Express, Linda

 Full-featured

 Application sp ecialists and consultants

 Designed to p ermit unleash? the development of parallel software

Companies Lab oratories Universities

ARCO ANL UC Santa Barbara

libraries

Convex GMD Syracuse U

Cray Res LANL Michigan State U

 Designed to provide access to advanced parallel hardware for

IBM LLNL Oregon Grad Inst

Intel NOAA U of New Mexico

{ end users

KAI NSF Miss. State U.

Meiko ORNL U of Southampton

{ library writers

NAG PNL U of Colorado

nCUBE Sandia Yale U

{ to ol develop ers

ParaSoft SDSC UofTennessee

Shell SRC U of Maryland

TMC Western MichU

U of Edinburgh

Cornell U.

Rice U.

U of San Francisco

81 82

New Features of MPI

New Features of MPI cont.

 General

 Application-oriented pro cess top ologies

{ Communicators combine context and group for message security

{ Built-in supp ort for grids and graphs uses groups

{ safety

 Pro ling

 Point-to-p oint communication

{ Ho oks allow users to intercept MPI calls to install their own to ols

{ Structured bu ers and derived datatyp es, heterogeneity

 Environmental

{ Mo des: normal blo cking and non-blo cking, synchronous, ready

{ inquiry

to allow access to fast proto cols, bu ered

{ error control

 Collective

{ Both built-in and user-de ned collective op erations

{ Large numb er of data movement routines

{ Subgroups de ned directly or by top ology

83 84

Features not in MPI Is MPI Large or Small?

 Non-message-passing concepts not included:

 MPI is large 125 functions

{ pro cess management

{ MPI's extensive functionality requires many functions

{ remote memory transfers { Numb er of functions not necessarily a measure of complexity

{ active messages

 MPI is small 6 functions

{ threads

{ Many parallel programs can b e written with just 6 basic functions.

{ virtual

 MPI is just right

 MPI do es not address these issues, but has tried to remain compatible

{ One can access exibility when it is required.

with these ideas e.g. thread safety as a goal, intercommunicators

{ One need not master all parts of MPI to use it.

85 86

Header les MPI Function Format

2 C 2 C:

include

error = MPI

xxxxxparameter, ...;

MPI xxxxxparameter, ...;

2 Fortran

include `mpif.h'

2 Fortran:

CALL MPI

XXXXXparameter, ..., IERROR

87 88

MPI_COMM_WORLD MPI

Initiali zi ng communicator

2 C

int MPI

Initint *argc, char ***argv

2 Fortran

MPI OR

INITIERR MPI_COMM_WORLD

INTEGER IERROR

0 1

2 Must b e rst routine called. 4 2 3 5 6

89 90

Size

Rank

2 How many pro cesses are contained within a

2 Howdoyou identify di erent pro cesses?

communicator?

MPI

Comm rankMPI Comm comm, int *rank

MPI

Comm sizeMPI Comm comm, int *size

MPI COMM RANKCOMM, RANK, IERROR

MPI COMM SIZECOMM, SIZE, IERROR

INTEGER COMM, RANK, IERROR

INTEGER COMM, SIZE, IERROR

91 92

Exiting MPI Messages

2 C 2 A message contains a numb er of elements of some particular

datatyp e.

int MPI

Finalize

2 MPI datatyp es:

2 Fortran

{ Basic typ es.

{ Derived typ es.

MPI

FINALIZEIERROR

INTEGER IERROR

2 Derived typ es can b e built up from basic typ es.

2 Ctyp es are di erent from Fortran typ es.

2 Must b e called last by all pro cesses.

93 94

MPI Basic Datatyp es - Fortran

MPI Basic Datatyp es - C

MPI Datatyp e Fortran Datatyp e

MPI Datatyp e C datatyp e

MPI INTEGER INTEGER

MPI CHAR signed char

MPI REAL REAL

MPI SHORT signed short int

MPI DOUBLE PRECISION DOUBLE PRECISION

MPI INT signed int

MPI COMPLEX COMPLEX

MPI LONG signed long int

MPI LOGICAL LOGICAL

MPI UNSIGNED CHAR unsigned char

MPI CHARACTER CHARACTER1

MPI UNSIGNED SHORT unsigned short int

MPI BYTE

MPI UNSIGNED unsigned int

MPI PACKED

MPI UNSIGNED LONG unsigned long int

MPI FLOAT oat

MPI DOUBLE double

MPI LONG DOUBLE long double

MPI BYTE

MPI PACKED

95 96

Simple Fortran example

Point−to−Point Communication program main

include 'mpif.h'

integer rank, size, to, from, tag, count, i, ierr

integer src, dest

integer st_source, st_tag, st_count IZE 1 5 integer statusMPI_STATUS_S

2 double precision data100 dest call MPI_INIT ierr 

3

call MPI_COMM_RANK MPI_COMM_WORLD, rank, ierr 

call MPI_COMM_SIZE MPI_COMM_WORLD, size, ierr  4 print *, 'Process ', rank, ' of ', size, ' is alive'

source 0 dest = size - 1 src = 0

communicator C

if rank .eq. src then

to = dest

count =10

tag = 2001

do 10 i=1, 10

10 datai = i

call MPI_SEND data, count, MPI_DOUBLE_PRECISION, to, tag, MPI_COMM_WORLD, ierr 

Communication between two processes. +

else if rank .eq. dest then = MPI_ANY_TAG

Source process sends message to destination process. tag

count = 10

from = MPI_ANY_SOURCE SION, from,

Communication takes place within a communicator. call MPI_RECVdata, count, MPI_DOUBLE_PRECI

+ tag, MPI_COMM_WORLD, status, ierr  Destination process is identified by its rank in the communicator.

97 98

Fortran example

Simple Fortran example cont.

program main

call MPI_GET_COUNT status, MPI_DOUBLE_PRECISION,

+ st_count, ierr 

include "mpif.h"

st_source = statusMPI_SOURCE

st_tag = statusMPI_TAG

double precision PI25DT

C

parameter PI25DT = 3.141592653589793238462643d0

print *, 'Status info: source = ', st_source,

+ ' tag = ', st_tag, ' count = ', st_count

double precision mypi, pi, h, sum, x, f, a

print *, rank, ' received', datai,i=1,10

integer n, myid, numprocs, i, rc

endif

c function to integrate

fa = 4.d0 / 1.d0 + a*a

call MPI_FINALIZE ierr 

end

call MPI_INIT ierr 

call MPI_COMM_RANK MPI_COMM_WORLD, myid, ierr 

call MPI_COMM_SIZE MPI_COMM_WORLD, numprocs, ierr 

10 if  myid .eq. 0  then

write6,98

98 format'Enter the number of intervals: 0 quits'

read5,99 n

99 formati10

endif

call MPI_BCASTn,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr

99 100

C example

Fortran example cont.

include "mpi.h"

c check for quit signal

include

if  n .le. 0  goto 30

int mainargc,argv

c calculate the interval size

int argc;

h = 1.0d0/n

char *argv[];

{

sum = 0.0d0

int done = 0, n, myid, numprocs, i, rc;

do 20 i = myid+1, n, numprocs

double PI25DT = 3.141592653589793238462643;

x = h * dblei - 0.5d0

double mypi, pi, h, sum, x, a;

sum = sum + fx

20 continue

MPI_Init&argc,&argv;

mypi = h * sum

MPI_Comm_sizeMPI_COMM_WORLD,&numprocs;

MPI_Comm_rankMPI_COMM_WORLD,&myid;

c collect all the partial sums

call MPI_REDUCEmypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

$ MPI_COMM_WORLD,ierr

c node 0 prints the answer.

if myid .eq. 0 then

write6, 97 pi, abspi - PI25DT

97 format' pi is approximately: ', F18.16,

+ ' Error is: ', F18.16

endif

goto 10

30 call MPI_FINALIZErc

stop end

101 102

C example cont.

while !done

{

Communication mo des

if myid == 0 {

printf"Enter the number of intervals: 0 quits ";

scanf"d",&n;

Sender mo de Notes

}

MPI_Bcast&n, 1, MPI_INT, 0, MPI_COMM_WORLD; Synchronous send Only completes when the receive has

if n == 0 break;

started.

h = 1.0 / double n;

Bu ered send Always completes unless an error

sum = 0.0;

o ccurs, irresp ective of receiver.

for i = myid + 1; i <= n; i += numprocs {

x = h * doublei - 0.5;

Standard send Either synchronous or bu ered.

sum += 4.0 / 1.0 + x*x;

Ready send Always completes unless an error

}

mypi = h * sum;

o ccurs, irresp ective of whether the

receive has completed.

MPI_Reduce&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD;

Receive Completes when a message has

arrived.

if myid == 0

printf"pi is approximately .16f, Error is .16f\n",

pi, fabspi - PI25DT;

}

MPI_Finalize;

}

103 104

MPI Sender Mo des Sending a message

OPERATION MPI CALL

2 C:

Standard send MPI SEND

int MPI

Ssendvoid *buf, int count, MPI Datatyp e datatyp e,

Synchronous send MPI SSEND

int dest, int tag, MPI Comm comm

Bu ered send MPI BSEND

Ready send MPI RSEND

2 Fortran:

Receive MPI RECV

MPI

SSENDBUF, COUNT, DATATYPE, DEST, TAG,

COMM, IERROR

BUF*

INTEGER COUNT, DATATYPE, DEST, TAG INTE-

GER COMM, IERROR

105 106

Synchronous Blo cking

Receiving a message

Message-Passing

2 C:

2 Pro cesses synchronize.

int MPI

Recvvoid *buf, int count, MPI Datatyp e datatyp e,

int source, int tag, MPI Comm comm, MPI Status *sta-

2 Sender pro cess sp eci es the synchronous mo de.

tus

2 Blo cking - b oth pro cesses wait until the transaction has com-

2 Fortran:

pleted.

MPI

RECVBUF, COUNT, DATATYPE,

SOURCE, TAG, COMM, STATUS, IERROR

BUF* INTEGER COUNT,

DATATYPE, SOURCE, TAG, COMM,

STATUSMPI

STATUS SIZE, IERROR

107 108

For a communication to

148 Chapter 4

succeed:

2 Sender must sp ecify a valid destination rank.

data

2 Receiver must sp ecify a valid source rank. A A

0 0

The communicator must b e the same. 2 A broadcast 0 A

processes 0

2 Tags must match. A

0

Message typ es must match. 2 A 0 A

0

2 Receiver's bu er must b e large enough.

A A A A A A A 0 1 2 3 4 5 scatter 0 A 1 A 2 gather A 3 A 4 A 5

A A B C D E F 0 0 0 0 0 0 0 B A B C D E F 0 0 0 0 0 0 0 C allgather A B C D E F 0 0 0 0 0 0 0 D A B C D E F 0 0 0 0 0 0 0 E A B C D E F 0 0 0 0 0 0 0 F A B C D E F 0 0 0 0 0 0 0

A A A A A A A B C D E F 0 1 2 3 4 5 0 0 0 0 0 0 B B B B B B A B C D E F 0 1 2 3 4 5 alltoall 1 1 1 1 1 1 C C C C C C A B C D E F 0 1 2 3 4 5 2 2 2 2 2 2 D D D D D D A B C D E F 0 1 2 3 4 5 3 3 3 3 3 3 E E E E E E A B C D E F 0 1 2 3 4 5 4 4 4 4 4 4 F F F F F F A B C D E F

0 1 2 3 4 5 5 5 5 5 5 5 Figure 4 1

109 110

Message Order Preservation

Wildcarding

2 Receiver can wildcard.

2 To receive from any source - MPI

SOURCE

ANY 1

2 To receive with any tag - MPI TAG

ANY 5 2

2 Actual source and tag are returned in the receiver's status

parameter. 3 4

0 communicator

Messages do not overtake each other.

This is true even for non−synchronous sends.

111 112

Non−Blocking Send

Non-Blo cking Communications

2 Separate communication into three phases:

2 Initiate non-blo cking communication.

Do some work p erhaps involving other

2 1 5 unications?

comm 2

Wait for non-blo cking communication to complete. 2 in 3 4

out 0 communicator

113 114

Non−Blocking Receive

Non-blo cking Synchronous

Send C:

1 5 2

MPI

t, datatyp e, dest, tag, comm, han-

2 Issendbuf, coun dle

in

MPI

Waithandle, status

Fortran:

3 2

MPI

ISSENDbuf, count, datatyp e, dest, tag, comm, han-

4 dle, ierror

MPI

WAIThandle, status, ierror

out 0

communicator

115 116

Non-blo cking Receive Blo cking and Non-Blo cking

2 C: 2 Send and receive can b e blo cking or non-blo cking.

2 A blo cking send can b e used with a non-blo cking receive, and

MPI

Irecvbuf, count, datatyp e, src, tag, comm, handle

vice-versa.

MPI Waithandle, status

2 Non-blo cking sends can use any mo de - synchronous, bu ered,

2 Fortran:

standard, or ready.

MPI

IRECVbuf, count, datatyp e, src, tag, comm, handle,

2 Synchronous mo de a ects completion, not

ierror

initiation.

MPI

WAIThandle, status, ierror

117 118

Completion

Communication Mo des

2 Waiting versus Testing.

NON-BLOCKING OPERATION MPI CALL

2 C:

Standard send MPI ISEND

Synchronous send MPI ISSEND

MPI

Waithandle, status

Bu ered send MPI IBSEND

MPI Testhandle, ag, status

Ready send MPI IRSEND

Receive MPI IRECV

2 Fortran:

MPI

WAIThandle, status, ierror

MPI TESThandle, ag, status, ierror

119 120

Characteristics of Collecti ve

Barrier Synchronization

Communication

2 C:

2 Collective action over a communicator

int MPI

Barrier MPI Comm comm

2 All pro cesses must communicate.

2 Fortran:

2 Synchronisation mayormay not o ccur.

MPI

BARRIER COMM, IERROR

2 All collective op erations are blo cking.

INTEGER COMM, IERROR

2 No tags

2 Receive bu ers must b e exactly the right size.

121 122

Scatter

Broadcast

2 C:

ABCDE

int MPI

Bcast void *bu er, int count, MPI datatyp e, int

ro ot, MPI Comm comm

2 Fortran:

MPI

BCAST BUFFER, COUNT, DATATYPE, ROOT,

COMM, IERROR

ABCDE

BUFFER*

ATATYPE, ROOT, COMM, IER-

INTEGER COUNT, D A B C D E

ROR

123 124

Gather

Global Reduction Op erations

2 Used to compute a result involving data distributed over a group of pro cesses.

A C D E Examples:

B 2

{ global sum or pro duct

{ global maximum or minimum { global user-de ned op eration

ABCDE A B C D E

125 126

Example of Global Reduction Prede ned Reduction

Integer global sum

Op erations

MPI Name Function

2 C:

MPI MAX Maximum

MPI

Reduce&x, &result, 1, MPI INT,

MPI MIN Minimum

MPI SUM, 0, MPI COMM WORLD

MPI SUM Sum

MPI PROD Pro duct

2 Fortran:

MPI LAND Logical AND

MPI BAND Bitwise AND

CALL MPI

REDUCEx, result, 1, MPI INTEGER,

MPI LOR Logical OR

MPI SUM, 0, MPI COMM WORLD, IERROR

MPI BOR Bitwise OR

2 Sum of all the x values is placed in result

MPI LXOR Logical exclusiveOR

MPI BXOR Bitwise exclusiveOR

2 The result is only placed there on pro cessor 0

MPI MAXLOC Maximum and lo cation

MPI MINLOC Minimum and lo cation

127 128

MPI_REDUCE

RANK

User-De ned Reduction ABCD ABCD

0

Op erators Reducing using an arbitrary op erator, ROOT 2 EFGH EFGH

1

2 C - function of typ e MPI

User function:

void my op erator  void *invec, void *inoutvec, int *len,

MPI yp e *datatyp e MPI_REDUCE Datat IJKL IJKL

2

2 Fortran - function of typ e

FUNCTION MY

OPERATOR INVEC*,

ATATYPE MN O P MN O P INOUTVEC*, LEN, D

3

INVECLEN, INOUTVECLEN

INTEGER LEN, DATATYPE

QRS T QRS T 4

AoEoIoMoQ

129 130

MPI_SCAN Summary

Wehavecovered

RANK

 Background and scop e of MPI

ABCD ABCD

0

 Some characteristic features of MPI communicators, datatyp es

A

 Point-to-Point communication

{ blo cking and non-blo cking EFGH EFGH

1

{multiple mo des

AoE

 Collective communication

vement MPI_SCAN { data mo IJKL IJKL

2

{ collective computation AoEoI

MN O P MN O P 3 AoEoIoM

QRS T QRS T 4

AoEoIoMoQ

131 132

Summary Current MPI Implementation E orts

Vendor Implementations

 The community has co op erated to develop

a full-featured standard message-passing library interface.

IBM Research MPI-F

IBM Kingston

 Implementations ab ound

Intel SSD

 Applications b eginning to b e develop ed or p orted

Cray Research

Meiko, Inc.

 MPI-2 pro cess b eginning

SGI

 Lots of MPI material available

Kendall Square Research

NEC

Fujitsu AP1000

Convex

Hughes Aircraft

Portable Implementations

Argonne{Mississippi State MPICH

Ohio sup ercomputer Center LAM

University of Edinburgh

Technical University of Munich

University of Illinois

Other interested groups: Sun, Hewlett-Packard, Myricom mak-

ers of high-p erformance network switches and PALLAS a Ger-

man software company, Sandia National Lab oratory Intel Paragon running SUNMOS

133 134

MPI Implementation Pro jects MPICH { A Freely-available Portable MPI Implementation

 Variety of implementations  Complete MPI implementation

{Vendor proprietary  On MPP's: IBM SP1 and SP2, Intel IPSC860 and Paragon,

TMC CM-5, SGI, Meiko CS-2, NCub e, KSR, Sequent Sym-

{Free, p ortable

metry

{World wide

 On workstation networks: Sun, Dec, HP, SGI, Linux, FreeBSD,

{ Real-time, emb edded systems

NetBSD

{ All MPP's and networks

 Includes multiple pro ling libraries for timing, event logging,

 Implementation strategies

and animation of programs.

{ Sp ecialized

 Includes trace upshot visualization program, graphics library

{ Abstract message-passing devices

 Eciently implemented for shared-memory, high-sp eed switches,

{ Active-message devices

and network environments

 Man pages

 Source included

 Available at ftp.mcs.anl.gov in pub/mpi/mpich.tar.Z

135 136

Sharable MPI Resources Sharable MPI Resources, continued

 The Standard itself:  Newsgroup:

{AsaTechnical rep ort: U. of Tennessee. rep ort { comp.paral lel .m pi

{ As p ostscript for ftp: at info.mcs. anl .g ov in

 Mailing lists:

pub/mpi/mp i-r ep ort .ps .

{ mpi-comm@c s.u tk .ed u: the MPI Forum discussion list.

{Ashyp ertext on the World Wide Web:

{ mpi-impl@m cs. an l.g ov: the implementors' discussion list.

http://www .mc s. anl .go v/ mpi

{ As a journal article: in the Fall issue of the Journal of

 Implementations available by ftp:

Sup ercomputing Applications

{ MPICH is available by anonymous ftp from

 MPI Forum discussions

info.mcs.a nl. go v in the directory pub/mpi/m pic h, le mpich.*.ta r. Z.

{ LAM is available by anonymous ftp from tbag.osc.e du in

{ The MPI Forum email discussions and b oth current and

the directory pub/lam.

earlier versions of the Standard are available from netlib.

{ The CHIMP version of MPI is available by anonymous ftp

 Bo oks:

from ftp.epcc. ed. ac .uk in the directory pub/chimp/r el eas e.

{ Using MPI: Portable Paral lel Programming with the Message-Passing

 Test co de rep ository new:

Interface,by Gropp, Lusk, and Skjellum, MIT Press, 1994

{ ftp://info .mc s. anl .go v/ pub /mp i- tes t

{ MPI AnnotatedReference Manual,by Otto, Dongarra, Leder-

man, Snir, and Walker, MIT Press, 1995.

137 138

PVM and MPI Future MPI-2

The MPI Forum with old and new participants has b egun a

PVM MPI  w-on series of meetings. System Process Control Context follo Process Creation ...

...

 Goals

{ clarify existing draft

{ provide features users have requested

{ make extensions, not changes

 Ma jor Topics b eing considered

Future

{ dynamic pro cess management

Context Active Messages

{ client/server Active Messages System Process Control

Process Creation { real-time extensions

{ \one-sided" communication put/get, active messages

{ p ortable access to MPI system state for debuggers

{ language bindings for C++ and Fortran-90

Schedule Merging Features 

MPI available on:

t/server bySC'95 IBM SP { Dynamic pro cesses, clien

Intel Paragon y SC '96 Cray T3D { MPI-2 complete b Meiko CS−2

PVM/p4

139

Conclusions

 MPI b eing adopted worldwide

 Standard do cumentation is an adequate guide to implemen-

tation

 Implementations ab ound

 Implementation communityworking together