A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

F

e

a

t

u

r e Message Passing Interface (MPI)

Message Passing Interface (MPI)

s

o

f

M

e

s

s a

Day – 3 Classroom Lecture 3 – Day

g e Advanced Features of

Advanced Features of

P

a

s

s

i

n

g

I n

PCOPP-2002

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 1

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

        

c

e

F

e a

Positives and Negatives Features MPI –2 Features Types of Synchronization PassingCost of Message MPI Datatypes MPI Collective Communication and Computations MPI Communication modes MPI advanced point-to-point communication Basics of MPI

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n g

Lecture Outline

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 2

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

.Understanding the way in which the different MPI operations are 4. Know how to use different MPI functions to solve performance 3. Understand how MPI implementation work 2. Understand the effect of different MPI Functions that accomplish 1. MPI Goals

c

e

F e

implemented is critical in tuning for performance problems communicationthe same

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a c

MPI Goals

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 3

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

Message-Passing Programming Paradigm connected using a message passing interconnection network.

F

e

a

t

u

r

e

s

o

f

Message Passing Architecture Model Message

M

e

s

s

a

g

e

P

a

s s

M i

P

n

g

I n

COMMUNICATION

t

e

r f

M a

P

c e

NETWORK

(

M

P I

M ) P • M P : Processors are are : Processors Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 4

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Is MPI Large or Small?  

c

e

F e

MPI is large (125 Functions) MPI is just small (6 Functions) MPI is a

    

t

u

r e

One need not master all parts of MPI to use it Onewhen it is can access flexibility required Many parallel programs can be written with just 6 basic functions Number of functions not necessarily a measure of complexity many functions requires functionality extensive MPI’s

s

o

f

M

e

s

s a

right

g

e

P a

Is MPI Large or Small? s

candidate for message passing

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 5

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

One can utilize any of 125 functions in MPI. MPI can be large MPI_FINALIZE MPI_RECV MPI_SEND MPI_COMM_RANK MPI_COMM_SIZE MPI_INIT One can begin programming with 6 MPI function calls MPI can be small. The MPI Message Passing Interface Small or Large

e

F

e

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

Is MPI Large or Small?

a

s

s

i

n

g

I

n

t

e r Determines number of processors Determines the label of the calling Initializes MPI Terminates MPI

Receives a message messageSends a

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 6

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

   Sending and Receiving messages Fundamental questions answered

c

e

F

e a

How does the receiver identify it? What is sent? data To whom is sent?

t

u

r

e

s

o

f

M e

Process 0 Process 1

s

s a

edRecv Send

g

e

P

a

s s

MPI Send and Receive

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 7

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

   

c e

Information on MPI Send and Recv

F e

Source Communication between two processes Destination process is identified by its Communication takes place within a

a

t

u

r

e

s

o

f

M

process sends message to to message sends process e

MPI Point-to-Point Communication

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) destination communicator rank in the communicator Copyright process PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 8

A June 03- 06, 2002 C-DAC, Pune

d

v

a n  

Communication Completion

c

e

F e it is locally complete for all processes A communication operation is globally complete

processes involved have completed their part in the operation communicationA operation is globally complete if all the process has completed its part in the operation A communication operationthe is locally complete on processif a

a

t

u

r

e

s

o

f

M e

MPI Point-to-Point Communication

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a c

if and only if

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 9

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Where  Blocking Send 

c e

A typical blocking send looks like

F

e a type message containing the message to be containing the message sent dest ( messages screen selectively

address

t

u

r

e

s

o f

is an integer identifier representing the processthe toreceive

M

is nonnegativeis integer that the destination can use to

e

s s send (

length ,

a g

MPI Blocking Send and Receive

e

P

a

s s

dest, type, address, length

i

n

g

I

n

t e

)

r

f a

describes a contiguous area in memory

c

e

(

M

P

I ) ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 10

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 The sending and receiving of messages between pairs of processors. Point-to-Point Communications 

c

e

F e

operation has been issued and the message has been transferred. BLOCKING SEND: has been issued and the message has been received. BLOCKING RECEIVE:

a

t

u

r

e

s

o

f

M

e

s

s

a g

MPI Blocking Send and Receive

e

P

a

s

s

i

n

g

I n MPI_Send MPI_Recv

returns only after the corresponding RECEIVE

t

e

r

f

a c

returns only after the corresponding SEND

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 11

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c S = Sender S = R = Receiver procedure.and receivesendblocking use the following If we are sending a large message, most implementations of Blocking Sends and Receives

has been posted MPI_RECV corresponding word has arrived that Transferdoesn’t begin until R S

e

F

e

a

t

u

r e

size > threshold s

MPI_SEND

o

f

M

e

s

s

a g

MPI Blocking Send and Receive

e

P

a

s

s i

task waits n

(blocking standard send)

g

I

n

t

e

r

f a

MPI_RECV

c

e

(

M

P I

source complete source from transfer data ) wait complete is buffer user’s to transfer taskcontinues when data Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 12

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Non-Blocking Send and Receive Fortran C

c

e

F e

processor. complete,immediatebut returns control back to the calling Non-blocking Receive:

a

t

u

r e

MPI_Irecv (buf, count, dtype, source, tag, comm, request, ierror) comm, request, ierror) MPI_Isend (buf, count, dtype, tag, MPI_Irecv (buf, count, dtype, dest, tag, request);comm, MPI_Isend (buf, count, dtype, dest, tag, comm, request);

s

o

f

M

MPI Non-Blocking Send and Receive

e

s

s

a

g

e

P

a

s

s

i

n

g

I

MPI_IRecv

n

t

e

r

f

a

c e

does not wait for the message transfer to

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 13

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

 Non-Blocking Communications

F

e

a

t u

Separate communicationthree phases: into r

  

e

s

o

f

Wait for non-blocking communication to complete. Wait for non-blocking Do some work (perhaps involving other communications ?) Initiate non-blocking communication. M

MPI Non-Blocking Send and Receive

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 14

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

F

e

a

t

u

r

e

s

o f

MPI Non-Blocking Send and Receive

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 15

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

If we are sending a If weare sending been received. been already has the message believes side sending the on MPI_Wait case, the In this completed. has operation a non-blocking it to see checks An MPI-Wait R = Receiver S = Sender side. receiving on the buffer in a stored and immediately sent be can The message procedure. the following use receive and sends c

MPI_IRECV posted early enoughMPI_IRECV posted early node can be avoided if Transfer to buffer on receiving

e

F e

size

a

t u

MPI_ISEND (non-blocking standard send) send) standard (non-blocking MPI_ISEND

r e

s

o

f

threshold

MPI Non-Blocking Send and Receive

M

e

s

s

a

g

e

P

small

a

s

s

i

n

g

I

n t

message, most implementations of non-blocking of non-blocking implementations most message,

e r

MPI_WAIT

f

a

c

e

(

M

P

I ) MPI_IRECV Copyright is late enough if MPI_WAIT no delay MPI_WAIT PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 16

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

been sent yet. sent been not has message the that side sees the sending on MPI_Wait the this case, In completed. has operation it a non-blocking to see An MPI_Wait checks R = Receiver S = Sender by an later halted MPI_Wait. but after the send, is resumed Computation sent. immediately is not data the but is issued, send The procedure. following the use receive and sends If we are sending a sending If we are c

MPI_IRECV has been posted has arrived that corresponding transfer doesn’t begin until word

e

F

R S

e

a

t

u

r

e

s

size > threshold o

MPI Non-Blocking Send and Receive f

MPI_ISEND

M

e

s

s

a

g

e

P a

large

s

s

i

n

g

I

(non-blocking standard send) standard (non-blocking

n

t e

message, most implementations of non-blocking of non-blocking implementations most message, r

MPI_WAIT

f

a c

task waits e

MPI_IRECV

(

M

P

I ) source complete source from transfer data MPI_WAIT Copyright PCOPP PCOPP enough if wait is late interruption no  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 17

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

  Communication Modes

c

e

F

e a

Buffered mode Synchronous mode t

 

u

r

e

s

buffered to this ensure independentmatching of receive, and message may be to standardSimilar mode, but completion is always complete until message delivery is guaranteed the send will not except standard mode, same as The

o

f

M

e

s

s

a

g e

MPI Communication Modes

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 18

A June 03- 06, 2002 C-DAC, Pune

d

v

a n S = Sender S = R = Receiver non-blocking send. temporary storage on the sending processor, we can perform a type of for (buffer space) memory some the programmer allocate we If Buffered Sends and Receives

R S

c

e

F

e

a

t u

to buffer to copy data

r

e

s

o

f

MPI_BSEND

M

e

s

s

a

g e buffer complete buffer user-supplied to transfer data

MPI Buffered Send and Receive

P

a

s

s

i

n

g

I n

(buffered send)

t

e

r

f

a

c

e

(

M

P

I ) MPI_RECV task waits Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 19

A June 03- 06, 2002 C-DAC, Pune

d

v a

when a send operation is initiated, or when it completes The mode of a point to point communication operation governs Communication Modes n

 

c

e

F e

Standard mode Ready mode

a t

 

u

r

e s

been initiated initiated A send may be not been initiated A send may initiated be

o

f

M

e

s

s

a

g

e

MPI Communication Modes

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) even if only if a matching receive a matching receive Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, has has (Contd…) 2002 2002 20

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Communication Modes 

c

e

F e

Two basic ways of checking on non-blocking send and receives communication to be overlapped Use non-blocking and completion routines allow computation and a

   

t

u

r e

mpi_test mpi_wait Call a test routine that a returns flag to indicate if complete Call a wait routine that blocks until completion

s

o

f

M

e

s

s

a

g e MPI Communication Modes

(request_id,flag,return_status,ierr)

(request_id,return_status,ierr)

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 21

A June 03- 06, 2002 C-DAC, Pune

d

v a

 Communication Modes 

n

c

e

F

Non-blocking send Blocking send

e a

     

t

u

r e

Must check for local completion for local Must check bufferMessage not should be read Returns when send is locally complete completion for local Must check Message buffer should not be written Returns “immediately

s

o

f

M

e

s

s

a

g e

MPI Communication Modes

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) from after return to after return Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 22

A June 03- 06, 2002 C-DAC, Pune

d

v a

Communication Modes n

  

c

e

F e

Blocking receive communication is complete mpi_wait mpi_status a

 

t

u

r

e s

Message bufferMessage from afterbe read can return Returns when receive is locally complete

o

f

M

e

s s

blocks until communication is complete

a g

returns “immediately”, and sets e

MPI Communication Modes

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) flag Copyright to to true PCOPP PCOPP  C-DAC, 2002 C-DAC, if the (Contd…) 2002 2002 23

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

F

e

a

t

u

r

e

s

o

f

M Synchronous send Standard send Receive Buffered send

Ready send

e

s

s a

Sender mode

g

e

P

MPI Communication Modes

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

( M buffered. or synchronous Either

completed. has the receive whether error occurs), irrespective of an (unless completes Always P receiver. error occurs), irrespective of an (unless completes Always has arrived. has arrived. a when message Completes

receive has completed has receive completed the when Only completes

I ) Notes Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 24

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

F

e

a

t

u

r

e s MPI Sender Modes eev MPI_RECV MPI_RSEND MPI_SSEND Receive Ready send Synchronous send Standard send

ufrdsn MPI_BSEND Buffered send

o

f

M

e

s s

OPERATION

a

g

e

P

MPI Communication Modes

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) MPI_SEND MPI CALL Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 25

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

   MPI : Persistent Operations  MPI : Nonblocking operations, overlap effective

F

e

a

t u “ startall(recvs),sendrecv/barrier, startall(rsends),waitall waitall “ recvinit, startall, sendinit, Variation of example of MPI_Request Allocation “ Potential saving Isend, Irecv, Waitall

Vendor implementations are buggy

r

e

s

o

f

M

e

s s

MPI Persistent Communication

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 26

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

will stall indefinitely. will stall indefinitely. the receives can never be executed, and both sets of communications executed after the sends, so if the sends do not completereturn, and corresponding receives in order to complete, but those receives are The diagram demonstrates that the two onsendstheir waiting each are for events that haven’t been initiated yet. Causes of : Deadlock MPI Message Passing -

c e PE 0

PE 1

F

e

a

t

u

r

e

s

o

MPI_Send issue by both PEs

f

M

e

s

s

a

g

e

P

a

s s

MPI Cause of Deadlock

i

n

g

Deadlock occurs when all tasks are waiting

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 27

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

 MPI Message Passing : Avoiding Deadlock 

e

F

e a

Change Ordering Non-blocking calls

t

u r

   

e

s

the sendsposted. are task is working on when the message arrives or in what order This allows each messagethematter what to be received, no other communication. Have each task post a non-blocking receive before it doesany its send first. to post Arrange for one task to post its receive first and for the other between tasks Different ordering of calls

o

f

M

e

s

s

a

g

e

P

a s

P-Avoiding Deadlock MPI-

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 28

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

 MPI Message Passing Avoiding Deadlock 

e

F

e a

MPI_Sendrecv Buffered mode

t u

   

r

e

s

receive to be executed.receive the send This will allow to complete and the subsequent copying the message to the user-supplied buffer. Use buffered sends so that computation can proceed after message to a destination and the receiving from a source The send-receive combines in one call the sending of a (S). MPI_Sendrecv_replaceMPI_Sendrecv use

o

f

M

e

s

s

a

g

e

P

a

s s

MPI-Avoiding Deadlock

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 29

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

need participate in the communication. processors. A collective communicationprocessorsall that implies The sending and/or receiving of messages to/from groups of Collective Communications Collective Communications e

   

F

e a

Two broad classes : All collective routines block until they are locally complete No message tags used Involves coordinated communication within a group of processes

t u

 

r

e

s

o

Global computation routines Data movement routines

f

M

e

s

s

a g

MPI Collective Communications

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 30

A June 03- 06, 2002 C-DAC, Pune

d

v a

Collective Communication n

  

c

e

F

e a

Examples: Called by all processes in a communicator. Communications involving a group of processes. t

  

u

r

e

s

Global sum,maximum,etc. global gather. scatter, Broadcast, Barrier synchronization.

o

f

M

e

s

s a

MPI Collective Communications

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 31

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

Receive buffers must be exactly the right size No tags. All collective operations are blocking. Synchronization may or may not occur All processes must communicate Collective action over a communicator Characteristics of Collective Communication

e

F

e

a

t

u

r

e

s

o

f

M

MPI Collective Communications

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 32

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Communication is coordinated among a group of processes Collective Communications  

c

e

F e

Group can be constructed “ be constructed can Group No non-blocking collective operations Different communicators are used instead routines or by using MPI topology-definition routines

a

t

u

r

e

s

o

f

M

e

s s

MPI Collective Communications

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

( M

by hand

P

I ) ” with MPI group-manipulation Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 33

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

Collective Computation Operations

e

F MPI_BOR MPI_BAND MPI_LXOR MPI_LOR MPI_BXOR MPI_LAND

MPI_Name

e

a

t

u

r

e

s

o

f

M

e

s

s

a g

MPI Collective Computations

e

P

a

s

s

i

n

g

I n Bitwise AND Bitwise

Bitwise OR Bitwise Bitwise exclusive OR t

Logical exclusive or (xor) Logical or Logical and Operation

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 34

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

Collective Computation Operation

e

F MPI_MAXLOC MPI_SUM MPI_PROD MPI_MIN MPI_MAX

MPI_MAXLOC e

MPI Name

a

t

u

r

e

s

o

f

M

e

s s

MPI Collective Computations

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Maximum and location Maximum and Maximum and location Maximum and Operation Maximum Minimum Product Sum Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 35

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

   F

MPI Collective Communications and Computations

e

a

t

u r

in and user-defined combination functions sizes (Scatterv,Allgatherv, Gatherv) V versions deliver resultsAll to all participating processes e

l eue eue,ReduceScatter, and Scan take both built- All reduce, Reduce ,

s

-version allow the chunks to have different non-uniform data

o

f

M e Scatterv Reduce Scatter Gather Alltoall

Allgather

s

s

a

g

e

P

a

s

s

i n

Other CollectiveCells Library

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Alltoallv Gatherv Allgatherv Scan Copyright Allreduce Scatter Bcast Reduce PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 36

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

P3 P2 P1 P0 P3 P2 P1 P0

F

e

a

t

u

r e 0D 2D3 D2 D1 D0 C3 C2 C1 C0 B3 B2 B1 B0 A0

D C B A

s

o

f

Representation of collective data movement in MPI M

1A A3 A2 A1

e

s

s a

MPI Collective Communications

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a c

All gather e

All to

(

M

P

I ) P3 P2 P1 P0 P3 P2 P1 P0 A3 A2 A1 A0 A A A A Copyright B3 B2 B1 B0 BCD BCD BCD BCD PCOPP PCOPP  C3 C2 C1 C0 C-DAC, 2002 C-DAC, (Contd…) D3 D2 D1 D0 2002 2002 37

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

four processors. every processor accumulates the final values final the accumulates processor every processors. four Performs a scatter a scatter Performs All-to-All

c

e

F

All-to-All operation for an integer array of size 8 on 4 processors

e

a

t

u

r

e

s

o

f

M

e

s s

MPI Collective Communications

a

g

e

P a

and gather

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

from all four processors to all other

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 38

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

 MPI : Support for Regular Decompositions 

e

F

e a “MPI_Cart_Create “ “Remember,contention still matters; a good mapping can “Allow MPI implementation to provide low expected contention “Simple to use (why not?)

User candefine Using topology routines reduce contention effects layout of processes (contention can matter) topology routinesyou use the Why

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s s

ita topology virtual i

P Using topology MPI -

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 39

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

  Message type  

F

e

a

t u

MPI datatypes: datatype A message containselements a numberof particularof some C types are different from Fortran types Derived types can be built up from basic types r

 

e

s

o

f

eie aatps(etr; Structs; Others) Data types (Vectors; Derived Basic types

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n t

MPI Datatypes

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 40

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

P ai Datatypes MPI Basic

F

e

a

t

u

r

e

s

o

MPI_PACKED MPI_BYTE MPI_LONG_DOUBLE MPI_DOUBLE MPI_FLOAT MPI_UNSIGNED_LONG MPI_UNSIGNED MPI_UNSIGNED_SHORT MPI_UNSIGNED_CHAR MPI_LONG MPI_INT MPI_SHORT MPI_CHAR

f

M

e

s

s

a

g e

P Datatype MPI

P

a

s

s i

P ai Datatypes MPI Basic

n

g

I n

-C

t

e

r

f

a

c

e

(

M

P

I ) Double Unsigned long int Unsignedint short Unsigned char Signed long int Signed int Signed short int Long double Float Unsigned int Signed char datatype C Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 41

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Contiguous Data  

c

e

F e

The simplest derived C : contiguous items of the same Fortran :

a

t

u

r

e s nee on,odye newtype oldtype, integer count,

ltp,P_aaye*newtype); oldtype,MPI_Datatype (int count, MPI_Datatype MPI_Type_contiguous int P_yecniuu cut oldtype, newtype) MPI_Type_contiguous (count,

o

f

M

e

s

s

a

g

e

P a

MPI Derived Data types MPI Derived

s

s

i

n

g

I

n

t

e

r

f

a c

datatype

e

(

M

P

I ) datatype consists of a number ofof a numberconsists Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 42

A June 03- 06, 2002 C-DAC, Pune

d v

Stride = = 5; Stride a Vector Datatype Array_of_types[1] = MPI_DOUBLE Array_of_blocklengths[1] =3 Array_of_types[0] = MPI_INT; Array_of_blocklengths[0] =1 2; Count = Struct Datatype

block length = 3; Count = 2;

n

c

e

F

e

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s s

MPI Derived Data types MPI Derived

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) ; Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 43

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

  Datatype Extent of a   Constructing a Vector Datatype

c

e

F e

Fortran C Fortran C

a

t

u

r

e s

nee datatype, extent, ierror integer ierror) MPI_Type_extent(datatype, extent, int MPI_Type_extent (MPI_Datatype datatype, int *extent); ierror) newtype, blocklength,oldtype, stride, MPI_Type_vector (count, *newtype); MPI_Datatype oldtype, MPI_Datatype blocklength, int stride, MPI_Type_vectorint (int count, int

o

f

M

e

s

s

a

g

e

P

a

s

s i

MPI Derived Data types MPI Derived

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 44

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Struct Datatype Constructing a 

c

e

F e C : Fortran : MPI_Type_Struct (count, array_of_blocklengths, (count, MPI_Type_Struct

n P_yesrc (int count, int array_of_blocklengths, int MPI_Type_struct

a

t

u

r e

P_aaye*newtype); MPI_Datatype *array_of_types, MPI_Datatype *array_of_displacements, MPI_Aint

s

o

f

M e

ra_fdslcmns aryo_ye,nwye ierror) array_of_displacements, array_of_types, newtype,

s

s

a

g

e

P

a

s s

MPI Derived Data types MPI Derived

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 45

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

    datatype a Committing

c

e

F e

Fortran C This is done using before it is used. has been constructed, it needs to be committed datatype Once a

a

t

u

r

e s

nee aaye ierror datatype, integer ierror) (datatype, MPI_Type_Commit *datatype); int MPI_Type_Commit (MPI_Datatype

o

f

M

e

s

s

a

g

e

P

a

s s

MPI Derived Data types MPI Derived

i

n g

MPI_TYPE_COMMIT

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 46

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

     MPI : Performance of datatypes

c

e

F

e a

Use of collective when many copies bcast /gather Collect many small messages into a single large message; with time Performance very dependent on implementation; should improve Test of 1000 element vector of doubles with stride of 24 doubles. Handing non-contiguousdata; t

 

u

r

e

s

“User packs and unpacks by hand “MPI_Type_vector” and “MPI_Type_struct(.*,*)”;

o

f

M

e

s

s

a

g

e

P

a

s s

MPI Derived Data types MPI Derived

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 47

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

  

c

e

F

e a

Common message-passingoperations component Startup time has both a to transfer messages Message passing programs that exploit oftendata parallelismuse t

   

u

r

e

s

o

Gather and Scatter operations Broadcast, Reduction, Prefix, Gather, Scalter, All to Barrier Blocking and Non-Blocking type Point to point / Collective communication

f

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t e

required data

r

f

a

c

e

( M

hardware

P

I ) among processors among as well Copyright software PCOPP PCOPP  C-DAC, 2002 C-DAC, relate 2002 2002 48

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

   Startup-time and transfer-time

e

F

e a

travel, and the status of the network at the time of transmission The cost depends on thehasto size of themessage,it how far receiving the message processorsinvolved must spend some time sending and Every time a message is passed between processors, the parts: startup-time The time required to send a message can be divided into two

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

and transfer-time

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 49

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

    Hardware

F

e

a

t u

being sent Startup time is fixed and does not depend on the size of message and router The time to establish an interfaceprocessor betweenthelocal The time to execute the writing algorithm (such as adding header,trailer, and error correction information) Time required to prepare the message for under-lying network

r

e

s

o

f

M e

:

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 50

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

Software    

c

e

F

e a

receiving processors receiving Negotiating thetransmissiondata of between the sending and Copying the message to internal buffers library for creating various internal data structures In general, this may include the time spent by the message-passing passing library Depends on the protocol followed by the underlying message

t

u

r

e

s

o f

:

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 51

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

    The transfer time of a message 

e

F

e

a

t u

to another will have traverse many links In most cases, a message traveling from one processor Per word transfer rate ( The underlying interconnection network Depends on the size of the message and bandwidth of Transfer time of a message of size

r

e

s

o

f

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c e

t

w

( M

)= P

1/r

I ) n words ( nt Copyright w ) PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 52

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

Remarks   

e

F

e a

there is a greater likelihood that link contention will arise. As the message sizes and distances being traveled increase, forcing the remaining messages to wait. In this case, only the one message at a time will traverse the link, same link of the interconnection network at the same time. concurrently, more than one message may want to traverse the Different messages traverse the interconnection network

t

u

r

e

s

o

f

M

:

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 53

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

Remarks     

c

e

F e

network the bandwidthandmessage of the underlying interconnection transfera messageThe dependsof time on the size of the network statusprocessor,the receivingThe of and other traffic in the May depend on the size of the message being sent Depending on the type of MPI communication operation performed that of the hardware end The software related startup time is usuallyhigherthanof much

a

t

u

r

e

s

o

:

f

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 54

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

varies greatly and it depends on type of communication call used. time requiredThe by the MPI collective communication operation Cost of Collective Communication Operations

c e

    

F

e

a t

the interconnection network Time required by each operation depends on characteristic of collective communication operations Algorithm is more important for efficient implementation of BROADCAST of a message of size Element wise Reduce Operation of operation Barrier

u

r

e

s

o

f

M

e

s

s

a

g

e

P a

Cost of Message Passing Cost of Message

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) n n words words words Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 55

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

Cttruh routing (Cut-through- REDUCE BROADCAST SCATTER

F

e

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P n n

n a processor SCATTER PREFIX REDUCE Cost of Message Passing Cost of Message

BROADCAST BARRIER s

Operation s is size of the message received from each is size of the message being broadcasted

is size of message stored in each processor

i

n

g

I

n t

)

e

r

f

a

c

e

(

M

P

I ) 0(np) 0(np) 0(n log p) 0(n log p) p) 0(log Complexity Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 56

A June 03- 06, 2002 C-DAC, Pune

d

v a

   Synchronization methods   Synchronization Speed

n

c

e

F e

Assigning a unique task to each processor from a list of tasks particular answer Waiting until the first of any of the contributing processes finds a Waiting all processes finish a loop Speed of system Cost of message passingalso depends on Synchronization together to the next step have finished one step of a problem and are ready to go on Refers to the time needed for totheyall processorsagree that

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

Types of Synchronization

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 57

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

 Memory copies   MPI-Synchronization Delays  

c

e

F

e a

and single processor problem performance of source primary the are copies Memory requires devoting resources to checking. it There is a performance tradeoff caused by reacting quickly - react quickly, a delay results thepartner if doesn’t Message passing is a cooperative method- Measured memcpy performance non-contiguousCost of datatypes hardware

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s s

P-Synchronization MPI-

i

n

g

I

n

t

e

r

f a

memcpy

c

e

(

M

P

I ) is often much slower than the Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 58

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

   

c

e

F

e a

Application-oriented process topologies Collective Environmental Profiling t

     

u

r

e

s

o

Built-in support for grids and graphs (uses groups) Subgroups defined directly or by topology Large number of data movements routines Both built-in and user-defined collective operations Inquiry and Error control Hooks allow users to intercept MPI calls

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I

Features of MPI

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 59

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

   

c

e

F e

Point-to-Point communication Non-message-passing concepts not included: General Non-message-passing concepts included a

   

t u

 

r e

Thread safety Communicators combine context & group for message security s Modes: normal (blocking and non-blocking), synchronous, datatypes, heterogeneity Structured buffers and derived shared memory shared ready (to allow access to fast protocols), buffered

Process management; Remote memorytransfers; Virtual Active messages and Threads

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I n

Features of MPI

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 60

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c



e

F

e a

Positives

t

u

r

e

s

o

     

f

M

e

s s

No language binding issues Do not require any daemon to start application Rich set of collective functions Simplified sending message Performance was a high-priority in the design MPI is De-facto standard for message-passing in a box

a

g

e

P

a

s

s

i

n

g

I n

Features of MPI

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 61

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e



F

e

a

t u

Positives r

    

e

s

o

Not limited to “shared -data” Can send messages using any kind of data Simple to understand conceptually Simple memory model Bestpractice scaling seenin

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I n

Features of MPI

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 62

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

    Cons

F

e

a t

expensive are errorwhich proneuse lots or of messages, which is Non-contiguous data handling either use derived data types in nature,communication using asynchronous indeterministic Codes may be consuming Development of production codes is much difficult and time Debugging is not easy

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I n

Features of MPI

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 63

A June 03- 06, 2002 C-DAC, Pune

d

v

a n

P- ehius–Positives MPI-2 Techniques – c

      

e

F

e a

Thread Safety is ensured ProcessDynamic spawning/deletion operations Generalized requests (interrupt receive) “ put/get/barrier to reduce synchronization points One-sided communication communicationsNon-blocking collective t

Additional language bindings for Fortran90 /95 and C++ MPI-I/O

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n g

Features of MPI-2

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 64

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n c

   Tuning Performance     Tuning Performance (General techniques)

e

F

e a

Pitfalls MPI -Specific tuning Techniques Performance Changing the Algorithm Load Balancing Decomposition Aggregation

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n g

P Performance MPI -

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 65

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c e

    

F

e a

Features of MPI and MPI-2 PassingCost of Message Operations Summary of MPI Collective Communication and Computations Data Types, Toplogies Summary of MPI special features – Summary of MPI advanced point-to-point communication

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n g

Summary of MPI

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 66

A June 03- 06, 2002 C-DAC, Pune

d

v

a

n

c

e

F

e

a

t

u

r

e

s

o

f

M

e

s

s

a

g

e

P

a

s

s

i

n

g

I

n

t

e

r

f

a

c

e

(

M

P

I ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 67