<<

MPI A MessagePassing Interface Standard

Message Passing Interface Forum

May

This work was supp orted in part by ARPA and NSF under grant ASC the

National Science Foundation Science and Technology Center Co op erative

Agreement No CCR and by the Commission of the Europ ean Community

through Esprit pro ject PPPPE

The Message Passing Interface Forum MPIF with participation from over or

1

ganizations has b een meeting since Novemb er to discuss and dene a set of library

2

interface standards for message passing MPIF is not sanctioned or supp orted by any ocial

3

standards organization

4

The goal of the Message Passing Interface simply stated is to develop a widely used

5

standard for writing messagepassing programs As such the interface should establish a

6

practical p ortable ecient and exible standard for message passing

7

This is the nal rep ort Version of the Message Passing Interface Forum This

8

do cument contains all the technical features prop osed for the interface This copy of the

9

a

draft was pro cessed by L T X on May

10

E

Please send comments on MPI to mpicommentscsutkedu Your comment will b e

11

forwarded to MPIF committee memb ers who will attempt to resp ond

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

University of Tennessee Knoxville Tennessee Permission to copy with

35

out fee all or part of this material is granted provided the University of Tennessee copyright

36

notice and the title of this do cument app ear and notice is given that copying is by p ermis

37

sion of the University of Tennessee

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Contents

7

8

9

10

11

Acknowledgments vi

12

13

Intro duction to MPI

14

Overview and Goals

15

Who Should Use This Standard

16

What Platforms Are Targets For Implementation

17

What Is Included In The Standard

18

What Is Not Included In The Standard

19

Organization of this Do cument

20

21

MPI Terms and Conventions

22

Do cument Notation

23

Pro cedure Sp ecication

24

Semantic Terms

25

Data Typ es

26

Opaque ob jects

27

Array arguments

28

State

29

Named constants

30

Choice

31

Addresses

32

Language Binding

33

Fortran Binding Issues

34

C Binding Issues

35

Pro cesses

36

Error Handling

37

Implementation issues

38

Indep endence of Basic Runtime Routines

39

Interaction with signals in POSIX

40

41

PointtoPoint Communication

42

Intro duction

43

Blo cking Send and Receive Op erations

44

Blo cking send

45

Message data

46

Message envelop e

47

Blo cking receive

48

Return status

Data typ e matching and data conversion

1

Typ e matching rules

2

Data conversion

3

Communication Mo des

4

Semantics of p ointtop oint communication

5

Buer allo cation and usage

6

Mo del implementation of buered mo de

7

Nonblo cking communication

8

Communication Ob jects

9

Communication initiation

10

Communication Completion

11

Semantics of Nonblo cking Communications

12

Multiple Completions

13

Prob e and Cancel

14

Persistent communication requests

15

Sendreceive

16

Null pro cesses

17

Derived datatyp es

18

Datatyp e constructors

19

Address and extent functions

20

Lowerb ound and upp erb ound markers

21

Commit and free

22

Use of general datatyp es in communication

23

Correct use of addresses

24

Examples

25

Pack and unpack

26

27

Collective Communication

28

Intro duction and Overview

29

Communicator argument

30

Barrier synchronization

31

Broadcast

32

Example using MPI BCAST

33

Gather

34

Examples using MPI GATHER MPI GATHERV

35

Scatter

36

SCATTER MPI SCATTERV Examples using MPI

37

Gathertoall

38

Examples using MPI ALLGATHER MPI ALLGATHERV

39

AlltoAll ScatterGather

40

Global Reduction Op erations

41

Reduce

42

Predened reduce op erations

43

MINLOC and MAXLOC

44

UserDened Op erations

45

AllReduce

46

ReduceScatter

47

Scan 48

Example using MPI SCAN

1

Correctness

2

3

Groups Contexts and Communicators

4

Intro duction

5

Features Needed to Supp ort Libraries

6

MPIs Supp ort for Libraries

7

Basic Concepts

8

Groups

9

Contexts

10

IntraCommunicators

11

Predened IntraCommunicators

12

Group Management

13

Group Accessors

14

Group Constructors

15

Group Destructors

16

Communicator Management

17

Communicator Accessors

18

Communicator Constructors

19

Communicator Destructors

20

Motivating Examples

21

Current Practice

22

Current Practice

23

Approximate Current Practice

24

Example

25

Library Example

26

Library Example

27

InterCommunication

28

Intercommunicator Accessors

29

Intercommunicator Op erations

30

InterCommunication Examples

31

Caching

32

Functionality

33

Attributes Example

34

Formalizing the Lo osely Synchronous Mo del

35

Basic Statements

36

Mo dels of Execution

37

38

Pro cess Top ologies

39

Intro duction

40

Virtual Top ologies

41

Emb edding in MPI

42

Overview of the Functions

43

Top ology Constructors

44

Cartesian Constructor

45

Cartesian Convenience Function MPI DIMS CREATE

46

General Graph Constructor

47

Top ology inquiry functions 48

Cartesian Shift Co ordinates

1

Partitioning of Cartesian structures

2

Lowlevel top ology functions

3

An Application Example

4

5

MPI Environmental Management

6

Implementation information

7

Environmental Inquiries

8

Error handling

9

Error co des and classes

10

Timers

11

Startup

12

13

Proling Interface

14

Requirements

15

Discussion

16

Logic of the design

17

Miscellaneous control of proling

18

Examples

19

Proler implementation

20

MPI library implementation

21

Complications

22

Multiple levels of interception

23

24

Bibliography

25

26

A Language Binding

27

A Intro duction

28

A Dened Constants for C and Fortran

29

A C bindings for PointtoPoint Communication

30

A C Bindings for Collective Communication

31

A C Bindings for Groups Contexts and Communicators

32

A C Bindings for Pro cess Top ologies

33

A C bindings for Environmental Inquiry

34

A C Bindings for Proling

35

A Fortran Bindings for PointtoPoint Communication

36

A Fortran Bindings for Collective Communication

37

A Fortran Bindings for Groups Contexts etc

38

A Fortran Bindings for Pro cess Top ologies

39

A Fortran Bindings for Environmental Inquiry

40

A Fortran Bindings for Proling

41

42 MPI Function Index

43

44

45

46

47 48

1

Acknowledgments

2

3

4

5

The technical development was carried out by subgroups whose work was reviewed

6

by the full committee During the p erio d of development of the Message Passing Interface

7

MPI many p eople served in p ositions of resp onsibility and are listed b elow

8

Jack Dongarra David Walker Conveners and Meeting Chairs

9

10

Ewing Lusk Bob Knighten Minutes

11

12

Marc Snir William Gropp Ewing Lusk PointtoPoint Communications

13

Al Geist Marc Snir Steve Otto Collective Communications

14

15

Steve Otto Editor

16

17

Rolf Hemp el Pro cess Top ologies

18

Ewing Lusk Language Binding

19

20

William Gropp Environmental Management

21

22

James Cownie Proling

23

Tony Skjellum Lyndon Clarke Marc Snir Richard Littleeld Mark Sears Groups

24

Contexts and Communicators

25

26

Steven HussLederman Initial Implementation Subset

27

28

The following list includes some of the active participants in the MPI pro cess not

29

mentioned ab ove

30

31

Ed Anderson Rob ert Babb Jo e Baron Eric Barszcz

32

Scott Berryman Rob Bjornson Nathan Doss Anne Elster

33

Jim Feeney Vince Fernando Sam Fineb erg Jon Flower

34

Daniel Frye Ian Glendinning Adam Greenb erg Rob ert Harrison

35

Leslie Hart Tom Haupt Don Heller Tom Henderson

36

Alex Ho CT Howard Ho Gary Howell John Kap enga

37

James Kohl Susan Krauss Bob Leary Arthur Maccab e

38

Peter Madams Alan Mainwaring Oliver McBryan Phil McKinley

39

Charles Mosher Dan Nessett Peter Pacheco Howard Palmer

40

Paul Pierce Sanjay Ranka Peter Rigsb ee Arch Robison

41

Erich Schikuta Ambuj Singh Alan Sussman Rob ert Tomlinson

42

Rob ert G Voigt Dennis Weeks Stephen Wheat Steven Zenith

43

44

45

The University of Tennessee and Oak Ridge National Lab oratory made the draft avail

46

able by anonymous FTP mail servers and were instrumental in distributing the do cument

47

MPI op erated on a very tight budget in reality it had no budget when the rst meeting

48

was announced ARPA and NSF have supp orted research at various institutions that have

made a contribution towards travel for the US academics Supp ort for several Europ ean

1

participants was provided by ESPRIT

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

Intro duction to MPI

11

12

13

14

Overview and Goals

15

16

Message passing is a paradigm used widely on certain classes of parallel machines esp ecially

17

those with Although there are many variations the basic concept of

18

pro cesses communicating through messages is well understo o d Over the last ten years

19

substantial progress has b een made in casting signicant applications in this paradigm Each

20

vendor has implemented its own variant More recently several systems have demonstrated

21

that a message passing system can b e eciently and p ortably implemented It is thus an

22

appropriate time to try to dene b oth the syntax and semantics of a core of library routines

23

that will b e useful to a wide range of users and eciently implementable on a wide range

24

of computers

25

In designing MPI we have sought to make use of the most attractive features of a numb er

26

of existing message passing systems rather than selecting one of them and adopting it as

27

the standard Thus MPI has b een strongly inuenced by work at the IBM T J Watson

28

Research Center Intels NX Express nCUBEs Vertex p and

29

PARMACS Other imp ortant contributions have come from Zip co de Chimp

30

PVM Chameleon and PICL

31

The MPI standardization eort involved ab out p eople from organizations mainly

32

from the United States and Europ e Most of the ma jor vendors of concurrent computers

33

were involved in MPI along with researchers from universities government lab oratories and

34

industry The standardization pro cess b egan with the Workshop on Standards for Message

35

Passing in a Distributed Memory Environment sp onsored by the Center for Research on

36

Parallel Computing held April in Williamsburg Virginia At this workshop

37

the basic features essential to a standard message passing interface were discussed and a

38

working group established to continue the standardization pro cess

39

A preliminary draft prop osal known as MPI was put forward by Dongarra Hemp el

40

Hey and Walker in Novemb er and a revised version was completed in February

41

MPI emb o died the main features that were identied at the Williamsburg

42

workshop as b eing necessary in a message passing standard Since MPI was primarily

43

intended to promote discussion and get the ball rolling it fo cused mainly on p ointtop oint

44

communications MPI brought to the forefront a numb er of imp ortant standardization

45

issues but did not include any collective communication routines and was not safe

46

In Novemb er a meeting of the MPI working group was held in Minneap olis at

47

which it was decided to place the standardization pro cess on a more formal fo oting and to 48

CHAPTER INTRODUCTION TO MPI

generally adopt the pro cedures and organization of the High Performance Fortran Forum

1

Sub committees were formed for the ma jor comp onent areas of the standard and an email

2

discussion service established for each In addition the goal of pro ducing a draft MPI

3

standard by the Fall of was set To achieve this goal the MPI working group met every

4

weeks for two days throughout the rst months of and presented the draft MPI

5

standard at the Sup ercomputing conference in Novemb er These meetings and the

6

email discussion together constituted the MPI Forum memb ership of which has b een op en

7

to all memb ers of the high p erformance computing community

8

The main advantages of establishing a messagepassing standard are p ortability and

9

easeofuse In a distributed memory communication environment in which the higher level

10

routines andor abstractions are build up on lower level message passing routines the b enets

11

of standardization are particularly apparent Furthermore the denition of a message

12

passing standard such as that prop osed here provides vendors with a clearly dened base

13

set of routines that they can implement eciently or in some cases provide hardware supp ort

14

for thereby enhancing

15

The goal of the Message Passing Interface simply stated is to develop a widely used

16

standard for writing messagepassing programs As such the interface should establish a

17

practical p ortable ecient and exible standard for message passing

18

A complete list of goals follows

19

20

Design an application programming interface not necessarily for compilers or a system

21

implementation library

22

23 Allow ecient communication Avoid memorytomemory copying and allow overlap

of computation and communication and ooad to communication copro cessor where

24

25 available

26

Allow for implementations that can b e used in a heterogeneous environment

27

28

Allow convenient C and Fortran bindings for the interface

29

Assume a reliable communication interface the user need not cop e with communica

30

tion failures Such failures are dealt with by the underlying communication subsystem

31

32

Dene an interface that is not to o dierent from current practice such as PVM NX

33

Express p etc and provides extensions that allow greater exibility

34

Dene an interface that can b e implemented on many vendors platforms with no

35

signicant changes in the underlying communication and system software

36

37

Semantics of the interface should b e language indep endent

38

The interface should b e designed to allow for threadsafety

39

40

41

Who Should Use This Standard

42

43

This standard is intended for use by all those who want to write p ortable messagepassing

44

programs in Fortran and C This includes individual application programmers develop ers

45

of software designed to run on parallel machines and creators of environments and to ols

46

In order to b e attractive to this wide audience the standard must provide a simple easy

47

touse interface for the basic user while not semantically precluding the highp erformance

48

messagepassing op erations available on advanced machines

WHAT PLATFORMS ARE TARGETS FOR IMPLEMENTATION

What Platforms Are Targets For Implementation

1

2

The attractiveness of the messagepassing paradigm at least partially stems from its wide

3

p ortability Programs expressed this way may run on distributedmemory multipro cessors

4

networks of workstations and combinations of all of these In addition sharedmemory

5

implementations are p ossible The paradigm will not b e made obsolete by architectures

6

combining the shared and distributedmemory views or by increases in network sp eeds It

7

thus should b e b oth p ossible and useful to implement this standard on a great variety of

8

machines including those machines consisting of collections of other machines parallel

9

or not connected by a communication network

10

The interface is suitable for use by fully general MIMD programs as well as those

11

written in the more restricted style of SPMD Although no explicit supp ort for threads is

12

provided the interface has b een designed so as not to prejudice their use With this version

13

of MPI no supp ort is provided for dynamic spawning of tasks

14

MPI provides many features intended to improve p erformance on scalable parallel com

15

puters with sp ecialized interpro cessor communication hardware Thus we exp ect that

16

native highp erformance implementations of MPI will b e provided on such machines At

17

the same time implementations of MPI on top of standard Unix interpro cessor communi

18

cation proto cols will provide p ortability to workstation clusters and heterogenous networks

19

of workstations Several proprietary native implementations of MPI and a public domain

20

p ortable implementation of MPI are in progress at the time of this writing

21

22

23

What Is Included In The Standard

24

25

The standard includes

26

27 Pointtop oint communication

28

Collective op erations

29

30

Pro cess groups

31

Communication contexts

32

33

Pro cess top ologies

34

35

Bindings for Fortran and C

36

Environmental Management and inquiry

37

38

Proling interface

39

40

41

What Is Not Included In The Standard

42

The standard do es not sp ecify

43

44

Explicit sharedmemory op erations

45

46

Op erations that require more op erating system supp ort than is currently standard

47

for example interruptdriven receives remote execution or active messages 48

CHAPTER INTRODUCTION TO MPI

Program construction to ols

1

2

Debugging facilities

3

4

Explicit supp ort for threads

5

Supp ort for task management

6

7

IO functions

8

9

There are many features that have b een considered and not included in this standard

10

This happ ened for a numb er of reasons one of which is the time constraint that was self

11

imp osed in nishing the standard Features that are not included can always b e oered as

12

extensions by sp ecic implementations Perhaps future versions of MPI will address some

13

of these issues

14

15

16

Organization of this Do cument

17

The following is a list of the remaining chapters in this do cument along with a brief

18

description of each

19

20

Chapter MPI Terms and Conventions explains notational terms and conventions

21

used throughout the MPI do cument

22

23

Chapter Point to Point Communication denes the basic pairwise communication

24

subset of MPI send and receive are found here along with many asso ciated functions

25

designed to make basic communication p owerful and ecient

26

27

Chapter Collective Communications denes pro cessgroup collective communication

28

op erations Well known examples of this are barrier and broadcast over a group of

29

pro cesses not necessarily all the pro cesses

30

Chapter Groups Contexts and Communicators shows how groups of pro cesses are

31

formed and manipulated how unique communication contexts are obtained and how

32

the two are b ound together into a communicator

33

34

Chapter Pro cess Top ologies explains a set of utility functions meant to assist in

35

the mapping of pro cess groups a linearly ordered set to richer top ological structures

36

such as multidimensional grids

37

38

Chapter MPI Environmental Management explains how the programmer can manage

39

and make inquiries of the current MPI environment These functions are needed for the

40

writing of correct robust programs and are esp ecially imp ortant for the construction

41

of highlyp ortable messagepassing programs

42

Chapter Proling Interface explains a simple nameshifting convention that any

43

MPI implementation must supp ort One motivation for this is the ability to put

44

p erformance proling calls into MPI without the need for access to the MPI source

45

co de The name shift is merely an interface it says nothing ab out how the actual

46

proling should b e done and in fact the name shift can b e useful for other purp oses

47 48

ORGANIZATION OF THIS DOCUMENT

Annex A Language Bindings gives sp ecic syntax in Fortran and C for all MPI

1

functions constants and typ es

2

3

The MPI Function Index is a simple index showing the lo cation of the precise denition

4

of each MPI function together with b oth C and Fortran bindings

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

MPI Terms and Conventions

11

12

13

14

This chapter explains notational terms and conventions used throughout the MPI do cument

15

some of the choices that have b een made and the rationale b ehind those choices

16

17

Do cument Notation

18

19

Rationale Throughout this do cument the rationale for the design choices made in

20

the interface sp ecication is set o in this format Some readers may wish to skip

21

these sections while readers interested in interface design may want to read them

22

carefully End of rationale

23

24

Advice to users Throughout this do cument material that sp eaks to users and

25

illustrates usage is set o in this format Some readers may wish to skip these sections

26

while readers interested in programming in MPI may want to read them carefully End

27

of advice to users

28

29

Advice to implementors Throughout this do cument material that is primarily

30

commentary to implementors is set o in this format Some readers may wish to skip

31

these sections while readers interested in MPI implementations may want to read

32

them carefully End of advice to implementors

33

34

Pro cedure Sp ecication

35

36

MPI pro cedures are sp ecied using a language indep endent notation The arguments of

37

pro cedure calls are marked as IN OUT or INOUT The meanings of these are

38

39

the call uses but do es not up date an argument marked IN

40

the call may up date an argument marked OUT

41

42

the call b oth uses and up dates an argument marked INOUT

43

44

There is one sp ecial case if an argument is a handle to an opaque ob ject these

45

terms are dened in Section and the ob ject is up dated by the pro cedure call then

46

the argument is marked OUT It is marked this way even though the handle itself is not

47

mo died we use the OUT attribute to denote that what the handle references is up dated 48

SEMANTIC TERMS

The denition of MPI tries to avoid to the largest p ossible extent the use of INOUT

1

arguments b ecause such use is errorprone esp ecially for scalar arguments

2

A common o ccurrence for MPI functions is an argument that is used as IN by some

3

pro cesses and OUT by other pro cesses Such argument is syntactically an INOUT argument

4

and is marked as such although semantically it is not used in one call b oth for input and

5

for output

6

Another frequent situation arises when an argument value is needed only by a subset

7

of the pro cesses When an argument is not signicant at a pro cess then an arbitrary value

8

can b e passed as argument

9

Unless sp ecied otherwise an argument of typ e OUT or typ e INOUT cannot b e aliased

10

with any other argument passed to an MPI pro cedure An example of argument aliasing in

11

C app ears b elow If we dene a C pro cedure like this

12

13

void copyIntBuffer int pin int pout int len

14

int i

15

for i ilen i pout pin

16

17

18

then a call to it in the following co de fragment has aliased arguments

19

int a

20

copyIntBuffer a a

21

22

Although the C language allows this such usage of MPI pro cedures is forbidden unless

23

otherwise sp ecied Note that Fortran prohibits aliasing of arguments

24

All MPI functions are rst sp ecied in the languageindep endent notation Immediately

25

b elow this the ANSI C version of the function is shown and b elow this a version of the

26

same function in Fortran

27

28

29

Semantic Terms

30

When discussing MPI pro cedures the following semantic terms are used The rst two are

31

usually applied to communication op erations

32

33

nonblo cking If the pro cedure may return b efore the op eration completes and b efore the

34

user is allowed to reuse resources such as buers sp ecied in the call

35

36

blo cking If return from the pro cedure indicates the user is allowed to reuse resources

37

sp ecied in the call

38

39

lo cal If completion of the pro cedure dep ends only on the lo cal executing pro cess Such an

40

op eration do es not require communication with another user pro cess

41

nonlo cal If completion of the op eration may require the execution of some MPI pro cedure

42

on another pro cess Such an op eration may require communication o ccurring with

43

another user pro cess

44

45

collective If all pro cesses in a pro cess group need to invoke the pro cedure

46

47 48

CHAPTER MPI TERMS AND CONVENTIONS

Data Typ es

1

2

Opaque objects

3

4

MPI manages system memory that is used for buering messages and for storing internal

5

representations of various MPI ob jects such as groups communicators datatyp es etc This

6

memory is not directly accessible to the user and ob jects stored there are opaque their

7

size and shap e is not visible to the user Opaque ob jects are accessed via handles which

8

exist in user space MPI pro cedures that op erate on opaque ob jects are passed handle

9

arguments to access these ob jects In addition to their use by MPI calls for ob ject access

10

handles can participate in assignment and comparisons

11

In Fortran all handles have typ e INTEGER In C a dierent handle typ e is dened

12

for each category of ob jects These should b e typ es that supp ort assignment and equality

13

op erators

14

In Fortran the handle can b e an index to a table of opaque ob jects in system table in

15

C it can b e such index or a p ointer to the ob ject More bizarre p ossibiliti es exist

16

Opaque ob jects are allo cated and deallo cated by calls that are sp ecic to each ob ject

17

typ e These are listed in the sections where the ob jects are describ ed The calls accept a

18

handle argument of matching typ e In an allo cate call this is an OUT argument that returns

19

a valid reference to the ob ject In a call to deallo cate this is an INOUT argument which

20

returns with an invalid handle value MPI provides an invalid handle constant for each

21

ob ject typ e Comparisons to this constant are used to test for validity of the handle

22

A call to deallo cate invalidates the handle and marks the ob ject for deallo cation The

23

ob ject is not accessible to the user after the call However MPI need not deallo cate the

24

ob ject immediatly Any op eration p ending at the time of the deallo cate that involves this

25

ob ject will complete normally the ob ject will b e deallo cated afterwards

26

An opaque ob ject and its handle are signicant only at the pro cess where the ob ject

27

was created and cannot b e transferred to another pro cess

28

MPI provides certain predened opaque ob jects and predened static handles to these

29

ob jects Such ob jects may not b e destroyed

30

31

Rationale This design hides the internal representation used for MPI data structures

32

thus allowing similar calls in C and Fortran It also avoids conicts with the typing

33

rules in these languages and easily allows future extensions of functionality The

34

mechanism for opaque ob jects used here lo osely follows the POSIX Fortran binding

35

standard

36

The explicit separating of handles in user space ob jects in system space allows space

37

reclaiming deallo cation calls to b e made at appropriate p oints in the user program

38

If the opaque ob jects were in user space one would have to b e very careful not to

39

go out of scop e b efore any p ending op eration requiring that ob ject completed The

40

sp ecied design allows an ob ject to b e marked for deallo cation the user program can

41

then go out of scop e and the ob ject itself still p ersists until any p ending op erations

42

are complete

43

The requirement that handles supp ort assignmentcomparison is made since such

44

op erations are common This restricts the domain of p ossible implementations The

45

alternative would have b een to allow handles to have b een an arbitrary opaque typ e

46

This would force the intro duction of routines to do assignment and comparison adding

47

complexity and was therefore ruled out End of rationale 48

DATA TYPES

Advice to users A user may accidently create a dangling reference by assigning to a

1

handle the value of another handle and then deallo cating the ob ject asso ciated with

2

these handles Conversely if a handle variable is deallo cated b efore the asso ciated

3

ob ject is freed then the ob ject b ecomes inaccessible this may o ccur for example if

4

the handle is a lo cal variable within a subroutine and the subroutine is exited b efore

5

the asso ciated ob ject is deallo cated It is the users resp onsibility to avoid adding

6

or deleting references to opaque ob jects except as a result of calls that allo cate or

7

deallo cate such ob jects End of advice to users

8

9

Advice to implementors The intended semantics of opaque ob jects is that each

10

opaque ob ject is separate from each other each call to allo cate such an ob ject copies

11

all the information required for the ob ject Implementations may avoid excessive

12

copying by substituting referencing for copying For example a derived datatyp e

13

may contain references to its comp onents rather then copies of its comp onents a

14

call to MPI COMM GROUP may return a reference to the group asso ciated with the

15

communicator rather than a copy of this group In such cases the implementation

16

must maintain reference counts and allo cate and deallo cate ob jects such that the

17

visible eect is as if the ob jects were copied End of advice to implementors

18

19

20

Array arguments

21

An MPI call may need an argument that is an array of opaque ob jects or an array of

22

handles The arrayofhandles is a regular array with entries that are handles to ob jects

23

of the same typ e in consecutive lo cations in the array Whenever such an array is used

24

an additional len argument is required to indicate the numb er of valid entries unless this

25

numb er can b e derived otherwise The valid entries are at the b egining of the array len

26

indicates how many of them there are and need not b e the entire size of the array The

27

same approach is followed for other array arguments

28

29

State

30

31

MPI pro cedures use at various places arguments with state typ es The values of such data

32

typ e are all identied by names and no op eration is dened on them For example the

33

ERRHANDLER SET routine has a state typ e argument with values MPI ERRORS ARE FA MPI

34

TAL MPI ERRORS RETURN etc

35

36

Named constants

37

MPI pro cedures sometimes assign a sp ecial meaning to a sp ecial value of a basic typ e argu

38

ment eg tag is an integervalued argument of p ointtop oint communication op erations

39

ANY TAG Such arguments will have a range of regular with a sp ecial wildcard value MPI

40

values which is a prop er subrange of the range of values of the corresp onding basic typ e

41

ANY TAG will b e outside the regular range The range of regular sp ecial values such as MPI

42

values can b e queried using environmental inquiry functions Section

43

44

45

Choice

46

MPI functions sometimes use arguments with a choice or union data typ e Distinct calls

47

to the same routine may pass by reference actual arguments of dierent typ es The mecha 48

CHAPTER MPI TERMS AND CONVENTIONS

nism for providing such arguments will dier from language to language For Fortran the

1

do cument uses typ e to represent a choice variable for C we use void

2

3

4

Addresses

5

Some MPI pro cedures use address arguments that represent an absolute address in the

6

calling program The datatyp e of such an argument is an integer of the size needed to hold

7

any valid address in the execution environment

8

9

10

Language Binding

11

12 This section denes the rules for MPI language binding in general and for Fortran and

ANSI C in particular Dened here are various ob ject representations as well as the naming

13

14 conventions used for expressing this standard The actual calling sequences are dened

15 elsewhere

16 It is exp ected that any Fortran and C implementations use the Fortran

17 and ANSI C bindings resp ectively Although we consider it premature to dene other

bindings to Fortran and C the current bindings are designed to encourage rather

18

19 than discourage exp erimentation with b etter bindings that might b e adopted later

Since the word PARAMETER is a keyword in the Fortran language we use the word

20

21 argument to denote the arguments to a subroutine These are normally referred to

as parameters in C however we exp ect that C programmers will understand the word

22

23 argument which has no sp ecic meaning in C thus allowing us to avoid unnecessary

confusion for Fortran programmers

24

25 There are several imp ortant language binding issues not addressed by this standard

26 This standard do es not discuss the interop erability of message passing b etween languages It

27 is fully exp ected that many implementations will have such features and that such features

28 are a sign of the quality of the implementation

29

30

Fortran Binding Issues

31

All MPI names have an MPI prex and all characters are capitals Programs must not

32

declare variables or functions with names b eginning with the prex MPI This is mandated

33

to avoid p ossible name collisions

34

All MPI Fortran subroutines have a return co de in the last argument A few MPI op

35

erations are functions which do not have the return co de argument The return co de value

36

SUCCESS Other error co des are implementation dep endent for successful completion is MPI

37

see Chapter

38

Handles are represented in Fortran as INTEGERs Binaryvalued variables are of typ e

39

LOGICAL

40

Array arguments are indexed from one

41

Unless explicitly stated the MPI F binding is consistent with ANSI standard Fortran

42

There are several p oints where this standard diverges from the ANSI Fortran stan

43

dard These exceptions are consistent with common practice in the Fortran community In

44

particular

45

46

MPI identiers are limited to thirty not six signicant characters

47

48 MPI identiers may contain underscores after the rst character

LANGUAGE BINDING

1

double precision a

2

integer b

3

4

call MPIsenda

5

call MPIsendb

6

7

8

Figure An example of calling a routine with mismatched formal and actual arguments

9

10

An MPI subroutine with a choice argument may b e called with dierent argument

11

typ es An example is shown in Figure This violates the letter of the Fortran

12

standard but such a violation is common practice An alternative would b e to have

13

a separate version of MPI SEND for each data typ e

14

15

Although not required it is strongly suggested that named MPI constants PARAMETERs

16

b e provided in an include le called mpifh On systems that do not supp ort include

17

les the implementation should sp ecify the values of named constants

18

Vendors are encouraged to provide typ e declarations in the mpifh le on Fortran

19

systems that supp ort userdened typ es One should dene if p ossible the typ e

20

MPI ADDRESS which is an INTEGER of the size needed to hold an address in the

21

execution environment On systems where typ e denition is not supp orted it is up to

22

the user to use an INTEGER of the right kind to represent addresses ie INTEGER

23

on a bit machine INTEGER on a bit machine etc

24

25

26

C Binding Issues

27

28

prex dened con We use the ANSI C declaration format All MPI names have an MPI

29

stants are in all capital letters and dened typ es and functions have one capital letter after

30

the prex Programs must not declare variables or functions with names b eginning with

31

the prex MPI This is mandated to avoid p ossible name collisions

32

The denition of named constants function prototyp es and typ e denitions must b e

33

supplied in an include le mpih

34

Almost all C functions return an error co de The successful return co de will b e

35

SUCCESS but failure return co des are implementation dep endent A few C functions MPI

36

do not return values so that they can b e implemented as macros

37

Typ e declarations are provided for handles to each category of opaque ob jects Either

38

a p ointer or an integer typ e is used

39

Array arguments are indexed from zero

40

Logical ags are integers with value meaning false and a nonzero value meaning

41

true

42

Choice arguments are p ointers of typ e void

43

Aint This is dened to b e an int of the Address arguments are of MPI dened typ e MPI

44

size needed to hold any valid address on the target architecture

45

46

47 48

CHAPTER MPI TERMS AND CONVENTIONS

Pro cesses

1

2

An MPI program consists of autonomous pro cesses executing their own co de in an MIMD

3

style The co des executed by each pro cess need not b e identical The pro cesses commu

4

nicate via calls to MPI communication primitives Typically each pro cess executes in its

5

own address space although sharedmemory implementations of MPI are p ossible This

6

do cument sp ecies the b ehavior of a parallel program assuming that only MPI calls are

7

used for communication The interaction of an MPI program with other p ossible means of

8

communication eg is not sp ecied

9

MPI do es not sp ecify the execution mo del for each pro cess A pro cess can b e sequential

10

or can b e multithreaded with threads p ossibly executing concurrently Care has b een taken

11

to make MPI threadsafe by avoiding the use of implicit state The desired interaction of

12

MPI with threads is that concurrent threads b e all allowed to execute MPI calls and calls

13

b e reentrant a blo cking MPI call blo cks only the invoking thread allowing the

14

of another thread

15

MPI do es not provide mechanisms to sp ecify the initial allo cation of pro cesses to an

16

MPI computation and their binding to physical pro cessors It is exp ected that vendors will

17

provide mechanisms to do so either at load time or at run time Such mechanisms will

18

allow the sp ecication of the initial numb er of required pro cesses the co de to b e executed

19

by each initial pro cess and the allo cation of pro cesses to pro cessors Also the current

20

prop osal do es not provide for dynamic creation or deletion of pro cesses during program

21

execution the total numb er of pro cesses is xed although it is intended to b e consistent

22

with such extensions Finally we always identify pro cesses according to their relative rank

23

in a group that is consecutive integers in the range groupsize

24

25

26

Error Handling

27

28

MPI provides the user with reliable message transmission A message sent is always received

29

correctly and the user do es not need to check for transmission errors timeouts or other

30

error conditions In other words MPI do es not provide mechanisms for dealing with failures

31

in the communication system If the MPI implementation is built on an unreliable underly

32

ing mechanism then it is the job of the implementor of the MPI subsystem to insulate the

33

user from this unreliabili ty or to reect unrecoverable errors as failures Whenever p ossible

34

such failures will b e reected as errors in the relevant communication call Similarly MPI

35

itself provides no mechanisms for handling pro cessor failures The error handling facilities

36

describ ed in section can b e used to restrict the scop e of an unrecoverable error or design

37

error recovery at the application level

38

Of course MPI programs may still b e erroneous A program error can o ccur when

39

an MPI call is called with an incorrect argument nonexisting destination in a send op er

40

ation buer to o small in a receive op eration etc This typ e of error would o ccur in any

41

implementation In addition a resource error may o ccur when a program exceeds the

42

amount of available system resources numb er of p ending messages system buers etc

43

The o ccurrence of this typ e of error dep ends on the amount of available resources in the

44

system and the resource allo cation mechanism used this may dier from system to system

45

A highquality implementation will provide generous limits on the imp ortant resources so

46

as to alleviate the p ortability problem this represents

47

Almost all MPI calls return a co de that indicates successful completion of the op eration

48

Whenever p ossible MPI calls return an error co de if an error o ccurred during the call

IMPLEMENTATION ISSUES

By default an error detected during the execution of the MPI library causes the parallel

1

computation to ab ort However MPI provides mechanisms for users to change this default

2

and to handle recoverable errors The user may sp ecify that no error is fatal and handle

3

error co des returned by MPI calls by himself or herself Also the user may provide his

4

or her own errorhandling routines which will b e invoked whenever an MPI call returns

5

abnormally The MPI error handling facilities are describ ed in section

6

Several factors limit the ability of MPI calls to return with meaningful error co des

7

when an error o ccurs MPI may not b e able to detect some errors other errors may b e to o

8

exp ensive to detect in normal execution mo de nally some errors may b e catastrophic

9

and may prevent MPI from returning control to the caller in a consistent state

10

Another subtle issue arises b ecause of the nature of asynchronous communications MPI

11

calls may initiate op erations that continue asynchronously after the call returned Thus the

12

op eration may return with a co de indicating successful completion yet later cause an error

13

exception to b e raised If there is a subsequent call that relates to the same op eration eg

14

a call that veries that an asynchronous op eration has completed then the error argument

15

asso ciated with this call will b e used to indicate the nature of the error In a few cases

16

the error may o ccur after all calls that relate to the op eration have completed so that no

17

error value can b e used to indicate the nature of the error eg an error in a send with the

18

ready mo de Such an error must b e treated as fatal since information cannot b e returned

19

for the user to recover from it

20

This do cument do es not sp ecify the state of a computation after an erroneous MPI call

21

has o ccurred The desired b ehavior is that a relevant error co de b e returned and the eect

22

of the error b e lo calized to the greatest p ossible extent Eg it is highly desireable that an

23

erroneous receive call will not cause any part of the receivers memory to b e overwritten

24

b eyond the area sp ecied for receiving the message

25

Implementations may go b eyond this do cument in supp orting in a meaningful manner

26

MPI calls that are dened here to b e erroneous For example MPI sp ecies strict typ e

27

matching rules b etween matching send and receive op erations it is erroneous to send a

28

oating p oint variable and receive an integer Implementations may go b eyond these typ e

29

matching rules and provide automatic typ e conversion in such situations It will b e helpful

30

to generate warnings for such nonconforming b ehavior

31

32

33

Implementation issues

34

35

There are a numb er of areas where an MPI implementation may interact with the op erating

36

environment and system While MPI do es not mandate that any services such as IO or

37

signal handling b e provided it do es strongly suggest the b ehavior to b e provided if those

38

services are available This is an imp ortant p oint in achieving p ortability across platforms

39

that provide the same set of services

40

41

Indep endence of Basic Runtime Routines

42

MPI programs require that library routines that are part of the basic language environment 43

such as date and write in Fortran and printf and malloc in ANSI C and are executed

44

after MPI 45 INIT and b efore MPI FINALIZE op erate indep endently and that their completion

is indep endent of the action of other pro cesses in an MPI program

46

47 Note that this in no way prevents the creation of library routines that provide parallel

48 services whose op eration is collective However the following program is exp ected to com

CHAPTER MPI TERMS AND CONVENTIONS

plete in an ANSI C environment regardless of the size of MPI COMM WORLD assuming that

1

IO is available at the executing no des

2

3

int rank

4

MPIInit argc argv

5

MPICommrank MPICOMMWORLD rank

6

if rank printf Starting programn

7

MPIFinalize

8

9

The corresp onding Fortran program is also exp ected to complete

10

An example of what is not required is any particular ordering of the action of these

11

routines when called by several tasks For example MPI makes neither requirements nor

12

recommendations for the output from the following program again assuming that IO is

13

available at the executing no des

14

MPICommrank MPICOMMWORLD rank

15

printf Output from task rank dn rank

16

17

In addition calls that fail b ecause of resource exhaustion or other error are not con

18

sidered a violation of the requirements here however they are required to complete just

19

not to complete successfully

20

21

Interaction with signals in POSIX

22

23

MPI do es not sp ecify either the interaction of pro cesses with signals in a UNIX environment

24

or with other events that do not relate to MPI communication That is signals are not

25

signicant from the view p oint of MPI and implementors should attempt to implement

26

MPI so that signals are transparent an MPI call susp ended by a signal should resume and

27

complete after the signal is handled Generally the state of a computation that is visible

28

or signicant from the viewp oint of MPI should only b e aected by MPI calls

29

The intent of MPI to b e thread and signal safe has a numb er of subtle eects For

30

example on Unix systems a catchable signal such as SIGALRM an alarm signal must

31

not cause an MPI routine to b ehave dierently than it would have in the absence of the

32

signal Of course if the signal handler issues MPI calls or changes the environment in

33

which the MPI routine is op erating for example consuming all available memory space

34

the MPI routine should b ehave as appropriate for that situation in particular in this case

35

the b ehavior should b e the same as for a multithreaded MPI implementation

36

A second eect is that a signal handler that p erforms MPI calls must not interfere

37

with the op eration of MPI For example an MPI receive of any typ e that o ccurs within a

38

signal handler must not cause erroneous b ehavior by the MPI implementation Note that an

39

implementation is p ermitted to prohibit the use of MPI calls from within a signal handler

40

and is not required to detect such use

41

It is highly desirable that MPI not use SIGALRM SIGFPE or SIGIO An implementation

42

is required to clearly do cument all of the signals that the MPI implementation uses a go o d

43

place for this information is a Unix man page on MPI

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

PointtoPoint Communication

11

12

13

14

Intro duction

15

16

Sending and receiving of messages by pro cesses is the basic MPI communication mechanism

17

The basic p ointtop oint communication op erations are send and receive Their use is

18

illustrated in the example b elow

19

20

include mpih

21

main argc argv

22

int argc

23

char argv

24

25

char message

26

int myrank

27

MPIStatus status

28

MPIInit argc argv

29

MPICommrank MPICOMMWORLD myrank

30

if myrank code for zero

31

32

strcpymessageHello there

33

MPISendmessage strlenmessage MPICHAR MPICOMMWORLD

34

35

else code for process one

36

37

MPIRecvmessage MPICHAR MPICOMMWORLD status

38

printfreceived sn message

39

40

MPIFinalize

41

42

43 In this example pro cess zero myrank sends a message to pro cess one using the

44 SEND The op eration sp ecies a send buer in the sender memory send op eration MPI

45 from which the message data is taken In the example ab ove the send buer consists of

46 the storage containing the variable message in the memory of pro cess zero The lo cation

47 size and typ e of the send buer are sp ecied by the rst three parameters of the send

48 op eration The message sent will contain the characters of this variable In addition

CHAPTER POINTTOPOINT COMMUNICATION

the send op eration asso ciates an envelop e with the message This envelop e sp ecies the

1

message destination and contains distinguishing information that can b e used by the receive

2

op eration to select a particular message The last three parameters of the send op eration

3

sp ecify the envelop e for the message sent

4

Pro cess one myrank receives this message with the receive op eration MPI RECV

5

The message to b e received is selected according to the value of its envelop e and the message

6

data is stored into the receive buer In the example ab ove the receive buer consists

7

of the storage containing the string message in the memory of pro cess one The rst three

8

parameters of the receive op eration sp ecify the lo cation size and typ e of the receive buer

9

The next three parameters are used for selecting the incoming message The last parameter

10

is used to return information on the message just received

11

The next sections describ e the blo cking send and receive op erations We discuss send

12

receive blo cking communication semantics typ e matching requirements typ e conversion in

13

heterogeneous environments and more general communication mo des Nonblo cking com

14

munication is addressed next followed by channellike constructs and sendreceive op er

15

ations We then consider general datatyp es that allow one to transfer eciently hetero

16

geneous and noncontiguous data We conclude with the description of calls for explicit

17

packing and unpacking of messages

18

19

20

Blo cking Send and Receive Op erations

21

22

Blo cking send

23

The syntax of the blo cking send op eration is given b elow

24

25

26

MPI SENDbuf count datatyp e dest tag comm

27

IN buf initial address of send buer choice

28

29

IN count numb er of elements in send buer nonnegative inte

30

ger

31

IN datatyp e datatyp e of each send buer element handle

32

IN dest rank of destination integer

33

IN tag message tag integer

34

35

IN comm communicator handle

36

37

Sendvoid buf int count MPI Datatype datatype int dest int MPI

38

Comm comm int tag MPI

39

MPI SENDBUF COUNT DATATYPE DEST TAG COMM IERROR

40

type BUF

41

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

42

43

The blo cking semantics of this call are describ ed in Sec

44

45

Message data

46

47

The send buer sp ecied by the MPI SEND op eration consists of count successive entries of

48

the typ e indicated by datatyp e starting with the entry at address buf Note that we sp ecify

BLOCKING SEND AND RECEIVE OPERATIONS

the message length in terms of numb er of elements not numb er of bytes The former is

1

machine indep endent and closer to the application level

2

The data part of the message consists of a sequence of count values each of the typ e

3

indicated by datatyp e count may b e zero in which case the data part of the message is

4

empty The basic datatyp es that can b e sp ecied for message data values corresp ond to the

5

basic datatyp es of the host language Possible values of this argument for Fortran and the

6

corresp onding Fortran typ es are listed b elow

7

8

9

10

MPI datatyp e Fortran datatyp e

11

MPI INTEGER INTEGER

12

MPI REAL REAL

13

MPI DOUBLE PRECISION DOUBLE PRECISION

14

MPI COMPLEX COMPLEX

15

MPI LOGICAL LOGICAL

16

MPI CHARACTER CHARACTER

17

MPI BYTE

18

MPI PACKED

19

20

21

Possible values for this argument for C and the corresp onding C typ es are listed b elow

22

23

24

MPI datatyp e C datatyp e

25

MPI CHAR signed char

26

MPI SHORT signed short int

27

MPI INT signed int

28

MPI LONG signed long int

29

MPI UNSIGNED CHAR unsigned char

30

MPI UNSIGNED SHORT unsigned short int

31

MPI UNSIGNED unsigned int

32

MPI UNSIGNED LONG unsigned long int

33

MPI FLOAT float

34

MPI DOUBLE double

35

MPI LONG DOUBLE long double

36

MPI BYTE

37

MPI PACKED

38

39

40

41

The datatyp es MPI BYTE and MPI PACKED do not corresp ond to a Fortran or C

42

datatyp e A value of typ e MPI BYTE consists of a byte binary digits A byte is

43

uninterpreted and is dierent from a character Dierent machines may have dierent

44

representations for characters or may use more than one byte to represent characters On

45

the other hand a byte has the same binary value on all machines The use of the typ e

46

MPI PACKED is explained in Section

47

MPI requires supp ort of the datatyp es listed ab ove which match the basic datatyp es of

48

Fortran and ANSI C Additional MPI datatyp es should b e provided if the host language

CHAPTER POINTTOPOINT COMMUNICATION

has additional data typ es MPI LONG LONG INT for bit C integers declared to b e of

1

typ e longlong int MPI DOUBLE COMPLEX for double precision complex in Fortran declared

2

to b e of typ e DOUBLE PRECISION MPI REAL MPI REAL and MPI REAL for Fortran

3

reals declared to b e of typ e REAL REAL and REAL resp ectively MPI INTEGER

4

MPI INTEGER and MPI INTEGER for Fortran integers declared to b e of typ e INTEGER

5

INTEGER and INTEGER resp ectively etc

6

7

Rationale One goal of the design is to allow for MPI to b e implemented as a

8

library with no need for additional prepro cessing or compilation Thus one cannot

9

assume that a communication call has information on the datatyp e of variables in the

10

communication buer this information must b e supplied by an explicit argument

11

The need for such datatyp e information will b ecome clear in Section End of

12

rationale

13

14

Message envelop e

15

16 In addition to the data part messages carry information that can b e used to distinguish

17

messages and selectively receive them This information consists of a xed numb er of elds

18 which we collectively call the message envelop e These elds are

19

source

20

destination

21

tag

22

communicator

23

The message source is implicitly determined by the identity of the message sender The

24

other elds are sp ecied by arguments in the send op eration

25

The message destination is sp ecied by the dest argument

26

The integervalued message tag is sp ecied by the tag argument This integer can b e

27

used by the program to distinguish dierent typ es of messages The range of valid tag

28

values is UB where the value of UB is implementation dep endent It can b e found by

29

querying the value of the attribute MPI TAG UB as describ ed in Chapter MPI requires

30

that UB b e no less than

31

The comm argument sp ecies the communicator that is used for the send op eration

32

Communicators are explained in Chapter b elow is a brief summary of their usage

33

A communicator sp ecies the communication context for a communication op eration

34

Each communication context provides a separate communication universe messages are

35

always received within the context they were sent and messages sent in dierent contexts

36

do not interfere

37

The communicator also sp ecies the set of pro cesses that share this communication

38

context This pro cess group is ordered and pro cesses are identied by their rank within

39

this group Thus the range of valid values for dest is n where n is the numb er of

40

pro cesses in the group If the communicator is an intercommunicator then destinations

41

are identied by their rank in the remote group See Chapter

42

COMM WORLD is provided by MPI It allows commu A predened communicator MPI

43

nication with all pro cesses that are accessible after MPI initialization and pro cesses are

44

COMM WORLD identied by their rank in the group of MPI

45

46

Advice to users Users that are comfortable with the notion of a at name space

47

for pro cesses and a single communication context as oered by most existing com

48

munication libraries need only use the predened variable MPI COMM WORLD as the

BLOCKING SEND AND RECEIVE OPERATIONS

comm argument This will allow communication with all the pro cesses available at

1

initialization time

2

3

Users may dene new communicators as explained in Chapter Communicators

4

provide an imp ortant encapsulation mechanism for libraries and mo dules They allow

5

mo dules to have their own disjoint communication universe and their own pro cess

6

numb ering scheme End of advice to users

7

8

Advice to implementors The message envelop e would normally b e enco ded by a

9

xedlength message header However the actual enco ding is implementation dep en

10

dent Some of the information eg source or destination may b e implicit and need

11

not b e explicitly carried by messages Also pro cesses may b e identied by relative

12

ranks or absolute ids etc End of advice to implementors

13

14

Blo cking receive

15

The syntax of the blo cking receive op eration is given b elow

16

17

18

MPI RECV buf count datatyp e source tag comm status

19

OUT buf initial address of receive buer choice

20

21

IN count numb er of elements in receive buer integer

22

IN datatyp e datatyp e of each receive buer element handle

23

IN source rank of source integer

24

25

IN tag message tag integer

26

IN comm communicator handle

27

OUT status status ob ject Status

28

29

30

int MPI Recvvoid buf int count MPI Datatype datatype int source

31

int tag MPI Comm comm MPI Status status

32

MPI RECVBUF COUNT DATATYPE SOURCE TAG COMM STATUS IERROR

33

type BUF

34

STATUS SIZE INTEGER COUNT DATATYPE SOURCE TAG COMM STATUSMPI

35

IERROR

36

37

The blo cking semantics of this call are describ ed in Sec

38

The receive buer consists of the storage containing count consecutive elements of the

39

typ e sp ecied by datatyp e starting at address buf The length of the received message must

40

b e less than or equal to the length of the receive buer An overow error o ccurs if all

41

incoming data do es not t without truncation into the receive buer

42

If a message that is shorter than the receive buer arrives then only those lo cations

43

corresp onding to the shorter message are mo died

44

45

PROBE function describ ed in Section can b e used to Advice to users The MPI

46

receive messages of unknown length End of advice to users

47 48

CHAPTER POINTTOPOINT COMMUNICATION

Advice to implementors Even though no sp ecic b ehavior is mandated by MPI for

1

erroneous programs the recommended handling of overow situations is to return in

2

status information ab out the source and tag of the incoming message The receive

3

op eration will return an error co de A quality implementation will also ensure that

4

no memory that is outside the receive buer will ever b e overwritten

5

6

In the case of a message shorter than the receive buer MPI is quite strict in that it

7

allows no mo dication of the other lo cations A more lenient statement would allow

8

for some optimizations but this is not allowed The implementation must b e ready to

9

end a copy into the receiver memory exactly at the end of the receive buer even if

10

it is an o dd address End of advice to implementors

11

The selection of a message by a receive op eration is governed by the value of the

12

message envelop e A message can b e received by a receive op eration if its envelop e matches

13

the source tag and comm values sp ecied by the receive op eration The receiver may sp ecify

14

a wildcard MPI ANY SOURCE value for source andor a wildcard MPI ANY TAG value for

15

tag indicating that any source andor tag are acceptable It cannot sp ecify a wildcard value

16

for comm Thus a message can b e received by a receive op eration only if it is addressed to

17

the receiving pro cess has a matching communicator has matching source unless source

18

ANY SOURCE in the pattern and has a matching tag unless tag MPI ANY TAG in the MPI

19

pattern

20

The message tag is sp ecied by the tag argument of the receive op eration The

21

argument source if dierent from MPI ANY SOURCE is sp ecied as a rank within the

22

pro cess group asso ciated with that same communicator remote pro cess group for in

23

tercommunicators Thus the range of valid values for the source argument is fn

24

gfMPI ANY SOURCEg where n is the numb er of pro cesses in this group

25

Note the asymmetry b etween send and receive op erations A receive op eration may

26

accept messages from an arbitrary sender on the other hand a send op eration must sp ecify

27

a unique receiver This matches a push communication mechanism where data transfer

28

is eected by the sender rather than a pull mechanism where data transfer is eected

29

by the receiver

30

Source destination is allowed that is a pro cess can send a message to itself How

31

ever it is unsafe to do so with the blo cking send and receive op erations describ ed ab ove

32

since this may lead to deadlo ck See Sec

33

34

Advice to implementors Message context and other communicator information can

35

b e implemented as an additional tag eld It diers from the regular message tag

36

in that wild card matching is not allowed on this eld and that value setting for

37

this eld is controlled by communicator manipulation functions End of advice to

38

implementors

39

40

Return status

41

42

The source or tag of a received message may not b e known if wildcard values were used in

43

RECV the receive op eration The information is returned by the status argument of MPI

44

The typ e of status is MPIdened Status variables need to b e explicitly allo cated by the

45

user that is they are not system ob jects

46

In C status is a structure that contains two elds named MPI SOURCE and MPI TAG and

47

the structure may contain additional elds Thus statusMPI SOURCE and statusMPI TAG

48

contain the source and tag resp ectively of the received message

DATA TYPE MATCHING AND DATA CONVERSION

In Fortran status is an array of INTEGERs of size MPI STATUS SIZE The two constants

1

MPI SOURCE and MPI TAG are the indices of the entries that store the source and tag elds

2

Thus statusMPI SOURCE and statusMPI TAG contain resp ectively the source and the

3

tag of the received message

4

The status argument also returns information on the length of the message received

5

However this information is not directly available as a eld of the status variable and a call

6

to MPI GET COUNT is required to deco de this information

7

8

9

MPI GET COUNTstatus datatyp e count

10

11

IN status return status of receive op eration Status

12

IN datatyp e datatyp e of each receive buer element handle

13

OUT count numb er of received elements integer

14

15

int MPI Get countMPI Status status MPI Datatype datatype int count

16

17

MPI GET COUNTSTATUS DATATYPE COUNT IERROR

18

INTEGER STATUSMPI STATUS SIZE DATATYPE COUNT IERROR

19

Returns the numb er of elements received Again we count elements not bytes The

20

datatyp e argument should match the argument provided by the receive call that set the

21

status variable We shall later see in Section that MPI GET COUNT may return

22

UNDEFINED in certain situations the value MPI

23

24

Rationale Some message passing libraries use INOUT count tag and source argu

25

ments thus using them b oth to sp ecify the selection criteria for incoming messages

26

and return the actual envelop e values of the received message The use of a separate

27

status argument prevents errors that are often attached with INOUT argument eg

28

using the MPI ANY TAG constant as the tag in a send Some libraries use calls that

29

refer implicitly to the last message received This is not thread safe

30

31

GET COUNT so as to improve p erformance The datatyp e argument is passed to MPI

32

A message might b e received without counting the numb er of elements it contains

33

and the count value is often not needed Also this allows the same function to b e

34

used after a call to MPI PROBE End of rationale

35

36

All send and receive op erations use the buf count datatyp e source dest tag comm

37

SEND and MPI RECV op erations and status arguments in the same way as the blo cking MPI

38

describ ed in this section

39

40

Data typ e matching and data conversion

41

42

Typ e matching rules

43

One can think of message transfer as consisting of the following three phases

44

45

Data is pulled out of the send buer and a message is assembled

46

47

A message is transferred from sender to receiver 48

CHAPTER POINTTOPOINT COMMUNICATION

Data is pulled from the incoming message and disassembled into the receive buer

1

2

Typ e matching has to b e observed at each of these three phases The typ e of each

3

variable in the sender buer has to match the typ e sp ecied for that entry by the send

4

op eration the typ e sp ecied by the send op eration has to match the typ e sp ecied by the

5

receive op eration and the typ e of each variable in the receive buer has to match the typ e

6

sp ecied for that entry by the receive op eration A program that fails to observe these three

7

rules is erroneous

8

To dene typ e matching more precisely we need to deal with two issues matching of

9

typ es of the host language with typ es sp ecied in communication op erations and matching

10

of typ es at sender and receiver

11

The typ es of a send and receive match phase two if b oth op erations use identical

12

names That is MPI INTEGER matches MPI INTEGER MPI REAL matches MPI REAL

13

and so on There is one exception to this rule discussed in Sec the typ e MPI PACKED

14

can match any other typ e

15

The typ e of a variable in a host program matches the typ e sp ecied in the commu

16

nication op eration if the datatyp e name used by that op eration corresp onds to the basic

17

typ e of the host program variable For example an entry with typ e name MPI INTEGER

18

matches a Fortran variable of typ e INTEGER A table giving this corresp ondence for Fortran

19

and C app ears in Sec There are two exceptions to this last rule an entry with

20

typ e name MPI BYTE or MPI PACKED can b e used to match any byte of storage on a

21

byteaddressable machine irresp ective of the datatyp e of the variable that contains this

22

byte The typ e MPI PACKED is used to send data that has b een explicitly packed or receive

23

BYTE allows one to data that will b e explicitly unpacked see Section The typ e MPI

24

transfer the binary value of a byte in memory unchanged

25

To summarize the typ e matching rules fall into the three categories b elow

26

Communication of typ ed values eg with datatyp e dierent from MPI BYTE where

27

the datatyp es of the corresp onding entries in the sender program in the send call in

28

the receive call and in the receiver program must all match

29

30

BYTE where b oth sender Communication of untyp ed values eg of datatyp e MPI

31

and receiver use the datatyp e MPI BYTE In this case there are no requirements on

32

the typ es of the corresp onding entries in the sender and the receiver programs nor is

33

it required that they b e the same

34

35

Communication involving packed data where MPI PACKED is used

36

The following examples illustrate the rst two cases

37

38

Example Sender and receiver sp ecify matching typ es

39

CALL MPICOMMRANKcomm rank ierr

40

IFrankEQ THEN

41

CALL MPISENDa MPIREAL tag comm ierr

42

ELSE

43

CALL MPIRECVb MPIREAL tag comm status ierr

44

END IF

45

46

This co de is correct if b oth a and b are real arrays of size In Fortran it might

47

b e correct to use this co de even if a or b have size eg when a can b e equivalenced

48

to an array with ten reals

DATA TYPE MATCHING AND DATA CONVERSION

Example Sender and receiver do not sp ecify matching typ es

1

2

CALL MPICOMMRANKcomm rank ierr

3

IFrankEQ THEN

4

CALL MPISENDa MPIREAL tag comm ierr

5

ELSE

6

CALL MPIRECVb MPIBYTE tag comm status ierr

7

END IF

8

9

This co de is erroneous since sender and receiver do not provide matching datatyp e

10

arguments

11

Example Sender and receiver sp ecify communication of untyp ed values

12

13

CALL MPICOMMRANKcomm rank ierr

14

IFrankEQ THEN

15

CALL MPISENDa MPIBYTE tag comm ierr

16

ELSE

17

CALL MPIRECVb MPIBYTE tag comm status ierr

18

END IF

19

This co de is correct irresp ective of the typ e and size of a and b unless this results in

20

an out of b ound memory access

21

22

Advice to users If a buer of typ e MPI BYTE is passed as an argument to MPI SEND

23

then MPI will send the data stored at contiguous lo cations starting from the address

24

indicated by the buf argument This may have unexp ected results when the data

25

layout is not as a casual user would exp ect it to b e For example some Fortran

26

compilers implement variables of typ e CHARACTER as a structure that contains the

27

character length and a p ointer to the actual string In such an environment sending

28

BYTE typ e will not have and receiving a Fortran CHARACTER variable using the MPI

29

the anticipated result of transferring the character string For this reason the user is

30

advised to use typ ed communications whenever p ossible End of advice to users

31

32

Typ e MPI CHARACTER

33

The typ e MPI CHARACTER matches one character of a Fortran variable of typ e CHARACTER

34

rather then the entire character string stored in the variable Fortran variables of typ e

35

CHARACTER or substrings are transferred as if they were arrays of characters This is

36

illustrated in the example b elow

37

38

Example Transfer of Fortran CHARACTERs

39

40

CHARACTER a

41

CHARACTER b

42

43

CALL MPICOMMRANKcomm rank ierr

44

IFrankEQ THEN

45

CALL MPISENDa MPICHARACTER tag comm ierr

46

ELSE

47

CALL MPIRECVb MPICHARACTER tag comm status ierr

48

END IF

CHAPTER POINTTOPOINT COMMUNICATION

The last ve characters of string b at pro cess are replaced by the rst ve characters

1

of string a at pro cess

2

3

Rationale The alternative choice would b e for MPI CHARACTER to match a char

4

acter of arbitrary length This runs into problems

5

6

AFortran character variable is a constant length string with no sp ecial termination

7

symb ol There is no xed convention on how to represent characters and how to store

8

their length Some compilers pass a character argument to a routine as a pair of argu

9

ments one holding the address of the string and the other holding the length of string

10

Consider the case of an MPI communication call that is passed a communication buer

11

with typ e dened by a derived datatyp e Section If this communicator buer

12

contains variables of typ e CHARACTER then the information on their length will not b e

13

passed to the MPI routine

14

This problem forces us to provide explicit information on character length with the

15

MPI call One could add a length parameter to the typ e MPI CHARACTER but this

16

do es not add much convenience and the same functionality can b e achieved by dening

17

a suitable derived datatyp e End of rationale

18

19

Advice to implementors Some compilers pass Fortran CHARACTER arguments as a

20

structure with a length and a p ointer to the actual string In such an environment

21

the MPI call needs to dereference the p ointer in order to reach the string End of

22

advice to implementors

23

24

Data conversion

25

26

One of the goals of MPI is to supp ort parallel computations across heterogeneous environ

27

ments Communication in a heterogeneous environment may require data conversions We

28

use the following terminology

29

typ e conversion changes the datatyp e of a value eg by rounding a REAL to an INTEGER

30

31

representation conversion changes the binary representation of a value eg from Hex

32

oating p oint to IEEE oating p oint

33

34

The typ e matching rules imply that MPI communication never entails typ e conversion

35

On the other hand MPI requires that a representation conversion b e p erformed when a

36

typ ed value is transferred across environments that use dierent representations for the

37

datatyp e of this value MPI do es not sp ecify rules for representation conversion Such

38

conversion is exp ected to preserve integer logical or character values and to convert a

39

oating p oint value to the nearest value that can b e represented on the target system

40

Overow and underow exceptions may o ccur during oating p oint conversions Con

41

version of integers or characters may also lead to exceptions when a value that can b e

42

represented in one system cannot b e represented in the other system An exception o ccur

43

ring during representation conversion results in a failure of the communication An error

44

o ccurs either in the send op eration or the receive op eration or b oth

45

BYTE then the binary If a value sent in a message is untyp ed ie of typ e MPI

46

representation of the byte stored at the receiver is identical to the binary representation

47

of the byte loaded at the sender This holds true whether sender and receiver run in the

48

same or in distinct environments No representation conversion is required Note that

COMMUNICATION MODES

representation conversion may o ccur when values of typ e MPI CHARACTER or MPI CHAR

1

are transferred for example from an EBCDIC enco ding to an ASCI I enco ding

2

No conversion need o ccur when an MPI program executes in a homogeneous system

3

where all pro cesses run in the same environment

4

Consider the three examples The rst program is correct assuming that a and

5

b are REAL arrays of size If the sender and receiver execute in dierent environments

6

then the ten real values that are fetched from the send buer will b e converted to the

7

representation for reals on the receiver site b efore they are stored in the receive buer

8

While the numb er of real elements fetched from the send buer equal the numb er of real

9

elements stored in the receive buer the numb er of bytes stored need not equal the numb er

10

of bytes loaded For example the sender may use a four byte representation and the receiver

11

an eight byte representation for reals

12

The second program is erroneous and its b ehavior is undened

13

The third program is correct The exact same sequence of forty bytes that were loaded

14

from the send buer will b e stored in the receive buer even if sender and receiver run in

15

a dierent environment The message sent has exactly the same length in bytes and the

16

same binary representation as the message received If a and b are of dierent typ es or if

17

they are of the same typ e but dierent data representations are used then the bits stored

18

in the receive buer may enco de values that are dierent from the values they enco ded in

19

the send buer

20

Data representation conversion also applies to the envelop e of a message source des

21

tination and tag are all integers that may need to b e converted

22

23

Advice to implementors The current denition do es not require messages to carry

24

data typ e information Both sender and receiver provide complete data typ e infor

25

mation In a heterogeneous environment one can either use a machine indep endent

26

enco ding such as XDR or have the receiver convert from the sender representation

27

to its own or even have the sender do the conversion

28

Additional typ e information might b e added to messages in order to allow the sys

29

tem to detect mismatches b etween datatyp e at sender and receiver This might b e

30

particularly useful in a slower but safer debug mo de End of advice to implementors

31

32

MPI do es not require supp ort for interlanguage communication The b ehavior of a

33

program is undened if messages are sent by a C pro cess and received by a Fortran pro cess

34

or viceversa

35

36

Rationale MPI do es not handle interlanguage communication b ecause there are no

37

agreed standards for the corresp ondence b etween C typ es and Fortran typ es There

38

fore MPI programs that mix languages would not p ort End of rationale

39

Advice to implementors MPI implementors may want to supp ort interlanguage

40

communication by allowing Fortran programs to use C MPI typ es such as MPI INT

41

MPI CHAR etc and allowing C programs to use Fortran typ es End of advice to

42

implementors

43

44

45

Communication Mo des

46

47

The send call describ ed in Section is blo cking it do es not return until the message

48

data and envelop e have b een safely stored away so that the sender is free to access and

CHAPTER POINTTOPOINT COMMUNICATION

overwrite the send buer The message might b e copied directly into the matching receive

1

buer or it might b e copied into a temp orary system buer

2

Message buering decouples the send and receive op erations A blo cking send can com

3

plete as so on as the message was buered even if no matching receive has b een executed by

4

the receiver On the other hand message buering can b e exp ensive as it entails additional

5

memorytomemory copying and it requires the allo cation of memory for buering MPI

6

oers the choice of several communication mo des that allow one to control the choice of the

7

communication proto col

8

The send call describ ed in Section used the standard communication mo de In

9

this mo de it is up to MPI to decide whether outgoing messages will b e buered MPI may

10

buer outgoing messages In such a case the send call may complete b efore a matching

11

receive is invoked On the other hand buer space may b e unavailable or MPI may cho ose

12

not to buer outgoing messages for p erformance reasons In this case the send call will

13

not complete until a matching receive has b een p osted and the data has b een moved to the

14

receiver

15

Thus a send in standard mo de can b e started whether or not a matching receive has

16

b een p osted It may complete b efore a matching receive is p osted The standard mo de send

17

is nonlo cal successful completion of the send op eration may dep end on the o ccurrence

18

of a matching receive

19

20

Rationale The reluctance of MPI to mandate whether standard sends are buering

21

or not stems from the desire to achieve p ortable programs Since any system will run

22

out of buer resources as message sizes are increased and some implementations may

23

want to provide little buering MPI takes the p osition that correct and therefore

24

p ortable programs do not rely on system buering in standard mo de Buering may

25

improve the p erformance of a correct program but it do esnt aect the result of the

26

program If the user wishes to guarantee a certain amount of buering the user

27

provided buer system of Sec should b e used along with the bueredmo de send

28

End of rationale

29

30

There are three additional communication mo des

31

A buered mo de send op eration can b e started whether or not a matching receive

32

has b een p osted It may complete b efore a matching receive is p osted However unlike

33

the standard send this op eration is lo cal and its completion do es not dep end on the

34

o ccurrence of a matching receive Thus if a send is executed and no matching receive is

35

p osted then MPI must buer the outgoing message so as to allow the send call to complete

36

An error will o ccur if there is insucient buer space The amount of available buer space

37

is controlled by the user see Section Buer allo cation by the user may b e required

38

for the buered mo de to b e eective

39

A send that uses the synchronous mo de can b e started whether or not a matching

40

receive was p osted However the send will complete successfully only if a matching re

41

ceive is p osted and the receive op eration has started to receive the message sent by the

42

synchronous send Thus the completion of a synchronous send not only indicates that the

43

send buer can b e reused but also indicates that the receiver has reached a certain p oint in

44

its execution namely that it has started executing the matching receive If b oth sends and

45

receives are blo cking op erations then the use of the synchronous mo de provides synchronous

46

communication semantics a communication do es not complete at either end b efore b oth

47

pro cesses rendezvous at the communication A send executed in this mo de is nonlo cal 48

COMMUNICATION MODES

A send that uses the ready communication mo de may b e started only if the matching

1

receive is already p osted Otherwise the op eration is erroneous and its outcome is unde

2

ned On some systems this allows the removal of a handshake op eration that is otherwise

3

required and results in improved p erformance The completion of the send op eration do es

4

not dep end on the status of a matching receive and merely indicates that the send buer

5

can b e reused A send op eration that uses the ready mo de has the same semantics as a

6

standard send op eration or a synchronous send op eration it is merely that the sender

7

provides additional information to the system namely that a matching receive is already

8

p osted that can save some overhead In a correct program therefore a ready send could

9

b e replaced by a standard send with no eect on the b ehavior of the program other than

10

p erformance

11

Three additional send functions are provided for the three additional communication

12

mo des The communication mo de is indicated by a one letter prex B for buered S for

13

synchronous and R for ready

14

15

16

MPI BSEND buf count datatyp e dest tag comm

17

18

IN buf initial address of send buer choice

19

IN count numb er of elements in send buer integer

20

IN datatyp e datatyp e of each send buer element handle

21

IN dest rank of destination integer

22

23

IN tag message tag integer

24

IN comm communicator handle

25

26

int MPI Bsendvoid buf int count MPI Datatype datatype int dest

27

int tag MPI Comm comm

28

29

MPI BSENDBUF COUNT DATATYPE DEST TAG COMM IERROR

30

type BUF

31

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

32

Send in buered mo de

33

34

35

MPI SSEND buf count datatyp e dest tag comm

36

IN buf initial address of send buer choice

37

38

IN count numb er of elements in send buer integer

39

IN datatyp e datatyp e of each send buer element handle

40

IN dest rank of destination integer

41

42

IN tag message tag integer

43

IN comm communicator handle

44

45

int MPI Ssendvoid buf int count MPI Datatype datatype int dest

46

int tag MPI Comm comm

47

48

MPI SSENDBUF COUNT DATATYPE DEST TAG COMM IERROR

CHAPTER POINTTOPOINT COMMUNICATION

type BUF

1

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

2

3

Send in synchronous mo de

4

5

6

MPI RSEND buf count datatyp e dest tag comm

7

IN buf initial address of send buer choice

8

IN count numb er of elements in send buer integer

9

10

IN datatyp e datatyp e of each send buer element handle

11

IN dest rank of destination integer

12

IN tag message tag integer

13

14

IN comm communicator handle

15

16

int MPI Rsendvoid buf int count MPI Datatype datatype int dest

17

int tag MPI Comm comm

18

MPI RSENDBUF COUNT DATATYPE DEST TAG COMM IERROR

19

type BUF

20

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

21

22

Send in ready mo de

23

There is only one receive op eration which can match any of the send mo des The

24

receive op eration describ ed in the last section is blo cking it returns only after the receive

25

buer contains the newly received message A receive can complete b efore the matching

26

send has completed of course it can complete only after the matching send has started

27

In a multithreaded implementation of MPI the system may deschedule a thread that

28

is blo cked on a send or receive op eration and schedule another thread for execution in the

29

same address space In such a case it is the users resp onsibility not to access or mo dify a

30

communication buer until the communication completes Otherwise the outcome of the

31

computation is undened

32

33

Rationale We prohibit read accesses to a send buer while it is b eing used even

34

though the send op eration is not supp osed to alter the content of this buer This

35

may seem more stringent than necessary but the additional restriction causes little

36

loss of functionality and allows b etter p erformance on some systems consider the

37

case where data transfer is done by a DMA engine that is not cachecoherent with the

38

main pro cessor End of rationale

39

40

Advice to implementors Since a synchronous send cannot complete b efore a matching

41

receive is p osted one will not normally buer messages sent by such an op eration

42

It is recommended to cho ose buering over blo cking the sender whenever p ossible

43

for standard sends The programmer can signal his or her preference for blo cking the

44

sender until a matching receive o ccurs by using the synchronous send mo de

45

A p ossible communication proto col for the various communication mo des is outlined

46

b elow

47

48

ready send The message is sent as so on as p ossible

SEMANTICS OF POINTTOPOINT COMMUNICATION

synchronous send The sender sends a requesttosend message The receiver stores

1

this request When a matching receive is p osted the receiver sends back a p ermission

2

tosend message and the sender now sends the message

3

4

standard send First proto col may b e used for short messages and second proto col for

5

long messages

6

buered send The sender copies the message into a buer and then sends it with a

7

nonblo cking send using the same proto col as for standard send

8

Additional control messages might b e needed for ow control and error recovery Of

9

course there are many other p ossible proto cols

10

11

Ready send can b e implemented as a standard send In this case there will b e no

12

p erformance advantage or disadvantage for the use of ready send

13

A standard send can b e implemented as a synchronous send In such a case no data

14

buering is needed However many most users exp ect some buering

15

In a multithreaded environment the execution of a blo cking communication should

16

blo ck only the executing thread allowing the thread scheduler to deschedule this

17

thread and schedule another thread for execution End of advice to implementors

18

19

20

Semantics of p ointtop oint communication

21

22

A valid MPI implementation guarantees certain general prop erties of p ointtop oint com

23

munication which are describ ed in this section

24

25

Order Messages are nonovertaking If a sender sends two messages in succession to the

26

same destination and b oth match the same receive then this op eration cannot receive the

27

second message if the rst one is still p ending If a receiver p osts two receives in succession

28

and b oth match the same message then the second receive op eration cannot b e satised

29

by this message if the rst one is still p ending This requirement facilitates matching of

30

sends to receives It guarantees that messagepassing co de is deterministic if pro cesses

31

are singlethreaded and the wildcard MPI ANY SOURCE is not used in receives Some of

32

the calls describ ed later such as MPI CANCEL or MPI WAITANY are additional sources of

33

nondeterminism

34

If a pro cess has a single thread of execution then any two communications executed

35

by this pro cess are ordered On the other hand if the pro cess is multithreaded then the

36

semantics of thread execution may not dene a relative order b etween two send op erations

37

executed by two distinct threads The op erations are logically concurrent even if one

38

physically precedes the other In such a case the two messages sent can b e received in

39

any order Similarly if two receive op erations that are logically concurrent receive two

40

successively sent messages then the two messages can match the two receives in either

41

order

42

43

Example An example of nonovertaking messages

44

45

CALL MPICOMMRANKcomm rank ierr

46

IF rankEQ THEN

47

CALL MPIBSENDbuf count MPIREAL tag comm ierr

48

CALL MPIBSENDbuf count MPIREAL tag comm ierr

CHAPTER POINTTOPOINT COMMUNICATION

ELSE rankEQ

1

CALL MPIRECVbuf count MPIREAL MPIANYTAG comm status ierr

2

CALL MPIRECVbuf count MPIREAL tag comm status ierr

3

END IF

4

5

The message sent by the rst send must b e received by the rst receive and the message

6

sent by the second send must b e received by the second receive

7

8

Progress If a pair of matching send and receives have b een initiated on two pro cesses then

9

at least one of these two op erations will complete indep endently of other actions in the

10

system the send op eration will complete unless the receive is satised by another message

11

and completes the receive op eration will complete unless the message sent is consumed by

12

another matching receive that was p osted at the same destination pro cess

13

14

Example An example of two intertwined matching pairs

15

16

CALL MPICOMMRANKcomm rank ierr

17

IF rankEQ THEN

18

CALL MPIBSENDbuf count MPIREAL tag comm ierr

19

CALL MPISSENDbuf count MPIREAL tag comm ierr

20

ELSE rankEQ

21

CALL MPIRECVbuf count MPIREAL tag comm status ierr

22

CALL MPIRECVbuf count MPIREAL tag comm status ierr

23

END IF

24

Both pro cesses invoke their rst communication call Since the rst send of pro cess zero

25

uses the buered mo de it must complete irresp ective of the state of pro cess one Since

26

no matching receive is p osted the message will b e copied into buer space If insucient

27

buer space is available then the program will fail The second send is then invoked At

28

that p oint a matching pair of send and receive op eration is enabled and b oth op erations

29

must complete Pro cess one next invokes its second receive call which will b e satised by

30

the buered message Note that pro cess one received the messages in the reverse order they

31

were sent

32

33

Fairness MPI makes no guarantee of fairness in the handling of communication Supp ose

34

that a send is p osted Then it is p ossible that the destination pro cess rep eatedly p osts a

35

receive that matches this send yet the message is never received b ecause it is each time

36

overtaken by another message sent from another source Similarly supp ose that a receive

37

was p osted by a multithreaded pro cess Then it is p ossible that messages that match this

38

receive are rep eatedly received yet the receive is never satised b ecause it is overtaken

39

by other receives p osted at this no de by other executing threads It is the programmers

40

resp onsibility to prevent starvation in such situations

41

42

43

Resource limitations Any p ending communication op eration consumes system resources

44

that are limited Errors may o ccur when lack of resources prevent the execution of an MPI

45

call A quality implementation will use a small xed amount of resources for each p ending

46

send in the ready or synchronous mo de and for each p ending receive However buer space

47

may b e consumed to store messages sent in standard mo de and must b e consumed to store

48

messages sent in buered mo de when no matching receive is available The amount of space

SEMANTICS OF POINTTOPOINT COMMUNICATION

available for buering will b e much smaller than program data memory on many systems

1

Then it will b e easy to write programs that overrun available buer space

2

MPI allows the user to provide buer memory for messages sent in the buered mo de

3

Furthermore MPI sp ecies a detailed op erational mo del for the use of this buer An MPI

4

implementation is required to do no worse than implied by this mo del This allows users to

5

avoid buer overows when they use buered sends Buer allo cation and use is describ ed

6

in Section

7

A buered send op eration that cannot complete b ecause of a lack of buer space is

8

erroneous When such a situation is detected an error is signalled that may cause the

9

program to terminate abnormally On the other hand a standard send op eration that

10

cannot complete b ecause of lack of buer space will merely blo ck waiting for buer space

11

to b ecome available or for a matching receive to b e p osted This b ehavior is preferable in

12

many situations Consider a situation where a pro ducer rep eatedly pro duces new values

13

and sends them to a consumer Assume that the pro ducer pro duces new values faster

14

than the consumer can consume them If buered sends are used then a buer overow

15

will result Additional synchronization has to b e added to the program so as to prevent

16

this from o ccurring If standard sends are used then the pro ducer will b e automatically

17

throttled as its send op erations will blo ck when buer space is unavailable

18

In some situations a lack of buer space leads to deadlo ck situations This is illustrated

19

by the examples b elow

20

21

Example An exchange of messages

22

23

CALL MPICOMMRANKcomm rank ierr

24

IF rankEQ THEN

25

CALL MPISENDsendbuf count MPIREAL tag comm ierr

26

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

27

ELSE rankEQ

28

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

29

CALL MPISENDsendbuf count MPIREAL tag comm ierr

30

END IF

31

This program will succeed even if no buer space for data is available The standard send

32

op eration can b e replaced in this example with a synchronous send

33

34

Example An attempt to exchange messages

35

CALL MPICOMMRANKcomm rank ierr

36

IF rankEQ THEN

37

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

38

CALL MPISENDsendbuf count MPIREAL tag comm ierr

39

ELSE rankEQ

40

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

41

CALL MPISENDsendbuf count MPIREAL tag comm ierr

42

END IF

43

44

The receive op eration of the rst pro cess must complete b efore its send and can complete

45

only if the matching send of the second pro cessor is executed The receive op eration of the

46

second pro cess must complete b efore its send and can complete only if the matching send

47

of the rst pro cess is executed This program will always deadlo ck The same holds for any

48 other send mo de

CHAPTER POINTTOPOINT COMMUNICATION

Example An exchange that relies on buering

1

2

CALL MPICOMMRANKcomm rank ierr

3

IF rankEQ THEN

4

CALL MPISENDsendbuf count MPIREAL tag comm ierr

5

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

6

ELSE rankEQ

7

CALL MPISENDsendbuf count MPIREAL tag comm ierr

8

CALL MPIRECVrecvbuf count MPIREAL tag comm status ierr

9

END IF

10

11

The message sent by each pro cess has to b e copied out b efore the send op eration returns

12

and the receive op eration starts For the program to complete it is necessary that at least

13

one of the two messages sent b e buered Thus this program can succeed only if the

14

communication system can buer at least count words of data

15

16

Advice to users When standard send op erations are used then a deadlo ck situation

17

may o ccur where b oth pro cesses are blo cked b ecause buer space is not available The

18

same will certainly happ en if the synchronous mo de is used If the buered mo de is

19

used and not enough buer space is available then the program will not complete

20

either However rather than a deadlo ck situation we shall have a buer overow

21

error

22

A program is safe if no message buering is required for the program to complete

23

One can replace all sends in such program with synchronous sends and the pro

24

gram will still run correctly This conservative programming style provides the b est

25

p ortability since program completion do es not dep end on the amount of buer space

26

available or in the communication proto col used

27

Many programmers prefer to have more leeway and b e able to use the unsafe pro

28

gramming style shown in example In such cases the use of standard sends is likely

29

to provide the b est compromise b etween p erformance and robustness quality imple

30

mentations will provide sucient buering so that common practice programs will

31

not deadlo ck The buered send mo de can b e used for programs that require more

32

buering or in situations where the programmer wants more control This mo de

33

might also b e used for debugging purp oses as buer overow conditions are easier to

34

diagnose than deadlo ck conditions

35

36

Nonblo cking messagepassing op erations as describ ed in Section can b e used to

37

avoid the need for buering outgoing messages This prevents deadlo cks due to lack

38

of buer space and improves p erformance by allowing overlap of computation and

39

communication and avoiding the overheads of allo cating buers and copying messages

40

into buers End of advice to users

41

42

Buer allo cation and usage

43

44

A user may sp ecify a buer to b e used for buering messages sent in buered mo de Buer

45

ing is done by the sender

46

47 48

BUFFER ALLOCATION AND USAGE

MPI BUFFER ATTACH buer size

1

2

IN buer initial buer address choice

3

IN size buer size in bytes integer

4

5

int MPI Buffer attach void buffer int size

6

7

MPI BUFFER ATTACH BUFFER SIZE IERROR

8

type BUFFER

9

INTEGER SIZE IERROR

10

Provides to MPI a buer in the users memory to b e used for buering outgoing mes

11

sages The buer is used only by messages sent in buered mo de Only one buer can b e

12

attached to a pro cess at a time

13

14

15

MPI BUFFER DETACH buer size

16

17 OUT buer initial buer address choice

18

OUT size buer size in bytes integer

19

20

Buffer detach void buffer int size int MPI

21

22

MPI BUFFER DETACH BUFFER SIZE IERROR

23

type BUFFER

24

INTEGER SIZE IERROR

25

Detach the buer currently asso ciated with MPI This op eration will blo ck until all

26

messages currently in the buer have b een transmitted Up on return of this function the

27

user may reuse or deallo cate the space taken by the buer

28

The statements made in this section describ e the b ehavior of MPI for bueredmo de

29

sends When no buer is currently asso ciated MPI b ehaves as if a zerosized buer is

30

asso ciated with the pro cess

31

MPI must provide as much buering for outgoing messages as if outgoing message

32

data were buered by the sending pro cess in the sp ecied buer space using a circular

33

contiguousspace allo cation p olicy We outline b elow a mo del implementation that denes

34

this p olicy MPI may provide more buering and may use a b etter buer allo cation algo

35

rithm than describ ed b elow On the other hand MPI may signal an error whenever the

36

simple buering allo cator describ ed b elow would run out of space In particular if no buer

37

is explicitly asso ciated with the pro cess then any buered send may cause an error

38

MPI do es not provide mechanisms for querying or controlling buering done by standard

39

mo de sends It is exp ected that vendors will provide such information for their implemen

40

tations

41

42

Rationale There is a wide sp ectrum of p ossible implementations of buered com

43

munication buering can b e done at sender at receiver or b oth buers can b e

44

dedicated to one senderreceiver pair or b e shared by all communications buering

45

can b e done in real or in virtual memory it can use dedicated memory or memory

46

shared by other pro cesses buer space may b e allo cated statically or b e changed dy

47

namically etc It do es not seem feasible to provide a p ortable mechanism for querying 48

CHAPTER POINTTOPOINT COMMUNICATION

or controlling buering that would b e compatible with all these choices yet provide

1

meaningful information End of rationale

2

3

4

Mo del implementation of buered mo de

5

The mo del implementation uses the packing and unpacking functions describ ed in Sec

6

tion and the nonblo cking communication functions describ ed in Section

7

We assume that a circular queue of p ending message entries PME is maintained

8

Each entry contains a communication request handle that identies a p ending nonblo cking

9

send a p ointer to the next entry and the packed message data The entries are stored in

10

successive lo cations in the buer Free space is available b etween the queue tail and the

11

queue head

12

A buered send call results in the execution of the following co de

13

14

Traverse sequentially the PME queue from head towards the tail deleting all entries

15

for communications that have completed up to the rst entry with an uncompleted

16

request up date queue head to p oint to that entry

17

Compute the numb er n of bytes needed to store entry for new message length of

18

packed message computed with MPI PACK SIZE plus space for request handle and

19

p ointer

20

21

Find the next contiguous empty space of n bytes in buer space following queue tail

22

or space at start of buer if queue tail is to o close to end of buer If space not found

23

then raise buer overow error

24

25

App end to end of PME queue in contiguous space the new entry that contains request

26

handle next p ointer and packed message data MPI PACK is used to pack data

27

Post nonblo cking send standard mo de for packed data

28

29

Return

30

31

32

Nonblo cking communication

33

One can improve p erformance on many systems by overlapping communication and com

34

putation This is esp ecially true on systems where communication can b e executed au

35

tonomously by an intelligent communication controller Lightweight threads are one mech

36

anism for achieving such overlap An alternative mechanism that often leads to b etter

37

p erformance is to use nonblo cking communication A nonblo cking send start call ini

38

tiates the send op eration but do es not complete it The send start call will return b efore

39

the message was copied out of the send buer A separate send complete call is needed

40

to complete the communication ie to verify that the data has b een copied out of the send

41

buer With suitable hardware the transfer of data out of the sender memory may pro ceed

42

concurrently with computations done at the sender after the send was initiated and b efore it

43

completed Similarly a nonblo cking receive start call initiates the receive op eration but

44

do es not complete it The call will return b efore a message is stored into the receive buer

45

A separate receive complete call is needed to complete the receive op eration and verify

46

that the data has b een received into the receive buer With suitable hardware the transfer

47

of data into the receiver memory may pro ceed concurrently with computations done after 48

NONBLOCKING COMMUNICATION

the receive was initiated and b efore it completed The use of nonblo cking receives may also

1

avoid system buering and memorytomemory copying as information is provided early

2

on the lo cation of the receive buer

3

Nonblo cking send start calls can use the same four mo des as blo cking sends standard

4

buered synchronous and ready These carry the same meaning Sends of all mo des ready

5

excepted can b e started whether a matching receive has b een p osted or not a nonblo cking

6

ready send can b e started only if a matching receive is p osted In all cases the send start call

7

is lo cal it returns immediately irresp ective of the status of other pro cesses If the call causes

8

some system resource to b e exhausted then it will fail and return an error co de Quality

9

implementations of MPI should ensure that this happ ens only in pathological cases That

10

is an MPI implementation should b e able to supp ort a large numb er of p ending nonblo cking

11

op erations

12

The sendcomplete call returns when data has b een copied out of the send buer It

13

may carry additional meaning dep ending on the send mo de

14

If the send mo de is synchronous then the send can complete only if a matching receive

15

has started That is a receive has b een p osted and has b een matched with the send In

16

this case the sendcomplete call is nonlo cal Note that a synchronous nonblo cking send

17

may complete if matched by a nonblo cking receive b efore the receive complete call o ccurs

18

It can complete as so on as the sender knows the transfer will complete but b efore the

19

receiver knows the transfer will complete

20

If the send mo de is buered then the message must b e buered if there is no p ending

21

receive In this case the sendcomplete call is lo cal and must succeed irresp ective of the

22

status of a matching receive

23

If the send mo de is standard then the sendcomplete call may return b efore a matching

24

receive o ccurred if the message is buered On the other hand the sendcomplete may not

25

complete until a matching receive o ccurred and the message was copied into the receive

26

buer

27

Nonblo cking sends can b e matched with blo cking receives and viceversa

28

29

Advice to users The completion of a send op eration may b e delayed for standard

30

mo de and must b e delayed for synchronous mo de until a matching receive is p osted

31

The use of nonblo cking sends in these two cases allows the sender to pro ceed ahead

32

of the receiver so that the computation is more tolerant of uctuations in the sp eeds

33

of the two pro cesses

34

35

Nonblo cking sends in the buered and ready mo des have a more limited impact A

36

nonblo cking send will return as so on as p ossible whereas a blo cking send will return

37

after the data has b een copied out of the sender memory The use of nonblo cking

38

sends is advantageous in these cases only if data copying can b e concurrent with

39

computation

40

The messagepassing mo del implies that communication is initiated by the sender

41

The communication will generally have lower overhead if a receive is already p osted

42

when the sender initiates the communication data can b e moved directly to the

43

receive buer and there is no need to queue a p ending send request However a

44

receive op eration can complete only after the matching send has o ccurred The use

45

of nonblo cking receives allows one to achieve lower communication overheads without

46

blo cking the receiver while it waits for the send End of advice to users

47 48

CHAPTER POINTTOPOINT COMMUNICATION

Communication Objects

1

2

Nonblo cking communications use opaque request ob jects to identify communication op er

3

ations and match the op eration that initiates the communication with the op eration that

4

terminates it These are system ob jects that are accessed via a handle A request ob ject

5

identies various prop erties of a communication op eration such as the send mo de the com

6

munication buer that is asso ciated with it its context the tag and destination arguments

7

to b e used for a send or the tag and source arguments to b e used for a receive In addition

8

this ob ject stores information ab out the status of the p ending communication op eration

9

10

Communication initiation

11

12

We use the same naming conventions as for blo cking communication a prex of B S or

13

R is used for buered synchronous or ready mo de In addition a prex of I for immediate

14

indicates that the call is nonblo cking

15

16

MPI ISENDbuf count datatyp e dest tag comm request

17

18

IN buf initial address of send buer choice

19

IN count numb er of elements in send buer integer

20

IN datatyp e datatyp e of each send buer element handle

21

22

IN dest rank of destination integer

23

IN tag message tag integer

24

IN comm communicator handle

25

26

OUT request communication request handle

27

28

int MPI Isendvoid buf int count MPI Datatype datatype int dest

29

int tag MPI Comm comm MPI Request request

30

MPI ISENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

31

type BUF

32

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

33

34

Start a standard mo de nonblo cking send

35

36

37

38

39

40

41

42

43

44

45

46

47 48

NONBLOCKING COMMUNICATION

MPI IBSENDbuf count datatyp e dest tag comm request

1

2

IN buf initial address of send buer choice

3

IN count numb er of elements in send buer integer

4

5

IN datatyp e datatyp e of each send buer element handle

6

IN dest rank of destination integer

7

IN tag message tag integer

8

9 IN comm communicator handle

10

OUT request communication request handle

11

12

int MPI Ibsendvoid buf int count MPI Datatype datatype int dest

13

int tag MPI Comm comm MPI Request request

14

15

MPI IBSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

18

Start a buered mo de nonblo cking send

19

20

21

ISSENDbuf count datatyp e dest tag comm request MPI

22

IN buf initial address of send buer choice

23

IN count numb er of elements in send buer integer

24

25

IN datatyp e datatyp e of each send buer element handle

26

IN dest rank of destination integer

27

IN tag message tag integer

28

29

IN comm communicator handle

30

OUT request communication request handle

31

32

int MPI Issendvoid buf int count MPI Datatype datatype int dest

33

int tag MPI Comm comm MPI Request request

34

35

MPI ISSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

36

type BUF

37

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

38

Start a synchronous mo de nonblo cking send

39

40

41

42

43

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI IRSENDbuf count datatyp e dest tag comm request

1

2

IN buf initial address of send buer choice

3

IN count numb er of elements in send buer integer

4

5

IN datatyp e datatyp e of each send buer element handle

6

IN dest rank of destination integer

7

IN tag message tag integer

8

9 IN comm communicator handle

10

OUT request communication request handle

11

12

int MPI Irsendvoid buf int count MPI Datatype datatype int dest

13

int tag MPI Comm comm MPI Request request

14

15

MPI IRSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

18

Start a ready mo de nonblo cking send

19

20

21

IRECV buf count datatyp e source tag comm request MPI

22

OUT buf initial address of receive buer choice

23

IN count numb er of elements in receive buer integer

24

25

IN datatyp e datatyp e of each receive buer element handle

26

IN source rank of source integer

27

IN tag message tag integer

28

29

IN comm communicator handle

30

OUT request communication request handle

31

32

int MPI Irecvvoid buf int count MPI Datatype datatype int source

33

int tag MPI Comm comm MPI Request request

34

35

MPI IRECVBUF COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

36

type BUF

37

INTEGER COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

38

Start a nonblo cking receive

39

These calls allo cate a communication request ob ject and asso ciate it with the request

40

handle the argument request The request can b e used later to query the status of the

41

communication or wait for its completion

42

A nonblo cking send call indicates that the system may start copying data out of the

43

send buer The sender should not access any part of the send buer after a nonblo cking

44

send op eration is called until the send completes

45

A nonblo cking receive call indicates that the system may start writing data into the re

46

ceive buer The receiver should not access any part of the receive buer after a nonblo cking

47

receive op eration is called until the receive completes 48

NONBLOCKING COMMUNICATION

Communication Completion

1

2

The functions MPI WAIT and MPI TEST are used to complete a nonblo cking communica

3

tion The completion of a send op eration indicates that the sender is now free to up date the

4

lo cations in the send buer the send op eration itself leaves the content of the send buer

5

unchanged It do es not indicate that the message has b een received rather it may have

6

b een buered by the communication subsystem However if a synchronous mo de send was

7

used the completion of the send op eration indicates that a matching receive was initiated

8

and that the message will eventually b e received by this matching receive

9

The completion of a receive op eration indicates that the receive buer contains the

10

received message the receiver is now free to access it and that the status ob ject is set It

11

do es not indicate that the matching send op eration has completed but indicates of course

12

that the send was initiated

13

We shall use the following terminology A null handle is a handle with value

14

MPI REQUEST NULL A p ersistent request and the handle to it are inactive if the request

15

is not asso ciated with any ongoing communication see Section A handle is active if

16

it is neither null nor inactive

17

18

19

MPI WAITrequest status

20

INOUT request request handle

21

OUT status status ob ject Status

22

23

24

WaitMPI Request request MPI Status status int MPI

25

MPI WAITREQUEST STATUS IERROR

26

INTEGER REQUEST STATUSMPI STATUS SIZE IERROR

27

28

A call to MPI WAIT returns when the op eration identied by request is complete If the

29

communication ob ject asso ciated with this request was created by a nonblo cking send or

30

receive call then the ob ject is deallo cated by the call to MPI WAIT and the request handle

31

is set to MPI REQUEST NULL MPI WAIT is a nonlo cal op eration

32

The call returns in status information on the completed op eration The content of

33

the status ob ject for a receive op eration can b e accessed as describ ed in section The

34

TEST CANCELLED see status ob ject for a send op eration may b e queried by a call to MPI

35

Section

36

WAIT with a null or inactive request argument In this case One is allowed to call MPI

37

the op eration returns immediately The status argument is set to return tag MPI ANY TAG

38

source MPI ANY SOURCE and is also internally congured so that calls to MPI GET

39

COUNT and MPI GET ELEMENTS return count

40

Rationale This makes MPI WAIT functionally equivalent to MPI WAITALL with a

41

list of length one and adds some elegance Status is set in this way so as to prevent

42

errors due to accesses of stale information

43

44

WAIT after a MPI IBSEND implies that the user send buer Successful return of MPI

45

can b e reused ie data has b een sent out or copied into a buer attached with

46

MPI BUFFER ATTACH Note that at this p oint we can no longer cancel the send

47

see Sec If a matching receive is never p osted then the buer cannot b e freed

48

This runs somewhat counter to the stated goal of MPI CANCEL always b eing able to

CHAPTER POINTTOPOINT COMMUNICATION

free program space that was committed to the communication subsystem End of

1

rationale

2

3

Advice to implementors In a multithreaded environment a call to MPI WAIT

4

should blo ck only the calling thread allowing the thread scheduler to schedule another

5

thread for execution End of advice to implementors

6

7

8

9

MPI TESTrequest ag status

10

INOUT request communication request handle

11

12

OUT ag true if op eration completed logical

13

OUT status status ob ject Status

14

15

int MPI TestMPI Request request int flag MPI Status status

16

17

MPI TESTREQUEST FLAG STATUS IERROR

18

LOGICAL FLAG

19

INTEGER REQUEST STATUSMPI STATUS SIZE IERROR

20

A call to MPI TEST returns ag true if the op eration identied by request is com

21

plete In such a case the status ob ject is set to contain information on the completed

22

op eration if the communication ob ject was created by a nonblo cking send or receive then

23

REQUEST NULL The call returns ag it is deallo cated and the request handle is set to MPI

24

false otherwise In this case the value of the status ob ject is undened MPI TEST is a

25

lo cal op eration

26

The return status ob ject for a receive op eration carries information that can b e accessed

27

as describ ed in section The status ob ject for a send op eration carries information

28

that can b e accessed by a call to MPI TEST CANCELLED see Section

29

TEST with a null or inactive request argument In such a One is allowed to call MPI

30

case the op eration returns ag false

31

The functions MPI WAIT and MPI TEST can b e used to complete b oth sends and

32

receives

33

34

TEST call allows the user to Advice to users The use of the nonblo cking MPI

35

schedule alternative activities within a single thread of execution An eventdriven

36

TEST End of advice to thread scheduler can b e emulated with p erio dic calls to MPI

37

users

38

39

Example Simple usage of nonblo cking op erations and MPI WAIT

40

41

CALL MPICOMMRANKcomm rank ierr

42

IFrankEQ THEN

43

CALL MPIISENDa MPIREAL tag comm request ierr

44

do some computation to mask latency

45

CALL MPIWAITrequest status ierr

46

ELSE

47

CALL MPIIRECVa MPIREAL tag comm request ierr

48

do some computation to mask latency

NONBLOCKING COMMUNICATION

CALL MPIWAITrequest status ierr

1

END IF

2

3

A request ob ject can b e deallo cated without waiting for the asso ciated communication

4

to complete by using the following op eration

5

6

7

MPI REQUEST FREErequest

8

INOUT request communication request handle

9

10

11

int MPI Request freeMPI Request request

12

MPI REQUEST FREEREQUEST IERROR

13

INTEGER REQUEST IERROR

14

15

Mark the request ob ject for deallo cation and set request to MPI REQUEST NULL An

16

ongoing communication that is asso ciated with the request will b e allowed to complete

17

The request will b e deallo cated only after its completion

18

19

Rationale The MPI REQUEST FREE mechanism is provided for reasons of p erfor

20

mance and convenience on the sending side End of rationale

21

REQUEST FREE it is Advice to users Once a request is freed by a call to MPI

22

not p ossible to check for the successful completion of the asso ciated communication

23

WAIT or MPI TEST Also if an error o ccurs subsequently during with calls to MPI

24

the communication an error co de cannot b e returned to the user such an error

25

must b e treated as fatal Questions arise as to how one knows when the op erations

26

have completed when using MPI REQUEST FREE Dep ending on the program logic

27

there may b e other ways in which the program knows that certain op erations have

28

completed and this makes usage of MPI REQUEST FREE practical For example an

29

active send request could b e freed when the logic of the program is such that the

30

receiver sends a reply to the message sent the arrival of the reply informs the

31

sender that the send has completed and the send buer can b e reused An active

32

receive request should never b e freed as the receiver will have no way to verify that

33

the receive has completed and the receive buer can b e reused End of advice to

34

users

35

36

REQUEST FREE Example An example using MPI

37

38

CALL MPICOMMRANKMPICOMMWORLD rank

39

IFrankEQ THEN

40

DO i n

41

CALL MPIISENDoutval MPIREAL req ierr

42

CALL MPIREQUESTFREEreq ierr

43

CALL MPIIRECVinval MPIREAL req ierr

44

CALL MPIWAITreq status ierr

45

END DO

46

ELSE rankEQ

47

CALL MPIIRECVinval MPIREAL req ierr 48

CHAPTER POINTTOPOINT COMMUNICATION

CALL MPIWAITreq status

1

DO I n

2

CALL MPIISENDoutval MPIREAL req ierr

3

CALL MPIREQUESTFREEreq ierr

4

CALL MPIIRECVinval MPIREAL req ierr

5

CALL MPIWAITreq status ierr

6

END DO

7

CALL MPIISENDoutval MPIREAL req ierr

8

CALL MPIWAITreq status

9

END IF

10

11

12

Semantics of Nonblo cking Communications

13

The semantics of nonblo cking communication is dened by suitably extending the denitions

14

in Section

15

16

Order Nonblo cking communication op erations are ordered according to the execution order

17

of the calls that initiate the communication The nonovertaking requirement of Section

18

is extended to nonblo cking communication with this denition of order b eing used

19

20

Example Message ordering for nonblo cking op erations

21

22

CALL MPICOMMRANKcomm rank ierr

23

IF RANKEQ THEN

24

CALL MPIISENDa MPIREAL comm r ierr

25

CALL MPIISENDb MPIREAL comm r ierr

26

ELSE rankEQ

27

CALL MPIIRECVa MPIREAL MPIANYTAG comm r ierr

28

CALL MPIIRECVb MPIREAL comm r ierr

29

END IF

30

CALL MPIWAITrstatus

31

CALL MPIWAITrstatus

32

The rst send of pro cess zero will match the rst receive of pro cess one even if b oth messages

33

34 are sent b efore pro cess one executes either receive

35

36

Progress A call to MPI WAIT that completes a receive will eventually terminate and return

37

if a matching send has b een started unless the send is satised by another receive In

38

particular if the matching send is nonblo cking then the receive should complete even if

39

WAIT that no call is executed by the sender to complete the send Similarly a call to MPI

40

completes a send will eventually return if a matching receive has b een started unless the

41

receive is satised by another send and even if no call is executed to complete the receive

42

43 Example An illustration of progress semantics

44

CALL MPICOMMRANKcomm rank ierr

45

IF RANKEQ THEN

46

CALL MPISSENDa MPIREAL comm ierr

47

CALL MPISENDb MPIREAL comm ierr 48

NONBLOCKING COMMUNICATION

ELSE rankEQ

1

CALL MPIIRECVa MPIREAL comm r ierr

2

CALL MPIRECVb MPIREAL comm ierr

3

CALL MPIWAITr status ierr

4

END IF

5

6

This co de should not deadlo ck in a correct MPI implementation The rst synchronous

7

send of pro cess zero must complete after pro cess one p osts the matching nonblo cking

8

receive even if pro cess one has not yet reached the completing wait call Thus pro cess zero

9

will continue and execute the second send allowing pro cess one to complete execution

10

11

If an MPI TEST that completes a receive is rep eatedly called with the same arguments

12

and a matching send has b een started then the call will eventually return ag true unless

13

the send is satised by another receive If an MPI TEST that completes a send is rep eatedly

14

called with the same arguments and a matching receive has b een started then the call will

15

eventually return ag true unless the receive is satised by another send

16

17

Multiple Completions

18

It is convenient to b e able to wait for the completion of any some or all the op erations

19

in a list rather than having to wait for a sp ecic message A call to MPI WAITANY or

20

MPI TESTANY can b e used to wait for the completion of one out of several op erations A

21

call to MPI WAITALL or MPI TESTALL can b e used to wait for all p ending op erations in

22

a list A call to MPI WAITSOME or MPI TESTSOME can b e used to complete all enabled

23

op erations in a list

24

25

26

MPI WAITANY count array of requests index status

27

IN count list length integer

28

29

INOUT array of requests array of requests array of handles

30

OUT index index of handle for op eration that completed integer

31

OUT status status ob ject Status

32

33

34

int MPI Waitanyint count MPI Request array of requests int index

35 MPI Status status

36

MPI WAITANYCOUNT ARRAY OF REQUESTS INDEX STATUS IERROR

37

INTEGER COUNT ARRAY OF REQUESTS INDEX STATUSMPI STATUS SIZE

38

IERROR

39

40

Blo cks until one of the op erations asso ciated with the active requests in the array has

41

completed If more then one op eration is enabled and can terminate one is arbitrarily

42

chosen Returns in index the index of that request in the array and returns in status the

43

status of the completing communication The array is indexed from zero in C and from

44

one in Fortran If the request was allo cated by a nonblo cking communication op eration

45

then it is deallo cated and the request handle is set to MPI REQUEST NULL

46

The array of requests list may contain null or inactive handles If the list contains no

47

active handles list has length zero or all entries are null or inactive then the call returns

48

immediately with index MPI UNDEFINED

CHAPTER POINTTOPOINT COMMUNICATION

The execution of MPI WAITANYcount array of requests index status has the same

1

eect as the execution of MPI WAITarray of requestsi status where i is the value

2

returned by index MPI WAITANY with an array containing one active entry is equivalent

3

to MPI WAIT

4

5

6

MPI TESTANYcount array of requests index ag status

7

8

IN count list length integer

9

INOUT array of requests array of requests array of handles

10

OUT index index of op eration that completed or MPI UNDE

11

FINED if none completed integer

12

13 OUT ag true if one of the op erations is complete logical

14

OUT status status ob ject Status

15

16

int MPI Testanyint count MPI Request array of requests int index

17

int flag MPI Status status

18

19

MPI TESTANYCOUNT ARRAY OF REQUESTS INDEX FLAG STATUS IERROR

20

LOGICAL FLAG

21

INTEGER COUNT ARRAY OF REQUESTS INDEX STATUSMPI STATUS SIZE

22

IERROR

23

Tests for completion of either one or none of the op erations asso ciated with active

24

handles In the former case it returns ag true returns in index the index of this request

25

in the array and returns in status the status of that op eration if the request was allo cated

26

by a nonblo cking communication call then the request is deallo cated and the handle is set

27

REQUEST NULL The array is indexed from zero in C and from one in Fortran In to MPI

28

the latter case it returns ag false returns a value of MPI UNDEFINED in index and status

29

is undened The array may contain null or inactive handles If the array contains no active

30

handles then the call returns immediately with ag false index MPI UNDEFINED and

31

status undened

32

The execution of MPI TESTANYcount array of requests index status has the same

33

eect as the execution of MPI TESTarray of requestsi ag status for i count

34

in some arbitrary order until one call returns ag true or all fail In the former

35

UNDEFINED case index is set to the last value of i and in the latter case it is set to MPI

36

MPI TESTANY with an array containing one active entry is equivalent to MPI TEST

37

38

39

WAITALL count array of requests array of statuses MPI

40

IN count lists length integer

41

INOUT array of requests array of requests array of handles

42

43

OUT array of statuses array of status ob jects array of Status

44

45

int MPI Waitallint count MPI Request array of requests

46

MPI Status array of statuses

47

48

MPI WAITALLCOUNT ARRAY OF REQUESTS ARRAY OF STATUSES IERROR

NONBLOCKING COMMUNICATION

INTEGER COUNT ARRAY OF REQUESTS

1

INTEGER ARRAY OF STATUSESMPI STATUS SIZE IERROR

2

3

Blo cks until all communication op erations asso ciated with active handles in the list

4

complete and return the status of all these op erations this includes the case where no

5

handle in the list is active Both arrays have the same numb er of valid entries The

6

ith entry in array of statuses is set to the return status of the ith op eration Requests

7

that were created by nonblo cking communication op erations are deallo cated and the corre

8

sp onding handles in the array are set to MPI REQUEST NULL The list may contain null or

9

inactive handles The call returns in the status of each such entry tag MPI ANY TAG

10

source MPI ANY SOURCE and each status entry is also congured so that calls to

11

MPI GET COUNT and MPI GET ELEMENTS return count

12

The execution of MPI WAITALLcount array of requests array of statuses has the same

13

eect as the execution of MPI WAITarray of requesti array of statusesi for i

14

count in some arbitrary order MPI WAITALL with an array of length one is equivalent

15

to MPI WAIT

16

17

18

TESTALLcount array of requests ag array of statuses MPI

19

IN count lists length integer

20

INOUT array of requests array of requests array of handles

21

22

OUT ag logical

23

OUT array of statuses array of status ob jects array of Status

24

25

int MPI Testallint count MPI Request array of requests int flag

26

Status array of statuses MPI

27

28

MPI TESTALLCOUNT ARRAY OF REQUESTS FLAG ARRAY OF STATUSES IERROR

29

LOGICAL FLAG

30

INTEGER COUNT ARRAY OF REQUESTS

31

ARRAY OF STATUSESMPI STATUS SIZE IERROR

32

Returns ag true if all communications asso ciated with active handles in the array

33

have completed this includes the case where no handle in the list is active In this case

34

each status entry that corresp onds to an active handle request is set to the status of the

35

corresp onding communication if the request was allo cated by a nonblo cking communica

36

tion call then it is deallo cated and the handle is set to MPI REQUEST NULL Each status

37

ANY TAG entry that corresp onds to a null or inactive handle is set to return tag MPI

38

source MPI ANY SOURCE and is also congured so that calls to MPI GET COUNT and

39

MPI GET ELEMENTS return count

40

Otherwise ag false is returned no request is mo died and the values of the status

41

entries are undened This is a lo cal op eration

42

43

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI WAITSOMEincount array of requests outcount array of indices array of statuses

1

2

IN incount length of array of requests integer

3

INOUT array of requests array of requests array of handles

4

5

OUT outcount numb er of completed requests integer

6

OUT array of indices array of indices of op erations that completed array of

7

integers

8

OUT array of statuses array of status ob jects for op erations that completed

9

array of Status

10

11

int MPI Waitsomeint incount MPI Request array of requests int outcount

12

int array of indices MPI Status array of statuses

13

14

MPI WAITSOMEINCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES

15

ARRAY OF STATUSES IERROR

16

OF REQUESTS OUTCOUNT ARRAY OF INDICES INTEGER INCOUNT ARRAY

17

ARRAY OF STATUSESMPI STATUS SIZE IERROR

18

Waits until at least one of the op erations asso ciated with active handles in the list

19

have completed Returns in outcount the numb er of requests from the list array of requests

20

of indices the that have completed Returns in the rst outcount lo cations of the array array

21

indices of these op erations index within the array array of requests the array is indexed

22

from zero in C and from one in Fortran Returns in the rst outcount lo cations of the array

23

of status the status for these completed op erations If a request that completed was array

24

allo cated by a nonblo cking communication call then it is deallo cated and the asso ciated

25

handle is set to MPI REQUEST NULL

26

If the list contains no active handles then the call returns immediately with outcount

27

28

29

30

MPI TESTSOMEincount array of requests outcount array of indices array of statuses

31

IN incount length of array of requests integer

32

33

of requests array of requests array of handles INOUT array

34

OUT outcount numb er of completed requests integer

35

of indices array of indices of op erations that completed array of OUT array

36

integers

37

38

OUT array of statuses array of status ob jects for op erations that completed

39

array of Status

40

41

int MPI Testsomeint incount MPI Request array of requests int outcount

42

int array of indices MPI Status array of statuses

43

44

TESTSOMEINCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES MPI

45

ARRAY OF STATUSES IERROR

46

OF REQUESTS OUTCOUNT ARRAY OF INDICES INTEGER INCOUNT ARRAY

47

ARRAY OF STATUSESMPI STATUS SIZE IERROR 48

NONBLOCKING COMMUNICATION

Behaves like MPI WAITSOME except that it returns immediately If no op eration has

1

completed it returns outcount

2

MPI TESTSOME is a lo cal op eration which returns immediately whereas MPI WAIT

3

SOME will blo ck until a communication completes if it was passed a list that contains at

4

least one active handle Both calls full a fairness requirement If a request for a receive

5

rep eatedly app ears in a list of requests passed to MPI WAITSOME or MPI TESTSOME and

6

a matching send has b een p osted then the receive will eventually succeed unless the send

7

is satised by another receive and similarly for send requests

8

9

Advice to users The use of MPI TESTSOME is likely to b e more ecient than the use

10

of MPI TESTANY The former returns information on all completed communications

11

with the latter a new call is required for each communication that completes

12

13

A server with multiple clients can use MPI WAITSOME so as not to starve any

14

client Clients send messages to the server with service requests The server calls

15

MPI WAITSOME with one receive request for each client and then handles all re

16

WAITANY is used instead then one client ceives that completed If a call to MPI

17

could starve while requests from another client always sneak in rst End of advice

18

to users

19

20

Advice to implementors MPI TESTSOME should complete as many p ending com

21

munications as p ossible End of advice to implementors

22

Example Clientserver co de starvation can o ccur

23

24

25

CALL MPICOMMSIZEcomm size ierr

26

CALL MPICOMMRANKcomm rank ierr

27

IFrank THEN client code

28

DO WHILETRUE

29

CALL MPIISENDa n MPIREAL tag comm request ierr

30

CALL MPIWAITrequest status ierr

31

END DO

32

ELSE rank server code

33

DO i size

34

CALL MPIIRECVai n MPIREAL tag

35

comm requestlisti ierr

36

END DO

37

DO WHILETRUE

38

CALL MPIWAITANYsize requestlist index status ierr

39

CALL DOSERVICEaindex handle one message

40

CALL MPIIRECVa index n MPIREAL tag

41

comm requestlistindex ierr

42

END DO

43

END IF

44

45

46

WAITSOME Example Same co de using MPI

47 48

CHAPTER POINTTOPOINT COMMUNICATION

1

CALL MPICOMMSIZEcomm size ierr

2

CALL MPICOMMRANKcomm rank ierr

3

IFrank THEN client code

4

DO WHILETRUE

5

CALL MPIISENDa n MPIREAL tag comm request ierr

6

CALL MPIWAITrequest status ierr

7

END DO

8

ELSE rank server code

9

DO i size

10

CALL MPIIRECVai n MPIREAL tag

11

comm requestlisti ierr

12

END DO

13

DO WHILETRUE

14

CALL MPIWAITSOMEsize requestlist numdone

15

indexlist statuslist ierr

16

DO i numdone

17

CALL DOSERVICEa indexlisti

18

CALL MPIIRECVa indexlisti n MPIREAL tag

19

comm requestlisti ierr

20

END DO

21

END DO

22

END IF

23

24

25

Prob e and Cancel

26

27

The MPI PROBE and MPI IPROBE op erations allow incoming messages to b e checked for

28

without actually receiving them The user can then decide how to receive them based on

29

the information returned by the prob e basically the information returned by status In

30

particular the user may allo cate memory for the receive buer according to the length of

31

the prob ed message

32

CANCEL op eration allows p ending communications to b e canceled This is The MPI

33

required for cleanup Posting a send or a receive ties up user resources send or receive

34

buers and a cancel may b e needed to free these resources gracefully

35

36

37 IPROBEsource tag comm ag status MPI

38

IN source source rank or MPI ANY SOURCE integer

39

IN tag tag value or MPI ANY TAG integer

40

41

IN comm communicator handle

42

OUT ag logical

43

OUT status status ob ject Status

44

45

int MPI Iprobeint source int tag MPI Comm comm int flag

46

MPI Status status

47 48

PROBE AND CANCEL

MPI IPROBESOURCE TAG COMM FLAG STATUS IERROR

1

LOGICAL FLAG

2

INTEGER SOURCE TAG COMM STATUSMPI STATUS SIZE IERROR

3

4

MPI IPROBEsource tag comm ag status returns ag true if there is a message

5

that can b e received and that matches the pattern sp ecied by the arguments source tag

6

and comm The call matches the same message that would have b een received by a call to

7

MPI RECV source tag comm status executed at the same p oint in the program and

8

returns in status the same value that would have b een returned by MPI RECV Otherwise

9

the call returns ag false and leaves status undened

10

If MPI IPROBE returns ag true then the content of the status ob ject can b e sub

11

sequently accessed as describ ed in section to nd the source tag and length of the

12

prob ed message

13

A subsequent receive executed with the same context and the source and tag returned

14

in status by MPI IPROBE will receive the message that was matched by the prob e if no

15

other intervening receive o ccurs after the prob e If the receiving pro cess is multithreaded

16

it is the users resp onsibility to ensure that the last condition holds

17

The source argument of MPI PROBE can b e MPI ANY SOURCE and the tag argument

18

ANY TAG so that one can prob e for messages from an arbitrary source andor can b e MPI

19

with an arbitrary tag However a sp ecic communication context must b e provided with

20

the comm argument

21

It is not necessary to receive a message immediately after it has b een prob ed for and

22

the same message may b e prob ed for several times b efore it is received

23

24

25

PROBEsource tag comm status MPI

26

IN source source rank or MPI ANY SOURCE integer

27

ANY TAG integer IN tag tag value or MPI

28

29

IN comm communicator handle

30

OUT status status ob ject Status

31

32

int MPI Probeint source int tag MPI Comm comm MPI Status status

33

34

MPI PROBESOURCE TAG COMM STATUS IERROR

35

INTEGER SOURCE TAG COMM STATUSMPI STATUS SIZE IERROR

36

MPI PROBE b ehaves like MPI IPROBE except that it is a blo cking call that returns

37

only after a matching message has b een found

38

PROBE and MPI IPROBE needs to guarantee progress The MPI implementation of MPI

39

if a call to MPI PROBE has b een issued by a pro cess and a send that matches the prob e

40

has b een initiated by some pro cess then the call to MPI PROBE will return unless the

41

message is received by another concurrent receive op eration that is executed by another

42

IPROBE and a thread at the probing pro cess Similarly if a pro cess busy waits with MPI

43

matching message has b een issued then the call to MPI IPROBE will eventually return ag

44

true unless the message is received by another concurrent receive op eration

45

46

Example Use blo cking prob e to wait for an incoming message

47 48

CHAPTER POINTTOPOINT COMMUNICATION

CALL MPICOMMRANKcomm rank ierr

1

IF rankEQ THEN

2

CALL MPISENDi MPIINTEGER comm ierr

3

ELSE IFrankEQ THEN

4

CALL MPISENDx MPIREAL comm ierr

5

ELSE rankEQ

6

DO i

7

CALL MPIPROBEMPIANYSOURCE

8

comm status ierr

9

IF statusMPISOURCE THEN

10

CALL MPIRECVi MPIINTEGER status ierr

11

ELSE

12

CALL MPIRECVx MPIREAL status ierr

13

END IF

14

END DO

15

END IF

16

17

Each message is received with the right typ e

18

19

Example A similar program to the previous example but now it has a problem

20

21

CALL MPICOMMRANKcomm rank ierr

22

IF rankEQ THEN

23

CALL MPISENDi MPIINTEGER comm ierr

24

ELSE IFrankEQ THEN

25

CALL MPISENDx MPIREAL comm ierr

26

ELSE

27

DO i

28

CALL MPIPROBEMPIANYSOURCE

29

comm status ierr

30

IF statusMPISOURCE THEN

31

CALL MPIRECVi MPIINTEGER MPIANYSOURCE

32

status ierr

33

ELSE

34

CALL MPIRECVx MPIREAL MPIANYSOURCE

35

status ierr

36

END IF

37

END DO

38

END IF

39

40

We slightly mo died example using MPI ANY SOURCE as the source argument in

41

the two receive calls in statements lab eled and The program is now incorrect the

42

receive op eration may receive a message that is distinct from the message prob ed by the

43

PROBE preceding call to MPI

44

Advice to implementors A call to MPI PROBEsource tag comm status will match

45

the message that would have b een received by a call to MPI RECV source tag

46

comm status executed at the same p oint Supp ose that this message has source s tag

47

t and communicator c If the tag argument in the prob e call has value MPI ANY TAG 48

PROBE AND CANCEL

then the message prob ed will b e the earliest p ending message from source s with com

1

municator c and any tag in any case the message prob ed will b e the earliest p ending

2

message from source s with tag t and communicator c this is the message that would

3

have b een received so as to preserve message order This message continues as the

4

earliest p ending message from source s with tag t and communicator c until it is re

5

ceived A receive op eration subsequent to the prob e that uses the same communicator

6

as the prob e and uses the tag and source values returned by the prob e must receive

7

this message unless it has already b een received by another receive op eration End

8

of advice to implementors

9

10

11

12

MPI CANCELrequest

13

IN request communication request handle

14

15

16

int MPI CancelMPI Request request

17

MPI CANCELREQUEST IERROR

18

INTEGER REQUEST IERROR

19

20

A call to MPI CANCEL marks for cancellation a p ending nonblo cking communication

21

op eration send or receive The cancel call is lo cal It returns immediately p ossibly b efore

22

the communication is actually canceled It is still necessary to complete a communication

23

REQUEST FREE MPI WAIT or that has b een marked for cancellation using a call to MPI

24

MPI TEST or any of the derived op erations

25

If a communication is marked for cancellation then a MPI WAIT call for that com

26

munication is guaranteed to return irresp ective of the activities of other pro cesses ie

27

MPI WAIT b ehaves as a lo cal function similarly if MPI TEST is rep eatedly called in a

28

busy wait lo op for a canceled communication then MPI TEST will eventually b e successful

29

MPI CANCEL can b e used to cancel a communication that uses a p ersistent request see

30

Sec in the same way it is used for nonp ersistent requests A successful cancellation

31

CANCEL cancels the active communication but not the request itself After the call to MPI

32

and the subsequent call to MPI WAIT or MPI TEST the request b ecomes inactive and can

33

b e activated for a new communication

34

The successful cancellation of a buered send frees the buer space o ccupied by the

35

p ending message

36

Either the cancellation succeeds or the communication succeeds but not b oth If a

37

send is marked for cancellation then it must b e the case that either the send completes

38

normally in which case the message sent was received at the destination pro cess or that

39

the send is successfully canceled in which case no part of the message was received at the

40

destination Then any matching receive has to b e satised by another send If a receive is

41

marked for cancellation then it must b e the case that either the receive completes normally

42

or that the receive is successfully canceled in which case no part of the receive buer is

43

altered Then any matching send has to b e satised by another receive

44

If the op eration has b een canceled then information to that eect will b e returned in

45

the status argument of the op eration that completes the communication

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI TEST CANCELLEDstatus ag

1

2

IN status status ob ject Status

3

OUT ag logical

4

5

int MPI Test cancelledMPI Status status int flag

6

7

MPI TEST CANCELLEDSTATUS FLAG IERROR

8

LOGICAL FLAG

9

INTEGER STATUSMPI STATUS SIZE IERROR

10

Returns ag true if the communication asso ciated with the status ob ject was canceled

11

successfully In such a case all other elds of status such as count or tag are undened

12

Returns ag false otherwise If a receive op eration might b e canceled then one should call

13

MPI TEST CANCELLED rst to check whether the op eration was canceled b efore checking

14

on the other elds of the return status

15

16

Advice to users Cancel can b e an exp ensive op eration that should b e used only

17

exceptionally End of advice to users

18

19

Advice to implementors If a send op eration uses an eager proto col data is trans

20

ferred to the receiver b efore a matching receive is p osted then the cancellation of this

21

send may require communication with the intended receiver in order to free allo cated

22

buers On some systems this may require an interrupt to the intended receiver Note

23

CANCEL this is still a that while communication may b e needed to implement MPI

24

lo cal op eration since its completion do es not dep end on the co de executed by other

25

pro cesses If pro cessing is required on another pro cess this should b e transparent to

26

the application hence the need for an interrupt and an interrupt handler End of

27

advice to implementors

28

29

30

Persistent communication requests

31

32

Often a communication with the same argument list is rep eatedly executed within the in

33

ner lo op of a parallel computation In such a situation it may b e p ossible to optimize

34

the communication by binding the list of communication arguments to a p ersistent com

35

munication request once and then rep eatedly using the request to initiate and complete

36

messages The p ersistent request thus created can b e thought of as a communication p ort or

37

a halfchannel It do es not provide the full functionality of a conventional channel since

38

there is no binding of the send p ort to the receive p ort This construct allows reduction

39

of the overhead for communication b etween the pro cess and communication controller but

40

not of the overhead for communication b etween one communication controller and another

41

It is not necessary that messages sent with a p ersistent request b e received by a receive

42

op eration using a p ersistent request or vice versa

43

A p ersistent communication request is created using one of the four following calls

44

These calls involve no communication

45

46

47 48

PERSISTENT COMMUNICATION REQUESTS

MPI SEND INITbuf count datatyp e dest tag comm request

1

2

IN buf initial address of send buer choice

3

IN count numb er of elements sent integer

4

5

IN datatyp e typ e of each element handle

6

IN dest rank of destination integer

7

IN tag message tag integer

8

9 IN comm communicator handle

10

OUT request communication request handle

11

12

int MPI Send initvoid buf int count MPI Datatype datatype int dest

13

int tag MPI Comm comm MPI Request request

14

15

MPI SEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER REQUEST COUNT DATATYPE DEST TAG COMM REQUEST IERROR

18

Creates a p ersistent communication request for a standard mo de send op eration and

19

binds to it all the arguments of a send op eration

20

21

22

BSEND INITbuf count datatyp e dest tag comm request MPI

23

IN buf initial address of send buer choice

24

IN count numb er of elements sent integer

25

26

IN datatyp e typ e of each element handle

27

IN dest rank of destination integer

28

IN tag message tag integer

29

30

IN comm communicator handle

31

OUT request communication request handle

32

33

int MPI Bsend initvoid buf int count MPI Datatype datatype int dest

34

int tag MPI Comm comm MPI Request request

35

36

MPI BSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

37

type BUF

38

INTEGER REQUEST COUNT DATATYPE DEST TAG COMM REQUEST IERROR

39

Creates a p ersistent communication request for a buered mo de send

40

41

42

43

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI SSEND INITbuf count datatyp e dest tag comm request

1

2

IN buf initial address of send buer choice

3

IN count numb er of elements sent integer

4

5

IN datatyp e typ e of each element handle

6

IN dest rank of destination integer

7

IN tag message tag integer

8

9 IN comm communicator handle

10

OUT request communication request handle

11

12

int MPI Ssend initvoid buf int count MPI Datatype datatype int dest

13

int tag MPI Comm comm MPI Request request

14

15

MPI SSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

18

Creates a p ersistent communication ob ject for a synchronous mo de send op eration

19

20

21

RSEND INITbuf count datatyp e dest tag comm request MPI

22

IN buf initial address of send buer choice

23

IN count numb er of elements sent integer

24

25

IN datatyp e typ e of each element handle

26

IN dest rank of destination integer

27

IN tag message tag integer

28

29

IN comm communicator handle

30

OUT request communication request handle

31

32

int MPI Rsend initvoid buf int count MPI Datatype datatype int dest

33

int tag MPI Comm comm MPI Request request

34

35

MPI RSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

36

type BUF

37

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

38

Creates a p ersistent communication ob ject for a ready mo de send op eration

39

40

41

42

43

44

45

46

47 48

PERSISTENT COMMUNICATION REQUESTS

MPI RECV INITbuf count datatyp e source tag comm request

1

2

OUT buf initial address of receive buer choice

3

IN count numb er of elements received integer

4

5

IN datatyp e typ e of each element handle

6

IN source rank of source or MPI ANY SOURCE integer

7

IN tag message tag or MPI ANY TAG integer

8

9 IN comm communicator handle

10

OUT request communication request handle

11

12

int MPI Recv initvoid buf int count MPI Datatype datatype int source

13

int tag MPI Comm comm MPI Request request

14

15

MPI RECV INITBUF COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

18

Creates a p ersistent communication request for a receive op eration The argument buf

19

is marked as OUT b ecause the user gives p ermission to write on the receive buer by passing

20

the argument to MPI RECV INIT

21

A p ersistent communication request is inactive after it was created no active com

22

munication is attached to the request

23

A communication send or receive that uses a p ersistent request is initiated by the

24

START function MPI

25

26

27

MPI STARTrequest

28

INOUT request communication request handle

29

30

int MPI StartMPI Request request

31

32

MPI STARTREQUEST IERROR

33

INTEGER REQUEST IERROR

34

The argument request is a handle returned by one of the previous ve calls The

35

asso ciated request should b e inactive The request b ecomes active once the call is made

36

If the request is for a send with ready mo de then a matching receive should b e p osted

37

b efore the call is made The communication buer should not b e accessed after the call

38

and until the op eration completes

39

The call is lo cal with similar semantics to the nonblo cking communication op era

40

START with a request created by tions describ ed in section That is a call to MPI

41

MPI SEND INIT starts a communication in the same manner as a call to MPI ISEND a call

42

to MPI START with a request created by MPI BSEND INIT starts a communication in the

43

same manner as a call to MPI IBSEND and so on

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI STARTALLcount array of requests

1

2

IN count list length integer

3

INOUT array of requests array of requests array of handle

4

5

int MPI Startallint count MPI Request array of requests

6

7

MPI STARTALLCOUNT ARRAY OF REQUESTS IERROR

8

INTEGER COUNT ARRAY OF REQUESTS IERROR

9

Start all communications asso ciated with requests in array of requests A call to

10

MPI STARTALLcount array of requests has the same eect as calls to MPI START ar

11

ray of requestsi executed for i count in some arbitrary order

12

A communication started with a call to MPI START or MPI STARTALL is completed

13

by a call to MPI WAIT MPI TEST or one of the derived functions describ ed in section

14

The request b ecomes inactive after successful completion of such call The request is not

15

deallo cated and it can b e activated anew by an MPI START or MPI STARTALL call

16

A p ersistent request is deallo cated by a call to MPI REQUEST FREE Section

17

The call to MPI REQUEST FREE can o ccur at any p oint in the program after the p er

18

sistent request was created However the request will b e deallo cated only after it b ecomes

19

inactive Active receive requests should not b e freed Otherwise it will not b e p ossible

20

to check that the receive has completed It is preferable in general to free requests when

21

they are inactive If this rule is followed then the functions describ ed in this section will

22

b e invoked in a sequence of the form

23

24

25 Create Start Complete Free where indicates zero or more rep etitions If the

26

27

same communication ob ject is used in several concurrent threads it is the users resp onsi

28

bility to co ordinate calls so that the correct sequence is ob eyed

29

START can b e matched with any receive op eration A send op eration initiated with MPI

30

and likewise a receive op eration initiated with MPI START can receive messages generated

31

by any send op eration

32

33

Sendreceive

34

35

The sendreceive op erations combine in one call the sending of a message to one desti

36

nation and the receiving of another message from another pro cess The two source and

37

destination are p ossibly the same A sendreceive op eration is very useful for executing

38

a shift op eration across a chain of pro cesses If blo cking sends and receives are used for

39

such a shift then one needs to order the sends and receives correctly for example even

40

pro cesses send then receive o dd pro cesses receive rst then send so as to prevent cyclic

41

dep endencies that may lead to deadlo ck When a sendreceive op eration is used the com

42

munication subsystem takes care of these issues The sendreceive op eration can b e used

43

in conjunction with the functions describ ed in Chapter in order to p erform shifts on var

44

ious logical top ologies Also a sendreceive op eration is useful for implementing remote

45

pro cedure calls

46

A message sent by a sendreceive op eration can b e received by a regular receive op er

47

ation or prob ed by a prob e op eration a sendreceive op eration can receive a message sent 48

SENDRECEIVE

by a regular send op eration

1

2

3

MPI SENDRECVsendbuf sendcount sendtyp e dest sendtag recvbuf recvcount recvtyp e

4

source recvtag comm status

5

6

IN sendbuf initial address of send buer choice

7

IN sendcount numb er of elements in send buer integer

8

IN sendtyp e typ e of elements in send buer handle

9

10

IN dest rank of destination integer

11

IN sendtag send tag integer

12

OUT recvbuf initial address of receive buer choice

13

14

IN recvcount numb er of elements in receive buer integer

15

IN recvtyp e typ e of elements in receive buer handle

16

IN source rank of source integer

17

18 IN recvtag receive tag integer

19

IN comm communicator handle

20

OUT status status ob ject Status

21

22

int MPI Sendrecvvoid sendbuf int sendcount MPI Datatype sendtype

23

int dest int sendtag void recvbuf int recvcount

24

MPI Datatype recvtype int source MPI Datatype recvtag

25

Comm comm MPI Status status MPI

26

27

MPI SENDRECVSENDBUF SENDCOUNT SENDTYPE DEST SENDTAG RECVBUF

28

RECVCOUNT RECVTYPE SOURCE RECVTAG COMM STATUS IERROR

29

type SENDBUF RECVBUF

30

INTEGER SENDCOUNT SENDTYPE DEST SENDTAG RECVCOUNT RECVTYPE

31

SOURCE RECVTAG COMM STATUSMPI STATUS SIZE IERROR

32

Execute a blo cking send and receive op eration Both send and receive use the same

33

communicator but p ossibly dierent tags The send buer and receive buers must b e

34

disjoint and may have dierent lengths and datatyp es

35

36

37

38

39

40

41

42

43

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI SENDRECV REPLACEbuf count datatyp e dest sendtag source recvtag comm sta

1

tus

2

3

INOUT buf initial address of send and receive buer choice

4

IN count numb er of elements in send and receive buer integer

5

6

IN datatyp e typ e of elements in send and receive buer handle

7

IN dest rank of destination integer

8

IN sendtag send message tag integer

9

10 IN source rank of source integer

11

IN recvtag receive message tag integer

12

IN comm communicator handle

13

14 OUT status status ob ject Status

15

16

int MPI Sendrecv replacevoid buf int count MPI Datatype datatype

17

int dest int sendtag int source int recvtag MPI Comm comm

18

MPI Status status

19

MPI SENDRECV REPLACEBUF COUNT DATATYPE DEST SENDTAG SOURCE RECVTAG

20

COMM STATUS IERROR

21

type BUF

22

INTEGER COUNT DATATYPE DEST SENDTAG SOURCE RECVTAG COMM

23

STATUSMPI STATUS SIZE IERROR

24

25

Execute a blo cking send and receive The same buer is used b oth for the send and

26

for the receive so that the message sent is replaced by the message received

27

The semantics of a sendreceive op eration is what would b e obtained if the caller forked

28

two concurrent threads one to execute the send and one to execute the receive followed

29

by a join of these two threads

30

31

Advice to implementors Additional intermediate buering is needed for the replace

32

variant End of advice to implementors

33

34

Null pro cesses

35

36

In many instances it is convenient to sp ecify a dummy source or destination for commu

37

nication This simplies the co de that is needed for dealing with b oundaries for example

38

in the case of a noncircular shift done with calls to sendreceive

39

PROC NULL can b e used instead of a rank wherever a source or a The sp ecial value MPI

40

PROC NULL destination argument is required in a call A communication with pro cess MPI

41

has no eect A send to MPI PROC NULL succeeds and returns as so on as p ossible A receive

42

from MPI PROC NULL succeeds and returns as so on as p ossible with no mo dications to the

43

receive buer When a receive with source MPI PROC NULL is executed then the status

44

PROC NULL tag MPI ANY TAG and count ob ject returns source MPI

45

46

47 48

DERIVED DATATYPES

Derived datatyp es

1

2

Up to here all p oint to p oint communication have involved only contiguous buers contain

3

ing a sequence of elements of the same typ e This is to o constraining on two accounts One

4

often wants to pass messages that contain values with dierent datatyp es eg an integer

5

count followed by a sequence of real numb ers and one often wants to send noncontiguous

6

data eg a subblo ck of a matrix One solution is to pack noncontiguous data into a

7

contiguous buer at the sender site and unpack it back at the receiver site This has the

8

disadvantage of requiring additional memorytomemory copy op erations at b oth sites even

9

when the communication subsystem has scattergather capabilities Instead MPI provides

10

mechanisms to sp ecify more general mixed and noncontiguous communication buers It

11

is up to the implementation to decide whether data should b e rst packed in a contiguous

12

buer b efore b eing transmitted or whether it can b e collected directly from where it resides

13

The general mechanisms provided here allow one to transfer directly without copying

14

ob jects of various shap e and size It is not assumed that the MPI library is cognizant of

15

the ob jects declared in the host language Thus if one wants to transfer a structure or an

16

array section it will b e necessary to provide in MPI a denition of a communication buer

17

that mimics the denition of the structure or array section in question These facilities can

18

b e used by library designers to dene communication functions that can transfer ob jects

19

dened in the host language by deco ding their denitions as available in a symb ol table

20

or a dop e vector Such higherlevel communication functions are not part of MPI

21

More general communication buers are sp ecied by replacing the basic datatyp es that

22

have b een used so far with derived datatyp es that are constructed from basic datatyp es using

23

the constructors describ ed in this section These metho ds of constructing derived datatyp es

24

can b e applied recursively

25

A general datatyp e is an opaque ob ject that sp ecies two things

26

27

A sequence of basic datatyp es

28

29

A sequence of integer byte displacements

30

The displacements are not required to b e p ositive distinct or in increasing order

31

Therefore the order of items need not coincide with their order in store and an item may

32

app ear more than once We call such a pair of sequences or sequence of pairs a typ e

33

map The sequence of basic datatyp es displacements ignored is the typ e signature of

34

the datatyp e

35

Let

36

37

T y pemap fty pe disp ty pe disp g

0 0 n1 n1

38

b e such a typ e map where ty pe are basic typ es and disp are displacements Let

i i

39

40

T y pesig fty pe ty pe g

0 n1

41

42

b e the asso ciated typ e signature This typ e map together with a base address buf sp ecies

43

a communication buer the communication buer that consists of n entries where the

44

ith entry is at address buf disp and has typ e ty pe A message assembled from such a

i i

45

communication buer will consist of n values of the typ es dened by T y pesig

46

We can use a handle to a general datatyp e as an argument in a send or receive op eration

47

instead of a basic datatyp e argument The op eration MPI SENDbuf datatyp e will use

48

the send buer dened by the base address buf and the general datatyp e asso ciated with

CHAPTER POINTTOPOINT COMMUNICATION

datatyp e it will generate a message with the typ e signature determined by the datatyp e

1

argument MPI RECVbuf datatyp e will use the receive buer dened by the base

2

address buf and the general datatyp e asso ciated with datatyp e

3

General datatyp es can b e used in all send and receive op erations We discuss in Sec

4

the case where the second argument count has value

5

The basic datatyp es presented in section are particular cases of a general datatyp e

6

and are predened Thus MPI INT is a predened handle to a datatyp e with typ e map

7

fint g with one entry of typ e int and displacement zero The other basic datatyp es are

8

similar

9

The extent of a datatyp e is dened to b e the span from the rst byte to the last byte

10

o ccupied by entries in this datatyp e rounded up to satisfy alignment requirements That

11

is if

12

13

T y pemap fty pe disp ty pe disp g

0 0 n1 n1

14

15

then

16

l bT y pemap min disp

j

17

j

18

ubT y pemap maxdisp siz eof ty pe and

j j

j

19

extentT y pemap ubT y pemap l bT y pemap

20

21

If ty pe requires alignment to a byte address that is is a multiple of k then is the least

i i

22

nonnegative increment needed to round extentT y pemap to the next multiple of max k

i i

23

Example Assume that T y pe fdouble char g a double at displacement zero

24

followed by a char at displacement eight Assume furthermore that doubles have to b e

25

strictly aligned at addresses that are multiples of eight Then the extent of this datatyp e is

26

rounded to the next multiple of A datatyp e that consists of a character immediately

27

followed by a double will also have an extent of

28

29

Rationale The denition of extent is motivated by the assumption that the amount

30

of padding added at the end of each structure in an array of structures is the least

31

needed to fulll alignment constraints More explicit control of the extent is provided

32

in section Such explicit control is needed in cases where the assumption do es

33

not hold for example where union typ es are used End of rationale

34

35

Datatyp e constructors

36

37

Contiguous The simplest datatyp e constructor is MPI TYPE CONTIGUOUS which allows

38

replication of a datatyp e into contiguous lo cations

39

40

TYPE CONTIGUOUScount oldtyp e newtyp e MPI

41

42

IN count replication count nonnegative integer

43

IN oldtyp e old datatyp e handle

44

OUT newtyp e new datatyp e handle

45

46

47

int MPI Type contiguousint count MPI Datatype oldtype

48

MPI Datatype newtype

DERIVED DATATYPES

MPI TYPE CONTIGUOUSCOUNT OLDTYPE NEWTYPE IERROR

1

INTEGER COUNT OLDTYPE NEWTYPE IERROR

2

3

newtyp e is the datatyp e obtained by concatenating count copies of oldtyp e Concate

4

nation is dened using extent as the size of the concatenated copies

5

6

Example Let oldtyp e have typ e map fdouble char g with extent and let

7

count The typ e map of the datatyp e returned by newtyp e is

8

fdouble char double char double char g

9

10

ie alternating double and char elements with displacements

11

12

13

In general assume that the typ e map of oldtyp e is

14

fty pe disp ty pe disp g

15

0 0 n1 n1

16

with extent ex Then newtyp e has a typ e map with count n entries dened by

17

fty pe disp ty pe disp ty pe disp ex ty pe disp ex

18

0 0 n1 n1 0 0 n1 n1

19

ty pe disp ex count ty pe disp ex count g

0 0 n1 n1

20

21

22

23

Vector The function MPI TYPE VECTOR is a more general constructor that allows repli

24

cation of a datatyp e into lo cations that consist of equally spaced blo cks Each blo ck is

25

obtained by concatenating the same numb er of copies of the old datatyp e The spacing

26

b etween blo cks is a multiple of the extent of the old datatyp e

27

28

29

MPI TYPE VECTOR count blo cklength stride oldtyp e newtyp e

30

IN count numb er of blo cks nonnegative integer

31

32

IN blo cklength numb er of elements in each blo ck nonnegative inte

33

ger

34

IN stride numb er of elements b etween start of each blo ck inte

35

ger

36

IN oldtyp e old datatyp e handle

37

38

OUT newtyp e new datatyp e handle

39

40

int MPI Type vectorint count int blocklength int stride

41

MPI Datatype oldtype MPI Datatype newtype

42

MPI TYPE VECTORCOUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

43

INTEGER COUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

44

45

46

Example Assume again that oldtyp e has typ e map fdouble char g with ex

47

tent A call to MPI TYPE VECTOR oldtyp e newtyp e will create the datatyp e

48

with typ e map

CHAPTER POINTTOPOINT COMMUNICATION

fdouble char double char double char

1

2

double char double char double char g

3

4

That is two blo cks with three copies each of the old typ e with a stride of elements

5

bytes b etween the blo cks

6

7

Example A call to MPI TYPE VECTOR oldtyp e newtyp e will create the

8

datatyp e

9

10

fdouble char double char double char g

11

12

13

In general assume that oldtyp e has typ e map

14

15

fty pe disp ty pe disp g

0 0 n1 n1

16

with extent ex Let bl b e the blo cklength The newly created datatyp e has a typ e map with

17

count bl n entries

18

19

fty pe disp ty pe disp

0 0 n1 n1

20

21

ty pe disp ex ty pe disp ex

0 0 n1 n1

22

23

ty pe disp bl ex ty pe disp bl ex

0 0 n1 n1

24

25

ty pe disp stride ex ty pe disp stride ex

0 0 n1 n1

26

27

ty pe disp stride bl ex ty pe disp stride bl ex

0 0 n1 n1

28

29

ty pe disp stride count ex

0 0

30

ty pe disp stride count ex

31

n1 n1

32

ty pe disp stride count bl ex

0 0

33

34

ty pe disp stride count bl exg

n1 n1

35

36

37

38

A call to MPI TYPE CONTIGUOUScount oldtyp e newtyp e is equivalent to a call to

39

MPI TYPE VECTORcount oldtyp e newtyp e or to a call to MPI TYPE VECTOR

40

count n oldtyp e newtyp e n arbitrary

41

42

43

44

45

46

47 48

DERIVED DATATYPES

Hvector The function MPI TYPE HVECTOR is identical to MPI TYPE VECTOR except

1

that stride is given in bytes rather than in elements The use for b oth typ es of vector

2

constructors is illustrated in Sec H stands for heterogeneous

3

4

5

MPI TYPE HVECTOR count blo cklength stride oldtyp e newtyp e

6

7

IN count numb er of blo cks nonnegative integer

8

IN blo cklength numb er of elements in each blo ck nonnegative inte

9

ger

10

IN stride numb er of bytes b etween start of each blo ck integer

11

12

IN oldtyp e old datatyp e handle

13

OUT newtyp e new datatyp e handle

14

15

int MPI Type hvectorint count int blocklength MPI Aint stride

16

MPI Datatype oldtype MPI Datatype newtype

17

18

MPI TYPE HVECTORCOUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

19

INTEGER COUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

20

21

22

Assume that oldtyp e has typ e map

23

fty pe disp ty pe disp g

0 0 n1 n1

24

25

with extent ex Let bl b e the blo cklength The newly created datatyp e has a typ e map with

26

count bl n entries

27

28

fty pe disp ty pe disp

0 0 n1 n1

29

ty pe disp ex ty pe disp ex

30

0 0 n1 n1

31

ty pe disp bl ex ty pe disp bl ex

0 0 n1 n1

32

33

ty pe disp stride ty pe disp stride

0 0 n1 n1

34

35

ty pe disp stride bl ex

0 0

36

37

ty pe disp stride bl ex

n1 n1

38

39

ty pe disp stride count ty pe disp stride count

0 0 n1 n1

40

41

ty pe disp stride count bl ex

0 0

42

43

ty pe disp stride count bl exg

n1 n1

44

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

Indexed The function MPI TYPE INDEXED allows replication of an old datatyp e into a

1

sequence of blo cks each blo ck is a concatenation of the old datatyp e where each blo ck

2

can contain a dierent numb er of copies and have a dierent displacement All blo ck

3

displacements are multiples of the old typ e extent

4

5

6

MPI TYPE INDEXED count array of blo cklengths array of displacements oldtyp e newtyp e

7

8

9

IN count numb er of blo cks also numb er of entries in

10

array of displacements and array of blo cklengths non

11

negative integer

12

IN array of blo cklengths numb er of elements p er blo ck array of nonnegative

13

integers

14

IN array of displacements displacement for each blo ck in multiples of oldtyp e

15

extent array of integer

16

17

IN oldtyp e old datatyp e handle

18

OUT newtyp e new datatyp e handle

19

20

int MPI Type indexedint count int array of blocklengths

21

of displacements MPI Datatype oldtype int array

22

MPI Datatype newtype

23

24

MPI TYPE INDEXEDCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

25

OLDTYPE NEWTYPE IERROR

26

INTEGER COUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

27

OLDTYPE NEWTYPE IERROR

28

29

Example Let oldtyp e have typ e map fdouble char g with extent Let B

30

and let D A call to MPI TYPE INDEXED B D oldtyp e newtyp e returns

31

a datatyp e with typ e map

32

33

fdouble char double char double char

34

35

double char g

36

That is three copies of the old typ e starting at displacement and one copy starting at

37

displacement

38

39

40

41

In general assume that oldtyp e has typ e map

42

fty pe disp ty pe disp g

43

0 0 n1 n1

44

with extent ex Let B b e the array of blo cklength argument and D b e the

45

P

count 1

of displacements argument The newly created datatyp e has n Bi entries array

i=0

46

47

fty pe disp D ex ty pe disp D ex

0 0 n1 n1 48

DERIVED DATATYPES

ty pe disp D B ex ty pe disp D B ex

1

0 0 n1 n1

2

ty pe disp Dcount ex ty pe disp Dcount ex

0 0 n1 n1

3

4

ty pe disp Dcount Bcount ex

0 0

5

6

ty pe disp Dcount Bcount exg

n1 n1

7

8

9

A call to MPI TYPE VECTORcount blo cklength stride oldtyp e newtyp e is equivalent

10

to a call to MPI TYPE INDEXEDcount B D oldtyp e newtyp e where

11

12

Dj j stride j count

13

14

and

15

Bj blo cklength j count

16

17

18

Hindexed The function MPI TYPE HINDEXED is identical to MPI TYPE INDEXED except

19

that blo ck displacements in array of displacements are sp ecied in bytes rather than in

20

multiples of the oldtyp e extent

21

22

MPI TYPE HINDEXED count array of blo cklengths array of displacements oldtyp e new

23

typ e

24

25

IN count numb er of blo cks also numb er of entries in

26

array of displacements and array of blo cklengths inte

27

ger

28

of blo cklengths numb er of elements in each blo ck array of nonnega IN array

29

tive integers

30

IN array of displacements byte displacement of each blo ck array of integer

31

32

IN oldtyp e old datatyp e handle

33

OUT newtyp e new datatyp e handle

34

35

int MPI Type hindexedint count int array of blocklengths

36

MPI Aint array of displacements MPI Datatype oldtype

37

MPI Datatype newtype

38

39

MPI TYPE HINDEXEDCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

40

OLDTYPE NEWTYPE IERROR

41

INTEGER COUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

42

OLDTYPE NEWTYPE IERROR

43

44

45

Assume that oldtyp e has typ e map

46

fty pe disp ty pe disp g

0 0 n1 n1

47 48

CHAPTER POINTTOPOINT COMMUNICATION

with extent ex Let B b e the array of blo cklength argument and D b e the

1

array of displacements argument The newly created datatyp e has a typ e map with n

2

P

count 1

Bi entries

3

i=0

4

fty pe disp D ty pe disp D

0 0 n1 n1

5

6

ty pe disp D B ex

0 0

7

8

ty pe disp D B ex

n1 n1

9

10

ty pe disp Dcount ty pe disp Dcount

0 0 n1 n1

11

12

ty pe disp Dcount Bcount ex

0 0

13

ty pe disp Dcount Bcount exg

14

n1 n1

15

16

17

18

TYPE STRUCT is the most general typ e constructor It further generalizes Struct MPI

19

the previous one in that it allows each blo ck to consist of replications of dierent datatyp es

20

21

22 MPI TYPE STRUCTcount array of blo cklengths array of displacements array of typ es new

23 typ e

24

IN count numb er of blo cks integer also numb er of entries

25

in arrays array of typ es array of displacements and ar

26

ray of blo cklengths

27

IN array of blo cklength numb er of elements in each blo ck array of integer

28

29

IN array of displacements byte displacement of each blo ck array of integer

30

IN array of typ es typ e of elements in each blo ck array of handles to

31

datatyp e ob jects

32

OUT newtyp e new datatyp e handle

33

34

Type structint count int array of blocklengths int MPI

35

MPI Aint array of displacements MPI Datatype array of types

36

MPI Datatype newtype

37

38

MPI TYPE STRUCTCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

39

ARRAY OF TYPES NEWTYPE IERROR

40

OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS INTEGER COUNT ARRAY

41

ARRAY OF TYPES NEWTYPE IERROR

42

43

Example Let typ e have typ e map

44

45

fdouble char g

46

47

with extent Let B D and T MPI FLOAT typ e MPI CHAR

48

Then a call to MPI TYPE STRUCT B D T newtyp e returns a datatyp e with typ e map

DERIVED DATATYPES

foat oat double char char char char g

1

2

That is two copies of MPI FLOAT starting at followed by one copy of typ e starting at

3

followed by three copies of MPI CHAR starting at We assume that a oat o ccupies

4

four bytes

5

6

7

In general let T b e the array of typ es argument where Ti is a handle to

8

i i i i

9

ty pemap fty pe disp ty pe disp g

i

0 0 n 1 n 1

i i

10

of blo cklength argument and D b e the array of displace with extent ex Let B b e the array

i

11

ments argument Let c b e the count argument Then the newly created datatyp e has a typ e

P

12

c1

map with Bi n entries

i

i=0

13

0 0 0 0

fty pe disp D ty pe disp D

14

0 0 n n

0 0

15

0 0 0 0

disp D B ex ty pe disp D B ex ty pe

0 0

0 0 n n

0 0

16

c1 c1 c1 c1

17

Dc disp Dc ty pe disp ty pe

n 1 n 1 0 0

c1 c1

18

c1 c1

19

Dc Bc ex disp ty pe

c1

0 0

20

c1 c1

Dc Bc ex g disp ty pe

c1

n 1 n 1

c1 21 c1

22

23

A call to MPI TYPE HINDEXED count B D oldtyp e newtyp e is equivalent to a call

24

to MPI TYPE STRUCT count B D T newtyp e where each entry of T is equal to oldtyp e

25

26

Address and extent functions

27

28

The displacements in a general datatyp e are relative to some initial buer address Abso

29

lute addresses can b e substituted for these displacements we treat them as displacements

30

relative to address zero the start of the address space This initial address zero is indi

31

cated by the constant MPI BOTTOM Thus a datatyp e can sp ecify the absolute address of

32

the entries in the communication buer in which case the buf argument is passed the value

33

BOTTOM MPI

34

The address of a lo cation in memory can b e found by invoking the function

35

MPI ADDRESS

36

37

38

MPI ADDRESSlo cation address

39

IN lo cation lo cation in caller memory choice

40

OUT address address of lo cation integer

41

42

43

Addressvoid location MPI Aint address int MPI

44

MPI ADDRESSLOCATION ADDRESS IERROR

45

type LOCATION

46

INTEGER ADDRESS IERROR

47

48

Returns the byte address of lo cation

CHAPTER POINTTOPOINT COMMUNICATION

Example Using MPI ADDRESS for an array

1

2

REAL A

3

INTEGER I I DIFF

4

CALL MPIADDRESSA I IERROR

5

CALL MPIADDRESSA I IERROR

6

DIFF I I

7

The value of DIFF is sizeofreal the values of I and I are

8

implementation dependent

9

10

Advice to users C users may b e tempted to avoid the usage of MPI ADDRESS

11

and rely on the availability of the address op erator Note however that cast

12

expression is a p ointer not an address ANSI C do es not require that the value of a

13

p ointer or the p ointer cast to int b e the absolute address of the ob ject p ointed at

14

although this is commonly the case Furthermore referencing may not have a unique

15

denition on machines with a segmented address space The use of MPI ADDRESS

16

to reference C variables guarantees p ortability to such machines as well End of

17

advice to users

18

19

The following auxiliary functions provide useful information on derived datatyp es

20

21

22

TYPE EXTENTdatatyp e extent MPI

23

IN datatyp e datatyp e handle

24

25 OUT extent datatyp e extent integer

26

27

int MPI Type extentMPI Datatype datatype int extent

28

MPI TYPE EXTENTDATATYPE EXTENT IERROR

29

INTEGER DATATYPE EXTENT IERROR

30

31

Returns the extent of a datatyp e where extent is as dened in Eq on page

32

33

MPI TYPE SIZEdatatyp e size

34

35

IN datatyp e datatyp e handle

36

OUT size datatyp e size integer

37

38

int MPI Type sizeMPI Datatype datatype int size

39

40

MPI TYPE SIZEDATATYPE SIZE IERROR

41

INTEGER DATATYPE SIZE IERROR

42

MPI TYPE SIZE returns the total size in bytes of the entries in the typ e signature

43

asso ciated with datatyp e ie the total size of the data in a message that would b e created

44

with this datatyp e Entries that o ccur multiple times in the datatyp e are counted with

45

their multiplicity

46

47 48

DERIVED DATATYPES

MPI TYPE COUNTdatatyp e count

1

2

IN datatyp e datatyp e handle

3

OUT count datatyp e count integer

4

5

int MPI Type countMPI Datatype datatype int count

6

7

MPI TYPE COUNTDATATYPE COUNT IERROR

8

INTEGER DATATYPE COUNT IERROR

9

Returns the numb er of toplevel entries in the datatyp e

10

11

12

Lowerb ound and upp erb ound markers

13

It is often convenient to dene explicitly the lower b ound and upp er b ound of a typ e map

14

and override the denition given by Equation on page This allows one to dene a

15

datatyp e that has holes at its b eginning or its end or a datatyp e with entries that extend

16

ab ove the upp er b ound or b elow the lower b ound Examples of such usage are provided

17

in Sec To achieve this we add two additional pseudodatatyp es MPI LB and

18

UB that can b e used resp ectively to mark the lower b ound or the upp er b ound of a MPI

19

datatyp e These pseudodatatyp es o ccupy no space extentMPI LB extentMPI UB

20

They do not aect the size or count of a datatyp e and do not aect the the content of a

21

message created with this datatyp e However they do aect the denition of the extent of

22

a datatyp e and therefore aect the outcome of a replication of this datatyp e by a datatyp e

23

constructor

24

25

Example Let D T MPI LB MPI INT MPI UB and B

26

Then a call to MPI TYPE STRUCT B D T typ e creates a new datatyp e that has an

27

extent of from to included and contains an integer at displacement This is

28

the datatyp e dened by the sequence flb int ub g If this typ e is replicated

29

TYPE CONTIGUOUS typ e typ e then the newly created typ e twice by a call to MPI

30

can b e describ ed by the sequence flb int int ub g Entries of typ e lb or

31

ub can b e deleted if they are not at the endp oints of the datatyp e

32

33

In general if

34

35

T y pemap fty pe disp ty pe disp g

0 0 n1 n1

36

then the lower b ound of T y pemap is dened to b e

37

38

min disp if no entry has basic typ e lb

j j

l bT y pemap

39

min fdisp such that ty pe lb g otherwise

j j j

40

41

Similarly the upp er b ound of T y pemap is dened to b e

42

max disp siz eof ty pe if no entry has basic typ e ub

j j j

43

ubT y pemap

max fdisp such that ty pe ub g otherwise

j j j

44

45

Then

46

47

extentT y pemap ubT y pemap l bT y pemap 48

CHAPTER POINTTOPOINT COMMUNICATION

If ty pe requires alignment to a byte address that is a multiple of k then is the least

1

i i

nonnegative increment needed to round extentT y pemap to the next multiple of max k

2

i i

The formal denitions given for the various datatyp e constructors apply now with the

3

amended denition of extent

4

The two functions b elow can b e used for nding the lower b ound and the upp er b ound

5

of a datatyp e

6

7

8

MPI TYPE LB datatyp e displacement

9

10

IN datatyp e datatyp e handle

11

OUT displacement displacement of lower b ound from origin in bytes in

12

teger

13

14

int MPI Type lbMPI Datatype datatype int displacement

15

16

MPI TYPE LB DATATYPE DISPLACEMENT IERROR

17

INTEGER DATATYPE DISPLACEMENT IERROR

18

19

20

MPI TYPE UB datatyp e displacement

21

IN datatyp e datatyp e handle

22

23 OUT displacement displacement of upp er b ound from origin in bytes in

teger

24

25

26

int MPI Type ubMPI Datatype datatype int displacement

27

MPI TYPE UB DATATYPE DISPLACEMENT IERROR

28

INTEGER DATATYPE DISPLACEMENT IERROR

29

30

31

Rationale Note that the rules given in Sec imply that it is erroneous to call

32

MPI TYPE EXTENT MPI TYPE LB and MPI TYPE UB with a datatyp e argument

33

that contains absolute addresses unless all these addreses are within the same sequen

34

tial storage For this reason the displacement for the C binding in MPI TYPE UB is

35

Aint End of rationale an int and not of typ e MPI

36

37

Commit and free

38

A datatyp e ob ject has to b e committed b efore it can b e used in a communication A

39

committed datatyp e can still b e used as a argument in datatyp e constructors There is no

40

need to commit basic datatyp es They are precommitted

41

42

43

MPI TYPE COMMITdatatyp e

44

INOUT datatyp e datatyp e that is committed handle

45

46

47

Type commitMPI Datatype datatype int MPI 48

DERIVED DATATYPES

MPI TYPE COMMITDATATYPE IERROR

1

INTEGER DATATYPE IERROR

2

3

The commit op eration commits the datatyp e that is the formal description of a com

4

munication buer not the content of that buer Thus after a datatyp e has b een commit

5

ted it can b e rep eatedly reused to communicate the changing content of a buer or indeed

6

the content of dierent buers with dierent starting addresses

7

8

Advice to implementors The system may compile at commit time an internal

9

representation for the datatyp e that facilitates communication eg change from a

10

compacted representation to a at representation of the datatyp e and select the most

11

convenient transfer mechanism End of advice to implementors

12

13

14

MPI TYPE FREEdatatyp e

15

16

INOUT datatyp e datatyp e that is freed handle

17

18

int MPI Type freeMPI Datatype datatype

19

20

MPI TYPE FREEDATATYPE IERROR

21

INTEGER DATATYPE IERROR

22

Marks the datatyp e ob ject asso ciated with datatyp e for deallo cation and sets datatyp e

23

to MPI DATATYPE NULL Any communication that is currently using this datatyp e will com

24

plete normally Derived datatyp es that were dened from the freed datatyp e are not af

25

fected

26

27

Example The following co de fragment gives examples of using MPI TYPE COMMIT

28

29

INTEGER type type

30

CALL MPITYPECONTIGUOUS MPIREAL type ierr

31

new type object created

32

CALL MPITYPECOMMITtype ierr

33

now type can be used for communication

34

type type

35

type can be used for communication

36

it is a handle to same object as type

37

CALL MPITYPEVECTOR MPIREAL type ierr

38

new uncommitted type object created

39

CALL MPITYPECOMMITtype ierr

40

now type can be used anew for communication

41

42

Freeing a datatyp e do es not aect any other datatyp e that was built from the freed

43

datatyp e The system b ehaves as if input datatyp e arguments to derived datatyp e con

44 structors are passed by value

45

Advice to implementors The implementation may keep a reference count of active

46

communications that use the datatyp e in order to decide when to free it Also one

47

may implement constructors of derived datatyp es so that they keep p ointers to their 48

CHAPTER POINTTOPOINT COMMUNICATION

datatyp e arguments rather then copying them In this case one needs to keep track

1

of active datatyp e denition references in order to know when a datatyp e ob ject can

2

b e freed End of advice to implementors

3

4

5

Use of general datatyp es in communication

6

Handles to derived datatyp es can b e passed to a communication call wherever a datatyp e

7

argument is required A call of the form MPI SENDbuf count datatyp e where count

8

is interpreted as if the call was passed a new datatyp e which is the concatenation of count

9

copies of datatyp e Thus MPI SENDbuf count datatyp e dest tag comm is equivalent to

10

11

MPITYPECONTIGUOUScount datatype newtype

12

MPITYPECOMMITnewtype

13

MPISENDbuf newtype dest tag comm

14

Similar statements apply to all other communication functions that have a count and

15

datatyp e argument

16

Supp ose that a send op eration MPI SENDbuf count datatyp e dest tag comm is

17

executed where datatyp e has typ e map

18

19

fty pe disp ty pe disp g

0 0 n1 n1

20

21

and extent extent Empty entries of pseudotyp e MPI UB and MPI LB are not listed

22

in the typ e map but they aect the value of extent The send op eration sends n count

23

entries where entry i n j is at lo cation addr buf extent i disp and has typ e

ij j

24

ty pe for i count and j n These entries need not b e contiguous nor

j

25

distinct their order can b e arbitrary

26

The variable stored at address addr in the calling program should b e of a typ e that

ij

27

matches ty pe where typ e matching is dened as in section The message sent contains

j

28

n count entries where entry i n j has typ e ty pe

j

29

RECVbuf count datatyp e source tag Similarly supp ose that a receive op eration MPI

30

comm status is executed where datatyp e has typ e map

31

fty pe disp ty pe disp g

32

0 0 n1 n1

33

with extent extent Again empty entries of pseudotyp e MPI UB and MPI LB are not

34

listed in the typ e map but they aect the value of extent This receive op eration receives

35

n count entries where entry i n j is at lo cation buf extent i disp and has typ e

j

36

ty pe If the incoming message consists of k elements then we must have k n count the

j

37

i n j th element of the message should have a typ e that matches ty pe

j

38

Typ e matching is dened according to the typ e signature of the corresp onding datatyp es

39

that is the sequence of basic typ e comp onents Typ e matching do es not dep end on some

40

asp ects of the datatyp e denition such as the displacements layout in memory or the

41

intermediate typ es used

42

43

Example This example shows that typ e matching is dened in terms of the basic

44

typ es that a derived typ e consists of

45

46

47

CALL MPITYPECONTIGUOUS MPIREAL type

48

CALL MPITYPECONTIGUOUS MPIREAL type

DERIVED DATATYPES

CALL MPITYPECONTIGUOUS type type

1

2

CALL MPISEND a MPIREAL

3

CALL MPISEND a type

4

CALL MPISEND a type

5

CALL MPISEND a type

6

7

CALL MPIRECV a MPIREAL

8

CALL MPIRECV a type

9

CALL MPIRECV a type

10

CALL MPIRECV a type

11

12

Each of the sends matches any of the receives

13

14

A datatyp e may sp ecify overlapping entries If such a datatyp e is used in a receive

15

op eration that is if some part of the receive buer is written more than once by the receive

16

op eration then the call is erroneous

17

Supp ose that MPI RECVbuf count datatyp e dest tag comm status is executed

18

where datatyp e has typ e map

19

20

fty pe disp ty pe disp g

0 0 n1 n1

21

The received message need not ll all the receive buer nor do es it need to ll a numb er of

22

lo cations which is a multiple of n Any numb er k of basic elements can b e received where

23

k count n The numb er of basic elements received can b e retrieved from status using

24

the query function MPI GET ELEMENTS

25

26

27

MPI GET ELEMENTS status datatyp e count

28

IN status return status of receive op eration Status

29

30

IN datatyp e datatyp e used by receive op eration handle

31

OUT count numb er of received basic elements integer

32

33

int MPI Get elementsMPI Status status MPI Datatype datatype int count

34

35

MPI GET ELEMENTSSTATUS DATATYPE COUNT IERROR

36

STATUS SIZE DATATYPE COUNT IERROR INTEGER STATUSMPI

37

The previously dened function MPI GET COUNT Sec has a dierent b e

38

havior It returns the numb er of toplevel elements received In the previous example

39

GET COUNT may return any integer value k where k count If MPI GET COUNT MPI

40

returns k then the numb er of basic elements received and the value returned by

41

MPI GET ELEMENTS is n k If the numb er of basic elements received is not a multi

42

ple of n that is if the receive op eration has not received an integral numb er of datatyp e

43

GET COUNT returns the value MPI UNDEFINED copies then MPI

44

45

Example Usage of MPI GET COUNT and MPI GET ELEMENT

46

47

48

CALL MPITYPECONTIGUOUS MPIREAL Type ierr

CHAPTER POINTTOPOINT COMMUNICATION

CALL MPITYPECOMMITType ierr

1

2

CALL MPICOMMRANKcomm rank ierr

3

IFrankEQ THEN

4

CALL MPISENDa MPIREAL comm ierr

5

CALL MPISENDa MPIREAL comm ierr

6

ELSE

7

CALL MPIRECVa Type comm stat ierr

8

CALL MPIGETCOUNTstat Type i ierr returns i

9

CALL MPIGETELEMENTSstat Type i ierr returns i

10

CALL MPIRECVa Type comm stat ierr

11

CALL MPIGETCOUNTstat Type i ierr returns iMPIUNDEFINED

12

CALL MPIGETELEMENTSstat Type i ierr returns i

13

END IF

14

15

The function MPI GET ELEMENTS can also b e used after a prob e to nd the numb er

16

of elements in the prob ed message Note that the two functions MPI GET COUNT and

17

GET ELEMENTS return the same values when they are used with basic datatyp es MPI

18

19

Rationale The extension given to the denition of MPI GET COUNT seems natural

20

one would exp ect this function to return the value of the count argument when

21

the receive buer is lled Sometimes datatyp e represents a basic unit of data one

22

wants to transfer for example a record in an array of records structures One

23

should b e able to nd out how many comp onents were received without b othering to

24

divide by the numb er of elements in each comp onent However on other o ccasions

25

datatyp e is used to dene a complex layout of data in the receiver memory and do es

26

not represent a basic unit of data for transfers In such cases one needs to use the

27

GET ELEMENTS End of rationale function MPI

28

29

Advice to implementors The denition implies that a receive cannot change the

30

value of storage outside the entries dened to comp ose the communication buer In

31

particular the denition implies that padding space in a structure should not b e mo d

32

ied when such a structure is copied from one pro cess to another This would prevent

33

the obvious optimization of copying the structure together with the padding as one

34

contiguous blo ck The implementation is free to do this optimization when it do es not

35

impact the outcome of the computation The user can force this optimization by

36

explicitly including padding as part of the message End of advice to implementors

37

38

Correct use of addresses

39

40

Successively declared variables in C or Fortran are not necessarily stored at contiguous

41

lo cations Thus care must b e exercised that displacements do not cross from one variable

42

to another Also in machines with a segmented address space addresses are not unique

43

and address arithmetic has some p eculiar prop erties Thus the use of addresses that is

44

BOTTOM has to b e restricted displacements relative to the start address MPI

45

Variables b elong to the same sequential storage if they b elong to the same array to

46

the same COMMON blo ck in Fortran or to the same structure in C Valid addresses are

47

dened recursively as follows 48

DERIVED DATATYPES

The function MPI ADDRESS returns a valid address when passed as argument a

1

variable of the calling program

2

3

The buf argument of a communication function evaluates to a valid address when

4

passed as argument a variable of the calling program

5

6

If v is a valid address and i is an integer then vi is a valid address provided v and

7

vi are in the same sequential storage

8

If v is a valid address then MPI BOTTOM v is a valid address

9

10

A correct program uses only valid addresses to identify the lo cations of entries in

11

communication buers Furthermore if u and v are two valid addresses then the integer

12

dierence u v can b e computed only if b oth u and v are in the same sequential storage

13

No other arithmetic op erations can b e meaningfully executed on addresses

14

The rules ab ove imp ose no constraints on the use of derived datatyp es as long as

15

they are used to dene a communication buer that is wholly contained within the same

16

sequential storage However the construction of a communication buer that contains

17

variables that are not within the same sequential storage must ob ey certain restrictions

18

Basically a communication buer with variables that are not within the same sequential

19

storage can b e used only by sp ecifying in the communication call buf MPI BOTTOM

20

count and using a datatyp e argument where all displacements are valid absolute

21

addresses

22

23

Advice to users It is not exp ected that MPI implementations will b e able to detect

24

erroneous out of b ound displacements unless those overow the user address

25

space since the MPI call may not know the extent of the arrays and records in the

26

host program End of advice to users

27

28

Advice to implementors There is no need to distinguish absolute addresses and

29

BOTTOM is relative displacements on a machine with contiguous address space MPI

30

zero and b oth addresses and displacements are integers On machines where the dis

31

tinction is required addresses are recognized as expressions that involve MPI BOTTOM

32

End of advice to implementors

33

34

Examples

35

36

The following examples illustrate the use of derived datatyp es

37

38

Example Send and receive a section of a D array

39

40

REAL a e

41

INTEGER oneslice twoslice threeslice sizeofreal myrank ierr

42

INTEGER statusMPISTATUSSIZE

43

44

C extract the section a

45

C and store it in e

46

47

CALL MPICOMMRANKMPICOMMWORLD myrank 48

CHAPTER POINTTOPOINT COMMUNICATION

CALL MPITYPEEXTENT MPIREAL sizeofreal ierr

1

2

C create datatype for a D section

3

CALL MPITYPEVECTOR MPIREAL oneslice ierr

4

5

C create datatype for a D section

6

CALL MPITYPEHVECTOR sizeofreal oneslice twoslice ierr

7

8

C create datatype for the entire section

9

CALL MPITYPEHVECTOR sizeofreal twoslice

10

threeslice ierr

11

12

CALL MPITYPECOMMIT threeslice ierr

13

CALL MPISENDRECVa threeslice myrank e

14

MPIREAL myrank MPICOMMWORLD status ierr

15

16

Example Copy the strictly lower triangular part of a matrix

17

18

REAL a b

19

INTEGER disp blocklen ltype myrank ierr

20

INTEGER statusMPISTATUSSIZE

21

22

C copy lower triangular part of array a

23

C onto lower triangular part of array b

24

25

CALL MPICOMMRANKMPICOMMWORLD myrank

26

27

C compute start and size of each column

28

DO i

29

dispi i i

30

blocki i

31

END DO

32

33

C create datatype for lower triangular part

34

CALL MPITYPEINDEXED block disp MPIREAL ltype ierr

35

36

CALL MPITYPECOMMITltype ierr

37

CALL MPISENDRECV a ltype myrank b

38

ltype myrank MPICOMMWORLD status ierr

39

40

Example Transp ose a matrix

41

42

REAL a b

43

INTEGER row xpose sizeofreal myrank ierr

44

INTEGER statusMPISTATUSSIZE

45

46

C transpose matrix a onto b

47

48

CALL MPICOMMRANKMPICOMMWORLD myrank

DERIVED DATATYPES

1

CALL MPITYPEEXTENT MPIREAL sizeofreal ierr

2

3

C create datatype for one row

4

CALL MPITYPEVECTOR MPIREAL row ierr

5

6

C create datatype for matrix in rowmajor order

7

CALL MPITYPEHVECTOR sizeofreal row xpose ierr

8

9

CALL MPITYPECOMMIT xpose ierr

10

11

C send matrix in rowmajor order and receive in column major order

12

CALL MPISENDRECV a xpose myrank b

13

MPIREAL myrank MPICOMMWORLD status ierr

14

15

Example Another approach to the transp ose problem

16

17

REAL a b

18

INTEGER disp blocklen type row row sizeofreal

19

INTEGER myrank ierr

20

INTEGER statusMPISTATUSSIZE

21

22

CALL MPICOMMRANKMPICOMMWORLD myrank

23

24

C transpose matrix a onto b

25

26

CALL MPITYPEEXTENT MPIREAL sizeofreal ierr

27

28

C create datatype for one row

29

CALL MPITYPEVECTOR MPIREAL row ierr

30

31

C create datatype for one row with the extent of one real number

32

disp

33

disp sizeofreal

34

type row

35

type MPIUB

36

blocklen

37

blocklen

38

CALL MPITYPESTRUCT blocklen disp type row ierr

39

40

CALL MPITYPECOMMIT row ierr

41

42

C send rows and receive in column major order

43

CALL MPISENDRECV a row myrank b

44

MPIREAL myrank MPICOMMWORLD status ierr

45

46

Example We manipulate an array of structures

47

48

struct Partstruct

CHAPTER POINTTOPOINT COMMUNICATION

1

int class particle class

2

double d particle coordinates

3

char b some additional information

4

5

6

struct Partstruct particle

7

8

int i dest rank

9

MPIComm comm

10

11

12

build datatype describing structure

13

14

MPIDatatype Particletype

15

MPIDatatype type MPIINT MPIDOUBLE MPICHAR

16

int blocklen

17

MPIAint disp

18

int base

19

20

21

compute displacements of structure components

22

23

MPIAddress particle disp

24

MPIAddress particled disp

25

MPIAddress particleb disp

26

base disp

27

for i i i dispi base

28

29

MPITypestruct blocklen disp type Particletype

30

31

If compiler does padding in mysterious ways

32

the following may be safer

33

34

MPIDatatype type MPIINT MPIDOUBLE MPICHAR MPIUB

35

int blocklen

36

MPIAint disp

37

38

compute displacements of structure components

39

40

MPIAddress particle disp

41

MPIAddress particled disp

42

MPIAddress particleb disp

43

MPIAddress particle disp

44

base disp

45

for i i i dispi base

46

47

build datatype describing structure 48

DERIVED DATATYPES

1

MPITypestruct blocklen disp type Particletype

2

3

4

5

send the entire array

6

7

MPITypecommit Particletype

8

MPISend particle Particletype dest tag comm

9

10

11

12

send only the entries of class zero particles

13

preceded by the number of such entries

14

15

MPIDatatype Zparticles datatype describing all particles

16

with class zero needs to be recomputed

17

if classes change

18

MPIDatatype Ztype

19

20

MPIAint zdisp

21

int zblock j k

22

int zzblock

23

MPIAint zzdisp

24

MPIDatatype zztype

25

26

compute displacements of class zero particles

27

j

28

fori i i

29

if particleiclass

30

31

zdispj i

32

zblockj

33

j

34

35

36

create datatype for class zero particles

37

MPITypeindexed j zblock zdisp Particletype Zparticles

38

39

prepend particle count

40

MPIAddressj zzdisp

41

MPIAddressparticle zzdisp

42

zztype MPIINT

43

zztype Zparticles

44

MPITypestruct zzblock zzdisp zztype Ztype

45

46

MPITypecommit Ztype

47

MPISend MPIBOTTOM Ztype dest tag comm 48

CHAPTER POINTTOPOINT COMMUNICATION

1

2

A probably more efficient way of defining Zparticles

3

4

consecutive particles with index zero are handled as one block

5

j

6

for i i i

7

if particleiindex

8

9

for ki k particlekindex k

10

zdispj i

11

zblockj ki

12

j

13

i k

14

15

MPITypeindexed j zblock zdisp Particletype Zparticles

16

17

18

19

send the first two coordinates of all entries

20

21

MPIDatatype Allpairs datatype for all pairs of coordinates

22

23

MPIAint sizeofentry

24

25

MPITypeextent Particletype sizeofentry

26

27

sizeofentry can also be computed by subtracting the address

28

of particle from the address of particle

29

30

MPITypehvector sizeofentry MPIDOUBLE Allpairs

31

MPITypecommit Allpairs

32

MPISend particled Allpairs dest tag comm

33

34

an alternative solution to

35

36

MPIDatatype Onepair datatype for one pair of coordinates with

37

the extent of one particle entry

38

MPIAint disp

39

MPIDatatype type MPILB MPIDOUBLE MPIUB

40

int blocklen

41

42

MPIAddress particle disp

43

MPIAddress particled disp

44

MPIAddress particle disp

45

base disp

46

for i i i dispi base

47 48

DERIVED DATATYPES

MPITypestruct blocklen disp type Onepair

1

MPITypecommit Onepair

2

MPISend particled Onepair dest tag comm

3

4

5

Example The same manipulations as in the previous example but use absolute ad

6

dresses in datatyp es

7

8

struct Partstruct

9

10

int class

11

double d

12

char b

13

14

15

struct Partstruct particle

16

17

build datatype describing first array entry

18

19

MPIDatatype Particletype

20

MPIDatatype type MPIINT MPIDOUBLE MPICHAR

21

int block

22

MPIAint disp

23

24

MPIAddress particle disp

25

MPIAddress particled disp

26

MPIAddress particleb disp

27

MPITypestruct block disp type Particletype

28

29

Particletype describes first array entry using absolute

30

addresses

31

32

33

send the entire array

34

35

MPITypecommit Particletype

36

MPISend MPIBOTTOM Particletype dest tag comm

37

38

39

40

send the entries of class zero

41

preceded by the number of such entries

42

43

MPIDatatype Zparticles Ztype

44

45

MPIAint zdisp

46

int zblock i j k

47

int zzblock 48

CHAPTER POINTTOPOINT COMMUNICATION

MPIDatatype zztype

1

MPIAint zzdisp

2

3

j

4

for i i i

5

if particleiindex

6

7

for ki k particlekindex k

8

zdispj i

9

zblockj ki

10

j

11

i k

12

13

MPITypeindexed j zblock zdisp Particletype Zparticles

14

Zparticles describe particles with class zero using

15

their absolute addresses

16

17

prepend particle count

18

MPIAddressj zzdisp

19

zzdisp MPIBOTTOM

20

zztype MPIINT

21

zztype Zparticles

22

MPITypestruct zzblock zzdisp zztype Ztype

23

24

MPITypecommit Ztype

25

MPISend MPIBOTTOM Ztype dest tag comm

26

27

28

Example Handling of unions

29

30

union

31

int ival

32

float fval

33

u

34

35

int utype

36

37

All entries of u have identical type variable

38

utype keeps track of their current type

39

40

MPIDatatype type

41

int blocklen

42

MPIAint disp

43

MPIDatatype mpiutype

44

MPIAint ij

45

46

compute an MPI datatype for each possible union type

47

assume values are leftaligned in union storage 48

PACK AND UNPACK

1

MPIAddress u i

2

MPIAddress u j

3

disp disp ji

4

type MPIUB

5

6

type MPIINT

7

MPITypestruct blocklen disp type mpiutype

8

9

type MPIFLOAT

10

MPITypestruct blocklen disp type mpiutype

11

12

fori i i MPITypecommitmpiutypei

13

14

actual communication

15

16

MPISendu mpiutypeutype dest tag comm

17

18

19

Pack and unpack

20

21

Some existing communication libraries provide packunpack functions for sending noncon

22

tiguous data In these the user explicitly packs data into a contiguous buer b efore sending

23

it and unpacks it from a contiguous buer after receiving it Derived datatyp es which are

24

describ ed in Section allow one in most cases to avoid explicit packing and unpacking

25

The user sp ecies the layout of the data to b e sent or received and the communication

26

library directly accesses a noncontiguous buer The packunpack routines are provided

27

for compatibility with previous libraries Also they provide some functionality that is not

28

otherwise available in MPI For instance a message can b e received in several parts where

29

the receive op eration done on a later part may dep end on the content of a former part

30

Another use is that outgoing messages may b e explicitly buered in user supplied space

31

thus overriding the system buering p olicy Finally the availability of pack and unpack

32

op erations facilitates the development of additional communication libraries layered on top

33

of MPI

34

35

MPI PACKinbuf incount datatyp e outbuf outcount p osition comm

36

37

IN inbuf input buer start choice

38

IN incount numb er of input data items integer

39

IN datatyp e datatyp e of each input data item handle

40

41

OUT outbuf output buer start choice

42

IN outcount output buer size in bytes integer

43

INOUT p osition current p osition in buer in bytes integer

44

IN comm communicator for packed message handle

45

46

47

int MPI Packvoid inbuf int incount MPI Datatype datatype void outbuf

48

int outcount int position MPI Comm comm

CHAPTER POINTTOPOINT COMMUNICATION

MPI PACKINBUF INCOUNT DATATYPE OUTBUF OUTCOUNT POSITION COMM

1

IERROR

2

type INBUF OUTBUF

3

INTEGER INCOUNT DATATYPE OUTCOUNT POSITION COMM IERROR

4

5

Packs the message in the send buer sp ecied by inbuf incount datatyp e into the buer

6

space sp ecied by outbuf and outcount The input buer can b e any communication buer

7

allowed in MPI SEND The output buer is a contiguous storage area containing outcount

8

bytes starting at the address outbuf length is counted in bytes not elements as if it were

9

a communication buer for a message of typ e MPI PACKED

10

The input value of p osition is the rst lo cation in the output buer to b e used for

11

packing p osition is incremented by the size of the packed message and the output value

12

of p osition is the rst lo cation in the output buer following the lo cations o ccupied by the

13

packed message The comm argument is the communicator that will b e subsequently used

14

for sending the packed message

15

16

17

MPI UNPACKinbuf insize p osition outbuf outcount datatyp e comm

18

IN inbuf input buer start choice

19

IN insize size of input buer in bytes integer

20

21

INOUT p osition current p osition in bytes integer

22

OUT outbuf output buer start choice

23

IN outcount numb er of items to b e unpacked integer

24

25

IN datatyp e datatyp e of each output data item handle

26

IN comm communicator for packed message handle

27

28

int MPI Unpackvoid inbuf int insize int position void outbuf

29

int outcount MPI Datatype datatype MPI Comm comm

30

31

MPI UNPACKINBUF INSIZE POSITION OUTBUF OUTCOUNT DATATYPE COMM

32

IERROR

33

type INBUF OUTBUF

34

INTEGER INSIZE POSITION OUTCOUNT DATATYPE COMM IERROR

35

Unpacks a message into the receive buer sp ecied by outbuf outcount datatyp e from

36

the buer space sp ecied by inbuf and insize The output buer can b e any communication

37

RECV The input buer is a contiguous storage area containing insize buer allowed in MPI

38

bytes starting at address inbuf The input value of p osition is the rst lo cation in the output

39

buer o ccupied by the packed message p osition is incremented by the size of the packed

40

message so that the output value of p osition is the rst lo cation in the output buer after

41

the lo cations o ccupied by the message that was unpacked comm is the communicator used

42

to receive the packed message

43

44

RECV and MPI UNPACK in Advice to users Note the dierence b etween MPI

45

MPI RECV the count argument sp ecies the maximum numb er of items that can

46

b e received The actual numb er of items received is determined by the length of

47

the incoming message In MPI UNPACK the count argument sp ecies the actual 48

PACK AND UNPACK

numb er of items that are unpacked the size of the corresp onding message is the

1

increment in p osition The reason for this change is that the incoming message size

2

is not predetermined since the user decides how much to unpack nor is it easy to

3

determine the message size from the numb er of items to b e unpacked In fact in a

4

heterogeneous system this numb er may not b e determined a priori End of advice

5

to users

6

7

To understand the b ehavior of pack and unpack it is convenient to think of the data

8

part of a message as b eing the sequence obtained by concatenating the successive values sent

9

in that message The pack op eration stores this sequence in the buer space as if sending

10

the message to that buer The unpack op eration retrieves this sequence from buer space

11

as if receiving a message from that buer It is helpful to think of internal Fortran les or

12

sscanf in C for a similar function

13

Several messages can b e successively packed into one packing unit This is eected

14

by several successive related calls to MPI PACK where the rst call provides p osition

15

and each successive call inputs the value of p osition that was output by the previous call

16

and the same values for outbuf outcount and comm This packing unit now contains the

17

equivalent information that would have b een stored in a message by one send call with a

18

send buer that is the concatenation of the individual send buers

19

PACKED Any p oint to p oint or collective A packing unit can b e sent using typ e MPI

20

communication function can b e used to move the sequence of bytes that forms the packing

21

unit from one pro cess to another This packing unit can now b e received using any receive

22

op eration with any datatyp e the typ e matching rules are relaxed for messages sent with

23

typ e MPI PACKED

24

A message sent with any typ e including MPI PACKED can b e received using the typ e

25

PACKED Such a message can then b e unpacked by calls to MPI UNPACK MPI

26

A packing unit or a message created by a regular typ ed send can b e unpacked

27

into several successive messages This is eected by several successive related calls to

28

MPI UNPACK where the rst call provides p osition and each successive call inputs

29

the value of p osition that was output by the previous call and the same values for inbuf

30

insize and comm

31

The concatenation of two packing units is not necessarily a packing unit nor is a

32

substring of a packing unit necessarily a packing unit Thus one cannot concatenate two

33

packing units and then unpack the result as one packing unit nor can one unpack a substring

34

of a packing unit as a separate packing unit Each packing unit that was created by a related

35

sequence of pack calls or by a regular send must b e unpacked as a unit by a sequence of

36

related unpack calls

37

38

Rationale The restriction on atomic packing and unpacking of packing units

39

allows the implementation to add at the head of packing units additional information

40

such as a description of the sender architecture to b e used for typ e conversion in a

41

heterogeneous environment End of rationale

42

43

The following call allows the user to nd out how much space is needed to pack a

44

message and thus manage space allo cation for buers

45

46

47 48

CHAPTER POINTTOPOINT COMMUNICATION

MPI PACK SIZEincount datatyp e comm size

1

2

IN incount count argument to packing call integer

3

IN datatyp e datatyp e argument to packing call handle

4

IN comm communicator argument to packing call handle

5

6

OUT size upp er b ound on size of packed message in bytes in

7

teger

8

9

int MPI Pack sizeint incount MPI Datatype datatype MPI Comm comm

10

int size

11

12

MPI PACK SIZEINCOUNT DATATYPE COMM SIZE IERROR

13

INTEGER INCOUNT DATATYPE COMM SIZE IERROR

14

A call to MPI PACK SIZEincount datatyp e comm size returns in size an upp er b ound

15

on the increment in p osition that is eected by a call to MPI PACKinbuf incount datatyp e

16

outbuf outcount p osition comm

17

18

Rationale The call returns an upp er b ound rather than an exact b ound since the

19

exact amount of space needed to pack the message may dep end on the context eg

20

rst message packed in a packing unit may take more space End of rationale

21

22

Example An example using MPI PACK

23

int position i j a

24

char buff

25

26

27

28

MPICommrankMPICOMMWORLD myrank

29

if myrank

30

31

SENDER CODE

32

33

position

34

MPIPacki MPIINT buff position MPICOMMWORLD

35

MPIPackj MPIINT buff position MPICOMMWORLD

36

MPISend buff position MPIPACKED MPICOMMWORLD

37

38

else RECEIVER CODE

39

MPIRecv a MPIINT MPICOMMWORLD

40

41

42

43

Example A elab orate example

44

45

int position i

46

float a

47

char buff 48

PACK AND UNPACK

1

2

MPICommrankMPICommworld myrank

3

if myrank

4

5

SENDER CODE

6

7

int len

8

MPIAint disp

9

MPIDatatype type newtype

10

11

build datatype for i followed by aai

12

13

len

14

len i

15

MPIAddress i disp

16

MPIAddress a disp

17

type MPIINT

18

type MPIFLOAT

19

MPITypestruct len disp type newtype

20

MPITypecommit newtype

21

22

Pack i followed by aai

23

24

position

25

MPIPack MPIBOTTOM newtype buff position MPICOMMWORLD

26

27

Send

28

29

MPISend buff position MPIPACKED

30

MPICOMMWORLD

31

32

33

One can replace the last three lines with

34

MPISend MPIBOTTOM newtype MPICOMMWORLD

35

36

37

else myrank

38

39

RECEIVER CODE

40

41

MPIStatus status

42

43

Receive

44

45

MPIRecv buff MPIPACKED status

46

47

Unpack i 48

CHAPTER POINTTOPOINT COMMUNICATION

1

position

2

MPIUnpackbuff position i MPIINT MPICOMMWORLD

3

4

Unpack aai

5

MPIUnpackbuff position a i MPIFLOAT MPICOMMWORLD

6

7

8

Example Each pro cess sends a count followed by count characters to the ro ot the

9

ro ot concatenate all characters into one string

10

11

int count gsize counts totalcount k k k

12

displs position concatpos

13

char chr lbuf rbuf cbuf

14

15

MPICommsizecomm gsize

16

MPICommrankcomm myrank

17

18

allocate local pack buffer

19

MPIPacksize MPIINT comm k

20

MPIPacksizecount MPICHAR k

21

k kk

22

lbuf char mallock

23

24

pack count followed by count characters

25

position

26

MPIPackcount MPIINT lbuf k position comm

27

MPIPackchr count MPICHAR lbuf k position comm

28

29

if myrank root

30

gather at root sizes of all packed messages

31

MPIGather position MPIINT NULL NULL

32

NULL root comm

33

34

gather at root packed messages

35

MPIGatherv buf position MPIPACKED NULL

36

NULL NULL NULL root comm

37

38

else root code

39

gather sizes of all packed messages

40

MPIGather position MPIINT counts

41

MPIINT root comm

42

43

gather all packed messages

44

displs

45

for i i gsize i

46

displsi displsi countsi

47

totalcount diplsgsize countsgsize

48

rbuf char malloctotalcount

PACK AND UNPACK

cbuf char malloctotalcount

1

MPIGatherv lbuf position MPIPACKED rbuf

2

counts displs MPIPACKED root comm

3

4

unpack all messages and concatenate strings

5

concatpos

6

for i i gsize i

7

position

8

MPIUnpack rbufdisplsi totalcountdisplsi

9

position count MPIINT comm

10

MPIUnpack rbufdisplsi totalcountdisplsi

11

position cbufconcatpos count MPICHAR comm

12

concatpos count

13

14

cbufconcatpos

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

Collecti ve Communication

11

12

13

14

Intro duction and Overview

15

16

Collective communication is dened as communication that involves a group of pro cesses

17

The functions of this typ e provided by MPI are the following

18

Barrier synchronization across all group memb ers Sec

19

20

Broadcast from one memb er to all memb ers of a group Sec This is shown in

21

gure

22

23

Gather data from all group memb ers to one memb er Sec This is shown in

24

gure

25

Scatter data from one memb er to all memb ers of a group Sec This is shown

26

in gure

27

28

A variation on Gather where all memb ers of the group receive the result Sec

29

This is shown as allgather in gure

30

31

ScatterGather data from all memb ers to all memb ers of a group also called complete

32

exchange or alltoall Sec This is shown as alltoall in gure

33

Global reduction op erations such as sum max min or userdened functions where

34

the result is returned to all group memb ers and a variation where the result is returned

35

to only one memb er Sec

36

37

A combined reduction and scatter op eration Sec

38

39

Scan across all memb ers of a group also called prex Sec

40

A collective op eration is executed by having all pro cesses in the group call the com

41

munication routine with matching arguments The syntax and semantics of the collective

42

op erations are dened to b e consistent with the syntax and semantics of the p ointtop oint

43

op erations Thus general datatyp es are allowed and must match b etween sending and re

44

ceiving pro cesses as sp ecied in Chapter One of the key arguments is a communicator

45

that denes the group of participating pro cesses and provides a context for the op eration

46

Several collective routines such as broadcast and gather have a single originating or receiv

47

ing pro cess Such pro cesses are called the root Some arguments in the collective functions 48

INTRODUCTION AND OVERVIEW

1 data

2 A A

0 0 3 A 4 broadcast 0 5 A

processes 0 6 A 7 0

8 A

0 9 A

10 0

11 12

13 A A A A A A A 0 1 2 3 4 5 0

14 scatter A 15 1

16 A

2 17 gather A 18 3

19 A

4 20 A

21 5

22 23

24 A A B C D E F

0 0 0 0 0 0 0 25

26 B A B C D E F 0 0 0 0 0 0 0

27 C allgather A B C D E F

0 0 0 0 0 0 0 28 D A B C D E F 29 0 0 0 0 0 0 0

30 E A B C D E F

0 0 0 0 0 0 0 31 F A B C D E F

32 0 0 0 0 0 0 0

33 34

35 A A A A A A A B C D E F 36 0 1 2 3 4 5 0 0 0 0 0 0

37 B B B B B B A B C D E F 0 1 2 3 4 5 alltoall 1 1 1 1 1 1 38 C C C C C C A B C D E F

0 1 2 3 4 5 2 2 2 2 2 2 39

40 D D D D D D A B C D E F 0 1 2 3 4 5 3 3 3 3 3 3

41 E E E E E E A B C D E F

0 1 2 3 4 5 4 4 4 4 4 4 42

43 F F F F F F A B C D E F

0 1 2 3 4 5 5 5 5 5 5 5

44

45

Figure Collective move functions illustrated for a group of six pro cesses In each case

46

each row of b oxes represents data lo cations in one pro cess Thus in the broadcast initially

47

just the rst pro cess contains the data A but after the broadcast all pro cesses contain it

0 48

CHAPTER COLLECTIVE COMMUNICATION

are sp ecied as signicant only at ro ot and are ignored for all participants except the

1

ro ot The reader is referred to Chapter for information concerning communication buers

2

general datatyp es and typ e matching rules and to Chapter for information on how to

3

dene groups and create communicators

4

The typ ematching conditions for the collective op erations are more strict than the cor

5

resp onding conditions b etween sender and receiver in p ointtop oint Namely for collective

6

op erations the amount of data sent must exactly match the amount of data sp ecied by

7

the receiver Distinct typ e maps the layout in memory see Sec b etween sender and

8

receiver are still allowed

9

Collective routine calls can but are not required to return as so on as their participa

10

tion in the collective communication is complete The completion of a call indicates that the

11

caller is now free to access lo cations in the communication buer It do es not indicate that

12

other pro cesses in the group have completed or even started the op eration unless otherwise

13

indicated in the description of the op eration Thus a collective communication call may

14

or may not have the eect of synchronizing all calling pro cesses This statement excludes

15

of course the barrier function

16

Collective communication calls may use the same communicators as p ointtop oint

17

communication MPI guarantees that messages generated on b ehalf of collective communi

18

cation calls will not b e confused with messages generated by p ointtop oint communication

19

A more detailed discussion of correct use of collective routines is found in Sec

20

21

Rationale The equaldata restriction on typ e matching was made so as to avoid

22

the complexity of providing a facility analogous to the status argument of MPI RECV

23

for discovering the amount of data sent Some of the collective routines would require

24

an array of status values

25

The statements ab out synchronization are made so as to allow a variety of implemen

26

tations of the collective functions

27

The collective op erations do not accept a message tag argument If future revisions of

28

MPI dene nonblo cking collective functions then tags or a similar mechanism will

29

need to b e added so as to allow the disambiguation of multiple p ending collective

30

op erations End of rationale

31

32

Advice to users It is dangerous to rely on synchronization sideeects of the col

33

lective op erations for program correctness For example even though a particular

34

implementation may provide a broadcast routine with a sideeect of synchroniza

35

tion the standard do es not require this and a program that relies on this will not b e

36

p ortable

37

On the other hand a correct p ortable program must allow for the fact that a collective

38

call may b e synchronizing Though one cannot rely on any synchronization sideeect

39

one must program so as to allow it These issues are discussed further in Sec

40

End of advice to users

41

Advice to implementors While vendors may write optimized collective routines

42

matched to their architectures a complete library of the collective communication

43

routines can b e written entirely using the MPI p ointtop oint communication func

44

tions and a few auxiliary functions If implementing on top of p ointtop oint a hidden

45

sp ecial communicator must b e created for the collective op eration so as to avoid inter

46

ference with any ongoing p ointtop oint communication at the time of the collective

47

call This is discussed further in Sec End of advice to implementors 48

COMMUNICATOR ARGUMENT

Communicator argument

1

2

The key concept of the collective functions is to have a group of participating pro cesses

3

The routines do not have a group identier as an explicit argument Instead there is a com

4

municator argument For the purp oses of this chapter a communicator can b e thought of

5

as a group identier linked with a context An intercommunicator that is a communicator

6

that spans two groups is not allowed as an argument to a collective function

7

8

9

Barrier synchronization

10

11

12

MPI BARRIER comm

13

14

IN comm communicator handle

15

16

int MPI BarrierMPI Comm comm

17

MPI BARRIERCOMM IERROR

18

INTEGER COMM IERROR

19

20

MPI BARRIER blo cks the caller until all group memb ers have called it The call returns

21

at any pro cess only after all group memb ers have entered the call

22

23

Broadcast

24

25

26

27

MPI BCAST buer count datatyp e ro ot comm

28

INOUT buer starting address of buer choice

29

30

IN count numb er of entries in buer integer

31

IN datatyp e data typ e of buer handle

32

IN ro ot rank of broadcast ro ot integer

33

34

IN comm communicator handle

35

36

int MPI Bcastvoid buffer int count MPI Datatype datatype int root

37

MPI Comm comm

38

BCASTBUFFER COUNT DATATYPE ROOT COMM IERROR MPI

39

type BUFFER

40

INTEGER COUNT DATATYPE ROOT COMM IERROR

41

42

MPI BCAST broadcasts a message from the pro cess with rank ro ot to all pro cesses of

43

the group itself included It is called by all memb ers of group using the same arguments

44

for comm ro ot On return the contents of ro ots communication buer has b een copied to

45

all pro cesses

46

General derived datatyp es are allowed for datatyp e The typ e signature of count

47

datatyp e on any pro cess must b e equal to the typ e signature of count datatyp e at the ro ot 48

CHAPTER COLLECTIVE COMMUNICATION

This implies that the amount of data sent must b e equal to the amount received pairwise

1

b etween each pro cess and the ro ot MPI BCAST and all other datamovement collective

2

routines make this restriction Distinct typ e maps b etween sender and receiver are still

3

allowed

4

5

6

Example using MPI BCAST

7

Example Broadcast ints from pro cess to every pro cess in the group

8

9

MPIComm comm

10

int array

11

int root

12

13

MPIBcast array MPIINT root comm

14

As in many of our example co de fragments we assume that some of the variables such as

15

comm in the ab ove have b een assigned appropriate values

16

17

18

Gather

19

20

21

MPI GATHER sendbuf sendcount sendtyp e recvbuf recvcount recvtyp e ro ot comm

22

IN sendbuf starting address of send buer choice

23

24

IN sendcount numb er of elements in send buer integer

25

IN sendtyp e data typ e of send buer elements handle

26

OUT recvbuf address of receive buer choice signicant only at

27

ro ot

28

IN recvcount numb er of elements for any single receive integer sig

29

nicant only at ro ot

30

31

IN recvtyp e data typ e of recv buer elements signicant only at

32

ro ot handle

33

IN ro ot rank of receiving pro cess integer

34

IN comm communicator handle

35

36

37

Gathervoid sendbuf int sendcount MPI Datatype sendtype int MPI

38 Datatype recvtype int root void recvbuf int recvcount MPI

39

MPI Comm comm

40

GATHERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE MPI

41

ROOT COMM IERROR

42

type SENDBUF RECVBUF

43

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE ROOT COMM IERROR

44

45

Each pro cess ro ot pro cess included sends the contents of its send buer to the ro ot

46

pro cess The ro ot pro cess receives the messages and stores them in rank order The outcome

47

is as if each of the n pro cesses in the group including the ro ot pro cess had executed a call

48 to

GATHER

MPI Sendsendbuf sendcount sendtype root

1

2

and the ro ot had executed n calls to

3

MPI Recvrecvbuf i recvcount extentrecvtype recvcount recvtype i

4

5

where extentrecvtype is the typ e extent obtained from a call to MPI Type extent

6

An alternative description is that the n messages sent by the pro cesses in the group

7

are concatenated in rank order and the resulting message is received by the ro ot as if by a

8

call to MPI RECVrecvbuf recvcountn recvtyp e

9

The receive buer is ignored for all nonro ot pro cesses

10

General derived datatyp es are allowed for b oth sendtyp e and recvtyp e The typ e sig

11

nature of sendcount sendtyp e on pro cess i must b e equal to the typ e signature of recvcount

12

recvtyp e at the ro ot This implies that the amount of data sent must b e equal to the amount

13

of data received pairwise b etween each pro cess and the ro ot Distinct typ e maps b etween

14

sender and receiver are still allowed

15

All arguments to the function are signicant on pro cess ro ot while on other pro cesses

16

only arguments sendbuf sendcount sendtyp e ro ot comm are signicant The arguments

17

ro ot and comm must have identical values on all pro cesses

18

The sp ecication of counts and typ es should not cause any lo cation on the ro ot to b e

19

written more than once Such a call is erroneous

20

Note that the recvcount argument at the ro ot indicates the numb er of items it receives

21

from each pro cess not the total numb er of items it receives

22

23

24

MPI GATHERV sendbuf sendcount sendtyp e recvbuf recvcounts displs recvtyp e ro ot

25

comm

26

IN sendbuf starting address of send buer choice

27

IN sendcount numb er of elements in send buer integer

28

29

IN sendtyp e data typ e of send buer elements handle

30

OUT recvbuf address of receive buer choice signicant only at

31

ro ot

32

IN recvcounts integer array of length group size containing the num

33

b er of elements that are received from each pro cess

34

signicant only at ro ot

35

IN displs integer array of length group size Entry i sp ecies

36

the displacement relative to recvbuf at which to place

37

the incoming data from pro cess i signicant only at

38

ro ot

39

40

IN recvtyp e data typ e of recv buer elements signicant only at

41

ro ot handle

42

IN ro ot rank of receiving pro cess integer

43

IN comm communicator handle

44

45

46

Gathervvoid sendbuf int sendcount MPI Datatype sendtype int MPI

47

void recvbuf int recvcounts int displs

48

MPI Datatype recvtype int root MPI Comm comm

CHAPTER COLLECTIVE COMMUNICATION

MPI GATHERVSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNTS DISPLS

1

RECVTYPE ROOT COMM IERROR

2

type SENDBUF RECVBUF

3

INTEGER SENDCOUNT SENDTYPE RECVCOUNTS DISPLS RECVTYPE ROOT

4

COMM IERROR

5

6

MPI GATHERV extends the functionality of MPI GATHER by allowing a varying count

7

of data from each pro cess since recvcounts is now an array It also allows more exibili ty

8

as to where the data is placed on the ro ot by providing the new argument displs

9

The outcome is as if each pro cess including the ro ot pro cess sends a message to the

10

ro ot

11

12

MPI Sendsendbuf sendcount sendtype root

13

and the ro ot executes n receives

14

15

MPI Recvrecvbuf dispi extentrecvtype recvcountsi recvtype i

16

17

Messages are placed in the receive buer of the ro ot pro cess in rank order that is the

18

data sent from pro cess j is placed in the jth p ortion of the receive buer recvbuf on pro cess

19

ro ot The jth p ortion of recvbuf b egins at oset displsj elements in terms of recvtyp e into

20

recvbuf

21

The receive buer is ignored for all nonro ot pro cesses

22

The typ e signature implied by sendcount sendtyp e on pro cess i must b e equal to the

23

typ e signature implied by recvcountsi recvtyp e at the ro ot This implies that the amount

24

of data sent must b e equal to the amount of data received pairwise b etween each pro cess

25

and the ro ot Distinct typ e maps b etween sender and receiver are still allowed as illustrated

26

in Example

27

All arguments to the function are signicant on pro cess ro ot while on other pro cesses

28

only arguments sendbuf sendcount sendtyp e ro ot comm are signicant The arguments

29

ro ot and comm must have identical values on all pro cesses

30

The sp ecication of counts typ es and displacements should not cause any lo cation on

31

the ro ot to b e written more than once Such a call is erroneous

32

33

GATHER MPI GATHERV Examples using MPI

34

Example Gather ints from every pro cess in group to ro ot See gure

35

36

MPIComm comm

37

int gsizesendarray

38

int root rbuf

39

40

MPICommsize comm gsize

41

rbuf int mallocgsizesizeofint

42

MPIGather sendarray MPIINT rbuf MPIINT root comm

43

44

Example Previous example mo died only the ro ot allo cates memory for the receive

45

buer

46

47 48

GATHER

100 100 100 1

all processes

2

3 4 100 100 100

5 at root

6 7

rbuf

8

9

Figure The ro ot pro cess gathers ints from each pro cess in the group

10

11

MPIComm comm

12

int gsizesendarray

13

int root myrank rbuf

14

15

MPICommrank comm myrank

16

if myrank root

17

MPICommsize comm gsize

18

rbuf int mallocgsizesizeofint

19

20

MPIGather sendarray MPIINT rbuf MPIINT root comm

21

22

23

Example Do the same as the previous example but use a derived datatyp e Note

24

that the typ e cannot b e the entire set of gsize ints since typ e matching is dened

25

pairwise b etween the ro ot and each pro cess in the gather

26

27

MPIComm comm

28

int gsizesendarray

29

int root rbuf

30

MPIDatatype rtype

31

32

MPICommsize comm gsize

33

MPITypecontiguous MPIINT rtype

34

MPITypecommit rtype

35

rbuf int mallocgsizesizeofint

36

MPIGather sendarray MPIINT rbuf rtype root comm

37

38

Example Now have each pro cess send ints to ro ot but place each set of

39

GATHERV and the displs argument to achieve stride ints apart at receiving end Use MPI

40

this eect Assume str ide See gure

41

42 MPIComm comm

43

int gsizesendarray

44 int root rbuf stride

45

int displsircounts

46

47

48

CHAPTER COLLECTIVE COMMUNICATION

100 100 100 1

all processes

2

3 4 100 100 100

at root 5 6

stride 7

rbuf

8

9

Figure The ro ot pro cess gathers ints from each pro cess in the group each set is

10

placed stride ints apart

11

12

MPICommsize comm gsize

13

rbuf int mallocgsizestridesizeofint

14

displs int mallocgsizesizeofint

15

rcounts int mallocgsizesizeofint

16

for i igsize i

17

displsi istride

18

rcountsi

19

20

MPIGatherv sendarray MPIINT rbuf rcounts displs MPIINT

21

root comm

22

23

Note that the program is erroneous if str ide

24

25

26

Example Same as Example on the receiving side but send the ints from the

27

th column of a int array in C See gure

28

29

MPIComm comm

30

int gsizesendarray

31

int root rbuf stride

32

MPIDatatype stype

33

int displsircounts

34

35

36

37

MPICommsize comm gsize

38

rbuf int mallocgsizestridesizeofint

39

displs int mallocgsizesizeofint

40

rcounts int mallocgsizesizeofint

41

for i igsize i

42

displsi istride

43

rcountsi

44

45

Create datatype for column of array

46

47

MPITypevector MPIINT stype

48

MPITypecommit stype

GATHER

150 150 150 1

100 100 100 all processes

2

3

4 5

100 100 100 6

at root 7

8 stride

rbuf

9

10

Figure The ro ot pro cess gathers column of a C array and each set is placed

11

stride ints apart

12

13

MPIGatherv sendarray stype rbuf rcounts displs MPIINT

14

root comm

15

16

17

Example Pro cess i sends i ints from the ith column of a int array in

18

C It is received into a buer with stride as in the previous two examples See gure

19

20

MPIComm comm

21

int gsizesendarraysptr

22

int root rbuf stride myrank

23

MPIDatatype stype

24

int displsircounts

25

26

27

28

MPICommsize comm gsize

29

MPICommrank comm myrank

30

rbuf int mallocgsizestridesizeofint

31

displs int mallocgsizesizeofint

32

rcounts int mallocgsizesizeofint

33

for i igsize i

34

displsi istride

35

rcountsi i note change from previous example

36

37

Create datatype for the column we are sending

38

39

MPITypevector myrank MPIINT stype

40

MPITypecommit stype

41

sptr is the address of start of myrank column

42

43

sptr sendarraymyrank

44

MPIGatherv sptr stype rbuf rcounts displs MPIINT

45

root comm

46

47

Note that a dierent amount of data is received from each pro cess 48

CHAPTER COLLECTIVE COMMUNICATION

150 150 150 1

100 100 100 all processes

2

3

4 5

100 99 98 6

at root 7

stride 8

rbuf

9

10

Figure The ro ot pro cess gathers i ints from column i of a C array and

11

each set is placed stride ints apart

12

13

Example Same as Example but done in a dierent way at the sending end We

14

create a datatyp e that causes the correct striding at the sending end so that that we read

15

a column of a C array A similar thing was done in Example Section

16

17

MPIComm comm

18

int gsizesendarraysptr

19

int root rbuf stride myrank disp blocklen

20

MPIDatatype stypetype

21

int displsircounts

22

23

24

25

MPICommsize comm gsize

26

MPICommrank comm myrank

27

rbuf int mallocgsizestridesizeofint

28

displs int mallocgsizesizeofint

29

rcounts int mallocgsizesizeofint

30

for i igsize i

31

displsi istride

32

rcountsi i

33

34

Create datatype for one int with extent of entire row

35

36

disp disp sizeofint

37

type MPIINT type MPIUB

38

blocklen blocklen

39

MPITypestruct blocklen disp type stype

40

MPITypecommit stype

41

sptr sendarraymyrank

42

MPIGatherv sptr myrank stype rbuf rcounts displs MPIINT

43

root comm

44

45

Example Same as Example at sending side but at receiving side we make the

46

stride b etween received blo cks vary from blo ck to blo ck See gure

47

48

MPIComm comm

GATHER

int gsizesendarraysptr

1

int root rbuf stride myrank bufsize

2

MPIDatatype stype

3

int displsircountsoffset

4

5

6

7

MPICommsize comm gsize

8

MPICommrank comm myrank

9

10

stride int mallocgsizesizeofint

11

12

stridei for i to gsize is set somehow

13

14

15

set up displs and rcounts vectors first

16

17

displs int mallocgsizesizeofint

18

rcounts int mallocgsizesizeofint

19

offset

20

for i igsize i

21

displsi offset

22

offset stridei

23

rcountsi i

24

25

the required buffer size for rbuf is now easily obtained

26

27

bufsize displsgsizercountsgsize

28

rbuf int mallocbufsizesizeofint

29

Create datatype for the column we are sending

30

31

MPITypevector myrank MPIINT stype

32

MPITypecommit stype

33

sptr sendarraymyrank

34

MPIGatherv sptr stype rbuf rcounts displs MPIINT

35

root comm

36

37

38

Example Pro cess i sends num ints from the ith column of a int array in

39

C The complicating factor is that the various values of num are not known to root so a

40

separate gather must rst b e run to nd these out The data is placed contiguously at the

41

receiving end

42

43

MPIComm comm

44

int gsizesendarraysptr

45

int root rbuf stride myrank disp blocklen

46

MPIDatatype stypetypes

47

int displsircountsnum 48

CHAPTER COLLECTIVE COMMUNICATION

150 150 150 1

100 100 100 all processes

2

3

4 5

100 99 98 6

at root 7

stride[1] 8

rbuf

9

10

Figure The ro ot pro cess gathers i ints from column i of a C array and

11

each set is placed stridei ints apart a varying stride

12

13

14

15

MPICommsize comm gsize

16

MPICommrank comm myrank

17

18

First gather nums to root

19

20

rcounts int mallocgsizesizeofint

21

MPIGather num MPIINT rcounts MPIINT root comm

22

root now has correct rcounts using these we set displs so

23

that data is placed contiguously or concatenated at receive end

24

25

displs int mallocgsizesizeofint

26

displs

27

for i igsize i

28

displsi displsircountsi

29

30

And create receive buffer

31

32

rbuf int mallocgsizedisplsgsizercountsgsize

33

sizeofint

34

Create datatype for one int with extent of entire row

35

36

disp disp sizeofint

37

type MPIINT type MPIUB

38

blocklen blocklen

39

MPITypestruct blocklen disp type stype

40

MPITypecommit stype

41

sptr sendarraymyrank

42

MPIGatherv sptr num stype rbuf rcounts displs MPIINT

43

root comm

44

45

46

47 48

SCATTER

Scatter

1

2

3

4

MPI SCATTER sendbuf sendcount sendtyp e recvbuf recvcount recvtyp e ro ot comm

5

IN sendbuf address of send buer choice signicant only at ro ot

6

7

IN sendcount numb er of elements sent to each pro cess integer sig

8

nicant only at ro ot

9

IN sendtyp e data typ e of send buer elements signicant only at

10

ro ot handle

11

OUT recvbuf address of receive buer choice

12

13

IN recvcount numb er of elements in receive buer integer

14

IN recvtyp e data typ e of receive buer elements handle

15

IN ro ot rank of sending pro cess integer

16

17

IN comm communicator handle

18

19

int MPI Scattervoid sendbuf int sendcount MPI Datatype sendtype

20

void recvbuf int recvcount MPI Datatype recvtype int root

21

MPI Comm comm

22

MPI SCATTERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

23

ROOT COMM IERROR

24

type SENDBUF RECVBUF

25

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE ROOT COMM IERROR

26

27

MPI SCATTER is the inverse op eration to MPI GATHER

28

The outcome is as if the ro ot executed n send op erations

29

30

MPI Sendsendbuf i sendcount extentsendtype sendcount sendtype i

31

and each pro cess executed a receive

32

33

MPI Recvrecvbuf recvcount recvtype i

34

35

An alternative description is that the ro ot sends a message with MPI Sendsendbuf

36

sendcountn sendtyp e This message is split into n equal segments the ith segment is

37

sent to the ith pro cess in the group and each pro cess receives this message as ab ove

38

The send buer is ignored for all nonro ot pro cesses

39

The typ e signature asso ciated with sendcount sendtyp e at the ro ot must b e equal to

40

the typ e signature asso ciated with recvcount recvtyp e at all pro cesses however the typ e

41

maps may b e dierent This implies that the amount of data sent must b e equal to the

42

amount of data received pairwise b etween each pro cess and the ro ot Distinct typ e maps

43

b etween sender and receiver are still allowed

44

All arguments to the function are signicant on pro cess ro ot while on other pro cesses

45

only arguments recvbuf recvcount recvtyp e ro ot comm are signicant The arguments ro ot

46

and comm must have identical values on all pro cesses

47

The sp ecication of counts and typ es should not cause any lo cation on the ro ot to b e

48 read more than once

CHAPTER COLLECTIVE COMMUNICATION

Rationale Though not needed the last restriction is imp osed so as to achieve

1

symmetry with MPI GATHER where the corresp onding restriction a multiplewrite

2

restriction is necessary End of rationale

3

4

5

6

MPI SCATTERV sendbuf sendcounts displs sendtyp e recvbuf recvcount recvtyp e ro ot

7

comm

8

IN sendbuf address of send buer choice signicant only at ro ot

9

10

IN sendcounts integer array of length group size sp ecifying the num

11

b er of elements to send to each pro cessor

12

IN displs integer array of length group size Entry i sp ecies

13

the displacement relative to sendbuf from which to

14

take the outgoing data to pro cess i

15

IN sendtyp e data typ e of send buer elements handle

16

17

OUT recvbuf address of receive buer choice

18

IN recvcount numb er of elements in receive buer integer

19

IN recvtyp e data typ e of receive buer elements handle

20

21

IN ro ot rank of sending pro cess integer

22

IN comm communicator handle

23

24

int MPI Scattervvoid sendbuf int sendcounts int displs

25

MPI Datatype sendtype void recvbuf int recvcount

26

Datatype recvtype int root MPI Comm comm MPI

27

28

MPI SCATTERVSENDBUF SENDCOUNTS DISPLS SENDTYPE RECVBUF RECVCOUNT

29

RECVTYPE ROOT COMM IERROR

30

type SENDBUF RECVBUF

31

INTEGER SENDCOUNTS DISPLS SENDTYPE RECVCOUNT RECVTYPE ROOT

32

COMM IERROR

33

MPI SCATTERV is the inverse op eration to MPI GATHERV

34

SCATTERV extends the functionality of MPI SCATTER by allowing a varying MPI

35

count of data to b e sent to each pro cess since sendcounts is now an array It also allows

36

more exibility as to where the data is taken from on the ro ot by providing the new

37

argument displs

38

The outcome is as if the ro ot executed n send op erations

39

40

Sendsendbuf displsi extentsendtype sendcountsi sendtype i MPI

41

and each pro cess executed a receive

42

43

MPI Recvrecvbuf recvcount recvtype i

44

45

The send buer is ignored for all nonro ot pro cesses

46

The typ e signature implied by sendcounti sendtyp e at the ro ot must b e equal to the

47

typ e signature implied by recvcount recvtyp e at pro cess i however the typ e maps may b e

48

dierent This implies that the amount of data sent must b e equal to the amount of data

SCATTER

100 100 100 1

all processes

2

3 4 100 100 100

5 at root

6 7

sendbuf

8

9

Figure The ro ot pro cess scatters sets of ints to each pro cess in the group

10

11

received pairwise b etween each pro cess and the ro ot Distinct typ e maps b etween sender

12

and receiver are still allowed

13

All arguments to the function are signicant on pro cess ro ot while on other pro cesses

14

only arguments recvbuf recvcount recvtyp e ro ot comm are signicant The arguments ro ot

15

and comm must have identical values on all pro cesses

16

The sp ecication of counts typ es and displacements should not cause any lo cation on

17

the ro ot to b e read more than once

18

19

Examples using MPI SCATTER MPI SCATTERV

20

21

Example The reverse of Example Scatter sets of ints from the ro ot to each

22

pro cess in the group See gure

23

24

MPIComm comm

25

int gsizesendbuf

26

int root rbuf

27

28

MPICommsize comm gsize

29

sendbuf int mallocgsizesizeofint

30

31

MPIScatter sendbuf MPIINT rbuf MPIINT root comm

32

33

34

Example The reverse of Example The ro ot pro cess scatters sets of ints to

35

the other pro cesses but the sets of are stride ints apart in the sending buer Requires

36

SCATTERV Assume str ide See gure use of MPI

37

MPIComm comm

38

39 int gsizesendbuf

int root rbuf i displs scounts

40

41

42

43

MPICommsize comm gsize

44

45 sendbuf int mallocgsizestridesizeofint

46

displs int mallocgsizesizeofint

47

48 scounts int mallocgsizesizeofint

CHAPTER COLLECTIVE COMMUNICATION

100 100 100 1

all processes

2

3 4 100 100 100

at root 5 6

stride 7

sendbuf

8

9

Figure The ro ot pro cess scatters sets of ints moving by stride ints from send to

10

send in the scatter

11

12

for i igsize i

13

displsi istride

14

scountsi

15

16

MPIScatterv sendbuf scounts displs MPIINT rbuf MPIINT

17

root comm

18

19

20

Example The reverse of Example We have a varying stride b etween blo cks at

21

sending ro ot side at the receiving side we receive into the ith column of a C

22

array See gure

23

24

MPIComm comm

25

int gsizerecvarrayrptr

26

int root sendbuf myrank bufsize stride

27

MPIDatatype rtype

28

int i displs scounts offset

29

30

MPICommsize comm gsize

31

MPICommrank comm myrank

32

33

stride int mallocgsizesizeofint

34

35

stridei for i to gsize is set somehow

36

sendbuf comes from elsewhere

37

38

39

displs int mallocgsizesizeofint

40

scounts int mallocgsizesizeofint

41

offset

42

for i igsize i

43

displsi offset

44

offset stridei

45

scountsi i

46

47

Create datatype for the column we are receiving

48

GATHERTOALL

150 150 150 1

100 100 100 all processes

2

3

4 5

100 99 98 6

at root 7

8 stride[1]

sendbuf

9

10

Figure The ro ot scatters blo cks of i ints into column i of a C array At

11

the sending side the blo cks are stridei ints apart

12

13

MPITypevector myrank MPIINT rtype

14

MPITypecommit rtype

15

rptr recvarraymyrank

16

MPIScatterv sendbuf scounts displs MPIINT rptr rtype

17

root comm

18

19

20

21

22

Gathertoall

23

24

25

MPI ALLGATHER sendbuf sendcount sendtyp e recvbuf recvcount recvtyp e comm

26

IN sendbuf starting address of send buer choice

27

28

IN sendcount numb er of elements in send buer integer

29

IN sendtyp e data typ e of send buer elements handle

30

OUT recvbuf address of receive buer choice

31

32

IN recvcount numb er of elements received from any pro cess inte

33

ger

34

IN recvtyp e data typ e of receive buer elements handle

35

IN comm communicator handle

36

37

38 int MPI Allgathervoid sendbuf int sendcount MPI Datatype sendtype

39

void recvbuf int recvcount MPI Datatype recvtype

40 MPI Comm comm

41

MPI ALLGATHERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

42

COMM IERROR

43

type SENDBUF RECVBUF

44

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE COMM IERROR

45

46

ALLGATHER can b e thought of as MPI GATHER but where all pro cesses receive MPI

47

the result instead of just the ro ot The jth blo ck of data sent from each pro cess is received

48

by every pro cess and placed in the jth blo ck of the buer recvbuf

CHAPTER COLLECTIVE COMMUNICATION

The typ e signature asso ciated with sendcount sendtyp e at a pro cess must b e equal to

1

the typ e signature asso ciated with recvcount recvtyp e at any other pro cess

2

The outcome of a call to MPI ALLGATHER is as if all pro cesses executed n calls to

3

4

MPIGATHERsendbufsendcountsendtyperecvbufrecvcount

5

recvtyperootcomm

6

7

for root n The rules for correct usage of MPI ALLGATHER are easily found

8

from the corresp onding rules for MPI GATHER

9

10

MPI ALLGATHERV sendbuf sendcount sendtyp e recvbuf recvcounts displs recvtyp e comm

11

12

13

IN sendbuf starting address of send buer choice

14

IN sendcount numb er of elements in send buer integer

15

IN sendtyp e data typ e of send buer elements handle

16

17

OUT recvbuf address of receive buer choice

18

IN recvcounts integer array of length group size containing the num

19

b er of elements that are received from each pro cess

20

IN displs integer array of length group size Entry i sp ecies

21

the displacement relative to recvbuf at which to place

22

the incoming data from pro cess i

23

24

IN recvtyp e data typ e of receive buer elements handle

25

IN comm communicator handle

26

27

Allgathervvoid sendbuf int sendcount MPI Datatype sendtype int MPI

28

void recvbuf int recvcounts int displs

29

MPI Datatype recvtype MPI Comm comm

30

31

MPI ALLGATHERVSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNTS DISPLS

32

RECVTYPE COMM IERROR

33

type SENDBUF RECVBUF

34

INTEGER SENDCOUNT SENDTYPE RECVCOUNTS DISPLS RECVTYPE COMM

35

IERROR

36

MPI ALLGATHERV can b e thought of as MPI GATHERV but where all pro cesses receive

37

the result instead of just the ro ot The jth blo ck of data sent from each pro cess is received

38

by every pro cess and placed in the jth blo ck of the buer recvbuf These blo cks need not

39

all b e the same size

40

The typ e signature asso ciated with sendcount sendtyp e at pro cess j must b e equal to

41

the typ e signature asso ciated with recvcountsj recvtyp e at any other pro cess

42

The outcome is as if all pro cesses executed calls to

43

44

MPIGATHERVsendbufsendcountsendtyperecvbufrecvcountsdispls

45

recvtyperootcomm

46

47

ALLGATHERV are easily for root n The rules for correct usage of MPI

48

found from the corresp onding rules for MPI GATHERV

ALLTOALL SCATTERGATHER

Examples using MPI ALLGATHER MPI ALLGATHERV

1

2

Example The allgather version of Example Using MPI ALLGATHER we will

3

gather ints from every pro cess in the group to every pro cess

4

5

MPIComm comm

6

int gsizesendarray

7

int rbuf

8

9

MPICommsize comm gsize

10

rbuf int mallocgsizesizeofint

11

MPIAllgather sendarray MPIINT rbuf MPIINT comm

12

13

After the call every pro cess has the groupwide concatenation of the sets of data

14

15

AlltoAll ScatterGather

16

17

18

19

MPI ALLTOALLsendbuf sendcount sendtyp e recvbuf recvcount recvtyp e comm

20

IN sendbuf starting address of send buer choice

21

IN sendcount numb er of elements sent to each pro cess integer

22

23

IN sendtyp e data typ e of send buer elements handle

24

OUT recvbuf address of receive buer choice

25

IN recvcount numb er of elements received from any pro cess inte

26

ger

27

28

IN recvtyp e data typ e of receive buer elements handle

29

IN comm communicator handle

30

31

int MPI Alltoallvoid sendbuf int sendcount MPI Datatype sendtype

32

Datatype recvtype void recvbuf int recvcount MPI

33

MPI Comm comm

34

35

MPI ALLTOALLSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

36

COMM IERROR

37

type SENDBUF RECVBUF

38

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE COMM IERROR

39

MPI ALLTOALL is an extension of MPI ALLGATHER to the case where each pro cess

40

sends distinct data to each of the receivers The jth blo ck sent from pro cess i is received

41

by pro cess j and is placed in the ith blo ck of recvbuf

42

The typ e signature asso ciated with sendcount sendtyp e at a pro cess must b e equal to

43

the typ e signature asso ciated with recvcount recvtyp e at any other pro cess This implies

44

that the amount of data sent must b e equal to the amount of data received pairwise b etween

45

every pair of pro cesses As usual however the typ e maps may b e dierent

46

The outcome is as if each pro cess executed a send to each pro cess itself included with

47

a call to 48

CHAPTER COLLECTIVE COMMUNICATION

MPI Sendsendbuf i sendcount extentsendtype sendcount sendtype i

1

2

and a receive from every other pro cess with a call to

3

4

MPI Recvrecvbuf i recvcount extentrecvtype recvcount i

5

All arguments on all pro cesses are signicant The argument comm must have identical

6

values on all pro cesses

7

8

9

MPI ALLTOALLVsendbuf sendcounts sdispls sendtyp e recvbuf recvcounts rdispls recvtyp e

10

comm

11

12

IN sendbuf starting address of send buer choice

13

IN sendcounts integer array equal to the group size sp ecifying the

14

numb er of elements to send to each pro cessor

15

IN sdispls integer array of length group size Entry j sp ecies

16

the displacement relative to sendbuf from which to

17

take the outgoing data destined for pro cess j

18

19 IN sendtyp e data typ e of send buer elements handle

20

OUT recvbuf address of receive buer choice

21

IN recvcounts integer array equal to the group size sp ecifying the

22

numb er of elements that can b e received from each

23

pro cessor

24

25 IN rdispls integer array of length group size Entry i sp ecies

26 the displacement relative to recvbuf at which to place

27 the incoming data from pro cess i

28

IN recvtyp e data typ e of receive buer elements handle

29

IN comm communicator handle

30

31

int MPI Alltoallvvoid sendbuf int sendcounts int sdispls

32

MPI Datatype sendtype void recvbuf int recvcounts

33

Datatype recvtype MPI Comm comm int rdispls MPI

34

35

MPI ALLTOALLVSENDBUF SENDCOUNTS SDISPLS SENDTYPE RECVBUF RECVCOUNTS

36

RDISPLS RECVTYPE COMM IERROR

37

type SENDBUF RECVBUF

38

INTEGER SENDCOUNTS SDISPLS SENDTYPE RECVCOUNTS RDISPLS

39

RECVTYPE COMM IERROR

40

ALLTOALLV adds exibili ty to MPI ALLTOALL in that the lo cation of data for the MPI

41

send is sp ecied by sdispls and the lo cation of the placement of the data on the receive side

42

is sp ecied by rdispls

43

The jth blo ck sent from pro cess i is received by pro cess j and is placed in the ith

44

blo ck of recvbuf These blo cks need not all have the same size

45

The typ e signature asso ciated with sendcount j sendtyp e at pro cess i must b e equal

46

to the typ e signature asso ciated with recvcounti recvtyp e at pro cess j This implies that

47 48

GLOBAL REDUCTION OPERATIONS

the amount of data sent must b e equal to the amount of data received pairwise b etween

1

every pair of pro cesses Distinct typ e maps b etween sender and receiver are still allowed

2

The outcome is as if each pro cess sent a message to every other pro cess with

3

4

MPI Sendsendbuf displsi extentsendtype sendcountsi sendtype i

5

and received a message from every other pro cess with a call to

6

7

MPI Recvrecvbuf displsi extentrecvtype recvcountsi recvtype i

8

All arguments on all pro cesses are signicant The argument comm must have identical

9

values on all pro cesses

10

Rationale The denitions of MPI ALLTOALL and MPI ALLTOALLV give as much

11

exibility as one would achieve by sp ecifying n indep endent p ointtop oint communi

12

cations with two exceptions all messages use the same datatyp e and messages are

13

scattered from or gathered to sequential storage End of rationale

14

15

Advice to implementors Although the discussion of collective communication in

16

terms of p ointtop oint op eration implies that each message is transferred directly

17

from sender to receiver implementations may use a tree communication pattern

18

Messages can b e forwarded by intermediate no des where they are split for scatter or

19

concatenated for gather if this is more ecient End of advice to implementors

20

21

Global Reduction Op erations

22

23

The functions in this section p erform a global reduce op eration such as sum max logical

24

AND etc across all the memb ers of a group The reduction op eration can b e either one of

25

a predened list of op erations or a userdened op eration The global reduction functions

26

come in several avors a reduce that returns the result of the reduction at one no de an

27

allreduce that returns this result at all no des and a scan parallel prex op eration In

28

addition a reducescatter op eration combines the functionality of a reduce and of a scatter

29

op eration

30

31

Reduce

32

33

34

REDUCE sendbuf recvbuf count datatyp e op ro ot comm MPI

35

36

IN sendbuf address of send buer choice

37

OUT recvbuf address of receive buer choice signicant only at

38

ro ot

39

IN count numb er of elements in send buer integer

40

41

IN datatyp e data typ e of elements of send buer handle

42

IN op reduce op eration handle

43

IN ro ot rank of ro ot pro cess integer

44

IN comm communicator handle

45

46

47

int MPI Reducevoid sendbuf void recvbuf int count

48

MPI Datatype datatype MPI Op op int root MPI Comm comm

CHAPTER COLLECTIVE COMMUNICATION

MPI REDUCESENDBUF RECVBUF COUNT DATATYPE OP ROOT COMM IERROR

1

type SENDBUF RECVBUF

2

INTEGER COUNT DATATYPE OP ROOT COMM IERROR

3

4

MPI REDUCE combines the elements provided in the input buer of each pro cess in

5

the group using the op eration op and returns the combined value in the output buer of

6

the pro cess with rank ro ot The input buer is dened by the arguments sendbuf count

7

and datatyp e the output buer is dened by the arguments recvbuf count and datatyp e

8

b oth have the same numb er of elements with the same typ e The routine is called by all

9

group memb ers using the same arguments for count datatyp e op ro ot and comm Thus all

10

pro cesses provide input buers and output buers of the same length with elements of the

11

same typ e Each pro cess can provide one element or a sequence of elements in which case

12

the combine op eration is executed elementwise on each entry of the sequence For example

13

if the op eration is MPI MAX and the send buer contains two elements that are oating p oint

14

numb ers count and datatyp e MPI FLOAT then recvbuf global maxsendbuf

15

and recvbuf global maxsendbuf

16

Sec lists the set of predened op erations provided by MPI That section also

17

enumerates the datatyp es each op eration can b e applied to In addition users may dene

18

their own op erations that can b e overloaded to op erate on several datatyp es either basic

19

or derived This is further explained in Sec

20

The op eration op is always assumed to b e asso ciative All predened op erations are also

21

assumed to b e commutative Users may dene op erations that are assumed to b e asso ciative

22

but not commutative The canonical evaluation order of a reduction is determined by the

23

ranks of the pro cesses in the group However the implementation can take advantage of

24

asso ciativity or asso ciativity and commutativity in order to change the order of evaluation

25

This may change the result of the reduction for op erations that are not strictly asso ciative

26

and commutative such as oating p oint addition

27

28

REDUCE b e imple Advice to implementors It is strongly recommended that MPI

29

mented so that the same result b e obtained whenever the function is applied on the

30

same arguments app earing in the same order Note that this may prevent optimiza

31

tions that take advantage of the physical lo cation of pro cessors End of advice to

32

implementors

33

34

The datatyp e argument of MPI REDUCE must b e compatible with op Predened op

35

erators work only with the MPI typ es listed in Sec and Sec Userdened

36

op erators may op erate on general derived datatyp es In this case each argument that the

37

reduce op eration is applied to is one element describ ed by such a datatyp e which may

38

contain several basic values This is further explained in Section

39

40

Predened reduce op erations

41

REDUCE and related functions The following predened op erations are supplied for MPI

42

MPI ALLREDUCE MPI REDUCE SCATTER and MPI SCAN These op erations are invoked

43

by placing the following in op

44

45

46

47 48

GLOBAL REDUCTION OPERATIONS

1

Name Meaning

2

3

MPI MAX maximum

4

MPI MIN minimum

5

MPI SUM sum

6

MPI PROD pro duct

7

MPI LAND logical and

8

MPI BAND bitwise and

9

MPI LOR logical or

10

MPI BOR bitwise or

11

MPI LXOR logical xor

12

MPI BXOR bitwise xor

13

MPI MAXLOC max value and lo cation

14

MPI MINLOC min value and lo cation

15

The two op erations MPI MINLOC and MPI MAXLOC are discussed separately in Sec

16

For the other predened op erations we enumerate b elow the allowed combinations

17

of op and datatyp e arguments First dene groups of MPI basic datatyp es in the following

18

way

19

20

21

C integer MPI INT MPI LONG MPI SHORT

22

MPI UNSIGNED SHORT MPI UNSIGNED

23

MPI UNSIGNED LONG

24

Fortran integer MPI INTEGER

25

Floating p oint MPI FLOAT MPI DOUBLE MPI REAL

26

DOUBLE PRECISION MPI LONG DOUBLE MPI

27

Logical MPI LOGICAL

28

Complex MPI COMPLEX

29

BYTE Byte MPI

30

31

Now the valid datatyp es for each option is sp ecied b elow

32

33

Op Allowed Typ es

34

35

MPI MAX MPI MIN C integer Fortran integer Floating p oint Complex

36

SUM MPI PROD C integer Fortran integer Floating p oint Complex MPI

37

MPI LAND MPI LOR MPI LXOR C integer Logical

38

BAND MPI BOR MPI BXOR C integer Fortran integer Byte MPI

39

40

41

Example A routine that computes the dot pro duct of two vectors that are distributed

42

across a group of pro cesses and returns the answer at no de zero

43

SUBROUTINE PARBLASm a b c comm

44

REAL am bm local slice of array

45

REAL c result at node zero

46

REAL sum

47

48 INTEGER m comm i ierr

CHAPTER COLLECTIVE COMMUNICATION

1

local sum

2

sum

3

DO i m

4

sum sum aibi

5

END DO

6

7

global sum

8

CALL MPIREDUCEsum c MPIREAL MPISUM comm ierr

9

RETURN

10

11

Example A routine that computes the pro duct of a vector and an array that are

12

distributed across a group of pro cesses and returns the answer at no de zero

13

14

SUBROUTINE PARBLASm n a b c comm

15

REAL am bmn local slice of array

16

REAL cn result

17

REAL sumn

18

INTEGER n comm i j ierr

19

20

local sum

21

DO j n

22

sumj

23

DO i m

24

sumj sumj aibij

25

END DO

26

END DO

27

28

global sum

29

CALL MPIREDUCEsum c n MPIREAL MPISUM comm ierr

30

31

return result at node zero and garbage at the other nodes

32

RETURN

33

34

MINLOC and MAXLOC

35

36

The op erator MPI MINLOC is used to compute a global minimum and also an index attached

37

to the minimum value MPI MAXLOC similarly computes a global maximum and index One

38

application of these is to compute a global minimum maximum and the rank of the pro cess

39

containing this value

40

MAXLOC is The op eration that denes MPI

41

42

u v w

43

i j k

44

where

45

46

w maxu v

47

48 and

GLOBAL REDUCTION OPERATIONS

1 i if u v

k min i j if u v

2

3 j if u v

4

MPI MINLOC is dened similarly

5

6

u v w

7

i j k

8

9

where

10

11

w min u v

12

and

13

14

i if u v

15

min i j if u v

k

16

j if u v

17

Both op erations are asso ciative and commutative Note that if MPI MAXLOC is applied

18

to reduce a sequence of pairs u u u n then the value returned is

19

0 1 n1

u r where u max u and r is the index of the rst global maximum in the sequence

20

i i

Thus if each pro cess supplies a value and its rank within the group then a reduce op eration

21

MAXLOC will return the maximum value and the rank of the rst pro cess with op MPI

22

with that value Similarly MPI MINLOC can b e used to return a minimum and its index

23

More generally MPI MINLOC computes a lexicographic minimum where elements are ordered

24

according to the rst comp onent of each pair and ties are resolved according to the second

25

comp onent

26

The reduce op eration is dened to op erate on arguments that consist of a pair value

27

and index For b oth Fortran and C typ es are provided to describ e the pair The p otentially

28

mixedtyp e nature of such arguments is a problem in Fortran The problem is circumvented

29

for Fortran by having the MPIprovided typ e consist of a pair of the same typ e as value

30

and co ercing the index to this typ e also In C the MPIprovided pair typ e has distinct

31

typ es and the index is an int

32

MINLOC and MPI MAXLOC in a reduce op eration one must provide In order to use MPI

33

a datatyp e argument that represents a pair value and index MPI provides seven such

34

predened datatyp es The op erations MPI MAXLOC and MPI MINLOC can b e used with each

35

of the following datatyp es

36

37

Fortran

38

Name Description

39

MPI REAL pair of REALs

40

DOUBLE PRECISION pair of DOUBLE PRECISION variables MPI

41

MPI INTEGER pair of INTEGERs

42

MPI COMPLEX pair of COMPLEXes

43

44

45

C

46

Name Description

47

MPI FLOAT INT oat and int 48

CHAPTER COLLECTIVE COMMUNICATION

MPI DOUBLE INT double and int

1

MPI LONG INT long and int

2

MPI INT pair of int

3

MPI SHORT INT short and int

4

MPI LONG DOUBLE INT long double and int

5

6

The datatyp e MPI REAL is as if dened by the following see Section

7

8

MPITYPECONTIGUOUS MPIREAL MPIREAL

9

10

Similar statements apply for MPI INTEGER MPI DOUBLE PRECISION and MPI INT

11

The datatyp e MPI FLOAT INT is as if dened by the following sequence of instructions

12

type MPIFLOAT

13

type MPIINT

14

disp

15

disp sizeoffloat

16

block

17

block

18

MPITYPESTRUCT block disp type MPIFLOATINT

19

20

Similar statements apply for MPI LONG INT and MPI DOUBLE INT

21

22

Example Each pro cess has an array of doubles in C For each of the lo cations

23

compute the value and rank of the pro cess containing the largest value

24

25

26

each process has an array of double ain

27

28

double ain aout

29

int ind

30

struct

31

double val

32

int rank

33

in out

34

int i myrank root

35

36

MPICommrankMPICOMMWORLD myrank

37

for i i i

38

inival aini

39

inirank myrank

40

41

MPIReduce in out MPIDOUBLEINT MPIMAXLOC root comm

42

At this point the answer resides on process root

43

44

if myrank root

45

read ranks out

46

47

for i i i

48

aouti outival

GLOBAL REDUCTION OPERATIONS

indi outirank

1

2

3

4

Example Same example in Fortran

5

6

7

each process has an array of double ain

8

9

DOUBLE PRECISION ain aout

10

INTEGER ind

11

DOUBLE PRECISION in out

12

INTEGER i myrank root ierr

13

14

MPICOMMRANKMPICOMMWORLD myrank

15

DO I

16

ini aini

17

ini myrank myrank is coerced to a double

18

END DO

19

20

MPIREDUCE in out MPIDOUBLEPRECISION MPIMAXLOC root

21

comm ierr

22

At this point the answer resides on process root

23

24

IF myrank EQ root THEN

25

read ranks out

26

DO I

27

aouti outi

28

indi outi rank is coerced back to an integer

29

END DO

30

END IF

31

32

Example Each pro cess has a nonempty array of values Find the minimum global

33

value the rank of the pro cess that holds it and its index on this pro cess

34

35

define LEN

36

37

float valLEN local array of values

38

int count local number of values

39

int myrank minrank minindex

40

float minval

41

42

struct

43

float value

44

int index

45

in out

46

47

local minloc

48

invalue val

CHAPTER COLLECTIVE COMMUNICATION

inindex

1

for i i count i

2

if invalue vali

3

invalue vali

4

inindex i

5

6

7

global minloc

8

MPICommrankMPICOMMWORLD myrank

9

inindex myrankLEN inindex

10

MPIReduce in out MPIFLOATINT MPIMINLOC root comm

11

At this point the answer resides on process root

12

13

if myrank root

14

read answer out

15

16

minval outvalue

17

minrank outindex LEN

18

minindex outindex LEN

19

20

21

Rationale The denition of MPI MINLOC and MPI MAXLOC given here has the

22

advantage that it do es not require any sp ecialcase handling of these two op erations

23

they are handled like any other reduce op eration A programmer can provide his or

24

her own denition of MPI MAXLOC and MPI MINLOC if so desired The disadvantage

25

is that values and indices have to b e rst interleaved and that indices and values have

26

to b e co erced to the same typ e in Fortran End of rationale

27

28

29

UserDened Op erations

30

31

32

OP CREATE function commute op MPI

33

IN function user dened function function

34

IN commute true if commutative false otherwise

35

36

OUT op op eration handle

37

38

int MPI Op createMPI User function function int commute MPI Op op

39

MPI OP CREATE FUNCTION COMMUTE OP IERROR

40

EXTERNAL FUNCTION

41

LOGICAL COMMUTE

42

INTEGER OP IERROR

43

44

OP CREATE binds a userdened global op eration to an op handle that can MPI

45

subsequently b e used in MPI REDUCE MPI ALLREDUCE MPI REDUCE SCATTER and

46

MPI SCAN The userdened op eration is assumed to b e asso ciative If commute true

47

then the op eration should b e b oth commutative and asso ciative If commute false 48

GLOBAL REDUCTION OPERATIONS

then the order of op erations is xed and is dened to b e in ascending pro cess rank order

1

b eginning with pro cess zero

2

function is the userdened function which must have the following four arguments

3

invec inoutvec len and datatyp e

4

The ANSIC prototyp e for the function is the following

5

6

typedef void MPIUserfunction void invec void inoutvec int len

7

MPIDatatype datatype

8

9

The Fortran declaration of the userdened function app ears b elow

10

FUNCTION USERFUNCTION INVEC INOUTVEC LEN TYPE

11

type INVECLEN INOUTVECLEN

12

INTEGER LEN TYPE

13

14

The datatyp e argument is a handle to the data typ e that was passed into the call to

15

MPI REDUCE The user reduce function should b e written such that the following holds

16

Let u ulen b e the len elements in the communication buer describ ed by the

17

arguments invec len and datatyp e when the function is invoked let v vlen b e len

18

elements in the communication buer describ ed by the arguments inoutvec len and datatyp e

19

when the function is invoked let w wlen b e len elements in the communication

20

buer describ ed by the arguments inoutvec len and datatyp e when the function returns

21

then wi uivi for i len where is the reduce op eration that the function

22

computes

23

Informally we can think of invec and inoutvec as arrays of len elements that function

24

is combining The result of the reduction overwrites values in inoutvec hence the name

25

Each invo cation of the function results in the p ointwise evaluation of the reduce op erator

26

on len elements Ie the function returns in inoutveci the value inveci inoutveci for

27

i count where is the combining op eration computed by the function

28

29

REDUCE to avoid calling the function for Rationale The len argument allows MPI

30

each element in the input buer Rather the system can cho ose to apply the function

31

to chunks of input In C it is passed in as a reference for reasons of compatibility

32

with Fortran

33

By internally comparing the value of the datatyp e argument to known global handles

34

it is p ossible to overload the use of a single userdened function for several dierent

35

data typ es End of rationale

36

37

General datatyp es may b e passed to the user function However use of datatyp es that

38

are not contiguous is likely to lead to ineciencies

39

ABORT No MPI communication function may b e called inside the user function MPI

40

may b e called inside the function in case of an error

41

42

Advice to users Supp ose one denes a library of userdened reduce functions that

43

are overloaded the datatyp e argument is used to select the right execution path at each

44

invo cation according to the typ es of the op erands The userdened reduce function

45

cannot deco de the datatyp e argument that it is passed and cannot identify by itself

46

the corresp ondence b etween the datatyp e handles and the datatyp e they represent

47

This corresp ondence was established when the datatyp es were created Before the 48

CHAPTER COLLECTIVE COMMUNICATION

library is used a library initialization preamble must b e executed This preamble

1

co de will dene the datatyp es that are used by the library and store handles to these

2

datatyp es in global static variables that are shared by the user co de and the library

3

co de

4

5

The Fortran version of MPI REDUCE will invoke a userdened reduce function using

6

the Fortran calling conventions and will pass a Fortrantyp e datatyp e argument the

7

C version will use C calling convention and the C representation of a datatyp e handle

8

Users who plan to mix languages should dene their reduction functions accordingly

9

End of advice to users

10

11

Advice to implementors We outline b elow a naive and inecient implementation of

12

MPI REDUCE

13

14

if rank

15

RECVtempbuf count datatype rank

16

Userreduce tempbuf sendbuf count datatype

17

18

if rank groupsize

19

SEND sendbuf count datatype rank

20

21

answer now resides in process groupsize now send to root

22

23

if rank groupsize

24

SEND sendbuf count datatype root

25

26

if rank root

27

RECVrecvbuf count datatype groupsize

28

29

30

The reduction computation pro ceeds sequentially from pro cess to pro cess group

31

size This order is chosen so as to resp ect the order of a p ossibly noncommutative

32

reduce A more ecient implementation is op erator dened by the function User

33

achieved by taking advantage of asso ciativity and using a logarithmic tree reduction

34

Commutativity can b e used to advantage for those cases in which the commute argu

35

OP CREATE is true Also the amount of temp orary buer required can ment to MPI

36

b e reduced and communication can b e pip elined with computation by transferring

37

and reducing the elements in chunks of size len count

38

The predened reduce op erations can b e implemented as a library of userdened

39

REDUCE handles op erations However b etter p erformance might b e achieved if MPI

40

these functions as a sp ecial case End of advice to implementors

41

42

43

44

OP FREE op MPI

45

IN op op eration handle

46

47

48

op free MPI Op op int MPI

GLOBAL REDUCTION OPERATIONS

MPI OP FREE OP IERROR

1

INTEGER OP IERROR

2

3

Marks a userdened reduction op eration for deallo cation and sets op to MPI OP NULL

4

5

Example of Userdened Reduce

6

7

It is time for an example of userdened reduction

8

9

Example Compute the pro duct of an array of complex numb ers in C

10

typedef struct

11

double realimag

12

Complex

13

14

the userdefined function

15

16

void myProd Complex in Complex inout int len MPIDatatype dptr

17

18

int i

19

Complex c

20

21

for i i len i

22

creal inoutrealinreal

23

inoutimaginimag

24

cimag inoutrealinimag

25

inoutimaginreal

26

inout c

27

in inout

28

29

30

31

and to call it

32

33

34

35

each process has an array of Complexes

36

37

Complex a answer

38

MPIOp myOp

39

MPIDatatype ctype

40

41

explain to MPI how type Complex is defined

42

43

MPITypecontiguous MPIDOUBLE ctype

44

MPITypecommit ctype

45

create the complexproduct userop

46

47

MPIOpcreate myProd True myOp 48

CHAPTER COLLECTIVE COMMUNICATION

1

MPIReduce a answer ctype myOp root comm

2

3

At this point the answer which consists of Complexes

4

resides on process root

5

6

7

8

AllReduce

9

MPI includes variants of each of the reduce op erations where the result is returned to all

10

pro cesses in the group MPI requires that all pro cesses participating in these op erations

11

receive identical results

12

13

14

MPI ALLREDUCE sendbuf recvbuf count datatyp e op comm

15

IN sendbuf starting address of send buer choice

16

17

OUT recvbuf starting address of receive buer choice

18

IN count numb er of elements in send buer integer

19

IN datatyp e data typ e of elements of send buer handle

20

IN op op eration handle

21

22

IN comm communicator handle

23

24

int MPI Allreducevoid sendbuf void recvbuf int count

25

MPI Datatype datatype MPI Op op MPI Comm comm

26

27

MPI ALLREDUCESENDBUF RECVBUF COUNT DATATYPE OP COMM IERROR

28

type SENDBUF RECVBUF

29

INTEGER COUNT DATATYPE OP COMM IERROR

30

Same as MPI REDUCE except that the result app ears in the receive buer of all the

31

group memb ers

32

33

Advice to implementors The allreduce op erations can b e implemented as a re

34

duce followed by a broadcast However a direct implementation can lead to b etter

35

p erformance End of advice to implementors

36

37

Example A routine that computes the pro duct of a vector and an array that are

38

distributed across a group of pro cesses and returns the answer at all no des see also Example

39

40

41

SUBROUTINE PARBLASm n a b c comm

42

REAL am bmn local slice of array

43

REAL cn result

44

REAL sumn

45

INTEGER n comm i j ierr

46

47

local sum

48

DO j n

REDUCESCATTER

sumj

1

DO i m

2

sumj sumj aibij

3

END DO

4

END DO

5

6

global sum

7

CALL MPIALLREDUCEsum c n MPIREAL MPISUM comm ierr

8

9

return result at all nodes

10

RETURN

11

12

13

ReduceScatter

14

15

MPI includes variants of each of the reduce op erations where the result is scattered to all

16

pro cesses in the group on return

17

18

19 MPI REDUCE SCATTER sendbuf recvbuf recvcounts datatyp e op comm

20

IN sendbuf starting address of send buer choice

21

OUT recvbuf starting address of receive buer choice

22

23

IN recvcounts integer array sp ecifying the numb er of elements in re

24

sult distributed to each pro cess Array must b e iden

25

tical on all calling pro cesses

26

IN datatyp e data typ e of elements of input buer handle

27

IN op op eration handle

28

29

IN comm communicator handle

30

31

int MPI Reduce scattervoid sendbuf void recvbuf int recvcounts

32

Datatype datatype MPI Op op MPI Comm comm MPI

33

MPI REDUCE SCATTERSENDBUF RECVBUF RECVCOUNTS DATATYPE OP COMM

34

IERROR

35

type SENDBUF RECVBUF

36

INTEGER RECVCOUNTS DATATYPE OP COMM IERROR

37

38

REDUCE SCATTER rst do es an elementwise reduction on vector of count MPI

P

39

recvcountsi elements in the send buer dened by sendbuf count and datatyp e Next

i

40

the resulting vector of results is split into n disjoint segments where n is the numb er of

41

memb ers in the group Segment i contains recvcountsi elements The ith segment is sent

42

to pro cess i and stored in the receive buer dened by recvbuf recvcountsi and datatyp e

43

44

REDUCE SCATTER routine is functionally equiva Advice to implementors The MPI

45

lent to A MPI REDUCE op eration function with count equal to the sum of recvcountsi

46

followed by MPI SCATTERV with sendcounts equal to recvcounts However a direct

47

implementation may run faster End of advice to implementors 48

CHAPTER COLLECTIVE COMMUNICATION

Scan

1

2

3

4

MPI SCAN sendbuf recvbuf count datatyp e op comm

5

IN sendbuf starting address of send buer choice

6

7

OUT recvbuf starting address of receive buer choice

8

IN count numb er of elements in input buer integer

9

IN datatyp e data typ e of elements of input buer handle

10

11

IN op op eration handle

12

IN comm communicator handle

13

14

int MPI Scanvoid sendbuf void recvbuf int count

15

Datatype datatype MPI Op op MPI Comm comm MPI

16

17

MPI SCANSENDBUF RECVBUF COUNT DATATYPE OP COMM IERROR

18

type SENDBUF RECVBUF

19

INTEGER COUNT DATATYPE OP COMM IERROR

20

MPI SCAN is used to p erform a prex reduction on data distributed across the group

21

The op eration returns in the receive buer of the pro cess with rank i the reduction of

22

the values in the send buers of pro cesses with ranks i inclusive The typ e of

23

op erations supp orted their semantics and the constraints on send and receive buers are

24

as for MPI REDUCE

25

26

Rationale We have dened an inclusive scan that is the prex reduction on pro cess

27

i includes the data from pro cess i An alternative is to dene scan in an exclusive

28

manner where the result on i only includes data up to i Both denitions are useful

29

The latter has some advantages the inclusive scan can always b e computed from the

30

exclusive scan with no additional communication for noninvertible op erations such

31

as max and min communication is required to compute the exclusive scan from the

32

inclusive scan There is however a complication with exclusive scan since one must

33

dene the unit element for the reduction in this case That is one must explicitly

34

say what o ccurs for pro cess This was thought to b e complex for userdened

35

op erations and hence the exclusive scan was dropp ed End of rationale

36

37

SCAN Example using MPI

38

39

Example This example uses a userdened op eration to pro duce a segmented scan

40

A segmented scan takes as input a set of values and a set of logicals and the logicals

41

delineate the various segments of the scan For example

42

43

v al ues v v v v v v v v

1 2 3 4 5 6 7 8

44

l og ical s

45

r esul t v v v v v v v v v v v v v

1 1 2 3 3 4 3 4 5 6 6 7 8

46

The op erator that pro duces this eect is

47 48

SCAN

1

u v w

2

i j j

3

4

where

5

6

u v if i j

w

7

v if i j

8

Note that this is a noncommutative op erator C co de that implements it is given

9

b elow

10

11

typedef struct

12

double val

13

int log

14

SegScanPair

15

16

the userdefined function

17

18

void segScan SegScanPair in SegScanPair inout int len

19

MPIDatatype dptr

20

21

int i

22

SegScanPair c

23

24

for i i len i

25

if inlog inoutlog

26

cval inval inoutval

27

else

28

cval inoutval

29

clog inoutlog

30

inout c

31

in inout

32

33

34

35

Note that the inout argument to the userdened function corresp onds to the right

36

hand op erand of the op erator When using this op erator we must b e careful to sp ecify that

37

it is noncommutative as in the following

38

39

int ibase

40

SeqScanPair a answer

41

MPIOp myOp

42

MPIDatatype type MPIDOUBLE MPIINT

43

MPIAint disp

44

int blocklen

45

MPIDatatype sspair

46

47

explain to MPI how type SegScanPair is defined

48

CHAPTER COLLECTIVE COMMUNICATION

MPIAddress a disp

1

MPIAddress alog disp

2

base disp

3

for i i i dispi base

4

MPITypestruct blocklen disp type sspair

5

MPITypecommit sspair

6

create the segmentedscan userop

7

8

MPIOpcreate segScan False myOp

9

10

MPIScan a answer sspair myOp root comm

11

12

13

Correctness

14

15

A correct p ortable program must invoke collective communications so that deadlo ck will not

16

o ccur whether collective communications are synchronizing or not The following examples

17

illustrate dangerous use of collective routines

18

19

Example The following is erroneous

20

21

switchrank

22

case

23

MPIBcastbuf count type comm

24

MPIBcastbuf count type comm

25

break

26

case

27

MPIBcastbuf count type comm

28

MPIBcastbuf count type comm

29

break

30

31

We assume that the group of comm is fg Two pro cesses execute two broadcast

32

op erations in reverse order If the op eration is synchronizing then a deadlo ck will o ccur

33

Collective op erations must b e executed in the same order at all memb ers of the com

34

munication group

35

36

Example The following is erroneous

37

38

switchrank

39

case

40

MPIBcastbuf count type comm

41

MPIBcastbuf count type comm

42

break

43

case

44

MPIBcastbuf count type comm

45

MPIBcastbuf count type comm

46

break

47

case 48

CORRECTNESS

MPIBcastbuf count type comm

1

MPIBcastbuf count type comm

2

break

3

4

5

Assume that the group of comm is fg of comm is f g and of comm is fg If

6

the broadcast is a synchronizing op eration then there is a cyclic dep endency the broadcast

7

in comm completes only after the broadcast in comm the broadcast in comm completes

8

only after the broadcast in comm and the broadcast in comm completes only after the

9

broadcast in comm Thus the co de will deadlo ck

10

Collective op erations must b e executed in an order so that no cyclic dep endences o ccur

11

12

Example The following is erroneous

13

14

switchrank

15

case

16

MPIBcastbuf count type comm

17

MPISendbuf count type tag comm

18

break

19

case

20

MPIRecvbuf count type tag comm

21

MPIBcastbuf count type comm

22

break

23

24

25

Pro cess zero executes a broadcast followed by a blo cking send op eration Pro cess one

26

rst executes a blo cking receive that matches the send followed by broadcast call that

27

matches the broadcast of pro cess zero This program may deadlo ck The broadcast call on

28

pro cess zero may blo ck until pro cess one executes the matching broadcast call so that the

29

send is not executed Pro cess one will denitely blo ck on the receive and so in this case

30

never executes the broadcast

31

The relative order of execution of collective op erations and p ointtop oint op erations

32

should b e such so that even if the collective op erations and the p ointtop oint op erations

33

are synchronizing no deadlo ck will o ccur

34

35

Example A correct but nondeterministic program

36

37

switchrank

38

case

39

MPIBcastbuf count type comm

40

MPISendbuf count type tag comm

41

break

42

case

43

MPIRecvbuf count type MPIANYSOURCE tag comm

44

MPIBcastbuf count type comm

45

MPIRecvbuf count type MPIANYSOURCE tag comm

46

break

47

case

48

MPISendbuf count type tag comm

CHAPTER COLLECTIVE COMMUNICATION

First Execution 1

process: 0 1 2 2 3 match recv send 4

broadcast broadcast broadcast 5

match 6

send recv

7

8

9 10

Second Execution 11

12 13

broadcast 14

match 15 send recv broadcast 16

match 17

recv send 18

broadcast

19

20

Figure A causes nondeterministic matching of sends and receives One

21

cannot rely on synchronization from a broadcast to make the program deterministic

22

23

MPIBcastbuf count type comm

24

break

25

26

27

All three pro cesses participate in a broadcast Pro cess sends a message to pro cess

28

after the broadcast and pro cess sends a message to pro cess b efore the broadcast

29

Pro cess receives b efore and after the broadcast with a wildcard source argument

30

Two p ossible executions of this program with dierent matchings of sends and receives

31

are illustrated in gure Note that the second execution has the p eculiar eect that a

32

send executed after the broadcast is received at another no de b efore the broadcast This

33

example illustrates the fact that one should not rely on collective communication functions

34

to have particular synchronization eects A program that works correctly only when the

35

rst execution o ccurs only when broadcast is synchronizing is erroneous

36

37

Finally in multithreaded implementations one can have more than one concurrently

38

executing collective communication call at a pro cess In these situations it is the users re

39

sp onsibility to ensure that the same communicator is not used concurrently by two dierent

40

collective communication calls at the same pro cess

41

42

Advice to implementors Assume that broadcast is implemented using p ointtop oint

43

MPI communication Supp ose the following two rules are followed

44

45

All receives sp ecify their source explicitly no wildcards

46

Each pro cess sends all messages that p ertain to one collective call b efore sending

47

any message that p ertain to a subsequent collective call 48

CORRECTNESS

Then messages b elonging to successive broadcasts cannot b e confused as the order

1

of p ointtop oint messages is preserved

2

3

It is the implementors resp onsibility to ensure that p ointtop oint messages are not

4

confused with collective messages One way to accomplish this is whenever a commu

5

nicator is created to also create a hidden communicator for collective communica

6

tion One could achieve a similar eect more cheaply for example by using a hidden

7

tag or context bit to indicate whether the communicator is used for p ointtop oint or

8

collective communication End of advice to implementors

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

Groups Contexts and

11

12

Communicators

13

14

15

16

Intro duction

17

18

This chapter intro duces MPI features that supp ort the development of parallel libraries

19

Parallel libraries are needed to encapsulate the distracting complications inherent in paral

20

lel implementations of key algorithms They help to ensure consistent correctness of such

21

pro cedures and provide a higher level of p ortability than MPI itself can provide As

22

such libraries prevent each programmer from rep eating the work of dening consistent

23

data structures data layouts and metho ds that implement key algorithms such as matrix

24

op erations Since the b est libraries come with several variations on parallel systems dif

25

ferent data layouts dierent strategies dep ending on the size of the system or problem or

26

typ e of oating p oint this to o needs to b e hidden from the user

27

We refer the reader to and for further information on writing libraries in MPI

28

using the features describ ed in this chapter

29

30

Features Needed to Supp ort Libraries

31

32

The key features needed to supp ort the creation of robust parallel libraries are as follows

33

Safe communication space that guarantees that libraries can communicate as they

34

need to without conicting with communication extraneous to the library

35

36

Group scop e for collective op erations that allow libraries to avoid unnecessarily syn

37

chronizing uninvolved pro cesses p otentially running unrelated co de

38

39

Abstract pro cess naming to allow libraries to describ e their communication in terms

40

suitable to their own data structures and algorithms

41

The ability to adorn a set of communicating pro cesses with additional userdened

42

attributes such as extra collective op erations This mechanism should provide a

43

means for the user or library writer eectively to extend a messagepassing notation

44

45

In addition a unied mechanism or ob ject is needed for conveniently denoting communica

46

tion context the group of communicating pro cesses to house abstract pro cess naming and

47

to store adornments 48

INTRODUCTION

MPIs Supp ort for Libraries

1

2

The corresp onding concepts that MPI provides sp ecically to supp ort robust libraries are

3

as follows

4

5

Contexts of communication

6

Groups of pro cesses

7

8

Virtual top ologies

9

10

Attribute caching

11

Communicators

12

13

Communicators see encapsulate all of these ideas in order to provide the

14

appropriate scop e for all communication op erations in MPI Communicators are divided

15

into two kinds intracommunicators for op erations within a single group of pro cesses and

16

intercommunicators for p ointtop oint communication b etween two groups of pro cesses

17

18

Caching Communicators see b elow provide a caching mechanism that allows one to

19

asso ciate new attributes with communicators on a par with MPI builtin features This

20

can b e used by advanced users to adorn communicators further and by MPI to implement

21

some communicator functions For example the virtualtop ology functions describ ed in

22

Chapter are likely to b e supp orted this way

23

24

Groups Groups dene an ordered collection of pro cesses each with a rank and it is this

25

group that denes the lowlevel names for interpro cess communication ranks are used for

26

sending and receiving Thus groups dene a scop e for pro cess names in p ointtop oint

27

communication In addition groups dene the scop e of collective op erations Groups may

28

b e manipulated separately from communicators in MPI but only communicators can b e

29

used in communication op erations

30

31

Intracommunicators The most commonly used means for message passing in MPI is via

32

intracommunicators Intracommunicators contain an instance of a group contexts of

33

communication for b oth p ointtop oint and collective communication and the ability to

34

include virtual top ology and other attributes These features work as follows

35

36

Contexts provide the ability to have separate safe universes of message passing in

37

MPI A context is akin to an additional tag that dierentiates messages The system

38

manages this dierentiation pro cess The use of separate communication contexts

39

by distinct libraries or distinct library invo cations insulates communication internal

40

to the library execution from external communication This allows the invo cation of

41

the library even if there are p ending communications on other communicators and

42

avoids the need to synchronize entry or exit into library co de Pending p ointtop oint

43

communications are also guaranteed not to interfere with collective communications

44

within a single communicator

45

46

Groups dene the participants in the communication see ab ove of a communicator

47 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

A virtual top ology denes a sp ecial mapping of the ranks in a group to and from a

1

top ology Sp ecial constructors for communicators are dened in chapter to provide

2

this feature Intracommunicators as describ ed in this chapter do not have top ologies

3

4

Attributes dene the lo cal information that the user or library has added to a com

5

municator for later reference

6

7

Advice to users The current practice in many communication libraries is that there

8

is a unique predened communication universe that includes all pro cesses available

9

when the parallel program is initiated the pro cesses are assigned consecutive ranks

10

Participants in a p ointtop oint communication are identied by their rank a collec

11

tive communication such as broadcast always involves all pro cesses This practice

12

can b e followed in MPI by using the predened communicator MPI COMM WORLD

13

Users who are satised with this practice can plug in MPI COMM WORLD wherever

14

a communicator argument is required and can consequently disregard the rest of this

15

chapter End of advice to users

16

17

Intercommunicators The discussion has dealt so far with intracommunication com

18

munication within a group MPI also supp orts intercommunication communication

19

b etween two nonoverlapping groups When an application is built by comp osing several

20

parallel mo dules it is convenient to allow one mo dule to communicate with another using

21

lo cal ranks for addressing within the second mo dule This is esp ecially convenient in a

22

clientserver computing paradigm where either client or server are parallel The supp ort

23

of intercommunication also provides a mechanism for the extension of MPI to a dynamic

24

mo del where not all pro cesses are preallo cated at initialization time In such a situation

25

it b ecomes necessary to supp ort communication across universes Intercommunication

26

is supp orted by ob jects called intercommunicators These ob jects bind two groups to

27

gether with communication contexts shared by b oth groups For intercommunicators these

28

features work as follows

29

30

Contexts provide the ability to have a separate safe universe of message passing

31

b etween the two groups A send in the lo cal group is always a receive in the re

32

mote group and vice versa The system manages this dierentiation pro cess The

33

use of separate communication contexts by distinct libraries or distinct library in

34

vo cations insulates communication internal to the library execution from external

35

communication This allows the invo cation of the library even if there are p ending

36

communications on other communicators and avoids the need to synchronize entry

37

or exit into library co de There is no generalpurp ose collective communication on

38

intercommunicators so contexts are used just to isolate p ointtop oint communica

39

tion

40

A lo cal and remote group sp ecify the recipients and destinations for an intercom

41

municator

42

43

Virtual top ology is undened for an intercommunicator

44

45

As b efore attributes cache denes the lo cal information that the user or library has

46

added to a communicator for later reference

47 48

BASIC CONCEPTS

MPI provides mechanisms for creating and manipulating intercommunicators They

1

are used for p ointtop oint communication in an related manner to intracommunicators

2

Users who do not need intercommunication in their applications can safely ignore this

3

extension Users who need collective op erations via intercommunicators must layer it on

4

top of MPI Users who require intercommunication b etween overlapping groups must also

5

layer this capability on top of MPI

6

7

8

Basic Concepts

9

10

In this section we turn to a more formal denition of the concepts intro duced ab ove

11

12

Groups

13

14 A group is an ordered set of pro cess identiers henceforth pro cesses pro cesses are

15 implementationdep endent ob jects Each pro cess in a group is asso ciated with an inte

16 ger rank Ranks are contiguous and start from zero Groups are represented by opaque

17 group ob jects and hence cannot b e directly transferred from one pro cess to another A

group is used within a communicator to describ e the participants in a communication uni

18

19 verse and to rank such participants thus giving them unique names within that universe

of communication

20

21 There is a sp ecial predened group MPI GROUP EMPTY which is a group with no

memb ers The predened constant MPI GROUP NULL is the value used for invalid group

22

23 handles

24

Advice to users MPI GROUP EMPTY which is a valid handle to an empty group

25

GROUP NULL which in turn is an invalid handle should not b e confused with MPI

26

The former may b e used as an argument to group op erations the latter which is

27

returned when a group is freed in not a valid argument End of advice to users

28

29

Advice to implementors A group may b e represented by a virtualtoreal pro cess

30

addresstranslation table Each communicator ob ject see b elow would have a p ointer

31

to such a table

32

33

Simple implementations of MPI will enumerate groups such as in a table However

34

more advanced data structures make sense in order to improve scalability and memory

35

usage with large numb ers of pro cesses Such implementations are p ossible with MPI

36

End of advice to implementors

37

38

Contexts

39

A context is a prop erty of communicators dened next that allows partitioning of the

40

communication space A message sent in one context cannot b e received in another context

41

Furthermore where p ermitted collective op erations are indep endent of p ending p ointto

42

p oint op erations Contexts are not explicit MPI ob jects they app ear only as part of the

43

realization of communicators b elow

44

45

Advice to implementors Distinct communicators in the same pro cess have distinct

46

contexts A context is essentially a systemmanaged tag or tags needed to make

47

a communicator safe for p ointtop oint and MPIdened collective communication 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

Safety means that collective and p ointtop oint communication within one commu

1

nicator do not interfere and that communication over distinct communicators dont

2

interfere

3

4

A p ossible implementation for a context is as a supplemental tag attached to messages

5

on send and matched on receive Each intracommunicator stores the value of its two

6

tags one for p ointtop oint and one for collective communication Communicator

7

generating functions use a collective communication to agree on a new groupwide

8

unique context

9

Analogously in intercommunication which is strictly p ointtop oint communication

10

two context tags are stored p er communicator one used by group A to send and group

11

B to receive and a second used by group B to send and for group A to receive

12

Since contexts are not explicit ob jects other implementations are also p ossible End

13

of advice to implementors

14

15

IntraCommunicators

16

17

Intracommunicators bring together the concepts of group and context To supp ort

18

implementationsp ecic optimizations and application top ologies dened in the next chap

19

ter chapter communicators may also cache additional information see section

20

MPI communication op erations reference communicators to determine the scop e and the

21

communication universe in which a p ointtop oint or collective op eration is to op erate

22

Each communicator contains a group of valid participants this group always includes

23

the lo cal pro cess The source and destination of a message is identied by pro cess rank

24

within that group

25

For collective communication the intracommunicator sp ecies the set of pro cesses that

26

participate in the collective op eration and their order when signicant Thus the commu

27

nicator restricts the spatial scop e of communication and provides machineindep endent

28

pro cess addressing through ranks

29

Intracommunicators are represented by opaque intracommunicator ob jects and

30

hence cannot b e directly transferred from one pro cess to another

31

32

Predened IntraCommunicators

33

34

An initial intracommunicator MPI COMM WORLD of all pro cesses the lo cal pro cess can

35

communicate with after initialization itself included is dened once MPI INIT has b een

36

called In addition the communicator MPI COMM SELF is provided which includes only the

37

pro cess itself

38

COMM NULL is the value used for invalid communicator The predened constant MPI

39

handles

40

In a staticpro cessmo del implementation of MPI all pro cesses that participate in the

41

COMM WORLD is a computation are available after MPI is initialized For this case MPI

42

communicator of all pro cesses available for the computation this communicator has the

43

same value in all pro cesses In an implementation of MPI where pro cesses can dynami

44

cally join an MPI execution it may b e the case that a pro cess starts an MPI computation

45

without having access to all other pro cesses In such situations MPI COMM WORLD is a

46

communicator incorp orating all pro cesses with which the joining pro cess can immediately

47

communicate Therefore MPI COMM WORLD may simultaneously have dierent values in

48

dierent pro cesses

GROUP MANAGEMENT

All MPI implementations are required to provide the MPI COMM WORLD communica

1

tor It cannot b e deallo cated during the life of a pro cess The group corresp onding to

2

this communicator do es not app ear as a predened constant but it may b e accessed using

3

MPI COMM GROUP see b elow MPI do es not sp ecify the corresp ondence b etween the

4

pro cess rank in MPI COMM WORLD and its machinedep endent absolute address Neither

5

do es MPI sp ecify the function of the host pro cess if any Other implementationdep end ent

6

predened communicators may also b e provided

7

8

9

Group Management

10

11

This section describ es the manipulation of pro cess groups in MPI These op erations are

12

lo cal and their execution do not require interpro cess communication

13

14

Group Accessors

15

16

17

MPI GROUP SIZEgroup size

18

19

IN group group handle

20

OUT size numb er of pro cesses in the group integer

21

22

int MPI Group sizeMPI Group group int size

23

24

MPI GROUP SIZEGROUP SIZE IERROR

25

INTEGER GROUP SIZE IERROR

26

27

28

MPI GROUP RANKgroup rank

29

IN group group handle

30

31 OUT rank rank of the calling pro cess in group or

32 MPI UNDEFINED if the pro cess is not a memb er in

33 teger

34

35

Group rankMPI Group group int rank int MPI

36

MPI GROUP RANKGROUP RANK IERROR

37

INTEGER GROUP RANK IERROR

38

39

40

41

42

43

44

45

46

47 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

MPI GROUP TRANSLATE RANKS group n ranks group ranks

1

2

IN group group handle

3

IN n numb er of ranks in ranks and ranks arrays integer

4

5

IN ranks array of zero or more valid ranks in group

6

IN group group handle

7

OUT ranks array of corresp onding ranks in group MPI UNDE

8

FINED when no corresp ondence exists

9

10

int MPI Group translate ranks MPI Group group int n int ranks

11

MPI Group group int ranks

12

13

MPI GROUP TRANSLATE RANKSGROUP N RANKS GROUP RANKS IERROR

14

INTEGER GROUP N RANKS GROUP RANKS IERROR

15

This function is imp ortant for determining the relative numb ering of the same pro cesses

16

in two dierent groups For instance if one knows the ranks of certain pro cesses in the group

17

of MPI COMM WORLD one might want to know their ranks in a subset of that group

18

19

20

MPI GROUP COMPAREgroup group result

21

IN group rst group handle

22

23

IN group second group handle

24

OUT result result integer

25

26

int MPI Group compareMPI Group groupMPI Group group int result

27

28

MPI GROUP COMPAREGROUP GROUP RESULT IERROR

29

INTEGER GROUP GROUP RESULT IERROR

30

MPI IDENT results if the group memb ers and group order is exactly the same in b oth groups

31

This happ ens for instance if group and group are the same handle MPI SIMILAR results if

32

the group memb ers are the same but the order is dierent MPI UNEQUAL results otherwise

33

34

35

Group Constructors

36

Group constructors are used to subset and sup erset existing groups These constructors

37

construct new groups from existing groups These are lo cal op erations and distinct groups

38

may b e dened on dierent pro cesses a pro cess may also dene a group that do es not

39

include itself Consistent denitions are required when groups are used as arguments in

40

communicatorbuilding functions MPI do es not provide a mechanism to build a group

41

from scratch but only from other previously dened groups The base group up on

42

which all other groups are dened is the group asso ciated with the initial communica

43

COMM WORLD accessible through the function MPI COMM GROUP tor MPI

44

45

Rationale In what follows there is no group duplication function analogous to

46

MPI COMM DUP dened later in this chapter There is no need for a group dupli

47

cator A group once created can have several references to it by making copies of 48

GROUP MANAGEMENT

the handle The following constructors address the need for subsets and sup ersets of

1

existing groups End of rationale

2

3

Advice to implementors Each group constructor b ehaves as if it returned a new

4

group ob ject When this new group is a copy of an existing group then one can

5

avoid creating such new ob jects using a referencecount mechanism End of advice

6

to implementors

7

8

9

10

MPI COMM GROUPcomm group

11

IN comm communicator handle

12

13

OUT group group corresp onding to comm handle

14

15

int MPI Comm groupMPI Comm comm MPI Group group

16

COMM GROUPCOMM GROUP IERROR MPI

17

INTEGER COMM GROUP IERROR

18

19

MPI COMM GROUP returns in group a handle to the group of comm

20

21

22

GROUP UNIONgroup group newgroup MPI

23

IN group rst group handle

24

IN group second group handle

25

26

OUT newgroup union group handle

27

28

int MPI Group unionMPI Group group MPI Group group MPI Group newgroup

29

MPI GROUP UNIONGROUP GROUP NEWGROUP IERROR

30

INTEGER GROUP GROUP NEWGROUP IERROR

31

32

33

34

MPI GROUP INTERSECTIONgroup group newgroup

35

IN group rst group handle

36

IN group second group handle

37

38

OUT newgroup intersection group handle

39

40

int MPI Group intersectionMPI Group group MPI Group group

41

MPI Group newgroup

42

MPI GROUP INTERSECTIONGROUP GROUP NEWGROUP IERROR

43

INTEGER GROUP GROUP NEWGROUP IERROR

44

45

46

47 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

MPI GROUP DIFFERENCEgroup group newgroup

1

2

IN group rst group handle

3

IN group second group handle

4

5

OUT newgroup dierence group handle

6

7

int MPI Group differenceMPI Group group MPI Group group

8

MPI Group newgroup

9

MPI GROUP DIFFERENCEGROUP GROUP NEWGROUP IERROR

10

INTEGER GROUP GROUP NEWGROUP IERROR

11

12

The setlike op erations are dened as follows

13

union All elements of the rst group group followed by all elements of second group

14

group not in rst

15

16

intersect all elements of the rst group that are also in the second group ordered as in

17

rst group

18

19

dierence all elements of the rst group that are not in the second group ordered as in

20

the rst group

21

Note that for these op erations the order of pro cesses in the output group is determined

22

primarily by order in the rst group if p ossible and then if necessary by order in the

23

second group Neither union nor intersection are commutative but b oth are asso ciative

24

The new group can b e empty that is equal to MPI GROUP EMPTY

25

26

27

MPI GROUP INCLgroup n ranks newgroup

28

IN group group handle

29

30

IN n numb er of elements in array ranks and size of new

31

group integer

32

IN ranks ranks of pro cesses in group to app ear in newgroup ar

33

ray of integers

34

OUT newgroup new group derived from ab ove in the order dened by

35

ranks handle

36

37

38

Group inclMPI Group group int n int ranks MPI Group newgroup int MPI

39

MPI GROUP INCLGROUP N RANKS NEWGROUP IERROR

40

INTEGER GROUP N RANKS NEWGROUP IERROR

41

42

The function MPI GROUP INCL creates a group newgroup that consists of the n pro

43

cesses in group with ranks rank rankn the pro cess with rank i in newgroup is the

44

pro cess with rank ranksi in group Each of the n elements of ranks must b e a valid rank

45

in group and all elements must b e distinct or else the program is erroneous If n

46

then newgroup is MPI GROUP EMPTY This function can for instance b e used to reorder

47

the elements of a group See also MPI GROUP COMPARE 48

GROUP MANAGEMENT

MPI GROUP EXCLgroup n ranks newgroup

1

2

IN group group handle

3

IN n numb er of elements in array ranks integer

4

IN ranks array of integer ranks in group not to app ear in new

5

group

6

7

OUT newgroup new group derived from ab ove preserving the order

8

dened by group handle

9

10

int MPI Group exclMPI Group group int n int ranks MPI Group newgroup

11

MPI GROUP EXCLGROUP N RANKS NEWGROUP IERROR

12

INTEGER GROUP N RANKS NEWGROUP IERROR

13

14

The function MPI GROUP EXCL creates a group of pro cesses newgroup that is obtained

15

by deleting from group those pro cesses with ranks ranks ranksn The ordering of

16

pro cesses in newgroup is identical to the ordering in group Each of the n elements of ranks

17

must b e a valid rank in group and all elements must b e distinct otherwise the program is

18

erroneous If n then newgroup is identical to group

19

20

21

MPI GROUP RANGE INCLgroup n ranges newgroup

22

IN group group handle

23

IN n numb er of triplets in array ranges integer

24

25

IN ranges an array of integer triplets of the form rst rank last

26

rank stride indicating ranks in group of pro cesses to

27

b e included in newgroup

28

OUT newgroup new group derived from ab ove in the order dened by

29

ranges handle

30

31

int MPI Group range inclMPI Group group int n int ranges

32

MPI Group newgroup

33

34

MPI GROUP RANGE INCLGROUP N RANGES NEWGROUP IERROR

35

INTEGER GROUP N RANGES NEWGROUP IERROR

36

If ranges consist of the triplets

37

38

f ir st l ast str ide f irst l ast str ide

1 1 1 n n n

39

then newgroup consists of the sequence of pro cesses in group with ranks

40

41

l ast f ir st

1 1

f ir st f ir st str ide f ir st str ide

1 1 1 1 1

42

str ide

1

43

l ast f ir st

n n

44

str ide f ir st f ir st str ide f ir st

n n n n n

str ide

n

45

46

Each computed rank must b e a valid rank in group and all computed ranks must b e

47

distinct or else the program is erroneous Note that we may have f ir st l ast and str ide

i i i

48

may b e negative but cannot b e zero

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

The functionality of this routine is sp ecied to b e equivalent to expanding the array

1

of ranges to an array of the included ranks and passing the resulting array of ranks and

2

other arguments to MPI GROUP INCL A call to MPI GROUP INCL is equivalent to a call

3

to MPI GROUP RANGE INCL with each rank i in ranks replaced by the triplet ii in

4

the argument ranges

5

6

7

MPI GROUP RANGE EXCLgroup n ranges newgroup

8

9

IN group group handle

10

IN n numb er of elements in array ranks integer

11

IN ranges a onedimensional array of integer triplets of the form

12

rst rank last rank stride indicating the ranks in

13

group of pro cesses to b e excluded from the output

14

group newgroup

15

16

OUT newgroup new group derived from ab ove preserving the order

17

in group handle

18

19

int MPI Group range exclMPI Group group int n int ranges

20

MPI Group newgroup

21

MPI GROUP RANGE EXCLGROUP N RANGES NEWGROUP IERROR

22

INTEGER GROUP N RANGES NEWGROUP IERROR

23

24

Each computed rank must b e a valid rank in group and all computed ranks must b e distinct

25

or else the program is erroneous

26

The functionality of this routine is sp ecied to b e equivalent to expanding the array

27

of ranges to an array of the excluded ranks and passing the resulting array of ranks and

28

GROUP EXCL A call to MPI GROUP EXCL is equivalent to a call other arguments to MPI

29

to MPI GROUP RANGE EXCL with each rank i in ranks replaced by the triplet ii in

30

the argument ranges

31

32

Advice to users The range op erations do not explicitly enumerate ranks and

33

therefore are more scalable if implemented eciently Hence we recommend MPI

34

programmers to use them whenenever p ossible as highquality implementations will

35

take advantage of this fact End of advice to users

36

37

Advice to implementors The range op erations should b e implemented if p ossible

38

without enumerating the group memb ers in order to obtain b etter scalability time

39

and space End of advice to implementors

40

41

Group Destructors

42

43

44

GROUP FREEgroup MPI

45

INOUT group group handle

46

47

Group freeMPI Group group int MPI 48

COMMUNICATOR MANAGEMENT

MPI GROUP FREEGROUP IERROR

1

INTEGER GROUP IERROR

2

3

This op eration marks a group ob ject for deallo cation The handle group is set to

4

MPI GROUP NULL by the call Any ongoing op eration using this group will complete nor

5

mally

6

7

Advice to implementors One can keep a reference count that is incremented for each

8

call to MPI COMM CREATE and MPI COMM DUP and decremented for each call to

9

MPI GROUP FREE or MPI COMM FREE the group ob ject is ultimately deallo cated

10

when the reference count drops to zero End of advice to implementors

11

12

Communicator Management

13

14

This section describ es the manipulation of communicators in MPI Op erations that access

15

communicators are lo cal and their execution do es not require interpro cess communication

16

Op erations that create communicators are collective and may require interpro cess commu

17

nication

18

19

Advice to implementors Highquality implementations should amortize the over

20

heads asso ciated with the creation of communicators for the same group or subsets

21

thereof over several calls by allo cating multiple contexts with one collective commu

22

nication End of advice to implementors

23

24

Communicator Accessors

25

The following are all lo cal op erations

26

27

28

MPI COMM SIZEcomm size

29

IN comm communicator handle

30

31

OUT size numb er of pro cesses in the group of comm integer

32

33

int MPI Comm sizeMPI Comm comm int size

34

MPI COMM SIZECOMM SIZE IERROR

35

INTEGER COMM SIZE IERROR

36

37

38

Rationale This function is equivalent to accessing the communicators group with

39

MPI COMM GROUP see b elow computing the size using MPI GROUP SIZE and

40

GROUP FREE However this function is so then freeing the group temp orary via MPI

41

commonly used that this shortcut was intro duced End of rationale

42

Advice to users This function indicates the numb er of pro cesses involved in a

43

COMM WORLD it indicates the total numb er of pro cesses communicator For MPI

44

available for this version of MPI there is no standard way to change the numb er of

45

pro cesses once initialization has taken place

46

47

This call is often used with the next call to determine the amount of concurrency

48

available for a sp ecic library or program The following call MPI COMM RANK

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

indicates the rank of the pro cess that calls it in the range from size where size

1

is the return value of MPI COMM SIZEEnd of advice to users

2

3

4

5

MPI COMM RANKcomm rank

6

IN comm communicator handle

7

OUT rank rank of the calling pro cess in group of comm integer

8

9

10

int MPI Comm rankMPI Comm comm int rank

11

MPI COMM RANKCOMM RANK IERROR

12

INTEGER COMM RANK IERROR

13

14

Rationale This function is equivalent to accessing the communicators group with

15

MPI COMM GROUP see b elow computing the size using MPI GROUP RANK and

16

then freeing the group temp orary via MPI GROUP FREE However this function is so

17

commonly used that this shortcut was intro duced End of rationale

18

19

Advice to users This function gives the rank of the pro cess in the particular commu

20

nicators group It is useful as noted ab ove in conjunction with MPI COMM SIZE

21

Many programs will b e written with the masterslave mo del where one pro cess such

22

as the rankzero pro cess will play a sup ervisory role and the other pro cesses will

23

serve as compute no des In this framework the two preceding calls are useful for

24

determining the roles of the various pro cesses of a communicator End of advice to

25

users

26

27

28

29

MPI COMM COMPAREcomm comm result

30

IN comm rst communicator handle

31

IN comm second communicator handle

32

33

OUT result result integer

34

35

int MPI Comm compareMPI Comm commMPI Comm comm int result

36

MPI COMM COMPARECOMM COMM RESULT IERROR

37

INTEGER COMM COMM RESULT IERROR

38

39

MPI IDENT results if and only if comm and comm are handles for the same ob ject identical

40

groups and same contexts MPI CONGRUENT results if the underlying groups are identical

41

SIMILAR in constituents and rank order these communicators dier only by context MPI

42

results if the group memb ers of b oth communicators are the same but the rank order diers

43

MPI UNEQUAL results otherwise

44

45

Communicator Constructors

46

47

The following are collective functions that are invoked by all pro cesses in the group asso ci

48

ated with comm

COMMUNICATOR MANAGEMENT

Rationale Note that there is a chickenandegg asp ect to MPI in that a communicator

1

is needed to create a new communicator The base communicator for all MPI com

2

municators is predened outside of MPI and is MPI COMM WORLD This mo del was

3

arrived at after considerable debate and was chosen to increase safety of programs

4

written in MPI End of rationale

5

6

7

8

MPI COMM DUPcomm newcomm

9

IN comm communicator handle

10

11

OUT newcomm copy of comm handle

12

13

int MPI Comm dupMPI Comm comm MPI Comm newcomm

14

MPI COMM DUPCOMM NEWCOMM IERROR

15

INTEGER COMM NEWCOMM IERROR

16

17

MPI COMM DUP Duplicates the existing communicator comm with asso ciated key val

18

ues For each key value the resp ective copy callback function determines the attribute value

19

asso ciated with this key in the new communicator one particular action that a copy callback

20

may take is to delete the attribute from the new communicator Returns in newcomm a

21

new communicator with the same group any copied cached information but a new context

22

see section

23

24

Advice to users This op eration is used to provide a parallel library call with a dupli

25

cate communication space that has the same prop erties as the original communicator

26

This includes any attributes see b elow and top ologies see chapter This call is

27

valid even if there are p ending p ointtop oint communications involving the commu

28

COMM DUP at the b eginning of nicator comm A typical call might involve a MPI

29

the parallel call and an MPI COMM FREE of that duplicated communicator at the

30

end of the call Other mo dels of communicator management are also p ossible

31

This call applies to b oth intra and intercommunicators End of advice to users

32

33

Advice to implementors One need not actually copy the group information but only

34

add a new reference and increment the reference count Copy on write can b e used

35

for the cached informationEnd of advice to implementors

36

37

38

39

MPI COMM CREATEcomm group newcomm

40

IN comm communicator handle

41

IN group Group which is a subset of the group of comm han

42

dle

43

44

OUT newcomm new communicator handle

45

46

int MPI Comm createMPI Comm comm MPI Group group MPI Comm newcomm

47

MPI COMM CREATECOMM GROUP NEWCOMM IERROR 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

INTEGER COMM GROUP NEWCOMM IERROR

1

2

This function creates a new communicator newcomm with communication group dened by

3

group and a new context No cached information propagates from comm to newcomm The

4

function returns MPI COMM NULL to pro cesses that are not in group The call is erroneous

5

if not all group arguments have the same value or if group is not a subset of the group

6

asso ciated with comm Note that the call is to b e executed by all pro cesses in comm even

7

if they do not b elong to the new group This call applies only to intracommunicators

8

9

Rationale The requirement that the entire group of comm participate in the call

10

stems from the following considerations

11

It allows the implementation to layer MPI COMM CREATE on top of regular

12

collective communications

13

14

It provides additional safety in particular in the case where partially overlapping

15

groups are used to create new communicators

16

It p ermits implementations sometimes to avoid communication related to context

17

creation

18

19

End of rationale

20

Advice to users MPI COMM CREATE provides a means to subset a group of pro

21

cesses for the purp ose of separate MIMD computation with separate communication

22

COMM CREATE can b e used in subse space newcomm which emerges from MPI

23

quent calls to MPI COMM CREATE or other communicator constructors further to

24

sub divide a computation into parallel subcomputations A more general service is

25

provided by MPI COMM SPLIT b elow End of advice to users

26

27

Advice to implementors Since all pro cesses calling MPI COMM DUP or

28

MPI COMM CREATE provide the same group argument it is theoretically p ossible

29

to agree on a groupwide unique context with no communication However lo cal exe

30

cution of these functions requires use of a larger context name space and reduces error

31

checking Implementations may strike various compromises b etween these conicting

32

goals such as bulk allo cation of multiple contexts in one collective op eration

33

34

Imp ortant If new communicators are created without synchronizing the pro cesses

35

involved then the communication system should b e able to cop e with messages arriving

36

in a context that has not yet b een allo cated at the receiving pro cess End of advice

37

to implementors

38

39

40

COMM SPLITcomm color key newcomm MPI

41

IN comm communicator handle

42

43

IN color control of subset assignment integer

44

IN key control of rank assigment integer

45

OUT newcomm new communicator handle

46

47

48

int MPI Comm splitMPI Comm comm int color int key MPI Comm newcomm

COMMUNICATOR MANAGEMENT

MPI COMM SPLITCOMM COLOR KEY NEWCOMM IERROR

1

INTEGER COMM COLOR KEY NEWCOMM IERROR

2

3

This function partitions the group asso ciated with comm into disjoint subgroups one for

4

each value of color Each subgroup contains all pro cesses of the same color Within each

5

subgroup the pro cesses are ranked in the order dened by the value of the argument

6

key with ties broken according to their rank in the old group A new communicator is

7

created for each subgroup and returned in newcomm A pro cess may supply the color value

8

MPI UNDEFINED in which case newcomm returns MPI COMM NULL This is a collective

9

call but each pro cess is p ermitted to provide dierent values for color and key

10

A call to MPI COMM CREATEcomm group newcomm is equivalent to

11

a call to MPI COMM SPLITcomm color key newcomm where all memb ers of group pro

12

vide color and key rank in group and all pro cesses that are not memb ers of group

13

provide color MPI UNDEFINED The function MPI COMM SPLIT allows more general

14

partitioning of a group into one or more subgroups with optional reordering This call

15

applies only intracommunicators

16

17

Advice to users This is an extremely p owerful mechanism for dividing a single com

18

municating group of pro cesses into k subgroups with k chosen implicitly by the user

19

by the numb er of colors asserted over all the pro cesses Each resulting communica

20

tor will b e nonoverlapping Such a division could b e useful for dening a hierarchy

21

of computations such as for multigrid or linear algebra

22

Multiple calls to MPI COMM SPLIT can b e used to overcome the requirement that

23

any call have no overlap of the resulting communicators each pro cess is of only one

24

color p er call In this way multiple overlapping communication structures can b e

25

created Creative use of the color and key in such splitting op erations is encouraged

26

Note that for a xed color the keys need not b e unique It is MPI COMM SPLITs

27

resp onsibility to sort pro cesses in ascending order according to this key and to break

28

ties in a consistent way If all the keys are sp ecied in the same way then all the

29

pro cesses in a given color will have the relative rank order as they did in their parent

30

group In general they will have dierent ranks

31

Essentially making the key value zero for all pro cesses of a given color means that one

32

do esnt really care ab out the rankorder of the pro cesses in the new communicator

33

34

End of advice to users

35

36

Communicator Destructors

37

38

39

COMM FREEcomm MPI

40

INOUT comm communicator to b e destroyed handle

41

42

43

Comm freeMPI Comm comm int MPI

44

MPI COMM FREECOMM IERROR

45

INTEGER COMM IERROR

46

47

This collective op eration marks the communication ob ject for deallo cation The handle

48

is set to MPI COMM NULL Any p ending op erations that use this communicator will complete

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

normally the ob ject is actually deallo cated only if there are no other active references to

1

it This call applies to intra and intercommunicators The delete callback functions for

2

all cached attributes see section are called in arbitrary order

3

4

Advice to implementors A referencecount mechanism may b e used the reference

5

count is incremented by each call to MPI COMM DUP and decremented by each call

6

to MPI COMM FREE The ob ject is ultimately deallo cated when the count reaches

7

zero

8

9

Though collective it is anticipated that this op eration will normally b e implemented to

10

b e lo cal though the debugging version of an MPI library might cho ose to synchronize

11

End of advice to implementors

12

13

Motivating Examples

14

15

Current Practice

16

17

Example a

18

mainint argc char argv

19

20

int me size

21

22

MPIInit argc argv

23

MPICommrank MPICOMMWORLD me

24

MPICommsize MPICOMMWORLD size

25

26

voidprintf Process d size dn me size

27

28

MPIFinalize

29

30

31

Example a is a donothing program that initializes itself legally and refers to the the

32

all communicator and prints a message It terminates itself legally to o This example

33

do es not imply that MPI supp orts printflike communication itself

34

Example b supp osing that size is even

35

36

mainint argc char argv

37

38

int me size

39

int SOMETAG

40

41

MPIInitargc argv

42

43

MPICommrankMPICOMMWORLD me local

44

MPICommsizeMPICOMMWORLD size local

45

46

ifme

47

48

send unless highestnumbered process

MOTIVATING EXAMPLES

ifme size

1

MPISend me SOMETAG MPICOMMWORLD

2

3

else

4

MPIRecv me SOMETAG MPICOMMWORLD

5

6

7

MPIFinalize

8

9

10

Example b schematically illustrates message exchanges b etween even and o dd pro

11

cesses in the all communicator

12

13

Current Practice

14

15

mainint argc char argv

16

17

int me count

18

void data

19

20

21

MPIInitargc argv

22

MPICommrankMPICOMMWORLD me

23

24

ifme

25

26

get input create buffer data

27

28

29

30

MPIBcastdata count MPIBYTE MPICOMMWORLD

31

32

33

34

MPIFinalize

35

36

37

This example illustrates the use of a collective communication

38

39

Approximate Current Practice

40

mainint argc char argv

41

42

int me count count

43

void sendbuf recvbuf sendbuf recvbuf

44

MPIGroup MPIGROUPWORLD grprem

45

MPIComm commslave

46

static int ranks

47

48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

MPIInitargc argv

1

MPICommgroupMPICOMMWORLD MPIGROUPWORLD

2

MPICommrankMPICOMMWORLD me local

3

4

MPIGroupexclMPIGROUPWORLD ranks grprem local

5

MPICommcreateMPICOMMWORLD grprem commslave

6

7

ifme

8

9

compute on slave

10

11

MPIReducesendbufrecvbuffcount MPIINT MPISUM commslave

12

13

14

zero falls through immediately to this reduce others do later

15

MPIReducesendbuf recvbuff count

16

MPIINT MPISUM MPICOMMWORLD

17

18

MPICommfreecommslave

19

MPIGroupfreeMPIGROUPWORLD

20

MPIGroupfreegrprem

21

MPIFinalize

22

23

24

This example illustrates how a group consisting of all but the zeroth pro cess of the all

25

group is created and then how a communicator is formed commslave for that new group

26

The new communicator is used in a collective call and all pro cesses execute a collective call

27

in the MPI COMM WORLD context This example illustrates how the two communicators

28

that inherently p ossess distinct contexts protect communication That is communication

29

COMM WORLD is insulated from communication in commslave and vice versa in MPI

30

In summary group safety is achieved via communicators b ecause distinct contexts

31

within communicators are enforced to b e unique on any pro cess

32

33

Example

34

35

The following example is meant to illustrate safety b etween p ointtop oint and collective

36

communication MPI guarantees that a single communicator can do safe p ointtop oint and

37

collective communication

38

39

define TAGARBITRARY

40

define SOMECOUNT

41

42

mainint argc char argv

43

44

int me

45

MPIRequest request

46

MPIStatus status

47

MPIGroup MPIGROUPWORLD subgroup

48

int ranks

MOTIVATING EXAMPLES

MPIComm thecomm

1

2

MPIInitargc argv

3

MPICommgroupMPICOMMWORLD MPIGROUPWORLD

4

5

MPIGroupinclMPIGROUPWORLD ranks subgroup local

6

MPIGroupranksubgroup me local

7

8

MPICommcreateMPICOMMWORLD subgroup thecomm

9

10

ifme MPIUNDEFINED

11

12

MPIIrecvbuff count MPIDOUBLE MPIANYSOURCE TAGARBITRARY

13

thecomm request

14

MPIIsendbuff count MPIDOUBLE me TAGARBITRARY

15

thecomm request

16

17

18

fori i SOMECOUNT i

19

MPIReduce thecomm

20

MPIWaitall request status

21

22

MPICommfreethecomm

23

MPIGroupfreeMPIGROUPWORLD

24

MPIGroupfreesubgroup

25

MPIFinalize

26

27

28

29

Library Example

30

The main program

31

32

mainint argc char argv

33

34

int done

35

userlibt libha libhb

36

void dataset dataset

37

38

MPIInitargc argv

39

40

inituserlibMPICOMMWORLD libha

41

inituserlibMPICOMMWORLD libhb

42

43

userstartoplibha dataset

44

userstartoplibhb dataset

45

46

whiledone

47

48

work

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

1

MPIReduce MPICOMMWORLD

2

3

see if done

4

5

6

userendoplibha

7

userendoplibhb

8

9

uninituserliblibha

10

uninituserliblibhb

11

MPIFinalize

12

13

14

The user library initialization co de

15

void inituserlibMPIComm comm userlibt handle

16

17

userlibt save

18

19

userlibinitsavesave local

20

MPICommdupcomm save comm

21

22

other inits

23

24

25

handle save

26

27

28

User startup co de

29

void userstartopuserlibt handle void data

30

31

MPIIrecv handlecomm handle irecvhandle

32

MPIIsend handlecomm handle isendhandle

33

34

User communication cleanup co de

35

36

void userendopuserlibt handle

37

38

MPIStatus status

39

MPIWaithandle isendhandle status

40

MPIWaithandle irecvhandle status

41

42

User ob ject cleanup co de

43

44 void uninituserlibuserlibt handle

45

46 MPICommfreehandle comm

47 freehandle

48

MOTIVATING EXAMPLES

Library Example

1

2

The main program

3

4

mainint argc char argv

5

6

int ma mb

7

MPIGroup MPIGROUPWORLD groupa groupb

8

MPIComm comma commb

9

10

static int lista

11

if definedEXAMPLEB definedEXAMPLEC

12

static int listb

13

else EXAMPLEA

14

static int listb

15

endif

16

int sizelista sizeoflistasizeofint

17

int sizelistb sizeoflistbsizeofint

18

19

20

MPIInitargc argv

21

MPICommgroupMPICOMMWORLD MPIGROUPWORLD

22

23

MPIGroupinclMPIGROUPWORLD sizelista lista groupa

24

MPIGroupinclMPIGROUPWORLD sizelistb listb groupb

25

26

MPICommcreateMPICOMMWORLD groupa comma

27

MPICommcreateMPICOMMWORLD groupb commb

28

29

MPICommrankcomma ma

30

MPICommrankcommb mb

31

32

ifma MPIUNDEFINED

33

libcallcomma

34

35

ifmb MPIUNDEFINED

36

37

libcallcommb

38

libcallcommb

39

40

41

MPICommfreecomma

42

MPICommfreecommb

43

MPIGroupfreegroupa

44

MPIGroupfreegroupb

45

MPIGroupfreeMPIGROUPWORLD

46

MPIFinalize

47

48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

The library

1

2

void libcallMPIComm comm

3

4

int me done

5

MPICommrankcomm me

6

ifme

7

whiledone

8

9

MPIRecv MPIANYSOURCE MPIANYTAG comm

10

11

12

else

13

14

work

15

MPISend ARBITRARYTAG comm

16

17

18

ifdef EXAMPLEC

19

include resp exclude for safety resp no safety

20

MPIBarriercomm

21

endif

22

23

24

The ab ove example is really three examples dep ending on whether or not one includes rank

25

in list b and whether or not a synchronize is included in lib call This example illustrates

26

that despite contexts subsequent calls to lib call with the same context need not b e safe

27

from one another collo quially backmasking Safety is realized if the MPI Barrier is

28

added What this demonstrates is that libraries have to b e written carefully even with

29

contexts When rank is excluded then the synchronize is not needed to get safety from

30

back masking

31

Algorithms like reduce and allreduce have strong enough source selectivity prop

32

erties so that they are inherently okay no backmasking provided that MPI provides basic

33

guarantees So are multiple calls to a typical treebroadcast algorithm with the same ro ot

34

or dierent ro ots see Here we rely on two guarantees of MPI pairwise ordering of

35

messages b etween pro cesses in the same context and source selectivity deleting either

36

feature removes the guarantee that backmasking cannot b e required

37

Algorithms that try to do nondeterministic broadcasts or other calls that include wild

38

card op erations will not generally have the go o d prop erties of the deterministic implemen

39

tations of reduce allreduce and broadcast Such algorithms would have to utilize

40

the monotonically increasing tags within a communicator scop e to keep things straight

41

All of the foregoing is a supp osition of collective calls implemented with p ointto

42

p oint op erations MPI implementations may or may not implement collective calls using

43

p ointtop oint op erations These algorithms are used to illustrate the issues of correctness

44

and safety indep endent of how MPI implements its collective calls See also section

45

46

47 48

INTERCOMMUNICATION

InterCommunication

1

2

This section intro duces the concept of intercommunication and describ es the p ortions of

3

MPI that supp ort it It describ es supp ort for writing programs that contain userlevel

4

servers

5

All p ointtop oint communication describ ed thus far has involved communication b e

6

tween pro cesses that are memb ers of the same group This typ e of communication is called

7

intracommunication and the communicator used is called an intracommunicator as

8

we have noted earlier in the chapter

9

In mo dular and multidiscipl inary applications dierent pro cess groups execute distinct

10

mo dules and pro cesses within dierent mo dules communicate with one another in a pip eline

11

or a more general mo dule graph In these applications the most natural way for a pro cess

12

to sp ecify a target pro cess is by the rank of the target pro cess within the target group In

13

applications that contain internal userlevel servers each server may b e a pro cess group that

14

provides services to one or more clients and each client may b e a pro cess group that uses

15

the services of one or more servers It is again most natural to sp ecify the target pro cess

16

by rank within the target group in these applications This typ e of communication is called

17

intercommunication and the communicator used is called an intercommunicator as

18

intro duced earlier

19

An intercommunication is a p ointtop oint communication b etween pro cesses in dier

20

ent groups The group containing a pro cess that initiates an intercommunication op eration

21

is called the lo cal group that is the sender in a send and the receiver in a receive The

22

group containing the target pro cess is called the remote group that is the receiver in a

23

send and the sender in a receive As in intracommunication the target pro cess is sp ecied

24

using a communicator rank pair Unlike intracommunication the rank is relative to a

25

second remote group

26

All intercommunicator constructors are blo cking and require that the lo cal and remote

27

groups b e disjoint in order to avoid deadlo ck

28

Here is a summary of the prop erties of intercommunication and intercommunicators

29

30

The syntax of p ointtop oint communication is the same for b oth inter and intra

31

communication The same communicator can b e used b oth for send and for receive

32

op erations

33

A target pro cess is addressed by its rank in the remote group b oth for sends and for

34

receives

35

36

Communications using an intercommunicator are guaranteed not to conict with any

37

communications that use a dierent communicator

38

39

An intercommunicator cannot b e used for collective communication

40

A communicator will provide either intra or intercommunication never b oth

41

42

The routine MPI COMM TEST INTER may b e used to determine if a communicator is an

43

inter or intracommunicator Intercommunicators can b e used as arguments to some of the

44

other communicator access routines Intercommunicators cannot b e used as input to some

45

of the constructor routines for intracommunicators for instance MPI COMM CREATE

46

47

Advice to implementors For the purp ose of p ointtop oint communication commu

48

nicators can b e represented in each pro cess by a tuple consisting of

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

group

1

2

send context

3

receive context

4

source

5

6

For intercommunicators group describ es the remote group and source is the rank of

7

the pro cess in the lo cal group For intracommunicators group is the communicator

8

group remotelo cal source is the rank of the pro cess in this group and send

9

context and receive context are identical A group is represented by a rankto

10

absoluteaddress translation table

11

The intercommunicator cannot b e discussed sensibly without considering pro cesses in

12

b oth the lo cal and remote groups Imagine a pro cess P in group P which has an inter

13

communicator C and a pro cess Q in group Q which has an intercommunicator

P

14

C Then

Q

15

16

C group describ es the group Q and C group describ es the group P

P Q

17

C send context C receive context and the context is unique in Q

P Q

18

C receive context C send context and this context is unique in P

P Q

19

20

C source is rank of P in P and C source is rank of Q in Q

P Q

21

Assume that P sends a message to Q using the intercommunicator Then P uses

22

the group table to nd the absolute address of Q source and send context are

23

app ended to the message

24

25

Assume that Q p osts a receive with an explicit source argument using the inter

26

communicator Then Q matches receive context to the message context and source

27

argument to the message source

28

The same algorithm is appropriate for intracommunicators as well

29

In order to supp ort intercommunicator accessors and constructors it is necessary to

30

supplement this mo del with additional structures that store information ab out the

31

lo cal communication group and additional safe contexts End of advice to imple

32

mentors

33

34

Intercommunicator Accessors

35

36

37

38

MPI COMM TEST INTERcomm ag

39

IN comm communicator handle

40

OUT ag logical

41

42

43

Comm test interMPI Comm comm int flag int MPI

44

MPI COMM TEST INTERCOMM FLAG IERROR

45

INTEGER COMM IERROR

46

LOGICAL FLAG

47 48

INTERCOMMUNICATION

This lo cal routine allows the calling pro cess to determine if a communicator is an inter

1

communicator or an intracommunicator It returns true if it is an intercommunicator

2

otherwise false

3

When an intercommunicator is used as an input argument to the communicator ac

4

cessors describ ed ab ove under intracommunication the following table describ es b ehavior

5

6

7

MPI COMM Function Behavior

8

in InterCommunication Mo de

9

MPI COMM SIZE returns the size of the lo cal group

10

MPI COMM GROUP returns the lo cal group

11

MPI COMM RANK returns the rank in the lo cal group

12

13

Furthermore the op eration MPI COMM COMPARE is valid for intercommunicators Both

14

communicators must b e either intra or intercommunicators or else MPI UNEQUAL results

15

Both corresp onding lo cal and remote groups must compare correctly to get the results

16

CONGRUENT and MPI SIMILAR In particular it is p ossible for MPI SIMILAR to result MPI

17

b ecause either the lo cal or remote groups were similar but not identical

18

The following accessors provide consistent access to the remote group of an inter

19

communicator

20

The following are all lo cal op erations

21

22

23

MPI COMM REMOTE SIZEcomm size

24

IN comm intercommunicator handle

25

OUT size numb er of pro cesses in the remote group of comm

26

integer

27

28

int MPI Comm remote sizeMPI Comm comm int size

29

30

MPI COMM REMOTE SIZECOMM SIZE IERROR

31

INTEGER COMM SIZE IERROR

32

33

34

MPI COMM REMOTE GROUPcomm group

35

36

IN comm intercommunicator handle

37

OUT group remote group corresp onding to comm handle

38

39

Comm remote groupMPI Comm comm MPI Group group int MPI

40

41

MPI COMM REMOTE GROUPCOMM GROUP IERROR

42

INTEGER COMM GROUP IERROR

43

44

Rationale Symmetric access to b oth the lo cal and remote groups of an inter

45

COMM REMOTE SIZE communicator is imp ortant so this function as well as MPI

46

have b een provided End of rationale

47 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

Intercommunicator Op erations

1

2

This section intro duces four blo cking intercommunicator op erations MPI INTERCOMM

3

CREATE is used to bind two intracommunicators into an intercommunicator the function

4

MPI INTERCOMM MERGE creates an intracommunicator by merging the lo cal and remote

5

groups of an intercommunicator The functionsMPI COMM DUP andMPI COMM FREE

6

intro duced previously duplicate and free an intercommunicator resp ectively

7

Overlap of lo cal and remote groups that are b ound into an intercommunicator is

8

prohibited If there is overlap then the program is erroneous and is likely to deadlo ck If

9

a pro cess is multithreaded and MPI calls blo ck only a thread rather than a pro cess then

10

dual memb ership can b e supp orted It is then the users resp onsibility to make sure that

11

calls on b ehalf of the two roles of a pro cess are executed by two indep endent threads

12

The function MPI INTERCOMM CREATE can b e used to create an intercommunicator

13

from two existing intracommunicators in the following situation At least one selected

14

memb er from each group the group leader has the ability to communicate with the

15

selected memb er from the other group that is a p eer communicator exists to which b oth

16

leaders b elong and each leader knows the rank of the other leader in this p eer communicator

17

the two leaders could b e the same pro cess Furthermore memb ers of each group know

18

the rank of their leader

19

Construction of an intercommunicator from two intracommunicators requires separate

20

collective op erations in the lo cal group and in the remote group as well as a p ointtop oint

21

communication b etween a pro cess in the lo cal group and a pro cess in the remote group

22

In standard MPI implementations with static pro cess allo cation at initialization the

23

MPI COMM WORLD communicator or preferably a dedicated duplicate thereof can b e

24

this p eer communicator In dynamic MPI implementations where for example a pro cess

25

may spawn new child pro cesses during an MPI execution the parent pro cess may b e the

26

bridge b etween the old communication universe and the new communication world that

27

includes the parent and its children

28

The application top ology functions describ ed in chapter do not apply to inter

29

INTERCOMM MERGE communicators Users that require this capability should utilize MPI

30

to build an intracommunicator then apply the graph or cartesian top ology capabilities to

31

that intracommunicator creating an appropriate top ologyoriented intracommunicator

32

Alternatively it may b e reasonable to devise ones own application top ology mechanisms

33

for this case without loss of generality

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

INTERCOMMUNICATION

MPI INTERCOMM CREATElo cal comm lo cal leader p eer comm remote leader tag

1

newintercomm

2

3

IN lo cal comm lo cal intracommunicator handle

4

IN lo cal leader rank of lo cal group leader in lo cal comm integer

5

6

IN p eer comm p eer intracommunicator signicant only at the lo

7 cal leader handle

8

IN remote leader rank of remote group leader in p eer comm signicant

9

only at the lo cal leader integer

10

IN tag safe tag integer

11

12 OUT newintercomm new intercommunicator handle

13

14

int MPI Intercomm createMPI Comm local comm int local leader

15

MPI Comm peer comm int remote leader int tag

16

Comm newintercomm MPI

17

MPI INTERCOMM CREATELOCAL COMM LOCAL LEADER PEER COMM REMOTE LEADER TAG

18

NEWINTERCOMM IERROR

19

INTEGER LOCAL COMM LOCAL LEADER PEER COMM REMOTE LEADER TAG

20

NEWINTERCOMM IERROR

21

22

This call creates an intercommunicator It is collective over the union of the lo cal and

23

remote groups Pro cesses should provide identical lo cal comm and lo cal leader arguments

24

leader lo cal leader and tag within each group Wildcards are not p ermitted for remote

25

This call uses p ointtop oint communication with communicator p eer comm and with

26

tag tag b etween the leaders Thus care must b e taken that there b e no p ending communi

27

cation on p eer comm that could interfere with this communication

28

29

Advice to users We recommend using a dedicated p eer communicator such as a

30

COMM WORLD to avoid trouble with p eer communicators End duplicate of MPI

31

of advice to users

32

33

34

MPI INTERCOMM MERGEintercomm high newintracomm

35

36

IN intercomm InterCommunicator handle

37

IN high logical

38

OUT newintracomm new intracommunicator handle

39

40

int MPI Intercomm mergeMPI Comm intercomm int high

41

Comm newintracomm MPI

42

43

MPI INTERCOMM MERGEINTERCOMM HIGH INTRACOMM IERROR

44

INTEGER INTERCOMM INTRACOMM IERROR

45

LOGICAL HIGH

46

47 This function creates an intracommunicator from the union of the two groups that are

48

asso ciated with intercomm All pro cesses should provide the same high value within each

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

1 2

Group 0 Group 1 Group 2 3

4

5

6

Figure Threegroup pip eline

7

8

9

of the two groups If pro cesses in one group provided the value high false and pro cesses

10

in the other group provided the value high true then the union orders the low group

11

b efore the high group If all pro cesses provided the same high argument then the order

12

of the union is arbitrary This call is blo cking and collective within the union of the two

13

groups

14

15

Advice to implementors The implementation of MPI INTERCOMM MERGE

16

MPI COMM FREE and MPI COMM DUP are similar to the implementation of

17

MPI INTERCOMM CREATE except that contexts private to the input intercommun

18

icator are used for communication b etween group leaders rather than contexts inside

19

a bridge communicator End of advice to implementors

20

21

InterCommunication Examples

22

Example ThreeGroup Pip eline

23

24

Groups and communicate Groups and communicate Therefore group requires

25

one intercommunicator group requires two intercommunicators and group requires

26

intercommunicator

27

28

mainint argc char argv

29

30

MPIComm myComm intracommunicator of local subgroup

31

MPIComm myFirstComm intercommunicator

32

MPIComm mySecondComm second intercommunicator group only

33

int membershipKey

34

int rank

35

36

MPIInitargc argv

37

MPICommrankMPICOMMWORLD rank

38

39

User code must generate membershipKey in the range

40

membershipKey rank

41

42

Build intracommunicator for local subgroup

43

MPICommsplitMPICOMMWORLD membershipKey rank myComm

44

45

Build intercommunicators Tags are hardcoded

46

if membershipKey

47

Group communicates with group

48

MPIIntercommcreate myComm MPICOMMWORLD

INTERCOMMUNICATION

1

2 3

4 Group 0 Group 1 Group 2

5

6

7

Figure Threegroup ring

8

9

10

myFirstComm

11

12

else if membershipKey

13

Group communicates with groups and

14

MPIIntercommcreate myComm MPICOMMWORLD

15

myFirstComm

16

MPIIntercommcreate myComm MPICOMMWORLD

17

mySecondComm

18

19

else if membershipKey

20

Group communicates with group

21

MPIIntercommcreate myComm MPICOMMWORLD

22

myFirstComm

23

24

25

Do work

26

27

switchmembershipKey free communicators appropriately

28

29

case

30

MPICommfreemySecondComm

31

case

32

case

33

MPICommfreemyFirstComm

34

break

35

36

37

MPIFinalize

38

39

40

Example ThreeGroup Ring

41

Groups and communicate Groups and communicate Groups and communicate

42

Therefore each requires two intercommunicators

43

44

mainint argc char argv

45

46

MPIComm myComm intracommunicator of local subgroup

47

MPIComm myFirstComm intercommunicators 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

MPIComm mySecondComm

1

MPIStatus status

2

int membershipKey

3

int rank

4

5

MPIInitargc argv

6

MPICommrankMPICOMMWORLD rank

7

8

9

User code must generate membershipKey in the range

10

membershipKey rank

11

12

Build intracommunicator for local subgroup

13

MPICommsplitMPICOMMWORLD membershipKey rank myComm

14

15

Build intercommunicators Tags are hardcoded

16

if membershipKey

17

Group communicates with groups and

18

MPIIntercommcreate myComm MPICOMMWORLD

19

myFirstComm

20

MPIIntercommcreate myComm MPICOMMWORLD

21

mySecondComm

22

23

else if membershipKey

24

Group communicates with groups and

25

MPIIntercommcreate myComm MPICOMMWORLD

26

myFirstComm

27

MPIIntercommcreate myComm MPICOMMWORLD

28

mySecondComm

29

30

else if membershipKey

31

Group communicates with groups and

32

MPIIntercommcreate myComm MPICOMMWORLD

33

myFirstComm

34

MPIIntercommcreate myComm MPICOMMWORLD

35

mySecondComm

36

37

38

Do some work

39

40

Then free communicators before terminating

41

MPICommfreemyFirstComm

42

MPICommfreemySecondComm

43

MPICommfreemyComm

44

MPIFinalize

45

46

47 48

INTERCOMMUNICATION

Example Building Name Service for Intercommunication

1

2

The following pro cedures exemplify the pro cess by which a user could create name service

3

for building intercommunicators via a rendezvous involving a server communicator and a

4

tag name selected by b oth groups

5

After all MPI pro cesses execute MPI INIT every pro cess calls the example function

6

Init server dened b elow Then if the new world returned is NULL the pro cess getting

7

NULL is required to implement a server function in a reactive lo op Do server Everyone

8

else just do es their prescrib ed computation using new world as the new eective global

9

communicator One designated pro cess calls Undo Server to get rid of the server when it

10

is not needed any longer

11

Features of this approach include

12

13

Supp ort for multiple name servers

14

15

Ability to scop e the name servers to sp ecic pro cesses

16

Ability to make such servers come and go as desired

17

18

define INITSERVERTAG

19

define UNDOSERVERTAG

20

21

static int serverkeyval

22

23

for attribute management for servercomm copy callback

24

void handlecopyfnMPIComm oldcomm int keyval void extrastate

25

void attributevalin void attributevalout int flag

26

27

copy the handle

28

attributevalout attributevalin

29

flag indicate that copy to happen

30

31

32

int Initserverpeercomm rankofserver servercomm newworld

33

MPIComm peercomm

34

int rankofserver

35

MPIComm servercomm

36

MPIComm newworld new effective world sans server

37

38

MPIComm tempcomm lonecomm

39

MPIGroup peergroup tempgroup

40

int rankinpeercomm size color key

41

int peerleader peerleaderrankintempcomm

42

43

MPICommrankpeercomm rankinpeercomm

44

MPICommsizepeercomm size

45

46

if size rankofserver rankofserver size

47

return MPIERROTHER 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

1

create two communicators by splitting peercomm

2

into the server process and everyone else

3

4

peerleader rankofserver size arbitrary choice

5

6

if color rankinpeercomm rankofserver

7

8

MPICommsplitpeercomm color key lonecomm

9

10

MPIIntercommcreatelonecomm peercomm peerleader

11

INITSERVERTAG servercomm

12

13

MPICommfreelonecomm

14

newworld MPIComm

15

16

else

17

18

MPICommSplitpeercomm color key tempcomm

19

20

MPICommgrouppeercomm peergroup

21

MPICommgrouptempcomm tempgroup

22

MPIGrouptranslaterankspeergroup peerleader

23

tempgroup peerleaderrankintempcomm

24

25

MPIIntercommcreatetempcomm peerleaderrankintempcomm

26

peercomm rankofserver

27

INITSERVERTAG servercomm

28

29

attach newworld communication attribute to servercomm

30

31

CRITICAL SECTION FOR MULTITHREADING

32

ifserverkeyval MPIKEYVALINVALID

33

34

acquire the processlocal name for the server keyval

35

MPIAttrkeyvalcreatehandlecopyfn NULL

36

serverkeyval NULL

37

38

39

newworld tempcomm

40

41

Cache handle of intracommunicator on intercommunicator

42

MPIAttrputservercomm serverkeyval void newworld

43

44

45

return MPISUCCESS

46

47 48

INTERCOMMUNICATION

The actual server pro cess would commit to running the following co de

1

2

int Doserverservercomm

3

MPIComm servercomm

4

5

void initqueue

6

int enqueue dequeue keep triplets of integers

7

for later matching fns not shown

8

9

MPIComm comm

10

MPIStatus status

11

int clienttag clientsource

12

int clientrankinnewworld pairsrankinnewworld

13

int buffer count

14

15

void queue

16

initqueuequeue

17

18

19

for

20

21

MPIRecvbuffer count MPIINT MPIANYSOURCE MPIANYTAG

22

servercomm status accept from any client

23

24

determine client

25

clienttag statusMPITAG

26

clientsource statusMPISOURCE

27

clientrankinnewworld buffer

28

29

if clienttag UNDOSERVERTAG client that

30

terminates server

31

32

while dequeuequeue MPIANYTAG pairsrankinnewworld

33

pairsrankinserver

34

35

36

MPIIntercommfreeservercomm

37

break

38

39

40

if dequeuequeue clienttag pairsrankinnewworld

41

pairsrankinserver

42

43

matched pair with same tag tell them

44

about each other

45

buffer pairsrankinnewworld

46

MPISendbuffer MPIINT clientsrc clienttag

47

servercomm 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

1

buffer clientrankinnewworld

2

MPISendbuffer MPIINT pairsrankinserver clienttag

3

servercomm

4

5

else

6

enqueuequeue clienttag clientsource

7

clientrankinnewworld

8

9

10

11

12

A particular pro cess would b e resp onsible for ending the server when it is no longer

13

needed Its call to Undo server would terminate server function

14

15

int Undoserverservercomm example client that ends server

16

MPIComm servercomm

17

18

int buffer

19

MPISendbuffer MPIINT UNDOSERVERTAG servercomm

20

MPIIntercommfreeservercomm

21

22

The following is a blo cking nameservice for intercommunication with same semantic

23

Intercomm create but simplied syntax It uses the functionality just restrictions as MPI

24

dened to create the name service

25

26

int Intercommnamecreatelocalcomm servercomm tag comm

27

MPIComm localcomm servercomm

28

int tag

29

MPIComm comm

30

31

int error

32

int found attribute acquisition mgmt for newworld

33

comm in servercomm

34

void val

35

36

MPIComm newworld

37

38

int buffer rank

39

int localleader

40

41

MPIAttrgetservercomm serverkeyval val found

42

newworld MPICommval retrieve cached handle

43

44

MPICommrankservercomm rank rank in local group

45

46

if rank localleader

47

48

CACHING

buffer rank

1

MPISendbuffer MPIINT tag servercomm

2

MPIRecvbuffer MPIINT tag servercomm

3

4

5

error MPIIntercommcreatelocalleader localcomm buffer

6

newworld tag comm

7

8

returnerror

9

10

11

12

Caching

13

14

MPI provides a caching facility that allows an application to attach arbitrary pieces of

15

information called attributes to communicators More precisely the caching facility

16

allows a p ortable library to do the following

17

pass information b etween calls by asso ciating it with an MPI intra or intercommun

18

icator

19

20

quickly retrieve that information and

21

22

b e guaranteed that outofdate information is never retrieved even if the communi

23

cator is freed and its handle subsequently reused by MPI

24

The caching capabilities in some form are required by builtin MPI routines such as

25

collective communication and application top ology Dening an interface to these capa

26

bilities as part of the MPI standard is valuable b ecause it p ermits routines like collective

27

communication and application top ologies to b e implemented as p ortable co de and also

28

b ecause it makes MPI more extensible by allowing userwritten routines to use standard

29

MPI calling sequences

30

31

Advice to users The communicator MPI COMM SELF is a suitable choice for p osting

32

pro cesslo cal attributes via this attributingcaching mechanism End of advice to

33

users

34

35

Functionality

36

37

Attributes are attached to communicators Attributes are lo cal to the pro cess and sp ecic

38

to the communicator to which they are attached Attributes are not propagated by MPI

39

from one communicator to another except when the communicator is duplicated using

40

COMM DUP and even then the application must give sp ecic p ermission through MPI

41

callback functions for the attribute to b e copied

42

43

Advice to implementors Attributes are scalar values equal in size to or larger than

44

a Clanguage p ointer Attributes can always hold an MPI handle End of advice to

45

implementors

46

47

The caching interface dened here represents that attributes b e stored by MPI opaquely

48

within a communicator Accessor functions include the following

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

obtain a key value used to identify an attribute the user sp ecies callback func

1

tions by which MPI informs the application when the communicator is destroyed or

2

copied

3

4

store and retrieve the value of an attribute

5

6

Advice to implementors Caching and callback functions are only called synchronously

7

in resp onse to explicit application requests This avoid problems that result from re

8

p eated crossings b etween user and system space This synchronous calling rule is a

9

general prop erty of MPI

10

The choice of key values is under control of MPI This allows MPI to optimize its

11

implementation of attribute sets It also avoids conict b etween indep endent mo dules

12

caching information on the same communicators

13

14

A much smaller interface consisting of just a callback facility would allow the entire

15

caching facility to b e implemented by p ortable co de However with the minimal call

16

back interface some form of table searching is implied by the need to handle arbitrary

17

communicators In contrast the more complete interface dened here p ermits rapid

18

access to attributes through the use of p ointers in communicators to nd the attribute

19

table and cleverly chosen key values to retrieve individual attributes In light of the

20

eciency hit inherent in the minimal interface the more complete interface dened

21

here is seen to b e sup erior End of advice to implementors

22

23

MPI provides the following services related to caching They are all pro cess lo cal

24

25

MPI KEYVAL CREATEcopy fn delete fn keyval extra state

26

27

IN copy fn Copy callback function for keyval

28

IN delete fn Delete callback function for keyval

29

OUT keyval key value for future access integer

30

31

IN extra state Extra state for callback functions

32

33

Keyval createMPI Copy function copy fn MPI Delete function int MPI

34

delete fn int keyval void extra state

35

MPI KEYVAL CREATECOPY FN DELETE FN KEYVAL EXTRA STATE IERROR

36

EXTERNAL COPY FN DELETE FN

37

INTEGER KEYVAL EXTRA STATE IERROR

38

39

Generates a new attribute key Keys are lo cally unique in a pro cess and opaque to

40

user though they are explicitly stored in integers Once allo cated the key value can b e

41

used to asso ciate attributes and access them on any lo cally dened communicator

42

fn function is invoked when a communicator is duplicated by MPI COMM DUP The copy

43

copy fn should b e of typ e MPI Copy function which is dened as follows

44

45

typedef int MPICopyfunctionMPIComm oldcomm int keyval

46

void extrastate void attributevalin

47

void attributevalout int flag 48

CACHING

AFortran declaration for such a function is as follows

1

FUNCTION COPY FUNCTIONOLDCOMM KEYVAL EXTRA STATE ATTRIBUTE VAL IN

2

ATTRIBUTE VAL OUT FLAG

3

INTEGER OLDCOMM KEYVAL EXTRA STATE ATTRIBUTE VAL IN

4

ATTRIBUTE VAL OUT

5

LOGICAL FLAG

6

7

The copy callback function is invoked for each key value in oldcomm in arbitrary order

8

Each call to the copy callback is made with a key value and its corresp onding attribute

9

If it returns ag then the attribute is deleted in the duplicated communicator Oth

10

erwise ag the new attribute value is set via attribute val out The function returns

11

MPI SUCCESS on success and an error co de on failure in which case MPI COMM DUP will

12

fail

13

copy fn may b e sp ecied as MPI NULL FN from either C or FORTRAN in which case

14

no copy callback o ccurs for keyval MPI NULL FN is a function that do es nothing other

15

than returning ag In C the NULL function p ointer has the same b ehavior as using

16

MPI NULL FN As a further convenience MPI DUP FN is a simpleminded copy callback

17

available from C and FORTRAN it sets ag and returns the value of attribute val in

18

val out in attribute

19

COMM DUP assumes that the callback functions Note that the C version of this MPI

20

follow the C prototyp e while the corresp onding FORTRAN version assumes the FORTRAN

21

prototyp e

22

23

Advice to users A valid copy function is one that completely duplicates the in

24

formation by making a full duplicate copy of the data structures implied by an at

25

tribute another might just make another reference to that data structure while using

26

a referencecount mechanism Other typ es of attributes might not copy at all they

27

might b e sp ecic to oldcomm only End of advice to users

28

29

fn is a callback deletion function dened as follows The delete fn Analogous to copy

30

function is invoked when a communicator is deleted by MPI COMM FREE or when a call

31

is made explicitly to MPI ATTR DELETE delete fn should b e of typ e MPI Delete function

32

which is dened as follows

33

34

typedef int MPIDeletefunctionMPIComm comm int keyval

35

void attributeval void extrastate

36

37

AFortran declaration for such a function is as follows

38

FUNCTIONCOMM KEYVAL ATTRIBUTE VAL EXTRA STATE FUNCTION DELETE

39

INTEGER COMM KEYVAL ATTRIBUTE VAL EXTRA STATE

40

COMM FREE and MPI ATTR DELETE to do whatever This function is called by MPI

41

is needed to remove an attribute It may b e sp ecied as the null function p ointer in C or

42

NULL FN from either C or FORTRAN in which case no delete callback o ccurs for as MPI

43

keyval

44

The sp ecial key value MPI KEYVAL INVALID is never returned by MPI KEYVAL CREATE

45

Therefore it can b e used for static initialization of key values

46

47 48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

MPI KEYVAL FREEkeyval

1

2

INOUT keyval Frees the integer key value integer

3

4

int MPI Keyval freeint keyval

5

6

MPI KEYVAL FREEKEYVAL IERROR

7

INTEGER KEYVAL IERROR

8

Frees an extant attribute key This function sets the value of keyval to

9

MPI KEYVAL INVALID Note that it is not erroneous to free an attribute key that is in use

10

b ecause the actual free do es not transpire until after all references in other communicators

11

on the pro cess to the key have b een freed These references need to b e explictly freed

12

by the program either via calls to MPI ATTR DELETE that free one attribute instance

13

or by calls to MPI COMM FREE that free all attribute instances asso ciated with the freed

14

communicator

15

16

Advice to implementors The function MPI NULL FN need not b e aliased to void

17

in C though this is ne It could b e a legitimately callable function that

18

NULL FN b e a proles and so on For FORTRAN it is most convenient to have MPI

19

legitimate donothing function callEnd of advice to implementors

20

21

22

23

ATTR PUTcomm keyval attribute val MPI

24

IN comm communicator to which attribute will b e attached han

25

dle

26

IN keyval key value as returned by

27

MPI KEYVAL CREATE integer

28

29

IN attribute val attribute value

30

31

Attr putMPI Comm comm int keyval void attribute val int MPI

32

MPI ATTR PUTCOMM KEYVAL ATTRIBUTE VAL IERROR

33

INTEGER COMM KEYVAL ATTRIBUTE VAL IERROR

34

35

This function stores the stipulated attribute value attribute val for subsequent retrieval

36

by MPI ATTR GET If the value is already present then the outcome is as if MPI ATTR

37

DELETE was rst called to delete the previous value and the callback function delete fn

38

was executed and a new value was next stored The call is erroneous if there is no key

39

KEYVAL INVALID is an erroneous key value with value keyval in particular MPI

40

41

42

43

44

45

46

47 48

CACHING

MPI ATTR GETcomm keyval attribute val ag

1

2

IN comm communicator to which attribute is attached handle

3

IN keyval key value integer

4

5

OUT attribute val attribute value unless ag false

6

OUT ag true if an attribute value was extracted false if no

7

attribute is asso ciated with the key

8

9

int MPI Attr getMPI Comm comm int keyval void attribute val int flag

10

11

MPI ATTR GETCOMM KEYVAL ATTRIBUTE VAL FLAG IERROR

12

INTEGER COMM KEYVAL ATTRIBUTE VAL IERROR

13

LOGICAL FLAG

14

Retrieves attribute value by key The call is erroneous if there is no key with value

15

keyval On the other hand the call is correct if the key value exists but no attribute is

16

attached on comm for that key in such case the call returns flag false In particular

17

MPI KEYVAL INVALID is an erroneous key value

18

19

20

ATTR DELETEcomm keyval MPI

21

IN comm communicator to which attribute is attached handle

22

IN keyval The key value of the deleted attribute integer

23

24

25

int MPI Attr deleteMPI Comm comm int keyval

26

MPI ATTR DELETECOMM KEYVAL IERROR

27

INTEGER COMM KEYVAL IERROR

28

29

Delete attribute from cache by key This function invokes the attribute delete function

30

delete fn sp ecied when the keyval was created

31

Whenever a communicator is replicated using the function MPI COMM DUP all call

32

back copy functions for attributes that are currently set are invoked in arbitrary order

33

Whenever a communicator is deleted using the function MPI COMM FREE all callback

34

delete functions for attributes that are currently set are invoked

35

36

Attributes Example

37

Rationale End of rationale

38

39

Advice to users This example shows how to write a collective communication

40

op eration that uses caching to b e more ecient after the rst call The co ding style

41

assumes that MPI function results return only error statuses End of advice to users

42

43

key for this modules stuff

44

static int gopkey MPIKEYVALINVALID

45

46

typedef struct

47

48

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

int refcount reference count

1

other stuff whatever else we want

2

gopstufftype

3

4

EfficientCollectiveOp comm

5

MPIComm comm

6

7

gopstufftype gopstuff

8

MPIGroup group

9

int foundflag

10

11

MPICommgroupcomm group

12

13

if gopkey MPIKEYVALINVALID get a key on first call ever

14

15

if MPIAttrkeyvalcreate gopstuffcopier

16

gopstuffdestructor

17

gopkey void

18

get the key while assigning its copy and delete callback

19

behavior

20

21

MPIAbort Insufficient keys available

22

23

24

MPIAttrget comm gopkey gopstuff foundflag

25

if foundflag

26

This module has executed in this group before

27

We will use the cached information

28

29

else

30

This is a group that we have not yet cached anything in

31

We will now do so

32

33

34

First allocate storage for the stuff we want

35

and initialize the reference count

36

37

gopstuff gopstufftype malloc sizeofgopstufftype

38

if gopstuff NULL abort on outofmemory error

39

40

gopstuff refcount

41

42

Second fill in gopstuff with whatever we want

43

This part isnt shown here

44

45

Third store gopstuff as the attribute value

46

MPIAttrput comm gopkey gopstuff

47

48

FORMALIZING THE LOOSELY SYNCHRONOUS MODEL

Then in any case use contents of gopstuff

1

to do the global op

2

3

4

The following routine is called by MPI when a group is freed

5

6

gopstuffdestructor comm keyval gopstuff extra

7

MPIComm comm

8

int keyval

9

gopstufftype gopstuff

10

void extra

11

12

if keyval gopkey abort programming error

13

14

The groups being freed removes one reference to gopstuff

15

gopstuff refcount

16

17

If no references remain then free the storage

18

if gopstuff refcount

19

freevoid gopstuff

20

21

22

23

The following routine is called by MPI when a group is copied

24

gopstuffcopier comm keyval gopstuff extra

25

MPIComm comm

26

int keyval

27

gopstufftype gopstuff

28

void extra

29

30

if keyval gopkey abort programming error

31

32

The new group adds one reference to this gopstuff

33

gopstuff refcount

34

35

36

37

Formalizing the Lo osely Synchronous Mo del

38

39

In this section we make further statements ab out the lo osely synchronous mo del with

40

particular attention to intracommunication

41

42

Basic Statements

43

When a caller passes a communicator that contains a context and group to a callee that

44

45 communicator must b e free of side eects throughout execution of the subprogram there

should b e no active op erations on that communicator that might involve the pro cess This

46

47 provides one mo del in which libraries can b e written and work safely For libraries

48 so designated the callee has p ermission to do whatever communication it likes with the

CHAPTER GROUPS CONTEXTS AND COMMUNICATORS

communicator and under the ab ove guarantee knows that no other communications will

1

interfere Since we p ermit go o d implementations to create new communicators without

2

synchronization such as by preallo cated contexts on communicators this do es not imp ose

3

a signicant overhead

4

This form of safety is analogous to other common computerscience usages such as

5

passing a descriptor of an array to a library routine The library routine has every right to

6

exp ect such a descriptor to b e valid and mo diable

7

8

9

Mo dels of Execution

10

In the lo osely synchronous mo del transfer of control to a parallel pro cedure is eected by

11

having each executing pro cess invoke the pro cedure The invo cation is a collective op eration

12

it is executed by all pro cesses in the execution group and invo cations are similarly ordered

13

at all pro cesses However the invo cation need not b e synchronized

14

We say that a parallel pro cedure is active in a pro cess if the pro cess b elongs to a group

15

that may collectively execute the pro cedure and some memb er of that group is currently

16

executing the pro cedure co de If a parallel pro cedure is active in a pro cess then this pro cess

17

may b e receiving messages p ertaining to this pro cedure even if it do es not currently execute

18

the co de of this pro cedure

19

20

Static communicator allo cation

21

22

This covers the case where at any p oint in time at most one invo cation of a parallel

23

pro cedure can b e active at any pro cess and the group of executing pro cesses is xed For

24

example all invo cations of parallel pro cedures involve all pro cesses pro cesses are single

25

threaded and there are no recursive invo cations

26

In such a case a communicator can b e statically allo cated to each pro cedure The

27

static allo cation can b e done in a preamble as part of initialization co de If the parallel

28

pro cedures can b e organized into libraries so that only one pro cedure of each library can

29

b e concurrently active in each pro cessor then it is sucient to allo cate one communicator

30

p er library

31

32

Dynamic communicator allo cation

33

Calls of parallel pro cedures are wellnested if a new parallel pro cedure is always invoked in

34

a subset of a group executing the same parallel pro cedure Thus pro cesses that execute

35

the same parallel pro cedure have the same execution stack

36

In such a case a new communicator needs to b e dynamically allo cated for each new

37

invo cation of a parallel pro cedure The allo cation is done by the caller A new communicator

38

can b e generated by a call to MPI COMM DUP if the callee execution group is identical to

39

the caller execution group or by a call to MPI COMM SPLIT if the caller execution group

40

is split into several subgroups executing distinct parallel routines The new communicator

41

is passed as an argument to the invoked routine

42

The need for generating a new communicator at each invo cation can b e alleviated or

43

avoided altogether in some cases If the execution group is not split then one can allo cate

44

a stack of communicators in a preamble and next manage the stack in a way that mimics

45

the stack of recursive calls

46

47 48

FORMALIZING THE LOOSELY SYNCHRONOUS MODEL

One can also take advantage of the wellordering prop erty of communication to avoid

1

confusing caller and callee communication even if b oth use the same communicator To do

2

so one needs to abide by the following two rules

3

4

messages sent b efore a pro cedure call or b efore a return from the pro cedure are also

5

received b efore the matching call or return at the receiving end

6

7

messages are always selected by source no use is made of MPI ANY SOURCE

8

9

The General case

10

In the general case there may b e multiple concurrently active invo cations of the same

11

parallel pro cedure within the same group invo cations may not b e wellnested A new

12

communicator needs to b e created for each invo cation It is the users resp onsibility to make

13

sure that should two distinct parallel pro cedures b e invoked concurrently on overlapping

14

sets of pro cesses then communicator creation b e prop erly co ordinated

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

Pro cess Top ologi es

11

12

13

14

Intro duction

15

16

This chapter discusses the MPI top ology mechanism A top ology is an extra optional

17

attribute that one can give to an intracommunicator top ologies cannot b e added to inter

18

communicators A top ology can provide a convenient naming mechanism for the pro cesses

19

of a group within a communicator and additionally may assist the runtime system in

20

mapping the pro cesses onto hardware

21

As stated in chapter a pro cess group in MPI is a collection of n pro cesses Each

22

pro cess in the group is assigned a rank b etween and n In many parallel applications

23

a linear ranking of pro cesses do es not adequately reect the logical communication pattern

24

of the pro cesses which is usually determined by the underlying problem geometry and

25

the numerical algorithm used Often the pro cesses are arranged in top ological patterns

26

such as two or threedimensional grids More generally the logical pro cess arrangement is

27

describ ed by a graph In this chapter we will refer to this logical pro cess arrangement as

28

the virtual top ology

29

A clear distinction must b e made b etween the virtual pro cess top ology and the top ology

30

of the underlying physical hardware The virtual top ology can b e exploited by the system

31

in the assignment of pro cesses to physical pro cessors if this helps to improve the commu

32

nication p erformance on a given machine How this mapping is done however is outside

33

the scop e of MPI The description of the virtual top ology on the other hand dep ends only

34

on the application and is machineindep endent The functions that are prop osed in this

35

chapter deal only with machineindep en dent mapping

36

Rationale Though physical mapping is not discussed the existence of the virtual

37

top ology information may b e used as advice by the runtime system There are well

38

known techniques for mapping gridtorus structures to hardware top ologies such as

39

hyp ercub es or grids For more complicated graph structures go o d heuristics often

40

yield nearly optimal results On the other hand if there is no way for the user

41

to sp ecify the logical pro cess arrangement as a virtual top ology a random mapping

42

is most likely to result On some machines this will lead to unnecessary contention

43

in the interconnection network Some details ab out predicted and measured p erfor

44

mance improvements that result from go o d pro cesstopro cessor mapping on mo dern

45

wormholerouting architectures can b e found in

46

47

Besides p ossible p erformance b enets the virtual top ology can function as a con

48

venient pro cessnaming structure with tremendous b enets for program readability

VIRTUAL TOPOLOGIES

and notational p ower in messagepassing programming End of rationale

1

2

3

Virtual Top ologies

4

5

The communication pattern of a set of pro cesses can b e represented by a graph The

6

no des stand for the pro cesses and the edges connect pro cesses that communicate with each

7

other MPI provides messagepassing b etween any pair of pro cesses in a group There

8

is no requirement for op ening a channel explicitly Therefore a missing link in the

9

userdened pro cess graph do es not prevent the corresp onding pro cesses from exchanging

10

messages It means rather that this connection is neglected in the virtual top ology This

11

strategy implies that the top ology gives no convenient way of naming this pathway of

12

communication Another p ossible consequence is that an automatic mapping to ol if one

13

exists for the runtime environment will not take account of this edge when mapping Edges

14

in the communication graph are not weighted so that pro cesses are either simply connected

15

or not connected at all

16

17

Rationale Exp erience with similar techniques in PARMACS show that this

18

information is usually sucient for a go o d mapping Additionally a more precise

19

sp ecication is more dicult for the user to set up and it would make the interface

20

functions substantially more complicated End of rationale

21

22

Sp ecifying the virtual top ology in terms of a graph is sucient for all applications

23

However in many applications the graph structure is regular and the detailed setup of the

24

graph would b e inconvenient for the user and might b e less ecient at run time A large frac

25

tion of all parallel applications use pro cess top ologies like rings two or higherdimensional

26

grids or tori These structures are completely dened by the numb er of dimensions and

27

the numb ers of pro cesses in each co ordinate direction Also the mapping of grids and tori

28

is generally an easier problem then that of general graphs Thus it is desirable to address

29

these cases explicitly

30

Pro cess co ordinates in a cartesian structure b egin their numb ering at Rowma jor

31

numb ering is always used for the pro cesses in a cartesian structure This means that for

32

example the relation b etween group rank and co ordinates for four pro cesses in a

33

grid is as follows

34

co ord rank

35

co ord rank

36

co ord rank

37

co ord rank

38

39

40

Emb edding in MPI

41

The supp ort for virtual top ologies as dened in this chapter is consistent with other parts of

42

MPI and whenever p ossible makes use of functions that are dened elsewhere Top ology

43

information is asso ciated with communicators It is added to communicators using the

44

caching mechanism describ ed in Chapter

45

46

47 48

CHAPTER PROCESS TOPOLOGIES

Overview of the Functions

1

2

The functions MPI GRAPH CREATE and MPI CART CREATE are used to create general

3

graph virtual top ologies and cartesian top ologies resp ectively These top ology creation

4

functions are collective As with other collective calls the program must b e written to work

5

correctly whether the call synchronizes or not

6

The top ology creation functions take as input an existing communicator comm old

7

which denes the set of pro cesses on which the top ology is to b e mapp ed A new communi

8

cator comm top ol is created that carries the top ological structure as cached information see

9

Chapter In analogy to function MPI COMM CREATE no cached information propagates

10

from comm old to comm top ol

11

MPI CART CREATE can b e used to describ e cartesian structures of arbitrary dimen

12

sion For each co ordinate direction one sp ecies whether the pro cess structure is p erio dic or

13

not Note that an ndimensional hyp ercub e is an ndimensional torus with pro cesses p er

14

co ordinate direction Thus sp ecial supp ort for hyp ercub e structures is not necessary The

15

lo cal auxiliary function MPI DIMS CREATE can b e used to compute a balanced distribution

16

of pro cesses among a given numb er of dimensions

17

18

Rationale Similar functions are contained in EXPRESS and PARMACS End

19

of rationale

20

21

TOPO TEST can b e used to inquire ab out the top ology asso ciated The function MPI

22

with a communicator The top ological information can b e extracted from the communica

23

tor using the functions MPI GRAPHDIMS GET and MPI GRAPH GET for general graphs

24

and MPI CARTDIM GET and MPI CART GET for cartesian top ologies Several additional

25

functions are provided to manipulate cartesian top ologies the functions MPI CART RANK

26

and MPI CART COORDS translate cartesian co ordinates into a group rank and viceversa

27

the function MPI CART SUB can b e used to extract a cartesian subspace analogous to

28

MPI COMM SPLIT The function MPI CART SHIFT provides the information needed to

29

communicate with neighb ors in a cartesian dimension The two functions

30

GRAPH NEIGHBORS COUNT and MPI GRAPH NEIGHBORS can b e used to extract MPI

31

the neighb ors of a no de in a graph The function MPI CART SUB is collective over the

32

input communicators group all other functions are lo cal

33

Two additional functions MPI GRAPH MAP and MPI CART MAP are presented in the

34

last section In general these functions are not called by the user directly However together

35

with the communicator manipulation functions presented in Chapter they are sucient

36

to implement all other top ology functions Section outlines such an implementation

37

38

39

40

41

42

43

44

45

46

47 48

TOPOLOGY CONSTRUCTORS

Top ology Constructors

1

2

Cartesian Constructor

3

4

5

6

MPI CART CREATEcomm old ndims dims p erio ds reorder comm cart

7

IN comm old input communicator handle

8

IN ndims numb er of dimensions of cartesian grid integer

9

10

IN dims integer array of size ndims sp ecifying the numb er of

11

pro cesses in each dimension

12

IN p erio ds logical array of size ndims sp ecifying whether the grid

13

is p erio dic true or not false in each dimension

14

IN reorder ranking may b e reordered true or not false logical

15

16

OUT comm cart communicator with new cartesian top ology handle

17

18

int MPI Cart createMPI Comm comm old int ndims int dims int periods

19

int reorder MPI Comm comm cart

20

MPI CART CREATECOMM OLD NDIMS DIMS PERIODS REORDER COMM CART IERROR

21

INTEGER COMM OLD NDIMS DIMS COMM CART IERROR

22

LOGICAL PERIODS REORDER

23

24

MPI CART CREATE returns a handle to a new communicator to which the cartesian

25

top ology information is attached If reorder false then the rank of each pro cess in the new

26

group is identical to its rank in the old group Otherwise the function may reorder the pro

27

cesses p ossibly so as to cho ose a go o d emb edding of the virtual top ology onto the physical

28

machine If the total size of the cartesian grid is smaller than the size of the group of comm

29

COMM NULL in analogy to MPI COMM SPLIT The then some pro cesses are returned MPI

30

call is erroneous if it sp ecies a grid that is larger than the group size

31

32

DIMS CREATE Cartesian Convenience Function MPI

33

For cartesian top ologies the function MPI DIMS CREATE helps the user select a balanced

34

distribution of pro cesses p er co ordinate direction dep ending on the numb er of pro cesses

35

in the group to b e balanced and optional constraints that can b e sp ecied by the user

36

COMM WORLDs group into an One use is to partition all the pro cesses the size of MPI

37

ndimensional top ology

38

39

40

MPI DIMS CREATEnno des ndims dims

41

42 IN nno des numb er of no des in a grid integer

43

IN ndims numb er of cartesian dimensions integer

44

INOUT dims integer array of size ndims sp ecifying the numb er of

45

no des in each dimension

46

47

int MPI Dims createint nnodes int ndims int dims 48

CHAPTER PROCESS TOPOLOGIES

MPI DIMS CREATENNODES NDIMS DIMS IERROR

1

INTEGER NNODES NDIMS DIMS IERROR

2

3

The entries in the array dims are set to describ e a cartesian grid with ndims dimensions

4

and a total of nno des no des The dimensions are set to b e as close to each other as p ossible

5

using an appropriate divisibili ty algorithm The caller may further constrain the op eration

6

of this routine by sp ecifying elements of array dims If dimsi is set to a p ositive numb er

7

the routine will not mo dify the numb er of no des in dimension i only those entries where

8

dimsi are mo died by the call

9

Negative input values of dimsi are erroneous An error will o ccur if nnodes is not a

Y

10

multiple of dimsi

11

idims[i]=0

12

For dimsi set by the call dimsi will b e ordered in nonincreasing order Array

13

dims is suitable for use as input to routine MPI CART CREATE MPI DIMS CREATE is

14

lo cal

15

16

dims function call dims

17

b efore call on return

18

MPI DIMS CREATE dims

Example

19

MPI DIMS CREATE dims

20

MPI DIMS CREATE dims

21

MPI DIMS CREATE dims erroneous call

22

23

24

General Graph Constructor

25

26

27

28

MPI GRAPH CREATEcomm old nno des index edges reorder comm graph

29

IN comm old input communicator without top ology handle

30

IN nno des numb er of no des in graph integer

31

32

IN index array of integers describing no de degrees see b elow

33

IN edges array of integers describing graph edges see b elow

34

IN reorder ranking may b e reordered true or not false logical

35

36

OUT comm graph communicator with graph top ology added handle

37

38

Graph createMPI Comm comm old int nnodes int index int edges int MPI

39

int reorder MPI Comm comm graph

40

MPI GRAPH CREATECOMM OLD NNODES INDEX EDGES REORDER COMM GRAPH

41

IERROR

42

INTEGER COMM OLD NNODES INDEX EDGES COMM GRAPH IERROR

43

LOGICAL REORDER

44

45

MPI GRAPH CREATE returns a handle to a new communicator to which the graph

46

top ology information is attached If reorder false then the rank of each pro cess in the

47

new group is identical to its rank in the old group Otherwise the function may reorder the 48

TOPOLOGY CONSTRUCTORS

pro cesses If the size nno des of the graph is smaller than the size of the group of comm

1

then some pro cesses are returned MPI COMM NULL in analogy to MPI CART CREATE and

2

MPI COMM SPLIT The call is erroneous if it sp ecies a graph that is larger than the group

3

size of the input communicator

4

The three parameters nno des index and edges dene the graph structure nno des is the

5

numb er of no des of the graph The no des are numb ered from to nnodes The ith entry

6

of array index stores the total numb er of neighb ors of the rst i graph no des The lists of

7

neighb ors of no des nnodes are stored in consecutive lo cations in array edges

8

The array edges is a attened representation of the edge lists The total numb er of entries

9

in index is nno des and the total numb er of entries in edges is equal to the numb er of graph

10

edges

11

The denitions of the arguments nnodes index and edges are illustrated with the

12

following simple example

13

14

Example Assume there are four pro cesses with the following adjacency

15

matrix

16

17

pro cess neighb ors

18

19

20

21

22

23

Then the input arguments are

24

nno des

25

index

26

edges

27

28

Thus in C index is the degree of no de zero and indexi indexi is the

29

degree of no de i i nnodes the list of neighb ors of no de zero is stored in

30

edgesj for j index and the list of neighb ors of no de i i is stored in

31

edgesj indexi j indexi

32

In Fortran index is the degree of no de zero and indexi indexi is the

33

degree of no de i i nnodes the list of neighb ors of no de zero is stored in

34

edgesj for j index and the list of neighb ors of no de i i is stored in

35

edgesj indexi j indexi

36

37

Advice to implementors The following top ology information is likely to b e stored

38

with a communicator

39

Typ e of top ology cartesiangraph

40

41

For a cartesian top ology

42

ndims numb er of dimensions

43

dims numb ers of pro cesses p er co ordinate direction

44

periods p erio dicity information

45

ownposition own p osition in grid could also b e computed from rank and

46

dims

47

48

For a graph top ology

CHAPTER PROCESS TOPOLOGIES

index

1

2

edges

3

which are the vectors dening the graph structure

4

5

For a graph structure the numb er of no des is equal to the numb er of pro cesses in

6

the group Therefore the numb er of no des do es not have to b e stored explicitly

7

An additional zero entry at the start of array index simplies access to the top ology

8

information End of advice to implementors

9

10

Top ology inquiry functions

11

If a top ology has b een dened with one of the ab ove functions then the top ology information

12

can b e lo oked up using inquiry functions They all are lo cal calls

13

14

15

MPI TOPO TESTcomm status

16

IN comm communicator handle

17

18

OUT status top ology typ e of communicator comm choice

19

20

int MPI Topo testMPI Comm comm int status

21

MPI TOPO TESTCOMM STATUS IERROR

22

INTEGER COMM STATUS IERROR

23

24

The function MPI TOPO TEST returns the typ e of top ology that is assigned to a

25

communicator

26

The output value status is one of the following

27

28

MPI GRAPH graph top ology

29

MPI CART cartesian top ology

30

UNDEFINED no top ology MPI

31

32

33

MPI GRAPHDIMS GETcomm nno des nedges

34

IN comm communicator for group with graph structure handle

35

36 OUT nno des numb er of no des in graph integer same as numb er

37

of pro cesses in the group

38

OUT nedges numb er of edges in graph integer

39

40

Graphdims getMPI Comm comm int nnodes int nedges int MPI

41

42

MPI GRAPHDIMS GETCOMM NNODES NEDGES IERROR

43

INTEGER COMM NNODES NEDGES IERROR

44

GRAPHDIMS GET and MPI GRAPH GET retrieve the graphtop ology Functions MPI

45

information that was asso ciated with a communicator by MPI GRAPH CREATE

46

The information provided by MPI GRAPHDIMS GET can b e used to dimension the

47

vectors index and edges correctly for the following call to MPI GRAPH GET 48

TOPOLOGY CONSTRUCTORS

MPI GRAPH GETcomm maxindex maxedges index edges

1

2

IN comm communicator with graph structure handle

3

IN maxindex length of vector index in the calling program

4

integer

5

6

IN maxedges length of vector edges in the calling program

7 integer

8

OUT index array of integers containing the graph structure for

9

details see the denition of MPI GRAPH CREATE

10

OUT edges array of integers containing the graph structure

11

12

int MPI Graph getMPI Comm comm int maxindex int maxedges int index

13

int edges

14

15

MPI GRAPH GETCOMM MAXINDEX MAXEDGES INDEX EDGES IERROR

16

INTEGER COMM MAXINDEX MAXEDGES INDEX EDGES IERROR

17

18

19

CARTDIM GETcomm ndims MPI

20

IN comm communicator with cartesian structure handle

21

22

OUT ndims numb er of dimensions of the cartesian structure inte

23

ger

24

25

int MPI Cartdim getMPI Comm comm int ndims

26

27

MPI CARTDIM GETCOMM NDIMS IERROR

28

INTEGER COMM NDIMS IERROR

29

The functions MPI CARTDIM GET and MPI CART GET return the cartesian top ology

30

information that was asso ciated with a communicator by MPI CART CREATE

31

32

33

MPI CART GETcomm maxdims dims p erio ds co ords

34

IN comm communicator with cartesian structure handle

35

IN maxdims length of vectors dims periods and coords in the

36

calling program integer

37

38

OUT dims numb er of pro cesses for each cartesian dimension ar

39

ray of integer

40

OUT p erio ds p erio dicity truefalse for each cartesian dimension

41

array of logical

42

OUT co ords co ordinates of calling pro cess in cartesian structure

43

array of integer

44

45

46

Cart getMPI Comm comm int maxdims int dims int periods int MPI

47

int coords 48

CHAPTER PROCESS TOPOLOGIES

MPI CART GETCOMM MAXDIMS DIMS PERIODS COORDS IERROR

1

INTEGER COMM MAXDIMS DIMS COORDS IERROR

2

LOGICAL PERIODS

3

4

5

6

MPI CART RANKcomm co ords rank

7

IN comm communicator with cartesian structure handle

8

IN co ords integer array of size ndims sp ecifying the cartesian

9

co ordinates of a pro cess

10

11

OUT rank rank of sp ecied pro cess integer

12

13

int MPI Cart rankMPI Comm comm int coords int rank

14

MPI CART RANKCOMM COORDS RANK IERROR

15

INTEGER COMM COORDS RANK IERROR

16

17

For a pro cess group with cartesian structure the function MPI CART RANK translates

18

the logical pro cess co ordinates to pro cess ranks as they are used by the p ointtop oint

19

routines

20

For dimension i with periodsi true if the co ordinate coordsi is out of

21

range that is coordsi or coordsi dimsi it is shifted back to the interval

22

coordsi dimsi automatically Outofrange co ordinates are erroneous for

23

nonp erio dic dimensions

24

25

26

CART COORDScomm rank maxdims co ords MPI

27

IN comm communicator with cartesian structure handle

28

IN rank rank of a pro cess within group of comm integer

29

30

IN maxdims length of vector coord in the calling program integer

31

OUT co ords integer array of size ndims containing the cartesian

32

co ordinates of sp ecied pro cess integer

33

34

int MPI Cart coordsMPI Comm comm int rank int maxdims int coords

35

36

CART COORDSCOMM RANK MAXDIMS COORDS IERROR MPI

37

INTEGER COMM RANK MAXDIMS COORDS IERROR

38

The inverse mapping ranktoco ordinates translation is provided by MPI CART

39

COORDS

40

41

42

43

44

45

46

47 48

TOPOLOGY CONSTRUCTORS

MPI GRAPH NEIGHBORS COUNTcomm rank nneighb ors

1

2

IN comm communicator with graph top ology handle

3

IN rank rank of pro cess in group of comm integer

4

OUT nneighb ors numb er of neighb ors of sp ecied pro cess integer

5

6

7

int MPI Graph neighbors countMPI Comm comm int rank int nneighbors

8

MPI GRAPH NEIGHBORS COUNTCOMM RANK NNEIGHBORS IERROR

9

INTEGER COMM RANK NNEIGHBORS IERROR

10

11

MPI GRAPH NEIGHBORS COUNT and MPI GRAPH NEIGHBORS provide adjacency

12

information for a general graph top ology

13

14

MPI GRAPH NEIGHBORScomm rank maxneighb ors neighb ors

15

16

IN comm communicator with graph top ology handle

17

IN rank rank of pro cess in group of comm integer

18

IN maxneighb ors size of array neighb ors integer

19

20

OUT neighb ors ranks of pro cesses that are neighb ors to sp ecied pro

21

cess array of integer

22

23

int MPI Graph neighborsMPI Comm comm int rank int maxneighbors

24

int neighbors

25

MPI GRAPH NEIGHBORSCOMM RANK MAXNEIGHBORS NEIGHBORS IERROR

26

INTEGER COMM RANK MAXNEIGHBORS NEIGHBORS IERROR

27

28

29

Example Supp ose that comm is a communicator with a shueexchange top ology The

n

30

group has memb ers Each pro cess is lab eled by a a with a f g and has

1 n i

31

three neighb ors exchangea a a a a a a shuea a

1 n 1 n1 n 1 n

32

a a a and unshuea a a a a The graph adjacency list is

2 n 1 1 n n 1 n1

33

illustrated b elow for n

34

35

no de exchange shue unshue

36

neighb ors neighb ors neighb ors

37

38

39

40

41

42

43

44

45

46

Supp ose that the communicator comm has this top ology asso ciated with it The follow

47

ing co de fragment cycles through the three typ es of neighb ors and p erforms an appropriate

48

p ermutation for each

CHAPTER PROCESS TOPOLOGIES

C assume each process has stored a real number A

1

C extract neighborhood information

2

CALL MPICOMMRANKcomm myrank ierr

3

CALL MPIGRAPHNEIGHBORScomm myrank neighbors ierr

4

C perform exchange permutation

5

CALL MPISENDRECVREPLACEA MPIREAL neighbors

6

neighbors comm status ierr

7

C perform shuffle permutation

8

CALL MPISENDRECVREPLACEA MPIREAL neighbors

9

neighbors comm status ierr

10

C perform unshuffle permutation

11

CALL MPISENDRECVREPLACEA MPIREAL neighbors

12

neighbors comm status ierr

13

14

15

Cartesian Shift Co ordinates

16

If the pro cess top ology is a cartesian structure a MPI SENDRECV op eration is likely to b e

17

used along a co ordinate direction to p erform a shift of data As input MPI SENDRECV

18

takes the rank of a source pro cess for the receive and the rank of a destination pro cess for

19

CART SHIFT is called for a cartesian pro cess group it provides the send If the function MPI

20

the calling pro cess with the ab ove identiers which then can b e passed to MPI SENDRECV

21

The user sp ecies the co ordinate direction and the size of the step p ositive or negative

22

The function is lo cal

23

24

25

MPI CART SHIFTcomm direction disp rank source rank dest

26

IN comm communicator with cartesian structure handle

27

28

IN direction co ordinate dimension of shift integer

29

IN disp displacement > upwards shift < downwards

30

shift integer

31

OUT rank source rank of source pro cess integer

32

OUT rank dest rank of destination pro cess integer

33

34

35

Cart shiftMPI Comm comm int direction int disp int rank source int MPI

36

dest int rank

37

MPI CART SHIFTCOMM DIRECTION DISP RANK SOURCE RANK DEST IERROR

38

INTEGER COMM DIRECTION DISP RANK SOURCE RANK DEST IERROR

39

40

Dep ending on the p erio dicity of the cartesian group in the sp ecied co ordinate direc

41

CART SHIFT provides the identiers for a circular or an endo shift In the case tion MPI

42

of an endo shift the value MPI PROC NULL may b e returned in rank source or rank dest

43

indicating that the source or the destination for the shift is out of range

44

45

Example The communicator comm has a twodimensional p erio dic cartesian top ol

46

ogy asso ciated with it A twodimensional array of REALs is stored one element p er pro cess

47

in variable A One wishes to skew this array by shifting column i vertically ie along the

48

column by i steps

TOPOLOGY CONSTRUCTORS

1

C find process rank

2

CALL MPICOMMRANKcomm rank ierr

3

C find cartesian coordinates

4

CALL MPICARTCOORDScomm rank maxdims coords ierr

5

C compute shift source and destination

6

CALL MPICARTSHIFTcomm coords source dest ierr

7

C skew array

8

CALL MPISENDRECVREPLACEA MPIREAL dest source comm

9

status ierr

10

11

12

Partitioning of Cartesian structures

13

14

15

MPI CART SUBcomm remain dims newcomm

16

IN comm communicator with cartesian structure handle

17

IN remain dims the ith entry of remain dims sp ecies whether the

18

19 ith dimension is kept in the subgrid true or is drop

p ed false logical vector

20

21

OUT newcomm communicator containing the subgrid that includes

22

the calling pro cess handle

23

24

int MPI Cart subMPI Comm comm int remain dims MPI Comm newcomm

25

26

MPI CART SUBCOMM REMAIN DIMS NEWCOMM IERROR

27

INTEGER COMM NEWCOMM IERROR

28

DIMS LOGICAL REMAIN

29

If a cartesian top ology has b een created with MPI CART CREATE the function

30

MPI CART SUB can b e used to partition the communicator group into subgroups that

31

form lowerdimensional cartesian subgrids and to build for each subgroup a communica

32

tor with the asso ciated subgrid cartesian top ology This function is closely related to

33

COMM SPLIT MPI

34

35

Example Assume that MPI CART CREATE comm has dened a grid

36

Let remain dims true false true Then a call to

37

MPICARTSUBcomm remaindims commnew

38

39

will create three communicators each with eight pro cesses in a cartesian top ol

40

dims false false true then the call to MPI CART SUBcomm ogy If remain

41

remain dims comm new will create six nonoverlapping communicators each with four

42

pro cesses in a onedimensional cartesian top ology

43

44

Lowlevel top ology functions

45

46

The two additional functions intro duced in this section can b e used to implement all other

47

top ology functions In general they will not b e called by the user directly unless he or she

48

is creating additional virtual top ology capability other than that provided by MPI

CHAPTER PROCESS TOPOLOGIES

MPI CART MAPcomm ndims dims p erio ds newrank

1

2

IN comm input communicator handle

3

IN ndims numb er of dimensions of cartesian structure integer

4

5

IN dims integer array of size ndims sp ecifying the numb er of

6

pro cesses in each co ordinate direction

7

IN p erio ds logical array of size ndims sp ecifying the p erio dicity

8

sp ecication in each co ordinate direction

9

OUT newrank reordered rank of the calling pro cess MPI UNDEFINED

10

if calling pro cess do es not b elong to grid integer

11

12

int MPI Cart mapMPI Comm comm int ndims int dims int periods

13

int newrank

14

15

MPI CART MAPCOMM NDIMS DIMS PERIODS NEWRANK IERROR

16

INTEGER COMM NDIMS DIMS NEWRANK IERROR

17

LOGICAL PERIODS

18

MPI CART MAP computes an optimal placement for the calling pro cess on the phys

19

ical machine A p ossible implementation of this function is to always return the rank of the

20

calling pro cess that is not to p erform any reordering

21

22

CART CREATEcomm ndims dims Advice to implementors The function MPI

23

cart with reorder true can b e implemented by calling p erio ds reorder comm

24

MPI CART MAPcomm ndims dims p erio ds newrank then calling

25

MPI COMM SPLITcomm color key comm cart with color if newrank

26

MPI UNDEFINED color MPI UNDEFINED otherwise and key newrank

27

28

The function MPI CART SUBcomm remain dims comm new can b e implemented

29

COMM SPLITcomm color key comm new using a single numb er by a call to MPI

30

enco ding of the lost dimensions as color and a single numb er enco ding of the preserved

31

dimensions as key

32

All other cartesian top ology functions can b e implemented lo cally using the top ology

33

information that is cached with the communicator End of advice to implementors

34

35

The corresp onding new function for general graph structures is as follows

36

37

38

39

40

41

42

43

44

45

46

47 48

AN APPLICATION EXAMPLE

MPI GRAPH MAPcomm nno des index edges newrank

1

2

IN comm input communicator handle

3

IN nno des numb er of graph no des integer

4

5

IN index integer array sp ecifying the graph structure see

6

MPI GRAPH CREATE

7

IN edges integer array sp ecifying the graph structure

8

OUT newrank reordered rank of the calling pro cess MPI UNDEFINED

9

if the calling pro cess do es not b elong to graph inte

10

ger

11

12

int MPI Graph mapMPI Comm comm int nnodes int index int edges

13

int newrank

14

15

MPI GRAPH MAPCOMM NNODES INDEX EDGES NEWRANK IERROR

16

INTEGER COMM NNODES INDEX EDGES NEWRANK IERROR

17

18

Advice to implementors The function MPI GRAPH CREATEcomm nno des index

19

edges reorder comm graph with reorder true can b e implemented by calling

20

MPI GRAPH MAPcomm nno des index edges newrank then calling

21

COMM SPLITcomm color key comm graph with color if newrank MPI

22

MPI UNDEFINED color MPI UNDEFINED otherwise and key newrank

23

24

All other graph top ology functions can b e implemented lo cally using the top ology

25

information that is cached with the communicator End of advice to implementors

26

27

An Application Example

28

29

Example The example in gure shows how the grid denition and inquiry functions

30

can b e used in an application program A partial dierential equation for instance the

31

Poisson equation is to b e solved on a rectangular domain First the pro cesses organize

32

themselves in a twodimensional structure Each pro cess then inquires ab out the ranks of

33

its neighb ors in the four directions up down right left The numerical problem is solved

34

by an iterative metho d the details of which are hidden in the subroutine relax

35

In each relaxation step each pro cess computes new values for the solution grid function

36

at all p oints owned by the pro cess Then the values at interpro cess b oundaries have to b e

37

exchanged with neighb oring pro cesses For example the exchange subroutine might contain

38

SENDneigh rank to send up dated values to the lefthand neighb or a call like MPI

39

ij

40

41

42

43

44

45

46

47 48

CHAPTER PROCESS TOPOLOGIES

1

2

3

integer ndims num neigh

4

logical reorder

5

parameter ndims num neigh reordertrue

6

integer comm comm cart dimsndims neigh defndims ierr

7

integer neigh ranknum neigh own positionndims i j

8

logical periodsndims

9

real u f

10

data dims ndims

11

comm MPI COMM WORLD

12

C Set process grid size and periodicity

13

call MPI DIMS CREATEcomm ndims dimsierr

14

periods TRUE

15

periods TRUE

16

C Create a grid structure in WORLD group and inquire about own position

17

call MPI CART CREATE comm ndims dims periods reorder comm cartierr

18

call MPI CART GET comm cart ndims dims periods own positionierr

19

C Look up the ranks for the neighbors Own process coordinates are ij

20

C Neighbors are ij ij ij ij

21

position i own

22

position j own

23

neigh def i

24

neigh def j

25

call MPI CART RANK comm cart neigh def neigh rankierr

26

neigh def i

27

neigh def j

28

call MPI CART RANK comm cart neigh def neigh rankierr

29

neigh def i

30

neigh def j

31

CART RANK comm cart neigh def neigh rankierr call MPI

32

neigh def i

33

neigh def j

34

call MPI CART RANK comm cart neigh def neigh rankierr

35

C Initialize the grid functions and start the iteration

36

call init u f

37

do it

38

call relax u f

39

C Exchange data with neighbor processes

40

cart neigh rank num neigh call exchange u comm

41

continue

42

call output u

43

end

44

45

46

Figure Setup of pro cess structure for twodimensional parallel Poisson solver

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

MPI Environmental Management

11

12

13

14

This chapter discusses routines for getting and where appropriate setting various param

15

eters that relate to the MPI implementation and the execution environment such as error

16

handling The pro cedures for entering and leaving the MPI execution environment are also

17

describ ed here

18

19

Implementation information

20

21

Environmental Inquiries

22

23 A set of attributes that describ e the execution environment are attached to the commu

nicator MPI COMM WORLD when MPI is initialized The value of these attributes can b e

24

25 inquired by using the function MPI ATTR GET describ ed in Chapter It is erroneous to

26 delete these attributes or free their keys

27 The list of predened attribute keys include

28

MPI TAG UB Upp er b ound for tag value

29

30

MPI HOST Host pro cess rank if such exists MPI PROC NULL otherwise

31

MPI IO rank of a no de that has regular IO facilities p ossibly myrank No des in the same

32

communicator may return dierent values for this parameter

33

34

Vendors may add implementation sp ecic parameters such as no de numb er real mem

35

ory size virtual memory size etc

36

The required parameter values are discussed in more detail b elow

37

38

Tag values

39

40

TAG UB inclusive These values are Tag values range from to the value returned for MPI

41

guaranteed to b e unchanging during the execution of an MPI program In addition the tag

42

upp er b ound value must b e at least An MPI implementation is free to make the

43 30

TAG UB larger than this for example the value is also a legal value for value of MPI

44

MPI TAG UB

45

46

47 48

CHAPTER MPI ENVIRONMENTAL MANAGEMENT

Host rank

1

2

The value returned for MPI HOST gets the rank of the HOST pro cess in the group asso ciated

3

with communicator MPI COMM WORLD if there is such MPI PROC NULL is returned if

4

there is no host MPI do es not sp ecify what it means for a pro cess to b e a HOST nor do es

5

it requires that a HOST exists

6

7

IO rank

8

9

The value returned for MPI IO is the rank of a pro cessor that can provide languagestandard

10

IO facilities For Fortran this means that all of the Fortran IO op erations are supp orted

11

eg OPEN REWIND WRITE For C this means that all of the ANSIC IO op erations are

12

supp orted eg fopen fprintf lseek

13

If every pro cess can provide languagestandard IO then the value MPI ANY SOURCE

14

must b e returned If no pro cess can provide languagestandard IO then the value MPI

15

PROC NULL must b e returned If several pro cesses can provide IO then any them may b e

16

returned The same value rank need not b e returned by all pro cesses

17

18

MPI GET PROCESSOR NAME name resultlen

19

20

OUT name A unique sp ecier for the actual as opp osed to vir

21

tual no de

22

OUT resultlen Length in printable characters of the result returned

23

in name

24

25

Get processor namechar name int resultlen int MPI

26

27

MPI GET PROCESSOR NAME NAME RESULTLEN IERROR

28

CHARACTER NAME

29

INTEGER RESULTLENIERROR

30

This routine returns the name of the pro cessor on which it was called at the moment

31

of the call The name is a character string for maximum exibility From this value it

32

must b e p ossible to identify a sp ecic piece of hardware p ossible values include pro cessor

33

in rack of mppcsorg and where is the actual pro cessor numb er in the

34

running homogeneous system The argument name must represent storage that is at least

35

MAX PROCESSOR NAME characters long MPI GET PROCESSOR NAME may write up MPI

36

to this many characters into name

37

The numb er of characters actually written is returned in the output argument resultlen

38

39

Rationale This function allows MPI implementations that do pro cess migration

40

to return the current pro cessor Note that nothing in MPI requires or denes pro

41

GET PROCESSOR NAME simply allows such cess migration this denition of MPI

42

an implementation End of rationale

43

44

MAX PROCESSOR NAME space Advice to users The user must provide at least MPI

45

to write the pro cessor name pro cessor names can b e this long The user should

46

examine the ouput argument resultlen to determine the actual length of the name

47

End of advice to users 48

ERROR HANDLING

Error handling

1

2

An MPI implementation cannot or may cho ose not to handle some errors that o ccur during

3

MPI calls These can include errors that generate exceptions or traps such as oating p oint

4

errors or access violations The set of errors that are handled by MPI is implementation

5

dep endent Each such error generates an MPI exception

6

A user can asso ciate an error handler with a communicator The sp ecied error han

7

dling routine will b e used for any MPI exception that o ccurs during a call to MPI for a

8

communication with this communicator MPI calls that are not related to any communica

9

tor are considered to b e attached to the communicator MPI COMM WORLD The attachment

10

of error handlers to communicators is purely lo cal dierent pro cesses may attach dierent

11

error handlers to the same communicator

12

A newly created communicator inherits the error handler that is asso ciated with the

13

parent communicator In particular the user can sp ecify a global error handler for

14

all communicators by asso ciating this handler with the communicator MPI COMM WORLD

15

immediately after initialization

16

Several predened error handlers are available in MPI

17

18

MPI ERRORS ARE FATAL The handler when called causes the program to ab ort on all exe

19

cuting pro cesses This has the same eect as if MPI ABORT was called by the pro cess

20

that invoked the handler

21

MPI ERRORS RETURN The handler has no eect

22

23

Implementations may provide additional predened error handlers and programmers

24

can co de their own error handlers

25

The error handler MPI ERRORS ARE FATAL is asso ciated by default with MPI COMM

26

WORLD after initialization Thus if the user cho oses not to control error handling every

27

error that MPI handles is treated as fatal Since almost all MPI calls return an error co de

28

a user may cho ose to handle errors in its main co de by testing the return co de of MPI calls

29

and executing a suitable recovery co de when the call was not successful In this case the

30

error handler MPI ERRORS RETURN will b e used Usually it is more convenient and more

31

ecient not to test for errors after each MPI call and have such error handled by a non

32

trivial MPI error handler

33

After an error is detected the state of MPI is undened That is using a userdened

34

ERRORS RETURN do es not necessarily allow the user to continue to error handler or MPI

35

use MPI after an error is detected The purp ose of these error handlers is to allow a user to

36

issue userdened error messages and to take actions unrelated to MPI such as ushing IO

37

buers b efore a program exits An MPI implementation is free to allow MPI to continue

38

after an error but is not required to do so

39

40

Advice to implementors A go o d quality implementation will to the greatest p os

41

sible extent circumscrib e the impact of an error so that normal pro cessing can

42

continue after an error handler was invoked The implementation do cumentation will

43

provide information on the p ossible eect of each class of errors End of advice to

44

implementors

45

46

An MPI error handler is an opaque ob ject which is accessed by a handle MPI calls

47

are provided to create new error handlers to asso ciate error handlers with communicators

48

and to test which error handler is asso ciated with a communicator

CHAPTER MPI ENVIRONMENTAL MANAGEMENT

MPI ERRHANDLER CREATE function errhandler

1

2

IN function user dened error handling pro cedure

3

OUT errhandler MPI error handler handle

4

5

int MPI Errhandler createMPI Handler function function

6

MPI Errhandler errhandler

7

8

MPI ERRHANDLER CREATEFUNCTION HANDLER IERROR

9

EXTERNAL FUNCTION

10

INTEGER ERRHANDLER IERROR

11

Register the user routine function for use as an MPI exception handler Returns in

12

errhandler a handle to the registered exception handler

13

14

Advice to implementors The handle returned may contain the address of the error

15

handling routine This call is sup eruous in C which has a referencing op erator but

16

is necessary in Fortran End of advice to implementors

17

18

The user routine should b e a C function of typ e MPI Handler function which is dened

19

as

20

21

typedef void MPIHandlerfunctionMPIComm int

22

23

The rst argument is the communicator in use The second is the error co de to b e returned

24

by the MPI routine The remaining arguments are stdargs arguments whose numb er

25

and meaning is implementationdep endent An implementation should clearly do cument

26

these arguments Addresses are used so that the handler may b e written in Fortran

27

28

Rationale The variable argument list is provided b ecause it provides an ANSI

29

standard ho ok for providing additional information to the error handler without this

30

ho ok ANSI C prohibits additional arguments End of rationale

31

32

33

ERRHANDLER SET comm errhandler MPI

34

35

IN comm communicator to set the error handler for handle

36

IN errhandler new MPI error handler for communicator handle

37

38

int MPI Errhandler setMPI Comm comm MPI Errhandler errhandler

39

40

MPI ERRHANDLER SETCOMM ERRHANDLER IERROR

41

INTEGER COMM ERRHANDLER IERROR

42

Asso ciates the new error handler errorhandler with communicator comm at the calling

43

pro cess Note that an error handler is always asso ciated with the communicator

44

45

46

47 48

ERROR HANDLING

MPI ERRHANDLER GET comm errhandler

1

2

IN comm communicator to get the error handler from handle

3

OUT errhandler MPI error handler currently asso ciated with commu

4

nicator handle

5

6

int MPI Errhandler getMPI Comm comm MPI Errhandler errhandler

7

8

MPI ERRHANDLER GETCOMM ERRHANDLER IERROR

9

INTEGER COMM ERRHANDLER IERROR

10

Returns in errhandler a handle to the error handler that is currently asso ciated with

11

communicator comm

12

Example A library function may register at its entry p oint the current error handler

13

for a communicator set its own private error handler for this communicator and restore

14

b efore exiting the previous error handler

15

16

17

MPI ERRHANDLER FREE errhandler

18

IN errhandler MPI error handler handle

19

20

21

int MPI Errhandler freeMPI Errhandler errhandler

22

MPI ERRHANDLER FREEERRHANDLER IERROR

23

INTEGER ERRHANDLER IERROR

24

25

Marks the error handler asso ciated with errhandler for deallo cation and sets errhandler

26

to MPI ERRHANDLER NULL The error handler will b e deallo cated after all communicators

27

asso ciated with it have b een deallo cated

28

29

ERROR STRING errorco de string resultlen MPI

30

31

IN errorco de Error co de returned by an MPI routine

32

OUT string Text that corresp onds to the errorco de

33

OUT resultlen Length in printable characters of the result returned

34

in string

35

36

37

int MPI Error stringint errorcode char string int resultlen

38

MPI ERROR STRINGERRORCODE STRING RESULTLEN IERROR

39

INTEGER ERRORCODE RESULTLEN IERROR

40

CHARACTER STRING

41

42

Returns the error string asso ciated with an error co de The argument string must

43

MAX ERROR STRING characters long represent storage that is at least MPI

44

The numb er of characters actually written is returned in the output argument resultlen

45

46

Rationale The form of this function was chosen to make the Fortran and C bindings

47

similar A version that returns a p ointer to a string has two diculties First the

48

return string must b e statically allo cated and dierent for each error message allowing

CHAPTER MPI ENVIRONMENTAL MANAGEMENT

the p ointers returned by successive calls to MPI ERROR STRING to p oint to the correct

1

message Second in Fortran a function declared as returning CHARACTER can

2

not b e referenced in for example a PRINT statement End of rationale

3

4

5

Error co des and classes

6

7

The error co des returned by MPI are left entirely to the implementation with the exception

8

of MPI SUCCESS This is done to allow an implementation to provide as much information

9

as p ossible in the error co de for use with MPI ERROR STRING

10

ERR To make it p ossible for an application to interpret an error co de the routine MPI

11

OR CLASS converts an error co de into one of a small set of sp ecied values called error

12

classes Valid error classes include

13

MPI SUCCESS No error

14

MPI ERR BUFFER Invalid buer p ointer

15

ERR COUNT Invalid count argument MPI

16

MPI ERR TYPE Invalid datatyp e argument

17

MPI ERR TAG Invalid tag argument

18

MPI ERR COMM Invalid communicator

19

MPI ERR RANK Invalid rank

20

MPI ERR REQUEST Invalid request handle

21

MPI ERR ROOT Invalid ro ot

22

MPI ERR GROUP Invalid group

23

MPI ERR OP Invalid op eration

24

ERR TOPOLOGY Invalid top ology MPI

25

MPI ERR DIMS Invalid dimension argument

26

MPI ERR ARG Invalid argument of some other kind

27

MPI ERR UNKNOWN Unknown error

28

MPI ERR TRUNCATE Message truncated on receive

29

MPI ERR OTHER Known error not in this list

30

MPI ERR INTERN Internal MPI error

31

MPI ERR LASTCODE Last standard error co de

32

33

An implementation is free to dene more error classes however the standard error

34

classes must b e used where appropriate The error classes satisfy

35

36

SUCCESS MPI ERR MPI ERR LASTCODE MPI

37

38

Rationale The dierence b etween MPI ERR UNKNOWN and MPI ERR OTHER is that

39

MPI ERROR STRING can return useful information ab out MPI ERR OTHER

40

Note that MPI SUCCESS is necessary to b e consistent with C practice the sepa

41

ration of error classes and error co des allows us to dene the error classes this way

42

Having a known LASTCODE is often a nice sanity check as well End of rationale

43

44

45

46

47 48

TIMERS

MPI ERROR CLASS errorco de errorclass

1

2

IN errorco de Error co de returned by an MPI routine

3

OUT errorclass Error class asso ciated with errorco de

4

5

int MPI Error classint errorcode int errorclass

6

7

MPI ERROR CLASSERRORCODE ERRORCLASS IERROR

8

INTEGER ERRORCODE ERRORCLASS IERROR

9

10

11

Timers

12

13

MPI denes a timer A timer is sp ecied even though it is not messagepassing b ecause

14

timing parallel programs is imp ortant in p erformance debugging and b ecause existing

15

timers b oth in POSIX and D and in Fortran are either incon

16

venient or do not provide adequate access to highresolution timers

17

18

MPI WTIME

19

20

double MPI Wtimevoid

21

22

DOUBLE PRECISION MPI WTIME

23

MPI WTIME returns a oatingp oint numb er of seconds representing elapsed wallclo ck

24

time since some time in the past

25

The time in the past is guaranteed not to change during the life of the pro cess

26

The user is resp onsible for converting large numb ers of seconds to other units if they are

27

preferred

28

This function is p ortable it returns seconds not ticks it allows highresolution

29

and carries no unnecessary baggage One would use it like this

30

31

32

double starttime endtime

33

starttime double MPIWtime

34

stuff to be timed

35

endtime double MPIWtime

36

printfThat took f secondsnendtimestarttime

37

38

39

The times returned are lo cal to the no de that called them There is no requirement

40

that dierent no des return the same time

41

42

43

WTICK MPI

44

45

double MPI Wtickvoid

46

DOUBLE PRECISION MPI WTICK

47 48

CHAPTER MPI ENVIRONMENTAL MANAGEMENT

MPI WTICK returns the resolution of MPI WTIME in seconds That is it returns

1

as a double precision value the numb er of seconds b etween successive clo ck ticks For

2

example if the clo ck is implemented by the hardware as a counter that is incremented

3

3

every millisecond the value returned by MPI WTICK should b e

4

5

6

Startup

7

8

One goal of MPI is to achieve source code portability By this we mean that a program written

9

using MPI and complying with the relevant language standards is p ortable as written and

10

must not require any source co de changes when moved from one system to another This

11

explicitly do es not say anything ab out how an MPI program is started or launched from

12

the command line nor what the user must do to set up the environment in which an MPI

13

program will run However an implementation may require some setup to b e p erformed

14

b efore other MPI routines may b e called To provide for this MPI includes an initialization

15

routine MPI INIT

16

17

INIT MPI

18

19

int MPI Initint argc char argv

20

21

MPI INITIERROR

22

INTEGER IERROR

23

This routine must b e called b efore any other MPI routine It must b e called at most

24

once subsequent calls are erroneous see MPI INITIALIZED

25

All MPI programs must contain a call to MPI init this routine must b e called b efore

26

any other MPI routine apart from MPI INITIALIZED is called The version for ANSI C

27

accepts the argc and argv that are provided by the arguments to main

28

29

MPIinit argc argv

30

31

The Fortran version takes only IERROR

32

33

MPI FINALIZE

34

35

Finalizevoid int MPI

36

37

MPI FINALIZEIERROR

38

INTEGER IERROR

39

This routines cleans up all MPI state Once this routine is called no MPI routine even

40

MPI INIT may b e called The user must ensure that all p ending communications involving

41

FINALIZE a pro cess complete b efore the pro cess calls MPI

42

43

44

45

46

47 48

STARTUP

MPI INITIALIZED ag

1

2

OUT ag Flag is true if MPI INIT has b een called and false

3

otherwise

4

5

int MPI Initializedint flag

6

7

MPI INITIALIZEDFLAG IERROR

8

LOGICAL FLAG

9

INTEGER IERROR

10

This routine may b e used to determine whether MPI INIT has b een called It is the

11

only routine that may b e called b efore MPI INIT is called

12

13

14

MPI ABORT comm errorco de

15

IN comm communicator of tasks to ab ort

16

17 IN errorco de error co de to return to invoking environment

18

19

int MPI AbortMPI Comm comm int errorcode

20

MPI ABORTCOMM ERRORCODE IERROR

21

INTEGER COMM ERRORCODE IERROR

22

23

This routine makes a b est attempt to ab ort all tasks in the group of comm This

24

function do es not require that the invoking environment take any action with the error

25

co de However a Unix or POSIX environment should handle this as a return errorcode

26

from the main program or an aborterrorcode

27

MPI implementations are required to dene the b ehavior of MPI ABORT at least for a

28

COMM WORLD MPI implementations may ignore the comm argument and act comm of MPI

29

as if the comm was MPI COMM WORLD

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Chapter

7

8

9

10

Proling Interface

11

12

13

14

Requirements

15

16

To meet the MPI proling interface an implementation of the MPI functions must

17

provide a mechanism through which all of the MPI dened functions may b e accessed

18

with a name shift Thus all of the MPI functions which normally start with the prex

19

should also b e accessible with the prex PMPI MPI

20

21

ensure that those MPI functions which are not replaced may still b e linked into an

22

executable image without causing name clashes

23

24

do cument the implementation of dierent language bindings of the MPI interface if

25

they are layered on top of each other so that the proler develop er knows whether

26

she must implement the prole interface for each binding or can economise by imple

27

menting it only for the lowest level routines

28

where the implementation of dierent language bindings is is done through a layered

29

approach eg the Fortran binding is a set of wrapp er functions which call the C

30

implementation ensure that these wrapp er functions are separable from the rest of

31

the library

32

33

This is necessary to allow a separate proling library to b e correctly implemented

34

since at least with Unix linker semantics the proling library must contain these

35

wrapp er functions if it is to p erform as exp ected This requirement allows the p erson

36

who builds the proling library to extract these functions from the original MPI library

37

and add them into the proling library without bringing along any other unnecessary

38

co de

39

PCONTROL in the MPI library provide a noop routine MPI

40

41

42

Discussion

43

44

The ob jective of the MPI proling interface is to ensure that it is relatively easy for authors

45

of proling and other similar to ols to interface their co des to MPI implementations on

46

dierent machines

47

Since MPI is a machine indep endent standard with many dierent implementations

48

it is unreasonable to exp ect that the authors of proling to ols for MPI will have access to

LOGIC OF THE DESIGN

the source co de which implements MPI on any particular machine It is therefore necessary

1

to provide a mechanism by which the implementors of such to ols can collect whatever

2

p erformance information they wish without access to the underlying implementation

3

We b elieve that having such an interface is imp ortant if MPI is to b e attractive to end

4

users since the availability of many dierent to ols will b e a signicant factor in attracting

5

users to the MPI standard

6

The proling interface is just that an interface It says nothing ab out the way in which

7

it is used There is therefore no attempt to lay down what information is collected through

8

the interface or how the collected information is saved ltered or displayed

9

While the initial imp etus for the development of this interface arose from the desire to

10

p ermit the implementation of proling to ols it is clear that an interface like that sp ecied

11

may also prove useful for other purp oses such as internetworking multiple MPI imple

12

mentations Since all that is dened is an interface there is no ob jection to its b eing used

13

wherever it is useful

14

As the issues b eing addressed here are intimately tied up with the way in which ex

15

ecutable images are built which may dier greatly on dierent machines the examples

16

given b elow should b e treated solely as one way of implementing the ob jective of the MPI

17

proling interface The actual requirements made of an implementation are those detailed

18

in the Requirements section ab ove the whole of the rest of this chapter is only present as

19

justication and discussion of the logic for those requirements

20

The examples b elow show one way in which an implementation could b e constructed

21

to meet the requirements on a Unix system there are doubtless others which would b e

22

equally valid

23

24

25

Logic of the design

26

27

Provided that an MPI implementation meets the requirements ab ove it is p ossible for the

28

implementor of the proling system to intercept all of the MPI calls which are made by

29

the user program She can then collect whatever information she requires b efore calling

30

the underlying MPI implementation through its name shifted entry p oints to achieve the

31

desired eects

32

33

Miscellaneous control of proling

34

There is a clear requirement for the user co de to b e able to control the proler dynamically

35

at run time This is normally used for at least the purp oses of

36

37

Enabling and disabling proling dep ending on the state of the calculation

38

Flushing trace buers at noncritical p oints in the calculation

39

40

Adding user events to a trace le

41

These requirements are met by use of the MPI PCONTROL

42

43

44

PCONTROLlevel MPI

45

IN level Proling level

46

47

48

Pcontrolconst int level int MPI

CHAPTER PROFILING INTERFACE

MPI PCONTROLlevel

1

INTEGER LEVEL

2

3

MPI libraries themselves make no use of this routine and simply return immediately

4

to the user co de However the presence of calls to this routine allows a proling package to

5

b e explicitly called by the user

6

Since MPI has no control of the implementation of the proling co de we are unable

7

to sp ecify precisely the semantics which will b e provided by calls to MPI PCONTROL This

8

vagueness extends to the numb er of arguments to the function and their datatyp es

9

However to provide some level of p ortability of user co des to dierent proling libraries

10

we request the following meanings for certain values of level

11

12

level Proling is disabled

13

level Proling is enabled at a normal default level of detail

14

15

level Prole buers are ushed This may b e a noop in some prolers

16

17

All other values of level have prole library dened eects and additional arguments

18

We also request that the default state after MPI INIT has b een called is for proling

19

to b e enabled at the normal default level ie as if MPI PCONTROL had just b een called

20

with the argument This allows users to link with a proling library and obtain prole

21

output without having to mo dify their source co de at all

22

PCONTROL as a noop in the standard MPI library allows them The provision of MPI

23

to mo dify their source co de to obtain more detailed proling information but still b e able

24

to link exactly the same co de against the standard MPI library

25

26

27

Examples

28

29

Proler implementation

30

Supp ose that the proler wishes to accumulate the total amount of data sent by the

31

MPI SEND function along with the total elapsed time sp ent in the function This could

32

trivially b e achieved thus

33

34

static int totalBytes

35

static double totalTime

36

37

int MPISENDvoid buffer const int count MPIDatatype datatype

38

int dest int tag MPIcomm comm

39

40

double tstart MPIWtime Pass on all the arguments

41

int extent

42

int result PMPISendbuffercountdatatypedesttagcomm

43

44

Accumulate byte count

45

totalBytes count MPITypesizedatatypeextent

46

and time

47

totalTime MPIWtime tstart 48

EXAMPLES

1

return result

2

3

4

5

MPI library implementation

6

On a Unix system in which the MPI library is implemented in C then there are various

7

p ossible options of which two of the most obvious are presented here Which is b etter

8

dep ends on whether the linker and compiler supp ort weak symb ols

9

10

Systems with weak symb ols

11

12

If the compiler and linker supp ort weak external symb ols eg Solaris x other system

13

V machines then only a single library is required through the use of pragma weak thus

14

15

pragma weak MPIExample PMPIExample

16

17

int PMPIExample appropriate args

18

19

Useful content

20

21

22

The eect of this pragma is to dene the external symb ol MPI Example as a weak

23

denition This means that the linker will not complain if there is another denition of the

24

symb ol for instance in the proling library however if no other denition exists then the

25

linker will use the weak denition

26

27

Systems without weak symb ols

28

In the absence of weak symb ols then one p ossible solution would b e to use the C macro

29

prepro cessor thus

30

31

ifdef PROFILELIB

32

ifdef STDC

33

define FUNCTIONname Pname

34

else

35

define FUNCTIONname Pname

36

endif

37

else

38

define FUNCTIONname name

39

endif

40

41

Each of the user visible functions in the library would then b e declared thus

42

43

int FUNCTIONMPIExample appropriate args

44

45

Useful content

46

47 48

CHAPTER PROFILING INTERFACE

The same source le can then b e compiled to pro duce b oth versions of the library

1

dep ending on the state of the PROFILELIB macro symb ol

2

It is required that the standard MPI library b e built in such a way that the inclusion of

3

MPI functions can b e achieved one at a time This is a somewhat unpleasant requirement

4

since it may mean that each external function has to b e compiled from a separate le

5

However this is necessary so that the author of the proling library need only dene those

6

MPI functions which she wishes to intercept references to any others b eing fullled by the

7

normal MPI library Therefore the link step can lo ok something like this

8

9

cc lmyprof lpmpi lmpi

10

11

Here libmyprofa contains the proler functions which intercept some of the MPI

12

functions libpmpia contains the name shifted MPI functions and libmpia contains

13

the normal denitions of the MPI functions

14

15

Complications

16

17

Multiple counting

18

Since parts of the MPI library may themselves b e implemented using more basic MPI func

19

tions eg a p ortable implementation of the collective op erations implemented using p oint

20

to p oint communications there is p otential for proling functions to b e called from within

21

an MPI function which was called from a proling function This could lead to double

22

counting of the time sp ent in the inner routine Since this eect could actually b e useful

23

under some circumstances eg it might allow one to answer the question How much time

24

is sp ent in the p oint to p oint routines when theyre called from collective functions we

25

have decided not to enforce any restrictions on the author of the MPI library which would

26

overcome this Therefore the author of the proling library should b e aware of this problem

27

and guard against it herself In a single threaded world this is easily achieved through use of

28

a static variable in the proling co de which rememb ers if you are already inside a proling

29

routine It b ecomes more complex in a multithreaded environment as do es the meaning

30

of the times recorded

31

32

33

Linker o ddities

34

The Unix linker traditionally op erates in one pass the eect of this is that functions from

35

libraries are only included in the image if they are needed at the time the library is scanned

36

When combined with weak symb ols or multiple denitions of the same function this can

37

cause o dd and unexp ected eects

38

Consider for instance an implementation of MPI in which the Fortran binding is

39

achieved by using wrapp er functions on top of the C implementation The author of the

40

prole library then assumes that it is reasonable only to provide prole functions for the C

41

binding since Fortran will eventually call these and the cost of the wrapp ers is assumed

42

to b e small However if the wrapp er functions are not in the proling library then none

43

of the proled entry p oints will b e undened when the proling library is called Therefore

44

none of the proling co de will b e included in the image When the standard MPI library

45

is scanned the Fortran wrapp ers will b e resolved and will also pull in the base versions of

46

the MPI functions The overall eect is that the co de will link successfully but will not b e

47

proled 48

MULTIPLE LEVELS OF INTERCEPTION

To overcome this we must ensure that the Fortran wrapp er functions are included in

1

the proling version of the library We ensure that this is p ossible by requiring that these

2

b e separable from the rest of the base MPI library This allows them to b e ared out of the

3

base library and into the proling one

4

5

6

Multiple levels of interception

7

8

The scheme given here do es not directly supp ort the nesting of proling functions since it

9

provides only a single alternative name for each MPI function Consideration was given to

10

an implementation which would allow multiple levels of call interception however we were

11

unable to construct an implementation of this which did not have the following disadvan

12

tages

13

assuming a particular implementation language

14

15

imp osing a run time cost even when no proling was taking place

16

17

Since one of the ob jectives of MPI is to p ermit ecient low latency implementations and

18

it is not the business of a standard to require a particular implementation language we

19

decided to accept the scheme outlined ab ove

20

Note however that it is p ossible to use the scheme ab ove to implement a multilevel

21

system since the function called by the user may call many dierent proling functions

22

b efore calling the underlying MPI function

23

Unfortunately such an implementation may require more co op eration b etween the dif

24

ferent proling libraries than is required for the single level implementation detailed ab ove

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Biblio graphy

7

8

9

10

V Bala and S Kipnis Pro cess groups a mechanism for the co ordination of and com

11

munication among pro cesses in the Venus collective communication library Technical

12

rep ort IBM T J Watson Research Center Octob er Preprint

13

14

V Bala S Kipnis L Rudolph and Marc Snir Designing ecient scalable and

15

p ortable collective communication libraries Technical rep ort IBM T J Watson Re

16

search Center Octob er Preprint

17

18

Purushotham V Bangalore Nathan E Doss and Anthony Skjellum MPI Issues

19

and Features In OONSKI page in press

20

A Beguelin J Dongarra A Geist R Manchek and V Sunderam Visualization and

21

debugging in a heterogeneous environment IEEE Computer June

22

23

Luc Bomans and Rolf Hemp el The ArgonneGMD macros in FORTRAN for p ortable

24

parallel programming and their implementation on the Intel iPSC Paral lel Com

25

puting

26

27

R Butler and E Lusk Users guide to the p programming system Technical Rep ort

28

TMANL Argonne National Lab oratory

29

Ralph Butler and Ewing Lusk Monitors messages and clusters the p parallel

30

programming system Journal of Paral lel Computing to app ear Also Argonne

31

National Lab oratory Mathematics and Division preprint P

32

33

Robin Calkin Rolf Hemp el HansChristian Hopp e and Peter Wypior Portable pro

34

gramming with the parmacs messagepassing library Paral lel Computing Special issue

35

on messagepassing interfaces to app ear

36

37

S Chittor and R J Enb o dy Performance evaluation of meshconnected wormhole

38

routed networks for interpro cessor communication in multicomputers In Proceedings

39

of the Supercomputing Conference pages

40

S Chittor and R J Enb o dy Predicting the eect of mapping on the communica

41

tion p erformance of large multicomputers In Proceedings of the International

42

Conference on Paral lel Processing vol II Software pages I I I I

43

44

J Dongarra A Geist R Manchek and V Sunderam Integrated PVM framework

45

supp orts heterogeneous network computing Computers in Physics April

46

47 48

BIBLIOGRAPHY

J J Dongarra R Hemp el A J G Hey and D W Walker A prop osal for a user

1

level message passing interface in a distributed memory environment Technical Rep ort

2

TM Oak Ridge National Lab oratory February

3

4

Nathan Doss William Gropp Ewing Lusk and Anthony Skjellum A mo del imple

5

mentation of MPI Technical rep ort Argonne National Lab oratory

6

7

Edinburgh Centre University of Edinburgh CHIMP Concepts

8

June

9

Edinburgh Parallel Computing Centre University of Edinburgh CHIMP Version

10

Interface May

11

12

D Feitelson Communicators Ob jectbased multiparty interactions for parallel pro

13

gramming Technical Rep ort Dept Computer Science The Hebrew University

14

of Jerusalem Novemb er

15

16

Hub ertus Franke Peter Ho chschild Pratap Pattnaik and Marc Snir An ecient

17

implementation of MPI In International Conference on Paral lel Processing

18

19

G A Geist M T Heath B W Peyton and P H Worley A users guide to PICL a

20

p ortable instrumented communication library Technical Rep ort TM Oak Ridge

21

National Lab oratory Octob er

22

23

William D Gropp and Barry Smith Chameleon parallel programming to ols users

24

manual Technical Rep ort ANL Argonne National Lab oratory March

25

O Kramer and H Muhlen b ein Mapping strategies in messagebased multipro cessor

26

systems Paral lel Computing

27

28

nCUBE Corp oration nCUBE Programmers Guide r Decemb er

29

30

Parasoft Corp oration Pasadena CA Express Users Guide version edition

31

Paul Pierce The NX op erating system In Proceedings of the Third Conference on

32

Hypercube Concurrent Computers and Applications pages ACM Press

33

34

A Skjellum and A Leung Zip co de a p ortable multicomputer communication library

35

atop the reactive kernel In D W Walker and Q F Stout editors Proceedings of the

36

Fifth Distributed Memory Concurrent Computing Conference pages IEEE

37

Press

38

A Skjellum S Smith C Still A Leung and M Morari The Zip co de message passing

39

system Technical rep ort Lawrence Livermore National Lab oratory Septemb er

40

41

Anthony Skjellum Nathan E Doss and Purushotham V Bangalore Writing Libraries

42

in MPI In Anthony Skjellum and Donna S Reese editors Proceedings of the Scalable

43

Paral lel Libraries Conference pages IEEE Computer So ciety Press Octob er

44

45

46

Anthony Skjellum Steven G Smith Nathan E Doss Alvin P Leung and Manfred

47

Morari The Design and Evolution of Zip co de Paral lel Computing Invited

48

Pap er to app ear in Sp ecial Issue on Message Passing

BIBLIOGRAPHY

Anthony Skjellum Steven G Smith Nathan E Doss Charles H Still Alvin P Le

1

ung and Manfred Morari Zip co de A Portable Communication Layer for High Per

2

formance Multicomputing Technical Rep ort UCRLJC revised

3

Lawrence Livermore National Lab oratory March To app ear in Concur

4

rency Practice Exp erience

5

6

D Walker Standards for message passing in a distributed memory environment Tech

7

nical Rep ort TM Oak Ridge National Lab oratory August

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

1

2

3

4

5

6

Annex A

7

8

9

10

Language Binding

11

12

13

14

A Intro duction

15

16

In this section we summarize the sp ecic bindings for b oth Fortran and C We present rst

17

the C bindings then the Fortran bindings Listings are alphab etical within chapter

18

19

A Dened Constants for C and Fortran

20

21

These are required dened constants to b e dened in the les mpih for C and mpifh

22

for Fortran

23

24

return codes both C and Fortran

25

MPISUCCESS

26

MPIERRBUFFER

27

MPIERRCOUNT

28

MPIERRTYPE

29

MPIERRTAG

30

MPIERRCOMM

31

MPIERRRANK

32

MPIERRREQUEST

33

MPIERRROOT

34

MPIERRGROUP

35

MPIERROP

36

MPIERRTOPOLOGY

37

MPIERRDIMS

38

MPIERRARG

39

MPIERRUNKNOWN

40

MPIERRTRUNCATE

41

MPIERROTHER

42

MPIERRINTERN

43

MPIERRLASTCODE

44

45

assorted constants both C and Fortran

46

MPIBOTTOM

47

MPIPROCNULL 48

ANNEX A LANGUAGE BINDING

MPIANYSOURCE

1

MPIANYTAG

2

MPIUNDEFINED

3

MPIUB

4

MPILB

5

6

status size and reserved index values Fortran

7

MPISTATUSSIZE

8

MPISOURCE

9

MPITAG

10

11

Errorhandling specifiers C and Fortran

12

MPIERRORSAREFATAL

13

MPIERRORSRETURN

14

15

Maximum sizes for strings

16

MPIMAXPROCESSORNAME

17

MPIMAXERRORSTRING

18

19

elementary datatypes C

20

MPICHAR

21

MPISHORT

22

MPIINT

23

MPILONG

24

MPIUNSIGNEDCHAR

25

MPIUNSIGNEDSHORT

26

MPIUNSIGNED

27

MPIUNSIGNEDLONG

28

MPIFLOAT

29

MPIDOUBLE

30

MPILONGDOUBLE

31

MPIBYTE

32

MPIPACKED

33

34

35

36

elementary datatypes Fortran

37

MPIINTEGER

38

MPIREAL

39

MPIDOUBLEPRECISION

40

MPICOMPLEX

41

MPIDOUBLECOMPLEX

42

MPILOGICAL

43

MPICHARACTER

44

MPIBYTE

45

MPIPACKED

46

47

datatypes for reduction functions C 48

A DEFINED CONSTANTS FOR C AND FORTRAN

MPIFLOATINT

1

MPIDOUBLEINT

2

MPILONGINT

3

MPIINT

4

MPISHORTINT

5

MPILONGDOUBLEINT

6

7

datatypes for reduction functions Fortran

8

MPIREAL

9

MPIDOUBLEPRECISION

10

MPIINTEGER

11

MPICOMPLEX

12

13

optional datatypes Fortran

14

15

MPIINTEGER

16

MPIINTEGER

17

MPIINTEGER

18

MPIREAL

19

MPIREAL

20

MPIREAL

21

22

optional datatypes C

23

MPILONGLONGINT

24

25

reserved communicators C and Fortran

26

MPICOMMWORLD

27

MPICOMMSELF

28

29

results of communicator and group comparisons

30

31

MPIIDENT

32

MPICONGRUENT

33

MPISIMILAR

34

MPIUNEQUAL

35

36

environmental inquiry keys C and Fortran

37

MPITAGUB

38

MPIIO

39

MPIHOST

40

41

collective operations C and Fortran

42

MPIMAX

43

MPIMIN

44

MPISUM

45

MPIPROD

46

MPIMAXLOC

47

MPIMINLOC 48

ANNEX A LANGUAGE BINDING

MPIBAND

1

MPIBOR

2

MPIBXOR

3

MPILAND

4

MPILOR

5

MPILXOR

6

7

Null handles

8

MPIGROUPNULL

9

MPICOMMNULL

10

MPIDATATYPENULL

11

MPIREQUESTNULL

12

MPIOPNULL

13

MPIERRHANDLERNULL

14

15

Empty group

16

MPIGROUPEMPTY

17

18

topologies C and Fortran

19

MPIGRAPH

20

MPICART

21

22

23

The following are dened C typ e denitions also included in the le mpih

24

25

opaque types C

26

MPIAint

27

MPIStatus

28

29

handles to assorted structures C

30

MPIGroup

31

MPIComm

32

MPIDatatype

33

MPIRequest

34

MPIOp

35

36

prototypes for userdefined functions C

37

typedef int MPICopyfunctionMPIComm oldcomm newcomm int keyval

38

void extrastate

39

typedef int MPIDeletefunctionMPIComm comm int keyval

40

void extrastate

41

typedef void MPIHandlerfunctionMPIComm int

42

typedef void MPIUserfunction void invec void inoutvec int len

43

MPIDatatype datatype

44

45

For Fortran here are examples of how each of the userdened functions should b e

46

declared

47

The userfunction argument to MPI OP CREATE should b e declared like this 48

A C BINDINGS FOR POINTTOPOINT COMMUNICATION

FUNCTION USERFUNCTION INVEC INOUTVEC LEN TYPE

1

type INVECLEN INOUTVECLEN

2

INTEGER LEN TYPE

3

4

The copyfunction argument to MPI KEYVAL CREATE should b e declared like this

5

6

FUNCTION COPYFUNCTIONOLDCOMM KEYVAL EXTRASTATE ATTRIBUTEVALIN

7

ATTRIBUTEVALOUT FLAG

8

INTEGER OLDCOMM KEYVAL EXTRASTATE ATTRIBUTEVALIN ATTRIBUTEVALOUT

9

LOGICAL FLAG

10

The deletefunction argument to MPI KEYVAL CREATE should b e declared like this

11

12

FUNCTION DELETEFUNCTIONCOMM KEYVAL ATTRIBUTEVAL EXTRASTATE

13

INTEGER COMM KEYVAL ATTRIBUTEVAL EXTRASTATE

14

15

16

A C bindings for PointtoPoint Communication

17

These are presented here in the order of their app earance in the chapter

18

int MPI Sendvoid buf int count MPI Datatype datatype int dest

19

int tag MPI Comm comm

20

21

int MPI Recvvoid buf int count MPI Datatype datatype int source

22

Comm comm MPI Status status int tag MPI

23

24

int MPI Get countMPI Status status MPI Datatype datatype int count

25

int MPI Bsendvoid buf int count MPI Datatype datatype int dest

26

int tag MPI Comm comm

27

28

int MPI Ssendvoid buf int count MPI Datatype datatype int dest

29

int tag MPI Comm comm

30

int MPI Rsendvoid buf int count MPI Datatype datatype int dest

31

int tag MPI Comm comm

32

33

int MPI Buffer attach void buffer int size

34

int MPI Buffer detach void buffer int size

35

36

Isendvoid buf int count MPI Datatype datatype int dest int MPI

37

int tag MPI Comm comm MPI Request request

38

Ibsendvoid buf int count MPI Datatype datatype int dest int MPI

39

int tag MPI Comm comm MPI Request request

40

41

int MPI Issendvoid buf int count MPI Datatype datatype int dest

42

int tag MPI Comm comm MPI Request request

43

int MPI Irsendvoid buf int count MPI Datatype datatype int dest

44

int tag MPI Comm comm MPI Request request

45

46

int MPI Irecvvoid buf int count MPI Datatype datatype int source

47

int tag MPI Comm comm MPI Request request 48

ANNEX A LANGUAGE BINDING

int MPI WaitMPI Request request MPI Status status

1

2

int MPI TestMPI Request request int flag MPI Status status

3

4

int MPI Request freeMPI Request request

5

int MPI Waitanyint count MPI Request array of requests int index

6

MPI Status status

7

8

int MPI Testanyint count MPI Request array of requests int index

9

int flag MPI Status status

10

int MPI Waitallint count MPI Request array of requests

11

MPI Status array of statuses

12

13

int MPI Testallint count MPI Request array of requests int flag

14

MPI Status array of statuses

15

Waitsomeint incount MPI Request array of requests int outcount int MPI

16

int array of indices MPI Status array of statuses

17

18

int MPI Testsomeint incount MPI Request array of requests int outcount

19

int array of indices MPI Status array of statuses

20

int MPI Iprobeint source int tag MPI Comm comm int flag

21

MPI Status status

22

23

int MPI Probeint source int tag MPI Comm comm MPI Status status

24

int MPI CancelMPI Request request

25

26

int MPI Test cancelledMPI Status status int flag

27

int MPI Send initvoid buf int count MPI Datatype datatype int dest

28

int tag MPI Comm comm MPI Request request

29

30

int MPI Bsend initvoid buf int count MPI Datatype datatype int dest

31

int tag MPI Comm comm MPI Request request

32

int MPI Ssend initvoid buf int count MPI Datatype datatype int dest

33

int tag MPI Comm comm MPI Request request

34

35

int MPI Rsend initvoid buf int count MPI Datatype datatype int dest

36

Comm comm MPI Request request int tag MPI

37

38

int MPI Recv initvoid buf int count MPI Datatype datatype int source

39

int tag MPI Comm comm MPI Request request

40

int MPI StartMPI Request request

41

42

int MPI Startallint count MPI Request array of requests

43

int MPI Sendrecvvoid sendbuf int sendcount MPI Datatype sendtype

44

int dest int sendtag void recvbuf int recvcount

45

MPI Datatype recvtype int source MPI Datatype recvtag

46

MPI Comm comm MPI Status status

47 48

A C BINDINGS FOR POINTTOPOINT COMMUNICATION

int MPI Sendrecv replacevoid buf int count MPI Datatype datatype

1

int dest int sendtag int source int recvtag MPI Comm comm

2

MPI Status status

3

4

int MPI Type contiguousint count MPI Datatype oldtype

5

MPI Datatype newtype

6

7

int MPI Type vectorint count int blocklength int stride

8

MPI Datatype oldtype MPI Datatype newtype

9

int MPI Type hvectorint count int blocklength MPI Aint stride

10

MPI Datatype oldtype MPI Datatype newtype

11

12

int MPI Type indexedint count int array of blocklengths

13

int array of displacements MPI Datatype oldtype

14

MPI Datatype newtype

15

int MPI Type hindexedint count int array of blocklengths

16

MPI Aint array of displacements MPI Datatype oldtype

17

MPI Datatype newtype

18

19

int MPI Type structint count int array of blocklengths

20

MPI Aint array of displacements MPI Datatype array of types

21

Datatype newtype MPI

22

int MPI Addressvoid location MPI Aint address

23

24

int MPI Type extentMPI Datatype datatype int extent

25

int MPI Type sizeMPI Datatype datatype int size

26

27

int MPI Type countMPI Datatype datatype int count

28

int MPI Type lbMPI Datatype datatype int displacement

29

30

int MPI Type ubMPI Datatype datatype int displacement

31

int MPI Type commitMPI Datatype datatype

32

33

int MPI Type freeMPI Datatype datatype

34

int MPI Get elementsMPI Status status MPI Datatype datatype int count

35

36

int MPI Packvoid inbuf int incount MPI Datatype datatype void outbuf

37

int outsize int position MPI Comm comm

38

39

int MPI Unpackvoid inbuf int insize int position void outbuf

40

Datatype datatype MPI Comm comm int outcount MPI

41

int MPI Pack sizeint incount MPI Datatype datatype MPI Comm comm

42

int size

43

44

45

46

47 48

ANNEX A LANGUAGE BINDING

A C Bindings for Collective Communication

1

2

int MPI BarrierMPI Comm comm

3

4

int MPI Bcastvoid buffer int count MPI Datatype datatype int root

5

MPI Comm comm

6

int MPI Gathervoid sendbuf int sendcount MPI Datatype sendtype

7

void recvbuf int recvcount MPI Datatype recvtype int root

8

MPI Comm comm

9

10

int MPI Gathervvoid sendbuf int sendcount MPI Datatype sendtype

11

void recvbuf int recvcounts int displs

12

MPI Datatype recvtype int root MPI Comm comm

13

int MPI Scattervoid sendbuf int sendcount MPI Datatype sendtype

14

void recvbuf int recvcount MPI Datatype recvtype int root

15

Comm comm MPI

16

17

int MPI Scattervvoid sendbuf int sendcounts int displs

18

MPI Datatype sendtype void recvbuf int recvcount

19

MPI Datatype recvtype int root MPI Comm comm

20

int MPI Allgathervoid sendbuf int sendcount MPI Datatype sendtype

21

void recvbuf int recvcount MPI Datatype recvtype

22

MPI Comm comm

23

24

int MPI Allgathervvoid sendbuf int sendcount MPI Datatype sendtype

25

void recvbuf int recvcounts int displs

26

MPI Datatype recvtype MPI Comm comm

27

28

int MPI Alltoallvoid sendbuf int sendcount MPI Datatype sendtype

29

Datatype recvtype void recvbuf int recvcount MPI

30

MPI Comm comm

31

int MPI Alltoallvvoid sendbuf int sendcounts int sdispls

32

MPI Datatype sendtype void recvbuf int recvcounts

33

int rdispls MPI Datatype recvtype MPI Comm comm

34

35

int MPI Reducevoid sendbuf void recvbuf int count

36

Datatype datatype MPI Op op int root MPI Comm comm MPI

37

int MPI Op createMPI User function function int commute MPI Op op

38

39

int MPI Op free MPI Op op

40

int MPI Allreducevoid sendbuf void recvbuf int count

41

Datatype datatype MPI Op op MPI Comm comm MPI

42

43

int MPI Reduce scattervoid sendbuf void recvbuf int recvcounts

44

MPI Datatype datatype MPI Op op MPI Comm comm

45

Scanvoid sendbuf void recvbuf int count int MPI

46

MPI Datatype datatype MPI Op op MPI Comm comm

47 48

A C BINDINGS FOR GROUPS CONTEXTS AND COMMUNICATORS

A C Bindings for Groups Contexts and Communicators

1

2

int MPI Group sizeMPI Group group int size

3

4

int MPI Group rankMPI Group group int rank

5

int MPI Group translate ranks MPI Group group int n int ranks

6

MPI Group group int ranks

7

8

int MPI Group compareMPI Group groupMPI Group group int result

9

int MPI Comm groupMPI Comm comm MPI Group group

10

11

int MPI Group unionMPI Group group MPI Group group MPI Group newgroup

12

int MPI Group intersectionMPI Group group MPI Group group

13

MPI Group newgroup

14

15

int MPI Group differenceMPI Group group MPI Group group

16

MPI Group newgroup

17

18

int MPI Group inclMPI Group group int n int ranks MPI Group newgroup

19

int MPI Group exclMPI Group group int n int ranks MPI Group newgroup

20

21

int MPI Group range inclMPI Group group int n int ranges

22

MPI Group newgroup

23

int MPI Group range exclMPI Group group int n int ranges

24

MPI Group newgroup

25

26

int MPI Group freeMPI Group group

27

int MPI Comm sizeMPI Comm comm int size

28

29

int MPI Comm rankMPI Comm comm int rank

30

int MPI Comm compareMPI Comm comm comm int result

31

32

int MPI Comm dupMPI Comm comm MPI Comm newcomm

33

int MPI Comm createMPI Comm comm MPI Group group MPI Comm newcomm

34

35

int MPI Comm splitMPI Comm comm int color int key MPI Comm newcomm

36

int MPI Comm freeMPI Comm comm

37

38

int MPI Comm test interMPI Comm comm int flag

39

int MPI Comm remote sizeMPI Comm comm int size

40

41

int MPI Comm remote groupMPI Comm comm MPI Group group

42

43 int MPI Intercomm createMPI Comm local comm int local leader

44 MPI Comm peer comm int remote leader int tag

45 MPI Comm newintercomm

46

int MPI Intercomm mergeMPI Comm intercomm int high

47

MPI Comm newintracomm 48

ANNEX A LANGUAGE BINDING

int MPI Keyval createMPI Copy function copy fn MPI Delete function

1

delete fn int keyval void extra state

2

3

int MPI Keyval freeint keyval

4

5

int MPI Attr putMPI Comm comm int keyval void attribute val

6

int MPI Attr getMPI Comm comm int keyval void attribute val int flag

7

8

int MPI Attr deleteMPI Comm comm int keyval

9

10

11

A C Bindings for Pro cess Top ologies

12

int MPI Cart createMPI Comm comm old int ndims int dims int periods

13

int reorder MPI Comm comm cart

14

15

int MPI Dims createint nnodes int ndims int dims

16

17

int MPI Graph createMPI Comm comm old int nnodes int index int edges

18

int reorder MPI Comm comm graph

19

int MPI Topo testMPI Comm comm int status

20

21

int MPI Graphdims getMPI Comm comm int nnodes int nedges

22

int MPI Graph getMPI Comm comm int maxindex int maxedges int index

23

int edges

24

25

int MPI Cartdim getMPI Comm comm int ndims

26

int MPI Cart getMPI Comm comm int maxdims int dims int periods

27

int coords

28

29

int MPI Cart rankMPI Comm comm int coords int rank

30

int MPI Cart coordsMPI Comm comm int rank int maxdims int coords

31

32

int MPI Graph neighbors countMPI Comm comm int rank int nneighbors

33

int MPI Graph neighborsMPI Comm comm int rank int maxneighbors

34

int neighbors

35

36

int MPI Cart shiftMPI Comm comm int direction int disp int rank source

37

int rank dest

38

int MPI Cart subMPI Comm comm int remain dims MPI Comm newcomm

39

40

Cart mapMPI Comm comm int ndims int dims int periods int MPI

41

int newrank

42

int MPI Graph mapMPI Comm comm int nnodes int index int edges

43

int newrank

44

45

46

47 48

A C BINDINGS FOR ENVIRONMENTAL INQUIRY

A C bindings for Environmental Inquiry

1

2

int MPI Get processor namechar name int resultlen

3

4

int MPI Errhandler createMPI Handler function function

5

MPI Errhandler errhandler

6

int MPI Errhandler setMPI Comm comm MPI Errhandler errhandler

7

8

int MPI Errhandler getMPI Comm comm MPI Errhandler errhandler

9

int MPI Errhandler freeMPI Errhandler errhandler

10

11

int MPI Error stringint errorcode char string int resultlen

12

int MPI Error classint errorcode int errorclass

13

14

int double MPI Wtimevoid

15

16

int double MPI Wtickvoid

17

int MPI Initint argc char argv

18

19

int MPI Finalizevoid

20

int MPI Initializedint flag

21

22

int MPI AbortMPI Comm comm int errorcode

23

24

25

A C Bindings for Proling

26

27

int MPI Pcontrolconst int level

28

29

30

A Fortran Bindings for PointtoPoint Communication

31

SENDBUF COUNT DATATYPE DEST TAG COMM IERROR MPI

32

type BUF

33

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

34

35

MPI RECVBUF COUNT DATATYPE SOURCE TAG COMM STATUS IERROR

36

type BUF

37

STATUS SIZE INTEGER COUNT DATATYPE SOURCE TAG COMM STATUSMPI

38

IERROR

39

MPI GET COUNTSTATUS DATATYPE COUNT IERROR

40

STATUS SIZE DATATYPE COUNT IERROR INTEGER STATUSMPI

41

42

MPI BSENDBUF COUNT DATATYPE DEST TAG COMM IERROR

43

type BUF

44

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

45

46 SSENDBUF COUNT DATATYPE DEST TAG COMM IERROR MPI

47

type BUF

48

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

ANNEX A LANGUAGE BINDING

MPI RSENDBUF COUNT DATATYPE DEST TAG COMM IERROR

1

type BUF

2

INTEGER COUNT DATATYPE DEST TAG COMM IERROR

3

4

MPI BUFFER ATTACH BUFFER SIZE IERROR

5

type BUFFER

6

INTEGER SIZE IERROR

7

8

MPI BUFFER DETACH BUFFER SIZE IERROR

9

type BUFFER

10

INTEGER SIZE IERROR

11

MPI ISENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

12

type BUF

13

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

14

15

MPI IBSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

16

type BUF

17

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

18

ISSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR MPI

19

type BUF

20

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

21

22

MPI IRSENDBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

23

type BUF

24

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

25

MPI IRECVBUF COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

26

type BUF

27

INTEGER COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

28

29

MPI WAITREQUEST STATUS IERROR

30

INTEGER REQUEST STATUSMPI STATUS SIZE IERROR

31

MPI TESTREQUEST FLAG STATUS IERROR

32

LOGICAL FLAG

33

INTEGER REQUEST STATUSMPI STATUS SIZE IERROR

34

35

MPI REQUEST FREEREQUEST IERROR

36

INTEGER REQUEST IERROR

37

MPI WAITANYCOUNT ARRAY OF REQUESTS INDEX STATUS IERROR

38

OF REQUESTS INDEX STATUSMPI STATUS SIZE INTEGER COUNT ARRAY

39

IERROR

40

41

MPI TESTANYCOUNT ARRAY OF REQUESTS INDEX FLAG STATUS IERROR

42

LOGICAL FLAG

43

INTEGER COUNT ARRAY OF REQUESTS INDEX STATUSMPI STATUS SIZE

44

IERROR

45

MPI WAITALLCOUNT ARRAY OF REQUESTS ARRAY OF STATUSES IERROR

46

INTEGER COUNT ARRAY OF REQUESTS

47

ARRAY OF STATUSESMPI STATUS SIZE IERROR 48

A FORTRAN BINDINGS FOR POINTTOPOINT COMMUNICATION

MPI TESTALLCOUNT ARRAY OF REQUESTS FLAG ARRAY OF STATUSES IERROR

1

LOGICAL FLAG

2

INTEGER COUNT ARRAY OF REQUESTS

3

ARRAY OF STATUSESMPI STATUS SIZE IERROR

4

5

MPI WAITSOMEINCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES

6

ARRAY OF STATUSES IERROR

7

INTEGER INCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES

8

ARRAY OF STATUSESMPI STATUS SIZE IERROR

9

10

MPI TESTSOMEINCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES

11

ARRAY OF STATUSES IERROR

12

INTEGER INCOUNT ARRAY OF REQUESTS OUTCOUNT ARRAY OF INDICES

13

ARRAY OF STATUSESMPI STATUS SIZE IERROR

14

MPI IPROBESOURCE TAG COMM FLAG STATUS IERROR

15

LOGICAL FLAG

16

INTEGER SOURCE TAG COMM STATUSMPI STATUS SIZE IERROR

17

18

MPI PROBESOURCE TAG COMM STATUS IERROR

19

INTEGER SOURCE TAG COMM STATUSMPI STATUS SIZE IERROR

20

MPI CANCELREQUEST IERROR

21

INTEGER REQUEST IERROR

22

23

MPI TEST CANCELLEDSTATUS FLAG IERROR

24

LOGICAL FLAG

25

INTEGER STATUSMPI STATUS SIZE IERROR

26

MPI SEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

27

type BUF

28

INTEGER REQUEST COUNT DATATYPE DEST TAG COMM REQUEST IERROR

29

30

MPI BSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

31

type BUF

32

INTEGER REQUEST COUNT DATATYPE DEST TAG COMM REQUEST IERROR

33

MPI SSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR

34

type BUF

35

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

36

37

RSEND INITBUF COUNT DATATYPE DEST TAG COMM REQUEST IERROR MPI

38

type BUF

39

INTEGER COUNT DATATYPE DEST TAG COMM REQUEST IERROR

40

MPI RECV INITBUF COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

41

type BUF

42

INTEGER COUNT DATATYPE SOURCE TAG COMM REQUEST IERROR

43

44

STARTREQUEST IERROR MPI

45

INTEGER REQUEST IERROR

46

MPI STARTALLCOUNT ARRAY OF REQUESTS IERROR

47

INTEGER COUNT ARRAY OF REQUESTS IERROR 48

ANNEX A LANGUAGE BINDING

MPI SENDRECVSENDBUF SENDCOUNT SENDTYPE DEST SENDTAG RECVBUF

1

RECVCOUNT RECVTYPE SOURCE RECVTAG COMM STATUS IERROR

2

type SENDBUF RECVBUF

3

INTEGER SENDCOUNT SENDTYPE DEST SENDTAG RECVCOUNT RECVTYPE

4

SOURCE RECVTAG COMM STATUSMPI STATUS SIZE IERROR

5

6

MPI SENDRECV REPLACEBUF COUNT DATATYPE DEST SENDTAG SOURCE RECVTAG

7

COMM STATUS IERROR

8

type BUF

9

INTEGER COUNT DATATYPE DEST SENDTAG SOURCE RECVTAG COMM

10

STATUSMPI STATUS SIZE IERROR

11

12

MPI TYPE CONTIGUOUSCOUNT OLDTYPE NEWTYPE IERROR

13

INTEGER COUNT OLDTYPE NEWTYPE IERROR

14

MPI TYPE VECTORCOUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

15

INTEGER COUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

16

17

MPI TYPE HVECTORCOUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

18

INTEGER COUNT BLOCKLENGTH STRIDE OLDTYPE NEWTYPE IERROR

19

MPI TYPE INDEXEDCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

20

OLDTYPE NEWTYPE IERROR

21

INTEGER COUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

22

OLDTYPE NEWTYPE IERROR

23

24

MPI TYPE HINDEXEDCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

25

OLDTYPE NEWTYPE IERROR

26

INTEGER COUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

27

OLDTYPE NEWTYPE IERROR

28

MPI TYPE STRUCTCOUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

29

ARRAY OF TYPES NEWTYPE IERROR

30

INTEGER COUNT ARRAY OF BLOCKLENGTHS ARRAY OF DISPLACEMENTS

31

OF TYPES NEWTYPE IERROR ARRAY

32

33

MPI ADDRESSLOCATION ADDRESS IERROR

34

type LOCATION

35

INTEGER ADDRESS IERROR

36

MPI TYPE EXTENTDATATYPE EXTENT IERROR

37

INTEGER DATATYPE EXTENT IERROR

38

39

MPI TYPE SIZEDATATYPE SIZE IERROR

40

INTEGER DATATYPE SIZE IERROR

41

MPI TYPE COUNTDATATYPE COUNT IERROR

42

INTEGER DATATYPE COUNT IERROR

43

44

TYPE LB DATATYPE DISPLACEMENT IERROR MPI

45

INTEGER DATATYPE DISPLACEMENT IERROR

46

MPI TYPE UB DATATYPE DISPLACEMENT IERROR

47

INTEGER DATATYPE DISPLACEMENT IERROR 48

A FORTRAN BINDINGS FOR COLLECTIVE COMMUNICATION

MPI TYPE COMMITDATATYPE IERROR

1

INTEGER DATATYPE IERROR

2

3

MPI TYPE FREEDATATYPE IERROR

4

INTEGER DATATYPE IERROR

5

6

MPI GET ELEMENTSSTATUS DATATYPE COUNT IERROR

7

STATUS SIZE DATATYPE COUNT IERROR INTEGER STATUSMPI

8

MPI PACKINBUF INCOUNT DATATYPE OUTBUF OUTCOUNT POSITION COMM

9

IERROR

10

type INBUF OUTBUF

11

INTEGER INCOUNT DATATYPE OUTCOUNT POSITION COMM IERROR

12

13

MPI UNPACKINBUF INSIZE POSITION OUTBUF OUTCOUNT DATATYPE COMM

14

IERROR

15

type INBUF OUTBUF

16

INTEGER INSIZE POSITION OUTCOUNT DATATYPE COMM IERROR

17

MPI PACK SIZEINCOUNT DATATYPE COMM SIZE IERROR

18

INTEGER INCOUNT DATATYPE COMM SIZE IERROR

19

20

21

A Fortran Bindings for Collective Communication

22

23

MPI BARRIERCOMM IERROR

24

INTEGER COMM IERROR

25

26

MPI BCASTBUFFER COUNT DATATYPE ROOT COMM IERROR

27

type BUFFER

28

INTEGER COUNT DATATYPE ROOT COMM IERROR

29

MPI GATHERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

30

ROOT COMM IERROR

31

type SENDBUF RECVBUF

32

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE ROOT COMM IERROR

33

34

GATHERVSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNTS DISPLS MPI

35

RECVTYPE ROOT COMM IERROR

36

type SENDBUF RECVBUF

37

INTEGER SENDCOUNT SENDTYPE RECVCOUNTS DISPLS RECVTYPE ROOT

38

COMM IERROR

39

SCATTERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE MPI

40

ROOT COMM IERROR

41

type SENDBUF RECVBUF

42

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE ROOT COMM IERROR

43

44

SCATTERVSENDBUF SENDCOUNTS DISPLS SENDTYPE RECVBUF RECVCOUNT MPI

45

RECVTYPE ROOT COMM IERROR

46

type SENDBUF RECVBUF

47 48

ANNEX A LANGUAGE BINDING

INTEGER SENDCOUNTS DISPLS SENDTYPE RECVCOUNT RECVTYPE ROOT

1

COMM IERROR

2

3

MPI ALLGATHERSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

4

COMM IERROR

5

type SENDBUF RECVBUF

6

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE COMM IERROR

7

8

MPI ALLGATHERVSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNTS DISPLS

9

RECVTYPE COMM IERROR

10

type SENDBUF RECVBUF

11

INTEGER SENDCOUNT SENDTYPE RECVCOUNTS DISPLS RECVTYPE COMM

12

IERROR

13

MPI ALLTOALLSENDBUF SENDCOUNT SENDTYPE RECVBUF RECVCOUNT RECVTYPE

14

COMM IERROR

15

type SENDBUF RECVBUF

16

INTEGER SENDCOUNT SENDTYPE RECVCOUNT RECVTYPE COMM IERROR

17

18

MPI ALLTOALLVSENDBUF SENDCOUNTS SDISPLS SENDTYPE RECVBUF RECVCOUNTS

19

RDISPLS RECVTYPE COMM IERROR

20

type SENDBUF RECVBUF

21

INTEGER SENDCOUNTS SDISPLS SENDTYPE RECVCOUNTS RDISPLS

22

RECVTYPE COMM IERROR

23

MPI REDUCESENDBUF RECVBUF COUNT DATATYPE OP ROOT COMM IERROR

24

type SENDBUF RECVBUF

25

INTEGER COUNT DATATYPE OP ROOT COMM IERROR

26

27

MPI OP CREATE FUNCTION COMMUTE OP IERROR

28

EXTERNAL FUNCTION

29

LOGICAL COMMUTE

30

INTEGER OP IERROR

31

OP FREE OP IERROR MPI

32

INTEGER OP IERROR

33

34

MPI ALLREDUCESENDBUF RECVBUF COUNT DATATYPE OP COMM IERROR

35

type SENDBUF RECVBUF

36

INTEGER COUNT DATATYPE OP COMM IERROR

37

MPI REDUCE SCATTERSENDBUF RECVBUF RECVCOUNTS DATATYPE OP COMM

38

IERROR

39

type SENDBUF RECVBUF

40

INTEGER RECVCOUNTS DATATYPE OP COMM IERROR

41

42

SCANSENDBUF RECVBUF COUNT DATATYPE OP COMM IERROR MPI

43

type SENDBUF RECVBUF

44

INTEGER COUNT DATATYPE OP COMM IERROR

45

46

47 48

A FORTRAN BINDINGS FOR GROUPS CONTEXTS ETC

A Fortran Bindings for Groups Contexts etc

1

2

MPI GROUP SIZEGROUP SIZE IERROR

3

INTEGER GROUP SIZE IERROR

4

5

MPI GROUP RANKGROUP RANK IERROR

6

INTEGER GROUP RANK IERROR

7

MPI GROUP TRANSLATE RANKSGROUP N RANKS GROUP RANKS IERROR

8

INTEGER GROUP N RANKS GROUP RANKS IERROR

9

10

MPI GROUP COMPAREGROUP GROUP RESULT IERROR

11

INTEGER GROUP GROUP RESULT IERROR

12

MPI COMM GROUPCOMM GROUP IERROR

13

INTEGER COMM GROUP IERROR

14

15

MPI GROUP UNIONGROUP GROUP NEWGROUP IERROR

16

INTEGER GROUP GROUP NEWGROUP IERROR

17

GROUP INTERSECTIONGROUP GROUP NEWGROUP IERROR MPI

18

INTEGER GROUP GROUP NEWGROUP IERROR

19

20

MPI GROUP DIFFERENCEGROUP GROUP NEWGROUP IERROR

21

INTEGER GROUP GROUP NEWGROUP IERROR

22

23

MPI GROUP INCLGROUP N RANKS NEWGROUP IERROR

24

INTEGER GROUP N RANKS NEWGROUP IERROR

25

MPI GROUP EXCLGROUP N RANKS NEWGROUP IERROR

26

INTEGER GROUP N RANKS NEWGROUP IERROR

27

28

MPI GROUP RANGE INCLGROUP N RANGES NEWGROUP IERROR

29

INTEGER GROUP N RANGES NEWGROUP IERROR

30

MPI GROUP RANGE EXCLGROUP N RANGES NEWGROUP IERROR

31

INTEGER GROUP N RANGES NEWGROUP IERROR

32

33

MPI GROUP FREEGROUP IERROR

34

INTEGER GROUP IERROR

35

MPI COMM SIZECOMM SIZE IERROR

36

INTEGER COMM SIZE IERROR

37

38

COMM RANKCOMM RANK IERROR MPI

39

INTEGER COMM RANK IERROR

40

COMM COMPARECOMM COMM RESULT IERROR MPI

41

INTEGER COMM COMM RESULT IERROR

42

43

COMM DUPCOMM NEWCOMM IERROR MPI

44

INTEGER COMM NEWCOMM IERROR

45

MPI COMM CREATECOMM GROUP NEWCOMM IERROR

46

INTEGER COMM GROUP NEWCOMM IERROR

47 48

ANNEX A LANGUAGE BINDING

MPI COMM SPLITCOMM COLOR KEY NEWCOMM IERROR

1

INTEGER COMM COLOR KEY NEWCOMM IERROR

2

3

MPI COMM FREECOMM IERROR

4

INTEGER COMM IERROR

5

6

MPI COMM TEST INTERCOMM FLAG IERROR

7

INTEGER COMM IERROR

8

LOGICAL FLAG

9

MPI COMM REMOTE SIZECOMM SIZE IERROR

10

INTEGER COMM SIZE IERROR

11

12

MPI COMM REMOTE GROUPCOMM GROUP IERROR

13

INTEGER COMM GROUP IERROR

14

MPI INTERCOMM CREATELOCAL COMM LOCAL LEADER PEER COMM REMOTE LEADER TAG

15

NEWINTERCOMM IERROR

16

INTEGER LOCAL COMM LOCAL LEADER PEER COMM REMOTE LEADER TAG

17

NEWINTERCOMM IERROR

18

19

MPI INTERCOMM MERGEINTERCOMM HIGH INTRACOMM IERROR

20

INTEGER INTERCOMM INTRACOMM IERROR

21

LOGICAL HIGH

22

MPI KEYVAL CREATECOPY FN DELETE FN KEYVAL EXTRA STATE IERROR

23

EXTERNAL COPY FN DELETE FN

24

INTEGER KEYVAL EXTRA STATE IERROR

25

26

MPI KEYVAL FREEKEYVAL IERROR

27

INTEGER KEYVAL IERROR

28

MPI ATTR PUTCOMM KEYVAL ATTRIBUTE VAL IERROR

29

INTEGER COMM KEYVAL ATTRIBUTE VAL IERROR

30

31

MPI ATTR GETCOMM KEYVAL ATTRIBUTE VAL FLAG IERROR

32

INTEGER COMM KEYVAL ATTRIBUTE VAL IERROR

33

LOGICAL FLAG

34

MPI ATTR DELETECOMM KEYVAL IERROR

35

INTEGER COMM KEYVAL IERROR

36

37

38

A Fortran Bindings for Pro cess Top ologies

39

40

MPI CART CREATECOMM OLD NDIMS DIMS PERIODS REORDER COMM CART IERROR

41

OLD NDIMS DIMS COMM CART IERROR INTEGER COMM

42

LOGICAL PERIODS REORDER

43

44

MPI DIMS CREATENNODES NDIMS DIMS IERROR

45

INTEGER NNODES NDIMS DIMS IERROR

46

MPI GRAPH CREATECOMM OLD NNODES INDEX EDGES REORDER COMM GRAPH

47

IERROR 48

A FORTRAN BINDINGS FOR ENVIRONMENTAL INQUIRY

INTEGER COMM OLD NNODES INDEX EDGES COMM GRAPH IERROR

1

LOGICAL REORDER

2

3

MPI TOPO TESTCOMM STATUS IERROR

4

INTEGER COMM STATUS IERROR

5

6

MPI GRAPHDIMS GETCOMM NNODES NEDGES IERROR

7

INTEGER COMM NNODES NEDGES IERROR

8

MPI GRAPH GETCOMM MAXINDEX MAXEDGES INDEX EDGES IERROR

9

INTEGER COMM MAXINDEX MAXEDGES INDEX EDGES IERROR

10

11

MPI CARTDIM GETCOMM NDIMS IERROR

12

INTEGER COMM NDIMS IERROR

13

MPI CART GETCOMM MAXDIMS DIMS PERIODS COORDS IERROR

14

INTEGER COMM MAXDIMS DIMS COORDS IERROR

15

LOGICAL PERIODS

16

17

MPI CART RANKCOMM COORDS RANK IERROR

18

INTEGER COMM COORDS RANK IERROR

19

MPI CART COORDSCOMM RANK MAXDIMS COORDS IERROR

20

INTEGER COMM RANK MAXDIMS COORDS IERROR

21

22

MPI GRAPH NEIGHBORS COUNTCOMM RANK NNEIGHBORS IERROR

23

INTEGER COMM RANK NNEIGHBORS IERROR

24

MPI GRAPH NEIGHBORSCOMM RANK MAXNEIGHBORS NEIGHBORS IERROR

25

INTEGER COMM RANK MAXNEIGHBORS NEIGHBORS IERROR

26

27

MPI CART SHIFTCOMM DIRECTION DISP RANK SOURCE RANK DEST IERROR

28

INTEGER COMM DIRECTION DISP RANK SOURCE RANK DEST IERROR

29

MPI CART SUBCOMM REMAIN DIMS NEWCOMM IERROR

30

INTEGER COMM NEWCOMM IERROR

31

DIMS LOGICAL REMAIN

32

33

MPI CART MAPCOMM NDIMS DIMS PERIODS NEWRANK IERROR

34

INTEGER COMM NDIMS DIMS NEWRANK IERROR

35

LOGICAL PERIODS

36

MPI GRAPH MAPCOMM NNODES INDEX EDGES NEWRANK IERROR

37

INTEGER COMM NNODES INDEX EDGES NEWRANK IERROR

38

39

40

A Fortran Bindings for Environmental Inquiry

41

42

GET PROCESSOR NAMENAME RESULTLEN IERROR MPI

43

CHARACTER NAME

44

INTEGER RESULTLEN IERROR

45

46

MPI ERRHANDLER CREATEFUNCTION HANDLER IERROR

47

EXTERNAL FUNCTION 48

ANNEX A LANGUAGE BINDING

INTEGER ERRHANDLER IERROR

1

2

MPI ERRHANDLER SETCOMM ERRHANDLER IERROR

3

INTEGER COMM ERRHANDLER IERROR

4

5

MPI ERRHANDLER GETCOMM ERRHANDLER IERROR

6

INTEGER COMM ERRHANDLER IERROR

7

MPI ERRHANDLER FREEERRHANDLER IERROR

8

INTEGER ERRHANDLER IERROR

9

10

MPI ERROR STRINGERRORCODE STRING RESULTLEN IERROR

11

INTEGER ERRORCODE RESULTLEN IERROR

12

CHARACTER STRING

13

MPI ERROR CLASSERRORCODE ERRORCLASS IERROR

14

INTEGER ERRORCODE ERRORCLASS IERROR

15

16

DOUBLE PRECISION MPI WTIME

17

DOUBLE PRECISION MPI WTICK

18

19

MPI INITIERROR

20

INTEGER IERROR

21

MPI FINALIZEIERROR

22

INTEGER IERROR

23

24

MPI INITIALIZEDFLAG IERROR

25

LOGICAL FLAG

26

INTEGER IERROR

27

MPI ABORTCOMM ERRORCODE IERROR

28

INTEGER COMM ERRORCODE IERROR

29

30

31

A Fortran Bindings for Proling

32

33

MPI PCONTROLlevel

34

INTEGER LEVEL

35

36

37

38

39

40

41

42

43

44

45

46

47 48

MPI Function Index

MPI ABORT MPI ERRHANDLER CREATE

MPI ADDRESS MPI ERRHANDLER FREE

MPI MPI ALLGATHER ERRHANDLER GET

MPI ALLGATHERV MPI ERRHANDLER SET

MPI ALLREDUCE MPI ERROR CLASS

MPI ALLTOALL MPI ERROR STRING

MPI ALLTOALLV

MPI FINALIZE

MPI ATTR DELETE

MPI ATTR GET

GATHER MPI

ATTR PUT MPI

MPI GATHERV

MPI GET COUNT

MPI BARRIER

MPI GET ELEMENTS

MPI BCAST

MPI GET PROCESSOR NAME

MPI BSEND

MPI GRAPH CREATE

BSEND INIT MPI

MPI GRAPH GET

MPI BUFFER ATTACH

GRAPH MAP MPI

MPI BUFFER DETACH

MPI GRAPH NEIGHBORS

MPI CANCEL MPI GRAPH NEIGHBORS COUNT

MPI CART COORDS MPI GRAPHDIMS GET

MPI CART CREATE MPI GROUP COMPARE

MPI CART GET MPI GROUP DIFFERENCE

MPI GROUP EXCL

MPI CART MAP

MPI CART RANK MPI GROUP FREE

CART SHIFT GROUP INCL MPI MPI

CART SUB GROUP INTERSECTION MPI MPI

MPI MPI CARTDIM GET GROUP RANGE EXCL

COMM COMPARE GROUP RANGE INCL MPI MPI

MPI COMM CREATE MPI GROUP RANK

MPI COMM DUP MPI GROUP SIZE

MPI COMM FREE MPI GROUP TRANSLATE RANKS

COMM GROUP GROUP UNION MPI MPI

MPI COMM RANK

MPI IBSEND

COMM REMOTE GROUP MPI

MPI INIT

MPI COMM REMOTE SIZE

INITIALIZED MPI

COMM SIZE MPI

MPI INTERCOMM CREATE

MPI COMM SPLIT

MPI INTERCOMM MERGE

MPI COMM TEST INTER

MPI IPROBE

MPI MPI DIMS CREATE IRECV

MPI IRSEND

MPI Function Index

MPI ISEND MPI TYPE STRUCT

1

MPI MPI ISSEND TYPE UB

2

MPI TYPE VECTOR

3

KEYVAL CREATE MPI

4

MPI KEYVAL FREE

MPI UNPACK

5

6

MPI OP CREATE

MPI WAIT

7

MPI OP FREE

MPI WAITALL

8

MPI WAITANY

9

PACK MPI

MPI WAITSOME

10

MPI PACK SIZE

MPI WTICK

11

MPI PCONTROL

MPI WTIME

12

MPI PROBE

13

MPI RECV

14

MPI RECV INIT

15

MPI REDUCE

16

MPI REDUCE SCATTER

17

MPI REQUEST FREE

18

MPI RSEND

19

MPI RSEND INIT

20

21

SCAN MPI

22

MPI SCATTER

23

MPI SCATTERV

24

MPI SEND

25

MPI SEND INIT

26

MPI SENDRECV

27

SENDRECV REPLACE MPI

28

MPI SSEND

29

MPI SSEND INIT

30

MPI START

31

MPI STARTALL

32

33

MPI TEST

34

MPI TEST CANCELLED

35

MPI TESTALL

36

MPI TESTANY

37

TESTSOME MPI

38

MPI TOPO TEST

39

MPI TYPE COMMIT

40

TYPE CONTIGUOUS MPI

41

MPI TYPE COUNT

42

MPI TYPE EXTENT

43

TYPE FREE MPI

44

MPI TYPE HINDEXED

45

MPI TYPE HVECTOR

46

MPI TYPE INDEXED

47

MPI TYPE LB

48

MPI TYPE SIZE