<<

A HighPerformance Portable Implementation of

the MPI Interface Standard

William Gropp

Ewing Lusk

Mathematics and Computer Science Division

Argonne National Lab oratory

Nathan Doss

Anthony Skjellum

Department of Computer Science

NSF Engineering Research Center for CFS

y

Mississippi State University

Abstract

MPI Message Passing Interface is a sp ecication for a standard library for message

passing that was dened by the MPI Forum a broadly based group of parallel com

puter vendors library writers and applications sp ecialists Multiple implementations

of MPI havebeendevelop ed In this pap er we describ e MPICH unique among existing

implementations in its design goal of combining p ortability with high p erformance We

do cument its p ortability and p erformance and describ e the architecture bywhichthese

features are simultaneously achieved We also discuss the set of to ols that accompany

the free distribution of MPICH which constitute the b eginnings of a p ortable parallel

programming environment A pro ject of this scop e inevitably imparts lessons ab out

parallel the sp ecication b eing followed the currenthardware and software

environment for and pro ject management we describ e those we

have learned Finallywe discuss future developments for MPICH including those nec

essary to accommo date extensions to the MPI Standard now b eing contemplated by

the MPI Forum

Intro duction

The messagepassing mo del of parallel computation has emerged as an expressive e

cient and wellundersto o d paradigm for parallel programming Until recen tly the syntax

and precise semantics of each messagepassing library implementation were dierentfrom

This work was supp orted by the Mathematical Information and Computational Sciences Division

subprogram of the Oce of Computational and Technology Research US Department of Energyunder

Contract WEng

y

This work was supp orted in part by the NSF Engineering ResearchCenter for Computational Field

Simulation Mississipp i State University Additional supp ort came from the NSF Career Program under

grant ASC and from ARPA under Order D

the others although many of the general semantics were similar The proliferation of

messagepassing library designs from b oth vendors and users was appropriate for a while

but eventually it was seen that enough consensus on requirements and general semantics

for messagepassing had b een reached that an attempt at standardization might usefully b e

undertaken

The pro cess of creating a standard to enable p ortability of messagepassing applica

tions co des b egan at a workshop on Message Passing Standardization in April and

the Message Passing Interface MPI Forum organized itself at the Sup ercomputing

Conference During the next eighteen months the MPI Forum met regularly and Version

of the MPI Standard was completed in May Some clarications and re

nements were made in the spring of and Version of the MPI Standard is now

available For a detailed presentation of the Standard itself see for a tutorial

approach to MPI see In this pap er we assume that the reader is relatively familiar

with the MPI sp ecication but weprovide a brief overview in Section

The pro ject to provide a p ortable implementation of MPI b egan at the same time as

the MPI denition pro cess itself The idea was to provide early on decisions b eing

made by the MPI Forum and provide an early implementation to allow users to exp eriment

with the denitions even as they were b eing develop ed Targets for the implementation were

to include all systems capable of supp orting the messagepassing mo del MPICH is a freely

available complete implementation of the MPI sp ecication designed to b e b oth p ortable

and ecient The CH in MPICH stands for Chameleon symb ol of adaptabilityto

ones environment and thus of p ortability Chameleons are fast and from the b eginning a

y secondary goal was to give up as little eciency as p ossible for the p ortabilit

MPICH is thus b oth a research pro ject and a software development pro ject As a

research project its goal is to explore metho ds for narrowing the gap b etween the program

mer of a parallel computer and the p erformance deliverable by its hardware In MPICH

we adopt the constraint that the programming interface will b e MPI reject constraints on

the architecture of the target machine and retain high p erformance measured in terms of

bandwidth and latency for messagepassing op erations as a goal As a softwareproject

MPICHs goal is to promote the adoption of the MPI Standard by providing users with

a free highp erformance implementation on a diversity of platforms while aiding vendors

in providing their own customized implementations The extenttowhich these goals have

b een achieved is the main thrust of this pap er

The rest of this pap er is organized as follows Section gives a short overview of MPI

and briey describ es the precursor systems that inuenced MPICH and enabled it to come

into existence so quickly In Section we do cument the extentofMPICHsportability and

present results of a numb er of p erformance measurements In Section we describ e in some

detail the software architecture of MPICH which comprises the results of our researchin

combining p ortability and p erformance In Section we presentseveral sp ecic asp ects

of the implementation that merit more detailed analysis Section describ es a family of

supp orting programs that surround the core MPI implementation and turn MPICH into a

p ortable environment for developing parallel applications In Section we describ e howw e

as a small distributed group havecombined a numberoffreelyavailable to ols in the Unix

environment to enable us to develop distribute and maintain MPICH with a minimum

of resources In the course of developing MPICH wehave learned a numb er of lessons

from the challenges p osed b oth accidentally and delib erately for MPI implementors by

the MPI sp ecication these lessons are discussed in Section Finally Section describ es

the current status of MPICH Version as of February and outlines our plans

for future development

Background

In this section we giveanoverview of MPI itself describ e briey the systems on which

the rst versions of MPICH were built and review the history of the development of the

pro ject

Precursor Systems

MPICH came into b eing quickly b ecause it could build on stable co de from existing systems

These systems pregured in various ways the p ortability p erformance and some of the other

features of MPICH Although most of that original co de has b een extensively reworked

MPICH still owes some of its design to those earlier systems whichwe briey describ e

here

P is a thirdgeneration parallel programming library including b oth messagepassing

and sharedmemory comp onents p ortable to a great many parallel computing environments

including heterogeneous networks Although p contributed much of the co de for TCPIP

networks and sharedmemory multipro cessors for the early versions of MPICH most of that

has b een rewritten P remains one of the devices on which MPICH can b e built see

Section but in most cases more customized alternatives are available

Chameleon is a highp erformance p ortabilitypackage for message passing on par

allel sup ercomputers It is implemented as a thin layer mostly C macros over vendor

messagepassing systems s NX TMCs CMMD IBMs MPL for p erformance and

over publicly available systems p and PVM for p ortability A substantial amountof

to MPICHas detailed in Section Chameleon technology is incorp orated in

Zip co de is a p ortable system for writing scalable libraries It contributed several

concepts to the design of the MPI Standardin particular contexts groups and mailers the

equivalent of MPI communicators Zip co de also contains extensive collective op erations

with group scop e as well as virtual top ologies and this co de was heavily b orrowed from in

the rst version of MPICH

Brief Overview of MPI

MPI is a messagepassing application programmer interface together with proto col and

semantic sp ecications for how its features must b ehaveinany implementation suchas

a message buering and message delivery progress requirement MPI includes p ointto

p oint message passing and collective global op erations all scop ed to a usersp ecied group

of pro cesses Furthermore MPI provides abstractions for pro cesses at twolevels First

pro cesses are named according to the rank of the group in which the communication is b eing

p erformed Second virtual top ologies allow for graph or Cartesian naming of pro cesses

that help relate the application semantics to the message passing semantics in a convenient

ecientway Communicators which house groups and communication context scoping

information provide an imp ortant measure of safety that is necessary and useful for building

up libraryoriented parallel co de

MPI also provides three additional classes of services environmental inquiry basic

timing information for application p erformance measurement and a proling interface for

external p erformance monitoring MPI makes heterogeneous data conversion a transparent

part of its services by requiring datatyp e sp ecication for all communication op erations

Both builtin and userdened datatyp es are provided

MPI accomplishes its functionality with opaque ob jects with welldened constructors

and destructors giving MPI an ob jectbased lo ok and feel Opaque ob jects include groups

the fundamental container for pro cesses communicators whichcontain groups and are

used as arguments to communication calls and request ob jects for asynchronous op era

tions Userdened and predened datatyp es allow for heterogeneous communication and

elegant description of gatherscatter semantics in sendreceive op erations as well as in col

lective op erations

MPI provides supp ort for b oth the SPMD and MPMD mo des of parallel programming

Furthermore MPI can supp ort interapplication computations through intercommunicator

op erations which supp ort communication b etween groups rather than within a single group

Dataowstyle computations also can b e constructed from intercommunicators MPI pro

vides a safe application programming interface API which will b e useful in multi

threaded environments as implementations mature and supp ort thread safety themselves

Development History of MPICH

At the organizational meeting of the MPI Forum at the Sup ercomputing conference

Gropp and Lusk volunteered to develop an immediate implementation that would track

the Standard denition as it evolved The purp ose was to quickly exp ose problems that

the sp ecication might p ose for implementors and to provide early exp erimenters with an

opp ortunity to try ideas b eing prop osed for MPI b efore they b ecame xed The rst version

of MPICH in fact implemented the presp ecication describ ed in within a few days

The sp eed with which this version was completed was due to the existing p ortable systems

p and Chameleon This rst MPICH which oered quite reasonable p erformance and

p ortability is describ ed in

Starting in spring this implementation was gradually mo died to provide increased

p erformance and p ortabilityAt the same time the system was greatly expanded to include

all of the MPI sp ecication for the collective op erations and top ologies together

with co de for attribute management were b orrowed from Zip co de and tuned as the months

wentby

What made this pro ject unique was that we had committed to following the MPI sp ec

ication as it develop edand it changed at every MPI Forum meeting Most system im

plementors wait for a stable sp ecication The goals of this pro ject dictated that in the

short term we delib erately cho ose a constantly changing sp ecication The payo came

of course when the MPI Standard was released in May the MPICH implementation

was complete p ortable fast and available immediatelyItisworthwhile to contrast this

situation with what happ ened in the case of the HighPerformance Fortran HPF Stan

dard The HPF Forum which started and nished a year b efore the MPI Forum pro duced

their standard sp ecication in much the same way that the MPI Forum did However since

implementation was left entirely to the vendors who naturally waited until the sp ecication

was complete b efore b eginning to invest implementation eort HPF implementations are

only nowFebruary b ecoming available whereas a large community has b een using

MPI for over a year

For the past year with the MPI Standard stable MPICH has continued to evolvein

several directions First the Abstract Device Interface ADI architecture describ ed in

Section and central to the p erformance has develop ed and stabilized Second individual

vendors and others have b egun taking advantage of this interface to develop their own highly

sp ecialized implementations of it as a result extremely ecient implementations of MPI

exist on a greater varietyofmachines than wewould have b een able to tune MPICH for

ourselves In particular Convex Intel SGI and Meikohave pro duced implementations of

hines while taking advantage the ADI that pro duce excellent p erformance on their own mac

of the p ortability of the great ma jority of the co de in MPICH ab ove the ADI layer Third

the set of to ols that form part of the MPICH parallel programming environment has b een

extended these are describ ed in Section

Related Work

The publication of the MPI Standard provided many implementation groups with a clear

sp ecication and several freely available partially p ortable implementations have app eared

Like MPICH their initial versions were built on existing p ortable messagepassing systems

They dier from MPICH in that they fo cus on the workstation environment where software

p erformance is necessarily limited by Unix so cket functionality Some of these systems are

as follows

LAM is available from the Ohio Sup ercomputer Center and runs on heterogeneous

networks of Sun DEC SGI IBM and HP workstations

CHIMPMPI is available from the Edinburgh Parallel Computing Center and runs

on Sun SGI DEC IBM and HP workstations the Meiko Computing Surface ma

chines and the Fujitsu AP It is based on CHIMP

At the Technical University of Munich research has b een done on a system for check

p ointing messagepassing jobs including MPI See and

Unify available from Mississippi State Universitylayers MPI on a version of

PVM that has b een mo died to supp ort contexts and static groups Unify allows

mixed MPI and PVM calls in the same program

tations provided byvendors are describ ed in Proprietary and platformsp ecic implemen

Section

Portability and Performance

The challenge of the MPICH pro ject is to combine b oth p ortability and p erformance In

this section we rst survey the range of environments in which MPICH can b e used and

then present p erformance data for a representative sample of those environments

PortabilityofMPICH

The MPI standard itself addresses the messagepassing mo del of parallel computation In

this mo del pro cesses with separate address spaces like Unix pro cesses communicate with

one another by sending and receiving messages A numb er of dierent hardware platforms

supp ort such a mo del

Exploiting HighPerformance Switches

The most obvious hardware platform for MPI is a distributedmemory parallel sup ercom

puter in whicheach pro cess can b e run on a separate no de of the machine and com

munication o ccurs over a highp erformance switch of some kind In this category are the

Intel Paragon IBM SP Meiko CS Thinking Machines CM NCub e and TD

Although the Cray TD provides some hardware that allows one to treat it as a shared

memory machine it falls primarily into this category see Details of how MPICH is

implemented on each of these machines are given in Section and p erformance results for

the Paragon and SP are given in Section

Exploiting SharedMemory Architectures

Anumb er of architectures supp ort a sharedmemory programming mo del in which a mem

ory lo cation can b e b oth read and written to bymultiple pro cesses Although this is

not part of MPIs computational mo del an MPI implementation maytakeadvantage of

capabilities in this area oered by the hardwaresoftware combination to provide partic

ularly ecient messagepassing op erations Currentmachines oering this mo del include

the SGI Onyx Challenge Power Challenge and Power Challenge Arraymachines IBM

SMPs symmetric multipro cessors the Convex Exemplar and the Sequent Symmetry

MPICH is implemented using for eciency on all of these machines details

in Section Performance measurements for the SGI are given in Section

Exploiting Networks of Workstations

One of the most common parallel computing environments is a network of workstations

Many institutions use Ethernetconnected p ersonal workstations as a free computational

resource and at many universities lab oratories equipp ed with Unix workstations provide

b oth shared Unix services for students and an inexp ensive parallel computing environment

for instruction In many cases the workstation collection includes machines from multiple

vendors Interop erabilityisprovided by the TCPIP standard MPICH runs on worksta

tions from Sun b oth SunOS and Solaris DEC HewlettPackard SGI and IBM Recently

the Intel and Pentium compatible machines have b een able to join the Unix worksta

tion family by running one of the common free implementations of Unix suchasFreeBSD

NetBSD or Linux MPICH runs on all of these workstations and on heterogeneous collec

tions of them Details of how heterogeneity is handled are presented in Section and some

p erformance gures for Ethernetconnected workstations are given in Section

An imp ortant family of nonUnix op erating systems is supp orted by Microsoft MPICH

has b een p orted to Windows where it simulates multipro cessing on a single pro cessor

the system is called WinMPI

Performance of MPICH

The MPI sp ecication was designed to allow high p erformance in the sense that semantic

restrictions on optimization were avoided wherever user convenience would not b e severely

impacted Furthermore a numb er of features were added to enable users to takeadvantage

of optimizations that some systems oered without aecting p ortability to other systems

that did not havesuch optimizations available In MPICH wehave tried to takeadvantage

of those features in the Standard that allow for extra optimization but wehave not done

so in every p ossible case

Performance on ones own application is of course what counts most Nonetheless

useful predictions of application p erformance can b e made based on the results of sp ecially

constructed b enchmark programs In this section we rst describ e some of the diculties

that arise in b enchmarking messagepassing systems then discuss the programs wehavede

velop ed to address these diculties and nally present results from running the b enchmarks

on a representative sample of the environments supp orted by MPICH

tation includes two MPI programs mpptest and goptest that The MPICH implemen

provide reliable tests of the p erformance of an MPI implementation The program mpptest

provides testing of b oth p ointtop oint and collective op erations on a sp ecied number of

pro cessors the program goptest can b e used to study the of collective routines

as a function of numb er of pro cessors

Performance Measurement Problems and Pitfalls

One common problem with simple p erformance measurement programs is that the results

are dierenteach time the program is run even on the same system A numb er of factors are

resp onsible ranging from assuming that the clo ckcallshave no cost and innite resolution

to the eects of other jobs running on the same machine A go o d p erformance test will

give the same to the clo cks precision answer each time The mpptest and goptest

programs distributed with MPICH compute the average time for a numb er of iterations of

an op eration thus handling the cost and of the clo ck and then run the same

test over several times and take the minimum of those times thus reducing the eects of

other jobs The programs can also provide information ab out the mean and worstcase

p erformance

More subtle are issues of which test to run The simplest pingp ong test which sends

the same data using the same data buer b etween two pro cesses allows data to reside

entirely in the memory cache In many real applications however neither buer will al

ready b e mapp ed into cache and this situation can aect the p erformance of the op eration

Similarly data transfers that are not prop erly aligned on word b oundaries can b e more

exp ensive than those that are MPI also has noncontiguous datatyp es the p erformance

of an implementation with these datatyp es may b e signicantly slower than for contiguous

data Another parameter is the numb er of pro cessors used even if only two are communi

cating Certain implementations will include a latency cost prop ortional to the number of

pro cessors This gives the b est p erformance on the twopro cessor pingp ong test at the cost

of p ossibly lower p erformance on real applications Mpptest and goptest include tests to

measure these eects

Benchmarks for PointtoPoint Op erations

In this section we present some of the simplest b enchmarks for p erformance of MPICH on

various platforms The p erformance test programs mpptest and goptest can pro duce a

wealth of information the script basetestprovided with the MPICH implementation can

b e used to get a more complete picture of the b ehavior of a particular system Here we

present only the most basic data short and longmessage p erformance

For the shortmessage graphs the only options used with mpptest are auto and size

The option auto tells mpptest to cho ose the sizes of the messages so as to

reveal the exact message size where there is any sudden change in b ehavior for example

at an internal packetsize b oundary The size option selects messages with sizes from

to bytes in increments of bytes The shortmessage graphs give a go o d picture of

the latency of message passing

For the longmessage graphs a few more options are used Some make the test runs

more ecient The size range of message is set with size which selects

messages of sizes b etween ab out K and K sampled every bytes

These tests provide a picture of the b est achievable bandwidth p erformance More

realistic tests can b e p erformed by using cachesize to force the use of dierent data

areas overlap for communication and computation overlap async for nonblo cking

communications and vector for noncontiguous communication Using

givedy gives information on the range of p erformance displaying b oth the mean and

worstcase p erformance

Performance of MPICH Compared with NativeVendor Systems

One question that can b e asked ab out MPI is how its p erformance compares with propri

etary vendor systems Fortunatelythe mpptest program was designed to work with many

messagepassing systems and can b e built to call a vendors system directly In Figure

we compare MPI and Intels NX messagepassing The MPICH implementation for the In

tel Paragon while implemented with a sp ecial ADI still relies on messagepassing services

provided by NX Despite this fact the MPI p erformance is quite go o d and can probably

b e improved with the secondgeneration ADI planned for a later release of MPICH We

use this as a representative example to demonstrate that the apparently elab orate structure

shown in Figures and do es not imp ose serious p erformance overheads b eyond those of

Figure MPICH vs NX on the Paragon

the underlying vendorsp ecic messagepassing layer

Paragon Measurements

The Intel Paragon has a classic distributedmemory architecture with a cutthrough routed

D mesh top ology Latency and bandwidth p erformance are shown in Figure The

Paragon p erformance measurements shown in Figure were taken while other users were

on the system This explains why the right side of Figure is rougher than the curvein

Figure although the p eak bandwidth shown is similar

IBM SP measurements

The IBM SP at Argonne National Lab oratory has Power no des the same as in the IBM

SP and the SP highp erformance switch Measurements on IBM SP with Power no des

thin or wide will b e dierent The latencies shown in Figure reect the slower sp eed of

the Power no des Note the obvious packet b oundaries in the shortmessage plot

SGI Power Challenge Measurements

The SGI Power Challenge is a symmetric multipro cessor The latency and bandwidth

p erformance as shown in Figure indicate the p erformance for the ch device a

Figure Short and long messages on the Paragon

Figure Short and long messages on the IBM SP

generic sharedmemory device supplied with the MPICH implementation

Cray TD Measurements

The Cray TD supp orts a shared memory interface the shmem library For MPICH this

library is used to supp ort MPI messagepassing semantics The latency and bandwidth

p erformance are shown in Figure

Figure Short and long messages on the SGI Power Challenge

Figure Short and long messages on the CrayTD

Workstation Network Measurements

Workstation networks connected by simple Ethernet are common The p erformance of

MPICH for two Sun SPARCStations on a shared Ethernet are shown in Figure

Figure Short and Long Messages on a workstation network

Architecture of MPICH

In this section we describ e in detail how the software architecture of MPICH supp orts

the conicting goals of p ortability and high p erformance The design was guided bytwo

principles First we wished to maximize the amount of co de that can b e shared without

compromising p erformance A large amount of the co de in any implementation is system

indep endent Implementation of most of the MPI opaque ob jects including datatyp es

groups attributes and even communicators is platformindep endent Many of the complex

communication op erations can b e expressed p ortably in terms of lowerlevel ones Second

we wished to provide a structure whereby MPICH could b e p orted to a new platform

quickly and then gradually tuned for that platform by replacing parts of the shared co de

by platformsp ecic co de As an example wepresent in Section a case study showing

how MPICH was quickly p orted and then incrementally tuned for p eak p erformance on SGI

sharedmemory systems

The central mechanism for achieving the goals of p ortability and p erformance is a sp eci

cation we call the abstract device interface ADI All MPI functions are implemented

in terms of the macros and functions that make up the ADI All such co de is p ortable

Hence MPICH contains many implementations of the ADI which provide p ortability ease

of implementation and an incremental approach to trading p ortability for p erformance

One implementation of the ADI is in terms of a lower level yet still p ortable interface we

terface can b e extremely small ve functions call the channel interface The channel in

at minimum and provides the quickest waytoportMPICHtoanewenvironment Such

a p ort can then b e expanded gradually to include sp ecialized implementation of more of

the ADI functionality The architectural decisions in MPICH are those that relegate the

implementation of various functions to the channel interface the ADI or the application

programmer interface API which in our case is MPI

The Abstract Device Interface

The design of the ADI is complex b ecause we wish to allow for but not require a range of

p ossible functions of the device For example the device may implementitsown message

queuing and datatransfer functions In addition the sp ecic environment in which the

device op erates can strongly aect the choice of implementation particularly with regard

to how data is transferred to and from the users memory space For example if the device

co de runs in the users address space then it can easily copy data to and from the users

space If it runs as part of the users pro cess for example as library routines on top of a

simple hardware device then the device and the API can easily communicate calling

each other to p erform services If on the other hand the device is op erating as a separate

pro cess and requires a context switchtoexchange data or requests then switching b etween

pro cesses can b e very exp ensive and it b ecomes imp ortant to minimize the numb er of such

exchanges byproviding all information needed with a single call

Although MPI is a relatively large sp ecication the devicedep endent parts are small

By implementing MPI using the ADI wewere able to provide co de that can b e shared

among many implementations Eciency could b e obtained byvendorsp ecic proprietary

implementations of the abstract device For this approach to b e successful the semantics of

the ADI must not preclude maximally ecient instantiations using mo dern messagepassing

hardware While the ADI has b een designed to provide a p ortable MPI implementation

nothing ab out this part of the design is sp ecic to the MPI library our denition of an

abstract device can b e used to implementany highlevel messagepassing library

To help in understanding the design it is useful to lo ok at some abstract devices for

other op erations for example for graphical display or for printing Most graphical displays

provide for drawing a single pixel at an arbitrary lo cation any other graphical function

can b e built by using this single elegant primitive However highp erformance graphical

y and line draw displays oer a wide variety of additional functions ranging from blo ckcop

ing to D surface shading One approach for allowing an API application programmer

interface to access the full p ower of the most sophisticated graphics devices without sac

ricing p ortability to less capable devices is to dene an abstract device with a rich set of

functions and then provide software emulations of any functions not implemented by the

graphics device We use the same approach in dening our messagepassing ADI

A messagepassing ADI must provide four sets of functions sp ecifying a message to b e

sent or received moving data b etween the API and the messagepassing hardware managing

lists of p ending messages b oth sent and received and providing basic information ab out

the execution environment eg howmany tasks are there The MPICH ADI provides all

of these functions however many messagepassing hardware systems may not provide list

management or elab orate datatransfer abilities These functions are emulated through the

use of auxiliary routines describ ed in

The abstract device interface is a set of function denitions whichmay b e realized as

either C functions or macro denitions in terms of which the usercallable standard MPI

functions may b e expressed As such it provides the messagepassing proto cols that dis

tinguish MPICH from other implementations of MPI In particular the ADI layer contains

the co de for packetizing messages and attaching header information managing multiple

buering p olicies matching p osted receives with incoming messages or queuing them if

necessary and handling heterogeneous communications For details of the exact interface

and the algorithms used see

MPI_Reduce MPI

MPI MPI_Isend point−to−point

The Abstract SGI(4) MPID_Post_Send Device Interface

MPID_SendControl The Channel Meiko T3D SGI(3) NX Interface

(implementations of the channel interface)

Figure Upp er layers of MPICH

A diagram of the upp er layers of MPICH showing the ADI is shown in Figure Sample

functions at eachlayer are shown on the left Without going into details on the algorithms

present in the ADI one can exp ect the existence of a routine like MPID SendControl which

communicates control information The implementation of such a routine can b e in terms

of a vendors own existing messagepassing system or new co de for the purp ose or can b e

expressed in terms of a further p ortable layer the channel interface

The Channel Interface

At the lowest level what is really needed is just a way to transfer data p ossibly in small

amounts from one pro cesss address space to anothers Although many implementations

are p ossible the sp ecication can b e done with a small numb er of denitions The chan

nel interface describ ed in more detail in consists of only ve required functions

Three routines send and receiveenvelop e or control information MPIDSendControl

MPIDRecvAnyControland MPIDControlMsgAvailtwo routines send and receive data

MPIDSendChannel and MPIDRecvFromChannel Others whichmightbeavailable in sp e

cially optimized implementations are dened and used when certain macros are dened

One can use MPID SendControlBlock instead of or along with MPID SendControl It can b e more ecient

to use the blo cking version for implementing blo cking calls

that signal that they are available These include various forms of blo cking and nonblo ck

ing op erations for b oth envelop es and data

These op erations are based on a simple capability to send data from one pro cess to

another pro cess No more functionality is required than what is provided by Unix in the

select readand write op erations The ADI co de uses these simple op erations to provide

the op erations suchasMPIDPostrecv that are used by the MPI implementation

The issue of buering is a dicult one Wecouldhave dened an interface that assumed

no buering requiring the ADI that calls this interface to p erform the necessary buer

management and owcontrol The rationale for not making this choice is that many of the

systems used for implementing the interface dened here do maintain their own internal

buers and ow controls and implementing another layer of buer managementwould

imp ose an unnecessary p erformance p enalty

The channel interface implements three dierentdataexchange mechanisms

Eager In the eager proto col data is sent to the destination immediately If the destination

is not exp ecting the data eg no MPIRecv has yet b een issued for it the receiver

must allo cate some space to store the data lo cally

This choice often oers the highest p erformance particularly when the underlying

implementation provides suitable buering and handshakes However it can cause

problems when large amounts of data are sent b efore their matching receives are

p osted causing memory to b e exhausted on the receiving pro cessors

This is the default choice in MPICH

In the rendezvous proto col data is sent to the destination only when re Rendezvous

quested the control information describing the message is always sent When a

receive is p osted that matches the message the destination sends the source a request

for the data In addition it provides a way for the sender to return the data

This choice is the most robust but dep ending on the underlying system software may

b e less ecient than the eager proto col Some legacy programs may fail when run

using a rendezvous proto col if an is unsafely expressed in terms of MPISend

Such a program can b e safely expressed in terms of MPIBsend but at a p ossible cost in

eciency That is the user may desire the semantics of an eager proto col messages

are buered on the receiver with the p erformance of the rendezvous proto col no

copying but since buer space is exhaustible and MPIBsend mayhavetocopy the

user maynotalways b e satised

MPICH can b e congured to use this proto col by sp ecifying userndv during con

guration

Get In the get proto col data is read directly by the receiver This choice requires a

metho d to directly transfer data from one pro cesss memory to another A typical

implementation mightuse memcpy

This choice oers the highest p erformance but requires sp ecial hardware supp ort such

as shared memory or remote memory op erations In manyways it functions like the

rendezvous proto col but uses a dierent set of routines to transfer the data

To implement this proto col sp ecial routines must b e provided to prepare the address

for remote access and to p erform the transfer The implementation of this proto col

allows data to b e transferred in several pieces for example allowing arbitrarily sized

messages to b e transferred using a limited amount of shared memory The routine

MPIDSetupGetAddress is called bythe sender to determine the address to send to

the destination In sharedmemory systems this may simply b e the address of the

data if all memory is visible to all pro cesses or the address in sharedmemory where

all or some of the data has b een copied In systems with sp ecial hardware for moving

data b etween pro cessors it may b e the appropriate handle or ob ject

MPICH includes multiple implementations of the channel interface see Figure

send_cntl−pkt The Channel Interface

SCI

PIsend Chameleon p2 SGI(2)

isend nx mpl Nexus tcp Convex SGI(1) Solaris

p4_send CMMD p4

write rs6000 Sun Dec HP SGI(0) all tcp

Figure Lower layers of MPICH

Chameleon Perhaps the most signicant implementation is the Chameleon version which

was particularly imp ortant during the initial phase of MPICH implementation By

implementing the channel interface in terms of Chameleon macros we provide

p ortability toanumb er of systems at one stroke with no additional overhead since

Chameleon macros are resolved at compile time Chameleon macros exist for most

vendor messagepassing systems and also for p which in turn is p ortable to very

many systems A newer implementation of the channel interface is a direct TCPIP

interface not involving p

Shared memory A completely dierent implementation of the channel interface has b een

done p ortably for a sharedmemory abstraction in terms of a sharedmemory malloc

and lo cks There are in turn multiple macro implementations of the sharedmemory

implementation of the channel interface This is represented as the p box in Figure

Sp ecialized Some vendors SGI and HPConvexatpresent have implemented the chan

nel interface directly without going through the sharedmemory p ortabilitylayer

This approach takes advantage of particular memory mo dels and op erating system

features that the sharedmemory implementation of the channel interface do es not

assume are present

SCI A sp ecialized implementation of the channel interface has b een develop ed for an imple

mentation of the Scalable CoherentInterface from Dolphin Interconnect Solutions

whichprovides p ortabilitytoanumb er of systems that use it

Contrary to some descriptions of MPICH that have app eared elsewhere MPICH has

never relied on the p version of the channel interface for p ortability to

pro cessors From the b eginning the MPP IBM SPIntel Paragon TMC CM versions

used the macros provided by Chameleon We rely on the p implementation only for the

workstation networks and a pindep endentversion for TCPIP will b e available so on

A Case Study

One of the b enets of a system architecture like that shown in Figures and is the ex

ibility provided in cho osing where to insert vendorsp ecic optimizations One illustration

ywas used is given by the of the version of of how this exibilit

MPICH

Since Chameleon had b een p orted to p and p had b een p orted to SGI workstations long

b efore the MPICH pro ject b egan MPICH ran on SGI machines from the very b eginning

This is the b ox shown as SGI in Figure This implementation used TCPIP so ckets

between workstations and standard Unix System V shared memory op erations for message

passing within a multipro cessor like the SGI Onyx

The SGI b ox in Figure illustrates an enhanced version achieved by using a simple

p ortable sharedmemory interface we call p half of p In this version shared memory

op erations use sp ecial SGI op erations for sharedmemory functions instead of the less robust

System V op erations

SGI in Figure is a direct implementation of the channel interface that we did in

collab oration with SGI It uses SGIsp ecic mechanisms for memory sharing that allow

singlecopy data movementbetween pro cesses as opp osed to copying into and out of an

intermediate shared buer and it uses lo ckfree shared queue management routines that

take advantage of sp ecial assembler language instructions of the MIPS micropro cessor

SGI next develop ed a direct implementation of the ADI that did not use the channel in

terface mo del SGI in Figure and then bypassed the ADI altogether to pro duce a very

highp erformance MPI implementation for the Power Challenge Array pro duct combining

b oth sharedmemory op erations and messagepassing over the HiPPI connections b etween

sharedmemory clusters Even at this sp ecialized level it retains much of the upp er lev

els of MPICH that are implemented either indep endently of or completely on top of the

messagepassing layer such as the collective op erations and top ology functions

At all times SGI users had access to a complete MPI implementation and their pro

grams did not need to change in anyway as the implementation improved

Selected Subsystems

A detailed description of all the design decisions that wentinto MPICH would b e tedious

Here we fo cus on several of the salient features of this implementation that distinguish it

from other implementations of MPI

Groups

The basis of an MPICH pro cess group is an ordered list of pro cess identiers stored as an

integer array A pro cesss rank in a group refers to its index in this list Stored in the list

is an address in a format the underlying device can use and understand This is often the

rank in MPICOMMWORLD but need not b e

Communicators

The Standard describ es twotyp es of communicators intracommunicators and intercommu

nicators which consist of two basic comp onents namely pro cess groups and communication

contexts MPICH intracommunicators and intercommunicators use this same structure

The Standard describ es howintracommunicators and intercommunicators are related

see Section of the Standard We takeadvantage of this similarity to reduce the com

plexity of functions that op erate on b oth intracommunicators and intercommunicators eg

communicator accessors p ointtop oint op erations Most functions in the p ortable layer

of MPICH do not need to distinguish b etween an intracommunicator and an intercommu

nicator For example each communicator has a lo cal group localgroup and a remote

tercommunicator For intracommu group group as describ ed in the denition of an in

nicators these two groups are identical reference counting is used to reduce the amount

of overhead asso ciated with keeping two copies of a group see Section however for

intercommunicators these two groups are disjoint

Another similaritybetween intracommunicators and intercommunicators is the use of

contexts Each communicator has a send context sendcontext and a receivecontext

recvcontext For intracommunicators these twocontexts are equal for intercommu

nicators these contexts may b e dierent Regardless of the typ e of communicator MPI

p ointtop oint op erations attachthesendcontext to all outgoing messages and use the

recvcontext when matching contexts up on receipt of a message

For most MPICH devices contexts are integers Contexts for new communicators are

allo cated through a collective op eration over the group of pro cesses involved in the com

municator construction Through this collective op eration all pro cesses involved agree on

a context that is currently not in use byany of the pro cesses One of the algorithms used

to allo cate contexts involves passing the highest con text currently used by a pro cess to an

Allreduce with the MPI MAX op eration to nd the smallest context an integer unused MPI

byany of the participants

In order to provide safe p ointtop ointcommunications within a collective op eration an

additional collective context is allo cated for eachcommunicator This collectivecontext

is used during communicator construction to create a hidden communicator commcoll

that cannot b e accessed directly by the user This is necessary so that p ointtop oint

op erations used to implement a collective op eration do not interfere with userinitiated

p ointtop oint op erations

Other imp ortant elements of the communicator data structure include the following

np local rank lrank to grank Used to provide more convenient access to lo cal group

information

collops Array of p ointers to functions implementing the collective op erations for the com

municator see Section

Collective Op erations

As noted in the preceding section MPICH collective op erations are implemented on top

of MPICH p ointtop oint op erations MPICH collective op erations retrieve the hidden

communicator from the communicator passed in the argument list and then use standard

MPI p ointtop oint calls with this hidden communicator We use straightforward p owerof

twobased algorithms to provide scalability however considerable opp ortunities for further

optimization remain

Although the basic implementation of MPICH collective op erations uses p ointtop oint

op erations sp ecial versions of MPICH collective op erations exist These sp ecial versions

include b oth vendorsupplied and sharedmemory versions In order to allow the use of these

sp ecial versions on a communicatorbycommunicator basis each communicator contains a

list of function p ointers that p oint to the functions that implement the collectives for that

particular communicator Eachcommunicator structure contains a reference count so that

communicators can share the same list of p ointers

typedef struct MPIRCOLLOPS

int Barrier MPIComm comm

int Bcast void buffer int count MPIDatatype datatype

int root MPIComm comm

other function pointers

int refcount So we can share it

MPIRCOLLOPS

Each MPI collective op eration checks the validity of the input arguments then forwards

the function argumen ts to the dereferenced function for the particular communicator This

approachallows vendors to substitute systemsp ecic implementations for all or some of

the collective routines CurrentlyMeiko Intel and Convex haveprovided vendorsp ecic

collective implementations These implementations follow systemsp ecic strategies for

example the Convex SPP collective routines makes use b oth of shared memory and of the

memory hierarchies in the SPP

Attributes

Attribute caching on communicators is implemented by using a heightbalanced tree HBT

or AVL tree Each communicator has an HBT asso ciated with it although initially

the HBT maybeanemptyornull tree Caching an attribute on a communicator is simply

an insertion into the HBT retrieving an attribute is simply searching the tree and returning

the cached attribute

MPI keyvals are created by passing the attributes copy function and destructor as well

state needed to the keyval constructor Pointers to these are kept in the as any extra

keyval structure that is passed to attribute functions

Additional elements of a keyval include a ag denoting whether C or Fortran calling

conventions are to b e used for the copy function the attribute input argument to the copy

function is passed byvalue in C and passed by reference in Fortran

Caching on other typ es of MPI handles is b eing considered for inclusion in the MPI

standard The MPICH HBT implementation of caching can b e used almost exactly as is

for implementing caching on other typ es of MPI handles by simply adding an HBT to the

other typ es of handles

Top ologies

Supp ort for top ologies is layered on the communicator attribute mechanism Because of

this conguration the co de implementing top ologies is almost entirely p ortable even to

other MPI implementations For communicators with asso ciated top ology information

the communicators cache contains a structure describing the top ology either a Cartesian

top ology or a graph top ology The MPI top ology functions access the cached top ology

information as needed using standard MPI calls then use this information to p erform the

requested op eration

The Proling Interface

The MPI Forum wished to promote the development of to ols for understanding program

b ehavior but considered it premature to standardize any sp ecic to ol interface The MPI

sp ecication provides instead a general mechanism for intercepting calls to MPI functions

Thus b oth end users and to ol develop ers can develop p ortable p erformance analyzers and

other to ols without access to the MPI implementation source co de The only requirement

is that every MPI function b e callable in b oth C and Fortran by an alternate name

PMPI Xxxx as well as the usual MPI Xxxx In some environments those supp orting weak

symb ols the additional entry p oints can b e supplied in the source co de In MPICH we

take the less elegant but more p ortable approach of building a duplicate MPI library in

which all functions are known by their PMPI names Of course only one copy of the source

co de is maintained Users can interp ose their own proling wrapp ers for MPI functions by

linking with their own wrapp ers the standard version of the MPI library and the proling

version of the MPI library in the prop er order MPICH also supplies a numb er of prebuilt

proling libraries these are describ ed in Section

The Fortran Interface

MPI is a languageindep endent sp ecication with separate language bindings The MPI

standard sp ecies a C and a Fortran binding Since these bindings are quite similar

we decided to implement MPI in C with the Fortran implementation simply calling the

C routines This strategy requires some care however b ecause some C routines take ar

guments byvalue while all Fortran routines take arguments by reference In addition the

MPICH implementation uses p ointers for the MPI opaque ob jects suchasMPIRequest

and MPIComm Fortran has no nativepointer datatyp e and the MPI standard uses the

Fortran INTEGER typ e for these ob jects Rather than manually create eachinterface routine

we used a program that had b een develop ed at Argonne for just this purp ose

The program bfort reads the C source le and uses structured comments to

identify routines for whichtogenerateinterfaces Sp ecial options allow it to handle opaque

typ es cho ose how to handle C p ointers and provide name mapping In many cases this was

all that was necessary to create the Fortran interfaces In cases where routinesp ecic co de

was needed for example in MPIWaitsome where zeroorigin indexing is used in C and one

origin is used in Fortran the automatically generated co de was a go o d base to use for the

custom co de Using the automatic to ol also simplies up dating all of the interfaces when

a system with a previously unknown FortranC interface is encountered This situation

ay arose the rst time we p orted MPICH to a system that used the program fc asaw

to provide a Fortran compiler fc generates unusual external names for Fortran routine

names We needed only to rerun bfort to up date the Fortran interfaces This interface

handles the issues of p ointer conversions b etween C and Fortran see Section as well

as the mapping of Fortran external names to C external names The determination of

the name format eg whether Fortran externals are upp er or lower case and whether they

have underscore characters app ended to them is handled byour configure program which

compiles a test program with the users selected Fortran compiler and extracts the external

name from the generated ob ject le This allows us to handle dierentFortran compilers

and options on the same platform

Job Startup

The MPI Forum did not standardize the mechanism for starting jobs This decision was

entirely appropriate byway of comparison the Fortran standard do es not sp ecify howto

start Fortran programs Nonetheless the extreme diversityoftheenvironments in which

MPICH runs and the diversity of jobstarting mechanisms in those environments sp ecial

commands like prun poeormexec settings of various environmentvariables or sp ecial

commandline arguments to the program b eing started suggested to us that we should

encapsulate the knowledge of how to run a job on various machines in a single command

We named it mpirunInallenvironments an MPI program say myprog can b e run with

say pro cesses by issuing the command

mpirun np myprog

Note that this might not b e the only way to start a program and additional arguments might

usefully b e passed to b oth mpirun and myprog see Section but the mpirun command

will always work even if the starting of a job requires complex interaction with a resource

manager For example at Argonne weuseahomegrown scheduler called EASY instead of

IBMs LoadLeveler to start jobs on our IBM SP interaction with EASY is encapsulated in

mpirun

Anumb er of other MPI implementations and environments have also decided to use the

name mpirun to start MPI jobs The MPI Forum is discussing whether this command can

b e at least partially standardized for MPI see Section

Building MPICH

An imp ortant comp onent of MPICHs p ortability is the ability to build it in the same

way in many dierentenvironments We rely on the existence of a Bourne shell sh or

sup erset and Unixstyle make on the users machine The sh script that the user runs

is constructed by GNUs autoconf whichwe need in our developmentenvironment but

which the user do es not need At least a vanilla version of MPICH can b e built in anyof

MPICHs target environments by going to the toplevel directory of the distribution and

issuing the commands

configure

make

The congure script will determine asp ects of the environment such as the lo cation of cer

tain include les p erform tests of the environment to ensure that all comp onents required

for the correct compilation and execution of MPICH programs are present and construct the

appropriate Makefilesinmany directories so that the make command will build MPICH

After b eing built and tested MPICH can b e installed in a publicly available lo cation such

as usrlocal with make installPainless building and installation has b ecome one of

our p et goals for MPICH

Do cumentation

MPICH comes with b oth an installation guide and a users guide Although there

is some overlap and therefore some duplication we consider separating them to b e a b etter

approachthancombining them Although many users obtain and use MPICH just for their

own use an increasing numb er of them are linking their own programs to a systemwide copy

of the libraries that have b een installed in a publicly accessible place For such users the

information in the installation guide is a distraction Conversely the users guide contains a

collection of helpful hints for users who may b e exp eriencing diculties getting applications

to run These diculties mightwell never b e encountered by systems administrators who

merely install MPICH

An imp ortant but frequently overlo oked part of a software pro ject particular for re



searchsoftware is the generation of do cumentation particularly Unixstyle man pages We

use a to ol called doctext that generates man pages as well as WWW and LaTeX do c

umentation directly from simple structured comments in the source co de Using this to ol

allowed us to deliver MPICH with complete do cumentation from the b eginning Examples of

the do cumentation can b e accessed on the WWW at httpwwwmcsanlgovmpiwwwindexhtml

TowardaPortable Parallel Programming Environment

Although MPI sp ecies a standard library interface and therefore describ es what a p ortable

parallel program will lo ok like it says nothing ab out the environment in which the program

will run MPICH is a p ortable implementation of the MPI standard but also attempts

to provide more for programmers Wehave already discussed mpirun which provides a

p ortable way to run programs In this section we describ e briey some of the other to ols

provided in MPICH along with the basic MPI implementation

The MPE Extension Library

MPE MultiPro cessing Environment is a lo osely structured library of routines designed

to b e handy for the parallel programmer in an MPI environment That is most of the

MPE functions assume the presence of some implementation of MPI but not necessarily of

MPICH MPE routines fall into several categories

Parallel X graphics There are routines to provide all pro cesses with access to a shared

These routines are easier to use than the corresp onding native Xlib rou X display

tines and make it quite convenient to provide graphical output for parallel programs

Routines are provided to set up the display probably the hardest part and draw

text rectangles circles lines etc on it It is not the case that the various pro cesses

communicate with one pro cess that draws on the display rather the display is shared

by all the pro cesses This library is describ ed in

Logging One of the most common to ols for analyzing parallel program p erformance is a

timestamp ed event trace le The MPE library provides simple calls to pro duce such

a le It uses MPI calls to obtain the timestamps and to merge separate log les

together at the end of a job It also automatically handles the misalignment and drift

of clo cks on multiple pro cessors if the system do es not provide a synchronized clo ck

The logle format is that of upshot This is the library for a user who wishes

to dene his own events and program states Automatic generation of events by MPI

routines is describ ed in Section

Sequential Sections Sometimes a section of co de that is executed on a set of pro cesses

must b e executed by only one pro cess at a time in rank order The MPE library

provides functions to ensure that this typ e of execution o ccurs



This is not to say that the format of man pages cannot b e improved rather every Unix user knows how

to get information this way and rightly exp ects man pages to b e provided

Error Handling The MPI sp ecication provides a mechanism whereby a user can control

how the implementation resp onds to runtime errors including the ability to install

ones own error handler One error handler that we found convenient for developing

MPICH starts the dbx debugger in a p opup xterm when an error is encountered

Thus the user can examine the stack trace and values of program variables at the

time of the error To obtain this b ehavior the user must

Compile and link with the g option as usual when using dbx

a Link with the MPE library

Call

MPIErrhandlerset comm MPEErrorscalldbxinxterm

early in the program

OR

b Pass the mpedbg argumenttompirun if MPICH congured with mpedbg

CommandLine Arguments and Standard IO

The MPI standard says little ab out commandline arguments to programs other than that

in C they are to b e passed to MPIInit whichremoves the command line arguments it rec

ognizes MPICH ensures that on each pro cess the commandline arguments returned from

MPIInit are the same on all pro cesses thus relieving the user of the necessity of broad

casting the commandline arguments to the rest of the pro cesses from whichever pro cess

actually was passed them as arguments to main

The MPI Standard also says little ab out IO other than that if at least one pro cess has

access to stdin stdout and stderr the user can nd out which pro cess this is by querying

the attribute MPIIO on MPICOMMWORLD In MPICH all pro cesses haveaccesstostdin

stdoutand stderr and on networks these IO streams are routed back to the pro cess

with rank in MPICOMMWORLD On most systems these streams also can b e redirected

through mpirunasfollows

mpirun np myprog myarg datain resultsout

Here we assume that myarg are commandline arguments pro cessed by the applica

tion myprog After MPIIniteach pro cess will have these arguments in its argv This is

an MPICH feature not an MPI requirement On batch systems where stdin may not b e

available one can use an argumentto mpirunasfollows

mpirun np stdin datain myprog myarg resultsout

The latter form may always b e used

Supp ort for Performance Analysis and Debugging

The MPI proling interface allows the convenient construction of p ortable to ols that rely on

intercepting calls to the MPI library Such to ols are ultra p ortable in the sense that they

can b e used with any MPI implementation not just a sp ecic p ortable MPI implementation

Proling Libraries

The MPI sp ecication makes it p ossible but not particularly convenient for users to build

their own proling libraries whichintercept all MPI library calls MPICH comes with

three proling libraries already constructed wehave found them useful in debugging and

in p erformance analysis

tracing The tracing library simply prints on stdout a trace of each MPI library call

Each line is identied with its pro cess numb er rank in MPICOMMWORLD Since

stdout from all pro cesses is collected even on a networkofworkstations all out

put comes out on the console A sample is shown here

Starting MPIBcast

Starting MPIBcast

Ending MPIBcast

Starting MPIBcast

Ending MPIBcast

Ending MPIBcast

logging The logging library uses the mpe logging routines describ ed in Section to write

a logle with events for entry to and exit from each MPI function Then upshot see

Section can b e used to display the computation and its colored bars will show

the frequency and duration of each MPI call See Figure

animation The animation library uses the mpe graphics routines to provide a simple ani

mation of the message passing that o ccurs in an application via a shared X display

Further description of these libraries can b e found in

Upshot

One of the most useful to ols for understanding parallel program b ehavior is a graphical

display of parallel timelines with colored bars to indicate the state of each pro cess at any

given time A numb er of to ols develop ed byvarious groups do this One of the earliest of

these was upshot Since then upshot has b een reimplemented in TclTk and this ver

is distributed with MPICH It can read log les generated either byParagraph sion

or bythempe logging routines which are in turn used by the logging proling libraryA

sample screen dump is shown in Figure Bcast Compute Recv Send

0

1

2

3

4

5

10.22 10.27 10.32 10.37 10.42 10.46 10.51

Figure Upshot output

Supp ort for Adding New Proling Libraries

The most obvious way to use the proling library is to cho ose some family of calls to

intercept and then treat each of them in a sp ecial way Typically one p erforms some

action adds to a counter prints a message writes a log record calls the real MPI

Xxxx p erhaps p erforms another action eg writes function using its alternate name PMPI

another log record and then returns to the application propagating the return co de from

the PMPI routine

MPICH includes a utility called wrappergen that lets a user sp ecify templates for

proling routines and a list of routines to create and then automatically creates the pro

ling versions of the sp ecied routines Thus the work required by a user to add a new

proling library is reduced to writing individual MPIInit and MPIFinalize routines and

one template routine The libraries describ ed ab ove in Section are all pro duced in this

way Details of how to use wrappergen can b e found in

Useful Commands

Asp ects of the environment required for correct compilation and linking are encapsulated

in the Makefiles pro duced when the user runs configure Users may set up Makefile for

their own applications by copying one from an MPI examples directory and mo difying it

as needed The resultant Makefile may not b e p ortable but this may not b e a primary

consideration

An even easier and more p ortable way to build a simple application and one that ts

within existing complex Makefiles is to use the commands mpicc or mpif constructed in

the MPICH bin directory by configure These scripts are used like the usual commands

to invoke the C and Fortran compilers and the linker Extra arguments to these commands

link with the designated versions of proling libraries For example

mpicc c myprogc

compiles a C program automatically nding the include libraries that were congured when

MPICH was installed The command

mpif mpilog o myprog myprogf

compiles and links a Fortran program that when run will pro duce a log le that can b e

examined with upshot The command

mpicc mpitrace o myprog myprogc

compiles and links a C program that displays a trace of its execution on stdout

The mpirun command has already b een mentioned It has more exibility than wehave

describ ed so far In particular in heterogeneous environments the command

mpirun arch sun np arch rs np myprog

starts myprog on four Suns and three RSs where the sp ecic hosts have b een stored

in MPICHs machines le

Sp ecial arguments for the application program can b e used to make MPICH provide

helpful debugging information For example

mpirun np myprog mpedbg mpiqueue

automatically installs the error handler describ ed in Section that starts dbx on errors

and display all message queues when MPIFinalize is called This latter option is useful

in lo cating lost messages

Details on all of these commands can b e found in the users guide

Network ManagementTo ols

Although not strictly part of MPICH itself the Scalable Unix To ols SUT are a useful

part of the MPICH programming environmentonworkstation clusters Basically SUT

implements parallel versions of common Unix commands suchasls ps cporrmPerhaps

the most useful is a cross b etween find and ps that wecall pfps parallel nd in the pro cess

space For example one can nd and send a KILL signal to runaway jobs on a workstation

network during a debugging session with

pfps all tn myprog kill KILL

or lo cate all of ones own jobs on the network that have b een running for more than an

hour with

pfps all o me and rtime print

Graphical displays also show the load on eachworkstation and can help one cho ose the

subcollection of machines to run an MPICH job on Details can b e found in

Example Programs

MPICH comes with a fairly rich collection of example programs to illustrate its features In

addition to the extensive test suite and b enchmark programs there are example programs

for Mandelbrot computations solving the Mastermind puzzle and the game of life that

illustrate the use of the mpe library in an entertaining wayAnumb er of simple examples

illustrate sp ecic features of the MPI Standard top ologies for example and have b een

develop ed for use in classes and tutorials Many of the examples from are included For

all of these examples configure prepares the appropriate Makefiles but they havetobe

individuall y built as the user wishes One example is a mo derately large complete nuclear

physics Monte Carlo integration application in Fortran

Software ManagementTechniques and Tools

MPICH was written by a small distributed team sharing the workload We had the exp ected

problems of co ordinating b oth development and maintenance of a mo derately large

lines of C and Wehaveworked to distribute new releases in an orderly

fashion track and resp ond to bug rep orts and maintain contact with a growing b o dy of

users In doing so wehave used existing to ols engineered some of our own and develop ed

pro cedures that have served us well In this section we rep ort on our exp eriences in the

hop e that some of our to ols and metho ds will b e useful to other system develop ers All

software describ ed here is freely available either from wellknown sources or included in

MPICH

Conguring for Dierent Systems

Wehave tried as a sort of p et goal to make building MPICH completely painless despite

vironments This is a challenge In earlier systems suchasp the variety of target en

Chameleon and Zip co de it was assumed that a particular vendor name or op erating system

version was enough to determine how to build the system This is to o simplistic a view

The same hardware may run multiple op erating systems Solaris or SunOS on suns

LINUX or FreeBSD on xs

Dierentversions of the same op erating system may dier radically SGI IRIX is

bit whereas IRIX is the numb er of parameters to some system calls in Solaris

dep ends on the minor version numb er

Dierent compilers may use dierent includes datatyp e sizes and libraries

In addition it is rare that a system is completely free of bugs in particular since we

distribute source co de it is imp erative that the C compiler pro duce correct ob ject co de

In distributing MPICH we found that many users did not have correctly functioning C

compilers It is b est to determine this problem at congure time

We use the GNU autoconf system to build a shell script configure which in turn

executes various commands including building and running programs to determine the

users environment It then creates Makefiles from makele templates as well as creating

some scripts that contain sitesp ecic information such as the lo cation of the

wish interpreter

The autoconf system as distributed provides commands for checking the part of a

system that the GNU to ols need in MPICH wehave dened an additional set of op erations

that we found we needed in this and other pro jects These include commands to test that

the compiler pro duces correct co de and to cho ose a vendors compiler with the correct

options this is particularly imp ortant for the massively parallel systems In short the

configure script distributed with MPICH has evolved into a knowledge base ab out a wide

varietyofvendor environments

Source Co de Management

To allowusalltowork on the co de without interfering with one another we used RCS via

the Emacs VCinterface

Testing

Testing is often a lengthy and b oring pro cess MPICH contains a test suite that attempts to

test the implementation of each MPI routine While not as systematic as some commercial

testing systems whichcandwarf the size of the original package our test suite has proven

invaluable to us in identifying problems b efore new releases Many of the test programs

originated as bugdemonstration programs senttousby our users In MPICH we automate

the testing pro cess through the use of scripts that build test and generate a do cument that

summarizes the tests including the conguration correctness and p erformance results The

testing system is provided as part of the distribution

Tracking and Resp onding to Problem Rep orts

We realized early on that simply leaving bug rep orts in our email list would not work We

w all of the develop ers to keep track of rep orts including needed a system that would allo

what had b een done dialog with problem submitter answers etc It also had to b e simple

for users to use and for us to install Wechose the req system This system allows

users to send mail without any format restrictions to mpibugsmcsanlgov the bug

rep ort is kept in a separate system as well as b eing forwarded to a list of develop ers Both

GUI and commandline access to the bug rep orts is provided

Over time it b ecame clear that some problems were much more common than others

We develop ed a of common problems searchable bykeyword whichisalsointe

grated into the manual When a user sends in a bug rep ort we can query the database for

a standard resp onse to the problem For example if a user complains ab out getting the

message Try Again the command

helpbinfmsg Try Again

gives the information from the users guide on the message Try Again which comes not

from MPICH but from rshd

Announcements ab out new releases are sent to a mailing list managed by majordomo

to which users are encouraged to subscrib e when they rst run the configure script

Preparing a New Release

Preparing a new release for a package as p ortable as MPICH requires testing on a wide

variety of platforms To help test a new release of MPICH we use several programs and

scripts that build and test the release on a new platform The docport program included

in the MPICH distribution p erforms a build checking for errors from the make followed by

b oth p erformance and correctness tests The output from this program is a p ostscript le

that describ es the results of the build and tests this p ostscript le is then made available

on the WWW in a table at httpwwwmcsanlgovmpimpichportingportversion

numberhtml Another program then is used to do an installation and to check that b oth

MPICH and the other to ols suchasupshot and mpiccwork correctlyFor networks of

workstations additional tests also managed by a separate program test heterogeneous

collections of workstations By automating much of the testing we ensure that the testing

is reasonably complete and that the most glaring oversights are caught b efore a release

go es out Unfortunately b ecause the space of p ossible tests is so large these programs and

scripts have b een built primarily by testing for past mistakes

Lessons Learned

One of the purp oses of doing an early implementation was to understand the implications of

decisions made during the development of the Standard As exp ected the implementation

pro cess and early exp eriences of users shed light on the consequences of choices made at

orum meetings MPI F

Language Bindings

One of the earliest lessons learned had to do with the language bindings and choices of

C datatyp es for some items in the MPI Standard For example the version passed

the MPIStatus structure itself rather than a p ointer to the structure to the routines

MPIGetcount MPIGetelementsand MPITestcancelled The C bindings used int

in some places where an int might not b e large enough to hold the result most of these

except for MPITypesizewere changed to MPIAint

In the MPIKeyvalcreate function the predened null functions MPINULLCOPYFN

and MPINULLDELETEFN were originally b oth MPINULLFN unfortunately neither of these

is exactly a null function b oth have mandatory return values and the copy function also

sets a ag Exp erience with the implementation help ed the MPI Forum to repair these

problems in the version of the MPI Standard

A related issue was the desire of the Forum to make the same attribute copy and delete

functions usable from b oth C and Fortran for this reason addresses were used in the

standard for some items in C that were more naturally values Unfortunately when the size

of the C int datatyp e is dierent from that of the Fortran INTEGER datatyp e this approach

do es not work In a surprise move the MPI Forum exploited this in the Standard

changing the bindings of the functions in C to use values instead of addresses

Another issue is the Fortran bindings of all of the routines that take buers These

buers can b e of anyFortran datatyp e eg INTEGER REALorCHARACTER This was

common practice in most previous messagepassing systems but is in violation of the Fortran

Standard The MPI Forum voted to follow standard practice In most cases Fortran

compilers pass all items by reference and few complain when a routine is called with

dierent datatyp es Unfortunatelyseveral standardconforming Fortran implementations

use a dierent representation for CHARACTER data than for numeric data and in these cases

it is dicult to build an MPI implementation that works with CHARACTER data in Fortran

The MPI Forum is attempting to address this problem in the MPI prop osal

The MPI Forum provided a way for users to interrogate the environment to nd out for

example what was the largest valid message tag This was done in an elegant fashion by

using attributes a general mechanism for users to attach information to a communicator

The system attributes are attached to the initial MPICOMMWORLD communicator The

problem is that in general users need to set as well as get attributes Some users did in

w detects this as an illegal op eration and the MPI fact try to set MPITAGUB MPICH no

Forum claried this in the Standard

Performance

One of the goals of MPI was to dene the semantics of the message passing op erations so

that no unnecessary data motion was required The MPICH implementation has shown

this goal to b e achievable On two dierent sharedmemory systems MPICH achieves a

single copy directly from userbuer to userbuer In b oth cases the op erating system had

to b e mo died slightly to allow a pro cess to directly access the address space of another

pro cess On systems twovendors were able to achieve the same result

by providing vendorsp ecic implementations of the ADI

In actual use some users have noticed some p erformance irregularities these indicate

areas where more work needs to b e done in implementations For example the implemen

tation of MPIBsend in MPICH always copies data into the userprovided buer for small

messages suchcopying is not always necessary it may b e p ossible to deliver the message

without blo cking This can have a signicant eect on latencysensitive calculations Dif

ferent metho ds for handling short intermediate and long messages are also needed and are

under development

Another source of some p erformance diculties is seemingly inno cuous requirements

that aect the lowest levels of the implementation For example the following is legal in

MPI

MPIIsend request

MPIRequestfree request

The user need not must not actually use a wait or test on the request This functionality

can b e complex to implement when wellseparated software layers are used in the MPI

implementation In particular it requires that either the completion of the op eration started

by the MPIIsend change data maintained by the MPI implementation or that the MPI

implementation p erio dically check to see whether some request has completed The problem

with this functionality is that it may not matchwell with the services that are implementing

the actual data transp ort and can b e the source of unanticipated latency

Despite these problems the MPICH implementation do es achieve its goal of high p er

formance and p ortability In particular the use of a carefully layered design where the

layers can b e implemented as macros or removed entirely as one vendor has done was

key in the success of MPICH

Resource Limits

The MPI sp ecication is careful not to constrain implementations with sp ecic resource

guarantees For many uses programmers can work within the limits of any reasonable

implementation However many existing messagepassing systems provide some usually

unsp ecied amount of buering for messages sent but not yet received This allows a user

to send messages without worrying ab out the pro cess blo cking waiting for the destination

to receive them or worrying ab out waiting on nonblo cking send op erations The problem

with this approach is that if the system is resp onsible for managing the buer space user

programs can fail in mysterious ways A b etter approachistoallow the user to sp ecify

orum recognizing this need added routines the amount of buering desired The MPI F

with userprovided buer space MPIBsend MPIBufferattachand MPIBufferdetach

and nonblo cking versions These routines sp ecify that all of the space needed by the MPI

implementation can b e found in the userprovided buer including the space used to manage

the users messages Unfortunately this made it imp ossible for users to determine how big

a buer they needed to provide since there was no waytoknowhowmuch space the MPI

implementation needed to manage each message The MPI Forum added

MPIBSENDOVERHEAD to provide this information in the version of the Standard

One remaining problem that some users are now exp eriencing is the limit on the num

b er of outstanding MPIRequests that are allowed Currentlynoa priori way exists to

determine or provide the numb er of allowed requests

HeterogeneityandInterop erability

Packed data needs to b e sent with a packed data bit this means that datatyp es need to

know whether any part of the datatyp e is MPIPACKED The only other option is to always

use the same format for example network byte order at the cost of maximum p erformance

Many systems can b e handled by using byte swapping with data extension eg bit

to and from bit integers most systems can b e handled In some cases only oating

p oint requires sp ecial treatment in these cases XDR may b e used where IEEE format is

not guaranteed

The MPI sp ecication provides MPIPACK and MPIUNPACK functions unfortunately

these are not the functions that are needed to implement the p ointtop oint op erations The

reason is that these functions pro duce data that can b e senttoanyone in a communicator

including the sender whereas when sending to a single other pro cess there is more



freedom in cho osing the data representation The MPICH implementation uses internal

versions of MPIPACK and MPIUNPACK that work with data intended either for a sp ecic

pro cess or for all memb ers of a communicator

bit Issues

The development of MPICH coincided with the of a numb er of bit systems

Many programmers rememb ering the problems moving co de from to bit platforms

expressed concern over the problem of p orting applications to the bit systems Our

exp erience with MPICH was that with some care in using C prop erly void and not int

for addresses for example there was little problem in p orting MPICH from to bit

systems In fact with the exception discussed b elow MPICH has no sp ecial co de for

or bit systems

The exception is in the FortranC interface and this requires an understanding of the

rules of the Fortran Standard While C makes few statements ab out the length of

datatyp es for example sizeofint and sizeoffloat are unrelated Fortran denes

the ratios of the sizes of the numeric datatyp es Sp ecically the sizes of INTEGER and REAL

data are the same and are half the size of DOUBLE PRECISION This is imp ortantin

Fortran where there is no memory allo cation in the language and programmers often

have to reuse data areas for dierenttyp es of data Further using bit IEEE oating p oint

b e bits This is true even if sizeofint for DOUBLE PRECISION requires that INTEGER

in C is bits

In the FortranC interface this problem app ears when we lo ok at the representation of

MPI opaque ob jects In MPICH they are p ointers if these are bits in size then they

cannot b e stored in a Fortran INTEGER If opaque ob jects were ints it would not help

much wewould still need to convert from a bit to bit integer Thus on systems

where addresses are bits and Fortran INTEGERs are shorter something must b e done

The MPICH implementation handles this problem by translating the C p ointers to and

from small Fortran integers which represent the index in a table that holds the p ointer

This translation is inserted automatically into the Fortran interface co de by the Fortran

interface generator bfort discussed in section

Another problem involves the routine MPIAddress which returns an address of an

item This address may b e used in only twoways relative to another address from

MPIAddress or relative to the constant MPIBOTTOMInCtheobvious implementation

is to set MPIBOTTOM to zero and use something like longchar ptr to get the address

that ptr represents But in Fortran the value MPIBOTTOM is a variable at a known



lo cation Since all arguments to routines in Fortran are passed by address the b est

approachistoha ve the Fortran version of MPIAddress return addresses relative to the

address of MPIBOTTOM The advantage of this approach is that even when absolute addresses



Strangely the MPI Forum considered but did not accept the functions needed for packing and unpacking

data sentbetween two sp ecic pro cesses this decision mayhave b een b ecause there was less exp erience with

heterogeneous environments



Valueresult if one mustbepicky in practice the addresses are passed

are to o large to t in an INTEGERinmany cases the address relative to a lo cation in the

users program ie MPIBOTTOMwilltinanINTEGER This is the approach used in

MPICH if the address do es not t an error is returned of class MPIERRARGwithan

error co de indicating that the address wont t

As a nal step in ensuring p ortability to bit systems our configure program runs

some programs to determine whether the system is or bits This allows MPICH to

p ort to unknown systems or to systems like SGIs IRIX that change from bit IRIX

to bit IRIX without anychanges to the co de

Unresolved Issues

The MPI Forum did not address any mixedlanguage programming issues At least for MPI

Fortran programs must pass messages to Fortran programs and the same for C Yet it is

clearly p ossible to supp ort b oth CtoC and FortrantoFortran message passing in a single

application We call this a horizontal mixedlanguage p ortability As long as there is no

interest in transferring anything other than user data b etween Fortran and C strata of the

parallel application the horizontal mo del can b e satised provided that MPIInit provides

a consistent single initialization of MPI for b oth languages regardless of which language is

used actually to initialize MPI Current practice centers on this horizontal mo del but it

is clearly insucient as wehave observed from user feedback

Two additional levels of supp ort are p ossible staying still with the restriction of C and

Fortran as the mixed languages The rst is the ability to pass MPI opaque ob jects

lo cally within a pro cess b etween C and Fortran As noted earlier C and Fortran repre

sentations for MPI ob jects will often b e arbitrarily dierent as will addresses Although

useraccessible interop erable functions already are required in MPICH for the b enet of its

Fortran interface the MPI Standard do es not require them Such functionalityislikely to

app ear in MPI as a result of our users exp erience and with other MPI systems as well

Such functionality has the added b enet of enhancing the ability of third parties to provide

addon to ols for b oth C and Fortran users without working with inside knowledge of the

MPICH implementation for instance see

The second level of vertical supp ort is to allow a C routine to transmit data to a For

tran routine This requires some corresp ondence b etweenCandFortran datatyp es as well

as a common format for p erforming the MPI op erations eg the C and Fortran implemen

tations must agree on how to send control information and p erform collective op erations

The MPI Forum is preparing a prop osal that addresses the issues of interlanguage use of

MPI datatyp es for MPI

Status and Plans

We b egin this section by describing the current use of MPICH byvendors as a comp onent

of their own MPI implementations and others We then describ e some of our plans for

improving MPICH b oth by optimizing some of its algorithms for b etter p erformance and

by extending its p ortabilityinto other environments

Vendor Interactions

As describ ed ab ove one of the motivations for MPICHs architecture was to allowven

dors to use MPICH in developing their own proprietary MPI implementations MPICH is

copyrighted but freely given away and automatically licensed to anyone for further devel

opment It is not restricted to noncommercial use This approach has worked well and

vendor implementations are now app earing many incorp orating ma jor p ortions of MPICH

co de

IBM obtained an explicit license for MPICH and collab orated with us in testing

and debugging early versions During this time MPIF app eared This IBM

implementation do es not use the ADI but maps MPI functions directly onto an

internal IBM abstract device interface Our contact at IBM was Hub ertus Franke

SGI worked closely with us see Section to improve the implementation of the

ADI for their Challenge and Power Challenge machines Functions were added to

IRIX to enable singlecopyinterpro cess data movement and SGI gaveuslockfree

queuemanagement routines in assembler language Those involved at SGI were Greg

Chesson and Eric Salo

Convex worked closely with us to optimize an implementation of the channel interface

and then of the ADI Weworked with Paco Romero Dan Golan Gary Applegate

and Ra ja Daoud

Intel contributed a version of the ADI written directly for NX bypassing the channel

interface The Intel p erson resp onsible was Jo el Clarke

Meikoalsocontributed to the publicly distributed version a Meiko device thanks to

the eorts of Jim Cownie

Laurie Costello help ed us adapt MPICH for the Crayvector machines

DEC has used MPICH as the foundation of a memorychannelbased MPI for Alpha

clusters

Weobviously do not claim credit for the vendor implementations but it do es app ear that

we met our original goal of accelerating the adoption of MPI byvendors through providing

them a running start on their implementations The architecture of MPICH whichprovided

multiple layers without impact on p erformance was the key

Other Users

Since we make MPICH publicly available by ftpwedonothave precise counts on the num

b er of users It is downloaded ab out times p er month from our ftp site ftpmcsanlgov

which is also mirrored at Mississippi State ftpercmsstateedu Judging from the bug

rep orts and subscriptions to the mpiusers mailing list we estimate that b etween ve

hundred and one thousand p eople are currently active in their use of MPICH

Planned Enhancements

We are pursuing several directions for future work based on MPICH

New ADI To further reduce latencies particularly on systems where latency is already

quite low we plan an enhanced ADI that will enable MPICH to takeadvantage of

lowlevel device capabilities

Better collective algorithms As mentioned in Section the current collective op era

tions are implemented in a straightforward wayWewould like to incorp orate some

of the ideas in for improved p erformance

Thread safety The MPI sp ecication is threadsafe and considerable eort has gone into

providing for thread safety in MPICH but this has not b een seriously tested The

primary obstacle here is the availability of a test suite for thread safety of MPI op er

ations

Dynamic lighterweight TCPIP device We are nearing completion of a p ortable de

vice that will replace p as our primary device for TCPIP networks It will b e lighter

weight than p and will supp ort dynamic pro cess management which p do es not

RDPUDP device Weareworking on a reliable data proto col device approach built on

UDPIP User datagram proto col which extends and leverages the initial work done

by D Brightwell

Multiproto col supp ort Currently MPICH can use only one of its devices at a time

Although two of those devices the one based on Nexus and the one based on p

are to a certain extentmultiproto col devices we need a general mechanism for allowing

multiple devices to b e active at the same time We are designing such a mechanism

w This will allow for example two MPPs to b e used at the same time each using no

its own switches for internal communication and TCPIP for communication b etween

the two machines

Ports to more machines We are working with several groups to p ort MPICH to inter

esting new environments These include

the Parsytec machine

NEC SX and Cenju

Microsoft Windows NT b oth for multipro cessor servers and across the many

dierent kinds of networks that NT will supp ort and

Network proto cols that are more ecient than TCPIP b oth standard for ex

ample MessageWay and proprietary for example Myrinet

Parallel IO Wehaverecently b egun a pro ject to determine whether the concepts of the

ADI can b e extended to include parallel IO If this proves successful we will include

an exp erimental implementation of parts of MPIIO into MPICH

MPI

In March the MPI Forum resumed meeting with many of its original participants to

consider extensions to the original MPI Standard The extensions fall into several categories

SPAWN Dynamic creation of pro cesses eg MPI

PUT Onesided op erations eg MPI

Extended collective op erations suchascollective op erations on intercommunicators

External interfaces p ortable access to elds in MPI opaque ob jects

C and Fortran bindings

Extensions for realtime environments

Miscellaneous topics such as the standardization of mpirun new datatyp es and lan

guage interop erability

The MPICH pro ject b egan as a commitment to implement the MPI Standard with

the aim of assisting in the adoption of MPI bybothvendors and users In this goal it has

b een successful The degree to which MPI functionality will b e incorp orated into MPICH

dep ends on several factors

The actual content of MPI which is far from settled at this time

Thedegreetowhich the MPI sp ecication mandates features whose implementation

would b e feasible only with ma jor changes to MPICH internals

The enthusiasm of MPICH users for the individual MPI features

At this writing it seems highly likely that we will extend MPICH to include dynamic

pro cess management as dened by the MPI Forum at least for the workstation envi

ronment This extension will not b e dicult to do with the new implementation of the

channel interface for TCPIP networks and it is the feature most desired by those de

veloping workstationnetwork applications We exp ect also to aid to ol builders including

ourselves byproviding access to MPICH internals sp ecied in the MPI external inter

faces sp ecication For the other parts of MPI wewillwait and see

Summary

Wehave describ ed MPICH a p ortable implementation of the MPI Standard that oers

p erformance close to what sp ecialized v endor messagepassing libraries have b een able to

deliver We b elieve that MPICH has succeeded in p opularizing the MPI Standard and

encouraging vendors to provide MPI to their customers rst by helping to create demand

and second by oering them a convenient starting p oint for proprietary implementations

Wehave also describ ed the programming environment that is distributed with MPICH

The simple commands

configure

make

cd examplesbasic

mpicc mpilog o cpi cpic

mpirun np cpi

upshot cpilog

provide a portable sequence of actions by whicheven the b eginning user can install MPICH

run a program and use a sophisticated to ol to examine its b ehavior These commands are

the same and the users program is the same on MPPs SMPs and workstation networks

MPICH demonstrates that such p ortability need not b e achieved at the cost of p erformance

Acknowledgments

We thank the entire MPICH user community for their patience feedback and supp ort

of this pro ject Furthermore we thank Gail Piep er for her help with pro ofreading and

suggesting improvements to the presentation of the material

We thank Ralph Butler University of North Florida for his help in adapting parts of

p to supp ort MPICH

Weacknowledge several students who contributed to this work Thomas McMahon and

Ron Brightwell Mississippi State Patrick Bridges University of Arizona and Ed Karrels

University of Wisconsin

References

M Barnett S Gupta D Payne L Shuler R van de Geijn and J Watts Interpro ces

sor collectivecommunications library intercomm In Proceedings of the Scalable High

Performance Computing Conference pages IEEE Computer So ciety Press

Nanette J Bo den et al Myrineta Gigabitp erSecond Lo calArea Network IEEE

Micro

David Brightwell The Design and Implementation of a Datagram Proto col UDP

Device for MPICH Masters thesis Decemb er Dept of Computer Science

Ronald Brightwell and Anthony Skjellum Design and Implementation Study of

MPICH for the Cray TD In preparation February

R Alasdair A Bruce James G Mills and A Gordon Smith CHIMPMPI user guide

Technical Rep ort EPCCKTPCHIMPVUSER Edinburgh Parallel Computing

Centre June

Greg Burns and Ra ja Daoud MPI Cubixcollecti ve POSIX IO op erations for MPI

Technical Rep ort OSCTR Ohio Sup ercomputer Center

Greg Burns Ra ja Daoud and James Vaigl LAM An op en cluster environmentfor

MPI In John W Ross editor Proceedings of Supercomputing Symposium pages

UniversityofToronto

Ralph Butler and Ewing Lusk Monitors messages and clusters The p parallel

programming system Paral lel Computing April

Lyndon J Clarke Rob ert A Fletcher Shari M Trewin R Alasdair A Bruce A Gor

don Smith and Simon R Chapple Reuse p ortability and parallel libraries In Proceed

ings of IFIP WGProgramming Environments for Massively Paral lel Distributed

Systems

Danny Cohen and Craig Lund Prop osed Standard for the MessageWayInterSAN

Routing IETF Working Group Do cument September

Peter Corb ett Dror Feitelson Yarsun Hsu JeanPierre Prost Marc Snir Sam

Fineb erg Bill Nitzb erg Bernard Traversat and Parkson Wong MPIIO A parallel

le IO interface for MPI version Technical Rep ort NAS NAS January

Peter Corb ett Yarsun Hsu JeanPierre Prost Marc Snir Sam Fineb erg Bill Nitzb erg

Bernard Traversat Parkson Wong and Dror Feitelson MPIIO A parallel le IO

interface for MPI version httplovelacenasnasagovMPIIO December

oceedings of the RemyEvard Managing the evergrowing to do list In USENIX Pr

Eighth Large Instal lation Systems Administration Conference pages

S I Feldman and PJWeinb erger A p ortable Fortran compiler In UNIX Time

Sharing System Programmers Manualvolume ATT Bell Lab oratories tenth edi

tion

American National Standard Programming Language Fortran ANSI X

Message Passing Interface Forum Do cument for a standard messagepassing interface

Technical Rep ort Technical Rep ort No CS revised UniversityofTennessee

April Available on netlib

The MPI Forum The MPI messagepassing interface standard

httpwwwmcsanlgovmpistandardhtmlMay

Ian Foster Carl Kesselman and Steven Tuecke The Nexus taskparallel runtime

system In Proc st Intl Workshop on Paral lel Processing pages Tata

McGraw Hill

Hub ertus Franke Peter Ho chschild Pratap Pattnaik JeanPierre Prost and Marc

Snir MPI on IBM SPSP Current status and future directions Technical rep ort

IBM T J Watson Research Center

Al Geist Adam Beguelin Jack Dongarra Weicheng Jiang Bob Manchek and Vaidy

Sunderam PVM Paral lel Virtual MachineA Users Guide and Tutorial for Network

Paral lel Computing MIT Press

William Gropp Users manual for bfort Pro ducing Fortran interfaces to C source

co de Technical Rep ort ANLMCSTM Argonne National Lab oratory March

William Gropp Users manual for doctext Pro ducing do cumentation from C source

co de Technical Rep ort ANLMCSTM Argonne National Lab oratory March

William Gropp Edward Karrels and Ewing Lusk MPE graphicsscalable X graph

ics in MPI In Proceedings of the Scalable Paral lel Libraries Conference pages

Mississippi State University IEEE Computer So ciety Press Octob er

William Gropp and Ewing Lusk An abstract device denition to supp ort the im

plementation of a highlevel p ointtop oint messagepassing interface Preprint MCS

P Argonne National Lab oratory

William Gropp and Ewing Lusk Installation guide for mpich a p ortable implementa

tion of MPI Technical Rep ort ANL Argonne National Lab oratory

William Gropp and Ewing Lusk Scalable Unix to ols on parallel pro cessors In Pro

ceedings of the Scalable HighPerformance Computing Conference pages IEEE

Computer So ciety Press

William Gropp and Ewing Lusk Users guide for mpich a p ortable implementation of

MPI Technical Rep ort ANL Argonne National Lab oratory

William Gropp and Ewing Lusk MPICH working note Creating a new MPICH device

using the channel interface Technical Rep ort ANLMCSTM Argonne National

Lab oratory

William Gropp Ewing Lusk and Anthony Skjellum Using MPI Portable Paral lel

Pro gramming with the MessagePassing Interface MIT Press

William D Gropp and Ewing Lusk A test implementation of the MPI draft message

passing standard Technical Rep ort ANL Argonne National Lab oratoryDe

cemb er

William D Gropp and Barry Smith Chameleon parallel programming to ols users

manual Technical Rep ort ANL Argonne National Lab oratory March

M T Heath and J A Etheridge Visualizing the p erformance of parallel programs

IEEE Software September

Virginia Herrarte and Ewing Lusk Studying parallel program b ehavior with upshot

Technical Rep ort ANL Argonne National Lab oratory

Edward Karrels and Ewing Lusk Performance analysis of MPI programs In JackJ

Dongarra and Bernard Tourancheau editors Environments and Tools for Paral lel Sci

entic Computing pages SIAM

Donald Knuth The Art of Vol AddisonWesley

Message Passing Interface Forum MPI A messagepassing interface standard Inter

national Journal of Applications

Jo erg Meyer Message passing interface for Microsoft Windows Masters thesis

University of Nebraska at Omaha Decemb er

Jo erg Meyer and Hesham ElRewini WinMPI Message passing interface for Microsoft

Windows ftpcsftpunomahaedupubrewiniWinMPI

Paraab University of Bergen Programmers Guide to MPI for Dolphins

SBustoSCI adapters version edition November Available at

ftpftpiiuibnopubincomingprogrguidevps

IEEE standard for scalable coherentinterface SCI SH IEEE

Anthony Skjellum Steven G Smith Nathan E Doss Alvin P Leung and Manfred

Morari The design and evolution of Zip co de Paral lel Computing April

Marc Snir Steve W Otto Steven HussLederman David W Walker and Jack Don

garra MPI The Complete Reference MIT Press

Georg Stellner Co check Checkp ointing and pro cess migration for MPI In Proceedings

of the IPPS IEEE Computer So ciety Press

Georg Stellner and Jim Pruyne Resource ManagementandCheckp ointing for PVM

oceedings of the nd European PVM Users Group Meeting pages Lyon In Pr

Septemb er Editions Hermes

Paula L Vaughan Anthony Skjellum Donna S Reese and Fei Chen Cheng Migrating

from PVM to MPI part I The Unify System In Fifth Symposium on the Frontiers

of Massively Paral lel Computation pages McLean Virginia February

IEEE Computer So cietyTechnical Committee on Computer Architecture IEEE Com

puter So cietyPress

David Walker Standards for message passing in a distributed memory environment

Technical rep ort Oak Ridge National Lab oratory August