<<



A Manual for the CHAOS Runtime

Jo el Saltz Ravi Ponnusamy Shamik Sharma Bongki Mo on

YuanShin Hwang Mustafa Uysal Ra ja Das

UMIACS and Dept of Computer Science

University of Maryland

College Park MD

dybbukcsumdedu

Abstract

Pro cedures are presented that are designed to help users eciently program irregu

lar problems eg unstructured mesh sweeps sparse matrix co des adaptive mesh partial

dierential equations solvers on distributed memory machines These pro cedures are also

designed for use in for distributed memory multipro cessors The p ortable CHAOS

pro cedures are designed to supp ort dynamic data distributions and to automatically gener

ate send and receive messsage by capturing communications patterns at runtime



The CHAOS pro ject was sp onsored in part by ARPA NAG NSF ASC and ONR SC

i

Fact Sheet

Principal Investigator

Jo el Saltz

The CHAOS Team

Ra ja Das YuanShin Hwang

Bongki Mo on Ravi Ponnusamy

Shamik Sharma Mustafa Uysal

Computer Science Department

University of Maryland

College Park MD

fsaltzrajashinbkmoonpravishamikuysalgcsumdedu

Research Collab orators

Rice University

Reinard von Hanxeleden Seema Hiranandani

Charles Ko elb el Ken Kennedy

Syracuse University

Alok Choudhary Georey Fox

Sanjay Ranka

To Obtain CHAOS Software

Anonymous ftp hyenacsumdedu

email fbkmoonshamikshinuysalgcsumdedu

Bug Rep orts

email fbkmoonshamikshinuysalgcsumdedu ii

Contents

Intro duction

Getting the CHAOS Library

Sneak Preview Problems Data Structures and Pro cedures

Translation Table Dereference and Schedule

Computing Schedules

Overview of CHAOS

Insp ector

Communication Schedules

Singlephase Insp ector

subroutine lo calize

subroutine reglo calize

Twophase Insp ector

subroutine PARTI schedule

subroutine PARTI hash

create hash table function PARTI

subroutine PARTI free hash table

make mask function PARTI

Insp ectors for Partially Mo died Data Access Patterns

subroutine PARTI clear mask

clear stamp subroutine PARTI

Incremental schedules

subroutine PARTI incremental schedule

Data Exchangers

subroutine PREFIXgather

subroutine PREFIXscatter

subroutine PREFIXscatter FUNC

subroutine PREFIXmultigather

subroutine PREFIXmultiscatter

subroutine PREFIXmultiscatter FUNC

subroutine PREFIXscatternc

subroutine PREFIXmultiscatternc

FUNCnc subroutine PREFIXscatter

subroutine PREFIXmultiscatter FUNCnc

Data Exchangers For General Data Structures

gather subroutine PARTI

subroutine PARTI scatter

mulgather subroutine PARTI

mulscatter subroutine PARTI iii

CHAOS Pro cedures for Adaptive Problems

Intro duction

function schedule pro

subsection PREFIXscatter app end

back subsection PREFIXscatter

app end subsection PREFIXmultiscatter

subroutine PREFIXmultiscatter back

scatter app end subroutine PREFIXmultiarr

subroutine PREFIXmultiarr scatter back

Example

Translation Table

function build translation table

subroutine dereference

function build reg translation table

dst translation table function build

ttable with pro c function init

subroutine free table

subroutine derefpro c

subroutine derefoset

subroutine remap table

function tableGetReplication

function tableGetPageSize

subroutine tableGetIndices

Miscellaneous Pro cedures

subroutine PARTI setup

subroutine renumb er

Runtime Data Distribution

Runtime Data Graph

Data Mapping Pro cedures

subroutine eliminate dup edges

rdg subroutine generate

function init rdg hash table

Lo op Iteration Partitioning Pro cedures

subroutine dref rig

partitioner subroutine iteration

Data Remapping

subroutine remap

Parallel Partitioners

Geometry Based Partitioners

Recursive Bisection Coloring

Weighted Recursive Bisection Coloring

Recursive Bisection with Remapping iv

Weighted Recursive Bisection with Remapping

Recursive Sp ectral Partitioner

A Example Program NEIGHBORCOUNT

A Sequential Version of Program

A Parallel Version of Program Regular blo ck Distribution

A Parallel Version of Program Irregular Distribution

B Example Program Simplied ISING

B Sequential Version of Program

B Parallel Version of Program v

Intro duction

A suite of pro cedures has b een designed and develop ed to eciently solve a class of problems

commonly known as irregular problems Irregular concurrent problems are a class of irregular

problems which consist of a sequence of concurrent computational phases Patterns of data ac

cess and computational cost of each phase of these typ es of problems cannot b e predicted until

runtime This prevents optimization In this class of problems once runtime in

formation is available data access patterns are known in advance making it p ossible to utilize a

variety of protable prepro cessing strategies Runtime supp ort compilation metho ds are b eing

develop ed that are applicable to a variety of unstructured problems including explicit multi

grid unstructured computational uid dynamic solvers molecular dynamics co des CHARMM

AMBER GROMOS etc diagonal or p olynomial preconditioned iterative linear solvers di

rect simulation Monte Carlo DSMC co des and particleincell PIC co des These problems

share the characteristics of arrays accessed through one or more levels of indirection and

formulation of the problem as a sequence of lo op nests which prove to b e parallelizable

The CHAOS pro cedures describ ed in this manual can b e viewed as forming a p ortion of a

p ortable indep endent runtime supp ort library The CHAOS libray has b een written

in C however interfaces have b een provided to call CHAOS library routines from Fortran

application programs as well This manual presents CHAOS procedures for Fortran application

codes

Getting the CHAOS Library

The CHAOS pro cedures presented in this manual and related technical pap ers can b e obtained

from the anonymous ftp site hyenacsumdedu

Example codes shown in this manual are distributed along with CHAOS software

Sneak Preview Problems Data Structures and Pro cedures

An example of an irregular problem is presented in this section CHAOS data structures and

pro cedures to parallelize this example are intro duced Figure illustrates a simple sequential

Fortran irregular lo op lo op L which is similar in form to lo ops found in unstructured com

putational uid dynamics CFD co des and molecular dynamics co des In Figure arrays x

and y are accessed by indirection arrays edge and edge Note that the data access pattern

asso ciated with the inner lo op lo op L is determined by integer arrays edge and edge

Because arrays edge and edge are not mo died within lo op L Ls data access pattern

can b e anticipated prior to executing it Consequently edge and edge are used to carry

out prepro cessing needed to minimize communication volume and startup costs Since large

data arrays are asso ciated with a typical uid dynamics problem the rst step in parallelizing

involves partitioning data arrays x and y The next step involves assigning equal amounts of

work to pro cessors to maintain load balance

Translation Table Dereference and Schedule

On distributed memory machines large data arrays may not t in a singlepro cessors memory

hence they are divided among pro cessors Also computational work is divided among indi

vidual pro cessors to achieve parallelism Once distributed arrays have b een partitioned each

C Outer lo op L

L do i n step

C Sweep Over Edges Inner Lo op L

L do i nedge

n edgei

n edgei

yn yn fxn xn

yn yn gxn xn

end do

end do

Figure An Example co de with an Irregular Lo op

pro cessor ends up with a set of globally indexed distributed array elements Each element in

a size N distributed array A is assigned to a particular home pro cessor In order for another

pro cessor to b e able to access a given element Ai of the distributed array the home pro cessor

and lo cal address of Ai must b e determined Generally unstructured problems solved with

irregular data distributions p erform more eciently than with a regular data distribution such

as BLOCK In the case of irregular data distribution a lo okup table called translation table is

built that for each array element lists the home pro cessor and the lo cal address

Memory considerations make it clear that it is not always feasible to place a copy of the

translation table on each pro cessor so the translation table must b e distributed b etween pro

cessors This is accomplished by distributing the translation table by blo cks ie putting the

rst NP elements on the rst pro cessor the second NP elements on the second pro cessor

and so on where P is the numb er of pro cessors

When an element Am of distributed array A is accessed the home pro cessor and lo cal

oset are found in the p ortion of the distributed translation table stored in pro cessor m

N P A translation table lo okup aimed at discovering the home pro cessor and the

oset asso ciated with a global distributed array index is referred to as a dereference request

Consider the irregular lo op L in Figure that sweeps over the edges of a mesh In this case

distributing data arrays x and y corresp onds to partitioning the mesh vertices partitioning

lo op iterations corresp onds to partitioning edges of the mesh Hence each pro cessor gets a

subset of lo op iterations edges to work on An edge i that has b oth end p oints edg ei

and edg ei inside the same partition pro cessor requires no outside information On the

other hand edges which cross partition b oundaries require data from other pro cessors Before

executing the computation for such an edge pro cessors must retrieve the required data from

other pro cessors

There is typically a nontrivial communication latency or message startup cost on dis

tributed memory machines communication can b e aggregated to reduce the eect of com

munication latency software caching can b e done to reduce communication volume To carry

out either optimization it is helpful to have apriori knowledge of data access patterns In

irregular problems it is generally not p ossible to predict data access patterns at compile time

For example the values of indirection arrays edge and edge of lo op L in Figure are

known only at runtime b ecause they dep end on the input mesh During program

data references of distributed arrays are prepro cessed On each pro cessor data needed to b e

exchanged are precomputed The results of this prepro cessing is stored in a data structure

called communication schedule The pro cess of analyzing the indirection arrays and generating

schedules is called the inspector phase

Each pro cessor uses communication schedules to exchange required data b efore and after

executing a lo op The same schedules can b e used rep eatedly as long as the data reference

patterns remain unchanged The pro cess of carrying communication and computation is called

the executor phase

Computing Schedules

This section presents a discussion on the pro cess of generating and using schedules to carry

out communication vectorization and software caching Consider the example shown in Fig

ure Arrays x y edge and edge are partitioned b etween the pro cessors of the distributed

memory machine The lo cal size of data arrays x and y on each pro cessor is lo cal nno de and

nedge It is assumed that arrays x and y are indirection arrays edge and edge is lo cal

distributed in the same fashion and the distribution is stored in a distributed translation table

The partitioned indirection arrays edge and edge are called part edge and part edge

resp ectively To compute schedules the lo cal data array references lo cal indirection array

values are collected in an array and passed to the pro cedure localize

In lo op L of Figure values of array y are up dated using the values stored in array x

Hence a pro cessor may need an opro cessor array element of x to up date an element of y it

may up date an opro cessor array element of y also The goal of the insp ector is to prefetch

opro cessor data items b efore executing the lo op and carry out opro cessor up dates after

executing the lo op Hence two sets of schedules are computed in the insp ector gather

schedules communication schedules that can b e used to fetch opro cessor elements of x and

scatter schedules communication schedules that can b e used to send up dated opro cessor

elements of y However the arrays x and y are referenced in an identical fashion in each

iteration of the lo op L and also they are identically distributed so a single schedule that

represents data references of either x or y can b e used for b oth fetching opro cessor elements

of x and sending opro cessor elements of y

Figure contains the prepro cessing co de for the simple irregular lo op L shown in Figure

The distribution of data arrays x and y are stored in a translation table itable The globally

indexed reference pattern used to access arrays x and y are collected Lo op K in an array

ig ref The pro cedure localize dereferences the index array ig ref to get the addresses and

translates ig ref so that valid references are generated when the lo op is executed

Opro cessor elements are stored in a onpro cessor buer area The buer area for each

data array immediately follows the onpro cessor data for that array For example the buer

nno de Hence when localize translates ig ref to lo cal for data array y b egins at ylo cal

ized edge the opro cessor references are mo died to p oint to buer addresses The pro cedure

localize uses a hashtable to remove any duplicate references to opro cessor elements so that

only a single copy of each opro cessor datum is transmitted When the o pro cessor data is

C Inspector Phase build translation table compute schedule and translate indices

S itable build translation table lo cal indices lo cal nno de

K do i lo cal nedge

refi lo cal edgei ig

ig reflo cal nedgei lo cal edgei

end do

reflo calized edge lo cal nedge n o pro clo cal nno de S call lo calizeitablei schedig

nedge K do i lo cal

lo cal edgei lo calized edgei

edgei lo calized edgelo cal nedgei lo cal

end do

C Executor Phase carry out communication and computation

S call zero out buerylo cal nno de o pro c

S call gatherxlo cal nno de x isched

nedge K do i lo cal

n lo cal edgei

edgei n lo cal

yn yn fxn xn

yn yn gxn xn

end do

S call scatter addylo cal nno de y isched

Figure No de Co de for Simple Irregular Lo op Partitioned global localized global reference list reference list

localize

off-processor local buffer references references

data array

local data

buffer off-processor data

Figure Index Translation by Lo calize Mechanism

collected into the buer using the schedule returned by localize the data is stored in a way such

edge and lo cal edge accesses the correct data that execution of the lo op using the lo cal

A sketch of how the pro cedure localize works is shown in Figure

The executor co de starting at S in Figure carries out the actual lo op computation In

this computation the values stored in the array y are up dated using the values stored in x

During computation accumulations to opro cessor lo cations of array y are carried out in the

buer asso ciated with array y This makes it necessary to initialize the buer corresp onding

to opro cessor references of y To p erform this action the function zero out buer is called

After the lo ops computation data in the buer lo cation of array y is communicated to the

home pro cessors of these data elements scatter add There are two p otential communication

p oints in the executor co de ie the gather and the scatter add calls The gather on each

pro cessor fetches all the necessary x references that reside opro cessor The scatter add calls

accumulates the opro cessor y values

Overview of CHAOS

Solving such concurrent irregular problems on distributed memory machines using CHAOS

runtime supp ort usually involves six ma jor phases Figure The rst four phases concern

mapping data and computations onto pro cessors The next two steps concern analyzing data

access patterns in a lo op and generating optimized communication calls A brief description of

these phases follows they will b e discussed in detail in later sections

Phase A Data Partitioning Assign elements of data arrays to pro cessors

Phase B Data Remapping Redistribute data array elements

Phase C Iteration Partitioning Allo cate iterations to pro cessors

Phase D Iteration Remapping Redistribute indirection array elements

Phase E Insp ector Translate indices Generate schedules

Phase F Executor Use Schedules for Data Transp ortation

Perform computation

Figure Solving Irregular Problems

Data Distribution Phase A calculates how data arrays are to b e partitioned by

making use of partitioners provided by CHAOS or by the user CHAOS supp orts a

numb er of parallel partitioners that partition data arrays using heuristics based on spatial

p ositions computational load connectivity etc The partitioners return an irregular

assignment of array elements to pro cessors which is stored as a CHAOS construct called

the translation table A translation table is a globally accessible data structure which

lists the home pro cessor and oset address of each data array element The translation

table may b e replicated distributed regularly or stored in a paged fashion dep ending on

storage requirements

Data Remapping Phase B remaps data arrays from the current distribution to the

newly calculated irregular distribution A CHAOS pro cedure remap is used to generate

an optimized communication schedule for moving data array elements from their original

distribution to the new distribution Other CHAOS pro cedures gather scatter and

scatterappend use the communication schedule to p erform data movement

Lo op Iteration Partitioning Phase C determines how lo op iterations should b e par

titioned across pro cessors There are a large numb er of p ossible schemes for assigning

lo op iterations to pro cessors based on optimizing load balance and communication vol

ume CHAOS uses the almostownercomputes rule to assign lo op iterations to pro cessors

Each iteration is assigned to the pro cessor which owns a ma jority of data array elements

accessed in that iteration This heuristic is biased towards reducing communication costs

CHAOS also allows the ownercomputes rule

Remapping Lo op Iterations Phase D is similar to phase B Indirection array elements

are remapp ed to conform with the lo op iteration partitioning For example in Figure

once lo op L is partitioned indirection array elements edgei and edgei used in

iteration i are moved to the pro cessor which executes that iteration

Insp ector Phase E carries out the prepro cessing needed for communication optimiza

tions and index translation

Executor Phase F uses information from the earlier phases to carry out the computa

tion and communication Communication is carried out by CHAOS data transp ortation

primitives which use communication schedules constructed in Phase E

In static irregular problems Phase F is executed many times while phases A through E

are executed only once In some adaptive problems data access patterns change p erio dically

but reasonable load balance is maintained In such applications phase E must b e rep eated

whenever the data access pattern changes In even more highly adaptive problems the data

arrays may need to b e repartitioned in order to maintain load balance In such applications

all the phases describ ed ab ove are rep eated

Insp ector

The CHAOS pro cedures that can b e used to generate the data structures translation tables

and schedules and to op erate on these data structures are illustrated in this section The

functionalities of the pro cedures explained in this section are shown in Table

Communication Schedules

To describ e the insp ector primitives provided by CHAOS we shall often refer to the example

co de in Figure In that example one should notice that the arrays edge and edge are

used to index other arrays x and y Since this is a indirect indexing metho d we call edge

and edge indirection arrays As we see from the parallelized no de co de version Figure

of the simple lo op we need to do some prepro cessing b efore we can actually execute the lo op

This prepro cessing is aimed at achieving the following goals

Global to Lo cal translation of references Determine which of the references made by

an indirection array eg edge edge are references to data that is now onpro cessor

These references are changed so that they now p oint to the lo cal address For references to

data that is now opro cessor we need to assign onpro cessor buer lo cations where these

opro cessor elements will b e brought into The opro cessor references of the indirection

array are changed so that they now p oint into this buer lo cation

Communication Schedule Generation The communication schedule sp ecies an opti

mized way to gather opro cessor data and to scatter back lo cal copies after computa

tion CHAOS primitves are used to scan the data access pattern and determine which

opro cessor elements will b e needed during computation Duplicates among these ref

erences are removed and the opro cessor references are merged so that fewer messages

will b e needed for moving data The communication schedule data structure typically

contains the following

a send list a list of lo cal array elements that must b e sent to other pro cessors

b p ermutation list an array that sp ecies the data placement order of incoming o

pro cessor elements in a lo cal buer area which is designated to receive incoming

data

c send size an array that sp ecies the sizes of outgoing messages from pro cessor p

to other pro cessors

Table CHAOS Insp ectorExecutor Pro cedures

Task Functionality Pro cedure

Insp ector

compute Schedule lo calize

compute Schedule reglo calize

compute Schedule PARTI schedule

Schedule compute Incremental Schedule PARTI incremental schedule

hash global references PARTI hash

create hash table PARTI creat hash table

deallo cate hash table PARTI free hash table

build translation table build translation table

Translation build translation table build reg translation table

Table build paged translation table build dst translation table

up date a paged table table remap

get replication factor getTableRepF

get table page size getTablePageSize

dereference dereference

dereference only pro cessor derefpro c

dereference only oset derefoset

get lo cal indices getTableIndices

deallo cate translation table free table

Executor

Fetch opro cessor data PREFIXgather

scatter opro cessor data PPREFIXscatter

Data scatter opro cessor data with function PREFIXscatter FUNC

Exchangers fetch opro cessor data PREFIXmulitgather

scatter opro cessor data PREFIXscatternc

scatter opro cessor data PREFIXmulitscatternc

scatter opro cessor data PREFIXmulitscatter FUNCnc

gather opro cessor data PARTI gather

gather opro cessor data PARTI mulgather

scatter opro cessor data PARTI scatter

scatter opro cessor data PARTI mulscatter

Miscellaneous

Initialize CHAOS environment PARTI setup

Renumb er data arrays renumb er

PREFIX data typ e d for double precision f for real i for integer

FUNC function add for addition mult for multiplication

d fetch size an array that sp ecies the sizes of incoming messages to pro cessor p

from other pro cessors

Some schedules do not need all of these entries These variants are describ ed b elow

The following example further claries the pro cess of indextranslation Let us assume that

an array x with elements x x has b een distributed over two pro cessors Similarly

the indirection arrays edge and edge are also distributed over the two pro cessors The lo cal

p ortions of each array x edge and edge lo ok like this

local area local area

x x x x x

Local x x x x x

nedge and nedge

edge edge

edge edge

where the underlined references are offprocessor

As explained in Section b efore we execute the lo op we must bring in a copy of all

opro cessor references that might b e made inside the lo op In this case we have opro cessor

references only of these are distinct We can assign these copies memory just at the end

of the lo cal p ortion of the array This new region is called the ghost area or the opro cessor

buer

local areaghost cells local areaghost cells

x x x x x x x x x x

x x x x x x x x x x

Now we must change the global references in edge edge

so that they now point to the appropriate local references

After indextranslation we have

nedge nedge

edge edge

edge edge

The next task is to build a communication schedule The fetch size and send size are

the amounts of data that must b e transferred during a gather op eration The send list

sp ecies the data that must b e sent to every other pro cessor The p ermutation list sp ecies

where incoming data from each pro cessor should b e placed in the ghost area Note that a

communication schedule is dep endent on the globaltolo cal index translation having b een done

since the p ermutation list can only b e determined after each opro cessor reference has b een

assigned a ghostarea lo cation during index translation In the p ermutation list the lo cal indices

are stored as osets from the b eginning of the ghostarea instead of b eing stored as absolute

lo cal index values For the example ab ove the communication schedule would lo ok as follows

ScheduleP ScheduleP

fetchsize fetchsize

sendsize sendsize

sendlist p NULL sendlist p

p p NULL

permlist p NULL permlist p

p p NULL

In the following sections CHAOS primitives for p erforming indextranslation and schedule

generation have b een describ ed Since indextranslation and schedulegeneration are usually

p erformed one after the other CHAOS primitves can combine the twosteps into a single

primitive which given an indirection array returns a translated indirection array as well as a

communication schedule In some cases however it is p ossible to reuse the index translation

pro cess to generate dierent schedules For such applications CHAOS allows the user to use a

twostep insp ector pro cess by breaking the insp ector step into indextranslation and schedule

generation stages

Singlephase Insp ector

Singlestep insp ector primitives p erform b oth indextranslation as well as schedulegeneration

with resp ect to a given set of indirection arrays

subroutine lo calize

localize is used to translate all the global indices in the given indirection array into lo cal

indices It also returns the schedule corresp onding to that indirection array The schedule

p ointer returned by localize is used to gather data and store it at the end of the lo cal array

This schedule created is such that multiple copies of the same data is not brought in during

the gather phase The elimination of duplicates is achieved by using a hash table localize

returns the lo cal reference string corresp onding to the global references which are passed as a

parameter to it The numb er of opro cessor data elements are also returned by localize so

that one can allo cate enough space at the end of the lo cal array

Synopsis

subroutine lo calizeitabptrilschedi global refs ilo cal refsndatan o pro c

my sizerep etition

Parameter Declarations

integer itabptr refers to the relevant translation table p ointer

integer ilsched refers to the relevant schedule p ointer returned by localize

integer iglobal refs the array which stores all of the global indirection array references

integer ilo cal refs the array which stores the lo cal reference string corresp onding to the

global references returned by localize

integer ndata numb er of global references

integer n o pro c numb er of opro cessor data returned by localize

integer my size the size of my lo cal data array

integer rep etition maximum numb er of columns or rows from which data will b e gathered

or scattered using the schedule returned by lo calize

Return Value

None

Example

Assume that data arrays x and y are identically but irregularly distributed b etween pro ces

sors and and the pro cessors take part in a computation that involves a lo op which refers

ref and the to opro cessor data The indirect data array references are stored in global

array my index has the lo cal indices of the data arrays The insp ector and the executor

co de for the lo op is presented here

integer indataindirection BUFSIZE

parameterBUFSIZE

integer myindexglobalreflocalref

double precision xyBUFSIZE

integer tabptrschedptr buildtranslationtable

c data initialization

ifMPImynode eq then

myindex

myindex

myindex

mysize

globalref

globalref

globalref

ndata

else ifMPImynode eq then

myindex

myindex

myindex

myindex

myindex

mysize

globalref

globalref

globalref

globalref

globalref

ndata

else

mysize

ndata

end if

do imysize

xi myindexi

yi myindexi

continue

C initialize Chaos environment

call PARTIsetup

c the following is the inspector code

tabptr buildtranslationtablemyindexmysize

call localizetabptrschedptrglobalref

localrefndatanoffprocmysize

c

do indata

globalrefi localrefi

continue

c end of the inspector and the executor begins

call dgatherymysizeyschedptr

do indata

indirection globalrefi

xi xi yindirection

continue

c end of the executor code

if MPImynode eq then

WRITE XI Imysize

WRITE globalrefI Indata

WRITE YglobalrefI Indata

end if

call MPIgsync

if MPImynode eq then

WRITE XI IMYSIZE

WRITE globalrefI Indata

WRITE YglobalrefI Indata

end if

end

The distribution of data arrays are stored in a translation table tabptr and the communication

pattern b etween pro cessors are stored in a schedule schedptr The pro cedure dgather brings

in the opro cessor data After the end of the computation in pro cessor the values of x

x and x are and resp ectively On pro cessor the values of x x

x x and x are and resp ectively

subroutine reglo calize

The functionality of this pro cedure is similar to that of the localize pro cedure except in this

case instead of passing a p ointer to the translation table for an irregular data distribution

sp ecication we must give a regular distribution At the present time we supp ort BLOCK and

CYCLIC distributions Other regular distribution can b e supplied by the user by writing their

own regular dereference

Synopsis

refs subroutine reglo calizedistributionsi zeil schedi global

ilo cal refsndatan o pro cmy sizerep etition

Parameter Declarations

integer distribution BLOCK CYCLIC or OWN

integer size The array size from which data is to b e gathered or scattered

integer ilsched refers to the relevant schedule p ointer returned by reglocalize

refs the array which stores all of the global indirection array references integer iglobal

refs the array which stores the lo cal reference string corresp onding to the integer ilo cal

global references returned by reglocalize

integer ndata numb er of global references

integer n o pro c numb er of opro cessor data returned by reglocalize

size parameter will return the size of the lo cal array integer my

integer rep etition maximum numb er of columns or rows from which data will b e gathered

or scattered using the schedule returned by lo calize

Return Value

None

Example

The example is similar to the one presented in the lo calize section except that the data

arrays are regularly distributed ie by BLOCK

integer indataindirection BUFSIZE globalsize BLOCK

integer myindex

parameterBUFSIZE

integer iglobalrefilocalref

real buffer aloc

double precision xyBUFSIZE

integer itabptr ischedptr buildtranslationtable

BLOCK

globalsize

if MPInumnodes le globalsize then

mysize globalsizeMPInumnodes

else

if MPImynode lt globalsize then

mysize

else

mysize

endif

endif

ifMPImynode eq then

iglobalref

iglobalref

iglobalref

ndata

else if MPImynode eq then

iglobalref

iglobalref

iglobalref

iglobalref

ndata

else

ndata

end if

C initialize Chaos environment

call PARTIsetup

c the following is the inspector code

call reglocalizeBLOCKglobalsizeischedptriglobalref

ilocalrefndatanoffprocmysize

do indata

iglobalrefi ilocalrefi

continue

do i ndata

xi MPImynode i

enddo

do i mysize

yi MPImynode i

continue

c end of the inspector and the following is the executor code

call dgatherymysizeyischedptr

do indata

indirection iglobalrefi

xi xi yindirection

continue

c end of the executor code

if MPImynode le then

WRITE

MPImynode xXI Indata

WRITE

MPImynode grefiglobalrefI Indata

WRITE

MPImynode yYiglobalrefI Indata

end if

call MPIgsync

C Gather example

do imysize

aloci floatMPImynodei

continue

call fgatherbufferalocischedptr

if MPImynode le then

write Gather results

WRITE MPImynode bufferI Inoffproc

endif

end

After the end of the computation in pro cessor the values of x x x and x are

and resp ectively On pro cessor the values of x x x and x

are and resp ectively

Twophase Insp ector

Instead of using localize or reglocalize the user may also cho ose to p erform indextranslation

and schedulegeneration in separate steps For this purp ose CHAOS provides the four primi

create hash table PARTI hash PARTI schedule and PARTI free hash table tives PARTI

PARTI hash is used to enter a data access pattern ie an indirection array into a hash table

schedule where duplicates are removed and globaltolo cal index translation is p erformed PARTI

is used to build a communication schedule by insp ecting the entries in the hash table PARTI create hash table

and PARTI free hash table are used to allo cate and deallo cate memory for the hash table

subroutine PARTI schedule

Synopsis

schedule hash table comb o mask sched maxdim PARTI

Parameters

integer hash table the hash table

mask the combination of indirection arrays for which a schedule is needed integer comb o

integer sched the schedule that is returned

integer maxdim maximum numb er of columns or rows from which data will b e gathered

schedule or scattered using the schedule returned by PARTI

Return Value

None

subroutine PARTI hash

Enters the global references into the heap and converts them into referencres to a lo cal array

Synopsis

hashtrans table hash table sup ermask new stamp global refs ndata my size PARTI

no o pro c no new o pro c

Parameters

table translation table used to determine lo cal address of any global address integer trans

table the heap into which all the global references will b e hashed It must integer hash

have b een created earlier using PARTI create hash table

integer global refs the global references

refs array integer ndata size of global

integer my size the size of the lo cal buer the o pro cessor global references will b e

mo died so that they now p oint to address b eginning from mysize onwards

o pro c Conatins the numb er of o pro cessor references in global refs re integer no

hash turned by PARTI

integer no new o pro c Some of the o pro cessor references may have already b een

hashed into the heap by previous indirection arrays This parameter will have the

pro cessor references returned by PARTI hash numb er of new or unique o

stamp A heap hash table can b e used to hash in many indirection array integer new

all entries hashed in by a particular indirection array have a sp ecic id The param

stamp returns the id assigned to this particular indirection arrays elements eter new

hash returned by PARTI

integer sup ermask Each indirection array can sp ecify which of the previous indirection

arrays entries it wants to reuse this is done by passing in a parameter called the

sup ermask A sup ermask is created by converting the ids of the previous indirection

arrays into a mask using PARTI make mask id and then using bitwiseor to OR

these masks If a user do es not care for reusing previous index analyses a should b e

passed

Return value

None

function PARTI create hash table

Creates a heap into which each pro cessor will enter its global references any data range will

do but a go o d estimate improves p erformance

Synopsis

PARTI create hash tabledata range

Parameters

range the maximum size of a global reference integer data

Return Value

an integer identifying the hash table

subroutine PARTI free hash table

Deallo cates the hash table created with PARTI create hash table

Synopsis

PARTI free hash table hash table

Parameters

table an integer identifying the hash table integer hash

Return Value

None

function PARTI make mask

Synopsis

make mask stamp PARTI

Parameters

integer stamp

Return Value

an integer representing the bitmask

hash Converts an id to a bitmask see PARTI

integer indataindirection BUFSIZE mysize

parameterBUFSIZE

integer myindexglobalreflocalref

double precision xyBUFSIZE

integer tabptrschedptr buildtranslationtable

integer particreatehashtable

integer partimakemask

integer hashptr

integer istamp imask ndontcare noffproc

c data initialization

ifMPImynodeeq then

myindex

myindex

myindex

mysize

globalref

globalref

globalref

globalref

ndata

else ifMPImynodeeq then

myindex

myindex

myindex

myindex

myindex

mysize

globalref

globalref

globalref

globalref

globalref

ndata

else

mysize

ndata

end if

do imysize

xi myindexi

yi myindexi

continue

C initialize CHAOS environment

call PARTIsetup

c the following is the inspector code

tabptr buildtranslationtablemyindexmysize

hashptr PARTIcreatehashtable

call PARTIhashtabptrhashptristampglobalref

ndata mysize ndontcare noffproc

c

c The stamp returned by PARTIhash is an integer between and

c PARTImakemask maps this integer into a specific bit in an integer mask

c masks can then be bitORed if needed

c

imask PARTImakemaskistamp

call PARTIschedulehashptr imask schedptr

c

c end of the inspector and the executor begins

call dgatherymysizeyschedptr

do indata

indirection globalrefi

xi xi yindirection

continue

if MPImynodeeq then

WRITE Processor

WRITE XI Imysize

end if

call MPIgsync

if MPImynodeeq then

WRITE Processor

WRITE XI IMYSIZE

end if

end

Insp ectors for Partially Mo died Data Access Patterns

The previous section describ ed how a twostep insp ector can b e used instead of a single

primitive such as localize Both schemes provide the same functionality indeed localize

and reglocalize have b een built on top of the more general primitives PARTI hash and

PARTI schedule

We recommend that users use the onestep insp ector routines whenever p ossible Twostep

insp ectors are usually needed when the indirection arrays are mo died slightly during the course

of computation In such cases the index analysis for the old indirection array can b e reused

since information is maintained in the hashtable

To clarify how this is done we provide the following example

integer indataindirection BUFSIZE

parameterBUFSIZE

integer myindexglobalreflocalref

double precision xyBUFSIZE

integer tabptrschedptr buildtranslationtable

integer hashptr

c data initialization

ifmynode eq then

myindex

myindex

myindex

mysize

globalref

globalref

globalref

globalref

ndata

else

myindex

myindex

myindex

myindex

myindex

mysize

globalref

globalref

globalref

globalref

globalref

ndata

end if

do imysize

xi myindexi

yi myindexi

continue

C initialize CHAOS environment

call PARTIsetup

c the following is the inspector code

tabptr buildtranslationtablemyindexmysize

call PARTIcreatehashtable

do k notimesteps

c SSS

if k eq

ioldmask

else

ioldmask PARTImakemaskistamp

endif

call PARTIhashtabptrhashptrioldmaskistampglobalref

ndata mysize ndontcare noffproc

imask PARTImakemaskistamp

call PARTIschedulehashptr imask schedptr

c EEE

c end of the inspector and the executor begins

call dgatherymysizeyschedptr

do indata

indirection globalrefi

xi xi yindirection

continue

c On some weird conditions change a few entries

c in the indirection array

do i ndata

if i eq randomk then

globalrefi globalrefi

endif

enddo

enddo

c end of the executor code

In the previous example the indirection array changes every time step hence the sched

ule must b e regenerated each time Since most of the index analysis can b e resued we call

hash with a p ointer to the old hash table and pass the mask of the previous indirection PARTI

array Using this information the PARTI hash hashes in each entry of the new indirection array

and resues the index analysis information of matching preexisting entries in the hashtable

Index analysis is p erformed only for those entries that are not found in the hash table There

is a caveat to the previous example the numb er of distinct stamps that PARTI hash can issue

is limited to This implies that the previous example will work only when the numb er of

timesteps is less than

CHAOS provides two ways of working around this problem PARTI clear stamp can clear

all entries with a particular stamp PARTI clear mask is a more general primitive with which

any entry in the hash table with a particular combination of stamps can b e removed The second

metho d is to instruct PARTI hash not to issue a new stamp every iteration For instance in

the previous example each indirection array is used only for one schedule indirection arrays

for old timestamps are never resued By passing in an istamp value of to PARTI hash the

user can direct it to reuse the last issued stamp value for all new entries

The following co de segment shows how each of this can b e done These co de segments

reect the section b etween SSS and EEE in the previous example

c Using PARTIclearmask to clear entries in hash table

c SSS

if k eq

ioldmask

else

ioldmask PARTImakemaskistamp

endif

call PARTIhashtabptrhashptrioldmaskistampglobalref

globalref ndata mysize ndontcare noffproc

if k gt call PARTIclearmaskioldmask

imask PARTImakemaskistamp

call PARTIschedulehashptr imask schedptr

c EEE

c Using stamp reuse to limit stamp numbers

c SSS

if k eq

ioldmask

else

ioldmask PARTImakemaskistamp

endif

istamp

call PARTIhashtabptrhashptrioldmaskistampglobalref

ndata mysize ndontcare noffproc

imask PARTImakemaskistamp

call PARTIschedulehashptr imask schedptr

c EEE

clear mask subroutine PARTI

Synopsis

PARTI clear mask hash table mask

Parameters

integer hash table

integer mask

Return Value

None

table which b elong uniquely to this mask Removes all entries in a hash

clear stamp subroutine PARTI

A wrapp er around PARTI clear mask You can directly give the id of the indirection array

that you want removed from the heap

Synopsis

clear stamp hash table stamp PARTI

Parameters

table integer hash

integer stamp

Return Value

None

Incremental schedules

Incremental scheduling is a metho d by which users can take advantage of the similarities b etween

various data access patterns Many of the opro cessor references sp ecied in an indirection

array may b e the same as the references in a previously analysed indirection array In such

cases some of the opro cessor references of the second indirection array can b e treated as lo cal

during schedule generation This eectively reduces the communication volume of gathered

elements

To build incremental schedules PARTI hash must b e passed masks corresp onding to stamps

of previous indirection arrays This noties the underlying layer which existing entries in the

incremental schedule is used to build the hash table can b e resued After hashing PARTI

schedule

subroutine PARTI incremental schedule

Similar to PARTI schedule except that it builds a schedule only for entries in the hash table

which uniquely b elong to this particular mask

Synopsis

PARTI incremental schedule hash table comb o mask sched maxdim

Parameters

table the hash table integer hash

integer comb o mask the combination of indirection arrays for which a schedule is needed

incremental schedule sched refers to the relevant schedule p ointer returned by PARTI

maxdim maximum numb er of columns or rows from which data will b e gathered or scat

tered using the returned schedule

Return Value

None

Data Exchangers

The CHAOS data structure schedul e stores the sendrecieve patterns but the CHAOS data

exchangers actually move data b etween pro cessors using schedul es

subroutine PREFIXgather

PREFIX can b e d double precision i integer f real or c character The PREFIXgather

pro cedure uses a schedule and copies of data values obtained from other pro cessors are placed

in memory p ointed to by buf f er Also passed to PREFIXgather is a p ointer to the lo cation

from which data is to b e fetched on the cal ling processor This p ointer is designated here as

al oc

Synopsis

PREFIXgatherbueralo cschedinfo

Parameter Declarations

TYPE buer p ointer to buer for copies of gathered data values

TYPE alo c lo cation from which data is to b e fetched from calling pro cessor

integer schedinfo refers to the relevant schedule

Return Value

None

Assume that a schedule has already b een obtained and that real numb ers are gathered

ie fgather is used On each pro cessor alo c p oints to the arrays from which values are to

b e obtained Buer p oints to the lo cation into which will b e placed copies of data values

obtained from other pro cessors

real buffer aloc

integer ischedptr

do i

aloci floatmynode i

continue

call fgatherbufferalocischedptr

WRITE bufferi inoffproc

On pro cessor buer is and on pro cessor buer buer and buer are

of f pr oc is returned by now equal to and The size of opro cessor elements n

the pro cedure reglo calize

subroutine PREFIXscatter

PREFIX can b e d double precision i integer f real or c character PREFIXscatter uses

a schedule pro duced by a call to another subroutine which creates a schedule eg reglo calize

Copies of data values to b e scattered to other pro cessors are placed in memory p ointed to by

buer Also passed to PREFIXscatter is a p ointer to the lo cation to which copies of data are

to b e written on the cal ling processor This p ointer is designated here as alo c

Synopsis

PREFIXscatterbueralo cschedinfo

Parameter Declarations

TYPE buer p oints to data values to b e scattered from a given pro cessor

TYPE alo c p oints to rst memory lo cation on calling pro cessor for scattered data to b e

placed

integer schedinfo refers to the relevant schedule

Return Value

None

Example

We assume that a schedule has already b een obtained by calling a subroutine such as reglo

calize Our example will assume that we wish to scatter real precision numb ers ie that we

will b e calling fscatter On each pro cessor alo c p oints to the arrays to which values are to

b e scattered buer p oints to the lo cation from which data will b e obtained to scatter

real buffer aloc

integer ischedptr

do i

aloci

continue

ifmynodeeq then

buffer

endif

ifmynodeeq then

buffer

buffer

buffer

endif

call fscatterbufferalocischedptr

On pro cessor the rst four elements of alo c are and On

pro cessor the rst four elements of alo c are and

subroutine PREFIXscatter FUNC

PREFIX can b e d double precision i integer f real or c character FUNC can b e add

sub or mult PREFIXscatter stores data values to sp ecied lo cations PREFIXscatter FUNC

allows one pro cessor to sp ecify computations that are to b e p erformed on the contents of

given memory lo cation of another pro cessor The pro cedure is in other resp ects analogous to

PREFIXscatter

Synopsis

PREFIXscatter FUNCbueralo cischedinfo

Parameter Declarations

TYPE buer p oints to data values that will form op erands for the sp ecied typ e of

remote op eration

TYPE alo c p oints to rst memory lo cation on calling pro cessor to b e used as targets of

remote op erations

integer schedinfo refers to the relevant schedule

Return Value

None

Example

We assume that a schedule has already b een obtained by calling a subroutine such as reglo

calize Our example will assume that we wish to scatter and add real numb ers ie that we

add On each pro cessor alo c p oints to the arrays to which values are will b e calling fscatter

to b e scattered and added buer p oints to the lo cation from which values will b e obtained

to scatter and add

real buffer aloc

integer ischedptr

do i

aloci

continue

ifmynodeeq then

buffer

endif

ifmynodeeq then

buffer

buffer

buffer

endif

call fscatteraddbufferalocischedptr

On pro cessor the rst four elements of alo c are and On

pro cessor the rst three elements of alo c are and

subroutine PREFIXmultigather

This primitive is an extension of the regular gather primitive such that it allows one to sp ecify

multiple schedules to b e used to gather data from the target array into the buer It also allows

multicolumn or multirow gathers if the columns are distributed in the same way The dierent

schedules to b e used are passed in an array data structure to the function and the numb er of

schedules need to b e sp ecied as a parameter to the function

Synopsis

PREFIXmultigatherbueralo cn scheds schedsbase shiftdilationrep etition

Parameter Declarations

TYPE buer buer for copies of gathered data values

TYPE alo c lo cation from which data is to b e fetched from calling pro cessor

integer n scheds numb er of schedules passed

integer scheds array of p ointers where each p ointer p oints to a schedule

shift size of the distributed dimension integer base

integer dilation the factor by which the distance b etween two data accesses on any di

mension changes when that dimension is transp osed

integer rep etition maximum numb er of columns or rows from which data will b e accessed

Return Value

None

subroutine PREFIXmultiscatter

This primitive is an extension of the regular scatter primitive but it allows one to sp ecify

multiple schedules to b e used to scatter data from the target array into the buer It also

allows multicolumn or multirow scatters if the columns are distributed in the same way The

dierent schedules to b e used are passed in an array data structure to the function and the

numb er of schedules need to b e sp ecied as a parameter to the function

Synopsis

scheds PREFIXmultiscatterbueralo cn

schedsbase shiftdilationrep etition

Parameter Declarations

TYPE buer buer for copies of scattered data values

TYPE alo c lo cation from which data is to b e scattered from calling pro cessor

scheds numb er of schedules passed integer n

integer scheds array of p ointers where each p ointer p oints to a schedule

integer base shift size of the distributed dimension

integer dilation the factor by which the distance b etween two data accesses on any di

mension changes when that dimension is transp osed

integer rep etition maximum numb er of columns or rows to which data will b e scattered

Return Value

None

Example

Data can b e scattered to multicolumns or rows using more than one schedules

subroutine PREFIXmultiscatter FUNC

FUNC can b e add sub or mult This primitive is an extension of the regular scatter FUNC

primitive but it allows one to sp ecify a numb er of schedules to b e used to p erform computations

on the contents of a given memory lo cation of another pro cessor It also allows multicolumn

or multirow op erations if the columns or rows are distributed in the same way The dierent

schedules to b e used are passed in an array data structure to the function and the numb er of

schedules need to b e sp ecied as a parameter to the function

Synopsis

FUNCbueralo cn scheds PREFIXmultiscatter

schedsbase shiftdilationrep etition

Parameter Declarations

TYPE buer p oints to data values that will form op erands for the sp ecied typ e of

remote op eration

TYPE alo c p oints to rst memory lo cation on calling pro cessor to b e used as targets of

remote op erations

scheds numb er of schedules passed integer n

integer scheds array of p ointers where each p ointer p oints to a schedule

integer base shift size of the distributed dimension

integer dilation the factor by which the distance b etween two data accesses on any di

mension changes when that dimension is transp osed

integer rep etition maximum numb er of columns or rows from which data will b e accessed

Return Value

None

Example

Simple arithmetic op erations can b e p erformed on the memory lo cation of another pro cessor

similar to the way a gather is p erformed

subroutine PREFIXscatternc

This works just like a normal scatter function except it do es an onpro cessor gather b efore it

do es the scatter The onpro cessor gather it do es is done according to pattern stored in the

schedule

Synopsis

PREFIXscatterncbueralo cschedinfo

Parameter Declarations

TYPE buer array from which data is to b e scattered

TYPE alo c array to which data is to b e scattered pro cessor

integer schedinfo refers to the relevant schedule p ointer

Return Value

None

Example

addnc function This works similar to the scatter

subroutine PREFIXmultiscatternc

This primitive is an extension of the regular multiscatter primitive such that an on pro cessor

gather is p erformed b efore the scatter o ccurs The onpro cessor gather is done according to a

pattern stored in the schedule

Synopsis

scheds PREFIXmultiscatterncbueralo cn

shiftdilationrep etition schedsbase

Parameter Declarations

TYPE buer buer for copies of scattered data values

TYPE alo c lo cation from which data is to b e scattered from calling pro cessor

integer n scheds numb er of schedules passed

integer scheds array of p ointers where each p ointer p oints to a schedule

shift size of the distributed dimension integer base

integer dilation the factor by which the distance b etween two data accesses on any di

mension changes when that dimension is transp osed

integer rep etition maximum numb er of columns or rows to which data will b e scattered

Return Value

None

Example

Data can b e scattered to multicolumns or rows using more than one schedules

subroutine PREFIXscatter FUNCnc

FUNC function except it do es a onpro cessor gather b efore This works just like a normal scatter

it do es the scatter FUNC The onpro cessor gather it do es is done according to pattern stored

in the schedule

Synopsis

void PREFIXscatter FUNCncbueralo cschedinfo

Parameter Declarations

TYPE buer array from which data is to b e scattered

TYPE alo c array to which data is to b e scattered pro cessor

integer schedinfo refers to the relevant schedule p ointer

Return Value

None

subroutine PREFIXmultiscatter FUNCnc

FUNC can b e add sub or mult This primitive is an extension of the regular multiscat

FUNC primitive such that it allows the user to p erform an onpro cessor gather b efore the ter

scatter FUNC op eration

Synopsis

subroutine PREFIXmultiscatter FUNCncbueralo cn scheds schedsbase shiftdilationrep etition

Parameter Declarations

TYPE buer p oints to data values that will form op erands for the sp ecied typ e of

remote op eration

TYPE alo c p oints to rst memory lo cation on calling pro cessor to b e used as targets of

remote op erations

scheds numb er of schedules passed integer n

integer scheds array of p ointers where each p ointer p oints to a schedule

shift size of the distributed dimension integer base

integer dilation the factor by which the distance b etween two data access on any dimen

sion changes when that dimension is transp osed

integer rep etition maximum numb er of columns or rows from which data will b e accessed

Return Value

None

Example

Simple arithmetic op erations can b e p erformed on the memory lo cation of another pro cessor

similar to the way a gather is p erformed

Data Exchangers For General Data Structures

subroutine PARTI gather

Synopsis

gathersched data target size subroutine PARTI

Parameter Declarations

integer sched refers to the relevant schedule

any typ e data p oints to data values that will form op erands for the sp ecied typ e of

remote op eration

any typ e target p oints to rst memory lo cation on calling pro cessor to b e used as

targets of remote op erations

integer size size of the data structure

Return Value

None

Similar to TYPEgather except that this subroutine can b e used to gather any typ e of data

structure The size of the data structure must b e provided

scatter subroutine PARTI

Synopsis

subroutine PARTI scattersched data target size func

Parameter Declarations

integer sched refers to the relevant schedule

data p oints to data values that will form op erands for the sp ecied typ e of remote

op eration

target p oints to rst memory lo cation on calling pro cessor to b e used as targets of remote

op erations

integer size size of the data structure

func function to b e used

Return Value

None

Similar to scatter FUNC subroutine but the function is provided by the user The parameter

func sp ecies the op eration for scatter routine For standard functions the following constants

can b e used

NULL no operation scatter function

PARTIadd integer addition

PARTIsub integer substraction

PARTImul integer multiplication

PARTIcadd character addition

PARTIcsub character substraction

PARTIcmul character multiplication

PARTIfadd float addition

PARTIfsub float substraction

PARTIfmul float multiplication

PARTIdadd double addition

PARTIdsub double substraction

PARTIdmul double multiplication

subroutine PARTI mulgather

Synopsis

mulgathersched size narray data target data target PARTI

Parameter Declarations

integer sched refers to the relevant schedule

integer size size of the data structure

integer narray numb er of arrays

data p oints to data values that will form op erands for the sp ecied typ e of remote

op eration

target p oints to rst memory lo cation on calling pro cessor to b e used as targets of remote

op erations

Return Value

None

The gather routine for multiple arrays

mulscatter subroutine PARTI

Synopsis

mulscattersched func size narray data target data target PARTI

Table CHAOS Pro cedures for Adaptive Problems

Task Functionality Pro cedure

Insp ector

LightWeight compute schedule schedule pro c

Schedule

Executor

Data exchange data PREFIXscatter app end

restore data PREFIXscatter back

Transp ortation exchange data PREFIXmultiscatter app end

restore data PREFIXmultiscatter back

exchange data PREFIXmultiarr scatter app end

restore data PREFIXmultiarr scatter back

Parameter Declarations

integer sched refers to the relevant schedule

func function to b e used

integer size size of the data structure

integer narray numb er of arrays

data p oints to data values that will form op erands for the sp ecied typ e of remote

op eration

target p oints to rst memory lo cation on calling pro cessor to b e used as targets of remote

op erations

Return Value

None

The scatter routine for multiple arrays

CHAOS Pro cedures for Adaptive Problems

Intro duction

In a class of highly adaptive problems patterns of data access vary frequently As a result

prepro cessing or inspector must b e carried out whenever change in data access pattern o ccurs

The implication is that the pro cessing cost of inspector can hardly b e amortized b ecause the

communication schedule pro duced by the inspector pro cedures for one time step may not b e

reused for the consequent time steps Insp ector pro cedures for application co des in which partial

change in data access patterns o ccur are discussed in Section

A set of primitives for highly adaptive problems have b een develop ed These pro cedures are

particularly suitable for applications where order of data storage and computation is not strictly

maintained The pro cedure develop ed for adaptive problems compute lightweight schedules

at very low cost These pro cedures compute space requirements for data migration on each

pro cessor and determine how collective communications can b e p erformed Data transp ortation

routines develop ed for this routines uses these schedules and p erform irregular communications

eciently

pro c function schedule

This function returns a communication schedule as PARTI schedule do es but the schedule

generated by schedule pro c function do es not have information on addresses in destination

pro cessors for opro cessor data items

Synopsis

integer schedule pro cpro c ndata new ndata maxdim

Parameter declarations

integer pro c a list of destination pro cessor indices

integer ndata the numb er of lo cal array elements

integer new ndata numb er of new lo cal array elements

integer maxdim maximum numb er of columns or rows from which the distributed arrays

will b e exchanged

Return value

an integer value representing a lightweight communication schedule

app end subsection PREFIXscatter

PREFIX can b e d double precision i integer f oat c character The PREFIXscat

ter app end pro cedure uses a schedule pro duced by a call to schedule pro c The ownership

of distributed array data is exchanged among participating pro cessors and a copy of new lo cal

p ortion of the distributed array is placed in memory p ointed to by target

Synopsis

subroutine PREFIXscatter app endsched data target

Parameter declarations

integer sched a lightweight schedule data exchange

TYPE data a distributed array p ointer

TYPE target a p ointer to buer where the new copy of lo cal p ortion of the array will

b e placed

Return value

None

subsection PREFIXscatter back

PREFIX can b e d double precision i integer f oat c character The PREFIXs

catter back pro cedure uses the same schedule which was previously used by PREFIXscat

app end function to reverse the eects of data exchange done by PREFIXscatter app end ter

Each element of the distributed array will b e restored into its original p osition within the pro

cessor which owned it b efore data exchange

Synopsis

backsched data target subroutine PREFIXscatter

Parameter declarations

integer sched a lightweight schedule for data exchange

TYPE data a distributed array p ointer

TYPE target a p ointer to buer from which the copy of lo cal p ortion of the array is to

b e restored

Return value

None

app end subsection PREFIXmultiscatter

This subroutine is an extension to the PREFIXscatter app end so that it allows one to

sp ecify a numb er of schedules to b e used to exchange distributed arrays It also allows data

exchange of multidimensional arrays The dierent schedules to b e used are passed in an array

of schedules to PREFIXmultiscatter app end and the numb er of schedules needs to b e

sp ecied as a parameter to this subroutine

Synopsis

app endnschedschedsdatatarget base shiftdilationrep etition subroutine PREFIXmultiscatter

Parameter declarations

integer nsched the numb er of schedules passed

integer scheds an array of schedules

TYPE data a distributed array

TYPE target a p ointer to buer where the new copy of lo cal p ortion of the array will

b e placed

integer base shift size of the distributed dimension

integer dilation Not used

integer rep etition Not used

Return value

None

subroutine PREFIXmultiscatter back

back except The functionality of this subroutine is the same to that of PREFIXscatter

that it allows one to use a numb er of schedules

Synopsis

backnschedschedsdatatarget base shiftdilationrep etition subroutine PREFIXmultiscatter

Parameter declarations

integer nsched the numb er of schedules passed

integer scheds an array of schedules

TYPE data a distributed array whose ownership will b e restored

TYPE target a p ointer to buer from which the copy of the array is to b e restored

integer base shift size of the distributed dimension

integer dilation Not used

integer rep etition Not used

Return value

None

subroutine PREFIXmultiarr scatter app end

This subroutine is another extension to the PREFIXscatter app end which can b e used

to exchange arrays which are distributed in a similar manner

Synopsis

subroutine PREFIXmultiarr scatter app endschednarrays data tar g et data tar g et

1 1 n n

Parameter declarations

integer sched a schedule for

integer narrays the numb er of pairs of data and tar g et

i i

TYPE data array to b e distributed

i

TYPE tar g et a p ointer to buer where the new copy of lo cal p ortion of the array will

i

b e placed

Return value

None

scatter back subroutine PREFIXmultiarr

This subroutine is another extension to the PREFIXscatter app end which can b e used

to exchange data of multiple arrays which are distributed indentically

Synopsis

subroutine PREFIXmultiarr scatter backschednarrays data tar g et data tar g et

1 1 n n

Parameter declarations

integer sched a schedule for

integer narrays the numb er of pairs of data and tar g et

i i

TYPE data a distributed array whose data will b e restored

i

TYPE tar g et a p ointer to buer from which the copy of the array is to b e restored

i

Return value

None

Example

integer ndata

double precision xndata yndata loadi

do k timestep Loop L

c computation

do i ndata Loop L

do j loadi

xi yi

enddo

loadi xi

enddo

end do

Consider the ab ove sequential co de L Lo op L is executed many times inside the lo op L

Computation load for iteration i dep ends on the value of loadi and it is up dated in every

iteration k of the lo op L Note that there is no communication involved To eciently run

this co de on a parallel machine load in pro cessors must b e balanced for every iteration of lo op

L

Let us see how to parallelize this co de Assume that data arrays data x and y are identically

and mayb e irregularly distributed among several pro cessors and they need to b e remapp ed

to balance the workload The data array workload stores the load information p er each data

p oint and user sp ecied load balancing primitive load balancer generates a list of destination

pro cessor indices using the load information passed A lightweight schedule sched is computed

proc Sice arrays x y and load are distributed identically the using the pro cedure schedule

same schedule can b e used to tranp ort all the arrays to new lo cations

integer sched ndata newnd BUFSIZE

parameter BUFSIZE

integer procBUFSIZE

real workloadBUFSIZE dataBUFSIZE

double precision xBUFSIZE yBUFSIZE loadBUFSIZE

do k timestep Loop L

c compute the workload of each data point

do i ndata

workloadi

enddo

c initialize CHAOS environment

call PARTIsetup

c produce a list of destination processor indices

call loadbalancerprocworkloadndata

c build up a schedule for data exchange

sched scheduleprocprocndatanewnd

c check the buffer size for the new local portion of arrays

if newnd gt BUFSIZE then

write Exceed the local buffer size

call exit

endif

c exchange distributed data arrays among participating processors

call fscatterappendscheddatadata

call dmultiarrscatterappendschedxxyyloadload

c local computation

do i newnd Loop L

enddo

c restore data distributed arrays

call fscatterbackscheddatadata

call dmultiarrscatterbackschedxxyyloadload

end do

Translation Table

function build translation table

In order to allow a user to assign globally numb ered indices to pro cessors in an irregular pattern

it is useful to b e able to dene and access a distributed translation table By using a distributed

translation table it is p ossible to avoid replicating records of where distributed array elements

are stored in all pro cessors The distributed table is itself partitioned in a very regular manner

A pro cessor that seeks to access an element I of a irregularly distributed data array is able to

compute a simple function that designates a lo cation in the distributed table the lo cation of

the actual array element sought is obtained from the distributed table

The pro cedure build translation table constructs a distributed translation table It assumes

that distributed array elements are globally numb ered Each pro cessor passes build translation table

a set of indices for which it will b e resp onsible The distributed translation table may b e strip ed

or blo cked across the pro cessors With a strip ed translation table the translation table entry

of pro cessors the lo cal index of for global index I is stored in pro cessor I mo dulo numb er

the translation table is I numb er of pro cessors In a blo cked translation table translation

table entries are partitioned into a numb er of equal sized ranges of contiguous integers these

ranges are placed in consecutively numb ered pro cessors With blo cked partitioning the blo ck

corresp onding to index I is IB and the lo cal index is I mo dulo B where B is the size of the

blo ck Let M b e the maximum global index passed to build translation table by any pro cessor

and NP represent the numb er of pro cessors B dM NP e

Synopsis

function build translation tablepartindexarrayndata

Parameter Declarations

integer part how translation table will b e mapp ed may b e BLOCKED or STRIPED

integer indexarray each pro cessor P sp ecies list of globally numb ered indices for which

P will b e resp onsible

integer ndata numb er of indices for which pro cessor P will b e resp onsible

Return Value

integer which refers to the translation table corresp onding to the input data

Example

For a detailed example refer to Section

subroutine dereference

The subroutine dereference accesses a translation table and determines owner pro cessors and

lo cal addresses on the owner pro cessors for a list of global numb ered array elements The

subroutine dereference is passed a p ointer to a translation table this structure denes the

irregularly distributed mapping created dereference is passed an array with global indices that

need to b e lo cated in distributed memory dereference returns arrays lo cal and pro c that contain

the pro cessors and lo cal indices corresp onding to the global indices

Synopsis

subroutine dereferencegloballo calpro cndataindex table

Parameter declarations

integer global list of global indices we wish to lo cate in distributed memory

integer lo cal lo cal indices obtained from the distributed translation table that corre

sp ond to the global indices passed to dereference

integer pro c array of distributed translation table pro cessor assignments for each global

index passed to dereference

integer ndata numb er of elements to b e dereferenced

table refers to the relevant translation table integer index

Return value

None

Example

Table Values obtained by dereference

Pro cessor pro c lo cal pro c lo cal

A one dimensional distributed array is partitioned in some irregular manner so we need a

distributed translation table to keep track of where one can nd the value of a given element

of the distributed array

In the example b elow we show how a translation table is initialize d Pro cessor calls

build translation table and assigns indices and to pro cessor pro cessor calls build translation table

and assigns indices and to pro cessor The translation table is partitioned b etween pro

cessors in blo cks

Pro cessor then uses the translation table to dereference global variables and pro cessor

uses the translation table to dereference global variables and On each pro cessor

dereference carries out a translation table lo okup The values of pro c and lo cal are returned

by dereference are shown in Table The user gets to sp ecify the pro cessor to which each

global index is assigned note however that build translation table assigns lo cal indices

program dref

c

integer size i indexarray

integer derefarray

integer local proc

logical buildtranslationtable

c initialize CHAOS environment

call PARTIsetup

c Assign indices and to processor

mysize

ifmynodeeq then

indexarray

indexarray

endif

c Assign indices and to processor

ifmynodeeq then

indexarray

indexarray

endif

c set up a translation table

itable buildtranslationtableindexarraymysize

c Processor seeks processor and local indices

c for global array indices and

ifmynodeeq then

derefarray

derefarray

endif

c Processor seeks processor and local indices

c for global array indices and

ifmynodeeq then

derefarray

derefarray

endif

c Dereference a set of global variables

call dereferencetablederefarraylocalprocsize

c local and proc return the processors and local indices where

c global array indices are stored

c In processor proc proc local local

c In processor proc proc local local

stop

end

Now assume that pro cessor needs to know to values of distributed array elements and

while pro cessor needs to know the value of element We call dereference to nd the

pro cessors and the lo cal indices that corresp ond to each global index At this p oint schedule

can b e called and gathers and scatters carried out

function build reg translation table

Builds a translation table for a regular distribution It returns an id for the translation table

built The distribution could b e blo cked or cyclic for the current implementation No other

regular distribution is supp orted as b efore however provisions for an analytic user function and

parameterized general blo ck distributions have b een made The translation tables returned by

this function DO NOT explicitly list all global indices

Synopsis

build reg translation table dist typ e data size

Parameter declarations

typ e distribution typ e BLOCK and CYCLIC integer dist

integer data size global data array size

Return value

an integer representing a translation table

function build dst translation table

Builds a translation table for a degenerate irregular distribution index is the list of global

indices for which the calling pro cessor is resp onsible for nindex denotes the size of the index

array repf is a oatingp oint numb er b etween and inclusive denoting the replication

factor for the translation table storage more on this later psize is the page size for the

page decomp osition of the translation table contents The function returns an ID for the table

generated to b e used by dereference functions

Synopsis

build dst translation table index nindex repf psize

Parameter declarations

integer index index list to b e dereferenced

array integer nindex size of the index

real repf replication factor

integer psize page size

Return value

an integer reprezenting a translation table

This typ e of translation tables list all the global indices and their lo cation assignment in

the distributed memory The global list of indices are decomp osed into pages of size psize

and the storage is managed in page level rather than individual index level The pages

are distributed b etween the pro cessors using blo ck distribution Furthermore each pro cessor

replicates repfNNP pages in its lo cal memory where N denotes the total numb er of pages

and P is the numb er of pro cessors in the system

function init ttable with pro c

A partial translation table that stores only pro cessor information for a distribution can b e built

using the function init ttable with pro c

Synopsis

function init ttable with pro cpartpro cndata

Parameter Declarations

integer part how translation table will b e mapp ed may b e BLOCKED

integer pro c list of owner pro cessor numb ers for a global data array

integer ndata size of of pro c array

Return Value

integer which refers to the translation table corresp onding to input data

Example

subroutine free table

It deallo cates storage space asso ciated with a translation table

Synopsis

tabletrans table free

Parameter declarations

integer trans table translation table id

Return value

None

subroutine derefpro c

Same as dereference but the function do es not return oset assignment of global indices

Translation table argument is generic b oth irregular and regular distributions uses the same

function

Synopsis

tableindex arraypro cndata derefpro ctrans

Parameter declarations

table translation table table id integer trans

integer index array index list to b e dereferenced

array integer ndata size of the index

integer pro c list of pro cessor numb ers

Return value

None

subroutine derefoset

Same as dereference but the function do es not return pro cessor assignment of global indices

Translation table argument is generic b oth irregular and regular distributions uses the same

function

Synopsis

derefosettrans tableindex arraylo calndata

Parameter declarations

table translation table table id integer trans

integer index array index list to b e dereferenced

array integer ndata size of the index

integer lo cal list of lo cal oset

None

Return value

None

subroutine remap table

Up dates the translation table entries to reect the changes in the distribution of data This

function do es not p erform any action if the old distribution was regular ol dt is the translation

table containing the current distribution of data new I ndex is the new set of global indices

for which the calling pro cessor is resp onsible for nindex is the size of new I ndex The size of

the global data space should b e same for the new distribution and current distribution The

eect of this function is that it up dates translation table and the result is translation table

contains entries for the new data distribution Note Space is reused user should not free the

old translation table space

Synopsis

remap tableoldt newIndex nindex

Parameter declarations

integer oldt translation table table id

integer newIndex new index set

integer nindex size of the new index list

Return value

None

function tableGetReplication

Returns the replication factor for a distributed translation table For regular distribution it

ERROR returns a symb olic TT

Synopsis

real tableGetReplicationtable

Parameter declarations

integer table translation table table id

Return value

a real numb er which gives the replication factor

function tableGetPageSize

Returns the page size for a distributed translation table For regular distribution it returns a

ERROR symb olic TT

Synopsis

integer tableGetPageSizetable

Parameter declarations

integer table translation table table id

Return value

an integer numb er which gives the page size of the table

subroutine tableGetIndices

Returns the list of indices owned by pro cessors in global address

Synopsis

tableGetIndices table indices nindex

Parameter declarations

integer table translation table table id

integer indices lo cal set of indices

integer nindex size of the lo cal indices

Return value

None

Miscellaneous Pro cedures

setup subroutine PARTI

Synopsis

PARTI setup

Parameter declarations

None

Return value

None

setup is called once at the b egining of the application program This The pro cedure PARTI

pro cedure sets up buer space and environment variables

subroutine renumb er

Synopsis

renumb ertable index array size rarray

Parameter declarations

Table Choas runtime data mapping pro cedures

Task Function Primitive

initializi ng hash table init rdg hash table

Data generating lo cal graph eliminate dup edges

Partitioning generating distributed graph generate rdg

Lo op iteration generating lo op iteration graph dref rig

Partitioning partitioning lo op iteration graph iteration partitioner

Remap generating schedule to remap array remap

indices and other aligned data arrays

integer table translation table id

integer index lo cal index set in global address

integer array array to b e renumb ered

integer size size of array index list

integer rarray array with renumb ered values

Return value

None

Runtime Data Distribution

In scalable multipro cessor systems high p erformance demands that computational load b e

balanced evenly among pro cessors and that interpro cessor communication b e minimized Over

the past few years a lot of study has b een carried out in the area of mapping irregular problems

onto distributed memory multicomputers As a result of this several general heuristics have

b een prop osed for ecient data mapping Currently these partitioners must b e coupled to user

programs manually A standard interface can however b e used to link these partitioners with

programs at runtime We use a distributed data structure to represent array access patterns that

arise in particular lo ops of a program This data structure is passed to the graph partitioner

The partitioner returns a data distribution In this section we present the CHAOS pro cedure

that can b e used to generate the distributed data structure link the partitioners and supp ort

data redistribution

Runtime Data Graph

To implement the data partitioning we generate a distributed data structure called the Runtime

Data Graph or RDG The runtime graph is generated from the distributed array access patterns

in the user selected lo ops Here it is assumed that all distributed arrays considered for RDG

generation are to b e partitioned in the same way and also that they are of the same size No de

Selected Lo op

do i

S yIAi xIBi

end do

Input

IA f g

IB f g

Runtime Data Graph

Runtime Data Graph in Compressed Row Format

adjacency list array f g

adjacency list p ointr f g

Figure An example of runtime data graph generation

i of an RDG represents element i of all distributed arrays if there is more than one distributed

array used for graph generation

An RDG is constructed by executing a mo died version of the lo op which forms a list of

edges instead of p erforming numerical calculations The graph partitioners that we consider

divide the graph into equal subgraphs with as few edges as p ossible b etween them The intent

is that two no des in RDG will b e linked if one is used to compute the other and they are to b e

allo cated to the same pro cessor There is an edge b etween no des i and j in the RDG if on some

iteration of the lo op an assignment statement writes element i of an array ie xi app ears

on the lefthand side and references element j ie y j app ears on the righthand side or

viceversa

Figure shows a simple example of sequential generation of a runtime graph The statement

S in the gure has indirections IA and IB b oth on the left and the right hand sides In this

example the arrays x and y of the same size are considered for graph generation Each

vertex in the graph represents an array element Hence the runtime graph will have vertices

The RDG is constructed by adding an undirected edge b etween the no de pairs representing the

left hand side IAi and the right hand side IBi array indices for each lo op iteration i

For instance during the rst lo op iteration an edge b etween vertex and and also an edge

b etween and are added to the graph The run time graph is stored in a format closely

related to compressed sparse row format

Figure shows the parallel RDG generation steps Initially data arrays and lo op iterations

are divided among pro cessors in uniform blo cks Each pro cessor generates a lo cal RDG using

the array access patterns that o ccur in lo cal lo op iterations For clarity the lo cal RDG is shown

as an adjacency matrix in the gure The lo cal graph is then merged to form a distributed

graph While merging if we view the lo cal graph as an adjacency matrix stored in compressed

row format then pro cessor P collects all entries of the rst NP rows in the matrix from all

0

other pro cessors where N is the numb er of no des array size and P is the numb er of pro cessors

Pro cessor P collects the next NP rows of the matrix and so on Pro cessors remove duplicate

1 * Generate local graph on each processor representing Loop's array access pattern . N x N N x N ...... Local ...... graph ......

* Merge local graphs to produce a distributed graph

N/P x N N/P x N

Merged

graph

Figure Parallel generation of runtime data graph

do i nedges

n ndei

n ndei

n ndei

S yn yn xn xn xn

S yn yn xn xn

enddo

Figure An example sequential lo op

entries when they collect adjacency list entries

Data Mapping Pro cedures

In this section we present the CHAOS pro cedures that can b e used for data mapping on MIMD

machines The kernel shown in Fig is an example of the kind of lo op that commonly o ccurs in

a variety of sparse or unstructured co de We use this kernel as a running example to illustrate

the pro cedures The CHAOS pro cedures carry out data mapping and lo op iteration partitioning

in paral lel To p erform op erations in parallel the lo op iterations of the selected lo ops and the

arrays to b e partitioned are distributed among pro cessors in uniform blo cks Figure shows

the mo died version of the sequential lo op and the initial data and lo op distribution The

array l ocal ind l ist in the gure has a list of lo cal array descriptors assigned to each pro cessor

and n l ocal is the size of l ocal ind l ist The array l ocal iter l ist has a list of lo cal iteration

numb ers Figure shows parallel prepro cessing co de for the example co de in Figure with

data mapping pro cedures incorp orated

do i nlocaledges

n localndei

n localndei

n localndei

S yn yn xn xn xn

S yn yn xn xn

enddo

Assuming the size of arrays x and y to b e value of nedges to b e and

nde f g

the initial data and lo op iteration distributions for pro cessors are

Processor Processor

lo cal iter list fg lo cal iter list fg

edges nlo cal edges nlo cal

lo cal ind list fg lo cal ind list fg

lo cal n lo cal n

nde f lo cal nde f lo cal

g g

Figure Mo died example co de for parallel prepro cessing

subroutine eliminate dup edges

The pro cedure eliminate dup edges generates a lo cal runtime graph on each pro cessor This

pro cedure is called once for each statement in the lo ops that accesses the distributed arrays on

b oth the left hand side and the right hand side of the statement For all these statement the left

hand side l hs ind array indices and the corresp onding list of right hand sider hs ind array

indices for each lo cal iteration are generated separately For example in Figure statements

S and S access the distributed arrays b oth on left and right hand sides The statement S has

distinct array indices on the right hand side for each left hand side index whereas statement

S has only one The left hand side and right hand side index pairs are generated separately

for each statement and the pro cedure is called twice

The pro cedure eliminate dup edges inputs the lists l hs ind and r hs ind the size of

l hs ind n count and the numb er of unique r hs ind indices n dep for each l hs ind This

pro cedure generates the lo cal graph by adding an undir ected edge b etween the left hand side

and right hand index lists and stores it in a hash table An undirected edge b etween no des i

and j is formed by adding j to the adjacency no de list of no de i and vice versa

The pro cedure eliminates any duplicate edges and also self edges as there is no p otential for

communication The current version of the pro cedure do es not distinguish the edges connecting

the same pairs of no des but arise due to dierent pairs of arrays The future version of this

pro cedure will take this case into account and it will pro duce graphs with weighted edges

Synopsis

btable buildtranslationtabletype localindlist nlocal

C Initialize hash table to store runtime data graph

ndep

ndep

hashindex initrdghashtablendepndepnlocaledges

jcount

icount

kcount

lcount

do i nlocaledges

n localndei

n localndei

n localndei

C S yn yn xn xn xn

lhsindicount n

rhsindjcount n

jcount jcount

rhsindjcount n

jcount jcount

C S yn yn xn xn

lhsindicount n

rhsindlcount n

lcount lcount

icount icount

enddo

CGenerate local run time graph

call eliminatedupedgeshashindex lhsind rhsind ndep icount

call eliminatedupedgeshashindex lhsind rhsind ndep icount

CGenerate distributed run time graph

call generaterdghashindex localindlist

nlocal csrptr csrcols ncols

C call parallel graph partitioner

call parallelrsblocalindlist nlocal csrptr csrcols ntable

C Remap array indices

call remapntable localindlist sched newindlist newsize

call dgatherx x sched

call dgathery y sched

Figure Parallel prepro cessing co de for data mapping

eliminate dup edgeshashindex lhs ind rhs ind n count n dep

Parameter declarations

integer hashindex an integer identifying the hash table that stores RDG

ind list of left hand side array indices of all lo cal iterations integer lhs

integer rhs ind a list of right hand side unique indices of each lo cal iterations

count numb er of left hand side indices integer n

dep numb er of unique right hand indices for each left hand side index in the integer n

considered statement

Return value

None

Example

Assume that in the example kernel shown in Figure two pro cessors are employed and

arrays x and y are to b e mapp ed in a conforming manner and are of the same size

The RDG is generated based on the access pattern of the arrays x and y in the mo died

lo op shown Figure The statements S and S in the kernel are not executed while generating

the graph Initially the arrays x and y and the lo op iterations are distributed in uniform blo cks

as shown in Figure The mo died lo op is executed in parallel to form a lo cal list of array

access patterns on each pro cessor as shown in Figure The lists l hs ind and r hs ind in

Figure have the left hand side and right hand side array access patterns in statement S

ind and r hs ind have the patterns in S All pro cessors then call the pro cedure and l hs

eliminate dup edges with the list of left hand side and right hand side array indices The

RDG generated by the pro cedure is shown in Figure The RDG is stored in the hash table

which is identied by an integer hashindex The RDG obtained using S do es not get altered

by the statement S as there are no new pair of array indices

subroutine generate rdg

Once array access patterns in a lo op have b een recorded in a hash table by the pro cedure

eliminate dup edges the pro cedure generate rdg can b e called on each pro cessor to atten

its lo cal adjacency list in the hash table into an adjacency list data structure closely related

to Compressed Sparse Row CSR format This pro cedure p erforms a global scatter op eration

resolving collisions by app ending lists and then combines these lo cal lists into a complete

graph also represented in CSR format This data structure is distributed so that each pro cessor

stores the adjacency lists for a subset of the array elements

Synopsis

rdghashindex lo cal ind list n lo cal csr ptr csr col ncols generate

Parameter declarations

Processor Processor

ind fg lhs ind fg lhs

rhs ind fg rhs ind fg

ind fg lhs ind fg lhs

ind fg rhs ind fg rhs

after eliminate dup edges call for statement S RDG in hash table

RDG f RDG f

g g

dup edges call for statement S RDG in hash table after eliminate

RDG f RDG f

g g

Figure Example output eliminate dup edges

integer hashindex an integer identifying the hash table which stores RDG

integer lo cal ind list list of array descriptors

lo cal size of l ocal ind l ist integer n

integer csr ptr list of csr format p ointers p ointing into csr col

integer csr col adjacency list for indices o ccur in lo cal iterations

integer ncols size of csr cols returned by generate rdg

Return value

None

Example

Pro cessor Pro cessor

after attening

col fg csr col fg csr

ptr fg csr ptr fg csr

after merging

csr col fg csr col fg

csr ptr fg csr ptr fg

rdg Figure Example output generate

The attening pro cess groups edges for each array element together and then stores in

col A list of p ointers csr ptr identies the b eginning of the edge list for each a list csr

array element in csr col Since the attening pro cess uses only the lo cal hash table there is no

communication b etween pro cessors In the merging pro cess pro cessor P collects the adjacency

list for arrays indices fg and P for array indices fg as shown in Figure

function init rdg hash table

dup edges and generate rdg can b e initial The hash table used in pro cedures eliminate

ized by using the pro cedure init rdg hash table This pro cedure is called with the initial

numb er of exp ected entries in the hash table for memory allo cation Extra memory space is

automatically allo cated when the hash table entries overow the initial allo cation

Synopsis

rdg hash tablesize function init

Parameter declarations

integer size numb er of exp ected entries in the hash table

Return value

An integer identifying the hash table

Lo op Iteration Partitioning Pro cedures

Up on identifying the new array distribution lo op iteration partitioning pro cedures are called

to distribute the lo op iterations These pro cedures by distributing the lo op iterations balance

computation among pro cessors and reduce opro cessor memory accesses Figure shows

prepro cessing co de for mapping lo op iterations of the kernel shown in Figure In the follow

ing section the lo op iteration partitioning pro cedures dref rig and iteration partitioner are

discussed

rig subroutine dref

To partition lo op iterations we use the Runtime Iteration GraphRIG which lists for each

lo op iteration the distinct distributed array elements referenced For example in Figure

each lo op iteration accesses three dierent array indices n n n of the distributed arrays

In this case the RIG has a list of n n and n for all lo cal iterations Using the RIG

pro cedure dref rig generates a Runtime Iteration Processor Assignment graph RIPA that

has a list of pro cessors that own array elements in the RIG The future version of this pro cedure

will supp ort a RIG with a weight asso ciated with each entry in the graph

Synopsis

rigttable rig niter n ref ripa dref

C Create translation table with the current loop iteration list

ttable buildtranslationtable localiterlist nlocaledges

icount

nref

do i nlocaledges

n localndei

n localndei

n localndei

C S yn yn xn xn xn

C S yn yn xn xn

rigicount n

rigicount n

rigicount n

icount icount

enddo

C Generate runtime iteration graph

call drefrigntable rig nlocaledges nref ripa

C Partition runtime iteration graph

call iterationpartitionerripa nlocaledges nref ltable

C Remap loop iterations

call remapltable localiterlist sched newiterlist newloopsize

call igatherlocalnde localnde sched

call igatherlocalnde localnde sched

call igatherlocalnde localnde sched

Figure Example co de with lo op iteration partitioning pro cedures

Parameter declarations

integer ttable an integer identifying the distribution translation table which describ es the

distribution arrays returned by the partitioner

integer rig list of distinct indices accessed for each lo cal iteration

integer niter numb er of lo cal iterations

integer n ref numb er of unique indices accessed p er iteration

integer ripa list of pro cessors that own each entry in rig

Return value

None

subroutine iteration partitioner

The current version of mapp er pro cedure iteration partitioner inputs the RIPA and assigns

iterations to pro cessors by assigning each iteration to the pro cessor that owns the most data

Processor Processor

ind list new ind list new

n ref n ref

rig fg rig fg

rig after dref

ripa fg ripa fg

after iteration partitioner and remap

followed by gather

new iter list fg new iter list fg

Figure Example output dref ref and iteration partitioner

The new lo op iteration distribution is describ ed by a translation table

Synopsis

partitionerripa niter n ref itable iteration

Parameter declarations

integer ripa list of pro cessors that own each entry in rig

integer niter numb er of lo cal iterations

integer n ref numb er of unique indices accessed p er iteration

integer itable an integer identifying distribution translation table

Return value

None

Example

Assuming array elements fg are distributed to P and array elements fg are dis

tributed to P in Figure and lo op iterations are partitioned initially as shown in Figure

the Figure shows the new lo op distribution

rig returns a list of pro cessor numb ersripa which own indices referred The pro cedure dref

in each lo cal iteration This list is obtained by dereferencing the translation table ttabl e

The ttabl e describ es the new distribution of arrays Note that the lo op iteration partitioning

can only b e done after the arrays are remapp ed Section based on the new distribution

partitioner assigns a lo op iteration returned by the partitioner The pro cedure iteration

to a pro cessor which owns the maximum numb er of indices accessed in that iteration Ties

are broken arbitrarily The pro cedure returns the lo op iteration mapping for its initial list of

iterations l ocal iter l ist in the form of a translation tablel tabl e Each pro cessor then calls

the pro cedure remap to obtain a schedule sched This schedule can b e used to get the new

iteration list new iter l ist using CHAOS data exchager g ather

Data Remapping

Once the new array distribution has b een identied the array index list must b e remapp ed

based on the new distribution The data arrays asso ciated with the distributed array must

also b e remapp ed The pro cedure remap inputs the translation table new tabl e describing

ind l ist It returns a schedule the new array mapping and a list of initial array indices l ocal

ind l ist A schedule stores sendreceive patterns and this sched and a new index listnew

can b e used to move data among pro cessors using the CHAOS data exchangers describ ed in

Section The returned schedule can b e used to remap the data arrays asso ciated with the

distributed array descriptors

subroutine remap

Synopsis

remapnewtable lo cal ind list sched new ind list nind

Parameter declarations

integer newtable translation table index describing new mapping

ind list list of initial lo cal indices in global numb ering integer lo cal

integer sched schedule index returned

integer new ind list list of new lo cal indices in global numb er ing

integer nind numb er of new lo cal indices

Return value

None

Example

In Figure initially arrays x and y and any data arrays asso ciated with them are distributed

in a conforming manner The pro cedure parallel rsb on each pro cessor returns a new

distribution for the initial lo cal array elements of x and y This distribution is returned

in the form of a translation table new tabl e The pro cedure remap returns a new list of

ind l ist for arrays x and y based on the new partition for each pro cessor It indices new

also returns a schedule sched which is used to remap the data arrays These asso ciated

arrays are actually transp orted using the CHAOS data exchange pro ceduresgather

Parallel Partitioners

In this section we present parallel partitioners that are distributed along with the CHAOS

runtime library The rst two partitioners namely recursive co ordinate bisection and inertial

bisection partitioners use spatial information The third parallel partitioner recursive sp ectral

bisection uses graph connectivity information

Table Parallel Partitioners

Partitioner Input Pro cedure

Recursive geometry Co orBisecMap

co ordinate geometry Co orBisec

bisection geometry and weight Co orWeighBisecMap

geometry and weight Co orWeighBisec

Recursive geometry InerBisecMap

co ordinate geometry InerBisec

bisection geometry and weight InerWeighBisecMap

geometry and weight InerWeighBisec

Sp ectral bisection connectivity parallel rsb

Geometry Based Partitioners

There are two typ es of geometry partitioners ones which use computational load information

and ones which dont There are implemented in two fashions coloring and remapping In the

coloring implementation only pro cessor assignment of each element are returned whereas in

the remapping implementation arrays used to sp ecify the geometry information are automat

ically redistributed and the size of data arrays and indices of new lo cal elements are returned

Therefore there are four p ossible versions of partitioners as shown in Table

Recursive Bisection Coloring

The pro cedure PREFIXBisecMap returns a pro cessor assignment list mapar r ay PREFIX can

b e Co or for recursive co ordinate partitioner and Iner for Inertial bisection partitioner

PREFIXBisecMap maparray ndata ndim x arg

Parameters

integer maparray result of partitioning

integer ndata numb er of elements

integer ndim numb er of dimensions

double precision x f argg co ordinate arrays

Weighted Recursive Bisection Coloring

Here the partitioners use an additional information to partition data Computational load at

each p oint is also considered for partitioning The load is sp ecied by an array l oad

PREFIXWeighBisecMap maparray load ndata ndim x arg

Parameters

integer maparray result of partitioning

integer load loads of elements

integer ndata numb er of elements

integer ndim numb er of dimensions

double precision x f argg co ordinate arrays

Recursive Bisection with Remapping

PREFIXBisec remaplevel myindex ndata ndim x arg

Parameters

integer remaplevel remap arrays every remaplevel levels

integer myindex indexes of lo cal elements

integer ndata numb er of elements

integer ndim numb er of dimensions

double precision x f argg co ordinate arrays

The arrays used to sp ecify the geometry information are automatically remapp ed to the

new distribution The parameter remaplevel is used to sp ecify how often that co ordinate arrays

should b e remap ed ie the co ordinate arrays will b e remapp ed and moved every remaplevel

levels

The parameter myindex is used to keep track of how elements are remapp ed and moved

Users input the indexes of current lo cal elements The partitioner returns indexes of new lo cal

elements after remapping

Users place the current numb er of lo cal elements in ndata Partitioners returns the numb er

of new lo cal elements in ndata

Weighted Recursive Bisection with Remapping

PREFIXWeighBisec remaplevel myindex load ndata ndim x arg

Parameters

integer remaplevel remap arrays every remaplevel levels

integer myindex indexes of lo cal elements

integer load loads of elements

integer ndata numb er of elements

integer ndim numb er of dimensions

double precision x f argg co ordinate arrays

Recursive Sp ectral Partitioner

rsb myindex n lo cal csr ptr csr col ntable parallel

Parameters

integer myindex indexes of lo cal elements

integer n lo cal numb er of lo cal elements

integer csr ptr list of csr format p ointers p ointing into csr col

col adjacency list for all lo cal indices integer csr

integer ntable distribution returned in the form a translation table

rsb The csr format representation of RDG see Section is passed to the pro cedure parallel

The parallel partitioner returns a p ointer to a translation table ntabl e which describ es the new

array distribution The parallel version is based on the sequential single level sp ectral partitioner

provided by Horst Simon For eciency reasons user might want to use the multilevel version

of the partitioner

References

R Das YS Hwang M Uysal J Saltz and A Sussman Applying the CHAOSPARTI

library to irregular problems in computational chemistry and computational aero dynam

ics In Proceedings of the Scalable Paral lel Libraries Conference Mississippi State Uni

versity Starkvil le MS pages IEEE Computer So ciety Press Octob er

Ra ja Das Jo el Saltz and Reinhard von Hanxleden Slicing analysis and indirect access to

distributed arrays Technical Rep ort CSTR and UMIACSTR University of

Maryland Department of Computer Science and UMIACS May App ears in LCPC

Ra ja Das Mustafa Uysal Jo el Saltz and YuanShin Hwang Communication optimiza

tions for irregular scientic computations on distributed memory architectures Technical

Rep ort CSTR and UMIACSTR University of Maryland Department of

Computer Science and UMIACS Octob er Submitted to Journal of Parallel and

Distributed Computing

R v Hanxleden K Kennedy C Ko elb el R Das and J Saltz Compiler analysis for

irregular problems in Fortran D In Proceedings of the th Workshop on Languages and

Compilers for Paral lel Computing New Haven CT August

Seema Hiranandani Jo el Saltz Piyush Mehrotra and Harry Berryman Performance

of hashed cache data migration schemes on multicomputers Journal of Paral lel and

Distributed Computing August

R Mirchandaney J H Saltz R M Smith D M Nicol and Kay Crowley Principles of

runtime supp ort for parallel pro cessors In Proceedings of the ACM International

Conference on Supercomputing pages July

R Ponnusamy R Das J Saltz J Saltz and D Mavriplis The dybbuk

In IEEE COMPCON San Francisco February

Ravi Ponnusamy Jo el Saltz and Alok Choudhary Runtimecompilation techniques for

data partitioning and communication schedule reuse In Proceedings Supercomputing

pages IEEE Computer So ciety Press Novemb er Also available as

University of Maryland Technical Rep ort CSTR and UMIACSTR

Ravi Ponnusamy Jo el Saltz Alok Choudhary YuanShin Hwang and Georey Fox Run

time supp ort and compilation metho ds for usersp ecied data distributions Technical

Rep ort CSTR and UMIACSTR University of Maryland Department of

Computer Science and UMIACS Novemb er Submitted to IEEE Transactions on

Parallel and Distributed Systems

J Saltz H Berryman and J Wu Runtime compilation for multipro cessors Concurrency

Practice and Experience

Jo el Saltz Kathleen Crowley Ravi Mirchandaney and Harry Berryman Runtime

scheduling and execution of lo ops on message passing machines Journal of Paral lel

and Distributed Computing April

Jo el H Saltz Ravi Mirchandaney and Kay Crowley Runtime parallelization and

scheduling of lo ops IEEE Transactions on Computers May

Shamik D Sharma Ravi Ponnusamy Bongki Mo on YuanShin Hwang Ra ja Das and

Jo el Saltz Runtime and compiletime supp ort for adaptive irregular problems Submitted

to Sup ercomputing April

A Sussman J Saltz R Das S Gupta D Mavriplis R Ponnusamy and K Crow

ley PARTI primitives for unstructured and blo ck structured problems Computing

Systems in Engineering Pap ers presented at the Symp osium on

HighPerformance Computing for

J Wu J Saltz S Hiranandani and H Berryman Runtime compilation metho ds for mul

ticomputers In Proceedings of the International Conference on Paral lel Processing

volume pages

A Example Program NEIGHBORCOUNT

A Sequential Version of Program

PROGRAM NeighborCount

CC The purpose of this program is to take a graph and determine each nodes

CC inout degree The graph is represented by arrays X Y edge and edge

CC where arrays X and Y are the vertices placement on a grid and edge and

CC edge represent the end points of each edge

parameter idim

parameter NE

INTEGER edgeNE edgeNE

INTEGER IODegreeidim Xidim Yidim

C Initialize edges arrays

edge INTrandidim

edge INTrandidim

do i NE

edgeiINTrandidim

edgeiMODINTrandidimedgeiidim

continue

C Initialize XY coord arrays not used in program

X INTrand

Y INTrand

do i idim

Xi INTrand

Yi INTrand

continue

do i idim

IODegreei

continue

C Calculate neighbors

do i NE

IODegreeedgei IODegreeedgei

IODegreeedgei IODegreeedgei

continue

CC The output is in the form of

CC

CC Node IODegree

do i idim

print iIODegreei

tot tot IODegreei

continue

print Total tot

end

A Parallel Version of Program Regular blo ck Distribution

PROGRAM NeighborCount

CC The purpose of this program is to take a graph and determine each nodes

CC inout degree The graph is represented by arrays X Y edge and edge

CC where arrays X and Y are the vertices placement on a grid and edge and

CC edge represent the end points of each edge

CC In this version of the program the X Y edge edge and IODegree

CC arrays are initialized assuming a block distribution

include mpiforth

integer idim ne buff

C idim is the number of vertices in the graph

parameter idim

C NE is the number of edges in the graph

parameter NE

C buff is the amouunt of buffer space being allocated for

C off proceessor data storage

parameter buffNE

INTEGER buildregtranslationtable

INTEGER rand

C edgei and edgei hold the endpoints for edge i

INTEGER edgeNEedgeNE

C newedge and newedge hold the adjusted values of the endpoints

C adjusted based on which vertices my processor has and

C where it has them

INTEGER newedgeNE buffnewedgeNE buff

C edge and edge are combined into tempedgearr so tempedgearr can be

C passed to XXXXXXXXXXXX and the adjusted values are returned

C in newtempedgearr

INTEGER tempedgearrNE buff newtempedgearrNE buff

INTEGER IODegreeidim buff

C The X and Y arrays represent the xy coordinates of the vertices

INTEGER Xidim Yidim

INTEGER sch tt count tot

INTEGER BLOCKarridim

INTEGER size loops

call PARTIsetup

procsMPInumnodes

C Each processor must know how many everyone else has to create BLOCKarrs

do i procs

sizei idimprocs

if i lt idim sizeiprocs sizeisizei

continue

mysizesizeMPImynode

do i procs

loopsiNEprocs

if i lt NE loopsiprocs loopsiloopsi

continue

myloopsloopsMPImynode

C Initialize edges arrays

do i myloops

edgeiMODINTrandidim

edgeiMODMODINTrandidimedgeiidim

continue

C Initialize XY coord arrays not used in this program

do i idim

Xi MODINTrand

Yi MODINTrand

continue

C Initialize IODegree Array

do i myloops

IODegreei

continue

C Initialize BLOCK distribution arrays to represent initial

C distribution of the data and indirection arrays

offset

do i MPImynode

offsetoffset sizei

continue

do i mysize

BLOCKarri offset i

continue

C Partition Data

tt buildregtranslationtableidim

C Inspector

CC The values in the edge end point indirection arrays are adjusted using

CC localize to refer to local indices for onprocessor data segments and to

CC buffer space for offprocessor data segments

do i myloops

tempedgearri edgei

tempedgearrimyloops edgei

continue

call localizett sch tempedgearr newtempedgearr myloops

noff mysize

do i myloops

edgei newtempedgearri

edgei newtempedgearrimyloops

continue

C Executor

C Calculate neighbors

do i myloops

IODegreeedgei IODegreeedgei

IODegreeedgei IODegreeedgei

continue

CC After the local portion of the indirection arrays have been polled each

CC processor sends any information which it has stored in its buffer areas to

CC the processor which owns the associated node

call iscatteraddIODegreemysizeIODegreesch

CC The output is in the form of

CC

CC Home Processor Node IODegree

tot

do i mysize

print MPImynodeBLOCKarriIODegreei

tottotIODegreei

continue

CC The following information is printed out for debugging purposes to

CC make sure that the correct number of edge end points were counted

print MPImynodes TOTAL tot

call MPIgisumtotibuf

print Total Total tot

print Total should be NE

end

A Parallel Version of Program Irregular Distribution

PROGRAM NeighborCount

CC The purpose of this program is to take a graph and determine each nodes

CC inout degree The graph is represented by arrays X Y edge and edge

CC where arrays X and Y are the vertices placement on a grid and edge and

CC edge represent the end points of each edge

CC In this version of the program the X Y edge edge and IODegree

CC arrays are initialized assuming a block distribution

include mpiforth

parameter idim

parameter NE

parameter buffNE

INTEGER buildtranslationtable

INTEGER edgeNEedgeNE

INTEGER newedgeNE buffnewedgeNE buff

INTEGER arrNE buff newedgeNE buff

INTEGER IODegreeidim buffIODegreeidim buff

DOUBLE PRECISION Xidim Yidim

INTEGER sch sch sch tt tt count tot

INTEGER BLOCKarridim

INTEGER BLOCKarrNE

INTEGER maparridim newdistarridim buff

INTEGER newdistarr NE buff

INTEGER rigNE ripaNE

INTEGER size loops

call PARTIsetup

procsMPInumnodes

C Each processor must know how many everyone else has to create BLOCKarrs

do i procs

sizei idimprocs

if i lt idim sizeiprocs sizeisizei

continue

mysizesizeMPImynode

do i procs

loopsiNEprocs

if i lt NE loopsiprocs loopsiloopsi

continue

myloopsloopsMPImynode

C Initialize edges arrays

edge INTrandidim

edge INTrandidim

do i myloops

edgeiINTrandidim

edgeiMODINTrandidim

edgeiidim

continue

C Initialize XY coord arrays

X INTrand

Y INTrand

do i idim

Xi INTrand

Yi INTrand

continue

C Initialize IODegree Array

do i myloops

IODegreei

continue

C Initialize BLOCK distribution arrays to represent initial

C distribution of the data and indirection arrays

offset

do i MPImynode

offsetoffset sizei

continue

do i mysize

BLOCKarri offset i

continue

offset

do i MPImynode

offsetoffset loopsi

continue

do i myloops

BLOCKarri offset i

continue

C Partition Data

CC The X and Y coordinates are used to create a partitioning scheme for the

CC IODegree data arrays The data arrays are then redistributed from block to

CC the new distribution

call PARTIsetup

call CoorBisecMapmaparr mysize X Y

tt initttablewithproc maparr mysize

C ReMap data from old distribution to new distribution

call remaptt BLOCKarr sch newdistarr newnumlocalind

tt buildtranslationtable newdistarr newnumlocalind

call igatherIODegree IODegree sch

mysizenewnumlocalind

C Partition Loops

CC Then the loops are partitioned using the new distribution information in

CC order to try to maximize onprocessor work The indirection arrays are then

CC gathered so that each processor has the portions of the indirection arrays

CC which correspond to the loop iterations they own

c collect all local ind array values for IODegree into rig

count

do i myloops

rigcount edgei

rigcount edgei

count count

continue

call drefrigtt rig myloops ripa

call iterationpartitionerripa myloops tt

call remaptt BLOCKarr sch newdistarr newmyloops

call igathernewedge edge sch

call igathernewedge edge sch

C Inspector

CC The values in the edge end point indirection arrays are adjusted using

CC localize to refer to local indices for onprocessor data segments and to

CC buffer space for offprocessor data segments

do i newmyloops

arri newedgei

arrinewmyloops newedgei

continue

call localizett sch arr newedge newmyloops

noff mysize

C Executor

C Calculate neighbors

do i newmyloops

IODegreenewedgei IODegreenewedgei

IODegreenewedgeinewmyloopsIODegreenewedgeinewmyloops

continue

CC After the local portion of the indirection arrays have been polled each

CC processor sends any information which it has stored in its buffer areas to

CC the processor which owns the associated node

call iscatteraddIODegreemysizeIODegreesch

CC The output is in the form of

CC

CC Home Processor Node IODegree

tot

do i mysize

print MPImynodenewdistarriIODegreei

tottotIODegreei

continue

CC The following information is printed out for debugging purposes to

CC make sure that the correct number of edge end points were counted

print MPImynodes TOTAL tot

call MPIgisumtotibuf

print Total Total tot

print Total should be NE

end

B Example Program Simplied ISING

B Sequential Version of Program

program isingsimulation

parameter ixdim

parameter iydim

parameter frac

parameter particlesixdimiydim

parameter timeunits

C This program will simulate a grid of cells and the energy particles

C within those cells The grid will be initialized in a uniformly

C distributed pattern The simulation will allow a particle to

C move one cell left right up or down or stay in place per time

C unit For this program it is assumed there is a oneway permeable

C wall through which particles can leave but not enter the simulation

integer xparticles yparticles

integer count

INTEGER rand

C Initialize Grid assuming perfect divisibility

do i particles

xi MODrandixdim

yi MODrandiydim

continue

C Do timeunits particle movement cycles

do k timeunits

do i particles

if MODrand lt frac then

ival MODrand

if ivaleq then

yiyi

else

if ivaleq then

yiyi

else

if ivaleq then

xixi

else

if ivaleq xixi

endif

endif

endif

endif

if xilt xiixdim

if xigtixdim xi

if yilt yiiydim

if yigtiydim yi

continue

continue

C Calculate number of particles left in grid can do something else instead

print Particles in grid particles

end

B Parallel Version of Program

program isingsimulation

include mpiforth

parameter ixdim

parameter iydim

parameter frac

parameter particlesixdimiydim

parameter timeunits

C This program will simulate a grid of cells and the energy particles

C within those cells The grid will be initialized in a uniformly

C distributed pattern The simulation will allow a particle to

C move one cell left right up or down or stay in place per time

C unit For this program it is assumed that when a particle leaves

C the top it goes to the bottom and the same for left and right

C and viceversa

integer xparticles yparticles

integer sched scheduleproc

integer destprocparticles

integer countpart

INTEGER rand

call PARTIsetup

procsMPInumnodes

menodeMPImynode

partINTixdimprocs

localloopsparticlesprocs

C Random number hack so that all processors dont reuse same random

C numbers

do i menode localloops

lrand

continue

C Initialize Grid assuming perfect divisibility

C NOTE The xcoordinates are appropriate for the given processor

do i localloops

xi MODrandpart partmenode

yi MODrandiydim

continue

C Do timeunits particle movement cycles

do k timeunits

do i localloops

if iprocmapxiixdimnemenode print Wrong Processor

if MODrand lt frac then

ival MODrand

if ivaleq then

yiyi

else

if ivaleq then

yiyi

else

if ivaleq then

xixi

else

if ivaleq xixi

endif

endif

endif

endif

if xilt xiixdim

if xigtixdim xi

if yilt yiiydim

if yigtiydim yi

C Create destination processor array listing which processor each particle

C should be on for the next time iteration

destproci iprocmapxiixdim

continue

C Creating schedule for moving particles between machines

sched scheduleprocdestproclocalloopsnewlocalloops

localloopsnewlocalloops

C Moving particles x and y coordinates to processor which owns the

C cell in which that particle resides

call iscatterappendxxsched

call iscatterappendyysched

C Freeing memory used by schedule since this schedule is never useful again

call freeschedsched

continue

C Check number of particles left in grid can do something else instead

print Particles on machine menode localloops

call MPIgisumlocalloopsitemp

if menodeeq print Total localloops

end

function iprocmapxcoordxdim

integer xcoordxdim

iprocmap INTxcoordINTxdimMPInumnodes

end