<<

Syracuse University SURFACE

Electrical Engineering and Computer Science College of Engineering and Computer Science

1994

PASSION Runtime for Parallel I/O

Rajeev Thakur Syracuse University, Northeast Parallel Architectures Center, and Department of Electrical and Computer Engineering, [email protected]

Rajesh Bordawekar Syracuse University

Alok Choudhary Syracuse University, Northeast Parallel Architectures Center, and Department of Electrical and Computer Engineering

Ravi Ponnusamy Syracuse University, Northeast Parallel Architectures Center, and Department of Electrical and Computer Engineering, [email protected]

Follow this and additional works at: https://surface.syr.edu/eecs

Part of the Computer Sciences Commons

Recommended Citation Thakur, Rajeev; Bordawekar, Rajesh; Choudhary, Alok; and Ponnusamy, Ravi, "PASSION Runtime Library for Parallel I/O" (1994). Electrical Engineering and Computer Science. 30. https://surface.syr.edu/eecs/30

This Working Paper is brought to you for free and open access by the College of Engineering and Computer Science at SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an authorized administrator of SURFACE. For more information, please contact [email protected].

Scalable Parallel Libraries Conference Oct

PASSION Runtime Library for Parallel IO

Rajeev Thakur Rajesh Bordawekar Alok Choudhary

Ravi Ponnusamy Tarvinder Singh

Dept of Electrical and Computer Eng and

Northeast Parallel Architectures Center

Syracuse University Syracuse NY

thakur rajesh choudhar ravi tpsingh npacsyredu

At Syracuse University we consider the IO prob Abstract

lem from a language and runtime supp ort

We are developing a compiler and runtime sup

p oint of view We are developing a compiler and run

port system cal led PASSION Paral lel And Scalable

time supp ort system called PASSION Parallel And

Software for InputOutput PASSION provides soft

Scalable Software for InputOutput PASSION

ware support for IO intensive outofcore loosely syn

provides supp ort for compiling outofcore data paral

chronous problems This paper gives an overview

lel programs parallel inputoutput of data

of the PASSION Runtime Library and describes two

communication of outofcore data redistribution of

of the optimizations incorporated in it namely Data

data stored on disks many optimizations including

Prefetching and Data Sieving Performance improve

data prefetching from disks data sieving data reuse

ments provided by these optimizations on the Intel

etc as well as supp ort at the op erating system level

Touchstone Delta are discussed together with an out

We have also develop ed an initial framework for run

ofcore Median Filtering application

time supp ort for outofcore irregular problems

This pap er gives an overview of PASSION and de

Intro duction

scrib es some of the main features of the PASSION

There are a numb er of applications which deal with

Runtime Library We explain the basic mo del of

very large quantities of data These applications exist

computation and IO used by the runtime library

in diverse areas such as large scale scienti compu

The runtime routines supp orted by PASSION are dis

tations database applications hyp ertext and multi

cussed A numb er of optimizations have b een in

media systems information retrieval and many other

corp orated in the runtime library to reduce the IO

applications of the Information Age The numb er of

cost We describ e in detail two of these optimizations

such applications and their data requirements keep

namely Data Prefetching and Data Sieving Perfor

increasing day by day Consequently it has b ecome

mance improvements provided by these optimizations

apparent that IO p erformance rather than CPU or

on the Intel Touchstone Delta are discussed together

communication p erformance may b e the limiting fac

with an outofcore Median Filtering application

tor in future computing systems Recent advances in

high p erformance computing have resulted in comput

PASSION Overview

ers which can provide more than Gops of com

PASSION provides software supp ort for IO inten

puting p ower However the p erformance of the IO

sive lo osely synchronous problems It has a layered ap

systems of these machines has lagged far b ehind It

proach and provides supp ort at the compiler runtime

is still several orders of magnitude more exp ensive to

and op erating systems level as shown in Figure The

read data from disk than to read it from lo cal memory

PASSION compiler translates outofcore HPF pro

Improvements are needed b oth in hardware as well as

grams to message passing no de programs with explicit

software to reduce the imbalance b etween CPU p er

parallel IO It extracts information from user direc

formance and IO p erformance

tives ab out the data distribution which is required by



the PASSION It restructures lo ops

This work was supp orted in part by NSF Young Investiga

having outofcore arrays and also decides the trans

tor Award CCR grants from Intel SSD and IBM Corp

formations on outofcore data to map the distribu

and in part by USRA CESDIS Contract This work

tion on disks with the usage in the lo ops The PAS

was p erformed in part using the Intel Touchstone Delta System

SION compiler uses well known techniques such as

op erated by Caltech on b ehalf of the Concurrent Sup ercomput

lo op stripmining iteration blo cking etc to generate ing Consortium Access to this facility was provided by CRPC I/O Intensive OOC Applications

Compiler & Runtime Support

Prefetch Manager

I/O Controller and Disk Subsystem

Cache and Buffer Manager

Two-Phase Access Manager

Compiler Support for HPF Directives Support for prefetching etc.

Loosely Synchronous Computations

Figure PASSION Rings

program written in a highlevel data parallel language ecient co de for IO intensive applications It also

like HPF can b e translated into ecient co de using emb eds calls to appropriate PASSION runtime rou

the PASSION compiler and runtime system A de tines which carry out IO eciently The Compiler

tailed description of all the features of PASSION is and Runtime Layers pass data distribution and access

given in pattern information to the TwoPhase Access Man

ager and the Prefetch Manager They optimize IO

Mo del for Computation and IO

using buering redistribution and prefetching strate

In the SPMD Single Program Multiple Data pro

gies At the op erating system level PASSION pro

gramming mo del each pro cessor has a lo cal array as

vides supp ort to handle prefetching and buering

so ciated with it In an incore program the lo cal array

The PASSION runtime supp ort system makes IO

resides in the lo cal memory of the pro cessor For large

optimizations transparent to users The runtime pro

data sets however lo cal arrays cannot entirely t in

cedures can either b e used together with a compiler to

main memory In such cases parts of the lo cal array

translate outofcore data parallel programs or used

have to b e stored on disk We refer to such a lo cal ar

directly by application programmers The runtime li

ray as an Outofcore Lo cal Array OCLA Parts

brary p erforms the following functions

of the OCLA need to b e swapp ed b etween main mem

ory and disk during the course of the computation

hides disk data distribution from the user

The basic mo del for computation and IO used by

provides consistent IO p erformance indep endent

PASSION is shown in Figure The simplest way to

of data distribution

view this mo del is to think of each pro cessor as hav

ing another level of memory which is much slower than

reorders IO requests to minimize seek time

main memory Since the lo cal arrays are outofcore

eliminates duplicate IO requests to reduce IO

they have to b e stored in les on disk The lo cal ar

cost

ray of each pro cessor is stored in a separate le called

the Lo cal Array File LAF of that pro cessor The

prefetches disk data to hide IO latency

no de program explicitly reads from and writes into

the le when required If the IO architecture of the Writing message passing parallel programs with ef

system is such that each pro cessor has its own disk cient parallel IO is a tedious pro cess Instead a Global Array

To P0 To P1

To P2 To P3

P0 P1

P2 P3

Processors

ICLA ICLA

Disks Disks

Local array Local array

Files Files

Figure Mo del for Computation and IO

ized routines for parallel IO and collective commu the LAF of each pro cessor will b e stored on the disk

nication These routines are built using the native attached to that pro cessor If there is a common set

communication and IO primitives of the system and of disks for all pro cessors the LAF will b e distributed

provide a high level abstraction which avoids the in across one or more of these disks In other words

convenience of working directly with the lower layers we assume that each pro cessor has its own logical disk

For example the routines hide details such as buer with the LAF stored on that disk The mapping of

ing mapping of les on disks lo cation of data in les the logical disk to the physical disks is system dep en

synchronization optimum message size for communi dent At any time only a p ortion of the lo cal array is

cation b est communication algorithms communica fetched and stored in main memory The size of this

tion scheduling IO scheduling etc p ortion dep ends on the amount of memory available

The p ortion of the lo cal array which is in main mem

PASSION Runtime Library

ory is called the InCore Lo cal Array ICLA All

The PASSION routines can b e divided into four

computations are p erformed on the data in the ICLA

main categories based on their functionality Array

Thus during the course of the program parts of the

ManagementAccess Routines Communication Rou

LAF are fetched into the ICLA the new values are

tines Mapping Routines and Generic Routines Some

computed and the ICLA is stored back into appropri

of the basic routines and their functions are listed in

ate lo cations in the LAF

Table

Runtime Supp ort in PASSION

During program data needs to b e moved

Array ManagementAccess Routines

back and forth b etween the LAF and the ICLA Also

These routines handle the movement of data b etween since the global array is distributed a pro cessor may

the LAF and the ICLA Any arbitrary regular section need data from the lo cal array of another pro cessor

of the OCLA can b e read for an array stored in either This requires data to b e communicated b etween pro

rowma jor or columnma jor order The information cessors Thus runtime supp ort is needed to p erform

ab out the array such as its shap e size distribution IO as well as communication The PASSION Run

storage format etc is passed to the routines using a time Library consists of a set of high level sp ecial

Array Management Routines

PASSION Routine Function

PASSION read section Read a regular section from LAF to ICLA

PASSION write section Write a regular section from ICLA to LAF

PASSION read with reuse read section with data reuse

PASSION prefetch read Asynchronous nonblo cking read of a regular section

PASSION prefetch wait Wait for a prefetch to complete

Array Communication Routines

PASSION Routine Function

PASSION o c shift Shift typ e collective communication on outofcore data

PASSION o c multicast Multicast communication on outofcore data

Mapping Routines

PASSION Routine Function

PASSION o c disk map Map disks to pro cessors

PASSION o c le map Generate lo cal les from global les

Generic Routines

PASSION Routine Function

PASSION o c transp ose Transp ose an outofcore array

PASSION o c matmul Perform outofcore matrix multiplication

Table Some of the PASSION Runtime Routines

distribution on pro cessors Other distributions give data structure called the OutofCore Array Descrip

much lower p erformance To alleviate this problem tor OCAD The Data Sieving Metho describ ed

the Two Phase Access Strategy has b een prop osed in Section is used for improved p erformance

in In the Two Phase Approach data is rst read

in a manner conforming to the distribution on disks

Communication Routines

and then redistributed among the pro cessors This is

found to give consistently go o d p erformance for all dis

The Communication Routines p erform collective com

tributions The PASSION runtime library uses

munication of data in the OCLA We use the Explicit

this Two Phase Approach for parallel IO In the rst

Communication Metho d describ ed in The com

phase data is accessed using the data distribution

munication is done for the entire OCLA ie all the

strip e size and set of reading no des p ossibly a sub

opro cessor data needed by the OCLA is fetched dur

set of the computational array which conforms with

ing the communication This requires interpro cessor

the distribution of data over the disks In the second

communication as well as disk accesses

phase the data is redistributed at runtime to match

the applications desired data distribution

Mapping Routines

The TwoPhase Approach provides the follow

ing advantages over the conventional Direct Access

The Mapping Routines p erform data and pro ces

Metho d

sordisk mappings Data mapping routines include

routines to generate lo cal array les from a global le

The distribution of data on disks is eectively hid

Disk mapping routines map physical disks onto logical

den from the user

disks

It uses the higher bandwidth of the interconnec

tion network

Generic Routines

It uses collective communication and collective

The Generic Routines p erform computations on out

IO op erations

ofcore arrays Examples of these routines are outof

It provides software caching of the outofcore

core transp ose and outofcore matrix multiplication

data in main memory to exploit temp oral and

TwoPhase Approach

spatial lo cality

The p erformance of parallel le systems dep ends

It aggregates IO requests of compute no des so

to a large extent on the way data is distributed on

that only one copy of each data item is transferred

disks and pro cessors The p erformance is b est when

b etween disk and main memory

the data distribution on disks conforms to the data ReadCompWrite Read Comp Write Read Comp Write

(A) Without Prefetch

Read CompWrite CompWrite Comp Write

Read Read

(B) With Prefetch

Figure Data Prefetching

mance improvement Figure B shows how prefetch Optimizations

ing can reduce the time taken for the example in Fig

A numb er of optimizations have b een incorp orated

ure A Since the computation time is assumed to

in the PASSION runtime library to reduce the IO

b e the same as the read time all reads other than the

cost One optimization called Data Reuse reduces

rst one get overlapp ed with computation The to

the amount of IO by reusing data already fetched into

tal reduction in program time is equal to the time for

main memory instead of reading it again from disk

reading two slabs as only two of the three reads can

Two other optimizations Data Prefetching and Data

b e overlapp ed in this example

Sieving are describ ed in Sections and resp ectively

Prefetching can b e done using

In addition some other optimizations such as Software

the routine PASSION prefetch read and the rou

Caching to reduce the numb er of IO requests and

prefetch wait can b e used to wait tine PASSION

Access Reordering to reduce IO latency time have

for the prefetch to complete

b een incorp orated

Performance

Data Prefetching

We use an outofcore Median Filtering program to

In the mo del of computation and IO describ ed ear illustrate the p erformance of Data Prefetching Me

lier the OCLA is divided into a numb er of slabs each

dian Filtering is frequently used in computer vision

of which can t in the ICLA Program execution pro and image pro cessing applications to smo oth the in

ceeds as follows a slab of data is fetched from the put image Each pixel is assigned the median of the

LAF to the ICLA the computation is p erformed on

values of its neighb ors within a window of a particular

this slab and the slab is written back to the LAF size say or or larger We have implemented

This is rep eated on other slabs till the end of the pro a parallel outofcore Median Filtering program using

gram Thus IO and computation form distinct phases

PASSION runtime routines for IO and communica

in the program A pro cessor has to wait while each tion The image is distributed among pro cessors in

slab is b eing read or written as there is no overlap b e one dimension along columns and stored in lo cal array

tween computation and IO This is illustrated in Fig

les Dep ending on the window size each pro cessor

ure A which shows the time taken for computation needs a few columns from its right and left neighb ors

and IO on slabs For simplicity reading writing This requires a shift typ e communication which is im

and computation are shown to take the same amount

plemented using the routine PASSION oc shift

of time which may not b e true in certain cases

Tables and show the p erformance of Median

Filtering on the Intel Touchstone Delta for windows

The time taken by the program can b e reduced if it

of size and resp ectively The image is of

is p ossible to overlap computation with IO in some

size K K pixels We assume this to b e outof

fashion A simple way of achieving this is to issue an

core for the purp ose of exp erimentation The numb er

asynchronous IO read request for the next slab im

of pro cessors is varied from to and the size of

mediately after the current slab has b een read This

the ICLA is varied in each case in such a way that

is called Data Prefetching Since the read request is

the numb er of slabs varies from to Since the

asynchronous the reading of the next slab can b e over

Touchstone Delta has disks each pro cessors LAF

lapp ed with the computation b eing p erformed on the

can b e stored on a separate disk

current slab If the computation time is comparable

to the IO time this can result in signicant p erfor The following observations can b e made from these

Table Performance of Median Filtering using window time in sec

slabs slabs slabs

Pro cs Prefetch No Prefetch Prefetch No Prefetch Prefetch No Prefetch

Table Performance of Median Filtering using window time in sec

slabs slabs slabs

Pro cs Prefetch No Prefetch Prefetch No Prefetch Prefetch No Prefetch

With prefetching as the numb er of slabs in in

50.0

creased the time taken decreases in most cases

Since the rst slab can never b e prefetched all

Without prefetch

have to wait for the rst slab to b e

40.0 With prefetch pro cessors

read As the slab size is reduced the wait time for the rst slab is also reduced and there is more

30.0

overlap of computation and IO However the

numb er of IO requests increases When the slab

size is large a reduction in the slab size by half

Time (sec) 20.0

improves p erformance b ecause the saving in the

wait time for the rst slab is higher than the in

crease in time due to the larger numb er of IO

10.0

requests But when the slab size is small pro

cessor case with or slabs the higher numb er

requests costs more than the decrease in 0.0 of IO

4 8 16 32 64

ait time for the rst slab Hence the p erfor

Processors w

mance actually degrades in this case

Data Sieving

Figure Median Filtering using window

All the PASSION runtime routines for reading

or writing data fromto disks supp ort the read

ingwriting of regular sections of arrays We dene

tables

a regular section of an array as any p ortion of an ar

ray which can b e sp ecied in terms of its lower b ound

In all cases prefetching improves p erformance

upp er b ound and stride in each dimension The need

considerably In some cases the improvement is

for reading array sections from disks may arise due to a

close to Figures and show the relative

numb er of reasons for example FORALL or array as

p erformance with and without prefetching when

signment statements involving sections of outofcore

the numb er of slabs is

arrays

Consider the array of size shown in Fig Without prefetching as the numb er of slabs is in

ure which is stored on disk Supp ose it is required creased the time taken increases This is b ecause

to read the section of this array The el more numb er of slabs means a smaller slab size

ements to b e read are circled in the gure Since these which results in more numb er of IO requests (1,1) (1,11) 100.0 (2,3) (2,9) Without Prefetch With Prefetch 80.0

60.0

Time (sec) 40.0

20.0

(10,3) (10,9) 0.0 4 8 16 32 64

Processors (11,1) (11,11)

Figure Accessing outofcore array sections

Figure Median Filtering using window

In the Data Sieving Metho d the entire blo ck of

elements are stored with a stride on disk it is not p os

data from column l to u if the storage is column

2 2

sible to read them using one read call A simple way

ma jor or the entire blo ck from row l to u if the

1 1

of reading this array section is to explicitly move the

storage is row ma jor is read into a temp orary buer

le p ointer to each element and read it individually

in main memory using one read call The required

which requires as many reads as the numb er of ele

data is then extracted from this buer and placed in

ments We call this the Direct Read Method A ma jor

the ICLA Hence the name Data Sieving A ma jor ad

disadvantage of this metho d is the large numb er of

vantage of this metho d is that it requires only one IO

IO calls and low granularity of data transfer Since

call and the rest is data transfer within main memory

the IO latency is very high this metho d proves to b e

The main disadvantage is the high memory require

very exp ensive For example on the Intel Touchstone

ment Another disadvantage is the extra amount of

Delta using pro cessor and disk it takes ms

data that is read from disk However we have found

to read integers as one blo ck whereas it takes

that the savings in the numb er of IO calls increases

ms to read all of them individually

p erformance considerably For this metho d assuming

Supp ose it required to read a section of a two

column ma jor storage

dimensional array sp ecied by l u s l u

No of IO requests

1 1 1 2 2

s The numb er of array elements in this section is

No of array elements read p er access

2

bu l s c bu l s c Therefore

u l nr ow s

1 1 1 2 2 2

2 2

in the Direct Read Metho d

No of IO requests (b(u l )s c + 1) (b(u l )s c + 1)

1 1 1 2 2 2

Data Sieving is a way of combining multiple IO

No of array elements read p er access

requests into one request so as to reduce the eect of

Thus in this metho d the numb er of IO requests is

high IO latency time A similar metho d called mes

very high and the numb er of elements accessed p er

sage coalescing is used in interpro cessor communica

request is very low which is undesirable

tion where small messages are combined into a single

We prop ose a much more ecient metho d called

large message in order to reduce the eect of commu

Data Sieving to read or write outofcore array sec nication latency However Data Sieving is dierent

tions having strides in one or more dimensions Data

b ecause instead of coalescing the required data ele

Sieving can b e explained with the help of Figure

ments together it actually reads even unwanted data

As explained earlier each pro cessor has an outofcore elements so that large contiguous blo cks are read The

lo cal array OCLA asso ciated with it The OCLA is

useful data is then ltered out by the runtime system

logically divided into slabs each of which can t in

in an intermediate step and passed on to the program

main memory ICLA The OCLA shown in the gure The unwanted data read into main memory is dynam

has four slabs Let us assume that it is necessary to

ically discarded

read the array section shown in Figure sp ecied by

Reducing the Memory Requirement

l u s l u s into the ICLA Although this

1 1 1 2 2 2

section spans three slabs of the OCLA b ecause of the If the stride in the array section is large the amount

stride all the data elements can t in the ICLA of memory required to read the entire blo ck from col Read contiguous block Sieve

(l1,l2) Section

(u1,u2)

Slab In-core ICLA buffer

OCLA

Figure Data Sieving

the ICLA are placed at appropriate lo cations in a tem umn l to u will b e quite large There may not

2 2

p orary buer with stride and the buer is written to b e enough main memory available to store this en

disk the data in the buer b etween the strided ele tire blo ck Since the amount of memory available to

ments will overwrite the corresp onding data elements create a temp orary buer is not known we make the

on disk In order to maintain data consistency it is assumption that there is always enough memory to

necessary to rst read the entire blo ck from the LAF create a buer of size equal to that of the ICLA The

into the temp orary buer Then data elements from Data Sieving Metho d describ ed ab ove is mo died as

the ICLA can b e stored at appropriate lo cations in follows to take this fact into account Instead of read

the buer and the entire buer can b e written back to ing the entire blo ck of data from column l to u we

2 2

disk read only as many columns or rows at a time as can

t in a buer of the same size as the ICLA For each

This is similar to what happ ens in cache memo

set of columns read the data is sieved and passed on to

ries when there is a write miss In that case a whole

the program This reduces the memory requirements

line or blo ck of data is fetched from main memory

of the program considerably and increases the numb er

into the cache and then the pro cessor writes data into

of IO requests only slightly Let us assume that the

the cache This is done in hardware in the case of

array is stored in column ma jor order on disk and n

caches PASSION do es this in software when writing

columns of the OCLA can t in the ICLA Then for

array sections using Data Sieving Thus writing sec

this case

tions requires twice the amount of IO compared to

No of IO requests du l ne

reading sections b ecause for each write to disk the

2 2

No of array elements read p er access n nr ow s

corresp onding blo ck has to rst b e fetched into mem

ory Therefore for writing array sections

Writing Array Sections

No of IO requests du l ne

2 2

No of array elements transferred p er access

Supp ose it is required to write an array section

n nr ow s

l u s l u s from the ICLA to the

1 1 1 2 2 2

LAF The issues involved here are similar to those

Performance

describ ed ab ove for reading array sections A Direct

Table gives the p erformance of Data Sieving Write Method can b e used to write each element in

versus the Direct Metho d for reading and writ dividually but it suers from the same problems of

ing array sections An array of size K K large numb er of IO requests and low granularity of

is distributed among pro cessors in one di data transfer In order to reduce the numb er of IO re

mension along columns We measured the quests a metho d similar to the Data Sieving Metho d

time taken by the PASSION describ ed ab ove needs to b e used If we directly use read section and

Data Sieving in the reverse direction ie elements from PASSION write section routines for reading and

Table Performance of Direct ReadWrite versus Data Sieving time in sec

K K global array on pro cs lo cal array size K slab size columns

PASSION read section PASSION write section

Array Section Direct Read Sieving Direct Write Sieving

Table IO requirements of Direct Read and Data Sieving Metho ds

K K global array on pro cs lo cal array size K slab size columns

No of IO requests No of array elements read

Array Section Direct Read Sieving Direct Read Sieving

IO is done in the Intel Touchstone Delta The cwrite writing sections of the outofcore lo cal array on each

call returns after data is written to the cache in the pro cessor We observe that Data Sieving provides

IO no de without waiting for the data to b e written tremendous improvement over the Direct Metho d in

to disk all cases The reason for this is large numb er of IO

requests in the Direct Metho d even though the to

All PASSION routines involving array sections use

tal amount of data accessed is higher in Data Siev

Data Sieving for greater eciency

ing Table gives the numb er of IO requests and

the total amount of data transferred for each of the

Related Work

array sections considered in Table We observe that

There has b een some related research in software

in the Data Sieving Metho d the numb er of data ele

supp ort for high p erformance parallel IO The Two

ments transferred is more or less the same for all cases

phase IO readwrite strategy was rst prop osed by

This is b ecause the total amount of data transferred

Bordawekar et al The eects of prefetching

dep ends only on the lower and upp er b ounds of the

blo cks of a le in a multipro cessor le system are

section and is indep endent of the stride Hence the

studied in Prefetching for incore problems is

time taken using Data Sieving do es not vary much for

discussed in Vesta is a parallel le system de

all the sections we have considered However there is

signed and develop ed at IBM T J Watson Research

a wide variation in time for the Direct Metho d b e

Center which supp orts logical partitioning of

cause only those elements b elonging to the section are

les File declustering where dierent blo cks of a le

read The time is lower for small sections and higher

are stored on distinct disks is suggested in This

for large sections

is used in the Bridge File System in Intels Con

current File System CFS and in various RAID

We observe that even for writing array sections

schemes An overview of the various issues in

Data Sieving p erforms b etter than Direct Write even

volved in high p erformance IO is given in

though it requires reading the section b efore writing

As exp ected PASSION write section takes ab out

Conclusions

read section when us twice the time as PASSION

The PASSION Runtime Library provides highlevel ing Data Sieving Comparing the Direct Write and Di

runtime supp ort for lo osely synchronous outofcore rect Read Metho ds we nd that writing takes slightly

computations on distributed memory parallel comput less time than reading data This is due to the way

the Scalable High Performance Computing Con ers The routines p erform ecient parallel IO as

ference pages May well as interpro cessor communication The PASSION

runtime pro cedures can either b e used together with

P Corb ett D Feitelson J Prost and S Baylor

a compiler to translate outofcore data parallel pro

Parallel Access to Files in the Vesta File System

grams or used directly by application programmers

In Proceedings of Supercomputing pages

A numb er of optimizations have b een incorp orated

Novemb er

in the runtime library for greater eciency The two

optimizations describ ed in this pap er namely Data

J del Rosario R Bordawekar and A Choud

Prefetching and Data Sieving provide considerable

hary A TwoPhase Strategy for Achieving High

p erformance improvement Data Prefetching overlaps

Performance Parallel IO Technical Rep ort

computation with IO while Data Sieving improves

SCCS NPAC Syracuse University Octob er

the granularity of IO accesses for reading or writing

array sections

The PASSION Runtime Library is currently

J del Rosario and A Choudhary High Perfor

available on the Intel Paragon Touchstone Delta

mance IO for Parallel Computers Problems and

and iPSC using Intels Concurrent File Sys

Prosp ects IEEE Computer pages March

tem Eorts are underway to p ort it to the

IBM SP and SP using the Vesta Paral

P Dibble M Scott and C Ellis Bridge A High

lel File System Additional information ab out

Performance File System for Parallel Pro cessors

PASSION is available on the World Wide Web

th

In Proceedings of the International Confer

at httpwwwcatsyredupassionhtml PAS

ence on Distributed Computing Systems pages

SION related pap ers can also b e obtained from the

June

anonymous ftp site erccatsyredu

Acknowledgments

D Kotz and C Ellis Prefetching in File Systems

We thank Georey Fox Ken Kennedy Chuck Ko el

for MIMD Multipro cessors IEEE Transactions

b el Paul Messina and Jo el Saltz for many fruitful dis

on Paral lel and Distributed Systems pages

cussions and helpful comments

April

M Livny S Khoshaan and H Boral Multi

References

Disk Management Algorithms In Proceedings

R Bordawekar A Choudhary and R Thakur

of the ACM SIGMETRICS Conference on

Data Access Reorganizations in Compiling Out

Measurement and Modeling of Computer Sys

ofcore Data Parallel Programs on Distributed

tems pages May

Memory Machines Technical Rep ort SCCS

NPAC Syracuse University Septemb er

T Mowry M Lam and A Gupta Design and

Evaluation of a Compiler Algorithm for Prefetch

R Bordawekar J del Rosario and A Choud

ing In Proceedings of ASPLOS pages

hary Design and Evaluation of Primitives for

Octob er

Parallel IO In Proceedings of Supercomputing

pages Novemb er

D Patterson G Gibson and R Katz A Case for

Redundant Arrays of Inexp ensive Disks In Pro

D Callahan K Kennedy and A Portereld

ceedings of ACM SIGMOD International Confer

Software Prefetching In Proceedings of ASPLOS

ence on Management of Data June

pages

P Pierce A Concurrent File System for a Highly

A Choudhary R Bordawekar M Harry

Parallel Mass Storage Subsystem In Proceed

R Krishnaiyer R Ponnusamy T Singh and

th

ings of Conference on Hypercubes Concur

R Thakur PASSION Parallel and Scalable

rent Computers and Applications pages

Software for InputOutput Technical Rep ort

March

SCCS NPAC Syracuse University Septem

b er

R Thakur R Bordawekar and A Choud

hary Compiler and Runtime Supp ort for Out

P Corb ett S Baylor and D Feitelson Overview

th

ofCore HPF Programs In Proceedings of the

of the Vesta Parallel File System In Proceedings

ACM International Conference on Supercomput

of the Workshop on IO in Paral lel Computer

ing pages July

Systems at IPPS pages April

P Corb ett and D Feitelson Overview of the

Vesta Parallel File System In Proceedings of