University of Vienna

Institute Rep ort

Institute for Software Technology and Parallel Systems

University of Vienna

Vienna

Preface

This rep ort describ es the scientic activities of the Institute for Software Technology and

Parallel Systems at the University of Vienna for the threeyear p erio d from January

through Decemb er

The primary ob jectives of the Institute are

 to conduct research in programming languages compilers programming environments

and software to ols that supp ort the user in the pro cess of solving problems on high p er

formance computing systems

 to achieve a transfer of technology by co op erating with application develop ers and the

industry and

 to disseminate knowledge in the elds of parallel computing and software technology

Our research activities fo cussed on the continuation of the Vienna Fortran language develop

ment a contribution to the evolving defacto standard High Performance Fortran HPF the

consolidation and extension of the Vienna Fortran Compilation System VFCS and the de

velopment of new to ols for p erformance prediction and analysis Longterm research topics

included knowledgebased supp ort for parallel program development highlevel user interfaces

and integrated programming environments combining compilers p erformance analysis systems

and automatic parallelization supp ort to ols

Research at the Institute was p erformed in co op eration with many institutions across the world

Emphasis was placed on strengthening the collab oration with partners from industry in parti

cular in the context of ESPRIT pro jects The most visible success resulting from these eorts

was the founding of the Vienna Centre of Excellence for Parallel Computing VCPC

in June by the Europ ean Union The ma jor ob jective of the VCPC which provides a

MeikoCS architecture with pro cessors to industrial and academic users across Europ e is

the supp ort of a technology transfer from academia to industry

During the rep orting p erio d the Institute contributed to the ESPRIT III pro jects PPPE A

Portable Parallel Programming Environment and PREPARE Programming Environment for

Parallel Architectures The ESPRIT IV Long Term Research pro ject HPF co ordinated by the

Institute started at the b eginning of in parallel with the industrial ESPRIT IV research

pro ject PHAROS in which the VCPC is a partner

The Institute co ordinates two nationally funded co op erative pro jects conducted together with

memb ers of the Austrian Center for Parallel Computation ACPC the priority research pro

gram Software for Parallel Systems supp orted by the Austrian Research Foundation FWF

and PACT a pro ject funded by the Austrian Ministry for Science Research and the Arts

BMWFK which involves partners from Central and East Europ ean countries i

We would like to express our thanks to the Austrian Ministry for Science Research

and the Arts BMWFK the Austrian Research Foundation FWF the Europ ean

Commission and the University of Vienna for their generous supp ort and invaluable advice

and guidance The co op eration with the Institute for Computer Applications in Science

and Engineering ICASE NASA Langley Research Center in Hampton Virginia has

b een instrumental in dening the direction of our work and reaching our goals in research

I am grateful to the Rector of the University of Vienna and the Dean of the Faculty of So cial

Studies and Economics as well as to the many colleagues at the University of Vienna who have

supp orted us over the past years

Last but not least I would like to thank all memb ers of the Institute for their hard work and

continued enthusiasm

Vienna February Hans P Zima

Acknowledgement I thank Bernd Wender who coedited this rep ort and collected and organized the

statistical information supplied in Chapters through

Our Co ordinates

Institute for Software Technology and Paral Vienna Centre of Excellence for Parallel

lel Systems University of Vienna Computing VCPC

Liechtensteinstrasse Liechtensteinstrasse

A Vienna Austria A Vienna Austria

Telephone Telephone

Fax Fax

WWW Home Page httpwwwparunivieacat WWW Home Page httpwwwvcpcunivieacat

Anonymous FTP Server ftpparunivieacat ii

Contents

Memb ers of the Institute and Visiting Researchers

Memb ers of the Institute

Memb ers of the VCPC

Visiting Researchers

Research Activities and Co op erations An Overview

Research Overview

Background Motivation and Ob jectives

Languages and Compilation Technology

To ols and KnowledgeBased Program Development

Pro jects

Ma jor Co op eration Partners

Languages and Compiler Technology

Vienna Fortran

Overview

Language Features

Comparison to HPF

OPUS Abstractions for HighPerformance Distributed Computing

Overview

Shared Abstractions

OPUS Runtime Supp ort

HPF

The Vienna Fortran Compilation System VFCS

Overview

Program Transformation and Analysis

Parallelization Strategy

Optimization and Co de Generation

Future Work

PARTI

Dynamic Data Distributions Optimizations

Reduction of Redistribution Costs

Compilation Strategies for Multiple Reaching Distributions

Compilation Techniques for Virtual Shared Memory Architectures

Parallelization of Sparse Matrix Co des

Overview

Representing Sparse Matrices on DMMPs

Language Extensions for the Supp ort of Sparse Matrix Computations

Sparse Matrix Pro duct

Implementation

High Performance Access to Mass Storage Systems iii

Intro duction

Language and Compiler Supp ort

Advanced Runtime Supp ort

Conclusion

To ols and KnowledgeBased Program Development

ANALYST

Summary

Design of the To ol

Concept Comprehension PAP Recognizer

Supp ort for Data Alignment and Distribution

3

P T

Simulation To ol

Motivation and Ob jectives

System Description

Performance Indices and Monitoring Facilities

Current State and Future Research

VFPMS

Postexecution Performance Analysis

Exp ert System Adviser

Motivation

Current State

Future Work

Debugging

NICE

Algorithms for Massively Parallel Computer Systems

Motivation

Matrix Multiplication on SIMD Pro cessor Arrays

Singular Value Decomp osition Algorithms for MIMD Hyp ercub e Systems

VLSI Algorithms for Sp ecial Linear Systems

Description of Ma jor Pro jects

The ESPRIT I I I Pro ject PPPE

The ESPRIT I I I Pro ject PREPARE

The ESPRIT IV Pro ject HPF

The ESPRIT IV Pro ject PHAROS VCPC

The ACTS Pro ject DIANE VCPC

LACE VCPC

MAGELLAN VCPC

The CEI Pro ject PACT

FWF Priority Research Program Software for Parallel Systems

Europ ean Centre of Excellence for Parallel Computing at Vienna VCPC

Parallel Programming To ols

CSHA Hardware Resources

Services

Other Activities iv

Publications

Bo oks

Chapters in Bo oks

Refereed Publications

Technical Rep orts and Other Publications

Editorial Activities

Program Committee Memb erships

Habilitation Thesis

PhD Theses

Lectures and Research Visits

Lectures

Research Visits

App ointments and Awards

Curricula and Courses

Research Facilities

InstituteVCPC Collo quium

Visitors v

Chapter

Memb ers of the Institute and

Visiting Researchers

Memb ers of the Institute

Chair

Prof Hans P Zima zimaparunivieacat

Research Sta

Dr Stefan Andel andelparunivieacat

Dr Siegfried Benkner sigiparunivieacat

Dr Roman Blasko romanparunivieacat

Dr Peter Brezany brezanyparunivieacat

Barbara M Chapman BScHons barbaravcpcunivieacat

DiplIng Beniamino Di Martino dimartinparunivieacat

Dr Thomas Fahringer tfparunivieacat

DiplIng Ying Hou

DiplIng Jan Hulman hulmanparunivieacat

DiplIng Bernhard Knaus

DiplIng Peter Kutschera

Dr Maria Lucka

DiplIng Eduard Mehofer mehoferparunivieacat

DiplIng Hans Moritsch hmvcpcunivieacat

Dr Mario Pantano pantanoparunivieacat

DiplIng Rob ert Ripp el

Kamram Sanjari BSc sanjariparunivieacat

Dr Viera Sipkova sipkaparunivieacat

Dr Marian Va jtersic marianparunivieacat

DiplIng Bernd Wender wenderparunivieacat

Administrative Sta

Edwin Cikan cikanparunivieacat

Maria Cherry mariaparunivieacat

Elisab eth Ob ermaier secparunivieacat

Martin Paul martinparunivieacat

Renate Schwinghammer

Memb ers of the VCPC

Director

Barbara M Chapman BScHons barbaravcpcunivieacat

Research Sta

Dr Ian Glendinning ianvcpcunivieacat

DiplIng Hans Moritsch hmvcpcunivieacat

Dr Guy Robinson robinsonvcpcunivieacat

Administrative Sta

Markus Egg markusvcpcunivieacat

Helga Pfeifer helgavcpcunivieacat

Aleksandar Vestica

Visiting Researchers

Prof Rob ert G Babb Univ of Denver USA

Prof Shahid H Bokhari Univ of Engineering Technology Lahore Pakistan

DiplIng Beniamino Di Martino Univ of Naples

Dr Michael Gerndt KFA Julic h

Dr Matthew Haines ICASE Hampton Virginia USA

Dr Piyush Mehrotra ICASE Hampton Virginia USA

Dr Mario Pantano Univ of Pavia Italy

DiplIng Guillermo Trabado Univ of Malaga Spain

DiplIng Manuel Ujaldon Univ of Malaga Spain

Dr Maria Lucka Slovak Academy of Science Bratislava Slovakia

Dr Marian Va jtersic Slovak Academy of Science Bratislava Slovakia

Chapter

Research Activities and

Co op erations An Overview

In this chapter we present an overview of the research activities at the Institute in the rep orting

p erio d Section We discuss the background and motivation of current work and outline

the ma jor research themes Section provides a short description of the pro jects p erformed

during this p erio d the chapter closes with a list of the ma jor co op eration partners of the Institute

Section

Research Overview

Background Motivation and Ob jectives

High Performance Computing Systems HPCs have b ecome a fundamental to ol of sci

ence and engineering They can b e used to simulate highly complex pro cesses o ccurring in

nature in industry or in so cioeconomic systems Such simulations are based up on mathema

tical mo dels typically governed by partial dierential equations whose solution in a discretized

version suitable for computation may require trillions of op erations In engineering the use of

HPCs may save money as well as material resources With their help it is p ossible to mo del the

b ehavior of automobiles in a crash or of airplanes in critical ight situations in a manner which

is so realistic that many situations can b e simulated on a computer b efore an actual physical

prototyp e is built The Nob el prize winner Ken Wilson has claimed that HPCs are b eginning to

play a role in mo dern science similar to that of the telescop e in seventeenthcentury astronomy

the unprecedented increase in computing p ower at reasonable cost has op ened up an exciting new

avenue for the exploration of the universe from dissecting the b ehavior of subatomic particles

to simulating the collision of galaxies

A more recent development inspired at least in part by the increasing maturity of the HPCs

themselves is their use for information management In particular database technology is b eing

adapted to multipro cessor systems enabling huge amounts of data to b e searched rapid queries

to b e p erformed and computationintensive transactions to b e pro cessed This development

represents a qualitative leap in the capabilities of information technologies the combination of

new information technologies with parallel and distributed computing platforms will form the

basis for the information society of the future It will have a wideranging eect on all pro duc

tion and service industries as well as the provision of so cietal services such as health education

and transp ort

The two ma jor sources contributing to the p ower of HPC hardware are circuit technolo

gy and multilevel architectural concurrency Circuit technology which has b een progressively

improving over a long p erio d of time is seen to approach inherent physical limits In con

trast architectural concurrency allows the design of systems with a large p otential for growth

Distributedmemory multipro cessing systems DMMPs provide scalability by connec

ting a p otentially large numb er of pro cessing no des via a sophisticated network A DMMP node

has b een traditionally a single otheshelf pro cessor in recent developments a trend towards

hybrid systems can b e observed in which the no des are symmetric sharedmemory machines

SMPs Massively parallel IO subsystems are added to DMMPs in order to balance the IO

capabilities with their computational p ower Furthermore future DMMPs are seen to provi

de a hardwaresupp orted virtual shared memory thus alleviating the problems of dealing with

partitioned memory in software

An orthogonal trend in architectures favored by the dramatic improvements in global net

work connections has led to heterogeneous distributed multipro cessing systems which

combine subsystems such as workstation clusters vector computers SMPs or DMMPs in

to single integrated systems that may b e distributed over many geographically distant lo cations

One of the crucial yet only partially solved problems regarding HPCs is software with the

related research issues covering the range from models applications and algorithms to languages

compilers and programming environments These topics in particular the last three constitute

the research agenda of the Institute Our activities are part of a worldwide eort towards

developing the mo dels theories and practical foundations to make HPCs accessible to a broad

range of users at a high level of abstraction The ma jor goals of our work include

 research in highlevel paradigms languages and programming environments for HPCs

 the study of new mo dels applications and algorithms

 research in knowledgebased techniques supp orting application compiler and to ol deve

lopment for HPCs

 the study of techniques for managing large amounts of data in parallel le systems and

for accessing heterogeneous and distributed data

 a transfer of technology to industry and

 an active contribution to international standardization eorts

Although the past three decades have seen a large amount of research on parallel languages

only a small fraction of this work had practical consequences for the programming of HPCs at a

suciently high level of abstraction Many computer scientists have found it hard to accept that

designers of numerical applications still adhere to conventional programming languages such as

Fortran C and C and that therefore a broadly accepted programming metho dology can

b e only the result of an evolutionary development that includes supp ort for the use of these

languages on HPCs

This situation has changed during the past few years and the computer science community has

b ecome increasingly aware of the intellectual challenge and economic imp ortance of research in

this area New languages have b een develop ed and standardization eorts such as those for

High Performance Fortran HPF and High Performance C HPC have b een

initiated Our work in particular the design of Vienna Fortran the implementation

of the Vienna Fortran Compilation System VFCS and the development

of p erformance analysis techniques integrated into the compilation environment has

signicantly contributed to international language standards and advanced compilation techno

logy

The scop e of our research ranges from pure computer science pro jects to co op erative work

with a strong interdisciplinary character We take part in pro jects with Europ ean American

and Japanese partners from universities research establishments and industrial companies The

memb ership of Austria in the Europ ean Union has strengthened our involvement in the Euro

p ean research scene we participate in the Europ ean Commissions Third and Fourth Research

Framework the ACTS Programme and are represented in the HPCN Advisory Panel of the Eu

rop ean Commission Collab orations with leading hardware manufacturers supply the Institute

with rsthand knowledge ab out current and future architectural trends

The relationships with other relevant Austrian research eorts including co op erations with insti

tutes at the University of Vienna and the Vienna University of Technology the Graz University

of Technology and partners in the Austrian Center for Parallel Computation ACPC contribute

to creating a strong international presence for Austria in the eld of parallel computation

Our research deeply aects the scop e and contents of the educational activities oered by

the Institute Lectures seminars and lab oratories are oered for diploma and PhD students

as well as for p ostdo cs Furthermore diploma and PhD theses are integrated into the various

research pro jects

Languages and Compilation Technology

Background

Traditionally HPCs have b een programmed using a standard sequential programming language

Fortran or C augmented with message passing constructs In this paradigm the user is forced

to deal with all asp ects of the distribution of data and work to the pro cessors and to control

the programs execution by explicitly inserting message passing op erations The resulting pro

gramming style can b e compared to assembly language programming for a sequential machine

it has led to slow software development cycles and high costs for software pro duction Moreo

ver although standards for message passing such as PARMACS and MPI have b een

develop ed the p ortability of programs using these standards is limited since the characteristics

of the target architectures may require extensive restructuring of the co de

As a consequence much research and development activity has b een concentrated in recent

years on providing higherlevel programming paradigms for HPCs A ma jor direction of work in

this area fo cussed on the data parallel SingleProgramMultipleDat a SPMD mo del of

computation This mo del is characterized by a single thread of control with all pro cessors exe

cuting the same parameterized co de in a loosely synchronous fashion Parallelism is obtained

by applying the computation to dierent parts of the data domain simultaneously Highlevel

supp ort for this paradigm requires language features for the sp ecication of data distribution

and alignment The idea is to provide the compiler with enough highlevel information so that

the user can b e freed from dealing with the messagepassing level explicitly delegating this work

to the compiler

This development b egan in the mid s with research at the California Institute of Tech

nology the GMD and in the SUPERB compiler pro ject at the University of Bonn

followed by related eorts elsewhere The growing b o dy of

exp erience with compilation technology led to the formalization of highlevel dataparallel lan

guages primarily based up on Fortran Vienna Fortran jointly develop ed in

by our Institute and ICASE NASA Langley Research Center was the rst fully dened data

parallel language targeted towards DMMPs The current defacto standard High Performan

ce Fortran HPF is an extension of Fortran based on Vienna Fortran Fortran D

and other languages Typically these languages allow the

explicit sp ecication of processor arrays to dene the set of abstract pro cessors used to execute

a program Distributions map data arrays to pro cessor sets the establishment of an alignment

relation b etween arrays results in the mapping of corresp onding array elements to the same

pro cessor thus avoiding communication if such elements are jointly used in a computation

Research Overview

The research and development work of the Institute in programming languages and compilers

fo cussed on extensions of Vienna Fortran for areas such as unstructured grid computations

sparse matrix co des multidiscipli nary applications and parallel IO the denition and imple

mentation of a language version based on Fortran Vienna Fortran and on the continued

development of related optimizing compiler technology

We summarize the most imp ortant subpro jects b elow

 Vienna Fortran

The original Vienna Fortran language is a language extension of Fortran

Vienna Fortran VF is an extension and enhancement of Vienna Fortran

based up on Fortran It is describ ed in more detail in Section

 OPUS

Vienna Fortran as well as VF and HPF are dataparallel languages built up on the SPMD

programming mo del None of these languages is wellsuited to express task parallelism

and the related synchronization and communication mechanisms OPUS is an

ob jectbased co ordination language which can b e combined with a dataparallel language

to express asynchronous tasks and to integrate data with task parallelism

The ma jor features of OPUS are outlined in Section

 HPF

One of the signicant dierences b etween Vienna Fortran and HPF the current version

of HPF is the fact that Vienna Fortran is based up on a general framework for data

distribution and alignment while HPF p ermits only a small numb er of predened regular

data distributions As a consequence HPF do es not adequately reect the requirements

of many advanced applications

The motivation for the development of HPF is to improve HPF in order to make it

applicable to a broader range of applications An outline of the language is given in Section

an ESPRIT IV pro ject in which a full language sp ecication and implementation will

b e develop ed is describ ed in Section

 Vienna Fortran Compilation System VFCS

VFCS is an integrated compilation environment for Vienna Fortran VF and HPF

It provides the transformation and analysis facilities for the interactive translation of

programs in these languages to target programs with explicit message passing

An overview of the current state of VFCS is given in Section More sp ecialized topics

are discussed b elow and in the corresp onding sections of Chapters and

 PARTI

Many algorithms such as sparse matrix co des or sweeps over unstructured meshes require

irregular accesses to data structures The PARTI library develop ed at ICASE and

the University of Maryland was an imp ortant rst step in providing runtime supp ort for

such algorithms which was integrated at an early stage into VFCS

We have generalized and optimized PARTI in the framework of the ESPRIT I I I research

pro ject PREPARE More details on the development of PARTI are given in Sections

and

 Dynamic Data Distribution

Vienna Fortran VF and HPF all oer statements which can change the data distributi

on asso ciated with an array at runtime However none of the currently existing commercial

HPF compilers implements this feature and many asp ects of its implementation are still

areas of active research The reason for this is the disastrous eect that a naive imple

mentation of dynamic data distribution can have up on the p erformance of the resulting

program

We discuss dynamic data distribution in more detail in Section fo cussing on the

required analysis and optimization features

 Compilation Techniques for Virtual Shared Memory Architectures

In a joint pro ject with CONVEX Computers Richardson Texas USA and

Germany we will develop new language features and compilation strategies for virtual

shared memory architectures This will include the development of a co de generator for

the CONVEX Exemplar

A more detailed description of this pro ject is given in Section

 Sparse Co des

In a joint pro ject with the University of Malaga we have develop ed sp ecialized compilation

and runtime techniques for the parallelization of sparse co des The user sp ecies the

representation and distribution of sparse data in a sp ecial annotation or directive based

up on this information the compiler can generate co de that optimizes the runtime b ehavior

of the program

This technique is describ ed in Section

 Parallel InputOutput

We develop language extensions and related compilation and runtime technology that

supp ort parallel inputoutput and outofcore data structures

More details ab out this pro ject are given in Section

Future Work

Although signicant progress in languages and compilers for HPCs has b een made over the past

years there are still unsolved issues and new problems arise due to the rapid evolution of new

architectures and systems We summarize a few imp ortant p oints here

 current defacto language standards in particular HPF do not provide sucient supp ort

for advanced algorithms

 new HPC architecture mo dels such as networks of SMPs or heterogeneous systems are

not yet suciently reected in actual languages and compilers and

 the compilation and runtime technology for problems requiring dynamic load balancing

and irregular data and work distribution has not yet reached sucient maturity

Future research must address these issues in particular

 the improvement of HPF and the related implementation technology

 the integration of data and task parallelism

 the automatic coupling of dynamic data partitioners to compiled programs

 the interop erability of FortranHPF and CHPC programs and

 the study of new mo dels programming paradigms and languages

To ols and KnowledgeBased Program Development

Research Overview

The long term commercial success of HPCs will dep end on the availability of high level program

ming interfaces that combine userfriendliness with the generation of high p erformance target

co de Languages such as Vienna Fortran VF and HPF can b e seen as rst steps in this

direction However conventional compiler technology alone do es not suce to pro duce high

quality parallel ob ject co de for these languages It is necessary to develop integrated compi

lation environments which oer a range of sophisticated to ols that supp ort an automated

translation pro cess by providing the required analysis information Furthermore in the medium

to long term automatic supp ort for data distribution and alignment as well as for the selection

of a parallelization strategy will b ecome mandatory

Below we give a short overview of research conducted over the past three years at our institute

 ANALYST

A set of analysis algorithms from VFCS enhanced by new features has b een combined

to form a separate to ol ANALYST that can b e applied to a program indep endent of

the transformation facilities of VFCS It provides a range of information ab out a source

program in either graphical or text format at dierent levels of abstraction

An outline of the features of ANALYST is given in Section

 Concept Comprehension

Concept comprehension is based up on a set of predened algorithmic patterns o ccurring

in source programs A sp ecial parsing pro cess is applied to the program to detect such

patterns which can then b e mapp ed to ecient ob ject co de dep ending on the particular

target machine and asso ciated libraries

Concept comprehension can supp ort many asp ects of automatic parallelization such as

automatic data distribution and the automatic selection of restructuring strategies It is

describ ed in Section

 Supp ort for Data Distribution and Alignment

Languages such as Vienna Fortran and HPF leave the intellectually most dicult problem

the choice of an ecient data distribution to the user Section discusses a strategy

for the automatic supp ort of data distribution and alignment

3

 P T

3

P T is a compiletime p erformance prediction to ol Applied to the messagepassing pro

gram generated by VFCS and a set of input data it pro duces a range of parameter values

characterizing the b ehavior of the resulting ob ject programs These include the work distri

bution the numb er of communication transfers transfer volumes and machinedep endent

parameters such as the cache hit ratio

A detailed description is given in Section

 Simulation To ol

The simulation to ol PEPSY is based on the automatic mo delling of parallel programs and

p erformance analysis using discrete event simulation It is describ ed in Section

 VFPMS

The Vienna Fortran Performance Measuring System VFPMS provides a highlevel facility

for sp ecifying measurements of p erformancecritical sections of Vienna Fortran or HPF

programs Its ma jor features are outlined in Section

 PostExecution Performance Analysis

In many cases compiletime p erformance prediction or simulation cannot provide the user

or compiler with suciently precise information ab out the p erformance of the program

In such a situation the program may b e run on the target architecture and the analysis

of the relevant p erformance parameters can provide feedback for a restructuring strategy

with the ob jective of p erformance tuning

Supp ort for p ostexecution analysis is discussed in Section

 Exp ert System Adviser

We have used exp ert system technology to develop a knowledgebased to ol supp orting the

parallelization pro cess This is describ ed in Section

 Debugging

We are working on a sourcelevel debugging system that allows the programmer to observe

program b ehavior at the level at which the program has b een develop ed This work is

describ ed in Section

Future Work

The development of intelligent knowledgebased programming environments is a longterm

research goal Much of the work discussed ab ove contributes to partial solutions yet many

problems remain currently unsolved Below we summarize some of these issues and provide an

outlo ok to future research

 The rst generation of commercial HPF compilers pro duce ob ject co de of only mo derate

quality While this is partly due to the immaturity of early releases another inherent

cause is missing availability of p erformance information in the compiler

 Many p erformance analysis to ols have b een develop ed in the past While these to ols

generally provide a detailed account of the b ehavior of a parallel program this information

is usually sp ecied at the message passing level Furthermore these systems are not

connected to a compiler therefore they can inuence program tuning only indirectly via

a userguided interpretation of the lowlevel information pro duced by the system

 Sophisticated program restructuring systems such as VFCS p erform an interactive source

tosource transformation from an HPFlike language level to message passing co de They

oer extensive program analysis to ols and restructuring transformations However stra

tegic guidance of the program transformation pro cess is essentially left to the user

Moreover although these systems contain a wealth of information ab out the hardwaresoftware

environment and the program to b e parallelized this information is usually implicit hidden

in the co de and not immediately accessible to software to ols or the user

 Much research eort has b een sp ent for automatic strategies supp orting the selection of

data distribution and alignment for regular problems This work p erformed at a numb er

of academic sites and companies resulted in a numb er of useful prop osals which however

still have to prove their applicabili ty to real application problems

In Section we outline the design of a knowledgebased compilation environment which

addresses many of the issues discussed ab ove

Pro jects

Below we give a short description of the ma jor pro jects p erformed during the rep orting p erio d

and two pro jects that b egan in These pro jects are funded by the Austrian Ministry for

Science Research and the Arts BMWFK the Austrian Science Foundation FWF and the

Europ ean Commission

 Programming Environments for DMMPs

Description Development of an interactive programming environment for dataparallel scientic

applications Porting of the SUPERB parallelization system to the Intel iPSC multipro cessor

system

Time Perio d Novemb er April

 Software for Parallel Systems

Role of the Institute Co ordinator

Description Co op erative Priority Research Program Schwerpunktprojekt funded by the Austri

an Research Foundation FWF see Section

Partners Memb ers of the Austrian Center for Parallel Computation ACPC

Time Perio d July June

 Algorithms and Software for Parallel Systems

Description Development of dataparallel numerical algorithms and related software supp ort

Partner Slovak Academy of Sciences Bratislava Slovakia

Time Perio d July Decemb er

 Advanced To ols for Parallel Programming

Description Design and development of to ols for p erformance analysis and p erformance predic

tion

Partner Slovak Academy of Sciences Bratislava Slovakia

Time Perio d April Decemb er

 Porting the Vienna Fortran Compilation System to RISC Workstation Clusters

Description Development of a PARMACS Back End for the VFCS

Partner IBM Austria

Time Perio d July August

 Portable Parallel Programming Environments PPPE ESPRIT I I I Pro ject

Role of the Institute Partner

Description Research in language design advanced compilation metho ds and p erformance ana

lysis and prediction to ols see Section

Time Perio d July June

 Languages and Compilers for Parallel Programming PREPARE ESPRIT I I I Pro

ject

Role of the Institute Asso ciate Partner ACE

Description Development of a compiler and a to ol set for High Performance Fortran see Section

Time Perio d July March

 Automatic Data Distribution

Description Development of software to ols that supp ort data distribution for DMMPs

Time Perio d January Decemb er

 Performance Prediction and Exp ert Adviser for a Parallel Programming Environment

Description Development of software to ols for p erformance prediction of parallel programs and

knowledgebased supp ort for parallelization

Time Perio d February June

 PACT Programming Environments Algorithms Applications Compilers and To ols

for Parallel Computation

Role of the Institute Co ordinator

Description Co op erative pro ject funded by the Austrian Ministry for Research Science and the

Arts BMWFK see Section

Time Perio d Novemb er Decemb er

 Porting Vienna Fortran to the CONVEX Exemplar

Role of the Institute Partner

Description Development of language extensions and compilation techniques for shared virtual

memory architectures development of a co de generator for the CONVEX Exemplar

Time Perio d Octob er op en

 HPF ESPRIT IV LTR Pro ject

Role of the Institute Co ordinator

Description Development of HPF an extension of HPF and its implementation see Sections

and

Time Perio d January Decemb er

 PHAROS Op en HPF Programming Environment ESPRIT IV HPCN Pro ject

Role of the Institute Partner VCPC

Description Integration of to ols for HPF program development see Section

Time Perio d January Decemb er

 DIANE Distributed Annotation Environment ACTS Pro ject

Role of the Institute Partner VCPC

Description See Section

 LACE

Role of the Institute Partner VCPC

Description See Section

 MAGELLAN

Role of the Institute Partner VCPC

Description See Section

Ma jor Co op eration Partners

 ACE Asso ciated Computer Exp erts bv

 Argonne National Lab oratory Illinois USA

 AVL Graz Austria

 California Institute of Technology Pasadena California USA

 Convex Computers Richardson Texas USA and Munich Germany

 Cray Research Eagan Minnesota USA

 Engineering Systems International ESI Paris and Frankfurt Germany

 Europ ean Centre for Medium Range Weather Forecasts ECMWF Reading UK

 GMD Birlinghoven Germany

 GMD FIRST Berlin Germany

 University of Graz Graz Austria

 IBM Austria Vienna Austria

 University of Illinois at ChampaignUrbana Illinois USA

 INRIA Ro cquencourt France

 Institute for Computer Applications in Science and Engineering ICASE

NASA Langley Research Center Hampton Virginia USA

 University of Indiana Blo omington Indiana USA

 IRISA Rennes France

 Kapsch Vienna Austria

 KFA Julic h Julic h Germany

 University of Kyoto Kyoto

 University of Linz Linz Austria

 University of Malaga Malaga Spain

 University of Manchester Manchester UK

 University of Maryland College Park Maryland USA

 Meiko Bristol UK

 Politecnico di Milano Milan Italy

 Technical University of Munich Munich Germany

 University of Naples Naples Italy

 NA Software Liverp o ol UK

 Pallas GmbH Bruhl Germany

 University of Pavia Pavia Italy

 The Portland Group Inc PGI Portland Oregon USA

 RISC Linz Hagenb erg Austria

 University of Salzburg Salzburg Austria

 University of Southampton Southampton UK

 Slovak Academy of Sciences Bratislava Slovakia

 University of Syracuse Syracuse New York USA

 TNOITI Delft Netherlands

Chapter

Languages and Compiler

Technology

Vienna Fortran

Siegfried Benkner

Vienna Fortran is a machine indep endent data parallel programming language

targeted towards DMMPs It is based up on Fortran with additional language constructs

to supp ort the formulation of data parallel algorithms Many of these extensions are based on

Vienna Fortran an extension of Fortran However a numb er of language constructs

and mechanisms have b een redesigned or extended and new features sp ecic to Fortran

have b een added This includes distribution of ob jects of derived typ e and p ointer ob jects

and a generalized framework for expressing arbitrary data distribution alignment and work

distribution

In the following we briey discuss the advantages of using a language like Vienna Fortran

for parallel programming describ e the main language features and nally discuss its relation

to HPF and the directions of future work

Overview

Vienna Fortran provides the user with a global name space assumes a single thread of

control and oers a wide range of facilities for distributing the data ob jects of a program Data

distribution is mo deled by a mapping of data arrays to pro cessor array sections The user

sp ecies the distribution of data at a high level it is the task of the compiler to generate an

explicitly parallel SPMD program by automatically determining the work distribution and the

necessary communication As a consequence Vienna Fortran programs are not only simpler

than handwritten message passing programs but also considerably more exible enabling users

to mo dify a data distribution without ma jor reprogramming Moreover the features of Vienna

Fortran also provide a p owerful and convenient means for sp ecifying parallel algorithms which

may exploit the prop erties of the chosen data distribution in order to obtain ecient co de Thus

Vienna Fortran combines the advantages of the shared memory programming paradigm with

mechanisms for explicit user control of those asp ects of the program which have the greatest

impact on eciency

Language Features

Vienna Fortran extends a subset of Fortran with dataparallel language features It

includes those elements of Fortran which are most relevant in the context of data parallelism

in particular the array sublanguage The use of array syntax do es not only lead to a more

intuitive programming style but can also signicantly facilitate compilation and optimization

The fetchbeforestore semantics of array assignment statements makes them ma jor sources of

data parallelism Other imp ortant features included are dynamic memory allo cation derived

typ es p ointers mo dules and pro cedure interfaces

The following data parallel language features are provided by Vienna Fortran

 Pro cessor Arrays

Pro cessor arrays establish an abstraction from the actual underlying machine top ology and

thus provide a means for achieving machine indep endence The size of pro cessor arrays

may b e sp ecied either at compile time or at link or runtime Subsets of pro cessor arrays

may b e referenced by means of regular or irregular Fortran array sections and dierent

views of a pro cessor array may b e obtained by using Fortran intrinsic functions

 Data Distribution

The distribution of arrays with resp ect to the userdened pro cessor structures can b e sp e

cied by using additional attributes within the typ e declaration statement Distributions

may b e sp ecied either directly or implicitl y by referring to the distribution of another

array For the most common regular distributions intrinsic functions are provided Total

as well as partial replication is supp orted Irregular and indirect distributions can b e de

ned by means of mapping arrays The concept of indirect distributions of Vienna Fortran

has b een generalized such that it is p ossible to distribute either single array dimensions or

whole arrays indirectly Moreover the user may sp ecify arbitrary new distributions and

alignments by means of mapping procedures

 Distribution of Derived Typ es and Pointers

The concept of distributed derived typ es enables the user to distribute the comp onents of

ob jects of derived typ e in much the same way as ob jects of intrinsic typ es Furthermore

Vienna Fortran provides sp ecial features which alleviate the mapping of hierarchically

structured data ob jects to pro cessor subsets and provides supp ort for distributing p ointer

ob jects

 Mapping Pro cedures

The mechanisms provided by Vienna Fortran for the sp ecication of arbitrary userdened

alignment and distribution functions have b een unied and generalized by intro ducing the

concept of mapping procedures Mapping pro cedures are similar to Fortran pro cedures

may have arguments may b e recursive and may b e passed as arguments to mapping

pro cedures Besides the sp ecication of arbitrary userdened distributions and alignments

they may b e used to dene arbitrary mappings of lo op iterations to pro cessor sections

 Pro cedure Interfaces

Dummy arrays may inherit the distribution from the actual arguments or may b e dis

tributed explicitly Supp ort for separate compilation is provided by extending interface

blo cks of external pro cedures with data distribution information

 Mo dules

Mo dules have b een extended in such a way that it is p ossible to dene globally accessible

pro cessor arrays and commonly used distributed data structures in a clean and p ortable

way These are called distribution mo dules

 Dynamic Data Distribution

Whereas the distribution of statical ly distributed ob jects is xed within the scop e of their

declaration the distribution of dynamical ly distributed arrays may b e dened and changed

at runtime by sp ecial executable statements or as a result of pro cedure calls Because of the

signicant impact of dynamic data distributions on the complexity of the compilation and

the p erformance of the target program a clear syntactic distinction has b een made b etween

statically and dynamically distributed ob jects Furthermore a bundling mechanism in

order to dene sets of arrays that are always distributed together is provided

 Parallel Lo ops

Explicitly parallel lo ops which must not contain any lo op carried dep endences may b e

sp ecied Private variables and sp ecial intrinsic reduction op erators are provided Moreo

ver it is p ossible to explicitly sp ecify the distribution of the iterations of a parallel lo op

to the available pro cessors by means of mapping pro cedures

 Inquiry Pro cedures

Vienna Fortran provides several intrinsic pro cedures in order to obtain information

ab out the actual distribution of a data ob ject This enables to user to write co de nely

tuned with resp ect to a certain data distribution

 Lo cal Pro cedures

Lo cal pro cedures which must not have any side eects and which do not induce commu

nication may b e used They provide a suitable means for reusing sequential kernels and

interfacing with other sequential languages

Comparison to HPF

HPF was develop ed by the High Performance Fortran Forum a consortium from industry

government labs and academia in order to provide a standard set of extensions to supp ort

programming on a wide variety of parallel architectures

Vienna Fortran is similar to HPF in a numb er of features This includes the choice of Fortran

as the base language abstract pro cessor arrays direct distribution and alignment of arrays

distinction b etween static and dynamic distributions denition of the pro cedure interface in

particular inherited and enforced distributions and indep endent lo ops

On the other hand a numb er of advanced concepts of Vienna Fortran are currently not included

in HPF Among them are dierent pro cessor views with an explicit sp ecication of the equi

valence the implicit denition of a canonical onedimensional pro cessor array distribution of

arrays to pro cessor sections generalized blo ck and indirect distributions userdened mapping

pro cedures return of distributions from a pro cedure call and distribution of p ointers and of

ob jects of derived typ e

OPUS Abstractions for HighPerformance Distributed Com

puting

Barbara Chapman Matt Haines Piyush Mehrotra John Van Rosendale and Hans P Zima

Overview

Data parallel programming languages are rapidly maturing and can now readily express the

parallelism in a broad sp ectrum of scientic applications However with the imminent arrival of

teraop architectures the complexity of simulations b eing tackled by scientists and engineers is

increasing exp onentially Many of these simulations are of a complex multidisciplinary nature

constructed by pasting together mo dules from a variety of related scientic discipline s This

raises a host of new software integration issues While data parallel languages are well suited

to exploiting the parallelism in each mo dule they are not adequate for this software integration

problem or for exploiting the coarse grained parallelism that such applications frequently provide

One imp ortant example of a multidiscipli nary application is environmental simulation One

might for example wish to couple a variety of environmental mo dels each given initially as

separate programs a plant biology mo del for the Florida Everglades a mo del of the gulf

stream dynamics a climate mo del for North America and a solar radiation mo del The

goal is then to interconnect these mo dels into a single multidiscipli nary mo del subsuming the

original mo dels together with their various couplings At the same time the parallelism b oth

within and b etween the discipline mo dels should b e exp osed and eectively exploited

Analogous issues arise in multidisciplinary optimization In designing a mo dern aircraft for

example one has a wide variety of interacting disciplines aero dynamics propulsion structural

analysis and design controls and so forth An optimal engineering design is necessarily an

admixture of sub optimal designs in each discipline The essential goal is to correctly couple a

sequence of complex scientic and engineering programs from dierent discipline s each designed

and implemented by dierent groups into a coherent whole capable of eective multidiscipli nary

optimization Moreover the collection of programs chosen must remain exible since the choice

of programs tends to evolve rapidly as the simulation metho dology changes or as unanticipated

interactions force alteration of the mix of discipline s or programs b eing used These programs

must b e eectively mapp ed to heterogeneous distributed multipro cessing systems In such an

environment statically forming a task graph and coupling tasks via message plumbing app ears

virtually unworkable since in a messagepassing environment the design of each task requires

intimate knowledge of the b ehavior of all coupled tasks

In a joint eort of ICASE NASA Langley Research Center and our Institute a co ordination

language called Opus has b een designed which is targeted towards such appli

cations It provides a software layer on top of data parallel languages designed to address b oth

the programming in the large issues and the parallel p erformance issues arising in complex

multidiscipli nary applications

The heart of Opus is a new mechanism called ShareD Abstractions SDAs SDAs b orrow

from ob jectoriented systems in that they encapsulate data and the metho ds that act on the

data and from monitors in shared memory languages in that an active metho d has exclusive

access to the data of an SDA

Tasks ie asynchronously executing autonomous activities are instantiated in Opus by creating

instances of SDAs and invoking the asso ciated metho ds SDAs represent distinct address spaces

hence Opus tasks do not directly share data Instead interaction b etween tasks is accomplished

by invoking metho ds in other SDAs Thus a set of tasks may share a p o ol of common data by

creating an SDA of the appropriate typ e and making the data SDA available to all tasks in the

set Using SDAs and their asso ciated synchronization facilities also allows the formulation of a

range of co ordination strategies for these tasks This set of concepts forms a p owerful to ol which

can b e used for the hierarchical structuring of a complex b o dy of co de and a concise formulation

of the asso ciated co ordination and control mechanisms

The runtime system supp orting Opus utilizes lightweight userlevel threads that are capable

of supp orting b oth intrapro cessor and interpro cessor communication primitives in the form of

shared memory message passing and remote service requests This allows the indep endently

executing SDA tasks to freely share the underlying parallel resources

Shared Abstractions

The basic new concept in Opus is the Shared Abstraction SDA

An SDA typ e sp ecies an ob ject structure which contains data along with the metho ds

pro cedures which manipulate this data An SDA is generated by creating an instance of

an SDA typ e The creation of an SDA involves allo cation of resource on which the SDA will

execute the allo cation of data structures in memory and any initializations that are necessary

to establish a welldened initial state In a distributed environment SDAs can b e created on

a remote machine The lifetime of an SDA is the time interval b etween its creation and its

termination during this interval the SDA exists and can b e accessed via metho d calls

An SDA can b e saved by copying it to external storage thus generating an external SDA

which is identied by a unique external name External SDAs are p ersistent having an a

priori unlimited lifetime Saving an SDA thus makes it accessible for later reuse by loading an

external SDA into memory

A metho d of an SDA can b e called in two ways asynchronously by a nonblo cking call or

synchronously by a call blo cking the caller until control returns An asynchronous metho d

execution may b e asso ciated with an event which can b e used for status inquiries and synchro

nization No two metho d executions b elonging to the same SDA can execute in parallel as a

consequence each metho d has exclusive access to the data of its SDA With prop er implementa

tion concerns we can relax this restriction to allow metho ds which do not alter an SDAs state to

execute in parallel However this is currently not supp orted A metho d may have an asso ciated

condition clause sp ecifying a logical expression which guards the metho ds activations

Each SDA is asso ciated with a unique SDA task which is the lo cus of all control activity

related to the SDA The SDA task op erates on the resources allo cated to the SDA provides an

address space for the SDAs data and manages the execution of calls to the SDAs metho ds

The execution of an Opus program can b e thought of as a system of SDA tasks in which a task

executes a metho d of its SDA in resp onse to a request from another SDA

OPUS Runtime Supp ort

As discussed ab ove SDAs in Opus allow the enco ding of b oth computation servers and data

rep ositories for sharing data b etween the computational tasks In general the computation tasks

and the data servers will utilize the same or overlapping resources Thus any given pro cessor

in the system might b e resp onsible for the simultaneous execution of multiple indep endent

SDAs Execution of these units can b e implemented on Unixbased systems by mapping each

unit to a pro cess where each pro cessor can execute multiple pro cesses in some fashion However

this pro cessbased approach has several drawbacks including the inability to control scheduling

decisions the inability to share addressing spaces b etween units and costly context switching In

light of these disadvantages our approach is to utilize lightweight userlevel threads to represent

these various indep endent entities A lightweight userlevel thread is a unit of computation with

minimal context that executes within the domain of a kernellevel entity such as a Unix pro cess

or Mach thread Lightweight threads are b ecoming increasingly useful in supp orting language

implementations for b oth parallel and sequential machines by providing a level of concurrency

within a kernellevel pro cess

The Opus runtime supp ort as depicted in Figure is constructed atop a runtime inter

face called Chant Chant supp orts b oth a standardized interface for thread op erations as

sp ecied by the POSIX thread standard and communication among threads using either

p ointtop oint primitives such as those dened in the MPI standard or remote service

requests Chant also supp orts the concept of a rope a group of threads for executing collective

op erations such as broadcast and reductions A data parallel co de such as that pro duced by

an HPF compiler can b e mapp ed to a rop e with minimal changes A description of Chant and

its current status can b e found in

The two ma jor issues in the Opus runtime system are the management of SDAs and their

interaction In the initial design we have concentrated on the interaction namely metho d invo

cation and argument handling and have taken a simplied approach to resource management

We presume that all the required resources are statically allo cated and the appropriate co de

is invoked where necessary We will later extend the runtime system to supp ort the dynamic

acquisition of new resources The interaction b etween SDAs requires runtime supp ort for b oth Opus Language/Compiler

Opus Runtime

Language-Dependent

Language-Independent

Threaded Runtime (Chant)

Figure Runtime layers for SDA supp ort

metho d invo cation and metho d argument handling

HPF

Barbara Chapman Piyush Mehrotra and Hans Zima

A stated goal of the HPF Forum was to address the problems of writing data parallel programs

where the distribution of data aects p erformance HPF the current language standard

as of January provides the rst step towards such a p ortable programming interface

by extending Fortran with a set of directives However many advanced algorithms require

functionality which is missing from HPF For this purp ose we have prop osed a language

called HPF which adds new features to HPF and at the same time deletes a numb er of

irrelevant or illdened language elements Many of the new features are based up on Vienna

Fortran Vienna Fortran see also Section and the implementation

exp erience in the framework of VFCS and the PPPE ESPRIT I I I pro ject Syntactically HPF

uses the conventions of HPF In this section we discuss some requirements of advanced algo

rithms that are not met by HPF and outline a solution for the problem in HPF

In HPF data can b e either replicated or distributed across all pro cessors of a DMMP via

the predened blo ck cyclic and blo ckcyclic mappings In addition linear alignment functions

can b e sp ecied These language constructs suce for the expression of a range of numerical

applications op erating on regular data structures However more complex applications p ose

serious diculties For example mo dern co des in such imp ortant application areas as aircraft

mo delling and combustion engine simulation often employ multiblo ck grids Even if these

grids are to b e distributed by blo ck to pro cessors the constructs of HPF do not p ermit an

ecient data mapping for these programs this requires distributions to sections of the pro cessor

array HPF is even less equipp ed to handle advanced algorithms such as particleincell co des

adaptive multigrid solvers or sweeps over unstructured meshes Many of these problems need

more complex data distributions if they are to b e executed eciently on a parallel machine

Some of them may require the user to control the execution of ma jor do lo ops by sp ecifying

which pro cessor should p erform a sp ecic iteration These features are not provided by HPF

Some programs will require that data b e distributed onto pro cessor arrays of dierent dimen

sions the current language denition do es not p ermit the user to prescrib e or assume any

relationship b etween such pro cessor arrays

HP oers a numb er of language extensions which provide the required functionality including

the following

 Distribution to Pro cessor Subsets

This feature allows the sp ecication of a distribution target as a subsection of the pro cessors

in a pro cessor array or to an individual pro cessor Examples for its use are multiblo ck

and multigrid algorithms

 Pro cessor Views

Pro cessor views p ermit the interpretation of a given pro cessor set according to dierent

declarations of p ossibly dierent ranks The relationship b etween the pro cessors in such

declarations can b e established in an unambiguous way based up on Fortran equivalence

This feature is needed whenever dierent numb ers of dimensions of data arrays are distri

buted to a given set of pro cessors

 General Blo ck Distributions

General blo ck distributions while maintaining the contiguity of distribution segments

asso ciated with regular blo ck distributions allow dierent segment lengths thus providing

a simple and ecient means of dealing with load balancing If for instance the no des of an

unstructured mesh are partitioned prior to execution and then appropriately renumb ered

then the resulting distribution can b e describ ed in this manner

 Indirect Distributions

General blo ck distributions cannot deal with every irregular problem for example the

distributions generated by dynamic partitioners cannot in general b e represented in this

way Indirect distributions provide a means for sp ecifying an arbitrary replicationfree

distribution by using a mapping array which for each element of the array to b e distributed

p oints to the corresp onding pro cessor This mechanism is supp orted by the PARTI

and PARTI libraries see Section integrated into the VFCS

 UserDened Mapping Pro cedures

Indirect distributions incur a considerable overhead b oth at compile time and at runtime

A diculty with this approach is that when a distribution is describ ed by means of a

mapping array any regularity of structure that may have existed in the distribution is

lost as a consequence the compiler cannot exploit p ossible regularities Userdened

mapping pro cedures provide a facility for extending the set of intrinsic mappings in the

language in a structured way using a notation similar to that of Fortran pro cedures

 OnClauses

Onclauses allow the control of the work distribution in an INDEPENDENT lo op by

explicitly sp ecifying a mapping from iterations to pro cessors In this way the the work

distribution of irregular algorithms can b e adapted to the data distributions

 Task Parallelism and its Integration With Data Parallelism

Such features will b e included along the lines discussed in the section on OPUS Section

A more complete description of HPF is given in Beginning at January a new

ESPRIT IV pro ject with the title HPF has b een initiated This pro ject will develop a

complete language denition of HPF and implement the language in the context of VFCS

More information on this pro ject is given in Section

The Vienna Fortran Compilation System VFCS

Siegfried Benkner

Overview

The Vienna Fortran Compilation System VFCS is an integrated compilation environment that

includes a parallelizing compiler for Vienna FortranHPF program analysis and transformation

to ols and to ols for sequential proling p erformance prediction and measurement In Figure

the main comp onents of the VFCS are shown

Vienna Fortran / HPF

Frontend

Analysis Program Database Transformations syntax tree symbol tables G

flow graph dependence graph Parallelization U

call graph distribution information Performance Tools I User interprocedural information Utilities

Code Generator Program versions Protocol

Message Passing

Fortran

Figure The Structure of VFCS

The frontend reads the input program and generates the internal representation which is kept

in the program database Input languages for VFCS are Fortran and subsets of Vienna

Fortran Vienna Fortran and HPF including Fortran array syntax The analysis com

ponent p erforms data ow analysis data dep endence analysis and distribution analysis in an

intrapro cedural as well as interpro cedural mo de The transformation component provides nor

malization transformations standard transformations and various lo op transformations The

paral lelization component p erforms a sourcetosource translation from dataparallel Fortran

programs to explicitly parallel message passing Fortran providing automatic parallelization as

well as vectorization The performance tools include a sequential program proler a static

p erformance prediction to ol a simulation to ol and a p erformance measurement to ol The p er

formance to ols of VFCS are describ ed in more detail in Sections and The code generator

generates Fortran message passing co de for various parallel target architectures includin g the

Intel iPSC Hyp ercub e the Intel Paragon the Meiko CS as well as the PARMACS

and MPI standards

VFCS provides two dierent mo des of op eration to the user In the automatic mode as a

commandline compiler VFCS provides a set of easytouse directives for the creation of driver

les which are used to customize the system to the users needs and to control the sequence of

transformations that are applied during the restructuring pro cess In the interactive mode the

parallelization pro cess may b e directed by the user via a Motifbased graphical user interface

oering a set of analysis services and a transformation catalog

VFCS can also b e used to analyze and transform sequential programs Many of the transfor

mations which improve the p erformance of parallel programs also help in p erformance tuning of

sequential co de

VFCS supp orts most of the Vienna Fortran and HPF data parallel extensions It currently

provides regular block distributions general block distributions distribution to pro cessor subsets

distribution of workspaces which are often used in Fortran programs to simulate dynamic

memory allo cation distribution by alignment and indirect distributions using mapping arrays

Program Transformation and Analysis

VFCS provides normalization transformations and program analysis services which enable the

user to adapt the program in such a way that subsequent parallelization yields b etter results

These typically include lo op normalization if conversion and subscript standardization and

furthermore standard transformations such as constant propagation and dead co de elimination

Program analysis includes syntactic and semantic analysis control ow analysis data ow ana

lysis and data dep endence analysis on an intrapro cedural as well as interpro cedural basis

Appropriate data structures which capture the analysis information are constructed and may

b e displayed to the user in a graphical form This includes the unit ow graphs the data de

p endence graph and the call graph The call graph is annotated by information concerning the

bindings of actual and formal pro cedure parameters which provides the basis for subsequent

interpro cedural analysis in order to determine the distribution of dummy arrays that inherit

their distribution from actual arrays

A set of standard transformation is provided and can b e selected manually via the graphical

user interface These include

 constant propagation

 inline expansion

 scalar forward substitution  dead co de elimination

 induction variable substitution  lo op transformations

The lo op transformations include lo op distribution lo op fusion lo op interchange lo op unrolling

unroll and jam strip mining reduction recognition and transformation of DO lo ops into parallel

lo ops

Parallelization Strategy

VFCS translates a data parallel source program into an SPMD message passing program which

is parameterized such that according to the data distribution each pro cessor op erates on its

lo cal p ortion of the data Access to nonlo cal data is achieved by automatically inserting ap

propriate communication primitives VFCS exploits the parallelism inherent in do lo ops array

statements reduction op erations and parallel lo ops

In order to transform a data parallel source program into an explicitly parallel message passing

program the distribution and alignment sp ecications of the input program are analyzed to

determine the ownership of all data ob jects Based on this information each assignment state

ment involving distributed data ob jects is transformed by applying the owner computes rule or

variants thereof such that each pro cessor only op erates on its lo cal p ortion of the data space

Declaration statements are mo died in such a way that each pro cessor allo cates only those parts

of a distributed ob ject that have b een mapp ed to it All references to distributed ob jects are ad

apted accordingly by converting global indices into lo cal indices Potential accesses to nonlo cal

data are satised by inserting appropriate communication statements that copy these data into

automatically allo cated private variables of the pro cessors Aggressive optimization techniques

applied subsequently aim to reduce the overhead of ownership computation work distribution

and interpro cessor communication

In the early phases of parallelization VFCS transforms the source program as follows

 Splitting generates a host program and a node program

The host program will after compilation b e executed on the host computer or a sp eci

ally designated host no de of the target system as the host process It p erforms global

management tasks such as requesting resources loading co de and terminating program

execution it also handles all inputoutput A copy of the no de program will b e executed

on each pro cessor of the target machine to p erform the actual computation

 Masking enforces the owner computes paradigm by asso ciating a b o olean guard called the

mask with each statement in accordance with the ownership implied by the distributions

As a consequence the computation is partitioned such that each pro cessor only executes

assignments to array elements mapp ed to it

 Communication insertion generates for all p otentially nonlo cal data accesses highlevel

communication statements which copy nonlo cal data items to private variables of the

pro cessor

Optimization and Co de Generation

Since reduction of communication costs is crucial in achieving go o d p erformance subsequent

optimization phases eliminate redundant communication hoist communication statements out

of lo ops and combine small messages into larger ones in order to minimize the eects of latency

Overlap analysis detects certain simple regular communication patterns reorganizes

communication based up on this information and determines the minimum amount of storage

required for copies of nonlo cal data Mask optimization adapts lo op b ounds such that unne

cessary lo op iterations are eliminated Runtime compilation techniques based on the PARTI

primitives are used for the compilation of explicitly parallel lo ops with indirect array refe

rences

In the nal phase an optimized parallel message passing Fortran target program is created

Highlevel communication statements are implemented by calls to a message passing library

Memory for nonlo cal data is automatically allo cated and global references are translated into

lo cal references

Future Work

The main directions for future developments are a generalization of current VFCS to the Fortran

language base and to full Vienna Fortran and HPF This includes the implementation of

derived typ es dynamic memory allo cation masked array assignments pro cedure interfaces and

mo dules Furthermore VFCS will b e extended to handle nonconstant pro cessor arrays New

concepts for the parallelization of array assignments involving arbitrary blo ck and cyclic dis

tributions including linear alignment which may dep end on runtime data are currently b eing

implemented as well as a runtime supp ort system and optimization techniques needed

for an ecient implementation of dynamic data distribution irregular distributions and user

dened mapping pro cedures

Parallelization techniques for full Vienna Fortran or HPF must b e general enough to compile

programs for a parameterized numb er of pro cessors and must provide concepts for the mana

gement of distributed data including runtime representation of distributions ecient metho ds

for index conversion and static as well as dynamic schemes for handling the transfer of nonlo cal

data b etween the pro cessors of the parallel target machine Many of the basic compilation tech

niques discussed in the previous sections are only applicable if the data distribution and work

distribution can b e fully determined at compile time If this is not the case the compiler is

forced to pro duce parameterized co de and to defer ownership computation work distribution

and communication generation to the runtime In the following we outline the main functiona

lity that has to b e provided by a runtime supp ort system for Vienna Fortran or HPF

Runtime Supp ort

The computation of the distribution of data ob jects and the ownership of data is a prerequisite

for p erforming memory allo cation and for the computation of work distribution These tasks

however may not b e p ossible at compile time since the distribution of an ob ject may dep end on

runtime data or may b e changed dynamically Furthermore dummy arguments may inherit the

distribution of their actual arguments and thus may b e asso ciated with distinct distributions

in dierent incarnations of a pro cedure As a consequence an array reference in the co de may

b e reached by dierent data distributions

The goal of distribution analysis is to determine the set of distributions that may reach a certain

reference to a distributed array Distribution analysis has to b e p erformed on an interpro cedural

basis in order to avoid high overheads due to making conservative assumptions at pro cedure

b oundaries If interpro cedural distribution analysis reveals that there is only a single distribution

that may reach a certain array reference more ecient co de can b e generated However in

general the precise distribution of all ob jects can not b e determined at compile time by any

amount of analysis As a consequence a runtime supp ort system that provides the following

functionality is required

 Representation of Distributions

Runtime representation of distributions called layout descriptors capture all the informa

tion that is needed to determine the distribution of an array at runtime Layout descriptors

are utilized in those cases where the actual distribution cannot b e determined at compile

time eg if the distribution is dep endent on runtime data if the array is dynamically

allo cated or dynamically distributed or if the numb er of pro cessors on which the program

will execute is not known

 Data Organization Functions

These functions take layout descriptors as arguments and determine dynamically the

lo cal index set of a distributed ob ject on a particular pro cessor and are therefore a prere

quisite for the allo cation of distributed ob jects Furthermore the runtime system contains

functions for computing the ownership and lo cal address of the elements of distributed

arrays and for converting global indices into lo cal indices

 Work Distribution Functions

Based on the actual distribution of an ob ject each pro cessor usually only p erforms assi

gnments to those parts of the ob ject that are allo cated in its lo cal memory In the case of

array assignments work distribution functions determine for each pro cessor those elements

of the left hand side array section for which the assignment has to b e executed In the

case of lo ops appropriate functions partition and adapt the lo op iteration space according

to the data distribution of the accessed ob jects

 Supp ort for Communication Generation

Usually it will not b e p ossible to determine at compiletime the communication that is

necessary in order to execute assignment statements Therefore each pro cessor has to de

termine at runtime which data elements it has to send to or receive from other pro cessors

in order to guarantee correct program execution The runtime system also contains func

tions to handle buering of nonlo cal data A suitable representation of communication

sets is required in order to enable extraction and reuse of certain communication patterns

Communication Management and Optimization Techniques

The eciency of the parallelized program can b e improved by applying optimization techniques

that reduce the prepro cessing overhead the memory overhead imp osed by the need for temp orary

storage and the communication and synchronization overhead The prepro cessing overhead for

the computation of lo cal index sets work sets and communication sets can b e reduced by reusing

these sets wherever p ossible or computing these sets or parts thereof if p ossible at compile time

During parallelization p otential accesses to nonlo cal data are satised by inserting high level

communication descriptors prior to each assignment statement that contains references to

distributed data ob jects A high level communication descriptor contains all information that

is needed to determine which elements have to b e communicated and how the communication

is organized The semantics of communication descriptors ensures that nonlo cal data items are

copied into private variables of a pro cessor An imp ortant asp ect of high level communication

descriptors is the abstraction from the details of the actual management of temp orary storage

and communication High level communication descriptors provide an abstraction from the de

tails of the actual management of temp orary storage and communication and thus a numb er

of imp ortant optimization techniques may b e p erformed entirely on the level of communication

descriptors This includes elimination of unnecessary or redundant communication descriptors

moving communication descriptors out of lo ops or pro cedures and combining messages from

dierent communication statements In the co de generation phase communication descriptors

are expanded into calls to message passing primitives with the goal of exploiting collective com

munication whenever p ossible

In order to reduce the communication overhead it is crucial to minimize or hide the eects of

latency which on most distributedmemory architectures is dominated by the message startup

time The eect of message startup time which is usually one or two orders of magnitude higher

than the time required for the transmission of one byte is reduced by applying communication

vectorization in order to avoid single element messages and overlapping of computation and

communication

PARTI

Kamran Sanjari and Peter Brezany

Pro cessing irregular co des in Vienna Fortran or HPF compilers is a challenging problem of

growing imp ortance In such co des access patterns to ma jor data arrays may dep end on runti

me data and therefore dep endence analysis and the determination of communication patterns

cannot b e p erformed at compile time

HPF and Vienna Fortran provide several constructs to express data parallelism in irregular

co des These include the FORALL statement and construct the INDEPENDENT lo op and

array statements When pro cessing these constructs the task of the compiler is to match the

parallelism expressed by the construct to that of the target system

The standard strategy for pro cessing lo ops with irregular accesses develop ed by Saltz Mehrotra

and Ko elb el generates three co de phases called the work distributor the insp ector and the

executor The work distributor determines how to spread the work iterations

among the available pro cessors The insp ector p erforms a dynamic lo op analysis Its task is

to determine the necessary communication by a set of so called schedules and to establish an

appropriate addressing scheme for the access to lo cal elements and copies of nonlo cal elements

on each pro cessor The executor p erforms the communication describ ed by schedules and on

each pro cessor executes the actual computations for all iterations assigned to that pro cessor in

the work distribution phase

The PARTI runtime library was the rst system to supp ort this scheme for DMMPs It has

b een integrated into a numb er of compilers including VFCS

PARTI extends and improves PARTI in a numb er of resp ects PARTI and its successor Chaos

can handle multidimensional arrays in which one dimension is distributed There is no

supp ort for alignment and the COMPLEX data typ e On the other hand PARTI and the

asso ciated compiling techniques supp ort all HPF distributions and alignments and op erations

on all Fortran intrinsic data typ es In this way PARTI lifts several restrictions asso ciated

with PARTI

The contributions of PARTI can b e summarized as follows

 Supp ort for irregular co des containing arrays with multidimensional distributions

The organization of communication and addressing for such an array is based on view

ing it as a onedimensional array whose distribution is determined by the dimensional

distributions of the original array

 Generic interface to library pro cedures

 Exact communication sets

For multidimensional arrays PARTI and Chaos approximate the communication sets by

socalled rough schedules In contrast PARTI derives schedules which guarantee that

only elements that are actually referenced will b e communicated

 Reduction of buer space

In the case of multidimensional arrays PARTI squeezes copies of nonlo cal references

into a onedimensional buer and avoids allo cation of unnecessary buer cells

 Implementation on top of MPI

This enhances p ortability and helps to avoid hidden deadlo cks in global communication

patterns since MPI oers various collective data exchange routines with the capability of

dening pro cessor groups which exclude pro cessors with empty communication sets from

global communication Currently PARTI runs on the Intel iPSC and all platforms

supp orting MPI

Future improvements of PARTI will include a reduction of the complexity of schedule genera

tion supp ort for partially replicated arrays and the implementation of mechanisms for latency

hiding

PARTI has b een integrated into VFCS and the PREPARE HPF compilation environment

Dynamic Data Distributions Optimizations

Eduard Mehofer

The dynamic redistribution of arrays in languages such as Vienna Fortran or High Per

formance Fortran addresses the demands p osed by advanced applications with dynamically

varying pro cessor workloads or varying array access patterns Consider a particleincell PIC

co de with particles moving in a domain which is divided into cells The set of cells is distributed

at the b eginning in such a way that the workload of each pro cessor is approximately the same

If particles are moving from one cell to another during the simulation a load imbalance may

o ccur which in turn may result in a severe loss of p erformance A redistribution of the cells

across the pro cessing no des helps to restore the workload balance

Another class of applications which can b enet from dynamic redistributions includes co des

such as ADI Alternating Direction Implicit The ADI computational kernel consists of

a computational wavefront along the rows followed by a computational wavefront along the

columns A static columnwise data distribution requires communication in the rst phase but

none in the second phase and vice versa for a static rowwise distribution A redistribution

of the arrays b etween the computational phases has the eect that b oth phases are free of

communication at the price of the cost for the transp osition b etween the phases

An ecient implementation of dynamic redistributions is dicult requiring a coherent approach

involving language compiler and runtime issues To reduce the overhead intro duced by redistri

butions a broad range of p owerful optimizations must b e p erformed in order to get reasonable

p erformance results The following key problems must b e attacked in order to provide an ecient

implementation

 reduction of the overall redistribution costs and

 ecient compilation of array references in the presence of multiple reaching distributions

Reaching distribution analysis which is dened in analogy to reaching denitions plays a

vital role for the solution of the ab ove two problems On the one hand reaching distributions are

used to identify redundant redistributions On the other hand reaching distribution information

is required to nd the distributions which may b e asso ciated with a given array reference Hence

we want to determine the set of reaching distributions as precisely as p ossible This implies the

need for interpro cedural analysis

Note that suitable language features can provide the compiler with information that cannot b e

derived statically In particular a usersp ecied distribution range or distribution queries like

the Vienna Fortran DCASE or the functions IDT and IDTA can help to rene the set of reaching

distributions

Reduction of Redistribution Costs

The overall redistribution costs dep end on the numb er of executed redistributions and on the

time sp ent to p erform individual redistributions The reduction of redistributions p erformed

during the execution of a program can b e achieved by eliminating useless redistributions

or by co de motion of redistributions out of lo ops or across pro cedure b oundaries Since

the time required to p erform an individual array redistribution is dominated by the physical

remapping of the data items an ecient redistribution communication library is imp ortant

Elimination of Useless Redistributions

A redistribution statement at a program p oint p is called useless i the redistribution statement

is not live in p or it is redundant which means that the requested distribution already exists

In many cases useless redistribution statements are the result of compiler transformations or

implicitly inserted co de For example Vienna Fortran provides the p ossibility to connect the

distribution of a set of dynamically distributed arrays to a primary array in such a way that all

connected arrays will b e automatically redistributed whenever the distribution of the primary

array changes Consider the Vienna Fortran co de fragment of the following example where da

denotes a distribution annotation the asso ciated distribution and associateA binds array

A to

Example Redistributions of connected arrays in Vienna Fortran

REAL A   REAL AN DYNAMIC DIST da

1

REAL BM DYNAMIC CONNECT A REAL B  

  

s associate A

s DISTRIBUTE Ada

0

associate B

  

s DISTRIBUTE Ada

1

s associate A

1

0

associate B

1

Example Original Co de ! Transformed Co de

Array B is connected to array A and thus automatically redistributed as indicated by the

0

example If the distribution of B in s is not live the distribution asso ciation for array B in s

can b e eliminated If no redistributions o ccur b etween s and s and the initial distribution is

still valid useless redistribution elimination may discover the distribution asso ciation for B at

0

s as redundant

Another source for optimizations arises at subroutine b oundaries Vienna Fortran provides a

lo cal and nonlo cal view of subroutines To supp ort a lo cal view remappings may b e necessary

at subroutine b oundaries as shown in the following example

Example Remapping at subroutine boundaries

     

s associate A s associate A

1 1

s CALL FA s CALL FA

s associate A

s associate A

1

s CALL GA

s CALL GA

s associate A

s associate A

  

  

Example Transformed Co de ! Optimized Co de

Since the distribution of A in s is not live statement s can b e removed If the distribution

of array A is not changed by subroutine F reaching distribution analysis recognizes that the

requested distribution at s is identical to the distribution asso ciated with A at s Hence s

can b e identied as useless

Lo opInvariant Redistribution Motion

Since on the average most of the execution time is sp ent inside lo ops the optimization of lo op

b o dies is of key imp ortance This is esp ecially true for redistributions within lo ops The following

example shows the motion of distribution asso ciations out of lo op b o dies

Example Detection of loopinvariant redistributions

s associate A

1

DO I N

DO I N

s associate A

1

s CALL FA

s CALL FA

ENDDO

s associate A

s associate A

ENDDO

Example Lo opinvariant co de motion

Lo op invariant co de motion analysis based on reaching distribution and live distribution infor

mation recognizes that statement s can b e moved b efore the lo op and statement s just b ehind

the lo op

Compilation Strategies for Multiple Reaching Distributions

If an array reference is reached by multiple distributions two dierent co de generation strategies

are p ossible

 generation of a generic co de which covers the set of all reaching distributions or

 generation of a distributionifcascade with highlyoptimized co de for each distribution

The rst strategy may result in a loss of p erformance whereas the second strategy increases the

size of the generated co de Thus some heuristics must b e develop ed which are used to select

the prop er co de generation strategy

Compilation Techniques for Virtual Shared Memory Ar

chitectures

Bernd Wender

DMMPs are dicult to program and to compile for in contrast to the sharedmemory paradigm

which is generally considered as providing a natural and intuitive approach to parallel program

ming One way to overcome this dilemma has b een pursued in a numb er of research pro jects

In this work a virtual shared memory is provided on top of a physically distributed memory

by an additional software layer using a paging scheme similar to the classical virtual memory

implementations on single pro cessor machines The main problem with this

approach is to maintain memory coherence across all lo cal memories without a signicant loss

of p erformance

In recent years hardware vendors develop ed a class of new parallel architectures which provide

a hardwaresupp orted shared address space on top of an architecture with physically distributed

memory The CONVEX SPP Exemplar is an example for this class It consists of groups of

up to eight pro cessors called hyp erno des All pro cessors within a hyp erno de are connected

via a cross bar to a physically distributed but logically shared memory The hyp erno des are

1

connected by four CTI rings The CTI supp orts access to global memory on a cache line basis

Cache coherence is maintained using a directory based mechanism within the hyp erno de and

the SCI metho d across the hyp erno des The Exemplar provides a global address space across

all hyp erno des using its two level shared memory NUMA structure

The concept of a global address space alleviates the task of the programmer drastically It is

p ossible to write parallel programs in a similar way sequential programs are written Large

parts of the parallelization task can b e left to the compiler the programmer can supp ort the

parallelization via compiler directives

However there still remain some imp ortant problems Although the shared memory program

ming mo del is quite straightforward there are some tedious tasks left to the programmer eg

 cho osing sections of co de that are worthwhile to b e parallelized

 detecting inhibitors of parallelization

 restructuring parts of the program into parallelizable sections

 handling of lo ops with lo op carried dep endencies

Due to the nonuniform data access strategy of the Exemplar it is also necessary to design the

layout of the data structure of the program carefully to avoid unnecessary nonlo cal accesses

cache trashing eects false sharing and the like

Our goal is to alleviate the task of writing wellp erforming programs for virtual shared memory

machines using language elements program analysis techniques and to ols known within the

context of parallelization for DMMPs This can b e achieved by a parallelization system which

uses the information ab out data layout control and data ow information and transforms the

program according to a strategy derived using a mo del of the underlying target architecture

It will nd parallelizabl e structures of the co de p erform a series of transformations to remove

inhibitors of parallelization and generate a target program annotated with compiler directives

for the CONVEX native compiler

Parallelization of Sparse Matrix Co des

Manuel Ujaldon Barbara Chapman Emilio Zapata and Hans Zima

Overview

In this section we consider the sp ecic requirements for sparse computations as they arise in

many problem areas such as nite elements molecular dynamics matrix decomp osition and

image reconstruction We have intro duced new metho ds for the representation and distribution

of sparse matrices and prop ose a new data typ e that enco des the representation of such data

These features serve to provide the compiler and the runtime system with imp ortant additional

information that p ermits the sparsity structure to b e exploited

In order to parallelize sparse co des eectively there are three main issues to consider

How to generalize the representation of sparse matrices on a single pro cessor to DMMPs

how to distribute the data structures typically used in such co des and

1

CTI Convex Toroidal Interface is an extension of the scalable coherent interface SCI dened in IEEE

standard

how to implement an ecient compilationruntime scheme based on a given combination

of representation and distribution

The sp ecial representation of sparse data leads to co de in which irregular accesses are made

to ma jor data structures Existing compilers use routines based up on the inspectorexecutor

paradigm to translate such co des However this very general scheme is relatively inecient

The sp ecial nature of the language features presented here leads to an optimization which saves

memory reduces the amount of communication and generally improves the runtime b ehavior of

the application

Representing Sparse Matrices on DMMPs

We construct a representation of a sparse matrix on a DMMP by combining a standard repre

sentation used on sequential machines with a data distribution in the following way

A distributed representation drepresentation is determined by two comp onents a sequential

representation srepresentation and a data distribution The srepresentation sp ecies a set

of data structures which store the data of the sparse matrix and establish asso ciated access

mechanisms on a sequential machine The distribution determines in the usual sense a

mapping of the array to the pro cessors of the machine

For the following assume a pro cessor array Q X Y and a data array A n m

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

A

Figure Sample sparse matrix A with

We describ e a frequently used srepresentation the Compressed Row Storage CRS represen

tation is determined by a triple of vectors DACORO which are resp ectively called the data

column and row vector The data vector stores the nonzero values of the matrix as they are

traversed in a rowwise fashion The column vector stores the column indices of the elements in

the data vector Finally the row vector stores the indices in the data vector that corresp ond to

the rst nonzero element of each row if such an element exists

For example consider the sparse matrix A shown in Figure The data column and row

vectors for A are shown in Figure

Another srepresentation Compressed Column Storage CCS is similar to CRS but based on

an enumeration which traverses the columns of A rather than the rows

A drepresentation for A results from combining a data distribution with the given srepresentation

This is to b e understo o d as follows The data distribution determines for each pro cessor an as

so ciated local segment which under appropriate constraints is again a sparse matrix Now the

lo cal segment of each pro cessor will b e represented in a manner analogous to the srepresentation

employed in the sequential co de More sp ecically DA CO and RO are automatically conver

p p p

ted to sets of vectors DA CO and RO for each pro cessor p Hence the parallel co de will

save computation and storage using the very same mechanisms that were applied in the original

program In practice we need some additional global information to supp ort exchanges of sparse

data with other pro cessors

DA CO

RO

Figure CRS representation for the matrix A

Below we describ e two strategies to represent and distribute sparse matrices on multipro cessors

 Multiple Recursive Decomp osition MRD

MRD is a data distribution that generalizes the Binary Recursive Decomp osition BRD

of Berger and Bokhari to an arbitrary twodimensional array of pro cessors This

distribution denes the lo cal segment of each pro cessor as a rectangular matrix which

preserves neighb orho o d prop erties and achieves a go o d load balance

 BRS and BCS

The BRS and BCS representations are based up on cyclic distributions b oth dimensions

of A are assumed to b e distributed cyclically with blo ck length as sp ecied by the

annotation in the Vienna Fortran declaration

REAL AN M DIST C Y C LI C C Y C LI C TO Q X Y

We obtain the desired representation schemes by combining this distribution with two kinds of

srepresentation

 Blo ck Row Scatter BRS using CRS and

 Blo ck Column Scatter BCS using CCS

The BRS and BCS representations are go o d choices for irregular algorithms with a gradual

decrease or increase in the workload and for those where the workload is not identical for each

nonzero element of the sparse matrix Many common algorithms are of this nature including

sparse matrices decomp ositions LU Cholesky QR WZ image reconstruction Exp ectation

Maximization algorithm least Square Minimum iterative metho ds to solve linear systems

Conjugate and Biconjugate Gradient Minimal Residual Chebyshev Iteration and others

and eigenvalue solvers Lanczos algorithm

For example the BRS representation of sparse matrix A in Figure is given in Figure

Language Extensions for the Supp ort of Sparse Matrix Computations

We discuss here a small numb er of new language features for Vienna Fortran and similarly

HPF that supp ort the sparse representations discussed ab ove

Q Q

0 1

RO RO

0 0 1 1

DA CO DA CO

1 1

19 1 1 53 1 2

13 4 2 16 4 3

69 2 2 17 2 4

37 3 3 23 1 4

5 5

Q Q

2 3

RO RO

2 2

DA CO

3 3

1 1

DA CO

21 4

2 1

93 3 72 3

2 2

44 3 19 4

3 2

27 1 11 2

4 3

64 4

6 4

Figure BRS representation of A to Q

When the data are available at the b eginning of execution the original matrix is read from le

and the drepresentation can b e constructed at compiletime Otherwise it has to b e computed

at runtime We must take account of the fact that the data is accessed according to sparse

structures in the source co de We require these auxiliary structures to b e declared and used ac

cording to one of the representations known to the compiler The compiler uses this information

to construct the lo cal data sets as describ ed in the previous section

This requires us to give the following information to the compiler

 The name index domain and element type of the sparse matrix must b e declared This is

done using regular Fortran declaration syntax it creates an array resembling an ordinary

Fortran array but without the standard memory allo cation

 An annotation must b e sp ecied which declares the array as b eing SPARSE and provides

information on the representation of the array including the names of the auxiliary vectors

 The keyword DYNAMIC is used in a manner analogous to its meaning in Vienna Fortran

if it is sp ecied then the drepresentation will b e determined dynamically as a result of

executing a DISTRIBUTE statement Otherwise the sparsity structure of the matrix

is statically known and thus all comp onents of the drepresentation p ossibly excepting

the actual nonzero values of the matrix can b e constructed in the compiler Often this

information will b e contained in a le whose name will b e indicated in this annotation

Sparse Matrix Pro duct

One of the most imp ortant op erations in matrix algebra is the matrix pro duct We present in

Figure this algorithm expressed in Vienna Fortran and extended by the new sparse annota

tions Both CCS and CRS representations are used here while CRS is more suitable for the

traversal of A CCS is more appropriate for B in computing the pro duct C AB

Implementation

The ab ove features have b een implemented in the VFCS The compiler as well as the runtime

system exploit the additional information given by the new directives to signicantly enhance

the p erformance of the resulting co de when compared to the naive use of the insp ector executor

paradigm More details on the implementation and p erformance numb ers are given in

PARAMETER XY

PROCESSORS QXY

PARAMETER NA NB NC

REAL CNANC DIST CYCLICCYCLIC

INTEGER I J K

C A uses Compressed Row Storage Format

REAL ANA NB SPARSE CRSADACAR DYNAMIC

C B uses Compressed Column Storage Format

REAL BNB NC SPARSE CCSBDBCBR DYNAMIC

  

C Read A and B

  

DISTRIBUTE AB CYCLIC CYCLIC

C Initializati on of dense matrix C

FORALL I NA

FORALL JNC

CIJ

END FORALL

END FORALL

C Computation of the pro duct

FORALL I NA

FORALL K NC

DO J ARIARI

DO J BCKBCK

IF ACJ EQ BRJ THEN

CIK CIKADARIJBDBCKJ

END DO

END DO

END FORALL

END FORALL

END

Figure ViennaFortran Sparse Matrix Pro duct

High Performance Access to Mass Storage Systems

Peter Brezany

Intro duction

Languages like HPF and Vienna Fortran and their compilers have b een designed to improve

the practical applicability of massively parallel systems To accelerate the transition of these

systems into fully op erational environments it is also necessary to develop appropriate language

constructs and software to ols supp orting application programmers in the development of large

scientic IO intensive applications

The research describ ed in this section fo cusses on mass storage supp ort for VFCS to enable

ecient execution of parallel IO op erations and op erations on outofcore OOC structures

The use of OOC structures implies IO op erations due to main memory constraints some

parts of these data structures eg large arrays must b e swapp ed to disk On some machines

hardware and software supp ort for virtual memory allows the program version op erating on

incore data structures to b e run on larger datasets However the p erformance achieved is often

very p o or The approach used in this pro ject is based on two main concepts

Vienna Fortran language extensions and compilation techniques We prop ose

constructs to sp ecify OOC structures and IO op erations for distributed data structures

in the context of Vienna Fortran These op erations can b e used by the programmer to

provide information which helps the compiler and the runtime environment to use the

underlying IO subsystem in an ecient way

Integrated advanced runtime supp ort The mo dules of VFCS that pro cess IO op e

rations and handle OOC structures are coupled to a mass storage oriented runtime system

called VIPIOS Vienna Parallel IO System The ob jective of the prop osed integrated

compile time and runtime optimizations is to minimize the numb er of disk accesses for le

IO and OOC pro cessing A central issue in this context is to to achieve ecient data

prefetching from the disks to the pro cessors main memory minimize the numb er of disk

accesses reduce the volume of data transferred and to increase the reuse of data kept in

main memory buers

Language and Compiler Supp ort

Distributed data structures are stored in the parallel IO subsystem as parallel les The le

layout may b e optimized by VIPIOS to achieve ecient IO data transfers In the context of

an OPEN or WRITE statement the user may give a hint to the compilation system that data

in the le will b e written to or read from an array of a given distribution The REORGANIZE

statement enables the user to sp ecify restructuring a le into a form which will b e more ecient

for subsequent pro cessing A more detailed description of the appropriate language constructs

can b e found in

At compile time the translation of parallel IO op erations conceptually consists of two pha

ses basic compilation extracting parameters ab out data distributions and le access patterns

from the Vienna Fortran program and passing this information to the VIPIOS primitives and

advanced optimizations including the co de restructuring based on program analysis

The OOC array annotation is of the following form

REAL AN DISTBLOCK OUT OF CORE IN MEM M

OF CORE indicates that A is an OOC array and the optional keyword where the keyword OUT

IN MEM indicates that only the array p ortions of M elements are allowed to b e kept in memory

VFCS restructures the source of the outofcore program in such a way that during the course

of the computation sections of the array are fetched from disks into the lo cal memory the new

values are computed and the sections up dated are stored back onto disks if necessary The data

transfer and management of the data on disks is p erformed by VIPIOS To supp ort ecient

data prefetching the compilation system forwards internal information to the IO subsystem

in particular information ab out the array sections to b e transferred which is given in form of a

Section Requirement Graph

Advanced Runtime Supp ort

A central issue of the clientserver based runtime framework is a conceptual distinction b etween

two typ es of pro cesses application pro cesses and VIPIOS servers The application pro cesses are

created by the VFCS according to the SPMD paradigm The VIPIOS servers run indep endently

on a numb er of dedicated no des and p erform the data requests of the application pro cesses The

numb er and the lo cations of the VIPIOS servers dep end on the underlying hardware architecture

disk arrays lo cal disks sp ecialized IO no des etc the system conguration numb er of

available no des typ es of available no des etc the VIPIOS system administration numb er of

serviced no des disks application pro cesses or on user demands IO characteristics regular

irregular problems etc For each application pro cess a distinct VIPIOS server is assigned

accomplishing all data requests

Summing up the VIPIOS as a future comp onent of the VFCS is resp onsible for the organization

and maintenance of all data held on the mass storage devices

Conclusion

As mentioned in the preceding sections high p erformance languages generally lack ecient

parallel IO supp ort A p ossible approach to solve this problem is the development of an

integrated runtime subsystem which is optimized for HPF language systems As a main goal

physical data distributions should adapt to the requirements of the problem characteristics

sp ecied in the application program

Chapter

To ols and KnowledgeBased

Program Development

In the rst sections of this chapter we describ e a numb er of software to ols that have b een

designed and implemented in the context of VFCS over the threeyear p erio d covered by this

rep ort The nal Section describ es the design of an integrated compilation environment

which generalizes this work by outlining a knowledgebased system for automatic compilation

thus providing a uniform framework for future research in this area

ANALYST

Barbara Chapman Markus Egg and Fritz Wol lenweber

Summary

ANALYST is a prototyp e software to ol to supp ort an applications develop er or co de owner

who wishes to understand more ab out his or her Fortran applications It provides a range of

information ab out a source program at the level of detail desired and in either graphical or

text format In addition it p ermits some simple transformations to b e applied to the co de

Design of the To ol

The amount of time sp ent by application develop ers in order to acquire information on a co de

prior to p erforming substantial mo dications or converting it to run on a parallel platform

is considerable In fact this initial program analysis may take longer than the parallel p ort or

adaptation itself Although there are a few to ols on the market which provide supp ort for source

program analysis they generally do not provide the user with sucient control over the amount

and detail of information provided The result is often a large amount of data from which the

relevant facts must b e lab oriously extracted

This lack of appropriate supp ort for the initial phases of program parallelization is commercially

signicant it results in a much longer and more dicult p orting eort However the kind of

information which a user needs is often closely related to the kind of analysis which must b e

p erformed by automatic and semiautomatic restructuring to ols

ANALYST is a prototyp e software to ol which attempts to ll this gap It aims to give the

user the means to sp ecify and obtain the desired information on all or part of an application

program It provides a range of dierent kinds of information ab out a source program at the

level of detail desired and in b oth text and graphical format This work did not intend to

develop new metho ds of analysis Rather it has fo cussed on the issue of understanding just

what kind of supp ort a user needs and b oth deriving and presenting information to the user in

an appropriate and exible manner A signicant fraction of the implementation was dedicated

to the development of a sophisticated graphical user interface including to ols for manipulating

the graphical displays and for navigation within them ANALYST is the result of a collab oration

b etween to ol develop ers at the VCPC and an exp erienced end user who has b een involved in

the design from the outset

The program and its parts are displayed in several forms

 there are several representations of the program structure including the call graph a call

tree and the source co de form and

 the individual program units subprograms are represented b oth as source text and in

the form of a compressed ow graph and a full owgraph

The information currently presented to the user may involve one or more of these levels of

representation it includes

 various kinds of information related to the ow of data this includes details of the argu

ments passed at call sites on the predecessors and successors of statements and reaching

denitions

 Several kinds of data dep endence information are available in graphical form a selection

may b e obtained in terms of program statements lo op levels kind of dep endence or in

terms of variables involved

 When parallelizing it is essential to have precise details on the usage of global variables

and a knowledge of the use of sequence and storage asso ciation in the co de these details

to o are provided by ANALYST Common blo ck usage and references to the data within

a common blo ck is presented in several forms equivalencing is analyzed

ANALYST thus has menu items for b oth lo cal and global call site information global equivalence

information and lo cal and global common blo ck usage information in addition to the ab ove

There are menu entries for such things as printing displays and for viewing the source co de

for additional program units searching for text strings and jumping to lo cations in the source

co de

A transformation menu is included in the functionality of the system Transformations may b e

applied to the source co de from within ANALYST This includes the ability to apply them to

all statements where they may b e used or to a sp ecic statement or region of co de The results

are immediately displayed and the corresp onding information is up dated

The text and graphical display forms of information representation are suitably linked where

appropriate If for example a call graph no de representing a sp ecic program unit is clicked

on the corresp onding source text is shown Colour co ding provides some information at a

glance for instance indicating no des containing IO statements Clicking on a no de related to

program statements suces to highlight the corresp onding text in the source co de display Some

kinds of information are shown graphically by highlighting emphasizing or colourco ding the

corresp onding no des or arcs or by creating a new graph Information may also b e app ended to

arcs in text form In some cases additional details are retrieved by clicking on the colorco ded

or newly created no de or arc

The user may navigate graphs quickly via zo om functions and a socalled slider which also

displays the area of the graph currently visible on the screen No des of a graph may b e moved

by the user subgraphs may b e selected and extracted from a display p ossibly in order to obtain

additional information for this set of no des and arcs Some features of the graphical displays

may b e customized via an options menu

ANALYST is b eing created using X Windows and OS Motif it runs on a variety of p opular

workstations ANALYST has b een designed as a standalone to ol but it is also part of VFCS

from which the front end and the transformation facilities come

New functionality is still b eing added to ANALYST This includes additional means of viewing

the structure of a program and will include information on the parts of an array which are

accessed in a program unit In particular supp ort for the selection of ALIGNMENT and

DISTRIBUTION in preparation to the creation of HPF co de will b e available in a future version

The sequential proler from VFCS Section can also b e included in the realization of this

system

Concept Comprehension PAP Recognizer

Beniamino Di Martino

One of the ma jor problems for automatic parallelization is related to the fact that workable

solutions have to b e determined in search spaces whose huge size is caused by the complexity

of the underlying parallel machine A standard approach in this context consists of rst using

a set of heuristics thereby reducing the search space to a manageable size and subsequently

applying analysis

Concept Comprehension is a metho d that supp orts this strategy by predening a set

of algorithmic patterns for which ecient mappings to an HPC can b e dened Paral lelizable

Algorithmic Patterns asso ciating these patterns with a corresp onding set of heuristics

sp ecifying the syntactic and semantic prop erties that characterize these patterns and their va

riants in the source program and developing a system of production rules that p erform a

hierarchical concept parsing of the programs intermediate representation aiming at the auto

matic recognition of pattern instances

The tasks to which concept comprehension is applicable include automatic data distribution

co de restructuring and optimization the replacement of co de implementing recognized

functionalities by optimized sequential libraries such as Blas and Linpack and parallel co de

generation through co de replacement with parallel library calls and highlevel MPI primitives

Finally recognition of high level algorithms can drive the automatic selection of the execution

mo del that is more suited to the algorithm to the target architecture and to the runtime

parameters This can enable much more exible approaches to program parallelization than

those provided by the SPMD paradigm

A prototyp e to ol for the automatic recognition of Parallelizable Algorithmic Patterns the PAP

Recognizer has b een implemented in collab oration with the University of Naples PAP

Recognizer implements a plan based technique for the recognition of concept instances in the

co de that works in a hierarchical way The output of the to ol is a graphical browser that

p ermits the visualization of the hierarchical description of the recognized concepts together

with their implementation within the program co de The prototyp e has b een integrated into

the VFCS it utilizes the structural analysis of the input co de p erformed by the VFCS

frontend as a basis to build the internal representation of the program to b e analyzed It relies

on Prolog as a system shell and takes advantage of Prologs deductive inferencerule engine to

p erform the hierarchical concept recognition

The metho d develop ed for algorithmic concept comprehension p ermits successful handling of a

numb er of problems arising in the context of program comprehension These can b e summarized

as

 syntactic variation

 delo calization the implementation of a concept can b e spread throughout the co de

 implementation variation an abstract concept can b e implemented in many dierent ways

and so has to b e represented by several alternative plans

 overlapping implementations often the implementations of two or more distinct concepts

are merged so a p ortion of the program may b e part of more than one concept instance

The syntactic variation problem is solved by characterizing interstatement level concepts

with nonsyntactic prop erties like control and data dep endence and by taking advantage

of the backtracking characteristic of the recognition pro cedure to p erform symb olic analysis of

expressions within statements

The solution of the delocalization problem is based on the prop erties of the abstract program re

presentation which is based on an inherently delo calized structural representation Program

Dep endence Graph has a global scop e of visibility so that the active rule can attempt

to match all instances of concepts already recognized at every level of abstraction Although

this approach in principle increases the complexity of the pro cess the systematic use of control

and data dep endence relationships to characterize concepts allows the application of rules to b e

driven by the lo cality typically present in the source program In this way complexity can b e

maintained at an acceptable level without constraining the delo calized recognition capability

The implementation variation problem is solved by the backtracking feature of the recognition

pro cess More sp ecically backtracking allows the sp ecication of one concept by means of

multiple plans each plan sp ecies a dierent algorithmic implementation of the same concept

However backtracking has also its drawbacks While on one hand it makes the recognition

pro cedure more p owerful and general on the other hand it causes the search complexity to

grow exp onentially with the co de size Nevertheless as we have observed ab ove b oth the top

down approach and the summarization of derived subconcepts within the no des of the abstract

program representation should prune the search space considerably making practical the analysis

of nontrivial pieces of co de

Finally the overlapping implementation problem is solved by the global scop e of visibility of

the representation and by the fact that the parsing mechanism do es not restrict the use of a

sub concept to one plan allowing the recognition even in the presence of shared concept instances

An imp ortant consequence of the features just discussed is their indep endence of restructuring

techniques that may mo dify the original co de b efore and during the recognition pro cess to deal

with delo calized co de and implementation variations This means that our approach do es not

need a canonical form for concept implementations Nevertheless preapplied restructuring

transformations could b e still useful in certain situations to sp eed up the recognition pro cess

Related Work

Recurrence detection and idiom recognition were describ ed by David Kuck in and imple

mented in the Parafrase system More recent work targeted towards pattern matching for

the supp ort of automatic co de optimization includes the exp ert system EAVE for interac

tive vectorization of FORTRAN programs an approach by Radon and Feautrier based

on algebraic sp ecication for recurrence detection the commercial CMAX system and a

metho d by Pinter and Pinter that relies on a PDG representation which is then mo died

using graph rewrite rules by using a list of graph patterns Currently three other eorts aim

at the application of program understanding techniques to automatic parallelization work by

Kessler and prop osals of Bhansali et al and of Metzger

Supp ort for Data Alignment and Distribution

Barbara Chapman and Erwin Laure

High level programming language extensions to Fortran such as Vienna Fortran and HPF

require the user to sp ecify the distribution of program data to the no des of the executing

machine Each language provides a rather dierent set of p ossible distributions Although it

may sometimes b e very easy for a programmer to select an appropriate set of mappings for

many programs it is dicult and requires a tedious analysis of the co de and its data references

The selection of a data distribution for a program which is to b e parallelized according to the

SPMD approach has b een a topic of research for some time and is considered to b e one of the

hardest research issues in the eld There are several reasons for this First any approach which

will work well in practice must take due regard of Fortran co ding practices and common array

usages For some co de structures exp erienced manual parallelization teams are still working

to nd go o d solutions Further any approach will require signicant analysis it will need to

b e interpro cedural and will need to gather extensive information on the source co de including

p erformance information The extent to which this is a global problem or can b e handled lo cally

will dep end on the characteristics of the target machine Hence strategies must display some

exibility or will b e tied to an architectural mo del Finally even relatively simple subproblems

are NPcomplete and thus heuristic approaches are needed But this presupp oses an

amount of exp erience which relatively few researchers in the compiler eld are able to obtain

Many researchers are engaged in the search for automatic data distribution strategies but so

far despite some attempts to consider how these might b e used in realistic scenarios

and some early use in a SIMD environment none have provided a general purp ose

pragmatic approach which has b een used in a commercial system There have b een a numb er

of very interesting results however which have brought insights and metho dological advances

some are able to supply distributions in simplied situations

Although the current implementation do es not yet resort extensively to heuristics it is based

up on a go o d understanding of the particular challenges imp osed by the structure of Fortran

co des typical data usage patterns as well as Fortran programming metho ds which may

obscure relevant information This work was preceded by a study of these issues

This pro ject fo cuses on the selection and implementation of practical metho ds which supp ort

the user in this task initially by providing the relevant program information

The current system is implemented as a new comp onent of the Vienna Fortran Compilation

System the Alignment Comp onent and it is also b eing integrated into the ANALYST to ol

Its main task is to derive information on b oth the most suitable alignments of arrays in the co de

and on the access patterns used This information is presented to the user with an indication

of the accuracy of the results

The steps taken when a user requests alignment information are the following

 First several program transformations are applied These have the eect of improving the

information relating array subscripts to lo op variables this leads to more accurate dep en

dence analysis and may have a large impact on the alignment step itself as exp erimental

results have shown

 Next the Weight Finder is automatically executed this gives concrete values for the

numb er of times a particular subroutine or lo op is invoked and provides values for the

numb er of times a branch is executed frequency and true ratio information It is also

able to give the values assumed by a selected set of variables This information is used

during the alignment step to assign priorities and to decide whether a reference is likely

to b e lo cal or not

 The alignment analysis is p erformed Pairs of array references are tested to detect ali

gnment preferences and a Comp onent Alignment Graph is generated The concept of

alignment is based up on the work of Li and Chen and Gupta but has b een

extended to cover arbitrary identical references Pairs of references on the righthand side

of statements are also tested Arcs of the Comp onent Alignment Graph are attributed

the weights assigned are derived from the measured execution frequency the lo cality of

the alignment as well as data dep endence results The algorithm uses the results of data

dep endence to determine where communication will take place ie to which lo op level it

will b e extracted and is able to quantify the amount of data communication based up on

the results from the Weight Finder Estimates of the startup time and transfer time are

provided for each target machine and are used to derive a value which is a reasonably

go o d estimate of the communication cost if the alignment is not resp ected at all this is a

worstcase estimate This information is saved together with a description of the lo cality

of the reference Dierent lo cality classes may result in dierent arcs hence there may b e

multiple arcs b etween a single pair of no des in this graph

 After this analysis the alignment graph itself is solved in order to provide a set of ar

ray dimension alignments The algorithm is an improved version of the original metho d

develop ed by Li and Chen It has b een selected b ecause it provides satisfactory results

in a reasonable amount of time Solution metho ds prop osed more recently may b e com

putationally very exp ensive Each no de of the graph corresp onds to a dimension of an

array These are conceptually sorted into columns where each column contains the no des

corresp onding to a single array The algorithm rst sorts the no des on the basis of the

cumulative weights on edges leaving a column these are sorted into descending order The

no des of the rst column are the initial alignment sets The set of arcs connecting this

with other no des and their attributes are simplied For the next step whenever there is

still more than one arc connecting a pair of no des that with the highest asso ciated weight

is used The next column in the ordering is selected and matched with the alignment

sets using a bipartite graph matching Then the no des are merged with the sets to which

they have b een assigned by the matching The arcs leaving the up dated alignment sets

are simplied and the matching step is rep eated until there are either no more columns of

no des or no edges exist which connect the sets with remaining no des In this latter case

there will b e more than one collection of alignment sets

 The results of this alignment phase are presented to the user They are accompanied by

lo cality information describing the access patterns and including in particular whether

or not the alignment information was consistent throughout the co de

The implementation has b een organized in such a way that the algorithm may b e applied to all

or part of a subroutine The start and end statements must not however b e enclosed by

any lo op This requirement could easily b e relaxed to p ermit socalled counter lo ops to surround

the statements but at the exp ense of some accuracy in the estimation of communication costs

In general this do es not seem necessary

Future work will extend this comp onent to provide automatic generation of alignment and

distribution directives and to search for a go o d interpro cedural solution

P T

Thomas Fahringer

Stateoftheart parallelizing compilers such as VFCS provide the user with a large set of pro

gram transformations and a variety data distribution strategies However as of to day it is the

programmers resp onsibility to evaluate the p erformance gains for each of these transformations

As a consequence the programmer frequently compiles and executes his parallel program on the

target architecture for p erformance evaluation Obviously this is a tedious timeconsuming

and errorprone task Certain critical information such as work distribution cannot b e obtained

by machine level proling Even if all pro cessors are busy all the time it do es not mean that

they do useful nonreplicated work Most architectures do not allow to prole cache

b ehavior therefore it is dicult to evaluate data lo cality or reuse We develop ed an automatic

3

p erformance estimator P T to overcome these obstacles by providing the user with imme

diate and accurate p erformance estimates ab out the p erformance impacts of program changes

without actually executing the program In order to examine whether a sp ecic program trans

formation or a data distribution strategy causes a slowdown or improves the p erformance of a

3

parallel program P T automatically computes at compile time a

set of parallel program parameters This includes work distribution numb er of transfers

amount of data transferred over the network transfer times computation times network

contention and numb er of cache misses It is decisive to derive and display the

parallel program parameters separately as opp osed to hiding them in a single estimated run

time which is done in most existing p erformance estimators This provides

the compiler and programmer with detailed information for selected parts of the program on

dierent asp ects of the program b ehavior Therefore it can answer the following two fundamen

tal p erformance questions which cannot b e addressed if only estimated runtime information is

known

Which parts of the paral lel program need to be improved

What kind of improvement is required

3

P T is able to answer the rst question b ecause the parallel program parameters can b e selec

tively determined for statements lo ops pro cedures and the entire program furthermore their

eect with resp ect to individual pro cessors can b e examined The second question is answe

red b ecause the dierent parallel program parameters relate to distinct p erformance drawbacks

This enables the compiler andor programmer to apply welldirected optimization strategies

to alleviate or even eliminate these drawbacks For example it might apply lo op interchange

andor strip mining to reduce the numb er of cache misses implied by a certain lo op nest it may

decide to change the data distribution of an array in order to improve the programs work dis

tribution andor communication overhead privatizing variables or scalar expansion may avoid

extensive communication inside a lo op etc

3

In the following the parallel program parameters as computed by the P T are briey describ ed

They are dened and discussed in detail in

 Work Distribution WD estimates how even the work contained in a parallel program is

distributed across all pro cessors executing the program The work distribution is inherently

dened by the owner computes paradigm

 Number of Transfers NT is an estimate of the numb er of sendreceive op erations induced

by a communication statement

 Amount of Data Transferred TD approximates the data volume in bytes transferred

over the communication network as implied by a communication statement

 Network Contention NC represents an upp er b ound of the numb er of channel conten

tions induced by a communication statement on the underlying physical communication

network It is assumed that a channel contention o ccurs if a physical network channel is

traversed by two messages at the same time in the same direction

 Transfer Time TT predicts the time needed to transfer the required nonlo cal data over

the physical communication network

 Computation Time CT is an estimate for the sequential computation time of the parallel

program

 Number of Cache Misses CM mo dels the data lo cality of program statements It is

dened by an upp er b ound of the numb er of accessed cache lines which is assumed to

correlate with the numb er of cache misses

The parameters are adapted such that they can b e selectively computed for single pro cessors

statements lo ops pro cedures and the entire program The parameters are based on a lo op

mo del with lo op lower and upp er b ounds as linear functions of all enclosing lo op variables In

order to take pro cedure calls into account the parameter outcome for a single pro cedure call

instantiation is supp osed to b e indep endent of the call site Program unknowns such as lo op

iteration and statement execution counts are obtained by a single prole run of the ori

ginal input program In it is shown that a single prole run is sucient for numerous

imp ortant program transformations The parallel program parameters dep end on these data

The parameters are computed based on an analytical mo del without incorp orating simulation

techniques Geometric op erations such as intersection and volume algorithms for ndimensional

p olytop es are incorp orated for our p erformance analysis Similar op erations are used

to mo del nonlo cal array accesses which refer to array p ortions to b e communicated transfer

red b etween pro cessors

Analyzing communication patterns pro cess to pro cessor mapping strategy interconnection net

work and routing mechanism on the target architecture eg ecub e routing on the Intel hy

p ercub e allows to estimate the corresp onding network contention b ehavior

Transfer times can b e obtained by mo delling communication patterns data volumes transferred

interconnection network pro cess to pro cessor mapping strategy architecture dep endent startup

time and message transfer time p er byte

In order to estimate computation times we premeasure a large set of kernels This pa

rameter do es not account for communication and blo cking time The kernels are measured on

dierent architectures for varying problem sizes The measured kernel run times are stored in

a kernel library In order to estimate computation times a parallel program communication

statements and all other explicit parallel language constructs are ignored is parsed to detect

existing library kernels incorp orating pattern matching

Estimating the numb er of cache misses requires the analysis of array access patterns subs

cript expressions grouping array accesses into array access classes and accumulating the numb er

of cache lines accessed by each array access class Two array accesses b elong to the same array

access class if they access a common cache memory lo cation It is assumed that the numb er

of cache misses correlates with the numb er of cache lines accessed Target architecture sp ecic

information ab out data typ e lengths cache line size and overall cache lines available is incorp o

rated for this analysis

The p erformance estimator is based on Fortran programs However most of the describ ed tech

niques can b e used for other languages such as the C programming language as well

3

P T provides a graphical user interface which displays each program source line to

gether with the corresp onding parameter outcome The parameters are illustrated using colored

bars with the numeric parameter values next to them Bright colored bars indicate a go o d

parameter outcome while dark colors rep ort on p o or p erformance Additional p erformance and

analysis information is visualized by mouseclicking on individual statements in a MotifX

window

3

P T is to the b est of the authors knowledge the rst p erformance estimator to guide b oth

the selection of p erformance ecient data distribution strategies and the application of prota

ble program transformations Exp eriments demonstrate that it is critical to estimate not only

the p erformance outcome of data partitioning strategies but in particular that of optimizing

program transformations such as interchange and distribution of lo ops or interlo op communi

3

cation fusion Advanced and accurate analytical metho ds of P T allow to detect b oth crossover

p oints of the go o dness of dierent distribution strategies and undulations of the corresp onding

p erformance curve for parallel programs

3

P T has b een fully implemented according to the features mentioned ab ove It is an integrated

to ol of the VFCS The estimator is limited to very regular programs It cannot b e used for

irregular problems which require runtime analysis Shifting p erformance estimation into runtime

3

to supp ort runtime optimization will b e addressed in future work The P T is currently b eing

extended to estimate the p erformance of High Performance Fortran HPF programs using

a separate frontend under the VFCS Ongoing work to ne tune the estimator for a larger set of

optimizing transformations and to evaluate it for other MIMD architectures will further enhance

3

the usefulness of P T

Simulation To ol

Roman Blasko

Motivation and Ob jectives

PEPSY is a p erformance prediction to ol that p erforms p erformance analysis based on discrete

event simulation PEPSY has b een designed to analyze predict the b ehavior of parallel pro

grams pro duced by the frontend of VFCS b efore application of the backend It can b e used

to compare programs generated by dierent parallelization strategies of VFCS by evaluating a

range of static and dynamic prop erties of the program providing them to the user andor the

exp ert adviser XPA as a basis for an iterative p erformance tuning pro cess PEPSY allows the

analysis and optimization of parallel program p erformance without generating the nal co de

for the target parallel computer and without the necessity for actually accessing the target

machine

System Description

PEPSY PErformance Prediction SYstem consists of two main phases automatic

mo delling of parallel programs and p erformance analysis based on discreteevent simula

tion The automatic mo delling and discreteevent simulation techniques are implemented by

two basic functional mo dules viz MOGEN and PROGAN resp ectively We have dened the

Process Graph PG for the representation of the sequential and parallel Fortran Fort

ran Vienna Fortran Message Passing Fortran programs at the statement level PG uses a

set of no de typ es for representing the statement typ es used in the program such as lab eled and

unlab eled assignment conditional GOTO statement logical IF statement DOlo op head and

various typ es of communication statements The automatic model ling technique is implemented

in the functional mo dule MOGEN MOdel GENerator which is integrated into the internal

representation of the parallel program in VFCS and has access to the results pro duced by the

Weight Finder Section The automatic mo delling consists of three phases The rst

phase determines the basic parameters of the program via a topdown analysis based on the syn

tax tree The second phase extracts the program parameters characterizing the communication

b etween parallel pro cesses The third phase generates the complete PG mo del of the parallel

program to b e analyzed by PROGAN The parameterization phase uses a data base of language

and machine characteristics The present version is targeted towards the Intel iPSC DMMP

Changing the data in the data base allows p ortability of PEPSY to other target computers

The p erformance analyzer PROGAN PROcess Graph ANalyzer is based on discreteevent

simulation and has b een develop ed for p erformance evaluation of parallel pro cesses represented

by PG The input to PROGAN is a PG mo del of the parallel program to b e analyzed the

output from PROGAN is a set of the static parameters and p erformance data characterizing the

dynamic b ehavior of the parallel program PROGAN has several p erformance analysis facilities

implemented by optional monitoring mo des and develop ed for evaluation of a detailed and global

b ehavior of the parallel program PROGAN can b e used also separately for any PG mo del of

the parallel system generated automatically or written by an editor

Performance Indices and Monitoring Facilities

We have dened a set of p erformance indices used for characterizing static and b ehavioral

prop erties of the parallel program All these parameters are evaluated by the functional mo dule

PROGAN of PEPSY

Static parameters characterize the size and structure of the mo deled system ie the par

allel program We have dened nine parameters characterizing the size of the mo del and the

interconnections b etween its no des

Dynamic parameters characterize the b ehavior of the mo del for steady or transient states

PEPSY evaluates ten basic and several derived dynamic indices including execution time com

putation time communication time parallelism degree utilization and communication volume

By comparing dierent versions of the parallel program PEPSY allows the evaluation of scala

bility parameters such as execution signature sp eedup eciency and ecacy

The monitoring facilities of PEPSY evaluate the p erformance indices for the whole execution

p erio d of the parallel program or for an optional sampling p erio d Performance results are pro

vided in a hierarchy for the global or program level as average values for all parallel pro cessors

for the processor level comparing all parallel pro cesses for example to p erform analysis of load

balancing and for the statement level of the parallel program

Current State and Future Research

The functionality of PEPSY has b een validated by predicting the p erformance for a basic set

of parallel programs generated by VFCS The evaluated p erformance characteristics allow the

comparison of dierent parallel program versions and the selection of a parallelization strategy

for the compiler without executing the program on a target machine A comparison of predicted

and measured values has yielded satisfactory results

We have designed a hierarchical performance prediction technique for SPMD parallel programs

characterized by an acyclic call graph Comparing the hierarchical and at p erformance pre

diction we have reduced the required pro cessing time by more than one magnitude for an

illustrating example This technique also p ermits the recognition of the most critical parts of

the parallel program not only for individual program units but in the whole pro cedure call

structure

Our future research and development will address the issues of program representation the

automation of hierarchical p erformance prediction and its full integration with VFCS and its

exp ert adviser using a userfriendly interface

VFPMS

Bernd Wender

The Vienna Fortran Performance Measuring System VFPMS is designed to supp ort the

user with the sp ecication of measurements for p erformance critical comp onents of Vienna Fort

ran or HPF programs during a parallelization session It serves as a sophisticated interface for

the integration of p ost execution p erformance analysis mechanisms in an advanced paralleliza

tion system VFPMS is a fully integrated mo dule of VFCS and can b e used via the graphical

user interface as well as the command language interface

Motivation

The development of high p erformance SPMD applications is typically an iterative pro cess It

involves exp eriments with various data distributions and program transformation sequences

For large programs detailed and fo cussed p erformance measurements of p erformance critical

comp onents b ecome more and more imp ortant since most available prolers pro duce such a

large amount of p erformance data that it is often dicult to determine the relevant parts of the

proles

VFPMS has two advantages on the one hand it is integrated into VFCS As a consequence

the user can easily sp ecify within the parallelization session which program comp onents are to

b e measured On the other hand it provides several metho ds for the sp ecication of program

comp onents in a simple and intuitive way based on the textual intermediate representation of

the program

Description

The user starts with a Vienna Fortran or HPF source program which is transformed into an

intermediate representation by the VFCS All transformations are applied to the intermediate

representation It is p ossible to save and load versions of the intermediate representation during

a design cycle and then step back to a previous version whenever necessary VFPMS works on the

intermediate representation Therefore the user can easily sp ecify critical parts of the application

and generate a set of instrumented target programs with resp ect to relevant transformations

We provide a graphical user interface for the sp ecication of comp onents mo deled after an

extension of the VFCS batch language for p erformance measuring purp oses If VFCS is run

as a command line compiler via directives in a command le the measurement features of the

command language can b e used This is particularly imp ortant for comparative measurements

for a set of slightly dierent program versions for example to investigate the eect of various

distributions on communication b ehavior

An imp ortant issue for improving the p erformance of parallel programs is the availability of

feedback ab out certain p erformance parameters such as the execution andor communication

time of critical parts of the program Due to the complexity of the collected p erformance

data it is often helpful to use p erformance evaluation and visualization to ols such as MEDEA

PARAGRAPH or PARMON These to ols normally require traceles in a sp ecic

format eg the very common format generated by the PICL instrumentation primitives

For the sake of p ortability VFPMS do es not pro duce a tracele directly Our approach is

to generate detailed descriptions of the p erformed measurements in a format which is human

readable but can also b e used to generate traceles in almost arbitrary formats using the

p erformance data collected during the execution of the parallel program Currently we supp ort

two tracele formats the MEDEAPARMON format and the more common PICL format

Postexecution Performance Analysis

Mario Pantano

Postexecution p erformance analysis plays a key role in guiding the restructuring pro cess by

supplying crucial information regarding the b ehavior of the program In the following we examine

this topic in more detail

Performance Indices

Performance indices can b e derived at dierent program levels programmers often use p erfor

mance to ols like microscop es starting with a global view and then gradually descending to

smaller program comp onents where sources of p erformance loss have b een discovered Roughly

at least three levels of analysis can b e distinguished in a Fortran program

 interprocedural analysis providing a global characterization of program prop erties

 intraprocedural analysis which is applied to a single pro cedure and

 local analysis related to individual statements statement sequences and lo ops within a

pro cedure

For each level p erformance indices can b e determined using a combination of static and dynamic

metho ds The list b elow describ es a set of indices that provide useful quantitative information

concerning prop erties of SPMD programs when b eing executed on a parallel machine

 number of transfers numb er of Send and Receive op erations

 send time time sp ent executing Send op erations

 receive time time sp ent executing Receive op erations

 communication time send time receive time

 computation time the time sp ent in computation

 execution time communication time computation time

 amount of data transferred the data volume transferred by the Send and Receive op era

tions

 network contention an indication of the numb er of channel contentions induced by a

communication op eration on the underlying physical communication network

 cache misses numb er of cache misses

Another imp ortant set of p erformance indices on a higher level of abstraction is the following

 Degree of Paral lelism the numb er of pro cessors that are busy computing at a given instant

of time assuming an unb ounded numb er of pro cessors

 Execution Prole a function representing the numb er of pro cessors that are busy at each

instance of the program execution In a similar way the communication prole and the

computation prole can b e dened

 Processor Working Set PWS the numb er of pro cessors asso ciated with the knee of the

execution timeeciency prole Each p oint of such a prole represents the combinati

on of execution time and eciency achieved by a particular numb er of pro cessors The

knee the p oint where the ratio b etween the execution time and eciency is maximized

represents an optimal system op erating p oint

Measurements collected at runtime during the execution of a parallel program provide a basis for

an accurate evaluation of the p erformance Appropriate to ols have to b e used in order to pro cess

the large amount of collected data and derive a compact representation of the p erformance

For this purp ose the MEasurements Description Evaluation and Analysis to ol MEDEA

has b een integrated into the VFCS The interface b etween VFCS and MEDEA is a tracele in a

PICL format MEDEA is a generalpurp ose environment which contains features for the analy

sis of p erformance data collected by measuringmonitoring to ols The tracele is prepro cessed

in order to p erform a preliminary analysis of the data collected and to extract the values of

p erformance parameters

The amount of p erformance data collected and the large numb er of indices considered for pro

gram analysis can b e manipulated through the application of statistical techniques MEDEA

allows the computation of standard statistical indices such as moments mean variance stan

dard deviation parameter distribution range skewness median mo de and p ercentiles Simple

data transformations are also provided

The analysis of parameter distribution may lead to the application of various typ es of trans

formations in order to obtain comparable ranges for parameter values For example

parameters with a highly skewed distribution can b e transformed by taking the logarithm of

the values The static prop erties of the p erformance data set can b e derived through the multi

dimensional clustering analysis technique provided by MEDEA A clusteringbased approach is

adopted when some sort of classication needs to b e obtained For example this technique helps

in discovering similarities among the various pro cessors executing a given parallel program

In addition to such a quantitative description of the program the functional analysis gives its

logical description This approach can help in the identication of the p erformance of dierent

co de sections and of those asp ects of the compilation system which might b e improved Moreo

ver the application of various techniques such as numerical tting p ermits the repro duction of

timedep endent program characteristics

Exp ert System Adviser

Stefan Andel and Jan Hulman

Motivation

The set of parallelization supp ort to ols provided by VFCS can b e used to supply crucial in

formation needed for managing the parallelization pro cess However driving a parallelization

session still remains a dicult and tedious task where the user in order to achieve go o d par

allel program p erformance is forced to make crucial decisions that require a huge amount of

exp erience and detailed knowledge

A large collection of useful though usually fragmented knowledge has b een acquired by exp erts

in program parallelization and optimization It is highly desirable to oer this exp ert knowledge

to an average user incorp orating it into a new generation of parallelizers With this goal in mind

we have used expert system technology for designing a know ledgebased tool the Exp ert Adviser

XPA XPA uses an explicit and uniform representation of parallelization knowledge including

precise information ab out a program the parallelization environment target architectures and

human exp ert knowledge It has b een designed as an integral part of the new Integrated

Compilation Environment see Section

Current State

A prototyp e of the XPA has b een develop ed as a separate subsystem with precisely sp ecied

interfaces to the compiler and p erformance analysis to ols using the ProKappa Exp ert System

Development Environment from Intellicorp

The full supp ort of the parallelization pro cess requires the development of a knowledge base

mo delling the program the parallelization session and the parallelization environment We

fo cussed initially on building the programrelated part of the knowledge base and the interface

with VFCS In addition a limited knowledgebased approach to the solution of the automatic

data distribution problem was emb edded into XPA The current XPA prototyp e applies a sim

ple strategy for nding a suitable initial data distribution and alignment which is sub ject to

p erformance tuning at a later stage The implemented version is restricted to regular problems

and basic Vienna FortranHPF distribution typ es

The XPA knowledge base comprises objects and rules An inference engine p erforms reasoning

ab out the ob jects using the rules to deduce new facts or to prop ose parallelization actions data

distribution

The programrelated part of the knowledge base a program model is automatically created at

the b eginning of the XPA pro cessing by extracting relevant information from the VFCS internal

program representation Hierarchically interconnected ProKappa ob jects represent rened and

compressed programsp ecic knowledge In the following phases the program mo del is pre

pro cessed using rules from the knowledge base and augmented by derived distribution relevant

information such as alignment preferences interdimensional and intradimensional alignment

conicts The XPA decomp osition of the program into parallelization relevant comp onents

phases is also reected by the mo del Finally concepts ob jects representing data distribution

proposals complete the program mo del

The recent XPA prototyp e comprises the following main mo dules

 a hierarchical know ledge base containing parallelizationrelevant knowledge represented by

ob jects and rules

 an analysis module which infers new information such as alignment constraints detects

conicts in these constraints identies p otential communication in a program and spreads

the information interpro cedurally

 interfaces to the VFCS to a sequential proler the Weight Finder and to the user

supp orting knowledge acquisition

 an automatic generator of a programoriented paral lelization strategy based on program

proling information

 an automatic data distribution module which generates distribution prop osals for program

arrays

 the paral lelization proposal realization module which annotates the program with a prop o

sed data distribution and optionally generates a parallel target program using restructuring

features of VFCS

The integration of the XPA with the program comprehension tool PAP Recognizer

was studied A pattern knowledge base KBP has b een implemented which contains

a numb er of simple co de patterns together with rules that are able to derive lo cally optimal

parallelization prop osals Performance measurements of the co de patterns have b een p erformed

on the Intel iPSC target architecture

Future Work

Parallelization exp eriments with the XPA prototyp e have conrmed the feasibility of providing

knowledgebased supp ort for selected parallelization subproblems The current implementation

of the program mo del and the data distribution mo dule allows incremental improvement and

extension of the XPA functionality in the future

An intelligent knowledgebased programming environment should present several basic fea

tures intel ligent assistance for the user including explanation and guidance facilities

automation of the parallelization pro cess and automatic selection and exploitation of par

al lelization support tools within the environment The current XPA prototyp e exhibits some of

these features though in a limited rudimentary form In the framework of the development of

nICE see Section the XPA prototyp e can serve as a valuable and convenient workb ench

for clarifying of new ideas and as a rapid prototyping to ol

Debugging

Peter Brezany

Debugging is an integral part of the software development pro cess It enables the programmer

to lo cate analyze and correct susp ected faults where a fault is dened to b e an accidental

condition that causes a program to fail p erforming its required function Debugging is therefore

an interactive pro cess which is based on the intuition of the programmer and his or her knowledge

of the program to b e investigated

Due to this intuitive character even debugging sequential programs involves many problems

Much greater diculties o ccur in debugging parallel programs A large gap exists b etween user

needs and existing supp ort for debugging parallel and distributed programs The current ap

proach is to p erform debugging at the SPMD program level using a parallel debugger provided

by the vendor of the underlying DMMP However debugging at this abstraction level is di

cult b ecause the user must exactly know how the program parallelized by a restructurer or

compiler works Moreover according to the analysis results presented in each system has

a distinct debugger often with sophisticated user interface This variation and the amount of

time that must b e invested to learn a new user interface contribute to the users reluctance to

use a debugger By now the market has not strongly demanded b etter debugging to ols b ecause

rapid evolution of hardware and software for massively parallel systems has mainly kept pro

grammers busy just p orting existing co de to new systems This situation is likely to last for a

while thus it seems that outside inuence must b e used to improve debuggers for parallel and

distributed programs The existence of high quality debuggers will push the development and

implementation of new parallel algorithms for applications which currently do not have eective

solutions

Our research goal is the design and implementation of a sourcelevel debugging system that

enables the programmer in a highlevel language to observe the program b ehavior at the level

at which the program has b een develop ed The programmer will b e able to set a breakp oint

on a sp ecic logical pro cessor or examine a distributed variable using the pro cessor name

and the variable name that o ccur in the source program

There are three main features that characterize our approach

We follow an approach referred to as sequential view of parallel execution the real

parallel co de is executed but a corresp onding source co de level interface is presented to the

programmer Providing a sequential view is similar to the problem of debugging optimized

co de the debugger must undo the eect of the optimization in our case parallelization

at debugtime

Debugging of large scale programs is supp orted by a progressive program analysis technique

called slicing which enables the debugger to fo cus only on program parts that are relevant

for the ob ject or feature observed at the moment For example the program slice with

resp ect to a sp ecied variable at some program p oint consists of those parts of the program

that may directly or indirectly aect the value of that variable at the particular program

p oint

Another advanced supp ort for the debugger is a technique called checkp ointrestart

Checkp ointing is the act of saving the volatile state of a program eg registers me

mory le status to stable storage eg disk so that after a failure the program may b e

restored restarted to the state of its most recent checkp oint Checkp ointing will b e used

to provide playback for debugging This supp ort for the debugger will b e develop ed in

co ordination with another pro ject that addresses parallel IO

Debugging is characterized by a highly interactive mo de of work Therefore the development of

an appropriate userfriendly interface is our goal having a very high priority Only in this way

we can achieve the broad acceptance of our debugging system

NICE

Mario Pantano Stefan Andel Beniamino Di Martino Jan Hulman and Hans P Zima

In this section we present a prop osal for a unied compilation environment We intro duce the

new Integrated Compilation Environment NICE see Fig which is designed to provide

automatic supp ort for the translation of high level source programs to target programs for an

HPC Note that this gure contains only the main comp onents involved in the parallelization

pro cess it do es not show the interaction with the user as well as other imp ortant to ols of the

environment such as those for debugging and graphical supp ort

The ma jor system comp onents include

 the restructuring system VFCS an enhanced version of VFCS

 the p erformance analysis system PAS

 the exp ert system adviser XPA

The user has access to all system comp onents via a highlevel graphical interface oriented to

wards the source program level

The transformation of a program consists of an initial parallelization phase followed by iterative

p erformance tuning

Initial paralleliza tio n applies VFCS to the source program generating an initial version of

the target program This phase do es not rely on PAS or XPA but will b e supp orted by newly

develop ed to ols for automatic data distribution concept comprehension and program analysis

3

Performance information is provided by P T a static p erformance prediction to ol and a simu

lation to ol b oth of which are comp onents of current VFCS Initial parallelization generates

co de to create the pro cess structure in the parallel ob ject program determines a complete

initial set of sp ecications for data distribution and alignment extending a p ossibly empty

partial sp ecication in the source program generates a corresp onding work distribution

inserts required synchronization and communication and optimizes the program based on

statically available p erformance information and knowledge ab out the target architecture and

sp ecial libraries

The translation pro cess may b e terminated at this stage Otherwise the program generated

by initial parallelization b ecomes the current paral lel program and one or more iterations of a Source Vienna Fortran/ HPF/ Fortran 77/ 90/ 95 Code

VFCS + Concept Comprehension

Data Transformation Distribution Expert System Advisor System Performance (XPA) Prediction Measurement System

Parallel Code Instrumented Parallel Code Fortran + MPI Fully Annotated HPF Performance Compilation & Analysis System Execution

(PAS)

Figure Structure of the new Integrated Compilation Environment

cyclic p erformance tuning pro cess are p erformed Each such iteration pro duces a new cur

rent parallel program based on restructuring transformations selected on the basis of analysis

information that includes runtime p erformance information obtained from an execution of the

program on the target machine This is supp orted by the VFPMS which instruments the co de

and during the program execution pro duces raw p erformance data Postexecution p erformance

to ols analyze such data and provide information regarding the real parallel programs b ehavior

This pro cess terminates as a result of explicit user intervention satisfaction of certain predened

p erformance criteria or b ecause a predened time limit has b een reached

Each iteration consists of four steps i selective co de instrumentation ii execution of the

instrumented co de on the target machine iii p erformance analysis and iv generation and

application of a restructuring strategy

Related Work

A few attempts have b een made to develop advanced programming environments as a collection

of knowledgebased subsystems This metho d has b een explicitly applied to

the problem of generating co de for DMMPs by KoYang Wang at Purdue University He has

constructed a prototyp e system which has a description of a target machine a set of program

transformations and rules for applying the transformations which are written in terms of the

machine parameters This approach p ermits co de to b e transformed for a variety of dierent

machines by a single compilation system

The MPP Apprentice p erformance monitoring to ol helps the user to nd and correct p erforman

ce anomalies and inecienci es in programs that are designed for Cray MPP systems The MPP

Apprentice is a robust environment in which the user can navigate through the co de structure

it provides information regarding problems such as load imbalances excessive synchronization

or communication General advice on ways to improve the co de is also provided

The Pablo p erformance analysis environment designed to collect analyze and present

p erformance data has b een integrated with a parallelized compiler The work describ ed in

addresses the problem of p erformance debugging within a compilation system

Chapter

Algorithms for Massively Parallel

Computer Systems

Marian Vajtersic

Motivation

The research describ ed in this chapter was oriented towards the development of ecient parallel

algorithms for SIMD systems and DMMPs It is based on the observation that the design

of new algorithms can play a ma jor role for exploiting the capabilities of massively parallel HPCs

for scientic computing applications

Furthermore algorithmic asp ects of parallelism are closely related to those of programming and

automatic compilation mutually inuencing each other

Contributions to algorithm development have b een made in the following areas

 Matrix Multiplication

 Singular Value Decomp osition

 Linear Algebraic Systems with Sp ecial Matrices

The architecture classes studied in this context included SIMD pro cessor arrays DMMPs with

a hyp ercub e interconnection scheme and sp ecialized VLSI systems

Matrix Multiplication on SIMD Pro cessor Arrays

We have designed new parallel algorithms which are based on the classical matrix multiplication

formula as well as on the fast Strassens and Winograds formulas In all these algo

rithms the computations are p erformed in a fully regular manner with systoliclike movements

of data b etween neighb oring pro cessors of a pro cessor array with the toroidal interconnection

top ology

The structure of these algorithms dep ends on the selected data distribution strategy Their

runtime can b e predicted from explicitly known complexities for arithmetic and data movement

op erations

For the situation where the op erand matrices coincide with the size of the pro cessor array

b est results have b een obtained for the parallelized Winograds algorithm where within one

stage of the pro ductterm computation the op erand data travel just to the directly connected

pro cessors to the West and North resp ectively For large matrices ie when the matrix size

is larger than that of the pro cessor array appropriate data distributions have b een determined

According to these distributions two algorithmic variants for the systoliclike computing strategy

are formulated Small matrices have to b e copied and expanded in order to o ccupy the complete

array of pro cessors Thus a reduction of parallel arithmetic op erations has b een achieved As a

consequence the numb er of data handling op erations is increased and a tradeo b etween savings

in arithmetic steps and higher data communication costs has to b e taken into account

Although the algorithms are devoted primarily for SIMD typ e pro cessor arrays their blo ck

variants can b e implemented in a macrosystolic manner on DMMPs

Singular Value Decomp osition Algorithms for MIMD Hy

p ercub e Systems

The SVD Singular Value Decomp osition of a matrix b elongs to a class of computationintensive

tasks which are frequently encountered in many branches of scientic computing such as and

image pro cessing least squares problems and rob otics Since the computational complexity of

this problem is of the order in resp ect to the matrix size parallelism can achieve a reduction

of this value and lead to the fulllment of time demands in largesize realtime applications

The twosided Jacobi decomp osition b elongs to metho ds which are well suited for coarsegrained

parallel computing of SVD Our work deals with the parallelization of this metho d for DMMPs

with the hyp ercub e interconnection top ology

Two new metho ds have b een develop ed which enable the calculation of the singular values in a

fully parallel fashion The blo ck decomp osition of the matrix leads to a parallel solution of SVD

subproblems with blo ck size in each iteration The computational pro cess is p erformed on

column strip es of the matrix and is optimal with resp ect to the reduction of the computational

complexity and costs for the exchange of matrix blo cks among the pro cessors In addition

optimal load balancing is achieved and the metho ds are applicable to parallel architectures that

can b e emb edded into the hyp ercub e

Our results are directly implementable also in the most recent application of SVD which re

presents a generation of p ortraits for complex matrices There SVD is applied to evaluate the

minimum singular values of a large numb er of matrices the structure of which dep ends on values

in grid p oints of a sp ecied region in the twodimensional complex plane

VLSI Algorithms for Sp ecial Linear Systems

We have develop ed new VLSI Very Large Scale Integration algorithms for solving nite

dierence approximations to mo del elliptic problems on rectangular domains

Systems of linear algebraic equations arising from the discretization of these problems p ossess

sp ecial prop erties The resulting matrices are large and sparse Their size equals the numb er

of interior grid p oints and the sparsity is of a regular shap e This is a reason why for these

problems fast sp ecial solvers have b een designed which dier from classical metho ds based

on the elimination or orthogonalization principles

For the fast solution of these mo del elliptic problems new versions of parallel algorithms have

b een studied One of them is inuenced by the VLSI technology In the VLSI computational

mo del a new comp onent contributing to the complexity of an algorithm is the area for its

design The numb er of parallel arithmetic and data transfer op erations is contained in the time

comp onent of the complexity measure

Our algorithms have b een designed for solving sp ecial blo cktridiagonal and blo ckvediagonal

linear systems discretization of the second and fourthorder elliptic b oundary problems resp ec

tively The complexity mo del used for the evaluation and classication of these algorithms

2

is ar ea  time The results we have obtained for our VLSI Poisson and biharmonic solvers show

that when comparing them to existing VLSI designs for solving these problems our algorithms

achieve b est upp er b ounds in this complexity measure

Chapter

Description of Ma jor Pro jects

In this chapter we give a short outline of ma jor pro jects VCPC Pro jects are sp ecially marked

The ESPRIT III Pro ject PPPE

PPPE stands for A Portable Parallel Programming Environment It is a Europ ean Research

and Development pro ject which was funded as part of the Europ ean Unions Framework III

research programme ESPRIT The original duration of the pro ject which has b een co ordina

ted by Meiko UK was July to June The founding of the VCPC and initial

funding of sta and resources at the Centre was realized within an extension of this pro ject

which now terminates in June

The PPPE pro ject has delivered a pro duction quality set of to ols for parallel program

ming whose purp ose it is to simplify the development of large scale scientic and engineering

applications for massively parallel distributed architectures Research within PPPE fo cussed on

the development of exp erimental to ols at the HPF programming level which are exp ected to

lead to pro ducts in the future

The pro ject has made substantial contributions to standards activities in particular to the

MPI and HPF eorts in which it actively participated

To ols develop ed within PPPE include a debugger and p erformance analyzer for use with PAR

MACS and MPI co des as well as an automatic converter from PARMACS to MPI These are

complemented by an HPF compiler and debugger All to ols are p ortable and run on a variety

of UNIX platforms The PPPE research to ols include ADAPTOR from the GMD the Paral

lel Performance Estimator PPE from the University of Southampton and the Vienna Fortran

Compilation System from the University of Vienna which implements language extensions to

HPF in an integrated environment

The pro ject consortium consists of leading hardware and software vendors throughout Eu

rop e as well as ma jor research institutions with a strong track record in parallel pro cessing

End users are imp ortant participants in this pro cess the end users in PPPE signicantly

inuenced the design and functionality of b oth research and pro duction quality to ols and eva

luated the nal pro ducts PPPE partners from to come from memb er states of

the EU The hardware and software vendors are Meiko PALLAS Intel IBM NA Software

and Emeraude PPPE research institutions are GMD and the Universities of Southampton

and Vienna End users in the pro ject are AVL ESI ECMWF FIRST Informatics and Simulog

The Institute of Software Technology and Parallel Systems at the University of Vienna is active

within the research workpackage of PPPE for which it was Workpackage leader

The Pro ject Extension

During the initial three years the consortium was successful in its goal of developing mature

programming to ols for a new and rapidly growing market PPPE is now entering its fourth year

in which it is led by Barbara Chapman of the VCPC In this nal year it is refo cussing activities

in order to ensure the ecient exploitation of pro ject results PPPE goals in year four thus

include promotion of the to olset and demonstration of its use as well as further work on

standards and exp erimental to ols Activities include measures to train application develop ers

to use the new language standards and the PPPE to olset as well as to promote their active

deployment in the development and p erformance tuning of parallel application co des The pro

ject supp orts a numb er of training events and workshops in parallel computing as well as some

fullscale application development based up on MPI and HPF

The to olset is installed at the VCPC where it is made available to end users for co de deve

lopment Sta of the VCPC use the PPPE to ols in their daily work and are skilled in their

employment The Centre provides further information on the to ols can demonstrate their ca

pabilities and broker an initial contact with the marketers Future plans at the VCPC include

developing training material for selected to ols

Meiko PALLAS Intel GMD ECMWF the University of Southampton and the Institute all

participate in this nal year

For further information on PPPE contact infovcp cunivieacat or lo ok at the PPPE homepage

under httpwwwvcp cunivieacatactivitiespro jectsPPPEhtml

The ESPRIT III Pro ject PREPARE

PREPARE Programming Environment for Parallel Architectures is an ESPRIT I I I

pro ject p erformed by a team including a numb er of Europ ean companies and research institutes

The pro ject aims to develop a new p owerful to olset for the parallelization of applications for

DMMPs

The pro ject bases its to olset on High Performance Fortran HPF It develops a programming

environment in which HPF programs can b e develop ed restructured and compiled in a machine

indep endent way The PREPARE system relies on the COSY technology of the asso ciated

COMPARE pro ject The COSY mo del is an innovative compilation framework that makes it

p ossible to congure highly optimizing compilers from a set of building blo cks called engines

These engines work concurrently they share a generic Internal Representation IR in which

all HPF dataparallel constructs are mapp ed to one canonical form CF The kernel engine

is the Parallelization Engine that is resp onsible for the restructuring HPF programs and

for the SPMD co de generation A dedicated engine p erforms automatic DO lo op vectorization

which maps parallel lo ops to CF The Interactive Engine helps the programmer to sp ecify data

distribution It rep orts to what extent the system is able to parallelize the program automatically

and the obstacles it encounters on its way It invites the user to remove such obstacles by

providing additional information

The PREPARE HPF compilation system includes ab out thirty engines and is able to generate

binary co de for several target DMMPs Parsytec GC cluster of Sun Sparc stations and

other systems built up on Sparc or PowerPC pro cessors message passing C co de with MPI

communications

The binary co de generation fo cuses on gaining high p erformance integration of intra and

interpro cessor parallelism and the C co de generation guarantees a high p ortability of the parallel

co de

The pro ject is carried out by a Consortium of three industrial and six academic partners The

main contractor is ACE Amsterdam The Netherlands which co ordinates the pro ject provides

the COSY framework and is resp onsible for integration and for the transfer of the prototyp e

into a pro duct The Parsytec Aachen Germany company is resp onsible for the runtime sy

stem and will p erform the integration with its GC systems based on the Power PC pro cessors

The third industrial partner Steria France is resp onsible for the Interactive Engine which

is b eing develop ed in co op eration with the Technical University of Munich which contributes

its exp ertise on p erformance monitoring and user interfaces A crucial role in the consortium is

played by institutes having exp ertise ab out automatic parallelization for DMMPs this includes

our institute as well as IRISA Rennes France GMD Berlin Germany and TNO Delft The

Netherlands PELAB from the University of Linkoping interfaces the PREPARE system with

its Ob jectMath environment which enables scientists and engineers to describ e their mo dels in

a high level ob jectoriented equational representation

The contributions of our institute to the pro ject include a functional sp ecication of the whole

parallelization pro cess participation in the design of the Parallelization Engine and the Analysis

Engine and the development of the runtime library PARTI that extends the

functionality and optimizes the standard PARTI library from the University of Maryland

This library supp orts the parallelization and execution of irregular applications In the nal

pro ject phase we are building a test suite enabling testing and b enchmarking of the PREPARE

compilation system and the integration of its parts

The ESPRIT IV Pro ject HPF

The purp ose of this pro ject is to improve the current version of High Performance Fortran HPF

and related compiler technology by extending the functionality of the language and developing

optimizing compilation strategies dep ending on the requirements of a set of advanced application

co des

The language extensions will b e formalized by sp ecifying the full syntax and semantics of the

HPF language An overview of its new features relative to HPF was given in Section

The implementation eort will b e carried out in the framework of VFCS the Meiko CS at the

VCPC will serve as a pro ject platform

The HPF Consortium

The HPF Consortium consists of application designers and b oth academic and commercial

language compiler and to ol develop ers

 Engineering Systems International ESI Paris France

 AVL List GmbH Graz Austria

 Europ ean Centre for Medium Range Weather Forecasts ECMWF Reading England

 Institute for Software Technology and Parallel Systems and VCPC University of Vienna

Vienna Austria

 Dipartimento di Informatica e Sistemistica University of Pavia Pavia Italy

 NA Software Limited Liverp o ol England

st

The Pro ject Co ordinator is the University of Vienna The pro ject started on January

Ob jectives and Results

 Develop pro ject b enchmarks

The pro ject b enchmarks will b e based on a set of advanced application co des including

PAMCRASH from ESI IFS from ECMWF and FIRE from AVL They will range from

highly simplied kernels to representative b enchmark programs for requirement sp ecica

tion and evaluation of HPF

 Develop a full HPF language sp ecication

 Develop HPF related compilation technology

The HPF extensions will b e implemented in VFCS Optimizations will b e develop ed

within this framework in accordance with exp eriences from using the pro ject b enchmarks

 Extension of MEDEA for p erformance analysis of HPF co des

The MEDEA system of the University of Pavia will b e extended in order to provide detailed

descriptions of the b ehavior of HPF programs and to aid users in the interpretation of

the achieved p erformance These p erformance studies will identify problem areas in this

implementation providing feedback to application designers and implementers

 Evaluation of HPF

An evaluation of HPF and its implementation based on the pro ject b enchmarks will

provide a comparison of the enhanced technology with HPF and the message passing

approach This evaluation will b e carried out with resp ect to b oth the required p orting

eort as well as the p erformance of the ob ject co de at the pro ject platform The HPF

related part of the evaluation eort will use commercial compilers including the HPF

Mapp er from NA Software For all kernel migrations to HPF the p ossible impact on

the full co de will b e analyzed

 Transfer of technology

A transfer of technology to industry will b e addressed by studying the incorp oration of the

most eective of the new language features and compilation techniques into commercial

compilers such as the NA Software HPF Mapper

The ESPRIT IV Pro ject PHAROS VCPC

This pro ject which started in January will evaluate the current commercially available

HPF compilers and to ols using co des provided by industrial partners HPF versions of industrial

applications will b e develop ed in asso ciation with the co de providers the compiler and to ol

develop ers and centers of exp ertise in HPF and parallel pro cessing Work in other pro jects has

already develop ed messagepassing versions of the selected co des and will allow a comparison

of the relative eort of b oth co de development and maintenance along with the p erformance on

dierent platforms

Pro ject partners include PALLAS Germany GMD Germany NA Software UK SIMULOG

France CISE Italy debis Germany MATRA France SEMCAP France and the VCPC

The ACTS Pro ject DIANE VCPC

This pro ject is intended to develop a DIstributed ANnotation Environment DIANE It is

conceived as a service allowing users to create exchange and consume multimedia data easily

The basic concept is a multimedia annotated do cument consisting of recorded screen output of

arbitrary applications and multimedia annotations including p ointer movements DIANE will

realize this system for a distributed environment consisting of user terminals and annotation

servers for storing annotated do cuments

Partners in the pro ject are the University of Germany Hospital General de Manresa

Spain STISA Spain Kapsch Austria and the VCPC The VCPC will exploit the techno

logy develop ed in the pro ject to interact with partners at remote sites and provide feedback on

the suitability of the software for long distance working amd the ease with which the systems

can b e used in training supp ort and programpro ject development

LACE VCPC

Several Central Europ ean nations have joined to develop a regional Numerical Weather Predic

tion Service in the LACE pro ject with the aim of improving the quality of forecasts by using

lo cally gathered data and to provide detailed forecasts in the event of a natural disaster or other

emergencies LACE will use the ALADIN co de develop ed by an international team of meteo

rologists under the leadership of METEO France VCPC supp orts this activity by providing a

parallel platform for exp erimentation and access to b oth networking and HPC exp ertise

MAGELLAN VCPC

With funding from the Austrian Research Foundation FWF the VCPC is collab orating with

the Institute for Computer Graphics ICG in Graz Austria and the Jet propulsion Lab oratory

JPL in Pasadena California USA to parallelize a program that p erforms image analysis

on the data collected by the Magellan spacecraft The spacecraft used a synthetic ap erture

radar technique to map the surface of the planet and returned more data than all previous

NASA planetary missions The data will b e corrected for ephemeris and radiometric errors

which currently prevent stereoscopic analysis b eing used to obtain height information Parallel

computing will reduce the time taken for the correction and allow other investigations to b e

p erformed on the data

The CEI Pro ject PACT

PACT Programming Environments Algorithms Applications Compilers and To ols for Par

allel Computation is a co op erative research pro ject that involves partners from Austria and the

Central European Initiative CEI The pro ject started in Novemb er it is co ordinated by

our institute and involves more than researchers working in seven dierent countries

Austria

Austrian Center for Parallel Computation ACPC

Czech Republic

Czech Technical University Prague

Hungary

 Hungarian Academy of Sciences Computer and Automation Institute Budap est

 Technical University of Budap est

 Research Institute for Measurement and Computing Techniques Budap est

Italy

 University of Pavia

 Universita degli Studi di Roma La Sapienza Rome

 Institute for Research of Parallel Information Systems Naples

 CRS Centre for Advanced Studies Research and Development in Sardinia Cagliari

 Politecnico di Milano

Poland

 Institute of Mathematical Machines Warsaw

 Technical University of Czesto chowa Czesto chowa

Slovakia

 Slovak Academy of Sciences Bratislava

 Slovak Technical University Bratislava

Institute Joszef Stefan Ljubljana

PACT consists of seven workpackages

 Workpackage WP Advanced Compiler Technology

In this workpackage a programming environment for the development of parallel nume

rical programs is b eing designed and implemented based up on advanced compilation

p erformance analysis and graphical metho ds and in close co op eration with exp erienced

application designers

 Workpackage WP Parallel Computer Algebra System

The main goal of WP is the development of an environment and a library for Parallel

Computer Algebra in scientic and technical computation

 Workpackage WP Performance Analysis of Parallel Systems and Their Workload

In this workpackage metho ds for the p erformance analysis of parallel programs are b eing

develop ed and implemented In conjunction with WP the Vienna Fortran Compilation

System and the p erformance analysis to ol MEDEA were combined into an integrated

system

 Workpackage WP Parallel Computer Graphics

WP is based on an integrated view of computer graphics and its parallelization The work

concentrates on geometric mo delling image generation and animation and visualization

 Workpackage WP Parallel Numerics PARNUM

The principal goal of WP is a study and the development of theoretical design and

implementation asp ects of parallel numerical algorithms

 Workpackage WP Visual Programming

In this workpackage a notation for the sp ecication of parallel systems is b eing develop ed

as a basis for the design of a parallel software development paradigm

 Workpackage WP Execution Environment for Parallel Programs

The main ob jective of WP is the implementation of a software environment which provides

an integrated set of to ols to supp ort the execution and debugging of parallel programs

Besides research and development activities several workshops were organized by the pro ject

partners

FWF Priority Research Program Software for Parallel

Systems

The priority research program Software for Parallel Systems of the Austrian Research Foun

dation FWF b egan in and has b een planned as a veyear eort The pro ject aims to

establish an internationally comp etitive research eort in the sub ject area with a sp ecic em

phasis on broadening the co op eration b etween the partners with a view to achieve synergy It

is co ordinated by the Institute

The pro ject is divided into four technical areas

 Parallel Symb olic Computation Bruno Buchb erger RISC Linz

The main topics of research in this area include the design and implementation of a par

allelizing compiler for a functional language p erformance analysis for parallel symb olic

computation and the development of a parallel library of hybrid algorithms

 Performance Analysis of Parallel Systems Gun ther Haring University of Vienna

A p erformance prediction to olset including a sp ecication interface automatic Petri net

p erformance mo del generation and simulation based mo del evaluation has b een develop ed

Future work is oriented towards improving and extending previous work on p erformance

prediction and implementing a probabilistic simulation strategy

 Quadrature and ODEs Christoph Ub erhub er Vienna University of Technology

The work in this subpro ject fo cusses on the areas of parallel numerical quadrature and

parallel numerical solution of initial value problems for ordinary dierential equations

 HighLevel Programming Supp ort for Parallel Systems Hans P Zima University

of Vienna

This subpro ject deals with the development of to ols that supp ort a Vienna FortranHPF

based environment for HMPs with an emphasis on b eing able to handle real programs

Chapter

Europ ean Centre of Excellence for

Parallel Computing at Vienna

VCPC

The VCPC was established in with funding from the Europ ean Unions ESPRIT Program

me the Austrian Ministry for Science Research and the Arts BMWFK and the Austrian

Science Foundation FWF It acts as a site for the transfer of technology in High Performance

Computing to industry

The aims of the VCPC are broader than those which usually apply to a single university institu

te The goal of making parallel pro cessing technology available to a broad range of users implies

among other things that measures must b e taken which supp ort the programmability of parallel

computers at a numb er of levels and which broaden the range of applications which may b e

used on the machine The range of pro jects selected by the VCPC for active work reects these

goals and also reects the ma jor trends in the international marketplace It attempts to cover

a range of programming activities and supp orts the transfer of research into pro ducts from

several Austrian university departments in Vienna Graz and Linz Furthermore the VCPC

has supp orted the establishment of connections b etween Austrian research and industry and the

Europ ean establishment in the HPC sector In order to achieve its goals the VCPC is actively

engaged in building up a numb er of strategic alliances with other Europ ean centres

The main computing facility at the VCPC is a Meiko CSHA which is used by industry and

academia for applied research co de development b enchmarking and demonstration purp oses

Parallel Programming To ols

The parallel programming to ols available at the VCPC include

 MessagePassing Libraries

MPI

PARMACS

PVM

NX

 HPFRelated Software

NA Software Mapp er

PGI HPF Compiler

NA Software HPF Debugger

VFCS

ADAPTOR

 TotalView Debugger

 Paragraph Performance Analyzer

CSHA Hardware Resources

 Sup erSPARC Scalar PEs with MB Ecache and MB RAM

 Sup erSPARC PEs with MB RAM and IO capability

 GB disk array

 Ethernet and ATM MBit connections

Services

The following services are oered by the VCPC

 Remote network access to CSHA

 Workstations for onsite co de development

 Consultancy on hardware and software pro ducts

 Supp ort for parallelization of applications

 Training courses in HPC and asso ciated technology

Other Activities

In addition to providing hardware and software resources and oering services as outlined ab ove

the VCPC engages in the following activities

 RD pro jects with industry and research organizations worldwide see b elow

 Research and development in languages compilers and to ols for HPC

 Parallel application development

 Active participation in standards eorts including PF and MPI

More information on the pro jects in which the VCPC participates can b e found in the following

sections

 PPPE Section

 HPF Section

 PHAROS Section

 DIANE Section

 LACE Section

 MAGELLAN Section

Chapter

Publications

A selection of publications is available via anonymous FTP at ftpftpparunivieacatpubpapers

Bo oks

Va jtersicM Algorithms for Elliptic Problems Ecient Sequential and Parallel Solvers

Kluwer Academic Publisher BostonDordrecht

Fahringer T Automatic Performance Prediction of Parallel Programs

to b e published by Kluwer Academic Publishers Boston USA

Chapters in Bo oks

ZimaHP ChapmanBM Automatische Parallelisi erun g sequentieller Programme

Waldschmidt K Ed Parallelrechner Architekturen Systeme Werkzeuge

Chapter pp Teubner Stuttgart

Refereed Publications

PalkoVVa jtersicM AParallel Recognition of Lines in Binary Images

Computers and Articial Intelligence Vol pp

Blasko R Benacka C Bognar P Simulation Approach to Design of Dataow Computer

Computers and Articial Intelligence Vol No pp

ChapmanBMMehrotraPZimaHP Handling Distributed Data in Vienna Fortran Pro cedures

BanerjeeUGelernterDNicolauAPaduaDEds Pro c th IntWorkshop on Languages and Compilers

for Parallel Computing New Haven Connecticut USA August

Lecture Notes in Computer Science pp Springer Verlag

ChapmanBMMehrotraPZimaHP User Dened Mappings in Vienna Fortran

Pro c Second Workshop on Languages Compilers and Runtime Environments for DistributedMemory

Multipro cessors Boulder CO Septemb er Octob er ACM SIGPLAN Notices Vol No

pp January

FahringerTZimaHP A Static Parameter Based Performance Prediction To ol for Parallel Programs

Invited pap er Pro c th ACM International Conference on Sup ercomputing Tokyo Japan Also

Technical Rep ort ACPCTR Austrian Center for Parallel Computation January

Zima H P Chapman B M Compiling for DistributedMemory Systems

Invited Pap er Pro c IEEE Sp ecial Section on Languages and Compilers for Parallel Machines pp

February

Also Technical Rep ort ACPCTR Austrian Center for Parallel Computation Novemb er

Fahringer T The Weight Finder An Advanced Proler for Fortran Programs

Automatic Parallelizati on New Approaches to Co de Generation Data Distribution and Performance

Prediction pp Vieweg Advanced Studies in Computer Science ISBN Verlag Vieweg

Wiesbaden Germany March

Hlavacova O Simek J Va jtersic M Some Developments of Parallel Algorithms

Pro c Int Workshop Software Engineering for Parallel RealTime Systems pp March

Va jtersicM A Fast Multiplic ation of Matrices on Parallel Sup ercomputer Array

Pro c PARSWorkshop Dresden pp April

ChapmanBMMehrotraPZimaHP High Performance Fortran Without Templates A New Mo del for

Data Distribution and Alignment

Pro c Fourth ACM SIGPLAN Symp osium on Principles and Practice of Parallel Programming San Diego

May ACM SIGPLAN Notices Vol No pp July

ChapmanBMFahringerTZimaHP Automatic Supp ort for Data Distribution on DistributedMemory

Multipro cessor Systems

Invited Pap er Pro c Sixth Workshop on Languages and Compilers for Parallelism Portland August

BenknerS HPZima Massively Parallel Architectures and Their Programming Paradigms Recent De

velopments

Invited Pap er Pro c AICA International Section Parallel and Distributed Architectures and Algo

rithms pp Gallip oli Italy Septemb er

Fahringer T Automatic Cache Performance Prediction in a Parallelizi ng Compiler

Pro c AICA International Section Lecce Italy Septemb er

Blasko R Parameterization and Abstract Representation of Parallel Fortran Programs for Performance

Analysis

Pro c AICA Conference on Parallel and distributed architectures and algorithms Gallip ol i Lecce

Italy pp Septemb er

Lucka M An Eective Algorithm for Computation of TwoDimensional Fourier Transform for NxM Ma

trices

Lecture Notes in Computer Science Parallel Computation Pro c Second International ACPC Confe

rence Gmunden Austria pp SpringerVerlag Octob er

Zima HP Brezany P Chapman B M Hulman J Automatic Paralleliza tion for DistributedMemory

Systems Exp eriences and Current Research

Invited Pap er SpiesPPEd EuroArch pp Pro c Europ ean Informatics Congress Com

puting Systems Architectures Munich Informatik aktuell Springer Verlag Octob er

Chapman B M Mehrotra P Moritsch H Zima HP Dynamic Data Distributions in Vienna Fortran

Pro c Sup ercomputing pp Portland Oregon Novemb er

Benkner S Brezany P Zima HP Compiling High Performance Fortran in the Prepare Environment

Pro c Fourth Workshop on Compilers for Parallel Computers Delft Netherlands pp Decemb er

Hulman J Andel S Chapman B M Zima HP Intelligent Paralleli zati on Within the Vienna Fortran

Compilation System

Pro c Fourth Workshop on Compilers for Parallel Computers Delft Netherlands pp Decemb er

ZimaHP BrezanyP ChapmanBM SUPERB and Vienna Fortran

Invited pap er Parallel Computing pp

Blasko R Pro cess Graph and To ol for Performance Analysis of Parallel Pro cesses

Tro ch I Breitenecker FEds Pro c IMACS Symp osium on Mathematical Mo delling MATHMOD

Vienna pp February

Benkner SBrezany PZima HP Pro cessing Array Statements and Pro cedure Interfaces in the PRE

PARE High Performance Fortran Compiler

FritzsonPAEd Pro c th IntConfon Compiler Construction CC Edinburgh UK Lecture Notes

in Computer Science pp Springer Verlag Berlin April

ChapmanBMMehrotraPZimaHP High Performance Fortran Languages Advanced Applications and

Their Implementation

Future Generation Computer Systems

Also in Gentzsch W and Harms U Eds Pro c High Performance Computing and Networking Europ e

HPCNE Europ e Volume I I Lecture Notes in Computer Science pp Springer Verlag

Berlin April

Haines M Hess B Mehrotra P Van Rosendale J Zima HP Runtime Supp ort for Data Parallel

Tasks

Pro c Fifth Symp osium on the Frontiers of Massively Parallel Computation Frontiers McLean Vir

ginia

Also Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

April and ICASE Technical Rep ort

Blasko R A Systematic Strategy for Performance Prediction by Improvement of Parallel Programs

Pro c CAST Fourth International Workshop on Computer Aided Systems Technology Univ of

Ottawa Ottawa Ontario Canada p May

Chapman B M Mehrotra P ZimaHP Extending HPF for Advanced Data Parallel Applications

IEEE Magazine on Parallel and Distributed Technology pp Fall

Also Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

May

FahringerT Automatically Estimating Network Contention of Parallel Programs

Pro c th International Conference on Mo delling Techniques and To ols for Computer Performance Eva

luation Vienna Austria May

3

Fahringer T Using the P T to Guide the Paralleliz atio n and Optimization Eort under the Vienna

Fortran Compilation System

IEEE Pro c Scalable High Performance Computing Conference Knoxville TN May

Chapman B M Mehrotra P Zima HP High Performance Fortran Current Status and Future

Directions

Invited Pap er Pro c International Advanced Workshop on High Performance Computing Technology and

Applications Cetraro Italy June

Va jtersicM Two Classes of Ecient Hyp ercub e Orderings for SVD

Pro c International Conference on Applied Mathematics and Application s World Scientic Soa pp

August

Blasko R Performance Analysis of Parallel Programs Based on Simulation

Pro c th ASU Conference Prague ASU Sto ckholm pp Septemb er

Va jtersicM Some Examples of Massively Parallel Numerical Algorithms

Pro c International Conference on Parallel Pro cessing and Applied MathematicsTechnical Univ of Czes

to chowa Czesto chowa pp Septemb er

Chapman B M Mehrotra P Van Rosendale J and Zima HP A Software Architecture for Multidi

sciplinary Applications Integrating Task and Data Paralleli sm

Pro c CONPAR VAPP VI pp Linz Austria Septemb er

also Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

March

PantanoMZimaHP Performance Analysis of Paralleliz ed Programs Using Workload Characterization

Techniques

Pro c AICA Annual Conference Palermo Italy pp Septemb er

Blasko R Automatic Mo deling and Performance Analysis of Parallel Pro cesses by PEPSY

Fortschritte in der Simulationstechn ik Band Kamp e G Zeitz MEds ASIM Symp osium Stuttgart

Vieweg pp Octob er

Chapman B M Mehrotra P Zima HP Why High Performance Fortran is not Useful for Advanced

Numerical Applications Directions for Future Developments

Invited Pap er FurnariMMEd Pro c Second International Workshop on Massive Parallelism Hard

ware Software and Applications Capri Italy pp Octob er

Chapman BM Zima H Pantano M Compiler Technology for Scalable Parallel Architectures a Short

Overview Pro c Sixth ECMWF Workshop on the Use of Parallel Pro cessors in Meteorology Coming of

Age Editors GeerdR Homann and Norb ert Kreitz pp Reading UK Novemb er

Chapman B M Mehrotra P Van Rosendale J Zima HP Extending Vienna Fortran With Task

Parallelism

Pro c International Conference on Parallel and Distributed Systems ICPADS Hsinchu Taiwan

ROC Decemb er

Francomano E TortoriciMacaluso A Va jtersic M Implementation Analysis of Fast Matrix Multipli

cation Algorithms on Shared Memory Computers

Computers and Articial Intelligence pp

Va jtersic M Algorithms for Massively Parallel MatrixPro duct Computations

Pro c Conference on Parallel and Distributed Computing Univ of Kuwait pp March

ChapmanBMPantanoMZimaHP Sup ercompilers for Massively Parallel Architectures

Pro c Aizu International Symp osium on Parallel AlgorithmsArchitecture Synthesis pAs Aizu

Wakamatsu Fukushima Japan pp March

Blasko R Hierarchical Performance Prediction for Parallel Programs

Pro c International Symp osium and Workshop on Systems Engineering of Computer Based Systems

Tucson USA IEEE TH pp March

Va jtersic M Sp ecial Blo ckFiveDiagona l System Solvers for the VLSI Parallel Mo del

Pro c International Workshop Parallel Numerics Massimo Zaccaria Naples pp

Blasko R Simulation Based Performance Prediction by PEPSY

Pro c th Annual Simulation Symp osium Pho enix USA IEEE TH pp April

Di Martino B Chapman B M Program Comprehension Techniques to improve Automatic Paralleli za

tion

Pro c Workshop on Automatic Data Layout and Performance Prediction ADL Rice Univ Houston

USA April

UjaldonMZapataELChapmanBM ZimaHP New Data Parallel Language Features for Sparse Ma

trix Computations

Pro c th International Parallel Pro cessing Symp osium IPPS Santa Barbara California April

Also Technical Rep ort Institute for Software Technology and Parallel Systems Univ of Vienna April

Ujaldon M Zapata E L Chapman B Zima HP DataParallel Computation for Sparse Co des A

Survey and Contributions

Third Workshop on Languages Compilers and Runtime Systems for Scalable Computers Kluwer Academic

Publishers in Languages Compilers and Runtime Systems for Scalable Computers Chapter BK

Szymanski and B Sinharoy Eds Boston Massachussets

Blasko R Prediction of Static and Dynamic Characteristics of Parallel Pro cesses

Pro c International Symp osium on System Mo delling and Control Zakopane Vol pp May

BrezanyP Muc kT SchikutaE Language Compiler and Parallel Database Supp ort for IO Intensive

Applications

Pro c HPCN Europ e Milan Italy SpringerVerlag pp May

Va jtersic M HighPerformance VLSI Mo del Elliptic Solvers

Pro c HPCN Europ e Conference Lecture Notes in Computer Science SpringerVerlag pp

May

BrezanyP CheronO SanjariK van KonijnenburgE Pro cessing Irregular Co des Containing Arrays

with MultiDimensi ona l Distribution s by the PREPARE HPF Compiler

Pro c HPCN Europ e Milan Italy SpringerVerlag pp May

AndreP BrezanyP CheronO DenissenW SanjariK PazatJ A New Compiler Technology for

Handling HPF Data Parallel Constructs

Pro c rd Workshop on Languages and Runtime Systems for Scalable Computers Troy USA Kluwer

May

Benkner S Handling Blo ckCyclic Distributed Arrays in Vienna Fortran

Pro c International Conference on Parallel Architectures and Compilation Techniques PACT Limas

sol Cyprus June

Brezany P Sipkova V Chapman B M Greimel R Automatic Paralleli zatio n of the AVL FIRE

Benchmark for a DistributedMemory System

Chemistry and Engineering Science Denmark August

Kra jcovic D Lucka M Viktorinova E QR Factorization of Blo ck Banded Matrix on Distributed

Memory System

Pro c Parallel Numerics Sorrento Italy Septemb er

Benkner S Vienna Fortran An Advanced Data Parallel Language

Pro c International Conference on Parallel Computing Technologies PaCT StPetersburg

Lecture Notes in Computer Science pp Springer Verlag Septemb er

Chapman BM Mehrotra P Zima HP Languages and To ols for Parallel Scientic Computing

Invited Pap er

nd

Pro c Seminar on Current Trends in Theory and Practice of Informatics Milovy Czech Republic

Novemb erDecemb er

Lecture Notes in Computer Science LNCS Springer Verlag Novemb er

Fahringer T Estimating and Optimizing Performance for Parallel Programs

IEEE Computer Vol Numb er pp Novemb er

Brezany P Sipkova V Chapman BM Greimel R Automatic Parallelizati on of the AVL FIRE

Benchmark for a DistributedMemory System

Second Workshop on Applied Parallel Computing PARA Lyngby Denmark August To app ear

in Lecture Notes in Computer Science Springer Verlag Decemb er

Di Martino B Chapman BM Iannello G Zima HP Integration of Program Comprehension Tech

niques into the Vienna Fortran Compilation System

Pro c International Conference on High Performance Computing New Delhi India Decemb er

Calzarossa M Massari L Merlo A Pantano M Tessera D MEDEA A To ol for Workload Charac

terization of Parallel Systems IEEE Parallel and Distributed Technology Magazine Winter

Andel S Di Martino B Hulman J Zima HP Program Comprehension Supp ort for KnowledgeBased

Parallelizati on

Pro c th Euromicro Workshop on Parallel and Distributed Pro cessing Braga Portugal IEEE Press to

app ear January

Di Martino B Kessler CW Program Comprehension Engines for Automatic Paralleli zatio n A Com

parative Study

First Int Workshop on Software Engineering for Parallel and Distributed Systems Berlin Germany to

app ear March

Lucka M Va jtersic M Viktorinova E Massively Parallel Poisson and QR Factorization Solvers

Computers Mathematics with Applications to app ear

Pantano M Zima H An Integrated Environment for the Supp ort of Automatic Compilation Pro c

High Performance Computing Technology and Applications pp Editors L Grandinetti GR

Joub ert JJ Dongarra J Kowalik Elsevier Science

Chapman BM Mehrotra P Zima HP Vienna Fortran and the Path Towards a Standard Parallel

Language

Invited pap er

In MShimasakiHSato Eds Pro c International Symp osium on Parallel and Distributed Sup ercompu

ting Fukuoka Japan pp Septemb er

Ujaldon M Zapata EL Chapman BM Zima HP Vienna FortranHPF Extensions for Sparse and

Irregular Problems and Their Compilation

Submitted to IEEE Transactions on Parallel and Distributed Systems

Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

Octob er

Technical Rep orts and Other Publications

Chapman B M Benkner S Blasko R Brezany P Egg M Fahringer T Gerndt HM Hulman

J Knaus B KutscheraP Moritsch H Schwald A Sipkova V Zima HP VIENNA FORTRAN

Compilation System Version Users Guide

Technical Rep ort Institute for Software Technology and Parallel Systems Univ of Vienna January

Chapman BM Mehrotra P Zima HP An Alternative Mo del for Distribution and Alignment in High

Performance Fortran Without Templates

Deliverable ESPRIT Pro ject PPPE April

Va jtersic M Parallel VLSI Algorithms for Solving The Poisson and Biharmonic Problem

Rep ort RIST Institute Univ of Salzburg

Benkner S Brezany P Zima HP Functional Sp ecication of the PREPARE Parallelizati on Engine

Deliverable D ESPRIT Pro ject PREPARE June

also Research Rep ort of the PREPARE Pro ject June

Benkner S Brezany P Zima HP Program Normalization

Viennanorm rel The Prepare Consortium June

Brezany P Sanjari K Examples of Program Transformations Sp ecied in PUMA

Research Rep ort of the PREPARE Pro ject July

Chapman BM Moritsch H Zima HP Dynamic Data Distribution s in Vienna Fortran

Language Features and Runtime Supp ort Deliverabl e Da ESPRIT Pro ject PPPE July

Chapman BM Fahringer T Zima H Automatic Supp ort for Data Distribution on Distributed Memory

Multipro cessor Systems

Technical Rep ort Institute for Software Technology and Parallel Systems TR Univ of Vienna

August

Brezany P Compiling FORTRAN for Massively Parallel Computers

Pro c th MSC Europ ean Users Conference Vienna Septemb er

Fahringer T Automatic Performance Prediction for Parallel Programs on Massively Parallel Computers

Technical Rep ort Institute for Software Technology and Parallel Systems TR Univ of Vienna

Septemb er

Brezany P Analysis Requirements from the Paralleliza tion Engine

Research Rep ort of the PREPARE Pro ject Novemb er

Brezany P Users Supp ort for Parallelizi ng Optimizations Examples

Research Rep ort of the PREPARE Pro ject Novemb er

Benkner S Brezany P Functional Sp ecication of Basic Parallelizati on Scheme

Research Rep ort of the PREPARE Pro ject Novemb er

Ripp el R A Concept to Realize The Recognition Of Patterns In Fortran or Vienna Fortran

Pro c Workshop for Parallel Pro cessing Technical Rep ort Technical Univ of Clausthal January

Benkner S Brezany P Functional Sp ecication of Parallelizati on Engine for Basic Prepare Fortran Part

Basic Parallelizati on Scheme

Research Rep ort of the PREPARE Pro ject February

Brezany P Sanjari K Sipkova V Functional Sp ecication of Parallelizati on Engine for Basic Prepare

Fortran Part Optimizations

Research Rep ort of the PREPARE Pro ject February

Brezany P Sanjari K Test Suite for Exercising Transformation Catalogue

Research Rep ort of the PREPARE Pro ject February

Wender B Zima HP The Vienna Fortran Abstract Machine

Deliverable DZ CEI Pro ject PACT April

Wender B VFCS Instrumentation for Integration of MEDEA

Deliverable DZ CEI Pro ject PACT April

Brezany P Gerndt M Sipkova V Shared Virtual Memory Supp ort in the Vienna Fortran Compilation

System

Technical Rep ort KFA Julic h Germany April

Chapman B M Moritsch H Zima HP Dynamically Distributed Arrays Sp ecication of the Compi

lation Metho d

Deliverable DZ CEI Pro ject PACT April

Brezany P Chapman B M Ponnusamy R Sipkova V Zima HP Study of Application Algorithms

With Irregular Distribution s

Deliverable DZ CEI Pro ject PACT April

Benkner S Vienna Fortran Version Language Overview

Deliverable a ESPRIT Pro ject PPPE August

Brandes T Chapman BM Greimel R Lonsdale G Wollenweb er F ZimaHP HPF Extensions for

Advanced Applications in the PPPE Pro ject

Deliverable ESPRIT Pro ject PPPE Septemb er

Va jtersic M Some Approaches How To Multiply Matrices On Massively Parallel Meshes

Pro c International Workshop Parallel Numerics Slovak Academy of Sciences Smolenice pp

Septemb er

Andel S Hulman J An Exp eriment with an Exp ert System Shell

Deliverable DZ CEI Pro ject PACT Octob er

Brezany P Gerndt M Sipkova V Study of Vienna FortranHPF Interface with Virtual Shared Memory

Architectures

Deliverable DZ CEI Pro ject PACT Octob er

Lucka M Paul M Validation Suite and Automatic System to Exercise Integration of the PREPARE

HPF Compiler System

Deliverable ESPRIT Pro ject PREPARE Octob er

Chapman B M Pantano M Zima H Preliminary Overall Design of a Second Generation Restructuring

System for Distributed Memory Machines Deliverabl e d ESPRIT Pro ject PPPE January

Andel S Hulman J Di Martino B Zima HP Design of Knowledge Base for Program Patterns KBP

Deliverable DZ CEI Pro ject PACT April

Di Martino B Pattern Matching Techniques

Deliverable DZ CEI Pro ject PACT April

Di Martino B Design of a Pattern Matching Exp ert PMX

Deliverable DZ CEI Pro ject PACT April

Mehofer E Dynamic Data Distributio n in Vienna Fortran

Deliverable DZ CEI Pro ject PACT April

Moritsch H The Implementation of Dynamic Data Distributions in VFCS Language Compile and

Runtime Supp ort

Deliverable Df ESPRIT Pro ject PPPE April

Andel S Hulman J Automatic Data Distribution

Final Rep ort of FWF Pro ject P May

Di Martino B Design of a To ol for Automatic Recognition of Paralleliza bl e Algorithmic Patterns and

Description of its Prototyp e Implementation

Tech Rep Institute for Software Technology and Parallel Systems Univ of Vienna May

Andel S Chapman B M Hulman J Exp ert System To ol for Decision Making

Deliverable PPPE De ESPRIT Pro ject PPPE June

Lucka M Paul M New VFCS testing to ol XVRTE

Institute for Software and Parallel Systems Univ Vienna Internal Rep ort June

Moritsch H Implementation of Dynamic Data Distribution s in the Vienna Fortran Compilation System

Deliverable DZ CEI Pro ject PACT July

Pantano M Chapman BM Di Martino B Zima HP Final Overall Design of a Second Generation

Restructuring System for Distributed Memory Machines

Deliverable Dg ESPRIT Pro ject PPPE July

Andel S Chapman B M Hulman J Zima HP An Exp ert Advisor for Parallel Programming Envi

ronments and Its Realization within the Framework of the Vienna Fortran Compilation System

Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

Septemb er

Brezany P Sipkova V Implementation of Indirect Distributions within VFCS

Deliverable Di ESPRIT Pro ject PPPE Septemb er

Wender B Porting the VFCS to the Meiko CS

Deliverable Dh ESPRIT Pro ject PPPE Septemb er

Andel S Hulman J Implementation of an Automatic Data Distribution To ol

Deliverable DZ CEI Pro ject PACT Octob er

Andel S Hulman J Implementation of KBP

Deliverable DZ CEI Pro ject PACT Octob er

Andel S Di Martino B Hulman J Integration of PMX and KBP

Deliverable DZ CEI Pro ject PACT Octob er

Di Martino B Implementation of a Pattern Matching Exp ert PMX

Deliverable DZ CEI Pro ject PACT Octob er

Brezany P Sipkova V Implementation of Irregular Data Distributio ns in Vienna Fortran

Deliverable DZ CEI Pro ject PACT Octob er

Brezany P Sipkova V Wender B Implementation of the Vienna Fortran HPF Interface for a Particular

Virtual Shared Memory Architecture

Deliverable DZ CEI Pro ject PACT Octob er

Chapman BM Zima HP Haines M Mehrotra P Van Rosendale J OPUS A Co ordination Lan

guage for Multidisci pl i nary Applications

Technical Rep ort TR Institute for Software Technology and Parallel Systems Univ of Vienna

Octob er

Editorial Activities

M Va jtersic

Editorial Board Memb er of Pro cessing Letters World Scientic Singap ore

Editorial Board Memb er of Algorithms and Applications Gordon and Breach

Editorial Board Memb er of Computers and Articial Intelligence Slovart Slovakia

Editor of Pro c Software Engineering for Parallel RealTime Systems Smolenice

Editor of Pro c Algorithms and Software for Parallel Computer Systems Smolenice

Editor of Pro c Parallel Numerics Smolenice

Editor of Pro c Parallel Numerics Sorrento

HP Zima

Editor of the Series Internationale Computer Bibliothek published by AddisonWesley Germany

Asso ciate Editor Compiler Technology of Scientic Programming published by John Wiley

Memb er Editorial Board of Concurrency Practice and Exp erience published by John Wiley

Memb er Editorial Board of Parallel Pro cessing Letters

Foundation Editor of The Journal for Universal Computer Science JUCS

Program Committee Memb erships

T Fahringer

th ACM International Conference on Sup ercomputing

Barcelona Spain Barcelona July

M Va jtersic

PPAM

Czesto chowa Poland

Numerical Metho ds and Applications

Soa Bulgaria

PARCELLA

Potsdam Germany

HP Zima

Fourth ACM SIGPLAN Symp osium on Principles and Practice of Parallel Programming

PPoPP

San Diego CA May

TH ACM International Conference on Sup ercomputing ICS

Tokyo Japan July

Parallel Computing ParCo

Grenoble France Septemb er

Working Conference on Massively Parallel Programming Mo dels MPPM

Berlin Germany Septemb er

AICA Parallel and Distributed Architectures and Algorithms

Lecce Italy Septemb er

Second International Conference of the Austrian Center for Parallel Computation

Gmunden Austria Octob er

International Symp osium on Parallel AlgorithmArchitecture Synthesis pAs

AizuWakamatsu Japan March

IEEE Ninth International Parallel Pro cessing Symp osium IPPS

Santa Barbara California April

Third Workshop on Languages Compilers and Runtime Systems for Scalable Computers

Troy New York May

Parallel Computing Technologies PaCT

StPetersburg Russia Septemb er

Parallel Computing ParCo

Gent Septemb er

Working Conference on Massively Parallel Programming Mo dels MPPM

Berlin Germany Octob er

Habilitation Thesis

Va jtersic M Numerical Algorithms for Some Parallel Computer Architectures January

PhD Theses

Fahringer T Automatic Performance Prediction for Parallel Programs on Massively Parallel Computers

Octob er

Pantano M Integrazione di Tecniche e Strumenti p er la Caratterizzazione del Carico in un Compilatore

p er Sistemi Paralleli Univ of Pavia

Benkner S Vienna Fortran and its Compilation Septemb er

Chapter

Lectures and Research Visits

Lectures

S Benkner

Functional Sp ecication of the Prepare Parallelization Engine

PREPARE Review Meeting STERIA ParisVelizy France June

Compiling HPF in the PREPARE Environment

Int Workshop Compilers f Parallel Computers TU Delft The Netherlands Decemb er

Pro cessing Array Statements and Pro cedure Interfaces in the Prepare High Performance

Compiler

International Conference on Compiler Construction Edinburgh UK April

Vienna Fortran and its Compilation

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Handling Blo ckCyclic Distributed Arrays in Vienna Fortran

International Conference on Parallel Architectures and Compilation Techniques Limassol Cyprus June

Advanced Compilation Strategies for HPF

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Vienna Fortran An Advanced Data Parallel Language

International Conference on Parallel Computing Technologies StPetersburg Russia Septemb er

R Blasko

Parameterization and Abstract Representation of Parallel Fortran Programs for Performance

Analysis

AICA Conference on Parallel and distributed architectures and algorithms Gallip ol i Lecce Italy

Septemb er

Pro cess Graph and To ol for Performance Analysis of Parallel Pro cesses

The IMACS Symp osium on Mathematical Mo deling MATHMOD Vienna Austria February

A Systematic Strategy for Performance Prediction by Improvement of Parallel Programs

CAST Fourth International Workshop on Computer Aided Systems Technology Univ of Ottawa

Ottawa May

Performance Analysis of Parallel Programs Based on Simulation

The th ASU Conference Prague Septemb er

Automatic Mo deling and Performance Analysis of Parallel Pro cesses by PEPSY

ASIM Symp osium Stuttgart Octob er

Performance Prediction for Parallel Programs Based on Simulation

ACPCWorkshop Keutschach Austria Novemb er

Hierarchical Performance Prediction for Parallel Programs

The Inter Symp osium and Workshop on Systems Engineering of Computer Based Systems Tucson

USA March

Simulation Based Performance Prediction by PEPSY

th Annual Simulation Symp osium Pho enix USA April

Prediction of Static and Dynamic Characteristics of Parallel Pro cesses

Pro c Inter Symp osium on System Mo delling and Control Zakopane May

P Brezany

The State of the Art in Parallel Computing

Tutorial at SUPEUR Sup ercomputing in Europ e Vienna Septemb er

Compiling FORTRAN for Massively Parallel Computers

Invited Talk at the th MSC Europ ean Users Conference Vienna Austria Septemb er

Language Compiler and Runtime Supp ort for Irregular Computations

Univ of Liverp o ol England February

Compiling HPF Languages

Technical Univ Munich Germany February

PREPARE High Performance Fortran Compiler

Purdue Univ USA May

Compiling Irregular Co des in the PREPARE HPF Compiler

Univ of Maryland USA May

PREPARE High Performance Fortran Compiler

Rensselaer Polytechnic Institute Troy USA June

Compiling Irregular Co des in the PREPARE HPF Compiler and in the Vienna Fortran

Compilation System

Carnegie Mellon Univ Pittsburgh USA June

Compiling Irregular and Dynamic Problems

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Issues in Language and Compiler Supp ort for Parallel IO

Talk Workshop of the CEI Pro ject PACT Vienna Austria Septemb er

Pro cessing Irregular Co des Containing Arrays with MultiDimensional Distributions by the

PREPARE HPF Compiler

HPCN Europ e Milan Italy May

Program Analysis for Automatic Parallelization of Irregular and Dynamic Problems

DIKU Univ of Cop enhagen Denmark August

Automatic Parallelization of the AVL FIRE Benchmark for a DistributedMemory System

Second Workshop on Applied Parallel Computing PARA Lyngby Denmark August

Automatic Parallelization of Scientic Applications for Distributed Memory Architectures

Academic Computer Centre CYFRONET Cracow Poland Octob er

B M Chapman

Dynamic Data Distributions for Load Balancing

ACPC Workshop Vienna Austria March

Data Distribution for Load Balancing in Vienna Fortran

International Workshop on Languages and Compilers for Parallel Computing Jerusalem Israel June

Automatic Parallelization for DistributedMemory Systems

Tutorial jointly with H Zima PARLE Munich Germany June

Parallel Programming in Vienna Fortran

Computer Science Collo quiu m ETH Zurich Switzerland June

Automatic Supp ort for Data Distribution on Distributed Memory Multipro cessor Systems

Sixth Annual Workshop on Languages and Compilers for Parallel Computing Portland Ore USA August

Automatic Compilation Techniques in VFCS with demo ICASE NASA Langley Septemb er

Co ding Applications in Vienna Fortran

Computer Science Seminar Chemnitz Germany Novemb er

Data Distributions for Parallelizing Fortran Co des

Parallel Computing Collo quium Univ of Southampton Decemb er

AParallelization Environment for Interative Improvement of Fortran Programs

Fourth Workshop on Compilers for Parallel Computers Delft Netherlands Decemb er

Language Supp ort for Irregular and Dynamic Data Mappings

HPFF KickO Meeting Houston Texas USA January

Tutorial High Performance Fortran and Irregular Co des

INRIA Sophia Antip olis France January

Automatische Parallelisierung Moglichkeiten und Herausforderungen

Invited Talk IGIITG PARS Workshop Koblenz Germany June

Applications in Science and Engineering

The Vienna Fortran Compilation System Research Issues in Compilers Advanced Course on

Languages Compilers and Programming Environments for Scalable Parallel Computers Vienna Austria

July

A Software Architecture for Multidisciplinary Applications Integrating Task and Data Par

allelism

CONPAR VAPP VI Linz Austria Septemb er

HPF for Pro duction Co des Status and Exp eriences

RAPS Workshop Toulouse France Octob er

Beyond Current HPF

GMD HPF Workshop St Augustin Germany Decemb er

Vienna Fortran Language Features for HPF

Annual Workshop High Performance Fortran Forum Houston Texas USA January

Making Parallel Pro cessing Work The Agenda of the VCPC

CRIM Montreal Canada February

Extensions to HPF for Advanced Applications

VI I Las Palmas Seminar on Computer Sciences Las Palmas Spain April

Technology Transfer in High Performance Computing Activities of the VCPC

Op ening Ceremony of the VCPC Kleiner Festsaal Vienna Austria June

OneWeek Course in HPF

High Performance Fortran Summer Scho ol Bergen Norway June

Beyond HPF The Language Extensions for HPF

th Workshop on Compilers for Parallel Computers Malaga Spain June

Programming Environments for HPF

th Workshop on Compilers for Parallel Computers Malaga Spain June

Applications in Science and Engineering

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Compu

ters University of Vienna July

Exp eriences with HPF at Vienna

High Performance Fortran Workshop ICASE NASA Langley Hampton Virginia USA August

Vienna Fortran Features to Extend High Performance Fortran

EPCC Annual Seminar Edinburgh Scotland Septemb er

Vienna Fortran Extensions for HPF

Keynote address High Performance Fortran Workshop Southampton England Octob er

The Future of High Performance Fortran

nd GMD NEC Workshop on Scientic Parallel Computing GMD Germany Octob er

The Future Development of High Performance Fortran

IDRIS Seminar IDRIS Paris France Novemb er

Vienna Fortran Language Features for HPF

State University of New York at Albany New York USA Decemb er

The Vienna Fortran Compilation System as a Prototyp e HPF Programming Environment

Rensselaer Polytechnical Institute Troy New York USA Decemb er

B Di Martino

Program Comprehension Techniques to improve Automatic Parallelization

Workshop on Automatic Data Layout and Performance Prediction Rice Univ Houston USA April

Paradigms for the Parallelization of Branch Bound Algorithms

Workshop on Applied Parallel Computing in Physics Chemistry and Engineering Science Lyngby DK

August

Parallelization of Neural Network Algorithms

nd

Euro PVM Users Group Meeting Lyon Fr Septemb er

Integration of Program Comprehension Techniques into the Vienna Fortran Compilation

System

Conference on High Performance Computing New Delhi India Decemb er

T Fahringer

A Static Parameter based Performance Prediction To ol for Parallel Programs

th ACM International Conference on Sup ercomputing Tokyo Japan July

Automatic Cache Performance Prediction in a Parallelizing Compiler

AICA Lecce Italy Septemb er

Automatically Estimating Network Contention of Parallel Programs

Vienna Austria May

3

Using the P T to Guide the Parallelization and Optimization Eort under the Vienna Fortran

Compilation System

Knoxville TN USA May

3

The P T a p erformance estimator for HPF programs

Vienna Austria July

3

P T An Automatic Performance Estimator for Parallel Programs

International Workshop on Automatic Data Layout and Performance Prediction Rice Univ Houston

Texas USA April

On the Utility of Threads for Data Parallel Programming

th ACM International Conference on Sup ercomputing Barcelona Spain July

MLucka

A Massively Parallel Poisson Solver

th International Conference on Numerical Metho ds Miskolc August

QR Factorization of Blo ck Banded Matrix on Distributed Memory System

Int Workshop Parallel Numerics Sorrento Italy Septemb er

Calibration of METEOSTAT images on parallel computer INTEL iPSC

Int Workshop Parallel Numerics Sorrento Italy Septemb er

H Moritsch

Parallel Program Development with the VFCS

ACPC Meeting Hafnersee Nov

M Pantano

Performance Analysis of Parallelized Programs Using Workload Characterization Techniques

AICA Annual Conference Palermo Italy Septemb er

Software To ols in an Integrated Programming Environment

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

R Ripp el

Optimizing Parallelization by Pattern Matching

Workshop on Parallel Pro cessing Lessach Austria Septemb er

M Va jtersic

Some Developments of Parallel Algorithms

Workshop Software Engineering for Parallel RealTime Systems Smolenice Slovakia March

Hyp ercub e Parallel Orderings for the SVD Computations

Workshop Algorithms and Software for Parallel Computer Systems Smolenice Slovakia March

AFast Multiplication of Matrices on Parallel Sup ercomputer Array

Workshop Feinkornige und Massive Parallelitat Dresden Germany April

Some Linear Algebra Algorithms for a Massively Parallel System

Dept of Informatics Univ of Milan Italy May

Developments of Parallel Algorithms Some Principles

Dept of Systems and Informatics Univ of Bologna Italy June

Massively Parallel Algorithms Some Numerical Examples

Polytechnic Univ of Catalonia Department of Computer Architecture Barcelona Spain February

Some Developments of Algorithms for Massively Parallel Arrays

Centre DEstudis Avancats de Blanes Girona Spain February

Massively Parallel Algorithms

Univ of Patras Electronics Lab oratory Patras Greece July

Two Classes of Ecient Hyp ercub e Orderings for SVD

Invited presentation IntConfAdvances in Numerical Metho ds and Application s Soa Bulgaria August

Some Examples of Massively Parallel Numerical Algorithms

Keynote address IntConfParallel Pro cessing and Applied Mathematics Czesto chowa Poland Septemb er

Some Approaches How to Multiply Matrices on Massively Parallel Meshes

Int Workshop Parallel Numerics Smolenice Slovakia Septemb er

Hyp ercub e Jacobilike Algorithms

International Workshop Algorithms for Future Technologies Warszawa Poland Octob er

Issues in Parallel Algorithm Design

Departement of Mathematics Istvan Szechenyi College Gyor Hungary Novemb er

Algorithms for Massively Parallel Matrix Pro duct Computations

International Conference on Parallel and Distributed Computing Kuwait March

HighPerformance VLSI Mo del Elliptic Solvers

Congress High Performance Computing and Networking HPCN Europ e Milan Italy May

Numerical Algorithms for Some Parallel Computer Architectures

Habilitation lecture Department of Mathematics Univ of Salzburg June

Some Examples of Parallel Algorithms for Massively SIMD and MIMD Machines

Kollo quim lecture Weierstrass Institute for Applied Analysis and Sto chastics Berlin Germany June

Massively Parallel Numerical Solvers

Univ of Pisa Department of Informatics Pisa Italy July

Development of a Class of Parallel Orderings for Hyp ercub es and Their Improvement to

Other Top ologies

IRISA Rennes France Septemb er

Sp ecial Blo ckFiveDiagonal System Solvers for the VLSI Parallel Mo del

Septemb er

H P Zima

Vienna Fortran Eine Spracherweiterung fur Multiprozessorsysteme mit verteiltem Sp eicher

Computer Science Seminar Univ of Klagenfurt Austria January

Vienna Fortran and its Applications

Computer Science Collo quium Rensselaer Polytechnic Institute Troy New York February

Application Programming in Vienna Fortran

Computer Science Seminar Univ of Maryland February

The Vienna Fortran Compilation System

International Workshop on Algorithms and Software for Parallel Computer Systems Smolenice Slovakia

March

Programming Complex Applications in Vienna Fortran

International Workshop on Algorithms and Software for Parallel Computer Systems Smolenice Slovakia

March

High Performance Fortran Without Templates A New Mo del for Data Distribution and

Alignment

Fourth ACM SIGPLAN Symp osium on Principles and Practice of Parallel Programming San Diego Ca

lifornia May

Vienna Fortran

Tutorial Lecture ERCIM Advanced Course on Programming High Performance Computers Paris France

May

The Vienna Fortran Compilation System A Second Generation Compiler for Massively

Parallel Systems

International Workshop on Languages and Compilers for Parallel Computing Jerusalem Israel May

Automatic Parallelization for DistributedMemory Systems

Tutorial jointly with Barbara M Chapman Parallel Architectures and Languages PARLE Munich

Germany June

Vienna Fortran A Second Generation Compiler for Massively Parallel Systems

Invited Lecture ParCo Grenoble France Septemb er

Massively Parallel Architectures and Their Programming Paradigms Recent Developments

Invited Lecture AICA Annual Conference Gallip ol i Italy Septemb er

Fortran for Parallel Environments High Performance Fortran and Vienna Fortran

Invited Lecture SUPEUR Vienna Austria Septemb er

Automatic Parallelization for DistributedMemory Systems Exp eriences and Recent Rese

arch

Main Lecture EUROARCH Europ ean Informatics Congress Computing System Architectures Mu

nich Germany Octob er

Languages and Compilers for Massively Parallel Machines

Keynote Address ECUC Europ ean Convex Users Conference Bilbao Spain Octob er

High Performance Fortran Languages Applications and Implementation together with Barbara

Chapman

Scientic Parallel Computing Seminar Chemnitz Germany Novemb er

Dynamic Data Distributions in Vienna Fortran

Sup ercomputing Portland Oregon Novemb er

High Performance Fortran Languages and Their Implementation

Invited Lecture Linkoping Univ Sweden Novemb er

High Performance Fortran Languages and Their Implementation

Invited Lecture Umea Univ Sweden Novemb er

High Performance Fortran Languages and Their Compilation

Keynote Lecture Euromicro Workshop on Parallel and Distributed Pro cessing Malaga January

Handling Dynamic and Irregular Problems in Vienna Fortran

Seminar Department of Computer Science and Engineering Univ of California San Diego February

High Performance Fortran Current Limitations and Future Directions

Computer Science Seminar IBM Toronto April

High Performance Fortran Languages Advanced Applications and Their Implementation

Invited Lecture High Performance Computing and Networking Europ e HPCNE Europ e Munich

Germany April

Handling Irregular and Dynamic Problems in the Vienna Fortran Compilation System

Keynote Address First International Conference on Massively Parallel Computing Systems MPCS

Ischia Italy May

Handling Irregular and Dynamic Problems in Vienna Fortran

Invited Lecture Summer Scho ol on Parallel Computation in Computational Fluid Dynamics Delft Net

herlands June

Research at the Institute for Software Technology and Parallel Systems

First Joint Workshop KyotoVienna Vienna Austria June

High Performance Fortran Current Status and Future Directions

Invited Lecture International Advanced Workshop on High Performance Computing Technology and

Applications Cetraro Italy June

Research Issues in Languages for High Performance Computing together with Piyush Mehrotra

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Basic Compilation and Optimization Strategies

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Why High Performance Fortran is not Useful for Advanced Numerical Applications Direc

tions for Future Developments

Keynote Address

Second International Workshop on Massive Parallelism Hardware Software and Applications Capri Italy

Octob er

Vienna Fortran Eine Sprache fur die Programmierung von massiv parallelen Sup ercompu

tern

Computer Science Seminar Univ of Linz Austria Octob er

Parallelisierung irregularer Applikationen in Vienna Fortran

Kollo quium ub er Parallelverarb ei tun g in technischnaturwissen schaftl ich en Anwendungen Julic h Germa

ny Octob er

Vienna Fortran Advanced Applications and Their Implementation

Computer Science Seminar Univ of Barcelona Novemb er

Compiler Technology for Massively Parallel Architectures State of the Art and Current

Research

Invited Lecture Sixth Workshop on Use of Parallel Pro cessors in Meteorology Reading England Novemb er

Porting Advanced Numerical Applications to Massively Parallel Machines A Research Agen

da

Invited Lecture International Workshop on General Purp ose Massively Parallel Systems The Exp erience

of the CNR Pro ject Milano Italy Novemb er

Compiling HighPerformance Fortran Languages for Massively Parallel Machines

Tutorial International Conference on Parallel and Distributed Systems ICPADS Hsinchu Tai

wan ROC Decemb er

Extending Vienna Fortran With Task Parallelism

International Conference on Parallel and Distributed Systems ICPADS Hsinchu Taiwan ROC

Decemb er

High Performance Languages

Tutorial

Aizu International Symp osium on Parallel AlgorithmsArchitecture Synthesis pAs AizuWakamatsu

Fukushima Japan March

Sup ercompilers for Massively Parallel Architectures

Invited Lecture

Aizu International Symp osium on Parallel AlgorithmsArchitecture Synthesis pAs AizuWakamatsu

Fukushima Japan March

Vienna Fortran Language Implementation and Advanced Applications

Seminar Lecture Department of Information Science Kyoto Univ Kyoto Japan March

Vienna Fortran and its Compilation

Las Palmas Seminar on Computer Sciences Las Palmas Gran Canaria Spain April

DataParallel Computation for Sparse Co des

Third Workshop on Languages Compilers and Runtime Systems for Scalable Computers Troy New York

May

Vienna Fortran and HPF A Comparison

Department of Computer Science Collo quium Univ of Illinois at UrbanaChampaign May

Languages and Compilers for Massively Parallel Machines

Op ening Workshop of the VCPC Univ of Vienna June

Integrating Task Parallelism and Data Parallelism in the Vienna Fortran Compilation System

th Workshop on Compilers for Parallel Computers Malaga Spain June

Advanced Language Features The Path Towards HPF

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Advanced Compilation Strategies for HPF together with Siegfried Benkner

Advanced Course on Languages Compilers and Programming Environments for Scalable Parallel Com

puters Vienna Austria July

Languages for Scalable Parallel Architectures

Invited Presentation

Fourth International Parallel Computing Workshop PCW London England Septemb er

Vienna Fortran and the Path Towards a Standard Parallel Language

Invited Talk

International Symp osium on Parallel and Distributed Sup ercomputing PDSC Fukuoka Japan Sep

temb er

Vienna Fortran and the Path Towards a Standard Parallel Language

Cray Distinguished Lecture Series

Univ of Minnesota Minneap olis Octob er

Compiling Data Parallel Languages for Scalable Parallel Machines

Cray Distinguished Lecture Series

Cray Research Inc Eagan Minnesota Octob er

Vienna Fortran Eine Sprache fur die Programmierung von massiv parallelen Ho chleistungs

rechnern

Kollo quium des Instituts fur Computergraphik

Graz Univ of Technology Graz Austria Novemb er

Languages and To ols for Parallel Scientic Computing

Invited Lecture

nd

Seminar on Current Trends in Theory and Practice of Informatics Milovy Czech Republic Novemb er

Research Visits

S Benkner

PPPE Review Meeting Southampton UK February

PREPARE Pro ject Meeting TU Munich Germany March

PPPE WP Meeting NAS Software Liverp o ol UK May

PREPARE Pro ject Meeting GMD Karlsruhe Germany May

PREPARE Review Meeting STERIA ParisVelizy France June

PPPE Pro ject Meeting EMERAUDE Paris France July

Sup ercomputing Conference SC Portland Oregon USA Novemb er

Int Workshop Compilers f Parallel Computers TU Delft The Netherlands Decemb er

International Conference on Compiler Construction Univ of EdinburghScottl and April

High Performance Fortran Forum Meeting Chicago Illinois USA June

ACRI Lyon France January

International Conference on Parallel Architectures and Compilation Techniques Limassol Cyprus June

International Conference on Parallel Computing Technologies StPetersburg Russia Septemb er

GMDFIRST Berlin Germany Decemb er

R Blasko

Univ of Ottawa Ottawa Canada May

Univ of Stuttgart Stuttgart Germany Octob er

SIEMENS AG ZFE ST SN Munich Germany Octob er

Univ of Arizona Tucson USA March

P Brezany

Univ of Liverp o ol England Ferbruary

TU Munich Germany February

Purdue Univ USA May

Univ of Maryland USA May

Rensselaer Polytechnic Institute Troy USA June

Carnegie Mellon Univ Pittsburgh USA June

B M Chapman

ICASE NASA Langley Research Center Hampton VA USA February March

University of Maryland February

Queens Univ Belfast N Ireland April June

NA Software Liverp o ol UK May

Europ ean Centre for MediumRange Weather Forecasting ECMWF Reading UK May

Emeraude Paris France July

ICASE NASA Langley Research Center Hampton VA USA August Septemb er

Queens Univ Belfast N Ireland Novemb er Decemb er

Meiko Bristol UK Decemb er

Electronics and Computer Science Dept Univ of Southampton UK Decemb er

ECMWF Reading UK Decemb er

Firma Simulog und INRIA Sophia Antip olis France January

Queens Univ Belfast Northern Ireland May

ACE Amsterdam Netherlands Septemb er

PALLAS Bruhl Germany Septemb er

METEO France Toulouse France Octob er

CERFACS Toulouse France Octob er

Meiko Scientic Bristol UK Octob er

Sup ercomputing Washington DC USA Novemb er

th Workshop on Use of Parallel Pro cessors in Meteorology ECMWF Reading UK Novemb er

Parallel Applications Centre Southampton UK Novemb er

GMD Germany January

NA Software Liverp o ol january

ICASE NASA Langley Hampton USA February March

Meteo France Toulouse France April

HPF Forum Meeting Denver USA May

Technical University of Graz Austria May

HPCN Europ e Conference Milan May

HPF Forum Meeting Denver USA July

ICASE NASA Langley Hampton USA August

Cray Research Eagan USA Octob er Novemb er

HPF Forum Meeting Denver USA Novemb er

ICASE NASA Langley Hampton USA Novemb er

NCF Workshop Europ ean Academic Sup ercomputing The Hague Netherlands Novemb er

SARA Sup ercomputer Day Amsterdam Netherlands Decemb er

Sup ercomputing San Diego USA Decemb er

Meiko Scientic Bristol UK Decemb er

T Fahringer

ICASENASA Langley VA January

ICASENASA Langley VA July Octob er

I Glendinning

FSP Workshop Technical University of Graz Austria Septemb er

RAPS Workshop Graz Austria May

nd

National Hosts Conference Vienna Austria Nov

Sup ercomputing San Diego USA Decemb er

H Moritsch

HPCE Preparatory Meeting Heathrow London July

HPCN Milano Italy May

nd National Hosts Conference Vienna Austria Nov

R Ripp el

Univ of Salzburg Novemb er

E Mehofer

EUROPA Standard Parallel C Meeting Brussels Belgium Novemb er

M Pantano

ICASE NASA Langley Research Center Hampton VA USA Novemb er Decemb er

Sup ercomputing Portland Oregon USA Novemb er

Univ of Bonn Germany March

AICA Annual Conference Palermo Italy Septemb er

International Workshop on General Purp ose Massively Parallel Systems The Exp erience of the CNR

Pro ject Milano Italy Novemb er

Univ of Pavia Iatly Novemb er

Univ of Pavia Italy March

ESPRIT pro ject PHAROS meeting Univ of Pavia Italy May

Technical Univ Munich Germany June

G Robinson

ESPRIT pro ject PHAROS meeting GMD Germany August

PALLAS GmbH Bonn Germany August

TD Europ ean Workshop EPFL Geneva Switzerland Septemb er

CERN Geneva Switzerland Septemb er

Meiko Ltd Bristol UK Septemb er

ECMWF Reading UK Septemb er

HPF Forum Dallas Texas USA Septemb er

ECMWF Reading UK Novemb er

Sup ercomputing San Diego CA USA Decemb er

nd 1

National Hosts Conference Vienna Austria Novemb er

1

virtual presence via video link

V Sipkova

Zentralinstitut fur Angewandte Mathematik Forschungszentrum Julic h Germany January

Department of Computer Science Univ of Maryland College Park Maryland USA February

PARA Workshop on Applied Parallel Computing in Physics Chemistry and Engineering Science Den

mark August

M Va jtersic

Univ of Salzburg Austria AprilMay

Univ of Milan Italy May

Univ of Bologna Italy June

Univ of Palermo Italy July

Department of Computer Architecture Univ of Catalonia Spain February

Department of Electronics Univ of Patras Greece July

Institute of Informatics Univ of Warsaw Poland Octob er

Weierstrass Institute for Applied Mathematics and Sto chastics Berlin Germany June

Department of Mathematics Univ of Bologna Italy July

Department of Informatics Univ of Pisa Italy July

Department of Informatics Univ of Parma Italy July

IRISA Rennes France Septemb er

B Wender

Department for Computer Graphics and Parallel Pro cessing Univ of Linz Austria

HPCNE Europ e Munich Germany April

CONVEX Computer GmbH Munich Germany Octob er

HP Zima

Univ of Klagenfurt Austria January

Rensselaer Polytechnic Institute Troy New York February

Univ of Maryland February

International Workshop on Algorithms and Software for Parallel Computer Systems Smolenice Slovakia

March

Fourth ACM SIGPLAN Symp osium on Principles and Practice of Parallel Programming San Diego Ca

lifornia May

ERCIM Workshop Paris France May

International Workshop on Languages and Compilers for Parallel Computing Jerusalem Israel May

Parallel Architectures and Languages PARLE Munich Germany June

ICASE NASA Langley Research Center AugustSeptemb er

ParCo Grenoble France Septemb er

AICA Annual Conference Gallip ol i Italy Septemb er

SUPEUR Vienna Austria Septemb er

EUROARCH Europ ean Informatics Congress Computing System Architectures Munich Germany

Octob er

ECUC Europ ean Convex Users Conference Bilbao Spain Octob er

Technical Univ of Chemnitz Germany Novemb er

Sup ercomputing Portland Oregon USA Novemb er

Linkoping Univ Sweden Novemb er

Umea Univ Sweden Novemb er

Euromicro Workshop on Parallel and Distributed Pro cessing Malaga Spain January

ICASE NASA Langley Research Center February

Workshop on Task Paralleli sm in Fortran Pasadena California February

Department of Computer Science and Engineering Univ of California San Diego USA February

Computer Science Seminar IBM Toronto Canada April

High Performance Computing and Networking Europ e HPCNE Europ e Munich Germany April

First International Conference on Massively Parallel Computing Systems MPCS Ischia Italy May

Summer Scho ol on Parallel Computation in Computational Fluid Dynamics Delft The Netherlands June

International Advanced Workshop on High Performance Computing Technology and Application s Cetra

ro Italy June

Second International Workshop on Massive Parallelism Hardware Software and Applications Capri Italy

Octob er

Univ of Linz Austria Octob er

Kollo quiumub er Parallelverarb ei tung in technisch naturwissenschaftli chen Anwendungen Julic h Germa

ny Octob er

Univ of Barcelona Spain Novemb er

Sixth Workshop on Use of Parallel Pro cessors in Meteorology Reading England Novemb er

International Workshop on General Purp ose Massively Parallel Systems The Exp erience of the CNR

Pro ject Milano Italy Novemb er

International Conference on Parallel and Distributed Systems ICPADS Hsinchu Taiwan ROC

Decemb er

ICASE NASA Langley Research Center February

Cray Research Inc Eagan Minnesota February

Aizu International Symp osium on Parallel AlgorithmsArchitecture Synthesis pAs AizuWakamatsu

Fukushima Japan March

Kyoto Univ Kyoto Japan March

Univ of Las Palmas Gran Canaria Spain April

Third Workshop on Languages Compilers and Runtime Systems for Scalable Computers Troy New York

May

Univ of Illinois at UrbanaChampaign USA May

th Workshop on Compilers for Parallel Computers Malaga Spain June

ICASE NASA Langley Research Center AugustSeptemb er

Fourth International Parallel Computing Workshop PCW London England Septemb er

International Symp osium on Parallel and Distributed Sup ercomputing PDSC Fukuoka Japan Sep

temb er

Univ of Minnesota Minneap olis USA Octob er

Cray Research Inc Eagan Minnesota Octob er

ICASE NASA Langley Research Center Novemb er

Graz Univ of Technology Graz Austria Novemb er

nd

Seminar on Current Trends in Theory and Practice of Informatics Milovy Czech Republic Novemb er

Sup ercomputing San Diego California USA Decemb er

App ointments and Awards

Barbara M Chapman

NASA Consultant Institute for Computer Applications in Science and Engineering ICASE NASA Langley

Research Center Hampton Virginia USA

Thomas Fahringer

Best Europ ean Pap er Award ACM International Conference on Sup ercomputing ICS together

with H Zima

M Va jtersic

Visiting Professor Univ of Salzburg Austria

Memb er Collegium for Mathematics and Computer Science Slovak Academy of Sciences Slovakia

Memb er DrScDegree Committee Faculty of Electrical Engineering Czech Technical Univ Prague

Czech Republic

HP Zima

Best Europ ean Pap er Award ACM International Conference on Sup ercomputing ICS together

with T Fahringer

Adjunct Professor Rice Univ Houston Texas USA until

NASA Consultant Institute for Computer Applications in Science and Engineering ICASE NASA

Langley Research Center Hampton Virginia USA

Memb er Curatorium Institute for Information Pro cessing of the Austrian Academy of Science

Memb er Advisory Panel of the Europ ean Commission on High Performance Computing and Networking

Chapter

Curricula and Courses

The department oers a wide variety of courses covering the area of architectures languages

compilers and programming environments for parallel systems In particular lectures exercise

classes and seminars are provided for the following topics

 Compiler construction

 Program analysis and optimization

 Language design for parallel systems

 Vectorization and parallelization

 Program development for parallel and distributed systems

 Computer organization

 Software development

 Theoretical Computer Science

In addition lab oratories are oered in the following areas

 Compiler construction

 Vienna Fortran Compilation System

 To ol development

 Automatic Parallelization

 Parallel Algorithms

Chapter

Research Facilitie s

The following equipment is accessible on site

 Meiko CS pro cessors

 iPSC pro cessors

 MasPar MP A SIMD with pro cessors

 Transputer system pro cessors

 Sun SparcStations

 HP s with transputer b oard

 several p ersonal computers of them with transputer b oards

The currently available workstations are used in research pro jects and for teaching purp oses

The transputersystem is partially available for pro jects and for lab oratories

In addition machines accessible via the ACPC network include

 RISC Linz Sequent Symmetry

 Technical University of Vienna iPSC

 University of Linz nCUBE pro cessors

Chapter

InstituteVCPC Collo qui um

 J M Stichnoth CarnegieMellon University Pittsburgh PA USA

AParallelizing Fortran Compiler Supp orting Task and Data Parallelism

January

 E Burke Kendall Square Research

The KSR ALLCACHE Invention The First Shared Memory on a High Performance Parallel

Computer

January

 D C Marinescu Computer Science Department Purdue Univ USA

Iterative Computations on Distributed Memory MIMD Systems

February

 B K S Szymansky Rensselaer Polytechnic Institute Troy USA

Compiler Technology for Irregular Parallel Computations

March

 X Huang Univ of KoblenzLandau Germany

Automatisierte PerformanceAnalyse paralleler Applikationen

May

 K Knob e Massachusetts Institute of Technology USA

Subspace Optimizations

May

 C Kesselman California Institute of Technology Pasadena USA

Writing MultiParadigm Parallel Programs

June

 C Bischof Argonne National Lab oratory USA

Computational Dierentiation

Septemb er

 F Baetke CONVEX Munich Germany

Metacompiling How To Realize A Vision

Decemb er

 C W Brand Montanuniversi tat Leob en Austria

A Preconditioned Implementation of Recursive Sp ectral Bisectioning for Partitioning Un

structured Problems

January

 W Polzleitner Joanneum Research Graz Austria

Digitale Bildverarb eitung b ei Joanneum Research Vorstellung laufender und zukunftiger

Aktivitaten

April

 D Padua Univ of Illinois at UrbanaChampaign USA

Exp erience with the Polaris Restructurer

April

 S Grabner Univ of Linz Austria

GDDT A To ol for Data Decomp osition

May

 R G Babb I I Univ of Denver CO USA

Semiautomatic Reengineering of Scientic Programs for LargeGrain Parallelism

June

 K Miura Fujitsu America Inc San Jose CA USA

Fujitsu VPP System AVector Parallel Approach to High Performance Computing

June

 S K Das Univ of North Texas Denton TX USA

Linearly Ordered Concurrent Data Structures on Hyp ercub es

June

 E Zapata Univ of Malaga Spain

Exp eriences with Data Distributions for Sparse Matrix Computations

June

 F Leb erl Technical Univ Graz Austria

Computermo delle b estehender Ob jekte und ihre Visualisierung

June

 I Foster Argonne National Lab oratory USA

Integration of Data Parallelism With Task Parallelism

Septemb er

 A Schwald Univ Salzburg Austria

AdaX From Abstraction Oriented to Ob ject Oriented

Novemb er

 M Gerndt KFA Julic h Germany

SVMFortran Programmierumgebung fur Rechner mit gemeinsamem virtuellem Sp eicher

Decemb er

 C Lengauer Univ of Passau Germany

Zur automatischen Parallelisierung von forwhileSchleifensatzen

January

 S H Bokhari University of Engineering and Technology Lahore Pakistan and ICASE NASA Langley

Research Center Hampton Virginia USA

Multiphase Complete Exchange

March

 V van Dongen CRIM Montreal Canada

EPPP An Environment for Portable Parallel Programming

May

 W Joppich GMD Schlo Birlinghoven St Augustin Germany

Exp eriences with the Portable Parallelization of the IFS Co de Performance Results and

Exp eriences

May

 J R Gurd Univ of Manchester UK

Research at the Centre for Novel Computing

May

 R MorenoDiaz Universidad de Las Palmas Canary Islands Spain

On Living Parallel Pro cessing Systems

June

 B K Szymanski Rensselaer Polytechnic Institute Troy USA

Parallel Ecological Simulations

June

 G Iannello Univ of Naples Italy

Optimal Collective Communication Algorithms in LogP

June

 C G Diderich Swiss Federal Institute of Technology Lausanne Switzerland

Solving the Alignment Problem for a Given Degree of Parallelism

Octob er

 B Ro driguez NOAA Forecast Systems Lab oratory Boulder Colorado USA

Work And Future Interests of the NOAA Forecast Systems Lab

Octob er

Chapter

Visitors

Frank Baetke CONVEX Computer Corp oration Munich Germany

Christian Bischof Argonne National Lab oratory Illinois USA

Shahid H Bokhari University of Lahore Pakistan

Agnes Bradier Europ ean Commission Brussels Belgium

Clemens W Brand Montanuniversi tat Leob en Austria

Thomas Brandes GMD St AugustinBonn Germany

Wilhelm Brandstatter AVL List Graz Austria

Norman Brown NASoftware Liverp o ol UK

Ed Burke Kendall Square Research Waltham Massachusetts USA

Maria Calzarossa Univ of Pavia Italy

Alok Choudhary Syracuse Univ New York USA

Tom Bo Clausen Europ ean Commission Brussels Belgium

James Cownie Meiko Almondsbury UK

Sa jal K Das Univ of North Texas Denton Texas USA

Mike Delves NA Software UK

Philip e Devillers Simulog Guyancourt France

Claude Diderich Swiss Federal Institute of Technology Lausanne

Martin van Dijk ACE Amsterdam The Netherlands

Beniamino Di Martino University of Naples Italy

Vincent van Dongen CRIM Montreal Canada

Alistair Dunlop Univ of Southampton UK

Thilo Ernst GMD FIRST Berlin Germany

Hub ert Fischer BMW Munich Germany

Ian Foster Argonne National Lab oratory Illinois USA

Dennis Gannon University of Indiana Blo omington Indiana USA

Michael Gerndt KFA Julic h Julic h Germany

Stefan Gillich Intel ESCD FeldkirchenMuni ch Germany

Siegfried Grabner Univ of Linz Austria

Rob ert Greimel AVL List Graz Austria

John R Gurd Univ of Manchester UK

Stefan Hab erhauer Intel ESCD FeldkirchenMun ich Germany

Dong So o Han Kyoto Univ Japan

Tony Hey Univ of Southampton UK

Olga Hlavacova Slovak Academy of Science Bratislava Slovakia

Ladislav Hluchy Slovak Academy of Science Bratislava Slovakia

Geerd Homann ECMWF Reading UK

Friedel Hofeld KFA Julic h Julic h Germany

John Dawson Cray Research Eagan Minnesota USA

Xiandeng Huang Univ of KoblenzLandau Germany

Giulio Iannello Univ of Naples Italy

Wilfried Imrich Montanuniversitat Leob en Austria

Stefan Jahnichen GMD FIRST Berlin Germany

Andrzej Janicki Institute of Mathematical Machines Warsaw Poland

W Joppich GMD St AugustinBonn Germany

Hideyuki Kawabata Kyoto Univ Japan

Carl Kesselman California Institute of Technology Pasadena California USA

Christoph W Kessler Univ of Saarbruc ken Germany

Kathleen B Knob e MIT Cambridge Massachusetts USA

Franz Leb erl Technical Univ Graz Austria

Pierre Leca Onera Chatillon France

Christian Lengauer Univ of Passau Germany

Helmut List AVL List Graz Austria

Guy Lonsdale ESI Eschb orn Germany

Marc Loriot Simulog Guyancourt France

Thomas Ludwig Technical Univ Munich Germany

Dan C Marinescu Computer Science Department Purdue Univ Indiana USA

Piyush Mehrotra ICASE NASA Langley Research Center Hampton Virginia USA

Ken Miura Fujitsu America Inc San Jose California USA

Rob erto MorenoDiaz Univ of Las Palmas Spain

Michael Nolle Technical Univ Hamburg Germany

David Padua Univ of Illinois at Urbana Champaign Illinois USA

Mario Pantano Univ of Pavia Italy

K Pantazop oulos First Informatics Patras Greece

Ivan Plander Slovak Academy of Science Bratislava Slovakia

Wolfgang Polzleitner Joanneum Research Graz Austria

Ravi Ponnusamy Univ of Maryland College Park Maryland USA

David Pritchard Univ of Southampton UK

Bozena Przyb orowska Institute of Mathematical Machines Warsaw Poland

Thierry van der Pyl Europ ean Commission Brussels Belgium

B Ro driguez NOAA Forecast Systems Lab oratory Boulder Colorado USA

Jo el Saltz Univ of Maryland College Park Maryland USA

Takashi Sato Univ of Kyoto Japan

Andreas Schwald Univ of Salzburg Salzburg Austria

Vince Schuster Portland Group Inc Portland Oregon USA

Friedrich W Schroer GMD FIRST Berlin Germany

Giusepp e Serazzi Politecnico di Milano Italy

Piero Sguazzero IBM ECSEC Rome Italy

Karl Solchenbach Pallas GmbH Germany

James M Stichnoth Carnegie Mellon Univ Pittsburgh Pennsylvania USA

Henry Strauss CONVEX Munich Germany

Klaus Stub en GMD St AugustinBonn Germany

Boleslav K Szymanski Rensselaer Polytechnic Institute Troy New York USA

Guillermo Trabado Univ of Malaga Spain

Takao Tsuda Kyoto Univ Japan

Akihiro Tsukada Kyoto Univ Japan

Tetsutaro Uehara Kyoto Univ Japan

Manuel Ujaldon Univ of Malaga Spain

John Van Rosendale ICASE NASA Langley Research Center Hampton Virginia USA

Eva Viktorinova Slovak Academy of Science Bratislava Slovakia

Michael Wehner Lawrence Livermore National Lab oratory Livermore California USA

Fritz Wollenweb er ECMWF Reading UK

Francis Wray Smith Systems Engineering UK

Emilio Zapata Univ of Malaga Spain

Biblio graphy

Aho AV Sethi R Ullman J Compilers Principles Techniques and To ols AddisonWesley Rea

dingMA

Adve V MellorCrummey J Anderson M Kennedy K Reed DA Integrating Compilation and Per

formance Analysis for DataParallel Programs Sup ercomputing San Diego CA Dec

Alt M Amann U van Someren H Cosy Compiler Phase emb edding with the CoSy Compiler Mo del

th Int Conf CC Edinburgh LNCS Vol April pp

Andel S Di Martino B Hulman J Zima HP Program Comprehension Supp ort for KnowledgeBased

Paralleli zati on Pro c th Euromicro Workshop on Parallel and Distributed Pro cessing Braga Portugal

January IEEE Press to app ear

Andel S Hulman J Di Martino B Zima HP Design of Knowledge Base for Program Patterns KBP

CEI Pro ject PACT Deliverable DZ April

Andel S Hulman J Implementation of an Automatic Data Distribution To ol Deliverable DZ CEI

Pro ject PACT Octob er

Andel S Chapman B M Hulman J Zima HP An Exp ert Advisor for Parallel Programming Envi

ronments and Its Realization within the Framework of the Vienna Fortran Compilation System Technical

Rep ort TR Institute for Software Technology and Parallel Systems University of Vienna Septemb er

Andersen J Mitra G Parkinson D The Scheduling of Sparse MatrixVector Multiplicati on on a Massi

vely Parallel DAP Computer Parallel Computing Vol pp

Anderson J Lam M S Global Optimizations for Parallism and Lo cality on Scalable Parallel Machines

SIGPLAN Programming Language Design and Implementation pp

Andre F Pazat JL Thomas H PANDORE A system to manage data distributio n Pro c International

Conference on Sup ercomputing pages June

Balasundaram V Fox G Kennedy K Kremer U An Interactive Environment for Data Partitioning

and Distribution Pro c DMCC Mar pp

Balasundaram V Fox G Kennedy K Kremer U A Static Performance Estimator to Guide Data Par

titioning Decisions Pro c rd ACM Sigplan Symp osium on Principles and Practice of Parallel Programming

PPoPP Willi amsbu rg VA April

Barrett R Berry M Chan T Demmel J Donato J Dongarra J Eijkhout V Pozo R Romine C

van der Vorst H Templates for the solution of linear systems Building blo cks for iterative metho ds SIAM

Benkner S Chapman BM Zima HP Vienna Fortran Pro c Scalable High Performance Computing

Conferencepp Williams burg Virginia April

Benkner S Zima HP Massively Parallel Architectures and Their Programming Paradigms Recent

Developments AICA Lecce Italy

Benkner S Brezany P Zima HP Compiling High Performance Fortran in the PREPARE Environment

th International Workshop on Compilers for Parallel Computers Delft The Netherlands Dec pp

Benkner S Brezany P Zima HP Pro cessing Array Statements and Pro cedure Interfaces in the Prepare

High Performance Fortran Compiler In PA Fritszon editor Compiler Construction Proc th International

Conference LNCS pages SpringerVerlag April

Benkner S Vienna Fortran and its Compilation PhD Thesis Technical Rep ort TR University

of Vienna Institute of Software Technology and Parallel Systems Novemb er

Benkner S Handling Blo ckCyclic Distributed Arrays in Vienna Fortran Pro c International Conference

on Parallel Architectures and Compilation TechniquesLimassol Cyprus June

Benkner S Vienna Fortran An Advanced Data Parallel Language Pro c International Conference on

Parallel Computing Technologies PaCT StPetersburg Russia Lecture Notes in Computer Science

pp Springer Verlag Septemb er

Benkner S Andel S Blasko R Brezany P Celic A Chapman BM Egg M Fahringer T Hulman

J Hou Y Kelc E Mehofer E Moritsch H Paul M Sanjari K Sipkova V Velkov B Wender B

Zima HP Vienna Fortran Compilation System Version Users Guide Octob er

Berger MJ Bokhari SH APartitioning Strategy for Nonuniform Problems on Multipro cessors IEEE

Trans Comput Vol no pp

Berrendorf R Gerndt M Nagel W Pruemmer J SVM Fortran Internal rep ort kfazamib

Forschungszentrum Julic h GmbH D Julic h Germany Nov

Bik AJC Wijsho HAG Compilation Techniques for Sparse Matrix Computations th ACM ICS pp

Tokyo July

Bixby R Kennedy K Kremer U Automatic Data Layout using Integer Programming Pro c Int

Conf on Parallel Architectures and Compilation Techniques PACT Montreal Canada August

Bhansali S Hagemeister JR APattern Matching Approach for Reusing Software Libraries in Parallel

Systems Pro c of Workshop on Knowledge Based Systems for the Reuse of Program Libraries Sophia

Anthip olis FR Nov

Blasko R Parameterization and abstract representation of parallel fortran programs for p erformance

analysis Pro c AICA Annual Conference Gallip ol i Italy Septemb er

Blasko R Automatic mo deling and p erformance analysis of parallel pro cesses by p epsy Pro c of ASIM

Symp osium Stuttgart Germany Octob er

Blasko R Performance analysis of parallel programs based onsimulation Pro c of th ASU Conference

Prague Septemb er

Blasko R Pro cess graph and to ol for p erformance analysis of parallel pro cesses Pro c of IMACS Symp osium

on Mathematical Mo delling MATHMOD Vienna Austria February

Blasko R Hierarchical Performance Prediction for Parallel Programs Pro c International Symp osium and

Workshop on Systems Engineering of Computer Based Systems Tucson USA IEEE TH pages

March

Blasko R Simulation Based Performance Prediction by PEPSY Pro c th Annual Simulation Symp osium

Pho enix USA IEEE TH pages April

Bo din F Kervella L Priol T Fortran S A Fortran Interface for Shared Virtual Memory Architectures

Pro c Sup ercomputing Portland USA

Bomans L Hemp el R The ArgonneGMD Macros in Fortran for Portable Parallel Programming and

Their Implementation on the Intel iPSC Parallel Computing

Bose P Heuristic rulebased program transformations for enhanced vectorization Pro c International

Conference on Parallel Pro cessing

Bose P Interactive Program Improvement via EAVE An Exp ert Adviser for Vectorization Pro c Int

Conf on Sup ercomputing pages July

Brezany P Gerndt M Sipkova V Zima HP SUPERB Supp ort for Irregular Scientic Computation

Pro c Scalable High Performance Computing Conference pp

Brezany P Cheron O Sanjari K van Konijnenburg E Pro cessing Irregular Co des Containing Arrays

with MultiDimensi on al Distributio ns by the PREPARE HPF Compiler HPCN Europ e Milan Springer

Verlag

Brezany P Muc k T Schikuta E Language Compiler and Parallel Database Supp ort for IO Intensive

Applications Pro c HPCN Europ e Milan Italy May SpringerVerlag pp

Brezany P Muc k T Schikuta E Data Prefetching in the VIPIOS Framework Submitted to HPCN

Europ e Brussels Novemb er

Calzarossa M Massari L Merlo A Pantano M Tessera D MEDEA A to ol for Workload Chracteri

zation of Parallel Systems IEEE Parallel and Distributed Technology pp Winter

Calzarossa M Serazzi G Workload characterization A survey Pro c of IEEE August

Chapman B M Automatic Data Distribution PPPE Pro ject Deliverable c Septemb er

Chapman B M Fahringer T Zima HP Automatic Supp ort for Data Distributio n on Distributed

Memory Multipro cessor Systems Pro c th International Workshop on Languages and Compilers for Parallel

Computing Portland OR SpringerVerlag LNCS pp

Chapman BM Mehrotra P Zima HP Programming in Vienna Fortran Scientic Programming

VolNopp Fall

ChapmanBMMehrotraPZimaHP High Performance Fortran Without Templates A New Mo del for

Data Distributio n and Alignment Pro c Fourth ACM SIGPLAN Symp osium on Principles and Practice of

Parallel Programming San Diego May ACM SIGPLAN Notices Vol No pp July

Chapman BM Mehrotra P Moritsch H Zima HP Dynamic Data Distributio n in Vienna Fortran

Pro c Sup ercomputing Portland Oregon Novemb er

Chapman BM Mehrotra P Zima HP Extending HPF For Advanced Data Parallel Applications IEEE

Magazine on Parallel and Distributed Technology pages Fall

Chapman BM Mehrotra P van Rosendale J Zima HP A Software Architecture for Multidiscip li nary

Application s Integrating Task and Data Parallelism Pro c CONPAR VAPP VI pp Linz

Austria Septemb er Also Technical Rep ort TR Institute for Software Technology and Parallel

Systems University of Vienna March

Chapman BM Mehrotra P Van Rosendale J Zima HP Extending Vienna Fortran With Task

Paralleli sm Pro c International Conference on Parallel and Distributed Systems ICPADS Hsinchu

Taiwan ROC Decemb er

Chapman BM Zima HP Haines M Mehrotra P Van Rosendale J OPUS A Co ordination Language

for Multidisci pl in ary Applications Technical Rep ort TR Institute for Software Technology and Parallel

Systems University of Vienna Octob er

Chatterjee S Gilb ert J R Schreib er R Mobile and Replicated Alignment of Arrays in DataParallel

Programs Pro c Sup ercomputing Portland OR Novemb er pp

Chen M Li J Optimizing Fortran programs for data motion on massively parallel systems Technical

Rep ort YALEDCSTR Yale University January

Cheng D Ho o d R APortable Debugger for Parallel and Distributed Programs Pro c Sup ercomputing

Washington DC pp

CM Fortran Reference Manual Version Thinking Machines Corp oration Cambridge MA

Cohn R Source Level Debugging of Automatically Paralleliz ed Co de Pro c ACMONR Workshop on Par

allel and Distributed Debugging pages Santa Cruz California May

Cordsen J Luther E Oestmann B The Consistency Framework of the SODADVSM System Esprit

pro ject SODA wp Deliverabl e d German National Research Center for Computer Science Berlin

Germany May

Cro oks P Perrott R H An Automatic Data Distribution Generator for Distributed Memory MIMD

Machines Pro c Fourth Workshop on Compilers for Parallel Computers Delft The Netherlands Dec

pp

Das R Saltz J A manual for PARTI runtime primitives Revision Internal Research Rep ort University

of Maryland Dec

De Pietro G Giordano A Va jtersic M ZinterhofP Parallel Numerics Pro c International Workshop

Sorrento Septemb er ZACCARIA Publisher Naples pp

Di Martino B Iannello G Towards Automatic Paralleli zatio n through Program Comprehension Pro cs

dh

IEEE Workshop on Program Comprehension Washington USA Nov

Di Martino B Design of a To ol for Automatic Recognition of Parallelizab le Algorithmic Patterns and

Description of its Prototyp e Implementation Tech Rep Institute for Software Technology and Parallel

Systems Univ of Vienna May

Di Martino B Chapman BM Iannello G Zima HP Integration of Program Comprehension Techniques

into the Vienna Fortran Compilation System Pro c International Conference on High Performance

Computing New Delhi India Dec

Di Martino B Kessler CW Program Comprehension Engines for Automatic Paralleliz atio n A Compara

tive Study to app ear in Pro c of First Int Workshop on Software Engineering for Parallel and Distributed

Systems Berlin D March

Eager DL Lazowska ED Zahorjan J Sp eedup versus Eciency in Parallel Systems IEEE Transactions

on Computers March

Fahringer T The Weight Finder An Advanced Proler for Fortran Programs Pro c AP Saarbruc ken

Germany

Fahringer T Automatic Cache Performance Prediction in a Parallelizin g Compiler Pro c AICA

International Section Lecce Italy Septemb er

Fahringer T Automatic Performance Prediction for Parallel Programs on Massively Parallel Computers

PhD Thesis Technical Rep ort TR University of Vienna Institute of Software Technology and Parallel

Systems Octob er

Fahringer T The Weight Finder An Advanced Proler for Fortran Programs Pro c Automatic Paral

lelization New Approaches to Co de Generation Data Distribution and Performance Prediction Vieweg

Advanced Studies in Computer Science Verlag Vieweg Wiesbaden Germany March

Fahringer T Automatically Estimating Network Contention of Parallel Programs Pro c th International

Conference on Mo delling Techniques and To ols for Computer Performance Evaluation Vienna Austria May

Fahringer T Evaluation of Benchmarking Performance Estimation for Parallel Fortran Programs on

Massively Parallel SIMD and MIMD Computers IEEE Pro c nd Euromicro Workshop on Parallel and

Distributed Pro cessing Malaga Spain January

3

Fahringer T Using the P T to Guide the Paralleliz atio n and Optimization Eort under the Vienna Fortran

Compilation System Pro c IEEE Pro c Scalable High Performance Computing Conference Knoxville

TN May

3

Fahringer T P T An Automatic Performance Estimator for Parallel Programs Pro c International

Workshop on Automatic Data Layout and Performance Prediction Rice University Houston USA April

Fahringer T Estimating and Optimizing Performance for Parallel Programs IEEE Computer

Novemb er

3

Fahringer T On Estimating the Useful Work Distribution of Parallel Programs under P T A Static

Performance Estimator Accepted for publication in Journal Concurrency Practice and Exp erience Ed

Georey Fox to app ear in

Fahringer T Zima HP A Static Parameter based Performance Prediction To ol for Parallel Programs

Invited Pap er Pro c th ACM International Conference on Sup ercomputing Tokyo Japan July

Fortran ANSI XJ Internal Do cument S May

Fox G Hiranandani S Kennedy K Ko elb el C Kremer U Tseng C Wu M Fortran D language

sp ecication Department of Computer Science Rice COMP TR Rice University March

Francomano E TortoriciMacaluso A Va jtersic M Implementation Analysis of Fast Matrix Multipli ca

tion Algorithms on Shared Memory Computers Computers and Articial Intelligence

Geist GA Heath MT Peyton BW Worley PH A Users Guide To Picl A Portable Instrumented

Communication Library Ornltm Oak Ridge National Lab oratory Math Sciences Section Oak

Ridge Tennessee August

Gerndt HM Automatic Parallelizati on for DistributedMemory Multipro cessing Systems PhD Thesis

University of Bonn Decemb er

Ghosal D Serazzi G Tripathi SK The Pro cessor Working Set and its Use in Scheduling Multipro cessor

Systems IEEE Transactions on Software Engineering May

Guerrini C Va jtersic M Optimal Parallel Ordering Schemes For Solving The Singular Value Decomp osi

tion

Intern Journal on Mini and Micro computers

Gupta M Banerjee P Automatic Data Partitioning on Distributed Memory Multipro cessors Technical

Rep ort UILUENG Co ordinated Science Lab University of Illinois

Gupta M Banerjee P Demonstration of Automatic Data Partitioning Techniques for Parallelizi ng Com

pilers on Multicomputers IEEE Transactions on Parallel and Distributed Systems pp

Haines M Cronk D Mehrotra P On the design of Chant A talking threads package Pro c Sup ercom

puting pages Washington DC Novemb er Also ICASE Technical Rep ort

Haines M Hess B Mehrotra P van Rosendale J Zima HP Runtime Supp ort for Data Parallel Tasks

Pro c Fifth Symp osium on the Frontiers of Massively Parallel Computation FrontiersMcLeanVirgini a

Also Technical Rep ort TR Institute for Software Technology and Parallel Systems University of

Vienna April and ICASE Technical Rep ort

Haines M Mehrotra P Cronk D Rop es Supp ort for collective op erations among distributed threads

Hall MW Hirandani S Kennedy K Tseng CW Interpro cedural Compilation of Fortran D for MIMD

DistributedMemory Machines Pro c Sup ercomputing Minneap olis Novemb er

High Performance Fortran Forum High Performance Fortran Language Sp ecication Version Technical

Rep ort Rice University Houston TX May Also available as Scientic Programming

Spring and Summer

Hiranandani S Kennedy K Tseng CW Compiling Fortran D for MIMD DistributedMemory Machi

nes CommACM VolNo pages August

HPC White Pap ers The HPC Working Group Technical Rep ort TR Center for Research on

Parallel Computation CRPC Rice University Houston Texas Decemb er

Hulman J KnowledgeBased Supp ort for Transformation Systems Pro c Int Workshop on Algorithms

and Software for Parallel Computer Systems I Plander M Va jtersic HP Zima eds Smolenice March

pp

Hulman J Andel S Chapman B M Zima HP Intelligent Paralleli zatio n Within the Vienna Fortran

Compilation System Pro c Fourth Workshop on Compilers for Parallel Computers Delft Netherlands

Decemb er pp

IEEE Threads Extension for Portable Op erating Systems Draft February

Ikudome K Fox G Kolawa A Flower J An automatic and symb olic paralleli zati on system for

distributed memory parallel computers Pro c Fifth Distributed Memory Computing Conference pages

Charleston SC April

Jain R The Art of Computer System Performance Analysis John Wiley Sons New York

Kaushik SD Huang CH Ramanujam J Sadayappan P Multiphase Array Redistribution A Commu

nication Ecient Approach to Array Redistributi on Technical Rep ort OSUCISRCTR Ohio State

University ColumbusOH

Kessler CW Paul WJ Automatic Parallelizati on by Pattern Matching Lecture Notes in Computer

Science Parallel Computation

Knob e K Lukas J D Steele G L Data Optimization Allo cation of Arrays to Reduce Communication

on SIMD Machines Journal of Parallel and Distributed Computing pp

Knob e K Lukas J D Dally W J Dynamic Alignment on Distributed Memory Systems Pro c Third

Workshop on Compilers for Parallel Computers Vienna Austria Technical Rep ort ACPC TR Austrian

Center for Parallel Computation July

Ko elb el C Compiling Programs for Nonshared Memory Machines PhD Dissertation Purdue University

West Lafayette IN Nov

Ko elb el C Mehrotra P Compiling global namespace parallel lo ops for distributed execution IEEE

Transactions on Parallel and Distributed Systems Octob er

Ko elb el C Mehrotra P Van Rosendale J Semiautomatic pro cess partitioning for parallel computation

International Journal of Parallel Programming

Kremer U NPcompleteness of dynamic remapping Pro c Fourth Workshop on Compilers for Parallel

Computers Delft The Netherlands Dec pp

Kuck DJ A Survey of Parallel Machine Organization and Programming ACM Computing Surveys

Kuck DJ Kuhn RH Leasure BR Wolfe MJ The Structure of an Advanced Retargetable Vecorizer

Hwang KEd Sup ercomputers Design and Applications Tutorial IEEE Catalog Numb er

EHO IEEE So ciety Press Silver Spring Maryland

Laure E Die Alignmentanalys e Eine pragmatische Metho de zur Erzeugung ezienter datenparallel er

Fortranprogramme und ihre Realisierung im VFCS in German Diploma Thesis University of Vienna

Institute for Software Technology and Parallel Systems January

Lenzi P Serazzi G ParMon Parallel Monitor Technical Rep ort N Dipartimento di Elettronica

Politecnico di Milano Octob er

Li J Chen M Index Domain Alignment Minimizin g Cost of CrossReferencing Between Distributed

Arrays Technical Rep ort YALEUDCSTR Yale University Novemb er

Li J Chen M Generating explicit communication from sharedmemory program references Pro c

Sup ercomputing pages New York NY Novemb er

Li K Shared Virtual Memory on Lo osely Coupled Multipro cessors PhD Thesis Yale University

Loveman D High Performance Fortran Prop osal High Performance Fortran Forum Houston TX

January

Mace M Memory Storage Patterns in Parallel Pro cessing Kluwer Academic Publishers Boston MA

McDowell CE Helmb old DP Debugging Concurrent Programs ACM Computing Surveys Vol No

Dec pp

Metzger R Automated Recognition of Parallel Algorithms in Scientic Applications Pro c of Workshop

on Plan Recognition at IJCAI Aug

Mehrotra P van Rosendale J Programming distributed memory architectures using Kali A Nicolau

D Gelernter T Gross and D Padua editors Advances in Languages and Compilers for Parallel Pro cessing

pages PitmanMITPress

Merlo A Rossaro P MEDEA Architettura Software in Italian Technical rep ort CNR National

Research CouncilItaly Septemb er Rapp orto Technico CNR Progetto Finalizzato Sistemi Informatici

e Calcolo Parallelo n

Message Passing Interface Forum Do cument for a Standard MessagePassing Interface University of

Tennessee Knoxville Tennessee Octob er

MIMDizer Users Guide Version Pacic Sierra Research Corp oration Placervill e CA

Natara jan V Chiou D Ang BS Performance Visualizati on on Monso on Journal of Parallel and

Distributed Computing

Paalvast E Sips H A highlevel language for the description of parallel algorithms Pro c Parallel

Computing Leyden Netherlands August

Palko V Va jtersic M A Parallel Recognition of Lines In Binary Images Computers and Articial

Intelligence

Pase D MPP Fortran programming mo del High Performance Fortran Forum Houston TX January

Philippsen M Mo ck M U Data and Pro cess Alignment in Mo dula Kessler C W Ed Automa

tic Parallelizati on New Approaches to Co de Generation Data Distributi on and Performance Prediction

Vieweg Verlag

Pinter SS Pinter RY Program Optimization and Paralleli zati on Using Idioms ACM SIGPLAN

Principles of Programming Languages pages

Ponnusamy R et al A manual for the CHAOS runtime library Technical rep ort University of Maryland

May

Redon X Feautrier P Detection of Recurrences in Sequential Programs with Lo ops PARLE Springer

LNCS vol pages

Reed DA Olson RD Aydt RA Madhyastha TM Birkett T Jensen DW Nazief BA Totty

BK Scalable p erformance environments for parallel systems Pro c Sixth Distributed Memory Computing

Conference Portland Oregon April

Rogers A Pingali K Pro cess decomp osition through lo cality of reference Pro c Conference on Pro

gramming Language Design and Implementation pages ACM SIGPLAN June

Rosti E Serazzi G Workload Characterization for Performance Engineering of Parallel applicatio ns

IEEE January

Ruhl R Annaratone M Parallelizati on of Fortran co de on distributedmemory parallel pro cessors Pro c

ACM International Conference on Sup ercomputing June

Sab ot G WholeyS Cmax a Fortran Translator for the Connection Machine System Int ACM

Conference on Sup ercomputing pages

Saltz J Crowley K Mirchandaney R Berryman H Runtime Scheduling and Execution of Lo ops on

Message Passing Machines Journal of Parallel and Distributed Computing

Saltz J Das R Mo on B Sharma S Hwang Y Ponnusamy R Uysal M A Manual for the CHAOS

Runtime Library Computer Science Department University of Maryland

Saltz J Crowley K Mirchandaney R Berryman H Runtime scheduling and execution of lo ops on

message passing machines Journal of Parallel and Distributed Computing

Sarkar V Partitioning and Scheduling Parallel Programs for Multipro cessor The MIT Press Cambridge

Massachusetts

Stramm B Performance Prediction for Mapp ed Parallel Programs PhD Thesis University of California

San Diego

Sun XH Parallel Computation mo dels Representation Analysis and Applications PhD Thesis

Michigan State University Department of Computer Science

Sussman A Mo delDriven Mapping of Computation onto Distributed Memory Parallel Computers PhD

Thesis Carnegie Mellon University Scho ol of Computer Science Pittburgh PA

Thakur R Choudary A Fox G Runtime Array Redistributi on in HPF Programs Pro c Scalable

HighPerformance Computing Conference pages Knoxville Tennessee May

Ujaldon M Zapata EL Chapman BM Zima HP Vienna FortranHPF Extensions for Sparse and

Irregular Problems and Their Compilation Technical Rep ort TR Institute for Software Technology

and Parallel Systems University of Vienna Octob er Submitted to IEEE Transactions on Parallel

and Distributed Systems

Va jtersic M Algorithms for Elliptic Problems Ecient Sequential and Parallel Solvers KLUWER Aca

demic Publisher DordrechtBoston pp

Va jtersic M Zinterhof P Parallel Numerics Pro c International Workshop Smolenice

SAS Publisher Bratislava pp

Va jtersic M Two Classes of Ecient Hyp ercub e Orderings For SVD Pro c International Conference on

Applied Mathematics and Applications World Scientic Soa August

Va jtersic M Some Approaches How To Multiply Matrices On Massively Parallel Meshes Pro c Interna

tional Workshop Parallel Numerics Slovak Academy of Sciences Smolenice Septemb er

Va jtersic M Some Examples Of Massively Parallel Numerical Algorithms Pro c International Confe

rence on Parallel Pro cessing and Applied Mathematics Technical University of Czesto chowaCzesto chowa

Septemb er

Va jtersic M Highp erformance VLSI Mo del Elliptic Solvers Pro c HPCN Europ e Conference Lecture

Notes in Computer Science SpringerVerlag

Va jtersic M Algorithms For Massively Parallel MatrixPro duct Computations Pro c Conference on Par

allel and Distributed Computing University of Kuwait March

Va jtersic M Sp ecial Blo ckFiveDia gonal System Solvers For The Vlsi Parallel Mo del Pro c International

Workshop Parallel Numerics Septemb er IRSIP CNR Naples

Va jtersic M Numerical Algorithms for Some Parallel Computer Architectures Habilitation work Univer

sity of Salzburg pp

Veen A de Lange M Overview of the PREPARE Pro ject th International Workshop on Compilers

for Parallel Computers Delft The Netherlands Dec pp

Wang KY A framework for intellige nt parallel compilers Technical Rep ort CSDTR Computer

Science Dept Purdue University

Wang KY Gannon D Applying AI Techniques to Program Optimization for Parallel Computers

Parallel Pro cessing for Sup ercomputers and Articial Intelligence pages McGrawHill

Wender B VFCS Instrumentation for Integration of MEDEA Deliverable DZ CEI Pro ject PACT

April

Wakatani A Wolfe M A New Approach to Array Redistributi on Strip Mining Redistributi on Pro c

Parallel Architecture and Languages Europ e July

S Wholey Automatic Data Mapping for Distributed Memory Parallel Computers PhD Thesis Computer

Science CarnegieMellon University PA May

Wu J Saltz J Berryman H Hiranandani S Distributed Memory Compiler Design for Sparse Problems

ICASE Rep ort No

Zima HP Bast H Gerndt M Sup erb A to ol for semiautomatic MIMDSIMD paralleliza tion Parallel

Computing

Zima HP Brezany P Chapman BM Mehrotra P Schwald A Vienna Fortran A Language

Sp ecication Version Technical Rep ort ACPCTR Austrian Center for Parallel Computation

March Also NASA Contractor Rep ort ICASE Interim Rep ort NASA Langley Research

Center Hampton Virginia March

Zima HP Brezany P Chapman BM Hulman J Automatic Parallelizati on for DistributedMemory

Systems Exp eriences and Current Research Invited Pap er SpiesPPEd EuroArch pp

Pro c Europ ean Informatics Congress Computing Systems Architectures Munich Octob er

Informatik aktuell Springer Verlag

Zima HP Chapman BM Compiling for DistributedMemory Systems Invited Pap er Pro c of the

IEEE Sp ecial Section on Languages and Compilers for Parallel Machines pp February

Also Technical Rep ort ACPCTR Austrian Center for Parallel Computation Novemb er

Zima HP Chapman BM Sup ercompilers for Parallel and Vector Computers ACM Press Frontier

Series AddisonWesley