PARALLEL PROGRAMMING METHODOLOGY AND ENVIRONMENT

FOR THE SHARED MEMORYPROGRAMMING MODEL

A Thesis

Submitted to the Faculty

of

Purdue University

by

Insung Park

In Partial Fulllmentofthe

Requirements for the Degree

of

Do ctor of Philosophy

Decemb er

ii

Tomy b eloved grandmother

iii

ACKNOWLEDGMENTS

First Id like to thank my grandmother whom I have not seen for more than

twoyears and whom I will never see again She ed to South Korea with four

little daughters during the Korean War and started a new life in an unfamiliar place

with her bare hands Her courage p erseverance and endurance have led to my

existence Over the years in graduate scho ol she has always b een on my side lending

a sympathetic ear and doing her b est to keep me sane I wish I could see her just one

more time

Id like to thank my advisor Dr Rudolf Eigenmann for his encouragementand

advice during my research His insightful comments and constructive suggestions are

greatly appreciated I also express my gratitude to my graduate committee memb ers

Dr JoseABFortes Dr Howard J Siegel and Dr Elias Houstis for their time

and advice

My deep est lovegoestomy parents and two brothers In Jun and In Kwon

Icannever thank them enough for their neverending supp ort that has made me

come through with my research Through ups and downs in life their loveand

encouragementhasgiven me the strength to go on with my life I am also grateful to

my aunts uncles and cousins who have never hidden their pride in me and concern

for mywellb eing

Fresh and valuable p ersp ectives that the memb ers of our research group have

provided are greatly appreciated Among them Mike Seon Brian and Vishal have

made extra eorts to help me with my research which I deeply acknowledge

Mike Natalie and Nicholas deserve sp ecial mention for always b eing there for me

Icherish them as my brother sister and nephew Without them I would not have

made it this far I b elieve one of the reasons Go d led me here is to meet them I also

iv

value my tob elifelong friendship with Seon Young and their precious daughter

Arden Numerous evenings I havespent with all these friends are precious to me

I appreciate manyofmy Korean friends here at Purdue Esp ecially I extend my

thanks to Jonghyeok and JeHo The life here has b een joyous and fun b ecause of

them Thanks are also due to their wives who have fed this single hungry graduate

student countless times Id also liketomention In Sung Jae Hyung Yonghee So on

Keon Heon Seungmo on So ohong Jang Won Il Jung Min Hun So o Woon Young

Jong Sun Se Hyun and their families

LastlyIsendmy b est regard to Jo on So ok and her family I wish them happiness

v

TABLE OF CONTENTS

Page

LIST OF TABLES ix

LIST OF FIGURES xi

ABSTRACT xv

INTRODUCTION

Motivation

State of parallel computing

Op en issues in the shared memory programming mo del

Need for parallel programming environment

Thesis Organization

BACKGROUND

Parallel Programming Concepts Terminology and Notations

Parallelization in the Shared Memory Programming Mo del

Intro duction

History of parallel shared memory directives

Shared memory program execution

Automatic parallelization

Parallelization in the Message Passing Programming Mo del

MPI and PVM

HPF

Visual parallel programming systems

Parallel Programming and Optimization Metho dology

Shared memory programming metho dology

Message Passing programming metho dology

To ols

vi

Program development and optimization

Instrumentation

Performance visualization and evaluation

Guidance

Utilizing Web Resources for Parallel Programming

Conclusions

SHARED MEMORYPROGRAM OPTIMIZATION METHODOLOGY

Intro duction Scop e Audience and Metrics

Scop e of the prop osed metho dology

Target audience

Metrics understanding overheads

Parallel Program Optimization Metho dology

Instrumenting program

Getting serial execution time

Running parallelizing

Manually optimizing programs

Getting optimized execution time

Finding and resolving p erformance problems

Conclusions

TOOL SUPPORTFOR PROGRAM OPTIMIZATION METHODOLOGY

Design Ob jectives

Ursa MinorPerformance Evaluation Tool

Functionality

Internal Organization of the Ursa Minor to ol

Database structure and data format

Summary

InterPolInteractiveTuning Tool

Overview

Functionality

vii

Summary

Other To ols in Our Toolset

Polaris parallelizing compiler

InterAct p erformance monitoring and steering to ol

MaxP parallelism analysis to ol

Integration with Metho dology

To ol supp ort in eachstep

Other useful utilities

The Parallel Programming Hub and Ursa Major

Parallel Programming Hub globally accessible integrated to ol

environment

Ursa Major making a rep ository of knowledge available to

the world wide audience

Conclusions

EVALUATION

Metho dology Evaluation Case Studies

Manual tuning of ARCD

Evaluating a parallelizing compiler on a large application

Interactive compilation

Performance advisor hardware counter data analysis

Performance advisor simple techniques to improve p erformance

Eciency of the Tool Support

Facilitating the tasks in parallel programming

General comments from users

Comparison with Other Parallel Programming Environments

Comparison of Ursa Major and the Parallel Programming Hub

Conclusions

CONCLUSIONS

Summary

Directions for Future Work

viii

LIST OF REFERENCES

VITA

ix

LIST OF TABLES

Table Page

Overhead categories of the sp eedup comp onentmodel

Optimization technique application criteria

A detailed breakdown of the p erformance improvement due to each

technique

Common tasks in parallel programming

Time in seconds taken to p erform the tasks without our to ols

Time in seconds taken to p erform the tasks with our to ols

Feature comparison of parallel programming environments

Workload distribution on resources with our networkbased to ols x

xi

LIST OF FIGURES

Figure Page

The structure of an SMP

A pro cessor Origin system a top ology and b structure of

a single no de b oard

Simple parallelization with Op enMP

Screenshot of the CODE visual programming system

The timeline graph from NTV

The graphs generated by AIMS

The graphs generated byPablo

Typical parallel program developmentcycle

Overview of the prop osed metho dology

Scalar privatization a the original lo op and b the same lo op after

privatizing variable X

Arrayprivatization a the original lo op and b the same lo op after

privatizing variable array A

Scalar reduction a the original lo op and b the same lo op after

recognizing reduction variable SUM

Array reduction a the original lo op and b the same lo op after

recognizing reduction array A

recognition a the original lo op and b the same

lo op after replacing induction variable X

Scheduling mo dication a the original lo op and b the same lo op

after mo difying scheduling by pushing parallel constructs inside the

lo op nest In b the inner lo op is executed in parallel thus pro cessor

access array elements that at least stride apart

Padding a the original lo op and b the same lo op after padding

extra space into the arrays

xii

Load balancing a the original lo op and b the same lo op after

changing to interleaved scheduling scheme By changing the scheduling

from static to dynamic unbalanced load can b e distributed more evenly

Blo ckingtiling a the original lo op and b the same lo op after ap

plying tiling to split the matrices into smaller tiles In b another

lo op has b een added to assign smaller blo cks to each pro cessor The

data are likely to remain in the cache when they are needed again

Lo op interchange a a lo op with p o or lo calityandbthesameloop

with b etter lo cality after interchanging lo op nest

Software pip eline and lo op unrolling a the original lo op b the

same lo op with software pip eline Instructions are interleaved across

iterations and preamble and p ostamble have b een added and c the

same lo op unrolled by

do in program SWIM Original lo op SHALOW

Parallel version of SHALOW do in program SWIM

Optimized version of SHALOW do in program SWIM

Main view of the Ursa Minor to ol The user has gathered infor

mation on program BDNA After sorting the lo ops based on the ex

ecution time the user insp ects the p ercentage of three ma jor lo ops

ACTFOR do ACTFOR do RESTAR do using a pie chart gen

erator b ottom left Computing the sp eedup column with the

do is Expression Evaluator reveals that the sp eedup for RESTAR

p o or so the user is examining more detailed information on the lo op

Structure view of the Ursa Minor to ol The user is lo oking at the

Structure View generated for program BDNA Using Find utility the

user sets the view to subroutine ACTFOR and op ened up the source

view for the parallelized lo op ACTFOR do

The user interface of Merlin in use Merlin provides the solutions to

the detected problems This example shows the problems addressed in

lo op ACTFOR DO of program BDNA The button lab eled Ask Merlin

activates the analysis The View Source button op ens the source

viewer for the selected co de section The ReadMe for Map button pulls

up the ReadMe text provided by the p erformance map writer

The internal structure of a Merlin map The Problem Domain

corresp onds to general p erformance problems The Diagnostics Do

main depicts p ossible causes of the problems and the Solution Do

main contains suggested remedies Conditions are logical expressions

representing an analysis of the data

xiii

Building blo cks of the Ursa Minor to ol and their interactions

The database structure of Ursa Minor

An overview of InterPol Three main mo dules interact with users

through a Graphical User Interface The Program Builder handles

le IO and keeps track of the current program variant The compiler

Builder allows users to arrange optimization mo dules in Polaris The

Compilation Engine combines the user selections from the other two

mo dules and calls Polaris mo dules

User Interface of InterPol a the main window and b the Com

piler Builder

Monitoring the example application through InterAct interface The

main window shows the characterization data of the ma jor lo ops in the

SPEC SWIM Benchmark

To ol supp ort for the parallel programming metho dology

Ursa Minor Usage on the Parallel Programming Hub

Interaction provided by the Ursa Major to ol

The a execution time and b sp eedup of the various version of

do mo dica ARCD Mo d lo op interchange Mo d STEPFY

tion Mo d STEPFX do mo dication Mo d FILERX do mo d

ication Mo d YPENTA do mo dication Mo d mo dication on

XPENTA YPENTand XPENT

Contents of the Program Builder during an example usage of the In

terPol to ol a the input program and b the output from the

default Polaris compiler conguration

Contents of the Program Builder during an example usage of the In

terPol to ol c the output after placing an additional deadco de

elimination pass prior to inlining and d the program after manually

parallelizing subroutine two

Performance analysis of the lo op STEPFX DO in program ARCD The

graph on the left shows the overhead comp onents in the original serial

co de The graphs on the right show the sp eedup comp onentmodel

for the parallel co de variants on pro cessors b efore and after lo op

interchanging is applied Each comp onent of this mo del represents

the change in the resp ectiveoverhead category relative to the serial

program Merlin is able to generate the information shown in these

graphs

xiv

Sp eedup achieved by applying the p erformance map The sp eedup is

with resp ect to onepro cessor run with serial co de on a Sun Enterprise

system Each graph shows the cumulative sp eedup when applying

eachtechnique

Overall times to nish all tasks

The resp onse time of UMApplet and UMParHub on a a networked

PC b a networked workstation and c a dialup PC

The resp onse time of the three op erations on RETRAN database a

loading b spreadsheet command evaluation and c source searching

xv

ABSTRACT

Park Insung PhD Purdue University Decemb er Parallel Programming

Metho dology and Environment for the Shared Memory Programming Mo del Ma jor

Professor Rudolf Eigenmann

The easy programming mo del of the shared memory paradigm p ossesses many

attributes desirable to novice programmers However there has not b een a go o d

metho dology with which programmers navigate through the dicult task of program

parallelization and optimization It is b ecoming increasingly dicult to achieve good

p erformance without exp erience and intuition Guiding metho dologies must dene

easytofollow steps for programming and tuning multipro cessor applications In ad

dition a parallel programming environmentmust acknowledge timeconsuming steps

in the parallelization and tuning pro cess and supp ort users in their eorts

We prop ose a parallel programming metho dology for the shared memory mo del

and a set of to ols designed to assist users in accordance with the metho dology Our

research addresses the questions what to do in parallel program developmentand

tuning how to do it and where to do it Our main contribution is to provide a

comprehensive programming environmentsuch that b oth novice and advanced users

can p erform p erformance tuning in an ecient and straightforward manner Our ef

fort diers from the other parallel programming environments in that it integrates

the most stages of parallel programming tasks based on a common metho dology and

it addresses issues that have not b een attempted in previous eorts Wehaveused

network computing technology so that world wide programmers can b enet from our

work Through a series of evaluation pro cesses we found that our programming envi

ronment provides a metho dology that works well with parallel applications and that

our to ols provide an ecient supp ort to b oth novice and advanced programmers xvi

INTRODUCTION

Motivation

State of parallel computing

Multipro cessor machines have existed in many dierentarchitectures Among

them shared memory machines are getting much attention recently This is mainly

due to the fact that the sharedmemory architecture oers an easy programming

mo del and that the techniques for parallelization of programs for this class of machines

are wellestablished and can b e automated

Today new aordable multipro cessor workstations and PCs are attracting an in

creasing numb er of users and consequently these new programmers are inexp erienced

and desire an easier programming mo del to harness the p ower of parallel computing

These asp ects draw more attention to shared memory machines in twoways First

most newly develop ed parallel computers are shared memory machines or compati

ble with the shared memory programming mo del Second the aforementioned easy

programming mo del with the help of parallelizing requires relatively little

exp erience to develop parallel programs

The eort in the industry toward the standardization of a programming mo del

makes shared memory machines more app ealing The lack of standardized parallel

language had b een a problem with the shared memory mo del This often requires

programmers to learn a new set of language constructs whenever there is a need to

p ort programs across platforms Tomake matters worse the dierence among these

native dialects in the abilities to express parallelism was signicant enough that in

many cases a considerable change has to b e made in the program co de itself going

beyond direct directive translation There havebeenseveral attempts to provide

standard parallel languages which will b e discussed in Chapter but they failed to

get the attention from the parallel computing community in general

The recent parallel language standard for shared memory multipro cessor ma

chines Op enMP promises an attractiveinterface for those programmers who

wish to exploit parallelism explicitly The Op enMP standard resolves the p ortability

problem and is exp ected to attract more programmers and computer vendors in the

high p erformance computing area

Op en issues in the shared memory programming mo del

There are however op en issues to b e addressed Perhaps the most serious of all is

the lack of a go o d programming metho dology for these typ es of machines In contrast

to several eorts to establish a metho dology for other programming mo dels

no known literature is found that sp eaks of a programming and tuning metho dology

for the shared memory mo del A programmer who is to develop a parallel program

has to face a number of challenging questions What are the known techniques for

parallelizing this program What information is available for the program at hand

Howmuch sp eedup can b e exp ected from this program What are the limitations

for the parallelization of this program It usually takes substantial exp erience to nd

the answers to such questions Most general programmers do not have the time and

resources to acquire this exp erience

We b elieve that the absence of a programming metho dology is attributed to three

reasons First manyadvanced parallel programmers are used to programming in

terms of application level parallelism By this we mean the study of the underlying

physics and algorithms to nd parallelism residing in that level It is indeed an eec

tive metho d if it succeeds b ecause in some cases the scop e of the resulting parallelism

go es wider than the ner grain parallelism of the directivebased programming mo del

resulting in less synchronization overhead However this approach requires signicant

eort to understand the underlying physics and it is prone to a human error It is

not a rare case in which a programmer realizes in a later stage of development that

the algorithm that he or she thought to b e parallel is actually sequential If a p erson

parallelizing a program is not the programmer who wrote the program the required eort doubles as the understanding of the program has to precede parallelization

Furthermore dep ending on the problem that programmers wish to solve the underly

ing algorithms and physical mo dels vary signicantly making a systematic approach

to parallel application design dicult A programmer who is used to this approach

has to tackle each problem casebycase relying on intuition and exp erience

In contrast to the application level approach there is a program level paral

lelism approach This means an eort to nd parallelism based on the source co de

and how it is written Fo cusing only on rep etitive computing constructs lo ops this

approach allows automatic recognition of parallelism and p ossible transformations

Numerous research pro jects have addressed the issues of identifying parallelism and

applying the corresp onding transformations that can b e incorp orated into compil

ers Nevertheless these are not parallel programming metho d

ologies by themselves Theses researchers address only one part of parallel program

development parallelization A complete parallel programming metho dology has to

encompass the entire development stages including parallelization evaluation tuning

and so on The second reason for the lack of a metho dology for the shared memory

architecture stems from the signicant aid provided by these parallelizing compilers

Many inexp erienced programmers exp ect a signicant sp eedup after running a paral

lelizing compiler Indeed they simplify the pro cess considerablyHowever running a

parallelizing compiler do es not necessarily achieve high p erformance Toachievean

optimal p erformance from a program often many factors have to b e considered in

cluding b oth machine dep endent and indep endent parameters underlying algorithms

and so on As shown in without prop er consideration for these eects the result

ing p erformance mayeven degrade We b elieve that there is ro om for a systematic

waytoprovide users with guidelines and remedies that can b e incorp orated into a

structured metho dology

Finally there are some asp ects of the shared memory mo del that makeithard

to develop a general metho dologyAsmentioned ab ove the shared memory mo del

oers an easy programming interface This do es not mean that obtaining a go o d p er

formance is easy as well Unlike some other programming mo dels such as a message

passing scheme where a programmer explicitly dictates synchronization and sending

and receiving messages imp ortantevents suchasmultiple pro cessors writing to a

shared variable or false sharing are not readily visible to users in the shared memory

mo del Furthermore these eects are hard to measure if not imp ossible without

intro ducing signicantoverhead Therefore if the p erformance is not satisfactory

inexp erienced programmers have diculties nding what caused it An increasing

numb er of NonUniform Memory Access NUMA machines add more complexity

b ecause they intro duce another variable to consider namely memory latencyThe

shared memory programming mo del provides an easy transparent means of express

ing parallelism but the price is that the parallel p erformance optimization requires

signicant time and resources A go o d metho dology should b e general enough to cover

avariety of architectures and applications but exible enough to help programmers

pinp oint the b ottlenecks and resolve the problems in a sp ecic situation

Need for parallel programming environment

With the gaining momentum of the shared memory architecture a metho dol

ogy for the shared memory mo del is needed The shared memory mo del provides a

simple user interface What we do not havenowisaneasierway to pro duce go o d

p erformance It has to b e structured guidelines that encompass the whole pro cess

of program development while providing useful tips with which users can navigate

through dicult steps As there are a variety of issues to deal with it has to b e

general without losing its utility when applied to real environments

A go o d metho dology do not suce without prop er supp ort from to ols Listing

the tasks that need to b e completed cannot b e of much help to programmers if all

those tasks are to b e accomplished manually with only basic utilities available from

the target machine During an optimization pro cess programmers face challenges in

analysis and p erformance data management incremental application of parallelization

and optimization techniques p erformance measurement and monitoring and problem

identication and devising remedies Each of these tasks p oses a signicant burden

onto programmers and without any help it can b e a timeconsuming task

This leads to the need of supp orting facilities for the underlying metho dology

These facilities need to address dicult and timeconsuming steps sp ecied bythe

metho dology and provide functionality that accelerates these steps Together the

metho dology and the to ols should b e able to make up for the lack of exp erience

among novice programmers wherever it is required most such as analysis diagnosis

and formation of solutions Weacknowledge many to ols designed for the purp ose of

helping programmers but the ma jority of them fo cus on sp ecic asp ects or environ

ments in the program development pro cess not based on a metho dologyWe b elieve

that providing a more comprehensive and actively guiding to olset is p ossible with the

current technology

Another problem with the current to ols is their accessibility If useful to ols cannot

b e easily found and used by users the eort to develop suchtoolswould b e wasted

Furthermore as more diverse multipro cessors nd their users the compatibility issue

has b ecome an imp ortant factor in the to ols applicability As the existing program

ming mo dels converge to the standard Op enMPtooldevelop ers should consider this

problem With the emerging network technology and new p ortable languages such

as Java wealreadyhave the basic framework enabling more accessible parallel pro

gramming to ols

We present here our result in the sub ject of a parallel programming metho dology

and supp orting to ols Wehave develop ed a metho dology that has worked well un

der various environments and a set of to ols that address dicult tasks in the shared

memory mo del Combining the metho dology and the supp orting to ols we develop ed

programmers can now follow a structured approachtoward the optimal p erformance

with the supp ort from ecient to ols This optimization paradigm is available to gen

eral audience through the Purdue University Network Computing Hub PUNCH

and a Java Applet application allowing our metho dology and to ol supp ort to reach

many users throughout the glob e

Thesis Organization

Chapter will giveabriefoverview of the history and background on parallel

programming fo cusing on metho dologies and programming to ols Chapter will

present our prop osed metho dology toward these issues and the supp orting to ols

develop ed for the metho dology are summarized in Chapter Chapter discusses

the evaluation pro cess and the result Chapter concludes the thesis

BACKGROUND

In this chapter we examine previous eorts in developing programming metho d

ologies and to ols for parallel programming targeted towards the twowellknown pro

gramming mo dels the shared memory and the distributed memory mo dels Our re

search can b e summarized as building a comprehensive programming environmentby

designing a go o d programming metho dology providing a to olset that supp orts

it and making our results available to a wide audience From this p ersp ective we

discuss general concepts in parallel programming metho dologies and to ols prop osed

by other researchers and previous eorts towards b etter accessible data rep ositories

and parallel programming to ols

Parallel Programming Concepts Terminology and Notations

Parallelism exists in many forms In this pap er we consider parallel pro cessing

in whichmultiple pro cessors take part in executing a single program Other parallel

schemes such as instructionlevel parallelism or vector architectures are not the target

of our research There are twomajormultipro cessor architecture categories SIMD

Single Instruction Multiple Data and MIMD Multiple Instruction Multiple Data

Among these we fo cus on MIMD architecture which is the most commonly used

architecture these days

MIMD architecture consists of twotyp es of machines shared memory machines

and distributed memory machines The physical memory for the shared memory

architecture may b e shared or distributed further dividing the architecture into Uni

form Memory Access UMA architecture and NonUniform Memory Access NUMA

architecture Some distinguish them by using the terms Symmetric MultiPro ces

sor SMP architecture and Distributed Shared Memory DSM architecture resp ec

tively DSM machines seek to resolve the limited capacity of shared memory buses

which prevents scaling to a large numb er of pro cessors on a conventional SMP archi

tecture Figure showsatypical at SMP architecture with four pro cessors By

contrast the architecture shown in Figure is that of a pro cessor Cray Origin

system whichisaDSMmachine

CPU 1 CPU 2 ... CPU P

External Cache External Cache External Cache

Shared Main Memory

Fig The structure of an SMP

From the programmers p oint of view there are two main mo dels for programming

on parallel machines the shared memory programming mo del and the message pass

ing programming mo del There are other programming mo dels that target a cluster

of SMP machines or parallel logic environments but they are not widely

used and will not b e discussed in detail

The shared memory mo del and the message passing programming mo del share

the same basic concept threads A single pro cess forks multiple threads that inde

p endently execute p ortions of a program The dierence b etween these twoishow

threads access memory In the shared memory mo del multiple pro cessors share a

single memory space so pro cessors can read or write to the shared space regardless

of where they actually reside The notion of shared and private data b ecomes

imp ortant Shared data are visible to all pro cessors participating in the parallel ex

ecution Communication b etween pro cessors takes place in the form of reading and

writing to shared data Private data on the other hand are lo cal to each pro cessor

Router Node Board CPU 1 CPU 2 R R

External Cache External Cache R R

XIO Hub ASIC Router R R

Memory & Directory R R

Node Board

a b

Fig A pro cessor Origin system a top ology and b structure of a

single no de b oard

and cannot b e accessed by other pro cessors

By contrast in the message passing scheme pro cessors do not share memory

All data are private to the pro cessor that owns them The message passing scheme

requires each pro cessor b e aware of which pro cessor owns what data thus if there is

a need to read or write to a data item that b elongs to another pro cessor the item

has to b e explicitly sent and received

These two mo dels provide high level constructs for easier programming The

shared memory mo del oers directive languages with which a user sp ecies whether

certain lo ops can b e executed in parallel Also users can program directly with

threads with the help of thread libraries In the message passing mo del parallel

constructs typically come in the form of library of functions The library includes

functions for sending and receiving messages synchronization initialization and

grouping The Message Passing Interface MPI and Parallel Virtual Machine

PVM are imp ortant standards implemented in such libraries The parallel pro

grammers task in the message passing mo del is to incorp orate these functions into

parallel algorithms Programmers need to devise ways to split data communicate

and synchronize and write or mo dify the program based on the design

Although the shared memory programming mo del is basically for programming

on shared memory machines and the message passing mo del for programming on

distributed memory machines this mapping b etween programming mo dels and ar

chitectures is not binding Many mo dern parallel computers are compatible with

b oth programming mo dels although their hardware design takes sp ecically one form

or the other There is still no general agreement as to whicharchitecture and which

programming mo del are more eective and it is not likely that any one of them will

prevail over the other in the near future

Here we fo cus on the parallelization in the shared memory mo del Although we

view parallel program development in terms of programming mo dels wewillkeep in

mind the eects of sp ecic hardware implementation on program p erformance as var

ious machinedep endent parameters play signicant roles in program execution We

would like our approach to parallel programming to address some of these hardware

related issues

Parallelization in the Shared Memory Programming Mo del

Intro duction

The fo cus of the shared memory programming mo del is on lo ops Lo ops are the

most common means of expressing rep etitive computing patterns in a program The

concept of thread execution do es not restrict parallelism to the lo op level but high

level directive languages provided by the shared memory programming mo del mainly

deal with ways to sp ecify parallel lo op execution By exploiting parallelism among

lo op iterations the shared memory mo del often achieves a signicant p erformance

gain

In the shared memory programming mo del a programmer sp ecies parallel execu

tion by annotating the source co de with directives Typically directives consist of one

or more lines indicating serialparallel execution variable typ es shared private and

reduction scheduling scheme and a conditional construct if directive Commu

nication and synchronization among pro cessors are implicit inside parallel sections

meaning that those op erations are transparent and do not show up in the source

co de Also parallelization is lo calized In other words parallelizing one section of

co de has no logical eect on the rest of the program Transparent synchronization

and lo calized parallel sections of the shared memory programming mo del oer an easy

scheme to work with esp ecially for inexp erienced programmers Figure shows a

p ortion of co de taken from a example program in that computes b efore and

after parallelization using Op enMP Lines starting with OMP indicate directives

Directive PARALLEL DO indicates that the lo op has no lo opcarried dep endencies and

may b e executed in parallel Directives PRIVATE and SHARED tell the compiler that

the following variables in the parenthesis are private or shared resp ectively Direc

tive REDUCTION SUM indicates that the variable SUM is a summation reduction

variable and requires a sp ecial care for parallel execution Examining the details of

Op enMP is b eyond the scop e of this thesis More information can b e found in

By narrowing the main concern to lo ops the shared memory mo del has enabled

an impressiveadvance in parallelizationoptimization techniques Wellknown tech

niques for parallelization include advanced data dep endence analysis induction vari

able substitution reduction variable recognition privatization and so on In addition

there are lo cality enhancement techniques that sp ecically target the shared memory

architecture such as blo ckingtiling and load balancing Most of these techniques

have b een incorp orated into mo dern parallelizers which will b e presented in Section

History of parallel shared memory directives

As mentioned in the intro duction until the late s the shared memory mo del

had suered from a lack of the standard language Computers from dierentvendors

come with their own set of directives for expressing parallelism and compilers do

not understand other than their own directives There had b een a few initiatives to

resolve this problem In an informal industry group called the Parallel Comput

ing Forum PCF was formed to address the issue of standardizing lo op parallelism

Cache eects can aect the p erformance of the co de outside the parallel section

W dN

SUM d

W dN

OMP PARALLEL DO PRIVATEX SHAREDW

SUM d

OMP REDUCTION SUM

DO IN

DO IN

X W I d

SUM SUM FX

X W I d

ENDDO

SUM SUM FX

PIWSUM

ENDDO

PI W SUM

a Original sequential co de b After transformation

Fig Simple parallelization with Op enMP

in Fortran The group had b een active for three years and their nal rep ort was

published in After PCF was dissolved a sub committee XH authorized by

ANSI was formed to establish an indep endent language mo del for shared memory

programming in Fortran and C However the interest was lost eventually and their

prop osed standards were deserted leaving the last revision lab eled XHSD

Revision M There also had b een commercial p ortable directive sets suchas

KAPPro directive set from Kuck and Asso ciates KAI However since native

compilers only supp ort their directives p ortability could only b e achieved by trans

forming directives into a threadbased co de and compiling the resulting co de with

native compilers Overall all these eorts failed to gain attention from the general

parallel computing community

In spurred by the rekindled p opularityofsharedmemorymachines Silicon

Graphics Inc SGI and several ma jor high p erformance computer vendors initiated

the eort to establish a new standard directive language The prop osed directive

language named Op enMP embraces the previous standardization eorts and

added a few new concepts for more expressiveness Unlike previous attempts this

is an industrywise eort to resolve a practical problem so it is likely to result in

a successful standard that is supp orted by the ma jority of new and existing high

p erformance computers It seems safe to say that Op enMP ensures the future of the

shared memory architecture and the programming mo del by adding p ortability across

platforms

Shared memory program execution

Once an executable is generated by compiling a program with directives pro

grammers can run it as they run any sequential programs In fact an Op enMP

program starts out as a sequential program and engages other pro cessors as Op enMP

parallel constructs are encountered The user has a number of controls over parallel

executiontypically in the form of environmentvariables The most imp ortantoneof

them is the environmentvariable that sets the numb er of pro cessors participating in

the execution of parallel co de sections For programmers that are used to the message

passing programming mo del it is imp ortant to note that there are no conguration

scripts or setups necessary to execute an Op enMP program

Automatic parallelization

As the techniques for identifying parallelism and parallelizing lo ops advance it is

the natural course of action to incorp orate these into a compiler so that the whole

pro cess takes place without programmers care The apparentadvantage of using a

parallelizing compiler is that the conversion of a given serial program into parallel

form is done mechanically by the to ol releaving programmers from worrying ab out

parallelization details As the impact of parallelizing compilers are signicantespe

cially for the shared memory programming mo del a reasonable metho dology should

consider their role in parallel program development Thus we briey discuss the

general asp ects of parallelizing compilers in this section

The eort to automate parallelization pro cess starts from vectorizers of the s

and s The most imp ortantvectorizers among them are the Parafrase compiler

from the University of Illinois PFC parallelizing compiler develop ed at Rice

University and the PTRAN compiler from IBMs T J Watson Research Lab o

ratory They laid the foundation for the mo dern parallelizers Most of the general

techniques for vectorizing arrays within lo ops remain in the parallelizing compilers

these days

Today all shared memory multipro cessor machines are equipp ed with their own

parallelizers and there have b een several eorts from academia to create a new gener

ation of stateoftheart parallelizing compilers for the shared memory programming

mo del Two of the noticeable recent eorts in this eld are the Polaris parallelizing

compiler develop ed at the University of Illinois and Purdue University and the

SUIF Stanford UniversityIntermediate Format parallelizing compiler from Stanford

University They were b oth built up on their own infrastructures bases for Po

laris and kernels for SUIF whichwere designed to help researchers working on the

compiler technology The fo cus on SUIF compiler is on parallelizing the C language

With such techniques as global data and computation decomp osition communication

optimization array privatization interpro cedural parallelization and p ointer analy

sis SUIF b oasts an impressive p erformance gain from many programs

Polaris as a compiler includes advanced capabilities for array privatization sym

b olic and nonlinear data dep endence testing idiom recognition interpro cedural anal

ysis and symb olic program analysis The Polaris infrastructure provides useful fa

cilities for analyzing and manipulating Fortran programs whichcanprovide useful

information regarding the program structure and its p otential parallelism Polaris

has played a ma jor role in our previous eorts in metho dology and to ol research and

will continue to b e a ma jor part in our future research The details on the role of

Polaris in our research will b e discussed in Chapter

Parallelization in the Message Passing Programming Mo del

MPI and PVM

Both MPI and PVM provide message passing infrastructures for parallel

programs running on distributed memory machines Ever since the intro duction of

the rst distributed memory machine the Cosmic Cub e from Caltech in early s

researchers and programmers who saw the p otential of distributed memory computers

had struggled in the midst of the conicts b etween supp orting interfaces until Oak

Ridge National Lab oratorys PVM system and a joint USEurop e initiative for a

standard message passing interface eventually named MPI arrived on the scene

These twointerfaces were accepted by the ma jority of p eople involved in parallel

computing on distributed memory machines and successfully p orted to a varietyof

multipro cessor systems including shared memory machines

These two systems take the form of libraries rather than separate language con

structs These libraries consist of functions and subroutines for synchronization and

sending and receiving messages across pro cessors Users are mandated to insert the

calls to these routines to control the parallel execution of a program This required

the programmers to change their way of thinking They had to b e the masters that

explicitly take care of data distribution communication and other parallelization

details Nevertheless their p erformance on some distributed memory machines were

impressive

The message passing programming mo del is well suited for distributed systems

with a large numb er of pro cessors By carefully controlling the interaction among

pro cessors the p erformance of some applications that do not require heavy commu

nication were able to scale well as the numb er of pro cessors increases Another nice

thing ab out PVM and MPI is that they enable a cluster of heterogeneous unipro

cessor systems to b ehavelike one sup ercomputer Go o d p erformance of the message

passing mo del however often relies on one critical factor network latency The time

to transfer a message from one pro cessor to another ranges from a hundred to a mil

lion clo ck cycles If the application at hand requires frequentcommunication among

participating pro cessors the resulting p erformance gain can b e seriously limited even

on the fastest network to day let alone a cluster of unipro cessors connected by simple

network cables This problem spawned numerous research eorts regarding data par

allelism and work distribution on distributed memory machines whichwe will not

discuss any further

Another drawback of the message passing interface is its aforementioned lowlevel

programming style The amount of bookkeeping for data transfer and synchronization

can amounttoanintolerable level and it is all up to the programmer to ensure

correct execution Furthermore the tricks and tweaks to obtain high p erformance

maybeoverwhelming to inexp erienced programmers Even worse is that the eort

to parallelize a program generally starts from analyzing the underlying physics in

this programming mo del making it dicult for programmers other than the original

authors to parallelize a program Overall learning these interfaces is not particularly

dicult but designing a parallel program to achieve go o d p erformance is

HPF

Many p eople thought that the message passing programming style is at to o low

a level to app eal to the general audience For this reason a group of researchers

at Rice University attempted to provide higher level constructs for programming on

distributed memory machines Their result are Fortran D and its successor High

Performance Fortran HPF These are sets of extensions to Fortran The HPF

programming mo del lo oks similar to the shared memory mo del in that it fo cuses

on lo op parallelism controlled by directives added in front of lo ops In addition

it provides directives for data distribution onto distributed memory systems HPF

translators generate a message passing program based on these directives Compared

to message passing functions these directives let programmers sp ecify array distribu

tion without burdening them with tedious b o okkeeping details

However compared to the shared memory programming mo del HPF lacks im

portant constructs such as lo opprivate arrays and most of all the p erformance of HPF programs are not as go o d as that of programs written directly in MPI or PVM

So far only a handful of compilers and systems fully supp ort HPF

Visual parallel programming systems

A dierent approach to simplify the user interface of the message passing program

ming mo del is to achieveaneven higher level of abstraction by adopting the visual

programming mo del of such systems as Visual C and Visual BASIC The

goal of such research eorts is to develop visual programming environments in which

programmers use no des and arcs to design and implement parallel applications They

opt for a more ecientway of designing and implementing parallel programs and

p erformance evaluation and tuning are not their main concern Visual programming

systems such as HeNCE Enterprise CODE GRAPNEL PRIO

and Visp er b elong to this category

Contrary to the traditional co ding mo del these systems call for a dierent paradigm

in writing parallel programs Conventional constructs of programming language are

replaced with visual entities although programmers are often required to provide

some form of textural descriptions to sp ecify the details needed for the intended

functionality These systems include not only new programming mo dels but also

supp orting to ols that actually allow programmers to use them These to ols usually

come with a set of templates to help programmers in designing parallel programs

Figure shows the screenshot of the CODE visual parallel programming system

The advantage of these visual parallel programming systems is an ecient rep

resentation of complex program structures and parallel constructs Generallypro

grammers have less diculty in grasping the parallel nature of the programs using

these to ols In addition they reduce debugging time by providing utilities for au

tomatic translation of parallel constructs However the tasks of splitting data and

co ordinating communication are still left to programmers

Parallel Programming and Optimization Metho dology

As explained in Section the parallel constructs provided by the shared memory

programming mo del and the message passing programming mo del take signicantly

dierent forms Hence the corresp onding programming metho dologies havetaken

Fig Screenshot of the CODE visual programming system

distinct paths

Shared memory programming metho dology

In the shared memory mo del parallelism is sp ecied with directives that haveno

eects on program semantics Tasks are distributed based on lo op iterations and the

key asp ects of parallelization of shared memory programs are to detect lo opcarried

data dep endencies and to identify shared and private data in each iteration This

can b e done by static programlevel analysis Therefore the metho dology for the

shared memory programming mo del is at the highest level is to examine lo ops in a

serial program co de region to detect parallelism and to determine shared and private

variables There are publications and lecture notes addressing programming on the

shared memory machines They present concepts and notations

explain directives and discuss parallelization techniques and dep endence test crite

ria However they do not oer a overall strategy or a pro cedural metho dology of

p erformance optimization One exception is This do cument sp ecically aimed

at optimization for Origin machines devotes a section on tuning parallel co de

for Origin This section consists of architecturesp ecic techniques that are useful

in further improving parallel p erformance However compared to the detailed single

pro cessor tuning description in the same text parallel p erformance tuning only serves

to complement the single pro cessor case Also the do cumentlacks the p erformance

problem denitions and the p erformance evaluation description for parallel programs

An alternativeway of expressing programs in the sharedmemory mo del is to use

threads In this scheme the programmer packages program sections that can exe

cute concurrently into subroutines and spawns these subroutines as parallel threads

Threads parallelism is at a lower level than directive parallelism In fact compilers

will translate a directiveparallel program into a threadparallel program as an inter

mediate compilation step Advanced parallel programmers sometimes prefer thread

parallelism b ecause it can oer more control over parallel program execution Usually

this comes at the cost of a higher programming eort A brief description on shared

memory programming with multithreading is given in this lecture note

Message Passing programming metho dology

Although HPF provides a directivebased programming mo del for the message

passing mo del programming metho dologies found in literature fo cus on application

level approach using library functions General metho dologies for programming us

ing message passing libraries are describ ed in In the authors employ

the application level approach applicationdriven development in this b o ok in

which they rst categorize a given problem as one of the ve classes synchronous

applications lo osely synchronous applications embarrassingly parallel applications

asynchronous problems and metaproblems To this end the b o ok provides many

example algorithms common in scientic computing to the readers Based on the cat

egory of the target problem the b o ok lists p ossible parallel algorithms and suitable

parallel machines In parallel program design consists of four stages partition

ing communication agglomeration and mapping Partitioning and communication

are tasks of distributing data and co ordinating task execution resp ectivelyInthe

agglomeration stage the combined parallel structures data distribution and com

munication are evaluated If necessary smaller tasks are combined into a larger

task to improve p erformance or to reduce development cost Finallyeachtaskis

assigned to a pro cessor in a manner that attempts to satisfy the design goal in the

mapping stage Since parallel constructs are integrated into a program source in the

message passing mo del program design b ecomes an imp ortant part of parallel pro

gramming This b o ok also gives a detailed description of the evaluation pro cess of

parallel p erformance

There are two dierent approaches to abstract the parallel programming in the

message passing mo del using mathematical notations One is based on parallel pro

gram archetyp es or programming paradigms These are abstract notations that

combine computation structure parallelization strategy and templates for dataow

and communication Programmers are given a set of parallel program archetyp es

or programming paradigms Then they identify an appropriate element within the

set that matches the problem that they try to solve Finally they implementthe

actual program using the parallel structure or the template stated by that element

Using this metho dology programmers can save time and eort to design an appro

priate parallel structure for a given problem Once they identify the right parallel

program archetyp e or programming paradigm the implementation b ecomes simpler

This scheme works well in the case of scientic computing in which a set of well

known algorithms are used across many applications The other approach states that

programmers b egin with a conceptual or formal description of a given problem and

nd an appropriate parallel structure for the algorithm through a series of suggested

analysis pro cesses This metho d is more algorithmsp ecic and its applicability

is even narrower

Tools

In this section we briey intro duce the to ols that have b een develop ed to help

programmers in programming and tuning parallel programs As the task of developing

awellp erforming parallel program is very challenging numerous to ols have existed

to help programmers Some have b een made public for the general audience and

some were used only within small research groups Among those public to ols only

a few gained some attention from the parallel computing community and even fewer

were actually used by other researchers and programmers

We present here some of the ma jor eorts in developing parallel programming

to ols Due to the sheer number of tools wehave divided them into four categories

based on their functionality program development and optimization instrumenta

tion p erformance visualization and evaluation and guidance We will examine their

advantages and shortcomings and discuss p ossible improvements It should b e noted

that in this section we do not cover to ols designed to oer assists in other asp ects

of developing parallel programs such as serial program co ding and parallel program

debugging There are numerous general program co ding and editing to ols Some of

the eorts in parallel program debugging include the p ortable debugger for parallel

and distributed programs Panorama TotalView and Assure For

the to ols relevant to our research we present detailed comparison later in Chapter

Program development and optimization

We fo cus on to ols sp ecically designed for program parallelization and optimiza

tion in this section The ob jective of these to ols is to optimize program p erformance

of existing programs by helping users apply various techniques In addition to the

supp ort for manual mo dications these to ols generally have automated optimization

utilities to make it easy for programmers to apply the techniques on selected parts of

aprogramWe b egin with the to ols for the shared memory mo del

Faust is an ambitious pro ject started at the Center for Sup ercomputing Research

and Development CSRD at the University of Illinois in late s The to ol

supp orts many asp ects of programming parallel machines providing facilities for

pro ject database management automatic program restructuring and editing graphic

browsers for call graph an eventdisplay to ol for p erformance evaluation It is an

environmentthatcovers a wide range of parallel programming stages such as co ding

parallelization and p erformance tuning Its emphasis on pro ject managementallows

the supp ort for a ma jor p ortion of the entire program developmentcycle

The StartPat parallel programming to olkit was develop ed at Georgia Techto

supp ort programming and debugging of parallel programs It consists of a static

analyzer Start and an interactive parallelizer Pat Its main concern is paralleliza

tion and general co de optimization is not supp orted

n

Parascop e is an extension of the R programming environmentdevelop ed at Rice

University Like StartPat the fo cus of Parascop e is automatic or interactive

restructuring of sequential programs into parallel form It integrates an editor a com

piler and a parallel debugger The automatic transformation is conducted based on

the data dep endence information collected by their previous to ol PTo ol It provides

convenient facilities for parallelization and co de transformation

Faust StartPat and Parascop e are imp ortant milestones in interactive optimiza

tion to ol eort for parallel programs Unfortunately the develop ers have stopp ed

maintaining these to ols and their target architectures or programming mo dels have

b een abandoned Nonetheless their pioneering work laid the groundwork for the

current generation of interactive optimizers

PTOPP Practical To ols for Optimizing Parallel Programs is a set of to ols for

ecient optimization and parallelization develop ed at the Center for Sup ercomputing

ResearchandDevelopment CSRD It was designed based on the exp erience that

they gained through the optimization of applications for the Alliant FX and the

Cedar machine This to olset stays at the UNIX op erating system level and provides

some interaction through the facilities built up on the Emacs editor Facilities are

provided for execution time analysis convenient database and le managementof

p erformance data and exible interface with extensive congurability The PTOPP

to olset do es not include an interactive parallelization utility but the Polaris compiler

can b e invoked through its interface

Our research eort actually started out by expanding PTOPP utilities to integrate

static analysis data from a parallelizing compiler and simulation and p erformance

data whichwere missing from the previous version PTOPP is a set of useful to ols

that help make parallel programming easier but the core need of novice programmers

namely the lack of exp erience has not b een addressed in this pro ject

SUIF Explorer is an interactive optimization to ol develop ed at Stanford Univer

sity It utilizes the SUIF compiler infrastructure for automatic parallelization

This to ol comes with a basic p erformance evaluation facilitybased on the prole

data generated from program runs it can sort execution times to identify dominant

co de segments In addition it displays the static analysis data gathered from execut

ing the SUIF parallelizing compiler Perhaps the highlight of the to ol is its program

slicing capability Using this technique SUIF Explorer allows users to select certain

lines in a program source and displays the sections of co de that may b e aected bythe

change made to those lines This utilitycombined with the automatic parallelization

mo dule provides an interactivewayoftackling the task of tuning parallel programs

Visual KAP for Op enMP is a commercial interactive to ol from Kuckand

Asso ciates Inc It p erforms automatic parallelization on program les However it

lacks the supp ort for manual optimization and ner grain tuning FORGExplorer

is another commercial interactive parallelization to ol from Applied Parallel Research

Inc Like most of the to ols presented in this section FORGExplorer is capable of

automatic parallelization of co de sections while presenting users with static analysis

data such as call graphs and control and data ow diagrams

There are a couple of imp ortant optimization to ols for the message passing pro

gramming mo del The Fortran D Editor is a graphical editor for Fortran

Dthatprovides users with information on the parallelism and communications in a

program It obtains data dep endence communication and data layout information

through a direct interface to the Fortran D compiler and displays the information

during editing sessions This is useful knowledge in developing message passing pro

grams but the Fortran D Editor lacks the supp ort for automatic parallelization

Converting directivebased data parallel languages to message passing programs is

challenging as it is and automatic parallelization of sequential programs with data

parallel directives have not b een successful

The same applies to CAPTo ols CAPTo ols is a programming to ol for the message

passing mo del from the University of Greenwich in London The paralleliza

tion pro cess here is semiautomatic Through a series of user interactions users make

their decisions on which sections should b e parallelized and how to distribute work

and data CAPTo ols constructs a data dep endence graph on the target section and

uses this graph in the subsequent automatic parallelization phase If CAPTo ols needs

more information from users it asks questions though the user interface Recentlya

new frontend for the shared memory mo del using Op enMP has b een added but the

details are not available as of this writing

Instrumentation

Instrumentation is a means to obtain p erformance data and usually a part of most

visualization and evaluation to ols functionality In this section we examine general

mechanisms for instrumentation on the shared memory and the message passing mo d

els and discuss a few instrumentation utilities that deserve sp ecial attention

The main concern in parallel program instrumentation varies dep ending on the

programming paradigm In the shared memory mo del where communication b e

tween pro cessors is fast and frequent reducing the instrumentation overhead is an

imp ortant issue On the message passing mo del side an often overwhelming amount

of p erformance data b ecomes a problem To this end some researchers have incor

p orated a realtime summation utility or nonuniform instrumentation which will b e

discussed later in this section Both of these issues conict with the ultimate goal of

instrumentationobtaining as much p erformance data as p ossible

As mentioned in Chapter detailed instrumentation of shared memory programs

are not feasible without signicant p erturbation Hence most instrumentation utilities

rely on simple timing information Therefore the task of shared memory program

instrumentation is mainly inserting calls to timing routines The problem that often

arise is that timing routine calls in nested co de regions cause signicantoverhead

At its foundation the Polaris compiler is a parallelizing compiler but it

provides p owerful instrumentation utility for shared memory programs Polaris oers

several dierent strategies for instrumentation that allow users to control the amount

and the targets of instrumentation Recently a new library that supp orts hardware

counters has b een made compatible with the Polaris instrumentation utility

Other optimization to ols capable of instrumentation include SUIF Explorer Forge

Explorer and GuideView

In the message passing programming mo del the data needed for visualization

and animation are traces and there have b een several trace formats IBM PE tracing

format PVM tracing format ParaGraph format Pablos SDDF Self

Dening Data Format VAMPIR format are some of the examples The

dierence b etween these are mainly the size of the trace les Most visualization

to ols for the message passing mo del intro duced in the next section use one of these

wellknown formats

Since the parallel constructs in the message passing mo del are libraries of func

tions instrumentation takes place byintercepting these calls For additional informa

tion a series of checkp oints are inserted for status feedback Instrumentation of these

checkp oints are relatively simple but the resulting trace data may b e unmanageably

large AIMS tries to resolvethisproblemby automatically identifying imp ortant

regions Paradyns approach is unique in that instrumentation and monitoring utility

enables dynamically adjustable instrumentation byproviding online summarization

facility VAMPIR oers more compact trace formats More details on AIMS

Paradyn and VAMPIR are available in the next section

The develop ers of TAU at the University of Oregon chose a dierent approach

to program instrumentation TAU is a to olset designed for proling tracing and vi

sualizing parallel program p erformance TAUs instrumentation utility can generate

either timing proles or trace les dep ending on the target application When tim

ing proles are generated static viewers are used to present summary information

For trace les a trace visualizer is used The instrumentation library is develop ed

for multiple languages suchasCC Fortran HPF and Java thus signicantly

broadening its applicabilityHowever the instrumentation pro cess is manually done

Users need to sp ecify which functions should b e instrumented and asso ciate them in

a set of groups For very large programs this could b e very cumb ersome esp ecially

when users have little knowledge on the program at hand

Performance visualization and evaluation

Performance visualization refers to the transformation of numeric p erformance

data into meaningful graphical representation Visualization helps users gain insights

into the b ehavior of parallel programs so that they can b etter understand the pro

grams and improve the p erformance Performance visualization is often a stepping

stone to p erformance evaluation and problem identication Performance visualiza

tion can either b e dynamic or static Dynamic visualization to ols use graphical ani

mation to illustrate the dynamic b ehavior of the program under consideration The

animation can take place either during the program execution or after the program

termination through trace simulation Static visualization displays the summary of

p erformance characteristics in charts and graphs

GuideView from KAPPro to olset is a typical static visualization to ol How

ever it targets the shared memory mo del and do es not use traces Instrumented

runtime library generates and summarizes timing information Using charts and

graphs GuideView illustrates what each pro cessor is doing at various levels of detail

using a hierarchical summary Its intuitive colorco ded displays make it easy to assess

the target applications p erformance However due to the high overheads incurred

by the instrumentation the resulting graphs may not reect accurate real time p er

formance Fortran D Editor SUIF Explorer FORGExplorer and

DEEPMPI are also capable of graphical presentation of p erformance data but

their uses are limited to simple displays of execution time of co de blo cks DEEPMPI

targets MPI programs but do es not provide the display of traces Instead it shows

resource usage and timing charts

RACY from TAU pro ject has p erformance viewing utilities consisting of a

tabularized text rep ort and several static charts The information that is displayed

involves mostly timing proles As mentioned ab ove TAU instrumentation utility

is capable of generating trace les for message passing programs Instead of writing

their own trace viewer the develop ers decided to use VAMPIR which is also

discussed in this section

As for the static display of traces NTV summarizes traces from message

passing program execution and presents users with summary charts and timeline

graphs as shown in Figure This typ e of graphs help users understand load

distribution stalls and communication structure of the program PMA from Annai

Tool Environment is a graphic utility similar to NTV Annai integrates this

information with its source viewer for easier reference XMPI from LAM pro ject

oers a similar view although its main goal is debugging of MPI programs TraceView

is a pioneering work in a timeline display and it generates timeline graphs for b oth

shared memory and message passing programs through dierentruntime libraries In

b oth cases trace les are used However its graphics are not as rened as those listed

ab ove and the displayed data for shared memory programs are limited due to the

nature of the shared memory programming mo del

Fig The timeline graph from NTV

ParaGraph Upshot AIMS Scop e and VAMPIR are

to ols for animated p ostmortem visualization of program b ehavior based on trace

simulation The advantage of trace simulation is that the sp eed of graphic animation

can b e adjusted with the exception of ParaGraph so that events that are dicult to

observe in real time can b e slowed down for b etter understanding ParaGraph was a

pioneering eort in p erformance visualization from the University of Illinois The to ol

is visually elab orate but its practical value is limited by a few missing features like

ability to set the sp eed of replay and the lack of appropriate annotation Furthermore

the target and the framework of the graphic presentation is predetermined bythe

develop ers so users have little freedom in viewing other asp ects of program b ehavior

in dierent p ersp ectives Upshot has a feature to adjust sp eed but it do es not have

features such as a dynamic call graph or a communication diagram AIMS is an

automated instrumentation and monitoring system from NASA it displays dynamic

program b ehavior through animated and summary views AIMS adds a mo deling

mo dule that provides a means of estimating how the program would b ehaveifthe

execution environmentwere mo died Figure shows a screenshot of AIMS in

use The goal of Scop e is extensibility Scop e allows users more freedom to arrange

p erformance data into new displays VAMPIR adds a zo om utilityallowing users to

examine p erformance data with varying levels of detail All these to ols target message

passing programs

Pablo Paradyn XPVM PVaniM and Falcon can an

imate the b ehavior of a program while it is running This monitoring capabilityis

achieved by p erio dically up dating graphs and charts with newly available runtime

data from the executing application However events that may o ccur frequently for

avery short p erio d of time cannot b e traced and displayed For this reason XPVM

and PVaniM have utilities to playback the generated traces and other to ols generate

summary statistics Even so visualizing imp ortantevents during the execution of

a shared memory program in an animated fashion is not feasible in the sense that

these events such as writing to shared variables happ en to o frequently and to o many

times These to ols visualize the events during message passing program execution

Pablo a p erformance evaluation to ol develop ed at the University of Illinois is

p erhaps the most successful to ol currently in use It uses an adaptive instru

mentation control to reduce the p erturbation of instrumentation when it executes

The resulting trace les are used to pro duce graphical display of the program p er

formance Pablo also has a sonication utility and a D supp ort that convey more

Fig The graphs generated by AIMS

information to its users through multimedia exp erience The combined eort with

Fortran D Editor nowallows Pablo to integrate p erformance data with a pro

gram developmentenvironment However the lack of appropriate annotation and a

complex visual interface imp ose a steep learning curve on users Figure presents

a snapshot of Pablo graphical data presentation

The Paradyn Parallel Performance MeasurementTo ol develop ed at the Univer

sity of Wisconsin at Madison is characterized by their instrumentation scheme that

dynamically controls overheads by monitoring the cost of data collection The

basic paradigm of instrumentation execution and visualization is the same as that of

Pablo but due to the dynamic nature of its instrumentation scheme the to ol is par

ticularly useful when the application at hand is very large or longrunning The to ol

Fig The graphs generated byPablo

also contains a visualization facility that generates realtime tables and histograms

although it is not as extensiveasthatofPablo

XPVM is a graphical user interface for PVM that displays b oth realtime and

p ostmortem animations of message trac and machine utilization by PVM applica

tions While an application is running XPVM displays a spacetime diagram

of parallel tasks showing when they are computing communicating or idle XPVM

stores events in a trace le that can b e replayed and stopp ed to analyze the b ehavior

of a completed execution

PVaniM sp ecically targets network computing environments The p erfor

mance factors that are unique to networked environments require careful considera

tion in p erformance visualization PVaniM addresses these network issues suchas

p ossible heterogeneitylow network bandwidth and clo ckskew in its design Its

playback utility also adds to its usefulness byallowing users to examine details that

mayhave b een missed during real time monitoring

The principal asp ects of Falcon are its abstraction and accompanying to ols for

analysis of applicationsp ecic program information and online steering The

term applicationsp ecic means that users cho ose which asp ects of dynamic b e

havior to monitor and steer b eyond a predetermined set of parameters In addition

Falcon provides a supp ort for the online graphical display of the information b eing

monitored The Falcon develop ers used POLKA system for its animated and

static p erformance views

The metrics supp orted by these animation to ols include CPU utilization memory

usage oating p oint op erations message size and so on They help programmers in

identifying the b ottleneck in the execution of messagepassing programs The advan

tage of these typ es of to ols lies in providing dierent views on program execution by

visualizing the timely b ehavior of the target program When pro cessor communica

tion is relatively sparse and visible as in the message passing programming mo del

it is particularly imp ortant and b ottleneckidentication easily leads to wellknown

techniques to resolve the problems such as dierent data distribution combining

messages algorithm mo dication and so on

The ability to monitor realtime p erformance presents opp ortunities for p erfor

mance steering To this end those who develop ed Pablo Paradyn PVaniM and

Falcon have implemented a p erformance steering facility In fact the main fo cus of

Falcon has b een p erformance steering from the b eginning of the development Typi

cally users provide or select a set of parameters that they want to manipulate during

program execution and they are able to do so at various checkp oints inserted into

the target program Performance steering is not our concern in this research so we

will not go into any more details

Finally CUMULVS takes a dierent approach to p erformance visualization

As an extension to PVM CUMULVS is a library of functions that users can insert

into programs to visualize the b ehavior of a parallel program in real time The in

strumentation task is shifted to programmers but allows users exibilitytocho ose

what typ e of data they want to view CUMULVS data collection utilitycanbeuti

lized with several frontend visualization systems CUMULVS also supp orts program

steering through checkp oints

Guidance

The term p erformance guidance is used in many dierentcontexts in the par

allel programming eld Generally it means taking a more active role in helping

programmers overcome the obstacles in tuning programs With so manyavailable

to ols for instrumentation and visualization of raw data the task of extracting mean

ingful information is b ecoming increasingly burdensome In this section we discuss

several to ols that supp ort this functionality Accommo dating novice programmers

and automating the p erformance evaluation pro cess are imp ortant issues in parallel

programming and it is one of our fo cuses in our research However we found only a

few eorts addressing these sub jects

SUIF Explorers Parallelization Guru bases its analysis on two metrics parallelism

coverage and parallelism granularity These metrics are computed and up dated

when programmers makechanges to a program and run it It sorts prole data in

a decreasing order to bring programmers attention to most time consuming sections

of the program It is also capable of analyzing data dep endence information and

highlighting the sections that need to b e examined by its users

Paradyn Performance Consultant discovers p erformance problems by search

ing through the space dened by its own search mo del The search pro cess is fully

automatic but manual renements to direct the search is p ossible as well The re

sult is presented to the users through graphical displays DEEPMPI features a

similar advisor that gives textual information ab out message passing program p erfor

mance The DEEPMPI advisors analysis is hardco ded and the analysis is limited

to subroutines or functions

PPA prop oses a dierent approach in tuning message passing programs Un

like the Parallelization Guru Performance Consultant and DEEPMPI which base

their analysis on runtime data and traces PPA analyzes a program source and uses

a deductive framework to derive the algorithm concept from the program structure

Compared to other programming to ols the suggestion provided byPPA is more de

tailed and assertive The solution for one example in was to replace an inecient

algorithm

Parallelization Guru Performance Consultant and DEEPMPI basically tell the

user where the problem is whereas the exp ert system in PPA takes the role of pro

gramming environment a step toward an active guiding system However the knowl

edge base for the exp ert system relies on understanding of the underlying algorithm

based on pattern matching Having an exp ert system that understands all the va

riety of parallel algorithms is nearly imp ossible Due to the complexity required

problem identication is done by other to ols and hand analysis and the suggestions

provided by the to ol only considers parallel constructs which also limits the usage

of the to ol Because of the lack of p erformance evaluation and tuning supp ort PPA

cannot b e considered as a programming environment but their eort in developing a

p erformance guiding to ol is worth noting

Utilizing Web Resources for Parallel Programming

One of our ob jectives is to reach general audience with our metho dologytools

and optimization study results Wehave taken the Internet computing approachto

address this issue Thus we fo cus out attention to previous eorts that attempted

utilizing the Web to provide a programming environment and to establish online

rep ositories

Many of the systems and technologies that currently allow computing on the

Web supp ort a single or a relatively small set of to ols They include PUNCH

MOL NetSolve Ninf RCS VNC WinFrame Globus

and Legion More detailed descriptions of these systems are found in

As for the b enchmark rep ositories several Web to ols oer p erformance numbers

of various b enchmarks Typically the presented data are timing numbers

suchasoverall program p erformance or sp ecic timings of communication in message

passing systems Extensivecharacteristics of the measured programs are usually not

part of the online databases The user will have to obtain information from separate

sources which is often necessary for interpreting the numbers Furthermore these

rep ositories do not provide information gathered by other to ols such as compilers or

simulators and consequently they do not supp ort the comparison or the combined

presentation of p erformance asp ects and program characteristics

Our eort to resolve these problems with the previous research eorts unfolds in

twoways First wehave used PUNCH a network computing infrastructure

to construct an integrated Webaccessible and ecient parallel programming to ol

environment PUNCH allows remote users to execute unmo died to ols on its resource

no des More detailed descriptions of PUNCH are found in Section Second

our results on p erformance enhancementwithvarious applications have b een made

accessible through an Appletbased browser whichallows not only examining the

raw data but also manipulating and reasoning ab out the information This facilityis

explained in more detail in Section

Conclusions

Thus far wehave studied general concepts and paradigms in parallel program

ming We also havelooked at general trends in parallel programming mo dels and

supp orting to ols Wehave learned that there have b een numerous attempts to aid

parallel programmers through various to ols However these to ols are generally not

based on a programming metho dology and tend to fo cus on one sp ecic asp ect of

the optimization pro cess In addition a brief discussion has b een given on enhancing

to ol accessibility via the Web

It seems that to ols supp orting the shared memory mo del emphasize more on static

analysis and automatic co de transformation while those supp orting the message pass

ing mo del mainly fo cus on p erformance visualization This is not surprising consid

ering that the shared memory mo del enables structured program level parallelism

but instrumentation is exp ensive and that in the message passing mo del events are

relatively explicit and sparse but automatic parallelization is dicult

Several to ols have attempted integration of dierent asp ects in parallel program

ming Pablo and Fortran D editor opt for the integration of program optimization

and p erformance visualization but their visualization utilities although highly ver

satile are dicult to comprehend and oer little to help programmers in deductive

reasoning The lack of automatic parallelization capabilityofFortran D editor also

limits its utilization esp ecially among novice programmers SUIF Explorer and

FORGExplorer have a similar goal but their p erformance analysis utilities serve

only a complementary purp ose to direct programmers to timeconsuming co de regions

KAPPro To olset consists of useful to ols but do es not supp ort manual tuning

The fo cus of the Annai To ol Pro ject is limited to the asp ects of parallelization

debugging and p erformance monitoring Faust may b e the most comprehensive

environment to date encompassing co de optimization and p erformance evaluation

However many asp ects of Faust are not suitable for the mo dern day parallel ma

chines and it is no longer maintained by the develop ers Also there is the issue of

active user guidance which none of the optimization to ols supp orts Apart from the

missing functionality the problems with these to ols and most other to ols discussed

in this chapter are the lackofcontinuous supp ort system compatibility scalability

eort to add new to ols or features and accessibilitynotavailable and dicult to

learn

The quality of visualization of p erformance and structure of parallel programs

provided bytodays to ols has reached an impressivelevel Almost every asp ect of

parallel program execution can b e viewed in user friendly displays Parallel execution

events and resource utilization summaries are presented via colorful graphs charts

animation and even sound eects We b elieve that the next step in assisting pro

grammers in p erformance evaluation should b e the supp ort for comprehension and

deductive reasoning of p erformance data As the user base of aordable parallel ma

chines keeps expanding this asp ect of p erformance evaluation b ecomes increasingly

more imp ortant

A lot of smart p eople are developing parallel to ols that smart users just wont

use This sentence quoted from summarizes well some of the problems with

to ol developmentover the years Manytoolshave ended their lives unused by other

than the develop ers Perhaps it is b ecause the to ol develop ers have fo cused their

attention only to sp ecic stages in parallel program development disregarding the big

picture In many cases the develop ers created the to ol that they thought to b e useful

based on their exp erience under their own environment Another reason could b e the

lack of eort from the develop ers in providing convenient access to their to ols The

conventional approach to promote to ol usage has always b een telling users what the

to ol can do and explaining what to do with it Furthermore not enough consideration

has b een put into actually allowing users to try the to ols Weadvo cate the imp ortance

of a programming and optimization metho dology once more b ecause knowing exactly

what must b e done at each stage during parallel program development leads to an

eort to understand and appreciate the to ols functionality that ts users needs

With active motivation to reach larger audience with an integrated metho dology and

a to olset wemayhave a b etter chance

SHARED MEMORYPROGRAM OPTIMIZATION

METHODOLOGY

In this chapter we outline our prop osal on the issue of a metho dology for the

shared memory programming mo del We b elieve that the programming style of this

mo del allows a systematic approach to program tuning that is far more detailed

and organized than simple descriptions found in general guidelines Programmers

task in this scheme is to follow the steps suggested by the guidelines and apply the

appropriate techniques

Intro duction Scop e Audience and Metrics

Before presenting the metho dologywe rst discuss its scop e and target audience

as well as the metrics used in the metho dology

Scop e of the prop osed metho dology

Figure shows a typical shared memory program development cycle The soft

ware design and implementation part inside a dashed b oxhave b een simplied in

this gure The issues in these stages include planning design co ding testing and

debugging It is a quite complex topic and there have b een a sophisticated set of

metho dologies remedies metrics and to ols for helping out programmers in this mat

ter We will not discuss general software engineering issues any further in this

prop osal

In this research we fo cus our view on the parallelizing and tuning pro cess the b ox

enclosing parallelizationtuning program development program execution and p er

formance evaluation We assume the programmers haven a working serial program

Developing a sequential program is orthogonal to parallel pro cessing and we assume

that most programmers follow one of the existing software engineering practices Our

eort attempts to resolve diculties and problems asso ciated with parallelizing and

design

implementation

parallelization/ program tuning compilation

performance program evaluation execution

Done

Fig Typical parallel program developmentcycle

optimizing sequential programs Also notice that we do not consider application

level approach explained in Chapter to parallel program development Finding

parallelism at the algorithm level and incorp orating it while writing a program is a

dierent sub ject in that it requires a new p ersp ective in examining algorithms iden

tifying parallelism dividing and balancing tasks and incorp orating them into the

source co de As p ointed out in the intro duction the sheer number of variables in this

approach is so large that nding a systematic programming metho dology would b e

extremely dicult Some tips can b e found in literatures such as as well as

some of the programming metho dologies intro duced in Chapter

Target audience

We assume that our target programmers are familiar with programming and com

pilation They should b e able to write debug compile and run a sequential program

Also they should know at least the basics on how parallel pro cessing works with the

shared memory programming mo del It helps to understand the underlying shared

memory architecture b ecause certain machine dep endent parameters have a signif

icant impact on the program p erformance Tofollow our metho dology it is not

necessary to b e an exp erienced parallel programmer However even for exp erienced

programmers the metho dology serves as an ecient strategy for parallel program

ming

We divide our target audience into two group novice and advanced programmers

The word novice means new to parallel programming not to programming in gen

eral The novice programmer group consists of those given a task of parallelizing a

sequential program or writing a parallel program without much prior exp erience on

the pro cess They resort to a metho dology mainly for the guidelines and suggestions

to make up for the lack of exp erience They need to get the feeling of what the avail

able techniques are and how they can b e applied The supp orting to ols must take

this into accounttomake the learning curve as smo oth as p ossible

The need of advanced parallel programmers lies in the supp orting utilities The

metho dology aids them in eciently structuring the approach they have already b een

taking They have a go o d idea of what tasks havetobedoneineach stage and they

desire eective to ols to accelerates tedious tasks They would like the to ols to b e

exible so that they can congure them to t the sp ecic tasks of their interest

Metrics understanding overheads

Performance evaluation is an imp ortant stage in parallel programming The eval

uation pro cess consists of nding p erformance problems and p ossible techniques for

improvement Finding problems requires the denitions of p erformance problems

In other words programmers should know which phenomena constitute p erformance

problems Without denitions problems cannot b e found Metrics are used to for

mularize p erformance problems

In our metho dology the p erformance evaluation pro cess b egins with identifying

dominant and problematic co de sections A metric system provides a means to e

ciently identifying b ottlenecks in the presence of a p ossibly large amount of informa

tion As the overhead analysis is a critical part of the metho dologyweintro duce a

couple of p ersp ectives to lo ok at parallel program p erformance and the related metrics

in this section The main attention of these systems go es to overhead

One common way to view the p erformance overhead is describ ed well in

in which a programmer needs to identify two factors contributing the overall over

head parallelization and spreading overheads Our tuning strategy in the prop osed

metho dology is based on this overhead mo del

Parallelization overhead This refers to an overhead intro duced by transforming

a program into parallel form Often it is identied by comparing the execution

times of the serial version and the parallel version run on one pro cessor The

main reason for this is that the co de gets augmented inevitably for paralleliza

tion

The parallelization overhead of a parallel lo op is computed as

T T T

pr ocessor par allel execution ser ial execution par alleliz ation

The factors that contribute to parallelization overhead are listed b elow

instructions needed for parallel execution The instructions for the tasks such

as fork join and barriers are necessary for parallel execution These increase

the co de size and cause inevitable overhead

instructions needed for co de transformation Some parallelization techniques

require co de change that mayincuroverhead For instance the reduction tech

nique requires separate preamble and p ostamble The induction technique may

intro duce a complicated expression in each iteration whichwas not part of the

original co de

inecient optimization Co degenerating compilers p erforming less optimiza

tions on a parallel co de section compared to the original serial co de leading

to less ecient co de

Parallelization overhead may b e amortized if the lo op runs signicantly longer that

the overhead time On the other hand frequentinvo cation of a very small parallel

lo op can cause serious degradation in p erformance

Spreading overhead The execution mo del of a shared memory architecture is ba

sically such that at the b eginning of a program a pro cess forks multiple threads

and the master thread among them wakes them up whenever it encounters par

allel sections The time to wake the other threads is an unavoidable overhead

to endure Spreading overhead usually increases as more pro cessors are used in

program execution

The spreading overhead is computed as

T

pr ocessor par allel execution

T P T P

spr eading par allel execution

P

where P denotes the numb er of pro cessors

Some of the reasons for spreading overhead are given b elow

startup latency This refers to the time to initiate parallel execution on multiple

threads Naturally the more threads run the larger overhead o ccurs One way

to avoid this is to try to merge adjacent parallel regions into one making a

parallel section as large as p ossible

memory congestion Due to sharing data on a shared memoryheavy trac in

a memory bus may cause parallel execution to slowdown One p ossible remedy

for this is to increase the lo cality of lo ops to reduce bus trac

coherence trac Sharing data also requires a co ordination which adds addi

tional overhead for legitimate data invalidation

false sharing Dep ending on the cache line size data that are needed by only

one pro cessor may spread over other pro cessors caches causing frequent un

necessary invalidations

load imbalance Tasks are unevenly distributed over multiple pro cessors In

cases where the numb er of iterations is small and cannot b e distributed evenly

the exp ected sp eedup is limited by the remainder

Another p ersp ectiveonoverhead is provided in Hardware counters avail

able on most mo dern machines provide detailed statistics regarding the dynamic b e

havior of parallel programs Yet the measured values do not necessarily translate into

parallel programming terms The prop osed mo del denes four overhead comp onents

memory stalls pro cessor stalls co de overhead and thread managementoverhead

based on the hardware counter data Each comp onent is clearly dened and the

p ossible contributing factors and the remedies are also given This mo del provides a

more detailed insightinto the overhead characteristics of parallel lo ops For instance

aloopmay exhibit small parallelization and spreading overheads but memory or

pro cessor stalls may indicate a problem Wehave just b egun to explore this new

system and more work needs to b e done to incorp orate it into to ol development

The problem with this mo del is that obtaining the necessary data is tedious and very

timeconsuming The traditional parallelization and spreading overhead mo del still

serves as the primary measure for p erformance analysis for many programmers and

it will continue to do so in the future

Parallel Program Optimization Metho dology

In the past wehave participated in several research eorts in parallelizing pro

grams for dierent target architectures At rst we b elonged to the

category of novice programmers After a great deal of trial and error wehavedevel

op ed a structured way to a successful parallelization of programs As the number of

the programs that we dealt with increased our general metho dology went through

several stages of adjustment and improvement Finallywe felt the need to write it

down so that a wider range of programmers can b enet from the eciency it provides

Thus we started the pro cess of rening our metho dology to improve b oth eciency

and practicality

Figure shows the overview of the parallelization and optimization steps out

lined by our prop osed metho dology There are two feedback lo ops in the diagram

The rst one serves as the adjusting pro cess of instrumentation overhead The second

lo op is the actual optimization pro cess consisting of application of new techniques and

evaluation

Our metho dology envisions the following tasks when p orting an application pro

gram to a parallel machine and tuning its p erformance We start byidentifying the

most timeconsuming co de section of the program optimize its p erformance using

several recip es and then rep eat this pro cess with the next most imp ortantcodesec

tion The most imp ortantcodeblocks for parallel execution in our programming

paradigm are lo ops Hence we prole the program execution time on a lo opbylo op

basis We do this by instrumenting the program with calls to timer functions The

timing prole not only allows us to identify the most imp ortant co de sections but

also to monitor the programs p erformance improvements as weconvert it from a

serial to a parallel program However as the diagram shows programmers mayneed

to adjust the amount of proling due to the accompanying overhead The rst step

of p erformance optimization is to apply a parallelizing compiler If no suchtoolis

available or if we are not satised with the resulting p erformance we can apply pro

gram transformations byhandWe will describ e a number of suchtechniques The

following section describ es all these steps in detail

Instrumenting program

Instrumentation is a means to obtain p erformance data Typically on the shared

memory mo del proling routines are inserted into the co de that record necessary

data As the result one or more proles are generated at the end of the program

execution There are other metho ds to instrument a program using assembly co des

whichwe do not consider in this research Program instrumentation is a imp ortant

step in optimizing program p erformance The prole results from instrumented pro

gram runs provide a basis for p erformance evaluation and optimization It should

b e determined b eforehand what typeofcodeblocks should b e instrumented In the

Instrumenting Program

Getting Serial Execution Time reduce instrumentation overhead

Running Parallelizing Compiler

Manually Optimizing Program

Getting Optimized Execution Time

satisfactory Speedup Evaluation

unsatisfactory

Finding and Resolving Performance Problmes

done

Fig Overview of the prop osed metho dology

directivebased shared memory programming mo del lo ops are usually the basic blo cks

for instrumentation b ecause they are the basic sections considered for parallelization

The metrics for measurement can vary but they should conform to the goal of the

optimization There are utilities for measuring various asp ects of program execution

The most widely used measures are execution time

As the rst step programmers should instrument a serial program The purp ose of

this step is to understand the distribution of execution time within a program and to

identify the co de segments worth the optimization eort Therefore it is desirable to

obtain as much timing data as p ossible throughout the target program For instance

programmers may decide to instrument all the lo ops in a given program

Unfortunately most instrumentation metho ds intro duce overhead This has to

b e considered very carefully b ecause it not only aects the programs p erformance

but it can also skew the execution prole so that the programmer targets the wrong

program sections Our metho dology suggests the following remedies

 Programmers should make sure that they run the program with and without

instrumentation They should pro ceed only after they haveveried that the

p erturbation is small

 In order to reduce overhead programmers should remove instrumentation from

innermost lo ops innermost co de sections in general They may need to nd

out the overhead p er call of the instrumentation library If their initial prole

shows co de sections whose average execution times are less than two orders of

magnitude larger than the overhead the corresp onding instrumentation should

b e removed

 Programmers should add instrumentation after they run the co de through a

parallelizing compiler Compilers usually can apply less optimizations in the

presence of many subroutine calls Source level instrumentation generally has

the form of inserted subroutine calls If there exists a assembly level instrumen tation to ol this is less of a problem

 Programmers should b e careful when adding instrumentation inside a parallel

lo op or region Instrumentation libraries may assume these function calls are

made from serial program sections only

 It is desirable that programmers make sure that instrumented co de segments in

the optimized program match those instrumented in the sequential program so

that sidebyside comparisons can b e made in the p erformance evaluation stage

There is an obvious dilemma If programmers removetoomany instrumentation

points the prole will b ecome less useful They should leave the instrumentation at

least for all those program sections that they may later try to tune

Getting serial execution time

Program execution may b e aected bymany factors Pro cessor sp eed architec

ture op erating systems system load network load such as le IO requests etc The

resulting program from this optimization pro cess may b e sub ject to all these factors

However to accurately measure the eect whether p ositive or negative of the ap

plied techniques during the optimization pro cess it is very imp ortant to eliminate

these external factors during instrumented program runs One way to ensure uninter

rupted environments is to use single user time During this p erio d only one user is

allowed on the system In this way programmers can reduce unnecessary overheads

caused bycontext switching external le IO and so on

Running parallelizing compiler

Parallelizing compilers can analyze the input program detect parallelism and au

tomatically generate appropriate directives for detected parallel regions Parallelizing

compilers relieve parallel programmers of the tasks of parallelizing all lo ops manu

ally They are esp ecially useful when the lo ops under consideration have complex

structures for whichhuman analysis is cumbersome Stateoftheart parallelizing

compilers include manyadvanced techniques for parallelization and optimization

It is imp ortant to note that relying entirely on parallelizing compilers for opti

mization may not result in the optimal p erformance Compilers base the techniques

that they apply on the static analysis of input programs This may not accurately

reect the dynamic b ehavior of the programs Mo deling dynamic characteristics of

programs is very dicult For this reason programmers intervention may b e neces

sary to achieve nearoptimal p erformance Programmers comp ensation for compilers

lackofknowledge on the dynamic b ehavior of a program is the key to obtaining go o d

p erformance

Nonetheless running parallelizing compilers is a go o d starting p oint It can save

programmers signicantamount of time that maybespent analyzing all the lo ops in a

program For novice programmers manually parallelizing lo ops maybecumb ersome

to b egin with In addition most parallelizing compilers are capable of generating the

listing of the static analysis results whichmayprovide programmers with valuable

information on various co de sections

In our metho dologywe do not assume that programmers have necessarily access

to parallelizing compilers If this is the case the rst set of techniques to apply should

b e those for parallelization describ ed in the next section

Manually optimizing programs

Manual optimization allows users to make up for compilers shortcomings If a

programmer has run a parallelizing compiler the static analysis information gener

ated by the compiler in the form of listing les can help the programmer b etter

understand the problems at hand Running instrumented programs oers insights

into programs dynamic b ehavior Combined with programmers knowledge on the

underlying algorithm and physics these data provide vital clues in improving the

p erformance

In our metho dologywehave divided various wellknown techniques into four

categories parallelization techniques parallel p erformance optimization techniques

serial p erformance optimization techniques and other techniques Parallelization

techniques refer to techniques involving parallelizing co de segments Parallel p er

formance optimization techniques are the ones that may improve the p erformance

of already parallel sections Serial p erformance optimization techniques aim to im

prove the p erformance of co de sections whether they are serial or parallel Some

of these techniques may result in a sup erlinear sp eedup if not applied to the serial

program that serves as a p erformance reference p oint Lo cality enhancementtech

niques are typical examples The techniques that b elong to other categories do not

seem to have eects on p erformance by themselves However they may enable other

previously nonapplicable techniques The b enets of the techniques describ ed b e

lowcanvary signicantly with the underlying machine The judgment ab out which

techniques to apply to a given program should b e based on accurate p erformance

evaluation which will b e discussed in the subsequent section

Wegive brief descriptions of the techniques that wehave used to improve program

p erformance More detailed description and theoretical backgrounds are found in

Parallelization techniques

Privatization Privatization seeks to reduce false dep endencies Often b oth scalar

variables and arrays are used as temp orary storage within an iteration of a

lo op and therefore if a private copy of this variable is provided with each

iteration the lo op may b e parallelized More conservatively a single copymay

be provided to each of the participating pro cessors for use as its lo ops are crucial

part of parallelization pro cess For example in Figure variable X is used

as a temp orary storage within a lo op By allowing separate copies of X for all

participating pro cessors seemingly serial co de can b e executed in parallel In

some cases a temp orary storage could b e an arrayasshown in Figure

Reduction Scalar reductions are recurrences of the form sum sum expr where

expr is a lo opvariant expression and sum is a scalar variable Lo ops which con

tain such recurrences cannot b e executed in parallel without b eing restructured

since values are accumulated into the variable sumOneway of addressing such

a situation is to calculate lo cal sums in each pro cessor and combine these sums

at the completion of the lo op Figure shows an example of such a scalar

OMP PARALLEL DO PRIVATEX

DO I n

DO I n

X

X

X

X

ENDDO

ENDDO

a b

Fig Scalar privatization a the original lo op and b the same lo op after

privatizing variable X

OMP PARALLEL DO PRIVATEJ SHAREDA

DO I n

DO I n

DO J m

DO J m

AJ

AJ

ENDDO

ENDDO

DO J m

DO J m

AJ

AJ

ENDDO

ENDDO

ENDDO

ENDDO

a b

Fig Array privatization a the original lo op and b the same lo op after

privatizing variable array A

reduction op eration and its transformed version in Op enMPOpenMPprovides

a construct for identifying reduction op erations of typ e addition multiplication

maximum and minimum

OMP PARALLEL DO SHAREDA

DO I n

OMP REDUCTION SUM

sumsumAi

DO I n

ENDDO

sum sum Ai

ENDDO

a b

Fig Scalar reduction a the original lo op and b the same lo op after

recognizing reduction variable SUM

In addition to scalar reductions array reductions must b e addressed as it has

been shown that array reduction recognition is one of the most imp ortant trans

formations in real applications Array reductions like scalar reductions are

summations however they are of the form Aind Aindexpr where the

value of the subscript ind of A cannot b e determined at compile time There

fore lo cal sums must b e accumulated for each elementinA andcombined at

the time of the lo ops completion Figure shows such a reduction op eration

The constant No Of Procs would hold the value of the numb er of participating

pro cessors and the function call Get My Id would return the pro cessor iden

tication of the pro cessor executing that iteration Two additional lo ops for

initialization and nal summation are called preamble and p ostamble resp ec

tively

Induction Induction variables are variables that form a recurrence in the enclosing

lo op Figure shows an example of a simple induction expression as well as

a transformed form whichwould have no lo op carried dep endencies Induc

tion variable substitution must rst recognize variables of this form and then substitute them with a closedform solution

DO I n

Aind Aind Bi

ENDDO

a

DO I NoOfProcs

DO J ElementsInA

AJI

ENDDO

ENDDO

OMP PARALLEL DO SHAREDA B NoOfProcs

DO I nNoOfProcs

AindGetMyId

AindGetMyId Bi

ENDDO

DO J ElementsInA

DO I NoOfProcs

AI AI AIJ

ENDDO

ENDDO

b

Fig Array reduction a the original lo op and b the same lo op after

recognizing reduction array A

X

OMP PARALLEL DO SHAREDA

DO I n

X

X X I

DO I n

AX

AI I

ENDDO

ENDDO

a b

Fig Induction variable recognition a the original lo op and b the same lo op

after replacing induction variable X

This transformation would allow the original lo op shown in Figure a to b e ex

ecuted in parallel Unfortunately if there are many enclosing lo ops and complex

induction variables the closed form induction expressions may b ecome rather

exp ensive to compute If these expressions are used often they can intro duce

signicantoverhead

Handling IO If IO statements within a lo op are necessary in program execution

and the order of IO statements have to b e preserved among lo op iterations the

lo op cannot b e parallelized In other cases the lo op can still b e parallelized by

using one of the following metho ds

 If the IO is not absolutely necessary it can simply b e removed For in

stance if the IO was inserted for debugging purp ose or as execution status

rep orts deleting the IO statements will not aect the execution

 In cases where IO is needed to rep ort the status of an array the lo op may

b e distributed into two lo ops one for computation and the other for IO

The resulting lo op containing only IO cannot b e parallelized but the lo op

containing only computation may b e parallelizable

Handling subroutine and function calls If a lo op has a subroutine or function

call some parallelizing compilers usually make a conservative decision not to

parallelize it The programmer has to make sure that the subroutine or function

has no side eects b efore manually parallelizing it

Also dep ending on the implementation of parallel constructs parallel sections

inside a function or subroutine that are already running in parallel mayhave

unexp ected eects If a programmer decides to execute a subroutine or function

within a parallel blo ck it is advisable to remove parallel constructs within

that subroutine or function One other p ossible solution is to inline the called

function or subroutine if the size of the function or subroutine is reasonably

small More details on inlining are presented later in this section

Parallel p erformance optimization techniques

Parallelization intro duces overhead that clearly aects execution time Program

mers must b e aware that parallelization mayeven degrade the p erformance of some

co de sections Wehave presented the parallelization and spreading overhead mo del

in Section Techniques listed b elow aim to further improve the p erformance of

already parallel co de sections They mainly seek to reduce the overhead intro duced

by parallelization

Serialization In many cases the eect of optimization is not entirely predictable

Furthermore if programmers use a parallelizing compiler the compiler may

cause some co de sections to p erform worse Sometimes parallelizing a co de

segment just do es not payoFor instance if the execution time of a lo op is in

the same order of the parallelization overhead its parallel execution is likely to

p erform worse than the serial version If there are no other eligible techniques

to further improve the parallel section simply removing the parallel directives

can at least prevent degradation

This technique is highly machinedep endent The b enet of parallelization rely

on many machine parameters cache and memory size bandwidth pro cessor

sp eed IO eciency and op erating systems If the target program is to b e

used on various architectures programmers should make a cautious decision

as to which segments should b e converted back to serial based on the study

of those architectures A useful strategy is to serialize those lo ops or co de

sections whose timing proles shownoimprovements from any parallelization

and tuning attempts Also it is advisable to monitor the p erformance of those

lo ops whose execution time is less than an order of magnitude larger than the

forkjoin overhead The forkjoin overhead can b e measured as the dierence in

execution time of an empty parallel lo op b etween parallel and serial execution

It should b e noted that serialization itself can have a negative impact The idea

of serialization is to restore a co de segmentback to its original state but due to

cache eects the execution mayslowdown compared to the same co de section in

the untouched version For instance a small serial lo op rightbetween two large

parallel lo ops may cause signicantcache misses due to the data distribution

across caches

Handling false sharing Dep ending of the cache line size data that are needed

by only one pro cessor mayspreadover other pro cessors caches causing fre

quentinvalidations These may b e prevented by applying one of twotechniques

describ ed b elow

 Programmers may try to mo dify array access patterns byscheduling tasks

that access adjacent regions on the same pro cessor An example is given

in Figure

 Another solution is padding By adding empty data items into a shared

array one mayavoid false sharing by separating data into individual cache

lines However this may cause negative eects due to the increase in the

data size Figure shows an example of padding It should b e noted that

changing array declarations can have global and interpro cedural eects

All uses of mo died arrays must b e changed to use the new dimensions

Scheduling A directive language usually comes with several options for scheduling

Scheduling in parallel programming means telling the underlying machine how

OMP PARALLEL OMP PARALLEL

DO I OMP DO

OMP DO DO I

DO J N DO J N

AIJ BIJ AIJ BIJ

ENDDO ENDDO

OMP END DO NOWAIT ENDDO

ENDDO OMP END DO

OMP END PARALLEL OMP END PARALLEL

a b

Fig Scheduling mo dication a the original lo op and b the same lo op after

mo difying scheduling by pushing parallel constructs inside the lo op nest In b the

inner lo op is executed in parallel thus pro cessor access array elements that at least

stride apart

REAL AN BN REAL AN BN

OMP PARALLEL OMP PARALLEL

OMP DO OMP DO

DO I DO I N

DO J N DO J N

AIJ BIJ AIJ BIJ

ENDDO ENDDO

ENDDO ENDDO

OMP END DO OMP END DO

OMP END PARALLEL OMP END PARALLEL

a b

Fig Padding a the original lo op and b the same lo op after padding extra

space into the arrays

the tasks should b e distributed across pro cessors In Fortran case if a lo op iter

ates from to multiple pro cessors allow manyways to split the iterations

Dep ending on the lo op structure scheduling can make a signicant dierence

in p erformance Lo cality and false sharing are two most imp ortant factors that

are aected by employing dierentscheduling schemes The Op enMP directive

language provides four dierent options for scheduling Some scheduling

scheme may incur more overhead due to the required b o okkeeping Program

mers are recommended to examine the lo op structure b efore trying a dierent

scheduling mechanism

 static Each pro cessor is assigned a contiguous chunk of iterations If the

amountofwork in each iteration is approximately the same and there are

enough iterations for equal distribution this scheduling will do ne

 dynamic A pro cessor is assigned the next iteration as it b ecomes available

This is useful if the lo op has varying amounts of work for each iteration

The overhead is usually higher than that of static scheduling but if the

program is to run in a multiuser environment its b etter load balancing

prop erties can improve p erformance

 guided The same as dynamic scheduling but a linearly decreasing number

of iterations are dispatched to each pro cessor

 runtime The decision for scheduling is deferred until runtime The value of

an environmental variable OMP SCHEDULE determines the scheduling scheme

Load Balancing Unevenly distributed tasks cause stalls on multiple pro cessors

In cases where the numb er of iterations is small and cannot b e distributed

evenly the exp ected sp eedup is limited by the remainder of the number of

iteration over the numb er of pro cessors There is no solution for this case other

than trying to outer parallel lo ops If the imbalance is incurred byuneven work

within the lo op b o dy such as an outer parallel lo op with an inner triangular

lo op dynamic scheduling may result in b etter p erformance Figure shows

an example of load balancing bychanging scheduling

OMP PARALLEL DO OMP PARALLEL DO

OMPSCHEDULEDYNAMIC OMPSCHEDULESTATIC

DO I N DO I N

DO J I DO J I

ENDDO ENDDO

ENDDO ENDDO

a b

Fig Load balancing a the original lo op and b the same lo op after

changing to interleaved scheduling scheme By changing the scheduling from static

to dynamic unbalanced load can b e distributed more evenly

Blo ckingtiling If the data size handled byeach iteration of a lo op is larger than

the data cache size of the pro cessor and the data are reused within eachit

eration lots of cache misses o ccur Blo ckingtiling splits the data needed for

each iteration so that they t into one pro cessors cache This technique is par

ticularly useful in large matrix manipulation Obviouslymachine parameters

should come into play for this technique to b e successful Knowing the machines

cache size will help determine the rightblock size Blo ckingtiling are basically

lo cality enhancement techniques Figure shows howblockingtiling can b e

applied

In Figure the entire B array is referenced in each iteration of the I lo op If

the N NN references within each iteration of the I lo op exceed the cache

size then the each access to a new line of array B will b e a cache miss Tiling

the K and J lo ops allow smaller sections of B to b e accessed rep eatedly b efore

moving on to another section decreasing the references within the I lo op to

BLK BLKBLK references If BLK is small enough then eachlineof B will

only see one cache miss during the execution of the entire nest

DO I N

DO K N

DO J N

CJI AKI BJK CJI

ENDDO

ENDDO

ENDDO

a

DO KK NBLK

DO JJ NBLK

DO I N

DO K KKminkkBLKN

DO J JJminjjBLKN

CJI AKI BJK CJI

ENDDO

ENDDO

ENDDO

ENDDO

ENDDO

b

Fig Blo ckingtiling a the original lo op and b the same lo op after applying

tiling to split the matrices into smaller tiles In b another lo op has b een added to

assign smaller blo cks to each pro cessor The data are likely to remain in the cache when they are needed again

Serial p erformance optimization techniques

Sometimes programmers inadvertently write inecientcodeFor those who are

not familiar with p erformance issues it is not unusual to add co des that work against

go o d p erformance There are simple techniques that enhance p erformance of a co de

segment whether it is serial or parallel without altering its intended functionality

The techniques listed b elow aim to enhance the lo cality of program data resulting in

b etter cache p erformance or to reduce stalls They are mainly machineindep endent

For instance enhancing lo calityalways helps If the dominantcodesegments in the

target program are innately serial the following techniques may b e go o d candidates

for improving the p erformance without parallelization

Lo op interchange Lo op interchange is a simple technique that interchanges lo op

nests The array access patterns determined by lo op nests can have a drastic

eect on the resulting p erformance In the two co de segments shown in Fig

ure the rst one has p o or lo cality b ecause it has an array access stride of

N The second lo op on the other hand p erforms b etter b ecause of its stride

access

DO J M DO I N

DO I N DO J M

AIJ BIJ AIJ BIJ

ENDDO ENDDO

ENDDO ENDDO

a b

Fig Lo op interchange a a lo op with p o or lo cality and b the same lo op

with b etter lo cality after interchanging lo op nest

Lo op interchange is a simple technique that may result in a large p erformance

gain Programmers should b e aware that lo op interchange is not always legal In

the presense of backward data dep endences eg Ai j Ai jB i j

in a lo op lo op interchange violates the dep endence in the original co de

Lo op Fusion This is the opp osite of lo op distribution describ ed b elow If multi

ple lo ops have the same range they can b e merged if doing so do es not break

any dep endencies b etween them Fusion generally increases lo cality b ecause it

allows pro cessors to reuse the data that are already in their caches However

fusion may cause the data size to exceed the cache size which degrades the

p erformance Also as a side eect if fusion is applied to parallel lo ops it de

creases the numb er of synchronization barriers and reduces b oth parallelization

and spreading overhead Programmers should b e aware that lo op fusion is not

always legal even when the iteration spaces match

Software Pip eline andor Lo op Unrolling In some computeintensiveloops

data dep endencies across close iterations may cause pip eline stalls This is

more frequent in oatingp oint op erations whichtakeanumb er of CPU cycles

One way to alleviate this problem is to do software pip elining or lo op unrolling

Lo op unrolling do es not have a direct eect on reducing dep endency stalls but

it allows the backend compiler to interleave dep endent instructions

However unlike software pip eline technique whichmay create a lo opcarried

dep endency an unrolled lo op can still b e executed in parallel if the original lo op

is parallel As a side eect unrolled lo ops have less synchronization barriers

when executed in parallel These techniques allow more cycles b etween dep en

dent instructions so stalls are reduced Hardware counters often have facilities

to measure dep endency stalls Figure shows a simple lo op b efore and after

applying software pip eline and unrolling

Other p erformanceenhancing techniques

Lo op distribution Lo op Distribution refers to splitting a lo op into multiple lo ops

with smaller tasks This techniques may reduce the grain size of parallelism

however it enables other transformations Figure shows an actual co de

section found in program SWIM from the SPEC b enchmark suite

DO I N

C A B

DO I N DO I N

C AI BI

DIC

C AI BI DI C

DI C CAIBI

C AI BI

ENDDO ENDDO

DI C

D N C

ENDDO

a b c

Fig Software pip eline and lo op unrolling a the original lo op b the same

lo op with software pip eline Instructions are interleaved across iterations and

preamble and p ostamble have b een added and c the same lo op unrolled by

The outer lo op is parallel Adding appropriate directives we get the parallelized

version as shown in Figure

As mentioned ab ove in the lo cality enhancement section the nested lo ops in

this co de segmentwould b e a go o d candidate for lo op interchange due to the

column ma jor attribute of Fortran However one line right after the nested

lo op prevents applying the technique By splitting the outer lo op into two

and interchanging nested lo ops we get the co de shown in Figure which

p erforms signicantly b etter than the previous twoversions

Subroutine inlining Inlining replaces a call to a subroutine with the co de con

tained within the subroutine itself This pro cedure also called inline expan

sion can haveseveral b enecial eects The most obvious of these eects is

the removal of the calling overhead This is particularly true when a call is

emb edded within a small lo op and thus the overhead would b e incurred in

each lo op iteration However more imp ortantly in the context of parallelizing

compilers additional optimizations and transformations may b e facilitated by this transformation

DO icheck mnmin

DO jcheck mnmin

pcheck pcheckABSpnewicheck jcheck

ucheck ucheckABSunewicheck jcheck

vcheck vcheckABSvnewicheck jcheck

CONTINUE

ENDDO

unewicheck icheck unewicheck icheckMODicheck

CONTINUE

ENDDO

do in program SWIM Fig Original lo op SHALOW

With pro cedure calls inlined the pro cedures co de may b e optimized within

the context of the call site With sitesp ecic information nowavailable other

transformations may b e p ossible which in turn may facilitate yet other opti

mizations This mayallow some instances of a pro cedure to b e executed in

parallel even if it is not parallelizable at each call site

The down side of is the increase in the co de size which can b e

signicant if full inlining is p erformed This may cause many instruction cache

misses Also with the increase in co de size comes the increase in compilation

time since noweach instance of the inlined co de is optimized separately Often

full inlining is not practical and so heuristics are develop ed for its application

Deadco de elimination Deadco de elimination is an optimization technique which

removes unnecessary co de from a program The direct eect of dead co de

elimination is decreased execution time Co de that has no eect on the output

of a program is removed and thus the time sp ent in executing this p ortion of the application is eliminated Again there is the additional b enet that deadco de

OMP PARALLEL

OMPDEFAULTSHARED

OMPPRIVATEJCHECKICHECK

OMP DO

OMPREDUCTIONvcheckucheckpcheck

DO icheck mnmin

DO jcheck mnmin

pcheck pcheckABSpnewicheck jcheck

ucheck ucheckABSunewicheck jcheck

vcheck vcheckABSvnewicheck jcheck

CONTINUE

ENDDO

unewicheck icheck unewicheck icheckMODicheck

CONTINUE

ENDDO

OMP END DO NOWAIT

OMP END PARALLEL

Fig Parallel version of SHALOW do in program SWIM

OMP PARALLEL

OMPDEFAULTSHARED

OMPPRIVATEJCHECKICHECK

OMP DO

OMPREDUCTIONvcheckucheckpcheck

DO icheck mnmin

DO jcheck mnmin

pcheck pcheckABSpnewicheck jcheck

ucheck ucheckABSunewicheck jcheck

vcheck vcheckABSvnewicheck jcheck

CONTINUE

ENDDO

CONTINUE

ENDDO

OMP END DO

OMP DO

DO icheck MINm n

unewicheck icheck unewicheck icheckMODicheck

ENDDO

OMP END DO NOWAIT

OMP END PARALLEL

Fig Optimized version of SHALOW do in program SWIM

elimination may enable other optimizations eg An imp erfect lo op nest can

b ecome a p erfect lo op nest after deadco de elimination

Getting optimized execution time

As describ ed in Section using single user time is imp ortant to reduce ex

ternal p erturbation factors In parallel programs these factors may cause signicant

inaccuracies and variations in execution time b ecause of the unpredictable nature of

other user pro cesses

Finding and resolving p erformance problems

Finding dominant regions Programmers should fo cus on dominant co de seg

ments based on the measured data Instrumented program runs usually generate

proles with the measured data Programmers should nd ma jor co de blo cks

that consume most of execution time from these les With to ol supp ort this

task can b e simplied

Dominant program sections maychange as the result of a program tuning pro

cess After each iteration of this pro cess programmers should reevaluate the

most timeconsuming or the most problematic dep ending on the metrics co de

sections Other program sections mayhave b ecome the p oint of biggest return

of further time investment

Identifying problems and nding remedy When dominant co de sections are

found programmers should gure out any p ossible improvements on those seg

ments First the status of the segments should b e understo o dIs the co de

section parallel and Is the sp eedup acceptable are the questions that

should b e answered b efore lo oking for right remedies Computing the over

heads discussed earlier in Section can b e of signicant help to this end

Performance analysis is a dicult part of p erformance tuning In the next

chapter we present our eort to facilitate the p erformance analysis through to ol supp ort

 code not paral lel Even advanced parallelizing compilers such as the Po

laris compiler cannot detect all p ossible parallelism There are mainly two

reasons for this First the target co de uses some algorithmic techniques

that a parallelizing compiler cannot analyze Second the data dep enden

cies within the co de cannot b e determined without examining the input

data so the parallelizing compiler makes a conservative decision not to

parallelize it

For the rst case programmers may b e able to nd parallelism For ex

ample if a reduction variable is not recognized by a parallelizing compiler

programmers can parallelize the co de section with prop er reduction direc

tives Programmers may need to study the underlying algorithm for this

task Parallelization techniques are found in Section

For the second case programmers may b e able to make up for the lack

of information ab out the input data For instance if the reason for not

parallelizing a co de section is that the compiler cannot determine that

certain array accesses do not overlap programmers can simply parallelize

the co de manually If a conditional exit within a lo op only o ccurs in a

fatal error condition ignoring it and parallelizing the lo op will not aect

a correct execution

If the programmer cannot nd anyway to parallelize a given co de replac

ing the algorithm with parallel counter parts may b e p ossible There are

parallel algorithms for some inherent serial algorithms such as random

numb er generation and linear recurrences

Finallyeven if none of the techniques are p ossible programmers should

try enhancing the lo cality of the co de Some of the lo cality enhancing

techniques can make a drastic dierence in p erformance Some of such

techniques are listed in Section

 speedup not acceptable For parallel co de segments there are several rea

sons for p o or sp eedup including p o or lo cality and parallelization andor

spreading overhead Spreading overhead may b e incurred by poor locality

Programmers should try to enhance lo cality and reduce overhead Prob

lems with data lo calitymay b e detected if a hardware counter is available

on the target machine A large numb er of stalls or high data cache miss

ratio are a go o d indication of p o or lo cality Someofthesetechniques are

describ ed in Section

Conclusions

The ultimate ob jective of our research is to answer what and how in a paral

lel optimization pro cess The prop osed metho dology is designed to tell programmers

what must b e done Wehave divided the program optimization pro cess into sev

eral steps with feedbackloopsEach step denes sp ecic tasks for programmers to

accomplish Wehave also listed common analyses and techniques that are needed

There is a clear goal in each stage and the condition for its achievement is clearly

dened In this way our metho dology provides signicant guidance to programmers

in optimizing parallel applications

The metho dology describ ed ab ove has b een empirically devised All of the analy

ses and techniques have help ed us improving the p erformance of scientic and engi

neering applications However guring out exactly whichtechnique would improve

the p erformance is still a dicult sub ject and requires further studies Performance

prediction or mo deling has not b een successful in general cases In the next chapter

weintro duce our exp eriencebased approach to resolve this issue We supp ort our

metho dology with a set of to ols which is our approach to answer the question how

Supp orting to ols are the topic of the next chapter

TOOL SUPPORTFOR PROGRAM OPTIMIZATION

METHODOLOGY

As previously mentioned the main advantages of a metho dical approach to par

allel programming is that it is ecient and easy to apply without advanced

exp erience The prop osed metho dology outlines this systematic endeavor towards

go o d p erformance However the individual steps listed in the metho dology can b e

timeconsuming and tedious

Parallel programmers without access to parallel programming to ols have relied on

text editors shells and compilers Programmers write a program using text editors

and generate an executable with resident compilers All other tasks such as managing

les examining p erformance gures searching for problems and incorp orate solutions

can b e achieved using these traditional to ols However considerable eort and go o d

intuition are needed in le organization and p erformance diagnostics Even with

parallelizing compilers these tasks still remain for the users to deal with In fact

most users end up writing small help er scripts for these tasks

The to ols designed sp ecically for development and tuning of parallel programs

step in where traditional to ols have limits In general these to ols provide interactivity

and adequate user interface for incorp oration of user knowledge to further improve

program p erformance Previous eorts listed in Chapter mainly fo cus on twoas

p ects of functionality automation and visualization Automatic utilities simplies

analyzing very complex program structures Visualization utilities allow user to view

and interpret a large amount of static analysis information and p erformance data in

an ecient manner Still we feel that certain functionalities which could b e of great

help to programmers have b een largely ignored by the to ol develop ers

Based on the user feedback and the sp ecics of our metho dologywehave set our

our design goals which are listed in the next section Then we discuss in detail the

to ols that wehavedevelop ed andor included into our programming environment

We also present our eort to reach the general audience with our to ols through the

WorldWide Web Finallywe describ e how these to ols t into our metho dology and

help programmers in the tuning pro cess

Design Ob jectives

Consistent supp ort for metho dology This is the main goal of our research We

examine the steps in the metho dology and nd timeconsuming programming

chores that call for additional aid Some tasks are tedious and maybeauto

mated Some require complex analysis and cumb ersome reasoning so assisting

utilities are needed If these are prop erly addressed with the to ol supp ort pro

grammers can achieve greater p erformance with ease The integration of the

metho dology and the to ol supp ort would signicantly increase eciency and

pro ductivity

Supp ort for deductive reasoning Current p erformance visualization systems of

fer variety of utilities for viewing a large amount of data in many dierentper

sp ectives Understanding data patterns and lo cating problems however are

still left to users In addition to providing raw information advanced to ols

must help lter and abstract a p otentially very large amount of data Instead

of providing a xed numb er of options for data presentation oering the ability

to freely manipulate data and even to compute a new set of meaningful results

can serve as the basis for users deductive reasoning

Active guidance system Tuning programs requires dealing with numerous dif

ferent instances of co de segments Categorizing these variants and nding the

right remedies demand sucient exp erience on the programmers part The

transfer of such knowledge from exp erienced to novice programmers has al

ways b een a problem in the parallel programming community It usually takes

novice programmers a signicant amount of time and eort to gain adequate

exp ertise in parallel programming We b elieve that it is p ossible to address this

issue systematically using to days technology

Program characteristic visualization and p erformance evaluation The task

of improving program p erformance starts with examining the p erformance and

analysis data and nding ro om for improvement The ability to scroll through

these data and visualize what the data imply is critical in this task Tables

graphs and charts are a common way of expressing a large set of data for easy

comprehension However one of the pitfalls that researchers easily get into is

to o much information presentedinamyriad of windows without prop er anno

tations A go o d to ol should b e able to draw the users attentiontowhatis

imp ortant

Integration of static analysis with p erformance evaluation Most to ols pub

lished so far fo cus on either one of the twotyp es of data However as mentioned

earlier go o d p erformance only comes from considering b oth asp ects It is im

p ortant to identify the relationship b etween the data from b oth sides and have

them available for easy analysis Without the consideration of p erformance

data static program optimization can even degrade p erformance Likewise

without the static analysis data optimization based only on p erformance data

may b e marginal

Interactive and mo dular compilation The usual blackb oxoriented use of com

piler to ols have limits in eciently incorp orating users knowledge of program

algorithms and dynamic b ehavior For example although the compiler detects

avaluesp ecic datadep endence the user mayknowthatinevery reasonable

program input the values are such that the dep endence do es not o ccur In

other cases users mayknow that the array sections accessed in dierentloop

iterations do not overlap Furthermore certain program transformations may

make a substantial p erformance dierence but are applicable to very few pro

grams and hence not built into a compilers rep ertoire If a user can nd the

reason whyaloopwas not parallelized automatically a small mo dication may

b e applied that ensures parallel execution Because of these reasons manual

co de mo dications in addition to automatic parallelization is often necessary to

achieve go o d p erformance and to ols should supp ort a convenient mechanism

to incorp orate manual tuning Another drawbackofconventional compilers is

their limited supp ort for incremental tuning The lo calized eect of parallel

directives in the shared memory programming mo del allows users to fo cus on

small p ortions of co de for p ossible improvement Hence compiler supp ort for

incremental tuning is also an imp ortant goal in our to ol design

Data Management This is the basic need in successfully optimizing various appli

cations Data management refers to the task of organizing data les maintain

ing the storage for the gathered data and making it easy to retrieve them for

quick comparison and manipulation A unied space for exp erimental data with

clean interfaces not only help the develop ers themselves but also the combined

eort among research groups byallowing simple access to related databases

Accessibility Although the imp ortance of advanced to ols for all software devel

opment is evident manyavailable to ols remain unused A ma jor reason is

that the pro cess of searching for to ols with needed capabilities downloading

and installing them on lo cally available platforms and resources is very time

consuming In order to evaluate and nd an appropriate to ol this pro cess may

need to b e rep eated many times Using to days network computing technology

to ol accessibility can b e greatly enhanced

Portability For disseminating a new to ol to the user community it is imp ortant

that it b e easy to install on new platforms In addition a to ol has to b e exible

in the data format it can read such that it can adapt to the to ols compilers

and p erformance analyzers available on the lo cal platform

Congurability Satisfying the general users of a to ol can only b e achieved by

allowing them to congure the to ol to their liking By having congurabilityas

one of our design goals many users preferences can b e incorp orated into the

to ol usage without individually addressing them

Flexibility Flexibility is an imp ortantcharacteristic of general to ols Wehaveseen

many cases in which new typ es of p erformance data needed to b e incorp orated

into the picture for a b etter understanding of a program b ehavior Further

more wewould liketokeep the applicability of the to ol op en for tasks b eyond

p erformance tuning

In the next few sections weintro duce the to ols in our metho dologysupp ort to ol

box We presenttheoverviews for the to ols as well as the detailed structure and

functionality if needed We also include the lo ok and feel of these to ols from the end

users p ointofview

Ursa MinorPerformance Evaluation Tool

Often the programmers intervention into automatic optimization is necessary to

achieve nearoptimal parallel program p erformance To aid programmers in this pro

cess wehave develop ed a p erformance evaluation to ol Ursa Minor User Resp on

sive System for the Analysis Manipulation and Instrumentation of New Optimization

Research The main goal of Ursa Minor is p erformance optimiza

tion through interactiveintegration of p erformance evaluation with static program

analysis information With this to ol p erformance anomalies such as p o or sp eedup

and high cache miss ratio are easily identied on a lo opbylo op basis via a graphical

user interface Overhead comp onents are computed instantly This information is

combined with static program information suchasarray access patterns or lo op nest

structure to give a b etter understanding of the problems at hand

Ursa Minor complements the Polaris compiler in its supp ort for Op enMP par

allel programming in that it understands the compiler output It collects and com

bines information from various sources and its graphical interface provides selective

views and combinations of the gathered data Ursa Minor consists of a database

utility a visualization system for b oth p erformance data and program structure a

source searching and viewing to ol and a le management mo dule Ursa Minor

also provides users with p owerful utilities for manipulating and restructuring input

data to serve as the basis for the users deductive reasoning In addition it takes

p erformance evaluation one step further by means of an active p erformance guidance

system called Merlin Ursa Minor can present to the user and reason ab out many

dierenttyp es of data eg compilation results timing proles hardware counter in

formation making it widely applicable to dierent kinds of program optimization

scenarios

Functionality

Here we describ e the functionalityof Ursa Minor and what it can do for

programmers Typical p erformance evaluation pro cess consists of visualizing p erfor

mance identifying problems or anomalies nding the cause and devising the cor

resp onding remedies Programmers need to visualize and compare the p erformance

data under dierent trials ruminate over them compute derivativevalues examine

the runtime environment for the cause of p ossible problems and search for solu

tions Wehave designed practical utilities to assist programmers in this pro cess and

integrated them into Ursa Minor

Performance data and program structure visualization

The Ursa Minor to ol presents information to the user through two main display

windows the Table View and the Structure View The Table View shows the data

as text entries that relate to Program Units which can b e subroutines functions

lo ops blo cks or anyentities that a user denes The Structure View is designed to

visualize the program structure under consideration A user interacts with the to ol

bycho osing menu items or mouseclicking

The Table View displays data suchasaverage execution time the number of

invo cation of co de sections cache misses and a text indicating whether lo ops are

serial or parallel Generally the entries can b e of typ e integer oatingp ointnumb er

and string Users can manipulate the presented data through various features this

view provides This is the main view that provides the means for mo difying and

augmenting the underlying database Accesses to other mo dules of Ursa Minor

take place through this view The Table View is a tabb ed folder that contains one

or more tabs with lab els Each tab corresp onds to a program unit group which

means a group of data of a similar typ e For instance the folder lab eled LOOPS

contains all the data regarding lo ops in a given program When reading predened

data inputs such as timing les and Polaris listing les Ursa Minor generates

predened program unit groups eg LOOPS PROGRAM CALLSTRUCTURE

etc Users can create their own groups with their own input les using a prop er

format

A user can rearrange columns delete columns sort the entries alphab etically or

based on the execution time The bar graph on the rightsideshows an instant

normalized graph of a numeric column After each program run the newly collected

information is included as additional columns in the Table View Users can examine

these numb ers sidebyside as they t In this way p erformance dierences can b e

insp ected immediately for each individual lo op as well as for the overall program

Eects of program mo dications on other program sections b ecome obvious as well

The mo dication maychange the relative imp ortance of lo ops so that sorting them

by their newest execution time yields a new mosttimeconsuming lo op on whichthe

programmer has to fo cus next Figure shows the Table View of Ursa Minor in

use

Various features make the usage of the Table View easier and more accessible

Users can set a display threshold for each column so that an item that is less than

a certain quantity is displayed in a dierent color This feature allows users to ef

fortlessly identify co de sections with p o or sp eedup for instance One or more rows

and columns can b e selected so that they can b e manipulated as a whole Data that

would not t into a table cell such as the compilers explanation for why a lo op is not

parallel can b e displayed in a separate windowby one mouse click Finally Ursa

Minor is capable of generating pie charts and bar graphs on a selected column or

row for instant visualization of numeric data

Fig Main view of the Ursa Minor to ol The user has gathered information on

program BDNA After sorting the lo ops based on the execution time the user insp ects

the p ercentage of three ma jor lo ops ACTFOR do ACTFOR do RESTAR do

using a pie chart generator b ottom left Computing the sp eedup column with

the Expression Evaluator reveals that the sp eedup for RESTAR do is p o or so the user is examining more detailed information on the lo op

Another view of Ursa Minor provides the calling structure of a given program

which includes subroutine function and lo op nest information as shown in Figure

Each rectangle represents either a subroutine function or lo op The rectangles are

colorco ded so that more information is conveyed to the user visuallyFor example

parallel lo ops are represented by green rectangles and serial lo ops by red rectangles

Clicking one of these will display the corresp onding source co de In Figure the

do in this way Rectangles p ositioned to the right user is insp ecting lo op ACTFOR

are nested program units Thus if unit A has unit B inside the rectangle representing

B will b e placed to the right of the rectangle for A If one wants a wider view of the

program structure the user can zo om in and out This display helps to understand

a program structure for tasks suchasinterchanging lo ops or nding outer or inner

candidate parallel lo ops

Expression Evaluator

The ability to compute derivativevalues of raw p erformance data is critical in

analyzing the gathered information For instance the average timing value of dierent

runs sp eedup parallel eciency and the p ercentage of the execution time of co de

sections with resp ect to the overall execution time of the program are common metrics

used by many programmers Instead of adding individual utilities to compute these

values wehave added the Expression Evaluator for userentered expressions We

have provided a set of builtin mathematical functions for numeric relational and

logical op erations Nested op erators are allowed and any reasonable combination

of these functions are supp orted The Expression Evaluator has a pattern matching

capabilityaswell so the selection of a data set for evaluation b ecomes simplied

The Expression Evaluator also provides users with query functions that apprehend

static analysis data from a parallelizing compiler These functions can b e combined

with the mathematical functions allowing queries such as lo ops that are parallel and

whose sp eedups are less than or lo ops that have IO and whose execution time

is larger than of the o verall execution For example after the users identied

parallel lo ops with p o or sp eedup they maywant to compute cache miss ratio on those

Fig Structure view of the Ursa Minor to ol The user is lo oking at the

Structure View generated for program BDNA Using Find utility the user sets the

view to subroutine ACTFOR and op ened up the source view for the parallelized lo op

do ACTFOR

lo ops or parallelization overheads Instead of leaving the reasoning pro cess to users

Ursa Minor guides users through the deductive steps The Expression Evaluator

is a p owerful utility that allows manipulating and restructuring the input data to

serve as the basis for users deductive reasoning through a common spreadsheetlike

interface

The Merlin p erformance advisor

As previously mentioned identifying p erformance b ottlenecks and nding the

right remedies take exp erience and intuition whichnovice programmers usually lack

Acquiring the exp ertise requires many trials and studies Even for those programmers

who have exp erienced p eers the transfer of knowledge from advanced programmers

to novice programmers takes time and eort

Webelieve that to ols can b e of considerable use in addressing this problem We

haveusedacombination of the forementioned Expression Evaluator and knowledge

database to create a framework for easy transfer of exp erience Merlin is an auto

matic p erformance data analyzer that allows exp erienced programmers to tell novice

programmers how to diagnose and improvemanytyp es of p erformance problems

Its ob jective is to provide guidelines and suggestions to inexp erienced programmers

based on the accumulated knowledge of advanced programmers

Figure shows an instance of the Merlin user interface Merlin is activated

when a user clicks Run Performance Advisor for This Row from the rowpopup

menu It consists of an analysis text area an advice text area and buttons The

analysis text area displays the diagnosis that Merlin has p erformed on the selected

program unit The advice text area provides Merlins solution to the detected

problems with examples if any Diagnosis and the corresp onding advice are paired

with an identication numb er such as Analysis Solution Users can also

load a dierentmapanytime

Merlin diers from conventional spreadsheet macros in that it is capable of

comprehending static analysis data generated by a parallelizing compiler Merlin

can takeinto accountnumeric p erformance data as well as program information such

Fig The user interface of Merlin in use Merlin provides the solutions to the

detected problems This example shows the problems addressed in lo op

ACTFOR DO of program BDNA The button lab eled Ask Merlin activates the

analysis The View Source button op ens the source viewer for the selected co de

section The ReadMe for Map button pulls up the ReadMe text provided by the p erformance map writer

as parallel lo ops existence of IO statements or functions calls within a co de blo ck

and so on This allows a comprehensive analysis based on b oth p erformance and

static data available for the co de section under consideration

Merlin navigates through a knowledgebased database maps that contains

the information on diagnosis and solutions for various p erformance symptoms Exp e

rienced programmers write maps based on their knowledge and novice programmers

can view the suggestions made by the exp erienced programmers by activating Mer

linAsshown in Figure a map consists of three domains The elements in

the Problem Domain corresp ond to general p erformance problems from the viewp oint

of programmers They represent situations such as p o or sp eedup large number of

stalls and nonparallel lo ops dep ending on the p erformance data typ es targeted by

Merlin The Diagnostics Domain depicts p ossible causes of the problems suchas

oating p oint dep endence and data cache overow Finally the Solution Domain

contains remedial techniques Typical examples are serialization lo op interchange

tiling and lo op unrolling These elements are linked by conditions Conditions are

logical expressions representing an analysis of data If a condition evaluates to b e

true the corresp onding link is taken and the element in the next domain p ointed

to by the link is explorered Merlin invokes the Expression Evaluator for the eval

uation of these expressions A Merlin map is written in the Generic Data Format

describ ed in Section and it is loaded into Ursa Minor as an instance of Ursa

Minor database More detailed description of Merlin is available in

Merlin enables multiple causeeect analyses of p erformance and static data It

fetches the data sp ecied by the map from the Ursa Minor to ol p erforms the listed

op erations and follows the links if the conditions are true There are no restrictions on

the numb er of elements and conditions within each domain and each link is followed

indep endently Hence multiple p ersp ectives can b e easily incorp orated into one map

For instance memory stalls may b e caused by poor locality but it could also mean

oating p oint dep endence In this way Merlin considers all p ossibilities separately

and presents an inclusive set of solutions to users At the same time the remedies

Problem Diagnostics Solution Domain Domain Domain

condition 1

problem 1 diagnostics 1 solution 1

condition 2 problem 2 diagnostics 2 solution 2

problem. 3 diagnostics. 3 solution. 3 . . .

. . .

Fig The internal structure of a Merlin map The Problem Domain

corresp onds to general p erformance problems The Diagnostics Domain depicts

p ossible causes of the problems and the Solution Domain contains suggested

remedies Conditions are logical expressions representing an analysis of the data

suggested by Merlin assist users in learning by examples Merlin enables users

to gain exp ertise in an ecientmannerby listing p erformance data analysis steps

and many example solutions given by exp erienced programmers

Merlin is able to work with any map as long as the map is in the correct for

mat Therefore the intended fo cus of p erformance evaluation may shift dep ending

on the interest of the user group For instance the default map that comes with

Merlin fo cuses on parallel optimization of programs Should a map that fo cuses

on architecture b e develop ed and used instead the resp onse of Merlin will reect

that intention The Ursa Minor environment do es not limit its usage to parallel

programming

Other functionality

During the pro cess of compiling a parallel program and measuring its p erformance

a considerable amount of information is gathered For example timing information

b ecomes available from various program runs structural information of the program

is gathered from the co de do cumentation and compilers oer a large amountof

program analysis information Finding parallelism starts from lo oking through this

information and lo cating p otentially parallel sections of co de The b o okkeeping eort

accompanying this pro cedure is often overwhelming Ursa Minor provides a orga

nized solution to this problem All the data regarding tuning of a sp ecic program

are integrated into one compact database Easy access to the database supp orted by

the to ol allows users convenient views and manipulation of the data without having

to deal with numerous les

Ursa Minor also supp orts intergroup logs Sharing the p erformance data and

optimization results among team memb ers is imp ortant Group memb ers can share

the databases generated by others by sp ecifying one lo cation for a data rep ository

When a memb er decides to share a database with other members Ursa Minor adds

alogentry with the information regarding that particular database in the rep ository

In this way group memb ers do not have to ask others to send the database to examine

the data The rep ository has all the information ab out the database that the member

wants to share

Congurability is one way to ensure that the to ol adapts well to many users

environments and preferences The Ursa Minor user interface is congurable Users

can change the lo oks of the display views and many other functionalities Most

functions can b e mapp ed to keyb oard shortcuts allowing advanced users to sp eedup

the tasks

Learning how to use a new to ol has alwaysbeenanuisance to many programmers

As to ols b ecome complex and versatile reading a manual is cumbersome by itself

Some of the successful commercial applications in word pro cessing or games have

employed an online tutorial approach An emb edded mo dule steps through some

of the basic functions of the program and tells users how to use them Wehave

incorp orated such a mo dule into Ursa Minor Our interactive demo session allows

users to explore imp ortant features of the to ol with the input data prepared bythe

develop ers In addition this demo session automates some of the steps so that users

can quickly lo ok through them

Internal Organization of the Ursa Minor to ol

Static Dynamic Data Data data performance dependence numbers runtime Other tools results Spreadsheet structure env. analysis hardware ... counter ...

Expression Evaluator Database Manager Database

GUI Manager Merlin Performance Advisor

Table View Structure View

User

Fig Building blo cks of the Ursa Minor to ol and their interactions

Figure illustrates interaction b etween Ursa Minor mo dules and various data

les The Database Manager handles interaction b etween the database and other

mo dules Dep ending up on users requests it fetches the required data items or create

or mo dify database entities The GUI manager co ordinates various windows and

views and controls the pro cess of handling user actions It also takes care of data

consistency b etween the database and the displaywindows The Expression Evaluator

is a facility that allows users to p erform spreadsheetlike text usertyp ed commands

on the current database This mo dule parses the command applies the op erations

and up dates the views accordingly Finally Merlin is a guidance system capable of

automatically conducting p erformance analysis and nding solutions

Internally Ursa Minor stores information in a Ursa MinorMajor Database

UMD The Ursa MinorMajor Database UMD is a storage unit that holds

the collective information ab out a program its execution results in a certain system

environment or any other p ertinent data that users include This database can

b e stored in dierent formats including a plain text le which can optionally b e

insp ected with an editor and printed Furthermore a database can b e saved in a

format that can b e read by commercial spreadsheets providing a richer set of data

manipulation functions and graphical representations

The Ursa Minor to ol is written in lines of Java Thus any platform on

which the Javaruntime environmentisavailable can b e used to run the to ol It uses

the basic Java language with standard APIs which enhances the p ortabilityofthe

to ol Ob ject orientation in Javaallows a relatively easy addition of new typ es of data

to the database The windowing to olkits and utilities provide a go o d environment

for prototyping user interfaces which enable us to fo cus on the design of the to ol

functionalityFurthermore Java with its network supp ort makes a useful language

for realizing another goal of this pro ject making available the gathered program

compilation and p erformance results to world wide users This goal has b een realized

in the Ursa Major to ol which is discussed in Section

Database structure and data format

Ursa Minor maintains an organized database structure to store data Inside

the Ursa Minor database data items are stored in one of the four typ es integer

oating p ointnumb er string and long string For the most part the database mo dule

do es not care what kind of information it holds It is of course a go o d programming

practice but more imp ortantly it helps ensure the exibility and congurabilityof

the entire to ol There are certain mo dules that understand data semantics suchas

the Structure View and query functions in the Expression Evaluator but the lackof

the required data do es not prevent the to ol usage

At the b ottom of the structure is Program Unit This is the basic storage unit

that maps to an entity such as a lo op a subroutine a co de blo ck and so on These

units b elong to a larger entry called Program Unit Group Usually Program Unit

Groups are lab eled lo ops subroutines etc dep ending on the Program Units that

they keep These groups are combined into a Session which logically maps to a

database for one optimization research Sessions are managed bythe Ursa Minor

database manager the mo dule that handles database accesses Figure shows a

design schematics for the database

Session Program Unit Group Program Unit

Functions Loop 3

Subroutines Loop 2 Integer: Number invocation Float: Average Execution Time . . Float: Overall Execution Time Loops Loop 1 . . Float: Number of Cycles . . Float: Memory Stalls String: Serial or Parallel Long String: Nested Units

...

Fig The database structure of Ursa Minor

Ursa Minor is capable of reading several dierenttyp es of data les that are

generated by other to ols listed in this chapter Performance data sum les are

generated when Polarisinstrumented executables run Polaris listing les are gener

ated when Polaris attempts parallelization on a program and contain static analysis

information When Ursa Minor reads these les it parses them in a predened

way and creates appropriate program unit groups Users of the to ol do not need to

concern themselves with data typ es or formats when loading these les Also Ursa

Minor can read and write using Java serialization utility that stores the database in

a compact data le Adding or removing data from the loaded database is as simple

as clicking a menu

In order to provide more exibilitywehave dened the Generic Data Format

that can handle a wide variety of data Using this textbased format users can

input almost anytyp es of data with any data structure This format allows users to

create program unit groups of their own and arrange data as they see t This feature

greatly enhances the applicabilityof Ursa Minor and fullls one of the design goals

exibility

Summary

Ursa Minor supp orts the metho dology presented in the previous chapter by

providing utilities that mitigate many tasks in the p erformance evaluation stage It

integrates static analysis and p erformance data by means of a database with structure

based entities that hold many dierenttyp es of data With the supp ort for deduc

tive reasoning active guidance data management through congurable and exible

utilities Ursa Minor oers signicant aid to parallel programmers in need of a

p erformance evaluation to ol

Ursa Minor has b een installed on the Parallel Programming Hub allowing

accesses from remote users all over the world Users can quickly evaluate the to ol

with ease or extensively utilize it for pro duction use By Combining Ursa Minor

with other utilities on the Hub in supp ort of the metho dology our goal towards a

comprehensive programming environment is getting near The Parallel Programming

Hub is discussed in detail in Section

InterPolInteractiveTuning Tool

Go o d p erformance from a program is usually achieved by an incremental tuning

and evaluation pro cess The term incremental applies to b oth the applied tech

niques and the mo died co de segments Conventional batchoriented compilers are

limited in helping programmers in this task Often selecting target regions and cho os

ing optimization techniques are done by slicing a program and manipulating compiler

options manually The accompanying tasks of le management and learning ab out

compiler options are often overwhelming to programmers

Advanced parallelizing compilers provide a large list of available techniques for

program parallelization and optimization These techniques are usually controlled

by switches or command line options that maynotbeintuitive and userfriendly

The ability to select optimization techniques and even reordering their applications

would provide exibility in exploring various combinations of techniques on dierent

sections of co de In addition this would oer a playground for those interested in

studying compiler techniques

InterPol is an interactive utility that allows users to target program segments

and apply optimization techniques selectively It allows users to build their own

compiler from numerous optimization mo dules available from a parallelizing com

piler infrastructure It is also capable of incorp orating manual changes made by

users Meanwhile InterPol keeps track of the entire program that users wantto

optimize relieving programmers of le and version management tasks In this way

programmers are free to apply selected techniques on sp ecic regions change co de

manually and generate a working version of the entire program without exiting the

to ol During the optimization pro cess the to ol can display static analysis information

generated by the underlying compiler which can help users in further optimizing the

program

Overview

Figure illustrates the ma jor comp onents of InterPol Users select co de

regions using the Program Builder and arrange optimization techniques through the

Compiler Builder The Compilation Engine takes inputs from these builders executes

the selected compiler mo dules and displays the output program If the user wants to

keep the mo died co de segments the output will go into Program Builder Instead

of running the Compilation Engine users maycho ose to add changes to the co de

manually All of these actions are controlled by a graphical user interface Users are

able to store the current program variantatanypoint in the optimization pro cess

Functionality

Figure a shows the graphical user interface oered by InterPolTarget co de

segments and the corresp onding transformed versions are visible in separate areas

Call to Input Program Polaris Infrastructure

Compiler Compilation Program Builder Engine Builder

Output Program

Graphical User Interface

Fig An overview of InterPol Three main mo dules interact with users

through a Graphical User Interface The Program Builder handles le IO and keeps

track of the current program variant The compiler Builder allows users to arrange

optimization mo dules in Polaris The Compilation Engine combines the user

selections from the other two mo dules and calls Polaris mo dules

Static analysis information is given in another area whenever a user activates the

compiler Finally the Program Builder interface provides an instantviewofthe

currentversion of the target program InterPol is written in Java

The underlying parallelization and optimization to ol is the Polaris compiler in

frastructure Various Polaris mo dules form building blo cks for a customdesigned

parallelizing compiler InterPol is capable of stacking up these mo dules in anyor

der Polaris also comes with several dierent data dep endence test mo dules which

can also b e arranged by InterPol Overall more than mo dules are available

for application Users have a freedom to cho ose anyblocks in any order Executing

this custombuilt compiler is as simple as clicking a menu and the result is displayed

immediately on the graphical user interface Figure b shows the Compiler Builder

interface in InterPol More detailed conguration is also p ossible through the

InterPols Polaris switchinterface whichcontrols the b ehavior of the individual passes

a b

Fig User Interface of InterPol a the main window and b the Compiler

Builder

The Program Builder keeps and displays the uptodate version of the whole pro

gram Users select program segments from this mo dule apply automatic optimization

set up by the Compiler Builder andor add manual changes The Compiler Builder

is accessible at anypoint so users can apply entirely dierentsetsoftechniques to

dierent regions The currentversion of the program is always shown in the Program

Builder interface for easy examination Through this continuous pro cess of tuning op

timized program segments users always stay in the pro cess observing and mo difying

program transformations step by step

During the optimization pro cess InterPol can display program analysis results

generated by running Polaris mo dules This includes data dep endency test results

induction and reduction variables etc This provides a basis for further optimization

Programmers incorp orate their knowledge of the underlying algorithm comp ensating

for the compilers limited knowledge of the programs dynamic b ehavior and input

data

Summary

InterPol seeks to assist programmers by providing highly exible utilities for

b oth automatic and manual optimization For those who are not familiar with the

techniques available from parallelizing compilers the to ol provides greater insights

into the eects of co de transformations By combining the Ursa Minor p erfor

mance evaluation to ol with InterPolwe hop e to create a complete programming

environment

Other To ols in Our To olset

The functionalityof Ursa Minor and InterPolcombined with the Polaris in

strumentation mo dule cover all the asp ects of the metho dology discussed in Chapter

Later in Section we describ e how these to ols provide a comprehensive supp ort

for the metho dology In this section we present a set of complementary to ols in

our to olset whichwere develop ed in related pro jects The main goals of these to ols

do not necessarily match the issues that wewould like to address in this research

buttheyprovide additional information and grantcontrol over other asp ects in pro

gram development These to ols have b een either develop ed or mo died at Purdue

University

Polaris parallelizing compiler

The Polaris parallelizing compiler is a sourcetosource restructurer devel

op ed at the University of Illinois and Purdue UniversityPolaris automatically nds

parallelism and inserts appropriate parallel directives into input programs Polaris

includes advanced capabilities for array privatization symb olic and nonlinear data de

p endence testing idiom recognition interpro cedural analysis and symb olic program

analysis In addition the currentPolaris to ol is able to generate Op enMP parallel

directives and apply lo cality optimization techniques suchasloopinterchange and

tiling

As demonstrated previously the Polaris compiler has successfully im

proved the p erformance of many programs for various target machines Polaris pro

vides a go o d starting p oint for parallelizing and optimizing Fortran programs For

advanced programmers it can save substantial time that would b e sp ent tuning lo ops

that can b e automatically parallelized For novice programmers manually paralleliz

ing lo ops would b e cumb ersome to b egin with In addition Polaris can provide a

listing le with the results of static program analysis whichmayprovide program

mers with valuable information on various co de sections

InterPol describ ed ab oveprovides easyinteractive access to the Polaris paral

lelizing compiler InterPol is even capable of restructuring optimization mo dules

within Polaris If InterPol is not available Polaris can serve as an alternative

allowing fast parallelization of programs at hand Polaris is available on the Parallel

Programming Hub available to programmers all over the world

InterAct p erformance monitoring and steering to ol

InterAct is a to olset that allows interactive instrumentation and tuning of

Op enMP programs This to olset provides a simple interface and API that allow

users to quickly identify p erformance b ottlenecks through online monitoring of pro

gram p erformance and to explore solutions through exp erimentation with userdened

tunable variables The Polaris parallelizing compiler has b een mo died to annotate

sequential Fortran programs with Op enMP sharedmemory directives as well as to

insert calls to the instrumentation library The instrumentation library collects b oth

timings and hardware counter events transparently managing the lowlevel details

suchasoverows To manage the hardware counters the Op enMP Performance

Counter Library OMPcl has b een develop ed to accurately collect events within the

multithreaded Op enMP environment

InterAct provides a graphical user interface GUI to monitor program b ehavior

as well as to dynamically change instrumentation environmental settings and criti

cal program variables during execution It supp orts visualization of collected data

dynamic instrumentation interactive mo dication of the numb er of threads used by

the application interactive selection of the runtime library used for managing paral

lel threads and interactive mo dication of global variables that are registered bythe

target application These global variables could b e compiler or userinserted and used

to control the b ehavior andor p erformance of the application The to olset provides

asocket interface b etween the application and the GUI that allows monitoring to b e

done either lo cally or remotely Figure shows the screenshot of InterAct in use

for the study of the dynamic b ehavior of SWIM b enchmark

Fig Monitoring the example application through InterAct interface The

main windowshows the characterization data of the ma jor lo ops in the SPEC

SWIM Benchmark

MaxP parallelism analysis to ol

A compiler is able to analyze the static b ehavior of a program It can nd char

acteristics of a program that are true for all p ossible input data sets and target

machines In contrast dynamic evaluation of a program can provide insights into the

characteristics of programs and the predictions of b ehaviors that may b e undetected

by static analysis metho ds Of great interest is understanding the dynamic b ehavior

of parallelism one of the most dominant factors of p erformance

MaxP is a Polarisbased to ol develop ed at Purdue University It evaluates

the inherent parallelism of a program at runtime The inherent parallelism is dened

as the ratio of the total numb er of op erations in a program or program section

to the numb er of op erations along the critical path The critical path is the longest

path in the programs dataow graph which is computed by MaxP during program

execution The to ol can nd the minimum execution time of a program assuming the

availability of an unlimited numb er of parallel pro cessors It shows the maximum

parallelism as an upp er estimate for the p otential p erformance gain that a user can

exp ect from aggressively optimizing the co de

Integration with Metho dology

In this section we examine howweenvision the metho dology plus to ols scenario

First we discuss how these to ols facilitate the steps listed in Chapter Then we

fo cus on other features of the to ols that help programmers throughout the tuning

pro cess

To ol supp ort in each step

Our to ols have b een designed and mo died with the parallel programming metho d

ology in mind Figure gives the overview of how these to ols can b e of use in each

step in the metho dology intro duced in the previous chapter Ursa Minor mainly

contributes to the p erformance evaluation stages InterPol and Polaris oers aid

in parallelization and manual tuning stages Additional help in executing target pro

grams is available through InterAct In the following we revisit each step in the

metho dology and discuss the roles of our to ols

Instrumenting program

The Polaris to ol oers an instrumentation mo dule as one of its passes Users

can activate this mo dule using a set of switches In this way users can generate

instrumented versions of b oth parallel and serial programs Polaris provides several

switches for instrumentation of execution time of lo ops These switches dictate the

Polaris Instrumentor Instrumenting Program Hardware Counter

Getting Serial Execution Time InterAct reduce instrumentation overhead Polaris Running Parallelizing Compiler InterPol

Manually Optimizing Program InterPol

Getting Optimized Execution Time InterAct

satisfactory Ursa Minor Speedup Evaluation - Views - Expression Evaluator unsatisfactory Ursa Minor - Views Finding and Resolving Performance Problmes - Merlin - Expression Evaluator

done

Fig To ol supp ort for the parallel programming metho dology

typ es of co de blo cks that are instrumented and how to instrument nested sections

By carefully controlling the switches users can add all the necessary timing functions

without excessiveoverhead

Combined with the Op enMP Performance Counter Library PCL Polaris

can instrument a program so that each run generates a prole containing various

p erformance data measured by a hardware counter on instrumented co de segments

This library is available on many mo dern machines There are more than typ es

of measurementavailable including the numb er of cycles instruction and data cache

hits the numb er of reads and writes instruction counts dep endency stalls and so on

It is capable of generating a data le that can b e read by Ursa Minor for further

analysis

As noted in the metho dology it is imp ortant to record the execution time of the

uninstrumented program This serves as the basis for measuring the p erturbation that

instrumentation intro duces A simple UNIX command suchastimemayprovide

such a timing number

Getting serial execution time

Running an instrumented serial version is done typically through the UNIX com

mand line It is usually a simple command line interface Instrumentation generates

some form of records containing the timing information on the instrumented co de

segments For example an executable instrumented by the Polaris instrumentation

utility generates a le that lo oks like the following

RESTARdo AVE MIN MAX TOT

RESTARdo AVE MIN MAX TOT

RESTARdo AVE MIN MAX TOT

RESTARdo AVE MIN MAX TOT

RESTARdo AVE MIN MAX TOT

ACTFORdo AVE MIN MAX TOT

ACTFORdo AVE MIN MAX TOT

OVERALL time

The tabular section shows the averageAVE minimumMIN maximumMAXand

cumulative totalTOT time sp entoneach instrumented segment The last line shows

the overall execution time of the entire program This le can b e directly read bythe

Ursa Minor to ol for analysis

Running parallelizing compiler

This is the step in which users try parallelization by running automatic utilities

Its main goals are to utilize an automatic parallelizer to optimize complex lo ops

and p ossibly gain the static analysis results from the compiler and to save time by

automating the parallelization of small inconsequential lo ops Therefore the target

in this case is usually the entire program Furthermore most parallelizers with inter

pro cedural analysis capabilitywork well when an entire program is given as inputs

Polaris as a batchoriented program p erforms well for this purp ose InterPol is

also capable of handling this task

Manually optimizing programs

Any text editor can b e used to manually mo dify programs Several UNIX com

mands are useful for manipulating programs An example is fsplit which splits

subroutines and functions into dierent les However InterPol is sp ecically

designed for the pro cess of manual tuning InterPol allows programmers to apply

selected techniques on sp ecic regions change co de manually and generate a working

version of the entire program without exiting the to ol Some of the manual techniques

that users may consider are presented in Chapter

Getting optimized execution time

In the sharedmemory mo del programmers can invoke a parallel program as they

execute a serial program Typically there are certain environmental variables that

need to b e set b eforehand For example on Solaris machines environmental variable

NUM THREADS determines the numb er of pro cessors to b e used If programmers OMP

used the Polaris compiler for instrumentation a summary le is generated after each

run

InterAct allows an interactive instrumentation and tuning of Op enMP pro

grams Its ability to dynamically change runtime parameters tile size unrolling

numb er provides a testb ed for nding the optimal set of techniques Monitoring and

changing hardware counter instrumentation make the instrumentation pro cess more

ecient

Finding and resolving p erformance problems

Programmers need utilities for putting together and sorting data Identifying p er

formance problem requires a considerable amount of examination and hand analysis

Finding solutions often requires exp erience in program optimization studies

Ursa Minor provides to ols that assist parallel programmers in eectively evalu

ating p erformance Its graphical interface provides selective views and combinations

of timing information in combination with a program structure and static analy

sis data Users can put together a table op en a Structure View drawcharts do

spreadsheettyp e op erations and examine source co des Ursa Minor manages the

information within its own database thus data management that mighthave required

signicant le and version control b ecomes simplied

Identifying dominant lo ops are very simple with Ursa Minor Users can load

timing proles and sort the entries through the column p opup menu If a user cre

ates a pie chart most timeconsuming lo ops will b e displayed with the entire circle

representing the total execution time The bar graph on the right shows an instant

view of normalized numeric data

An imp ortant task in tuning program p erformance is to evaluate if an applied

program mo dication pro duces an acceptable result This involves computing var

ious metrics such as sp eedup and parallel eciency and examine program analysis

information The builtin mathematical functions allow users to manipulate data

The static analysis information generated bythePolaris compiler is also managed

within the Ursa Minor database For co de segments that require manual tuning

this information provides vital clues Static analysis information as well as the source

co de viewer can b e pulled up at any time with simple menu clicks so users can make

comprehensive diagnosis of the problems at hand

Ursa Minor do es more than just presenting data It is capable of actively

analyzing the data and giving advice When users run Merlin it extracts necessary

information and apply diagnosis techniques to nd right solutions As mentioned

previously the decisions that Ursa Minor makes rely on the Merlin map which

is typically provided by advanced parallel programmers In this way the knowledge

of exp erienced programmers can b e used bynovice programmers easily The fact that

a map can haveavariety of functions that can apply to anytyp es of data widens

the usage of Merlin in many dierent elds of study

Other useful utilities

In addition the to olset provides additional functionalities for the tasks that are

not sp ecically tied to the metho dology steps

When programmers are given an application to optimize they usually start out by

examining the source co de The basic knowledge ab out the program structure such

as large subroutines or functions their algorithms callees and callers tremendously

help programmers later in the tuning stage The algorithms employed by program

mo dules although not necessary to follow the metho dologymay b e of imp ortance

esp ecially programmers need to attempt replacing algorithms

Programs written by others are generally harder to understand Dierent co ding

styles make it dicult to capture the underlying comp ositions of individual program

mo dules The Structure View of Ursa Minor facilitates this problem by presenting

users with an intuitive colorco ded view of the program structure A simple click

pulls up the source view when a closer examination is desired This can savea

signicant amount of users time

As the size and the complexity of applications grow at exp onential rate these days

the sub ject of p erformance steering is getting more attention Performance steering

may come in handy in b oth development stage and pro duction use For instance

nding the right parameters for convergence criteria in the application development

stage can b e tricky so the ability to set or reset relevantvariables during the program

execution could provetobeadvantageous in exp erimenting with dierentvalues

Also an application maybeabletosimulate many dierent asp ects of a target ob ject

but users maybeinterested in only one asp ect In this case p erformance steering

can save time and resource by restricting the simulation The interest of InterAct

lies along this line The primary use of InterAct in our study has b een nding

the optimal combination of optimizationrelated parameters eg tile size unrolling

numb er for a given application For a long running programs InterAct allows a

ne control over variables such as the simulation step size and the numb er of iterations

When more than one p erson is involved in an optimization pro ject communication

between group memb ers b ecome problematic The data that one p erson generates

may not b e easily accessible or compatible with the to ols used by others Other

memb ers in the group maywant to fo cus on dierent p ersp ectives but the information

from one researcher may not b e formatted or arranged in a compatible way Sharing a

manipulatable database op ens up the p ossibility of all the members having access to

a set of compatible databases relevant to individual tasks At the same time group

memb ers can reason ab out the data gathered by other memb ers fo cusing on the

asp ects that they are interested in Ursa Minor enables an ecient and meaningful

way of sharing research results

Finally the growing p opularityofmultipro cessor workstations and high p erfor

mance PCs is leading to a substantial increase in nonexp ert users and programmers

of this machine class Such users need new programming paradigms p erhaps most

imp ortantly they need go o d examples to learn from Wehave extended our eort

to supp ort parallel programming by examples through Webaccessible to ols and a

database rep ository This is the topic of the next section

The Parallel Programming Hub and Ursa Major

Although the imp ortance of advanced to ols for all software developmentisevi

dent manyavailable to ols remain unused This is mainly due to the limited acces

sibility of to ols Wehavedevelop ed a set of to ols for parallel programmers and the

Internet provided an opp ortunity to make our to ols more accessible to world wide

parallel programmers Here we presenttwo separate outcomes that resulted from

our eort to reach wider audience with our to ols The Parallel Programming Hub

is an ongoing pro ject to provide a globally accessible integrated environment that

hosts parallelizing compilers program analyzers and interactive p erformance tuning

to ols Users can access and run these to ols with common Web browsers Ursa

Major is an Appletbased application that enables visualization and manipulation

of the p erformance and static analysis data of various parallel applications that have

b een studied at Purdue University Its goal is to make a rep ository of program

information available via the WorldWide Web

Parallel Programming Hub globally accessible integrated to ol en

vironment

Programming to ols are of paramount imp ortance for ecientsoftware develop

ment However despite several decades of to ol research and development there is a

drastic contrast b etween the large numb er of existing to ols and those actually used

by ordinary programmers We b elieve that there are two main reasons for this sit

uation The rst reason is that a programmer in order to b enet from new to ols

will typically have to go through one or several tedious eorts of searching down

loading installing and resolving platform incompatibilities b efore the to ols can even

b e learned and their use can b e evaluated The second reason is that even if the

value of a numb er of to ols has b een established they often use dierent terminology

diverse user interfaces and incompatible data exchange formats hence they are not

integrated

Through the combined eorts from many researchers wehave created the Parallel

Programming Hub a new parallel programming to ol environment that is acces

sible and executable anytime anywhere through standard Web browsers and

integrated in that it provides to ols that adhere to a common metho dology for paral

lel programming and p erformance tuning The Parallel Programming Hub addresses

these two issues It contributes solutions in the following way First the Parallel

Programming Hub makes available a growing numb er of to ols on the Web where

they are accessible and executable through standard Web browsers The Parallel

Programming Hub makes no restrictions on the typ e of to ols that can b e added A

new to ols can b e installed without mo dication providing the original graphical user

interface and if necessary b eing served directly o of the home site of a proprietary

provider Nevertheless the authorized user can access the to ol via standard Web

browsers

Our metho dology is supp orted bytheParallel Programming Hub that includes the

Polaris parallelizing compiler the MaxP parallelism analysis to ol and the Ursa

Minor p erformance evaluation and visualization to ol which are describ ed in previous

sections In addition an increasing numb er of to ols are b eing made available through

the Parallel Programming Hub CurrentlytheTrimaran environment

for instruction level parallelism ILP and the SUIF parallelizing compiler are

accessible Authorized users can access a numb er of common supp ort to ols such

as Matlab Mentor Graphics GNU Octave and StarOce Figure shows a

screenshot of Ursa Minor in use on the Parallel Programming Hub

On the surface the Parallel Programming Hub is a set of web pages through

which users can run various parallel programming to ols Underneath this interface

is an elab orate network computing infrastructure called the Purdue University Net

work Computing Hub PUNCH PUNCH is an infrastructure that supp orts network

accessible demandbased computing It allows users to access and run unmo died

to ols via standard Web browsers PUNCH allows to ols to b e written in any language

and do es not require source co des or ob ject co des of the applications it hosts This

feature allows a wide variety of to ols to b e included

When a user invokes a to ol on PUNCH the resource management unit determines

an appropriate platform out of a resource p o ol and executes the to ol on it The

smart resource managementunitmaintains resource usage to an optimal level It

also enables the system to b e highly scalable making sure that PUNCH p erforms

well under widelyvarying numb ers of users to ols and resource no des

Fig Ursa Minor Usage on the Parallel Programming Hub

PUNCH is logically divided into disciplinesp ecic Hubs Currently PUNCH

consists of four Hubs that contain to ols from semiconductor technology VLSI design

computer architecture and parallel programming These Hubs contain over thirty

to ols from eight universities and four vendors and serve more than vehundred

users from Purdue the US Europ e PUNCH has b een accessed million times

since it b ecame op erational in

Up on registering a user is given an account and disk space that is accessible as

long as the user is on PUNCH The execution of to ols via PUNCH takes place in

UNIX shadow accounts that are managed by the network computing infrastruc

ture This shadow account structure allows addition of user accounts to the parallel

programming Hub without requiring the setup of individual accounts by a UNIX sys

tem administrator PUNCH keeps all user les in a master account and maintains a

pool of shadow accounts that are allo cated dynamically for users at runtime Input

les for interactive programs suchas Ursa Minor are transferred ondemand from

master to shadow accounts via a system call tracing program based on the UFO

prototyp e that implements a userlevel virtual le system on top of the FTP

proto col This system is transparent to users thus all le transactions app ear to b e

normal disk IO

The immediate advantage of having an integrated networkbased to ol environment

is substantial savings in users eorts and resources The Parallel Programming Hub

eliminates times to searchfordownload and install to ols and it greatly supp orts

users in learning a to ol through uniform do cumentation online tutorials and to ols

that sp eak a common terminologyAtypical to ol access time for rst time users of

the ParHub is in the order of a minute including authentication and navigating to

the right to ol This contrasts with download and installation times of at least an

order of magnitude larger Even much larger eorts b ecome necessary if to ols need

to b e adapted to lo cal platforms

Anovel asp ect of the ParHubs underlying technology is that it represents not only

an actual information grid but also includes the necessary p ortals for its end users

One vision is that future users can access software to ols via any lo cal platform from a

palmtop to a p owerful workstation Compute p ower and le space is provided on the

Web Mobilityisprovided in that these resources are accessible transparently from

any access p oint The describ ed infrastructure represents a signicant step towards

this vision

Ursa Major making a rep ository of knowledge available to the

world wide audience

A core need for advancing the state of the art of computer systems is p erformance

evaluation and the comparison of results with those obtained by others To this end

many test applications have b een made publicly available for study and b enchmarking

by b oth researchers and industry Although a large b o dy of measurements obtained

from these programs can b e found in the literature and on public data rep ositories it is

usually extremely dicult to combine them into a form meaningful for new purp oses

In part this is b ecause data are not readily available ie they have to b e extracted

from several pap ers and they have to undergo substantial recategorizations and

transformations In addressing this issue the Ursa Major pro ject is creating

a comprehensive database of information

Many to ols can gather raw program and p erformance information and presentit

to users which is a starting p oint for answering the questions ab ove However in

addition to providing raw information advanced to ols must help lter and abstract

a p otentially very large amountofdata

Ursa Major addresses the describ ed issues byproviding an instrumentwith

which application machine and p erformance information can b e obtained from vari

ous sources and can b e displayed in an interactive viewer attached to the WorldWide

Web It provides a rep ository for this information and assists users in its abstrac

tion and comprehension Industrial b enchmarkers maybeinterested in one single

numb er for machine comparisons programmers maybeinterested in transforma

tions that can improve the p erformance of an application computer architects may

want to compare their cache measurements with those obtained by their p eers Ursa

Major provides ho oks for their needs and it includes instruments for the underlying

data mining task

Ursa Major is an Appletbased application that enables visualization and ma

nipulation of the p erformance and static analysis data of various parallel applications

that have b een studied at Purdue University The goal of Ursa Major is to make

a rep ository of program information available via the WorldWide Web Ursa Ma

jor has its origin in the Ursa Minor to ol providing almost identical functionality

Because wechose Java as an implementation language it was natural to combine

these resources with the rapidly advancing Internet technology and in this wayal

low users at remote sites to access our exp erimental data Typically in resp onse to

a user interaction it fetches from the rep ository a program database that represents

a sp ecic parallel programming case study It then displays it using Ursa Minors

visualization utilities Due to the Applets security constraints lo cal disk access is

not supp orted by Ursa Major Figure shows an overall view of the interactions

between Ursa Major a user and the Ursa Major rep ository UMR

Remote Server

Ursa Major Applet UMR (Ursa Major Repository)

Java Program Download DataBase Download

URSA MAJOR UMD (Ursa Major Database)

presentation/edit database presentation/edit database Loop Table View Call Graph View

interaction interaction

User

Fig Interaction provided bytheUrsa Major to ol

The data rep ository is b eing constructed from the results gathered in various

research pro jects Currently it consists of characteristics of a numb er of programs the

results of compiler analyses of these programs their p erformance numb ers on diverse

architectures and the data generated in several simulator runs Individual databases

in the rep ository are in the Generic Data Format describ ed in Section One

issue in designing the rep ository was to dene storage schemes that makes it easy for

users to nd information entered by other users To this end the rep ository structure

consists of extensions on le and directory names indicating data such as program

names platforms compilers optimization and parallel languages To b e exible

these extensions are not hardco ded Instead they are describ ed in a conguration

le that is read by Ursa Major at the start of a session

Ursa Major supp orts a user mo del of parallel programming by examples and

it serves as a program and b enchmark database for high p erformance computing It

integrates information available from p erformance analysis to ols compilers simula

tors and source programs to a degree not provided by previous to ols Ursa Major

can b e executed on the WorldWide Web from whichagrowing rep ository of infor

mation can b e viewed Through continuous up dates to the rep ositoryweenvision

Ursa Major to b e the rst place to lo ok for p erformance data

The emergence of the Parallel Programming Hub presents an interesting opp or

tunity to compare these twonetworkbased to ols Although their goals are distinct

Ursa Minor on the Parallel Programming Hub and Ursa Major provide users

with the same visualization utilities for viewing p erformance and static analysis data

The Parallel Programming Hub enables Ursa Minor to load and manipulate user

inputs from remote sites On the other hand it lacks the supp ort for access to a

centralized rep ository The detailed p erformance comparison in terms of the resp onse

time are given in the next chapter

Conclusions

Our eort to create a parallel programming environment has resulted in a parallel

program development and tuning metho dology and a set of to ols Wehavedevelop ed

the to ols with the design goals in mind to provide an integrated exible accessible

p ortable and congurable to ol environment that conforms to the underlying metho d

ology Our to olset integrates static program analysis with p erformance evaluation

while supp orting data visualization and interactive compilation Data management

is also simplied with our to ols

To give access to these to ols to as many users as p ossible and to disseminate our

p erformance databases of various applications as widely as p ossible wehaveuseda

network computing infrastructure In addition we are currently building a database

rep ository that enables the visualization and manipulation of p erformance results

through a Java Applet application

Here we conclude the presentation of our metho dology and to ol eorts The intro

duced metho dology addresses what in parallel programming The to olset describ ed

in this chapter has b een designed and implemented based on our exp erience and de

sign goals and aims to answer how Finally with the extra eort to promote the

to ols and to reach wider audience wehave attempted to solve the question where

The metho dology and the to ols are useless if they are not eective in actual parallel

programming and p erformance tuning pro cesses The obvious next step is to evaluate

the b enets of these to ols as well as the metho dology hence answering howwell

they work This is the topic of the next chapter

EVALUATION

Evaluating a metho dology and to ols is dicult It is largely due to two problems

asso ciated with the topic First The desirable characteristics of a metho dology and

supp orting to ols such as eciency and eectiveness cannot b e measured easilyespe

cially in quantitativeways It is very challenging to establish a set of metrics for such

measures Secondly the goal of developing a metho dology and supp orting to ols is to

assist users thus in determining the eciency of a metho dology and supp orting to ols

the users willingness and knowledge towards them b ecome critical factors Having

a large user communitywould help judge their value Even then however creating

controlled exp eriments to obtain quantitative feedbackisvery dicult

These are the main reasons that many to ol eorts in parallel programming have

ignored the evaluation asp ect The ma jority of publications related to parallel pro

gramming to ols do not include quantitativeevaluations Even general descriptions

of user feedback such as resp onse to the Sigma editor has b een go o d are

seldom found Some of them demonstrate the usage of to ols via descriptive case

studies Publications fo cusing on programming metho dology have

taken the same approach and give several examples of how their prop osed

scheme can b e applied to actual programming practices One notable evaluation eort

is found in the SUIF Explorer publication in which p erformance improvement

attempted by a user is summarized in detail Whether it accurately reects the ef

ciency of the to ol is arguable but as the only quantitative measurement for to ol

evaluation their eort is noteworthy

In this chapter we attempt to achieve fair and accurate evaluation as follows In

Section wegive a series of case studies to demonstrate the usage of our metho d

ology and to ol supp ort A detailed description of each parallelization and tuning

pro cess is given in the section These case studies servetoshow the applicability

of the metho dology and the functionalityoftools In Section weevaluate the

to ol functionalityby analyzing and comparing the tasks accomplished with and with

out the to ols Also we summarize the comments from users in this section The

comparison of our to ols with other parallel programming environments are given in

Section Finallywe discuss the to ol accessibility as the result of adopting the

network computing facilities in Section Conclusions are given at last

Metho dology Evaluation Case Studies

Manual tuning of ARCD

In this section we present a case study illustrating a manual tuning pro cess of

program ARCD from the Perfect b enchmark suite This case study was pre

sented in In this study a programmer has tried to improve the p erformance of

the program b eyond that achieved bythePolaris parallelizing compiler The target

machine is a Hyp erSPARCworkstation with pro cessors

Polaris was able to parallelize almost all lo ops in ARCD However the sp eedup

of the resulting executable was only on pro cessors Using Ursa Minors

Structure View and sorting utility the programmer was able to nd three lo ops to

whichloopinterchange can b e applied FILERX do XPENTA do and XPENT do

After lo op nests were interchanged in these lo ops the total program execution time

decreased by seconds increasing the sp eedup from to

As the result of this mo dication dominant program sections havechanged The

programmer reevaluated the most timeconsuming lo ops using the Expression Eval

uator to compute new sp eedups and the p ercentage of lo op execution time over the

do nest which total time The most time consuming lo op was now the STEPFY

consumed of the new parallel execution time The programmer examined the

nest with the source viewer and noticed two things there were many adjacent

parallel regions and the parallel lo ops were not always distributing the same di

mension of the work array The programmer merged all of the adjacent parallel

regions in the nest into a single parallel region The new parallel region consisted of

four consecutive parallel lo ops The rst two nests were single lo ops that distributed

the work array across its innermost dimension The second two nests were doubly

nested and distributed the work array across its second innermost dimension The

eect of these changes were twofold First the merging of regions should eliminate

parallel lo op forkjoin overhead Second the normalization of the distributions within

the subroutine should improve lo cality After this change the sp eedup of the lo op

improved from to

The programmer was able to apply the same techniques fusion and normaliza

do FILERX doand tion to the next most timeconsuming lo ops STEPFX

YPENTA do These mo dications result in a sp eedup gain from to Finally

the programmer attempted the same techniques to the next most timeconsuming

sections XPENTA YPENTand XPENT according to the newly computed proles and

sp eedups The sp eedup improved to The programmer felt that the p ointof

diminishing returns had b een reached and halted the optimization

a b

Fig The a execution time and b sp eedup of the various version of ARCD

Mo d lo op interchange Mo d STEPFY do mo dication Mo d STEPFX do

mo dication Mo d FILERX do mo dication Mo d YPENTA do mo dication

Mo d mo dication on XPENTA YPENTand XPENT

In summary applying lo op interchange parallel region merging and distribu

tion normalization yielded an increase from the outoftheb ox sp eedup of to

a sp eedup of This corresp onds to a decrease in execution time Figure

shows the improvements in the total program p erformance as each optimization was

applied Ursa Minor allowed the user to quickly identify the lo op structure of the

program and sort the lo ops to identify the most time consuming co de sections After

each mo dication the user was able to add the new timing data from the mo died

program runs recalculate the sp eedup and see if an improvementwas worthwhile

Evaluating a parallelizing compiler on a large application

In one research pro ject a users is enabling the Polaris compiler to work eectively

with large co des on the order of at least lines These co des have many

levels of abstractions and are very mo dular making it dicult to link p erformance

and parallelization b ottlenecks to their causes Ursa Minor was used with the

SPECseis application suite a set of co des that p erform seismic pro cessing as

a basic GUI to help manage the thousands of lines of co de and hundreds of lo op

timings as well as to direct the compiler develop er into enabling Polaris to recognize

more parallelism

Ursa Minor allows the user to easily pick out the signicant p ortions of the co de

in terms of execution time and to nd their callers and callees We found that the

implementation of the nitedierencing scheme whichwas a landmark in the history

of seismic pro cessing takes only of the total time The accompan ying correction

routine which comp ensates for the errors that accrue with the nitedierence ap

proximation takes of the total execution time The correction routine p erforms

a FFT applies the error equations and transforms the data back from the frequency

domain

Besides the abilitytoquickly and easily lo cate the ma jor comp onents of the execu

tion time the user found Ursa Minor helpful to the compiler develop er in analyzing

the eectiveness of compilation techniques One key b enet of using Ursa Minor

for p erformance evaluation is the ability to apply the Expression Evaluator to b oth

the runtime p erformance and the compiletime analysis Polaris was able to paral

lelize lo ops whichcontributed only of the execution time The user used Ursa

Minor to determine why certain key lo ops were not parallelized a feature requiring

one mouse click in order to add techniques that address these issues The SEICFT

routine p erforms a D FFT on a frequency slice The routine contains while lo ops

which are not parallelized byPolaris

With Ursa Minor the user was also able to work with the application as a

whole to determine what factors inuence automatic parallelization across the entire

co de We can do so using the commands provided in the Ursa Minor to ol In

particular Ursa Minor revealed that inlining or interpro cedural analysis is a cru

cial parallelism enabler for parallelizing compilers when dealing with large mo dular

co des Eight out of the top ten lo ops for the rst seismic phase have subroutine

calls within them

Interactive compilation

The use of a parallelizing compiler as an interactive to ol can b enet users in many

ways Users can incorp orate the feedback from the compiler during compilation and

add appropriate mo dications to the source An incremental use of such a to ol

simplies co de management and debugging as well b ecause the co de changes made

by users are lo calized In addition the ability to build a parallelizing compiler as

describ ed in the previous chapter allows users to exp eriment with dierent compiler

techniques so that they can learn more ab out the techniques and their eects

We present a case study in to demonstrate the functionalityof InterPol

A user parallelized the small example program shown in Figure a Figure b

shows the co de after b eing simply run through the default Polaris conguration with

the inlining switch set to inline subroutines of statements or less Two imp ortant

results can b e seen subroutine one is not inlined due to the inlining pass executing

prior to deadco de elimination and the lo ops in subroutine two are not found to

b e parallel b ecause of subscripted array subscripts which the Polaris compiler cannot

analyze Figure c shows the resulting program after adding a deadco de pass prior

to the inlining pass in the Compiler Builder and running the main program and sub

routine one from Figure a through this new compiler Finally in Figure d

PROGRAM EXAMPLE

REAL AB

PROGRAM EXAMPLE REAL C

REAL AB INTEGER I

REAL C DO I

INTEGER I CALL ONEABI

DO I CI I

CALL ONEABI ENDDO

CI I CALL TWOABC

ENDDO WRITE A

CALL TWOABC WRITE B

WRITE A END

WRITE B

END SUBROUTINE ONEABI

REAL AB

SUBROUTINE ONEABI INTEGER DEADCODE

REAL AB DEADCODE

OMP PARALLEL DO DEADCODE

DO J DEADCODE

AJI DEADCODE

BJI DEADCODE

ENDDO DO J

OMP END PARALLEL DO AJI

END BJI

ENDDO

SUBROUTINE TWOABC END

REAL A B

REAL C SUBROUTINE TWOABC

DO I REAL A B

DOJ REAL C

ACJCI IJ DO I

BCJCI IJ DO J

ENDDO ACJCI IJ

ENDDO BCJCI IJ

END ENDDO

ENDDO

END

a b

Fig Contents of the Program Builder during an example usage of the

InterPol to ol a the input program and b the output from the default Polaris compiler conguration

PROGRAM EXAMPLE

PROGRAM EXAMPLE REAL AB

REAL AB REAL C

REAL C INTEGER I

INTEGER I OMP PARALLEL DO

OMP PARALLEL DO DOI

DO I DO J

DO J AJI

AJI BJI

BJI ENDDO

ENDDO CI I

CI I ENDDO

ENDDO OMP END PARALLEL DO

OMP END PARALLEL DO CALL TWOABC

CALL TWOABC WRITE A

WRITE A WRITE B

WRITE B END

END

SUBROUTINE TWOABC

SUBROUTINE TWOABC REAL A B

REAL A B REAL C

REAL C OMP PARALLEL DO

DO I DOI

DOJ DO J

ACJCI IJ ACJCI IJ

BCJCI IJ BCJCI IJ

ENDDO ENDDO

ENDDO ENDDO

END OMP END PARALLEL DO

END

c d

Fig Contents of the Program Builder during an example usage of the

InterPol to ol c the output after placing an additional deadco de elimination

pass prior to inlining and d the program after manually parallelizing subroutine

two

the user has selected only subroutine two parallelized it by hand and included this

mo died version into the Program Builder Through simple interactions with Inter

Pol the user was able to takea codeforwhichPolaris was only able to parallelize

a single innermost lo op and parallelize b oth of its outermost lo ops

Performance advisor hardware counter data analysis

In this case study given in we discuss a p erformance map that uses the

speedup component model intro duced in The mo del fully accounts for the gap

between the measured sp eedup and the ideal sp eedup in each parallel program section

This mo del assumes execution on a sharedmemory multipro cessor and requires that

each parallel section b e fully characterized using hardware p erformance monitors to

gather detailed pro cessor statistics Hardware monitors are nowavailable on most

commo dity pro cessors

With hardware counter and timer data loaded into Ursa Minor users can simply

click on a lo op from the Ursa Minor table view and activate Merlin Merlin

then lists the numb ers corresp onding to the various overhead comp onents resp onsible

for the sp eedup loss in each co de section The displayed values for the comp onents

showoverhead categories in a form that allows users to easily see why a parallel region

do es not exhibit the ideal sp eedup of p on p pro cessors Merlin then identies the

dominant comp onents in the lo ops under insp ection and suggests techniques that

may reduce these overheads An overview of the sp eedup comp onent mo del and its

implementation as a Merlin map are given b elow

Performance map description

The ob jective of our p erformance map is to b e able to fully account for the p erfor

mance losses incurred by each parallel program section on a sharedmemory multipro

cessor system We categorize overhead factors into four main comp onents Table

shows the categories and their contributing factors

Memory stal ls reect latencies incurred due to cache misses memory access times

and network congestion Merlin will calculate the cycles lost due to these overheads

If the p ercentage of time lost is large lo calityenhancing software techniques will b e

Table

Overhead categories of the sp eedup comp onentmodel

Overhead Contributing Description Measured

Category Factors with

Memory stalls IC miss Stall due to ICache miss HW Cntr

Write stall The store buer cannot hold additional stores HW Cntr

Read stall An instruction in the execute stage dep ends on an earlier HW Cntr

load that is not yet completed

RAW load stall A read needs to wait for a previously issued write to the HW Cntr

same address

Pro cessor stalls Mispred Stall Stall caused by branch misprediction and recovery HW Cntr

Float Dep stall An instruction needs to wait for the result of a oating HW Cntr

p oint op eration

Co de overhead Parallelization Added co de necessary for generating parallel co de computed

Co de generation More conservative compiler optimizations for parallel co de computed

Thread Forkjoin Latencies due to creating and terminating parallel sections timers

management

Load imbalance Wait time at join p oints due to uneven workload

distribution

suggested These techniques include optimizations suchasloopinterchange lo op

tiling and lo op unrolling We found in that lo op interchange and lo op unrolling

are among the most imp ortanttechniques

Processor stal ls account for delays incurred pro cessorinternally These include

branch mispredictions and oating p oint dep endence stalls Although it is dicult

to address these stalls directly at the source level lo op unrolling and lo op fusion if

prop erly applied can remove branches and give more freedom to the backend compiler

to schedule instructions Therefore if pro cessor stalls are a dominant factor in a lo ops

p erformance Merlin will suggest that these twotechniques b e considered

Code overhead corresp onds to the time taken by instructions not found in the

original serial co de A p ositivecodeoverhead means that the total numb er of cycles

excluding stalls that are consumed across all pro cessors executing the parallel co de

is larger than the number used by a single pro cessor executing the equivalent serial

section These added instructions mayhave b een intro duced when parallelizing the

program eg by substituting an induction variable or through a more conservative

parallel co de generating compiler If co de overhead causes p erformance to degrade

below the p erformance of the original co de Merlin will suggest serializing the co de

section

Thread management accounts for latencies incurred at the fork and join p oints of

each parallel section It includes the times for creating or notifying waiting threads for

passing parameters to them and for executing barrier op erations It also includes the

idle times sp entwaiting at barriers which are due to unbalanced thread workloads

We measure these latencies directly through timers b efore and after each fork and each

join p oint Thread management latencies can b e reduced through highlyoptimized

runtime libraries and through improved balancing schemes of threads with uneven

workloads Merlin will suggest improved load balancing if this comp onent is large

Ursa Minor combined with this Merlin map displays the measured p erfor

mance of the parallel co de relative to the serial version the execution overheads

of the serial co de in terms of stall cycles rep orted by the hardware monitor and

the sp eedup comp onent mo del for the parallel co de We will discuss details of

the analysis where necessary to explain eects However for the full analysis with

detailed overhead factors and a larger set of programs we refer the reader to

Exp eriment

For our exp erimentwe translated the original source into Op enMP parallel form

using the Polaris parallelizing compiler The source program is the Perfect Bench

mark ARCD which is parallelized to a high degree byPolaris

We p erformed our measurements on a Sun Enterprise with six MHz

UltraSPARC V pro cessors each with a KB L data cache and MB unied L

cache Eachcodevariantwas compiled by the Sun v Fortran compiler with

the ags xtargetultra xcacheO For hardware p er

formance measurements we used the available hardware counter TICK register

ARCD consists of many small lo ops eachofwhich has a few milliseconds average

Fig Performance analysis of the lo op STEPFX DO in program ARCDThe

graph on the left shows the overhead comp onents in the original serial co de The

graphs on the rightshow the sp eedup comp onent mo del for the parallel co de

variants on pro cessors b efore and after lo op interchanging is applied Each

comp onent of this mo del represents the change in the resp ectiveoverhead category

relative to the serial program Merlin is able to generate the information shown in

these graphs

execution time Figure shows the overheads in the lo op STEPFX DO of the

original co de and the sp eedup comp onent graphs generated b efore and after applying

aloopinterchange transformation

Merlin calculates the sp eedup comp onent mo del using the data collected bya

hardware counter and displays the sp eedup comp onent graph Merlin applies the

following map using the sp eedup comp onent mo del If the memory stal l appears in

performancegraphs of both the serial code and the Polarisparal lelizedcode then apply

loop interchange From this suggested recip es the user tries lo op interchanging which

results in signicant now sup erlinear sp eedup Figure lo opinterchange on the

rightshows that the memory stall comp onent has b ecome negative which means that

there are fewer stalls than in the original serial program The negative comp onent

explains why there is a sup erlinear sp eedup

The sp eedup comp onent mo del further shows that the co de overhead comp onent

has drastically decreased from the original parallelized program Thecodeiseven

more ecient than in the serial program further contributing to the sup erlinear

sp eedup

In this example the use of the p erformance map for the sp eedup comp onentmodel

has signicantly reduced the time sp entby a user analyzing the p erformance of the

parallel program It has help ed explain b oth the sources of overheads and the sources

of sup erlinear sp eedup b ehavior

Performance advisor simple techniques to improve p erformance

In this section we present a p erformance map based solely on execution timings

and static compiler information Such a map requires program characterization data

that a novice user can easily obtain In the study that we did in a map is

designed to advise novice programmers in improving the p erformance of programs

achieved by a parallelizing compiler suchasPolaris In this case studyweas

sume that novice programmers have used a parallelizing compiler as the rst step to

optimize the p erformance of the target program and that its static analysis informa

tion is available The p erformance map presented in this section aims at improving

this initial p erformance

Our goal in this study is to provide users with a set of simple techniques that

may help enhance the p erformance of a parallel program based on data that can b e

easily generated This includes timing and static program analysis data Based on

our exp eriences with parallel programs wehavechosen four techniques that are

easy to apply and may yield considerable p erformance gain These techniques

are serialization lo op interchange and lo op fusion They are applicable to lo ops

which are often the fo cus of the shared memory programming mo del All of these

techniques are present in mo dern compilers However compilers may not have enough

knowledge to apply them most protably and some co de sections may need small

mo dications b efore the techniques b ecome applicable automatically

Performance map description

Wehave devised criteria for the application of these techniques which are shown

in Table If the sp eedup of a parallel lo op is less than we assume that the lo op

Table

Optimization technique application criteria

Techniques Criteria

Serialization sp eedup

Lo op Interchange of stride accesses of non stride accesses

Lo op Fusion sp eedup

is to o small for parallelization or that it requires extensive mo dication Serializing it

prevents p erformance degradation Lo op interchange may b e used to improve lo cality

by increasing the numb er of stride accesses in a lo op nest Lo op interchange is

commonly applied by optimizers however our case study shows manyexamplesof

opp ortunities missed by the backend compiler Lo op fusion can likewise b e used to

increase b oth granularity and lo cality The criteria shown in Table represent

simple heuristics and do not attempt to b e an exact analysis of the b enets of each

technique We simply assumed the threshold of the sp eedup as to apply the lo op

fusion

Exp eriment

Wehave applied these techniques based on the criteria presented ab ove Wehave

used a Sun Enterprise with six MHz UltraSPARC pro cessors The Op enMP

co de is generated bythePolaris Op enMP backend The results on ve programs

are shown They are SWIM and HYDROD from SPEC SWIM from SPEC and

ARCD and MDG from the Perfect Benchmarks Wehave incrementally applied these

techniques starting from serialization Figure shows the sp eedup achieved bythe

techniques The improvement in execution time ranges from for fusion in ARCD

to for lo op in terchange in SWIMFor HYDROD application of the Merlin

suggestions did not noticeably improve p erformance

Among the co des with large improvement SWIM from SPEC b enets most

Fig Sp eedup achieved by applying the p erformance map The sp eedup is with

resp ect to onepro cessor run with serial co de on a Sun Enterprise system Each

graph shows the cumulative sp eedup when applying eachtechnique

from lo op interchange It was applied under the suggestion of Merlin to the most

timeconsuming lo op SHALOW DOLikewise the main technique that improved

the p erformance in ARCD was lo op interchange MDG consists of two large lo ops

and numerous small lo ops Serializing these small lo ops was the sole reason for the

p erformance gain Table shows a detailed breakdown of how often techniques

were applied and their corresp onding b enet

Using this map considerable sp eedups are achieved with relatively small eort

Novice programmers can simply run Merlin to see the suggestions made bythe

map The map can b e up dated exibly without mo difying Merlin Thus if new

techniques show p otential or the criteria needs revision exp ert programmers can

easily incorp orate changes

Eciency of the To ol Supp ort

In order to quantitatively evaluate the eciency of the to ol supp ort wehave

p erformed an exp eriment with the help of actual to ol users We prepared a set of

small tasks that are commonly done by parallel programmers and asked users to

Table

A detailed breakdown of the p erformance improvement due to eachtechnique

Benchmark Technique NumberofModications Improvement

ARCD Serialization

Interchange

Fusion

HYDROD Serialization

Interchange

Fusion

MDG Serialization

Interchange

Fusion

SWIM Serialization

Interchange

Fusion

SWIM Serialization

Interchange

Fusion

accomplish these tasks with and without our to ols In addition wehave asked users

of the to ols a series of questions to gather users opinions on to ols and their usage

The questions targeted the functionality of the to ols as well as general comments on

the metho dologyWe present the results in the following sections

Facilitating the tasks in parallel programming

Common tasks in parallel programming

The main ob jectives of the exp eriment is to pro duce quantitative measures for the

eciency of the to ols functionalityTothisendwehave selected tasks that are

commonly p erformed by parallel programmers using parallel directives These tasks

are listed in Table

Table

Common tasks in parallel programming

task compute the sp eedup of the given program on pro cessors in terms of the serial execution time

task nd the most timeconsuming lo op based on the serial execution time

task nd the inner and outer lo ops of that lo op

task nd the callers of the subroutine containing the most timeconsumingloop

task compute the parallelization and spreading overhead of that lo op on pro cessors

task compute the parallel eciency of the second most timeconsuming lo op on pro cessors

task exp ort proles to a spreadsheet to create total execution time chart

on varying numb er of pro cessors containing of the most timeconsumingloops

task count the lo ops the sp eedups of which are b elow

task count the lo ops that are parallel and whose sp eedups are b elow

task compute the parallel coverage and the exp ected sp eedup based on Amdahls Law

Task compute the sp eedup of the target program The sp eedup of the

entire program is p erhaps the most frequently used metrics in computational engi

neering The changes made parallelization or any other typ es of optimization are

evaluated by the sp eedup gain in program execution time The instrumentation to

measure program execution time is simple and any calculator can b e used to compute

this number

Task nd the most timeconsuming co de sections Finding the dominant

co de sections using proles is the most imp ortant task in p erformance tuning Most

users would lo ok into the summary les generated from program execution with a text

editor In this case users would have to run a text editor menuclicking or typing the

command on a shell and nd the most timeconsuming lo op in the le Lo oking for

the largest quantity among manynumb ers would take a signicant amount of time

which is at b est in the order of minutes Some users suggested using sort command

available from UNIX as follows cat namesum sort r k

This pro duces a sorted list of summary le entries quickly but users have to remember

the column numb er to sort by and the amount of text to typ e is not trivial Moreover

if multiple les need to b e presented for comparison the sorting command cannot b e

used By contrast using the Ursa Minor to ol the task can b e accomplished by

activating the to ol typing UM loading the prole menu clicking and

sorting based on the column the user cho oses p opup menuclicking

Task nd inner and outer lo ops of a sp ecic lo op Increasing the granulity

of parallel execution is an imp ortant technique in improving parallel p erformance

This involves lo oking into inner or outer lo ops of the lo op under consideration There

are no other to ols that explicitly supp ort this task Programmers would have to use

a text editor to nd the lo op and examine the source to gure out the lo op nest The

Structure View of Ursa Minor signicantly simplies this task Users only need to

load the compiler listing le menuclicking scrolling and mouse clicking nd the

section scrolling or using Find feature and lo oking at the display

Task nd the callers of a sp ecic subroutine The presence of function

or subroutine calls may cause the parallelizing compiler to abandon optimizing lo ops

Users knowledge on the target program can b e of great use in such cases Finding

the callers and callees of a subroutine or a function is an essential task in optimizing

nested subroutines and lo ops with subroutine calls Normally programmers would

have to examine the program source to accomplish this task UNIX utilities such

as grep can b e useful The Structure View from Ursa Minor provides one click

supp ort for nding parents and children of selected co de sections

Task compute overheads Identifying p erformance problems requires den

ing rst what the problems are The metrics such as parallelization and spreading

overheads are frequently used variables in the problem denitions Consequently

computing these metrics is critical step to lo cate p erformance problems One of the

conventional metho ds of computing the overheads includes a calculator When users

need to compute overheads for multiple co de sections a commercial spreadsheet or

sp ecialpurp ose scripts can provide an easier way The mathematical functions pro

vided by Ursa Minor also supp ort the derivation of new metrics from the existing

data This set of functions sp ecically targets parallel programming so manyofthe

metrics commonly used in parallel programming are included in the set In the cur

rentversion however the parallelization and spreading overheads are not directly

supp orted

Task compute parallel eciencies Parallel eciency is another widely used

measure for evaluating parallel p erformance Parallel eciency EP on P pro cessors

is dened as

T

ser ial

E P

P T P

par allel

Users can compute this numb er using a calculator or a spreadsheet Ursa Minor

provides a function that computes parallel eciency

Task exp ort proles to spreadsheet to create charts An integrated to olset

oers an advantage in that exchanging les is easier Data les sp ecically takeone

form or another and converting them into the form that other to ols understand may

not b e trivial Commercial spreadsheets do a go o d job of imp orting textbased tabular

data les such as timing proles and create a variety of graphs Combing multiple

summary les b ecomes dicult however Without Ursa Minor users would have

to create a comma separated le using Awk or Sed scripts Adding proles and

arranging data for exp orting are frequently used features of Ursa MinorOftenit

can b e done within a minute this way In addition Ursa Minor can create charts

on any columns or rows that a user selects

Task count lo ops that have problems This is another example that em

phasizes the p ersp ectiveontheoverall p erformance Users should b e able to view

the resulting p erformance in terms of large blo cks of co de sections and that means

dealing with multiple lo ops that dominate the overall p erformance There is no di

rect supp ort for this task in b oth Ursa Minor and commercial spreadsheets but a

sequence of op erations can accomplish the task

Task count parallel lo ops that have problems The combined analysis

of p erformance and static program data such as compiler listings is more ecient

in lo cating p erformance problems This question is one of simple examples of such

cases Dep ending on the fo cus of the optimization parallel optimization or general

lo cality optimization combining the information on the parallel nature of co de blo cks

and their p erformance gures is much more ecient than dealing with each asp ect

separately Conventional to ols do not supp ort this approach The query functions

available Ursa Minor are designed sp ecically to help users comprehend the two

dierent data in the same context

Task compute the exp ected sp eedup based on Amdahls law This

task represents multistep pro cess of p erformance evaluation The Amdahls law

provides a simple p erformance mo del that can b e used to evaluate actual p erformance

Computing the exp ected sp eedup based on Amdahls law requires computing the

parallel coverage of the target program and several steps of computation This task

was selected to test how users use to ols to accomplish rather a complex goal Users

are exp ected to use a combination of to ols for this task

Task is a simple calculation so users are exp ected to use either a calculator

or the Expression Evaluator from Ursa Minor with comparable eciencyTask

evaluates the table manipulation utilities sorting and rearranging for p erformance

data Tasks and target the eciency of the Structure View and the utilities that

it provides The Expression Evaluator is the main target for evaluation in tasks

and Task tests the ability to rearrange tabular data and exp ort them to other

spreadsheet applications The rest of the tasks and attempt to evaluate

the combined usage of multiple utilities sorting the Expression Evaluator query

functions the static information viewer and the display option control provided by

Ursa Minor

Exp eriment

Wehaveasked four users to participate in this exp eriment They were asked

to p erform these tasks one byone Two dierent datasets were prepared for the

exp eriment These datasets contain timing proles of FLOQ from the Perfect

benchmarks under two dierentenvironments Thus the numb er of data items

are the same in b oth datasets but the prole numb ers are dierent First these users

were asked to p erform the tasks without our to ols Users were allowed to use any

scripts that they have written previously Then they p erformed the tasks using our

to ols with the other dataset

The time to activate to ols spreadsheet Ursa Minor and so on and load input

les was counted separately as loading time The reason for this is that when users

p erform these individual tasks separately under dierentenvironments the loading

time needs to b e added to the time taken to nish each task Since the users p erformed

the tasks in one session users needed to activate to ols only once Time to convert

data les for dierent to ols are also included in the loading time Hence the loading

time also reects the level of integration of to ols

Four users who participated represent dierent classes of users User is an exp ert

p erformance analyst who has written many sp ecialpurp ose scripts to p erform various

jobs These scripts do tabularizing sorting etc User do es use our to ols but relies

more on these scripts User has also b een working on p erformance evaluation for

a while and is considered an exp ert as well He uses only basic UNIX commands

rather than scripts However his skills with the basic UNIX commands are very go o d

so he can p erform a complex task without taking much time User started using

our to ols only recently User is also an exp ert p erformance analyst but his main

target programs are not shared memory programs He has b een using our to ols for a

long time but with distributed memory programs Finally user is a novice parallel

programmer His exp erience with parallel programs are limited compared to the

others He had read our metho dology and tries to use our to ols in his b enchmarking

research

Table

Time in seconds taken to p erform the tasks without our to ols

user user user user average

task

task

task

task

task

task

task

task

task

task

loading

total

Table shows the time for these users to p erform the assigned tasks User

and decided that tasks and cannot b e p erformed within a reasonable time so

they gave estimated times instead All of the users used a commercial spreadsheet

later in the session but user the novice programmer started doing the tasks after

he set up the spreadsheet and imp orted the input les User used his scripts for

many of the tasks

As the second part the of exp eriment users were allowed to use our to ols to

p erform the tasks The results are shown in Table User used a combination of

a spreadsheet and Ursa Minor to p erform tasks and The others used a

spreadsheet for task only User was not sure that he can nish task even with

our to ol supp ort so he gave an estimated time

Table

Time in seconds taken to p erform the tasks with our to ols

user user user user average

task

task

task

task

task

task

task

task

task

task

loading

total

As can b e seen from these tables our to ol supp ort improves the time to p erform

common parallel programming tasks considerably Figure shows the overall times

to nish all the tasks As can b e seen in the gure our to ol supp ort not only

saves time but also makes the pro cess easier for novice programmers resulting in

comparable times to p erform the tasks when using our to ols The work sp eedups for

the users are and resp ectively

The strength of our approach lies not only on the fact that the to ols oer ecient

ways of p erforming these individual tasks but also that these features are provided

in an integrated to olset This is demonstrated bythesavings in the lo ding time

in our exp eriment Users do no have to deal with several to ols and commands

There is no need to op en the same le into many dierenttools For instance

users can op en the Structure View to insp ect the program layout and examine and

Fig Overall times to nish all tasks

restructure the p erformance data from the same database Adding this advantage

into the consideration our to ol supp ort b ecomes even more app ealing

General comments from users

We summarize users comments on various to ol features in this section Users have

resp onded very p ositively to the Structure View of Ursa MinorWehave received

comments such as There is no alternative that I know of that gives as go o d of an

overview of the program structure quickly or If I am lo oking at a new program one

that I am unfamiliar with I almost always lo ok at its structure with Ursa Minor

to get a feel for its layout Although not sp ecied in the metho dology manyusers

examine program sources b efore they b egin working on optimization The Structure

View is oering vital help to those users

The Table View has gotten go o d reviews as well One resp onse was The Table

View is go o d I like its ability to combine multiple typ es of data In addition users

liked the bar graph at the right side of the Table View which visualized numeric data

instantly The Expression Evaluator also proves to b e very useful allowing users

compute dierent metrics on demand One user listed integration of to ols in parallel

p erformance sp ecic manner as one of the reason for using our to ols However some

users were not fully content with the cumb ersome interface to move swap and arrange

columns Also the limited graphing capabilities were p ointed out as one of the weak

points of Ursa MinorOverall manyversatile features provided by Ursa Minor

are greatly appreciated by users

InterPol is still relatively new to users and has not b een used much Further

more we feel that there remain issues to b e resolved with resp ect to do cumentation

and user interface Consequentlywe did not get much feedback from users As In

terPol gets more recognition from users with improved interface and do cuments

weanticipate users to actively utilize the to ol and return to us with quality feedbacks

As the to ols evolve in a needdriven way the feedback from the user community

will provide invaluable directions into the next generation of our to ol family We

exp ect the future upgrades of the to ols to incorp orate users opinions For instance

the weakness in GUI can b e resolved with the newly available Java technology

Develop ers need to monitor users needs and wishes constantly to keep up with the

current stateoftheart parallel programming practices Keeping close together the

to ol design pro jects and users application characterization eorts will ensure the

practicality of our to ol in the future

Comparison with Other Parallel Programming Environments

In Chapter wehave listed several parallel programming environments Pablo

and Fortran D editor SUIF Explorer FORGExplorer KAPPro

To olset the Annai To ol Pro ject DEEPMPI and Faust We present

in this section a more detailed comparison of our to olset with these environments

Table shows the availability of features in these environments The parallelization

utilityavailable from PabloFortran D Editor is actually semiautomatic

Other than the debugging capability Ursa MinorInterPol pair covers all of

the functionalities listed in the table In addition our environment has unique features

not available from others Ursa Minors ability to freely manipulate and restruc

ture p erformance data is unprecedented in other programming environments Fur

thermore Ursa Minor allows p erformance data to b e integrated with static analysis data through a set of mathematical and query functions The p erformance guidance

Table

Feature comparison of parallel programming environments e compilation teractiv

p erformance data visualization program structure visualization compiler analysis output automatic parallelization in supp ort for reasoning automatic analysisguidance debugging

p p p p

PabloFortran D Editor

p p p p p

SUIF Explorer

p p p

FORGExplorer

p p p

KAPPro To olset

p p

Annai Pro ject

p p p

DEEPMPI

p p p p p

Faust

p p p p p p p

Ursa MinorInterPol

system suchas Merlin has not b een attempted in others either SUIF Explorers

Parallelization Guru only p oints to imp ortant target co de sections DEEPMPIs

advisor is limited to hardco ded pro cedurelevel analysis so detailed diagnosis into

smaller co de blo cks are not p ossible InterPol allows users to build their own

parallelizing compiler No such feature is available in other to ols Overall the Ursa

MinorInterPol to olset oers the most versatile and exible features to date

Perhaps the most outstanding asp ect of our to olset is its accessibility As opp osed

to most other environments that ceased to exist or are not supp orted any more Ursa

Minor exists in Webaccessible forms Any user with an Internet connection can use

the to ol with the help of complete online do cumentation Suchquality is not easily

found in most to ol development pro jects The topic of the next section is the eciency

of our to ols placed in the WorldWide Web

Comparison of Ursa Major and the Parallel Programming Hub

As an eort to reach a larger audience with our to ols wehave used network

computing concepts to implement an online tuning data rep ository Ursa Major

andaWebexecutable integrated to ol environment the Parallel Programming Hub

Ursa Major is an Appletbased data visualization and manipulation to ol for a

rep ository of optimization studies The Parallel Programming Hub allows users to

access and run to ols without the hassle of searching downloading and installing

them

The Parallel Programming Hub contains Ursa Minorand Ursa Major uses

many comp onents from the Ursa Minor to ol and provides almost identical func

tionality This presents an interesting opp ortunitytocompareandevaluate dierent

approaches to network computing In this section we compare the eciency of Ursa

Minor on the Parallel Programming Hub and Ursa MajorWeprovide qualitative

and quantitative measures By this comparison we attempt to provide directions for

the next generation of online to ols This work was presented in

Batchoriented to ols run as eciently on the Parallel Programming Hub as on

lo cal platforms In fact thanks to the PUNCH systems p owerful underlying machine

resources most users to ols have faster resp onse times on the Hub Interactive to ols

need closer insp ection

Atypical to ol interaction with Ursa Minor causes the to ol to fetch from a

rep ository a program database that represents a sp ecic parallel programming case

study It then p erforms various op erations on this database and displays the results

using Ursa Minors visualization utilities Table shows howserver client and

le op erations are invoked byvarious tasks or the to ol

In a typical interactive to ol session a user loads input les runs computing util

ities on the data and adds more les for further manipulation From this scenario

wechose three to ol op erations Wehave measured the time taken to load a database

Table

Workload distribution on resources with our networkbased to ols

tasks Ursa Minor Ursa Major

application execution server client Applet

database load lo cal disk IO server network transfer client Applet

display network transfer client Applet VNC client Applet

p erform a simple spreadsheetlike op eration on the data and search and displaya

p ortion of source co des The database load is an example of loading input data while

spreadsheet command evaluation is representative of computing on the data Source

search op eration requires a simple search through a source co de Interestingly these

three op erations exhibit dierent patterns in resource usage For Ursa Majorthe

database load op eration requires downloading the database parsing it and up dat

ing the display appropriately Hence it exercises b oth networking and computing

capabilities The second op eration evaluation of a spreadsheet command p erforms

a mathematical op eration on the data that the Applet already has downloaded so it

only involves computing on a clientmachine The search op eration mainly relies on

networking A source le is not part of the database hence it has to b e downloaded

separatelyFor Ursa Minor data transfer over the network is replaced by le IO

However the resp onse to a user action has to b e up dated on the display of the remote

clientmachine

Wechose two dierent databases in this exp eriment representing a small and a

large application study resp ectively The rst database contains tuning information

of the program BDNA from the Perfect Benchmarks The database size is

aboutKbytes and the accompanying source le is ab out Kbytes We consider

this to b e a small database The second database contains information ab out the

parallelization of the RETRAN co de which represents a large p ower plant

simulation application The database weusedisKbytes in size and the size of

the source is ab out Mbytes

Finallywechose three machines on whichwe measured the to ol resp onse times

Networked PC is a PC with MHz Pentium I I and Mbytes of memory Its

op erating system is WindowsNT It is connected to the Internet through a Mbps

ethernet card Dialup PC is a home PC with MHz Pentium I I pro cessor and

Mbytes of memory Its op erating system is Windows and its connection to

the Internet is through K mo dem and via a lo cal ISP The third machine Net

worked Workstation is an UltraSPARCworkstation with MHz pro cessor and

Mbytes of memory Its op erating system is SunOS v The network bandwidth

is Mbps

Wehave measured the resp onse time of the three op erations in hour intervals

over several days using a Netscap e browser v Wehave inserted timing functions

for Ursa Major and used an external wallclo ckfor Ursa Minor on the Parallel

Programming Hub We made measurements for each case The average times are

shown in Figure It displays the resp onse time in seconds on the three machines

The gure shows the three measured to ol op erations rtload refers to the resp onse

time to load the RETRAN database rteval and rtsearch refer to the time to

p erform spreadsheet command evaluation and source search resp ectively The data

tags with prex b d refer to the same op erations on the BDNA database

Overall the networked PC exhibits the shortest resp onse time for all op erations

On this machine the resp onse times of Ursa Minor and Ursa Major are in the

same vicinityHowever downloading of a large program source signicantly increases

the resp onse time of the search op eration despite the ethernet connection In the case

of Ursa Minor les are read through le IO within the server thus the network

is not a dominating factor The dialup PC displays adequate resp onse time except

for the search op eration with Ursa Major The network b ottleneckiseven more

pronounced in this case The networked workstation do es not suer substantially

from the network connection but its slow pro cessor and relatively inecient imple

mentation of the Java Virtual Machine JVM make it the worst p erforming platform

among the three

The resp onse time on three dierentmachines for each op eration as shown in

a

b

c

Fig The resp onse time of UMApplet and UMParHub on a a networked

PC b a networked workstation and c a dialup PC

Figure oers a dierent p ersp ective We only present the data regarding the

op erations on the RETRAN database b ecause those on the BDNA database show

similar trends and the characteristics are more pronounced in the RETRAN case

The resp onse time of Ursa Minor do es not show noticeable variations on all three

machines except on the dialup PC The spreadsheet command evaluation takes more

than twice as long on the dialup PC compared to the others This op eration is not

timeconsuming so a screen up date b ecomes a factor with the slow mo dem connec

tion For Ursa Major the platform b ecomes a deciding factor If the network is

slow the search op eration degrades For computeintensive op erations the machine

sp eed and the quality of JVM determines the resp onse time In all the Hubbased

to ol p erforms b etter than the Appletbased version

a b

c

Fig The resp onse time of the three op erations on RETRAN database a

loading b spreadsheet command evaluation and c source searching

Our exp eriments show that the Parallel Programming Hub oers users a fast and

stable solution to interactive network computing The network transmits only users

action pressing buttons and clicking a mouse to and from the server so the network

or pro cessor sp eed had little impact on the to ol usage in our exp eriment By contrast

Appletbased to ols rely on the clientmachine for computation and on the network for

data transfer Thus if the amount of data is large or the clientmachine is slow the

resulting op erations take considerably longer The twonetworked machines weused

are lo cated within the Purdue network We exp ect these p erformance characteristics

to b e even more pronounced on geographically distributed machines

Although not as resp onsive as the Hubbased Ursa Minor Ursa Major serves

a distinct purp ose The accumulated rep ository of tuning studies help users all over

the world in their eorts to study the results from other researchers and compare

the results on dierent platforms Users with ab ovetheaverage machines can take

advantage of quick resp onse by running the application on them Slow screen up dates

and sluggish mouse control that may result from a slow network connection for Ursa

Minor is not a problem with Ursa Major

An increasing numb er of users are taking advantage of the Parallel Programming

Hub The Parallel Programming Hub is b eing accessed bymany users from all over the

world Ursa Minor itself has b een accessed times since it b ecame op erational

in March of As the hub adds additional to ols and gains more recognition by

parallel programming communityworld wide we exp ect to see the numb er of accesses

grow at a faster rate

Conclusions

In this chapter wehaveevaluated the prop osed metho dology and the to ol supp ort

Wehave presented several case studies showcasing the usage of the to ols in various

parallelization and tuning studies In many studies we did at Purdue the prop osed

approach to p erformance tuning has resulted in considerable improvement in the end

results Many features provided by the to ols are actively used by programmers but

most of all they are contained within an integrated to ol environment

In addition wehave fo cused on small individual tasks and shown how the to ols

can eectively assist users by simplifying timeconsuming chores and making dicult

obstacles more accessible The sample tasks we used are commonly p erformed in all

tuning studies and users time and eort are greatly saved by using our to ols The

exp erimental results show that our to ols provide ecient supp ort for many common

tasks in parallel programming Esp ecially the Expression Evaluator oers signicant

aid in deriving new data and computing metrics Another unique feature the Mer

lin p erformance advisor simplies the task of p erformance analysis considerablyas

shown in the case studies

Finallywehaveevaluated the eciency of the two dierent frameworks that we

used to broaden the user community for our to ols using network computing Overall

the Hubbased Ursa Minor exhibited fast and uniform resp onse time esp ecially in

cases where large data transfer is required On the other hand Ursa Major do es

not suer from sluggish control when the network is slow but the time to transfer

the requested data dep ends on the size of the database Nevertheless the purp oses of

these two to ols are distinct from each other and they oer signicant aid to parallel

programmers world wide

As mentioned in the b eginning evaluating a metho dology and to ols is a chal

lenging work This chapter represents our attempt to nd ways to do so in b oth

qualitativeandquantitativeways Wewould like to p oint out that this is not the end

of our work towards a comprehensive parallel programming environment Continuous

feedback from its user community will help improve the to ols service to a wide range of parallel programmers

CONCLUSIONS

Summary

When we rst started out as novice parallel programmers we had little exp erience

in the area Every problem that we encountered seemed formidable and imp ossible

to resolve We had to resort to exp erts for almost every task in the optimization

pro cess We did not know what to do and how to do it at practically every step of

the way After a long p erio d of trial and error wehavedevelop ed our own paradigm

for parallelizing and tuning programs As our metho dology rened over the years

the tasks b ecame routine and most of all wewere seldom puzzled or frustrated by

seemingly unexp ected results The metho dology gave us the condence that we could

always nd the cause for unexp ected anomalies and explain the phenomena

As more memb ers got involved in our group however another problem had risen

New memb ers of the group had just ab out the same amount of frustration and dismay

as we had There are no publications that sp eak of a parallel tuning metho dology in

terms that b oth exp ert and novice programmers could comprehend Our exp erience

had not yet b een do cumented and the to ols that intimately supp ort it were not there

The part of the motivation for this work stems from the need to address this problem

Now with the prop osed metho dology and the to ols we b elieve that the framework

for a structured approach to parallel programming is rmly in place With the gaining

momentum of the shared memory programming mo del we feel that many users could

b enet from this environment Such a comprehensive approach that includes a wide

range of tasks in parallel programming has not b een attempted previously

The sp ecic contribution of the work presented in this thesis is to present a unied

framework for our approach to parallel program development This includes a parallel

programming metho dology and a set of to ols that supp ort this underlying practice

Our work accomplishes this byachieving the following goals that we set out earlier

Structured Parallel Programming Metho dology The metho dology describ ed

in Chapter lists the tasks that need to b e p erformed in each step and the detailed

suggestions that users may consider Users obtain signicant guidance as the ob jec

tive is clear in each stage Nonetheless it is applicable regardless of the underlying

platform the algorithms applied by the target program or even the to ols that pro

grammers use It is wellorganized and easy to followeven for novice programmers

Integrated Use of Parallelizing Compilers and Evaluation Tools A com

bined use of Ursa Minor and InterPol or Polaris achieves this Co de segments

are lab eled as Program Units that work across b oth of these to ols Prole data

provides insights into the dynamic b ehavior of the program at hand whichinturn

can b e used to further improve the p erformance Through an interactive use of these

to ols that sp eak the same terminology programmers get a clearer understanding of

the program

Integration of Static Analysis Information and Performance Data Ursa

Minors ability to search and display the source assist users to understand a pro

gram structure signicantly In addition Ursa Minor understands the compiler

ndings and combines them in the same picture The query functions available from

Ursa Minor allows users to combine static analysis data with p erformance data in

meaningful ways

Supp ort for Users Deductive Reasoning One of the greatest strength of the

Ursa Minor to ol is its supp ort for users deductive reasoning The Expression

Evaluator enables reasoning ab out the data in numerous ways Users compute any

metrics without mo difying or up dating the to ol The newly created data can b e

manipulated and visualized likeany other data so that the to ol can stay with the users in their reasoning pro cess

Potential of Automatic Performance Evaluation Merlin has shown the

potential of automatic analysis of p erformance and static data It makes transfer of

exp erience from advanced to novice programmers easier Tedious analysis steps can

b e greatly simplied

Global Accessibility Having Ursa Minor on the Parallel Programming Hub has

op ened the do or for world wide programmers to evaluating and using the to ol without

worrying ab out searching downloading and installing Compatibility issues are non

existent Also Ursa Minor provides global parallel programming communitywith

the database of parallel programming studies that can b e easily manipulated and

visualized

Directions for Future Work

Many promising directions for further work suggest themselves

Supp ort for Other Parallel Programming Languages and Mo dels As the

concept of parallel programming expands itself on many programming languages

the ability to supp ort other general languages suchasJavaor C would promote

the to ol usage even further The structure of the Ursa Minor database is not

limited to Fortran and can supp ort these languages However a few features that

are languagesensitivehave to b e reworked Besides automatic instrumentation and

the accompanying tasks co de segment naming scheme and incorp orating compiler

listings need careful consideration Supp orting other programming mo dels can b e

signicantly dicult Radically dierent parallel constructs and programming styles

call for a new metho dology to b egin with It is interesting to see if and howthe

programlevel approach to parallel programming can b e applied to other programming

mo dels

Supp ort for Program Execution Traces The shared memory programming

mo del inherently imp oses problems for parallel trace generation Pro cessor communi

cations are implicit and frequent so generating accurate traces is dicult However

selecting rightevents and p erforming mo derate summarization can make it feasible

Timeline analysis is often critical in identifying problems such as load imbalance

Parallel Program Debugging Parallel program debugging is an entirely dierent

eld of studyManychallenging tasks have to b e planned for and accomplished As

a programming environment the addition of debugging capabilityinto the to olset

would greatly enhance its applicability

Online Generation of Data Files Further integration of Ursa MinorthePo

laris parallelizing compiler and the runtime environment will bring even more compre

hensiveenvironment Supp orting parallelization compilation and execution through

a single to ol would provide a highly integrated p ersp ective and make parallel pro

gramming most approachable for novice programmers The p ossibility of running

and monitoring parallel execution from a remote machine has b een shown by In

terAct Issues such as single user time and Ursa Minors p ortability need to b e

resolved rst

Getting More Information from Compilers Still there is plenty of information

that is kept internal within a parallelizing compiler Extracting more useful data from

a compiler and presenting them to users would have to b e the top priority for the

ongoing evaluationoptimization to ol pro ject

Visual Developmentof Merlin Map Merlin is still in its infancy and needs

more feedback and renements The foremost of all is the interface for developing a

map Although Merlin maps are wellstructured in format programmers rely on

conventional text editors for creating a map A b etter p ossibly graphical use interface

will make exp ert programmers job much easier

Global Information Exchange among Parallel Programmers Ursa Major

has demonstrated the p ossibilityofglobalcommunication and co op eration among

world wide parallel programmers The obvious next step would b e the exchange

of p erformance data among remote parallel programming and computer systems re

searchers With the prop er supp ort from the Ursa Major to ol such as the ability

to submit a database this is a denite p ossibility The integrated to olset from the

Parallel Programming Hub will continuously promote the usage of our databases

Advances in technology is usually the result of suchcombined eorts

LIST OF REFERENCES

L Dagum and R Menon Op enMP an industry standard API for shared

memory programming Computing in Science and Engineering

January

B L Massingill A structured approach to parallel programming Metho dology

and mo dels In Proc of th IPPSSPDP Workshops Held in Conjunction

with the th International Paral lel Processing Symposium and th Symposium

on Paral lel and DistributedProcessing pages

PB Hansen Mo del programs for computational science a programming

metho dology for multicomputers Concurrency Practice and Experience

August

T Raub er and G Runger Deriving structured parallel implementations for

numerical metho ds Microprocessing and Microprogramming

April

S Gorlatch From transformations to metho dology in parallel program develop

ment a case study Microprocessing and Microprogramming

April

Michael Wolfe High Performance Compilers for Paral lel Computing Addison

Wesley Publishing Company

Michael J Wolfe Optimizing Compilers for Supercomputers PhD thesis Uni

versity of Illinois at UrbanaChampaign Octob er

Uptal Bannerjee DependenceAnalysis for SupercomputingKulwer Academic

Publishers Norwell MA

Utpal Banerjee Rudolf Eigenmann Alexandru Nicolau and David Padua

Automatic program parallelization Proceedings of the IEEE

February

Dror E Maydan John L Hennessy and Monica S Lam Ecient and exact

data dep endence analysis In Proc of ACM SIGPLAN ConferenceonPro

gramming Language Design and ImplementationOntario Canada June

Paul M Petersen and David A Padua Static and dynamic evaluation of data

dep endence techniques IEEE Transactions on Paral lel and DistributedSys

tems November

Michael J Voss Portable lo oplevel parallelism for shared memory multipro ces

sor architectures Masters thesis Scho ol of ECE Purdue University Octob er

Nirav H Kapadia and JoseABFortes On the design of a demandbased

networkcomputing system The purdue universitynetwork computing hubs In

Proc of IEEE Symposium on High Performance Distributed Computing pages

Chicago IL

D A Bader and J JaJa SIMPLE a metho dology for programming high p er

formance algorithms on clusters of symmetricmultipro cessors SMPs Journal

of Paral lel and Distributed Computing July

B Buttarazzi A metho dology for parallel structured programming in logic

environments International Journal of Mini and Microcomputers

Message Passing Interface Forum MPI A messagepassing interface standard

Technical rep ort UniversityofTennessee Knoxville Tennessee May

A Beguelin J Dongarra A Geist R Manchek S Otto and J Walp ole PVM

Exp eriences current status and future direction In Proc of Supercomputing

pages November

ANSI XH Paral lel Extensions for Fortran XHSDRevision m edition

April

Kuck and Asso ciates Champaign IL Guide Reference Manualversion

edition Septemb er

David J Kuck The eects of program restructuring algorithm change and ar

chitecture choice on program p eformance In Proc of International Conference

on Paral lel Processing pages St Charles Ill August

Randy Allen and Ken Kennedy Automatic translation of Fortran programs

to vector form ACM Transactions on Programming Languages and Systems

Octob er

FAllenMBurke P Charles R Cytron and J Ferrante An overview of the

PTRAN analysis system for multipro cessing Journal of Paral lel and Distributed

Computing Octob er

William Blume Ramon Doallo Rudolf Eigenmann John Grout Jay Ho einger

Thomas Lawrence Jaejin Lee David Padua Yunheung Paek Bill Pottenger

Lawrence Rauchwerger and Peng Tu Parallel programming with Polaris IEEE

Computer December

M W Hall J M Anderson S P Amarasinghe B R Murphy SW Liao

E Bugnion and M S Lam Maximizing multipro cessor p erformance with the

SUIF compiler IEEE Computer December

AnthonyJGHey Highp erformance computingpast present and future

Computing and Control Engineering Journal February

R W Numrich J L Steidel B H Johnson B D de Dinechin G Elsesser

G Fischer and T MacDonald Denition of the F extension to Fortran

In Proc of the Workshop of Languages and Compilers for Paral lel Computing

pages SpringerVerlag August

R von Hanxleden K Kennedy and J Saltz Valuebased distributions in For

tran D In Proc of International Conference on HighPerformance Computing

and Networking pages SpringerVerlag April

High Performance Fortran Forum High Performance Fortran language sp ec

ication version Technical rep ort Rice University Houston Texas May

Microsoft Visual C httpmsdnmicrosoftcomvisualc

Microsoft Visual Basic httpmsdnmicrosoftcomvbasic

A Beguelin J Dongarra A Geist R Manchek K Mo ore R Wade and

V Sunderam HeNCE Graphical development to ols for networkbased concur

rent computing In Proc of Scalable High Performance Computing Conference

pages April

J Schaeer D Szafron G Lob e and I Parsons The Enterprise mo del for

developing distributed applications IEEE Paral lel and DistributedTechnology

JanuaryMarch

P Newton and J C Browne The CODE graphical parallel programming

language In Proc of International ConferenceonSupercomputing pages

July

P Kacsuk G Dozsa and T Fadgyas Designing parallel programs bythe

graphical language GRAPNEL Microprocessing and Microprogramming

April

O Lo ques J Leite and E V Carrera PRIO a mo dular parallelprogramming

environment IEEE Concurrency JanuaryMarch

N Stankovic and K Zhang Visual programming for messagepassing sys

tems International Journal of Software Engineering and Know ledge Engineer

ing August

Barr E Bauer Practical Paral lel Programming Academic Press

Silicon Graphics Inc PerformanceTuning Optimization for Origin

and Onyx httptechpubssgicomlibrarymanuals

htmlOTuninghtml

Boston Univeristy Introduction to Paral lel Processing on SGI Shared Memory

Computers httpscvbueduSCVTutorialsSMP

University of Illinois at UrbanaChampaign CSECSECE

httpwwwcseuiuceducse

University of California at Berkeley UC Berkeley CS Home Page Ap

plications of Paral lel Computers httpHTTPCSBerkeleyEDU dem

melcs

Georey C Fox Roy D Williams and Paul C Messina Paral lel Computing

Works Morgan Kaufmann Publishers

Ian Foster Designing and Building Paral lel Programs Addison Wesley

D Cheng and R Ho o d A p ortable debugger for parallel and distributed pro

grams In Proc of Supercomputing pages Novemb er

J May and F Berman Retargetability and extensibility in a parallel debugger

Journal of Paral lel and Distributed Computing June

Pallas TotalView httpwwwpallasdepagestotalvhtm

Kuck and Asso ciates Inc KAPProToolsethttpwwwkaicom

Vincent Guarna Jr Dennis Gannon David Jablonowski Allen Malonyand

Yogesh Gaur Faust An integrated environment for the development of parallel

programs IEEE Software July

Bill App elb e Kevin Smith and Charles McDowell StartPat A parallel

programming to olkit IEEE Software July

V Balasundaram K Kennedy U Kremer K McKinley and J Subhlok The

ParaScop e editor An interactive parallel programming to ol In Proc of Super

computing Conference pages

M W Hall T J Harvey K Kennedy N McIntosh K S McKinleyJD

Oldham M H Paleczny and G Roth Exp eriences using the ParaScop e editor

An interactive parallel programming to ol In Proc of Principles and Practices

of Paral lel Programming pages May

Rudolf Eigenmann and Patrick McClaughry Practical to ols for optimizing

parallel programs In Proc of the Simulation Multiconference on the High

Performance Computing Symposium pages March

W Liao A Diwan R PBosch Jr A Ghuloum and M S Lam SUIF explorer

An interactive and interpro cedural parallelizer In Proc of the th ACM SIG

PLAN Symposium on Principles and PracticeofParal lel Programming pages

August

Applied Parallel Research Inc Forge Explorer httpwwwapricom

Seema Hiranandani Ken Kennedy and ChauWen Tseng Compiling For

tran d for MIMD distributedmemory machines Communications of the ACM

August

V S Adve J MellorCrummey M Anderson K KennedyJCWang and

D A Reed An integrated compilation and p erformance analysis environment

for data parallel programs In Proc of Supercomputing Conference pages

S P Johnson C S Ierotheou and M Cross Automatic parallel co de genera

tion for message passing on distributed memory systems Paral lel Computing

February

S P Johnson P F Leggett C S Ierotheou E W Evans and M Cross

Computer AidedParal lelisation Tools CAPTools TutorialsParallel Pro cess

ing Research Group University of Greenwich Octob er CAPTo ols Version Beta

Central Institute for Applied Mathematics PCL The Performance Counter

Library A Common InterfacetoAccess Hardware Performance Counters on

MicroprocessorsNovember

Louis Lop ez The NAS Trace Visualizer NTV Rel Users Guide NASA

Septemb er

Michael T Heath and Jennifer A Etheridge Visualizing the p erformance of

parallel programs IEEE Software Septemb er

Universite de MarnelaVallee PGPVM httpphalanstereuniv

mlvfr svPGPVM

Daniel A Reed Exp erimental p erformance analysis of parallel systems Tech

niques and op en problems In Proc of the th Int Conf on Model ling Techniques

and Tools for Computer Performance Evaluation pages

W E Nagel A Arnold M Web er H C Hopp e and K Solchenbach VAM

PIR visualization and analysis of MPI resources Supercomputer

January

J Yan S Sarukkai and P Mehra Performance measurement visualization

and mo deling of parallel and distributed programs using the AIMS to olkit

SoftwarePractice and Experience April

Barton P Miller Mark D Callaghan Jonathan M Cargille Jerey K

Hollingsworth R Bruce Irvin Karen L Karavanic Krishna Kunchithapadam

and Tia Newhall The Paradyn parallel p erformance measurement to ol IEEE

Computer November

S Shende A D MalonyJCuny K Lindlan PBeckman and S Karmesin

Portable proling and tracing for parallel scientic applications using C In

Proc of ACM SIGMETRICS Symposium on Paral lel and DistributedTools

pages August

PacicSierra Research DEEPMPI Development Environment

for MPI Programs Paral lel Program Analysis and Debugging

mpi tophtml httpwwwpsrvcomdeep

B J N Wylie and A Endo AnnaiPMA multilevel hierarchical parallel pro

gram p erformance engineering In Proc of International Workshop on High

Level Programming Models and Supportive Environments pages

LAM Team University of North Dakota XMPI A RunDebug GUI for

MPI httpwwwmpindedulamsoftwarexmpi

A D Malony D H Hammerslag and D J Jablonowski TraceView a trace

visualization to ol IEEE Software Septemb er

Michael T Heath Performance visualization with ParaGraph In Proc of the

Second Workshop on Environments and Tools for Paral lel Scientic Computing

pages May

E Lusk Visualizing parallel program b ehavior In Proc of Simulation Mul

ticonference on the High Performance Computing Symposium pages April

Y Arrouye Scop e an extensible interactiveenvironment for the p erformance

evaluation of parallel system Microprocessing and Microprogramming

April

J A Kohl and G A Geist The PVM tracing facility and XPVM g In

Proc of the TwentyNinth Hawaii International Conference on System Sciences

pages January

B Top ol J T Stasko and V Sunderam PVaniM A to ol for visualization

in network computing environments Concurrency Practice and Experience

December

G Weiming G Eisenhauer K Schwan and J Vetter Falcon Online mon

itoring for steering parallel programs Concurrency Practice and Experience

August

J T Stasko and E Kraemer A metho dology for building applicationsp ecic

visualizations of parallel programs Journal of Paral lel and Distributed Com

puting June

G A Geist I I J A Kohl and PMPapadop oulos CUMULVS Providing

fault tolerance visualization and steering of parallel applications International

Journal of Supercomputer Applications Fall

K C Li and K Zhang Tuning parallel program through automatic program

analysis In Proc of Second International Symposium on Paral lel Architectures

Algorithms and Networks pages June

A Reinefeld R Baraglia T Decker J Gehring D Laforenza F Ramme

T Romke and J Simon The MOL pro ject An op en extensible metacomputer

In Proc of the IEEE Heterogeneous Computing Workshop pages

H Casanova and J Dongarra NetSolve a network enabled server for solv

ing computational science problems International Journal of Supercomputer

Applications Fall

M Sato H Nakada S Sekiguchi S Matsuoka U Nagashima and H Tak

agi Ninf a networkbased information library for global worldwide computing

infrastructure In Proc of HighPerformance Computing and Networking In

ternational Conference and Exhibition pages April

P Arb enz W Gander and M Oettli The Remote Computation System

Paral lel Computing Octob er

T Richardson Q StaordFraser K R Wo o d and A Hopp er Virtual network

computing IEEE Internet Computing JanuaryFebruary

Citrix ICA technical paperhttpwwwcitrixcompro ductsicaasp

I Foster and C Kesselman Globus A metacomputing infrastructure to olkit

International Journal of Supercomputer Applications Summer

A S Grimshaw and W A Wulf The Legion vision of a worldwide virtual

computer Communications of the ACM January

Insung Park Nirav H Kapadia Renato J Figueiredo Rudolf Eigenmann and

JoseABFortes Towardsanintegrated webexecutable parallel program

ming to ol environment To app ear in the Proc of SCHigh Performance

Networking and Computing

B LaRose The development and implementation of a p erformance database

server Technical Rep ort CS UniversityofTennessee August

The University of Southampton GRAPHICAL BENCHMARK INFORMA

TION SERVICE GBIS httpwwwccgecssotonacukgbispapiani

newgbishtml

Cherri M Pancake and Curtis Co ok What users need in parallel to ol supp ort

Survey results and analysis In Proc of Scalable High Performance Computing

Conference pages March

Roger S Pressman Software Engineering a Practitioners Approach McGraw

Hill Inc New York NY

Peter Pacheco Paral lel Programming with MPI Morgran Kaufman Publishers

D Culler J P Singh and A Gupta Paral lel Computer Architecture Morgran

Kaufman Publishers

Rudolf Eigenmann Toward a metho dology of optimizing programs for high

p erformance computers In Proc of ACM International ConferenceonSuper

computing pages Tokyo Japan July

Seon Wo ok Kim and Rudolf Eigenmann Detailed quantitative analysis of

sharedmemory parallel programs Technical Rep ort ECEHPCLab HP

CLAB Scho ol of ECE Purdue University

Seon Wook Kim Michael J Voss and Rudolf Eigenmann Performance analysis

of parallel compiler backends on sharedmemory multipro cessors In Proc of the

Tenth Workshop on Compilers for Paral lel Computers pages January

Rudolf Eigenmann Insung Park and Michael J Voss Are parallel workstations

the right target for parallelizing compilers In Lecture Notes in Computer

Science No Languages and Compilers for Paral lel Computing pages

March

Michael J Voss Insung Park and Rudolf Eigenmann On the machine

indep endent target language for parallelizing compilers In Proc of the Sixth

Workshop on Compilers for Paral lel ComputersAachen Germany December

Insung Park Michael J Voss and Rudolf Eigenmann Compiling for the new

generation of highp erformance SMPs Technical Rep ort ECEHPCLab

HPCLAB Scho ol of ECE Purdue UniversityNovember

Lynn Pointer Perfect Performance evaluation for costeective tranforma

tions rep ort Technical Rep ort Center for Sup ercomputing Researchand

Development University of Illinois at UrbanaChampaign March

Insung Park Michael J Voss Brian Armstrong and Rudolf Eigenmann Inter

active compilation and p erformance analysis with ursa minorInProc of the

Workshop of Languages and Compilers for Paral lel Computing pages

SpringerVerlag August

Insung Park Michael J Voss Brian Armstrong and Rudolf Eigenmann Par

allel programming and p erformance evaluation with the ursa to ol family In

ternational Journal of Paral lel Programming Novemb er

Insung Park Michael J Voss Brian Armstrong and Rudolf Eigenmann Sup

p orting users reasoning in p erformance evaluation and tuning of parallel ap

plications To app ear in Proc of the Twelth IASTED International Conference

on Paral lel and Distributed Computing and SystemsNovemb er

Seon Wo ok Kim Insung Park and Rudolf Eigenmann A p erformance advisor

to ol for novice programmers in parallel programming To app ear in the Proc

of the Workshop of Languages and Compilers for Paral lel Computing

Stefan Kortmann Insung Park Michael Voss and Rudolf Eigenmann Inter

active and mo dular optimization with interpol In Proc of the In

ternational ConferenceonParal lel and DistributedProcessing Techniques and

Applications pages June

Michael J Voss Kwok Wai Yau and Rudolf Eigenmann Interactive instru

mentation and tuning of Op enMP programs Technical Rep ort ECEHPCLab

HPCLAB

SeonWo ok Kim and Rudolf Eigenmann MaxP Detecting the Maximum Par

al lelism in a Fortran Program HPCLAB

Insung Park and Rudolf Eigenmann Ursa Major Exploring web technology

for design and evaluation of highp erformance systems In Proc of the Inter

national Conference on High Performance Computing and Networking pages

Berlin Germany April SpringerVerlag

T Nakra R Gupta and M L Soa Value prediction in VLIW machines

In Proc of the th International Symposium on Computer Architecture pages

May

Trimaran Homepage Trimaran Manual

httpwwwtrimaranorgdo cshtml

A D Alexandrov M Ib el K E Schauser and C J Scheiman UFO A p er

sonal global le system based on userlevel extensions to the op erating system

ACM Transactions on Computer Systems August

Rudolf Eigenmann and Siamak Hassanzadeh Benchmarking with real indus

trial applications The SPEC HighPerformance Group IEEE Computational

Science Engineering Spring

David L Weaver and Tom Germond The SPARCArchitecture Manual Version

SPARCInternational Inc PTR Prentice Hall Englewo o d Clis NJ

T J Downar JenYing Wu J Steill and R Janardhan Parallel and serial

applications of the RETRAN p ower plantsimulation co de using domain

decomp osition and Krylov subspace metho ds Nuclear Technology

February

VITA

Insung Park was b orn on February in Seoul South Korea He received

his BS degree in control and instrumentation engineering from Seoul National Uni

versityinFebruary of and his MS degree in Electrical Engineering from the

Virginia Polytechnic Institute and State UniversityBlacksburg Virginia in

He has successfully defended his PhD reserachinAugustofattheScho ol of

Electircal and Computer Engineering at Purdue UniversityHewas awarded a PhD

in Decemb er of the same year

From to Insung Park had served as a system administrator of the

electrical engineering departmental workstation lab oratory During the p erio d of his

MS study he has develop ed a partial scan design to ol BELLONA As a PhD

student at Purdue Insung Park designed and implemented a parallel programming

environment consisting of a programming metho dology and a set of to ols

He is a memb er of the honor so ciety of Phi Kappa Phi