THE NAS PARALLEL BENCHMARKS

D Bailey E Barszcz J Barton D Browning R Carter L Dagum R

Fato ohi S Fineb erg PFrederickson T Lasinski R Schreib er H Simon

V Venkatakrishnan and S Weeratunga

RNR Technical Rep ort RNR March

Abstract

A new set of b enchmarks has b een develop ed for the p erformance eval

uation of highly parallel sup ercomputers These b enchmarks consist of ve

parallel kernels and three simulated application b enchmarks Together they

mimic the computation and data movementcharacteristics of large scale

computational uid dynamics CFD applications

The principal distinguishing feature of these b enchmarks is their p encil

and pap er sp ecicationall details of these b enchmarks are sp ecied only

algorithmically In this waymany of the diculties asso ciated with conven

tional b enchmarking approaches on highly parallel systems are avoided

This pap er is an up dated version of a RNR technical rep ort originally

issued January Bailey Barszcz Barton Carter and Lasinski are with

NASA Ames ResearchCenter Dagum Fato ohi Fineb erg Simon and Weer

atunga are with Computer Sciences Corp work supp orted through NASA

Contract NAS Schreib er is with RIACS work funded via NASA

co op erative agreement NCC Browning is now in the Netherlands

work originally done with CSC Frederickson is now with Cray Research

Inc in Los Alamos NM work originally done with RIACS Venkatakrish

nan is now with ICASE in Hampton VAwork originally done with CSC

Address for all authors except Browning Frederickson and Venkatakrish

nan NASA Ames Research Center Mail Stop TA Moett Field CA

GENERAL REMARKS

D BaileyDBrowning R Carter S Fineb erg and H Simon

Intro duction

The Numerical Aero dynamic Simulation NAS Program which is based at

NASA Ames Research Center is a large scale eort to advance the state of

computational aero dynamics Sp ecically the NAS organization aims to

provide the Nations aerospace researchanddevelopment communitybythe

year a highp erformance op erational computing system capable of sim

ulating an entire aerospace vehicle system within a computing time of one to

several hours ref p The successful solution of this grand challenge

problem will require the development of computer systems that can p erform

the required complex scientic computations at a sustained rate nearly one

thousand times greater than current generation sup ercomputers can now

achieve The architecture of computer systems able to achieve this level of

p erformance will likely b e dissimilar to the shared memory multipro cessing

sup ercomputers of to day While no consensus yet exists on what the design

willbeitislikely that the system will consist of at least pro cessors

computing in parallel

Highly parallel systems with computing p ower roughly equivalent to tra

ditional shared memory multipro cessors exist to day Unfortunately the p er

formance evaluation of these systems on comparable typ es of scientic com

putations is very dicult for several reasons Few relevant data are available

for the p erformance of algorithms of interest to the computational aerophysics

community on many currently available parallel systems Benchmarking and

p erformance evaluation of such systems has not kept pace with advances in

hardware software and algorithms In particular there is as yet no gener

ally accepted b enchmark program or even a b enchmark strategy for these

systems

The p opular kernel b enchmarks that have b een used for traditional vec

tor sup ercomputers such as the Livermore Lo ops the LINPACK b ench

mark and the original NAS Kernels are clearly inappropriate for

Computer Sciences Corp oration El Segundo California This work is supp orted

through NASA Contract NAS

the p erformance evaluation of highly parallel machines First of all the

tuning restrictions of these b enchmarks rule out many widely used parallel

extensions More imp ortantly the computation and memory requirements

of these programs do not do justice to the vastly increased capabilities of the

new parallel machines particularly those systems that will b e available by

the mids

On the other hand a full scale scientic application is similarly unsuitable

First of all p orting a large program to a new parallel computer architecture

requires a ma jor eort and it is usually hard to justify a ma jor research task

simply to obtain a b enchmark numb er For that reason we b elievethatthe

otherwise very successful PERFECT Club b enchmark is not suitable for

highly parallel systems This is demonstrated byvery sparse p erformance

results for parallel machines in the recent rep orts

Alternatively an application b enchmark could assume the availabilityof

automatic software to ols for transforming dustydeck source into ecient

parallel co de on a variety of systems However such to ols do not exist to day

and many scientists doubt that they will ever exist across a wide range of

architectures

Some other considerations for the development of a meaningful b ench

mark for a highly parallel sup ercomputer are the following

Advanced parallel systems frequently require new algorithmic and soft

ware approaches and these new metho ds are often quite dierent from

the conventional metho ds implemented in source co de for a sequential

or vector machine

Benchmarks must b e generic and should not favor any particular

parallel architecture This requirement precludes the usage of any

architecturesp ecic co de such as message passing co de

The correctness of results and p erformance gures must b e easily veri

able This requirement implies that b oth input and output data sets

must b e kept very small It also implies that the nature of the compu

tation and the exp ected results must b e sp ecied in great detail

The memory size and run time requirements must b e easily adjustable

to accommo date new systems with increased p ower

The b enchmark must b e readily distributable

In our view the only b enchmarking approach that satises all of these

constraints is a pap er and p encil b enchmark The idea is to sp ecify a set

of problems only algorithmicallyEven the input data must b e sp ecied only

on pap er Naturally the problem has to b e sp ecied in sucient detail that

a unique solution exists and the required output has to b e brief yet detailed

enough to certify that the problem has b een solved correctly The p erson

or p ersons implementing the b enchmarks on a given system are exp ected

to solve the various problems in the most appropriate way for the sp ecic

system The choice of data structures algorithms pro cessor allo cation and

memory usage are all to the extentallowed by the sp ecication left op en to

the discretion of the implementer Some extension of FortranorCisrequired

and reasonable limits are placed on the usage of assembly co de and the like

but otherwise programmers are free to utilize language constructs that give

the b est p erformance p ossible on the particular system b eing studied

To this end wehave devised a numb er of relatively simple kernels

which are sp ecied completely in chapter of this do cument However

kernels alone are insucient to completely assess the p erformance p otential

of a parallel machine on real scientic applications The chief dicultyisthat

a certain data structure maybevery ecient on a certain system for one of

the isolated kernels and yet this data structure would b e inappropriate if

incorp orated into a larger application In other words the p erformance of

a real computational uid dynamics CFD application on a parallel system

is critically dep endent on data motion b etween computational kernels Thus

we consider the complete repro duction of this data movementtobeofcritical

imp ortance in a b enchmark

Our b enchmark set therefore consists of two ma jor comp onents ve par

allel kernel b enchmarks and three simulated application b enchmarks The

simulated application b enchmarks combine several computations in a man

ner that resembles the actual order of execution in certain imp ortantCFD

application co des This is discussed in more detail in chapter

We feel that this b enchmark set successfully addresses many of the prob

lems asso ciated with b enchmarking parallel machines Although wedonot

claim that this set is typical of all scientic computing it is based on the key

comp onents of several large aeroscience applications used on sup ercomput

ers by scientists at NASA Ames ResearchCenter These b enchmarks will b e

used by the Numerical Aero dynamic Simulation NAS Program to evaluate

the p erformance of parallel computers

Benchmark Rules

Denitions

In the following the term pro cessor is dened as a hardware unit capable

of executing b oth oating p oint addition and oating p ointmultiplication in

structions The lo cal memory of a pro cessor refers to randomly accessible

memory that can b e accessed by that pro cessor in less than one microsecond

The term main memory refers to the combined lo cal memory of all pro ces

sors This includes any memory shared by all pro cessors that can b e accessed

by each pro cessor in less than one microsecond The term mass storage

refers to nonvolatile randomly accessible storage media that can b e accessed

by at least one pro cessor within forty milliseconds A pro cessing no de is

dened as a hardware unit consisting of one or more pro cessors plus their

lo cal memory which is logically a single unit on the network that connects

the pro cessors

The term computational no des refers to those pro cessing no des pri

marily devoted to highsp eed oating p oint computation The term ser

vice no des refers to those pro cessing no des primarily devoted to system

op erations including compilation linking and communication with external

computers over a network

General rules

Implementations of these b enchmarks must b e based on either

which includes Fortran as a subset or C although a wide varietyof

parallel extensions are allowed This requirement stems from the observation

that Fortran and C are the most commonly used programming languages by

the scientic parallel computing community at the present time If in the

future other languages gain wide acceptance in this community they will

b e considered for inclusion in this group Assembly language and other low

level languages and constructs may not b e used except that certain sp ecic

vendorsupp orted assemblyco ded library routines may b e called see section

We are of the opinion that such language restrictions are necessarybe

cause otherwise considerable eort would b e made by b enchmarkers in low

level or assemblylevel co ding Then the b enchmark results would tend to

reect the amount of programming resources available to the b enchmarking

organization rather than the fundamental merits of the parallel system Cer

tainly the mainstream scientists that these parallel computers are intended

to serve will b e co ding applications at the source level almost certainly in

Fortran or C and thus these b enchmarks are designed to measure the p er

formance that can b e exp ected from suchcode

Accordingly the following rules must b e observed in any implementations

of the NAS Parallel Benchmarks

All oating p oint op erations must b e p erformed using bit oating

point arithmetic

All b enchmarks must b e co ded in either Fortran or C with

certain approved extensions

Implementation of the b enchmarks may not include a mix of Fortran

and C co deone or the other must b e used

Any extension of Fortran that is in the High Performance Fortran

HPF draft dated January or later is allowed

Any language extension or library routine that is employed in anyof

the b enchmarks must b e supp orted bythevendor and available to all

users

Subprograms and library routines not written in Fortran or C may only

p erform certain functions as indicated in the next section

All rules apply equally to subroutine calls language extensions and

compiler directives ie sp ecial comments

Allowable Fortran extensions and library routines

Fortran extensions and library routines are also p ermitted that p erform the

following

Indicate sections of co de that can b e executed in parallel or lo ops that

can b e distributed among dierent computational no des

Sp ecify the allo cation and organization of data among or within com

putational no des

Communicate data b etween pro cessing no des

Communicate data b etween the computational no des and service no des

Rearrange data stored in multiple computational no des including con

structs to p erform indirect addressing and array transp ositions

Synchronize the action of different computational no des

Initialize for a data communication or synchronization op eration that

will b e p erformed or completed later

Perform highsp eed input or output op erations b etween main memory

and the mass storage system

Perform any of the following array reduction op erations on an array

either residing within a single computational no de or distributed among

multiple no des MAX MIN AND OR XOR

Combine communication b etween no des with one of the op erations

listed in the previous item

Perform any of the following computational op erations on arrays ei

ther residing within a single computational no de or distributed among

multiple no des dense or matrix multiplication dense or

sparse matrixvector multiplication onedimensional twodimensional

or threedimensional fast Fourier transforms sorting blo ck tridiagonal

system solution and blo ckpentadiagonal system solution Suchrou

tines must b e callable with general array dimensions

Sample Co des

The intent of this pap er is to completely sp ecify the computation to b e car

ried out Theoretically a complete implementation including the generation

of the correct input data could b e pro duced from the information in this pa

per However the develop ers of these b enchmarks are aware of the diculty

and time required to generate a correct implementation from scratch in this

manner Furthermore despite several reviews ambiguities in this technical

pap er may exist that could delay implementations

In order to reduce the numb er of diculties and to aid the b enchmarking

sp ecialist Fortran computer programs implementing the b enchmarks are

available These co des are to b e considered examples of how the problems

could b e solved on a single pro cessor system rather than statements of how

they should b e solved on an advanced parallel system The sample co des

actually solve scaled down versions of the b enchmarks that can b e run on

many current generation workstations Instructions are supplied in comments

in the source co de on how to scale up the program parameters to the full size

b enchmark sp ecications

These programs as well as the b enchmark do cument itself are available

from the Systems DevelopmentBranch in the NAS Systems Division Mail

Stop NASA Ames Research Center Moett Field CA

attn NAS Parallel Benchmark Co des The sample co des are provided on

Macintosh oppy disks and contain the Fortran source co des ReadMe les

input data les and reference output data les for correct implementations

of the b enchmark problems These co des have b een validated on a number

of computer systems ranging from conventional workstations to sup ercom

puters

Three classes of problems are dened in this do cument These will b e

denoted Sample Co de Class A and Class B since the three classes

dier mainly in the sizes of principal arrays Tables and give

the problem sizes memory requirements measured in Mw run times and

p erformance rates measured in Mops for eachoftheeightbenchmarks

and for the Sample Co de Class A and Class B problem sets These statis

tics are based on implementations on one pro cessor of a CrayYMPThe

op eration count for the Integer Sort b enchmark is based on integer op er

ations rather than oatingp oint op erations The entries in the Problem

Size columns are sizes of key problem parameters Complete descriptions

of these parameters are given in chapters and

Submission of Benchmark Results

It must b e emphasized that the sample co des describ ed in section are not

the b enchmark co des but only implementation aids For the actual b ench

marks the sample co des must b e scaled to larger problem sizes The sizes

of the current b enchmarks were chosen so that implementations are p ossi

ble on currently available sup ercomputers As parallel computer technology

Benchmark co de Problem Memory Time Rate

size Mw sec Mops

Embarrassingly parallel EP

Multigrid MG

Conjugate gradientCG

D FFT PDE FT

Integer sort IS

LU solver LU

Pentadiagonal solver SP

Blo ck tridiagonal solver BT

Table NAS Parallel Benchmarks Sample Co de Statistics

progresses future releases of these b enchmarks will sp ecify larger problem

sizes

The authors and develop ers of these b enchmarks encourage submission of

p erformance results for the problems listed in table Perio dic publication

of the submitted results is planned Benchmark results should b e submitted

to the Applied Research Branch NAS Systems Division Mail Stop T

NASA Ames ResearchCenter Moett Field CA attn NAS Parallel

Benchmark Results A complete submission of results should include the

following

A detailed description of the hardware and software conguration used

for the b enchmark runs

A description of the implementation and algorithmic techniques used

Source listings of the b enchmark co des

Output listings from the b enchmarks

Benchmark co de Problem Memory Time Rate

size Mw sec Mops

Embarrassingly parallel EP

Multigrid MG

Conjugate gradient CG

D FFT PDE FT

Integer sort IS

LU solver LU

Pentadiagonal solver SP

Blo ck tridiagonal solver BT

Table NAS Parallel Benchmarks Class A Statistics

Benchmark co de Problem Memory Time Rate

size Mw sec Mops

Embarrassingly parallel EP

Multigrid MG

Conjugate gradient CG

D FFT PDE FT

Integer sort IS

LU solver LU

Pentadiagonal solver SP

Blo ck tridiagonal solver BT

Table NAS Parallel Benchmarks Class B Statistics

References

Numerical Aero dynamic Simulation Program Plan NAS Systems Divi

sion Ames Research Center Octob er

McMahon F H The Livermore Fortran Kernels A Computer Test

of the Numerical Performance Range Technical Rep ort UCRL

Lawrence Livermore National Lab oratoryLivermore Calif Dec

Dongarra J J The LINPACK Benchmark An Explanation Sup er

Computing Spring pp

Dongarra J J Performance of Various Computers Using Standard

Linear Equations Software in a Fortran Environment TR MCSRD

Argonne National Lab oratoryMarch

Bailey D H and Barton J The NAS Kernel Benchmark Program

Technical Rep ort Ames ResearchCenter Moett Field Calif

August

Berry M et al The Perfect Club Benchmarks EectivePerformance

Evaluation of Sup ercomputers The International Journal of Sup ercom

puter Applications vol pp

Pointer L PERFECT Rep ort Technical Rep ort CSRD Univ

of Illinois Urbana Ill July

Pointer L PERFECT Rep ort Performance Evaluation for Costef

fectiveTransformations Technical Rep ort CSRD Univ of Illinois

Urbana Ill March

Cyb enko G Kipp L Pointer L and Kuck D Sup ercomputer Per

formance Evaluation and the Perfect Benchmarks Technical Rep ort

CSRD Univ of Illinois Urbana Ill March

American National Standard Programming Language Fortran X

American National Standards Institute BroadwayNew

York NY

Draft Prop osed C ANSI Standard XJS American National

Standards Institute Broadway New York NY

High Performance Fortran HPF Language Sp ecication Version

Draft Center for ResearchinParallel Computing PO Rice Uni

versity Houston TX January

THE KERNEL BENCHMARKS

y y

D Bailey E Barszcz L Dagum PFrederickson R Schreib er

and H Simon

Overview

After an evaluation of a numb er of large scale CFD and computational aero

sciences applications on the NAS sup ercomputers at NASA Ames a number

of kernels were selected for the b enchmark These were supplemented by

some other kernels whichareintended to test sp ecic features of parallel

machines The following b enchmark set was then assembled

EP An embarrassingly parallel kernel It provides an estimate of the

upp er achievable limits for oating p oint p erformance ie the p erfor

mance without signicantinterpro cessor communication

MG A simplied multigrid kernel It requires highly structured long dis

tance communication and tests b oth short and long distance data com

munication

CG A conjugate gradient metho d is used to compute an approximation to

the smallest eigenvalue of a large sparse symmetric p ositive denite

matrix This kernel is typical of unstructured grid computations in that

it tests irregular long distance communication employing unstructured

matrix vector multiplication

FT A D partial dierential equation solution using FFTs This kernel

p erforms the essence of many sp ectral co des It is a rigorous test of

longdistance communication p erformance

IS A large integer sort This kernel p erforms a sorting op eration that is

imp ortant in particle metho d co des It tests b oth integer computa

tion sp eed and communication p erformance

Computer Sciences Corp oration This work is supp orted through NASA Contract

NAS

y

Research Institute for Advanced Computer Science RIACS Ames ResearchCenter

This work is supp orted by NAS Systems Division through Co op erative Agreement Number

NCC

These kernels involve substantially larger computations than previous ker

nel b enchmarks such as the Livermore Lo ops or Linpack and therefore they

are more appropriate for the evaluation of parallel machines The Parallel

Kernels in particular are suciently simple that they can b e implemented

on a new system without unreasonable eort and delay Most imp ortantly

as emphasized earlier this set of b enchmarks incorp orates a new concept in

p erformance evaluation namely that only the computational task is sp eci

ed and that the actual implementation of the kernel can b e tailored to the

sp ecic architecture of the parallel machine

In this chapter the Parallel Kernel b enchmarks are presented and the

particular rules for allowable changes are discussed Future rep orts will de

scrib e implementations and b enchmarking results on a numb er of parallel

sup ercomputers

Description of the Kernels

Kernel EP An embarrassingly parallel b enchmark

D Bailey and PFrederickson

Brief Statement of Problem

Generate pairs of Gaussian random deviates according to a sp ecic scheme

describ ed b elow and tabulate the numb er of pairs in successive square annuli

Detailed Description

Set n a and s Generate the pseudorandom

oating p ointvalues r in the interval for j n using the

j

scheme describ ed in section Then for j n set x r

j j

and y r Thus x and y are uniformly distributed on the interval

j j j j

Next set k Then b eginning with j test to see if t x y

j

j j

If not reject this pair and pro ceed to the next j Ifthisinequality holds then

q q

log t t and Y y log t t where log set k k X x

j j k j j j k j

denotes the natural logarithm Then X and Y are indep endent Gaussian

k k

deviates with mean zero and variance one Approximately n pairs will

b e constructed in this manner See reference page for additional dis

cussion of this scheme for generating Gaussian deviates The computation

of Gaussian deviates must employ the ab oveformula and vendorsupplied

Class A Class B

l Q Q

l l

Table EP Benchmark Counts

intrinsic routines must b e employed for p erforming all square ro ot and loga

rithm evaluations

Finallyfor l tabulate Q as the count of the pairs X Y that

l k k

lie in the square annulus l maxjX j jY j l and output the ten Q

k k l

P P

counts The two sums X and Y must also b e output

k k

k k

This completes the denition of the Class A problem The Class B prob

lem is the same except that n

Verication Test

P P

The twosums X and Y must agree with reference values to

k k

k k

within one part in For the Class A problemm the reference values are

and while for the Class

B problem the sums are and

EachofthetenQ counts must agree exactly with reference values which

l

are given in table

Op erations to b e Timed

All of the op erations describ ed ab ove are to b e timed including tabulation

and output

Other Features

This problem is typical of manyMonte Carlo simulation applications

The only requirement for communication is the combination of the

sums from various pro cessors at the end

Separate sections of the uniform pseudorandom numb ers can b e inde

p endently computed on separate pro cessors See section for details

The smallest distance b etween a oatingp ointvalue and a nearbyin

teger among the r X and Y valuesis whichiswell

j k k

ab ove the achievable accuracy using bit oating arithmetic on ex

isting computer systems Thus if a truncation discrepancy o ccurs it

implies a problem with the system hardware or software

Kernel MG a simple D multigrid b enchmark

E Barszcz and PFrederickson

Brief Statement of Problem

Four iterations of the Vcycle multigrid algorithm describ ed b elow are

used to obtain an approximate solution u to the discrete Poisson problem

r u v

on a grid with p erio dic b oundary conditions

Detailed Description

Set v except at the twenty p oints listed in table where v

These p oints were determined as the lo cations of the ten largest and ten

smallest pseudorandom numb ers generated as in Kernel FT

Begin the iterative solution with u Each of the four iterations

consists of the following two steps in which k log

r v Au ev al uate r esidual

k

u u M r appl y cor r ection

k

Here M denotes the Vcycle multigrid op erator dened in table

v ijk

ijk

Table Nonzero values for v

k

z M r

k k

if k

r Pr restrict residual

k k

k

z M r recursivesolve

k k

z Qz prolongate

k k

r r Az evaluate residual

k k k

z z Sr apply smo other

k k k

else

z Sr apply smo other

Table Vcycle multigrid op erator

In this denition A denotes the trilinear nite element discretization of

the Laplacian r normalized as indicated in table where the co ecients

of P Q and S are also listed

In this table c denotes the central co ecient of the p oint op erator when

these co ecients are arranged as a cub e Thus c is the co ecient

that multiplies the value at the gridp oint ijk while c multiplies the six

values at grid p oints which dier by one in exactly one index c multiplies

the next closest twelvevalues those that dier by one in exactly two indices

and c multiplies the eightvalues lo cated at grid p oints that dier byone

in all three indices The restriction op erator P given in this table is the

trilinear pro jection op erator of nite element theory normalized so that the

co ecients of all op erators are indep endentoflevel and is half the transp ose

of the trilinear interp olation op erator Q

C c c c c

A

P

Q

Sa

Sb

Table Co ecients for trilinear nite element discretization

Verication Test

Class A Evaluate the residual after four iterations of the Vcycle multi

grid algorithm using the co ecients from the Sa row for the smo othing

op erator S and verify that its L norm

X

r kr k

ijk

ijk

agrees with the reference value

within an absolute tolerance of

Class B The array size is the same as for Class A but iterations

must b e p erformed using the co ecients from the Sb row for the smo othing

op erator S The output L norms must agree with the reference value

within an absolute tolerance of

Timing

Start the clo ck b efore evaluating the residual for the rst time and after

initializing u and v Stop the clo ckafterevaluating the norm of the nal

residual but b efore displaying or printing its value

Kernel CG Solving an unstructured sparse linear system

by the conjugate gradient metho d

R Schreib er H Simon and R Carter

Brief Statement of Problem

This b enchmark uses the inverse p ower metho d to nd an estimate of

the largest eigenvalue of a symmetric p ositive denite sparse matrix with a

random pattern of nonzeros

Detailed Description

In the following A denotes the sparse matrix of order nlower case Roman

th

letters are vectors x is the j comp onentofthevector x and sup erscript

j

T indicates the usual transp ose op erator Lower case Greek letters are

p

T

x x scalars We denote by jjxjj the Euclidean norm of a vector x jjxjj

All quantities are real The inverse p ower metho d is to b e implemented as

follows

T

x

start timing here

DO it niter

Solve the system Az x and return jjr jj as describ ed b elow

T

x z

Print it jjr jj and

x zjjz jj

ENDDO

stop timing here

Values for the size of the system nnumb er of outer iterations niter and

the shift for three dierent problem sizes are provided in table The

solution z to the linear system of equations Az x is to b e approximated us

ing the conjugate gradient CG metho d This metho d is to b e implemented

as follows

z

r x

T

r r

p r

DO i

q Ap

T

p q

z z p

r r q

T

r r

p r p

ENDDO

compute residual norm explicitly jjr jj jjx Az jj

Size n niter NONZER

Sample

Class A

Class B

Table Input parameters for CG b enchmark

Verication Test

The program should print at every outer iteration of the p ower metho d

the iteration number it the eigenvalue estimate and the Euclidean norm

jjr jj of the residual vector at the last CG iteration the vector r in the dis

cussion of CG ab ove For each size problem the computer value of must

agree with the reference value within a tolerance of ie

RE F

j j These reference values are provided in table

RE F RE F

Timing

The rep orted time must b e the wallclo ck time required to compute all

niter iterations and print the results after the matrix is generated and down

loaded into the system and after the initialization of the starting vector x

Computed FLOP mem

Size nonzeros MW

RE F

Sample

Class A

Class B

Table Output parameters for CG b enchmark

It is p ermissible initially to reorganize the sparse matrix data structure

ar ow acol ael t which is pro duced by the matrix generation routine de

scrib ed b elow to a data structure b etter suited for the target machine

The original or the reorganized sparse matrix data structure can then b e

subsequently used in the conjugate gradientinteration Time sp entinthe

initial reorganization of the data structure will not b e counted towards the

b enchmark time

It is also p ermissible to use several dierent data structures for the matrix

Akeep multiple copies of the matrix AortowriteA to mass storage and

read it back in However the time for any data movements whichtake place

within the p ower iterations outer iteration or within the conjugate gradient

iterations inner iteration must b e included in the rep orted time

However the matrix A must b e used explicitlyBysaving the random

sparse vectors x used in makea see b elow it is p ossible to reformulate the

sparse matrixvector multiply op eration in suchaway that communication

is substantially reduced to only a few dense vectors and sparse op erations

are restricted to the pro cessing no des Although this scheme of matrixvector

multiplication technically satised the original rules of the CG b enchmark it

defeats this b enchmarks intended purp ose of measuring random communi

cation p erformance Therefore this scheme is no longer allowed and results

employing implicit matrixvector multiplication based on the outer pro d

uct representation of the sparse matrix are no longer considered to b e valid

b enchmark results

Other Features

The input sparse matrix A is generated bya Fortran subroutine called

makeawhichisprovided on the sample co de disk describ ed in section

In this program the random numb er generator is initialized with a

and s Then the subroutine makea is called to generate the

matrix A This program maynotbechanged In the makea subroutine the

matrix A is represented by the following Fortran variables

N INTEGERthe number of rows and columns

NZ INTEGERthe numb er of nonzeros

A REALarray of NZ nonzeros

IA INTEGERarrayofNZrow indices Element AK is in row IAK for

all K NZ

JA INTEGERarray of N p ointers to the b eginnings of columns Column

J of the matrix is stored in p ositions JAJ through JAJ of A and IA

JAN contains NZ

The co de generates the matrix as the weighted sum of N outer pro ducts

of random sparse vectors x

N

X

T

xx A

i

i

where the weights are a geometric sequence with and the ratio

i

chosen so that The vectors x are chosen to have a few randomly

N

placed nonzeros eachofwhich is a sample from the uniform distribution

th

on Furthermore the i elementof x is set to to insure that

i

A cannot b e structurally singular Finally is added to the diagonal

of A This results in a matrix whose condition numb er the ratio of its

largest eigenvalue to its smallest is roughly The numb er of randomly

chosen elements of x is provided for each problem size in table in the

NONZER column The nal numb er of nonzeros of A are listed in table

in the computed nonzeros column As implemented in the sample co des

the shift of the main diagonal of A is the nal task in subroutine makea

Values are provided for in table

The data structures used are these First a list of triples ar ow acol ael t

is constructed Each of these represents an elementinrow i ar ow column

j acol withvalue a ael t When the ar ow and acol entries of twoof

ij

these triples coincide then the values in their ael t elds are added together in

creating a The pro cess of assembling the matrix data structures from the

ij

list of triples including the pro cess of adding coincidententries is done by

the subroutine sparse which is called by makea and is also provided For

examples and more details on this sparse data structure consult section

of the b o ok by Du Erisman and Reid

Kernel FT A D fastFourier transform partial dierential

equation b enchmark

D Bailey and PFrederickson

Brief Statement of Problem

Numerically solve a certain partial dierential equation PDE using for

ward and inverse FFTs

Detailed Description

Consider the PDE

ux t

r ux t

t

where x is a p osition in threedimensional space When a Fourier transform

is applied to each side this equation b ecomes

vz t

jz j v z t

t

where v z t is the Fourier transform of ux t This has the solution

jz j t

v z t e v z

Now consider the discrete version of the original PDE Following the

ab ove steps it can b e solved by computing the forward D discrete Fourier

transform DFT of the original state array ux multiplying the results by

certain exp onentials and then p erforming an inverse D DFT The forward

DFT and inverse DFT of the n n n array u are dened resp ectively as

n n n

 

X X X

ilsn ikrn ijqn

 

e e F u u e

qrs jk l

j

l k

n n n

 

X X X

ilsn ikrn ijqn

 

u F e e u e

jk l

qrs

n n n

j

l k

The sp ecic problem to b e solved for the Class A b enchmark is as follows

Set n n and n Generate n n n bit pseudo

random oating p ointvalues using the pseudorandom numb er generator in

section starting with the initial seed Then ll the complex

array U jn kn ln withthisdatawherethe

jk l

rst dimension varies most rapidly as in the ordering of a D Fortran array

A single complex number entry of U consists of two consecutive pseudoran

domly generated results Compute the forward D DFT of U using a D

fast Fourier transform FFT routine and call the result V Set

and set t Then compute

j k l t

W e V

jk l jk l

where j is dened as j for jnand j n for n jn

The indices k and l are similarly dened with n and n Then compute an

inverse D DFT on W using a D FFT routine and call the result the

P

array X Finally compute the complex checksum X where q j

qrs

j

mo d n rj mo d n and s j mo d n After the checksum for

this t has b een output increment t by one Then rep eat the ab ove pro cess

from the computation of W through the incrementing of tuntil the step

t N has b een completed In this b enchmark N The V array and the

array of exp onential terms for t need only b e computed once Note that

the array of exp onential terms for t can b e obtained as the tth p ower of

the arrayfor t

This completes the denition of the Class A problem The Class B prob

lem is the same except that n n n and N

Any algorithm may b e used for the computation of the D FFTs men

tioned ab ove One algorithm is the following Assume that the data in the

input n n n complex array A are organized so that for each k and l

all elements of the complex vector A jn are contained within a

jk l

single pro cessing no de First p erform an n p ointDFFToneach of these

n n complex vectors Then transp ose the resulting arrayinto an n n n

complex array B Next p erform an n p oint D FFT on each of the n n

rstdimension complex vectors of B Again note that each of the D FFTs

can b e p erformed lo cally within a single no de Then transp ose the resulting

arrayinto an n n n complex array C Finally p erform an n p oint

D FFT on eachofthe n n rstdimension complex vectors of C Then

transp ose the resulting arrayinto an n n n complex array D This

array D is the nal D FFT result

Algorithms for p erforming an individual D complextocomplex FFT

are well known and will not b e presented here Readers are referred to

for details It should b e noted that some of these FFTs are unordered

FFTs ie the results are not in the correct order but instead are scrambled

by a bitreversal p ermutation SuchFFTsmay b e employed if desired but

it should b e noted that in this case the ordering of the exp onential factors in

the denition of W ab ovemust b e similarly scrambled in order to obtain

jk l

the correct results Also the nal result array X may b e scrambled in which

case the checksum calculation will havetobechanged accordingly

It should b e noted that individual D FFTs array transp ositions and

even entire D FFT op erations may b e p erformed using vendorsupplied

library routines See sections and for details

Op erations to b e Timed

All of the ab ove op erations including the checksum calculations must

b e timed

Verication Test

The N complex checksums must agree with reference values to within one

part in For the parameter sizes sp ecied ab ove the reference values are

given in table

Other Features

D FFTs are a key part of certain CFD applications notably large

eddy turbulence simulations

The D FFT steps require considerable communication for op erations

such as array transp ositions

Kernel IS Parallel sort over small integers

L Dagum

Brief Statement of Problem

Sort N keysinparallel The keys are generated by the sequential key

generation algorithm given b elow and initially must b e uniformly distributed

in memory The initial distribution of the keys can have a great impact on

the p erformance of this b enchmark and the required distribution is discussed

in detail b elow

Class A Class B

t Real Part Imaginary Part t Real Part Imaginary Part

Table FT Benchmark Checksums

Denitions

A sequence of keys fK j i N g will b e said to b e

i

sorted if it is arranged in nondescending order ie K K K

i i i

The rank of a particular key in a sequence is the index value i that the

key would have if the sequence of keys were sorted Ranking then is the

pro cess of arriving at a rank for all the keys in a sequence Sorting is

the pro cess of p ermuting the the keys in a sequence to pro duce a sorted

sequence If an initially unsorted sequence K K K has ranks

N

r r rN the sequence b ecomes sorted when it is rearranged in

the order K K K Sorting is said to b e stable if equal keys

r r r N

retain their original relative order In other words a sort is stable only if

r i rj whenever K K and i j Stable sorting is not required

r i r j

for this b enchmark

Memory Mapping

The b enchmark requires ranking an unsorted sequence of N keys The

initial sequence of keys will b e generated in an unambiguous sequential man

ner as describ ed b elow This sequence must b e mapp ed into the memory

of the parallel pro cessor in one of the following ways dep ending on the typ e

of memory system In all cases one key will map to one word of memory

Word size must b e no less than bits Once the keys are loaded onto the

memory system they are not to b e moved or mo died except as required by

the pro cedure describ ed in the Pro cedure subsection

Shared Global Memory

All N keys initially must b e stored in a contiguous address space If A is

i

th

used to denote the address of the i word of memory then the address space

must b e A A The sequence of keys K K K initially must

i iN N

map to this address space as

A MEMK for j N

ij j

where MEMK refers to the address of K

j j

Distributed Memory

In a distributed memory system with p distinct memory units each mem

ory unit initially must store N keys in a contiguous address space where

p

N Np

p

th

If A is used to denote the address of the i word in a memory unit and

i

th

if P is used to denote the j memory unit then P A will denote the

j j i

th th

address of the i word in the j memory unit Some initial addressing or

ordering of memory units must b e assumed and adhered to throughout the

b enchmark Note that the addressing of the memory units is left completely

arbitraryIfN is not evenly divisible by p then memory units fP j j

j

p g will store N keys and memory unit P will store N keys

p p pp

where now

N bNp c

p

N N p N

pp p

In some cases in particular if p is large this mapping may result in a

p o or initial load balance with N N Insuch cases it may b e desirable

pp p

to use p memory units to store the keys where p p This is allowed but

thestorageofthekeys still must follow either equation or equations

with p replacing p In the following we will assume N is evenly divisible

by p The address space in an individual memory unit must b e A A

i iN

p

If memory units are individually hierarchical then N keys must b e stored

p

in a contiguous address space b elonging to a single memory hierarchyand

th

A then denotes the address of the i word in that hierarchy The keys

i

cannot b e distributed among dierent memory hierarchies until after timing

b egins The sequence of keys K K K initially must map to this

N

distributed memory as

P A MEMK for j N

k ij kN j p

p

and k p

where MEMK refers to the address of K If N is not evenly

kN j kN j

p p

divisible by p then the mapping given ab ovemust b e mo died for the case

where k p as

P A MEMK for j N

p ij pN j pp

p

Hierarchical Memory

All N keys initially must b e stored in an address space b elonging to a

single memory hierarchywhich will here b e referred to as the main memory

Note that any memory in the hierarchywhich can store all N keys may

b e used for the initial storage of the keys and the use of the term main

memory in the description of this b enchmark should not b e confused with

the more general denition of this term in section The keys cannot

b e distributed among dierent memory hierarchies until after timing b egins

The mapping of the keys to the main memory must follow one of either the

shared global memory or the distributed memory mappings describ ed ab ove

The b enchmark requires computing the rank of eachkey in the sequence

The mappings describ ed ab ove dene the initial ordering of the keys For

shared global and hierarchical memory systems the same mapping must b e

applied to determine the correct ranking For the case of a distributed mem

ory system it is p ermissible for the mapping of keys to memory at the end of

the ranking to dier from the initial mapping only in the following manner

the number of keys mapped to a memory unit at the end of the ranking may

dier from the initial value N It is exp ected in a distributed memory

p

machine that go o d load balancing of the problem will require changing the

initial mapping of the keys and for this reason a dierent mapping maybe

used at the end of the ranking If N is the number of keys in memory

p

k

unit P at the end of the ranking then the mapping whichmust b e used to

k

determine the correct ranking is given by

P A MEMr kN j for j N

k ij p p

k k

and k p

where r kN j refers to the rank of key K Note however this do es

p kN j

p

k

k

not imply that the keys once loaded into memorymaybemoved Copies of

the keys may b e made and moved but the original sequence must remain in

tact such that each time the ranking pro cess is rep eated Step of Pro cedure

the original sequence of keys exists except for the two mo dications of Step

a and the same algorithm for ranking is applied Sp ecicallyknowledge

obtainable from the communications pattern carried out in the rst ranking

cannot b e used to sp eed up subsequent rankings and each iteration of Step

should b e completely indep endent of the previous iteration

Key Generation Algorithm

The algorithm for generating the keys makes use of the pseudorandom

numb er generator describ ed in section The keys will b e in the range

Rank full Full scale Rank sample Sample co de

r i r i

r i r i

r i r i

r i r i

r i r i

Table Values to b e used for partial verication

B Let r b e a random fraction uniformly distributed in the range

max f

th

and let K be the i key The value of K is determined as

i i

K bB r r r r c for i N

i max i i i i

Note that K must b e an integer and bc indicates truncation Four consecu

i

tive pseudorandom numb ers from the pseudorandom numb er generator must

b e used for generating eachkey All op erations b efore the truncation must b e

p erformed in bit double precision The random numb er generator must

b e initialized with s as a starting seed

Partial Verication Test

Partial verication is conducted for each ranking p erformed Partial ver

ication consists of comparing a particular subset of ranks with the reference

values The subset of ranks and the reference values are given in table

Note that the subset of ranks is selected to b e invariant to the ranking

algorithm recall that stability is not required in the b enchmark This is

accomplished by selecting for verication only the ranks of unique keys If

akey is unique in the sequence ie there is no other equal key then it

will have a unique rank despite an unstable ranking algorithm The memory

mapping describ ed in the Memory Mapping subsection must b e applied

Full Verication Test

Full verication is conducted after the last ranking is p erformed Full

verication requires the following

Rearrange the sequence of keys fK j i N ginthe

i

order fK j j r r rN g where r rrN

j

is the last computed sequence of ranks

For every K from i N test that K K

i i i

If the result of this test is true then the keys are in sorted order The memory

mapping describ ed in the Memory Mapping subsection must b e applied

Pro cedure

In a scalar sequential manner and using the key generation algorithm

describ ed ab ove generate the sequence of N keys

Using the appropriate memory mapping describ ed ab ove load the N

keys into the memory system

Begin timing

Do for i to I

max

a Mo dify the sequence of keys by making the following twochanges

K i

i

K B i

iI max

max

b Compute the rank of eachkey

c Perform the partial verication test describ ed ab ove

End timing

Perform full verication test describ ed ab ove

Sp ecications

The sp ecications given in table shall b e used in the b enchmark Two

sets of values are given one for Class A and one for Class B

For partial verication the reference values given in table are to b e

used In this table r j refers to the rank of K and i is the iteration

j

of Step of the Pro cedure Again two sets of values are given the Ful l

Scale set b eing for the actual b enchmark and the Sample Code set b eing

for development purp oses It should b e emphasized that the b enchmark

Parameter Class A Class B

N

B

max

seed

I

max

Table Parameter values to b e used for b enchmark

measures the p erformance based on use of the Ful l Scale values and the

Sample Code values are given only as a convenience to the implementor

Also to b e supplied to the implementor is Fortran source co de for the

sequential implementation of the b enchmark using the Sample Code values

and with partial and full verication tests

A Pseudorandom Numb er Generator for the Par

allel NAS Kernels

D Bailey

Supp ose that n uniform pseudorandom numb ers are to b e generated Set

a and let x s b e a sp ecied initial seed ie an integer in the

range s Generate the integers x for k n using the linear

k

congruential recursion

x ax mo d

k k

and return r x as the results Thus r and the r are

k k k k

very nearly uniformly distributed on the unit interval See b eginning on

page for further discussion of this typ e of pseudorandom numb er generator

Note that any particular value x of the sequence can b e computed di

k

rectly from the initial seed s by using the binary algorithm for exp onentiation

taking remainders mo dulo after eachmultiplication To b e sp ecic let m

m

b e the smallest integer such that k set b s and t a Then rep eat

the following for i from to m

j k

b bt mo d if j k

t t mo d

k j

k

The nal value of b is x a s mo d See for further discussion of

k

the binary algorithm for exp onentiation

The op eration of multiplying two large integers mo dulo can b e imple

mented using bit oating p oint arithmetic by splitting the arguments into

twowords with bits each To b e sp ecic supp ose one wishes to compute

c ab mo d Then p erform the following steps where int denotes the

greatest integer

a int a

a a a

b int b

b b b

t a b a b

t int t

t t t

t t a b

t int t

c t t

An implementation of the complete pseudorandom numb er generator al

gorithm using this scheme pro duces the same sequence of results on any

system that satises the following requirements

The input multiplier a and the initial seed saswell as the constants

and can b e represented exactly as bit oating

point constants

The truncation of a nonnegative bit oating p ointvalue less than

is exact

The addition subtraction and multiplication of bit oating p oint

values where the arguments and results are nonnegative whole numb ers

less than pro duce exact results

The multiplication of a bit oating p ointvalue which is a nonneg

ative whole numb er less than by the bit oating p ointvalue

m

m pro duces an exact result

These requirements are met by virtually all scientic computers in use to day

Any system based on the IEEE oating p oint arithmetic standard

easily meets these requirements using double precision However it should

b e noted that obtaining an exact p ower of two constant on some systems

requires a lo op rather than merely an assignment statementwith

Other Features

The p erio d of this pseudorandom numb er generator is

and it passes all reasonable statistical tests

This calculation can b e vectorized on vector computers by generating

results in batches of size equal to the hardware vector length

By using the scheme describ ed ab ove for computing x directlythe

k

starting seed of a particular segment of the sequence can b e quickly

and indep endently determined Thus numerous separate segments can

b e generated on separate pro cessors of a multipro cessor system

Once the IEEE oating p oint arithmetic standard gains universal

acceptance among scientic computers the radix canbesafelyin

creased to although the scheme describ ed ab ove for multiplying

two suchnumb ers must b e corresp ondingly changed This will increase

the p erio d of the pseudorandom sequence by a factor of to approxi

mately

References

IEEE Standard for Binary Floating PointNumb ers ANSIIEEE

Standard IEEE New York

Knuth D E The Art of Computer Programming Vol

AddisonWesley

Du I S Erisman A M and Reid J K Direct Metho ds for Sparse

Matrices Clarendon Press Oxford

Agarwal R C and Co oleyJWFourier Transform and

Convolution Subroutines for the IBM Vector Facility IBM

Journal of Research and Development vol pp

Bailey D H A HighPerformance FFT Algorithm for Vector

Sup ercomputers International Journal of Sup ercomputer Applications

vol pp

Pease M C An Adaptation of the Fast Fourier Transform for

Parallel Pro cessing Journal of the ACM pp

Swarztraub er P N FFT Algorithms for Vector Computers Parallel

Computing vol

pp

Swarztraub er P N Multipro cessor FFTs Parallel Computing vol

pp

A METHODOLOGY FOR BENCHMARK

ING SOME CFD KERNELS ON HIGHLY

PARALLEL PROCESSORS

Sisira Weeratunga Eric Barszcz Ro d Fato ohi and V Venkatakrishnan

Summary

A collection of iterativePDEsolvers emb edded in a pseudo application pro

gram is prop osed for the p erformance evaluation of CFD co des on highly

parallel pro cessors The pseudo application program is stripp ed of com

plexities asso ciated with real CFD application programs thereby enabling a

simpler description of the algorithms However it is capable of repro ducing

the essential computation and data motion characteristics of large scale state

of the art CFD co des In this chapter we present a detailed description of

the pseudo application program concept Preceding chapters address our ba

sic approachtowards the p erformance evaluation of parallel sup ercomputers

targeted for use in numerical aero dynamic simulation

Intro duction

Computational Fluid Dynamics CFD is one of the elds in the area of

scientic computing that has driven the development of mo dern vector su

p ercomputers Availability of these high p erformance computers has led to

impressive advances in the state of the art of CFD b oth in terms of the

physical complexity of the simulated problems and the developmentofcom

putational algorithms capable of extracting high levels of sustained p erfor

mance However to carry out the computational simulations essential for

future aerospace research CFD must b e able and ready to exploit p otential

p erformance and costp erformance gains p ossible through the use of highly

parallel pro cessing technologies Use of parallel sup ercomputers app ears to

b e one of the most promising avenues for realizing large complex physical

simulations within realistic time and cost constraints Although manyofthe

current CFD application programs are amenable to a high degree of parallel

Computer Sciences Corp Ames ResearchCenter

computation p erformance data on such co des for the current generation of

parallel computers often has b een less than remarkable This is esp ecially

true for the class of CFD algorithms involving global data dep endencies

commonly referred to as the implicit metho ds Often the b ottleneckisdata

motion due to high latencies and inadequate bandwidth

It is a common practice among computer hardware designers to use the

dense linear equation solution subroutine in the LINPACK to representthe

scientic computing workload Unfortunately the computational structures

in most CFD algorithms b ear little resemblance to this LINPACK routine

b oth in terms of its parallelization strategy as well as oating p ointand

memory reference features Most CFD application co des are characterized by

their use of either regular or irregular sparse data structures and asso ciated

algorithms One of the reasons for this situation is the near absence of

communication b etween computer scientists engaged in the design of high

p erformance parallel computers and the computational scientists involved

in the development of CFD applications To b e eective suchexchange of

information should o ccur during the early stages of the design pro cess It

app ears that one of the contributing factors to this lackofcommunication is

the complexity and condentiality asso ciated with the stateoftheart CFD

application co des One way to help the design pro cess is to provide the

computer scientists with synthetic CFD application programs whichlack

the complexity of a real application but at the same time retain all the

essential computational structures Suchsynthetic application co des can b e

accompanied by detailed and simpler descriptions of the algorithms involved

In return the p erformance data on suchsynthetic application co des can b e

used to evaluate dierent parallel sup ercomputer systems at the pro curement

stage by the CFD community

Computational uid dynamics involves the numerical solution of a sys

tem of nonlinear partial dierential equations in two or three spatial dimen

sions with or without time dep endence The governing partial dierential

equations referred to as the NavierStokes equations representthelaws of

conservation of mass momentum and energy applied to a uid medium in

motion These equations when supplemented by appropriate b oundary and

initial conditions describ e a particular physical problem To obtain a system

of equations amenable to solution on a computer requires the discretization

of the dierential equations through the use of nite dierence nite volume

nite element or sp ectral metho ds The inherent nonlinearities of the govern

ing equations necessitate the use of iterative solution techniques Over the

past years a variety of ecientnumerical algorithms have b een develop ed

all requiring many oating p oint op erations and large amounts of computer

memory to achieve a solution with the desired level of accuracy

In current CFD applications there are twotyp es of computational meshes

used for the spatial discretization pro cess structured and unstructured

Structured meshes are characterized by a consistent logical ordering of mesh

p oints whose connectivity is asso ciated with a rectilinear co ordinate system

Computationally structured meshes give rise to regularly strided memory

reference characteristics In contrast unstructured meshes oer greater free

dom in terms of mesh p oint distribution but require the generation and

storage of random connectivity information Computationallythisresults

in indirect memory addressing with random strides with its attendantin

crease in memory bandwidth requirements The synthetic application co des

currently under consideration are restricted to the case of structured meshes

The numerical solution algorithms used in CFD co des can b e broadly

categorized as either explicit or implicit based on the pro cedure used for

the time domain integration Among the advantages of the explicit schemes

are the high degree of easily exploitable parallelism and the lo calized spatial

data dep endencies These prop erties have resulted in highly ecient imple

mentations of explicit CFD algorithms on a variety of current generation

highly parallel pro cessors However the explicit schemes suer from strin

gentnumerical stability b ounds and as a result are not optimal for problems

that require ne mesh spacing for numerical resolution In contrast implicit

schemes have less stringent stability b ounds and are suitable for problems

involving highly stretched meshes However their parallel implementation

is more dicult and involve lo cal as well as global spatial data dep enden

cies In addition some of the implicit algorithms p ossess limited degrees of

exploitable parallelism At present we restrict our synthetic applications to

three dierent representative implicit schemes found in a wide sp ectrum of

pro duction CFD co des in use at NASA Ames Researchcenter

In the remaining sections of this chapter we describ e the development

of a collection of synthetic application programs First we discuss the ratio

nale b ehind this approach followed by a complete description of three such

synthetic applications We also outline the problem setup along with the as

so ciated verication tests when they are used to b enchmark highly parallel

systems

Rationale

In the past vector sup ercomputer p erformance was evaluated through the

use of suites of kernels chosen to characterize generic computational struc

tures present at a sites workload For example NAS Kernels were selected

to characterize the computational workloads inherent in a ma jority of algo

rithms used by the CFD community at Ames ResearchCenter However for

highly parallel computer systems this approach is inadequate for the reasons

outlined b elow

The rst stage of the pseudo application development pro cess was the

analysis of a variety of implicit CFD co des and the identication of a set of

generic computational structures that represented a range of computational

tasks emb edded in them As a result the following computational kernels

were selected

Solution of multiple indep endent systems of nondiagonallydominant

blo ck tridiagonal equations with a blo ck size

Solution of multiple indep endent systems of nondiagonallydominant

scalar p entadiagonal equations

Regularsparse blo ck matrixvector multiplication

Regularsparse blo ck lower and upp er triangular system solu

tion

These kernels constitute a ma jority of the computationallyintensive main

building blo cks of the CFD programs designed for the numerical solution of

threedimensional D EulerNavierStokes equations using nitevolume

nitedierence discretization on structured grids Kernels and are

representative of the computations asso ciated with the implicit op erator in

versions of the ARCD co de These kernels involve global data dep en

dencies Although they are similar in many resp ects there is a fundamental

dierence with regard to the communicationtocomputation ratio Kernel

typies the computation of the explicit part of almost all CFD algorithms

for structured grids Here all data dep endencies are lo cal with either nearest

neighb or or at most nexttonearest neighbor typ e dep endencies Kernel

represents the computations asso ciated with the implicit op erator of a newer

class of implicit CFD algorithms typied by the co de INSDLU This

kernel maycontain only a limited degree of parallelism relative to the other

kernels

In terms of their parallel implementation these kernels representvarying

characteristics with regard to the following asp ects which are often related

Available degree of parallelism

Level of parallelism and granularity

Data space partitioning strategies

Global vs lo cal data dep endencies

Interpro cessor and inpro cessor data motion requirements

Ratio of communication to computation

Previous research eorts in adapting algorithms in a varietyofow solvers

to the current generation of highly parallel pro cessors have indicated that the

overall p erformance of many CFD co des is critically dep endent on the latency

and bandwidth of b oth the inpro cessor and interpro cessor data motion

Therefore it is imp ortantfortheintegrity of the b enchmarking pro cess to

faithfully repro duce a ma jority of the data motions encountered during the

execution of applications in which these kernels are emb edded Also the

nature and amount of data motion is dep endent on the kernel algorithms

along with the asso ciated data structures and the interaction of these kernels

among themselves as well as with the remainder of the application that is

outside their scop e

To obtain realistic p erformance data sp ecication of b oth the incoming

and outgoing data structures of the kernels should mimic those o ccuring in

an application program The incoming data structure is dep endentonthe

section of the co de where the data is generated not on the kernel The op

timum data structure for the kernel may turn out to b e sub optimal for the

co de segments where the data is generated and vice versa Similar consid

erations also apply to the outgoing data structure Allowing the freedom to

cho ose optimal incoming and outgoing data structures for the kernel as a

basis for evaluating its p erformance is liable to pro duce results that are not

applicable to a complete application co de The overall p erformance should

reect the cost of data motion that o ccur b etween kernels

In order to repro duce most of the data motions encountered in the exe

cution of these kernels in a typical CFD application we prop ose emb edding

them in a pseudo application co de It is designed for the numerical solution

of a synthetic system of nonlinear Partial Dierential Equations PDEs us

ing iterative techniques similar to those found in CFD applications of interest

to NASA Ames Research Center However it contains none of the pre and

p ostpro cessing required by the full CFD applications or the interactions of

the pro cessors and the IO subsystem This can b e regarded as a stripp ed

down version of a CFD application It retains the basic kernels that are

the principal building blo cks of the application and admits a ma jorityofthe

interactions required b etween these basic routines Also the stripp eddown

version do es not represent a fully congured CFD application in terms of

system memory requirements This fact has the p otential for creating data

partitioning strategies during the parallel implementation of the synthetic

problem that may b e inappropriate for the full application

From the p oint of view of functionality the stripp eddown version do es

not contain the algorithms used to apply b oundary conditions as in a real

application It is well known that often the b oundary algorithms gives rise to

load imbalances and idling of pro cessors in highly parallel systems Due to the

simplication of the b oundary algorithms it is likely that the overall system

p erformance and eciency data obtained using the stripp eddown version

may b e higher than that of an actual application This eect is somewhat

mitigated by the fact that for most realistic problems a relatively small time

amountofisspent dealing with b oundary algorithms when compared to the

time sp ent in dealing with the internal mesh p oints Also most b oundary

algorithms involve only lo cal data dep endencies

Some of the other advantages of the stripp eddown application vs full

application approach are

Allows b enchmarking where real application co des are condential

Easier to manipulate and p ort from one system to another

Since only the abstract algorithm is sp ecied facilitates new implemen

tations that are tied closely to the architecture under consideration

Allows easy addition of other existing and emerging CFD algorithms

to the b enchmarking pro cess

Easily scalable to larger problem sizes

It should b e noted that this synthetic problem diers from a real CFD

problem in the following imp ortant asp ects

In full CFD application co des a nonorthogonal co ordinate transfor

mation is used to map the complex physical domains to the regu

lar computational domains therebyintro ducing metric co ecients of

the transformation into the governing PDEs and b oundary conditions

Such transformations are absent in the synthetic problem and as a re

sult mayhave reduced arithmetic complexity and storage requirements

A blend of nonlinear second and fourthdierence articial dissipation

terms is used in most of the actual CFD co des whose co ecients

are determined based on the lo cal changes in pressure In the stripp ed

down version only a linear fourth dierence term is used This reduces

the arithmetic and communication complexity needed to compute the

added higherorder dissipation terms However it should b e noted that

computation of these articial dissipation terms involve only lo cal data

dep endencies similar to the matrixvector multiplication kernel

In co des where articial dissipation is not used upwind dierencing

based on either uxvector splitting uxdierence splitting or

Total Variation Diminishing TVD schemes is used The absence of

such dierencing schemes in the stripp eddown version induces eects

similar to on the p erformance data

Absence of turbulence mo dels Computation of terms representing

some turbulence mo dels involve a combination of lo cal and some long

range data dep endencies Arithmetic and communication complexity

asso ciated with turbulence mo dels are absent

In addition it needs to b e emphasized that the stripp eddown problem

is neither designed nor is suitable for the purp oses of evaluating the conver

gence rates andor the applicabilityofvarious iterative linear system solvers

used in computational uid dynamics applications As mentioned b efore

the synthetic problem diers from the real CFD applications in the following

imp ortantways

Absence of realistic b oundary algorithms

Higher than normal dissipative eects

Lackofupwind dierencing eects based on either uxvector splitting

or TVD schemes

Absence of geometric stiness intro duced through b oundary conform

ing co ordinate transformations and highly stretched meshes

Lackofevolution of weak ie C solutions found in real CFD ap

plications during the iterative pro cess

Absence of turbulence mo delling eects

Some of these eects tend to suppress the predominantly hyp erb olic na

ture exhibited by the NavierStokes equations when applied to compressible

ows at high Reynolds numbers

Mathematical Problem Denition

We consider the numerical solution of the following synthetic system of ve

nonlinear partial dierential equations PDEs

EU FU GU U

TU U V U U W U U

HU U U U D D

with the b oundary conditions

B

B U U U U U D D

and initial conditions

U U D for

where D is a b ounded domain D is its b oundary and D f

T g

Also the solution to the system of PDEs

u

C B

C B

C B

C B

u

C B

C B

C B

C B

C B

U u

C B

C B

C B

C B

C B

u

C B

C B

A

u

dened in D D D isavector function of temp oral variable and

spatial variables that form the orthogonal co ordinate system in

ie

m m

u u

B

The vector functions U and U are given and B is the b oundary op er

ator E F G T V W and H are vector functions with ve comp onents

eachoftheform

e

C B

C B

C B

C B

e

C B

C B

C B

C B

C B

e E

C B

C B

C B

C B

C B

e

C B

C B

A

e

m m

and e e U etc are prescrib ed functions

The system given by equations through is in the normalform

ie it gives explicitly the time derivatives of all the dep endentvariables

u u u Consequently the Cauchy data at given by equation

p ermits the calculation of solution U for

In the current implementation of the synthetic PDE system solver we

seek a steadystate solution of equations through of the form

f

B C

B C

B C

B C

f

B C

B C

B C

B C

B C

U f f

B C

B C

B C

B C

B C

f

B C

B C

A

f

m m

where f f are prescrib ed functions of the following form

f

C B

C B

C B

C B

C C C

f

C B

C B

C B

C B

C B

C B

C B

C B

C B

C C C

C e B f

C B

C B

C B

C B

C B

A

C B

C B

f

C C C

C B

C B

A

f

Here the vector e is given by

T

e

and C m n are sp ecied constants The vec

mn

T m m

tor forcing function H h h h h h where h h

is chosen such that the system of PDEs along with the b oundary and initial

conditions for H satises the prescrib ed exact solution U This implies

FU GU EU

H

TU U V U U W U U

for D D

The b ounded spatial domain D is sp ecied to b e the interior of the unit

cub e ie

D f g

and its b oundary D is the surface of the unit cub e given by

D f org f orgf or g

The vector functions E F G T V and W of the synthetic problem are

sp ecied to b e the following

u

B C

B C

B C

B C

u u

B C

B C

B C

B C

B C

E u u u

B C

B C

B C

B C

B C

u u u

B C

B C

A

u u u

u

B C

B C

B C

B C

u u u

B C

B C

B C

B C

B C

F u u

B C

B C

B C

B C

B C

u u u

B C

B C

A

u u u

u

C B

C B

C B

C B

u u u

C B

C B

C B

C B

C B

u u u G

C B

C B

C B

C B

C B

u u

C B

C B

A

u u u

where

u u u

g k fu

u

Also

d u

B C

B C

B C

B C

B C

d u k k u u

B C

B C

B C

B C

B C T

d u k k u u

B C

B C

B C

B C

B C

d u k k u u

B C

B C

A

t

where

u

t d

u u u

k k

u

u u k k u u

d u

B C

B C

B C

B C

d u k k u u

B C

B C

B C

B C

B C

V d u k k u u

B C

B C

B C

B C

B C

d u k k u u

B C

B C

A

v

where

u

v d

u u u

k k

u

u u k k u u

d u

C B

C B

C B

C B

C B

u k k u u d

C B

C B

C B

C B

C W B

u k k u u d

C B

C B

C B

C B

C B

d u k k u u

C B

C B

A

w

where

u u u u

k k w d

u

u u k k u u

m m

m

and k k k k k d d d m are given constants

The b oundary conditions

The b oundary conditions for the system of PDEs is prescrib ed to b e of the

uncoupled Dirichlet typ e and is sp ecied to b e compatible with U such

that

m m

u f for D D

and m

The initial conditions

The initial values U in D are set to those obtained by a transnite trilinear

interp olation of the b oundary data given by equation Let

m

m m

P u u

m m m

P u u

m

m m

P u u

Then

m m

m m

P P u P

m m m m

m m

P P P P P P

m m

m

P P P for D

The Numerical Scheme

Starting from the initial values prescrib ed by equation we seek some

discrete approximation U D to the steadystate solution U of equations

h

through through the numerical solution of the nonlinear system of

PDEs using a pseudotime marching scheme and a spatial discretization

pro cedure based on nite dierence approximations

Implicit time dierencing

The indep endent temp oral variable is discretized to pro duce the set

D f n Ng

n

where the discrete time increment is given by

n

n n

Also the discrete approximation of U on D is denoted by

n

U U n U

A generalized singlestep temp oral dierencing scheme for advancing the

solution of equations through is given by

n n

U U

n n

U U

O

where the forward dierence op erator is dened as

n n n

U U U

With the appropriate choice for the parameters and equation

repro duces many of the well known two and threelevel explicit and implicit

schemes In this particular case weareinterested only in the twolevel

rstorder accurate Euler implicit scheme given by and ie

n n

U U

n

U O

n

Substituting for U and U in equation using equations

through weget

n n n n n n

E T F V G W

n

U

n n n

F V G W E T

H

n n n n n

where E E E and E EU etc

n

Equation is nonlinear in U as a consequence of the fact that the

n n n n n n

increments E F G T V and W are nonlinear functions

of the dep endentvariables U and its derivatives U U and U Alinear

equation with the same temp oral accuracy as equation can b e obtained

by a linearization pro cedure using a lo cal Taylor series expansion in time

n

ab out U

E

n n n

E E O

E U

n n n

E O

U

Also

U

n n n

O U U

Then by combining equations and we get

E

n n n n n

E E U U O

U

or

n n n

E A UU O

where AU is the Jacobian matrix E U

It should b e noted that the ab ove noniterative timelinearization formu

lation do es not lower the formal order of accuracy of temp oral discretization

in equation However if the steadystate solution is the only ob jective

the quadratic convergence of the NewtonRaphson metho d for suciently

go o d initial approximations is recovered only as

Similarly the linearization of remaining terms gives

F

n n n

U O F

U

n n

B U O

G

n n n

U O G

U

n n

C U O

T T

n n n n n

T U U O

U U

n n n n

M U N U O

n

NU

n n

O M N U

V V

n n n n n

V U U O

U U

n

QU

n n

O P Q U

W W

n n n n n

U O U W

U U

n

SU

n n

R S U O

where

F

BU

U

G

CU

U

T T N

MU U NU N U U

U U

V V Q

PU U QU Q U U

U U

W S W

SU S U U RU U

U U

When the approximations given by equations are intro duced into

n

equation we obtain the following linear equation for U

n n n n

A M N N Q B P Q

fI

n n

S C R S

n

gU

n n n

F V G W E T

H

It should b e noted that the notation of the form

n

A M N

n

U

is used to represent the expressions as such

n n

A M N U

etc

The left hand side ie LHS of equation is referred to as the implicit

part and the right hand side ie RHS as the explicit part

The solution at the advanced time n isgiven by

n n n

U U U

The Jacobian matrices for the problem under consideration are given by

B C

B C

        

B C

u u q k u u k u u k u u k

   

B C

B C

B C

       

B C

A u u u u u u u

B C

B C

B C

       

B C

u u u u u u u

B C

A

         

a a k u u u k u u u k u u

    

where

u u u k

g f q

u

u

a fk u u q g

u

u u u k u

f a gk

u u

B C

B C

       

B C

u u u u u u u

B C

B C

B C

        

B C

u u q k u u k u u k u u k

B

   

B C

B C

B C

       

B C

u u u u u u u

B C

A

         

b k u u u b k u u u k u u

    

where

u

b fk u u q g

u

u k u u u

gk b f

u u

B C

B C

       

B C

u u u u u u u

B C

B C

B C

       

B C

u u u u u u u

C

B C

B C

B C

        

B C

u u q k u u k u u k u u k

   

B C

A

         

c k u u u k u u u c k u u

    

where

u u

c fk q g

u u

k u u u u

c f gk

u u

M N



d

B C

B C

B C



  

B C

k k u u d

 

B C



B C

k k u

 

B C

B C

B C



B C

  

k k u u d

 

B C

N

B C



k k u

B C

 

B C

B C

B C



  

B C

k k u u d

 

B C



B C

k k u

 

B C

A

n n n n n

    

where

n k k k k k k u u

k k k k k k u u

k k k k k k u u

k k k k u u

n k k k k k k u u

n k k k k k k u u

n k k k k k k u u

k k k k u n d

P Q



d



C B

C B



C B

  

k k u u d



C B

 

C B



k k u

C B

 

C B

C B

C B



  

C B

k k u u d



 

Q

C B



C B

k k u

 

C B

C B

C B



C B   

k k u u d



 

C B

C B



k k u

 

C B

A

q q q q q

    

where

q k k k k k k u u

k k k k k k u u

k k k k k k u u

k k k k u u

q k k k k k k u u

q k k k k k k u u

q k k k k k k u u

k k k k u q d

R S



d



B C

B C

B C



  

B C

k k u u d

 



B C



B C

k k u

 

B C

B C

B C



B C

  

u d k k u

 

B C



S

B C



k k u

B C

 

B C

B C

B C



  

B C

k k u u d

 



B C



B C

k k u

 

B C

A

s s s s s

    

where

s k k k k k k u u

k k k k k k u u

k k k k k k u u

k k k k u u

s k k k k k k u u

s k k k k k k u u

s k k k k k k u u

s d k k k k u

Spatial discretization

The indep endent spatial variables are discretized bycovering D the

closure of D with a mesh of uniform increments h h h ineachofthe

co ordinate directions The mesh p oints in the region will b e identied bythe

indextriple i j k where the indices i N j N andk N

corresp ond to the discretization of and co ordinates resp ectively

D D f i N j N k N g

h h i j k

where

i h j h k h

i j k

and the mesh widths are given by

h N h N h N

with N N N N b eing the numb er of mesh p oints in and

directions resp ectively

Then the set of interior mesh p oints is given by

D f i N j N k N g

h i j k

and the b oundary mesh p oints by

D f i fN ggf

h i j k i j k

j fN ggf k fN gg

i j k

Also the discrete approximation of U in D D is denoted by

n

n i h j h k h U U U

ijk h

Spatial dierencing

The spatial derivatives in equation are approximated by the appropri

ate nitedierence quotients based on the values of U at mesh p oints in

h

D D We use threep oint secondorder accurate central dierence ap

h h

proximations in each of the three co ordinate directions

In the computation of the nitedierence approximation to the RHS

the following two general forms of spatial derivatives are encountered ie

m

e U

and

m

t U U

The rst form is dierenced as using mth comp onentvector function E as

an example

m

e U

m m

j h e U e U O h

ijk ijk ijk

The second form is approximated as using mth comp onentofvector func

tion T as an example

m

t U U U U U U

ijk ijk ijk ijk

m

j h ft

ijk

h

U U U U

ijk ijk ijk ijk

m

t g O h

h

for i N Similar formulas are used in the and directions

as well

During the nitedierence approximation of the spatial derivatives in the

implicit part following three general forms are encountered

ml l

a Uu

ml l

m U U u

and

ml l

n Uu

The rst form is approximated by the following

ml l

a Uu

l

ml

j h fa U gu

ijk ijk

ijk

l

ml

h fa U gu

ijk

ijk

The second form is dierenced in the following compact threep oint form

ml l

m U U u

j

ijk

U U U U

ijk ijk ijk ijk

l

ml

h fm gu

ijk

h

U U U U

ijk ijk ijk ijk

ml

h fm

h

U U U U

ijk ijk ijk ijk

l

ml

gu m

ijk

h

U U U U

ijk ijk ijk ijk

l

ml

h fm gu

ijk

h

Finally the third form is dierenced as follows

ml l

n Uu

l

ml

j h fn U gu

ijk ijk

ijk

l

ml

h fn U gu

ijk

ijk

l

ml

h fn U gu

ijk

ijk

Added higher order dissipation

When central dierence typ e schemes are applied to the solution of the Euler

and NavierStokes equations a numerical dissipation mo del is included and

it plays an imp ortant role in determining their success The form of the

dissipation mo del is quite often a blending of seconddierence and fourth

dierence dissipation terms The seconddierence dissipation is nonlinear

and is used to prevent oscillations in the neighb orho o d of discontinous ie

C solutions and is negligible where the solution is smo oth The fourth

dierence dissipation term is basically linear and is included to suppress

highfrequency mo des and allowthenumerical scheme to converge to a steady

state Near solution discontinuities it is reduced to zero It is this term that

aects the linear stabilityofthenumerical scheme

In the current implementation of the synthetic problem a linear fourth

dierence dissipation term of the following form

n n n

U U U

h h h

is added to the right hand side of equation Here is a sp ecied constant

n n

A similar term with U replaced byU will app ear in the implicit op erator

if it is desirable to treat the dissipation term implicitly as well

In the interior of the computational domain the fourthdierence term is

computed using the following vep oint approximation

n

U

n n n n n

h j U U U U U

ijk

ijk ijk ijk ijk ijk

for i N

At the rst twointerior mesh p oints b elonging to either end of the compu

tational domain the standard vep ointdierence stencil used for the fourth

dierence dissipation term is replaced by onesided or onesided biased sten

cils These mo dications are implementedsoastomaintain a nonp ositive

denite dissipation matrix for the system of dierence equations

Then at i

n

U

n n n

j U U U h

ijk

ijk ijk ijk

and at i

n

U

n n n n

h j U U U U

ijk

ijk ijk ijk ijk

Also at i N

n

U

n n n n

h U U U j U

ijk

ijk ijk ijk ijk

and at i N

n

U

n n n

h j U U U

ijk

ijk ijk ijk

Similar dierence formulas are also used in the and directions

Also for vector functions T V and W used here

M N P Q R S

Then for the cases where added higher order dissipation terms are treated

only explicitly equation reduces to

n n n n n n

N B Q C S A

n

gU fI

n n n

E T F V G W

n n n

U U U

h h h H

When the implicit treatment is extended to the added higher order dissipa

tion terms as well equation takes the following form

n n n n

I B I A N Q

h h fI

n n

C I S

n

h gU

n n n

F V G W E T

n n n

U U U

h h h H

The mo died vector forcing function is given by

EU FU GU

H

TU U V U U W U U

U U U

h h h

for D D

Computation of the explicit partRHS

The discretized form of the explicit part of equation is given by

n

n n n

RHS j D E T D F V D G W j

ijk ijk

n n n

U j D U h D U h D h

ijk

H j

ijk

where D D and D are secondorder accurate central spatial dierencing

n

op erators dened in D Ateach p oint in the mesh RHS is a column

h

ijk

vector with comp onents Discretization of added higher order dissipation

terms was already discussed in section Here we consider the computation

of the rst term in equation using formulas given in equations and

n n n

D E T D F V D G W j

ijk

n n

h EU EU

ijk ijk

n n n n

U U U U

ijk ijk ijk ijk

h fT

h

n n n n

U U U U

ijk ijk ijk ijk

T g

h

n n

h FU FU

ij k ij k

n n n n

U U U U

ij k ijk ij k ijk

h fV

h

n n n n

U U U U

ijk ij k ijk ij k

V g

h

n n

h GU GU

ijk ijk

n n n n

U U U U

ijk ijk ijk ijk

h fW

h

n n n n

U U U U

ijk ijk ijk ijk

g W

h

n

for fi j k D g Also RHS for i or N j or N and

h

ijk

k or N

This right hand side computation is equivalent in terms of b oth arithmetic

and communication complexity to a regular sparse blo ck matrixvector

multiplication whichisthekernel c However its ecient implementation

in this case do es not necessarily require the explicit formation andor storage

of the regular sparse matrix

Computation of the forcing vector function

Given the analytical form of U the forcing vector function H can simply b e

evaluated analytically as a function of and using equation This

function along with equation can then b e used to evaluate H

ijk

which is also a colunm vector with comp onents

Here we opt for a numerical evaluation of H using U and the

ijk ijk

nitedierence approximations of equations and as in the case of

equation

EU H j h EU

ijk

ijk ijk

U U U U

ijk ijk ijk ijk

h fT

h

U U U U

ijk ijk ijk ijk

T g

h

h FU FU

ij k ij k

U U U U

ij k ijk ij k ijk

h fV

h

U U U U

ijk ij k ijk ij k

g V

h

h GU GU

ijk ijk

U U U U

ijk ijk ijk ijk

h fW

h

U U U U

ijk ijk ijk ijk

W g

h

h D U h D U h D U j

ijk

for fi j k D g The fourth dierence dissipation terms are evaluated

h

using equation Also H for i or N j or N and k

ijk

or N

Solution of the system of linear equations

Replacing the spatial derivatives app earing in equation by their equiv

alent dierence approximations results in a system of linear equations for

n

U for i N j N and k N Direct

ijk

solution of this system of linear equations requires a formidable matrix in

version eort in terms of b oth the pro cessing time and storage requirements

n

Therefore U is obtained through the use of an iterative metho d Here we

consider three such iterative metho ds involving the kernels a b and d

All metho ds involve some form of approximation to the implicit op erator or

the LHS of equation For pseudotime marching schemes it is generally

sucient to p erform only one iteration p er time step

Approximate factorization BeamWarming algorithm

In this metho d the implicit op erator in equation is approximately

factored in the following manner

n n n n

N B Q A

gfI g fI

n n

C S

n

fI gU RHS

The BeamWarming algorithm is to b e implemented as shown b elow

Initialization Set the b oundary values of U for i j k D in accor

ijk h

dance with equation Set the initial values of U for i j k D in

h

ijk

accordance with equation Compute the forcing function vector H

ijk

for i j k D using equation

h

n

for i j k D Step Explicit part Compute the RHS

h

ijk

Step Sweep of the implicit part Form and solve the following system

of linear equations for U for i j k D

ijk h

n n

fI D A D N gU RHS

Step Sweep of the implicit part Form and solve the following system

of linear equations for U for i j k D

ijk h

n n

fI D B D Q gU U

Step Sweep of the implicit part Form and solve the following system

n

of linear equations for U for i j k D

ijk h

n n n

fI D C D S gU U

Step Up date the solution

n n n

U U U for i j k D

ijk ijk ijk h

Steps consist of one timestepping iteration of the Approximate Fac

torization scheme The solution of systems of linear equations in eachofthe

steps is equivalent to the solution of multiple indep endent systems of

blo ck tridiagonal equations with eachblock b eing a matrix in the

three co ordinate directions resp ectivelyFor example the system of

equations in the sweep has the following blo ck tridiagonal structure

B U C U RHS

jk jk jk jk jk

A U B U C U RHS

ijk ijk ijk ijk ijk ijk ijk

i N

A U B U RHS

N jk N jk N jk N jk N jk

where j N and k N Also A B and C are

matrices and U is a column vector

ijk

Here for i N using equations and

n n

g NU h A fh AU

ijk

ijk ijk

n

B I h NU

ijk

ijk

n n

C fh AU h NU g

ijk

ijk ijk

Also B I C A and B I

jk jk N jk N jk

Diagonal form of the approximate factorization algorithm

Here the approximate factorization algorithm of section is mo died

so as to transform the coupled systems given by the left hand side of equation

into an uncoupled diagonal form This involves further approximations

in the treatment of the implicit op erator This diagonalization pro cess targets

only the matrices A B and C in the implicit op erator Eects of other

matrices present in the implicit op erator are either ignored or approximated

to conform to the resulting diagonal structure

The diagonalization pro cess is based on the observation that matrices A

Band C eachhave a set of real eigenvalues and a complete set of eigenvec

tors Therefore they can b e diagonalized through similarity transformations

A T T

B T T

C T T

with

diag u u u u u u u u a u u a

diag u u u u u u u u a u u a

diag u u u u u u u u a u u a

where

v

u

u

u u u u

t

c c f g a

u u

and T U T U and T U are the matrices whose columns are the

eigenvectors of A BandC resp ectively When all other matrices except

A B and C are ignored the implicit op erator is given by

n n n

A B C

LHS I I I

Substituting for A B and C using equation in equation we get

n

T T

n

LHS T T

n

T T

n

T T

n

T T

n n

T T U

A mo died form of equation is obtained bymoving the eigenvector

matrices T T andT outside the spatial dierential op erators

and resp ectively

n n

n n n n n

T T I T T LHS T I

n

n n

I T U

This can b e written as

n n n

n n n

I U LHS T NI PI T

where

T T N T N T

P T T P T T

The eigenvector matrices are functions of and and therefore this mo d

ication intro duces further errors of O into the factorization pro cess

During the factorization pro cess the presence of the matrices N Qand

S in the implicit part were ignored This is b ecause in general the similarity

transformations used for diagonalizing A do not simultaneously diagonalize

N The same is true for the and factors as well This necessitates

some adho c approximate treatment of the ignored terms which at the same

time preserves the diagonal structure of the implicit op erators Whatever

the approach used additional approximation errors are intro duced in the

treatment of the implicit op erators Wechose to approximate N Qand S

by diagonal matrices in the implicit op erators whose values are given bythe

sp ectral radii N Q and S resp ectively In addition we also

treat the added fourthdierence dissipation terms implicitly

n n

I N I

n

I f LHS T h g

n n

I Q I

NI f h g

n n

I S I

n

h gT U P I f

The matrices T T N and P are given by

          

qa c u u a c u u a c u u a c a

   

   

C B

u u u

C B



   

C B

T

u u u

C B

       

A

q au u a c u u c u u c u u c

   

       

q au u a c u u c u u c u u c

   



B C

u uu uu uu

B C



B C

T u uu uu uu



B C

A

uu uu a uu a

       

u u qc q a c au u q a c au u

  

B C

B p C p



B C

N

B C

p

A

p

p p

B C

B C



B C

P

B C

p

A

p

p p

where u a and u a

In addition the sp ectral radii of the matrices N Qand S are given by

Nmax d

d k k u

d k k u

k k u d

d k k k k u

Q max d

d k k u

d k k u

d k k u

d k k k k u

S max d

k k u d

d k k u

d k k u

d k k k k u

The explicit part of the diagonal algorithm is exactly the same as in

equation and all the approximations are restricted to the implicit op era

tor Therefore if the diagonal algorithm converges the steadystate solution

will b e identical to the one obtained without diagonalization However the

convergence b ehavior of the diagonalized algorithm would b e dierent

The Diagonalized Approximate Factorization algorithm is to b e imple

mented in the following order

Initialization Set the b oundary values of U for i j k D in accor

ijk h

dance with equation Set the initial values of U for i j k D in

h

ijk

accordance withequation Compute the forcing function vector H for

ijk

i j k D using equation

h

n

Step Explicit part Compute RHS for i j k D

h

ijk

Step Perform the matrixvector multiplication

n

U T RHS

Step Sweep of the Implicit part Form and solve the following system

of linear equations for U

n n

fI D D N I h D IgU U

Step Perform the matrixvector multiplication

U N U

Step Sweep of the implicit part Form and solve the following system

of linear equations for U

n n

fI D D Q I h D IgU U

Step Perform the matrixvector multiplication

U P U

Step Sweep of the implicit part Form and solve the following system

of linear equations for U

n n

fI D D S I h D IgU U

Step Perform the matrixvector multiplication

n

U T U

Step Up date the solution

n n n

U U U

Steps constitute of one iteration of the Diagonal Form of the Ap

proximate Factorization algorithm The new implicit op erators are blo ck

pentadiagonal However the blo cks are diagonal in form so that the op era

tors simplify into ve indep endent scalar p entadiagonal systems Therefore

each of the steps and involve the solution of multiple indep endent

systems of scalar p entadiagonal equations in the three co ordinate directions

and resp ectively For example eachoftheve indep endent scalar

pentadiagonal systems in the sweep has the following structure

m m m m

U e U d U c U

jk jk jk

jk jk jk jk

m m m m m

b U c U d U e U U

jk jk jk jk

jk jk jk jk jk

m m m m

a U b U c U d U

ijk ijk ijk ijk

ijk ijk ijk ijk

m m

e U U

ijk

ijk ijk

for i N

m m m

a U b U c U

N jk N jk N jk

N jk N jk N jk

m m

d U U

N jk

N jk N jk

m m m m

a U b U c U U

N jk N jk N jk

N jk N jk N jk N jk

where j N k N and m The elements of

the p entadiagonal matrix are given by the following

c d e

jk jk jk

mm

n n

b h h N

jk jk

jk

n

c h N

jk jk

mm

n n

d h h N

jk jk

jk

e

jk

a

jk

mm

n n

b h h N

jk jk

jk

n

c h N

jk jk

mm

n n

h N d h

jk jk

jk

e

jk

a

ijk

mm

n n

b h h N

ijk ijk

ijk

n

c h N

ijk ijk

mm

n n

d h h N

ijk ijk

ijk

e

ijk

a

N jk

mm

n n

b h h N

N jk N jk

N jk

n

c h N

N jk N jk

mm

n n

d h h N

N jk N jk

N jk

e

N jk

a

N jk

mm

n n

b h h N

N jk N jk

N jk

n

c h N

N jk N jk

mm

n n

d h h N

N jk N jk

N jk

a b c

N jk N jk N jk

Symmetric successiveoverrelaxation algorithm

In this metho d the system of linear equations obtained after replacing

the spatial derivatives app earing in the implicit op erator in equation by

secondorder accurate central nite dierence op erators is solved using the

symmetric successiveoverrelaxation scheme Let the linear system b e given

by

n n n

K U RHS

where

n n n n n n n n

K fI D A D N D B D Q D C D S gU

for i j k D

h

It is sp ecied that the unknowns b e ordered corresp onding to the gridp oints

lexicographically such that the index in direction runs fastest followed

by the index in direction and nally in direction The nitedierence

discretization matrix K resulting from such an ordering has a very regular

banded structure There are altogether seven blo ck diagonals eachwitha

blo ck size

The matrix K can b e written as the sum of the matrices D Y and Z

n n n n

K D Y Z

where

n n

D Main blo ck diagonal of K

n n

Y Three sub blo ck diagonals of K

n n

Z Three sup er blo ck diagonals of K

Therefore D is a blo ckdiagonal matrix while Y and Z are strictly lower

and upp er triangular resp ectively Then the p ointSSOR iteration scheme

can b e written as

n n

X U RHS

where

n n n n n n

X D Y D D Z

n n n n

D Y I D Z

and istheoverrelaxation factor a sp ecied constant The

SSOR algorithm is to b e implemented in following manner

Initialization Set the b oundary values of U in accordance with equation

ijk

Set the initial values of U in accordance with equation Compute

ijk

the forcing function vector H for i j k D using equation

h

ijk

n

Step The explicit part Compute RHS for i j k D

h

ijk

Step Lower triangular system solution Form and solve the following

regular sparse blo cklower triangular system to get U

n n

D Y U RHS

Step Upp er triangular system solution Form and solve the following

n

regular sparse blo ck upp er triangular system to get U

n n n

I D Z U U

Step Up date the solution

n n n

U U U

Steps constitute one complete iteration of the SSOR scheme The

l th blo ckrow of the matrix K has the following structure

n n n n

A U B U C U D U

l l l l l l

lN N lN

n n n

E U F U G U RHS

l l l l l

lN lN N

where l i N j N N k and i j k D

h

The matrices are given by

n n

A h CU h SU

l

ijk ijk

n n

B h BU h QU

l

ij k ij k

n n

C h AU h NU

l

ijk ijk

n n n

D I h NU h QU h SU

l

ijk ijk ijk

n n

E h AU h NU

l

ijk ijk

n n

F h BU h QU

l

ij k ij k

n n

G h CU h SU

l

ijk ijk

Also A B and C are the elements in the l th blo ckrow of the matrix Y

whereas E F and G are the elements in the l the blo ckrow of the matrix Z

Benchmarks

There are three b enchmarks asso ciated with the three numerical schemes

describ ed in section for the solution of system of linear equations given by

equation Each b enchmark consists of running one of the three numerical

schemes for N time steps with given values for N N N and

s

Verication test

For eachofthebenchmarks at the completion of N time steps compute

s

the following

m

n

Ro ot Mean Square norms RM S Rm of the residual vectors RHS

ijk

for m where n N and i j k D

s h

v

u

P P P

N N

N

u

N

m

s

fRHS g

u

j i

k

ijk

t

RM S Rm

N N N

m

Ro ot Mean Square norms RM S E m of the error vectors fU

ijk

m

n

U gfor m wheren N andi j k D

s h

ijk

v

u

P P P

N N

N

m m

u

N

s

g fU U

u

j i

k

ijk ijk

t

RM S E m

N N N

The numerically evaluated surface integral given by

P P

j

i

h h I f

ijk ijk ij k ij k

    ii j j

 

ijk ijk ij k ij k

P P

i k

h h

ij k ij k ij k ij k

ii

    k k





ij k ij k ij k ij k

P P

j

k

h h

i jk i j k i jk i j k

j j

    k k





g

i jk i j k i jk i j k

where i i j j k and k are sp ecied constants suchthati

i N j j N and k k N and

u u u

g c fu

u

The validityofeach these quantities is measured according to

jX X j

c r

jX j

r

where X and X are the computed and reference values resp ectivelyand

c r

is the maximum allowable relative error The value of is sp ecied to

be The values of X are dep endent on the numerical scheme under

r

consideration and the values sp ecied for N N N and N

s

Benchmark

Class A Perform N iterations of the Approximate Factorization Al

s

gorithm with the following parameter values

N N N

and

Timing for this b enchmark should b egin just b efore the Step of the rst

iteration is started and end just after the Step of the N th iteration is

s

complete

Class B Same except N N N and

Benchmark

Class A Perform N iterations of the Diagonal Form of the Approxi

s

mate Factorization Algorithm with the following parameter values

N N N

and

Timing for this b enchmark should b egin just b efore the Step of the rst

iteration is started and end just after the Step of the N th iteration is

s

complete

Class B Same except N N N and

Benchmark

Class A Perform N iterations of the Symmetric SuccessiveOver

s

Relaxation Algorithm with the following parameter values

N N N

and

Timing for this b enchmark should b egin just b efore the Step of the rst

iteration is started and end just after the Step of the N th iteration is

s

complete

Class B Same except N N N

For all b enchmarks values of the remaining constants are sp ecied as

k k k k k

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

d d d d d

d d d d d

d d d d d

d d maxd

References

Bailey D H and Barton J T The NAS Kernel Benchmark Program

NASA TM Aug

Pulliam T H Ecient Solution Metho ds for the NavierStokes Equa

tions Lecture Notes for the Von Karman Institute for Fluid Dynamics

Lecture Series Numerical Techniques for Viscous Flow Computation in

Turb omachinery Bladings Jan Brussels Belgium

Yo on S Kwak D and Chang L LUSGS Implicit Algorithm for

ThreeDimensional Incompressible NavierStokes Equations with Source

Term AIAA Pap er CP

Jameson A Schmidt W and Turkel E Numerical Solution of the

Euler Equations by Finite Volume Metho ds using RungeKutta Time

Stepping Schemes AIAA Pap er June

Steger J L and Warming R F Flux Vector Splitting of the In

viscid Gas Dynamics Equations with Applications to Finite Dierence

Metho ds Journal of Computational Physics vol p

van Leer B Flux Vector Splitting for the Euler Equations Eighth In

ternational Conference on Numerical Metho ds in Fluid Dynamics Lec

ture Notes in Physics vol Edited by E Krause pp

Ro e P L Approximate Riemann Solvers Parameter Vectors and Dif

ference Schemes Journal of Computational Physics vol pp

Harten A High Resolution Schemes for Hyp erb olic Conservation Laws

Journal of Computational Physiscs vol pp

Gordon W J Blending Function Metho ds for Bivariate and Multivari

ate Interp olation SIAM J Num Analysis vol pp

Warming R F and Beam R M On the Construction and Applica

tion of Implicit Factored Schemes for Conservation Laws Symp osium

on Computational Fluid Dynamics SIAMAMS Pro ceedings vol

Lapidus L and Seinfeld J H Numerical Solution of Ordinary Dier

ential Equations Academic Press New York

Lomax H Stable Implicit and Explicit Numerical Metho ds for In

tegrating QuasiLinear Dierential Equations with ParasiticSti and

ParasiticSaddle Eigenvalues NASA TN D

Pulliam T H and Chaussee D S A Diagonal Form of an Implicit Ap

proximate Factorization Algorithm Journal of Computational Physics

vol p

Chan T F and Jackson K R Nonlinearly Preconditioned Krylov

Subspace Metho ds for Discrete Newton Algorithms SIAM J Sci Stat

Compt vol pp

Axelsson O A Generalized SSOR Metho d BIT vol pp

Olsson P and Johnson S L Boundary Mo dications of the Dissipa

tion Op erators for the ThreeDimensional Euler Equations Journal of

Scientic Computing vol pp

Coakley T J Numerical Metho ds for Gas Dynamics Combining Char

acteristics and Conservation Concepts AIAA Pap er