THE NAS PARALLEL BENCHMARKS
1 1 1 2
D. H. Bailey , E. Barszcz , J. T. Barton ,D.S.Browning , R. L. Carter, L.
2 2 3 1
Dagum ,R.A.Fato ohi ,P.O.Frederickson , T. A. Lasinski ,R.S.
3 2 2 2
Schreib er , H. D. Simon ,V.Venkatakrishnan and S. K. Weeratunga
NAS Applied Research Branch
NASA Ames Research Center, Mail Stop T045-1
Mo ett Field, CA 94035
Ref: Intl. Journal of Sup ercomputer Applications,vol. 5, no. 3 (Fall 1991),
pg. 66{73
Abstract
A new set of b enchmarks has b een develop ed for the p erformance
evaluation of highly parallel sup ercomputers. These b enchmarks con-
sist of ve \parallel kernel" b enchmarks and three \simulated appli-
cation" b enchmarks. Together they mimic the computation and data
movementcharacteristics of large scale computational uid dynamics
applications.
The principal distinguishing feature of these b enchmarks is their
\p encil and pap er" sp eci cation | all details of these b enchmarks
are sp eci ed only algorithmically. In this waymany of the dicul-
ties asso ciated with conventional b enchmarking approaches on highly
parallel systems are avoided.
1
This author is an employee of NASA Ames ResearchCenter
2
This author is an employee of Computer Sciences Corp oration. This work is supp orted
through NASA Contract NAS 2-12961.
3
This author is an employee of the Research Institute for Advanced Computer Science
(RIACS). This work is supp orted by the NAS Systems Division via Co op erative Agreement
NCC 2-387 b etween NASA and the Universities Space Research Asso ciation.
1 Intro duction
The Numerical Aero dynamic Simulation (NAS) Program, which is based at
NASA Ames Research Center, is a large scale e ort to advance the state of
computational aero dynamics. Sp eci cally, the NAS organization aims \to
provide the Nation's aerospace researchanddevelopment communitybythe
year 2000 a high-p erformance, op erational computing system capable of sim-
ulating an entire aerospace vehicle system within a computing time of one to
several hours" ([4], page 3). The successful solution of this \grand challenge"
problem will require the development of computer systems that can p erform
the required complex scienti c computations at a sustained rate nearly one
thousand times greater than current generation sup ercomputers can now
achieve. The architecture of computer systems able to achieve this level of
p erformance will likely b e dissimilar to the shared memory multipro cessing
sup ercomputers of to day. While no consensus yet exists on what the design
willbe,itislikely that the system will consist of at least 1,000 pro cessors
computing in parallel.
Highly parallel systems with computing p ower roughly equivalent to tradi-
tional shared memory multipro cessors exist to day. Unfortunately, for various
reasons, the p erformance evaluation of these systems on comparable typ es of
scienti c computations is very dicult. Little relevantdataisavailable for
the p erformance of algorithms of interest to the computational aerophysics
community on many currently available parallel systems. Benchmarking and
p erformance evaluation of such systems has not kept pace with advances in
hardware, software and algorithms. In particular, there is as yet no gener-
ally accepted b enchmark program or even a b enchmark strategy for these
systems.
The p opular \kernel" b enchmarks that have b een used for traditional
vector sup ercomputers, such as the Livermore Lo ops [12], the LINPACK
b enchmark [9, 10] and the original NAS Kernels [7], are clearly inappropriate
for the p erformance evaluation of highly parallel machines. First of all, the
tuning restrictions of these b enchmarks rule out many widely used parallel
extensions. More imp ortantly, the computation and memory requirements
of these programs do not do justice to the vastly increased capabilities of the
new parallel machines, particularly those systems that will b e available by
the mid-1990s.
On the other hand, a full scale scienti c application is similarly unsuitable. 1
First of all, p orting a large program to a new parallel computer architecture
requires a ma jor e ort, and it is usually hard to justify a ma jor research
task simply to obtain a b enchmark numb er. For that reason we b elieve
that the otherwise very successful PERFECT Club b enchmark [11] is not
suitable for highly parallel systems. This is demonstrated byonlyvery sparse
p erformance results for parallel machines in the recent rep orts [13, 14 , 8].
Alternatively, an application b enchmark could assume the availabilityof
automatic software to ols for transforming \dustydeck" source into ecient
parallel co de on a variety of systems. However, such to ols do not exist to day,
and many scientists doubt that they will ever exist across a wide range of
architectures.
Some other considerations for the development of a meaningful b ench-
mark for a highly parallel sup ercomputer are the following:
Advanced parallel systems frequently require new algorithmic and soft-
ware approaches, and these new metho ds are often quite di erent from
the conventional metho ds implemented in source co de for a sequential
or vector machine.
Benchmarks must b e \generic" and should not favor any particular
parallel architecture. This requirement precludes the usage of any
architecture-sp eci c co de, such as message passing co de.
The correctness of results and p erformance gures must b e easily veri-
able. This requirement implies that b oth input and output data sets
must b e kept very small. It also implies that the nature of the compu-
tation and the exp ected results must b e sp eci ed in great detail.
The memory size and run time requirements must b e easily adjustable
to accommo date new systems with increased p ower.
The b enchmark must b e readily distributable.
In our view, the only b enchmarking approach that satis es all of these
constraints is a \pap er and p encil" b enchmark. The idea is to sp ecify a set
of problems only algorithmically.Even the input data must b e sp eci ed only
on pap er. Naturally, the problem has to b e sp eci ed in sucient detail that
a unique solution exists, and the required output has to b e brief yet detailed 2
enough to certify that the problem has b een solved correctly. The p erson
or p ersons implementing the b enchmarks on a given system are exp ected
to solve the various problems in the most appropriate way for the sp eci c
system. The choice of data structures, algorithms, pro cessor allo cation and
memory usage are all (to the extentallowed by the sp eci cation) left op en to
the discretion of the implementer. Some extension of FortranorCisrequired,
and reasonable limits are placed on the usage of assembly co de and the like,
but otherwise programmers are free to utilize language constructs that give
the b est p erformance p ossible on the particular system b eing studied.
To this end, wehave devised a numb er of relatively simple \kernels",
which are sp eci ed completely in [6]. However, kernels alone are insucient
to completely assess the p erformance p otential of a parallel machine on real
scienti c applications. The chief diculty is that a certain data structure
maybevery ecient on a certain system for one of the isolated kernels, and
yet this data structure would b e inappropriate if incorp orated into a larger
application. In other words, the p erformance of a real computational uid
dynamics (CFD) application on a parallel system is critically dep endenton
data motion b etween computational kernels. Thus we consider the complete
repro duction of this data movement to b e of critical imp ortance in a b ench-
mark.
Our b enchmark set therefore consists of two ma jor comp onents: ve par-
allel kernel b enchmarks and three simulated application b enchmarks. The
simulated application b enchmarks combine several computations in a man-
ner that resembles the actual order of execution in certain imp ortantCFD
application co des. This is discussed in more detail in [6].
We feel that this b enchmark set successfully addresses many of the prob-
lems asso ciated with b enchmarking parallel machines. Although wedonot
claim that this set is typical of all scienti c computing, it is based on the
key comp onents of several large aeroscience applications used by scientists on
sup ercomputers at NASA Ames Research Center. These b enchmarks will b e
used by the Numerical Aero dynamic Simulation (NAS) Program to evaluate
the p erformance of parallel computers. 3
2 Benchmark Rules
2.1 De nitions
In the following, the term \pro cessor" is de ned as a hardware unit capable
of integer and oating p oint computation. The \lo cal memory" of a pro cessor
refers to randomly accessible memory with an access time (latency) of less
than one microsecond. The term \main memory" refers to the combined lo cal
memory of all pro cessors. This includes any memory shared by all pro cessors
that can b e accessed byeach pro cessor in less than one microsecond. The
term \mass storage" refers to non-volatile randomly accessible storage media
that can b e accessed by at least one pro cessor within forty milliseconds. A
\pro cessing no de" is de ned as a hardware unit consisting of one or more
pro cessors plus their lo cal memory,which is logically a single unit on the
network that connects the pro cessors.
The term \computational no des" refers to those pro cessing no des pri-
marily devoted to high-sp eed oating p oint computation. The term \ser-
vice no des" refers to those pro cessing no des primarily devoted to system
op erations, including compilation, linking and communication with external
computers over a network.
2.2 General Rules
Implementations of these b enchmarks must b e based on either Fortran-77 or
C, although a wide variety of parallel extensions are allowed. This require-
ment stems from the observation that Fortran and C are the most commonly
used programming languages by the scienti c parallel computing community
at the present time. If in the future other languages gain wide acceptance
in this community, they will b e considered for inclusion in this group. As-
sembly language and other low-level languages and constructs maynotbe
used, except that certain sp eci c vendor-supp orted assembly-co ded library
routines may b e called (see section 2.3).
We are of the opinion that such language restrictions are necessary,be-
cause otherwise considerable e ort would b e made by b enchmarkers in low-
level or assembly-level co ding. Then the b enchmark results would tend to
re ect the amount of programming resources available to the b enchmarking
organization, rather than the fundamental merits of the parallel system. Cer- 4
tainly the mainstream scientists that these parallel computers are intended to
serve will b e co ding applications at the source level, almost certainly in For-
tran C, and thus these b enchmarks are designed to measure the p erformance
that can b e exp ected from suchcode.
Accordingly, the following rules must b e observed in any implementations
of the NAS Parallel Benchmarks:
All oating p oint op erations must b e p erformed using 64-bit oating
point arithmetic.
All b enchmarks must b e co ded in either Fortran-77 [1] or C [3], with
certain approved extensions.
Implementation of the b enchmarks may not mix Fortran-77 and C co de
| one or the other must b e used.
Any extension of Fortran-77 that is in the Fortran-90 draft dated June
1990 or later [2] is allowed.
Any extension of Fortran-77 that is in the Parallel Computer Fortran
(PCF) draft dated March 1990 or later [5] is allowed.
Any language extension or library routine that is employed in anyof
the b enchmarks must b e supp orted bythevendor and available to all
users.
Subprograms and library routines not written in Fortran or C may only
p erform certain functions, as indicated on the next section.
All rules apply equally to subroutine calls, language extensions and
compiler directives (i.e. sp ecial comments).
2.3 Allowable Language Extensions and Library Rou-
tines
The following language extensions and library routines are p ermitted:
Constructs that indicate sections of co de that can b e executed in par-
allel or lo ops that can b e distributed among di erent computational
no des. 5
Constructs that sp ecify the allo cation and organization of data among
or within computational no des.
Constructs that communicate data b etween pro cessing no des.
Constructs that communicate data b etween the computational no des
and service no des.
Constructs that rearrange data stored in multiple computational no des,
including constructs to p erform indirect addressing and array transp o-
sitions.
Constructs that synchronize the action of di erent computational no des.
Constructs that initialize for a data communication or synchronization
op eration that will b e p erformed or completed later.
Constructs that p erform high-sp eed input or output op erations b etween
main memory and the mass storage system.
Constructs that p erform any of the following array reduction op era-
tions on an array either residing within a single computational no de or
distributed among multiple no des: +; , MAX, MIN, AND, OR, XOR.
Constructs that combine communication b etween no des with one of
the op erations listed in the previous item.
Constructs that p erform any of the following computational op era-
tions on arrays either residing within a single computational no de
or distributed among multiple no des: dense matrix-matrix multipli-
cation, dense matrix-vector multiplication and one-dimensional, two-
dimensional or three-dimensional fast Fourier transforms. Suchrou-
tines must b e callable with general array dimensions.
3 The Benchmarks: A Condensed Overview
After an evaluation of a numb er of large scale CFD and computational
aerosciences applications on the NAS sup ercomputers at NASA Ames, ve
medium-sized computational problems were selected as the \parallel kernels". 6
In addition to these problems, three di erent implicit solution schemes were
added to the b enchmark set. These schemes are representative of CFD co des
currently in use at NASA Ames ResearchCenter in that they mimicthe
computational activities and data motions typical of real CFD applications.
They do not include the typical pre- and p ostpro cessing of real applications,
nor do they include I/O. Boundary conditions are also handled in a greatly
simpli ed manner. For a detailed discussion on the di erences b etween the
simulated application b enchmarks and real CFD applications, see Chapter 3
of [6].
Even the ve parallel kernel b enchmarks involvesubstantially larger com-
putations than many previous b enchmarks, such as the Livermore Lo ops or
Linpack, and therefore they are more appropriate for the evaluation of par-
allel machines. They are suciently simple that they can b e implemented
on a new system without unreasonable e ort and delay. The three simu-
lated application b enchmarks require somewhat more e ort to implement
but constitute a rigorous test of the usability of a parallel system to p erform
state-of-the-art CFD computations.
3.1 The Eight Benchmark Problems
The following gives an overview of the b enchmarks. The rst ve are the
parallel kernel b enchmarks, and the last three are the simulated application
b enchmarks. Space do es not p ermit a complete description for all of these.
A detailed description of these b enchmark problems is given in [6].
EP: An \embarrassingly parallel" kernel, whichevaluates an integral by
means of pseudorandom trials. This kernel, in contrast to others in the
list, requires virtually no interpro cessor communication.
MG: A simpli ed multigrid kernel. This requires highly structured long
distance communication and tests b oth short and long distance data
communication.
CG: A conjugate gradient metho d is used to compute an approximation to
the smallest eigenvalue of a large, sparse, symmetric p ositive de nite
matrix. This kernel is typical of unstructured grid computations in that
it tests irregular long distance communication, employing unstructured
matrix vector multiplication. 7
FT: A 3-D partial di erential equation solution using FFTs. This kernel
p erforms the essence of many \sp ectral" co des. It is a rigorous test of
long-distance communication p erformance.
IS: A large integer sort. This kernel p erforms a sorting op eration that is
imp ortant in \particle metho d" co des. It tests b oth integer computa-
tion sp eed and communication p erformance.
LU: A regular-sparse, blo ck(5 5) lower and upp er triangular system so-
lution. This problem represents the computations asso ciated with the
implicit op erator of a newer class of implicit CFD algorithms, typi-
ed at NASA Ames by the co de \INS3D-LU". This problem exhibits
a somewhat limited amount of parallelism compared to the next two.
SP: Solution of multiple, indep endent systems of non diagonally dominant,
scalar, p entadiagonal equations. SP and the following problem BT are
representative of computations asso ciated with the implicit op erators of
CFD co des suchas\ARC3D" at NASA Ames. SP and BT are similar
in many resp ects, but there is a fundamental di erence with resp ect to
the communication to computation ratio.
BT: Solution of multiple, indep endent systems of non diagonally dominant,
blo ck tridiagonal equations with a (5 5) blo ck size.
3.2 The Embarrassingly Parallel Benchmark
In order to give the reader a avor of the problem descriptions in [6], a
detailed de nition will b e given for the rst problem, the \embarrassingly
parallel" b enchmark:
28
Set n =2 and s = 271828183 . Generate the pseudorandom oating
p ointvalues r in the interval (0, 1) for 1 j 2n using the scheme
j
describ ed b elow. Then for 1 j n set x =2r 1and y =2r 1.
j 2j 1 j 2j
Thus x and y are uniformly distributed on the interval ( 1; 1).
j j
2 2
Next set k = 0, and b eginning with j = 1, test to see if t = x + y 1.
j
j j
If not, reject this pair and pro ceed to the next j . If this inequality holds,
q q
then set k k +1;X = x ( 2 log t )=t and Y = y ( 2log t )=t ,
k j j j k j j j
where log denotes the natural logarithm. Then X and Y are indep endent
k k 8
Gaussian deviates with mean zero and variance one. Approximately n =4
pairs will b e constructed in this manner.
Finally,for0 l 9 tabulate Q as the count of the pairs (X ;Y )that
l k k
lie in the square annulus l max(jX j; jY j) k k l counts. Each of the ten Q counts must agree exactly with reference values. l The 2n uniform pseudorandom numb ers r mentioned ab ove are to b e j 13 generated according to the following scheme: Set a =5 and let x = s be 0 the sp eci ed initial \seed". Generate the integers x for 1 k 2n using k the linear congruential recursion 46 x = ax (mo d 2 ) k +1 k