The Nas Parallel Benchmarks

THE NAS PARALLEL BENCHMARKS 1 1 1 2 D. H. Bailey , E. Barszcz , J. T. Barton ,D.S.Browning , R. L. Carter, L. 2 2 3 1 Dagum ,R.A.Fato ohi ,P.O.Frederickson , T. A. Lasinski ,R.S. 3 2 2 2 Schreib er , H. D. Simon ,V.Venkatakrishnan and S. K. Weeratunga NAS Applied Research Branch NASA Ames Research Center, Mail Stop T045-1 Mo ett Field, CA 94035 Ref: Intl. Journal of Sup ercomputer Applications,vol. 5, no. 3 (Fall 1991), pg. 66{73 Abstract A new set of b enchmarks has b een develop ed for the p erformance evaluation of highly parallel sup ercomputers. These b enchmarks consist of ve \parallel kernel" b enchmarks and three \simulated application" b enchmarks. Together they mimic the computation and data movementcharacteristics of large scale computational uid dynamics applications. The principal distinguishing feature of these b enchmarks is their \p encil and pap er" sp eci cation | all details of these b enchmarks are sp eci ed only algorithmically. In this waymany of the dicul- ties asso ciated with conventional b enchmarking approaches on highly parallel systems are avoided. 1 This author is an employee of NASA Ames ResearchCenter 2 This author is an employee of Computer Sciences Corp oration. This work is supp orted through NASA Contract NAS 2-12961. 3 This author is an employee of the Research Institute for Advanced Computer Science (RIACS). This work is supp orted by the NAS Systems Division via Co op erative Agreement NCC 2-387 b etween NASA and the Universities Space Research Asso ciation. 1 Intro duction The Numerical Aero dynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, is a large scale e ort to advance the state of computational aero dynamics. Sp eci cally, the NAS organization aims \to provide the Nation's aerospace researchanddevelopment communitybythe year 2000 a high-p erformance, op erational computing system capable of sim- ulating an entire aerospace vehicle system within a computing time of one to several hours" ([4], page 3). The successful solution of this \grand challenge" problem will require the development of computer systems that can p erform the required complex scienti c computations at a sustained rate nearly one thousand times greater than current generation sup ercomputers can now achieve. The architecture of computer systems able to achieve this level of p erformance will likely b e dissimilar to the shared memory multipro cessing sup ercomputers of to day. While no consensus yet exists on what the design willbe,itislikely that the system will consist of at least 1,000 pro cessors computing in parallel. Highly parallel systems with computing p ower roughly equivalent to traditional shared memory multipro cessors exist to day. Unfortunately, for various reasons, the p erformance evaluation of these systems on comparable typ es of scienti c computations is very dicult. Little relevantdataisavailable for the p erformance of algorithms of interest to the computational aerophysics community on many currently available parallel systems. Benchmarking and p erformance evaluation of such systems has not kept pace with advances in hardware, software and algorithms. In particular, there is as yet no gener- ally accepted b enchmark program or even a b enchmark strategy for these systems. The p opular \kernel" b enchmarks that have b een used for traditional vector sup ercomputers, such as the Livermore Lo ops [12], the LINPACK b enchmark [9, 10] and the original NAS Kernels [7], are clearly inappropriate for the p erformance evaluation of highly parallel machines. First of all, the tuning restrictions of these b enchmarks rule out many widely used parallel extensions. More imp ortantly, the computation and memory requirements of these programs do not do justice to the vastly increased capabilities of the new parallel machines, particularly those systems that will b e available by the mid-1990s. On the other hand, a full scale scienti c application is similarly unsuitable. 1 First of all, p orting a large program to a new parallel computer architecture requires a ma jor e ort, and it is usually hard to justify a ma jor research task simply to obtain a b enchmark numb er. For that reason we b elieve that the otherwise very successful PERFECT Club b enchmark [11] is not suitable for highly parallel systems. This is demonstrated byonlyvery sparse p erformance results for parallel machines in the recent rep orts [13, 14 , 8]. Alternatively, an application b enchmark could assume the availabilityof automatic software to ols for transforming \dustydeck" source into ecient parallel co de on a variety of systems. However, such to ols do not exist to day, and many scientists doubt that they will ever exist across a wide range of architectures. Some other considerations for the development of a meaningful b enchmark for a highly parallel sup ercomputer are the following: Advanced parallel systems frequently require new algorithmic and software approaches, and these new metho ds are often quite di erent from the conventional metho ds implemented in source co de for a sequential or vector machine. Benchmarks must b e \generic" and should not favor any particular parallel architecture. This requirement precludes the usage of any architecture-sp eci c co de, such as message passing co de. The correctness of results and p erformance gures must b e easily veri- able. This requirement implies that b oth input and output data sets must b e kept very small. It also implies that the nature of the computation and the exp ected results must b e sp eci ed in great detail. The memory size and run time requirements must b e easily adjustable to accommo date new systems with increased p ower. The b enchmark must b e readily distributable. In our view, the only b enchmarking approach that satis es all of these constraints is a \pap er and p encil" b enchmark. The idea is to sp ecify a set of problems only algorithmically.Even the input data must b e sp eci ed only on pap er. Naturally, the problem has to b e sp eci ed in sucient detail that a unique solution exists, and the required output has to b e brief yet detailed 2 enough to certify that the problem has b een solved correctly. The p erson or p ersons implementing the b enchmarks on a given system are exp ected to solve the various problems in the most appropriate way for the sp eci c system. The choice of data structures, algorithms, pro cessor allo cation and memory usage are all (to the extentallowed by the sp eci cation) left op en to the discretion of the implementer. Some extension of FortranorCisrequired, and reasonable limits are placed on the usage of assembly co de and the like, but otherwise programmers are free to utilize language constructs that give the b est p erformance p ossible on the particular system b eing studied. To this end, wehave devised a numb er of relatively simple \kernels", which are sp eci ed completely in [6]. However, kernels alone are insucient to completely assess the p erformance p otential of a parallel machine on real scienti c applications. The chief diculty is that a certain data structure maybevery ecient on a certain system for one of the isolated kernels, and yet this data structure would b e inappropriate if incorp orated into a larger application. In other words, the p erformance of a real computational uid dynamics (CFD) application on a parallel system is critically dep endenton data motion b etween computational kernels. Thus we consider the complete repro duction of this data movement to b e of critical imp ortance in a b enchmark. Our b enchmark set therefore consists of two ma jor comp onents: ve parallel kernel b enchmarks and three simulated application b enchmarks. The simulated application b enchmarks combine several computations in a man- ner that resembles the actual order of execution in certain imp ortantCFD application co des. This is discussed in more detail in [6]. We feel that this b enchmark set successfully addresses many of the problems asso ciated with b enchmarking parallel machines. Although wedonot claim that this set is typical of all scienti c computing, it is based on the key comp onents of several large aeroscience applications used by scientists on sup ercomputers at NASA Ames Research Center. These b enchmarks will b e used by the Numerical Aero dynamic Simulation (NAS) Program to evaluate the p erformance of parallel computers. 3 2 Benchmark Rules 2.1 De nitions In the following, the term \pro cessor" is de ned as a hardware unit capable of integer and oating p oint computation. The \lo cal memory" of a pro cessor refers to randomly accessible memory with an access time (latency) of less than one microsecond. The term \main memory" refers to the combined lo cal memory of all pro cessors. This includes any memory shared by all pro cessors that can b e accessed byeach pro cessor in less than one microsecond. The term \mass storage" refers to non-volatile randomly accessible storage media that can b e accessed by at least one pro cessor within forty milliseconds. A \pro cessing no de" is de ned as a hardware unit consisting of one or more pro cessors plus their lo cal memory,which is logically a single unit on the network that connects the pro cessors. The term \computational no des" refers to those pro cessing no des pri- marily devoted to high-sp eed oating p oint computation.

The Nas Parallel Benchmarks

Technical Report Aaron Councilman

Intel® Cilk™ Plus

Part V Some Broad Topic

An Introduction to Parallel Programming

Lecture Notes : Intel Xeon Phi Coprocessor - an Overview

The Relevance of Opencl to HPC

Speculative Parallelism in Cilk++

Parallel Programming with Matlabmpi ∗

(12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu Et Al

Parallel Programming in Openmp About the Authors

Dryadlinq for Scientific Analyses

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models