NAS Parallel Benchmarks, RNR-94-007

THE NAS PARALLEL BENCHMARKS D Bailey E Barszcz J Barton D Browning R Carter L Dagum R Fato ohi S Fineb erg PFrederickson T Lasinski R Schreib er H Simon V Venkatakrishnan and S Weeratunga RNR Technical Rep ort RNR March Abstract A new set of b enchmarks has b een develop ed for the p erformance eval uation of highly parallel sup ercomputers These b enchmarks consist of ve parallel kernels and three simulated application b enchmarks Together they mimic the computation and data movementcharacteristics of large scale computational uid dynamics CFD applications The principal distinguishing feature of these b enchmarks is their p encil and pap er sp ecicationall details of these b enchmarks are sp ecied only algorithmically In this waymany of the diculties asso ciated with conven tional b enchmarking approaches on highly parallel systems are avoided This pap er is an up dated version of a RNR technical rep ort originally issued January Bailey Barszcz Barton Carter and Lasinski are with NASA Ames ResearchCenter Dagum Fato ohi Fineb erg Simon and Weer atunga are with Computer Sciences Corp work supp orted through NASA Contract NAS Schreib er is with RIACS work funded via NASA co op erative agreement NCC Browning is now in the Netherlands work originally done with CSC Frederickson is now with Cray Research Inc in Los Alamos NM work originally done with RIACS Venkatakrish nan is now with ICASE in Hampton VAwork originally done with CSC Address for all authors except Browning Frederickson and Venkatakrish nan NASA Ames Research Center Mail Stop TA Moett Field CA GENERAL REMARKS D BaileyDBrowning R Carter S Fineb erg and H Simon Intro duction The Numerical Aero dynamic Simulation NAS Program which is based at NASA Ames Research Center is a large scale eort to advance the state of computational aero dynamics Sp ecically the NAS organization aims to provide the Nations aerospace researchanddevelopment communitybythe year a highp erformance op erational computing system capable of sim ulating an entire aerospace vehicle system within a computing time of one to several hours ref p The successful solution of this grand challenge problem will require the development of computer systems that can p erform the required complex scientic computations at a sustained rate nearly one thousand times greater than current generation sup ercomputers can now achieve The architecture of computer systems able to achieve this level of p erformance will likely b e dissimilar to the shared memory multipro cessing sup ercomputers of to day While no consensus yet exists on what the design willbeitislikely that the system will consist of at least pro cessors computing in parallel Highly parallel systems with computing p ower roughly equivalent to tra ditional shared memory multipro cessors exist to day Unfortunately the p er formance evaluation of these systems on comparable typ es of scientic com putations is very dicult for several reasons Few relevant data are available for the p erformance of algorithms of interest to the computational aerophysics community on many currently available parallel systems Benchmarking and p erformance evaluation of such systems has not kept pace with advances in hardware software and algorithms In particular there is as yet no gener ally accepted b enchmark program or even a b enchmark strategy for these systems The p opular kernel b enchmarks that have b een used for traditional vec tor sup ercomputers such as the Livermore Lo ops the LINPACK b ench mark and the original NAS Kernels are clearly inappropriate for Computer Sciences Corp oration El Segundo California This work is supp orted through NASA Contract NAS the p erformance evaluation of highly parallel machines First of all the tuning restrictions of these b enchmarks rule out many widely used parallel extensions More imp ortantly the computation and memory requirements of these programs do not do justice to the vastly increased capabilities of the new parallel machines particularly those systems that will b e available by the mids On the other hand a full scale scientic application is similarly unsuitable First of all p orting a large program to a new parallel computer architecture requires a ma jor eort and it is usually hard to justify a ma jor research task simply to obtain a b enchmark numb er For that reason we b elievethatthe otherwise very successful PERFECT Club b enchmark is not suitable for highly parallel systems This is demonstrated byvery sparse p erformance results for parallel machines in the recent rep orts Alternatively an application b enchmark could assume the availabilityof automatic software to ols for transforming dustydeck source into ecient parallel co de on a variety of systems However such to ols do not exist to day and many scientists doubt that they will ever exist across a wide range of architectures Some other considerations for the development of a meaningful b ench mark for a highly parallel sup ercomputer are the following Advanced parallel systems frequently require new algorithmic and soft ware approaches and these new metho ds are often quite dierent from the conventional metho ds implemented in source co de for a sequential or vector machine Benchmarks must b e generic and should not favor any particular parallel architecture This requirement precludes the usage of any architecturesp ecic co de such as message passing co de The correctness of results and p erformance gures must b e easily veri able This requirement implies that b oth input and output data sets must b e kept very small It also implies that the nature of the compu tation and the exp ected results must b e sp ecied in great detail The memory size and run time requirements must b e easily adjustable to accommo date new systems with increased p ower The b enchmark must b e readily distributable In our view the only b enchmarking approach that satises all of these constraints is a pap er and p encil b enchmark The idea is to sp ecify a set of problems only algorithmicallyEven the input data must b e sp ecied only on pap er Naturally the problem has to b e sp ecied in sucient detail that a unique solution exists and the required output has to b e brief yet detailed enough to certify that the problem has b een solved correctly The p erson or p ersons implementing the b enchmarks on a given system are exp ected to solve the various problems in the most appropriate way for the sp ecic system The choice of data structures algorithms pro cessor allo cation and memory usage are all to the extentallowed by the sp ecication left op en to the discretion of the implementer Some extension of FortranorCisrequired and reasonable limits are placed on the usage of assembly co de and the like but otherwise programmers are free to utilize language constructs that give the b est p erformance p ossible on the particular system b eing studied To this end wehave devised a numb er of relatively simple kernels which are sp ecied completely in chapter of this do cument However kernels alone are insucient to completely assess the p erformance p otential of a parallel machine on real scientic applications The chief dicultyisthat a certain data structure maybevery ecient on a certain system for one of the isolated kernels and yet this data structure would b e inappropriate if incorp orated into a larger application In other words the p erformance of a real computational uid dynamics CFD application on a parallel system is critically dep endent on data motion b etween computational kernels Thus we consider the complete repro duction of this data movementtobeofcritical imp ortance in a b enchmark Our b enchmark set therefore consists of two ma jor comp onents ve par allel kernel b enchmarks and three simulated application b enchmarks The simulated application b enchmarks combine several computations in a man ner that resembles the actual order of execution in certain imp ortantCFD application co des This is discussed in more detail in chapter We feel that this b enchmark set successfully addresses many of the prob lems asso ciated with b enchmarking parallel machines Although wedonot claim that this set is typical of all scientic computing it is based on the key comp onents of several large aeroscience applications used on sup ercomput ers by scientists at NASA Ames ResearchCenter These b enchmarks will b e used by the Numerical Aero dynamic Simulation NAS Program to evaluate the p erformance of parallel computers Benchmark Rules Denitions In the following the term pro cessor is dened as a hardware unit capable of executing b oth oating p oint addition and oating p ointmultiplication in structions The lo cal memory of a pro cessor refers to randomly accessible memory that can b e accessed by that pro cessor in less than one microsecond The term main memory refers to the combined lo cal memory of all pro ces sors This includes any memory shared by all pro cessors that can b e accessed by each pro cessor in less than one microsecond The term mass storage refers to nonvolatile randomly accessible storage media that can b e accessed by at least one pro cessor within forty milliseconds A pro cessing no de is dened as a hardware unit consisting of one or more pro cessors plus their lo cal memory which is logically a single unit on the network that connects the pro cessors The term computational no des refers to those pro cessing no des pri marily devoted to highsp eed oating p oint computation The term ser vice no des refers to those pro cessing no des primarily

Load more