Experience with Parallel Computers at NASA Ames David H. Bailey November 14, 1991

Exp erience with Parallel Computers at NASA Ames David H. Bailey Novemb er 14, 1991 Ref: Intl. J. of High Sp eed Computing,vol. 5, no. 1 (993), pg. 51{62. Abstract Beginning in 1988, the Numerical Aero dynamic Simulation (NAS) organization at NASA Ames ResearchCenter has studied the usability of highly parallel computers on computational uid dynamics and other aerophysics applications. Presently this organization op erates a CM-2 with 32,768 no des and an Intel iPSC/860 with 128 no des. This note gives an overview of the exp erience in using these systems, highlights b oth strong p oints and weak p oints of each, and discusses what improvements will b e required in future highly parallel systems in order that they can gain acceptance in the mainstream of scienti c computing. The author is with the NAS Applied Research Branch atNASAAmesResearch Center, Mo ett Field, CA 94035. 1 Intro duction NASA Ames Research Center has long b een a trail blazer in the eld of high p erformance scienti c computation, particularly for computational uid dynamics (CFD) and other aerophysics applications. Back in the 1970s NASA Ames was the home of the Illiac IV, which featured a p eak p erformance of approximately 250 MFLOPS. In 1982, NASA Ames to ok delivery of the rst two pro cessor CrayX-MP,which featured a p eak p erformance of nearly 400 MFLOPS. In 1986 the newly formed Numerical Aero dynamic Simu- lation (NAS) organization at NASA Ames installed the rst full-memory Cray-2, with a p eak p erformance of 1.6 GFLOPS. Most recently, the NAS facility added in 1988 a Cray Y-MP with a p eak p erformance of 2.6 GFLOPS. By the late 1980s it b ecame clear that the \classical" sup ercomputer design epitomized by existing Cray systems was approaching maturity, and that highly parallel computer systems had the most p otential for achieving multi-TFLOPS levels of sustained p erformance, whichwe pro ject will b e required b efore the end of the decade. In order to b etter understand these systems and to help advance this technology to the level required for mainstream sup ercomputing, the NAS Applied Research Branchwas organized in 1988. Scientists in this organization and its aliates p ort application programs to parallel computers, develop new algorithms and programming techniques, and study the p erformance of the resulting implementations [1, 2, 3, 4, 6, 7, 8, 9, 10 and 12]. The scop e and diversity of these pro jects can b e seen by the partial listing in Table 1. The NAS pro ject now op erates a Connection Machine 2 (CM-2) by Thinking Machines, Inc. and an iPSC/860 byIntel Scienti c Computers. There are plans to acquire signi - cantly more p owerful systems in the future. As of this date, the CM-2 has b een op erational for nearly three years and the iPSC/860 for over one year. This note gives a brief sum- mary of our exp erience and highlights b oth the strengths and the weaknesses that wehave observed in these two systems. Requirements and recommendations for future systems are then outlined. The Connection Machine The NAS CM-2, obtained in collab oration with the Defense Advanced Pro jects Re- search Asso ciation (DARPA), was installed in 1988 with 32,768 bit-serial pro cessors. Re- cently it was upgraded with 1024 64-bit oating p oint hardware and four gigabytes of main memory, or ab out twice that of the NAS Cray-2. The system also now has the \slicewise" Fortran compiler, which treats the hardware as a SIMD array of 64-bit oating p oint pro cessing units. For the rst year or two, most applications on the CM-2 were co ded in *LISP,anex- tension of the LISP language for parallel pro cessing. More recently,however, programmers have started using the CM Fortran language, which is based on Fortran-90 [5]. CM Fortran has b een plagued by delays, and version 1.0 was just recently made available. One notable achievement on the CM-2 is the p orting of a variation of the NASA Ames \ARC3D" co de. This is a three dimensional implicit Navier-Stokes uid dynamics application and employs a numb er of state of the art techniques. On a 64 64 32 grid, this 2 Pro ject Researchers Y-MP CM-2 iPSC Multigrid (NAS b enchmark) Frederickson, Barszcz x x x Conj. gradient (NAS b enchmark) Schreib er, Simon x x x 3D FFT PDE (NAS b enchmark) Bailey,Frederickson x x x Integer sort (NAS b enchmark) Dagum x x x LU solver (NAS b enchmark) Fato ohi, Venkatakrishnan x x x Scalar p enta. (NAS b enchmark) Barszcz, Weeratunga x x x Blo ck tridiag. (NAS b enchmark) Barszcz, Weeratunga x x x INS3D (incomp. Navier Stokes) Fato ohi, Yoon x x x Isotropic turbulence simulation Wray, Rogallo x x x PSIM3 (particle metho d) McDonald x x PSICM (particle metho d) Dagum x F3D (ARC3D multi-zone) Barszcz, Chawala, Weeratunga x x CM3D (ARC3D derivative) Levit, Jesp erson x x ARC2D Weeratunga x Unstructured Euler solver Hammond, Barth, Venkatakrishnan x x Unstructured partitioning Simon x High precision vortex analysis Bailey, Krasny x Table 1: Overview of Parallel CFD ResearchatNASAAmes 3 co de now runs at 190 MFLOPS on 16,384 pro cessors [8]. The equivalent one pro cessor Y-MP rate for this co de is 128 MFLOPS. The strongest p ositive feature of the CM-2, from our p ersp ective, is TMC's adoption of a programming language based on Fortran-90. Some have questioned whether the Fortran-90 array constructs are suciently p owerful to encompass a large fraction of mo dern scienti c computation. Basically, this is the same as the question of to what extent the data parallel programming mo del encompasses mo dern scienti c computation. While we cannot sp eak for other facilities, it seems clear that a very large fraction of our applications can b e co ded e ectively using this mo del. Even some applications, such as unstructured grid computations, which app eared at rst to b e inappropriate for data parallel computation, have subsequently b een successfully p orted to the CM-2, although di erent data structures and implementations techniques have b een required [6]. Another p ositive feature of the CM-2 is that its system software is relatively stable. It app ears to b e largely free of the frequent system \crashes" and other diculties that plague some other highly parallel systems. One weak p oint of the CM-2 is that it was not designed from the b eginning for oating p oint computation, and the oating p oint pro cessors now in the system have b een incor- p orated in a somewhat inelegant fashion. While this has b een remedied somewhat with the recent hardware upgrade and the \slicewise" software, the limited bandwidth b etween oating p oint pro cessors and the asso ciated main memory mo dules remains a problem. Related to this is the limited bandwidth b etween no des. Even when using the hyp er- cub e network, data cannot b e transferred b etween no des at a rate commensurate with the oating p oint computation rate. And when the \router" (an alternate hardware facility for irregular communication) must b e used, the p erformance of the co de often drops to workstation levels. Another weakness of the CM-2 design is that until recently no more than four users could b e using the system at a given time, which is to o restrictivefordaytime op eration when scientists are developing and debugging their co des. This restriction stems from the fact that the hardware cannot b e partitioned any ner than a 8,192 no de section. This diculty has b een remedied by the intro duction of time sharing software that allows more than one job to b e executing in a given sector of the CM-2. However, the down side of this solution is that individual jobs must o ccupy more pro cessing no des than necessary, and thus their p erformance is reduced accordingly. The principal weakness with the CM-2 at the present time, however, is the CM Fortran compiler, which still has a numb er of bugs and frequently delivers disapp ointing p erformance. Also, the arrayintrinsic functions are in several cases rather inecient. Partly this is due to the fact that writing an e ectiveFortran-90 compiler, including ecient arrayintrinsics, has proved to b e much more of a challenge than originally anticipated. Other vendors attempting to write Fortran-90 compilers are rep orted to b e having similar diculties. In anyevent, clearly the CM Fortran compiler must b e further improved. The weaknesses of the currentversion of the CM Fortran compiler, coupled with the massively parallel design of the hardware, result in a system that is often very sensitiveto 4 the algorithm and implementation technique selected. It is not unusual for a reasonably well conceived and well written co de to deliver single digit MFLOPS p erformance rates when rst executed. Often the programmer must try quite a numb er of di erent parallel algorithms and implementation schemes b efore nding one that delivers resp ectable p erformance. Exp erienced CM-2 programmers have learned numerous \tricks" that can improve p erformance, but until recently TMC provided very little do cumentation available on these \tricks".

Experience with Parallel Computers at NASA Ames David H. Bailey November 14, 1991

PGI Fortran Guide

The Portland Group

The Portland Group

The Portland Group

The Portland Group

Tuning C++ Applications for the Latest Generation X64 Processors with PGI Compilers and Tools

Supercomputing in Plain English Exercise #3: Arithmetic Operations in This Exercise, We’Ll Use the Same Conventions and Commands As in Exercises #1 and #2

PGI Visual Fortran Reference Manual

The GPU Computing Revolution

CUDA Fortran 2013

Opencl Actors-Adding Data Parallelism to Actor-Based Programming With

PGI Compiler Reference Manual