Lessons from Massively Parallel Applications on Message Passing Computers
Total Page:16
File Type:pdf, Size:1020Kb
Syracuse University SURFACE Northeast Parallel Architecture Center College of Engineering and Computer Science 1992 Lessons from Massively Parallel Applications on Message Passing Computers Geoffrey C. Fox Syracuse University, Northeast Parallel Architectures Center Follow this and additional works at: https://surface.syr.edu/npac Part of the Computer Sciences Commons Recommended Citation Fox, Geoffrey C., "Lessons from Massively Parallel Applications on Message Passing Computers" (1992). Northeast Parallel Architecture Center. 72. https://surface.syr.edu/npac/72 This Article is brought to you for free and open access by the College of Engineering and Computer Science at SURFACE. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized administrator of SURFACE. For more information, please contact [email protected]. Lessons from Massively Parallel Applications on Message Passing Computers Geo rey C. Fox Syracuse University Northeast Parallel Architectures Center 111 College Place Syracuse, New York 13244-4100 Abstract We review a decade's work on message passing MIMD paral lel computers in the areas of hardware, software and applications. We conclude that dis- tributed memory paral lel computing works, and describe the implications of this for futureportable software systems. 1 Intro duction We start with a nostalgic note. The 1984 COMPCON conference was my rst opp ortunity to discuss our early hyp ercub e results from Caltech[1] based on the software and science applications we built for C. Seitz's 64- no de Cosmic Cub e which started \pro duction" runs on Quantum Chromo- dynamics (QCD) in Octob er, 1983. That rst MIMD machine was only two mega ops p erformance | ten times b etter than the VAX11/780 wewere using at the time. However, the basic parallelization issues remain similar in the 1991 six giga op QCD implementation on the full size 64 K CM-2. What havewe and others learned in the succeeding eightyears while the parallel hardware has evolved impressively with in particular a factor of 3000 improvement in p erformance? There is certainly plenty of information! In 1989, I surveyed some four hundred pap ers describing parallel applications [2], [3]| now the total must b e over one thousand. A new complete survey is to o daunting for me. However, my p ersonal exp erience, and I b elieve the lesson of the widespread international research on message passing parallel computers, has a clear message. The message passing computational modelisverypowerful and al lows one to express essential ly al l large scale computations and execute them ef- ciently on distributed memory SIMD and MIMD paral lel machines. Less formally one can say that parallel computing works, or more con- troversially but accurately in my opinion that \distributed memory parallel computing works". In the rest of this pap er, we will dissect this assertion and suggest that it has di erent implications for hardware, software and applications. Formally,we relate these as shown in Figure 1 by viewing computation as a series of maps. Software is an expression of the map of the problem onto the machine. In Section 2, we review a classi cation of problems describ ed in more detail in [3], [4], [5], [6], [7], [8]. In the following three sections, wedraw lessons for applications, hardware, and software and quantify our assertion ab ove ab out message passing parallel systems. 2 Problem Architecture Problems like computers havearchitectures. Both are large complex collec- tions of ob jects. A problem will p erform well when mapp ed onto a com- puter if their architectures matchwell. This lo ose statement will b e made more precise in the following, but not completely in this brief pap er. Ata coarse level, weliketointro duce ve broad problem classes which are brie y describ ed in Table 1. These can and should b e re ned, but this is not nec- essary here. Thus, as describ ed in Table 1, we do need to di erentiate the application equivalent of the control structure | SIMD and MIMD | for computers. However, details such as the top ology (hyp ercub e, mesh, tree, etc.) are imp ortant for detailed p erformance estimates but not for the gen- eral conclusions of this pap er. Note that the ab ove implies that problems and computers b oth have a top ology. We will use the classi cation of Table 1 in the following sections which will also expand and exemplify the brief de nitions of Table 1. 3 Applications Let us give some examples of the ve problem architectures. Synchronous: These are regular computations on regular data do- mains and can b e exempli ed by full matrix algorithms such as LU decomp o- sition; nite di erence algorithms and convolutions such as the fast Fourier transform. 2 Synchronous: Data Paral lel Tightly coupled and software needs to ex- ploit features of problem structure to get go o d p erformance. Compar- atively easy as di erent data elements are essentially identical. Loosely Synchronous: As ab ove but data elements are not identical. Still parallelizes due to macroscopic time synchronization. Asynchronous: Functional (or data) parallelism that is irregular in space and time. Often lo osely coupled and so need not worry ab out optimal decomp ositions to minimize communication. Hard to paral- lelize (massively) unless ::: Embarrassingly Paral lel: Indep endent execution of disconnected com- p onents. A=LS: (Loosely Synchronous Complex) Asynchronous collection of (lo osely) synchronous comp onents where these programs themselves can b e parallelized. Table 1: Five Problem Architectures 3 Lo osely Synchronous: These are typi ed by iterative calculations (or time evolutions) on geometrically irregular and p erhaps heterogeneous data domains. Examples are irregular mesh nite element problems, and inhomogeneous particle dynamics. Asynchronous: These are characterized by a temp oral irregularity which makes parallelization hard. An imp ortant example is even driven simulation where events, as in a battle eld simulation, o ccur in spatially distributed fashion but irregularly in time. Branch and b ound and other pruned tree algorithms common in arti cial intelligence such as computer chess [9] fall in this category. Synchronous and Lo osely synchronous problems parallelize naturally in a fashion that scales to large computers with many no des. One only requires that the application b e \large enough" which can b e quanti ed by a detailed p erformance analysis [10] whichwas discussed quite accurately in my original COMPCON pap er [1]. The sp eed up N (1) S = (1 + f ) c on a computer with N no des where the overhead f has a term due to c communication which has the form 1 t comm (2) f / c 11=d t n calc where t and t are resp ectively typical no de to no de communication comm calc and no de ( oating p oint) calculation time. n is the application grain size and d its dimension which is de ned precisely in [10]; in physical simulations 1 d is usually the geometric dimension. Go o d p erformance requires 11=d n b e \small" with a value that dep ends on the critical machine parameter t =t . The grain size n would b e the numb er of grid p oints stored comm calc on each no de in a nite di erence problem so that the complete problem had Nn grid p oints. Implicit in the ab ove discussion is that these problems are \data parallel" in the language of Hillis [11], [12]. This terminology is sometimes only asso ciated with problems run on SIMD machines but in fact, data parallelism is the general origin of massive parallelism on either SIMD or MIMD architectures. MIMD machines are needed for lo osely synchronous data parallel problems where we do not have a homogeneous algorithm with the same up date op eration on each data element. The ab ove analysis do es not apply to asynchronous problems as this case has additional synchronization overhead. One can, in fact, use mes- 4 Numerical ! Virtual ! Real Formulation \compiler" Machine Machine (Parallel) (3) of Problem (Virtual Sp eci c Computer Problem) \assembler" sage passing to naturally synchronize synchronous or lo osely synchroniza- tion problems. These typically divide into communication and calculation phases as given by individual iterations or time steps in a simulation. These phases de ne an algorithmic synchronization common to the entire applica- tion. This is lacking in asynchronous problems which require sophisticated parallel software supp ort such as that given by the time warp system [13]. However,thereisavery imp ortant class of asynchronous applications for which large scale parallelization is p ossible. These we call loosely syn- chronous complex as they consist of an asynchronous collection of lo osely synchronous (or synchronous) mo dules. A go o d example, shown in Figure 2, is the simulation of a satellite based defense system. Viewed at the level of the satellites, we see an asynchronous application. However, the mo dules are not elemental events but rather large scale data parallel subsystems. In a simulation develop ed by JPL, these mo dules included sophisticated Kalman lters and target weap on asso ciation [14]. This problem class shows a func- tional parallelism at the mo dule levelandaconventional data parallelism within the mo dules. The latter ensures that large problems of this class will parallelize on large machines. Image analysis, vision and other large information pro cessing or command and control problems fall in the lo osely synchronous complex class. A nal problem class of practical imp ortance is termed \embarrassingly parallel". These consist of a set of indep endent calculations. This is seen in partsofmanychemistry calculations where one can indep endently compute the separate matrix elements of the Hamiltonian.