<<

he availability of State-of-the-Art Tcomputers presents computational scientists with an exciting challenge. The architects of these machines claim that they are scalable—that a machine molecular dynamics on the containing 100 processors arranged in a parallel archi- tecture can be made 10 times more powerful by employing the same basic Peter S. Lomdahl and David M. Beazley design but using 1000 processors instead of 100. Such massively parallel computers are available, and one of largest—a CM-5 Connection Machine con- taining 1024 processors—is here at Los Alamos National Laboratory. The challenge is to realize the promise of those machines—to demon- strate that large scientific This “snapshot” from a molecular dynamics simulation shows a thin plate problems can be computed undergoing fracture. The plate contains 38 million particles. This state-of- with the advertised speed the-art simulation was performed using a scalable algorithm, called SPaSM, and cost-effectiveness. specifically designed to run on the massively parallel CM-5 at The challenge is signifi- Los Alamos National Laboratory. The velocities of the particles are indicated by color—red and yellow particles have higher velocities than blue and gray cant because the architec- particles. ture of parallel computers differs dramatically from that of conventional vector , which have been the standard tool for scientific computing for the last two decades.

44 Los Alamos Science Number 22 1994 Number 22 1994 Los Alamos Science 45

State-of-the-Art Parallel Computing

The massively parallel architecture perimental measurement contains over approach programmers use explicit in- requires an entirely new approach to ten billion atoms. Therefore, the larger structions to initiate communication programming. A computation must the number of atoms included in the among the processors of a parallel sys- somehow be divided equally among the calculation, the more interesting the re- tem, and the processors accomplish the hundreds of individual processors and sults for materials science. Scientists communication by sending messages to communication between processors would very much like to perform simu- each other. must be rapid and kept to a minimum. lations that predict the motions of bil- Message-passing programming is ad- The incentive to work out these issues lions of atoms during a dynamical vantageous not only in achieving high was heightened by Gordon Bell, an in- such as machining or fracture performance, but also in increasing dependent consultant to the computer and compare the predictions with ex- “portability,” because it is easily adapt- industry and well-known skeptic of periment. Unfortunately, the realization ed for use on a number of massively massively parallel machines. Six years of that prospect is not possible today. parallel machines. We were able to run ago he initiated a formal competition However, if the machines can be scaled our message-passing algorithm on an- for achievements in large-scale scientif- up in both memory capacity and speed, other parallel computer, the T3D, ic computation on vector and parallel as the computer manufacturers have with minimal modification. machines. The annual prizes are promised, then simulations can be Following brief descriptions of MD awarded through and recognized by the scaled up proportionally so that billion- and the architecture and properties of Computer Society of the Institute of atom simulations may become a reality. the CM-5, we present the method by Electrical and Electronic Engineers Furthermore, as we show here, the cur- which we have mapped a molecular-dy- (IEEE). The work reported here was rent generation of massively parallel namics simulation to the architecture of awarded one of the 1993 IEEE Gordon machines has made it possible to simu- the CM-5. Our example illustrates Bell prizes. We were able to demon- late the motions of tens of millions of message passing and other techniques strate very high performance as mea- atoms. Such simulations are large that are widely applicable to effective sured by speed: Our simulation, called enough to perform meaningful studies programming of massively parallel SPaSM (scalable parallel short-range of the bulk properties of matter and to computers. molecular dynamics), ran on the mas- improve the approximate models of sively parallel CM-5 at a rate of 50 gi- those properties used in standard con- gaflops (50 billion floating-point opera- tinuum descriptions of macroscopic sys- Molecular Dynamics tions per second). That rate is nearly tems. Molecular-dynamics simulations 40% of the theoretical maximum; typi- on massively parallel computers thus Molecular dynamics is a computa- cal programs achieve only about 20% represent a new opportunity to study tional method used to track the posi- of the theoretical maximum. Also, the the dynamical properties of materials tions and velocities of individual parti- algorithm we developed has the proper- from first principles. cles (atoms or groups of atoms) in a ty of . When the number of Massively parallel machines have material as the particles interact with processors used was doubled, the time only recently become popular, so the each other and respond to external in- required to run the simulation was cut vendors have not yet developed the fluences. The motion of each particle in half. software—the compilers—needed to is calculated by Newton’s equation of Our achievement was the adaptation optimize a given program for the par- motion, force ϭ mass ϫ acceleration, of the molecular-dynamics method to ticular architecture of their machines. where the force on a given particle de- the architecture of parallel machines. Consequently, to develop a scalable pends on the interactions of the particle Molecular dynamics (MD) is a first- for molecular dynam- with other particles. The mutual inter- principles approach to the study of ma- ics that would high performance, action of many particles simultaneously terials. MD starts with the fundamental we needed to understand the inner is called a “many-body problem” and building blocks—the individual workings of the CM-5 and incorporate has no analytical, or exact, solution be- atoms—that make up a material and that knowledge into the details of our cause the force on a particle is always follows the motions of the atoms as program. Much of our work focused changing in a complex way as each they interact with each other through on implementing a particular technique particle in the system moves under the interatomic forces. Even the smallest for interprocessor communication influence of the others. piece of material that is useful for ex- known as “message passing.” In this The MD algorithm provides an ap-

46 Los Alamos Science Number 22 1994

State-of-the-Art Parallel Computing

molecular1.adb 7/26/94 1.0 The procedure is repeated for each suc- ceeding timestep, and the result is a se- ries of “snapshots” of the positions and "Repulsive" region velocities of the particles at times tϭ0, 0.5

) ⌬t, 2⌬t, … , n⌬t. ε In MD simulations particles are usu- ally treated as point particles (that is, 0.0 they have no extent), and the force be- tween any two particles is usually ap- proximated as the gradient of a poten-

Potential (in units of tial function V that depends only on the -0.5 "Attractive" region distance r between the two particles. The force on particle i from particle j can be written -1.0 0 1 2 3 4 dV Distance between particles (in units of σ) F ϭ Ϫᎏ , where r ϭ r Ϫ r , ij dr  ij i j rij Figure 1. The Lennard-Jones Interaction Potential The Lennard-Jones interaction potential V between two particles is plotted as a func- and that force is directed along the line tion of the distance r between the particles. The potential is in units of ⑀, and the dis- connecting the two particles. If the tance between the particles is in units of ␴. The parameter ␴ is equal to the value of r force is negative, particles i and j at- at which the potential function crosses zero. The potential has a positive slope at tract one another; if the force is posi- longer distances, corresponding to an attractive force, and a steeply negative slope at tive, the particles repel each other. The short distances, corresponding to a strong repulsive force. Separating the two regions net force exerted on particle i by all is the local minimum of the potential—the deepest point in the potential “well”—which other particles j is approximated by has a depth equal to ⑀ and is located at r ϭ 21/6␴. At the potential minimum the force summing the forces calculated from all between the particles is zero. Therefore, in the absence of effects from any other parti- pair-wise interactions between particle i cles, the separation between two particles will tend to oscillate around r ϭ 21/6␴. and each particle j. The standard pair-wise potential proximate solution to the many-body used in MD simulations is the Lennard- 1 ′ ᎏ 2 problem by treating time as a succes- ri ϭ ri ϩ vi⌬t ϩ 2 ai(⌬t) Jones potential, which is known empiri- sion of discrete timesteps and treating cally to provide a reasonably good “fit” the force on (and therefore the accelera- and to experimental data for atom-atom in- ′ tion of) each particle as a constant for vi ϭ vi ϩ ai⌬t. teractions in many solids and liquids. the duration of a single timestep. At As shown in Figure 1, this potential has each timestep the position and velocity Sometimes, these equations are re- both an attractive part and a repulsive of each particle is updated on basis of placed by more sophisticated approxi- part: the constant acceleration it experiences mations for the new positions and ve- during that timestep. For example, at locities; the approximations involve ␴ 12 ␴ 6 time t, at the beginning of a timestep of interpolations from positions and veloc- V(r) ϭ 4⑀ ᎏ Ϫ ᎏ ΄΂r ΃ ΂r ΃ ΅, length ⌬t, particle i has position ri and ities at time t and the earlier tϪ⌬t. velocity vi. The force on particle i Such schemes conserve energy and mo- from the other particles is calculated; mentum more effectively than the sim- where r is the distance between the two the force determines the acceleration ai ple update of position and velocity particles and ␴ and ⑀ are adjustable pa- of particle i for the duration of this given above. rameters. When the distance between timestep. From elementary physics, the Whatever the scheme, the equations two particles is at the minimum of the ′ ′ position ri and velocity vi at the later yield a new position and velocity for potential function, the force between time tϩ⌬t are given by: each particle at the new time tϩ⌬t. the two particles is zero and the parti-

Number 22 1994 Los Alamos Science 47 State-of-the-Art Parallel Computing

cles are in their equilibrium positions. lations were performed on the UNI- node. Both networks have a “fat-tree” In the absence of other particles nearby, VAC, the first commercially available topology, a scalable structure in which the separation of two particles will tend computer. Initially only 32 particles the total bandwidth (capacity for data to oscillate around this distance. Physi- were used, and only about 300 interac- transmission) of the network increases cally, the average nearest-neighbor dis- tions per hour could be calculated. The in proportion to the number of proces- tance of atoms in many solids and flu- rapid advance of computers has made it sors. When new processing nodes are ids corresponds very closely to the possible for us—using the massively added, the networks are expanded so position of the minimum in the atom- parallel CM-5—to include tens of mil- that the effective bandwidth between atom potential. lions of atoms in our simulation of ma- any two processors does not decrease. For our molecular-dynamics simula- terials behavior. Since materials con- Maintaining a high bandwidth is critical tions, we simplify the force calculation tain large numbers of repeating units to the scalability of the CM-5. A ma- greatly by using a truncated form of the (units that can be treated in parallel) chine having more processing nodes Lennard-Jones potential. That is, we calculations of materials behavior are will have a greater total amount of net- “cut off” the long, flat tail of the poten- ideally suited for implementation on work “traffic.” Therefore the larger tial and assume that the force between massively parallel architectures. computer must also have a greater net- particles is exactly zero for particle sep- Figure 2 illustrates the architecture work capacity in order to realize the ex- arations greater than some chosen rmax, of the CM-5, which was built by pected gain in performance. the interaction cut-off distance. In Thinking Machines Corporation. The Our MD algorithm uses the data net- other words, when calculating the computer was constructed from large work extensively—large amounts of forces for particle i, we need consider numbers of relatively inexpensive, particle data are transmitted between only those particles j that are within a high-volume and mem- processors. The control network is also distance rmax of i. The time required to ory components. Such a scheme allows very important to our algorithm. A sin- calculate the pair interactions with all the hardware designers to build a high- gle program is broadcast over the con- the surrounding particles is reduced performance supercomputer without a trol network to the processors, each of considerably since a given particle major effort in the development of new which executes that program for the ap- “feels” forces from perhaps only a hun- microprocessor and memory circuits. propriate subset of the particles in the dred neighbors instead of the millions The Laboratory’s CM-5 has 1024 pro- simulation. The processing nodes are or billions of particles included in the cessing nodes but partitions as small as allowed to operate independently during simulation. The truncated potential 32 nodes may be used. most of the calculations. However, provides an excellent approximation for The processing nodes in the CM-5 there are steps in our algorithm that re- solid materials because in solids the ef- are connected by two types of high- quire synchronization of the processors. fects of distant atoms are “screened” by speed communication networks: a con- For example, the calculation of new nearer atoms. The fact that we can use trol network and a data network. The particle positions and velocities cannot a short-range potential is the most sig- control network performs synchroniza- proceed until the total force acting on nificant difference between our problem tion functions, broadcasts, and reduc- each particle has been determined. A and that described in “A Fast Tree tions. The synchronization functions special synchronization function en- Code for Many-Body Problems,” in are used to coordinate the work of the sures that the latter calculation is com- which the force is long-range and there- individual processors. Broadcasts are plete before the former is initiated. fore each particle always interacts with messages sent to all processors simulta- Access to the communication net- all other particles in the system. neously, such as a set of instructions to works is provided by a message-passing be executed by each . Reduc- library supplied by Thinking Machines. tions are operations in which data are The library allows two types of com- Mapping Molecular Dynamics collected from all processors for the munication—synchronous and asyn- onto the Architecture of the calculation of some overall value such chronous. We use both communication CM-5 as the total energy of a system. styles in our MD algorithm. The data network carries simultane- All programs on the CM-5 use some The molecular-dynamics method has ous point-to-point communication be- form of message passing between pro- been implemented on electronic com- tween multiple processing nodes at a cessing nodes. Data-parallel languages puters since 1957. The first such calcu- rate of over 5 megabytes per second per such as C* and FORTRAN-90 perform

48 Los Alamos Science Number 22 1994

State-of-the-Art Parallel Computing

molecular2.adb 7/26/94

(b) 32-node CM-5 (c) Processing node (PN)

To data To control network network

Fat-tree data network RISC Data Control Micro- network network processor interface interface

64-bit

Vector Vector Vector Vector unit unit unit unit

8 Mb 8 Mb 8 Mb 8 Mb memory memory memory memory

PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN PN

(a) 16-node CM-5

Figure 2. The Architecture of the CM-5 Connection Machine The diagram shows the general layout of the CM-5, one of the largest massively parallel computers. Also shown in the diagram is the manner in which this parallel architecture can be scaled up in size. (a) Sixteen processing nodes are connected in parallel by a network that carries data between the processors. A control network (not shown), similar in structure to the data network, carries instructions to the nodes. The networks have a “fat tree” topology—the upper branches (which connect a large number of nodes and therefore must carry more data) have a higher bandwidth (capacity for data transmission) so that the overall point-to-point band- width is at least 5 megabytes per second for any pair of nodes. (b) The addition of more processing nodes to the system (shown in gray) requires that the networks be expanded to a “higher” level in order to maintain the point-to-point bandwidth. (c) Each pro- cessing node consists of a 33-megahertz SPARC RISC (Reduced Instruction Set Computer) microprocessor, 32 megabytes of memo- ry, and four vector-processing units that perform 64-bit floating-point and integer arithmetic at a maximum combined rate of 128 mil- lion floating-point operations per second. The 1024-node CM-5 at the Laboratory also has 400 gigabytes (400 billion bytes) of disk space distributed over a number of disk drives (not shown), which are connected in parallel to the networks by input/output proces- sors. The parallel arrangement of the disk drives allows data to be transferred rapidly to disk; however, the data of a single write operation will be spread over all disks. For more details of computer elements and architectures see “How Computers Work: An In- troduction to Serial, Vector, and Parallel Computers.”

Number 22 1994 Los Alamos Science 49 State-of-the-Art Parallel Computing

message passing transparently; that is, the number of particles in the computa- the memory of the computer. Howev- the underlying communication is hidden tional box are conserved. er, such an arrangement is difficult to from the user. However, to achieve Typically the volume of the compu- maintain because each particle is free to higher performance we wrote explicit tational box is large enough that the move anywhere in space—particles that message-passing instructions into our processor domains have dimensions sig- are initially neighbors may separate program. It was designed to be execut- nificantly larger than the interaction during the course of an MD simulation, ed on all of the processing nodes simul- cut-off distance, rmax. In such cases and particles that are initially far apart taneously. Once the program is broad- each processor will have been assigned may become neighbors. The cell struc- cast to the processing nodes, each node a significant number of particles that do ture just described was designed to works independently on a small part of not interact with each other. Comput- keep the particles organized so that the the problem and manages its own data ing the forces between all pairs of parti- force calculation proceeds efficiently. and interprocessor communications. cles within each domain is unnecessary The force on a given particle in- because the result for each pair of parti- cludes contributions from all the other Data structures of the MD algo- cles separated by more than rmax would particles that are closer than rmax. Be- rithm. The particles in a molecular-dy- be zero. Therefore, the domain of each cause the cell size has been chosen to namics simulation occupy a simulated processor is subdivided into an identical be slightly larger than rmax, all the par- region of space called a “computational number of cells, each having dimen- ticles that must be considered are locat- box,” shown in Figure 3a (for simplici- sions slightly larger than rmax (see Fig- ed within the cell containing the given ty, the algorithm is illustrated in two di- ure 3c). The number of such cells de- particle or within adjacent cells. mensions). The box is divided into do- pends only on the size of the domain Each processor follows an identical, mains of equal size, one for each and rmax, not on the number of avail- predetermined sequence to calculate the processor (Figure 3b). The assignment able processors. The domains may be forces on the particles within its as- of particles to processors is made ac- subdivided into thousands of cells for signed domain. For each cell in its do- cording to the domain in which each simulations with large spatial dimen- main, the processor computes forces particle is located. The data associated sions. In the example in Figure 3, there exerted on particles in the cell by other with each particle include its velocity are 4 processor domains and each do- particles in the same cell as well as by and its coordinates within the computa- main has 16 cells. The cell structure particles located in adjacent cells. The tional box. forms the foundation of our algorithm. forces due to particles in adjacent cells Because any simulation is finite in Each particle is represented in the are calculated in the order given by the size we often need a method to effec- computer by a C-programming-lan- interaction path shown in Figure 4a. tively eliminate the boundaries of the guage data structure consisting of the Use of an identical path for each cell simulation. Therefore, we apply peri- particle parameters, including position, ensures that all contributions to the odic boundary conditions to the compu- velocity, force, and type. Associated total force on every particle are com- tational box. That is, the entire compu- with each cell is a small block of mem- puted without performing redundant tational box with all of its particles is ory containing a sequential list of the calculations. treated as just one unit of an infinite particle data for that cell. Storing parti- Typical calculations have hundreds lattice of identical units. Thus a parti- cle data in this manner facilitates the or thousands of cells per processor do- cle near a boundary of the computation- communication steps of the force calcu- main, so for most of the cells of a al box will be treated as if it is interact- lation. The entire contents of a cell can given domain all neighboring cells are ing with particles in an adjacent box be communicated to other processors also assigned to the same domain. containing an identical number of parti- simply by sending a small block of Therefore, the interaction calculation cles with identical positions and veloci- memory. can be performed for most cells without ties. One result of such boundary con- any communication between processors. ditions is that when a particle moves MD force calculation. The calcula- However, for cells along the edge of out through one side of the computa- tion of forces acting on the particles is each processor domain, some of the ad- tional box, an identical particle with the the most time-consuming part of the jacent cells are assigned to other same velocity enters through the bound- MD method. To optimize the calcula- processors (as in Figure 4b), and the ary at the opposite side of the box. tion, particles that are neighbors physi- message-passing features of the CM-5 Therefore, both linear momentum and cally should also be near one another in must be utilized so that particle data for

50 Los Alamos Science Number 22 1994 State-of-the-Art Parallel Computing

molecular3.adb• 7/26/94

those adjacent cells can be sent to and Figure 3. The Data Structure of the MD Algorithm received from the neighboring proces- This figure illustrates the mapping of a molecular-dynamics problem onto a parallel sors. Because each processor has been processor with 4 nodes. For simplicity, the algorithm is illustrated in two dimensions. assigned an identical cell structure and follows the same sequence of opera- tions through that structure, all proces- (a) The computational box sors will be required to transmit particle A “computational box” contains all the particles data for the same internal cell at ap- in the simulation. A single coordinate system proximately the same time. At such is used for the entire computational box so that each particle has a unique set of coordinates. times the processors synchronize and participate in send-and-receive commu- nication. Processors assigned to do- y mains with fewer particles will proceed through the interaction calculations faster. Therefore, when working on cells at the edges of the domain, those processors may be required to wait for others before they can synchronize. 0 x The details of synchronized message passing are shown in Figure 4c. 4-node CM-5 At every message-passing step, each (b) The processor domains processor sends the particle data for an The computational box is divided into 4 equal entire cell to the appropriate processors domains, each of which is assigned to a and receives corresponding information separate processor as denoted by the from other processors. The force calcu- numbers 1, 2, 3, and 4. lation then proceeds, now that each 3 4 processor has available in its local y memory the particle data needed to im- plement the next step along the interac- tion path. Note that synchronization occurs only during message passing. At all other times, the processors are running asynchronously. 1 2 0 x Redistribution of particles. When 4-node CM-5 the force calculation is complete the (c) The cell structures of each processor processors compute the new positions domain and velocities of all particles in the sys- The domain assigned to each processor is tem. Since our algorithm is based on subdivided into cells having dimensions just r max the cell structure of the processor do- larger than rmax, the interaction cut-off distance. In this example the domain of each processor mains, the possibility of changes in par- has been divided into 16 cells. The circle of y ticle positions requires that the comput- radius rmax is centered on a particle and er data structures for each cell are indicates the area containing all the other particles that are close enough to affect the updated regularly. A special redistribu- motion of the center particle. Note that all the tion function checks the coordinates of particles contained in this circle are located each particle after every timestep. If either within the same cell as the center particle the new coordinates of a particle indi- or within adjacent cells. Associated with each cell is a small block of memory containing a cate that it has moved to a new cell, the sequential list of the particle data for that cell. particle parameters must be transferred 0 x

Number 22 1994 Los Alamos Science 51 State-of-the-Art Parallel Computing

Molecular4.adb• 7/26/94

Figure 4. MD Force Calculation This figure shows the steps that are performed for each cell in order to calculate and sum up the forces acting on all particles. Each processor works on its cells in the order 1 through 16 and follows an identical interaction pathway for each cell.

4-node CM-5 (a) Interaction path for cell 7 1314 15 16 1314 15 16 The processors are performing the interaction calculations for cell 7, having already done so for 910 11 12 9 10 11 12 cells 1 through 6. First, all forces between particles within cell 7 are computed and the results are sent to an accumulator that adds all the pair-wise contributions for each particle. Then the block of memory containing the list of particle data for cell 8 is called up and used to calculate the 5 6 7 8 5 6 7 8 forces exerted on particles in cell 7 by particles in cell 8. The results are sent to the accumulator for cell 7. By Newton’s third law, the forces exerted on particles in cell 8 by particles in cell 7 are 1 2 3 4 1 2 3 4 y equal and opposite to those exerted on particles in cell 7 by particles in cell 8. Consequently, the results are also sent to the accumulator for cell 8. The process is repeated three times more for 1314 15 16 13 14 15 16 cell 7, calculating forces involving particles in cell 7 with those in cells 12, 11, and 10. The forces exerted on particles in cell 7 by particles in cells 2, 3, 4, and 6 were sent to the accumulator for 910 11 12 9 10 11 12 cell 7 when the interaction pathways were followed for cells 2, 3, 4, and 6. Because all the particles that must be considered for cell 7 are within the domain of a single processor, no 5 6 7 8 5 6 7 8 message passing is required.

1 2 3 4 1 2 3 4 0 x 4-node CM-5 (b) Interaction path for cell 8 (synchronous message passing steps are red) 1314 15 16 1314 15 16 The processors are calculating forces for cell 8. Because cell 8 is located at the edge of a 9 10 11 12 9 10 11 12 processor domain, some steps of the interaction path (shown by red arrows) cross the boundary between processor domains. To carry out the entire force calculation for cell 8 (and for all other 5 6 7 8 5 6 7 8 cells along the edge of processor domains) the data for the particles in cell 8 must be made available to neighboring processors. Synchronous message passing is used to transfer the 1 2 3 4 1 2 3 4 particle data for cell 8 to and from neighboring processors. Note that because the computational y box has periodic boundary conditions, interaction paths leave the box at one edge while equivalent interaction paths enter the box at the opposite edge. Because each processor has 1314 15 16 13 14 15 16 the same cell structure and proceeds through the cells in the same order, all the processors will need to send and/or receive particle data for a given edge cell at approximately the same time. 9 10 11 12 9 10 11 12 Those processors that are assigned to neighboring domains in the computational box will exchange data with one another during the force calculation. Therefore, synchronization of 5 6 7 8 5 6 7 8 message passing between the appropriate processors is straightforward.

1 2 3 4 1 2 3 4 Source Destination 0 x PN PN (c) Synchronous message Time passing between processing nodes (PN) Perform Perform One processor is sending data to calculations calculations another via synchronous message passing. The source processor stops computing, issues a “Ready "Ready to send" to send” signal, and waits for a “Ready to receive” signal from the WAIT destination processor. After the “Ready to receive” signal is "Ready to receive" issued, the data is transferred, and Send then both processors resume Receive message message calculations.

Resume Resume calculations calculations

52 Los Alamos Science Number 22 1994 State-of-the-Art Parallelmolecular5.adb• Computing 7/12/94

Figure 5. Redistribution of Particles After the force calculation is completed for a given timestep, new positions and velocities are calculated for all the particles.

4-node CM-5 (a) Moving particles to proper cells (asynchronous message passing steps are red)

The new coordinates of each particle are checked by a special redistribution function. For each case in which the new coordinates of a particle correspond to a location in a different cell, the redistribution function also transfers the data for that particle to the sequential list of particle data for the new cell. Most of the transfers are between cells within the same processor domain. Those transfers that cross a boundary between processor domains are executed by means of message passing between processing nodes and are indicated by red y arrows in the figure.

0 x

Source Destination (b) Asynchronous message passing between processing

Time PN PN nodes (PN)

Since there is no way to know beforehand which processors will be Periodically Perform involved in the transfer of particles, calculations check network during normal the transfer is accomplished by an calculations asynchronous mode of message "Ready to send" passing. During asynchronous message passing, the source Resume processor issues a “Ready to send” calculations signal and immediately resumes "Ready to receive" calculation (contrast with synchro- nous message passing in Figure 4c). Send Receive All processors periodically check the message message network and retrieve any waiting messages. The only significant Resume Resume interruption to the calculations of calculations calculations either node is the time used for the transfer of the message.

Number 22 1994 Los Alamos Science 53

State-of-the-Art Parallel Computing

moleculartab1.adb 7/26/94

to the particle list of the new cell (see Processors Figure 5a). Particles 32 64 128 256 512 1024 There are two types of particle trans- 1,024,000 8.90 4.51 2.32 1.26 0.72 0.44 fers to consider. One involves the sim- 2,048,000 — 8.96 4.44 2.46 1.36 0.74 ple transfer of a particle from one cell to another within a single processor do- 4,096,000 — — 8.79 4.81 2.67 1.36 main. In this case, deleting particle 8,192,000 — — 16.83 8.81 4.80 2.47 data from one cell and adding it to a 16,384,000 — — — 16.95 8.74 4.49 new cell requires only copying of mem- ory contents within a single processing 32,768,000 — — — — 16.90 8.54 node. The second type of transfer re- 65,536,000 — — — — — 16.55 quires moving particle data to a cell list 131,072,000 — — — — — 34.26 in another domain, so the particle data is deleted from the original cell list and sent to the new processor by message Table 1. Update Times per Timestep in Scaling Simulations passing. The redistribution process

-2.5␴). cannot use synchronized message pass ؍ Shown in the table are the timing results for a set of test simulations (rmax Each entry in the table is the CPU time (in seconds) required to compute a single ing since it is not known how many timestep for the given combination of numbers of particles and processors. particles will leave each processor,molecular6.adb how many will be received by each proces7/26/94- sor, and from what processors particles 200 32 will arrive. Therefore all processors are put into an asynchronous message-pass- ing mode. Asynchronous message 150 passing occurs in the background while the particle coordinates are checked 32 Measured performance (see Figure 5b). The messages are ad- (number refers to dressed so that they go to the appropri- 100 processors used ate processor, and each node periodical- 64 in simulation) ly checks the network and receives any Theoretical performance messages that are waiting. Messages Update time (seconds) 50 for 100% scaling may be sent or received at any time, 128 efficiency from and, in general, all processors operate 32-node result 256 independently. The redistribution of 1024 512 particle data is complete when all parti- cle coordinates have been checked and 1 1 1 1 1 1024 256 128 64 32 all messages have been retrieved from 1/(number of processors) the data network. Our experience indicates that the re- distribution of particles is efficient and Figure 6. The Scalability of the MD Algorithm accounts for a small portion of the The figure is a plot of the CPU time required to simulate a single timestep versus the overall iteration time. Generally, the reciprocal of the number of processors used for the calculation. The dots represent the number of particles changing cells after

5␴) in which particles re- any given timestep is small compared ؍ results obtained from a 4-million-particle simulation (rmax arranged from a simple cubic lattice to a face-centered cubic lattice. The line represents with the total number of particles in the perfect scaling from the 32-node data point, that is, an update time that is inversely pro- system. In addition, since most of the portional to the number of processing nodes used in the simulation. The results of this particles that change cells simply move series of test calculations indicate that the performance of our algorithm, at least for this to another cell on the same processor, type of simulation, scales very well with the number of available processors. the inter-processor communication re-

54 Los Alamos Science Number 22 1994 State-of-the-Art Parallel Computing

quired for the redistribution procedure was equal to 5␴, the increase in speed grated into free particles, and the shape is minimal. in going from 32 processors to 1024 of the plate has become significantly The architecture and capabilities of processors was more than a factor of distorted. the CM-5 are well suited for molecular- 30, corresponding to 95% parallel effi- The impact simulation shows the in- dynamics applications. The high-speed ciency. The latter scaling results are herently unstructured nature of MD communication networks allow us to shown graphically in Figure 6, where simulations. Tracking the motions of use message passing to efficiently trans- the update times for the 4-million-parti- individual particles often requires sig- fer data among the processors. An cle simulation are plotted as a function nificant amounts of communication and equivalent domain was assigned to each of the reciprocal of the number of data management. We have been able processor in order to divide the work- processors used. The scalability of our to achieve high performance by careful- load among the processors, and the cell MD algorithm was also demonstrated ly mapping the physics of the problem structure and interaction path ensure by comparing the update times for vary- to the architecture of the CM-5. The that the interaction calculations are ing numbers of particles (see Table 1). fast iteration times we have attained completed in a minimum amount of For a given number of processors, the allow the calculation of larger MD sim- time. update times increased linearly with the ulations than have previously been pos- number of particles in the simulation. sible. For example, the illustration at the beginning of this article was taken The Scalability of our from one of the largest MD simulations MD Algorithm Applications performed to date—our 38-million-par- ticle simulation of a plate undergoing In ideal circumstances, the speed of As a demonstration of the practicali- fracture. Because our algorithm is scal- a well-designed parallel algorithm will ty of our algorithm, we performed a able, the size of the physical system scale linearly with the number of simulation of a 1-million-particle pro- that can be modeled is expected to in- processors used in the parallel comput- jectile colliding with a 10-million-parti- crease proportionally as each new gen- er. To test the scalability of our algo- cle plate. Figure 7 is a series of images eration of massively parallel machines rithm we repeatedly performed a test taken from the simulation; each image is developed. simulation using various numbers of is a “snapshot” showing the positions We are now working to incorporate processors and various numbers of par- and velocities of the particles at the end new types of interaction potentials into ticles ranging from 1 million to 131 of a timestep. The velocities are indi- our MD algorithm. The particular million. The starting condition for the cated by color—red and yellow parti- atomic interactions will be described by tests consisted of identical particles cles are moving faster than blue and materials-specific interaction potentials arranged in a simple cubic lattice. The gray particles. The top panel in Figure like the “embedded-atom method” po- simple cubic lattice—known to be un- 7 shows the system at timestep 1000, tential. Such potential functions are stable for an interaction potential that just after the projectile has made con- created by adding a “many-body” term depends only on r and to undergo a tact with the plate. Note that the parti- to the pair-interaction potential present- phase change in which the particles re- cles in the lower third of the projectile ed in this paper. The many-body term arrange to a face-centered cubic config- have higher kinetic energies (higher ve- varies according to the local density uration—was chosen to guarantee locities) than those in the remainder of surrounding a given atom and thereby movement of particles between domains the projectile. Some of the particles in results in a more physically realistic in- during the calculation. the plate also have increased kinetic en- teraction potential. Table 1 shows the results of the ergies. At timestep 2000 (middle panel We are using our MD algorithm to scaling tests in which rmax was set of Figure 7) the projectile has nearly study the physics of fracture. We are equal to 2.5␴. Each doubling of the penetrated the plate and part of the pro- particularly interested in understanding number of processors used for a given jectile has begun to break up. A shock the brittle and ductile behavior of mate- number of particles approximately wave has propagated to the edges of the rials. In ductile fracture the propaga- halved the update times (the CPU time plate. By timestep 2900 (bottom panel tion of cracks is accompanied by the required to compute all particle motions of Figure 7) part of the projectile has formation of a significant number of during a single timestep). For a 4-mil- been absorbed into the plate, the re- dislocations (discontinuities in the crys- lion-particle simulation in which rmax mainder of the projectile has disinte- talline structure of a material) at the

Number 22 1994 Los Alamos Science 55 State-of-the-Art Parallel Computing

Figure 7. An 11-Million-Particle Impact Simulation The three images show the progress of an MD simulation in which a 1-million-par- ticle projectile impacts a 10-million-parti- cle plate. The particles of both the pro- jectile and the plate were initially at rest in equilibrium positions in face-centered cubic lattices. The interactions between particles were approximated by using a Lennard-Jones potential. The colors in the figure represent the velocities of the particles; red and yellow particles have higher velocities than blue and gray parti- cles. The simulation of 2900 timesteps required 30 hours of CPU time on a 512- processor CM-5. The top, middle, and bottom panels correspond to timesteps 1000, 2000, and 2900, respectively.

56 Los Alamos Science Number 22 1994 State-of-the-Art Parallel Computing

crack tip. These dislocations are re- sponsible for permanent deformations in the material. When the fracture is brittle, the crack propagates with little or no permanent deformation. Our goal is to understand and ultimately control parameters that influence these phenom- ena because such knowledge and ability will allow us to design materials having specific responses and behavior. Large- scale molecular dynamics is an ideal tool for the investigation of fracture phenomena because MD allows us to perform simulations of near-macroscop- ic samples using very few simplifica- tions or approximations.

Acknowledgements David M. Beazley (left) is a graduate student in the Center for Nonlinear Studies and the Condensed Matter and Statistical Physics Group. His research interests include massively parallel supercomputing, It is a pleasure to acknowledge the contributions high-performance computer architecture, and scientific computing. Beazley began working at the Lab- of Niels Gr¿nbech-Jensen, Pablo Tamayo, and oratory while an undergraduate in 1990. He received a B.A. in mathematics from Fort Lewis College Brad L. Holian to parts of the work described in 1991, and is currently working on a Ph.D. in computational science at the University of Utah. here. We also acknowledge generous support Peter S. Lomdahl is a staff member in the Condensed Matter and Statistical Physics Group in the The- from the staff of the Advanced Computing Labo- oretical Division, where he has worked on computational condensed-matter and materials-science re- ratory, who provided both machine access and search since 1985. From 1982 to 1985 he was a postdoctoral fellow with the Center for Nonlinear assistance. In particular, David O. Rich helped Studies. Lomdahl received his M.S. in electrical engineering and his Ph.D. in mathematical physics with the use of the CM-5, and Michael F. Krogh from the Technical University of Denmark in 1979 and 1982. His research interests include parallel provided invaluable help with visualization. computing and nonlinear phenomena in condensed-matter physics and materials science.

Further Reading

M. P. Allen and D. J. Tildesley. 1987. Com- namics on the CM-5. In Proceedings of Scalable puter Simulations of Liquids. Clarendon Press. High Performance Computing Conference 1992. IEEE Computer Society. D. M. Beazley and P. S. Lomdahl. 1994. Mes- sage-passing multi-cell molecular dynamics on P. S. Lomdahl, P. Tamayo, N. Gr¿nbech-Jensen, the Connection Machine 5. Parallel Computing and D. M. Beazley. 1993. 50 GFlops molecular 20: 173Ð195. dynamics on the connection machine. In Pro- ceedings of Supercomputing 1993. IEEE Com- D. M. Beazley, P. S. Lomdahl, P. Tamayo, and puter Society. N. Gr¿nbech-Jensen. 1994. A high perfor- mance communications and memory caching S. Plimpton. 1993. Fast parallel algorithms for scheme for molecular dynamics on the CM-5. short-range molecular dynamics. Sandia National In Proceedings of the Eighth International Par- Laboratory Report SAND91-1144, UC-705. allel Processing Symposium. IEEE Computer Society. P. Tamayo, J. P. Mesirov, and B. M. Boghosian. 1991. Parallel approaches to short range molecu- R. C. Giles and P. Tamayo. 1992. A parallel lar dynamics simulations. In Proceedings of Su- scalable approach to short-range molecular dy- percomputing 1991. IEEE Computer Society.

Number 22 1994 Los Alamos Science 57