Blue Gene: A vision for protein science using a petaflop

by the IBM Blue Gene team: F. Allen, G. Almasi, W. Andreoni, D. Beece, B. J. Berne, A. Bright, J. Brunheroto, . Cascaval, J. Castanos, P. Coteus, P. Crumley, A. Curioni, M. Denneau, W. Donath, M. Eleftheriou, B. Fitch, B. Fleischer, C. J. Georgiou, R. Germain, M. Giampapa, D. Gresh, M. Gupta, R. Haring, H. Ho, P. Hochschild, S. Hummel, T. Jonas, D. Lieber, G. Martyna, K. Maturu, J. Moreira, D. Newns, M. Newton, R. Philhower, T. Picunko, J. Pitera, M. Pitman, R. Rand, A. Royyuru, V. Salapura, A. Sanomiya, R. Shah, Y. Sham, S. Singh, M. Snir, F. Suits, R. Swetz, W. C. Swope, N. Vishnumurthy, T. J. C. Ward, H. Warren, R. Zhou

In December 1999, IBM announced the start of ofproteins and the protein folding problem, includ­ a five-year effort to build a massively parallel ing structure prediction and studies ofmechanisms. computer, to be applied to the study of biomolecular phenomena such as protein folding. We discuss the limitations of experimental probes The project has two main goals: to advance our of the folding process-a motivation for the use of understanding of the mechanisms behind protein simulation. This is followed by a brief high-level folding via large-scale simulation, and to explore overview of computational approaches for studying novel ideas in massively parallel machine the mechanisms of protein folding, including a sur­ architecture and software. This project should enable biomolecular simulations that are orders vey of some of the challenges, options, and areas of of magnitude larger than current technology exploration in the field. We then give a briefdescrip­ permits. Major areas of investigation include: tion ofthe skeleton science plan now being executed. how to most effectively utilize this novel platform to meet our scientific goals, how to make such massively parallel machines more usable, and After making the case for the utility of large-scale how to achieve performance targets, with simulation, we focus on the elements ofthe machine reasonable cost, through novel machine architecture that form the basis for the hardware and architectures. This paper provides an overview software research that the Blue Gene project will of the Blue Gene project at IBM Research. It pursue. Finally, we describe some of the challenges includes some of the plans that have been made, the intended goals, and the anticipated to be faced in creating a simulation application that challenges regarding the scientific work, the can efficiently execute the goals ofthe scientific pro- software application, and the hardware design.

©Copyright 2001 by International Business Machines Corpora­ tion. Copying in printed form for private use is permitted with­ outpayment ofroyalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy­ his paperprovides an overview ofthe Blue Gene right notice are included on the first page. The title and abstract, but no other portions, ofthis paper may be copied or distributed Tproject at IBM Research. We begin with a brief royalty free without further permission by computer-based and discussion ofwhy IBM decided to undertake this ad­ other information-service systems. Permission to republish any venturous research project. We include an overview other portion of this paper must be obtained from the Editor.

310 ALLEN ET AL. 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. gram on the project hardware, along with some op­ • Protein-drug interactions (docking) tions for meeting those challenges. • Enzyme catalysis (through use ofhybrid quantum and classical methods) 4 Motivation for IBM • Refinement ofprotein structures created through other methods There are several reasons why IBM is interested in the use ofbiomolecular simulations to study protein science. The life sciences are receiving special atten­ Protein science overview tion from IBM because the field is demonstrating ex­ The human genome is currently thought to contain plosive growth, and the life sciences are creatingwhat approximately 40000 genes, which code for a much will become one ofthe most significant industries of larger number ofproteins through alternative splic­ the new century. Indeed, with advances in bioinfor­ ing and post-translational modification, a molecu­ matics and genomics, high-throughput screening of lar toolkit assembled to handle a huge diversity of drug candidates, and ready access to information on functions. An understanding of how proteins func­ the Internet, the life sciences have benefited from tion is essential for understanding the life cycle computational capabilities and will be driving the re­ and metabolism, how cells send signals to their envi­ quirements for data, network, and computational ca­ ronment, and how cells receive and process signals pabilities in the future. The particular area of pro­ from their environment. An understanding of pro­ tein folding was chosen because there is great synergy tein structure and function can serve as a basis for between IBM's interests and capabilities in high-per­ innovation in new therapies, diagnostic devices, and formance computing and the scientific needs of the even industrial applications. field. The understanding ofthe protein folding phe­ nomenon is a recognized "grand challenge problem" The function ofproteins is intimately associated with of great interest to the life sciences. their structure. 1 The examples shown in Figure 1 il­ lustrate this. 5 When proteins fold into the wrong IBM has built research machines to explore novel ar­ structure, the results can be fatal, e.g., "mad cow" chitectural ideas before, and these projects have fre­ disease probably results from an autocatalyzed wrong quently been associated with important scientific fold in the prion protein 6 and cystic fibrosis is also challenges, such as problems in lattice quantum chro­ 7 2 connected with protein (mis)folding. modynamics. ,3 Protein architecture. Protein architecture 8 is based The mission of the Blue Gene scientific program is on three principles: to use large-scale biomolecular simulation to advance our understanding of biologically important pro­ cesses, in particular our understanding ofthe mech­ 1. The formation of a polymer chain anisms behind protein folding. 2. The folding ofthis chain into a compact function­ enabling structure, or native structure Increased computational power translates into an in­ 3. Post-translational modification of the folded creased ability to validate the models used in sim­ structure ulations and, with appropriate validation of these models, to probe these biological processes at the The protein chain (orpeptide chain ifshort in length) microscopic level over long time periods. A critical is a heteropolymer built up from alpha amino acid component ofour research programwill be the con­ monomers, as shown in Figure 2. The sequence of nection of the simulations to the experimental bio­ amino acid residues in the peptide chain is termed physics of protein dynamics. 1 To achieve our high­ theprimary structure ofthe protein. The 20 different level scientific goals, it will be essential to collaborate choices for each amino acid in the chain give the pos­ with the worldwide experimental, simulation, and sibility ofenormous diversity, even for small proteins. theoretical communities in order to utilize the com­ For example, a peptide of 30 residues yields the as­ putational platform in the most intelligent way. tonishing number of about 2030, or approximately 39 10 , possible unique sequences. The scientific knowledge derived from research on protein folding can potentially be applied to a variety From the enormous number of possible protein se­ ofrelated life sciences problems ofgreat scientific and quences that could exist, we observe relatively few commercial interest, including: in nature. It is thought that the diversity of viable

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 311

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. Figure 1 On the left, the enzyme malate dehydrogenase is shown with the reactants for the catalytic process in place. The site where these reactants are bound, sometimes termed a cleft, is specifically tailored in shape and affinity to enormously accelerate this specific reaction, and to be relatively inert to other processes. On the right, the M2 proton channel of the influenza A virus is shown. This ion channel plays an essential role in infection and is the site of attack of a class of channel-blocking flu drugs. In its open state illustrated below, the water column via which protons pass through the channel is seen to be a well-defined structural feature.

MALATE DEHYDROGENASE M2 CHANNEL (++ FORM)

r ...-, 'f i}l'~".# .1-'" .( - I' .' ' ~ ' . ."". , ~.~l._ ~;"' :.~ -~ -~~ 1~"'" ~.... "1 ..... , # i __ ~',.~..''''. '.. ... 't.. ,' .,), ~ l,' JI., .... , t.:t \·/ (~ ·;~~~~5) ((~i' ~I'~~.~, " .....\..~ t~l .. : ~~ .p!(~~.~t:~,,\ ~~~/ ( J.. ~t~~m'"~:'--'1 .~ 'f.'~...:' >-~~6~j '-((' ..~ ~ ~ ~--- ~4. ;:l~a~~~'~?'/ '.;A./ ... ~; _-'.A~.~,~_

Reprinted with permission, from (A) Q. Zhong, T. Husslein, P. B. Moore, D. M. Newns, P. Pattnaik, and M. L. Klein, "The M2 Channel ofInfluenza A Virus: A Molecular Dynamics Study," FEBSLetters 434, No.3, 265-271 (1998); (B) Q. Zhong, D. M. Newns, P. Pattnaik, J. D. Lear, and M. L. Klein, "Two Possible Conducting States ofthe Influenza A Virus M2 Ion Channel," FEBSLetters 473, No.2, 195-198 (2000).

proteins has been constrained by natural selection sheet. The relatively organized alpha helix and beta to give: sheet sections of a protein are joined by less orga­ nized loop or turn regions. The way in which the rel­ 1. Desired function atively localized secondary structure elements com­ 2. Adequate stability bine to form the overall compact protein is termed 3. Foldability the tertiary level of structure, as can be seen from 4. Evolvability from appropriate evolutionary pre­ the example on the right in Figure 3. Finally, qua­ cursors ternary structure refers to the way that tertiary struc­ tures from two or more chains combine to form much The peptide chain has certain local fold character­ larger structures. istics termed secondary structure. 7 Steric hindrance and energetic considerations favor certain confor­ The protein folding problem. There are two impor­ mations of the peptide chain. One such conforma­ tant facets to the protein folding problem: predic­ tion is the alpha helix (see red helices in Figure 3). tion ofthree-dimensional structure from amino acid Another secondary structure is the beta sheet (blue sequence, and understanding the mechanisms and flattened regions in Figure 3), in which two or more pathways whereby the three-dimensional structure strands of the peptide chain are aligned to form a forms within biologically relevant timescales.

312 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. The prediction of structure from sequence data is Figure 2 The generic formula for an alpha amino acid is the subject of an enormous amount ofresearch and NH2 - GaHR - GOOH, where the suffix denotes a series of conferences that assess the state of the the alpha carbon. The different amino acids, also art in structure prediction. 9 While this area is ex­ termed "residues" in protein architecture, differ in tremely important, good progress in the area ofstruc­ the "R" group attached to the alpha carbon. tural predictions has been made using only modest There are 20 possible choices of R group. In the amounts of computational power. The effort de­ polymerization process successive monomers are added, resulting in a peptide chain, as shown. scribed in this paper is aimed at improving our un­ The peptide and ultimately the protein chain derstanding ofthe mechanisms behind protein fold­ has the formula NH2 - GaHR1 - GO - NH • ing' rather than at structure prediction. Even though GaHR2 - ... - GaHRL - GOOH. biologists have been most interested in structure pre­ diction, there has been an increasing recognition of the role that misfolding of proteins plays in certain disease processes, notably Alzheimer's disease and mad cow disease. 6 The section that follows describes some of the fundamental reasons for interest in the process of protein folding.

Why proteinfolding mechanisms are important. The fact that a subset ofheteropolymers constructed from amino acids and used in biological processes actu­ ally take on reproducible three-dimensional struc­ tures in a relatively short time of seconds or less is one of the marvels of nature. Heteropolymers typ­ ically form a random coil in solution and do not "fold" to any reproducible structure in experimen­ tally accessible times. Consider the paradox noted by Levinthal, 10 which asks the reader to consider that if, say, we allow three possible conformations for ev­ ery amino acid residue on the peptide chain, a 100­ residue proteinwould have 3 100 == 10 47 configurations. much ofthe hydrophobic groups into the core ofthe Any unbiased exploration of this conformational protein-out ofcontactwith water, while the hydro­ space would take a vast amount oftime to complete. philic and charged groups remain in contactwith wa­ Thus, the proteins that fold reproducibly and quickly ter. The stability can be described in terms of the into particular shapes must have been selected in Gibbs free-energy change ~G some way for these properties. It is hypothesized that these proteins conduct this conformational search ~G == ~H - T~S, along particularpathways that allow the folding pro­ cess to proceed quickly. One ofthe challenges in the where ~H is the enthalpy change and ~S is the en­ study ofprotein dynamics is to understand the mech­ tropy change. ~H is negative due to the more fa­ anisms behind this behavior. An improved under­ vorable hydrophobic interactions in the folded state, standing ofthese mechanisms is not only interesting but so is ~S because the folded state is much more from a purely scientific perspective, but might even­ ordered and has lower entropy than the unfolded tually allow us to engineer other "self-assembling" state. The balance between the enthalpy and entropy structures. terms is a delicate one, and the total free-energy change is only oforder 15 kilocalories per mole. Ev­ Current view of folding mechanisms. A simplistic idently the internal hydrophobic/external hydrophilic but illustrative way of viewing protein folding is to packing requirement places strong constraints on the note that the amino acidR groups (see Figure 2, cap­ amino acid sequence, as does the requirement that tion) fall into three main classes: (1) charged, (2) hy­ the native state be kinetically accessible. drophilic ("water-loving"), and (3) hydrophobic ("water-hating"). In the simplest picture, the folded It is helpful to think ofthe physics ofthe folding pro­ state ofthe peptide chain is stabilized primarily (for cess as a "free-energy funnel," 11 shown schematically a globular protein in water), by the sequestration of in Figure 4. Since the folding process is slow relative

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 313

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. Figure 3 The left portion of the figure shows the unfolded state with some secondary structure formed; the right portion shows the native folded state. PDB 10: 1B2X, A. M. Buckle, K. Henrick, and A. R. Fersht, "Crystal Structural Analysis of Mutations in the Hydrophobic Cores of Barnase," Journal ofMolecular Biology 234,847 (1993).

to motions at atomic scale, we can think of partially hairpin (blue). It is still a long way from the com­ folded configurations as having a quasi-equilibrium pact native structure at right. The folding process in value ofthe free energy. The free energy surface may different proteins spans an enormous dynamic range be displayed as a function of some reduced dimen­ from approximately 20 microseconds to approxi­ sionality representation ofthe system configuration mately 1 second. in a given state ofthe protein. 12 Figure 4 is a vertical plot offree energy in a contracted two-dimensional Probes of folding. In biophysical studies of protein space (horizontal plane) representing the configura­ folding, current emphasis is on the reversible fold­ tion of the peptide chain. The most unfolded con­ ing ofmoderate-sized globular proteins under closely figurations are the most numerous, but have the high­ controlled conditions in vitro. Many globular proteins est free energy, and occur on the rim of the funnel. can be made to unfold in vitro and then to undergo Going into the funnel represents a loss of number a refold back to the native structure without any ex­ of configurations (decrease of entropy), but a grad­ ternal assistance. 1 The unfolding process can be ual decrease in free energy, until the native statewith driven in several ways, for example by changes in de­ very few configurations and the lowest free energy naturant concentration, by heat pulses, by pressure is reached at the bottom of the funnel. The walls of pulses, etc. In this manner, refolding times down to the funnel contain only relatively shallow subsidiary tens of microseconds can be measured. minima, which can trap the folding protein in non­ native states, but only for a short time. Now the evo­ Although the study ofprotein dynamics and folding lution of the system as it folds can be described in pathways via experiment is an active area ofresearch terms of the funnel. The system starts off in a phys­ and much progress is being made,l,13 experimental ically probable state on the rim of the funnel, and probes do notyet provide as complete a view ofpro­ then makes transitions to a series of physically ac­ tein dynamics at the microscopic length and time­ cessible states within the funnel, until the bottom of scales as one would like. Numerical simulations of­ the funnel is gradually approached. fer a potential window into this microscopic world, and models that treat the relevant physical phenom­ Figure 3 illustrates folding. Here the unfolded pep­ ena with varying degrees of abstraction have been tide chain on the left already contains some folded useful sources of insight, particularly for the theo­ secondary structure, alpha helices (red), and a beta retical community. As larger computational re-

314 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. sources have become available and simulation tech­ Figure 4 The free energy is plotted vertically as a function niques have improved, more detailed simulations of of the accessible configurations of the peptide larger systems and longer timescales have become chain, which are projected into the plane in this possible. representation. The most unfolded configurations are the most numerous, but have the highest free The simplest systems for study are those with less energy, and occur on the rim of the funnel. Going than approximately 120 residues, which constitute into the funnel represents a loss of a number of configurations (decrease of entropy), but a "two-state folders," i.e., there is no intermediate be­ gradual decrease in free energy, until the native tween the denatured and native states. Such ideal state with very few configurations and the lowest systems should also not contain prosthetic groups, free energy is reached at the bottom of the funnel. i.e., they should be pure peptides, should not require the assistance of "chaperones" to fold, and should preferably not have disulfide bonds.

Consider the following three types ofprotein science studies that might employ large-scale numerical sim­ ulation techniques:

• Structure prediction • Folding pathway characterization • Folding kinetics

Protein structure prediction can be carried out using a large number oftechniques 8 and, as previously dis­ cussed, it is unnecessary to spend a "petaflop year" on the prediction of a single protein structure. That said, there is some reason to believe that atomistic simulation techniques may be useful in refining struc­ tures obtained by other methods.

Foldingpathway characterization typically involves the study of thermodynamic properties of a protein in quasi-equilibrium during the folding process. Map­ Reprinted from Folding and Design 1, 1. N. Onuchic, N. D. Socci, ping out the free-energy "landscape" that the pro­ Z. Luthey-Schulten, and ~ G. Wolynes, "Protein Folding Funnels: tein traverses as it samples conformations during the The Nature ofthe Transition State Ensemble," 441-450. ©1996, folding process can give insights into the nature of with permission from Elsevier Science. intermediate states along the folding pathway and into the "ruggedness" ofthe free-energy surface that is traversed during this process. Because such stud­ case, the calculation of thermodynamic averages is ies involve computations ofaverage values ofselected not enough; the actual dynamics ofthe system must functions of the system's state, one has the choice be simulated with sufficient accuracy to allow esti­ of either averaging over time as the system samples mation of rates. Of course, a large number of tran­ a large number ofstates (molecular dynamics) or av­ sition events must be simulated in order to derive eraging over configurations (Monte Carlo). Aggres­ rate estimates with reasonable statistical uncertain­ sive sampling techniques that may improve the com­ ties. Another challenge faced in such simulations is putational efficiency with which such averages can that the simulation techniques used to reproduce be computed can be used to good effect in these stud­ thermodynamic averages in ensembles other than ies. Simulation techniques to compute these aver­ constant particle number, volume, and energy (NVE) ages over the appropriate thermodynamic ensem­ are, strictly speaking, inappropriate for studies of bles are available. 14 folding kinetics.

Simulation studies of folding kinetics are aimed at Both pathway and kinetic studies may "zoom in" on understanding the rates at which the protein makes selected regions of interest in the folding process transitions between various conformations. In this rather than trying to simulate the process from end

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 315

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. A more detailed treatment ofthe interatomic inter­ Figure 5 Schematic plot of degree of sophistication of interatomic interactions vs amount of biophysical actions is provided by united-atom models that are process that can be simulated simulated in a vacuum or in what is known as im­ plicit solvent. As the name indicates, united-atom models treat groups of atoms within the protein as if they were a single particle. The implicit solvent approximation treats the water surrounding the pro­ tein not as discrete molecules, but rather as a con­ tinuum, perhaps with a dielectric interaction on the protein. These models, because of their increased sophistication, require significantly more computer resources than the lattice and beaded-string mod­ els, and so, many fewer simulations can be run and not as many variables can be easily investigated.

Next in sophistication are the all-atom models ofpro­ teins, in which the water solvent is treated with an atomistic resolution. This treatment of the water, known as an explicit solvent treatment, adds consid­ erable computational cost to the calculation, since the interactions between water molecules become the dominant part of the simulation.

Finally, the ultimate in treatment of interatomic in­ teractions would include quantum mechanical (OM) to end. If probing the dynamics of the folding pro­ methods that treat the actual electronic wave func­ cess through repeated simulation of large portions tion of the system explicitly. Such calculations are ofcomplete folding trajectories is the method used, used, in fact, to study relatively short-time biomo­ then some logical starting points would be peptides lecular processes such as enzyme catalysis. orsmall fast-folding proteinswith well-characterized As Figure 5 indicates, as the degree of sophistica­ properties. IS Preferably, folding times should be un­ der 100 microseconds for convenience ofsimulation. tion ofthe treatment ofthe interatomic interactions increases, it becomes more and more difficult to ex­ haustively simulate the entire process ofprotein fold­ Computational approaches to studying ing' because the computational requirements rise so folding mechanisms quickly. However, all approaches have been fruit­ ful, some workers choosing to take advantage ofthe In principle, computer simulations of biomolecular ability to simulate the entire folding process, albeit processes such as protein folding can be carried out at a lower level of sophistication of interatomic in­ using techniques spanning a broad range of sophis­ teractions. Others have focused their efforts onhigher tication in modeling the basic physical processes and quality treatment ofthe interactions, but with much spanning a broad range in computational cost. less exhaustive exploration of the folding process. The addition of hardware with the capabilities At one end ofthe spectrum in sophistication oftreat­ planned for Blue Gene should significantly improve ment ofthe interatomic interactions are the so-called the ability to perform simulations of many degrees lattice and beaded-string models. Often in these ofsophistication. The current plan for the Blue Gene treatments each amino acid residue of a protein is project is to use all-atom explicit solvent simulations approximated as a point particle. In the lattice mod­ as our first targeted approach. els these particles are constrained to occupy sites on a lattice in three dimensions. In the beaded string So far the most ambitious attempt to fold a protein models this restriction can be relaxed. These mod­ is that of Duan and Kollman 18 on the villin head­ els have been extremely useful because they can be piece in water. In three months, on a 256-node Cray exhaustively investigated with a modest amount of T3E processor, the authors were able only to follow computer resources. 16,17 the folding trajectory for one microsecond, still too

316 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. short a time for folding, and in fact the authors were Table 1 The computational effort required to study protein not able to achieve a fully folded configuration, folding is enormous. Using crude workload though extensive secondary structure formation oc­ estimates for a petaflop/second capacity machine leads to an estimate of three years to simulate curred. Subsequently three more trajectories have 100 microseconds. also been obtained by these authors. Physical time for simulation 10-4 seconds Challenges for computational modeling. The current Typical time-step size 10-15 seconds Number of MD time steps 1011 expectation is that it will be sufficient to use classical Atoms in a typical protein and 32000 techniques, such as molecular dynamics (MD), to water simulation model proteins in the Blue Gene project. This is be­ Approximate number of cause many aspects of the protein folding process interactions in force do not involve the making and breaking of covalent calculation Machine instructions per force 1000 bonds. While disulfide bonds playarole in many pro­ calculation tein structures, their formation will notbe addressed Total number of machine by classical atomistic simulations. In classical atom­ instructions istic approaches, a model for the interatomic inter­ actions is used. This is known as a potential, or force field, since the forces on all the particles can be com­ puted from it, if one has its mathematical expres­ treat the molecules as completely rigid. The use of sion and all its parameters. The MD approach is to RESPA and the fixing of bond lengths involving hy­ compute all the forces on all the atoms of the com­ drogen atoms with SHAKE and RATTLE allow the use puter model ofthe protein and solvent, then use that of larger time-step sizes without much degradation force to compute the new positions of all the atoms in the accuracy of the simulation. a very short time later. By doing this repeatedly, a trajectory of the atoms of the system can be traced Onewould like to study the dynamics ofa single pro­ out, producing atomic coordinates as a function of tein in an unbounded volume of water. Since only time. a finite simulation cell can be modeled, the most common approach is to use periodic boundary con­ Newton's equation is integrated for each particle us­ ditions in the simulation volume. Thus the simula­ ing a small time step of the order of 10-15 seconds. tion models a periodic structure of unit cells, such This small time-step size is required to accurately as a simple cubic array, each cell consisting of a box describe the fastest vibrations ofthe protein and sol­ ofwater containing a protein within it, and all cells vent system, which tend to be those associated with being identical. The charges in different cells are par­ movement ofhydrogen atoms in the protein and wa­ tially screened from each other by the high water di­ ter. The magnitude of the computational cost can electric constant, which is exhibited by water mod­ be seenwhen one notes that folding times ofapprox­ els commonly used in simulations. Long-range imately 10 -4 seconds are observed in some fast-fold­ electrostatic interactions for such a periodic struc­ ing systems, requiring the computation of approx­ ture can be treated via the Ewald summation tech­ imately 1011 MD time steps. As can be seen in Table nique. 25 1, the computational requirements for studying pro­ tein folding are enormous. One of the key challenges faced in implementing classical atomistic simulations is that of adequately Various methods can be used to improve the sta­ modeling the interatomic interactions in the systems bility and efficiency ofthe dynamic simulation. These required to study problems ofbiological interest such include the use of integrators with good stability as protein folding. The forces that are typically taken properties such as velocity Verlet, 19 and extensions into account in MD simulations ofprotein and water such as RESPA. 20 Additional efficiencies can be re­ systems (and many others) are illustrated in Figure alized through freezing the fastest modes of vibra­ 6, which displays some ofthe types ofenergy expres­ tion by constraining the bonds to hydrogen atoms sions involved in a typical model potential. The en­ to be fixed in length using algorithms such as SHAKE21 ergy terms may be classified as eitherbonded ornon­ and RATTLE. 22 This approximation is consistentwith bonded interactions. Bonded interaction terms in many of the most popular models for water inter­ most model potentials consist ofat least bond stretch­ actions, such as TIP3P, TIP4P, TIP5P,23 and SPC and ing, angle bending, and torsion energy terms, but SpCIE,24 since these models for water interactions may include more complex terms such as stretch-

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 317

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. by the Lennard-Jones (LJ), or Van der Waals inter­ Figure 6 Representative functional forms for interparticle interactions used in force fields for atomistic action. This is comprised of the fluctuating-dipole simulations, Utotal = UStretch + U Bend + UTorsion + or dispersive attraction (r -6 term), togetherwith ex­ UCoulomb + U u . Bond stretch interactions involve change repulsion, which in the LJ form of the inter­ two particles, angle bending interactions involve action is artificially modeled by the r -12 term. Note three particles, and the torsion terms involve that the sums in the Coulomb and LJ expressions in­ four particles. All of the nonbonded interactions clude only nonbonded pairs ofatoms. These are pairs involve pairs of particles. The nonbonded inter• actions between particles that also have bonded ofatoms that are neither covalently bonded nor con­ interactions are typically modified or eliminated. nected by an angle energy expression. Pairs of at­ oms that are neither bonded nor connected by an angle expression, yet are coupled by a torsion expres­ sion, are sometimes known as 1-4nonbonded pairs. Depending on the model potential in use, these non­ bonded interactions are typically excluded or mod­ ified. The modifications may, for example, be dif­ ferent LJ parameters or use some multiplicative attenuation factor.

UTorsion = . L.. L Vijkln [1 + cos(n

In order for a science program based on large-scale bend interaction. Bond stretch and angle bending simulation to be a success, the model potentials used movements are generally very small, but torsional must adequately represent the relevant physics and movement may involve significant rotational dis­ chemistry involved. The project is to some extent hos­ placements. As Figure 6 indicates, bond and angle tage to the accuracy ofthe force fields available. Ex­ expressions are often harmonic in nature, whereas isting force fields are remarkably successful in some torsional energy contributions are usually expressed areas. However, the suitability ofexisting force fields as a short sum ofcosine functions. Note that the sum for accurately modeling the dynamics ofthe folding in the bond expression is over all pairs of atoms that process is an area of active investigation and refine­ are considered to be covalently bonded. The sum in ment ofthe models used is an ongoing process. The the angle expression is over all sets of three atoms force fields in common use to model the protein and that share two covalent bonds. The sum in the tor­ water system include, for example, CHARMM,26 sion expression is over all sets of four atoms that AMBER,27 GROMOS,28 and OPLS-AA. 29 Beyond im­ share three covalent bonds. proved parameterization ofthe existing models, the inclusion of additional physical phenomena such as The nonbonded forces are also illustrated in Figure polarizability is being investigated. 30,31 6. These consist of Coulomb interactions and Van der Waals interactions. The Coulomb interactions It seems obvious that calculations ofprotein folding between the atoms are due to charges and have the pathways and rates would be very sensitive to vari­ longest range in terms of the extent of their influ­ ations in the force field. However, strong arguments ence. These interactions playa key role in defining exist that in fact, paradoxically, the phenomenon of the energies of hydrogen bonds, which are ubiqui­ folding may be quite robust. The notion of design­ tous both intraprotein, intrawater, and between pro­ ability of a native fold, meaning that the fold is sta­ tein and water. The binding of solids such as solid ble over a significant variation of amino acid con­ hydrocarbons, and many of the interactions within tent, is one aspect of the robustness. Also, many the cores offolded protein molecules, are dominated proteins exist in highly mutated forms across and

318 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. within species. These mutated forms often have the parallel replica method of Voter, 33,34 the reaction same fold. On the other hand, mutated proteins of­ path approach of Elber,35 and the transition path ten have much slower folding rates, even ifthe same sampling approach of Chandler and coworkers. 36 structure is ultimately produced. And furthermore, it is well known that a number of diseases exist that Among the areas we plan to investigate are folding can be traced to changes in a single amino acid in kinetics, pathway characterization, and force-field the protein, which presumably results in unfoldable comparison and assessment. In kinetic studies, many or misfolded proteins. This argues for the possibil­ occurrences of the relevant events are required to ity that folding may be very sensitive to force-field reach any well-supported conclusions. Simulations quality. Characterizations, comparisons, and assess­ that probe for information about kinetics will need ments offorce fields are expected to represent a sig­ to be performed using constant energy dynamics, and nificant area of activity for the Blue Gene science highly accurate trajectories will need to be gener­ program. Thus, the software application infrastruc­ ated to avoid heating the system. Among the impor­ ture that supports the science must have the ability to support multiple force fields and new develop­ tant questions to investigate in kinetic studies, in ad­ ments in force-field technology. dition to attempting to predict rates, is whether any observations are consistentwith recent theories that Simulation options. It is very important to dispel the relate folding to chemical reaction kinetics or nu­ cleation events as observed in first-order phase tran­ notion that the Blue Gene resource will be applied 16 to study a single folding event of a single protein. sitions. These studies will also require the devel­ For physically meaningful studies ofprotein folding, opment of new trajectory analysis tools. it will be necessary to simulate a large number of trajectories in order to reach conclusions supported For folding pathway characterizations, we might de­ by reasonable statistics. Estimates ofthe number of velop and employ technology similar to that ofShei­ trajectories required, for each system studied, range nerman and Brooks, 37 where free energy along fold­ from 10 to 100 in order, for example, to derive a ing pathways is characterized using sophisticated crude estimate of the characteristic time for a step thermodynamic sampling techniques. These meth­ in a folding process. This requirement places lim­ ods can provide quantitative (subject to the approx­ itations on the sizes of systems and the lengths of imations ofthe model and simulation methods) char­ individual dynamical simulations that can be under­ acterization of the protein folding landscape. Such taken. There is some evidence that multiple simu­ simulations require the implementation ofthermal­ lations of limited duration provide more informa­ and pressure-control technology in the application, tion than a single longer trajectory representing an as well as the implementation of potential of mean equivalent amount ofcomputation. 32 For force-field force and umbrella sampling techniques. characterizations, too, many simulations will be re­ quired on a large variety ofprotein and peptide sys­ tems using different force fields. Some ofthe studies will target force-field character­ ization through a variety of probes. The simplest of We therefore anticipate the simulation ofa large va­ these are studies of the structural stability of exper­ riety of protein and smaller peptide systems in the imentally generated native structures using differ­ Blue Gene program for various amounts oftime and, ent force fields. These are likely to involve both ki­ for many, with a high degree of replication in order netic and thermodynamic sampling simulations 12 of to obtain meaningful statistics. We also anticipate small peptides in water over tens of nanoseconds­ that effort will be spent on the implementation and times characteristic offast alpha-helixformation. An­ investigation ofnewly developed, and perhaps more other type of study in which known protein struc­ efficient, algorithms for investigating protein science. tures are partially unfolded by applying a heat pulse, and then refolded, can provide further insight into A number ofnew techniques are becoming available the ability of the force fields to correctly reproduce that might prove useful for studying protein pro­ free-energy minima. Another possible way to com­ cesses such as folding. These, and othersyet to come, pare and assess force fields is to see if the trends in should be examined for applicability in the context the free energy of solvation or the partition coeffi­ of the hardware and science program. For studies cient for peptides that is observed in experiments can of protein folding kinetics, it may be possible to ex­ be computed using free-energy methods imple­ ploit acceleration techniques. Examples include the mented in the Blue Gene application.

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 31 9

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. the hardware most effectively to advance our under­ Figure 7 This figure illustrates the interactions between the major intellectual communities participating in the standing of the protein folding process. Therefore, science of protein folding. Extensive opportunities a key aspect of the Blue Gene science program will for interchange of ideas exist between each of be outreach to the worldwide scientific community these communities. Experimentalists can suggest ofprotein experimentalists, simulation methodology new systems and properties to study in and application experts, and biophysical theorists in simulations. Simulators can suggest systems to academia, government laboratories, and commercial be probed by experimentalists. Theorists can suggest systems for study by experimental and life sciences institutions. It is hoped that this project simulation methods. Simulators can feed back can help catalyze interactions within the life sciences microscopic information to theorists that may community, as shown in Figure 7. This is essential support or refute existing theories or suggest ifwe are to have a strong and relevant scientific im­ potential avenues for improvement. pact in protein science through the application of Experimentalists can similarly interact with the hardware. theorists via their data. It is hoped that the Blue Gene project will allow IBM to act as a catalyst to stimulate these interactions. We intend to engage the scientific community di­ rectly by organizing and hosting workshops, confer­ ences, and seminars. At the time this paper was be­ ing written, the first IBM-sponsored protein science workshop planned to meet at the University of Cal­ EXPERIMENTALISTS ifornia in San Diego on March 30-31, 2001. One of the primary functions of the meeting was to gener­ ate suggestions for new collaborative work in the field, especially, but not limited to, work that may II bear directly on the Blue Gene project.

THEORISTS SIMULATORS Hardware A petaflop/s power (10 15 floating-point operations per second) computer, with something like 50 times the computing power of all in the world today, can only be realized through massive parallelism. Such a machine would need to have high­ Overview of planned science program. The Blue bandwidth, low-latency communications; otherwise Gene science program is planned to be incremen­ it would be essentially limited to data-parallel op­ tal, with early studies being carried out on currently eration, a mode resulting in unacceptably long turn­ available hardware platforms. An application suite around time for jobs, each of which would be run will evolve that makes use of existing software tools on a low-power subset of the machine. A conven­ and biomolecular modeling and simulation software tional implementation ofa petaflop/s machine would packages, where appropriate. In fact, a rich set of be too massive, expensive, and unreliable. Because function exists today in commercial, academic, and of the communications requirement, solutions in­ public domain tools that allow visualization, prob­ volving huge numbers of loosely coupled personal lem setup, molecular coordinate manipulation, and computers are too inefficient for the desired appli­ trajectory generation and analysis. Some of the ini­ cations. tial simulations may be performed using these com­ mercially available software packages. However, new In the MD GRAPE approach,38 a set of boards con­ software will ultimately need to be developed for the taining a GRAPE chip hard-wired to implement the 2 computational core in order to efficiently exploit the full O(N ) 39 long-range force calculation is coupled special hardware being designed and to explore novel to a host machine in which the short-range forces techniques and algorithms. are computed; in this approach the use offinite-range cutoffs of the long-range forces to improve perfor­ Even with the anticipated computational power mance is discarded and compensated for by the ef­ available to the Blue Gene project for MD simula­ ficiency of "pipelining" the calculation of the long­ tions, careful consideration must be given to the sci­ range forces done in the GRAPE chip. There is also entific program in order to utilize the capability of a WINE chip for implementing the k-space part of

320 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. Ewald. Scaling up such a highly specialized machine chips with some faulty processors. This approach is to the necessary size would require an appropriate worthwhile if one targets applications with signifi­ scaling up of the host machine so that the bonded cant amounts ofparallelism, so that a large number force component (which is calculated more fre­ ofindividual processors can be usefully applied. Mo­ quently than the long-range forces when using the lecular dynamics is one example of such an appli­ multiple time-step technique) remains in scale. This cation. would seem to be a highly complex project involving many different types of chips, and therefore an ex­ Current microprocessor architectures suffer from the pensive approach. well-known "Von Neumann bottleneck": memory access time is measured in hundreds ofprocessor cy­ We next discuss the "cellular" Blue Gene architec­ cles, and the compute units often stall, waiting for ture for achieving a petaflop/s machine, a relatively data to arrive from memory. This leads to increas­ unspecialized approach that seems to be feasible. ingly complex logic to support long execution pipe­ lines and to increasingly complex cache hierarchies High-performance supercomputers are built today to satisfy most memory accesses from smaller, but by connecting together large numbers ofnodes, each faster, cache memory. While caches may take half consisting of a conventional server system (unipro­ of the area of a microprocessor chip, they still do a cessor, or small-scale shared-memory multiproces­ poor job for many scientific codes. Current CMOS sor). The ASCI Blue Pacific IBM sp* system, which (complementary metal-oxide semiconductor) tech­ achieves a performance in excess of 10 tera­ nology, in particular, IBM Blue Logic technology, 40 /second (peak), demonstrates the top perfor­ provides efficient support for embedded DRAM (dy­ mance achievable by such systems. However, it will namic random-access memory) cores on logic chips. be difficult to scale up the performance of such sys­ The Von Neumann bottleneck is minimized if this tems to petaflop/s performance in the near future. technology is used to integrate processor(s) and memory on one chip, thus providing a low-latency, A petaflop/s system built out ofconventional server high-bandwidth processor-memory interface. How­ nodes would consume hundreds of megawatts of ever, while it is quite easy to fit on one chip suffi­ electrical power; it would require many acres ofma­ cient floating-point logic to execute tens of billions chine room floor; it would likely have an unaccept­ offloating-point operations per second (flop/s), it is ably low mean time between failures (an MTBF in not possible to fit on one processor chip much more excess of100 hours is considered remarkable in cur­ than 32 megabytes of memory. The usual "rule of rent high-end supercomputers); it would be exceed­ thumb" of one of memory per flop/s precludes ingly difficult to program and manage; and it would the use of merged DRAM logic for systems with any have an unacceptably high price tag. interesting performance. The door for the creative use of merged DRAM logic opens, once it is under­ There are several reasons for this. Conventional mi­ stood that these rules ofthumb for balanced systems croprocessors are built to execute one sequential in­ apply to servers built to accommodate a broad mix struction stream as fast as possible. Increasingly large of workloads, but may not apply to systems target­ amounts of hardware are used to extract parallel­ ing a narrower range of applications. Preliminary ism from a sequential instruction stream with tech­ analysis indicates that some MD codes can be run ef­ niques such as register renaming, speculative exe­ ficiently using less than one byte per second per 1000 cution, branch prediction, and so on. These flop/so techniques yield diminishing returns: whereas the number of gates in integrated microprocessors has More generally, thinking on algorithm and system increased by three orders of magnitude in the last design is still dominated by the invalid perception two decades, the number ofinstructions executed at that compute units are the most expensive resource each cycle has increased by a factor of ten, at best. in a computing system. Applications are designed to minimize the number of operations required to A simplified microprocessor design leads to higher solve a problem. Systems are designed with enough efficiency and enables many processor cores to fit on memory and I/O to achieve a high usage ofthe com­ a single chip. This has the added advantage thatwires pute units, even if this leads to low memory or disk in the processor core can be kept short, further im­ utilization. In fact, the cost of systems is dominated proving performance, and that it is possible to by storage cost, and by the cost ofthe logic required achieve higher chip manufacturing yields by using to move data around (caches and buses). If one

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 321

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. thinks anew from basic principles, the conclusion will it waits for a load to complete from memory, then often be that algorithms that use more computing the shared processor resources can be used by other but less memory, oruse more computing but require thread units. In effect, one has replaced one fast pro­ less communication, coupled with systems designed cessor by multiple slower thread units that are bet­ to ensure more effective use of memory and com- ter matched to the memory latency. This is a worth­ while choice if application-level parallelism is there to be exploited. Assuming a SOO-megahertz clock, each processor can execute up to one gigaflop/s (two floating-point instructions at each cycle).

These considerations dictate the design for the Blue Gene system. The building block for Blue Gene will be a single chip that integrates multiple processors, as just described, memory, and communication logic. This will allow us to build a full system, essentially by replicating this one chip. munication, even at the expense oflower utilization ofthe processing units, are more cost-effective than We have determined that it is possible to fabricate conventional algorithms and systems. in currently available standard cell technology an in­ expensive chip with double-precision scalar perfor­ Conventional high-end systems are clusters ofnodes, mance on the order of 32 gigaflop/s, internal DRAM each controlled by a full . This sig­ capacity around 8 megabytes, and external commu­ nificantly increases (software) error rates, as well as nication bandwidth exceeding 12 gigabytes per sec­ introduces major inefficiencies. The predominant us­ ond. A novel mechanical packaging scheme maps a age mode for such machines is a partition dedicated 32 X 32 X 32 cube of these chips into a system cov­ for a significant period of time to the execution of ering an area ofabout 40 X 40 feet. Power consump­ one parallel application; the efficient execution of tion for a performance ofone petaflop/s is under two such an application requires the tight coordination megawatts. ofall resources in the partition. Yet, each operating system instance allocates resources (processor time The basic programming model offered by a Blue slices and memory) autonomously, with no aware­ Gene system is very similar to that offered by large­ ness of the required coupling with other nodes. In­ scale clusters today. Each chip is a shared memory deed, efficient performance is achieved only when multiprocessor that runs simultaneous multiple the operating system is "out of the way"; many of threads. Threads running on the same chip commu­ the mechanisms and policies of the operating sys­ nicate via shared memory. Communication between tem become obstacles to performance, rather than chips uses message passing. However, the sheer size supporting the right virtual parallel machine abstrac­ of the system will force us to evolve toward higher­ tion. Performance and stability can be significantly level programming models that will be mapped on improved by using at each node a simple kernel that the actual hardware by a combination of compile­ provides a thin layer ofprotection atop the hardware, time and run-time techniques. and by providing global mechanisms and policies for the management of partitions. A major concern on a system of this size is error re­ covery. We expect that the incidence ofsoftware er­ Even though access to on-chip memory can be much rors will be kept low by keeping the resident soft­ faster than to off-chip memory, DRAM access will still ware very simple, and by applying correctness­ require multiple cycles. We use simultaneous mul­ proving techniques to key software subsystems, such tithreading to hide this memory latency. The basic as the communication software. However, the large building block for processors is a thread unit that ex­ number ofcomponents imply that hardware failures ecutes an autonomous instruction stream. Each will be very likely to occur during anyone job run. thread unit has its own register set and its own in­ Furthermore, there will be anon-negligible likeli­ struction sequencer. Multiple thread units within one hood that hardware errors will corrupt the state of processor share more expensive resources, such as a computation without being detected by hardware. the double-precision floating-point units, instruction Software must assume responsibility for error cor­ cache, and memory bus. If one thread unit stalls as rection and, to some extent, for error detection. A

322 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. simple method for detecting errors is to replicate One important research area for the scientific pro­ computation and compare results. However, com­ gram is the study ofvery long timescale phenomena putation mirroring reduces the effective performance within the protein folding process. We would like to by a factor oftwo. Less onerous error detection may use the approximately 100-fold increase in compu­ be possible through use ofapplication-dependent re­ tational power that the hardware potentially offers dundancy, in the computation state, to periodically to realize a similar 100-fold increase in the time­ validate this state. Error recovery will be based on scales probed by the scientific program. the ability to isolate faulty components and to re­ start a computation on the remaining components This requires the ability to use additional hardware from a previously taken checkpoint. threads to increase the number of time steps that can be simulated in a fixed amount oftime on a prob­ The architectural choices outlined in this section lem offixed size. Since the systems most relevant to have their downside; the low amount ofmemory re­ the science program contain approximately 30000 stricts the range of applications that can exploit this atoms, the ability to use more hardware threads than architecture. Compatibility with existing software is there are particles in the simulation is required, if not guaranteed. We believe that an improvement of one wishes to run on the full unpartitioned machine. two orders of magnitude in compute , or compute performance per gate count, Existing approaches. For concreteness, this discus­ is well worth this trade-off. At the very high end this sion focuses on approaches aimed at molecular might be the only practical way of approaching dynamics. An extensive review of atomistic simula­ petaflop/s performance levels. tion techniques on parallel machines has recently ap­ peared 42 that provides a broader survey than that Application issues provided here. We provide a briefdescription ofcur­ rent approaches to decomposing the problem on a Given the nature of the architecture just described, parallel machine and also some ofthe currently pop­ it is clear that a number ofchallenges must be faced ular algorithmic alternatives. in crafting an application to execute the science pro­ gram on the target machine platform. At the high­ In the process ofintegration ofthe equation ofmo­ est level, a major challenge is to understand how to tion required by molecular dynamics, the integration extract the maximal amount ofconcurrency from the itself occupies negligible time, since it is of order N application. The scientific goals of the project may in anN-atom system, and involves simple arithmetic. require simulation of a fixed-size system for a very The various contributions to the potential energy of large number of time steps. the system are shown in Figure 6, and the evalua­ tion ofthe corresponding force components on each In the following discussion, some ofthe known strat­ particle dominates the computation. egies for mapping such simulations onto very large numbers of nodes are described, along with the is­ There are only O(N) bonded force evaluations and sues connectedwith these strategies. A high-level de­ they typically consume anywhere from one to 20 per­ scription of selected algorithmic alternatives avail­ cent of the total force computation in a uniproces­ able for improving the efficiency of the dominant sor environment. 25 This variation is due to bothvari­ computational burden is provided; we end with a ation in system type and to use ofmultiple time-step brief survey of some areas of exploration. integration techniques like RESPA20 that can change the relative frequencies atwhich various force terms Scalability challenge. With an immense number are evaluated. The force computation is dominated (more than 30000) ofnodes, an even larger number by the nonbonded interactions, comprising the LJ and of CPUs (approximately 1000000), and a yet larger Coulomb contributions. The LJ forces go out to large number of hardware thread contexts (more than numbers ofnear neighbors, whereas the long-ranged 5000000), the challenge of mapping a fixed-size N­ Coulomb forces created by one charge are felt by body simulation onto such a massively parallel ma­ every other charge in the system. Calculation ofthe chine is considerable. 41 It is most common to define long-range forces is the most expensive part of the scalability in terms of the ability to employ greater MD calculation. numbers of nodes to solve larger problems, where "larger" in this case refers to the number of parti­ How can the immense parallelism of the machine cles' N. be exploited to distribute this computational burden

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 323

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. Figure 8 The vertical axis represents the rate at which 1. Atom decomposition binds the computation ofall time steps are computed. With perfect scalability, the forces on a single particle to a particular hard­ the data would fall on a line with slope equal to ware thread. one. The behavior shown here is typical of a 2. Force decomposition binds the computation of a parallel molecular dynamics code; scalability is particular force term (usually a two-body force) better for systems with large numbers of particles. to a particular hardware thread. 3. Spatial decomposition binds the computation of all the forces on particles within a volume element to a particular hardware thread. Particles may "migrate" from hardware thread to hardware 0 0 0 0 000 thread during the course of the simulation. 00 00 0 0 00 The communication expense of these three ap­ 0 0 43 0 0 proaches has been reviewed by Plimpton with re­ 0 0 gard to molecular dynamics simulations with short­ 0 0 ranged interactions. 0 0 0-0 Spatial decomposition is most efficient if the cutoff 0 0 0-0 radius for the real space force sums is relatively small, but there is heightened interest in using Ewald 14,25 methods in simulations of biomolecular systems to take into account long-ranged Coulombic forces via periodic boundary conditions. The use of this tech­ nique imposes additional considerations and com­ plicates the issue of choosing the most appropriate approach for problem decomposition.

The Ewald technique expresses the slowly conver­ gent Coulomb sum in terms of two rapidly conver­ gent series, one in real space and one in Fourier (k-) space. The optimal complexity of the Ewald Cou­ 0 3 2 0 lomb sum is O(N / ), i.e., each charge can be thought 1 2 0 0 of as interacting with only O(N / ) other charges. 0 0 Implementation ofthe Fourier transform using fast 0 0 0 Fourier transform (FFT) techniques 44 enables a speed up to O(N log N); due to the communication de­ 0-0 mands ofthe FFT, this approach presents challenges 0-0 in a massively parallel environment and may raise the crossover point in particle number at which the FFT-based methods become more efficient than a standard Ewald calculation.

A different approach, with a computational complex­ ity of O(N), uses multipole expansions to evaluate the Coulomb interaction. 45 But the crossover in N at which the multipole approach becomes the most efficient is typically at higher N than for the FFT ap­ proach. The optimal choice ofalgorithm depends on system size and efficiency of implementation on a particular hardware platform. It should be noted that without running into communications problems and the effective system size, N, is the number ofcharges, problems with the machine's limited memory? In a not the number of atoms, and can be influenced by parallel environment, there are three basic approach­ the choices made in modeling the system. For the es: TIP5p 23 model of water, five charge centers per wa-

324 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. ter molecule are required as opposed to the three Summary charge centers required by other models, such as TIP3p 23 or SPCIE. 24 The Blue Gene project represents a unique oppor­ tunity to explore novel research into a number of A feeling for the "state of the art" in mapping mo­ areas, including machine architecture, programming lecular dynamics codes onto parallel machines is pro­ models, algorithmic techniques, and biomolecular vided in Figure 8. 46,47 It is important to remember simulation science. As we discussed, every aspect of that these codes are addressing the challenge ofmap­ this highly adventurous project involves significant ping macromolecular system simulations containing challenges. Carrying out our planned program will bonded as well as nonbonded interactions onto mas­ require a collaborative effort across many disciplines sively parallel machines. One of these studies notes and the involvement ofthe worldwide scientific and that scalability begins to fall off when the average technical community. In particular, the scientific pro­ number of atoms per hardware thread drops below gram will engage with the life sciences community 0(10).46 It is clear that there is a considerable moun­ in order to make best use of this unique computa­ tain to climb in the course of mapping a single mo­ tional resource. lecular dynamics simulation ofmodest size onto ma­ *Trademark or registered trademark of International Business chine architectures with hardware thread counts Machines Corporation. much larger than one thousand.

Areas for exploration. There are a number of areas Cited references and note to be explored that might allow one to overcome or 1. A. R. Fersht, Structure and Mechanism in Protein Science: A evade the scalability challenge just described: Guide to Enzyme Catalysis and Protein Folding, W. H. Free­ man, (1998). • Improved integrators that might allow use oflarger 2. J. Beetem, M. Denneau, and D. Weingarten, "The GF11 Par­ allel Computer," Experimental Architec­ time steps tures, J. J. Dongarra, Editor, North Holland Publishing Co., • Improved implementations and algorithms for Amsterdam (1987). long-range force evaluation. It is possible that ex­ 3. D. H. Weingarten, "Quarks by Computer," Scientific Amer­ isting statements about the suitability of specific ican 274, No.2, 104-108 (1996). algorithms for specific problems based on prob­ 4. W. Andreoni, A. Curioni, and T. Mordasini, "DFT-Based Mo­ lecular Dynamics as a New Tool for Computational Biology: lem size may have to be modified to take into ac­ First Applications and Perspective," IBMJournal ofResearch count the suitability of those algorithms for map­ and Development 45, Nos. 3 and 4 (2001). ping onto a massively parallel machine. 5. Q. Zhong, T. Husslein, P. B. Moore, D. M. Newns, P. C. • Improved techniques for calculating thermody­ Pattnaik, and M. L. Klein, FEBS Letters 434, No.3, 265-271 (1998). namic ensemble averages. Use of ensemble-aver­ 6. J. Collinge, "Variant Creutzfeldt-Jakob Disease," The Lan­ aging techniques such as Monte Carlo may allow cet 354, No. 9175, 317-323 (1999). more efficient utilization ofthe machine than con­ 7. E. Strickland, B.-H. Qu, and P. J. Thomas, "Cystic Fibrosis: ventional molecular dynamics in calculating ther­ A Disease of Altered Protein Folding," Journal ofBioener­ modynamic ensemble averages. Also, free-energy getics and Biomembranes 29, 483-490 (1997). 8. C. Branden and J. Tooze, Introduction to Protein Structure, calculation techniques have an embarrassingly par­ 2nd Edition, Garland Inc., New York (1999), and references allel component that can utilize a partitioned set therein. of hardware threads. 9. J. Moult, T. Hubbard, K. Fidelis, and J. T. Pedersen, "Crit­ • Novel methods for studying long-time dynamics ical Assessment of Methods of Protein Structure Prediction (CASP): Round III," Proteins: Structure, Function, and Ge­ in biomolecular systems that may circumvent ex­ netics 37, Supplement 3, 2-6 (1999). isting scalability issues 32,33 10. C. Levinthal, "Are There Pathways for Protein Folding?"Jour­ nal de Chimie Physique 65, 44-45 (1968). In order to explore algorithmic and decomposition 11. W. Jin, J. Onuchic, and P. Wolynes, "Statistics ofKinetic Path­ alternatives, it is essential to have a good set ofsim­ ways on Biased Rough Energy Landscapes with Applications to Protein Folding," Physical Review Letters 76, 4861-4864 ulation environments and tools to support applica­ (1996). tion development and tuning. Modeling the highly 12. F. B. Sheinerman and C. L. Brooks, "Molecular Picture of threaded machine architecture currently being Folding of a Small Alpha/Beta Protein," Proceedings ofthe planned represents a considerable challenge. In or­ National Academy of Sciences (USA) 95, No.4, 1562-1567 (1998). der to achieve performance on such platforms, a non­ 13. V. Munoz, P. A. Thompson, J. Hofrichter, and W. A. Eaton, intrusive run-time environment with extremely low "Folding Dynamics and Mechanism ofBeta-Hairpin Forma­ overhead will also be needed. 48 tion," Nature 390, 196-198 (1997).

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 325

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. 14. D. Frenkel and B. Smit, Understanding Molecular Simulation, 32. V. Daggett, "Long Time-Scale Simulations," Current Opin­ Academic Press, San Diego, CA (1996). ion in Structural Biology 10, 160-164 (2000). 15. S. E. Jackson, "How Do Small Single-Domain Proteins Fold?" 33. A. F. Voter, "Parallel Replica Method for Dynamics of In­ Folding & Design 3, R81-R91 (August 1998). frequent Events," PhysicalReview B 57, No. 22, 13985-13988 16. V. S. Pande, A. Y. Grosberg, T. Tanaka, and D. S. Rokhsar, (1998). "Pathways for Protein Folding: Is a 'New View' Needed?" 34. V. S. Pande, private communication (2000). Current Opinion in Structural Biology 8, 68-79 (1998). 35. R. Olender and R. Elber, "Calculation of Classical Trajec­ 17. D. Thirumalai and D. K. Klimov, "Deciphering the Time­ tories with a Very Large Time Step: Formalism and Numer­ scales and Mechanisms of Protein Folding Using Minimal ical Examples," Journal of Chemical Physics 105, No. 643, Off-Lattice Models," Current Opinion in Structural Biology 9, 9299-9315 (1996). No.2, 197-207 (1999). 36. P. G. Bolhuis, C. Dellago, P. L. Geissler, and D. Chandler, 18. Y. Duan and P. A. Kollman, "Pathways to a Protein Folding "Transition Path Sampling: Throwing Ropes over Mountains Intermediate Observed in a I-Microsecond Simulation in in the Dark," Journal ofPhysics: Condensed Matter 12, No. Aqueous Solution," Science 282, No. 5389, 740-744 (1998). 8A, 147-152 (2000). 19. W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wil­ 37. F. B. Sheinerman and C. L. Brooks, "Calculations on Fold­ son, "A Computer Simulation Method for the Calculation ing ofSegment b1 ofStreptococcal Protein g," Journal ofMo­ ofEquilibrium Constants for the Formation ofPhysical Clus­ lecular Biology 278, No.2, 439-456 (1998). ters ofMolecules: Application to Small Water Clusters,"Jour­ 38. T. Fukushige, M. Taiji, J. Makino, T. Ebisuzaki, and D. Su­ nal of Chemical Physics 76, 637-649 (1982). gimoto, "A Highly Parallelized Special-Purpose Computer 20. M. Tuckerman, B. J. Berne, and G. J. Martyna, "Reversible for Many-Body Simulations with an Arbitrary Central Force: Multiple Time Scale Molecular Dynamics,"Journal ofChem­ MD:GRAPE," Astrophysical Journal 468, 51-61 (1996). 2 ical Physics 97, No.3, 1990-2001 (August 1992). 39. This notation indicates order of complexity. Here O(N ) 21. J. P. Ryckaert, G. Ciccotti, and H. J. C. Berendsen, "Numer­ means that as N increases, the increase in time required for 2 ical Integration of the Cartesian Equations of Motion of a the computation is proportional to N • System with Constraints: Molecular Dynamics ofn-Alkanes," 40. See http://www.chips.ibm.com/products/asics/products/edram. Journal of Computational Physics 23, 327-341 (1977). 41. V. E. Taylor, R. L. Stevens, and K. E. Arnold, "Parallel Mo­ 22. H. C. Andersen, "RATTLE: A Velocity Version of the lecular Dynamics: Implications for Massively Parallel Ma­ SHAKE Algorithm for Molecular Dynamics," Journal of chines," Journal ofParallel andDistributed Computing 45, No. Computational Physics 52, 24-34 (1983). 2,166-175 (September 1997). 23. M. W. Mahoney and W. L. Jorgensen, "A Five-Site Model 42. G. S. Heffelfinger, "Parallel Atomistic Simulations," Com­ for Liquid Water and the Reproduction ofthe Density Anom­ puterPhysics Communications 128, Nos. 1-2,219-237 (June aly by Rigid, Nonpolarizable Potential Functions," Journal 2000). 43. S. Plimpton, "Fast Parallel Algorithms for Short-Range Mo­ of Chemical Physics 112, No. 20, 8910-8922 (2000). lecular Dynamics,"Journal ofComputationalPhysics 117, No. 24. H. J. C. Berendsen, J. R. Grigera, and T. P. Straatsma, Jour­ 1, 1-19 (March 1995). nal ofPhysical Chemistry 91, 6269 (1987). 44. R. W. Hockney and J. W. Eastwood, Computer Simulation Us­ 25. M. P. Allen and D. J. Tildesley, Computer Simulation ofLiq­ ingParticles, Institute ofPhysics Publishing, Bristol, UK (1988). uids, Oxford University Press, New York (1989). 45. L. Greengard and V. Rokhlin, "A Fast Algorithm for Par­ 26. A. D. MacKerell, D. Bashford, M. Bellott, R. L. Dunbrack, ticle Simulations," Journal ofComputationalPhysics 73, 325­ J. D. Evanseck, M. J. Field, S. Fischer, J. Gao, H. Guo, 348 (1987). S. Ha, D. Joseph-McCarthy, L. Kuchnir, K. Kuczera, F. T. K. 46. R. K. Brunner, J. C. Phillips, and L. V. Kale, "Scalable Mo­ Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prod­ lecular Dynamics for Large Biomolecular Systems," Super­ hom, W. E. Reiher, B. Roux, M. Schlenkrich, J. C. Smith, computing 2000 Proceedings, Dallas, TX (November 4-10, R. Stote, J. Straub, M. Watanabe, J. Wiorkiewicz-Kuczera, 2000), available at http://www.sc2000.org/proceedings/ D. Yin, and M. Karplus, "All-Atom Empirical Potential for techpapr/papers/pap271.pdf. Molecular Modeling and Dynamics Studies ofProteins,"Jour­ 47. S. Plimpton, R. Pollock, and M. Stevens, "Particle-Mesh nal ofPhysical Chemistry B 102, 3586-3616 (1998). Ewald and rRESPA for Parallel Molecular Dynamics Sim­ 27. W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. ulations," Proceedings ofthe Eighth SIAMConference on Par­ Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Cald­ allel Processing for Scientific Computing, Minneapolis, MN well, and P. A. Kollman, "A Second Generation Force Field (March 14-17, 1997), pp. 8-21. for the Simulation of Proteins and Nucleic Acids," Journal 48. B. G. Fitch and M. E. Giampapa, "The Vulcan Operating of the American Chemical Society 117, 5179-5197 (1995). Environment: A Brief Overview and Status Report," Paral­ 28. W. F. van Gunsteren, X. Daura, and A. E. Mark, "GROMOS lel Supercomputing in Atmospheric Science, G.-R. Hoffman Force Field," Encyclopaedia ofComputational Chemistry, Vol­ and T. Kauranne, Editors, World Scientific Publishing Co., ume 2 (1998). Inc., Riveredge, NJ (1993), p. 130. 29. G. Kaminski, R. A. Friesner, J. Tirado-Rives, and W. L. Jor­ gensen, "Evaluation and Improvement of the OPLS-AA Force Field for Proteinsvia Comparisonwith Accurate Quan­ Accepted for publication March 8, 2001. tum Chemical Calculations on Peptides," submitted to Jour­ nal of Computational Chemistry (2001). IBM Blue Gene team IBMResearch Division, Thomas1 30. J. L. Banks, G. A. Kaminski, R. Zhou, D. T. Mainz, B. J. Research Center, P. O. Box218, Yorktown Heights, New York 10598 Berne, and R. A. Friesner, "Parameterizing a Polarizable (electronic mail: rgermain@us..com). The Blue Gene team Force Field from ab initio Data: I. The Fluctuating Point spans a wide range oftechnical disciplines and organizations within Charge Model," Journal of Chemical Physics 110,741-754 IBM Research. Blue Gene team members are listed below, (1999). grouped by their contributions to the areas ofhardware, systems 31. J. Ponder, private communication. software, and application/science.

326 ALLEN ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply. Hardware: Ruud Haring is the manager of the Blue Gene Sys­ tems Development group and Monty Denneau is the chief ar­ chitect ofthe Blue Gene chip and system. Otherpast and present contributors to the hardware effort are: Daniel K. Beece, Arthur A. Bright, Paul Coteus, Bruce Fleischer, Christos J. Georgiou, Peter H. Hochschild, Kiran K. Maturu, Robert Philhower, Tho­ mas Picunko, Rick A. Rand, Valentina Salapura, Rahul S. Shah, Sarabjeet Singh, Richard Swetz, Nagesh K. Vishnumurthy, and Henry S. Warren, Jr. Systems software: Manish Gupta is the manager ofthe High Per­ formance and Cellular Programming Environments group. Jose Moreira is the manager ofthe Blue Gene System Software group. Other past and present contributors to the systems software ef­ fort are: , George AImasi, Jose Brunheroto, Calin Cascaval, Jose Castanos, Paul Crumley, Wilm Donath, Maria Eleftheriou, Mark Giampapa, Howard Ho, Derek Lieber, Matthew Newton, Aida Sanomiya, and Marc Snir. Application and science: Robert S. Germain is the manager of the Biomolecular Dynamics and Scalable Modeling group and Blake G. Fitch is the application architect. Otherpast and present contributors to the science and application effort include: Wanda Andreoni, Bruce J. Berne, Alessandro Curioni, Donna Gresh, Susan Hummel, Tiziana Jonas, Glenn Martyna, Dennis Newns, Jed Pitera, Michael Pitman, Ajay Royyuru, Yuk Sham, Frank Suits, William Swope, T. J. Christopher Ward, and Ruhong Zhou.

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 ALLEN ET AL. 327

Authorized licensed use limited to: IBM CORPORATE. Downloaded on August 17,2010 at 20:52:17 UTC from IEEE Xplore. Restrictions apply.