Protein Structure Prediction and Potential Energy Landscape Analysis using Continuous Global Minimization

KEN A. DILL Departmentof Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, CA 94118

ANDREW T. PHILLIPS ComputerScience Department, United StatesNaval Academy,Annapolis, MD 21402

J. BEN ROSEN Computer Science and Engineering Department, University of California at San Diego, San Diego, CA 92093

INTRODUCTION form of energy minimization technique, such as simu- lated annealing and genetic algorithms, rather than the Proteins require specific three-dimensional conforma- computationally more efficient continuous minimiza- tions to function properly. These “native” conforma- tion techniques. Each such method correctly predicts a tions result primarily from intramolecular interactions few protein structures, but missesmany others, and the between the atoms in the macromolecule, and also process is both slow and limited to structures of rela- intermolecular interactions between the macromole- tively small size. cule and the surrounding solvent. Although the folding The purpose of this paper is to show how one can process can be quite complex, the instructions guiding apply more efficient continuous minimization tech- this process are specified by the one-dimensional pri- niques to the energy minimization problem by using mary sequenceof the protein or nucleic acid: external an accurate continuous approximation to the discrete factors, such as helper (chaperone) proteins, present at information provided for known protein structures. In the time of folding have no effect on the final state of addition, we will show how the results of one particu- the protein. Many denatured proteins spontaneously lar computational method for predic- refold into functional conformations once denaturing tion (the CGU algorithm), which is based on this conditions are removed. Indeed, the existence of a continuous minimization technique, can be used both unique native conformation, in which residues distant to accurately determine the global minimum of poten- in sequence but close in proximity exhibit a densely tial energy function and also to offer a quantitative packed hydrophobic core, suggests that this three- analysis of all of the local (and global) minima on the dimensional structure is largely encoded within the energy landscape. sequential arrangement of these specific amino acids. Computational results using a continuous global In any case, the native structure is often the conforma- minimization technique on small to medium size pro- tion at the global minimum energy (see [ 11). tein structures has already proven successful. This In addition to the unique native (minimum energy) method has been extensively tested on a variety of structure, other less stable structures exist as well, computational platforms including the Intel Paragon, each with a corresponding potential energy. These Cray T3D, an 8 workstation Dee Alpha cluster, and a structures, in conjunction with the native structure, heterogeneousnetwork of 13 Sun SparcStations and 7 make up an energy landscape that can be used to char- SGI Indys. Protein structures with as many as 46 resi- acterize various aspectsof the protein structure. dues have been computed in under 40 hours on the 20 Over 20 years of research into this “ workstation heterogeneousnetwork. problem” has resulted in numerous important algo- rithms that aim to predict native three-dimensional THE POLYPEPTIDE MODEL AND POTENTIAL ENERGY protein structures (see [2], [3], [4], [8], [9], [lo], [12], FUNCTION [14], [15], [18], [19], [20], [21], and [22]). Suchmeth- ods assumethat the native structure is a balance of var- Since computational search methods are not yet fast ious interactions. These methods invariably use some enough to find global optima in real-space representa-

109 K. Dill et al. tions using accurate all-atom models and potential neighborhood of an experimentally determined value. functions, a practical conformational search strategy Among these are the 3n-1 backbone bond lengths 1 requires both a simplified, yet sufficiently realistic, between the pairs of consecutive atoms N-C’, C’-C,, molecular model with an associated potential energy and C,-N. Also, the 3n-2 backbone bond angles 8 function which consists of the dominant forces defined by N-&-C’, C&‘-N, and C/-N-C, are also involved in protein folding, and also a global optimiza- fixed at their ideal values. Finally, the n-l bond tion method which takes full advantage of any special dihedral angles o are fixed in the trans (1 SO’) confor- properties of this kind of energy function. mation. This leaves only the n-l backbone dihedral Each residue in the primary sequenceof a protein is angle pairs (cp,yr)in the reduced representation model. characterized by its backbone components NH-C,H- These also are not completely independent; in fact, C’O and one of 20 possible sidechains they are severely constrained by known chemical data attached to the central C, atom. The three-dimensional (the Ramachandranplot) for each of the 20 amino acid structure of macromolecules is determined by internal residues. Furthermore, since the atoms from one C, to molecular coordinates consisting of bond lengths 1 the next C, along the backbone can be grouped into (defined by every pair of consecutive backbone rigid planar peptide units, there are no extra parame- atoms), bond angles 8 (defined by every three consecu- ters required to express the three-dimensional position tive backbone atoms), and the backbone dihedral of the attached 0 and H peptide atoms. Hence, these angles cp,v, and o, wherecp gives the position of C’ rel- bond lengths and bond angles are also known and ative to the previous three consecutive backbone atoms fixed. C-N-C,, v gives the position of N relative to the pre- A key element of this simplified polypeptide vious three consecutive backbone atoms N-Cc-C’, and model is that each sidechain is classified as either

110 Protein Structure Prediction and Potential Energy Landscape Analysis

Rose [ 161. Both groups have demonstrated that this type of potential function is sufficient to accurately model the forces which are most responsible for fold- ing proteins. The specific potential function used ini- tially in this study is a simple modification of the Sun/ Thomas/Dill energy function and has the following form:

E total = Eex + Ehp + Ebb + Eqny where E,, is the steric repulsive term which rejects any Figure 2 Combined Potential Function conformation that would permit unreasonably small Energy Terms E, + E,,p interatomic distances, Ehp is the contact energy term the most favorable hydrogen bond is one in which favoring pairwise H-H residues, EM is the contact energy term favoring pairwise hydrogen bonding, and all of the bonded pairs C/O, O/H, and H/N lie E,,,, is the main chain torsional term that allows only along a straight line, since the energies between those (cp,yr)pairs which are permitted by the Ram- the pairs C’/H and O/N are repulsive while the achandran maps. In particular, the excluded volume energiesbetween C/N and O/H are attractive (see energy term E, and the hydrophobic interaction Figure 3). If for the #i pair of carbonyl and amide energy term Ehp are defined in this case as follows:

Cl Eex = C 9 and 1.0 + eXp((dij - d,ff j/d,) Figure 3 Schematic Illustration of Most ij Favorable Hydrogen Bond Alignment to be the distancebetween Ehp E C Eiif(dij> where groups, we define d(‘)ii 2)aato be the distance [i-J1 >2 ~t~~e~dC~ zltr$j)f ti be he distance r between 0 and H, and dt jii to be the distance .fCdij) = 1.0 + exp((~j- d,)/dt) ’ between the 0 and N, then the energy term Ebb can be representedas follows (with an additional The excluded volume term Eex is a soft sigmoidal steric repulsive energy term, not shown, between potential where dii is the interatomic distance between the O/H pair as well): two C, atoms or between two sidechain center of mass 4 atoms C,, &,, determines the rate of decrease of E,,, E and de8 determines the midpoint of the function (i.e. hb = ‘3x . c Eijky where the function equals l/2 of its maximum value). Vk=l Similarly, the hydrophobic interaction energy term Ehp where Egl 3: QcQH I d(‘$, Eij2 = Q@, I dop Ed3 = is a short ranged soft sigmoidal potential where dti rep- QoQH I d(3)+ and E64 = Q~QN I d( ij and where Qc resents the interatomic distance between two sidechain = +OS, QH = +0.3, Qo = -0.5, and QN = -0.3 are the centroids C,, and do and d, represent the rate of charges on the four participating atoms. This energy decrease and the midpoint of Ehp, respectively. The term is computed for all pairs ij of backbone amide hydrophobic interaction coefficient Ed = -1 .O when and carbonyl groups. both residues i and j are hydrophobic, and is set to 0 The final term in the potential energy function, otherwise. Figure 2 shows the combined effect of the Eq,,,, is the torsional penalty term allowing only “real- energy terms E, + Ehp for a pair of H-H residues. istic” (9,~) pairs in each conformation. That is, since cp The contact energy term Ebb represents the and v refer to rotations of two rigid peptide units attraction between backbone amides and carbon- around the same C, atom (see Figure I), most combi- yls participating in a hydrogen bond. To deter- nations produce steric collisions either between atoms mine this energy of attraction, it is assumed that in different peptide groups or between a peptide unit

111 K. Dill et al. and the side chain attached to C, (except for ). Hence, only certain specific combinations of (9,~) _- pairs are actually observed in practice, and are often conveyed via the Ramachandran plot, such as the one in Figure 4, and the cp-yrsearch space is therefore very 1

.’. .T. - - R2

P

Figure 5 Approximating Ramachandran Data by Ellipsoids

50 100 150 0 if (cp,v) E Ri for some i cp (1) qp,= Figure 4 Typical Ramachandran Plot B otherwise much restricted. where l3 is some large constant penalty. To obtain such Use of the Ramachandran plot in determining pro- an energy term, we first represent the i* ellipsoid Ri by tein structure is essential in order to restrict the (p-v a quadratic function 4i(Cp,~)which is positive definite plane to those regions observed for actual protein mol- (both eigenvaluespositive) and satisfies qi(o,y) = 0 on ecules. While most other global optimization methods the boundary of the ellipsoid Ri, qi(o,W) < 0 in the inte- for protein structure prediction also depend on this rior of Rip and qi((PW)> 0 in the exterior. By simply property, they are limited to some variant of simulated constructing a sigmoidal penalty term of the form annealing (or any other method based on random sam- pling rather than gradient information) to perform the optimization due to the lack of a smooth, i.e. differen- tiable, representation for EVV This, in , results in a very slow local minimization process. In contrast, our i-l approach, first presented in [7], is to model the Ram- achandran data by a smooth function which will have where the constants n > 0 determine the rate by which the approximate value zero in any permitted region, E,, approaches0 or l3near an ellipsoid boundary, then and a large positive value in all excluded regions. it is easy to see that ET,,, z 0 in the ellipsoid’s interior, A key observation in the construction of the func- and Eq,,, 3 l3 at distant exterior points, thus satisfying tion Eq,,, is that the set of allowable (cp,yr)pairs form (1). Figure 6 shows ET,,, as a function Of qi((p,~) for a compact clusters in the o-v plane. By enclosing each single ellipsoid with the values of l3= 1 and y1 = 5. Fig- such cluster in an appropriately constructed ellipsoid, ure 7 illustrates the three-dimensional plot of Eq,,, (for we may use the ellipsoids to define the energy term l3 = 1, and all yi = 100) for the data provided in Figure E cpV In particular, given p regions (ellipsoids) R,, 4. This differentiable representation for Eqv is crucial Rzr..., R,, containing the experimentally allowable for applying a continuous minimization technique to (cp,~) pairs (see Figure 5), we desire the energy term the problem of computing the minimum potential E,, to satisfy energy, and its use in the CGU algorithm, described next, is a major reason for the successof this method.

112 Protein Structure Prediction and Potential Energy Landscape Analysis

mum possible amount in the discrete Lt norm (see Figure 8). The use of such an underestimating function

Figure 8 The Convex Global Underestimator (CGU) / r -4 -* I 4 eP9v) * allows the replacement of a very complex function by Figure. 6 Sigmoidal Penalty Term E,,,, for p = 1 a simple convex underestimator. For simplicity of andyI = 5 notation, we define the differentiable function F(+) = E t0tal,where $ E R7 (z = 2n-2 representsthe number of backbone dihedral angles cpand v), and where F(G) is assumedto have many local minima. To begin the iterative process, a set of K 2 2s+l dis- tinct local minima @), for j-l,...& are computed and a convex quadratic underestimator function Y(o) is then fitted to these local minima so that it underesti- mates all the local minima, and normally interpolates F(@)) at 22+1 points (see Figure 8). This is accom- plished by determining the coefficients in the function Y($) so that

for j=l,...,K, and where ’ 6 .is minimized. That is, the difference between3J ($) ‘anA Y(+) is minimized in the discrete Lr norm over the set of K local minima @), j=l,...,K. The underestimating function Y($) is given by

Figure 7 Three-Dimensional Plot of E,, (for fi - 1.0) Corresponding to the Ramachandran Data in Figure 5 (3) w40 = cO + i (‘[(Pi + ~di@~). i-l

THE CGU GLOBAL OPTIMIZATION ALGORITHM Convexity of this quadratic function is guaranteed by requiring that di 2 0 for i-1 ,...,z. Other linear com- One computational method for finding the global min- binations of convex functions could also be used, but imum of the polypeptide’s potential energy function is the coefficients ci and di of this particular quadratic to use a global underestimator to localize the search in function provide various useful information related to the region of the global minimum. This CGU (convex the energy landscape. global underestimator) method, first described in [13] The convex quadratic underestimating function and subsequently applied successfully to the molecu- Y(+) determined by the values c E R’+’ and d E Rt lar model and potential energy function described ear- provides a global approximation to the local minima lier (see [6] and [7]), is designed to fit all known local of F(G), and its easily computed global minimum point minima with a convex function which underestimates hill is given by (~min)i= -ci/di, i=l,...,r, with corre- all of them, but which differs from them by the mini- sponding function value Yy,, given by

113 K. Dill et al.

P($“‘> - e-(E’-E”)‘kT (4) Y min - co - i c,2/(2di). (5) N -(Ei - E,)/RT i-l e c The value Yti is a good candidate for an approxi- i-0 mation to the global minimum of the correct energy where Ej E F(@) is the energy of the conformation function F(Q), and so $ti can be used as an initial Q(i),E. is the energy of the global minimum conforma- starting point around which additional configurations tion q(O),kT is Boltzmann’s constant multiplied by (i.e. local minima) can be generated. That is, the mini- temperature, and N 2 K is the total number of confor- mum of this underestimator is used to predict the glo- mations that are local minima of F(q). Thus, stable bal minimum for the true potential function F(o), conformations are less likely to exist in higher energy allowing a more localized conformer search to be per- states than in lower energy ones. formed based on the predicted minimum (see Figures Note that the CGU method will find only K of the N 9 and 10). A new set of conformers generated by the total local minima of F($), and that K <> Eo, so that their effect on the total sum in (5) is negligi- WP ble. If ($“‘)i represents the i* angle in the jti conforma- tion, then the weighted mean of the ith angle is given Figure 9 A Reduced Search Domain bY

j-0 and the corresponding mean square deviation in <+i> is given by Figure 10 The New CGU Over the Reduced Search Domain j-0 localized search then servesas a basis for another qua- dratic underestimation over a sufficiently reduced Thus a small mean square deviation demonstrates the space. After several repetitions, the reduced space will increased reliability of e&i>. Also a small mean square contain only a single conformation so that the convex deviation should give <$i> = (+‘O’)i.If all such mean underestimator prediction agrees exactly with the best square deviations are small, then the computed global known local minimum (see [ 131 for specific details). minimum angles (+‘O’>ishould give a good approxima- This conformation is thus the conformation with mini- tion to the true native conformation. mum energy among all conformations found by the If the final convex underestimator Y(@)agrees with algorithm, and so it is declared to be the best approxi- the global minimum potential energy E. at the com- mation to the true global minimum conformation. puted global minimum conformation e(O)(this condi- tion can be easily assured by requiring that ci = -di THE NATIVE STRUCTURE AND ITS FLUCTUATIONS ($‘O’)ifor all i = l,...,~; that is, the gradient of the con- vex underestimator vanishes at the computed global The CGU method gives a very simple expression for minimum), then E. can be expressed,using (4), as the fluctuations around the native structure. The proba- bility P(@)) of finding a molecule in a conformation I$) is given by the Boltzmann distribution law

114 Protein Structure Prediction and Potential Energy Lmdscape Analysis

COMPUTATIONAL SUMMARY E, = Ymin = co - ~ c?/(2di). (6) Using the chain representation, energy function, and i-l CGU search algorithm described above, we used eight Furthermore, with a suitable assumption’, combining protein sequences as test cases (met-enkephalin, (3) and (6) shows that the convex underestimator bradykinin, oxytocin, mellitin, zinc-finger, avian pan- energy Y(g) of any conformation $ is related to the creatic polypeptide, crambin, and BBAl [ 171, a 23- computed global minimum energy EO by residue B@r motif). While the CGU algorithm has been extensively tested on a variety of high perfor- mance platforms including the Intel Paragon, the Cray w44 - E, - i i di[~i- (Q’O’)i] T3D, an 8 workstation Dee Alpha cluster, and a heter- i-l ogeneous network of 13 Sun SparcStations and 7 SGI Indys, the results presented here represent those Now, if $(I) denotes a conformation with all angles $9 obtained from the heterogeneous network of Suns and except for $1,fixed at their respective global minimum SGIs using MPI (Message Passing Interface) for the values (+‘O’>i,then the energy difference directly attrib- interprocess communication. The total time for solu- uted to any $l is clearly tion and the computed global minimum potential energy for each structure is given in Table 1. Computa- tions on the Cray T3D are in progress and are expected to result in substantial reductions in computing time. The two principal results of this paper are: (1) to Finally, by applying (7) to (5), the Boltzmann distribu- show that starting from many different randomly cho- tion of angle $1 is then proportional to (ignoring the sen open starting conformations of the chain, the CGU denominator in (5)) method converges on the same structure in each case, suggesting that the method is probably reaching the global minimum of the energy function, and (2) to show the scaling of the solution time with the chain length (see Table l), indicating that the method seems where & is the mean, and al2 is the variance. There- practical for small protein-sized molecules. Table 2 fore, we can interpret (+(“))I as the mean value of $, shows that the simulations are dominated by a single and kT/dl as the variance of +l obtained directly from stable state for chains longer than about 20 residues. the convex global underestimator. Note that a large This paper is a test of a conformational search strat- value of dl implies a small variance in the angle $. egy, not an energy function. The energy function is not Also a high temperature T, as well as a small value of yet an accurate model of real proteins: the best com- dl, implies a large variance in the angle, as expected. puted structures differ from the true native structures. The standard deviation of $ is But similarly simple energy functions have begun to show value in predicting protein structures ([ 161, [ 181, 63) q = (kT/d,)“2. and [22]). Therefore we believe improved energy functions used in conjunction with the CGU search Note that this result depends on the property that the method may be useful in protein folding algorithms. CGU algorithm computes a large set of local minima, in addition to the global minimum. ACKNOWLEDGMENTS

K.A. Dill and A.T. Phillips were supported by the NSF grant BIR-9119575, and J.B. Rosen was supported by the ARPA/NIST grant 6ONANB4D1615 and NSF 1. For each conformation I$), the CGU function value W(#)) grants CCR-9509085 and CCR-952715 1. The authors matches the corresponding potential energy function Ej = also wish to thank the San Diego Supercomputer Cen- F(#)). Even if this assumption is not satisfied, an upper bound on the standard deviation given in (8) may be obtained.

115 K. Dill et al. ter for providing the computational resources to con- 1(1992), 625-640. duct this research. 15.J. Skolnick, and A. Kolinski, Simulations of the Folding ofa Globular Protein, Science250 (1990), 1121-1125. 16.R. Srinivasan and G.D. Rose,LINUS: A Hierarchic Pro- REFERENCES cedure to Predict the Fold of a Protein, PROTEINS: Structure, Function, and Genetics 22 (1995), 81-99. 1. C.B. Anfinsen, Principles that Govern the Folding of 17.M.D. Struthers, R.P. Cheng, and B. Imperiali, Design of Protein Chains, Science 181(1973), 223-230. a Monomeric 23-Residue Polypeptide with Defined Ter- 2. E.M. Boczko, and C. Brooks, First-Principles Calcula- tiary Structure, Science271(1996), 342-345. tion of the Folding Free Energy of a Three-Helix Bundle 18.S. Sun, P.D. Thomas, and K.A. Dill, A Simple Protein Protein, Science269 (1995), 393-396. Folding Algorithm using a Binary Code and Secondary 3. D.G. Covell, Folding Protein a-Carbon Chains info Structure Constraints, Protein Engineering 8 (1995), Compact Forms by Monte Carlo Methods, PROTEINS: 769-778. Structure, Function, and Genetics 14 (1992), 409-420. 19. S. Vajda, M.S. Jafri, O.U. Sezerman, and C. DeLisi, 4. D.G. Covell, Lattice Model Simulations of Polypeptide Necessary Conditions for Avoiding Incorrect Polypep- Chain Folding, Journal of Molecular Biology 235 tide Folds in Conformationul Search by Energy Minimi- (1994), 1032-1043. zation, Biopolymers 33 (1993), 173-192. 5. K.A. Dill, Dominant Forces in Protein Folding, Bio- 20. A. Wallqvist, and M. Ullner, A Simplified Amino Acid chemistry 29 (1990), 7133-7155. Potential for use in Structure Predictions of Proteins, 6. K.A. Dill, A.T. Phillips, and J.B. Rosen, CGU: An Algo- PROTEINS: Structure, Function, and Genetics 18 rithm for Molecular Structure Prediction, IMA Volumes (1994), 267-280. in Mathematics and its Applications (1996), forthcom- 21.C. Wilson, and S. Doniach, A Computer Model to ing. Dynamically Simulate Protein Folding - Studies with 7. K.A. Dill, A.T. Phillips, and J.B. Rosen, Molecular Crambin, PROTEINS: Structure, Function, and Genetics Structure Prediction by Global Optimization, Journal of 6 (1989), 193-209. Global Optimization (1996), forthcoming. 22.K. Yue, and K.A. Dill, Folding Proteins with a Simple 8. D. Hinds, and M. Levitt, Exploring Conformational Energy Function and Extensive Conformational Search- Space with a Simple Lattice Model for Protein Structure, ing, Protein Science5 (1996), 254-261. Journal of Molecular Biology 243 (1994), 668-682. 9. I. Kuntz, G. Crippen, P. Kollman, and D. Kimmelman, Permi=ion to make digitalhrd copies of all or part ofthis material for Calculation of Protein Tertiary Structure, Journal of personal or chssroom use is granted without fee provided fiat he copies Molecular Biology 106 (1976), 983-994. are not made or distributed for profit or commercial advantage, the copy- right notice, the title ofthe publication and its date appe,~, and notice is 10.M. Levitt, and A. Warshel, Computer Simulation of Pro- given that copyright is by permission ofthe ACM, Inc. To copy otherwise, tein Folding, Nature 253 (1975), 694-698. to republish to post on servers or to redistribute to lists, requires specific ll.S. Miyazawa, and R.L. Jernigan, A New Substitution permission and/or fee. RECOMB 97, Santa Fe New Mexico USA Matrix for Protein Sequence Searches Based on Contact Copyright 1997ACM O-89791-882-8/97/01 .,$x50 Frequencies in Protein Structures, Protein Engineering 6 (1993): 267-278. 12.A. Monge, R. Friesner, and B. Honig, An Algorithm to Generate Low-Resolution Protein Tertiary Structures from Knowledge of SecondaryStructure, Proceedingsof the National Academy of ScienceUSA 91(1994), 5027- 5029. 13.A.T. Phillips, J.B. Rosen, and V.H. Walke, Molecular Structure Determination by Convex Global Underesti- mation of Local Energy Minima, Dimacs Series in Dis- crete Mathematics and Theoretical Computer Science23 (1995), P.M. Pardalos, G.-L. Xue, and D. Shalloway (Eds), 181-198. 14.M. Sippl, M. Hendlich, and P. Lackner, Assembly of Polypeptide and Protein Backbone Conformations from Low Energy Ensembles of Short Fragments: Develop- ment of Strategies and Construction of Models for Myo- globin, Lysozyme, and Thymosin Beta 4, Protein Science

116 Protein Structure Prediction and Potential Energy Landscape Analysis

Table 1 Eight Small Test Problems

avian pancreatic

Table 2 Probability Distribution of Local Minima

Number of Local Minima in Probability Range Shown

compound C.2 Total (residues) I I met-enkephalin (5) bradykinin (9) oxytocin (9)

zinc-finger (30) avian pancreatic polypeptide (36) crambin (46)

117