A Statistical Mechanical Method to Optimize Energy Functions for Protein Folding
Total Page:16
File Type:pdf, Size:1020Kb
A statistical mechanical method to optimize energy functions for protein folding Ugo Bastolla*†, Michele Vendruscolo‡, and Ernst-Walter Knapp*† *Freie Universita¨t Berlin, Department of Biology, Chemistry and Pharmacy, Takustrasse 6, D-14195 Berlin, Germany; and ‡Oxford Centre for Molecular Sciences, New Chemistry Laboratory, Oxford OX1 3QT, United Kingdom Communicated by Vitalii Goldanskii, Russian Academy of Sciences, Moscow, Russia, December 22, 1999 (received for review May 12, 1999) We present a method for deriving energy functions for protein A second class of methods (19–23) aims at providing the folding by maximizing the thermodynamic average of the overlap largest possible thermodynamic stability to the target proteins. with the native state. The method has been tested by using the These methods optimize quantities related to the Z score (24), pairwise contact approximation of the energy function and gen- measuring the difference between the energy of the native state erating alternative structures by threading sequences over a da- and the average energy of the alternative states in units of the tabase of 1,169 structures. With the derived energy function, most standard deviation of the energy. Goldstein et al. (19, 20), native structures: (i) have minimal energy and (ii) are thermody- inspired by a spin-glass analysis, derived efficient parameters for namically rather stable, and (iii) the corresponding energy land- fold recognition. More recently, the method was extended to scapes are smooth. Precisely, 92% of the 1,013 x-ray structures are protein structure prediction via simulated annealing (21). Hao stabilized. Most failures can be attributed to the neglect of inter- and Scheraga applied a similar method to the more difficult actions between chains forming polychain proteins and of inter- problem of deriving an energy function for folding simulations actions with cofactors. When these are considered, only nine cases of a single protein (22). Mirny and Shakhnovich proposed remain unexplained. In contrast, 38% of NMR structures are not optimizing the Z score to obtain an energy function conferring assigned properly. large stability to most native states of a target set of proteins (23). These methods, however, do not guarantee that the native he starting point for folding proteins on a computer is to structure has the lowest energy among all alternative structures. Tassume, according to Anfinsen’s experiments (1), that the We present a third method, which combines the main advan- native state of the protein is in thermodynamic equilibrium and tages of the two previous ones. Here, the energy function is corresponds to the minimum free energy. The most straightfor- determined by maximizing the average native overlap Q. When ward approach considers a detailed atomistic model and follows Q is very close to 1, it is guaranteed that the native state and the its time evolution either by molecular dynamics (2, 3) or by ground state coincide. Moreover, a large value of Q also Monte Carlo simulations (4, 5). To date, only for special regular indicates that the native state is thermodynamically stable and structures (4, 5) and for small polypeptides (3), the experimen- suggests that the energy landscape is well correlated. tally known native structures have been reproduced in computer As a first application of the method, we determine optimal experiments. These simulations are still far from being routine parameters for the pairwise contact approximation of the energy methods of structure prediction. The reason is that a protein in function. We consider a database of 1,169 protein chains (25, 26) solution is only marginally stable, and its behavior depends and generate alternative structures by threading (9, 10, 13). Our crucially on subtle details of the interaction. Thus, in most energy function stabilizes 92% of the 1,013 x-ray structures even models even minor changes in the energy function can destabi- without considering interactions between different chains and lize the native state (6, 7). with cofactors. These can explain the remaining cases with very BIOPHYSICS An alternative approach consists in adopting a coarse-grained few exceptions. On the other hand, only 62% of NMR structures (mesoscopic) description of the protein structure and using an are stable. This can be at least partially explained by the way in energy function not derived from physical principles but ob- which these structures are represented in the PDB files. tained from the information contained in the Protein Data Bank In the next section, we define the theoretical framework of (PDB) of native structures. To carry out this task, several authors optimization methods. Then we describe our method and apply assume that the structural motifs in the set of native protein it to the determination of pairwise contact interactions. Finally, structures follow a Boltzmann distribution, whose energy func- we discuss our results. tion is calculated from the observed frequencies (8–12). Here we follow a different approach and determine the energy parame- Optimizing Energy Parameters ters by an optimization scheme. Two optimization schemes have been proposed so far. Their We consider a chain of N amino acids and an effective energy common goal is to obtain an energy function such that the function E(C, S) depending on the mesoscopic configuration ⑀͞⍀ ϭ ⑀⌺ ground state of the model corresponds to the observed native C N and on the sequence S {S1,...SN} . We choose as ϭ ⌫ structure and is thermodynamically stable. A first optimization mesoscopic representation the contact map matrix C f( ), ⌫ scheme, introduced by Maiorov and Crippen (13), requires that where is the microscopic state and the native states of a target set of proteins have energies lower than for a set of alternative structures. This is obtained by solving 1 if residues i and j are in contact, ϭ ͭ a system of inequalities. Recently, this method has been im- Cij [1] 0 otherwise. proved by Domany and coworkers (14–16) and Maritan and coworkers (17, 18). In the new formulation, it is possible to Abbreviation: PDB, Protein Data Bank. answer rigorously the question whether the system of inequalities †To whom reprint requests should be addressed. E-mail: [email protected] or is solvable. If a solution exists, it is not unique, and it is possible [email protected]. to improve the method by choosing the solution of maximal The publication costs of this article were defrayed in part by page charge payment. This stability, defined as the solution in which the stability gap of the article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. least stable protein is maximized (16, 18). §1734 solely to indicate this fact. PNAS ͉ April 11, 2000 ͉ vol. 97 ͉ no. 8 ͉ 3977–3981 Downloaded by guest on October 3, 2021 ϭ We consider two residues in contact if they are separated by of such similarity is the overlap q0 q(C0(S),Cn(S)), where C0(S) ϭ more than two residues along the sequence and if any two heavy is the lowest energy contact map for sequence S.Ifq0 1, we atoms belonging to them are closer than a threshold distance of are guaranteed that the native state contact map has the lowest 4.5 Å (16). energy. In the spirit of our statistical mechanics approach, we ⍀ The ensemble of configurations N is the ensemble of all optimize the similarity between the native structure and the contact maps realized by a given set of structures with N residues. whole Boltzmann ensemble obtained from a simulation or by We shall consider three possibilities: threading. We define the average native overlap (i) The threading set (9, 10, 13): backbone structures are ͑ ͒ ϭ ͗ ͑ ͒͘ obtained by cutting in all possible ways structures with length Q S, U qn C U, [5] ЈϾ N N contained in a subset of the PDB. ϭ where qn(C) q(C,Cn(S)), Cn(S) being the native state. The (ii) The lattice set: structures are self avoiding random walks ⍀ (28) on a suitable lattice. brackets denote a Boltzmann average in the ensemble , ͗ ͘ϭ ͞ ͚ Ϫ ϭ͚ Ϫ (iii) The off-lattice set: structures are obtained from Monte A(C) 1 z C A(C)exp( F(C, S)) and z C exp( F(C, S)), Carlo or molecular dynamics simulations that fulfill the physical and the interaction parameters U are measured in units such that ϭ constraints of excluded volume and of definite values for bond kBT 1. For threading, the entropy of contact maps is zero, and ϭ angles and lengths. we have F(C, S) E(C, S, U). There are three advantages in The present study will be limited to the simpler cases of lattice optimizing Q(S, U) instead of q0:(i) Q(S, U) yields information and threading sets. The more difficult case of an off-lattice set, about the thermodynamic stability of Cn(S). A value of Q close relevant for folding simulations, will be studied in a forthcoming to 1 does not only mean that C0(S) and Cn(S) coincide, but also paper. that Cn(S) has a large Boltzmann weight. (ii) If the temperature A central role in our method is played by the overlap q(C,CЈ), is not too low and the set of alternative conformations contains which measures the similarity between two contact maps. structures similar enough to the native one, this condition implies also that the low energy states are those similar to Cn(S). Ј ijCijC ij This is a way to obtain a smooth energy landscape, where the q͑C, CЈ͒ ϭ . [2] energy decreases on the average as Cn(S) is approached. For ͩ Јͪ max ijCij, ijC ij most reasonable dynamical rules, the correlation between E and q is expected to favor fast folding, in agreement with the funnel It holds q(C,CЈ) (0, 1) and q(C,CЈ) ϭ 1 if and only if the two scenario proposed by Bryngelson and Wolynes (30) and with S, U contact maps are equal.