Generalized Ensemble Methods for De Novo Structure Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Generalized ensemble methods for de novo structure prediction Alena Shmygelska1 and Michael Levitt1 Department of Structural Biology, Stanford University, Stanford, CA 94305-5126 Contributed by Michael Levitt, December 11, 2008 (sent for review October 12, 2008) Current methods for predicting protein structure depend on two temperatures, or computing other physical quantities affecting interrelated components: (i) an energy function that should have transitions between the states during the search. In particular, a low value near the correct structure and (ii) a method for advanced methods such as Temperature Replica Exchange searching through different conformations of the polypeptide Monte Carlo (TREM) (8) and Hamiltonian Replica Exchange chain. Identification of the most efficient search methods is essen- Monte Carlo (HREM) (10), have been shown to outperform tial if we are to be able to apply such methods broadly and with standard Monte Carlo in terms of sampling for both simplified confidence. In addition, efficient search methods provide a rigor- and all-atom force fields of small proteins (8, 10, 11). ous test of existing energy functions, which are generally knowl- For longer proteins, the computational cost and ruggedness of edge-based and contain different terms added together with the all-atom energy function makes solving this problem partic- arbitrary weights. Here, we test different search methods with one ularly challenging as evidenced by the modest success of full- of the most accurate and predictive energy functions, namely atom refinement (12–14). For this reason, there are multiscale Rosetta the knowledge-based force-field from Baker’s group [Si- approaches that start with low-resolution or reduced-model mons K, Kooperberg C, Huang E, Baker D (1997) J Mol Biol 268:209– energy functions and then use all-atom energy functions on a few 225]. We use an implementation of a generalized ensemble search selected conformations [often relying on additional steps such as method to scale relevant parts of the energy function. This method, use of sequence homologs (2) or clustering (3, 4)] been devel- known as Hamiltonian Replica Exchange Monte Carlo, outperforms oped (4, 6, 12, 13). These approaches often fail to generate the original Monte Carlo Simulated Annealing used in the Rosetta low-resolution models within the ‘‘radius of convergence’’ (rmsd package in terms of sampling low-energy states. It also outper- Ͻ3 Å) of the native state necessary for the success of subsequent forms another widely used generalized ensemble search method full-atom refinement (2). known as Temperature Replica Exchange Monte Carlo. Our results In this work, we test whether enhanced conformational sam- reveal clear deficiencies in the low-resolution Rosetta energy pling of low-resolution models can improve structure prediction. function in that the lowest energy structures are not necessarily Specifically, we apply generalized Monte Carlo methods to one the most native-like. By using a set of nonnative low-energy of the most powerfully predictive de novo protein potential structures found by our extensive sampling, we discovered that energy functions, the low-resolution Rosetta force field (1). We the long-range and short-range backbone hydrogen-bonding en- compare the performance of two of the best-performing search ergy terms of the Rosetta energy discriminate between the non- methods, Temperature Replica Exchange Monte Carlo and native and native-like structures significantly better than the Hamiltonian Replica Exchange Monte Carlo, with the four- low-resolution score used in Rosetta. stage Monte Carlo Simulated Annealing protocol used in the original Rosetta algorithm. We show that for a representative set conformational search ͉ protein folding ͉ Rosetta force field of 40 proteins containing ␣, , ␣ϩ, and ␣/ folds both the HREM and, to a lesser degree, TREM methods enhance sampling of low-energy states as compared with the original redicting the functional 3-dimensional structure (the native Rosetta method. More importantly, we are able to use the state) of a protein from its amino acid sequences is of central P nonnative-like low-energy structures sampled by generalized importance to structural and functional biology and has enor- ensemble methods to suggest improvements of the low-resolution mous applications in alleviating human disease. Even if the scoring function used in Rosetta. Our analysis of energy landscapes structures of all proteins were known, we would still not be able and structure clusters shows that HREM outperforms other search to answer questions related to diseases directly caused by protein methods, not only in terms of finding more low-energy states, but misfolding, such as certain types of cancer and Alzheimer’s and also in sampling a more diverse set of compact structures for use in Parkinson disease. For this we would need to understand the optimization of energy functions. physical basis of the energy terms that make the native state so BIOPHYSICS special. Such understanding of the energetics of the system Results and Discussion would also lead to more efficient and comprehensive drug Four Stages of the Rosetta Scoring Function. Rosetta’s low- design. Structure prediction depends on solving two problems: resolution Monte Carlo method (known here as ROSETTA) (i) describing the energy function with sufficient accuracy and employs a hierarchical protocol consisting of four sequential (ii) searching the conformational space sufficiently well. These searches that involve swapping fragments of length 9 and then 3 problems are particularly severe for proteins of biologically residues. Each stage employs a different scoring function. These relevant lengths (Ͼ150 aa). In this work we focus on conformational sampling, which has been recognized as the critical step in high-resolution structure Author contributions: A.S. and M.L. designed research; A.S. performed research; A.S. prediction (1–3). Most widely used standard methods for de analyzed data; and A.S. and M.L. wrote the paper. novo structure prediction are based on the variants of the Monte The authors declare no conflict of interest. Carlo method (4–6) and are unable to explore low-energy Freely available online through the PNAS open access option. regions efficiently because of the ruggedness of the potential 1To whom correspondence may be addressed. E-mail: [email protected] or energy surface. To overcome these problems, a number of [email protected]. generalized ensemble Monte Carlo methods have been devel- This article contains supporting information online at www.pnas.org/cgi/content/full/ oped (7–10). These methods strive to search energy space better 0812510106/DCSupplemental. by computing the density of states, sampling expanded ranges of © 2009 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0812510106 PNAS ͉ February 3, 2009 ͉ vol. 106 ͉ no. 5 ͉ 1415–1420 Downloaded by guest on October 2, 2021 Fig. 1. Energy and rmsd differences for HREM and TREM as compared to ROSETTA. (A) Showing the difference in energy value between conformations sampled during 20,000 independent runs by HREM and TREM and those conformations independently generated by ROSETTA for 40 selected proteins from the four structural classes of SCOP (55–208 aa). In each case we show differences for (i) the lowest energy values (min), (ii) the cutoff energy value for the 90th percentile of low-energy structures (p90, best 10% of structures), and (iii) the lowest energy values from the five largest clusters (Cbest). In all cases, HREM gets lower energy values than ROSETTA (energy differences Ͻ0), whereas TREM is better than ROSETTA in just 50% of the cases. (B) Showing the difference in C␣ root mean square deviation (rmsd) values between same conformations. In each case we show differences for (i) the rmsd for the lowest energy structure (min), (ii) the mean rmsd for the 90th percentile of low-energy structures (p90), and (iii) the cluster centroid rmsd from five largest clusters (Cbest). four different energy-scoring functions involve (i) replacement found that the HREM search method generally outperforms of the extended chain (score0), (ii) buildup of the secondary other search methods in terms of sampling low-energy states on structure (score1), (iii) alternation of high (score2) and low all sequences. In particular, performance differences between (score5) sheet weights, and (iv) low-resolution centroid refine- the generalized ensemble methods, HREM and TREM, and ment (score3) (15). Finally, structures are selected according to ROSETTA (the lowest energy, the energy level below which another low-resolution centroid refinement score (score4). Each 10% of the structures lie, and the lowest energy among the five subsequent scoring function used in ROSETTA adds new terms, highly populated clusters) become more marked as the length of while leaving many energy contributions unchanged; this pro- the protein increases and seems to be larger for -folds. In vides significant overlap of the energy values of conformations comparison with ROSETTA, HREM (consistently) and TREM sampled by different scoring functions. In addition, the cumu- (often) gave rise to significant improvement in terms of lower lative nature of the energy functions used consecutively in energy values. This did not always lead to the improvement in ROSETTA allows one to represent each scoring function as a rmsd because of false minima in the energy landscapes (Fig. 1B). scaled