Shaping up the folding funnel by local interaction: Lesson from a structure prediction study

George Chikenji*†, Yoshimi Fujitsuka‡, and Shoji Takada*‡§¶

*Department of Chemistry, Faculty of Science, and ‡Graduate School of Science and Technology, Kobe University, Nada, Kobe 657-8501, Japan; †Department of Computational Science and Engineering, Graduate School of Engineering, Nagoya University, Nagoya 464-8603, Japan; and §Core Research for Evolutionary Science and Technology, Japan Science and Technology Agency, Nada, Kobe 657-8501, Japan

Edited by R. Stephen Berry, University of Chicago, Chicago, IL, and approved December 30, 2005 (received for review September 22, 2005) Predicting protein tertiary structure by folding-like simulations is CASP6, our group Rokko (as well as the server Rokky) used the one of the most stringent tests of how much we understand the SIMFOLD software as a primary prediction tool, demonstrating principle of . Currently, the most successful method one of the top-level prediction performances in the new fold for folding-based structure prediction is the fragment assembly category (elite club membership by the assessor, B.K. Lee). (FA) method. Here, we address why the FA method is so successful In this work, we start with comparison of the structures and its lesson for the folding problem. To do so, using the FA predicted in the past. We then design a structure prediction test method, we designed a structure prediction test of ‘‘chimera of ‘‘chimera ,’’ where the local structural preference is proteins.’’ In the chimera proteins, local structural preference is that for the wild-type (WT) sequence and nonlocal interactions specific to the target sequences, whereas nonlocal interactions are are the only sequence-independent compaction force. We show only sequence-independent compaction forces. We find that these for some small proteins that the chimera proteins can find the chimera proteins can find the native folds of the intact sequences native-like topology of the intact proteins with high probability, with high probability indicating dominant roles of the local inter- indicating that local structural biases and nonspecific compac- actions. We further explore roles of local structural preference by tion forces together lead proteins to their native folds. We exact calculation of the HP lattice model of proteins. From these further explore this idea by performing exact calculation of the results, we suggest principles of protein folding: For small proteins, HP lattice model of proteins. The result supports the above compact structures that are fully compatible with local structural suggestion and further indicates that compact structures that are preference are few, one of which is the native fold. These local fully compatible with weak bias in local structure are few, one of biases shape up the funnel-like . which is the native fold. These studies elucidate that local structural biases of modest strength make the energy landscape computational ͉ energy landscape ͉ fragment assembly ͉ of proteins funnel-like. Go៮ model ͉ SIMFOLD Results and Discussion atural proteins have the algorithm of finding the global Lesson from Structure Prediction in the Past. We start with two Nminimum of their free energy surface within a biologically observations about comparisons of structures predicted in the relevant timescale (1). One of the most stringent tests of how past, which provide important insight. much we understand the algorithm may be to predict protein First, by using the same energy function (SIMFOLD), we compared the prediction performance by molecular dynamics tertiary structures by simulating processes that are analogous to (MD) simulation with that by the FA method for some small folding, which is often called de novo structure prediction. proteins (17). We found that the performance is strikingly Recently, significant progress has been made in de novo structure different, albeit use of the same energy function. In the case of prediction, in which the most successful method is the fragment protein G, for example, both methods found roughly correct assembly (FA) method developed by Baker and coworkers (2) secondary structure elements at the right residues, whereas their and others (3–6). The FA method shows considerable promise packing in three dimensions was quite different. MD simulations for new fold targets of recent Critical Assessments of Techniques generated many misfolds with the wrong topology, but not the for Protein Structure Prediction (CASPs), the community-wide native fold (four structures that are not marked X in Fig. 1a are blind tests of structure prediction (7–11). In the FA method, the snapshots from MD). In contrast, FA could successfully generate protocol is separated into two stages: First, we collect structural the native fold with quite high population. (The structure labeled candidates for every short segment of the target sequence, with ‘‘Native Fold’’ in Fig. 1b is from FA simulation.) Why did retrieving them from the structural database. The second stage it happen, and what does it tell us? The lowest energy of the is to assemble͞fold these fragments for constructing tertiary misfold found in MD was slightly lower than that around the structures that have low energies. native basin, indicating that the SIMFOLD energy was not good at Simple questions arose as to why the FA method is so identifying the native fold. So, why could FA generate the native successful and what we can learn about protein folding from the fold with high population? Because the FA method restricts local success of the FA method. According to Baker and coworkers structure to that found in the database, local structure is not (12, 13), the FA method is based on the experimental observa- affected by the energy function used, as is the case for MD. tion that local sequence of a protein biases but does not uniquely Indeed, the lowest-energy structures found in MD contained decide its local structure. To what extent does the modest local some local structures, especially in the loops, that never appear bias influence tertiary structures generated? How is the FA in the structural database, implying inaccuracy of local interac- method related to recently developed protein folding theory (14, 15)? In this work, we address these questions. For this purpose, we need structure prediction software that Conflict of interest statement: No conflicts declared. uses the FA method. Here, we use the in-house-developed This paper was submitted directly (Track II) to the PNAS office. software, SIMFOLD. SIMFOLD employs a coarse-grained protein Abbreviations: CASP, Critical Assessment of Techniques for Protein Structure Prediction; FA, representation and has its own energy function derived from fragment assembly; MD, molecular dynamics; PDB, Protein Data Bank; rmsd, rms deviation. BIOPHYSICS physicochemical considerations (16, 17). Conformational sam- ¶To whom correspondence should be addressed. E-mail: [email protected]. pling is performed by the FA method (18, 19). In the recent © 2006 by The National Academy of Sciences of the USA

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508195103 PNAS ͉ February 28, 2006 ͉ vol. 103 ͉ no. 9 ͉ 3141–3146 Downloaded by guest on October 1, 2021 account when fragment candidates are listed up in the first stage, whereas nonlocal interactions play central roles in assembling͞ folding these fragments in the second stage. Taking advantage of this separation, we designed the follow- ing chimera experiment. We use protein G (56-residue ␣ ϩ ␤ protein) as an example. First, we prepared fragment candidates of every overlapping eight residues, as usual, for protein G. Second, in the assembling stage, we used the energy function for the 56-residue polyvaline (polyVal) sequence instead of the sequence of protein G and performed multicanonical Monte Carlo simulations of assembling (18). Thus, the simulation was conducted for the chimera interactions; local structural prefer- ence is for protein G, and nonlocal interaction is for polyVal. For the latter, we chose Val nearly by chance as a representative of a hydrophobic residue. Hydrophobic residues tend to attract each other, thus introducing nonspecific force of compaction (collapse). Note that we are not interested in polyVal folding itself. We tested the same experiment using polyisoleucin and polyleucine and found essentially the same results (data not shown), confirming that the choice of Val is not essential. As a control experiment, we also performed structure prediction of Fig. 1. Comparison of structures generated by MD (a) and FA (b) with the ͞ same energy function and corresponding schematic energy landscapes. (a)MD protein G using the WT sequence in the folding assembling simulations generated four misfold structures, suggesting a highly rugged stage, as usual. The chimera and control experiments are energy landscape. (b) The FA method generated the native fold with high denoted as [polyVal͞WT] and [WT͞WT], respectively. (Here, probability, suggesting a funnel-like landscape. Structures without red X were [A͞B] means that fragment candidates are prepared for B and obtained by the corresponding simulation, whereas those with red X were not the assembling simulation uses interactions of sequence A.) The observed. The cartoons of protein structures were generated by MOLSCRIPT (59). fragment candidates prepared are identical between chimera and control experiments. We now show the results. Fig. 3 a and b plot energies and rms tion of SIMFOLD, e.g., because of ignored local hydrogen bonding deviations (rmsds) of structures generated by assembling simu- between main chain and side chain. The FA method is free from lations of [WT͞WT] (control) and [polyVal͞WT] (chimera), this inaccuracy in local interactions, which may be the primary respectively. Here, the rmsd is measured from the native. We reason that the FA method works so well. Success of FA, in turn, clearly see a stronger correlation between rmsd and a lower suggests importance of local interactions for encoding the native envelope of energies in the [WT͞WT] case than that in the fold. [polyVal͞WT] case, suggesting that the WT sequence has a more The second observation is from the recent CASPs. In the new funnel-like energy landscape, and thus the prediction perfor- fold category of CASP6, most prediction groups of high score mance of the WT sequence is better than that of the chimera used FA-based methods. Although each group used its own protein. Surprisingly enough, there is weak, but significant, scoring (i.e., energy) function, predicted structures were some- correlation between the rmsd and the lower envelope of energies times similar to each other. Obviously, well predicted structures in the [polyVal͞WT] case. Next, as usual in structure prediction, were similar. Interestingly, sometimes predicted structures with we performed cluster analysis of low energy structures. Fig. 3 d the wrong fold were somewhat similar. A nice example is the and e show the structures of the largest cluster centers for target T0201 from CASP6. Shown in Fig. 2 are prediction models [WT͞WT] and [polyVal͞WT], respectively. (The native struc- of three independent groups, Rokko, Baker࿝ROBETTA, and ture of protein G is depicted in Fig. 3c as a reference.) As Jones࿝UCL, by the FA methods. Apparently, they have the expected, the predicted structure of the [WT͞WT] case is quite wrong fold, but are very similar to each other. This observation similar to the native (rmsd is 4.0 Å). Surprisingly, the predicted suggests that the prediction of the FA method is robust against structure of the [polyVal͞WT] case also has the correct fold difference in scoring functions. (rmsd of 5.1 Å), although secondary structure elements are slightly perturbed; the second ␤-strand is too short. The size of ͞ Ͼ Chimera Experiment. Motivated by these two observations, we the largest cluster for [polyVal WT] is reasonably large ( 20% designed a chimera experiment for addressing robustness of the in population) by the standards of de novo structure prediction, FA method with respect to the energy function and the role of although it is smaller than the control case. The structure local interactions for protein folding. Here, we note that local prediction performance by the FA method is unexpectedly and nonlocal interactions are treated explicitly in separate stages robust with respect to the energy function used for assembling. This robustness is not limited to protein G. We performed the in the FA method; local structural preference is taken into same chimera and control experiments as above for an all ␣-protein, 434 C1 repressor [Protein Data Bank (PDB) ID code 1r69), and an all ␤-protein Hex1 (PDB ID code 1khi). rmsd– energy plots of 434 C1 repressor for the [WT͞WT] and that for the [polyVal͞WT] cases are shown in Fig. 3 f and g, respectively. The two figures look quite different, reflecting the difference in energy functions used in assembling: For the [WT͞WT] case, weak energetic bias is found toward smaller rmsd, whereas it is not seen in the chimera case. As was the case for protein G, Fig. 2. Examples of structure predictions from the new fold category in performing cluster analysis for low-energy structures, we iden- CASP6. (a) The native structure of target T0201 (PDB ID code 1s12). (b) Model tified that the structures of the largest cluster center have the 1 from group Rokko. (c) Model 6 from group Baker࿝Robetta. (d) Model 1 from native fold for both cases (see Fig. 3 i and j), although the group Jones࿝UCL. predicted structure for [polyVal͞WT] (rmsd 5.5 Å) is worse than

3142 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508195103 Chikenji et al. Downloaded by guest on October 1, 2021 simulations and the structure of the largest cluster center from the [polyVal͞WT], respectively. Interestingly, the structure for [polyVal͞WT] has the correct fold with rmsd of 5.4 Å, whereas that for the [WT͞WT] simulations has a pair of ␤-strands (fourth strand in yellow and fifth strand in red) flipped, resulting in a larger rmsd of 5.9 Å. (The largest cluster of the [WT͞WT] simulations is even worse.) The above chimera experiment showed that local structural preferences specific to the WT sequence and the nonspecific compaction force together drive the protein into its native fold with reasonably high probability for small proteins. In other words, given the fragment structure ensemble that reflects local structure preferences of the target sequence, compact structures that are fully compatible with these fragments are few, one of which is the native fold. We also performed the same experiments for some medium- to-large size proteins, finding that the performance of chimera experiments, on average, becomes worse as the size increases (data not shown). Thus, the conclusion above is limited to small proteins.

Nature of Short Fragments Taken from Proteins. Because the above results highlight the importance of local structural bias, we here briefly discuss physical properties of short fragments taken from proteins. First, a class of peptides excised from proteins can fold to the native structure by themselves, independently from other parts of the proteins (20–23). For these peptides, local interac- tions are so strong and specific that the peptide conformation is constrained into basically one structure. Another class of short sequences, often called chameleon sequences, is known to fold to entirely different conformations when they are embedded in different sequence contexts, demonstrating possible multiplicity of local structure bias (24). Some other peptides, once cleaved, do not possess any well defined structure (25). For these peptides, some local structures are prohibited by, for example, steric clash between side-chain atoms or by proximity of two atoms with the same charge. Thus, local bias does not pin down their conformations but reduces available conformational space. From these observations, it is reasonable to consider that short peptides excised from proteins have local biases and restriction in conformational space that are strongly sequence dependent. One of the biased conformations corresponds to the native. All atom MD simulations for short peptides extracted from proteins support this idea (26, 27). In the FA method, it is assumed that such biased conformations of short segments can be approxi- mated by fragment conformations retrieved from the structure database by sequence-comparison methods.

Fig. 3. Results of control and chimera experiments of protein G (a–e), 434 C1 Impact of Weak Local Constraint: Lattice HP Model. Then, how and ͞ repressor (f–j), and Hex1 (k–o). (a) rmsd–energy plots of protein G for [WT to what extent do local structural biases alter the global shape of WT]. (b) rmsd–energy plots for [polyVal͞WT]. (c) The native structure of protein G (PDB ID code 1pgb). (d) The structure of the largest cluster center of the energy landscape? For addressing this question with least protein G for [WT͞WT]. (e) The structure of the largest cluster center for ambiguity, we conducted an exact calculation using a simple [polyVal͞WT]. (f) rmsd–energy plots of 434 C1 repressor for [WT͞WT]. (g) two-dimensional lattice (2D) HP model of proteins (28), to rmsd–energy plots for [polyVal͞WT]. (h) The native structure of 434 C1 re- which we weakly added local structural biases, mimicking char- pressor (PDB ID code 1r69). (i) The structure of the largest cluster center of 434 acteristics of short peptides of real proteins. C1 repressor for [WT͞WT]. (j) The structure of the largest cluster center for In the lattice HP model, a protein is represented as a self- [polyVal͞WT]. (k) rmsd–energy plots of Hex1 for [WT͞WT]. (l) rmsd–energy avoiding chain on a 2D square lattice with two types of amino plots for [polyVal͞WT]. (m) The native structure of Hex1 (PDB ID code 1khi). (n) acids: H (hydrophobic) and P (polar). The energy, E, of a chain ͞ The structure of the second largest cluster center of Hex1 for [WT WT]. (o) The conformation is determined only by the number of H–H con- structure of the largest cluster center for [polyVal͞WT]. tacts, h,asE ϭϪ␧h, where ␧ is a positive constant (used as the energy unit). With this simplicity, complete enumeration in structural and sequence space is feasible for short chains, and ͞ that for [WT WT] (rmsd 3.5 Å). Results for Hex1 are shown in thus the sequence–structure relationship of this model has been Fig. 3 k and l. In this case, there is no correlation between rmsds quite thoroughly uncovered (28, 29). Here, we randomly choose and energies for both [WT͞WT] and [polyVal͞WT] cases, and a sequence that has a unique ground state-structure with the thus it is expected that predictions are worse than the previous chain length n ϭ 16. The sequence chosen here is HHHPHH- cases of protein G and 434 C1 repressor. Fig. 3 n and o shows the PHHHHPHHP. BIOPHYSICS structure of the second largest cluster center from the [WT͞WT] By exact enumeration of all possible conformations, we easily

Chikenji et al. PNAS ͉ February 28, 2006 ͉ vol. 103 ͉ no. 9 ͉ 3143 Downloaded by guest on October 1, 2021 Fig. 5. Schematic energy landscapes of the pure HP model (a) and the HP model with local structure constraints (b).

local structures. Physically, given a sequence of a short segment, some local structures are prohibited by, for example, steric clash. Obviously, these prohibited fragment conformations should not contain those that appear in the native structure. In the HP lattice model, we here impose structure biases on a four-residue fragment. There are six types of four-residue sequences in the sequence we use here, which are HHHH, PHHH, HPHH, HHPH, HHHP, and PHHP. The number of possible conforma- tions of four residues in two dimensions is five, as listed in Fig. 4c. For each four-residue sequence, one of the five structures is chosen as a prohibited conformation (marked by green rectan- gles in Fig. 4c), where the choice is at random from all structures that do not appear in the native structure. These structures Fig. 4. The HP model studied here. (a) Native structure of the HP model. (b) should be considered as conformations with high free energy. Example of a maximally compact structure of the HP model. (c) All four-residue We first investigate how the first-excited structures (E ϭϪ8) segments of the HP sequence and all possible structures. Structures sur- are affected by the introduced constraint in local structures, rounded by green rectangles are prohibited by the local structure constraint finding that 18 of 20 first-excited structures contain at least one we impose here. (d) All structures of the first excited states (E ϭϪ8) of the HP prohibited local structure, and thus they are eliminated. (See Fig. model. Structures are classified by the number of native contacts Q. Only two 4d where all nonshaded structures contain prohibited local gray shaded structures are not prohibited by imposing the local conforma- structures.) In particular, structures with small Q score, i.e., tional constraints. highly dissimilar to the native, are completely eliminated by the weak local constraint, where the Q score is the number of native find that this sequence has a unique ground state with energy contacts. Only 2 of the 20 structures remain, and they have E ϭϪ9 at the structure depicted in Fig. 4a and 20 misfolded relatively high Q score of 7 (shaded ones in Fig. 4d; Q score at structures in the first excited states (E ϭϪ8). All of the latter the native is 9), which means that these first-excited structures structures are shown in Fig. 4d, as classified by the nativeness are near native. This result suggests that the energy landscape score, the Q score, which is used as a measure of structural becomes more funnel-like by adding a weak constraint in local similarity to the native and is defined by the number of non- structure (see Fig. 5). bonded contacts common to those at the native structure (Q in Next, we see how characteristics of global conformational the of this model is 9). The presence of many space are altered by the weak constraint in local structure. Fig. low-energy misfolded structures with low Q scores indicates that 6a plots numbers of conformations (in logarithmic scale) as a the energy landscape of this sequence is highly rugged (see Fig. function of total number of nonbonded contacts C, which 5a). It has been known in general that the HP model has a highly reflects compactness, where data for the pure HP model without rugged energy landscape, which is non-protein-like (30). The HP local constraint and the locally constrained HP model are shown model simply captures the hydrophobic force, which is widely in white and black bars, respectively. For all C, the number of believed to be the driving force for the protein to fold (31). As conformations is significantly reduced by adding a local con- is clear from argument above, the key feature missing in this HP straint. Also important is that the number of conformations model is biases in local conformation. Thus, we introduce decreases with C, indicating that structures with higher com- sequence-dependent weak biases in local structures of the HP pactness are fewer. Led by these two tendencies, the striking model and consider the effect of them on the global shape of the result is that the number of maximally compact structures (C ϭ energy landscape. Local bias of the HP model were investigated 9) is 69 without the local constraint, but only 2 of 69 remain after before (32, 33), where biases were sequence independent. imposing the local constraint. One of the two remaining struc- Based on the argument in the previous section, the local tures is of course the native structure, by design. The other structural bias we introduce should be weak and sequence structure (shown in Fig. 4b) has the energy E ϭϪ5 and Q ϭ 1 dependent. Namely, the bias should not decisively pin down local and thus is highly excited. structure of a segment, but should modestly reduce diversity of Suppose we try to predict the ground state of this locally

3144 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508195103 Chikenji et al. Downloaded by guest on October 1, 2021 We note that the current studies do not suggest that the nonlocal interactions are unimportant. Hydrophobicity and van der Waals packing, which are predominantly nonlocal, play major roles in the thermodynamic ‘‘stability’’ of proteins. The present work suggests the primary code that ‘‘specifies’’ the native fold, but not the stability, is in the local structure preferences.

Implication to Computational Protein Design. The perspective that there are only limited numbers of compact structures fully Fig. 6. The effect of the local structure constraint on the conformational ⍀ compatible with local constraints may have implications to space of lattice protein models. (a) The number of conformations (in protein design as well. Recently, considerable progress has been logarithmic scale) as a function of total number of contacts C. The white bar indicates log⍀(C) of the pure HP model. The dotted bar indicates log⍀(C)of made in de novo design of foldable proteins (43–45), where the the HP model with local constraints. (b) Temperature dependence of the primary strategy is to seek the sequence that optimizes side- number of nonnative contacts. Dotted, solid, and broken lines correspond chain interactions at a target structure. From the perspective of to the pure HP model, the locally constrained HP model, and Go៮ model, folding theory, however, it has been anticipated that, instead of respectively. optimizing the interactions at the target structure, the (normal- ized) gap between target energy and the average energy of the denatured ensemble should be optimized. Lattice model simu- constrained HP model. Assuming that the ground state is a lations support this theory (46). However, using this idea directly maximally compact structure, we generate maximally compact for designing real protein sequences is limited (47). More structures that satisfy the local conformational rule. There are importantly, major successes of protein design so far come from only two of these structures, and one of them is the native. Thus, optimization of the interaction energy only at the target (43–45). we can predict the native structure with probability 0.5 regardless How can the practice and the theory be reconciled? Obviously of the energy function. Because two maximally compact struc- by now, the perspective that there are only limited numbers of Ϫ Ϫ tures have the energies of 9 and 5, an energy function that compact structures fully compatible with local constraints in- has typical error of 2 energy units is enough to identify the deed fills the gap: If this is the case and if the target structure is ground state. This finding is perfectly parallel to the fact that, in one of the possible compact structures, destabilizing alternative the chimera experiments, just generating compact structures conformations may not be necessary. using fragment libraries produces the native fold with high probability. Discussion on Denatured States of Proteins. The finding that local structural preference and compaction forces together lead the What Makes the Protein Energy Landscape Funnel-Like? What do protein into the native fold may have some links to characteristics these results tell us about protein folding? Recently it was of denatured states of proteins, which recently attract consider- established that the transition state ensemble of folding path- able attention (48–54). In simulations of a ␤-hairpin confined in ways can often be well described by the so-called Go៮-like models a pore, Thirumalai and coworkers (52) found that, under highly (34–41), which correspond to the perfect-funnel. The basic denatured conditions, the confined ␤-hairpin possesses partially assumption in the model is that only native contacts are favor- native-like order. This result is in contrast to the denatured able, and nonnative contacts are neutral or repulsive (42). These ensemble of a bulk ␤-hairpin, which is extended and has little studies suggested that the energy landscape of proteins is native-like order. Thus, compaction force alone could induce funnel-like. It is unclear, however, how the smooth and funnel- native-like order as in ref. 55. The present work suggests that this shaped energy landscape is organized from numerous atomic hypothesis is quite probable by the help of local structural interactions. Important and interesting questions related to this preferences. point are what is the physicochemical reason of the funnel- shaped energy landscape and why nonnative interactions play Conclusions only minor roles. This study addressed what microscopic interactions make the In the previous section, we showed that constraints in local protein energy landscape funnel-like, motivated the by recent structure prohibit many stable misfolded structures, resulting in success of the FA method in structure prediction. We reached a funnel-like energy landscape. To see this effect more quanti- the following principles of protein folding: (i) There are only tatively under thermodynamic conditions, we calculated the limited numbers of compact conformations that are fully com- temperature dependence of the average number of nonnative patible with local structural preferences, and the native fold is contacts for the above lattice HP model for the pure and locally one of them, and thus (ii) local structure preferences strongly constrained HP models (Fig. 6b). For comparison, we also show shape up the protein folding funnel. results for a Go៮ model, in which the native structure is set as the same as that of the HP model (Fig. 4a). As shown in Fig. 6b, there Materials and Methods is an apparent peak in the number of nonnative contacts curve SIMFOLD energy employs a coarse-grained protein representa- for the pure HP model, indicating that there is a nonnative tion. It includes explicit backbone structure and a simplified side intermediate state in the thermodynamic transition. Thus, the chain, which is represented by the centroid of side-chain heavy pure HP model does not show a two-state transition. In contrast, atoms. The SIMFOLD energy function consists of various inter- the thermodynamic behavior of the number of nonnative con- actions: hydrogen bonding, hydrophobic interaction, van der tacts of the locally constrained HP model shows a sigmoidal Waals interaction, and so on. Their functional forms are well transition, which is quite similar to that of the Go៮ model. Indeed based on physicochemical considerations, whereas parameters nonnative interactions in the locally constrained HP model are are optimized based on the energy landscape theory and struc- significantly reduced, relative to the pure HP model. Namely, tural database. The explicit expression is described in ref. 17. although nonnative interactions exist in the locally constrained In the first stage of structure prediction, we prepared fragment HP model, they do not contribute very much, suggesting local candidates for every segment of a target sequence, as usual, using structural biases make the energy landscape more funnel-like profile–profile comparison (56) against a nonredundant set of BIOPHYSICS (see Fig. 5b). the structural database. Sequences homologous to the target

Chikenji et al. PNAS ͉ February 28, 2006 ͉ vol. 103 ͉ no. 9 ͉ 3145 Downloaded by guest on October 1, 2021 protein [PSI-BLAST (57) E value Ͻ 0.01] were rigorously excluded old fragments and thus short portions of fragments (for example, from the fragment libraries, because the focus here is not on one-residue-length remains of fragment candidates) remain in homology modeling. We prepared Ϸ20 fragment candidates for any stage of the simulation. Frequent use of short remains of every eight-residue-length segment of the target protein. Pre- fragments diminishes the advantage of the FA method, that is, pared fragment structures would reflect the structural prefer- correlated ␾,␺ angle information in consecutive residues. The ence of the corresponding segments of the target protein. new algorithm rigorously prohibits use of remains of less than Second, tertiary structures were generated by assembling the four residues, and thus any structure generated is made of fragment candidates prepared above. Monte Carlo simulation portions of fragments with the length between four and eight. was carried out for finding low-energy conformations with the Finally, we performed cluster analysis of the resulting low- newly developed reversible fragment replacement algorithm energy structures using the pairwise C␣ rmsd as a measure of (G.C., Y.F., and S.T., unpublished data). This algorithm has dissimilarity. Cluster centers of largest cluster sizes were selected some attractive characteristics over conventional fragment in- as prediction models (58). The cluster cutoff was set to 6 Å here. sertion. First, in contrast to conventional algorithms, the current one satisfies the detailed balance condition, thus enabling us to We thank Satoshi Takahashi and Sotaro Fuchigami for fruitful discus- estimate thermodynamic quantities and use modern sampling sions. We also thank David Leitner for useful discussions and careful techniques in computational physics. Here, we use one of these reading of the manuscript. G.C. and Y.F. were supported by the Japan techniques, the multicanonical ensemble method that realizes Society for the Promotion of Science Research Fellowship for Young uniform sampling in energy space, avoiding too-long trapping by Scientists. This work was supported by Grant-in-Aid for scientific misfolded structures (18). Second, in conventional fragment research on priority areas ‘‘Water and Biomolecules’’ from the Ministry insertion algorithms, insertion of a fragment overwrites part of of Education, Science, Sports, and Culture of Japan.

1. Fersht, A. R. (1999) Structure and Mechanism in Protein Science: A Guide to 30. Wolynes, P. G. (1997) Nat. Struct. Biol. 4, 871–874. Enzyme Catalysis and Protein Folding (Freeman, New York). 31. Dill, K. A. (1990) Biochemistry 29, 7133–7155. 2. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997) J. Mol. Biol. 268, 32. Thomas, P. D. & Dill, K. A. (1993) Protein Sci. 2, 2050–2065. 209–225. 33. Irba¨ck, A. & Sandelin, E. (1998) J. Chem. Phys. 108, 2245–2250. 3. Jones, T. A. & Thirup, S. (1986) EMBO J. 5, 819–822. 34. Shoemaker, B. A., Wang, J. & Wolynes, P. G. (1997) Proc. Natl. Acad. Sci. USA 4. Levitt, M. (1992) J. Mol. Biol. 226, 507–533. 94, 777–782. 5. Bowie, J. U. & Eisenberg, D. (1994) Proc. Natl. Acad. Sci. USA 91, 4436–4440. 35. Portman, J. J., Takada, S. & Wolynes, P. G. (1998) Phys. Rev. Lett. 81, 6. Jones, D. T. (1997) Proteins 29, 185–191. 5237–5240. 7. Bradley, P., Chivian, D., Meiler, J., Misura, K. M. S., Rohl, C. A., Schief, W. R., 36. Galzitskaya, O. V. & Finkelstein, A. V. (1999) Proc. Natl. Acad. Sci. USA 96, Wedemeyer, W. J., Schueler-Furman, O., Murphy, P., Strauss, J., et al. (2003) 11299–11304. Proteins 53, 457–468. 37. Alm, E. & Baker, D. (1999) Proc. Natl. Acad. Sci. USA 96, 11305–11310. 8. Jones, D. T. & McGuffin, L. J. (2003) Proteins 53, 480–485. 38. Munoz, V. & Eaton, W. A. (1999) Proc. Natl. Acad. Sci. USA 96, 11311–11316. 9. Fang, Q. & Shortle, D. (2003) Proteins 53, 486–490. 39. Clementi, C., Nymeyer, H. & Onuchic, J. N. (2000) J. Mol. Biol. 298, 937–953. 10. Karplus, K., Karchin, R., Draper, J., Casper, J., Mandel-Gutfreund, Y., 40. Koga, N. & Takada, S. (2001) J. Mol. Biol. 313, 171–180. Diekhans, M. & Hughey, R. (2003) Proteins 53, 491–496. 41. Karaniclas, J. & Brooks, C. L., III (2003) J. Mol. Biol. 334, 309–325. 11. Shao, Y. & Bystroff, C. (2003) Proteins 53, 497–502. 42. Go៮, N. (1983) Annu. Rev. Biophys. Bioeng. 12, 183–210. 12. Bonneau, R. & Baker, D. (2001) Annu. Rev. Biophys. Biomol. Struct. 30, 43. Dahiyat, B. I. & Mayo, S. L. (1997) Science 278, 82–87. 173–189. 44. Walsh, S. T. R., Cheng, H., Bryson, J. W., Roder, H. & DeGrado, W. F. (1999) 13. Rohl, C., C. E. M. Strauss, K. M. S. Misura & Baker, D. (2004) Methods Proc. Natl. Acad. Sci. USA 96, 5486–5491. Enzymol. 383, 66–93. 45. Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L. & Baker, 14. Bryngelson, J. D., Onuchic, J. N., Socci, N. D. & Wolynes, P. G. (1995) Proteins D. (2003) Science 302, 1364–1868. 21, 167–195. 46. Shakhnovich, E. I. (1998) Folding Des. 3, R45–R58. 15. Onuchic, J. N. & Wolynes, P. G. (2004) Curr. Opin. Struct. Biol. 14, 70–75. 47. Jin, W., Kambara, O., Sasakawa, H., Tamura, A. & Takada, S. (2003) Structure 16. Takada, S., Luthey-Schulten, Z. A. & Wolynes, P. G. (1999) J. Chem. Phys. 110, (London) 11, 581–590. 11616–11629. 48. Smith, L. J., Fiebig, K. M., Schwalbe H. & Dobson, C. M. (1996) Folding Des. 17. Fujitsuka, Y., Takada, S., Luthey-Schulten, Z. A. & Wolynes, P. G. (2004) 1, Proteins 54, 88–103. R95–R106. 18. Chikenji, G., Fujitsuka, Y. & Takada, S. (2003) J. Chem. Phys. 119, 6895–6903. 49. Hammarstro¨m, P. & Carlsson, U. (2000) Biochem. Biophys. Res. Commun. 276, 19. Chikenji, G., Fujitsuka, Y. & Takada, S. (2004) Chem. Phys. 307, 157–162. 393–398. 20. Blanco, F. J., Rivas, G. & Serrano, L. (1994) Nat. Struct. Biol. 1, 584–590. 50. Shortle, D. & Ackerman, M. S. (2001) Science 293, 487–489. 21. Yi, Q., Bystroff, C., Rajagopal, P., Klevit, R. E. & Baker, D. (1998) Protein Sci. 51. Mohana-Borges, R., Goto, N. K., Kroon, G. J., Dyson, H. J. & Wright, P. E. 283, 293–300. (2004) J. Mol. Biol. 304, 1131–1142. 22. Rotondi, K. S. & Gierasch, L. M. (2003) Biochemistry 42, 7976–7985. 52. Klimov, D. K., Newfield, D. & Thirumalai, D. (2002) Proc. Natl. Acad. Sci. USA 23. Honda, S., Yamasaki, K., Sawada, Y. & Morii, H. (2004) Structure (London) 99, 8019–8024. 12, 1507–1518. 53. Zagrovic, B. & Pande, V. S. (2003) Nat. Struct. Biol. 10, 955–961. 24. Zhou, X., Aleber, F., Folkers, G., Gonnet, G. H. & Chelvanayagam, G. (2000) 54. Jha, A. K., Colubri, A., Freed, K. F. & Sosnick, T. R. (2005) Proc. Natl. Acad. Proteins 41, 248–256. Sci. USA 102, 13099–13104. 25. Wang, Y. & Shortle, D. (1997) Folding Des. 2, 93–100. 55. Chan, H. S. & Dill, K. A. (1990) Proc. Natl. Acad. Sci. USA 87, 6388–6392. 26. Nakajima, N., Higo, J., Kidera, A. & Nakamura, H. (2000) J. Mol. Biol. 296, 56. Pietrokovski, S. (1996) Nucleic Acids Res. 24, 3836–3845. 197–216. 57. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. 27. Higo, J., Ito, N., Kuroda, M., Ono, S., Nakajima, N. & Nakamura, H. (2001) & Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402. Protein Sci. 10, 1160–1171. 58. Shortle, D., Simons, K. T. & Baker, D. (1998) Proc. Natl. Acad. Sci. USA 95, 28. Lau, K. F. & Dill, K. A. (1989) Macromolecules 22, 3986–3997. 11158–11162. 29. Li, H., Helling, R., Tang, C. & Wingreen, N. (1996) Science 273, 666–669. 59. Kraulis, P. (1991) J. Appl. Crystallogr. 24, 946–950.

3146 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508195103 Chikenji et al. Downloaded by guest on October 1, 2021