doi:10.1006/jmbi.2001.5037 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 313, 171±180

Roles of Native Topology and Chain-length Scaling in Folding: A Simulation Study with a Go-like Model Nobuyasu Koga1 and Shoji Takada1,2*

1Graduate School of Science We perform folding simulations on 18 small with using a simple and Technology, Kobe Go-like protein model and analyze the folding rate constants, character- University and istics of the transition state ensemble, and those of the denatured states in terms of native topology and chain length. Near the folding transition 2PRESTO, Japan Science and temperature, the folding rate k scales as k exp( c RCO N0.6) where Technology, Rokkodai, Nada F F RCO and N are the relative contact order and numberÀ of residues, Kobe 657-8501, Japan respectively. Here the topology RCO dependence of the rates is close to that found experimentally (kF exp( c RCO)), while the chain length N dependence is in harmony withÀ the predicted scaling property 2/3 (kF exp( cN )). Thus, this may provides a uni®ed scaling law in fold-  À 2/3 ing rates at the transition temperature, kF exp( c RCO N ). The degree of residual structure in the denatured stateÀ is highly correlated with RCO, namely, proteins with smaller RCO tend to have more ordered structure in the denatured state. This is consistent with the observation that many helical proteins such as myoglobin and protein A, have partial helices, in the denatured states. The characteristics of the transition state ensemble calculated by the current model, which uses native topology but not sequence speci®c information, are consistent with experimental f-value data for about half of proteins. # 2001 Academic Press Keywords: contact order; folding rate; transition state; f-value; denatured *Corresponding author state

Introduction tacts, which are the residue pairs close in contact at the native structure, divided by the total number How proteins fold to their native structures has of residues. Second, the folding transition state long been studied and still remains a fundamental ensembles (TSEs) are characterized by the f-value question in biophysics.1 ± 3 Recent studies of small, analysis developed by Fersht et al.1,6 for many pro- single-domain proteins have leaded to a new para- teins and it was found that proteins that have simi- digm that the folding rates and pathways are lar- lar structures with low sequence homology have gely determined by their native topology. First, similar TSE for the SH3 domain family7,8 and ab Baker and co-workers found that the folding rates plaits9,10 which implies that the native topology of two-state folding proteins are strongly corre- tends to determine folding pathways of the pro- lated with a measure of the topological complexity tein. It should be noted, however, those recent of native structure,4,5 that is, folding is slower for works report some exceptions to this scenario:11± 13 more complex topologies. The measure of topologi- For example, protein G and protein L have the cal complexity, which they called the relative con- same topology, but exhibit different transition state tact order, is a key quantity currently. It is de®ned ensembles. as the average sequence separation of native con- Why native topology plays an important role in determining folding mechanisms is qualitatively understood from the energy landscape perspective. Abbreviations used: TSE, transition state ensembles; RCO, relative contact order; ACO, absolute contact It suggests that fast folding proteins have a funnel- like energy landscape with some ruggedness on order; D, denatured; N, native; AcP, acylphosphatase; 2 ADA2h, procarboxypeptidase A2. the surface. Evolutionary pressure has made the E-mail address of the corresponding author: ruggedness on the slope suf®ciently small so that [email protected] the current natural proteins may have minimal

0022-2836/01/010171±10 $35.00/0 # 2001 Academic Press 172 Topology and Chain-length in frustration. The effective energy, which is de®ned ility due to contacts, and the other from entropy as protein internal interaction energy plus sol- change due to loss of chain ¯exibility. vation free energy, decreases smoothly with fold- Advantages of these free-energy functional ing reaction to proceed, whereas the acquisition of approaches are their simplicity and reasonable structured segments upon folding reduces the agreement between theory and experiments. On chain entropy, too. Since the energy stabilization the other hand, a possible shortcoming may be the accelerates and the entropy loss decelerates folding dif®culty to single out reasons for any discrepancy reaction, their subtle balance determines the fold- between theory and experiments because these the- ing pathways and rates. Upon formation of a resi- ories often are based on many drastic approxi- due-residue contact, the entropy loss is larger for mations, such as neglect of non-native interactions, non-local contact than for local one. Thus, a protein crude approximations in the expression of entropy whose native topology has more non-local contacts loss and sequence-speci®c interactions included in folds more slowly. Also, qualitatively, contacts the model. formed at the TSE are chosen so that the entropy An alternative and complementary approach to loss may be minimized with given amount of these studies was recently taken by Onuchic and 26,27 structured segments. Here the entropy loss is his colleagues. They proposed a simple off- strongly dependent on the sequence separation of lattice simulation model that implements Go-like the formed contact and so the native topology con- interactions and performed trols characteristics of the folding TSEs. simulation for a few small-to-medium size pro- If we move from the qualitative to the semi- teins. They reported probabilities of forming native quantitative level, however, many issues need to be contacts in the TSEs and intermediate states that clari®ed. First, it is somewhat surprising that the are qualitatively consistent with experiments. This folding rate constant correlates with relative contact simulation study seems more straightforward than order instead of absolute contact order{ because the the free-energy functional approaches. Especially, latter is more directly linked to the entropy loss no additional approximation is needed besides the upon contact formation.3 Related to this is chain- interaction energy function. It is also based on the length scaling of the folding rate, where theoretical perfect-funnel model and is constructed mostly work predicts some different types of scaling such from the native topology without taking into n 14 2/3 15 1/2 16 account details of sequence-speci®c information. as kF NÀ , kF exp( cN ), kF exp( cN ), that wer e uni®edby WolynÀ es17 while expÀerimental Thus, it is less ambiguous to address the interplay data exhibit little correlation between chain lengths between the native topology and folding mechan- and folding rates.5 Here, experimental data are col- isms, making it easier to single out reasons for a lected only for small two-state folding proteins, discrepancy, if any, with experiments. Here, we attempt to extend Clementi, Nymeyer, while the theory dealt with relatively large proteins 26 and thus this may not be inconsistency, but a gap and Onuchic's work to a more comprehensive between small and large proteins. Here, we show level. Performing molecular dynamics simulation on 18 small proteins, most of which are experimen- how the contact order and chain length scaling can tally known to exhibit apparent two-state tran- be coupled to ®ll the gap giving a uni®ed perspec- sition, we address in greater detail the relation tive of folding rates. between folding mechanisms and the native top- Another challenge at the semi-quantitative level ology. In particular, the folding rates are computed is the prediction of folding pathways and charac- and their correlation with the contact order and its teristics of the TSE from a simple funnel theory. In variants is investigated; the degree of native-like particular, comparison of the calculated values f order in the denatured states is analyzed in terms with experimental ones is an excellent way of test- of the contact order; and the folding pathways are ing the funnel theory. During the last few years, analyzed and computed TSEs, i.e. computed f- several groups reported that simple free energy values, are compared with experimental f-value functionals can describe the folding free energy results. surface and quantitatively predict folding path- ways and rate constants with reasonable accuracy.18 ± 23 A basic assumption common in all of these works is that the native contacts (the con- Results tacts that exist in the native structure) alone can We use a minimal protein model proposed by describe the overall shape of a funnel energy land- Clementi et al.26,27 that tries to mimic the perfect scape and interaction energies of any non-native funnel aspect of folding energy landscape. Brie¯y, contacts are neglected. These models are often the protein chain is represented only with Ca called perfect-funnel models, or Go-like mod- 24,25 atoms of every amino acid residues that are con- els. The functional model essentially has two nected via virtual bonds. Both local and non-local competing contributions, one from energetic stab- interactions are set up so that they have the lowest energy at the native structure. In particular, attrac- { The absolute contact order is the average sequence tive contact interactions are introduced between separation of native contacts not divided by the chain native contacts, which are the residue pairs in close length. contact in the native structure, while the rest of Topology and Chain-length in Protein Folding 173 pairs have solely repulsive interactions avoiding SH3 domain) Both curves have peaks at the same chain crossing. temperature which we de®ne as the folding tran- During folding simulation at a temperature T, sition temperature TF. We calculated the free we monitor a measure of nativeness, Q-score. The energy pro®le as a function of Q by the histogram value of Q close to unity means the conformation method, con®rming the two-state transition (see is similar to the native structure whilst a Q-value Figure 1(c) for the case of src SH3 domain). of zero means the conformation is dissimilar to the In the following sub-sections, we analyze folding native structure. At a suf®ciently high (low) tem- mechanisms for 18 single-domain proteins, most of perature, Q is mostly smaller (larger) than 0.5. In which are experimentally considered to exhibit the between them, we can ®nd a temperature where two-state folding transition. The proteins studied Q < 0.5 for half of time and Q > 0.5 for the rest of include mainly a, mainly b, and a/b mixed pro- time (see Figure 1(a) for the case of src SH3 teins ranging from 53 to 153 residues long. domain), which should be close to the transition temperature. Folding and unfolding rate constants Folding rate constants are strongly temperature dependent and thus it is crucial to determine the transition temperature We have computed the folding rate constants at accurately. To this end, with use of the above men- T 0.97TF for 18 small proteins. Figure 2(a) depicts tioned simulation, we employ the multiple histo- theˆ folding rate constants (in a logarithmic scale) as gram method,28 computing the heat capacity a function of the chain length, N, where we see lit- 2 2 2 (Cn ( E E )/kT ) and change in the Q score tle correlation. This is consistent with what Baker ( dˆQh/diÀhT ( iEQ E Q )/kT2) as a function of et al. found in experimental data.5 On the other À h i ˆ h iÀh ih i temperature, T. (see Figure 1(b) for Cn of the src hand, Figure 2(b) and (c) show the folding rates

Figure 1. Example of folding time course and statistical mechanical analysis for the case of src SH3 domain. (a) The native-ness Q-value as a function of time t near the folding transition temperature TF. This shows apparent two-state transition between native (Q 0.9) and denatured (Q 0.15) states. (b) C as a function of temperature T, which exhi-   n bits a clear peak at the folding transition temperature TF. (c) The free energy pro®les F(Q) near the folding transition temperature TF, T 0.96TF (green), T TF (the thick red curve), and T 1.04TF (blue). The denatured and native states are separatedˆ by a free energy barrierˆ around Q 0.4. (d) The site-resolvedˆ folding pathways q (Q) (see the  i text for the explicit de®nition) are plotted along the reaction coordinate Q. qi(Q) is unity (zero) when the local environment of site i is native-like (denatured-like) and the value of qi(Q) is represented by the color; green, yellow, and white correspond to 0, 0.5, and 1, respectively. Positions of b-strands are illustrated as arrows in the left side. 174 Topology and Chain-length in Protein Folding

Figure 2. Folding rate constants of 18 small proteins simulated at T 0.97TF. (a) The rates are plotted against the chain length N where little correlation is seen. (b) The rates are plottedˆ with respect to the relative contact order (RCO). The correlation is found with r 0.69. (c) The rates are plotted against the absolute contact order (ACO) where a stronger correlation (r 0.80) isˆ found. (d) The rates are plotted with respect to RCO x N0.607 at which the strongest correlation is obtainedˆ (r 0.84). ˆ

with respect to the relative contact order (RCO, imentally observed.5 The measured folding time RCO ÆSij/MN where Sij is the sequence separ- varies over six orders of magnitude depending on ationˆ between i and j, the summation is over all topology, while simulated ones are within two native contacts, and M is the number of native con- orders of magnitude. In simulations, the slope tacts) and the absolute contact order (ACO, depends on energy balance between the local and ACO ÆSij/M), respectively. We ®nd signi®cant non-local interactions. By increasing non-local correlationˆ in both cases as in experimental data, interactions relative to local ones, we get much but the correlation here is somewhat stronger with slower rates for complex proteins with large RCO, ACO (the correlation coef®cient r 0.80) than with while the rate constants of helical proteins are not RCO (r 0.69), which is oppositˆe to what was affected as much. Simply due to limit of computer found expˆ erimentally.3 We note that the folding time, we use non-local interactions somewhat rates are computed near T while the experimental weaker than that should be for the same slope in F ln k plot. We anticipate that real proteins have data are mostly collected around 20-25 C, which F stronger non-local interactions than those in this are lower than TF. Thus the difference is not necessarily a contradiction, but rather re¯ects the simulation. more controlled conditions made possible in the simulation. Residual structure in the denatured state If we hypothetically change the chain length Next, we investigate the degree of native-like with the topology ®xed, we see that the RCO is a order in the denatured (D) state, in the transition scale-invariant quantity and ACO RCO N is ˆ  state ensemble (TSE) as well as in the native (N) proportional to the chain length. To further investi- state. The native and denatured states are de®ned gate the chain length scaling, we perform a non- as the local minima of the free energy pro®le (see linear least square ®tting of the folding rates Figure 1(c) for src SH3 domain), while the TSE cor- (ln k A B RCO Nn). This yields n 0.607 F ˆ À Á Á ˆ Æ responds to the barrier top. The Q-value for each 0.179 with r 0.84. The folding rates are plotted of the three states is plotted in Figure 3 against the against RCOˆN0.607 in Figure 2(d). Interestingly, Á RCO, where the red, green and blue dots corre- this is in harmony with the scaling property spond to the D state, the TSE and the N state, 2/3 kF exp( c N ) predicted by Finkelstein & respectively.  À 15 17 Badredtinov and Wolynes at the folding tran- First, we ®nd that the Q-value of the D state, sition temperature. Q(D), is highly correlated with RCO with a nega- We also remark that the slope in the plot of ln kF tive correlation coef®cient (r 0.87); namely pro- versus RCO (Figure 2(b)) is much less than exper- teins with small RCO, whichˆÀ are mainly composed Topology and Chain-length in Protein Folding 175

Folding pathway ensemble and topology To quantify folding pathways, we introduce a local measure of nativeness qi(X) for the ith amino acid residue at a given ensemble of structures X: e e q X h iiX Àh iiD i †ˆ e e h iiN Àh iiD

where ei is the site-energy de®ned as a sum of interaction energies of ith amino acid residue with any other residues. The bracket means an aver- age of the quantity over an ensemblehi that is speci- ®ed as subscript; N and D are the native and denatured states. In particular, fi qi(TS) is an analog of experimental f-value of siteˆ i, under the assumptions of simple transition state theory, and is used for comparison with protein engineering experiments.

Figure 3. The average native-ness Q-values in the SH3 domain and Sso7d denatured states, the native states, and the TSE are plotted with respect to the relative contact order (RCO) The SH3 domain is an interesting example to at the folding transition temperature. We see that the investigate the role of topology versus sequence- degree of residual structures in the denatured states are speci®city for determining folding pathways larger and more signi®cant for a-helical proteins than because comprehensive f-value analysis was per- for b-proteins. formed for two different proteins in the family, src and a-spectrin SH3 domains,7,8 as well as its structural analog Sso7d.23 Although the src and a-spectrin SH3 domains have only a weak sequence homology (36 %), their folding transition state ensembles turned out to be largely similar. of a-helices, tend to have higher degree of native- Namely, the distal b-hairpin and nearby contigu- like order in the D state. This is consistent with ous region are highly structured, while the rest of experimental observations. Although it is not easy the protein is less ordered at the transition state to quantify the residual order in the D state of real ensemble. This similarity to the transition state proteins, there are many papers that report lead to an idea that the protein topology can deter- residual helical structure in helical proteins. For mine the folding transition state. Since the current example, A and H helices of myoglobin29 and the simulation model mostly utilizes topological infor- third helix in the B domain of protein A30 are con- mation of the protein, neglecting sequence speci- sidered to be partially formed in the D state. Some ®city, it is interesting to see if the characteristics of helices are inherently stable without tertiary con- the TSE are well reproduced by the current model. tacts and so some secondary structures may Onuchic et al. already reported simulation results for src SH3 ®nding qualitative agreement with remain in the D state. On the other hand, any experiments,26 which we now investigate more b-strand can not be stable of its own, and thus it is comprehensively and quantitatively. relatively rare for partial b-structure survives in Figure 1(d) shows computed folding pathway the D state. The Q-value of the TSE seems to have ensembles of src SH3 domain. In Figure 1(d), local the same tendency as that of the D state, although measure of native-ness, qi(Q) is represented along the correlation is weaker. the global reaction coordinate Q. We see that the The Q-values of the N state, Q(N), are around distal b-hairpin, which corresponds to the third 0.8 for all proteins, almost independent of the and fourth b-strands, indeed gets native-like order RCO. A weak positive correlation is seen in at an earlier stage than the rest of the protein. This Figure 3, which probably re¯ects that proteins with is consistent with the experimental results. More speci®cally, the TSE corresponds to Q 0.4 where higher RCO have a more global network of con- ˆ tacts that reduces the ¯uctuation in the N state. fi, a computational analog of the experimental f- values, are computed. This is plotted onto the Fluctuation amplitude in the N state of real pro- native structure in Figure 4(a, left), together with teins, however, should depend on the detailed experimentalf values in Figure 4(a, right). Here, the atomic packing as well as coarse-grained topologi- blue and red stand for unstructured (f 0) and cal network and so the weak dependence found structured (f 1), at the TSE, respectively.ˆ Clearly, here might be an artifact of the current simple the major featuresˆ of theory and experiment model. resembles each other albeit minor discrepancy. The 176 Topology and Chain-length in Protein Folding

Figure 4. Both experimental and computed f-values of SH3 domains and their structural analog. The blue and red mean f 0 (unstructured in the TSE), and 1 (native-like in the TSE), respectively, whereas white segments mean no availableˆ experimental data or f-values larger than unity or smaller than zero. Computed (left) and experimental (right) results are shown for comparison, (a) src SH3 domain; (b) a spectrin SH3 domain; (c) Sso7d. Protein structures are drawn with Molscript.34 corresponding results for a-spectrin SH3 domain repeated simulation, we con®rmed that this is not are plotted in Figure 4(b, left) where we can see due to statistical error. The same tendency is found that most of the features of the TSE and folding in a free energy functional approach to src SH3 pathways are very similar to those of src SH3 domain23 and thus quite possibly this is not just an domain, which is also consistent with experimental artifact. results shown in Figure 4(b, right). It is interesting Sso7d is a structural analog of the SH3 domain to note that in Figure 1(d) the ®rst b-hairpin, called where the ®fth b-strand is replaced with an a-helix RT loop, gets some native-like order at very early and it has detectable sequence homology to neither stage of folding, Q 0.25. This order has to be bro- src nor a spectrin SH3 domains. The characteristics ken, however, for further proceed the reaction. By of the TSE of Sso7d were experimentally Topology and Chain-length in Protein Folding 177 determined23 and they turned out to be very differ- because they share the same symmetric topology, ent from those of src and a-spectrin SH3 domains. but the TSEs experimentally observed are different Thus, it is an interesting test of the current top- each other: The protein L has the structured ology-alone model. Figure 4(c, left) shows com- N-terminal b-hairpin at the TSE, while the C-term- puted f-values, which are compared with inal hairpin is ordered in the case of protein G. experimental data given in Figure 4(c, right). We Our simulation gave the high f-values in the see that the simulated TSE is mostly similar to that N-terminal hairpin for both protein L and protein of src SH3 domain, and thus deviates signi®cantly G (data not shown). Despite a nearly symmetric from f-value experiment of Sso7d. This suggests native topology, the polarized TSE again seems to that the native topology of this family drives the be a natural consequence of the topology-only folding pathways to the TSEs of the src SH3 model. Nevertheless in the laboratory, the way of domains, but the pathways are perturbed by the breaking the symmetry is not decided by the top- C-terminal helix and sequence-speci®c interactions ology alone, but by sequence-speci®c information. in Sso7d. We have investigated the characteristics of the TSEs of other proteins, too. Roughly, we ®nd that the computed TSEs are consistent with f-value U1A, AcP, and ADA2h experiments for about half of the proteins. Proteins Despite low mutual sequence homology, all with more complex topologies, such as the ®bro- these proteins have the same super-fold, so-called nectin type 3, tend to give better agreement with the ab plaits composed of babbab secondary struc- experimental results than those of simple topology tures, and characteristics of the TSEs of them have made of a-helices alone, such as l repressor. For been anticipated by the protein engineering experi- such helical proteins, the TSE characteristic are ments.9,10,13 Brie¯y, the acylphosphatase(AcP) and fragile and can easily be perturbed by changes in the activation domain of procarboxypeptidase A2 parameters. (ADA2 h) share the characteristic of TSE's, which is an expanded version of the native structure; the central section of the b-sheet (b-strands 1 and 3) is Discussion more ordered than the other strands and helices are partially formed.9,10 On the other hand, the As mentioned above, the scaling law kF exp( c 0.6  À U1A has more polarized TSE in which b strands 2 RCO N ) found in this paper is slightly different and 3 with helix 1 (one side of the protein) are from what was suggested experimentally 13 k exp( c RCO). However, we note that the tem- organized. F  À We compare computed f-values with exper- peratures in our study are not the same as in the imental ones in Figures 5. For U1A, computed experiments. The experimental data were collected f-values (Figure 5(a) left) are markedly high at mostly 20-25 C. This is somewhat lower than TF. b strands 1 and 3, and helix 1 (left half of Moreover, since kinetic experiments were per- Figure 5), which is perfectly consistent with formed by many different groups, temperatures what is anticipated from experiments (Figure 5(a), used for measurements are different one by one, right). The computed f-values of AcP is plotted which makes detailed interpretation dif®cult. On in Figure 5(b), left) accompanied by the exper- the other hand, we have performed folding simu- imental data in Figure 5(b), right). We see here lations carefully choosing the temperature consist- that the computed TSE signi®cantly deviates ently near TF leading to the scaling law closer to the theoretical prediction by Finkelstein & from the experimental results and is quite simi- 2/3 lar to that of U1A. The results for ADA2 h (data Wolynes, kF exp( cN ). This means that the  À 2/3 not shown) are similar to the case of AcP giving free energy barrier for folding scales as N , which discrepancy from experimental data. In other is the size of the boundary surface of the folding words, the current topology-only model leads to nucleus at the transition temperature{. The original basically the same TSEs for all of these proteins, scaling of their theory itself is, however, hardly implying that the polarized TSE is a direct con- detectable at least for small-to-medium size pro- sequence of the native topology. This suggests teins as we see in Figure 2(a) probably because top- two of the three proteins have the central ology effect is so strong that it washes out signal of b-sheet formed at their TSEs owing to sequence- chain length dependence. Thus, combination of speci®c interactions. two observations, topology effect and chain-length scaling, may gives a uni®ed rule for folding kin- etics near the folding transition temperature. Other cases Wolynes17 predicted that the chain-length scaling 2/3 varies from kF exp( cN ) near TF to a weaker The immunoglobin binding domain of protein L  Àn dependence kF N À at T5TF (but above the and protein G are other interesting examples glass transition temperature). Thus, probably, as decreasing temperature, the chain-length { For a large protein of N residue long, the gyration dependence of the current model becomes weaker, 1/3 radius of the compact form scales as N , and global with the scaling of kF approaching that observed boundary surface area scales as N2/3. experimentally. Further analysis is needed to this 178 Topology and Chain-length in Protein Folding

Figure 5. Both experimental and computed f-values of U1A and Acylphosphatase (AcP). Computed (left) and experimental (right) results are shown for comparison; (a) U1A and (b) AcP. Protein struc- tures are drawn with Molscript.34

direction. In this context, Cieplak & Hoang31 found approximation used by Alm & Baker21 and Munoz n 22 the scaling law of folding rate kF NÀ where n is & Eaton restricts the number of contiguous ®tted to around 2.5 by performing folding simu- native-like segments less than one or two. The val- lation of a simple protein model somewhat similar idity of this approximation depends on the chain to this work. They did the simulation at the stiffness.32 With the stiffness used in the current temperature where the folding is the fastest. In study, at most two contiguous native-like segments the current case, the fastest folding is attained at seem to be suf®cient to describe the folding path- T5TF and thus their observation is consistent with ways as we can see in Figure 1(d). ours based on Wolynes' theory. Using the model proposed by Onuchic et al.2 of For about half of proteins studied, the calculated proteins, we have performed comprehensive fold- TSE's are not similar to what was anticipated from ing simulation on 18 single-domain proteins and experimental f-value analysis. Many reasons can investigated folding rates and pathways with com- be considered for these disagreements. Comparing parison to experimental results. First, we ®nd that with successful prediction by Go-like free energy the folding rate constants scales as kF exp(c RCO functional approaches,20 ± 23 we consider that a N0.6) where RCO and N are the relative contact major property to be taken into account would be order and number of residues, respectively. This an inherent propensity of the secondary structure suggests the scaling law of the folding rates 2/3 of each segment. The hydrogen bond interaction kF exp(c RCO N ) ®lls the gap between the might be another source of missing speci®city. experimentally observed correlation with the RCO Simulation study with these interactions is under and the chain-length scaling property predicted by consideration. For some proteins, non-native inter- Finkelstein15 and Wolynes.17 Second, the residual actions may play a crucial role in the TSEs, native-like order in the denatured state is highly although their importance in general is not clear at correlated with the RCO with negative correlation the moment. coef®cient; proteins with smaller RCO tend to have The straightforward simulation study presented higher degree of residual order in the denatured here provides a good tool to validate approxi- states. This is consistent with the observation that mations made in free energy functional many helical proteins possess partial helices in the approaches. In particular, the contiguous sequence denatured states. Finally, the topology-only model Topology and Chain-length in Protein Folding 179 used in this paper successfully describes the fold- A standard algorithm33 is used for performing con- ing pathways and the TSEs for about half of pro- stant temperature molecular dynamics simulation, where teins studied. the Newtonian equation of motion is numerically solved with velocity rescaled for keeping the temperature. The measure of the native-ness Q(À) is de®ned, for a Materials and Methods given conformation À, as the number of formed native contacts divided by the total number of native contacts, where a contact between i and j is classi®ed as formed Proteins studied when rij is shorter than 1.2r0ij. The proteins swtudied are cytochrome b562 (256b), We estimate the ®rst passage time t1 tfold where t is the time when the protein ®rst reachesˆh thei native apo-myoglobin (1pmb), monomeric l repressor (1lmb), fold albumin binding domain (1prb), Sso7d (1bnz), Im9 topology, which is judged by the Q-value, and the aver- (1imq), ACBD (2abd), cold shock protein B (1nmg), chy- age is taken over 50 trajectories. The folding rate constanthÁÁÁi k is obtained by k 1/t . The TSEs are motripsin inhibitor 2 (1coa), TN ®bronectin type 3 (1ten), F F  1 U1A (1urn), IgG-binding domain of protein G (2gb1), obtained from the same 50 trajectories. Ada2 h (1aye), FKBP (1fkt), IgG-binding domain of pro- tein L (2ptl), a-spectrin SH3 domain (1shg), src SH3 domain (1srl) and Acylphosphatase (2acy), where the PDB code is given in parentheses. Acknowledgments We thank George Chikenji and Rie Tatsumi for helpful Simulation algorithms discussions. The work was supported, in part, by Grant- in-Aid for scienti®c research on priority areas (C) The interaction energy E at a given protein confor- ``Genome Information Science'' and priority areas (A) mation À is given as ``Molecular Physical Chemistry'' from the Ministry of Education, Science, Sports and Culture of Japan. E À; À K r r 2 0†ˆ r i À 0i† bonds X References K y y 2 ‡ y i À 0i† angles 1. Fersht, A. R. (1999). Structure and Mechanism in Pro- X tein Science: A Guide to Enzyme Catalysis and Protein 1 1 K † 1 cos f f † Folding, Freeman, New York. ‡ f f ‰ À i À 0i †Š dihedral 2. Onuchic, J. N., Luthey-Schulten, Z. A. & Wolynes, X 3 3 P. G. (1997). Theory of protein folding: the energy K † 1 cos 3 f f † ‡ f ‰ À i À 0i †Šg landscape perspective. Annu. Rev. Phys. Chem. 48, 545-600. native contact 12 10 3. Grantcharova, V., Alm, E. J., Baker, D. & Horwich, r0ij r0ij e 5 6 A. L. (2001). Mechanisms of protein folding. Curr. ‡ 1 r À r i>j 3 "#ij ij Opin. Struct. Biol. 11, 70-82. XÀ   4. Baker, D. (2000). A surprising simplicity to protein non-native contact C 12 folding. Nature, 405, 39-42. e 5. Plaxco, K. W., Simons, K. T. & Baker, D. (1998). ‡ 2 r i>j 3 ij Contact order, transition state placement and the À  X refolding rates of single domain proteins. J. Mol. In the equation, ri, yi, and fi stand for the ith virtual Biol. 277, 985-994. bond length between ith and (i 1)th amino acid resi- 6. Itzhaki, L. A., Otzen, D. E. & Fersht, A. R. (1995). ‡ dues, the virtual bond angle between (i 1)th and ith The structure of the transition state for folding of À bonds, and the virtual dihedral angle around the ith chymotrypsin inhibitor 2 analysed by protein engin- bond, respectively. rij is the distance between ith and jth eering methods: evidence for a nucleation-conden- amino acid residues. All the parameters with subscript sation mechanism for protein folding. J. Mol. Biol, ``0`` mean the values of the corresponding variables at 254, 260-288. the native structure. The ®rst term keeps the chain con- 7. Martinez, J. C. & Serrano, L. (1999). The folding nectivity, while the second and the third terms represent transition state between SH3 domains is confor- the local torsional interactions. The fourth and ®fth mationally restricted and evolutionarily conserved. terms are non-local interactions, where the former Nature Struct. Biol. 6, 1010-1016. includes native contact interactions and the latter is non- 8. Riddle, D. S., Grantcharova, V. A., Santiago, J. D., speci®c repulsion between non-native pairs. For other Alm, E., Ruczinski, I. & Baker, D. (1999). Experiment (1) parameters, we use Kr 100.0, Ky 20.0, Kf 1.0, and theory highlight role of native state topology in (3) ˆ ˆ ˆ K 0.5, e 0.18, e 0.18, C 4.0 AÊ for all proteins SH3 folding. Nature Struct. Biol. 6, 1016-1024. f ˆ 1 ˆ 2 ˆ ˆ studied. 9. Chiti, F., Taddei, N., White, P. M., Bucciantini, M., We de®ne that the ith and jth amino acid residues are Magherini, F., Stefani, M. & Dobson, C. M. (1999). in the native contacts if one of the non-hydrogen atoms Mutational analysis of acylphosphatase suggests the in the ith amino acid residue are within a critical dis- importance of topology and contact order in protein tance s to any non-hydrogen atom in the jth amino acid folding. Nature Struct. Biol. 6, 1005-1009. residue. We tested both s 5.5 AÊ and 6.5 AÊ in investi- 10. Villegas, V., Martinez, J. C., Aviles, F. X. & Serrano, gating folding reaction of tenˆ different proteins ®nding L. (1998). Structure of the transition state in the fold- no qualitative difference. Here, we present results for ing process of human procarboxypeptidase A2 acti- s 6.5 AÊ . vation domain. J. Mol. Biol. 283, 1027-1036. ˆ 180 Topology and Chain-length in Protein Folding

11. Kim, D. E., Fisher, C. & Baker, D. (2000). A break- variations in the folding pathways. J. Mol. Biol. 304, down of symmetry in the folding transition state of 967-982. protein L. J. Mol. Biol. 298, 971-984. 24. Go, N. (1983). Theoretical studies of protein folding. 12. McCallister, E. L., Alm, E. & Baker, D. (2000). Criti- Annu. Rev. Biophys. Bioeng. 12, 183-210. cal role of b-hairpin formation in protein G folding. 25. Takada, S. (1999). Go-ing for the prediction of pro- Nature Struct. Biol. 7, 669-673. tein folding mechanisms. Proc. Natl Acad. Sci. USA, 13. Ternstrom, T., Mayor, U., Akke, M. & Oliveberg, M. 96, 11698-11700. (1999). From snapshot to movie: f analysis of pro- 26. Clementi, C., Nymeyer, H. & Onuchic, J. N. (2000). tein folding transition states taken one step further. Topological and energetic factors: what determines Proc. Natl Acad. Sci. USA, 96, 14854-14859. the structural details of the transition state ensemble 14. Gutin, A. M., Abkevich, V. & Shakhnovich, E. I. and ``en-route'' intermediates for protein folding? (1996). Chain length scaling of protein folding time. J. Mol. Biol. 298, 937-953. Phys. Rev. Letters, 77, 5433-5436. 27. Clementi, C., Jennings, P. A. & Onuchic, J. N. (2000). 15. Finkelstein, A. V. & Badredtinov, A. Y. (1997). Rate How native-state topology affects the folding of of protein folding near the point of thermodynamic dihydrofolate reductase and interleukin-1 b. Proc. equilibrium between the coil and the most stable Natl Acad. Sci. USA, 97, 5871-5876. chain fold. Fold. Des. 2, 115-121. 28. Kumar, S., Bouzida, D., Swendsen, R. H., Kollman, 16. Thirumalai, D. (1995). From minimal models to real P. A. & Rosenberg, J. M. (1992). Multidimensional proteins: time scales for protein folding kinetics. free-energy calculations using the weighted histo- J. Phys. (France), 15, 1457-1467. gram analysis method. J. Comput. Chem. 13, 1011- 17. Wolynes, P. G. (1997). Folding funnels and energy 1021. landscapes of larger proteins within the capillarity 29. Eliezer, D., Yao, J., Dyson, H. J. & Wright, P. E. approximation. Proc. Natl Acad. Sci. USA, 94, 6170- (1998). Structural and dynamic characterization of 6175. partially folded states of apomyoglobin and impli- 18. Shoemaker, B. A., Wang, J. & Wolynes, P. G. (1997). cations for protein folding. Nature Struct. Biol. 5, Structural correlations in protein folding funnels. Proc. Natl Acad. Sci. USA, 94, 777-782. 148-155. 19. Portman, J. J., Takada, S. & Wolynes, P. G. (1998). 30. Bai, Y., Karimi, A., Dyson, H. J. & Wright, P. E. Variational theory for site resolved protein folding (1997). Absence of a stable intermediate on the fold- free energy surfaces. Phys. Rev. Letters, 81, 5237- ing pathway of protein A. Protein Sci. 6, 14491457. 5240. 31. Cieplak, M. & Hoang, T. X. (2000). Scaling of folding 20. Galzitskaya, O. V. & Finkelstein, A. V. (1999). A properties in Go models of proteins. J. Biol. Phys. 26, theoretical search for folding/unfolding nuclei in 273-294. three-dimensional protein structures. Proc. Natl Acad. 32. Portman, J. J., Takada, S. & Wolynes, P. G. (2001). Sci. USA, 96, 11299-11304. Microscopic theory of protein folding rates. I. Fine 21. Alm, E. & Baker, D. (1999). Prediction of protein- structure of the free energy pro®le and folding folding mechanisms from free-energy landscapes routes from a variational approach. J. Chem. Phys. derived from native structures. Proc. Natl Acad. Sci. 114, 5069-5081. USA, 96, 11305-11310. 33. Berendsen, H. J., Postma, J. P. M., van Gunsteren, 22. Munoz, V. & Eaton, W. A. (1999). A simple model W. F., DiNola, A. & Haak, J. R. (1984). Molecular for calculating the kinetics of protein folding from dynamics with coupling to an external bath. J. Chem. three-dimensional structures. Proc. Natl Acad. Sci. Phys. 81, 3684-3690. USA, 96, 11311-11316. 34. Kraulis, P. J. (1991). A program to produce both 23. Guerois, R. & Serrano, L. (2000). The SH3-fold detailed and schematic plots of protein structures. J family: experimental evidence and prediction of Appl Crystallog. 24, 946-950.

Edited by B. Honig

(Received 20 June 2001; received in revised form 15 August 2001; accepted 15 August 2001)