<<

Downloaded by guest on September 24, 2021 a traps Bitran kinetic Amir deep circumvent to misfolding-prone allows folding Cotranslational www.pnas.org/cgi/doi/10.1073/pnas.1913207117 folding biomolecular how evolutionarily. optimized and be cell may the that self-assembly in factors folding the protein on proper light C-terminal promote shed conserved results lack these Together, sequences codons. rare proteins’ interactions, these In nonnative significant indeed fold. of and lack to a cotrans- to from time due benefit folding to more lational not predicted chains are nascent proteins other these contrast, giving folding by cotranslational con- efficiency improves certain indeed under slowdown optimize that a such show to ditions, we tuned modeling, evolutionarily kinetic suggest- be Using window, may folding. optimal rates this synthesis at that Interestingly, synthesis ing down traps. codons slow rare kinetic may conserved contain avoid that sequences optimal proteins’ thus this these of and of many advantage lengths take benefi- of to be proteins window allows to it predicted because is cial folding is cotranslational folding C-terminal Thus, lengths, involving a residues. interactions these nonnative exists Beyond by kinetics. slowed there drastically folding proteins, sta- fast thermodynamic certain and both confers for bility nascent that of lengths that of function window find a narrow We as domains length. an protein chain use large various we of prop- question, folding erties this the compute have address to algorithm To improvement simulation-based allevi- established. all-atom this well be However, underlying been can cotranslationally. not problem mechanisms folding in start this molecular proteins that folding the if known inefficient vivo 2019) been or in 31, July long ated slow review has for from (received It 2019 suffer 12, vitro. December proteins approved and large MD, Bethesda, Many Health, of Institutes National Eaton, A. William by Edited and 02138; MA Cambridge, itdt ei.Ti 0aioai a sepce ie that given expected is gap pre- last the acid is sequesters amino tunnel folding exit 30 which the This at begin. lengths to their chain dicted of than 30 upstream roughly slowly acids enriched more amino significantly rare translated are of counterparts, typically stretches synonymous are conserved fold- that which cotranslational be show codons, for works may these allow organisms Namely, various to ing. in selection rates evolutionary 30% synthesis as under much protein as that of gests ribo- folding the the on the affect improve of folding may can cotranslational which that (13–23), is factor some efficiency correct second folding the A vivo expending until (6–12). in by chains attained or misfolded is folding anneal structure their repeatedly promote to to energy confin- chains passively unfolded by efficiency ing folding as improve such substantially may chaperones molecular example, in For GroEL vitro. which answer in but folding, that absent The cellular affect are resolved? Given that cellu- factors conundrum crowded of (1–6). number this the a involves is all in how fold at environment, efficiently lar and refold rapidly spontaneously must proteins not do others M self-assembly hi egh ugssta ornltoa odn a eunder be may folding cotranslational that suggests lengths folding-competent chain at observed pauses The conserved folding. of their enrichment impedes generally and chain nascent eateto hmsr n hmclBooy avr nvriy abig,M 02138; MA Cambridge, University, Harvard Biology, Chemical and Chemistry of Department n ag rtisrfl rmadntrdsaevery while state slower) or denatured minutes of a timescales from (on vitro refold in slowly proteins large any .coli E. shrci coli Escherichia a,b rtoe(0.Arcn e fwrs(3 4 sug- 14) (13, works of set recent A (20). proteome | ornltoa folding cotranslational ila .Jacobs M. William , c eateto hmsr,PictnUiest,Pictn J08544 NJ Princeton, University, Princeton Chemistry, of Department n rCadHP0i eukaryotes in HSP90 and TriC and | c oo usage codon id Zhai Xiadi , ∼30 | mn cd fa of acids amino a n ueeShakhnovich Eugene and , | doi:10.1073/pnas.1913207117/-/DCSupplemental. at online information supporting contains article This 1 is ulse aur ,2020. 7, January published First protein (https://figshare. all Figshare for in energies deposited been free has and publication com/articles/Analyzed rates this folding in containing included constructs dataset A deposition: Data the under Published Submission.y Direct PNAS a is article This interest.y competing paper. no y the declare wrote authors performed E.S. The X.Z. and W.M.J. and and A.B. data; research; analyzed designed A.B. E.S. research; and W.M.J., A.B., contributions: Author iuainbsdmto n nlssppln ecie in a described developed pipeline we analysis rates, and Chains. and method Nascent pathways simulation-based of folding Properties cotranslational Folding Predicting Results speed folding vivo optimized be in may folding evolutionarily. improve cotranslational how may and efficiency advan- synthesis and kinetic no vectorial confers deep picture how molecular thus detailed into of a synthesis provide misfold results vectorial these Together, not which tage. do for without and that cotranslational traps counterexamples—proteins codons for identify rare time enough conserved also provide We to folding. evolved sequences have these may suggesting lengths, contain intermediate sequences faster-folding proteins’ codons these rare of conserved non- Many lengths, stabilize traps. chain which residues kinetic shorter C-terminal native at of rapidly synthesis because the fold beneficial to to prior is chains nascent synthesis allows for vectorial it that syn- proteins, find vectorial We large efficiency. how certain folding investigate cotranslational to affects at lengths thesis properties chain protein-folding compute nascent to various method conformations. this nonnative rates of apply possibility We and the pathways for protein-folding accounting while inferring for method tional been mecha- not have specific beneficial elucidated. is the folding cotranslational which However, by nisms selection. evolutionary positive owo orsodnemyb drse.Eal shakhnovich@chemistry. Email: addressed. be may correspondence harvard.edu. whom To hs rtishv vle ofl cotranslationally. fold to evolved of have some proteins that refolding. these suggest spontaneous translation slow hinder unusually to from of Signatures tend away that proteins states certain process misfolded folding biases This folding start ribosome. the proteins cotranslational by of if translated into being alleviated insights are be they mechanistic arrangements while can provide problem incorrect we this myriad Here, how the adopt. likely to can less due are they correctly disease. proteins fold to larger conformations, to linked native been spontaneously their and has rapidly to so fold often do proteins small to Although failure their perform and to structure functions, specific a adopt must proteins Many Significance ee eadesti usinuiga l-tmcomputa- all-atom an using question this address we Here, y b a,1 avr nvriyPormi ipyis avr University, Harvard Biophysics, in Program University Harvard PNAS NSlicense.y PNAS data/11496954 | aur 1 2020 21, January ∼30 ). y mn cd ontemo these of downstream acids amino y | https://www.pnas.org/lookup/suppl/ o.117 vol. | o 3 no. ocompute To | 1485–1495

BIOPHYSICS AND COMPUTATIONAL BIOLOGY Replica exchange simulations Compute folding rates from detailed balance

Unfolding simulations Repeat for multiple chain lengths, Incorporate into kinetic model

Fig. 1. (Top Left) We run replica exchange atomistic simulations with a knowledge-based potential and umbrella sampling to compute a protein’s free- energy landscape. (Bottom Left) To obtain barrier heights, we run high-temperature unfolding simulations and extrapolate unfolding rates down to lower temperatures assuming Arrhenius kinetics. (Top Right) The principle of detailed balance is then used to compute folding rates. (Bottom Right) The process is repeated at multiple chain lengths and incorporated into a kinetic model of cotranslational folding. For details, see Materials and Methods.

Fig. 1 and Materials and Methods. The method utilizes an all- ferent intermediates computed here and is not included in our atom Monte Carlo simulation program with a knowledge-based simulations. 2) Unfolding rates obey Arrhenius kinetics, such potential and a realistic move set described previously (24–26). that rates computed at high temperatures can be readily extra- In essence, rather than simulating a protein’s folding ab initio polated to lower temperatures. This holds as long as the barri- from an unfolded ensemble (which is intractable for large pro- ers between intermediates are large so that a local equilibrium teins at reasonable simulation timescales), we simulate unfold- is reached in each free-energy basin prior to unfolding. 3) Non- ing and, in tandem, calculate the free energies of the folded, native contacts form on timescales faster than the timescales unfolded, and various intermediate states from simulations with of native folding transitions. This condition, which has previ- enhanced sampling. Given rates of sequential unfolding between ously been verified in lattice simulations (29, 30), is also satisfied these states and their free energies, the reverse folding rates can for the misfolded states observed here which are dominated be computed from detailed balance. Importantly, our sequence- by short-range interactions that form rapidly compared to the based potential energy function is not biased toward the native long-range native contacts. This implies that a protein’s folding state, as in native-centered (Go)¯ models, and allows for the landscape can be described by macrostates characterized by cer- possibility of nonnative interactions. Thus we can account for tain folded native elements in fast equilibrium with nonnative the role of misfolded states in folding kinetics. This method contacts that are compatible with the currently folded elements. is applied at multiple chain lengths to predict cotranslational Because of this separation of timescales, the transitions between folding properties. macrostates defined in this way are approximately Markovian Our approach here is valid as long as the following conditions and can therefore be reproduced by a coarse-grained kinetic hold: 1) The ribosome does not significantly affect cotransla- model (Materials and Methods). tional folding pathways. Previous work suggests that the ribo- some’s destabilizing effect on nascent chains is relatively modest, MarR—an E. coli Protein with Conserved Rare Codons—Adopts Sta- typically 1 to 2 kcal/mol (27), and affects various folding inter- ble Cotranslational Folding Intermediates. We began by simulating mediates to a comparable extent (28). Thus, the ribosome is the cotranslational folding of a protein previously shown to con- expected not to drastically affect the relative stability of the dif- tain conserved rare codons ∼30 amino acids downstream of a

1486 | www.pnas.org/cgi/doi/10.1073/pnas.1913207117 Bitran et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 aeil and (Materials bootstrapping from obtained al. possible. are et is rates Bitran folding step on each bars which Error at purple. in lengths highlighted chain is minimum stability hairpin and the (gold, folded speed Both indicate are Methods). folding Xs pathway in both folding folding). confers shown monomer that region length MarR lengths region the dimerization chain dimerization in each green, the step For of folding; folding melting each (Bottom) region illustrating with DNA-binding shown, associated elements are blue, structural range folding; temperature the temperature melting that each dimer probability from (C The equilibrium structures line). region. the dotted monomeric monomeric DNA-binding (right for Sample temperature the subunit line. melting per by dotted contacts region native followed DNA-binding left of by the fraction normalized Mean by temperature (B) indicated of (gold). region is function DNA-binding a the as stabilizing MarR in dimeric involved hairpin and beta crucial a and (blue), region 2. Fig. beta strand the another in 100—and strands leucine the through of formation one 97 the region between involving hairpin—leucine in step DNA-binding contacts region rate-limiting long-range of the of DNA-binding completion is the entire which 2) the folding, structure; scaffolds beta final which crucial the 2), 2C) a (gold 100 (Fig. of Fig. leucine steps through folding in 84 three valine fast residues in relatively of composed folds the hairpin monomer 1) the by that characterized find We a way. acquires isolation. monomer in the structure native that of indicate amount results substantial These S1B). Fig. time where the of below majority temperatures the folded while at stably time, find is the We of region fraction DNA-binding Methods). a the folded and sampling is region (Materials dimerization umbrella the potential that with all-atom simulations our using To exchange 2A). DNA- equi- replica (Fig. ran a we librium region stable, are of dimerization monomers individual helical composed whether investigate a monomer and each region winged binding a with into homodimer assembles tran- helix natively a (31–33), MarR, repressor (MarR). the scriptional regulator (13): resistance antibiotic intermediate multiple folding cotranslational possible enx undt netgtn h ooe’ odn path- folding monomer’s the investigating to turned next We C D AB T tutr fntv aRdmrbudt N (Left DNA to bound dimer MarR native of Structure (A) M IAppendix, (SI temperature melting monomer the is Top and Bottom T ≈ r hw tasmlto eprtr of temperature simulation a at shown are 0.9 T M tvroscanlnts eplot we lengths, chain various At (Top) (D) details.) for text (See monomer. MarR of pathway folding Predicted ) epo h aeo h lws odn tpDAbnigrgo omto.Anro idwo chain of window narrow A formation. region step–DNA-binding folding slowest the of rate the plot we Top, ledte line), dotted blue 2B, (Fig. swl smnmr( monomer as well as ) .coli E. T = it ornltoa odn nemdaea iia chain similar pre- a which at model, intermediate folding coarse-grained cotranslational a a using dicts analysis S1F prior Fig. this with at Appendix, cotranslationally (SI rate-limiting begin length can the folding thermodynamically and synthesized suggesting become fully folding favorable, folding) been beta-hairpin region has (DNA-binding both 2) step 100, Fig. length in (gold at hairpin 2D, beta (Fig. crucial equilibrium with Methods at associated and formed contacts Materials are com- tertiary step we the folding length, that each terminus resulting each probability C the At the of lengths. the puted various simulations from at equilibrium residues chains ran nascent-like truncated and protein we the cotranslation- this, place of test take can To steps ally. folding these whether dered temperature of function a in as shown step are dimeric folding a each of for presence Rates the partner. in ordered region more dimerization substantially the becomes Naturally, rapidly S1B). Fig. region arrange- Appendix , tertiary (SI this nonnative ments composing and native helices various 2); 2), the between Fig. exchange Fig. as in (green in reversible region is (blue dimerization the which 56 of threonine folding 3) through finally and 53 alanine of composed 0.51T aigpeitdtemnmrsfligptwy ewon- we pathway, folding monomer’s the predicted Having M . Right ihhglgtddmrzto ein(re) DNA-binding (green), region dimerization highlighted with ) i.S2. Fig. Appendix, SI PNAS o eal) efidta sso sthe as soon as that find We details). for | aur 1 2020 21, January .Ti nigi nagreement in is finding This ). | o.117 vol. | o 3 no. see Top; | 1487

BIOPHYSICS AND COMPUTATIONAL BIOLOGY length (SI Appendix, Fig. S1I). Meanwhile, the helix consist- T = 0.51 TM is nonetheless reasonable in our model because our ing of residues methionine 1 through serine 34 is stabilized potential energy function is temperature independent. by loose nonnative contacts with the DNA-binding region (SI Appendix, Fig. S1H), as the C-terminal helices with which it MarR Folding Rate Rapidly Decreases beyond 100 Amino Acids Due pairs to form the dimerization region have not yet been synthe- to Nonnative Interactions. We next asked how the folding kinet- sized. These helices have been partially synthesized by length ics for MarR’s rate-limiting folding step, namely DNA-binding 112, but dimerization-region folding is still unfavorable at this region folding, change as the nascent chain elongates beginning point. The entirety of the C-terminal helices must be synthe- at 100 amino acids. We find that for a narrow window around this sized, which occurs around the full monomer length of 144, length, the rate-limiting step is both thermodynamically favor- for the dimerization region to acquire partial stability (≈70% able and relatively fast (Fig. 2D, Bottom). Beyond 100 amino folded at the temperature shown). We note that, because the acids, this step becomes dramatically slower. By length 112, this dimerization region is largely composed of C-terminal residues, rate has decreased by roughly 1,000-fold, and by the time the monomers are not expected to dimerize cotranslationally, monomer is fully synthesized (144 amino acids [AAs]), the rate consistent with proteome-wide trends against cotranslational has decreased by roughly 2,000-fold relative to the 100-AA par- homodimerization (34). tial chain. This slowdown far exceeds what is predicted from These results are reported at a simulation of temperature of general scaling laws of folding time as a function of length (1, T = 0.51 TM , where TM is the DNA-binding region melting 35–37). For instance, the power-law scaling proposed by Gutin temperature. We chose this temperature because it is slightly et al. (36), τ ∼ L4, predicts only an ∼4-fold slowdown between below the dimer melting temperature of T ≈ 0.65 TM (Fig. 2B) lengths 100 and 144 AA. The discrepancy between this gen- and corresponds to a physiologically reasonable folding stabil- eral scaling and our observed dramatic slowdown suggests that ity of ∼5 kB T (SI Appendix, Fig. S1B). However, our results factors specific to MarR are at play. One possibility is non- are consistent across temperature choices below the dimer melt- native intermediates. To test this hypothesis, we turned off the ing temperature (SI Appendix, Fig. S1E). We further note that, contribution of nonnative contacts to the potential energy by although real physiological temperatures typically lie only slightly rerunning simulations in an all-atom Go¯ potential in which only below protein melting temperatures, our temperature choice of native contacts contribute (38, 39). In stark contrast to the full

AB

C

Fig. 3. (A) Folding rate vs. temperature for DNA-binding region folding rate as a function of temperature at nascent chain length 100 (dashed line) and full MarR (solid line), using the all-atom potential (Left) and a native-central potential in which nonnative interactions have been turned off (Right). Symbols indicate temperatures at which the partial chain folds significantly faster than the full monomer (p < 0.01) based on bootstrapped distributions (Materials and Methods). Rates are plotted only at temperatures where the folding free-energy difference is ≤20 kBT owing to large statistical uncertainties associated with free-energy differences greater than this. The resulting temperature range is different in the two potentials, hence the differing x scales. (B) Free-energy difference between configurations prior to the rate-limiting step that are kinetically trapped (defined as having at least five nonnative contacts that must be broken before the rate-limiting step can occur) and those that are not trapped as a function of temperature for both the partial MarR chain at length 100 and full MarR. (C) Mean nonnative contact maps for the two most prevalent clusters (Materials and Methods) among full MarR simulation snapshots in which the DNA-binding region is not folded, along with representative structures. Contacts involving the C terminus that must be broken before folding can proceed are circled in red on the maps and highlighted on the respective structures.

1488 | www.pnas.org/cgi/doi/10.1073/pnas.1913207117 Bitran et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 otc asfrhrhglgtteiprac fC-terminal al. et of Bitran importance these Together, the 100. highlight length already further is at folding than maps of length rate contact that the at synthesized why slower already explaining these much thus are stabilizing 112, 2, in length cluster involved at particularly residues to 51 the traps, residues residues of nonnative of with many hairpin native Notably, the nonnative 55. impeding a again 111, forms to 100 106 to 3C, (Fig. instead 95 cluster are stabilized beta-strand second the is 100, In that residues. to core C-terminal 95 hydrophobic by nonnative beta-strand a the into with sequestered 3C, pair (Fig. natively cluster which first residues the involving In contacts 100. native nonnative beyond the multiple from include 3C). substantially and (Fig. state differ of topologies clusters composed are whose clusters these snapshots populated heavily for most two visualized maps the and Indeed, contact clus- Methods ) and and nonnative prior (Materials constructed average snapshots step protein we rate-limiting full this, the of to maps test region contact To nonnative dimerization the structure. the tered native compose the otherwise beyond in would positions non- sequence which by at 100, stabilized residues are involving traps contacts these native hypothesized we acids, amino choice S2F Fig. the Appendix, to (SI robust value are threshold results this our of five contacts, traps. have nonnative that kinetic ones more as deeper or here snapshots experiences trapped protein define we full Although the above that temperatures tempera- suggests at these trend plot the not But do tures. thus We calculation. in free-energy uncertainties this large that to leading note infrequently, We extremely protein. observed full the below the for temperatures than for at 100 melting negative length the less at chain below is MarR difference temperatures free-energy all this For temperature, 3B). misfolded (Fig. of stability traps the kinetic for measure and a then as trapped ensembles We these nontrapped generally between structure. difference molten-globule–like and free-energy more the bro- nontrapped do computed looser, deemed be that a are Snapshots to on criterion occur. take need this can that fulfill step contacts rate-limiting not nonnative the as before defined more trapped, ken or kinetically rate-limiting are five the that undergo having ones to identified yet and have step that snapshots examined states. pathway free-energy folding low the trapped, indicating to nonnatively potential, the contribution to full negligible relative the a make in they observed that rarely before are J S1 broken states Fig. Appendix, be these (SI occur temporarily can step must rate-limiting native form the possi- certain initially where the that (40), observe frustration contacts do topological we native of potential, is bility 100 native-only frustration beyond the nonnative lengths in no at observed although potential full Interestingly, the acids. in amino rate slowdown folding orders-of-magnitude MarR observed in the producing interac- nonnative in of tions importance force. the driving result, to thermodynamic point a stronger observations These lower- a As with to decreases temperatures. due than temperature low rather ing increases, at now protein rate folding full the the kinetic eliminates for contacts nonnative trapping of absence S1 H the Figs. 2) Appendix, and ther- (SI S2E), reduced folding loose a for to by force con- leads driving stabilized absence nonnative modynamic normally their to so is and chain related contacts, partial effects nonnative the two 1) find- by namely These explained tacts, temperature. be remain decreasing can rates with ings folding increase the temperature, or that decreasing constant predicts with 100 potential drop length native-only rates the at folding both chain that partial dicts the 3A, than (Fig. protein faster full the dramatically temperature, melting folds the below that, predicts tial 3A, (Fig. potential knowledge-based ic iei rp r epra hi egh eod100 beyond lengths chain at deeper are traps kinetic Since we contacts, nonnative of role the of test additional an As .Frhroe hra h ulptnilpre- potential full the whereas Furthermore, Right). T ≈ 0.85 T M ,tentv-nypoten- native-only the Left), otapdsrcue are structures nontrapped , ,rsde 1t 55, to 51 residues Left), T ). ≈ 0.85 .However, ). T ,the Right), M clearly and ecos e fprmtr o hc h feto vec- of effect assume the is we rate which namely folding for pronounced; slowest parameters particularly the of is prob- time. resulting synthesis set over the torial intermediates a plot folding choose and different We model occupying kinetic of the ability into MarR for different between computed). be rates can steps relative folding (although or lengths syn- steps Monte the arbitrary Carlo from in to timescales determined Monte timescale folding be compute folding which cannot the simulations, ratio Carlo of This ratio timescale. the thesis 2) and at before, kept parame- is as free the which two temperature, to simulation contains the model 1) ters: This irreversible synthesis. pro- via to process regime the prior Markov next continuous-time that Methods), a during as assume and regime, unfold length We (Materials or each fold 4A. at can time Fig. it of which amount in fixed row a spends single similar tein a show depicted to both as are corresponding and regimes together folding, acids, slower two much amino latter namely properties, 112 144 These folding 2) 3) monomer. fast; to and full relatively 100 acids; the is 1) folding amino regimes: point 144 which such to at three acids, described identified amino calculations we 112 the MarR, properties by regimes, For informed folding length above. and the of constant which number nearly fold- for fixed are cotranslational intervals a by chain-length that characterized namely assumes in be model details can kinetic ing 4A; Our unassisted a (Fig. Methods). with developed folding and we possible cotranslational this, be of test model would hypothesized effi- To what we folding folding. improve posttranslational AAs, to significantly 100 compared may around ciency synthesis lengths vectorial chain that at Cir- fastest MarR Traps. is Helps Kinetic Synthesis Deep Vectorial cumvent That Predicts Modeling Kinetic nascent the as folding elongates. slowing chain MarR drastically in contacts nonnative .Frhroe lwn onsnhssdet aecodons rare to ( due 100 synthesis length down at slowing fold no Furthermore, to chain S3C). time is the enough as synthesis zero, has approaches longer vectorial ratio as that rate diminishes benefit folding/synthesis find this the expected length We as 112 although to factors. beneficial, 100 always various the func- at a by spent as time regime assuming the ratio ratio, increase this codons timescale rare plot folding/synthesis value that We unknown a benefit. the with of a synthesis, tion implying vectorial 1 to for than due proxy greater a benefit same is time that fold ratio folding This to by the cotranslational. time divided is mean folding posttranslation, the when show entirely quantity we occurs 4C, folding Fig. In if results. these of ality folding cotranslational improves efficiency. significantly slowdown induced 100 rare-codon 100% length a around parameters, nearly increases these for 6 to that, (green of gests curves) step factor rate-limiting a time blue the by the undergone regime and has length increasing (SI that 112 that population length to the find 100 this the we of in Indeed, spent downstream S3B). which slow- acids codons Fig. a amino rare such Appendix, of 30 vivo stretch roughly conserved In synthesis a 100. occurs from MarR of result down vecto- length may down folding slowing Although optimal by S3A). the enhanced the around Fig. whether be wondered Appendix, can we advantageous, advantage (SI clearly owing is traps period synthesis time rial deep this fold- the during posttranslational folding to appreciable of no curve). simulation shows (red analogous ing states an misfolded in contrast, 100 trapped In region the remains DNA-binding half at other the spent The that 50% regime is roughly length in time folds acid enough amino 112 parameters, to these For rate. nFg 4 Fig. In enx aidormdlsfe aaeest ettegener- the test to parameters free model’s our varied next We B, eicroaeorcmue odn rates folding computed our incorporate we Left, PNAS fnsetcan genadbu curves). blue and (green chains nascent of | aur 1 2020 21, January 6 ie htnsetMr folding MarR nascent that Given · 10 −3 Fg 4B, (Fig. ie h rti synthesis protein the times | o.117 vol. .Ti sug- This Right). Fig. Appendix, SI | T o 3 no. 0 = Materials .51 | 1489 T M

BIOPHYSICS AND COMPUTATIONAL BIOLOGY A

BC

Fig. 4. (A) Schematic of kinetic model (see main text and Materials and Methods for details). Dimerization is shown for completeness, but not accounted for in the kinetic model. (B) Time evolution for the probability of occupying different states as a function of time, assuming the slowest folding rate is 6 · 10−3 times the protein synthesis rate (under constant translation speed). We further assume either no slowdown at conserved rare codons between residues 100 and 112 (Left) or a sixfold slowdown at rare codons (Right) (main text and Materials and Methods). States are colored as in A (black, no native tertiary structure; gold, beta hairpin folded; red, beta hairpin folded with significant nonnative contacts; blue, DNA-binding region folded; green, fully folded), and sample structures are shown. We neglect lengths prior to 100, at which point no folding occurs. (C) Fractional reduction in the mean time to complete synthesis and folding as a function of unknown synthesis rate, assuming various percent slowdowns at rare codons indicated by numbers over the curves.

improves this benefit as long as the folding/synthesis rate ratio fatty acid synthesis (Fig. 5A and SI Appendix, Fig. S4). As with is less than ∼0.01. For ratios above this, folding at intermediate MarR, our simulations point to a rapid increase in monomer lengths is fast enough that there is no benefit from slowing down stability around 85 amino acids, at which point enough of the synthesis (SI Appendix, Fig. S3D). Thus in summary, our model protein has been synthesized that a folding core composed of predicts that 1) for nearly all parameter values, MarR cotrans- three N-terminal beta strands can fold (Fig. 5 A, Top). This early lational folding improves folding efficiency by helping nascent folding step, which is rate limiting overall, slows down some- chains overcome deep kinetic traps, and 2) assuming a reason- what beyond length 85 and even more beyond length 128, again able range of timescales, rare codons tune synthesis rates so owing to the protein’s tendency to adopt misfolded states that that a nascent MarR monomer can optimally exploit the faster differ dramatically in topology from the native state and are sta- folding rates available to it at lengths around 100 amino acids. bilized by C-terminal nonnative interactions (Fig. 5 A, Bottom and SI Appendix, Fig. S4 F–H). Thus, vectorial synthesis ben- Nonnative Interactions Explain Rare Codon Usage in Multiple Pro- efits FabG folding by allowing the chain to take advantage of teins. We then applied these methods to investigate the folding these shorter lengths. The sequence contains various stretches of of other E. coli proteins which were previously predicted to form rare codons, each of which is predicted to potentially enhance stable folding intermediates upstream of conserved rare codon this benefit under different conditions (SI Appendix, Fig. S4 I– stretches (13). For each, we plot the native stability and the K). Another protein that shows similar behavior is the enzyme slowest folding rate as a function of chain length at a chosen tem- cytidylate kinase (CMK) (Fig. 5B and SI Appendix, Fig. S5). Our perature where the folding stability is physiologically reasonable simulations predict that nonnative kinetic traps lead to very slow (∼5 to 15 kB T ). One example is the beta-ketoacyl-(acyl carrier CMK folding, consistent with previous experimental findings protein) reductase, or FabG, an essential enzyme involved in that the protein refolds on timescales of minutes (41). We further

1490 | www.pnas.org/cgi/doi/10.1073/pnas.1913207117 Bitran et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 irne al. et Bitran thus We calculations. folding-rate tem- reliable low for at allow enough to adequately landscape, peratures converge CMK’s Fig. not in down- Appendix, barriers did (SI large acids simulations to codons the owing amino rare that, of 30 note We stretch roughly S5G). conserved occurs a relatively of again and stream stability once increasing folding both fast to corresponds that dow 5 S5 (Fig. sig- lengths Fig. longer contacts Appendix, at nonnative step as rate-limiting the slow the fastest, nificantly which is this at nucleation) point proteins, (beta-core the other step to with is corresponds As length length robust. chain chain critical qualitatively this be around to stability expected observation in our increase but rapid value, exact a this of change frac- may folded field force a the in predicts field force only of our 145 tion around though at even length acids, with increases amino notably stability the that find at work we protein, each for sequence. before, the As in low. found of is are stability stability codons folding their rare a C-terminal fold, shows to chain structures synthesized rate-limiting in fully (E the Xs the for which black synthesized at while ( C been temperature synthesized, in CMK, a have Xs been B) ( residues Blue have FabG, red. enough step in though (A) highlighted rate-limiting even proteins step the this for to with shown prior associated broken are acids be (Bottom) must that step contacts (A–D, rate-limiting nonnative C-terminal the structure with native associated the rate protein, folding the and (Top) 5. Fig. o ahpoensmltd eidct hte tbectasainlfligitreitsaefre,de iei rp lwflig n conserved and folding, slow traps kinetic deep formed, are intermediates folding cotranslational stable whether indicate we simulated, protein each For ) E ABC safnto fcanlnt,teeulbimpoaiiyta etaysrcueeeet soitdwt h aelmtn tpaeformed are step rate-limiting the with associated elements structure tertiary that probability equilibrium the length, chain of function a As (A–D) ∼0.1 tti egh(i.5 (Fig. length this at E and F .Frhroe h hi-eghwin- chain-length the Furthermore, ). n apesrcueta a e oudrotert-iiigfligse (A–D, step folding rate-limiting the undergo to yet has that structure sample a and Top) B, .Sih inaccuracies Slight Top). B, Bottom and SI o15 to ∼5 snal eoa hsooial esnbetempera- reasonable S6F physiologically Fig. at Appendix, full zero (SI nearly for tures (42–45). is as traps computed 3B, rapidly states, kinetic Fig. unfolded fold in deep for depth to no trap kinetic known predict DHFR—the simulations is our which Indeed, enzyme considering essential 5C by an (Fig. began (DHFR) reductase We folate benefit. no pauses induced rare-codon confer and synthesis vectorial which for teins Counterexamples. to due substantial. benefit more the even thus be and may rates, synthesis, folding vectorial in extend difference to which the at trends point temperatures, thermal these reasonable physiologically expect point more we lower, which to However, at poor. to temperature, are close melting stabilities very protein’s temperatures full higher at the only rates folding compute B and k B T o oedtisprann oec rti,see protein, each to pertaining details more For . C, Bottom A and sn u ehdlg,w loietfidpro- identified also we methodology, our Using PNAS niaeta ofligrt scmue because, computed is rate folding no that indicate D, Top | aur 1 2020 21, January niaetelntsa hc h rtamino first the which at lengths the indicate .Rte,teufle nebeis ensemble unfolded the Rather, ). D and eK o each For HemK. (D) and DHFR, ) i.S6)— Fig. Appendix , SI | r hw,with shown, are Bottom) o.117 vol. .coli E. | o 3 no. IAppendix SI dihydro- | 1491 .

BIOPHYSICS AND COMPUTATIONAL BIOLOGY characterized by loose, molten-globule–like states with signif- length, the folding rate drops by orders of magnitude. This dra- icantly higher energy than the native state (Fig. 5 C, Bottom matic drop in folding rate far exceeds what is expected due and SI Appendix, Fig. S6 E–G). Our predicted folding pathway to increasing chain length alone (1, 35, 36) and instead results (SI Appendix, Fig. S6D) is in agreement with previous studies, from deep nonnative contacts involving C-terminal residues, which show that DHFR folds in multiple steps with fast relax- which must be broken before folding can proceed. Interestingly, ation times and no significant off-pathway intermediates (41, 42). these nonnative interactions occur entirely within individual Owing to this smooth folding landscape, we predict no advan- domains, in contrast to previous works which suggest that non- tage to vectorial synthesis, because even though the chain can native interactions between multiple domains can be avoided fold at an intermediate length of 149 (Fig. 5 C, Top), the fold- via cotranslational folding (46, 47). Vectorial synthesis can thus ing kinetics hardly change with length (Fig. 5 C, Bottom). This substantially decrease the time required for folding by allow- is consistent with the protein’s codon usage: Although E. coli ing individual domains to exploit the narrow window of lengths DHFR contains C-terminal rare codons (SI Appendix, Fig. S6H), at which problematic C-terminal residues have not yet been they are not conserved and their synonymous substitution has synthesized. At sufficiently fast translation speeds, vectorially been shown not to affect in vivo soluble protein levels or E. synthesized proteins may still tend to fold posttranslationally, coli fitness (45). [However, conserved N-terminal rare codons and so slowing synthesis at these critical lengths is necessary to were shown to be crucial for mRNA folding to ensure acces- promote cotranslational folding. In the case of MarR, FabG, sibility of the Shine–Dalgarno sequence (45).] In addition to and CMK, this prediction is consistent with the presence of con- DHFR, we simulated the N-terminal domain of HemK (residues served C-terminal rare codons ∼30 amino acids downstream. 1 to 74; Fig. 5D and SI Appendix, Fig. S7), a protein whose Our results may also explain why other proteins lack conserved cotranslational folding pathway has been studied using Forster¨ C-terminal rare codons. Namely, for DHFR and the HemK N- resonance energy transfer (FRET) by Holtkamp et al. (15). We terminal domain, we find that although cotranslational folding find that the domain can adopt a stable native-like structure at is possible, it is not advantageous relative to posttranslational around 40 amino acids, consistent with an observed increase folding because the full proteins fold rapidly without populating in FRET near this length by Holtkamp et al. (15) (Fig. 5 D, significant kinetic traps. Top). But as with DHFR, slowing down synthesis at this length This study generates specific experimentally testable predic- is predicted to confer no advantage (Fig. 5 D, Bottom), as the tions regarding the molecular mechanisms by which vectorial full domain folds rapidly and experiences only shallow fold- synthesis speeds up folding, and it also advances our general ing traps at physiological temperatures (SI Appendix, Fig. S7G). understanding of codon usage in proteins. For decades, it has Consistent with this, the HemK N-terminal domain shows no been known that synonymous which alter translation conserved rare codons (SI Appendix, Fig. S7H). Our results for speed can affect the folding of large proteins, potentially reduc- every protein we simulate are summarized in Fig. 5E. ing fitness (18, 48) or exacerbating disease symptoms (49–51). However, the mechanism for these effects has not been estab- Discussion lished. Other studies have examined the role of evolutionarily Together, these results shed light on how vectorial synthesis conserved clusters of rare codons at domain boundaries, sug- and its regulation affect the efficiency of in vivo cotranslational gesting that these may give individual domains time to fold folding for various proteins depending on their nascent chain cotranslationally (52). But more recent work has shown that con- properties. The main takeaway is summarized in Fig. 6. For the served rare codons may be found at any chain length at which relatively large single-domain proteins MarR, FabG, and CMK, folding can begin and not exclusively at domain boundaries (13, we identify a narrow window of chain lengths at which fold- 14). These studies did not, however, establish a rationale for ing is both favorable and fast. Prior to this length, the nascent slowing down synthesis in the middle of a domain. Our work chain cannot yet adopt native-like structures, while beyond this provides a potential mechanistic explanation for these obser- vations, pointing to the crucial role of misfolded intermediates stabilized by C-terminal residues. In the cell, such intermediates may be involved in harmful aggregation, an effect that is not considered in our model but which may further heighten selec- tion for cotranslational folding. Although our work focuses on proteins for which pauses in synthesis benefit folding, in other cases, slowing synthesis has been shown to hinder proper in vivo folding (16, 53, 54), particularly if nascent chains tend to misfold rather than adopting native-like structure. Finally, it is worth not- ing that some rare codons, particularly at the 50 end of , have evolved for reasons unrelated to cotranslational folding, for instance to promote proper mRNA folding (45, 55, 56) or to min- imize ribosome jamming (57). However, our work focuses on rare codons farther downstream in coding sequences, at which point a nascent chain will be synthesized to a greater extent and cotranslational folding becomes possible. More generally, this work expands our understanding of how evolution optimizes the folding of large, misfolding-prone proteins in vivo. Besides vectorial synthesis and codon usage, Fig. 6. For misfolding-prone proteins that can fold cotranslationally, the another regulatory strategy involves chaperones. Growing evi- overall folding rate is optimized if the nascent chain has time to start fold- dence suggests that these two strategies may work in tandem in ing at the earliest length at which stable folding can occur. At this point, the the cell, as chaperones such as trigger factor, DnaK, and TriC chain’s folding landscape is still relatively smooth (blue arrow). In the case that the nascent chain’s folding rate at this critical length is slightly slower have been shown to bind nascent chains and promote cotrans- than the synthesis rate, then slowing down synthesis using rare codons lational folding (4, 8, 9, 58). Thus, rare codons may serve an roughly 30 amino acids downstream is beneficial. In contrast, delaying fold- additional role of slowing synthesis to give time for chaperones ing until further synthesis is complete (red arrow) leads to deep kinetic traps to bind. This may be especially beneficial if cotranslational fold- stabilized by C-terminal residues, which significantly slow folding. ing intermediates are nonnative-like or aggregation prone or if

1492 | www.pnas.org/cgi/doi/10.1073/pnas.1913207117 Bitran et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 nomto.Ti sjsie eas h odn/nodn fanative a of asso- is folding/unfolding which loop, the a of kinetic because forming/breaking from the states requires justified directly typically which substructure than is in rather (61) This features, model structural information. state on based Markov defined a are to analogous is configurations folded; tures Such folded), include substructures. configurations synthesized formed fully topological For of example (60). MarR, configurations subset topological as a to referred by are in tran- states defined by shown states characterized are landscape between folding protein sitions coarse-grained each a defined for con- then Native substructures We (60). and identified substructures as and maps to of structure tact referred contacts maps equilibrated long-range contact and of native synthesized islands fully generated respective first the we Computation. properties, Rate folding Folding construct’s and Analysis Simulation replica without simulations ran we unfolding, of rates compute To 3) replica ran we properties, thermodynamic equilibrium compute To 2) (PDB) Bank Data Protein the from downloaded was structure starting A 1) length, steps: chain following intermediate the and rota- performed construct sidechain protein we and full knowledge-based backbone each a comprising For (24–26). set with tions move simulations realistic a Carlo and Monte potential atomistic utilizes Simulations. rates Carlo Monte Atomistic have pathways experimentally. folding characterized their well and been codons because rare counterexamples C-terminal func- as conserved be potential lack HemK can they the and DHFR which to selected modifications issue additionally minor field—an We was tion. through force state future our native the in in whose unstable corrected acids), marginally amino be sim- to (300 adequate found ISPA RSME for allow and involve not convergence, publication did topology ulation this knotted only whose from The acids), excluded acids). (144 amino amino (244 were MarR (244 that namely FabG and list, results acids), this additional amino from (228 substantial feasibility, proteins CMK a acids), computational shortest amino of maximize three To downstream the energy. acids simulated conserved free we evolutionarily amino folding three 30 native least least in at drop at of stretch located a codons having rare as 13 ref. in fied Proteins. of Selection Methods and Materials protein regulate that chaperones for vivo. factors roles in be additional folding these on myriad can light potentially intermediates, shed and to misfolded future as of the in such role applied cotranslational as the studying such including for steps folding, method Our slow isomerization. undergo proline must intermediates these irne al. et Bitran etretmt fteufligrt o hs atal nonnative partially these for rate unfolding temperatures. the low at for of intermediates allow estimate simulations Such better simulations. a exchange replica the tra- low-temperature in ran from jectories a extracted additionally structure, containing nonnative we states of degree intermediate CMK, high from and beginning FabG simulations the For unfolding from structure. starting run native were melt- simulations equilibrated the proteins, above all or For near temperature. temperatures ing at sampling umbrella or exchange balance. detailed satisfy not free-energy do the moves in these included not since were calculations, finding moves in these utilized the protein However, that contacts. steps the native time aid of numbers to intermediate (59) at imple- minima set additionally energy steps For move MC temperatures. knowledge-based million of a 600 range to wide mented were a 200 at initial simulations steps the These MC proteins, contacts. million some 800 native to of 200 for number run the bias umbrella-sampling to harmonic respect added with an using simulations truncated exchange equi- these protein. the complete equilibrating respective of the and terminus for done structure C was as PDB the structures length truncating protein at by full MarR generated librated example, then (for were lengths Nascent 100) potential. intermediate the at in constructs configuration chain lowest-energy neces- the are that attain structure to slight starting sary undergoes the to protein relative the during changes that biasing conformational har- likelihood Umbrella with the contacts. million temperature increases native 30 simulation equilibration along to low biasing 15 very a umbrella for at monic potential steps full (MC) the Carlo in Monte equilibrated was in shown structure are ing protein each for IDs (PDB abc ol substructures (only .Tersligntoko topological of network resulting The ). S1 Fig. Appendix , SI ecmie ito igedmi rtisidenti- proteins single-domain of list a compiled We a, and b, u loih o optn folding computing for algorithm Our c .Ti start- This S1). Table Appendix, SI r odd,and folded), are abcdef oivsiaeagiven a investigate To alsubstructures (all ∅ n substruc- (no IAppendix SI . xhne hswsacmlse ydfiigakntcdsac between distance rapid kinetic in a are defining that by configurations configurations accomplished topological topological was uni- of This and sets exchange. constant or then a We clusters assumes configuration. incorrect identified that any to model misclassification unfolding mis- of Markov probability the for form hidden fitted a account we ambiguity, to To structural trajectories above. possible to as due configurations, classification topological to simulations temperature, each 2B. at Fig. contacts in kinetic native as of of number presence/absence average thermal or a contacts compute 3C native Fig. of in (as number trapping of function shown a are as MarR for configuration—examples (PMF) topological in force of mean accep- of function potential Bennett a a compute multistate the as to the (62) Using method used (MBAR) formed. ratio then are tance we substructures simulations, con- which exchange topological with replica a to accordance simulations in exchange figuration replica from snapshots ulation of timescale the (60). to configurations relative topological a between equilibrate with configurations transition consistent rapidly topological microstates configuration Thus, as topological barrier. distributions, dwell-time free-energy Markovian large show a with ciated ie h usrcue htaefre ntelatfle configuration least-folded (b the cluster in this formed to For are assigned step. that of substructures rate-limiting consisting the the cluster tified the during be into would transitions this MarR, protein the 2D that (Figs. cluster rate- step the with folding associated structures limiting forming of probabilities equilibrium calculated, the directly S7). be Fig. to Appendix, rate (SI their agreement good for which for obtained enough HemK, and fast on method are our tested transitions We folding times 4). (1, folding pro- days measured to experimentally by microseconds in from range scaled the to steps chain span comparable different rates (MC roughly between observed sweeps comparisons the Our meaningful MC for in lengths. inverse allow rates to in the length) folding expressed from tein protein are times All 1,000 text replacement. resampling main by with rates trajectories folding unfolding on distribution error an where ratio the j, rates transition and reverse i and clusters were forward two clusters the for between of Namely, rates balance. PMFs folding detailed these the the from From energies, calculated using previously. free temperatures obtained and configuration those rates topological unfolding at of cluster function rel- each lower, a the as of to computed energies rates also free unfolding We ative temperatures. extrapolated Using reasonable then rates. physiologically unfolding we more MarR equation, observed the Arrhenius for temperature the fit of good function a a provides as equation equation. rates Arrhenius log the the fitted to unfold- and of At clusters cluster. rates between computed a ing then to we assigned temperature, then simulation was unfolding each simulations in The unfolding shown the MarR. are from for construct snapshot verified protein show each have clusters for we that clusters which resulting ensures clus- distributions, again time resulting This dwell the clusters. Markovian within between substantial exchange exchange a and of ensure ters to timescales chosen the was between threshold separation The threshold. distance some whose below configurations together is clustering then and them, between ota h ai ftegon-tt nriso ulMr n aR 100 MarR, and potential. MarR knowledge-based full full the of in energies that to ground-state close the contacts, is of residues, tuned native ratio were between contacts, the nonnative attraction that between for so repulsion modest values added contributing The as structure well 39). equilibrated as (38, the energy but in the section, found previous to the contacts in native as analyzed only and with run were MarR full and residues Potential. Native-Only length with first Simulations the native form. as one can least defined substructure at above, that was identified to the plots) substructures belonging which the contact these of at in each least for length Xs that, at chain such (colored which minimum occur in The can configuration formed. step any are occupies substructures protein these the that probability h BRmto a loue ocmuePMFs compute to used also was method MBAR The . S1 Fig. Appendix, SI oaayeufligsmltos efis sindsasosfo these from snapshots assigned first we simulations, unfolding analyze To sim- all assigned we protein, given a for substructures defined Having sn h Msa ucino oooia ofiuain ecomputed we configuration, topological of function a as PMFs the Using obtain to analysis bootstrap a performed we construct, protein each For F i ,j r h eaiefe nriso h epcieclusters. respective the of energies free relative the are .TePFa ucino aiecnat a sdto used was contacts native of function a as PMF The ). PNAS λ λ i j →j →i i and and i.S1 Fig. Appendix, SI | = aur 1 2020 21, January c n )a olw:Frt eietfidthe identified we First, follows: as 5) and j P P enda h vrg iet transition to time average the as defined , o aR n optdteBoltzmann the computed and MarR) for eq eq i j 2odr fmgiue(i.5 (Fig. magnitude of orders ∼12 = e hs iuain o aRa 100 at MarR for simulations These −(F j λ −F i →j [abc i )/kT | and hw htteArrhenius the that shows , , o.117 vol. bc λ , j →i bcd IAppendix SI satisfies ] | ete iden- then We . o 3 no. | Each . A–C 1493 [1] ),

BIOPHYSICS AND COMPUTATIONAL BIOLOGY Clustering Nonnative Contact Maps. To cluster misfolded states in accordance then for each cluster c0 at length L0, we define a similarity between c and c0 with which nonnative contacts are present, we made nonnative contact as the average number of substructures that must be formed or broken to maps of all snapshots assigned to a given topological configuration of inter- transition from a topological configuration in c to one in c0. We then find 0 L, T est at a set temperature range. The nonnative clusters for MarR in Fig. 3C the c that is most similar to c and propagate element c of P (τL) to ele- 0 L0, T include snapshots assigned to configuration b. We then assigned a distance ment c of P (0). The time spent at a given length regime τL is computed between every pair of snapshots, defined as the Hamming distance between using the contact maps (including only nonnative contacts that are not present L L in the equilibrated native structure), and defined a distance threshold such τL = τfastNfast + τrareNrare, [4] that pairs of snapshots whose distance is less than this threshold are defined

as adjacent. We formed clusters by finding the disconnected components of where τfast and τrare are the average times to translate a fast and a rare the resulting adjacency matrix. For most proteins, a distance threshold of L L codon, respectively, while Nfast and Nrare are the numbers of fast and rare 100 produced clusters that are structurally distinct and well defined, but codons in the length regime L. The values of τfast and τrare relative to the results are robust to this precise value. Having defined clusters, we pro- characteristic folding times are unknown and varied as free parameters as duced nonnative contact maps for each cluster by averaging the contact described in the main text. maps of snapshots assigned to that cluster. Each resulting average contact In addition to computing how probability distributions evolve in time, map depicts the frequency with which nonnative contact maps are observed we can compute the mean time to completion of synthesis and folding τtotal in a given set of structurally similar misfolded states. (Fig. 4C). To do this, we solve and propagate the probability distribution until the fully synthesized length regime F is reached, and then evaluate Kinetic Model of Cotranslational Folding. To model cotranslational folding, the sum we defined a set of length regimes, each of which corresponds to an interval X X F,T F of chain lengths for which the protein’s folding properties are assumed to τtotal = τL + Pc (0)τfold, c, [5] be constant. These folding properties are obtained by simulating a nascent L c chain at a length that is assumed to be representative of the length regime where the second sum is over clusters in the full length F, PF,T (0) is the ini- and then applying the methods of the previous sections. At each length c L, T tial probability of occupying cluster c (obtained by propagating from the regime L, we define P (t) as the vector of probabilities of occupying dif- penultimate length regime as described above), and τ F is the mean first- ferent clusters as a function of time at a given temperature T. Assuming fold,c L, T passage time to reach the cluster containing the folded cluster starting from continuous-time Markovian dynamics, P (t) satisfies the master equation cluster c. This mean first passage time is obtained by setting an absorbing boundary at the folded cluster and solving the equation d PL,T (t) = ML(T)PL,T (t), [2] dt L | F (M (T)) τ fold = −1, [6] where ML(T) is a transition matrix whose entries are given by L | F where (M (T)) is the transpose of the transition matrix, τ fold is a vector ( whose elements are the mean first passage times to the folded cluster from λL (T) if i 6= j ML (T) = j→i , [3] each initial cluster c, and the right-hand side is a vector of negative ones. ij P L − i λj→i(T) if i = j Data Availability. A dataset containing folding rates and free energies for L where the folding/unfolding rates λj→i(T) at length regime L are computed all protein constructs included in this publication has been deposited in as described previously. Figshare (https://figshare.com/articles/Analyzed data/11496954). At each length L, the master equation is solved for an amount of time τL corresponding to the total time spent at length L, given an initial probabil- ACKNOWLEDGMENTS. The computations in this paper were run on the ity distribution PL, T (0). At the first length regime at which folding can occur, Odyssey cluster supported by the Faculty of Arts and Sciences Division of PL, T (0) is assumed to be one at the cluster containing the unfolded state Science, Research Computing Group at Harvard University. A.B. was funded by the National Science Foundation Graduate Research Fellowship Program (topological configuration ∅) and zero elsewhere. After time τL, the proba- L, T L0, T (DGE1745303) and a Harvard Molecular Biophysics Training Grant (Principal bility P (τL) becomes the new initial distribution, P (0) at the next length Investigator James M. Hogle, NIH/National Institute of General Medical Sci- 0 0 regime L , and the master equation is solved again given a new ML (T). In ences T32 GM008313). W.M.J. was funded by NIH Grant F32GM116231. E.S. the case that cluster c at length L does not have an exact match at length L0, was funded by NIH Grant R01 GM124044.

1. A. N. Naganathan, V. Munoz,˜ Scaling of folding times with protein size. J. Am. Chem. 14. J. L. Chaney et al., Widespread position-specific conservation of synonymous rare Soc. 127, 480–481 (2005). codons within coding sequences. PLoS Comput. Biol. 13, e1005531 (2017). 2. J. A. Houwman, C. P. van Mierlo, Folding of proteins with a flavodoxin-like 15. W. Holtkamp et al., Cotranslational on the ribosome monitored in architecture. FEBS J. 284, 3145–3167 (2017). real time. Science 350, 1104–1107 (2015). 3. T. Suren et al., Single-molecule force spectroscopy reveals folding steps associated 16. F. Buhr et al., Synonymous codons direct cotranslational folding toward different with hormone binding and activation of the glucocorticoid receptor. Proc. Natl. Acad. protein conformations. Mol. Cell 61, 341–351 (2016). Sci. U.S.A. 115, 11688–11693 (2018). 17. R. Bartoszewski et al., Codon bias and the folding dynamics of the cystic fibrosis 4. Z. N. Scholl, W. Yang, P. E. Marszalek, Chaperones rescue luciferase folding by transmembrane conductance regulator. Cell Mol. Biol. Lett. 21, 23 (2016). separating its domains. J. Biol. Chem. 289, 28607–28618 (2014). 18. J. Fu et al., Codon usage affects the structure and function of the Drosophila circadian 5. J. L. Sohl, S. S. Jaswal, D. A. Agard, Unfolded conformations of α-lytic protease are clock protein PERIOD. Genes Dev. 30, 1761–1775 (2016). more stable than its native state. Nature 395, 817–819 (1998). 19. C. Kimchi-Sarfaty et al., A “silent” polymorphism in the MDR1 changes substrate 6. M. J. Kerner et al., Proteome-wide analysis of chaperonin-dependent protein folding specificity. Science 315, 525–528 (2007). in Escherichia coli. Cell 122, 209–220 (2005). 20. P. Ciryam, R. I. Morimoto, M. Vendruscolo, C. M. Dobson, E. P. O’Brien, In vivo trans- 7. J. Weaver et al., GroEL actively stimulates folding of the endogenous substrate lation rates can substantially delay the cotranslational folding of the Escherichia coli protein PepQ. Nat. Commun. 8, 15934 (2017). cytosolic proteome. Proc. Natl. Acad. Sci. U.S.A. 110, E132–E140 (2013). 8. K. Doring¨ et al., Profiling Ssb-nascent chain interactions reveals principles of Hsp70- 21. P. L. Clark, J. King, A newly synthesized, ribosome-bound polypeptide chain adopts assisted folding. Cell 170, 298–311 (2017). conformations dissimilar from early in vitro refolding intermediates. J. Biol. Chem. 9. A. Y. Yam et al., Defining the TRiC/CCT interactome links chaperonin function to sta- 276, 25411–25420 (2001). bilization of newly made proteins with complex topologies. Nat. Struct. Mol. Biol. 15, 22. M. S. Evans, I. M. Sander, P. L. Clark, Cotranslational folding promotes β- 1255–1262 (2008). helix formation and avoids aggregation in vivo. J. Mol. Biol. 383, 683–692 10. S. Chakrabarti, C. Hyeon, X. Ye, G. H. Lorimer, D. Thirumalai, Molecular chaperones (2008). maximize the native state yield on biological times by driving substrates out of 23. S. J. Kim et al., Translational tuning optimizes nascent protein folding in cells. Science equilibrium. Proc. Natl. Acad. Sci. U.S.A. 114, E10919–E10927 (2017). 348, 444–448 (2015). 11. M. Taipale et al., Quantitative analysis of Hsp90-client interactions reveals principles 24. J. S. Yang, W. W. Chen, J. Skolnick, E. I. Shakhnovich, All-atom ab initio folding of a of substrate recognition. Cell 150, 987–1001 (2012). diverse set of proteins. Structure 15, 53–63 (2007). 12. S. Bershtein, W. Mu, A. W. Serohijos, J. Zhou, E. I. Shakhnovich, Protein quality control 25. E. Kussell, J. Shimada, E. I. Shakhnovich, A structure-based method for derivation of acts on folding intermediates to shape the effects of mutations on organismal fitness. all-atom potentials for protein folding. Proc. Natl. Acad. Sci. U.S.A. 99, 5343–5348 Mol. Cell 49, 133–144 (2013). (2002). 13. W. M. Jacobs, E. I. Shakhnovich, Evidence of evolutionary selection for cotranslational 26. I. A. Hubner, E. J. Deeds, E. I. Shakhnovich, Understanding ensemble protein folding folding. Proc. Natl. Acad. Sci. U.S.A. 114, 11434–11439 (2017). at atomic detail. Proc. Natl. Acad. Sci. U.S.A. 103, 17747–17752 (2006).

1494 | www.pnas.org/cgi/doi/10.1073/pnas.1913207117 Bitran et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 0 .D il,C .Bok,Sboancmeiin oprtvt,adtopological and cooperativity, competition, Subdomain an Brooks, from L. G C. protein of Hills, kinetics D. folding R. ensemble 40. The Shakhnovich, I. kinetics E. and thermodynamics Shimada, folding J. The Shakhnovich, 39. I. E. Kussell, folding L. protein E. for Shimada, scales J. Time proteins: 38. real to folding models protein minimal of From scaling Thirumalai, length D. Chain Shakhnovich, 37. I. E. Abkevich, folding. I. V. protein Gutin, of M. law A. rate-length the 36. Inferring Pande, S. V. Lane, J. T. 35. Natan the E. of 34. analysis Mutational Levy, B. S. Head, F. J. Foster, K. McMurry, M. L. Duval, V. 33. pro- repressor antibiotic-resistance multiple multiple purified the of Binding of Rosner, repressor J. Martin, the R. MarR, 32. of Characterization Levy, B. S. Seoane, protein S. the A. of 31. diversity and Universality Shakhnovich, I. E. state Abkevich, transition V. the Mirny, and A. L. nuclei folding 30. protein Multiple and Thirumalai, native D. destabilizes Klimov, K. ribosome D. The Kaiser, 29. M. C. Quantitative Mattson, Marqusee, E. S. Rehfus, E. Cate, J. D. Liu, H. K. J. Soto, 28. A. R. Jensen, K. M. Samelson, J. A. 27. 1 .Bilc,T oez .Risen odn rpriso yoiemonophosphate cytosine of properties Folding Reinstein, J. Lorenz, T. Beitlich, T. 41. 2 .K edr,J .ONil .Ry .A enns nesnilitreit nthe in intermediate essential An Jennings, A. P. Roy, M. O’Neill, C. J. Heidary, K. D. 42. 3 .Iaai .P eaa .Ssi odn aha famlioanpoendepends protein multidomain a of pathway Folding Sasai, M. Terada, P. T. Inanami, T. 43. irne al. et Bitran rsrto ntefligo CheY. of folding the in frustration simulation. Carlo Monte all-atom simulation. Carlo Monte all-atom an using crambin of kinetics. time. (2013). e78606 proteins. homomeric domains. binding DNA the and at (2013). dimerization pocket 3341–3351 binding the ligand between a reveals interface marR regulator resistance multiple-antibiotic sequences. operator (1995). mar to (MarR) tein coli. Escherichia in operon (1995). (mar) resistance antibiotic model. lattice a of (1996). aid 103–116 the 1, with analysis comprehensive A scenarios: folding proteins. two-state in ensemble protein. multidomain nascent (2017). a in structures non-native stability. (2016). chain 13402–13407 nascent ribosome of determination iaefo .cl niaesaiiaintruha diinlisr nteNMP the in insert additional an through stabilization indicate domain. binding coli E. from kinase odn fdhdooaereductase. dihydrofolate of folding (2014). connectivity. domain of topology its on hs e.Lett. Rev. Phys. .Py.I Phys. J. ornltoa rti sebyipsseouinr osrit on constraints evolutionary imposes assembly protein Cotranslational al., et LSOne PLoS 4716 (1995). 1457–1467 5, 4353 (1996). 5433–5436 77, a.Src.Ml Biol. Mol. Struct. Nat. 734(2013). e78384 8, rtisSrc.Fnt Genet. Funct. Struct. Proteins rc al cd c.U.S.A. Sci. Acad. Natl. Proc. .Ml Biol. Mol. J. rc al cd c.U.S.A. Sci. Acad. Natl. Proc. rc al cd c.U.S.A. Sci. Acad. Natl. Proc. rc al cd Sci.U.S.A. Acad. Natl. Proc. 7–8 (2018). 279–288 25, 8–9 (2008). 485–495 382, rc al cd c.U.S.A. Sci. Acad. Natl. Proc. .Ml Biol. Mol. J. .Bacteriol. J. rti Sci. Protein 17–18 (2002). 11175–11180 99, 6–7 (2001). 465–475 43, 8657 (2002). 5866–5870 97, 99 (2001). 79–95 308, 15969–15974 111, .Bacteriol. J. 3414–3419 117, 5456–5460 92, 1439–1451 26, LSOne PLoS od Des. Fold. 195, 113, 8, 2 .R hrs .D hdr,Saitclyotmlaayi fsmlsfo multiple from samples of analysis optimal Statistically Chodera, science. D. J. a to Shirts, art an R. M. From models: 62. state Markov Pande, S. V. Husic, E. B. 61. 0 .M aos .I hknvc,Srcuebsdpeito fprotein-folding of prediction Structure-based Shakhnovich, protein I. E. for set Jacobs, move knowledge-based M. A quality W. Shakhnovich, protein I. 60. for E. Yang, hub S. a J. as Chen, W. ribosome W. The 59. Frydman, J. Willmund, F. Pechmann, S. 58. of determinants Coding-sequence Tuller T. Plotkin, B. 57. J. Tollervey, D. bias Murray, codon W. N-terminal A. of Kudla, effects G. and Causes 56. Kosuri, S. Church, M. G. Goodman, B. D. 55. tRNA dynamics via Non-equilibrium Bustamante, rates C. Wee, M. translation L. Goldman, codon H. D. of Alexander, M. Optimization L. Leidel, 54. A. S. of Nedialkova, role D. D. the 53. determine can bias Bicodon Purvis Diambra, J. I. L. 52. Carrea, A. McCarthy, C. 51. Lazrak A. 50. Gervasini G. 49. Zhou Contributions M. principle: Anfinsen’s 48. Expanding Clark, L. P. Chaney, L. J. Sander, M. I. 47. protein co-translational Optimizing quantity: over Quality Clark, L. P. Jacobson, N. G. 46. Bhattacharyya S. 45. Rodrigues V. J. 44. qiiru states. equilibrium Soc. rniinpaths. transition folding. control. translation. protein of coli. Escherichia in expression genes. bacterial in misfolding. (2019). its suppress 2709 translation during polypeptide nascent a of integrity. proteome maintains modifications hypothesis. A vivo. in translation (1987). of rates controlled diseases. human in SNPs synonymous cancer breast the postmenopausal of in outcomes clinical and patients. concentrations plasma FRQ. protein clock of design. protein rational to (2014). 858–861 selection codon synonymous of (2016). codons. synonymous non-‘optimal’ with folding coli. E. in bias codon N-terminal resistance. 3629 (2018). 2386–2396 140, 58CT hne dysfunction. channel CFTR ∆F508 o.Cell Mol. Proteins neouinrl osre ehns o otoln h efficiency the controlling for mechanism conserved evolutionarily An al., et o-pia oo sg fet xrsin tutr n function and structure expression, affects usage codon Non-optimal al., et r .Ci.Pharmacol. Clin. J. Br. T otiue oteseverity the to contributes I507-ATC->ATT change codon silent The al., et rc al cd Sci.U.S.A. Acad. Natl. Proc. h fcec ffligo oepoen sicesdby increased is proteins some of folding of efficiency The al., et oyopim nAC1adCP91gnsafc anastrozole affect genes CYP19A1 and ABCB1 in Polymorphisms al., et ipyia rnilspeitfins adcpso drug of landscapes fitness predict principles Biophysical al., et ipy.J. Biophys. 8–8 (2007). 682–688 66, 1–2 (2013). 411–421 49, Science cesblt fteSieDlan eunedictates sequence Shine-Dalgarno the of Accessibility al., et .Ce.Phys. Chem. J. Nature PNAS Cell 7–7 (2013). 475–479 342, 4–5 (2010). 344–354 141, 2–3 (2016). 925–936 111, 1–1 (2013). 111–115 495, | Science 6–7 (2017). 562–571 83, o.Cell Mol. aur 1 2020 21, January 215(2008). 124105 129, M Genom. BMC 5–5 (2009). 255–258 324, 17–17 (2016). E1470–E1478 113, AE J. FASEB 9–0.5(2018). 894–905.e5 70, Cell ur pn tut Biol. Struct. Opin. Curr. 6811 (2015). 1608–1618 161, 6044 (2013). 4630–4645 27, 2 (2017). 227 18, | o.117 vol. .Ml Biol. Mol. J. .A.Ce.Soc. Chem. Am. J. | a.Commun. Nat. o 3 no. 413–417 193, .A.Chem. Am. J. 102–110 38, | 1495 136, 10,

BIOPHYSICS AND COMPUTATIONAL BIOLOGY