The Pennsylvania State University The Graduate School Department of Chemistry

MODELING AND PREDICTING CO-TRANSLATIONAL

PROTEIN FOLDING WITH CHEMICAL KINETIC AND

MOLECULAR DYNAMIC SIMULATIONS

A Dissertation in Chemistry by

Daniel A. Nissley

© 2019 Daniel A. Nissley

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2019

The dissertation of Daniel A. Nissley was reviewed and approved* by the following:

Edward P. O’Brien Assistant Professor of Chemistry and of the Institute for CyberScience Dissertation Adviser Chair of Committee

Phillip C. Bevilacqua Head of Chemistry Department Distinguished Professor of Chemistry and Biochemistry and Molecular Biology

William G. Noid Associate Professor of Chemistry

Reka Albert Distinguished Professor of Physics and Biology

*Signatures are on file in the Graduate School

ii

ABSTRACT

Proteins are linear polymers of amino acids that perform myriad functions within living cells. Most proteins must form a specific three-dimensional structure, often referred to as the native state, in order to perform their biological function. The process of reaching the native state is termed . It is often assumed that thermodynamics alone determines the specific conformation of the native state of a protein, meaning that its structure is determined by the amino-acid sequence alone. Within living cells, however, proteins are translated based on mRNA templates by the ribosome in a distinctly out-of- equilibrium fashion. The non-equilibrium nature of translation means that kinetics can trump thermodynamics in determining the conformation of a protein, and recent experiments indicate that seemingly small changes to the kinetics of translation can radically alter protein structure and function and even lead to disease. One way in which the speed of translation can be perturbed is through synonymous codon mutations, which change the rate of translation but not the primary sequence of the nascent protein that is produced. Understanding how the non-equilibrium nature of protein synthesis influences the likelihood that a given protein will correctly fold and function is therefore critical to understanding protein biogenesis. This thesis contains four theoretical and computational studies of co-translational protein folding and its influence on protein conformations after synthesis is complete. The state of the experimental and computational literature is summarized in Chapter 1. In Chapter 2 I describe a chemical kinetic model that is able to accurately predict experimental co-translational folding probabilities for the first time. This chemical kinetic model is used to make the novel prediction that some yeast proteins which typically fold post- translationally can be made to fold co-translationally by recoding their mRNA sequences to contain the slowest-translating synonymous codon at each codon position. This chemical kinetic model is general, in principle, for all proteins and all organisms. The fluorescent technique FRET has recently been used to monitor protein folding on the ribosome both in vitro and in vivo. One disadvantage of such fluorescent techniques is that they produce a single value per time point, providing minimal structural information. In Chapter 3, I present results from low-friction, coarse-grain Langevin dynamics simulations that elucidate the structural origins of FRET measurements on the ribosome. These simulations are in strong agreement with experimental time series and reveal the underlying co-translational folding trajectory at a spatial resolution of 3.8 Å. I also show that the alternative hypothesis that nascent chain compaction occurs due to collapse as is expected for a polymer in poor solvent is not consistent with the experimental data. Dimensional collapse, however, could not be ruled out. I therefore suggest alternative dye positions that could be used to differentiate domain folding and domain dimensional collapse in future experiments. Chapter 4 of this thesis presents the hypothesis that the pathogenesis of Huntington’s Disease is, at least in part, due to the dysfunction of a co-translational process. The genetic cause of Huntington’s Disease is the expansion of a poly- region in Exon 1 of the HTT gene. Individuals with 35 or more CAG codons (which encode the

iii glutamine) in their HTT gene will suffer disease symptom onset within a typical human lifespan. The age of symptom onset decreases linearly as the number CAG codons increases beyond 35. Based on strong circumstantial experimental evidence and a simple kinetic model, I argue that the expansion of the CAG codon region in the transcript leads to an increase in the speed of translation at a key time during the synthesis of huntingtin protein that leads to its misprocessing and the onset of disease symptoms. I go on to propose experiments to test this hypothesis. In Chapter 5 I describe high-throughput simulations of the translation elongation, translation termination, and post-translational dynamics of a representative subset of the E. coli cytosolic proteome. I find that roughly one in four proteins is kinetically trapped for as long as 3 minutes after the completion of protein synthesis. Finally, in Chapter 6 I summarize the conclusions that can be drawn from, and the future directions related to, my work. One obvious future goal is the extension of my simulations of multi-domain E. coli proteins to the study of how domain interfaces influence protein folding. That is, do domains fold independently or do they rely upon one another? My E. coli proteome data set also includes diverse information about translation termination kinetics, and preliminary simulations reveal that electrostatic effects seem to largely determine its timescales. In summary, the results presented in this thesis advance understanding of co-translational protein folding and the influence translation kinetics can have on protein conformations.

iv

TABLE OF CONTENTS

LIST OF FIGURES ...... x

LIST OF TABLES ...... xiii

ABBREVIATIONS ...... xiv

ACKNOWLEDGEMENTS ...... xv

Chapter 1: INTRODUCTION ...... 1 1.1 Kinetics can be more important than thermodynamics for protein folding ...... 1 1.2 Synonymous codon substitutions can alter the rate of translation ...... 2 1.3 Co-translational folding can be influenced by codon translation rates ...... 3 1.4 Codon translation rates modulate protein function, misfolding, and aggregation ....4 1.5 Translocation of proteins across cell membranes is modulated by codon translation rates ...... 5 1.6 Other co-translational processes can also depend on codon translation rates ...... 6 1.7 Several human diseases have been linked to codon translation rates ...... 8 1.8 Recent approaches to modeling codon translation rate effects on co-translational protein folding and translocation ...... 8 1.9 Thesis objectives ...... 11

Chapter 2: ACCURATE PREDICTION OF CELLULAR CO-TRANSLATIONAL FOLDING INDICATES PROTEINS CAN SWITCH FROM POST- TO CO- TRANSLATIONAL FOLDING ...... 13 2.1 Abstract ...... 13 2.2 Introduction ...... 13 2.3 Results ...... 15 2.3.1 Derivation of the model ...... 15 2.3.2 Constructing a fully constrained model ...... 18 2.3.3 Prediction of pulse-chase co-translational folding curves ...... 20 2.3.4 Prediction of FactSeq co-translational folding curves ...... 20 2.3.5 Sensitivity of predictions to parameter variation ...... 22 2.3.6 Model sensitivity to variable codon translation rates ...... 24 2.3.7 Domains can switch from post- to co-translational folding ...... 25

v

2.4 Discussion ...... 26 2.5 Acknowledgements ...... 29

Chapter 3: STRUCTURAL ORIGINS OF FRET-OBSERVED NASCENT CHAIN COMPACTION ON THE RIBOSOME ...... 30 3.1 Abstract ...... 30 3.2 Introduction ...... 30 3.3 Results ...... 33 퐸 3.3.1 퐸 and are equally valid for comparison to experimental fluorescence .....33 퐸end 3.3.2 Both partial folding and dimensional collapse are consistent with the experimental data ...... 33 3.3.3 Alternative dye positions allow the partial folding and dimensional collapse mechanisms to be tested ...... 35 3.3.4 Co-translational folding of HemK NTD proceeds through a partially folded intermediate ...... 36 3.4 Discussion ...... 38 3.5 Conclusions ...... 40 3.6 Acknowledgements ...... 41

Chapter 4: ALTERED CO-TRANSLATIONAL PROCESSING PLAYS A ROLE IN HUNTINGTON’S PATHOGENESIS – A HYPOTHESIS ...... 42 4.1 Abstract ...... 42 4.2 Introduction ...... 42 4.3 Co-translational processes influence post-translational protein behavior ...... 44 4.4 Nascent chain interactions with auxiliary factors depend on nascent chain length 44 4.5 Co-translational folding and downstream function depend on translation-elongation rates ...... 45 4.6 The co-translational targeting of nascent proteins depends on translation-elongation rates ...... 45 4.7 Polyproline stretches slow-down translation elongation ...... 45 4.8 The hypothesis: altered co-translational processes involving huntingtin play a role in HD pathology ...... 46 4.9 Our hypothesis in the context of other mechanisms that could contribute to mHtt pathology ...... 49

vi

4.10 A kinetic model based on this hypothesis suggests why the age of onset of HD symptoms negatively correlates with the number of CAG repeats ...... 51 4.11 On the nature of misprocessing ...... 52 4.12 Testing the co-translational misprocessing hypothesis of HD pathology ...... 53 4.13 Conclusion ...... 54 4.14 Acknowledgements ...... 55

Chapter 5: PREVALENCE AND TIMESCALES OF KINETIC TRAPPING WITHIN THE E. COLI PROTEOME ...... 56 5.1 Introduction ...... 56 5.2 Results and Discussion ...... 58

Chapter 6: CONCLUSIONS AND FUTURE DIRECTIONS ...... 61 6.1 Conclusions ...... 61 6.2 Future directions ...... 62 6.2.1 Translation termination timescales and governing factors ...... 62 6.2.2 Investigating aspects of multi-domain protein folding in bulk solution and on the ribosome ...... 63

Appendix A: CHAPTER 2 METHODS AND SUPPORTING INFORMATION ...... 64 A.1 Methods ...... 64 A.1.1 Selection of model parameters ...... 64 A.1.2 Calculation of error bars ...... 65 A.1.3 Calculation of test statistics for FactSeq data ...... 65 A.1.4 Details of protein domain identification and numbering ...... 65 A.1.5 Identifying yeast protein domains ...... 66 A.1.6 The time-dependent fraction of full-length protein ...... 66 A.1.7 Testing the applicability of assumptions A1 and A3 ...... 66 A.1.8 Scaling codon translation rate estimates for CHO cells ...... 68 A.1.9 Ribosome profiling of yeast ...... 68 A.1.10 Bioinformatic analysis of ribosome profiling data ...... 69 A.2 Supplemental discussion ...... 70

vii

A.2.1 Derivation of Eq. 2.2 ...... 70 A.2.2 Predictions are robust to small deviations from steady state ...... 74 A.2.3 Predictions are robust to change in dwell-time distribution ...... 74 A.3 Supplemental figures and tables ...... 76

Appendix B: CHAPTER 3 METHODS AND SUPPORTING INFORMATION ...... 88 B.1 Methods ...... 88 B.1.1 Construction of HemK N-terminal domain folding model ...... 88 B.1.2 Construction of dye-modified HemK model ...... 88 B.1.3 Construction of good- and poor-solvent collapse Hamiltonians ...... 89 B.1.4 Selection of mean in silico translation elongation time ...... 90 B.1.5 Continuous synthesis simulations ...... 91 B.1.6 Mapping of simulation timescales ...... 92 퐸 B.1.7 Calculation of 퐸 and ensemble average time series ...... 93 퐸end B.1.8 Fraction of native contacts analysis ...... 93 B.1.9 Comparing simulated and experimental time series ...... 93 B.2 Supplemental discussion ...... 94 B.2.1 Comparing simulated and experimental time series (expanded discussion) ...94 B.2.2 Testing the applicability of the Förster equation ...... 96 B.3 Supplemental figures and tables ...... 97

Appendix C: CHAPTER 4 METHODS ...... 111 C.1 Methods ...... 111 C.1.1 Derivation of chemical-kinetic model for Htt misprocessing ...... 111

C.1.2 Calculation of 푘on for the chemical kinetic model ...... 112

Appendix D: CHAPTER 5 METHODS AND SUPPLEMENTARY FIGURES AND TABLES ...... 114 D.1 Methods ...... 114 D.1.1 Multi-domain protein selection and model building ...... 114 D.1.2 Single-domain protein selection and model building ...... 115

viii

D.1.3 Selection of mean in silico codon translation time based on training set folding kinetics ...... 116 D.1.4 Parameterization of single- and multi-domain protein contact potentials .....117 D.1.5 Construction of 50S E. coli ribosome cutout ...... 119 D.1.6 Simulations of translation elongation, translation termination, and post- translational protein dynamics ...... 119 D.1.7 Calculation of domain and interface folding times ...... 120 D.2 Supplementary figures and tables ...... 121

REFERENCES ...... 143

ix

LIST OF FIGURES Figure 1.1. Codon translation rates influence a broad range of nascent protein behaviors ...... 2 Figure 1.2. A range of methods for describing the influence of codon translation rates on nascent protein behavior ...... 7 Figure 1.3. Codon translation rate effects on nascent protein folding behavior predicted by chemical kinetic or coarse-grained simulation methods ...... 9 Figure 1.4. Influence of codon translation rates on the translocation or insertion of proteins into membranes as described by coarse-grained models ...... 10 Figure 2.1. Illustration of the pulse-chase experiment ...... 14 Figure 2.2. The co- and post-translational protein folding reaction scheme that Eq. 2.2 solves ...... 18 Figure 2.3. Comparison between the predicted and experimentally measured SFVP co- translational folding curves ...... 20 Figure 2.4. Comparison between the predicted and experimentally-measured FRB and HA1 co-translational folding curves ...... 21 Figure 2.5. Sensitivity analysis of the predicted co-translational folding curve of ΔC SFVP to changes in the number of residues that fit inside the ribosome, kA,푖, kF,푖, and kU,푖 ...... 23 Figure 2.6. Effects of variable codon translation rates on the predicted co-translational folding curve for ΔC SFVP ...... 24 Figure 2.7. Synonymous codon substitutions can switch some yeast protein domains from post- to co-translational folding according to Eq. 2.2 ...... 25 Figure 3.1. Coarse-grain representation of HemK NTD with dye-modified residues ....31 Figure 3.2. Simulation time series compared to experimental results ...... 34 Figure 3.3. Alternative dye positions that maximize separation between good-solvent and folding results for each HemK construct ...... 36 Figure 3.4. Three sets of fraction of native contact, Q, time series are displayed for HemK112 alongside representative coarse-grain nascent-chain structures ...... 37 Figure 4.1. The proposed co-translational mechanism of HD pathology ...... 46 Figure 4.2. A simple chemical-kinetic model explains how the age of onset of HD symptoms could arise from disruption of co-translational processing of huntingtin ...... 50 Figure 5.1. Protein training and data sets and ribosome cutout ...... 57

x

Figure 5.2. 1GYT fraction of native contacts time series ...... 58 Figure 6.1. Distributions of mean translation termination times from coarse-grain simulations ...... 62 Figure A.1. Ribo-Seq data exhibits stationary ribosome profile distributions between biological replicates of yeast ...... 76 Figure A.2. Ribo-Seq data shows stationary ribosome profile distributions ...... 77 Figure A.3. Linear least squares analysis of the appearance of full-length ΔC SFVP since the start of the chase period ...... 78

Figure A.4. An illustration of using Eq. 2.2 to compute 푃F(푡) using a tractable example ...... 79 Figure A.5. Comparison of Gillespie Algorithm simulations of non-steady-state translation kinetics to predictions made with Eq. 2.2 ...... 80 Figure A.6. A double-exponential ribosome dwell-time distribution does not alter the predicted co-translational folding curve of ∆C protein ...... 81 Figure A.7. Sensitivity analysis of co-translational folding curves predicted with Eq. A.5 for FRB and HA1 to changes in the parameters 푘F, 푘U, and 푘A ...... 82 Figure A.8. Sensitivity analysis of co-translational folding curves predicted with Eq. 2.2 for the SFVP, DHOM, DPP3, SBA1, and EF2 wild-type proteins to changes in the parameters 푘F, 푘U, and 푘A ...... 83 Figure A.9. The various estimates of codon translation rates do not correlate with each other ...... 84 Figure A.10. The six slowest-translating codons predicted by the Fluitt-Viljoen model cause the greatest deviation between the predicted and experimental values ...... 85 Figure B.1. Three sets of fraction of native contact, 푄, time series are displayed for HemK98 ...... 97 Figure B.2. Three sets of fraction of native contact, 푄, time series are displayed for HemK84 ...... 98 Figure B.3. Three sets of fraction of native contact, 푄, time series are displayed for HemK70 ...... 99 Figure B.4. Three sets of fraction of native contact, 푄, time series are displayed for HemK56 ...... 100 Figure B.5. Three sets of fraction of native contact, 푄, time series are displayed for HemK42 ...... 101

xi

Figure B.6. The mean radius of gyration evaluated over ten 600-ns trajectories ...... 102 Figure B.7. Results of Gillespie Algorithm simulations through a two-state co- translational folding reaction scheme ...... 103 Figure B.8. Value of 휅2 as a function of nascent chain length ...... 104 Figure D.1. Parameters related to interface stability ...... 121 Figure D.2. Parameters related to domain stability ...... 122 Figure D.3. Distributions of protein and domain length for multi-domain proteins .....123 Figure D.4. Distributions of protein length for single-domain proteins ...... 124 Figure D.5. 1FUI domain fraction of native contact plots ...... 125 Figure D.6. 1FUI interface fraction of native contact plots ...... 126

xii

LIST OF TABLES Table 2.1. Model parameters for SFVP, FRB, HA1, and yeast proteins ...... 22 Table 3.1. Pearson 푅2 values between simulated and experimental time series ...... 35 Table 3.2. λ values for comparisons between 퐸 and 퐸/퐸end time series ...... 38 Table 3.3. λ values for comparisons between alternative dye position 퐸 and 퐸/퐸end time series ...... 38 Table 5.1. Synthesis times, folding times, and percentage of trajectories kinetically trapped ...... 59 Table 6.1. Translation termination times for 1T8K and 1U0B under various conditions ...... 62 Table A.1. Translation rate profiles used in calculating pulse-chase curves in Figure 2.6 ...... 86 Table A.2. Summary of pulse-chase error bars from literature sources ...... 87 Table B.1. HemK dye model mass parameters ...... 105 Table B.2. HemK dye model bond parameters ...... 106 Table B.3. HemK dye model angle parameters ...... 107 Table B.4. HemK dye model torsional parameters ...... 108 Table B.5. HemK dye model non-bonded parameters ...... 110 Table D.1. Database of multi-domain proteins ...... 127 Table D.2. Replica exchange results for α-helical proteins ...... 128 Table D.3. Replica exchange results for β-sheet proteins ...... 129 Table D.4. Replica exchange results for α/β proteins ...... 130 Table D.5. Kinetic parameters for 18 training set proteins ...... 131 Table D.6. Single-domain protein data set information ...... 132 Table D.7. Multi-domain protein data set information ...... 134 Table D.8. Additional training set parameters ...... 139 Table D.9. 휂 values used for each structural class and the training set overall ...... 140 Table D.10. Mean in silico dwell times calculated from Fluitt-Viljoen model ...... 141 Table D.11. Kinetically trapped inter-domain interfaces ...... 142

xiii

ABBREVIATIONS aa amino acids BOF BodipyFL BOP Bodipy 576/587 CAF co-translationally acting factor CDS Coding sequence CG coarse grain CTD C-terminal domain E. coli Escherichia coli ER Endoplasmic reticulum FL firefly luciferase FRET Förster resonance energy transfer FRQ FREQUENCY HD Huntington's Disease HemK E. coli N5-glutamine methyltransferase protein Htt huntingtin protein MAP aminopeptidase MD molecular dynamics mHtt mutant huntingtin protein mRNA messenger RNA N17 huntingtin protein N-terminal signal sequence NTD N-terminal domain PDB Protein Data Bank PDF peptide deformylase PKR protein kinase R RNA ribonucleic acid RNC ribosome nascent chain SFVP Semliki Forest virus protein SNP single-nucleotide polymorphism SRP Signal recognition particle TF trigger factor UPS /proteasome system VMD Visualize Molecular Dynamics (software)

xiv

ACKNOWLEDGEMENTS

I thank the National Institutes of Health, National Science Foundation, and Human Frontier Science Program for funding this work in part, and I acknowledge that the findings and conclusions in this thesis do not necessarily reflect the views of these funding agencies. First, I should thank my advisor, Ed O’Brien. I am honored to have been the first graduate student in your lab at Penn State and proud of the work we have done together. Your tireless support of my research and career goals has been critical to my success. I look forward to continuing our relationship as I progress into the next stages of my career. In addition to Ed I need to thank my committee, Professors Will Noid, Phil Bevilacqua, and Reka Albert, for their thoughtful questions and criticisms that have helped me consider research from different perspectives, often to my great benefit. I had the good fortune to work with fantastic postdoctoral researchers and graduate students at Penn State. Thank you Ajeet, Fabio, Ben, Dave, Joe, Ian, Nabeel, Sarah, and Yang for your help and support. Ajeet, Fabio, Ben, Dave, and Joe in particular have been my friends and supporters through many shared battles and I will forever be grateful for their help. I should also thank my friends. Erik, thank you for always having an open door if I needed to get out of the house and out of my head. Josh, I will forever be thankful for you getting me into D&D, it is a fantastic escape from reality and I have fond memories of long nights and days playing. Pat, I have always appreciated your stoic and steady personality, I wish it would rub off but I do not think it will at this point. Zach, thank you for teaching me how to truly relax and enjoy life in its quiet moments. Thank you to both Bens for helpful conversations on many topics. Most importantly I need to thank my family. My parents, as always, have been my tireless cheerleaders and support system. Thank you for all of your visits, kind words, homecooked meals, and advice. Thank you, Jenn and Kellie, for visiting State College, showing me around NYC, and most importantly introducing me to Ethiopian food. My grandparents, some of whom are not here to see me finish my degree, were always ready with love and support. I know Grandpa Haeffner would have been particularly proud given his love of all things Penn State. Finally, I need to thank my fiancée, Grace. You have always been there for me when I was stressed, angry, tired, and in general hard to be around, with kindness and patience. When I could not find strength to keep working you showed me the way forward by believing in me more than I have ever believed in myself. I will never forget your small kindnesses to make the week more bearable – meeting for lunch, dropping me off at work on your days off, always listening to me complain – you are my best friend and my most reliable supporter. I know I could not have done this without you – aren’t I lucky that I met you when I did?

xv

Chapter 1

INTRODUCTION

This introductory chapter is reproduced in part with permission from The Journal of the American Chemical Society from the Perspective article “Timing is Everything: Unifying Codon Translation Rates and Nascent Proteome Behavior” by Daniel A. Nissley and Edward P. O’Brien in The Journal of the American Chemical Society, 2014, 136(52), pp 17892-17898. Copyright 2014 American Chemical Society.

The results presented in this thesis center around modeling and understanding the influence that translation speed has on protein folding. A protein’s folded state is often assumed to be fully dictated by the thermodynamic information encoded in its primary sequence. However, it has recently come to be understood that the speed of protein synthesis by the ribosome can profoundly influence protein folding and function. These results indicate that kinetics rather than thermodynamics can determine the fate of a protein in vivo. This introduction presents the experimental evidence for such non-equilibrium effects on protein folding and provides context on applicable theoretical and simulation techniques. In the subsequent chapters I describe projects that resulted in an accurate predictive model for co-translational folding, accurate prediction of FRET arising from co-translational folding, a new hypothesis for Huntington’s Disease pathogenesis, and quantification of the prevalence and timescales of kinetic trapping within the E. coli proteome. These projects are summarized at the end of the Introduction in the Thesis Objectives section.

1.1 Kinetics can be more important than thermodynamics for protein folding

During the process of translation, the ribosome synthesizes a protein molecule by unidirectionally translocating along an mRNA molecule one codon at a time (Figure 1.1A). A number of processes involving the nascent chain occur before it has been fully synthesized. These processes, referred to as co-translational processes, include co- translational folding1,2, molecular binding3, translocation between cellular compartments4,5, and the ubiquitination6,7 and glycosylation8 of the nascent protein. Recently, experiments from a number of different laboratories have demonstrated that changing the rate at which codon positions in an open reading frame (ORF) are translated by the ribosome can dramatically affect such co-translational processes and consequently alter the fate of the nascent protein in vivo (Figure 1.1B), even though the primary structure of the nascent protein is unaltered. Such results demonstrate that the thermodynamic stability of a protein’s native state can be less relevant to its nascent behavior in a cell than the rates of the processes acting on the protein. This property is a hallmark of nonequilibrium processes, wherein changes in kinetics can change the system’s behavior despite there being no change in the system’s composition.

1

Figure 1.1. Codon translation rates influence a broad range of nascent protein behaviors. (A) The codons comprising the mRNA are the template directing the ribosome as to which protein sequence to synthesize. Aminoacyl-tRNA (aa-tRNA) delivers the correct amino acid by selectively binding to the codon that is complementary to its anticodon. The nascent protein emerges from the exit tunnel, N-terminus first, as the ribosome translocates along the mRNA in the 5′ to 3′ direction. (B) The range of co- and post-translational processes which may occur for a nascent protein. (C) The codon usage, measured as the frequency of a particular codon per 1000 codons, is shown for the genomes of E. coli and H. sapiens (reproduced from the NCBI GenBank database). Codon translation rates can govern nascent protein behavior, and the disparate observations supporting this claim are manifestations of the out-of-equilibrium nature of co-translational processes. Hence, any general theory that attempts to quantitatively predict nascent proteome behaviors must necessarily account for the interplay between codon translation rates and the rates at which these co-translational processes proceed. In the following introductory sections, we detail recent experimental examples of these phenomena, describe promising theoretical approaches to treating such situations, and discuss how these treatments might be extended to account for the broad spectrum of nascent protein behaviors affected by codon translation rates. Ultimately, the creation of such a theoretical framework would help us better understand the causes and consequences of translation-regulated nascent protein behavior in cells and the origins of various human diseases9 as well as provide a framework to manipulate nascent proteome behavior in vivo.

1.2 Synonymous codon substitutions can alter the rate of translation

The genetic code is degenerate, with the 20 naturally occurring amino acids encoded by 61 unique codon types that compose the ORFs of transcripts (Figure 1.1A). All amino acids, with the exception of methionine and , are encoded by at least two and as many

2 as six different codons10. Codons that encode for the same amino acid are said to be synonymous with one another. Synonymous codons are not translated by the ribosome at identical rates; experiments indicate that the average translation rates between the synonymous codons GAA and GAG, which both encode , differ 3-fold in E. coli11. For those codons where translation rates have not yet been measured, theoretical modeling suggests even greater variability may exist12. Therefore, the translation rate at a particular codon position in an ORF can be altered by substituting a synonymous codon at that position, leaving the amino acid sequence of the synthesized protein unaltered. Indeed, as discussed below, synonymous codons have been extensively utilized to examine the influence of codon translation rates on nascent protein behavior. Furthermore, there is evidence that evolutionary pressures have shaped codon usage in the transcriptomes of organisms (Figure 1.1C) in some cases to modulate nascent proteome behavior13–16, indicating just how important the influence of codon translation rates can be to an organism’s phenotype.

1.3 Co-translational folding can be influenced by codon translation rates

Introducing synonymous codon mutations within a transcript, which alters the distribution of translation rates along the ORF, can change the ability of a nascent protein to co- translationally fold. The process of co-translational protein folding consists of the concomitant folding of one or more domains in a protein into a stable tertiary structure during the time it takes to synthesize the full-length protein1. Co-translational folding can be a biologically beneficial process because it allows the individual segments that compose multidomain proteins to fold in the absence of nascent chain segments from other domains, thereby minimizing the chances of interdomain misfolding1,17,18. Consequently, proteins that fold co-translationally often display decreased levels of misfolding and aggregation and, in some cases, populate on-pathway structures that enhance the chances that the protein will attain the correct folded structure19,20. Slower codon translation speeds can favor the correct folding of eukaryotic proteins by affording domains more time to fold while bound to the ribosome21. For example, the exchange of two rare codons for their most common synonymous codons, which presumably translate more quickly, in a normally slow-translating region within the ORF of the three-domain SufI protein was found to decrease its ability to co-translationally fold in an in vitro synthesis system22. In several cases, an increase in co-translational folding has been facilitated by the positioning of rare codon clusters on the mRNA downstream of the co-translationally folding domain22,23. An extensive bioinformatics study, however, indicates this arrangement of rare codons relative to domain boundaries is not typically found in the transcriptomes of organisms, though there is evidence that protein domain boundaries are enriched in fast-translating codons24. A recent theoretical study using kinetic models also suggests that there are scenarios in which fast-translating codons could increase the probability of correct co-translational folding when they are positioned in the ORF downstream of nascent protein segments prone to misfolding25. These results

3 illustrate how altering codon translation rates at specific positions in an ORF can modulate the co-translational folding of domains in a protein.

1.4 Codon translation rates modulate protein function, misfolding, and aggregation

Altering codon translation rates can also alter the ability of a newly synthesized protein to carry out its biological function18,22,26–28. So-called “silent” single nucleotide polymorphisms, which are naturally occurring synonymous codon substitutions in the genome of an organism, have been found to affect protein expression levels29, the final folded structure a nascent protein attains in vitro29, and downstream processes such as aggregation30 and the substrate specificity of newly synthesized proteins31. Each of these changes in the behavior of the newly synthesized protein can be explained by the changes that these synonymous codon substitutions have on the rates at which codons are translated. The expression levels of soluble, functional protein molecules can be controlled by tuning the rates at which codons are translated. One highly successful codon optimization strategy for increasing the yields of eukaryotic proteins expressed recombinantly in prokaryotic cells is to maintain the eukaryotic codon usage profile along the ORF in the context of the prokaryotic organism32. The success of this approach is consistent with the notion that by preserving the evolutionarily optimized relative timing at which different segments of a nascent chain are produced, the co-translational acquisition of correct structure and function can be achieved while minimizing the tendency of the nascent protein to aggregate. Optimizing codon usage, however, is not always beneficial to protein expression. For example, the expression of functional Neurospora clock protein FREQUENCY (FRQ) is dependent on the nonoptimal codon usage present within the naturally occurring transcript, and though optimizing the codon usage in the FRQ transcript results in increased yields of protein, it also causes the loss of FRQ’s proper periodic expression with time29. More generally, a multigene codon optimization study in E. coli33 found that while the majority of the genes studied showed 2- to 3-fold increases in expression relative to the wild-type transcripts, 20% had decreased expression levels. These results indicate that the optimization of a codon sequence does not guarantee an increase in successful expression of a given transcript. A more robust understanding of the molecular origin of both variable codon translation rates themselves and their myriad effects on co-translational processes may enhance the ability of optimization procedures to control nascent protein behavior and increase expression levels. Nascent protein misfolding, which can be measured experimentally for enzymes as a decreased specific activity in the soluble protein fraction13,21,26,27, is also a potential consequence of synonymous codon substitutions. Recombinant expression of eukaryotic proteins (including firefly luciferase and green fluorescent protein) by E. coli containing streptomycin–pseudodependent ribosomes, whose global translation elongation rate increases with increasing concentration of streptomycin, was found to produce a greater fraction of correctly folded protein when codon translation rates were slowed21. A synonymous codon mutation in the Multidrug Resistance 1 gene was suggested to alter the conformation and drug transport function of the synthesized protein despite resulting in an

4 identical amino acid sequence and comparable levels of expression31. In the case of FRQ, it was shown that optimization of its codon sequence also led to a conformational change in mature FRQ and the total abolishment of circadian periodicity within the organism29. These, and other experiments34, make clear the tangible link between co-translational protein misfolding, codon translation rates, and subsequent cellular processes that depend on the proper functioning of nascent proteins. When proteins are unable to reach their correct folded conformation, or any other soluble structure, they can aggregate and precipitate from solution35. Such aggregation can be detrimental to a cell and is the factor which unifies the amyloidosis family of diseases35. The aggregation propensity of nascent proteins can be altered through modification of codon translation rates, indicating a direct link between co-translational events and the post-translational process of aggregation. For example, the replacement of three codons in a segment of the Echinococcus granulosus fatty acid binding protein 1 transcript with synonymous codons resulted in an increase in aggregation in vivo26, suggesting that aggregation of nascent proteins is more prevalent when certain codons are translated at different rates21. It was also shown that optimizing the codon sequence of firefly luciferase’s transcript, with all codons being replaced with their fastest-translating synonymous codon, led to an increase in the amount of aggregation relative to the wild- type codon sequence at comparable levels of expression in E. coli13. Therefore, changes in expression levels of this protein do not drive aggregation, but instead, it is likely that changes in a co-translational process result in nascent protein misfolding. Codon translation rates are thus seen to have a significant impact on the likelihood of nascent protein aggregation.

1.5 Translocation of proteins across cell membranes is modulated by codon translation rates

The successful co-translational translocation of secretory proteins through the Sec- translocon has also been linked to codon translation rates. The signal recognition particle (SRP) is an abundant and universally conserved ribonucleoprotein which recognizes signal sequences located at the N-terminus of secretory nascent polypeptides and targets them to the endoplasmic reticulum (ER) for translocation5. The likelihood that a nascent protein is successfully translocated depends on the affinity of SRP for the signal sequence, which can be altered by modifying the amino acid sequence. In a recent study, it was shown that globally decreasing the codon translation rate in HDB52 cells via the addition of cycloheximide led to an increase in the percentage of nascent protein successfully translocated into the ER for proteins containing signal sequences with low binding affinity for SRP36. It was hypothesized that slower translation rates provide more time for SRP to recognize and bind to the signal sequence, which is the first step on the pathway to translocation. Additionally, an earlier study37 found that the topology of membrane proteins can also be altered by globally slowing translation rates. Though these studies used an external means (cycloheximide) to manipulate the translation rate, it is likely that altering

5 the codon usage in the transcript could provide a similar decrease in translation rate and aid in the co-translational translocation of secretory proteins. Co-translational protein translocation can also be influenced by changes in other co-translational processes that are sensitive to codon translation rates. A zinc-finger domain, the folding of which can be induced by the presence of Zn2+, was used to show that the ability of protein-conducting channels to co-translationally translocate proteins across the ER membrane is inhibited by the co-translational folding of the passenger sequence38. In constructs that were designed to allow the zinc finger to co-translationally fold, the passenger protein that was covalently attached to zinc finger was observed to be diverted to the cytosol instead of entering the ER lumen. On the other hand, the inhibition of zinc-finger folding allowed translocation into the ER to occur. This suggests that altering codon translation rates could alter the likelihood of co-translational folding and, thereby, affect the translocation efficiency of secretory proteins. More generally, this observation illustrates how codon translation rates can modulate not only individual co-translational processes but also multiple co-translational processes that occur in concert. Such multifactorial effects of codon translation rates illustrate the pressing need for a theoretical framework which will facilitate a quantitative understanding of the consequences of codon translation rates for nascent proteome behavior.

1.6 Other co-translational processes can also depend on codon translation rates

The processes of nascent protein glycosylation8, chaperoning interactions3, enzymatic modification39, and ubiquitination6,7,40 can also occur co-translationally. While the influence of codon translation rates on these cellular processes has not yet been examined, any process that occurs on the time scale of translation could also have its outcomes influenced by the kinetics of translation elongation. The covalent attachment of carbohydrates to proteins is known as , and most eukaryotic proteins that are secreted to the exterior of the cell are glycosylated8. This modification to the nascent chain often occurs co-translationally, with the ribosome inserting the nascent protein from the cytoplasmic side of the ER into the SEC translocon and glycosylation occurring as nascent chain segments emerge into the ER lumen4,8. Co- translational glycosylation of the nascent protein may be much more efficient as compared to post-translational glycosylation because nascent protein segments are more likely to be unstructured during the period of their synthesis on the ribosome, thereby offering greater exposure of potential glycosylation sites8. This suggests that slowing down translation could increase the probability of nascent chain segments folding as they emerge from the translocon, thereby decreasing the chances for glycosylation. A number of molecular chaperones and processing enzymes have been observed to interact with nascent proteins during translation39,41. In some cases, molecular chaperones, such as trigger factor in E. coli, assist nascent protein folding by sequestering aggregation- prone polypeptide sequences that can become exposed as a result of protein misfolding. Although some chaperones act post-translationally, those which associate with proteins during translation must do so on the time scale of protein synthesis and thus may be

6

Figure 1.2. A range of methods for describing the influence of codon translation rates on nascent protein behavior. (A) A chemical kinetic reaction scheme that describes co-translational translocation. At each nascent chain length, the states that a nascent chain segment may populate include unfolded (U), folded (F), unfolded and translocated (Utrans), or folded and translocated (Ftrans) states. Note that translocation cannot occur if the nascent protein segment is folded and that translocation is irreversible. (B) Numerical integration relies on mathematical estimation to solve systems of differential equations that represent the time evolution of chemical kinetic reaction schemes. (C) Coarse-grained simulations of ribosome nascent chain complexes allow for a detailed molecular perspective on co-translational folding. Snapshots from simulations in which ribosomes (red and yellow) are engaged in the translation of nascent proteins (green) are shown. Trigger factor (black) is shown associated with the ribosome-nascent chain complex. (D) The master equation approach can solve a set of differential equations to reveal the time-dependent evolution of the system on the reaction scheme. (E) The Gillespie Algorithm is used to solve a stochastic version of chemical kinetics for single molecules and consists of three key steps. The initialization step requires defining the states of the system and the rates of interconversion between these states. Random numbers are then used to generate the time step and starting reaction, and the reaction is then modeled. Next, the time step and number of molecules in each state are updated. The final two steps are then repeated until a predetermined stop condition is met. kinetically dependent on codon translation rates. Processing enzymes in E. coli, such as PDF and MAP42, act on nascent chains during translation, suggesting that their binding and enzyme kinetics could also be affected by codon elongation rates. Translation elongation rates have also been linked to observed differences in the correlated processes of arginylation and ubiquitination of γ- and β-actin nascent proteins in vivo43. Ubiquitination is carried out by ubiquitin ligases, enzymes that can covalently attach an ubiquitin protein molecule to another protein, which in some cases targets that protein for degradation6. Estimates of the relative amounts of nascent proteins that are co-

7 translationally ubiquitinated range from 1% to 30%6,7,39. Just as glycosylation, chaperoning, and enzymatic processing may be dependent on codon translation rates, it is also possible that co-translational ubiquitination may be affected by the kinetics of translation elongation. Since misfolded proteins are more likely to be ubiquitinated, co- translational ubiquitination and folding are another set of processes where changes in translation rates could affect more than one co-translational process.

1.7 Several human diseases have been linked to codon translation rates A number of human diseases that affect the lungs and blood, as well as various cancers, have been linked to the variability of codon translation rates9. The recent sequencing of the lymphocyte mRNAs of five Swedish families with hemophilia B led to the finding that a synonymous single nucleotide polymorphism (SNP) in the F9 gene was the only difference between their genotype and healthy individuals, with mRNA splice variation ruled out as a cause44. This leaves open the possibility that this SNP alters nascent protein behavior through a change in the translation elongation rate. Similarly, a synonymous SNP in the cystic fibrosis transmembrane conductance regulator gene, one of the most frequent causes of cystic fibrosis, was found to alter the expression level of the mutant protein via an observed change in protein synthesis rates34; however, this could also potentially occur due to a change in the translation initiation rate as compared to a change in elongation rate. Various cancers, including lung carcinoma and cervical and vulvar cancer, have also been linked to synonymous SNPs, suggesting a possible role of codon translation rates9,45. A molecular perspective connecting changes in co-translational processes due to alteration in codon translation rates and the progression of these diseases is lacking. Such a perspective could help us understand the origin of these diseases.

1.8 Recent approaches to modeling codon translation rate effects on co-translational protein folding and translocation

The numerous examples discussed above illustrate the importance that codon translation rates can have in governing nascent proteome behavior and indicate an area of biology where the application of theoretical and simulation techniques could significantly advance knowledge and understanding. Chemical kinetics and coarse-grained molecular dynamics simulations are two techniques that have recently been applied to study codon translation rate effects on co-translational protein folding and translocation through the SEC translocon. Using a Markov chain probability approach, the chemical reaction schemes representing co-translational domain folding mechanisms involving either two or three thermodynamic states were analytically solved25,46. The resulting equations provide the capability to predict how individual codon translation rates in an ORF will influence the probability of a domain populating its unfolded, intermediate, or folded states at different points during its synthesis. Extending this type of approach to other nascent protein behaviors (Figures 1.1B and 1.2A) is particularly promising, as it would allow for the rapid

8

prediction of the time evolution of the different states that a nascent chain populates as it is being synthesized, the probability of various co-translational processes occurring, and ultimately could provide predictions about the fate of the nascent protein in vivo. An alternative to solving such reaction schemes analytically is to numerically integrate the differential equations that describe the relationships between the co- translationally populated states and the underlying rates of interconversion between those states (Figure 1.2B). This approach was recently used to study the post-translational behavior of newly synthesized proteins in E. coli, including their degradation, aggregation, and interactions with mainly post-translationally acting chaperones47. This numerical

Figure 1.3. Codon translation rate effects on nascent protein folding behavior predicted by chemical kinetic or coarse-grained simulation methods. (A) Co-translational folding on the ribosome is favored by slow translation conditions (left) which allow the nascent chain additional time to adopt the native conformation. Fast translation conditions (right) can result in unfolded or misfolded confirmations. (B) The effect of globally altering codon translation rates on the probability that a protein will co-translationally fold as determined by both coarse-grained simulations (symbols) and chemical kinetics (solid lines) for a wide range of codon translation rates. The figure legend lists the codon translation times, τA, which are equivalent to the inverses of the rates. (C) The probability that a domain co-translationally folds as a function of nascent chain length determined by both chemical kinetic (solid lines) and coarse-grained simulations (symbols) as affected by the introduction of a single fast- (orange) or slow- (pink) translating codon, indicated by “F” or “S”, respectively. (D) A snapshot of the time step at which a coarse-grained model of the co-translationally folding Semliki Forest Virus Protein (SFVP) adopts its native fold. (E) The pathways of co-translational folding of SFVP on the ribosome. Red points symbolize highly probable states along SFVP's co-translational folding pathways, where QN, QC, and Qint are the fraction of native contacts in the protein’s N-terminal, C- terminal, and interfacial regions, respectively. (F) Same as E but for refolding trajectories started with the SFVP molecule free in solution. Panels A, B, and C were produced with permission from Nature Publishing Group (O’Brien, Vendruscolo, and Dobson, 2012). Panels D, E, and F were produced with permission from PLOS (Elcock 2006).

9 approach can also, in principle, be applied to co-translational behavior. There are benefits and costs to such a numerical approach as compared to the analytical approach previously described. Numerical approaches can be readily adapted to any complex reaction scheme, especially with their implementation in widely used and robust software packages such as Mathematica and Matlab. On the other hand, analytical solutions sometimes must be solved on a case-by-case basis for different co-translational situations25. Numerical integration, in principle, can take longer to calculate, as a larger number of integration steps must be executed to obtain converged results, whereas the analytic approach often involves fewer steps. A significant advantage that analytical approaches have is that they can rapidly identify the dynamic regimes within such models through the use of derivative tests25, while an exhaustive numerical search of the kinetic regimes of co-translational behavior can be prohibitive due to the large kinetic-parameter space of the representative reaction networks. Therefore, the benefits of analytical approaches are quite extensive, although this can be mitigated by the one-time cost of obtaining the analytic solution. Coarse-grained simulations (Figure 1.2C) of the translation process offer a more detailed molecular view of the consequences codon translation rates have on co- translational folding than mathematical modeling48–50. Like mathematical models, coarse- grained models can be used to calculate the probabilities of nascent chains being in different states at different translation elongation rates46, but they also offer additional information, including structural information, the energetics associated with various co- translational processes49, and information on dynamics that might be neglected in the chemical kinetic modeling. The trade-off for this more detailed picture is a dramatically increased cost of computation. An ensemble of coarse-grained simulations of co-translational folding took ∼800 CPU days (= 4 Figure 1.4. Influence of codon translation rates on the translocation or insertion of proteins into membranes as days per trajectory × 200 described by coarse-grained models. (A) Schematic of a trajectories), while the kinetic model kinetically controlled translocation/membrane-insertion predicted this result in a computation coarse-grained model used to study the influence of codon requiring just a few seconds25. Thus, translation rates. (B) The percent of nascent protein which is inserted by a Type II mechanism in coarse-grained coarse-grained simulations and simulations at global translation rates of 6 AA/s (blue) and 24 mathematical modeling can be used AA/s (red) as a function of nascent chain length. (C) Same as to address a wide range of questions, B except stop-transfer efficiency is plotted as a function of nascent chain length. The figures and data displayed in panels but where there is overlap in the A, B, and C are reproduced with permission from Cell Press question being addressed it is often (Zhang and Miller, 2012).

10 more efficient to use kinetic modeling. Chemical kinetic and molecular dynamic computations have recently been used to address a number of questions concerning the fates of nascent proteins, especially with regard to their co-translational folding and translocation. A chemical kinetic model describing a two-state co-translational folding mechanism (Figure 1.3A) allowed for the probability of co-translational folding to be calculated as a function of nascent chain length46. For Protein G, it was found that both globally changing the codon translation rate (Figure 1.3B) and the insertion of single slow- or fast-translating codons (Figure 1.3C) dramatically alter predicted folding behavior. Furthermore, coarse-grain models (Figure 1.3D) were used to probe the intrinsic differences between in vivo and in vitro protein folding, with results suggesting that, for multidomain proteins, ribosome-mediated folding in vivo (Figure 1.3E) can follow significantly different pathways from that of in vitro (Figure 1.3F) refolding51. Furthermore, the molecular regulation of co-translational protein translocation through the Sec translocon was probed using a coarse-grained model, revealing that multiple kinetic pathways exist for the integration of proteins into lipid membranes (Figure 1.4A)52. This study also demonstrated that in such models increasing the global codon translation rate by a factor of 4 resulted in a large reduction in the percent of nascent protein successfully inserted into the membrane by a Type II mechanism (Figure 1.4B) and an increase in the amount of protein directed into the membrane (referred to as the stop- transfer efficiency; Figure 1.4C). Related all-atom models of co-translational translocation also suggest that the kinetics of membrane insertion versus codon translation play a critical role in the cell’s regulation of nascent protein translocation53,54. Such chemical kinetic and coarse-grain modeling techniques will continue to be important tools for the expansion of our understanding of the impact of codon translation rates on nascent protein behavior.

1.9 Thesis objectives

This introduction highlights the influence that codon translation rates have on the broad spectrum of nascent protein behaviors and the tools that can be used to understand and model these phenomena. The work in this thesis advances the study of non-equilibrium effects on nascent protein folding in several ways. Chapter 2 presents an analytical chemical kinetic model that accurately predicts, for the first time, in vivo co-translational folding probabilities. This chemical kinetic model also demonstrates that some proteins that typically fold post-translationally can be made to fold co-translationally with synonymous codon mutations. In Chapter 3 I describe molecular dynamics simulations with three different polymer models that reveal the co-translational folding pathway of a protein on the ribosome. Chemical kinetic methods, principles of co-translational processes, and strong circumstantial experimental evidence are described in Chapter 4 to support a novel hypothesis for the involvement of co-translational processing in the pathogenesis of Huntington’s Disease. Chapter 5 expands on the methods used to produce Chapter 3’s results with a set of high-throughput simulations of a representative subset of the E. coli proteome that reveal widespread kinetic trapping on timescales on the order of

11 the half-lives of some proteins. Finally, in Chapter 6 I describe the conclusions from and possible future directions enabled by my to-date theoretical and computational work.

12

Chapter 2

ACCURATE PREDICTION OF CELLULAR CO-TRANSLATIONAL FOLDING INDICATES PROTEINS CAN SWITCH FROM POST- TO CO- TRANSLATIONAL FOLDING

Published as a paper entitled “Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding” by Daniel A. Nissley, Ajeet K. Sharma, Nabeel Ahmed, Ulrike A. Friedrich, Gunter Kramer, Bernd Bukau, and Edward P. O’Brien in Nature Communications 2016, 7, 10341. D.A.N., A.K.S., N.A. and E.P.O. designed the research. D.A.N., A.K.S. and E.P.O. carried out the theoretical modelling and analysis. U.F., G.K. and B.B. performed the ribosome profiling experiments. N.A. analyzed the ribosome profiling and FactSeq data. D.A.N., A.K.S., N.A., U.F., G.K., B.B and E.P.O. interpreted the data and wrote the manuscript. This article is published under an Open Access Creative Commons Attribution 4.0 International License.

2.1 Abstract

The rates at which domains fold and codons are translated are important factors in determining whether a nascent protein will co-translationally fold and function or misfold and malfunction. Here we develop a chemical kinetic model that calculates a protein domain’s co-translational folding curve during synthesis using only the domain’s bulk folding and unfolding rates and codon translation rates. We show that this model accurately predicts the course of co-translational folding measured in vivo for four different protein molecules. We then make predictions for a number of different proteins in yeast and find that synonymous codon substitutions, which change translation-elongation rates, can switch some protein domains from folding post-translationally to folding co- translationally—a result consistent with previous experimental studies. Our approach explains essential features of co-translational folding curves and predicts how varying the translation rate at different codon positions along a transcript’s coding sequence affects this self-assembly process.

2.2 Introduction

Protein folding, the assembly of a protein molecule or domain into a tertiary structure, can occur as a protein is being synthesized by the ribosome in a process referred to as co- translational folding1,41,55. In vitro56,57 and in vivo58 studies in which ribosomes were arrested at different nascent chain lengths have identified a number of proteins that can co- translationally fold. A convincing demonstration that co-translational folding occurs inside cells during continuous translation comes from pulse-chase experiments in which the synthesis of the cytosolic Semliki Forest virus protein (SFVP) was monitored in Chinese

13

Figure 2.1. Illustration of the pulse-chase experiment. (a) A schematic representation of the relevant protein segments of WT SFVP. Residues 1–267 correspond to the segment known as C protein. The other three protein domains are collectively referred to as p97. (b) The crystal structures of the three protein segments for which co-translational folding curves were predicted in this study. In each case, the co-translational folding domain whose behavior is predicted is colored blue. Top left, C protein of SFVP. Bottom left, the FRB domain. Right, HA1, for which the co-translational folding of residues 53–275 was experimentally monitored. (c) Pulse-chase experiments proceed in a step-wise manner as described in the main text. Ribosomes (grey circles) engaged in the translation of an mRNA (light green line) incorporate radiolabeled (red dots) and unlabeled (blue dots) amino acids into nascent proteins. Only those nascent chains that contain labeled amino acids (red segments) can be experimentally observed. hamster ovarian (CHO) cells2. SFVP is composed of four distinct protein segments (Figure 2.1A), including an N-terminal protease segment (referred to as ‘C protein’) that auto- catalytically cleaves itself from the SFVP molecule once folded (Figure 2.1B). Pulse-chase experiments revealed that cleaved C protein appears before synthesis of full-length SFVP is complete, demonstrating that C protein does indeed fold co-translationally in vivo. In this study, we develop a chemical kinetic model that predicts the course of such co- translational folding and compare the results to experimentally-measured co-translational folding curves reported in the literature. Pulse-chase experiments use the incorporation of radiolabeled amino acids into nascent proteins to resolve the time course of protein synthesis (Figure 2.1C). In the ‘pulse’ phase of the experiment, cells in culture are supplied with media containing radiolabeled amino acids, such as 35S-Met and 35S-Cys, for a prescribed period of time. These

14 radiolabeled amino acids begin being incorporated into nascent chains 10 s after their addition to the cell culture59. This delay is due to the fact that the amino acids must be taken up by the cells and covalently attached to tRNA. Immediately following the pulse, a ‘chase’ is initiated by supplying the cells with media containing unlabeled amino acids, which, following another 10 s delay after their addition to the cell culture59, inhibits the incorporation of labeled amino acids into the elongating nascent chain without hindering the translation process. Radiolabeled nascent protein is then tracked at different time points by a combination of SDS–polyacrylamide gel electrophoresis (for separation by protein size) and phosphorimaging (for quantification of protein levels), allowing the amount of each protein in a sample to be monitored as a function of time since the start of the pulse or chase. The SFVP is a 1,257-residue polyprotein; the last three segments are collectively referred to as p97 (Figure 2.1A)2. C protein (Figure 2.1B) is composed of the 267 N- terminal residues of SFVP and contains a non-sequential catalytic triad (H145, D167 and S219) that, upon folding, allows C protein to rapidly cleave itself from the rest of the polyprotein. Both folding and auto-catalytic cleavage of C protein occur co- translationally2. Once cleaved, it has been suggested that C protein is incapable of cleaving C protein off of other nascent proteins20. In pulse-chase experiments, the fraction of C protein cleaved since the start of the chase period is monitored. These data correspond to C protein’s co-translational folding curve, which equals the probability of C protein being folded as a function of time. Time-dependent co-translational folding was measured for two different constructs of SFVP, the wild-type (WT) and a deletion mutant, termed ΔC. This mutant lacks the 112 most-N-terminal residues, which are intrinsically disordered, resulting in a 1,145-amino acid long protein with a truncated C-protein segment that retains its catalytic activity. Recently, we introduced a kinetic model that accurately predicts the results of co- translational folding from molecular dynamic simulations46. Here we examine if this approach can be extended to predict in vivo co-translational folding curves. The resulting model’s predictions show excellent agreement with measured co-translational folding curves for four different proteins. We use this model to make novel predictions concerning a small subset of proteins in yeast, finding that some can switch between post- and co- translational folding mechanisms due to synonymous codon substitutions that alter translation-elongation rates. Thus, our model provides a rapid and accurate means to anticipate how small protein domains co-translationally behave in vivo, and the capability to explore the consequences of variable codon translation rates arising from synonymous mutations on this process.

2.3 Results

2.3.1 Derivation of the model

Our goal is to develop a kinetic model that can predict co-translational folding curves measured by pulse-chase experiments. As a starting point, we note that only radiolabeled

15 nascent chains are visible to these experiments, with unlabeled nascent chains making no contribution to the co-translational folding curve. Thus, only translation-initiation and elongation events that occur during the period of radiolabel incorporation contribute to the measured co-translational folding curve, as these events generate chains that are radiolabeled, while such translation events that occur outside the incorporation period do not. From these considerations, it follows that in the calculation of the experimentally- measured co-translational folding curve (푃F(푡)) we must account for (1) contributions from both ribosome-bound and ribosome-released radiolabeled nascent chains; (2) that at different time points during the experiment, the ribosome-bound population can contain sub-populations of nascent chains of different lengths; and (3) that the ribosome-released population can contain nascent chains that were released from the ribosome at different time points. The contribution to the co-translational folding curve from the ribosome- bound nascent chain population is proportional to the fraction of nascent chains that are both radiolabeled and folded at a nascent chain length of 푖, while the contribution from the ribosome-released nascent chains is proportional to the fraction of radiolabeled released nascent chains and the time since their release. We can express these ideas mathematically as: 푀 푡 ′ ′ 푃F(푡) = ∑ 푃F,B(푖)푓L,B(푖, 푡) + ∑ 푃F,R(푡, 푡 )푓L,R(푡, 푡 ).

′ (Eq. 2.1)

푖=1 푡 =0

Contribution from Contribution from ribosome-bound labeled ribosome-released labeled nascent chains nascent chains The first summation term in Eq. 2.1 represents the contribution of ribosome-bound, labeled chains to the co-translational folding curve, and the second term is the contribution from released, labeled chains. In Eq. 2.1, 푃F,B(푖) is the probability that the nascent chain segment of interest (that is, the segment whose folding is being monitored) is folded (F) and bound (B) to the ribosome at a nascent chain length of 푖. The nascent chain segment of interest for SFVP is C protein (Figure 2.1A). 푓L,B(푖, 푡) is the fraction of ribosome-bound (B) nascent chain segments of interest that are at codon position i and contain a radioactive label (L) at time t. A nascent chain segment is considered radiolabeled if at least one residue in the segment of interest is labeled. Although the absolute intensity of the phosphorimaging signal is directly proportional to the number of radioactive amino acids in a peptide, Helenius and co-workers normalized the experimental data by dividing by the maximum observed intensity2. This normalization procedure removes the signal’s dependence on the absolute number of radiolabeled amino acids and absolute number of ′ labeled protein molecules, yielding the co-translational folding probability. 푃F,R(푡, 푡 ) is the probability that at time t the nascent chain segment of interest is folded (F) for those ′ nascent chains released (R) from the ribosome at time 푡′, where 0 ≤ 푡′ ≤ 푡. 푓L,R(푡, 푡 ) is the fraction of labeled (L) nascent chains at time 푡 that were released (R) from the ribosome

16 at time 푡′. The first summation in Eq. 2.1 is over the different nascent chain lengths (from codon 푖 = 1 to 푖 = 푀, the stop codon) and the second summation is over the different time points during the experiment. To determine mathematical expressions for each of the terms in Eq. 2.1 we make the following assumptions:

A1. That steady-state translation kinetics occur throughout the time course of the experiment, which requires that the number of ribosomes initiating translation is equal to the number of ribosomes terminating translation at all times during the experiment. Consistent with this assumption, we performed Ribo-seq experiments on yeast and found that, for genes that have good coverage, stationary ribosome profile distributions occur between biological replicates (Figures A.1 and A.2). Furthermore, the constant rate of accumulation of full-length SFVP during the pulse-chase experiment (Figure A.3) means that the rate of protein synthesis is constant; this can only be the case if translation is occurring at steady state.

A2. That the co- and post-translational folding of the nascent chain segment of interest occurs in a two-state manner (Figure 2.2), with rates 푘U,푖 and 푘F,푖 at nascent chain length 푖, and rates 푘U and 푘F for ribosome-released nascent chains. Two-state folding indicates that the nascent chain segment does not populate any intermediate states, which is a reasonable assumption for small, cooperative folding domains. C protein has been shown to fold in a manner consistent with this two-state assumption20.

A3. That the dwell time of the ribosome at a particular codon position is exponentially distributed, with the rate of translation of codon 푖 denoted 푘A,푖. This assumption allows for the derivation of an analytical model60,61, but there is experimental evidence that ribosome dwell times are best described by the difference of two exponential terms62. We show below, however, that the predictions using either dwell-time distribution are highly similar.

These assumptions are, of course, not valid for all proteins or translation systems. For example, if a protein is known to fold via a pathway that includes an intermediate state then assumption A2 is not valid and our model will make inaccurate predictions. Under these assumptions, and with the introduction of discretization of t into time points of duration sδt, Eq. 2.1 can be rewritten as (see Appendix A for a full derivation)

1 푀 푃F(푡(푠)) = 푀 푠 ′ [∑푖=1 푁L,B(푖, 푡(푠))푃F,B(푖) ∑푖=1 푁L,B(푖,푡(푠))+∑푛=0 푁L,R(푡(푠),푡 (푛)) 푘 ′ 푘 푠 ′ F −[푘F+푘U][푡(푠)−푡 (푛)] F + ∑푛=0 푁L,R (푡(푠), 푡 (푛)) ([푃F,B(푀) − ] 푒 + )], (Eq. 2.2) 푘F+푘U 푘F+푘U which expresses 푃F(푡(푠)) purely as a function of the underlying rates of folding, unfolding and codon translation. To illustrate how the quantities 푁L,B and 푁L,R, the relative numbers of ribosome-bound and ribosome-released nascent chains, can change 17 with time during the experiment and how 푃F(푡(푠)) is calculated in practice, we provide a simple but tractable example in Figure A.4. We tested the validity of assumptions A1 and A3 and determined that our model can be applied even when there are small deviations from steady state (see Section A.2.2 and Figure A.5) and that the predictions using either the single-exponential or the difference of two exponential dwell-time distribution (see Section A.2.3, Figure A.6) are highly similar. We provide computer code as a Supplementary File to carry out these calculations; it is the same code used to make the predictions displayed in Figures 2.3, 2.5, 2.6, and 2.7. For a typical protein domain, making a prediction with Eq. 2.2 requires between 1 and 3 min of computer time on a typical computer.

2.3.2 Constructing a fully constrained model

A concern with any model that aims to predict experimentally measured quantities is that it will be under constrained. In such situations it is common to introduce additional assumptions to reduce the number of free parameters. Equation 2.2, with only assumptions A1, A2 and A3, is an under-constrained model for predicting SFVP’s behavior, as 3,771 rates are needed. These rates are the 1,257-codon translation rates in the CDS, and C protein’s folding and unfolding rate at each of the 1,257-nascent chain

Figure 2.2. The co- and post-translational protein folding reaction scheme that Eq. 2.2 solves. Initiation of translation of a transcript occurs at a rate 푘푖푛푡. At each codon position 푖 the probability that the nascent chain segment of interest folds depends on the rates of folding, unfolding and codon translation. At short nascent chain lengths a domain within the nascent chain is not sterically permitted to fold due to the confining environment of the ribosome exit tunnel, and therefore at these lengths the rates of folding and unfolding are defined to be zero. When the domain has emerged from the exit tunnel it can fold and unfold with rates 푘퐹,푖 and 푘푈,푖. Once the nascent chain has been released from the ribsome it will fold and unfold post- translationally with the bulk folding and unfolding rates 푘퐹 and 푘푈. Note well that this picture does not convey that Eq. 2.2 accounts for the time-dependent fraction of radiolabeled nascent chains at codon 푖.

18 lengths. However, introducing three additional assumptions results in a fully constrained model; these assumptions are:

A4. That each codon translates at the average codon translation rate. There is experimental evidence that this is a reasonable approximation for some proteins. While it is almost certainly the case that translation rates can vary from one codon to the next, it has been shown in mouse stem cells that no matter the length or type of protein being translated, all proteins are translated with an average codon translation rate of 5.6 AA per second63. On heuristic grounds, we expect that this experimental observation likely arises from the Central Limit Theorem, meaning that the most-probable codon translation rate will be the average codon translation rate provided that these rates are randomly distributed across the CDS.

A5. That the nascent chain segment of interest is only sterically permitted to fold once it emerges from the ribosome exit tunnel. This assumption is supported by structural64, proteolysis65, single molecule66 and coarse-grained simulation studies49 that demonstrate that protein domains need linker lengths of between 24 and 40 residues to fold, as the exit tunnel is too narrow to allow large domains to fold48.

A6. That once C protein is sterically permitted to fold and unfold it does so at its bulk folding and unfolding rates. Coarse-grained simulations of protein-G folding on the ribosome found it attained its bulk folding and unfolding rates just three residues beyond the nascent chain length at which it could form a thermodynamically-stable folded structure49. A single-molecule experiment66 suggests that T4 lysozyme attains its bulk folding and unfolding rates at a linker length of 80 residues, ∼40 residues after it has emerged from the exit tunnel. Consider that C protein is sterically permitted to fold starting at 297 residues in length, such that at nascent chain lengths between 297 and 337 residues its 푘F and 푘U may differ from their bulk values. From 337 to 1,257 residues in length, however, C protein has most likely attained its bulk 푘F and 푘U values. Thus, for only 40 out of 920 (=1,257–337) nascent chain lengths are the 푘F and 푘U of C protein potentially different than its bulk values, or only 4% of the nascent chain lengths at which C protein is sterically permitted to fold. This assumption is therefore reasonable for the proteins investigated in this paper.

Assumption A4 reduces the number of required translation rates from 1,257 to 1, reducing the number of required parameters by 1,256. Assumption A5 reduces the number of free parameters by 592 (=2 × 296), because the 푘F,푖 and 푘U,푖 values for 푖≤296 residues can be set to 0 s−1. Assumption A6 reduces the number of free parameters by 1,920 (=2 × (1,256–296)), as for all nascent chain lengths at which folding and unfolding are permitted the bulk 푘F and 푘U values are used. Thus, with these assumptions, we only require three parameters to make predictions using Equation 2.2: the bulk 푘F and 푘U values and average 푘A. Therefore, our predictions are made based on a model that is fully constrained by literature-reported values.

19

As more experimental information becomes available the number of assumptions required to make predictions using Equation 2.2 can be reduced. For example, ribosome 67 profiling holds out the promise that it may be possible to directly measure the 푘A,푖 values for a transcript12,68–70. In such a situation, assumption A4 is not necessary.

2.3.3 Prediction of pulse-chase co-translational folding curves

Using as input parameters the experimentally-determined values of 푘F, 푘U and 푘A (see Table 2.1 and Appendix A) for C protein in CHO cells and the experimental values of a 45-s pulse period and a 360-s chase period2, with a 10-s delay in the start of the incorporation period as is observed to occur in CHO cells59, we find that Equation 2.2 accurately predicts the experimentally measured co-translational folding curves for both the WT and ΔC SFVP constructs (Figure 2.3; SFVP WT: 푅2 = 0.96, 푝 = 1 x 10−4; SFVP ΔC: 푅2 = 0.99, 푝 = 1 × 10−6).

2.3.4 Prediction of FactSeq co- translational folding curves

As a further test of our approach, we also modelled in vivo co- translational folding curves for the 99-amino acid FKBP12-rapamycin- binding domain of a Flag-FRB-GFP construct (Figure 2.1B) and the 290 structured residues of the viral protein HA1 from influenza A/PR8 (Fig. 2.1B). These co-translational folding curves have been measured using the experimental technique known as folding-associated co- 71 Figure 2.3. Comparison between the predicted and translational sequencing (FactSeq) . experimentally measured SFVP co-translational folding FactSeq is a Next-Gen sequencing curves. Probabilities of co-translational folding calculated technique that uses substrate or using Eq. 2.2 (red triangles) and experimentally measured antibody binding to monitor the co- using pulse-chase labelling (open blue squares) for the WT (a) and ΔC mutant (b) of SFVP. Error bars for the translational folding status of a experimental results were not reported, and so error bars protein segment as a function of the were estimated as the average standard deviation from the nascent chain length rather than as a mean from three independent pulse-chase experiments function of time as in pulse-chase carried out under similar experimental conditions (see measurements. Thus, Eq. A.5 Appendix A). To match the convention used in the experiment, the predicted co-translational folding curve was (described in Appendix A) and not shifted such that the start of the chase is at 푡 = 0. WT: 푅2 = Eq. 2.2 is appropriate for predicting 0.96, 푝 = 1 × 10−4; ΔC mutant: 푅2 = 0.99, 푝 = 1 × these co-translational folding curves. −6 10 . For FRB and HA1, we used the 푘F

20 and 푘U values reported in Table 2.1. The typical range of translation rates in eukaryotic 2,63 cells is 3.2–5.6 AA per second . Using this range of 푘A values we find Eq. A.5 predicts very similar in vivo co-translational folding trends as are observed experimentally for FRB and HA1; the results when a 푘A of 3.9 AA per second is used are displayed here in Figure 2.4. The FactSeq data exhibit large variances in their signal from one codon position to the next, non-zero probabilities within the first fifty codons where folding cannot take place owing to the steric effect of the ribosome exit tunnel48, and probabilities >1.0 that arise from a numerator and denominator that are measured in two different experiments. Owing to these poor experimental statistics it is inappropriate to compare the measurements to the detailed, codon-specific predictions of our model. Instead it is justified—as was done in the original FactSeq publication71—to interpret the experimental data in terms of unfolded and folded regions along the transcript. Therefore, we broke the FactSeq data and our predictions into three regions. Region I corresponds to the first 50 codons of the transcript, and is used as a baseline where any signal from this region must correspond to unfolded protein. We then used the boundaries identified by Qian and colleagues71 in the original FactSeq paper for Regions II and III (see Appendix A). If Region II corresponds to an unfolded protein domain then the median FactSeq signal in this region should be Figure 2.4. Comparison between the predicted and experimentally- statistically indistinguishable measured FRB and HA1 co-translational folding curves. (a) The from the median value in co-translational folding probability calculated with Eq. A.5 (black Region I. We therefore tested line) and the experimentally-measured fraction folded using the null hypothesis that the FactSeq (blue circles) for (a) FRB, HA1 using antibody binding epitope (b) H28-E23 and (c) Y8-10C2 are shown. Regions I, II and median values in Regions I and III, as described in the main text, are indicated, respectively, by the II are the same. We applied the shaded regions in green, blue and red. (d) The median values of the Mann–Whitney U-test to this FactSeq-measured 푃퐹,퐵(푖) in Regions I, II and III are shown with hypothesis and found that bootstrapped error bars for FRB, H28-E23 and Y8-10C2.The Regions I and II are statistically ( ) statistical significance of the 푃퐹,퐵 푖 values was determined using the same (Figure 2.4D, Region the Mann–Whitney U-Test. Region I versus Region II: FRB: P=0.078, H28-E23: P=0.1933 and Y8-10C2: P=0.4471. Region III I versus Region II: FRB: 푝 = versus Region I: FRB: P=5.04 × 10−11, H28-E23: P=2.56 × 10−11 0.078, H28-E23: 푝 = 0.1933 and Y8-10C2: P=9.11 × 10−8. Region III versus Region II FRB: and Y8-10C2: 푝 = 0.4471). P=3.2 × 10−9, H28-E23: P=2.75 × 10−15 and Y8-10C2: P=8.98 × We also used the Mann– 10−11. Hence, the experimental data from FactSeq are consistent with the predicted co-translational folding curves in panels a, b, and Whitney U-test to determine c of this figure. that Region III is statistically

21 different from Regions I and II (Figure 2.4D, Region III versus Region I; FRB: 푝 = 5.04 × 10−11, H28-E23: 푝 = 2.56 × 10−11 and Y8-10C2: 푝 = 9.11 × 10−8. Region III versus Region II; FRB: 푝 = 3.2 × 10−9, H28-E23: 푝 = 2.75 × 10−15 and Y8- 10C2: 푝 = 8.98 × 10−11). Thus, the experimental data are consistent with the FRB and HA1 folding domains being unfolded in Regions I and II and folded in Region III. These trends in the FactSeq data and our predictions are consistent. These results lend further support to the accuracy of our modelling approach, as Eq. A.5 is an integral part of Eq. 2.2.

Table 2.1. Model parameters for SFVP, FRB, HA1, and yeast proteins Total Codons encoding Length of codons 풌 (퐀퐀/ Protein co-translational observable 풌 (풔−ퟏ) 풌 (풔−ퟏ) 퐀,풊 encoding 퐅,풊 퐔,풊 퐬) folding domain domain protein 0 for 푖 = 1 − 296 SFVP WT 1,257 1-267 255 4.34 x 10-5 3.9 20 for 푖 = 297 − 1,257

0 for 푖 = 1 − 184 3.9 and SFVP ΔC 1,145 1-155 143* 4.34 x 10-5 20 for 푖 = 185 − 1,145 variable

Flag-FRB- 0 for 푖 = 1 − 128 379 11-99 99 0.72 3.9 GFP 15.93 for 푖 = 129 − 379

0 for 푖 = 1 − 304 HA1 565 53-275 222 7.58 x 10-5 3.9 0.1378 for 푖 = 305 − 565

0 for 푖 = 1 − 190 DHOM 359 1-161 161 2.48 x 10-8 Variable 0.0240 for 푖 = 191 − 359

0 for 푖 = 1 − 164 SBA1 216 1-135 135 5.40 x 10-6 Variable 0.0721 for 푖 = 165 − 216

0 for 푖 = 1 − 750 EF2 842 570-721 151 1.03 x 10-7 Variable 0.0501 for 푖 = 751 − 842

0 for 푖 = 1 − 700 DPP3 711 431-671 240 4.33 x 10-9 Variable 0.1811 for 푖 = 701 − 711 * The last radiolabeled position in SFVP WT is 푖 = 255; the length of the observable domain for SFVP ∆C is therefore 143 (= 255-112).

2.3.5 Sensitivity of predictions to parameter variation

To test the sensitivity of our model’s predictions, we varied the parameters 푘F,푖, 푘U,푖 and 푘A,푖 several fold for each protein. The predicted folding curves for the proteins HA1 and yeast proteins DHOM, DPP3, SBA1, and EF2 (see below) are sensitive to one order of magnitude changes in 푘F,푖 (Figures A.7 and A.8). On the other hand, the folding curves predicted for ΔC SFVP (Figure 2.5A), WT SFVP (Figure A.8) and FRB (Figure A.7) only visibly shift after a two order of magnitude change in 푘F,푖. By varying 푘U,푖 by an order of magnitude we determined that the predicted folding curves for all the proteins are insensitive to this variation in the respective unfolding rates (Figure 2.5B, A.7, and A.8). We also determined that, for all proteins in this study, except FRB, a twofold

22 change in the global 푘A,푖 substantially shifts the co-translational folding curves (Figure 2.5D, A.7, and A.8). In the case of ΔC SFVP, we used trial and error to determine the 푘F and 푘U values needed for Eq. 2.2 to make inaccurate predictions. We find that the 푘F and 푘U values must change by factors of 103 and 106, respectively, for the predictions to fall outside the error bars (Figure 2.5A and B). We also tested how the number of residues that could fit in the ribosome exit tunnel affected the results for ΔC SFVP and found that our predictions are robust to changes to this value (Fig. 2.5C). We emphasize that not all proteins exhibit such robust results and elaborate on this point further in the Discussion section.

Figure 2.5. Sensitivity analysis of the predicted co-translational folding curve of ΔC SFVP to changes in the number of residues that fit inside the ribosome, 푘퐴,푖, 푘퐹,푖, and 푘푈,푖.(a) Co-translational folding curves −1 calculated using 푘퐹,푖 values of 0.02, 2, 20 or 200 s in Eq. 2.2 are plotted alongside the experimental time course (blue squares, panels a, b, c, and d. (b) Co-translational folding curves calculated using 푘푈,푖 values of 43.0, 4.34 × 10−4, 4.34 × 10−5 and 4.34 × 10−6 s−1. (c) Co-translational folding curves for the cases of the ribosome exit tunnel including 20 (green triangles), 30 (red squares) or 40 (blue diamonds) amino acids. (d) Co-translational folding curves calculated using global codon translation rates of 7.6 (purple diamonds), 3.9 (red triangles) or 1.9 AA per second (green circles).

23

2.3.6 Model sensitivity to variable codon translation rates

The efficiency of co-translational folding can be influenced by the variability in translation rates from one codon position to the next along an mRNA molecule25,28,72. Our previous predictions (Figure 2.3) were based on a uniform translation rate (assumption A4) and we therefore wished to test how sensitive our predictions are to variable rates. Individual codon translation rates in CHO cells, however, have not been measured. There have been at least five different estimates of codon translation rates in other organisms extracted from ribosome profiling data68–70 or calculated from theory12. These estimated codon translation rates do not correlate with each other, even when calculated for the same organism (Figure A.9). Settling the controversy of which data set is most accurate is outside the scope of this study. Therefore, we used each of the five codon translation rate sets to test the sensitivity of our predictions. To apply these rates to CHO cells we scaled them such that the average codon translation rate across the ΔC SFVP transcript matched the experimentally-measured 3.9 AA per second value (Table A.1). Using these individual codon translation rates in Eq. 2.2, we find that for four out of the five translation rate sets the predictions are essentially the same as when the average translation rate is used at every codon position (Figure 2.6). These results indicate that our predictions for ΔC SFVP are not highly sensitive to variable codon translation rates and that assumption A4 is reasonable for this protein. The Fluitt–Viljoen translation rate estimates are the only ones to result in predicted values that are statistically different from experiment. We also noticed that the Fluitt–Viljoen rates have the largest variance in translation rates compared with the other rate estimates (Table A.1). Therefore, we hypothesized that either the fastest- or slowest- translating codons in the set of rates predicted by Fluitt and Viljoen were the greatest contributors to the deviations from experiment. To test this hypothesis we created two new translation rate data sets. For the first (denoted ‘Slow Set’) the six slowest-translating sense codons were assigned their Fluitt–Viljoen values and the other 58 codon types were assigned the average rate of 3.9 AA per second. The other set Figure 2.6. Effects of variable codon translation rates on the (denoted ‘Fast Set’) used the six predicted co-translational folding curve for ΔC SFVP. The predictions made using Eq. 2.2 with translation rates fastest-translating sense codons. measured by Gardin et al. for yeast (green squares), Stadler Using these new translation-rate and Fire for yeast (purple diamonds), Dana and Tuller for estimates in Eq. 2.2, we find that the yeast (light blue triangles), Dana and Tuller for C. fast set better reproduces the elegans (gold circles), and predicted by the Fluitt–Viljoen experimental values, while the slow model for yeast (red circles) are displayed alongside the experimental (open blue squares) values with their set yields a deviation in the same associated error bars (see Fig. 2.3 and Methods section). direction as that observed when the The various translation-rate sets used are listed in Table full Fluitt–Viljoen translation rate A.1. set is used (Figure A.10). This test

24 indicates the greatest contributor to the deviation from experiment is the slowest codon translation rates estimated by Fluitt and Viljoen. It also suggests that, at least for ΔC SFVP synthesis in CHO cells, Fluitt and Viljoen’s estimated rates may have too great a variance.

2.3.7 Domains can switch from post- to co-translational folding

Synonymous codon substitutions can radically alter nascent protein behavior by modifying the translation-elongation kinetics of a transcript24,27 and thereby changing the timing and efficiency of co-translational processes. Previously, it was demonstrated that the co-translational folding of a domain in the E. coli protein SufI can be abolished by the introduction of fast-translating synonymous codon substitutions in a normally slow-translating region22. In light of this, we sought to determine if synonymous codon substitutions can alter the most fundamental classification of nascent protein folding in yeast in the opposite manner. That is, can synonymous codon substitutions be used to cause a yeast protein domain that folds post-translationally when translated from the WT transcript to fold co-translationally in the case of the synonymous variant? Experimental and simulation studies have found that slowing down translation-elongation tends to increase the probability that a domain will co-translationally fold28,46. Therefore, we hypothesized that introducing slow-translating codon substitutions into transcripts might

Figure 2.7. Synonymous codon substitutions can switch some yeast protein domains from post- to co- translational folding according to Eq. 2.2 (a) Top panel. The probability of folding as a function of the chase time for domain 1 of DHOM predicted using Eq. 2.2. Calculations were performed for both the WT transcript (red solid line) and the transcript in which all codon positions were substituted wi th their slowest-translating synonymous codon (solid blue line). In the same panel is plotted the time-dependent fraction of full-length protein (see Methods section) synthesized from the WT (red dashed line) or the slow-translating (blue dashed line) transcript. (a) Bottom panel. The fraction of DHOM molecules whose first domain folds co-translationally when synthesized from the WT (red) or slowest-translating (blue) transcript. (b) Same as a but for domain 1 of SBA1. (c) Additional probabilities of co-translational folding for domain 6 of EF2 (top) and domain 2 of DPP3 (bottom) for their WT and slowest-translating transcripts. Dashed grey lines separate the co- and post-translational folding classes.

25 be sufficient to switch some yeast domains from post- to co-translational folding. To test this hypothesis we examined 10 randomly-selected cytosolic, multi-domain proteins in yeast and predicted their pulse-chase folding curves using their WT mRNA sequence and also predicted their folding curves when all the codon positions were substituted with their slowest-translating synonymous codon. To make these predictions the Fluitt– Viljoen yeast translation rates were used (Table A.1), and, as in the experiments with SFVP, a pulse period of 45 s was used. We find that four of the yeast proteins we examined contain at least one domain that switches from post- to co-translational folding in our model. The pulse-chase time courses for two of these proteins (Figure 2.7A and B, top panels) show that for the WT CDSs the appearance of the full-length protein precedes folding, indicating that these proteins fold predominantly post-translationally; the situation is reversed for the mutated, slowest-translating CDSs, indicating the same domains fold predominantly co-translationally in this case. This change from post- to co-translational folding is also evidenced by an increase in the time-independent probability that the protein domain folds co-translationally (푃F,Co−t, see Appendix A) for the slowest-translating CDSs (Figure 2.7A and B bottom panel, C). Thus, our model predicts that, for some proteins in yeast, a fundamental change in nascent protein folding mechanisms can occur owing to synonymous codon substitutions.

2.4 Discussion

The study of protein folding in vitro over the past several decades has led to models that can accurately predict the time course of folding for small proteins73. More recently, it has been demonstrated that the tertiary folding of protein domains can begin during their synthesis by the ribosome2,22,48,64. Translation introduces an additional process that can influence nascent protein folding; hence, the kinetic equations describing protein folding have recently been expanded to account for the impact of codon translation rates25,46. These new models, while successfully tested against results from molecular dynamics simulations46, have not previously been validated against experimental data. The results of our study are the first to do so, and they demonstrate that our chemical kinetic modelling approach (Eq. 2.2) can make accurate predictions of nascent protein folding in vivo. The model calculates the predicted folding probability as a continuous rather than a discrete variable, which means the model is deterministic rather than stochastic74. This is a reasonable approximation for ensemble experiments, such as pulse chase, where the signal is averaged over a large number of nascent protein molecules. Importantly, the model only requires as input the domain-of-interest’s bulk folding and unfolding rates, and the average translation rate in the cell. If assumption A4 is discarded then the model requires all 64 codon translation rates. Such rate information has been reported in the literature for a number of different proteins73 and cell types59,63,75,76, suggesting this theoretical approach can be applied to a wide variety of proteins in different organisms. Our model explains the molecular origin of three features of the experimentally measured pulse-chase co-translational folding curves of SFVP (Figure 2.3). First, the non- zero folding probability at the start of the chase period is a result of the pulse’s duration being long enough to allow some labeled nascent C protein to complete synthesis, fold and cleave itself from the incomplete nascent protein before the chase period starts. Second, 2 the measured WT and ΔC 푃F(푡) curves increase linearly (푅 values of 0.94 for WT and

26

0.99 for ΔC SFVP) between the end of the incorporation period and the time point at which all labeled nascent C proteins achieve their equilibrium folding probability (that is, between times 0 and 100 s in Figure 2.3). This linear regime arises because a constant number of labeled C proteins reach the folded state at each time point during this period. Finally, the plateau of the co-translational folding curve, from 푡 = 100 to 360 푠, arises because in this range all labeled C protein molecules have achieved their equilibrium folding probability. Thus, Eq. 2.2 not only provides accurate predictions but also offers explanations for the features of co-translational folding curves. A subtle, but important technical point is that radiolabeling in pulse-chase experiments is typically preceded by a period of amino acid starvation, and this was indeed the case in the SFVP experiments that we modelled (Figure 2.3). This can potentially lead to deviations from steady state, which would violate assumption A1 of Eq. 2.2. The deviations from steady-state behavior during Helenius’s pulse-chase experiments, however, appear to be minimal, as evidenced by the linear time dependence of the accumulation of C protein during the chase (WT: 푅2 = 0.94, 푝 = 0.02; ΔC mutant 푅2 = 0.99, 푝 = 0.004; Figure 1C, D in Nicola, Chen, and Helenius, 1999, respectively). This can only occur if the rate of protein synthesis is constant, which can only be the case if translation is occurring at steady state. Thus, the assumption of steady-state translation is reasonable for this experimental data set. There can be experiments where steady state is not achieved77 (see Figure 7b in original publication, bottom panel; linear regression analysis of those data: 푅2 = 0.61, 푝 = 0.07). We therefore suggest that experimentalists who wish the steady-state approximation to be upheld follow the protocol of Helenius and co-workers. We were only able to test our model predictions for four proteins owing to the scarcity of in vivo experimentally-measured co-translational folding curves. As protein biophysicists continue to shift their research efforts from in vitro to in vivo protein behavior, we expect that more data will become available. Even without such data we can identify scenarios where the model could make inaccurate predictions. The current model assumes that domains fold in a two-state manner (assumption A2). Therefore, domains that populate long-lived intermediates or misfolded structures are unlikely to be accurately described by our model. This limitation can be overcome by using previously-reported ′ mathematical expressions for the 푃F,B(푖) and 푃F,R(푡, 푡 ) terms in Eq. 2.1 that describe co- translational folding mechanisms involving three states25. In addition, co-translational folding can be influenced by chaperones41,78,79 and other cellular factors3. As a first approximation, Eq. 2.2 can implicitly account for the effects of these other molecules on the co-translational folding process by accounting for their effect on nascent protein folding and unfolding rates. For example, trigger factor is a molecular chaperone in E. coli that has been shown to slow down the co-translational folding of β-galactosidase78 through a number of potential molecular mechanisms50. Our model can implicitly account for this effect by appropriately decreasing the 푘F,푖 values. A biologically fundamental prediction from our model is that some yeast proteins can be shifted from a post- to a co-translational folding mechanism by substituting codon positions in the WT CDS with their slowest-translating synonymous codon. Experimentalists have found that the introduction of presumably slow-translating synonymous substitutions often increases the extent of co-translational protein folding as reflected by the enzymatic activity13 or resistance to proteases22 of nascent proteins. For

27 example, a domain in SufI lost resistance to protease degradation when two rare codons were replaced with common codons, suggesting faster elongation kinetics in the mutant transcript provide that domain insufficient time to co-translationally fold22. Similarly, it was found that optimizing codon usage in the N-terminal 164 codons of the Neurospora clock protein frequency (FRQ) was sufficient to decrease its ability to associate with the protein WC-2 by 60%29. If this 60% decrease is due to a decrease in co-translational folding efficiency, it would suggest that FRQ’s folding mechanism switched from predominantly co- to post-translational. These experimental studies highlight the challenge of determining the relative contributions of co- and post-translational folding to the observed signals. Our model, which can reproduce experimental co-translational folding curves, allows the contributions from co- and post-translational folding to be separately quantified. Thus, our prediction that some yeast proteins can transition from a predominantly post- to a predominantly co-translational folding mechanism suggests that this phenomenon can occur in organisms other than the two already identified. Our results, however, say nothing about how common or uncommon it is for yeast proteins to be able to switch from post- to co-translational folding, as only 10 proteins were examined. In the future, it would be interesting to address this issue by applying our model to the entire yeast proteome. There are a number of proteins reported in the literature26,31 for which only a few synonymous codon substitutions can alter nascent protein folding. Yet, for SFVP, we found that altered codon translation rates have minimal to moderate effects on its co-translational folding curve (Figure 2.5), and that for some yeast proteins (Figure 2.7) synonymous substitutions at all codon positions were necessary to shift the protein from post- to co- translational folding. Should the co-translational folding of all proteins be able to be significantly affected by just a few synonymous codons? Recent theoretical papers25,46,55 demonstrate that the complex interplay of timescales of folding and translation-elongation influences whether a protein’s co-translational folding curve is robust or sensitive to changes in codon translation rates. Furthermore, if a domain can populate off-pathway intermediates, synonymous codons can have an even greater impact55. For example, if a domain folds extremely slowly or quickly relative to the possible codon translation times then introducing a synonymous mutation will have negligible effect on its co-translational folding. However, if the folding and codon translation times are similar, perturbations to a codon’s translation time can shift the folding curve. In the case of SFVP, its bulk folding time is 50 ms[20], fivefold faster than the 256 ms codon translation time in CHO cells2. Thus, unless a synonymous codon substitution in SFVP’s transcript speeds up translation greater than fivefold, the substitution is unlikely to have a significant effect on its folding curve. The preceding discussion of the importance of time scales of codon translation and folding also explains, in part, why the predictions for some protein domains are robust to folding-rate variation (Figure 2.5) and sensitive for others (Figures A.7 and A.8). Take, for example, the very different effects that varying 푘F by the same amount can have on the folding curves for HA1 and FRB. The rates of folding for HA1 and FRB are 0.1378 s−1 and 15.93 s−1, respectively. Increasing the folding rate of HA1 by an order of magnitude to 1.378 s−1 significantly alters its folding curve (Figure A.7, left column, data for HA1), but decreasing the folding rate of FRB by an order of magnitude to 1.593 s−1 does not significantly alter its folding curve (Figure A.7, left column, data for FRB). Why is one of these changes significant and the other insignificant? This is an example of how the interplay of timescales in non-equilibrium systems affects sensitivity, and is best

28 understood in light of timescale ratios. Increasing HA1’s folding rate to 1.378 s−1 changes the time required for its folding from 7,300 to 730 ms, a difference of 6,600 ms; this 6,600 ms difference provides enough time for roughly 26 additional codons to be translated in CHO cells, significantly perturbing the co-translational folding curve. In the case of FRB, however, the order of magnitude decrease in 푘F increases the mean time required for folding by only 570 ms, such that only two additional codons are translated before folding occurs. These differences in sensitivity can be observed in the co-translational folding curves for HA1 and FRB. Thus, the apparent robustness of our model’s predictions is a function of the separation of timescales. In summary, we have derived an equation that can accurately predict the probability that particular segments of a nascent chain co-translationally fold in vivo as a function of time on the basis of their bulk folding and unfolding rates and the average codon translation rate. The application of our assumptions (A1 through A6) to Eq. 2.2 is sufficient to fully constrain it with experimental rate information, leaving no free parameters. This equation is general for pulse-chase experiments of any duration, and, by discarding assumption A4, can account for the effects of variable codon translation rates. We have used Eq. 2.2 to show that synonymous codons can switch yeast proteins between post- and co-translational folding mechanisms. Such quantitative modeling of co-translational folding opens up new opportunities to understand differential codon usage in organisms22,80, the influence of co- translational folding on mRNA sequence evolution81, and can form the basis for the rational design of mRNA sequences to manipulate nascent protein behaviour32.

2.5 Acknowledgements

We thank Carol Deutsch, Phil Bevilacqua, Will Noid, Naomi Altman, and Ben Fritch for valuable feedback on the manuscript and Shu-Bing Qian for providing the raw FactSeq data from Han et al. 2012. This study was supported by a HFSP grant.

29

Chapter 3

STRUCTURAL ORIGINS OF FRET-OBSERVED NASCENT CHAIN COMPACTION ON THE RIBOSOME

This chapter is reproduced with permission from The Journal of Physical Chemistry B from the article “Structural origins of FRET-observed nascent chain compaction on the ribosome” by Daniel A. Nissley and Edward P. O’Brien in The Journal of Physical Chemistry B, 2018, 122(43), 9927-9937. Copyright 2018 American Chemical Society.

3.1 Abstract

A fluorescence signal arising from a Förster resonance energy transfer process was used to monitor conformational changes of a domain within the E. coli protein HemK during its synthesis by the ribosome. An increase in fluorescence was observed to begin 10 s after translation was initiated, indicating the domain became more compact in size. Since fluorescence only reports a single value at each time point it contains very little information about the structural ensemble that gives rise to it. Here, we supplement this experimental information with coarse-grained simulations that describe protein conformations and transitions at a spatial resolution of 3.8 Å. We use these simulations to test three hypotheses that might explain the cause of domain compaction: (1) that poor solvent quality conditions drive the unfolded state to compact, (2) that a change in the dimension of the space the domain occupies upon moving outside the exit tunnel causes compaction, or (3) that domain folding causes compaction. We find that domain folding and dimensional collapse are both consistent with the experimental data, while poor-solvent collapse is inconsistent. We identify alternative dye labeling positions on HemK that upon fluorescence can differentiate between the domain folding and dimensional collapse mechanisms. Partial folding of domains has been observed in C-terminally truncated forms of proteins. Therefore, it is likely that the experimentally observed compact state is a partially folded intermediate consisting, according to our simulations, of the first three helices of the HemK N-terminal domain adopting a native, tertiary configuration. With these simulations we also identify the possible co-translational folding pathways of HemK.

3.2 Introduction

Förster resonance energy transfer (FRET) has recently been used to probe the conformations of nascent proteins on both translationally arrested58 and continuously translating82 ribosomes. The efficiency of energy transfer from the FRET donor dye to the FRET acceptor dye is related to the distance between them, providing a means of monitoring protein conformational states.83 In one such study Rodnina and co-workers82 used a FRET-based assay, which measures acceptor dye fluorescence but not FRET efficiency, to monitor the synthesis of the N-terminal domain (NTD) of the Escherichia

30

Figure 3.1. Coarse-grain representation of HemK NTD with dye-modified residues. (A) Left: cartoon model superimposed over an all-atom representation of HemK NTD (residues 2-73, PDB ID: 1T43). Right: Cα coarse-grain representation of HemK NTD. Helices 1, 2, 3, 4, and 5 are shown in red, blue, yellow, cyan, and silver, respectively. Unstructured regions are shown in ochre. (B) Schematic of the 112-residue HemK construct with dye positions indicated. (C) All-atom and coarse-grained representations of the FRET Donor/Acceptor pair BOF/BOP. Interaction sites are shown in magenta while virtual bonds are shown in green. The leftmost bead in each image corresponds to the residue’s Cα interaction site. (D) Cross-section of a coarse-grained ribosome-nascent chain complex harboring a 112-residue HemK nascent chain. Ribosomal RNA and ribosomal protein interaction sites are shown in magenta and light purple, respectively. The coloring scheme for the N-terminal domain of HemK is the same as in (A). The unstructured and structured regions of the C-terminal linker are displayed in ochre and green, respectively. The FRET dyes are shown in pink. coli N5-glutamine methyltransferase protein HemK. The HemK NTD is 73 residues in length and contains five helices that form a bundle in the native state (Figure 3.1A). In these experiments the HemK NTD was fused to a 39-residue C-terminal segment composed of the interdomain linker and a small portion of the HemK C-terminal domain (CTD). A FRET donor/acceptor pair was incorporated into the HemK NTD via chemically modified residues at positions 1 and 34 (Figure 3.1B and C). Time series of acceptor fluorescence due to FRET during continuous synthesis were obtained for six HemK constructs by monitoring the fluorescence in the FRET acceptor channel in a stopped-flow

31 apparatus. The acceptor fluorescence began to increase after 10 s of synthesis, indicating that the FRET dyes moved closer together due to compaction of the domain. These data, in combination with limited assays and photoinduced electron transfer experiments, were used to support the hypothesis that HemK NTD co-translationally folds through a compact state. Subsequent fitting to a chemical kinetic model suggested that partially folded structures can transiently form during synthesis, although alternative mechanisms such as the formation of a molten globule or a nonspecific collapsed state have not been ruled out.84 The observed compact state could arise due to partial domain folding85 or collapse of the unfolded state in either a good or poor solvent.86 The ribosome exit tunnel is ∼100 Å in length and has an average diameter of 15 Å, which is similar to the 10 Å persistence length of unstructured proteins.87–89 For these reasons, nascent protein segments are confined in an effectively one-dimensional tube inside the tunnel and translocate to a three- dimensional space outside of it. Polymers tend to collapse when there is an increase in the 86,90,91 space dimension in which they exist. The radius of gyration, 푅g, of the unfolded state 3 of a 푁-monomer chain in a good solvent scales with the space dimension, 푑, as 푅g ∝ 푁푑+2. Thus, as a domain moves from inside to outside the exit tunnel 푅g’s scaling switches from 3 푁 (푑 = 1) to 푁5 (푑 = 3), such that the unfolded state of the 73-residue HemK NTD would 2 − compact in size by 푁 5 or 82%. Indeed, such unstructured collapse has been seen in simulations of other ribosome-nascent chain complexes.49 For a domain in a poor solvent, 1 1 푅g ∝ 푁푑, and so 푅g will go from 푁 (푑 = 1) to 푁3 (푑 = 3) when the domain enters the 2 − cytosol, such that HemK NTD will collapse by 푁 3 or 94%. Such collapse would bring the fluorescent dyes in HemK closer together and therefore could explain the increased FRET fluorescence. Here, we present results from coarse-grained, low-friction Langevin dynamics simulations of HemK synthesis that were performed with three different Hamiltonians to assess which driving force yields a compaction process that is most consistent with the experimental data. The first Hamiltonian, which models domain folding, is a Go̅ -based model previously used to model co-translational folding in silico.46,50,92 The second Hamiltonian approximates the influence of a good solvent on the nascent protein to model dimensional collapse. The third Hamiltonian models the influence of a poor solvent to permit nonspecific collapse. With this approach we are able to provide a molecular interpretation of FRET measurements on ribosome-nascent chain complexes, suggest optimal dye positions for future experiments, and determine the co-translational folding pathway of HemK NTD.

32

3.3 Results

푬 3.3.1 푬 and are equally valid for comparison to experimental fluorescence 푬퐞퐧퐝

The experimentally observed time-dependent fluorescence of the acceptor in the presence of the donor (퐹AD(A)) divided by the final fluorescence of the acceptor measured for 퐹AD(A) HemK112 ( end ) cannot be calculated from our simulations. However, we can compare 퐹AD(A) our results to experiment through the FRET efficiency. The FRET efficiency, 퐸, which is the probability of energy transfer from the donor to the acceptor dye, can be calculated 1 from our simulations using Förster’s equation, 퐸 = 푟 6, where 푟 is the distance 1+( ) R0 between the dyes and 푅0 is the Förster Radius. We can also calculate the related quantity 퐸 , which is the time-dependent value of 퐸 divided by the final value of 퐸 for HemK112 퐸end end 퐸 퐹AD(A) (퐸 ). Both 퐸 and end are directly proportional to end (see Eq. B.9 and B.11 and 퐸 퐹AD(A) Supplementary Discussion in Appendix B). Under the experimental conditions of constant 퐹AD(A) 퐸 illumination at the FRET donor excitation wavelength, as end increases 퐸 and end will 퐹AD(A) 퐸 퐹AD(A) 퐸 also increase, and as end decreases 퐸 and end will decrease. We emphasize that both 퐸 퐹AD(A) 퐸 퐸 퐹AD(A) and end are equally valid to compare to end , and it is not possible to argue that one is 퐸 퐹AD(A) better to use than the other. Therefore, in the rest of this study we compare both our 퐸 퐹AD(A) simulated 퐸 and end time series to the experimental end time series and discuss the 퐸 퐹AD(A) conclusions from each set of results.

3.3.2 Both partial folding and dimensional collapse are consistent with the experimental data

To test whether partial folding, dimensional collapse, or poor-solvent collapse is most consistent with experiment we simulated the synthesis of the six HemK experimental constructs, which differ in their final length, with three different Hamiltonians. A coarse- grain representation of the E. coli 50S ribosomal subunit93 and a dye-modified HemK coarse-grain structure82 (Figure 3.1D) were used in all synthesis simulations. To model domain folding we use a Go̅ -based Hamiltonian utilized in previous studies of co- translational folding.46,50,92 To model dimensional collapse we use another Hamiltonian that approximates the influence of good-solvent conditions on the HemK NTD. Finally, we use a third Hamiltonian that approximates the influence of poor-solvent conditions with the introduction of nonspecific attractive interactions that cause domain collapse in bulk solution. Note that nascent chains modeled using any of these Hamiltonians will undergo

33

Figure 3.2. Simulation time series compared to experimental results. (A) Simulated 퐸 time series are plotted 퐹퐴퐷(퐴) alongside experimental 푒푛푑 time series with experimental fluorescence data plotted as green circles. 퐹퐴퐷(퐴) Ensemble average simulation time series for the folding, good-solvent, and poor-solvent Hamiltonians are displayed in blue, magenta, and orange, respectively. Simulation error bars are 95% confidence intervals that are, for some data points, smaller than the plotted line. (B) Same as (A), but simulation time series are 퐸 rather than 퐸. Note that the left set of y axes applies only to the experimental time series. 퐸푒푛푑 dimensional collapse to some extent; the good-solvent model, however, will only collapse due to a change in dimension, while the other two models have additional driving forces for compaction. 퐸 We find the poor-solvent collapse 퐸 and results are inconsistent with the 퐸end experimental fluorescence data for each of the six HemK constructs (Table 3.1 and Figure 3.2A and B orange time series, Pearson 푅2 range: 0.51 – 0.82). For the dimensional collapse mechanism strong correlations are found between the dimensional collapse 퐸 and 퐸 time series (Table 3.1 and Figure 3.2A and B, magenta time series, Pearson 푅2 range∶ 퐸end 퐸 0.76 − 0.99) and the experimental data, as they follow similar trends. The E and time 퐸end series resulting from the folding Hamiltonian (Table 3.1 and Figure 3.2A and B, blue time series, Pearson 푅2 range: 0.76 – 0.98) are also strongly correlated with the experimental 퐹AD(A) end time series. On the basis of these comparisons, we conclude that both the co- 퐹AD(A) translational folding and dimensional collapse mechanisms are consistent with the experimental data, while poor-solvent collapse is inconsistent. 퐸 To quantify the similarity of the E and time series arising from different 퐸end Hamiltonians we calculated the root-mean-square deviation between pairs of simulation

34 time series, which we denote λ (Table 3.2 and Appendix B). Comparison of λ to the time series in Figure 3.2 suggests that, when λ ≥ 0.1, the folding and good-solvent time series can be differentiated from one another. We find that for the majority of HemK constructs λ < 0.1 between the folding and good-solvent Hamiltonians, indicating that in many cases they are highly similar. However, λ > 0.1 when the poor-solvent Hamiltonian is compared to either the folding or good-solvent Hamiltonian results. These results quantitatively demonstrate that, in general, the folding and good-solvent Hamiltonians give the same results, meaning that both mechanisms of compaction can provide a reasonable explanation for the experimental data.

Table 3.1. Pearson 푅2 values between simulated and experimental time series Hamiltonian HemK112 HemK98 HemK84 HemK70 HemK56 HemK42 Folding 0.98 0.97 0.97 0.95 0.78 0.76 model

Good solvent 0.99 0.99 0.96 0.98 0.88 0.76

Poor solvent 0.65 0.66 0.51 0.58 0.55 0.82

3.3.3 Alternative dye positions allow the partial folding and dimensional collapse mechanisms to be tested

With the experimental dye labeling positions of residues 1 and 34 the folding and dimensional collapse mechanisms give similar levels of agreement with experiment. We hypothesized that there exist alternative dye positions within the HemK NTD that would permit these two mechanisms to be tested. We therefore reanalyzed the folding and good- solvent Hamiltonian simulations to determine dye positions that maximize the difference 퐸 in E and between these two mechanisms. We find that alternative dye positions can 퐸end improve the separation between the folding and good-solvent 퐸 time series by 80 to 194%, with all λ > 0.1, such that folding and dimensional collapse can be differentiated from one 퐸 another (Figure 3.3 and Table 3.3). The separation between time series also improved 퐸end in five of the six HemK constructs, and λ became greater than 0.1 for three of the six constructs. For HemK70 there are no alternative dye positions that increase the difference 퐸 between the folding and good-solvent Hamiltonian results, implying the experimental 퐸end positions are optimal for this construct and calculation method. We therefore recommend that, in future experiments, to maximize the observable difference between folding and dimensional collapse for HemK112 the donor and acceptor be placed at residues 3 and 66, respectively.

35

Figure 3.3. Alternative dye positions that maximize separation between good-solvent and folding results for each HemK construct. (A) Simulated 퐸 time series calculated with the Förster Equation using explicit dye representation coordinates (folding Hamiltonian: blue; good-solvent Hamiltonian: magenta) or calculated using alternative dye position Cα coordinates (folding Hamiltonian: green; good-solvent Hamiltonian: orange). FRET time series were calculated for all possible locations of the FRET acceptor (a) and donor (d) dyes within the primary sequence of HemK NTD. The 퐸 results from the specific locations found to maximize the difference between folding and good-solvent Hamiltonians are plotted here, and the locations of the a and d dyes within the sequence noted in the subplot titles. Simulation error bars are 95% confidence 퐸 intervals that are, for some data points, smaller than the plotted line. (B) Same as (A) but time series are 퐸푒푛푑 rather than 퐸. 3.3.4 Co-translational folding of HemK NTD proceeds through a partially folded intermediate

Multiple proteins have been observed to fold co-translationally through partially folded intermediates,94 and we found on the basis of our simulations that partial folding is a likely mechanism for HemK NTD compaction. We therefore sought insight into the possible co- translational folding pathways of HemK NTD by analyzing how the fraction of native contacts (푄) changed with time in our folding-Hamiltonian simulations. For HemK112 we observe that helices 1, 2, and 3 form their intrahelix contacts while still buried in the exit tunnel after 7, 12, and 15 s of synthesis, respectively (Figure 3.4A; h1, h2, and h3 time series. Figure 3.4D; L = 42 aa and L = 56 aa structures). Tertiary contacts begin to form at 15 s and, after 23 s of synthesis, helices 1, 2, and 3 form a stable intermediate at a nascent chain length of 84 aa (Figure 3.4B and C; h1h2, h1h3, h2h3, and h123 time series. Figure 3.4D; L = 70 aa and L = 84 aa structures). Helix 4 forms secondary structure after 24 s of synthesis and joins to the partially folded intermediate after 27 s of synthesis (Figure 3.4A, 36

Figure 3.4. Three sets of fraction of native contact, 푄, time series are displayed for HemK112 alongside representative coarse-grain nascent-chain structures. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals. (D) Structures corresponding to key folding intermediates labeled with the nascent chain length at which they occur. Coloring is the same as in Figure 3.1. The orange sphere indicates the position at which the C-terminal residue is restrained. The 50S ribosome representation is removed for visual clarity but was present during simulations. B, and C; h4, h1h4, h2h4, h1234 time series. Figure 3.4D; L = 98 aa structure). Helix 5 forms intrahelix contacts after 31 s, when the full-length protein is complete, and begins to participate in tertiary structure at roughly the same time, forming the native state (Figure 3.4A and B; h5, h2h5, h4h5 time series. Figure 3.4D; L = 112 aa structure). The Q time series for the shorter HemK constructs are very similar to those for HemK112 but with some exceptions. HemK98 is not long enough for helix 5 to dock with helices 1–4, and helix 5 therefore forms secondary but not tertiary contacts (Figure B.1). HemK84 forms intrahelix contacts in helix 4 transiently from 15 to 23 s, but as tertiary contacts form between helices 1–3, it unwinds (Figure B.2). HemK70 displays unstable tertiary contact formation between helices 1 and 2, while HemK56 and HemK42 display no tertiary structure formation (Figures B.3–B.5). In summary, we find that the co- translational folding of HemK NTD proceeds via the formation of a three-helix bundle

37 consisting of helices 1–3 after 23 s of synthesis followed by a four-helix bundle at 27 s and the native-state five-helix bundle at 31 s.

Table 3.2. λ values for comparisons between 퐸 and 퐸/퐸end time series λ values for 푬 time series Hamiltonians HemK112 HemK98 HemK84 HemK70 HemK56 HemK42 compared Folding vs. good 0.162 0.160 0.148 0.054 0.064 0.086 solvent Folding vs. poor 0.228 0.227 0.229 0.164 0.243 0.427 solvent Good solvent vs. 0.203 0.215 0.210 0.215 0.305 0.513 poor solvent λ values for 푬/푬퐞퐧퐝 time series Hamiltonians HemK112 HemK98 HemK84 HemK70 HemK56 HemK42 compared Folding vs. good 0.073 0.075 0.083 0.152 0.076 0.049 solvent Folding vs. poor 0.298 0.316 0.308 0.360 0.482 0.688 solvent Good solvent vs. 0.292 0.311 0.302 0.312 0.444 0.736 poor solvent

Table 3.3. λ values for comparisons between alternative dye position 퐸 and 퐸/퐸end time series

λ values for alternative dye position analysis of 푬 time series HemK112 HemK98 HemK84 HemK70 HemK56 HemK42 Best location d3, a66 d3, a64 d3, a56 d42, a64 d4, a31 d3, a29

λ, 0.365, 0.325, 0.267, 0.159, 0.136, 0.216, (% change) (+125%) (+103%) (+80%) (+194%) (+113%) (+151%) λ values for alternative dye position analysis of 푬/푬퐞퐧퐝 time series HemK112 HemK98 HemK84 HemK70 HemK56 HemK42 Best location d38, a64 d35, a73 d36, a73 d1, a38 d19, a43 d2, a29

λ, 0.095, 0.159, 0.250, 0.146, 0.087, 0.169, (% change) (+30%) (+112%) (+201%) (-4%) (+14%) (+245%)

3.4 Discussion

The output of experimental fluorescence assays is a single value per time point, and converting these single values into a meaningful interpretation of the nature of a protein’s underlying conformational ensemble is a difficult inverse problem. This issue becomes acute when considering that the experimental signal arises from ensembles of asynchronously elongating ribosome-nascent chain complexes containing a range of

38 nascent chain lengths. Our coarse-grain molecular dynamics study of the HemK NTD reveals the molecular details of its co-translational folding at Angstrom spatial resolution to help alleviate this fundamental issue of interpreting fluorescence data. Co-translational folding is an important process by which nascent proteins obtain structure at the earliest possible stage in their biosynthesis. Estimates suggest that at least one-third of the E. coli proteome folds co-translationally,72 and perturbations to co-translational folding are associated with protein misfolding21 and disease.9 Understanding the molecular details of co-translational folding is therefore important to understanding protein biogenesis at large. We tested three hypotheses concerning the nature of the compact state observed experimentally during HemK NTD synthesis: (1) that it collapses due to the change in dimension it experiences upon leaving the exit tunnel as it would in a good solvent, (2) that it collapses to an unstructured globule upon leaving the exit tunnel as in a poor solvent, and (3) that it attains a partially folded structure. We simulated HemK NTD synthesis with three different Hamiltonians to test these hypotheses and found that, while nonspecific collapse is inconsistent with the experimental data, both partial domain folding and dimensional collapse are consistent. Partial folding of HemK NTD by sequential addition of helices to a growing globular core has previously been suggested,84 and our results are consistent with this mechanism but cannot rule out dimensional collapse. Reanalysis of our good-solvent and folding-Hamiltonian results allowed us to select optimal dye positions that maximize the observable difference between the folding and dimensional collapse mechanisms. We were able to improve the root-mean-square separation between our folding and good-solvent model results by up to 245% by permuting the dye labeling positions in this way. Use of these dye positions will allow for the good-solvent and folding hypotheses to be tested in future experiments. The extensive molecular information contained in our simulations of HemK NTD with the folding Hamiltonian also allowed us to investigate its folding pathways by analyzing the time series of fraction of native contacts. The native state, observed only for HemK112, is preceded at shorter nascent chain lengths by structures composed of subsets of helices. HemK42 and HemK56 display no tertiary contact formation between helices, while HemK70 begins to form contacts between helices 1 and 2. This is in contrast with a previous study,84 which suggested tertiary contact formation between helices 1 and 2 within HemK56. HemK84 displays contact formation between helices 1–3, and coarse-grain structures indicate the formation of a three-helix bundle intermediate, but helices 4 and 5 remain unable to participate due to confinement in the ribosome exit tunnel. HemK98 is almost fully native, but helix 5 is unable to participate in stable tertiary structure with the other helices. Our results are thus largely consistent with the sequence of partially folded intermediates suggested by chemical kinetic modeling of experimental fluorescence data,84 with the exception that we see no contact formation between helices 1 and 2 for HemK56. We observe larger deviations between experiment and simulation for HemK70 and 56 in comparison to the other four HemK constructs. The specific chemical environment experienced by the nascent chain at lengths of 70 and 56 residues may influence the agreement between the experimental fluorescence of the acceptor and the simulation FRET efficiency (see Appendix B). It is also possible that the experimental structures that give

39 rise to the detected fluorescence are non-native and therefore not reproducible by any of our three Hamiltonians, which cannot form non-native tertiary structure. Calculating fluorescence intensities from molecular dynamics simulation data is problematic. While Förster’s equation provides an easy route from atomic distances to FRET efficiencies, no such equation exists to convert atomic coordinates to a fluorescence intensity. However, 퐸 is intrinsically related to the fluorescence of the acceptor molecule during FRET, and comparisons can therefore still be made (see Appendix B). The quantity 퐸 , the calculation of which mimics the data processing step of division by the final 퐸end HemK112 time point value as was done for the experimental fluorescence data, is also 퐸 intrinsically related to the experimental fluorescence intensity. Both 퐸 and lead to the 퐸end same general conclusion: a poor-solvent model is inconsistent with experiment, while folding and good-solvent models are in roughly equivalent agreement. The robustness of 퐸 this result to normalization by supports our assertion that both values are equally valid. 퐸end It is also important to note that the type of coarse-grain model represented by our folding Hamiltonian is only capable of forming native contacts and thus cannot form alternative tertiary structure. We have therefore not ruled out the possibility that compact structure(s) detected during HemK NTD synthesis in vitro are the result of non-native tertiary structure.

3.5 Conclusions

Coarse-grain molecular dynamics is a powerful means of investigating molecular mechanisms of co-translational folding. Previous molecular dynamics simulations at both atomic and coarse-grained resolution have been used to gain insight into FRET results with biopolymers.95–97 We have found that coarse-grain Langevin dynamics simulations can be used to optimize dye position choices to test specific hypotheses concerning the nature of protein conformations. Without the acceleration of protein dynamics inherent to coarse- grain models and low-friction Langevin dynamics these simulations would be exceedingly time-consuming, but we were able to achieve the equivalent of ∼35 s of experimental time for six HemK constructs in ∼30 d of computer processor time. Acceleration of these simulations through the use of a further reduced ribosome structure will allow the co- translational folding behavior of even very large proteins to be studied. This type of continuous synthesis protocol could then be used to investigate co-translational folding of entire proteomes given sufficient computational resources. Molecular dynamics simulations can help decrease the ambiguity of single-valued measurements such as fluorescence and reveal the molecular details of co-translational folding, and simulations with different Hamiltonians can be used to test different molecular hypotheses for the origin of an experimental signal. If used in a preliminary scan of potential FRET dye labeling positions molecular dynamics can also reveal ideal dye positions that will provide the best signal for testing a particular molecular hypothesis. These results highlight molecular dynamics as a powerful tool for investigating co- translational folding in molecular detail.

40

3.6 Acknowledgements

The authors gratefully recognize funding support from National Science Foundation CAREER Award 1553291 and National Institutes of Health Maximizing Investigators’ Research Award 1R35GM124818-01.

41

Chapter 4

ALTERED CO-TRANSLATIONAL PROCESSING PLAYS A ROLE IN HUNTINGTON’S PATHOGENESIS – A HYPOTHESIS

Published as a paper entitled “Altered co-translational processing plays a role in Huntington’s pathogenesis – A hypothesis” by Daniel A. Nissley and Edward P. O’Brien in Frontiers in Molecular Neuroscience 2016, 9:54. E.P.O. first proposed the hypothesis. D.A.N. and E.P.O. conducted the research and wrote the manuscript. This article was published under an Open Access Creative Commons Attribution License (CC BY).

4.1. Abstract

Huntington's disease (HD) is an autosomal dominant neurodegenerative disorder caused by the expansion of a CAG codon repeat region in the HTT gene's first exon that results in huntingtin protein aggregation and neuronal cell death. The development of therapeutic treatments for HD is hindered by the fact that while the etiology and symptoms of HD are understood, the molecular processes connecting this genotype to its phenotype remain unclear. Here, we propose the novel hypothesis that the perturbation of a co-translational process affects mutant huntingtin due to altered translation-elongation kinetics. These altered kinetics arise from the shift of a -induced translational pause site away from Htt's localization sequence due to the expansion of the CAG-repeat segment between the poly-proline and localization sequences. Motivation for this hypothesis comes from recent experiments in the field of protein biogenesis that illustrate the critical role that temporal coordination of co-translational processes plays in determining the function, localization, and fate of proteins in cells. We show that our hypothesis is consistent with various experimental observations concerning HD pathology, including the dependence of the age of symptom onset on CAG repeat number. Finally, we suggest three experiments to test our hypothesis.

4.2 Introduction

Huntington's disease is an autosomal dominant neurodegenerative disorder characterized by the death of striatal neurons and the appearance of aggregates in the nuclei of a wide range of brain tissues98,99. Physical symptoms of HD include chorea (involuntary, dance- like motor function) and the dementia-like decline of mental faculties99. The genetic cause of HD is the expansion of a CAG codon repeat in Exon 1 of the HTT transcript; persons without HD have on average 19 CAG repeats, while individuals with 35 or more CAG repeats will develop HD symptoms over a typical lifespan100–102. Each additional CAG repeat beyond 35 results in the onset of symptoms roughly 3 years sooner, with repeat lengths greater than 60 leading to acute juvenile onset101–103. The pathogenesis of HD is most likely due to one or more of the aberrant gain-of- function104 or loss-of-function105 behaviors that have been identified for mutant huntingtin

42

(mHtt) or the HTT transcript. Understanding the mechanism of pathogenesis is significantly complicated by the fact that the normal function of wild-type huntingtin protein (Htt) is not agreed upon106. Htt is composed of, starting from the N-terminus, a 17- amino acid localization signal (denoted N17)107 that allows for the targeting of Htt to subcellular organelles including the Golgi apparatus, endoplasmic reticulum, and mitochondria108. Next in the sequence are 19 glutamines100,101 (poly-glutamine) encoded by CAG codon repeats. Following the poly-glutamine region is a 38 amino acid proline- rich region (poly-proline) which contains a total of 28 , including continuous stretches of 10 and 11 prolines109. These three sequence elements of N17, the poly- glutamine region, and the poly-proline region, constitute Htt Exon 1109. The characteristic aggregates observed in HD patients are primarily constituted by small fragments of Htt Exon 1 that are most likely generated by abortive degradation attempts or the translation of an Exon 1-only mRNA produced due to aberrant splicing110–112. Recent evidence suggests that C-terminal mHtt fragments can also lead to toxic effects by inhibiting the protein dynamin 1, leading to endoplasmic reticulum stress113. The remainder of the 3,144 residues in Htt consist of approximately 40 HEAT repeats, which are conserved helix-turn- helix structure motifs106 each roughly 40 residues in length. The only difference between the primary structure of mHtt and Htt, and the genetic cause of HD pathology, is that mHtt contains an expansion of the poly-glutamine region of Exon 1 to 35 or more . Since the discovery in 1993 that HD is caused by CAG-repeat expansion in Htt Exon 1, no dominant hypothesis for HD pathogenesis has emerged. Instead, many interrelated hypotheses have been put forward that variously describe HD as the result of dysfunction at the protein114 or mRNA115 level, to be associated with subcellular organelles like the mitochondria104 or systems such as the ubiquitin-proteasome system111, and to be due to either loss- or gain-of-function116. Each of these hypotheses can explain some subset of the diverse set of experimental observations of HD pathogenesis and mHtt behavior, though the overall picture remains unclear. Despite this complex and overlapping set of hypotheses, new hypotheses continue to emerge that can explain more diverse, or previously disparate, sets of observations. Here, we propose the novel hypothesis that co-translational processing plays a role in HD pathogenesis. This hypothesis is motivated by recent experimental results demonstrating that many of the behaviors of nascent proteins, including their targeting and aggregation, can be altered by changes to translation kinetics. We propose that a co- translational process is perturbed for mHtt due to altered translation-elongation kinetics downstream of N17. The magnitude of this perturbation is directly proportional to the increase in the length of the poly-glutamine sequence past Q35. Such perturbations to co- translational processes have been shown to profoundly alter downstream protein behavior, and the idea that there is a co-translational element to Huntington's pathogenesis is consistent with the experimental literature. We describe the insights gained from these studies and how they suggest a role for co-translational processes in HD pathology. We show that our hypothesis is consistent with a number of experimental observations concerning mHtt behavior and HD pathogenesis. We conclude by suggesting three experiments that directly test key aspects of our hypothesis.

43

4.3 Co-translational processes influence post-translational protein behavior

A number of processes involving nascent proteins take place during protein synthesis. These processes are therefore referred to as co-translational processes, and include domain folding1,2, chaperone interactions3, translocation4,5, ubiquitination6,7, phosphorylation117,118, acetylation119, and glycosylation8. The rates at which different nascent chain segments are synthesized can affect these co-translational processes, leading to altered nascent protein behavior in a cell21,22,120. These co-translational processes appear to occur far from equilibrium, such that the kinetics of translation can be more important than thermodynamics in determining nascent protein behavior121. Perturbations to these temporally-coordinated co-translational processes can result in deleterious downstream effects such as protein mistargeting41 and aggregation26. Most literature examples of codon-translation-rate-dependent phenomena are from prokaryotic organisms, such as E. coli. However, the timing of translation is also critical for nascent protein behavior in eukaryotic cells121. Eukaryotic cells contain homologs or molecules that carry out similar functions to those which act co-translationally in prokaryotes. Furthermore, the principles of non-equilibrium systems that underlie these phenomena are organism-independent121. For example, human cells contain a chaperone system homologous to the DnaJ/DnaK chaperone system that assists protein folding in prokaryotes.

4.4 Nascent chain interactions with auxiliary factors depend on nascent chain length

Recent studies have shown that the interactions of ribosome nascent chain (RNC) complexes with targeting complexes122, enzymes123, and chaperones79 are carefully orchestrated in cells, and that their equilibrium affinities for translationally-arrested RNCs depend sensitively on nascent chain length. For example, the signal recognition particle (SRP), a universally-conserved ribonucleoprotein that targets nascent proteins for translocation into the ER by selectively interacting with conserved signal sequences, interacts with RNCs in a nascent-chain-length-dependent manner122. Experiments have shown that SRP binds strongest to arrested-RNCs of between 75 and 95 amino acids in length; outside this region, the dissociation constant increases 3- to 24-fold122. Interactions between nascent chains and chaperones have also been shown to depend on nascent chain length. Trigger factor (TF) is a molecular chaperone in E. coli which assists the folding of nascent proteins by binding the ribosome during translation and shielding the nascent protein from aberrant interactions50,124, helping to prevent misfolding and aggregation125. Similar to SRP, RNC/TF interactions are also optimal within a narrow range of nascent chain lengths, with a nearly 5-fold decrease in TF's dissociation constant for a RNC harboring a 100-residue nascent chain in comparison to a 23-residue nascent chain79. Nascent chain length is a key factor affecting these co-translational processes.

44

4.5 Co-translational folding and downstream function depend on translation- elongation rates

During continuous translation in vivo, the dwell time of a ribosome at a given nascent chain length depends on the rate at which the codon in the A-site is decoded into an amino acid. The ribosome does not translate all codons at the same rate due to a variety of molecular factors12,69,126, and the variability in codon translation rates across an mRNA's coding sequence is a key parameter that modulates nascent protein behavior. Barral and co- workers found that “optimizing” the codon sequence of firefly luciferase (FL) by replacing rare codons (which are thought to translate more slowly than average) with common synonymous codons resulted in a ~55% decrease in specific activity in vivo13. In the case of the fast-translating FL transcript, the decrease in specific activity was accompanied by an increase in the amount of aggregated FL, suggesting that accelerating translation decreased FL's ability to acquire its correct structure and perform its intended function. Codon translation rates have also been shown to play a key role in regulating the structure and function of the N. crassa clock protein FRQ29. Optimization of the wild-type FRQ translation-rate profile resulted in the abolishment of N. crassa's circadian rhythm and a two-fold decrease in FRQ's ability to interact with a binding partner, suggesting that changes to FRQ's translation-rate profile altered its structure and function.

4.6 The co-translational targeting of nascent proteins depends on translation- elongation rates

Codon translation rates influence other co-translational processes in addition to nascent protein structure acquisition. The ability of SRP to target nascent chains for translocation depends not only on its equilibrium affinity for conserved N-terminal signal sequences measured on arrested-RNCs36, but also on the rate at which the signal sequence emerges from the exit tunnel during continuous synthesis in a cell. Globally decreasing codon translation rates increases the amount of protein that SRP successfully translocates into the ER36. A bioinformatic analysis120 also revealed that “non-optimal” codons are systematically enriched in the genomes of nine yeast species 35–40 codons downstream of SRP signal sequences. The location of this downstream region would result in translational slowdown while the signal sequence is connected to the ribosome by a 35–40 amino acid linker, which corresponds to the approximate length of the ribosome exit tunnel127, such that translation will be slowed just as the signal sequence emerges from the tunnel and SRP is sterically permitted to interact with it. This slowdown is presumed to give SRP more time to recognize and bind the signal sequence, ensuring that the nascent protein is successfully targeted to, and translocated into, the ER.

4.7 Polyproline stretches slow-down translation elongation

The molecular origin of the observed variability of codon translation rates is complex, including factors such as cognate tRNA concentrations, the chemical nature of the amino

45 acid being added to the nascent chain (i.e., the nature of the amino acid in the A-site), and sequence motifs within the nascent chain128,129. For example, it has been well-established that poly-proline regions slow down translation. In vitro128 and in vivo129 experiments have demonstrated that the ribosome translates sequences of two or more prolines much slower than the average global translation rate.

The results we have discussed highlight how critical the timing of translation can be to co- translational phenomena and to determining downstream protein behavior in a cell.

4.8 The hypothesis: altered co-translational processes involving huntingtin play a role in HD pathology

Our hypothesis for the contribution of co-translational processes to HD pathogenesis naturally follows from these experimental observations of protein biogenesis. In Htt, stretches of prolines are optimally positioned 30–57 residues downstream of N17 to slow translation-elongation when N17 has just been exposed from the confines of the ribosome exit tunnel, which may also be the optimal length at which the binding between a co- translationally acting factor (CAF) and the nascent chain is strongest (Figure 4.1). These proline residues are highly conserved, being present in the huntingtin proteins of all higher vertebrates106. This slowdown of translation provides time for an as-yet-unidentified (and unlooked-for) CAF to interact with the nascent chain and either help direct it to its proper subcellular location or chemically modify the nascent chain as needed for its function. In

Figure 4.1. The proposed co-translational mechanism of HD pathology. (1) A CAF (orange) recognizes N17 (blue rectangle) of nascent Htt (top, nascent proteins shown in green, ribosome in gray). In the case of mHtt (bottom), the poly-proline region is not correctly positioned to slow translation as N17 emerges, reducing the ability of the CAF to bind. (2) In the case of Htt, the CAF directs the subcellular localization of Htt to the Golgi, ER, and mitochondria. mHtt is largely directed to the cytosol, where proteolysis (3) produces short, Exon 1-containing fragments (short green line segments) that form amyloid. Proteolysis can also result in C- terminal mHtt fragments that interfere with ER function.

46 the case of mHtt, however, the expanded poly-glutamine region, rather than the poly- proline region, will be undergoing translation as the N17 sequence emerges from the ribosome exit tunnel; translation of these glutamines (encoded by CAG) is two- to six-fold faster than translation of the prolines located at these same codon positions in the wild- type128,129, and the CAF will thus have less time to bind N17 at the strongest-binding nascent chain lengths. A key concept in this hypothesis is that for each additional CAG repeat added to the poly-glutamine region there is a proportional decrease in the time available for the CAF to bind N17 at the nascent-chain lengths for which it has the strongest binding affinity at equilibrium. As a result, for each additional CAG repeat a smaller fraction of mHtt will interact with the CAF, and more mHtt will therefore have the opportunity to act aberrantly and form aggregates.

This hypothesis is consistent with nine key Experimental Observations concerning HD and mHtt:

Experimental Observation 1: The brains of HD patients contain protein aggregates in cell nuclei130,131. The main constituents of these characteristic aggregates are fragments of mHtt Exon 1110. In the cytosol, mHtt is targeted for degradation by either chaperone-mediated autophagy132 or the ubiquitin/proteasome system (UPS)111.

Explanation 1: As the number of CAG repeats increases, more mHtt is co- translationally misprocessed and directed to the cytosol. The UPS' ability to clear mHtt decreases over time133,134, and it also has trouble completely degrading proteins with repetitive sequence elements like mHtt's poly-glutamine region135. These problems with degradation combine with the increased partitioning of mHtt into the cytosol to increase the quantity of mHtt fragments in the cell, thereby making it more likely that they may enter the nucleus and aggregate or otherwise act aberrantly105. This increase in cytosolic mHtt may also lead to an increase in the quantity of C-terminal fragments that cause ER stress113.

Experimental Observation 2: Only persons with 35 or more CAG repeats develop HD symptoms100–102.

Explanation 2: If nascent Htt contains >35 CAG repeats, the poly-proline region is not positioned to slow translation as N17 emerges from the ribosome exit tunnel. Inefficient processing results due to decreased CAF binding, leading to the accumulation of mHtt in the cytosol and subsequent aggregation.

Experimental Observation 3: Each additional CAG repeat beyond 35 speeds disease progression by roughly 3 years100–102.

Explanation 3: As the number of CAG repeats increases over 35, targeting becomes proportionally less efficient, impaired by the decreased time available for

47

the CAF to bind mHtt's N17 sequence (see Figure 4.2). This decreased targeting efficiency results in an increased flux of mHtt into the cytosol, increasing the rate of aggregate formation.

Experimental Observation 4: N17 is required for the correct targeting of Htt to subcellular organelles108; the removal of N17 has been shown to completely abolish the targeting of Htt to the mitochondria, ER, and Golgi apparatus.

Explanation 4: CAFs can be sequence specific36, identifying motifs like conserved patterns of hydrophobic and charged residues136. If N17 is absent, then there is nothing for the CAF to recognize and bind, resulting in the inefficient targeting of mHtt to subcellular organelles108.

Experimental Observation 5: Removal of N17 leads to a large increase in the rate of nuclear aggregate formation in a HD mouse model, despite lower expression levels137.

Explanation 5: Removal of N17 effectively abolishes targeting of nascent Htt to subcellular organelles (see Experimental Observation 4). The ΔN17 deletion mutant of mHtt therefore remains in the cytosol for an extended period of time, further burdening the UPS and speeding the formation of the Exon 1 fragments that form nuclear aggregates or of the C-terminal fragments that induce ER stress110,113.

Experimental Observation 6: The length of the poly-glutamine region alters Htt targeting108. Increasing the length of the poly-glutamine domain from 25 to 97 Q's reduces co-localization of the protein to the mitochondria, ER, and Golgi by 4, 10, and 30%, respectively.

Explanation 6: Expansion of the poly-glutamine sequence moves the poly-proline stretch further downstream, disrupting the wild-type translation-elongation schedule. Without the poly-proline region to slow translation at the proper time, N17 is less likely to correctly interact with a CAF, in turn reducing the amount of correctly localized mHtt.

Experimental Observation 7: The presence of the poly-proline region is critical for correct targeting108. Removal of the poly-proline region reduces the co-localization of the protein to the ER and Golgi by 30 and 25%, respectively.

Explanation 7: Complete removal of the poly-proline region significantly perturbs the wild-type translation-rate profile of Htt. There will be less time available for a CAF to interact with N17, decreasing the probability of correct targeting and downstream function.

48

Experimental Observation 8: Removal of the poly-proline region is detrimental to spatial learning and memory in a mouse model of HD138.

Explanation 8: As described in Experimental Observation 7, removal of the poly- proline region leads to an increase in the quantity of Htt which is unable to perform its correct downstream function, leading to the observed disease symptoms.

Experimental Observation 9: A fusion protein consisting of the first 171 amino acids of Q125 mHtt attached to GFP has a shorter soluble half-life than an analogous Htt171-GFP fusion protein139.

Explanation 9: This decreased solubility is due to perturbed co-translational processing, which can result in poor targeting to subcellular organelles that may increase aggregation propensity and result in a decreased soluble half life26.

Though other hypotheses for HD pathology can explain different subsets of these observations, our hypothesis is the only one104,105,140–142 that, to our knowledge, is consistent with the experimentally-observed dependence of proper Htt targeting on the presence of the N17, poly-glutamine, and poly-proline sequence motifs. Our hypothesis is also unique in offering a molecular explanation for why removal of the N17 or poly-proline sequences leads to the development of disease symptoms.

4.9 Our hypothesis in the context of other mechanisms that could contribute to mHtt pathology

Many hypotheses have been proposed105,108,111,115,140–142 (a partial list) that can explain different groupings of the nine experimental observations of HD pathogenesis and Htt behavior we list above. We cannot succinctly describe the full complexity of these interrelated theories. However, given the complexity of HD pathogenesis, it seems reasonable that each of these mechanisms has the potential to contribute in some way to the disease phenotype. None of the other hypotheses that offer explanations for Experimental Observations 1–9 are mutually exclusive with the model that we have described. For example, the hypothesis that mHtt toxicity is mRNA-mediated115,142 posits that large CAG-repeat lengths in HTT can cause toxic downstream effects by forming a stable mRNA hairpin that sequesters a diverse set of proteins. Our co-translational hypothesis of HD pathogenesis can co-exist with this hypothesis. Consider one implication of the RNA-mediated hypothesis: at high values of 푁CAG a hairpin can form in HTT that recruits various cellular

49

Figure 4.2. A simple chemical-kinetic model explains how the age of onset of HD symptoms could arise from disruption of co-translational processing of huntingtin. (A) The CAF is assumed to bind N17 irreversibly in the region of optimal binding with rate 푘표푛. (B) Within this model, the CAF can only bind when the nascent Htt molecule is between 52 and 71 amino acids in length (i.e., when the ribosome starts translating the poly-proline region). (C) As 푁퐶퐴퐺 increases the number of glutamines (Q) in the binding region increases and the number of prolines (P) decreases, leading to a decrease in the time available for CAF binding (휏퐴퐹퐵). 휏퐴퐹퐵 has units of 휏퐴 (see Appendix C). (D) The fraction of Htt that is co-translationally misprocessed depends on 휏퐴퐹퐵 and, thereby, on 푁퐶퐴퐺, as expressed by Eq. 4.1. (E) The fraction of Htt which is misprocessed (푓푚푝 in Eq. 4.1) strongly correlates with 푁퐶퐴퐺 when realistic values for 푘표푛 and CAF concentration are used (see Appendix C for a complete description of Eq. 4.1). (F) The age of HD symptom onset shows strong negative correlation with the fraction of Htt misprocessed predicted by Eq. 4.1. Age of onset vs. 푁퐶퐴퐺 data were extracted from Figure 1C of Lee et al. (2012) with PlotDigitizer (plotdigitizer.sourceforge.net). factors, including protein kinase R (PKR)143. PKR is a double-stranded RNA-dependent kinase that forms part of the cellular virus-defense system, and its activation is associated with myriad downstream cell stress and apoptosis events including the upregulation of proteolysis machinery143. Simultaneously, perturbation of co-translational processes may

50 increase the fraction of mHtt directed to the cytosol (Figure 4.1), such that both processes synergistically contribute to an increase in the quantity of mHtt fragments observed in vivo. This explanation also demonstrates the interplay between our hypothesis and the hypothesis that proteolysis of mHtt is key to pathogenesis, as perturbed co-translational interactions that increase the amount of mHtt directed to the cytosol would also increase the stress on the cell's proteolysis machinery111. Repeat-associated non-ATG translation initiation of the mHtt transcript may also be related to our co-translational hypothesis. The frameshift created by this non-canonical translation initiation process will alter the identity of the codons being translated and also result in a nascent protein without the N17 localization sequence142. These two consequences of non-ATG translation initiation constitute a significant alteration to codon elongation rates and co-translational behavior that may contribute to HD pathology. Consider also that our hypothesis is independent of the method of mHtt fragment generation. Whether pathogenic fragments are a result of proteolysis or of aberrant mRNA splicing of Exon1, translation will still occur, and co-translational phenomena can thus still be critical. Experiments have also shown that the MID1-PP2A regulatory complex binds the mRNA hairpin formed by the expanded CAG-repeat region of mutant HTT, increasing the rate of translation initiation144. This increase in translation initiation for the mutant transcript increases the amount of mHtt translated, increasing the amount of protein that may be co-translationally misprocessed according to our hypothesis. As previously mentioned, ours is the only hypothesis, to our knowledge, that can explain the experimentally-observed influence of the N17, poly-glutamine, and poly- proline regions on the sub-cellular localization of Htt and mHtt, the large increase in nuclear aggregation when N17 is deleted, and the relationship between poly-proline deletion and various negative effects (Experimental Observations 4–8). The ability of our hypothesis to explain the large increase in aggregation rate observed when mHtt lacking N17 is expressed in vivo (Experimental Observation 5) is particularly important. In vitro studies have demonstrated that the aggregation rate of purified poly-glutamine segments increases as the number of poly-glutamine residues increases145. Occam's Razor suggests, then, that our co-translational phenomenon-based hypothesis may be superfluous. However, this simpler hypothesis cannot explain the observed increase in aggregation in vivo upon N17 deletion, whereas our hypothesis can.

4.10 A kinetic model based on this hypothesis suggests why the age of onset of HD symptoms negatively correlates with the number of CAG repeats

We can express this hypothesis in terms of the chemical reaction scheme shown in Figure 4.2A, in which we assume that a CAF only binds nascent Htt in a narrow range of nascent chain lengths with rate 푘on. This reaction scheme can be solved analytically using the principles of chemical kinetics to give an equation that estimates the change in the fraction of misprocessed nascent mHtt (푓mp) as the number of CAG repeats (푁CAG) increases above 35. By misprocessed we mean mHtt fails to interact with the CAF; for the purposes of this simple model, we assume that this co-translational interaction is an obligate step in Htt's

51 normal protein maturation pathway. To derive this equation we make the simplifying assumptions that (i) the binding of the hypothesized CAF to N17 is irreversible in the region of optimal binding, (ii) 푘on is non-zero from codon position 53, when translation of the poly-proline region begins for Q35-Htt, to codon position 72 (corresponding to nascent chain lengths 52–71, see Figure 4.2B and Methods in Appendix C), and (iii) prolines are translated twice as slowly as glutamines (Figure 4.2C and Methods in Appendix C). Assumptions (ii) and (iii) have experimental support in the literature, as described in the Methods in Appendix C. With these assumptions, 푓mp (see Methods in Appendix C for a complete description of this kinetic model and its technical assumptions), which calculates the relative fraction of Htt misprocessed at 푁CAG in comparison to the amount misprocessed when 푁CAG = 35, can be expressed as 푓mp(푁CAG) = exp[−푘on(휏AFB(푁CAG) − 휏AFB(35))] − 1. (Eq. 4.1) In Eq. 4.1, 휏AFB is the total amount of time available for CAF binding at the optimal nascent-chain binding lengths, which is a function of 푁CAG (Figure 4.2C). As 휏AFB decreases the fraction of misprocessed Htt increases (Figure 4.2D). Figure 4.2E displays the results obtained from Eq. 4.1 when a realistic value for 푘on (see Methods in Appendix C) is used to calculate 푓mp for values of 푁CAG from 40 to 53. We find that 푓mp strongly 2 −15 correlates with 푁CAG (Pearson 푅 = 0.996, 푝 = 1 × 10 ). Furthermore, we find that the experimentally-determined age of HD symptom onset also strongly correlates with 푓mp (Pearson 푅2 = 0.962, 푝 = 7 × 10−10, Figure 4.2F). This latter correlation is consistent with our hypothesis that a defect in a co-translational process plays a role in HD pathogenesis. Strong correlations are also found between our kinetic model and the age of onset data presented by Brinkman et al. (1997) for 푁CAG values of 39–50 (푓mp vs. 푁CAG: 2 −14 2 Pearson 푅 = 0.997, 푝 = 7 × 10 ; Age of onset vs. 푓mp: Pearson 푅 = 0.958, 푝 = 3 × 10−8, data not shown)101.

4.11 On the nature of misprocessing

With over 11 different co-translational processes potentially acting on huntingtin, any one or more of them may be perturbed by altered translation-elongation kinetics due to CAG expansion. We believe, however, that the two most-likely culprits are (i) altered co- translational of N17 by a CAF or (ii) decreased binding of a CAF that targets Htt to its proper cellular location. Consider, for example, the influence that CAG-repeat expansion could have on the phosphorylation of Htt at positions S13 and S16 within N17 and, thereby, on downstream Htt behavior such as membrane binding. Experiments have demonstrated that the phosphorylation of these two serine residues is important for the cellular localization of Htt107,146 and also plays a role in determining if Htt will be targeted for degradation by the ubiquitin/proteasome system (UPS)147. Critically, mHtt has also been experimentally shown to have decreased levels of phosphorylation in comparison to Htt146. Such changes in phosphorylation state have been shown to influence the binding affinity of peptides for membranes148. Our hypothesis can succinctly explain these observations. The N-terminal

52 location of S13 and S16 within huntingtin means that the time available for co-translational phosphorylation will decrease as the number of CAG repeats increases (see Figures 4.2C and D). In the case of Htt, the poly-proline sequence is correctly placed to slow translation as N17 emerges, allowing time for the phosphorylation of the and subsequent downstream localization and function of Htt. In the case of mHtt, however, co-translational phosphorylation is perturbed by the decrease in time available for CAF binding introduced by the expansion of the CAG-repeat region, resulting in a decrease in the fraction of mHtt which is correctly phosphorylated and can perform its correct downstream function. An analogous mechanism can be postulated for each of the N-terminal processing events which may affect Htt co-translationally, such as N-terminal and phosphorylation of Thr3149. It is important to note that this example assumes N17 phosphorylation occurs co-translationally; while there is evidence that this is the case for some proteins117,118, whether or not Htt is co-translationally phosphorylated has not yet been investigated.

4.12 Testing the co-translational misprocessing hypothesis of HD pathology

Though our hypothesis is consistent with a wide range of experimental observations concerning HD pathogenesis, it is based on a large body of circumstantial evidence from the co-translational folding field. We therefore suggest three experiments that would test this model. First and foremost, it must be determined if a CAF engages Htt. This question can be answered with a combination of pulse-chase labeling and cross-linking assays. Pulse-chase assays have been used extensively to study co-translational protein folding, as they provide the ability to specifically visualize nascent proteins2,59, while chemical crosslinking has also been used to capture RNC/CAF complexes in situ150. We suggest that these two techniques be combined into a hybrid technique. A short pulse-period of radiolabel incorporation (<1 min) would be followed by a chase period with unlabeled media containing a crosslinking agent (such as dithiobis succinimidyl propionate150) to selectively label nascent Htt and capture its interactions with any CAFs, respectively. Similar methods utilizing the incorporation of non-canonical amino acids into Htt followed by photo-crosslinking151 or click chemistry152 to connect Htt to binding partners can also be envisioned, though these methods require chemical modification of N17 that may alter the ability of potential binding partners to recognize and bind Htt Exon 1107. The experiment would conclude with size-dependent separation by SDS-PAGE and gel visualization via phosphorimaging or with mass spectrometry, depending on the technique used. Pending the positive identification of a Htt/CAF interaction, the next relevant question is whether this CAF preferentially binds Htt over mHtt. This question can be straightforwardly answered via the application of a number of different experimental techniques for determining dissociation constants as well as the on and off rates of RNC/CAF binding79. Our hypothesis suggests that the amount of mHtt directed to the cytosol (i.e., the amount of nascent protein that is mistargeted or otherwise malfunctions and is available for proteolysis) is greater than the amount of Htt directed to the cytosol. Furthermore, our

53 hypothesis predicts that the amount of mHtt directed into the cytosol will be a monotonically-increasing function of poly-glutamine length. Though the subcellular localization of various Htt Exon 1 constructs has been previously reported108, these data are not sufficient to quantify the flux of Htt/mHtt into the cytosol. Similar experiments could be designed utilizing separate fluorescent tags for each relevant organelle (i.e., the Golgi, ER, and mitochondria) as well as for the Htt construct. The fraction of cytosolic protein at various times after the start of the experiment could then be calculated by the difference between the total Htt-associated fluorescence and the Htt-associated fluorescence that co-localized with subcellular organelles.

4.13 Conclusion

We have described a novel hypothesis that presents possible contributions to HD pathology due to perturbation of the non-equilibrium phenomena of co-translational nascent-protein processing. Within this model, N17 is the CAF binding site, the poly-glutamine region acts as a linker connecting N17 to the poly-proline region, and the poly-proline stretch acts as a brake on translation elongation that facilitates N17-CAF interactions. As the number of CAG repeats increases above 35, the poly-proline stall site shifts further and further downstream of N17, and due to the decreased time available for CAF binding, more mHtt fails to be correctly co-translationally modified or targeted and is therefore directed to the cytosol. In the cytosol, proteolysis can result in the production of Exon 1 or C-terminal fragments that form HD's characteristic nuclear amyloid110,111 or cause ER dysfunction113, respectively. There is precedence for such an effect for other proteins; codon translation rates have been implicated as causal factors in the development of some human cancers45,153, cystic fibrosis34, and disparate drug transport functionalities between synonymous mutant proteins31. Our hypothesis is consistent with key observations of mHtt behavior and HD pathology (Figures 4.1 and 4.2). Furthermore, our hypothesis is testable, and we have suggested a number of experiments that do so. Pending the identification of a CAF that interacts with mHtt new therapeutic strategies can be explored based on the idea of reducing the rate of poly-glutamine translation to provide the needed time for CAF binding. Though we have thoroughly investigated our co-translational hypothesis only for HD, five other poly-glutamine proteins associated with the neurodegenerative disorders SBMA, DRPLA, SCA-2, SCA- 3, and SCA-7 also contain poly-proline regions154. Therefore, it is possible that these other proteins might also have a contribution to their pathology due to changes to co-translational phenomena upon poly-glutamine expansion. However, given the differences between these other poly-glutamine proteins and Htt, the specific form of this effect is likely different. It is our hope that the novel perspective offered in this paper will motivate experimentalists to further explore the molecular biophysics of HD pathology and any connection to translation kinetics and nascent protein behavior.

54

4.14 Acknowledgements

This project was supported by Human Frontiers in Science Foundation program grant (RGP0038/2015)

55

Chapter 5

PREVALENCE AND TIMESCALES OF KINETIC TRAPPING WITHIN THE E. COLI PROTEOME

Sections of this chapter are currently under review as part of a manuscript titled “Domain topology, stability, and translation speed determine co-translational folding force generation” by Sarah Leininger, Daniel A. Nissley, Fabio Trovato, and Edward P. O’Brien. D.A.N. performed all simulations and analysis described in this chapter with the exception of the construction of the multi-domain protein all-atom models and the training set replica exchange simulations (see Appendix D).

5.1 Introduction

The process of protein folding is often described in terms of the folding funnel155, in which folding is seen as a downhill conformational search towards the native state that represents the global free-energy minimum. Any peaks and valleys on the walls of this folding funnel form kinetic traps – local minima in which the protein molecule can be trapped in non- equilibrium structures for various durations. These non-equilibrium structures may resemble molten globules or partially folded intermediates with some resemblance to the folded state156. Due to the stochastic nature of protein folding some molecules are thought to quickly proceed to the folded state while others become caught in kinetic traps. Though protein folding pathways through the folding funnel have been explored for small proteins and subdomains, no analysis of the entire proteome or a representative subset thereof has been carried out for an organism. Thus, several open questions remain concerning kinetic traps157–159: How common is kinetic trapping? How long do kinetically trapped states persist on average? These questions have widespread implications for protein biogenesis, as kinetically trapped proteins will likely not be able to perform their intended biological function. Though such long-lived kinetically trapped conformations may be targeted for degradation before they can aberrantly influence the cell, recent experiments have found populations of proteins that are soluble and non-functional13. Kinetic trapping may be one source of such conformations. One of the key aspects of protein biogenesis that may affect the propensity of a protein to become kinetically trapped is the inherent non-equilibrium nature of translation. Thus, any study of kinetic trapping should consider the stochastic processes of protein synthesis and termination to accurately capture the influence of out-of-equilibrium effects on the partitioning of the ensemble of nascent proteins into different kinetically trapped states. To replicate the biological process of protein folding and answer the above questions concerning kinetic trapping we have performed high-throughput simulations of the translation elongation, translation termination, and post-translational dynamics of 50 multi-

56

Figure 5.1. Protein training and data sets and ribosome cutout. Top: Cartoon models of training set, multi- domain dataset, and single-domain dataset proteins are shown in dark blue, orange, and cyan, respectively. Only the 18 training set proteins for which folding times were calculated are shown. Bottom: The entire 50S E. coli ribosomal subunit is shown in a wire representation and the portion explicitly included in the coarse-grain model is shown as spheres. Yellow interaction sites correspond to ribosomal proteins except L24 which is colored bright green to indicate it was allowed to fluctuate in the simulations. Ribose interaction sites are shown in cyan. Purine, pyrimidine, and phosphate interaction sites are displayed in ochre. and 72 single-domain E. coli proteins (Figure 5.1). These 122 proteins correspond to approximately 10% of all E. coli proteins for which experimental structures are available72. We find that 32 of these 122 proteins contain at least one domain or interface that is kinetically trapped, and that these kinetic traps persist for the equivalent of up to 3 minutes of real time, which is longer that the half-lives of some proteins160. Kinetic trapping may

57 thus be a widespread mechanism of misfolding that can sequester protein molecules in non- functional states for physiologically relevant periods of time.

5.2 Results and Discussion

To approximate the E. coli proteome, we selected a data set of 50 multi- and 72 single-domain proteins from a previously published database of cytosolic E. coli proteins (see Appendix D)72. These proteins range in size from 72 to 1,342 amino acids. This data set contains the correct distributions of domain size, protein size, and structural class (see Methods, Figure D.3, and Figure D.4 in Appendix D). Each protein within this data set was parameterized based on a training set of 19 single-domain proteins and then synthesized on an explicit representation of the E. coli ribosome exit tunnel and ribosome surface in Langevin dynamics simulations in CHARMM (see Figure 5.1 and Appendix D). Following synthesis, the translation termination and post-translational dynamics of each protein was simulated for a wall time of thirty CPU days. Fraction of native contacts Figure 5.2. 1GYT fraction of native contacts time series. Fraction of native contacts as a function of simulation and time series, 푄(푡), were calculated experimental time for domain 1 (top), domain 2 (middle) and from these simulations for each the 1|2 interface (bottom) of PDB ID 1GYT. Light and dark domain and, for multi-domain lines correspond to individual trajectories and the ensemble proteins, each interface (see average over trajectories, respectively. The dotted black line in each plot indicates the 〈푄〉푟푒푓 − 2휎 threshold for the Appendix D). A domain or interface domain or interface. Note that one trajectory does fold for is considered to be folded when its 1GYT domain 2 but due to the binning procedure it appears 푄(푡) value is greater than or equal to not to breach the folding threshold. 〈푄〉ref − 2휎 for at least 150 ps of simulation time, where 〈푄〉ref is the mean fraction of native contacts calculated over ten 30-day simulations initiated from the native state and 휎 is the standard deviation of this average. A total of 32 proteins contain at least one kinetically trapped domain or interface. We define “kinetically trapped” in this context to mean that the domain or interface never folded during the simulation based on our 〈푄〉ref − 2휎 criterion despite remaining stable in simulations initiated from the native state. These

58 kinetically trapped domains and interfaces are summarized in Tables 5.1 and D.11, respectively. The mean folding time is calculated over the subset of folded trajectories for each domain or interface, while the trapping duration is calculated as the average total simulation time achieved for trajectories that never folded minus the mean synthesis time. Trapping durations in seconds were calculated based on the mean acceleration of dynamics observed in our low-friction Langevin dynamics simulations (see Methods in Appendix D).

Table 5.1 Synthesis times, folding times, and percentage of trajectories kinetically trapped Mean simulation Mean Trapping Mean folding Trapping PDB ID Domain %trapped time for trapped synthesis duration, time, ns duration, s trajectories, ns time, ns ns 1A6J 1 6 14774 32639 1819 30820 122 1CLI 2 10 7652 14650 3488 11163 44 1DUV 1 20 4193 12853 3749 9104 36 1DUV 2 2 3771 12853 3749 9104 36 1DXE 1 10 11778 23172 2885 20287 80 1FUI 1 62 9805 13010 6769 6241 25 1FUI 2 16 5360 13010 6769 6241 25 1FUI 3 94 7208 13010 6769 6241 25 1GLF 2 42 10027 13328 5203 8125 32 1GYT 2 98 5794 16675 5336 11339 45 1K7J 1 66 27037 42821 2110 40710 162 1KSF 2 10 4623 12719 7855 4864 19 1NG9 3 2 6054 12488 8769 3719 15 1P7L 3 42 4463 12218 3899 8319 33 1P91 1 46 6324 19118 3288 15829 63 1PF5 1 4 1590 36792 1470 35322 140 1U0B 2 4 4535 12863 4761 8102 32 1U60 1 4 6638 14865 3676 11189 44 1UUF 1 12 7874 15378 3859 11519 46 1W78 1 8 3390 11582 4485 7098 28 1W78 2 2 4526 11582 4485 7098 28 2FYM 2 2 4287 10685 3786 6898 27 2H1F 2 4 3701 14525 3755 10770 43 2HG2 1 22 6627 11551 5256 6295 25 2HNA 1 16 1593 33016 12872 20144 80 2ID0 3 70 10301 13926 7026 6900 27 2R5N 2 28 6755 11389 6718 4672 19 2WIU 1 4 6058 13018 5268 7751 31 3PCO 3 38 4816 12684 7466 5218 21 3PCO 5 10 7557 12684 7466 5218 21 4HR7 3 32 5082 12629 4582 8047 32 4IM7 1 2 6022 11557 5501 6056 24 4IM7 2 16 5451 11557 5501 6056 24 4IWX 1 44 15163 25719 3163 22556 89 4IWX 3 42 15364 25719 3163 22556 89

Perhaps unsurprisingly, multi-domain proteins are more likely to experience kinetic trapping than single-domain proteins. Of the 35 kinetically trapped domains only 7 are within single-domain proteins. This may be attributable to the fact that multi-domain proteins are in general more topologically frustrated than single-domain proteins due to the presence of inter-domain contacts and, on average, longer amino acid sequences. Consider domain 2 of PDB ID 1GYT, the most severely kinetically trapped domain, for which 98% of trajectories are unable to fold during the equivalent of 45 s of post-translational dynamics. Domain 1 folds on the order of the synthesis time after approximately 5 µs, but the interface between domains 1 and 2 (denoted 1|2 interface) is unable to form in most trajectories (Figure 5.2). Several trajectories make excursions to higher-Q regions of 59 conformational space before completely unfolding once more, and these excursions are observed in both the domain 2 and 1|2 interface plots. This result demonstrates the interplay between domain and interface formation. The 푄 time series for PDB ID 1FUI display similar behavior for each of the three domains and three interfaces, which are kinetically trapped in between 12 and 94% of trajectories, and no trajectory successfully folded each domain and formed each interface (Figures D.5 and D.6). Our simulations lack several cellular components that may influence the likelihood that a protein becomes kinetically trapped. , such as GroEL, are a class of molecular chaperones that recognize unfolded proteins, draw them into their internal cavity, and aid their folding to the correct native state161,162. Within cells, such chaperonins might aid the folding of kinetically trapped states and reduce the average timescale of kinetic trapping. On the other hand, the type of coarse-grain model we use here is incapable of forming non-native tertiary contacts. Our kinetically trapped states are thus the result of topological frustration and not misfolding to alternative tertiary structures as may occur for some proteins. While the action of chaperonins will tend to decrease the prevalence of long-lived kinetically trapped states, the lack of non-native tertiary contacts smooths the folding landscape and should tend to accelerate folding on average. Our simulations thus represent a reasonable approximation of the prevalence and timescales of kinetic trapping but neglect certain aspects of protein biogenesis in cells that may tend to increase or decrease kinetic trapping compared to reality. Future work related to our simulations could ascertain the effect of the pervasive kinetic trapping we have detected on protein function and solubility. For example, the likelihood that a kinetically trapped conformation will remain soluble could be estimated based on the proportion of solvent-exposed surface area of hydrophobic regions for a protein in comparison to its native state. Similarly, the RMSD over those residues identified to be related to the function of a given protein could be used as a surrogate for protein function. In summary, we have probed the timescales and prevalence of kinetic trapping within the E.coli proteome using high-throughput, low-friction Langevin Dynamics simulations of protein synthesis, termination, and post-translational dynamics. We find that more than one in four proteins is kinetically trapped, and that these kinetically trapped states persist for minutes after protein synthesis is complete.

60

Chapter 6

CONCLUSIONS AND FUTURE DIRECTIONS

6.1 Conclusions

The goal of the research presented in this thesis was to gain fundamental insight into the process of co-translational folding using various theoretical and computational techniques. Using classical and stochastic chemical kinetics as well as coarse-grain molecular dynamics simulations I have modeled co-translational folding at the whole cell and individual protein levels. I hope that the results presented in this thesis will motivate further investigations of the influence of the non-equilibrium nature of protein synthesis on protein structure and function. In Chapter 2 I describe a chemical kinetic model that accurately reproduces all available experimental co-translational folding time series and is general for all organisms, proteins, and translation schedules. This model predicts that some yeast proteins can be made to switch from folding post- to co-translationally if their translation schedule is delayed. This model will hopefully prove to be a useful tool for the community in determining the influence of translation schedule perturbations on protein folding. For example, a researcher expressing a human protein in E. coli could use my model to determine the influence of the fact that translation occurs significantly faster in E. coli than in human cells on the ability of their protein to fold correctly. Chapter 3 describes Langevin dynamics simulations of the synthesis and co- translational folding of the E. coli protein HemK. These simulations were used to test three hypotheses to explain the experimentally observed FRET signal during HemK synthesis: (1) collapse as in a poor solvent, (2) dimensional collapse due to the change in space dimension experienced by the nascent chain upon leaving the exit tunnel, and (3) co- translational folding. The poor-solvent hypothesis was ruled out, but dimensional collapse and folding were found to be equally consistent with the experimental data. Further analysis of these simulations revealed that the co-translational folding of HemK proceeds by the sequential addition of helices to a growing hydrophobic core as each helix emerges from the exit tunnel. Such simulations highlight the ability of molecular dynamics to reveal co-translational processes at molecular detail. Chapter 4 extends the basic principles of co-translational phenomena to consider the hypothesis that Huntington’s Disease is the result of dysfunction of a co-translational phenomenon. Based on strong circumstantial evidence and the results of a simple kinetic model I argue that this hypothesis is plausible and suggest key experiments to test its predictions. Such theoretical considerations of the possible influence of co-translational phenomena may open up new avenues of research related to human disease. Finally, in Chapter 5, I describe high-throughput simulations of a representative subset of the E. coli proteome that were carried out to assess the prevalence and timescales of kinetic trapping for a set of proteins representative of the E. coli proteome. I find that

61 more than one protein in five is kinetically trapped, and that the timescale of kinetic trapping is longer than the half-lives of some proteins.

6.2 Future directions

6.2.1 Translation termination timescales and governing factors

I observed a two order of magnitude range in mean translation termination times in simulations of nascent chain ejection from the ribosome as described in Chapter 5 and Appendix D (Figure 6.1 and Table 6.1). This observation raises the question of what factors govern the translation termination time for a given protein. The fastest- and slowest-terminating protein sequences, PDB IDs 1T8K and 1U0B, respectively, were found to contain markedly different electrostatic qualities in their C-termini. While 1U0B contains a cluster of positively charged amino acids at its C-terminus, 1T8K does not. I therefore hypothesized that electrostatic interactions play a role in the slow ejection of 1U0B. To test this Figure 6.1. Distributions of mean translation termination hypothesis, I performed two times from coarse-grain simulations. Top: Distribution of preliminary sets of control simulations 〈휏〉 in units of ns where 〈휏〉 is the mean termination time for fifty statistically independent simulations each for each protein. In the first control I generated from a different final structure from protein truncated each nascent chain such that synthesis. Bottom: Data in top panel normalized by the only their C-terminal 50 residues fastest termination time, 〈휏〉푚푖푛 and plotted with a remained, and in the second control I logarithmic x-axis. removed all positive charges within the same range of residues (Table 6.1).

Table 6.1. Translation termination times for 1T8K and 1U0B under various conditions Number of 〈흉 〉, ns 〈흉 〉, ns PDB ID 〈흉 〉, ns 퐭퐞퐫퐦 퐭퐞퐫퐦 residues 퐭퐞퐫퐦 C-terminal 50 aa only No (+) in C-terminal 50 aa 1T8K 78 0.3 0.3 0.3 1U0B 461 40.0 141 0.6

Truncating 1T8K had no influence on its termination time, but 1U0B’s termination time increased by an order of magnitude. It is possible that without the entropic pulling force93

62 generated by the N-terminal ~400 residues 1U0B has only a very small driving force for ejection. Removal of positive charges in the C-terminal 50 residues of 1U0B reduced its mean termination time to 0.6 ns, on the order of the value for 1T8K, while the same change did not alter 1T8K’s mean termination time. Based on these simulations it appears that electrostatics drive the slow ejection of 1U0B. Future work based on these preliminary simulations would ideally take the form of collaboration with an experimental group. For example, this experimentalist could measure the actual ejection times for sets of proteins I predict to be fast- or slow-terminating. If my simulations are accurate the experiments should find the same general trend. All-atom molecular dynamics could also be used to gain more detailed insight into the molecular mechanism underlying slow ejection. Simulations at an all-atom resolution are too costly and slow to actually observe nascent chain ejection on a timescale of tens to hundreds of nanoseconds, but constant velocity pulling simulations could be used to measure rupture forces for different nascent chains. My simulations predict that proteins with more positive charges in their C-termini will require larger forces to be extracted from the ribosome exit tunnel. Determining the molecular factors that govern the timescales of translation termination would be a step towards a more complete understanding of the process of protein biogenesis and highlight the predictive power of coarse-grain simulation methods.

6.2.2 Investigating aspects of multi-domain protein folding in bulk solution and on the ribosome

The dataset of 50 multi-domain proteins generated for my simulations described in Chapter 5 could also be used to address the question of whether domains in multi-domain proteins fold independently of one another. This question could be answered through three sets of temperature quenching simulations (see Appendix D for a representative simulation protocol): (1) of each multi-domain protein with inter-domain contacts removed but all domains present, (2) of domains simulated in isolation from one another, and (3) of each multi-domain protein with no modification to its parameters or domain topology. The first set of simulations controls for the influence of attractive terms between domains. The second set controls for not only attractive interactions between domains but also their influences on dynamics. The third set of simulations serves as a baseline for comparison. Analysis of these simulations has the potential to reveal diverse aspects of multi-domain protein folding: (1) how much do domain interfaces influence the ability of individual domains to reach their native-state conformation? (2) what influence do domain interfaces have on the kinetics of refolding? Comparison of the results of refolding with all domains and attractive interactions present to folding during synthesis (as described in Chapter 5) would also reveal the differences between co- and post-translational folding. In summary, this proposed extension of the research presented in Chapter 5 would elucidate mechanisms of multi-domain protein folding, the dependence of such folding on other domains in the protein, and general differences between co- and post-translational folding.

63

Appendix A

CHAPTER 2 METHODS AND SUPPORTING INFORMATION

Published as a paper entitled “Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding” by Daniel A. Nissley, Ajeet K. Sharma, Nabeel Ahmed, Ulrike A. Friedrich, Gunter Kramer, Bernd Bukau, and Edward P. O’Brien in Nature Communications 2016, 7, 10341. D.A.N., A.K.S., N.A. and E.P.O. designed the research. D.A.N., A.K.S. and E.P.O. carried out the theoretical modelling and analysis. U.F., G.K. and B.B. performed the ribosome profiling experiments. N.A. analyzed the ribosome profiling and FactSeq data. D.A.N., A.K.S., N.A., U.F., G.K., B.B and E.P.O. interpreted the data and wrote the manuscript. This article is published under an Open Access Creative Commons Attribution 4.0 International License.

A.1 Methods

A.1.1 Selection of model parameters

The co-translational folding curves of four different proteins have been measured in vivo using either pulse-chase2 or FactSeq71 experimental techniques. Equation 2.2 requires the bulk folding and unfolding rates for each of these domains along with the average codon translation rate for each transcript. These rates are listed in Table 2.1 for the four proteins, as are the lengths of the proteins and observable segment. In the case of the SFVP constructs, the observable region is limited by the most C-terminal Met residue within the C protein domain, Met255, as only Met and Cys residues were radiolabeled in the experiment. For both the Flag-FRB-GFP and HA1 constructs, all residues within the segment of interest are experimentally observable. The rates of folding (푘F,푖) for the SFVP WT and ΔC constructs were taken from the reported experimental values, and the rate of unfolding for the SFVP constructs was calculated from the experimentally-determined ∆퐺 thermodynamic stability of the native state as 푘 = 푘 exp [ UF]. The rates of folding and U F RT unfolding for the Flag-FRB-GFP and HA1 proteins were predicted using a phenomenological model73. The codon translation rate of 3.9 AA per second in CHO cells was calculated from Fig. 1d of (Nicola, Chen, and Helenius 1999), which displays the results for a pulse-chase experiment in which the synthesis of the cleavage-negative Δile SFVP construct is observed to be linear as a function of time. A S219I point mutation in this construct of C protein disrupts the function of the catalytic triad, preventing it from catalyzing its cleavage from the rest of the protein. Δile SFVP is otherwise identical to ΔC SFVP. The experimental data points were extracted using PlotDigitizer (PlotDigitizer.com) and a linear least squares analysis carried out (Figure A.3), resulting in a line of best fit of 푦 = 0.0025푡 + 0.26 (푅2 = 0.95, 푝 = 0.001). The time at which the fraction of full-length protein first reaches a value of 1.0 is equal to the amount of time required to synthesize the

64 entire protein. Dividing the length of the protein, 1,145 amino acids, by this time value, 296 s, yields an average codon translation rate of 3.9 AA per second for SFVP synthesis in CHO cells.

A.1.2 Calculation of error bars

No error bars were reported for the SFVP experimental data2 that are displayed in Figure 2.3. To better assess how well our calculations agreed with these experimental results, we performed a literature search for similar pulse-chase experiments involving 35Cys and 35Met labeling in which error bars are reported for proteins translating in vivo. The error bars were extracted from the published graphs of three separate studies163–165 with the program PlotDigitizer (PlotDigitizer.com) and then converted to a standard deviation. The individual standard deviations were then averaged, yielding an average standard deviation (푛 = 33) of 0.151 (in units of probability). Though the various experiments that were considered in this estimate contain a different number of measurements, it has been shown that the standard deviation is fairly insensitive to 푛[166]. The individual data points that were extracted from our literature search are reported in Table A.2.

A.1.3 Calculation of test statistics for FactSeq data

The FactSeq data in Figure 2.4 were each broken into three separate regions. The first region was defined as codon positions 1–50, which represents nascent chain lengths at which the nascent proteins will be unfolded. The second region was defined to be from codon position 51 to the last codon stated by Han and co-workers71 to be in the unfolded state. For FRB and both epitopes of HA1 the second region thus consists of codon positions 51–150 and 51–310, respectively. The third region is defined as the codon positions for which the protein is expected to be folded, which is codon positions 151–379 for FRB and codon positions 310–565 for HA1. The three regions were compared pairwise and statistical significance was determined with the Mann–Whitney U-test. The 95% confidence interval of the median values was calculated by bootstrapping with 100,000 replications. The median values of the three regions along with the corresponding 95% confidence intervals and statistical significances are shown in Figure 2.4D.

A.1.4 Details of protein domain identification and numbering

We used a previously reported method of domain identification72 based on the Class Architecture Topology Homology (CATH) and Domain Parser databases. CATH domains are identified on the basis of sequence homology167 and thus do not always represent autonomous folding units. Some CATH domains are composed of non-contiguous segments of the protein. The method we use here requires that the amino acids that compose a domain be contiguous and that each autonomous folding unit contain at least 50 amino acids; we therefore modified some CATH domain definitions such that domains only consisted of contiguous segments of >50 amino acids. Renumbering domains in a protein

65 in this way can result in a number of domains that is larger than the number of domains identified by CATH. For example, suppose that within a 500 amino-acid protein CATH identifies five domains, with the fifth domain composed of amino acids 1–100 and 300– 400. As the two segments that compose the CATH domain are non-contiguous, our labelling scheme would separate them into two unique domains. We would refer to amino acids 1–100 as domain 5 and amino acids 300–400 as domain 6. Domain details for the four yeast proteins can be found in Table 2.1.

A.1.5 Identifying yeast protein domains

We randomly selected 10 multi-domain yeast proteins that had domain definitions reported in the CATH or Domain Parser databases. We tested which of these domains could switch from post- to co-translational folding by applying Eq. A.5. To determine the starting and ending codons for each domain we BLASTED168 its protein sequence onto yeast reference genome (UCSC: sacCer2). We then used the de Sancho–Munoz model73 to estimate each domain’s folding and unfolding rates at 303 K, which were then used in Eq. 2.2 to predict its co-translational folding profile for the WT mRNA sequence and the recoded, slowest- translating mRNA sequence. The probability that a domain folds co-translationally (푃F,Co−t) was taken as the value of Eq. A.5 calculated at the stop codon. Proteins with 푃F,Co−t ≥ 0.5 fold predominantly co-translationally, while proteins with 푃F,Co−t < 0.5 fold predominantly post-translationally. Using these definitions, we predict that the four yeast proteins listed in Table 2.1 are capable of switching from post- to co-translational folding due to synonymous codon substitutions.

A.1.6 The time-dependent fraction of full-length protein

The time-dependent fraction of full-length protein (Figure 2.7) that has been synthesized at time 푡 in the pulse-chase experiment (푓L,R(푡)) is equal to the total number of protein molecules that have been released into the cytosol by time 푡 divided by the total number of full-length proteins that are synthesized during the entire simulated experiment 푁L,R(푡) 푓L,R(푡) = . (Eq. A.1) 푁L,R(푡=360 푠)

In Eq. A.1, 푁L,R(푡 = 360 푠) is the total number of proteins that will be released into the cytosol by the final time point in the chase period.

A.1.7 Testing the applicability of assumptions A1 and A3

The covalent attachment of amino acids into polypeptides is a many-step process169. However, a two-exponential fit of the experimentally-measured ribosome dwell-time distribution indicates only two rate limiting steps62,170. Therefore, to numerically test if the predicted co-translational folding curve would change significantly when a dwell-time 푘1푘2 distribution of the form 푃(휏) = [exp(−푘1휏) − exp(−푘2휏)] is used, we assumed that 푘2−푘1

66 ribosomes stochastically switch between the pre-translocation and post-translocation states. The post-translocation state transitions to the pre-translocation step with rate 푘1, and the transition from the pre-translocation to post-translocation state occurs with rate 푘2 and elongates the nascent chain by one amino acid. We scaled the experimentally-fitted values of 푘1 and 푘2 from (Tinoco and Wen, 2009) to keep the mean codon translation rate equal to 3.9 AA per second, which is SFVP’s average codon translation rate in CHO cells (that 1 1 1 −1 −1 is, = + ), and used 푘1=4.7363 s and 푘2=22.0649 s . 푘A 푘1 푘2 We started our virtual experiment from the situation where ribosomes of each nascent chain length are equally probable. Therefore, a single ribosome is assigned for each nascent chain length. In the system, a new translation-initiation event occurs after a variable 1 time interval of 휏 that is exponentially distributed with mean value of . Therefore, the 푘int number of labeled proteins increases with time and then saturates after the end of the pulse period. Using the Gillespie algorithm74, we simulated the stochastic kinetics of each of these ribosome-nascent chain complex translating the SFVP ΔC mRNA. These simulations generated the trajectories for the time evolution of each of these ribosome-nascent chain complex in different states. Using these trajectories, the co-translational folding curve was calculated as ∑ δ (i) P (t) = t , (Eq. A.2) F N(t) where 푁(푡) is the number of labeled protein domains at time 푡, and 훿푡(푖) equals one when the 푖th labeled protein is in folded state at time 푡. This virtual experiment was repeated 20 times, generating 20 different co- translational folding curves, which were then averaged together to give the co-translational folding curve displayed in Figure A.6. We tested the applicability of Eq. 2.2 under non-steady-state conditions by comparing the predictions made using equation Eq. 2.2 with non-steady-state co- translational folding curves for ΔC SFVP generated by the Gillespie algorithm. We used a sinusoidally varying time-dependent translation-initiation rate 푘int(푡) = 푘int(0) [1 +

2휋푡 퐴sin ( )] to create a non-steady-state condition in the system (Figure A.5, top panel). 휏p The plots shown in Figure A.5 were made with 휏p = 45 s, 푘int(0) = 3.9 AA per s, and 퐴 as indicated in the figure. We generated an exponentially-distributed random number, 푡1, 1 from an exponential distribution with mean value . The first translation-initiation 푘int(0) event occurred at time 푡1. For the next initiation event, another random number, 푡2, 1 distributed exponentially with the mean value , was generated, and the second 푘int(푡1) initiation thus occurred at time 푡1 + 푡2. This exponential distribution of time intervals between successive initiation events ensures that translation initiation is a Markovian process. New translation initiations were generated by this method until the end of the pulse period. We simulated the stochastic kinetics of ribosomes arriving in the system after each initiation event and computed the co-translational folding curves by using Eq. A.2. The

67 mean co-translational folding curve over 20 of these virtual experiments is displayed in Figure A.5.

A.1.8 Scaling codon translation rate estimates for CHO cells

Codon translation times in yeast were obtained from Stadler and Fire69, Dana and Tuller70, Gardin et al.68 and Fluitt and Viljoen12. Rates for translation in Caenorhabditis elegans were also obtained from Dana and Tuller70. For Gardin et al., Stadler and Fire, and Dana and Tuller the translation times were estimated from ribosome profiling analysis, and were referred to as the relative residence time score, occupancy and normalized footprint count, respectively, in the original publications. To map these rates to CHO cells, each reported set of rates were scaled such that the average translation rate across the CDS of ΔC SFVP matched the experimentally-determined value of 3.9 AA per second. To achieve this, the unscaled translation times were matched with the corresponding codons in ΔC SFVP’s sequence. The inverse of each of the unscaled translation time estimates was then taken to produce the estimated translation rate. The sum of these estimated translation rates across the ΔC SFVP’s CDS was then divided by the length of the CDS (=1,145 codons) to obtain the average unscaled translation rate. Dividing the desired average translation rate of 3.9 AA per second by the unscaled average translation rate yields a scaling factor, 휒, that relates the unscaled values to the correctly scaled values that reproduce the 3.9 AA per second average in CHO cells. Thus, multiplying the unscaled codon translation rates by 휒 yields the set of scaled rates that maintain the desired 3.9 AA per second average. This process is summarized in Eqs. A.3 and A.4. 3.9 AA per second χ = 1,145 (Eq. A.3) ∑ 푘unscaled 푖=1 A,푖 1,145

scaled unscaled 푘A,푖 = χ ∗ 푘A,푖 (Eq. A.4) Stadler and Fire only report rates for codons AAC, AAU, AGC, AGU, CAC, CAU, GAC, GAU, GGC, GGU, UAC, UAU, UGC, UGU, UUC and UUU; Occupancies of 1.000 were therefore assumed for each codon for which a specific translation time estimate was not reported. Translation times for stop codons (UAA, UAG and UGA), which are required by Eq. 2.2 to provide the ribosome dwell time at the last codon position in the CDS, were only reported by the Fluitt–Viljoen model; where specific translation times for stop codons were not reported, the average translation time of 256 ms for ΔC SFVP in CHO cells was used. Scaled and unscaled rates are reported in Table A.1.

A.1.9 Ribosome profiling of yeast

Ribosome profiling of yeast S288C cells was performed following the protocol of Ingolia et al.67 with the following modifications: yeast cells were grown in yeast extract peptone dextrose, at 30 °C to an optical density (OD600) of 0.5. Cells were collected by fast filtration in the absence of antibiotics and immediately flash-frozen in liquid nitrogen. Frozen cells

68 were mechanically lysed for 2 min at 30 Hz using a Retsch MM400 mixer mill and a lysis buffer composed of 20 mM Tris pH 8.0, 140 mM KCl, 6 mM MgCl2, 0.1% NP-40, 100 μg ml−1 cycloheximide, 200 μg ml−1 heparin, 1 mM PMSF, 20 μg ml−1 leupeptin, 20 μg ml−1 aprotinin, 1 mg ml−1 AEBSF, 1 μg ml−1 E-64, 40 μg ml−1 bestatin, 12.5 U DNase. Lysates were thawed and exposed parts of mRNAs were digested with 5 U/A260 RNaseI (Ambion) at 25 °C, 650 r.p.m. for 1 h. Digestion was stopped by adding 8 U/A260 SUPERase·In (Ambion) and the lysate was cleared of membranes, organelles and cell debris by centrifugation at 4 °C and 30,000g for 5 min. The supernatant was loaded on a −1 10–50% sucrose gradient (20 mM Tris pH 8.0, 140 mM KCl, 6 mM MgCl2, 100 μg ml cycloheximide, 1x EDTA-free protease inhibitor tablets (Roche)) and monosome fractions were pooled. RNA was isolated from monosomes by hot-phenol extraction and directly precipitated using GlycoBlue as coprecipitant. The size-selection step after of the footprint was omitted. Dephosphorylated mRNA footprints (5 pmol) were linked to the 1 μg linker L1′171 by incubation with 200 U T4 RNA Ligase 2, truncated (NEB) at 37 °C for 2.5 h in buffer containing 20 mM Tris pH7, 20% PEG MW 8000, 10% DMSO, 20 U SUPERase·In. Linked footprints were size-selected by gel electrophoresis. Reverse transcription was carried out with 200 U SuperScript III (Invitrogen), 20 U SUPERase·In, 10 nmol dNTP, 25 pmol Linker L1′L2′150, 100 nmol DTT in 20 μl of 1 × FSB buffer (Invitrogen). Circularization by incubation with CircLigase (Epicentre) was performed two times for 1 h each (a second aliquot of CircLigase was added after one hour) and the product was directly used for amplification by PCR. Deep sequencing was performed using Illumina HiSeq 2000 instrumentation.

A.1.10 Bioinformatic analysis of ribosome profiling data

The raw reads from the ribosome-protected fragments were trimmed of the 3′ custom adaptor 5′-CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTG-3′ using cutadapt172 (v1.1). The low quality reads were filtered using PRINSEQ173 (v0.20.4), and reads shorter than 20 nucleotides were discarded. The processed reads were first aligned to the ribosomal RNA sequences using Bowtie 2174 (v2.2.3). The unaligned reads were then aligned to the Saccharomyces cerevisiae assembly R64-1-1 (UCSC: sacCer3) using Tophat175 (v2.0.13) with up to two mismatches allowed. Gene annotations were obtained from Saccharomyces Genome Database (http://www.yeastgenome.org/) on 30 October 2014. For downstream analysis, only reads with length 27–32 nucleotides were considered, as they are more likely to represent the ribosome-protected fragments. The ribosome profiles of individual genes were obtained by quantifying the coverage at a gene position by the 5′ end of the reads. The reads that correspond to start and stop codons in the active site were not considered. Since the active site of translation is ∼15 nucleotides downstream of the 5′ end of the ribosome-protected fragment, the ribosome profiles of genes were calculated from four codons upstream of the start codon to six codons upstream of the stop codon. For pairwise comparison of ribosome profiles in the two replicate samples (Figure A.1D), only those genes were considered that had at least one read mapping to each codon

69 position and no multiply aligned reads, with the first and last codons not considered. In all, 91 genes met these criteria.

A.2 Supplemental discussion

A.2.1 Derivation of Eq. 2.2

From the assumptions A1, A2, and A3 it follows that 푃F,B(푖), the probability that the nascent chain segment of interest is folded at nascent chain length 푖, can be calculated as previously described46 푖 푘F,푗 푖 푘A,푘+1 푃F,B(푖) = ∑푗=1 ∏푘=푗 . (Eq. A.5) 푘A,푗+1 푘A,푘+1+푘F,푘+푘U,푘 Equation A.5 calculates the probability 푃F,B(푖) that a protein segment will co- translationally fold as a function of the nascent chain length, 푖, given a collection of stochastically translating ribosomes that have initiated translation at the same time point. In Eq. A.5, 푖 is the number of residues in the nascent chain at a given time point during synthesis. The parameters in Eq. A.5 are the codon translation rate (푘A,푖) and the folding (푘F,푖) and unfolding (푘U,푖) rates of the nascent chain segment of interest at each nascent chain length. The summation and product operators in Eq. A.5 are calculated over the different possible nascent chain lengths from 1 to 푖. This equation has been shown to accurately predict the co-translational folding curves generated by coarse-grained molecular dynamics simulations of translation46. ′ From assumption A2 we have that 푃F,R(푡, 푡 ), the time evolution of the probability of a released nascent chain segment of interest being folded, is described by the equation176 푘 ′ 푘 ′ F −[푘F+푘U][푡−푡 ] F 푃F,R(푡, 푡 ) = [푃F,B(푀) − ] 푒 + . (Eq. A.6) 푘F+푘U 푘F+푘U The 푃F,B(푀) term in Eq. A.6 is the probability that the nascent chain segment of interest is folded at the last codon in the coding sequence (CDS) immediately before it is released from the ribosome, and is calculated using Eq. A.5. The term 푓L,B(푖, 푡) in Eq. 2.1 is the fraction of labeled nascent chains of length 푖 at time 푡, and can be expressed as 푁L,B(푖,푡) 푓L,B(푖, 푡) = , (Eq. A.7) 푁L,B(푡)+푁L,R(푡) where 푁L,B(푖, 푡) is the number of ribosome-bound, labeled nascent chains of length 푖 at time 푡 and 푁L,B(푡) is the number of bound, labeled nascent chains of any length at time 푡 푀 and is equal to ∑푖=1 푁L,B(푖, 푡). 푁L,R(푡) is the number of labeled chains that have completed synthesis and have been released from the ribosome by time 푡, and is equal to 푁L,R(푡) = 푡 ′ ′ ∑푡′ 푁L,R(푡, 푡 ), where 푁L,R(푡, 푡 ) is the number of labeled nascent chains released at time 푡′ from the ribosome. ′ The term 푓L,R(푡, 푡 ) in Eq. 2.1 is the fraction of labeled nascent chains released from the ribosome at time 푡′, and can be written in a manner analogous to that of Eq. A.7: ′ ′ 푁L,R(푡,푡 ) 푓L,R(푡, 푡 ) = . (Eq. A.8) 푁L,B(푡)+푁L,R(푡)

70

Inserting Eqs. A.6, A.7, and A.8 into Eq. 2.1 yields 푀 푁L,B(푖,푡) 푃F(푡) = ∑푖=1 푃F,B(푖) + 푁L,B(푡)+푁L,R(푡) 푁 (푡,푡′) 푘 ′ 푘 푡 L,R F −[푘F+푘U][푡−푡 ] F ∑푡′=0 ([푃F,B(푀) − ] 푒 + ) (Eq. A.9) 푁L,B(푡)+푁L,R(푡) 푘F+푘U 푘F+푘U During the experiment, the number of labeled chains of a particular length can change with time. Therefore, one outstanding issue in using Eq. A.9 is how to keep track of labeled nascent chain segments of interest as a function of time since the start of the incorporation period. In other words, how do we populate the arrays 푁L,B(푖, 푡) and ′ ′ 푁L,R(푡, 푡 ) for all values of 푖, 푡, and 푡 ? Although assumption A1 requires that the average number of ribosomes at a given codon position is constant with time, it does not require that the number of labeled nascent chain segments at that codon position is constant with time. Below, we demonstrate how it is possible to keep track of labeled nascent chains in a closed-form solution that results in an expression for Eq. 2.1 in terms of 푘A,푖, 푘F,푖, and 푘U,푖. We first note that under steady state conditions (assumption A1) the flux of ribosomes into and out of codon position 푖 is constant with time. Hence, the number of ribosomes transitioning from codon position 푖 − 1 to 푖 during the time interval 훿푡 (denoted 퐹푖) is equal to the number of ribosomes transitioning from 푖 to 푖 + 1. That is, 퐹푖 = 퐹푖+1 at all codon positions in the CDS, where 퐹푖 equals 퐹푖 = 푘A,푖−1푁rib,푖−1훿푡. (Eq. A.10) 1 Letting 훿푡 = , where 푘A,fastest is the translation rate of the fastest translating codon 푘A,fastest position in the CDS, and letting 푖 − 1 be the fastest translating codon position within the CDS then 푘A,푖−1 = 푘A,fastest and 푁rib,푖−1 = 푁rib,fastest, which is defined to be the number of ribosomes at the fastest translating codon position in the CDS. Substituting these relationships into Eq. A.10 yields 퐹푖 = 푁rib,fastest. (Eq. A.11) 1 Thus, according to Eq. A.11, in a time interval equal to , the number of ribosomes 푘A,fastest that move to the next codon position equals 푁rib,fastest at all codon positions. The steady-state number of ribosomes can be solved for by equating 퐹푖 at codon position 푖 and 푖 + 1: 퐹푖 = 퐹푖+1, 푘A,푖−1푁rib,푖−1훿푡 = 푘A,푖푁rib,푖훿푡, and solving for 푁rib,i 푘A,푖−1 푁rib,푖 = 푁rib,푖−1. (Eq. A.12) 푘A,푖 Once more letting 푖 − 1 be the fastest translating codon position in the CDS, we find that Eq. A.12 equals 푘A,fastest 푁rib,푖 = 푁rib,fastest. (Eq. A.13) 푘A,푖

71

Equation A.13 tells us that the number of ribosomes at codon 푖 is directly proportional to the number of ribosomes at the fastest translating codon, and that the proportionality constant is the ratio of codon translation rates. Equations A.11 and A.13 dictate how to populate the arrays 푁L,B(푖, 푡) and ′ 푁L,R(푡, 푡 ) when modeling the pulse-chase experiment. Consider the following: at time 푡 = 0, the start of the incorporation period, there are no labeled nascent chains, because no radiolabeled amino acids have had the opportunity to be incorporated. According to Eq. A.11, at time 푡 = 0 + 훿푡, the number of radiolabeled nascent chains at codon 푖 will increase by a number equal to 푁rib,fastest, up to a maximum steady-state number of labeled nascent 푘A,fastest chains equal to 푁rib,fastest (i.e., Eq. A.11). At time 푡 = 0 + 2훿푡, the number of 푘A,푖 radiolabeled nascent chains at codon 푖 will again increase by 푁rib,fastest, provided codon position 푖 has not already reached its steady-state value. The application of this procedure for filling the 푁L,B(푖, 푡) array with radiolabeled nascent chains at a given codon position will continue at each new time-interval increment until either 푁L,B(푖, 푡) = 푘A,fastest 푁rib,fastest at all 푖 or the incorporation period ends. Once all codon positions have 푘A,푖 a number of radiolabeled nascent chains equal to their steady-state value as defined by Eq. A.13, the number of labeled chains transitioning from one codon to the next remains equal to 푁rib,fastest during the incorporation period. Release of radiolabeled nascent chains during the incorporation period commences once the last codon position in the CDS has reached its steady-state value of labeled nascent chains, and, as indicated by Eq. A.11, the number or radiolabeled nascent chains released at each subsequent time point is equal to 푁rib,fastest. After the incorporation period (i.e., during the chase period), no new radiolabeled nascent chains are created. In the time interval 훿푡, the number of radiolabeled nascent chains shifting from nascent chain length 푖 − 1 to 푖 will equal 푁rib,fastest, provided there are more than 푁rib,fastest radiolabeled nascent chains at codon 푖 − 1 at the start of the time interval. As time progresses during the chase, the number of ribosomes transitioning out of codon position 푖 − 1 will eventually result in there being no labeled nascent chains at codon position 푖 − 1, and the number of labeled nascent chains transitioning into codon 푖 will then be zero. Hence, time-step by time-step during the chase, and codon position by codon position, the number of radiolabeled chains goes to zero, and all of the radiolabeled nascent chains are eventually released from their ribosomes. As a consequence of the addition and subtraction of radiolabeled nascent chains in ′ units proportional to 푁rib,fastest, the terms 푓L,B(푖, 푡) and 푓L,R(푡, 푡 ) are independent of the actual value of 푁rib,fastest, because during both the pulse and chase periods this quantity cancels out in the numerator and denominator of these terms. To illustrate this point, we consider the situation when, at a time 푡1 during the pulse, each codon position has just reached their steady-state number of radiolabeled nascent chains; i.e., 푁L,B(푖, 푡1) = 푘A,fastest 푁rib,fastest for all 푖 and 푁L,R(푡1) = 0. In this case, Eq. A.7 is 푘A,푖

72

푘 A,fastest푁 푘 rib,fastest 푓 (푖, 푡 ) = A,푖 , L,B 1 푘A,fastest 푁rib,fastest ∑푖=1 +0 푘A,푖 푘A,fastest 푘 = A,푖 (Eq. A.14) 푘A,fastest ∑푖=1 푘A,푖 Note that in Eq. A.14 푁rib,fastest has cancelled out. At 푡2 = 푡1 + 훿푡, 푁rib,fastest labeled nascent chains are released from the ribosome, and therefore 푁L,R(푡2) = 푁rib,fastest and again 푁rib,fastest cancels out. In this case, 푘 A,fastest푁 푘 rib,fastest 푓 (푖, 푡 ) = A,푖 , L,B 2 푘A,fastest 푁rib,fastest ∑푖=1 + 푁rib,fastest 푘A,푖 푘A,fastest 푘 = A,푖 . (Eq. A.15) 푘A,fastest ∑푖=1 + 1 푘A,푖 Likewise for the released chain term N 푓 (푡, 푡 ) = rib,fastest , L,R 2 푘A,fastest 푁rib,fastest ∑푖=1 + 푁rib,fastest 푘A,푖 1 = . (Eq. A.16) 푘A,fastest ∑푖=1 +1 푘A,푖 These results (Eqs. A.14, A.15, and A.16) demonstrate that provided we calculate 푃F(푡) at 1 훿푡 time intervals equal to , 푃F(푡) does not depend on the actual value of 푁rib,fastest, 푘A,fastest but only on the codon translation rates across the CDS. Therefore, for convenience, we set 푁rib,fastest = 1 when making predictions using this method, and we emphasize that this choice does not affect our predictions – the results are the same regardless of the true value of 푁rib,fastest. 1 The requirement for a discrete time interval 훿푡 = to accurately 푘A,fastest calculate 푓L,B and 푓L,R indicates that 푃F(푡) can only be accurately calculated at integer multiples of 훿푡. That is, 푡 = 푡(푠) = 푠훿푡 and 푡′ = 푡′(푛) = 푛훿푡, where 푠 and 푛 are integers such that 푠 ≥ 푛 ≥ 0. Thus, Eq. 2.1 can be rewritten to indicate this discrete time-point dependence as 푀 푃F(푡(푠)) = ∑푖=1 푃F,B(푖)푓L,B(푖, 푡(푠)) 푠 ′ ′ + ∑푛=0 푃F,R (푡(푠), 푡 (푛))푓L,R(푡(푠), 푡 (푛)). (Eq. A.15) Substituting Eqs. A.6, A.7, and A.8 into Eq. A.15 yields 푀 푁L,B(푖,푡(푠)) 푃F(푡(푠)) = ∑푖=1 푃F,B(푖) 푀 푠 ′ ∑푗=1 푁L,B(푗,푡(푠))+∑푛=0 푁L,R(푡(푠),푡 (푛)) ′ 푁L,R(푡(푠),푡 (푛)) 푘 ′ 푘 푠 F −[푘F+푘U][푡(푠)−푡 (푛)] F + ∑푛=0 푀 푠 ′ ([푃F,B(푀) − ] 푒 + ), ∑푖=1 푁L,B(푖,푡(푠))+∑푙=0 푁L,R(푡(푠),푡 (푙)) 푘F+푘U 푘F+푘U (Eq. A.16) and factoring out the denominator in Eq. A.16 yields 1 푀 푃F(푡(푠)) = 푀 푠 ′ [∑푖=1 푁L,B(푖, 푡(푠))푃F,B(푖) ∑푖=1 푁L,B(푖,푡(푠))+∑푛=0 푁L,R(푡(푠),푡 (푛))

73

푘 ′ 푘 푠 ′ F −[푘F+푘U][푡(푠)−푡 (푛)] F + ∑푛=0 푁L,R(푡(푠), 푡 (푛)) ([푃F,B(푀) − ] 푒 + )], (Eq. A.17) 푘F+푘U 푘F+푘U which is identical to Eq. 2.2 and expresses 푃F(푡(푠)) purely as a function of the underlying rates of folding, unfolding, and codon translation.

A.2.2 Predictions are robust to small deviations from steady state

Assumption A1 greatly simplifies our model’s calculations by assuming steady-state translation kinetics occur throughout the pulse-chase experiment. However, it also means that our model may give misleading or erroneous predictions when this assumption does not hold for the system being modeled. In order to assess how well our model can approximate systems that experience non-steady-state translation kinetics, we simulated the co-translational protein folding of ∆C SFVP protein under non-steady-state conditions using the Gillespie Algorithm and compared the resulting co-translational folding curves with the predictions made by Eq. 2.2. The non-steady-state condition is created in the Gillespie simulation by introducing a sinusoidally-varying time-dependent initiation rate 2휋푡 푘 (푡) = 푘 (0) [1 + 퐴sin ( )]; in this equation, 푘 (0) is the initiation rate at time zero int int 휏 int (See Methods), 퐴 is the amplitude of the sine function (see Methods and Figure A.5), 푡 is the experimental time, and τ is the duration of the pulse period. Small values of |퐴| correspond to small deviations from steady-state, and large |퐴| values produce more significant non-steady-state behavior. We performed stochastic simulations for ∆C SFVP using values of 퐴 between 0 and 1, and found that the 푃F(푡) curve predicted by Eq. 2.2 remains within the statistical uncertainty of the simulated curve (Figure A.5, bottom left) for small values of A (e.g., A = 0.4). However, Eq. 2.2 fails to accurately predict the 푃F(푡) curve when larger deviations (e.g., A = 1.0) from steady-state translation kinetics are introduced (Figure A.5, bottom right). These results support the idea that our model can be applied even when there are small deviations from steady state in the real system.

A.2.3 Predictions are robust to change in dwell-time distribution

Single-molecule, Laser Optical Tweezer in vitro experiments on translating ribosomes show a ribosome dwell-time distribution best fit by the difference of two exponential terms 푘1푘2 of the form 푃(휏) = [exp(−푘1휏) − exp(−푘2휏)], with rates 푘1 = 0.7 and 푘2 = 3.4 푘2−푘1 s-1 [62]. Eq. 2.2 assumes (Assumption A3) that ribosomes dwell at a codon with a single- exponential distribution. It is not analytically possible, to our knowledge, to solve the reaction scheme shown in Figure 2.2 for the dwell-time distribution 푃(휏). Therefore, to numerically test if using the distribution 푃(휏) changes the resulting folding curve for ∆C SFVP we ran stochastic simulations using the Gillespie Algorithm12 on a reaction network representing the SFVP co-translational folding process on ribosomes that exhibit the experimentally measured 푃(휏) distribution, scaled to have an average translation rate of -1 3.9 AA per s. The scaled 푘1 and 푘2 values are 4.7363 and 22.0649 s , respectively. In 20 virtual experiments, 6,386 individual ribosome trajectories were simulated on this reaction

74 network. We find that the average co-translational folding curve across these virtual experiments yields the same results as the predictions from Eq. 2.2 (Figure A.6). Therefore, the predictions for SFVP are robust to changes in this dwell time distribution, and A3 is a reasonable approximation.

75

A.3 Supplemental figures and tables

Figure A.1. Ribo-Seq data exhibits stationary ribosome profile distributions between biological replicates of yeast. Data for replicates 1 and 2 are shown in blue and red, respectively. (A) Fragment size distribution of reads mapped to the CDS regions and 50 nt upstream of the first codon. (B) Meta-gene analysis: Normalized read count for fragment size 28 in a 100 nt region of the CDS from 18 nt upstream of the start codon to 82 nt within the CDS region for 6,665 genes in yeast. These data demonstrate strong 3 nt periodicity for ribosome footprints. (C) Distribution of reads of fragment size 28 whose 5’ end has aligned to reading frame 0, 1 or 2. (D) Pairwise correlation of ribosome profiles for individual genes from the two biological replicates which have at least 1 read at each codon position and contain no multiply-aligned reads. Boxplot shows the distribution of Pearson correlation coefficient values for the 91 genes that meet these criteria.

76

Figure A.2. Ribo-Seq data shows stationary ribosome profile distributions. Data for replicates 1 and 2 are shown in blue and red, respectively. Ribosome profiles across both replicates are represented for genes YKL056C (top) and YBR011C (middle and bottom, split for clarity of codon-position axis) which have the 푟 values of 0.96 and 0.98, respectively.

77

Figure A.3. Linear least squares analysis of the appearance of full-length ΔC SFVP since the start of the chase period. A linear line of best fit with the equation 푦 = 0.0025푡 + 0.26 (R2 = 0.95, p = 0.001) was calculated for the experimental values1 (red triangles) for the time evolution of full-length Δile SFVP.

78

Figure A.4. An illustration of using Eq. 2.2 to compute 푃F(푡) using a tractable example. (A) Consider a hypothetical peptide consisting of two codon positions for which 푃F(푡) will be calculated. We assume that while both codon positions can be radiolabeled, only the second codon position will be experimentally monitored, i.e. the second codon will correspond to the “segment of interest.” This is analogous to SFVP where only the folding status of C protein was experimentally monitored. The values of 푃F,B(푖), 푘A,푖, and ′ 푃F,R(푡, 푡 ) are listed in the red boxes. (B) The pulse-chase experiment is displayed in schematic form. At 푡 = 0 ∙ 훿푡 = 0 푠, there are no labeled nascent chains in the system (see key in figure) and the pulse is initiated. At 푡 = 1 ∙ 훿푡 = 1.5 푠 we add one labeled nascent chain to each codon position. Only those nascent chains which are labeled at codon position 푖 = 2 contribute to 푃F(푡). We likewise add one labeled nascent chain at each codon position until the end of the incorporation period, defined here to be when 푡 = 3 ∙ 훿푡 = 4.5 푠. At the final incorporation period time-point, one labeled nascent chain is released from codon position 푖 = 2. For chase time points, from 푡 = 4 ∙ 훿푡 = 6.0 푠 to 푡 = 7 ∙ 훿푡 = 10.5 푠, the number of ribosomes transitioning into and out of each codon position remains equal to 푁rib,fastest = 1, and the labeled nascent chains which were added during the pulse period are tracked over time. At 푡 = 6 ∙ 훿푡 = 9.0 푠 there are no longer any labeled nascent chains bound to a ribosome, and the value of 푃F(푡) has contributions from nascent chains that have been released from the ribosome and are labeled at codon position 푖 = 2. (C) The application of

Eq. 2.2 to the simple example outlined in (A) and (B). The quantities 푃F,B(1) ∗ 푁L,B(1, 푡), 푃F,R(푡, 0.0) ∗

푁L,R(푡, 0.0) and 푃F,R(푡, 1.5) ∗ 푁L,R(푡, 1.5) are omitted for compactness after one explicit use because they are equal to zero for all 푡 (see panel A). Likewise, the quantity 푃F,B(2) ∗ 푁L,B(2, 푡) is omitted in equations

푃F(9.0) and 푃F(10.5), as at these time points there are no longer any bound labeled nascent chains.

79

Figure A.5. Comparison of Gillespie Algorithm simulations of non-steady-state translation kinetics to predictions made with Eq. 2.2. Top panel: Plot of 푘푖푛푡(푡) during the pulse period of the Gillespie Algorithm simulations. Bottom left panel: Comparison between Gillespie Algorithm simulations using the sinusoidally varying 푘푖푛푡 with an amplitude, A, of 0.4 (green squares with experimental error bars) are in agreement with predictions made with Eq. 2.2 (red X’s). Bottom right panel: When A=1.0 is used (orange squares with experimental error bars), Eq. 2.2 fails to predict (red circles) the same co-translational folding curve as the Gillespie Algorithm simulations at time 푡 ≈ 45 푠.

80

Figure A.6. A double-exponential ribosome dwell-time distribution does not alter the predicted co-translational folding curve of ∆C protein. The co-translational folding curves obtained by using the Gillespie algorithm for a single- and double-exponential ribosome dwell-time distribution at each codon position shows excellent agreement with the predictions made by Eq. 2.2. The same mean dwell time of 3.9 AA per s is used in both cases.

81

Figure A.7. Sensitivity analysis of co-translational folding curves predicted with Eq. A.5 for FRB and HA1 to changes in the parameters 푘F, 푘U, and 푘A. Left column: Co- translational folding curves calculated with various values of 푘F are displayed. Middle column: Co-translational folding curves calculated with various values of 푘U are displayed. The various plots for the protein HA1 in the middle column are so similar as to be indistinguishable. Right column: Co-translational folding curves calculated with various values of 푘A are displayed.

82

Figure A.8. Sensitivity analysis of co-translational folding curves predicted with Eq. 2.2 for the SFVP, DHOM, DPP3, SBA1, and EF2 wild-type proteins to changes in the parameters 푘F, 푘U, and 푘A. Left column: Co-translational folding curves calculated with various values of 푘F are displayed. Middle column: Co-translational folding curves calculated with various values of 푘U are displayed. The various plots for the proteins DHOM, DPP3, SBA1, and EF2 in the middle column are so similar as to be indistinguishable. Right column: Co-translational folding curves calculated with various values of 푘A are displayed. In the case of the yeast proteins DHOM, DPP3, SBA1, and EF2, each individual codon translation rate, as predicted by the Fluitt-Viljoen model, 푘, was multiplied by the indicated constant.

83

Figure A.9. The various estimates of codon translation rates do not correlate with each other. Each set of sense codon translation rates was obtained for yeast and then scaled to reproduce the average rate of 3.9 AA per s (see Methods and Table A.1) across the ∆C SFVP transcript. Values of 푅2 for each pair of codon translation rate estimates are shown. Units on all axes are AA per s.

84

Figure A.10. The six slowest-translating codons predicted by the Fluitt-Viljoen model cause the greatest deviation between the predicted and experimental values. Co- translational folding curves predicted using Eq. 2.2 using the full set of Fluitt-Viljoen translation rates (red circles), Slow-Set (green triangles), and Fast-Set (purple diamonds) are displayed. See Results section for a definition of the Slow-Set and Fast-Set.

85

Table A.1. Translation rate profiles used in calculating pulse-chase curves in Figure 2.6 Fluitt-Viljoen12 Stadler and Fire69 Gardin et al.68 Tuller and Dana70 Tuller and Dana70 Codon Yeast CHO Yeast CHO Yeast CHO Yeast CHO C. CHO (AA per (AA per Occupancy (AA RRT (AA NFC (AA elegans (AA s) s) per s) per s) per s) NFC per s) AAA 4.651 2.539 1.000 3.415 0.880 4.598 1.162 4.321 0.798 2.849 AAC 8.850 4.831 0.394 8.675 0.760 5.324 0.944 5.942 0.956 3.415 AAG 11.905 6.499 1.000 3.415 0.740 5.468 1.085 8.946 0.976 3.487 AAU 6.135 3.349 0.812 4.207 0.880 4.598 1.000 2.608 1.091 3.897 ACA 3.257 1.778 1.000 3.415 1.350 2.997 0.821 2.701 0.765 2.733 ACC 10.753 5.870 1.000 3.415 0.700 5.780 0.823 4.277 0.877 3.134 ACG 0.794 0.433 1.000 3.415 1.120 3.613 0.948 1.404 0.980 3.501 ACU 10.989 5.999 1.000 3.415 0.780 5.187 0.831 5.942 0.920 3.288 AGA 11.111 6.066 1.000 3.415 1.010 4.006 1.074 6.481 1.143 4.083 AGC 4.082 2.228 0.565 6.046 1.090 3.712 1.037 1.080 0.714 2.552 AGG 0.794 0.433 1.000 3.415 1.590 2.545 1.097 2.614 1.119 3.998 AGU 3.003 1.639 1.042 3.278 1.100 3.678 1.014 0.474 0.847 3.026 AUA 1.431 0.781 1.000 3.415 1.570 2.577 0.981 1.081 1.171 4.184 AUC 12.048 6.578 1.000 3.415 0.810 4.995 0.972 5.595 0.960 3.431 AUG 4.032 2.201 1.000 3.415 0.920 4.398 0.944 5.942 1.167 4.168 AUU 12.987 7.090 1.000 3.415 0.920 4.398 1.028 7.258 1.135 4.056 CAA 8.850 4.831 1.000 3.415 0.870 4.651 0.743 4.861 0.697 2.489 CAC 9.524 5.200 0.578 5.905 1.080 3.747 0.631 4.321 0.842 3.007 CAG 1.241 0.677 1.000 3.415 1.150 3.518 0.837 2.096 0.723 2.584 CAU 7.634 4.168 1.489 2.294 0.930 4.351 0.673 1.897 0.961 3.431 CCA 11.765 6.423 1.000 3.415 1.380 2.932 0.944 5.401 0.917 3.275 CCC 2.375 1.297 1.000 3.415 1.710 2.366 0.806 0.778 0.723 2.584 CCG 16.949 9.253 1.000 3.415 1.310 3.089 1.051 1.728 0.929 3.317 CCU 2.538 1.386 1.000 3.415 1.270 3.186 0.855 1.080 0.744 2.658 CGA 5.682 3.102 1.000 3.415 1.450 2.791 0.889 3.241 0.940 3.360 CGC 8.850 4.831 1.000 3.415 1.450 2.791 0.829 2.722 0.908 3.245 CGG 1.848 1.009 1.000 3.415 1.440 2.810 0.843 0.540 0.869 3.104 CGU 11.494 6.275 1.000 3.415 0.870 4.651 0.750 3.780 1.024 3.657 CUA 3.268 1.784 1.000 3.415 1.250 3.237 0.967 1.620 0.744 2.658 CUC 1.608 0.878 1.000 3.415 1.890 2.141 1.046 0.540 0.640 2.287 CUG 6.369 3.477 1.000 3.415 0.920 4.398 1.046 0.519 0.845 3.019 CUU 1.553 0.848 1.000 3.415 1.240 3.263 0.977 0.237 0.705 2.519 GAA 11.236 6.134 1.000 3.415 1.040 3.891 1.192 8.101 1.171 4.184 GAC 10.753 5.870 0.526 6.489 0.850 4.760 0.967 8.644 1.198 4.278 GAG 1.623 0.886 1.000 3.415 1.250 3.237 1.200 3.673 1.411 5.041 GAU 9.346 5.102 0.811 4.210 0.760 5.324 1.042 3.794 1.333 4.763 GCA 4.149 2.265 1.000 3.415 1.280 3.161 0.883 3.241 0.837 2.989 GCC 7.407 4.044 1.000 3.415 0.860 4.705 0.782 4.277 0.774 2.764 GCG 5.780 3.156 1.000 3.415 0.990 4.087 0.785 1.037 0.777 2.775 GCU 10.753 5.870 1.000 3.415 0.810 4.995 0.845 5.942 0.880 3.142 GGA 2.119 1.157 1.000 3.415 1.560 2.594 1.107 1.620 1.293 4.617 GGC 14.925 8.149 0.657 5.194 1.220 3.317 0.949 8.644 0.628 2.245 GGG 2.037 1.112 1.000 3.415 1.610 2.513 1.102 1.599 1.119 3.998 GGU 12.195 6.658 1.187 2.878 0.930 4.351 0.977 3.794 0.721 2.574 GUA 1.477 0.806 1.000 3.415 1.310 3.089 0.981 1.621 0.869 3.104 GUC 9.524 5.200 1.000 3.415 0.750 5.395 0.893 5.444 0.798 2.849 GUG 2.242 1.224 1.000 3.415 1.520 2.662 1.042 1.599 0.874 3.122 GUU 14.925 8.149 1.000 3.415 0.750 5.395 0.972 7.561 0.944 3.373 UAC 9.524 5.200 0.914 3.737 1.250 3.237 1.088 4.321 1.083 3.870 UAU 10.753 5.870 2.342 1.458 1.250 3.237 1.084 1.897 1.340 4.787 UCA 3.413 1.863 1.000 3.415 1.260 3.211 0.949 2.161 0.845 3.019 UCC 9.091 4.963 1.000 3.415 0.990 4.087 0.935 4.277 0.783 2.795 UCG 0.883 0.482 1.000 3.415 1.430 2.830 1.070 1.231 0.801 2.861 UCU 17.544 9.578 1.000 3.415 0.980 4.129 0.991 5.942 0.756 2.701 UGC 3.876 2.116 0.486 7.027 1.230 3.290 0.833 2.160 1.036 3.700 UGG 7.143 3.900 1.000 3.415 1.530 2.645 0.907 3.413 0.929 3.317 UGU 4.762 2.600 1.391 2.455 0.810 4.995 0.795 0.948 1.131 4.040 UUA 6.667 3.640 1.000 3.415 0.990 4.087 1.033 3.780 1.088 3.885 UUC 8.130 4.439 0.708 4.826 1.000 4.046 0.977 5.942 1.000 3.572 UUG 9.091 4.963 1.000 3.415 0.920 4.398 1.222 6.610 1.024 3.657 UUU 8.333 4.550 2.022 1.689 1.050 3.854 0.963 2.608 1.068 3.815 UAA 43.478 23.737 N/A 3.900 N/A 3.900 N/A 3.900 N/A 3.900 UAG 25.000 13.649 N/A 3.900 N/A 3.900 N/A 3.900 N/A 3.900 UGA 40.000 21.838 N/A 3.900 N/A 3.900 N/A 3.900 N/A 3.900

86

Table A.2. Summary of pulse-chase error bars from literature sources Figure number in Data point (time) Standard Deviation original publication Figure 4A[163]29 0.5 h 0.074 1 h (top) 0.130 1 h (bottom) 0.112 Figure 4B[163] 0.5 h (top) 0.215 0.5 h (bottom) 0.138 1 h 0.117 3 h 0.034 Figure 4C[163] 0.5 h (top) 0.029 0.5 h (bottom) 0.073 1 h 0.038 2 h 0.137 3 h (top) 0.093 3 h (bottom) 0.041 Figure 4D[163] 1 h 0.150 2 h (top) 0.052 2 h (bottom) 0.046 3 h (top) 0.060 3 h (bottom) 0.078 Figure 6[164] 20 min (left) 0.402 30 (leftpanel) 20 min (middle) 0.278 8 min (right) 0.279 15 min (right) 0.296 15 min (right) 0.279 24 min 0.319 Figure 3B[165] 1 h (grey) 0.075 1 h (black) 0.130 3 h (grey) 0.091 3 h (black) 0.130 7 h (black) 0.124 1 h (top) 0.204 Figure 3C[165] 1 h (bottom) 0.200 3 h (grey) 0.249 3 h (black) 0.318

87

Appendix B

CHAPTER 3 METHODS AND SUPPORTING INFORMATION

This chapter is reproduced with permission from The Journal of the Physical Chemistry B from the article “Structural origins of FRET-observed nascent chain compaction on the ribosome” by Daniel A. Nissley and Edward P. O’Brien in The Journal of Physical Chemistry B, 2018, 122(43), 9927-9937. Copyright 2018 American Chemical Society.

B.1 Methods

B.1.1 Construction of HemK N-terminal domain folding model

A structure-based Go̅ model of the HemK N-terminal domain (residues 2–73) was constructed from protein data bank identification (PDB ID) 1T43 in a manner previously described.50,177 Each amino acid was represented as a single spherical interaction site with its center of mass at the location of the Cα atom in the crystal structure. Transferable bond, angle, and dihedral terms were used. Native contacts were defined based on the PDB ID: 1T43 crystal structure. The ETEN potential178, which introduces a small energy barrier representing the desolvation penalty for interacting amino acids into the Lennard-Jones function along with more curvature in energetic minimum region, was used for all simulations. This potential function increases the cooperativity of folding transitions for some proteins178. The stability of this HemK NTD Go̅ model was tuned with replica exchange simulations in the molecular dynamics package CHARMM179 to reproduce the experimental value of the folded state stability at 298 K within error (experiment: −4.66 ± 0.10 kcal/mol, simulation: −4.58 ± 0.11 kcal/mol) when Lennard-Jones well depths, initially set by the Betancourt-Thirumalai statistical potential180, were globally scaled by a factor of 2.2. A total of 50 000 exchanges were attempted between replicas; the first 5000 were discarded, and the weighted histogram analysis method181 was used for analysis. The root-mean-square deviation (RMSD) from the native state was calculated for the five helices of 1T43 as identified by STRIDE.182 The HemK NTD Go̅ model was defined as folded when the RMSD was less than 7 Å in comparison to the native state. This threshold value is the RMSD at which the cumulative probability of RMSD equals 0.5 at the Go̅ model’s melting temperature.

B.1.2 Construction of dye-modified HemK model

Explicit coarse-grained representations of the FRET dyes BodipyFL (BOF, Life Technologies D6140) and Bodipy 576/589 (BOP, Life Technologies D2225) and their linkers were built in Gaussian/g09d01 and their geometries optimized with B3LYP/6- 82,183,184 31G. These structures include all atoms of the dyes and linkers with the Cα position represented by a −CH3 group. The coordinates of the optimized structures were used to generate coarse-grained representations at a similar level as has been done previously for

88

FRET dyes.95 The approximate locations of the coarse-grain centers are displayed in Figure 3.1B. Each of the rings was reduced to a single interaction site at the centroid of the ring’s heavy atoms. Heavy atoms in two rings were included in the calculation of both ring’s centroids. Linker interaction sites were placed at bond midpoints. The total mass of the dye, linker, and Cα beads was redistributed evenly between all of the beads constituting each dye-modified amino acid representation. BOF-Met and BOP-Lys beads were thus assigned masses of 67.7 AMU and 48.8 AMU, respectively, such that the total mass of each dye-modified amino acid was conserved. The parameters chosen for BOF-Met and BOP-Lys are provided in Tables B.1–B.5. Bond lengths were extracted by measuring the distances between the coarse-grain beads in the geometry-optimized structures. Force constants of 50.0 kcal/(mol × Å2) were used for all dye and linker bonds. Bond angles were extracted from the B3LYP/6-31G-optimized structures, and all force constants were taken as 30 kcal/(mol × radian2). A double-well angle potential185 was used for all angles in the linker. Bond angles formed by adjacent nascent chain beads, the dye-modified Cα beads, and the first bead of the linker (e.g., the L1-A1-A2 bond angle for BOF-Met) were assigned the average Cα-Cα-Cβ bond angle from an ALA-ALA-ALA tripeptide constructed in CHARMM in a right-handed helical conformation. Gly-Gly dihedral terms were used for the linkers. Single-well dihedral potentials were used for the dye dihedrals, with the minimum energy position located at the value of the dihedral calculated from the B3LYP/6-31G-optimized structures. The 95 rmin/2 values for the linker beads were taken from Merchant et al. 2007. The rmin/2 of the ribose interaction site was used for each of the five-membered rings, while the rmin/2 used for six-membered rings by Merchant et al. 2007 was used for the six-membered rings here. The dye and linker representations interact in a purely repulsive manner with one another, the rest of the protein representation, and the ribosome representation. As the PDB ID: 1T43 structure does not contain an N-terminal Met residue the BOF-Met coarse-grain representation was appended to the N-terminus of the HemK NTD model. A structured linker, consisting of the next 39 residues in HemK’s sequence, was appended to the C- terminus of the N-terminal domain model as was done in the original experiments. This coarse-grain structure was used for all continuous-synthesis trajectories and is referred to in following Methods Section as the HemK NTD Go̅ -dye model.

B.1.3 Construction of good- and poor-solvent collapse Hamiltonians

The folding Hamiltonian was modified to approximate the influence of good solvent by removing all intra-NTD contacts while preserving inter-NTD/linker and inter-NTD/CTD contacts. Similarly, the poor-solvent Hamiltonian was constructed by first removing all intra-NTD contacts while preserving inter-NTD/linker and inter-NTD/CTD contacts and then adding a nonspecific pairwise attractive term. The functional form of the potential for these attractive interactions is taken to be the same as for contacts within the folding Hamiltonian. The well-depth for these added interactions was tuned with simulations of the HemK NTD Go̅ –dye model in bulk solution with well-depths of ε = {0.0, −0.2, −0.4, ..., −19.6, −19.8, −20.0}, in units of kilocalories per mole. Ten trajectories were run for 750

89 ns at each well-depth, and system coordinates were recorded every 1.5 ns. The first 150 ns were discarded, and the mean radius of gyration was calculated over the remaining 600 ns of each simulation. The resulting plot of mean radius of gyration as a function of the well- depth is shown in Figure B.6. The well-depth corresponding to the midpoint of the collapse transition, −3.4 kcal/mol, was used in our continuous-synthesis simulations. An rmin of 6.17 Å, which is the median rmin of all the intra-NTD contacts for the folding Hamiltonian, was used for these nonspecific interactions.

B.1.4 Selection of mean in silico translation elongation time

The use of low-friction Langevin dynamics with a coarse-grain model greatly accelerates dynamics. Because the relative time scales of folding and amino acid addition are critical to accurately capturing co-translational folding behavior, we calculated the in silico time sim scale of amino acid addition, 〈휏A 〉, as exp sim sim 〈휏A 〉 〈휏A 〉 = 〈휏F 〉 ∗ ( exp ), (Eq. B.1) 〈휏F 〉 exp exp where 〈휏A 〉 and 〈휏F 〉 are the experimental amino acid addition and folding times, sim respectively, and 〈휏F 〉 is the simulation folding time. This equation ensures that a sim realistic ratio of time scales is maintained in the simulations. The value of 〈휏F 〉 was determined from temperature-quenching simulations. The HemK NTD Go̅ -dye model was first equilibrated at 800 K for 40 ns to ensure complete unfolding and then instantaneously cooled to 298 K for 200 ns. Three hundred and fifty trajectories were run with this protocol, and the time-dependent survival probability of the unfolded state was calculated. The HemK NTD Go̅ dye model was considered to have folded during the quench at 298 K if the RMSD of the residues within its five helical elements was less than or equal to 5 Å. This value is less than the 7-Å thermodynamic RMSD threshold to prevent conformations that will likely rapidly unfold from being counted as folded. The survival probability curve was fit with a single-exponential function of the form 푆U(푡) = exp(−푘F푡), where 푆U(푡) is the survival probability of the unfolded state, 푡 is the time since the start of the quenching period, and 푘F is a fit parameter representing the rate of folding. The value of ⟨푘F⟩ was determined with curve-fitting in python to be 0.281 ns–1 (Pearson 푅2 of fit: 0.987), sim corresponding to a mean folding time of 〈휏F 〉 = 3.56 ns. Holtkamp et al. 2015 determined exp exp –4 sim sim that 〈휏A 〉 = 0.278 s and 〈휏F 〉 = 1.95 x 10 s; with 〈휏F 〉 = 3.56 ns, Eq. B.1 gives 〈휏A 〉 = 5080 ns. Simulating the synthesis of the HemK constructs with an explicit representation of the E. coli 50S ribosome is computationally intractable for a dwell time of this duration. To reduce the required computational time, we applied an acceleration factor α such that sim exp sim ′ 〈휏F 〉 〈휏A 〉 1 sim 〈휏A 〉 = ∗ ( exp ) = ∗ 〈휏A 〉. (Eq. B.2) 훼 〈휏F 〉 훼 As α increases, the simulation mean dwell time decreases linearly. A factor of α speed up sim sim ′ is reasonable if 〈휏F 〉 ≪ 〈휏A 〉 , such that the simulation is in a quasi-equilibrium regime. We used Gillespie algorithm74 simulations of two-state co-translational folding to determine the highest value of 훼 that does not significantly alter the behavior of the system.

90

In these simulations, the HemK NTD was only permitted to fold once the entire domain and an additional 30 residues, to simulate the effect of the exit tunnel, were synthesized. An initial set of 10 000 trajectories with the experimental rates of folding, unfolding, and amino acid addition were run for comparison. Additional sets of simulations were then run with the in silico rates calculated from temperature quenching with α = {1, 10, 100, 200, 500, 1000, 10 000, 15 000} (see Figure B.7). These simulations suggest that an α value of up to ∼1000 is reasonable, consistent with the notion that significant departure from quasi- exp exp equilibrium behavior is only expected when 〈휏A 〉 ≅ 〈휏F 〉, and there is ∼1400-fold exp exp difference between 〈휏A 〉 and 〈휏F 〉. We chose α = 500 as a compromise between computational expense and preservation of the experimental ratio of time scales as closely 5080 ns as possible. Thus, 〈휏sim〉′ = = 10.2 ns was used for all continuous synthesis A 500 simulations.

B.1.5 Continuous synthesis simulations

A coarse-grained representation of the E. coli ribosome’s large subunit was constructed from PDB ID: 3UOS as described previously.50 The CHARMM Cartesian coordinate system origin was placed at the location of the N6 atom of A2602, which is between the A- and P-site tRNA, with the positive x-axis aligned along the long axis of the exit tunnel. The ribosome portions around the C-terminal residue of the nascent chain were allowed to fluctuate by selecting ribosomal interaction sites within 12 Å of the point (6, 0, 0) Å and with an x-coordinate greater than 3 Å and applying a harmonic restraint using CONS HARM with a force constant of 0.5 kcal/(mol × Å2).49 A single ribosome interaction site corresponding to U2585’s uracil ring (by PDB ID: 3UOS numbering) was deleted from the PDB structure to avoid steric clashes with the nascent chain. All interactions between ribosome sites were deactivated with the BLOCK module in CHARMM. The C-terminal bead of the nascent chain was held at the point (6, 0, 0) Å by a spherical harmonic restraint with a force constant of 50 kcal/(mol × Å2). Five planar restraints were positioned on five sides around the nascent chain using the GEO PLANE functionality in the MMFP module of CHARMM for nascent chain lengths up to and including 15 residues to direct the nascent chain into the exit tunnel. These planar restraints use the potential form EXPONENTIAL and parameters FORCE 50, DROFF 2.0, and P1 0.050. One yz-plane, passing through the point (1, 0, 0) Å, was used to approximate the steric bulk of the P-site tRNA. Four additional planes were used to define a box around the nascent chain, open only on the face pointing out of the exit tunnel, to help guide it into the exit tunnel correctly. Two xz-planes, with one passing through the point (0, 20, 0) Å and the other through the point (0, −10, 0) Å, were used. Two xy-planes, one passing through the point (0, 0, 20) Å and the other through the point (0, 0, −20) Å, were also used. For nascent chain lengths greater than 15 residues only the yz-plane through (1, 0, 0) Å was retained to approximate the steric bulk of the tRNA. This combination of planar and harmonic restraints prevents early termination due to steric clashes that lead to SHAKE algorithm errors.

91

A modified version of a previously published continuous synthesis protocol for 186 CHARMM was used. Simulations were begun with the Cα bead of the N-terminal residue (BOF-Met) restrained at the point (6, 0, 0) Å. The dwell time at each nascent chain sim ′ length was randomly sampled from a single-exponential distribution with mean 〈휏A 〉 = 10.2 ns. A temperature of 310 K, frictional coefficient of FBETA = 0.050 ps–1, an integration time step of 0.015 ps, and the SHAKE algorithm were used for all continuous synthesis simulations. System coordinates were saved every 5000 integration time steps (75 ps). Two hundred, one hundred, and eighty synthesis trajectories were completed using the folding, good-solvent, and poor-solvent Hamiltonians, respectively. Each poor-solvent Hamiltonian trajectory was initiated from a different configuration from the folding Hamiltonian simulations at nascent chain length 33. Starting at a nascent chain length of 33 ensures that the nascent chain will be equilibrated by the time the BOP-Lys residue is added at position 34 and FRET becomes possible. The time series for continuous synthesis trajectories synthesized with the poor-solvent Hamiltonian were adjusted by appending 32 dwell times selected randomly from an exponential distribution with mean of 10.2 ns. This adjustment accounts for the time that would be required to synthesize the first 32 residues of the nascent chain that were not explicitly simulated with Langevin dynamics. The size of the harmonically restrained region of ribosomal interaction sites was also increased to a total of 95 ribosome interaction sites within the box bounded by x = [3, 30] Å, y = [−15, 10] Å, and z = [−15, 10] Å. The poor- solvent simulations were otherwise identical to the folding and good-solvent simulations. Trajectories for HemK98, HemK84, HemK70, HemK56, and HemK42 with each of the three Hamiltonians were extended by an appropriate amount to bring their total duration to ∼35 s when mapped to experimental time.

B.1.6 Mapping of simulation timescales

To compare our simulation results to experimental time series we mapped our simulation time course to experimental time. We assume that there exists a uniform scaling factor, 푐, that maps the timescale of our simulations to experimental time: 푡exp = 푐 ∗ 푡sim. (Eq. B.3) exp sim If we assume that 푡exp = 〈휏A 〉 and 푡sim = 〈휏A 〉, we have exp 〈휏A 〉 푐 = sim . (Eq. B.4) 〈휏A 〉 As we employed an acceleration factor, 훼, this equation must be adjusted to the form exp exp 〈휏A 〉 〈휏A 〉 푐 = sim = sim ′. (Eq. B.5) 〈휏A 〉 〈휏A 〉 훼 Thus, we mapped our simulated time series in units of nanoseconds onto the experimental time regime in units of seconds by the procedure exp 1 s 〈휏A 〉 푡exp = 푡sim ∗ 9 ∗ sim , (Eq. B.6) 10 ns 〈휏A 〉′ exp sim in which 푡sim has units of ns and both 〈휏A 〉 and 〈휏A 〉′ have units of s.

92

푬 B.1.7 Calculation of 푬 and ensemble average time series 푬퐞퐧퐝

The Förster equation was used to calculate the FRET efficiency as a function of the interdye 183 distance 푟. A Förster radius 푅0 of 54 Å was used. The distance between the donor and acceptor was calculated as the distance between the interaction sites representing the six- membered rings in BOF and BOP at each frame of a trajectory. The FRET efficiency time series for each trajectory was then time-averaged into 15 ns bins. Finally, the binned trajectories were averaged together to produce the ensemble average FRET time series. Statistics were generated by determining the standard error of the average for each bin across all trajectories and then calculating the corresponding 95% confidence interval. 퐸 Time series of were produced by dividing the ensemble average 퐸 time series for each 퐸end construct by the value in the final bin for the HemK112 ensemble average time series for the corresponding Hamiltonian. These ensemble average trajectories were then projected onto experimental time by multiplication by the time adjustment factor 푐. Calculation of 퐸 퐸 and time series for alternative dye positions was performed in the same fashion, except 퐸end the interdye distance was taken to be the distance between the Cα interaction sites for the dye locations under consideration.

B.1.8 Fraction of native contacts analysis

The fraction of native contacts (푄) formed by the five helices within the HemK NTD Go̅ - dye model was calculated from the folding Hamiltonian continuous synthesis simulations. First, the set of contacts within the coarse-grain, native-state reference structure was determined. Two interaction sites are considered to share a contact if two criteria are satisfied: (1) both residues are within structured regions and separated by at least three residues in the primary sequence (i.e., 푖 → 푖 + 4 or greater separation) and (2) the coarse- grain residues in the reference structure are no more than 8 Å apart. A contact was considered to be formed during a continuous synthesis trajectory if it satisfied criteria (1) above and the distance between the residues was less than or equal to 1.2 × 푑ref, where 푑ref is the distance between the residues in the reference structure and the factor of 1.2 is to adjust for thermal fluctuations. The total number of contacts formed at frame 푗 of the simulation, 푁(푗), was divided by the total number of contacts formed in the native state, 푁ref, to produce the value of 푄(푗). Time series of 푄(푗) were then binned and averaged, and the corresponding statistics were determined in the same manner as described for the 퐸 and 퐸 time series. 퐸end

B.1.9 Comparing simulated and experimental time series

93

퐸 Pearson 푅2 values were calculated between simulated 퐸 or time series and 퐸end 퐹AD(A) experimental end time series. The experimental time point temporally closest to each of 퐹AD(A) the simulation time points (after appropriate mapping of the simulated time series to the experimental time frame of reference) was determined and the Pearson 푅2 calculated between this set of experimental values and the simulation time series. All 푝-values are <10-10. We also introduce the parameter 휆 defined as 2 ∑(퐸 (푘)−퐸 (푘)) 휆 = √ H1 H2 , (Eq. B.7) 푁 where the summation is evaluated over all bins, 푘, in the ensemble-average time series for a particular HemK construct and 퐸H1(푘) and 퐸H2(푘) are the ensemble average 퐸 values in bin 푘 for the Hamiltonian 1 and Hamiltonian 2 time series, respectively, which are selected from the folding, good-solvent, and poor-solvent time series. The term 푁 is equal to the 퐸 total number of bins across which 휆 is calculated. When time series rather than 퐸 time 퐸end 퐸H1(푘) 퐸H2(푘) series are compared, the terms 퐸H1(푘) and 퐸H2(푘) are replaced with end and end , 퐸H1 퐸H2 end end respectively, in which 퐸H1 and 퐸H2 are the final values for the HemK112 time series calculated with the corresponding Hamiltonian’s simulation results.

B.2 Supplemental discussion

B.2.1 Comparing simulated and experimental time series (expanded discussion)

Due to technical concerns in the experiments, including that not all ribosomes translate in the in vitro translation system used and that the efficiency of dye incorporation is ~60-70%, FRET efficiencies could not be measured. Instead, the fluorescence detected in the acceptor channel in the presence of the donor (denoted 퐹AD(A)) for each construct was normalized to the endpoint fluorescence in the acceptor channel for HemK112 (referred to here as end 퐹AD(A) 퐹AD(A)), producing the quantity end . Our simulation output, consisting of the coordinates 퐹AD(A) of the system at discrete times during the simulation, can be used to calculate the FRET 1 efficiency according to Förster’s equation, as 퐸 = 푟 6. Here 퐸 is expressed as a 1+( ) R0 function of the distance between the dye molecules, 푟, and the Förster radius, R0, which is the distance at which 퐸 = 0.5 (see Methods for details of simulation analysis). The FRET efficiency, 퐸, is related to 퐹AD(A) by the equation 퐹 퐸 = AD(A) , (Eq. B.8) 퐹AD(A)+퐹AD(D) in which 퐹AD(D) is the fluorescence in the donor channel in the presence of the acceptor. end 퐹AD(A) Multiplying Eq. B.8 by end and rearranging the result yields 퐹AD(A)

94

end 퐹AD(A) 퐹AD(A) 퐸 = end ∗ ( ). (Eq. B.9) 퐹AD(A) 퐹AD(A)+퐹AD(D) end 퐹AD(A) 퐹AD(A) The FRET efficiency is thus seen to be related to end by the factor , 퐹AD(A) 퐹AD(A)+퐹AD(D) indicating that FRET efficiencies calculated using Förster’s equation can be converted to end 퐹AD(A) 퐹AD(A) the experimental quantity end if the time-dependent value of is known. 퐹AD(A) 퐹AD(A)+퐹AD(D) Unfortunately, we have no means of estimating or calculating this value based on our simulation results. Despite this limitation, the FRET efficiency from Förster’s equation is 퐹AD(A) seen by Equation B.9 to be intrinsically related to the experimental quantity end . Under 퐹AD(A) 퐹AD(A) constant illumination at the FRET donor excitation wavelength as end increases 퐸 will 퐹AD(A) 퐹AD(A) increase, and as end decreases 퐸 will decrease. 퐹AD(A) end The data processing step of normalization by 퐹AD(A) produces time series that are more intuitive to compare between the different HemK constructs, as all time series are held relative to HemK112’s endpoint. To mimic this procedure, in addition to calculating 퐸 퐸 we also calculated the parameter , in which 퐸end end end 퐹AD(A) 퐸 = end end . (Eq. B.10) 퐹AD(A)+퐹AD(D) The value of 퐸end is thus the final 퐸 value for HemK112 as a function of the final end end fluorescences of the acceptor (퐹AD(A)) and donor (퐹AD(D)). The experimental quantity 퐹AD(A) 퐸 end is related to end by the equation 퐹AD(A) 퐸 end end 퐸 퐹AD(A) 퐹AD(A)+퐹AD(D) end = end ∗ ( ). (Eq. B.11) 퐸 퐹AD(A) 퐹AD(A)+퐹AD(D) 퐹AD(A) A similar issue to that encountered in attempting to convert 퐸 to end is also encountered 퐹AD(A) 퐸 퐹AD(A) 퐸 퐹AD(A) when attempting to convert end to end . Conversion of end to end requires knowledge 퐸 퐹AD(A) 퐸 퐹AD(A) 퐹end +퐹end of the time-dependent quantity AD(A) AD(D) , to which we do not have access. However, 퐹AD(A)+퐹AD(D) 퐸 퐹AD(A) end is also intrinsically related to end by the same logic as employed for 퐸. 퐸 퐹AD(A) 퐸 In summary, the values of 퐸 and we calculate from our simulation data are 퐸end 퐹AD(A) 퐸 related to the experimental parameter end , but not equivalent to it. Both 퐸 and are 퐹AD(A) 퐸end equally valid, and we find that different conclusions are drawn when one set of results is 퐸 considered at the exclusion of the other. Thus, we compare our simulated 퐸 and time 퐸end

95

퐹AD(A) series to the experimental end time series and discuss the implications of each set of 퐹AD(A) results.

B.2.2 Testing the applicability of the Förster equation

The Förster equation assumes, in addition to the standard assumptions of Förster theory, that the so-called orientation factor, 휅2, is equal to the isotropic average of 2/3. To test this assumption, we calculated the mean value of 휅2 at each nascent chain length during the continuous synthesis simulations with the folding Hamiltonian with the equation 2 휅2 = [(푎̂ ∙ 푑̂) − 3 ((푎̂ ∙ 푟̂) ∗ (푑̂ ∙ 푟̂))] . (Eq. B.12) The parameters 푎̂, 푑̂, and 푟̂ are the unit vectors of the acceptor dipole moment, donor dipole moment, and along the vector between the dye centers of mass. The unit vectors 푎̂ and 푑̂ were calculated as shown in Figure B.8A. Figure B.8B displays the resulting mean 휅2 values as a function of nascent chain length. The mean value of 휅2 across all trajectories and nascent chain lengths is 0.678. Significant departures from the isometric averaging regime are only observed for nascent chain lengths between 34 and ~40, or about 6% of the nascent chain lengths at which FRET is possible. The use of the Förster equation is therefore justified.

96

B.3 Supplemental figures and tables

Figure B.1. Three sets of fraction of native contact, 푄, time series are displayed for HemK98. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals.

97

Figure B.2. Three sets of fraction of native contact, 푄, time series are displayed for HemK84. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals.

98

Figure B.3. Three sets of fraction of native contact, 푄, time series are displayed for HemK70. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals.

99

Figure B.4. Three sets of fraction of native contact, 푄, time series are displayed for HemK56. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals.

100

Figure B.5. Three sets of fraction of native contact, 푄, time series are displayed for HemK42. (A) Time series of 푄 representing intra-helix contacts for helices 1, 2, 3, 4, and 5 within HemK NTD. (B) Time series of 푄 representing inter-helix contacts between all pairs of helices between which contacts form in HemK NTD native state. The plot legend indicates the pair of helices considered for each time series. (C) Time series of 푄 representing overall folding (i.e., intra- and inter-helix contacts) calculated over either all five helices (h12345), the N-terminal four helices (h1234), or the N-terminal three helices (h123). Error bars are 95% confidence intervals.

101

Figure B.6. The mean radius of gyration evaluated over ten 600-ns trajectories for each 휀 is plotted as a function of the interaction energy, 휀, for the HemK NTD. A value of 휀 = −3.4 kcal/mol was selected for continuous synthesis simulations with the poor-solvent Hamiltonian. Error bars are 95% confidence intervals

102

Figure B.7. Results of Gillespie Algorithm simulations through a two-state co-translational folding reaction scheme. Each plot is the average of 10,000 statistically independent simulations of translation and folding through this reaction scheme. The solid magenta line is the same in each panel, and was calculated using the experimental rates of folding, unfolding, and amino acid addition of 5,137 s-1, 1.97 s-1, and 3.6 aa/s, respectively. The blue circles plotted in each panel were calculated using the in silico folding rate of the dye- modified HemK NTD of 2.81 x 108 s-1, an unfolding rate of 1.08 x 105 s-1, and a mean amino acid addition time as indicated on each plot.

103

2 Figure B.8. Value of 휅 as a function of nascent chain length. (A) All-atom schematics of the dye-modified residues BOP-Lys (left) and BOF-Met (right) are displayed with the approximate locations of the CG interaction sites superimposed. Note that bond lengths, angles, dihedrals, and interaction radii are not to scale or accurately reproduced. The unit vectors of the acceptor dipole moment (푎̂), donor dipole moment (푑̂), and the vector connecting the six-membered rings (푟̂) are shown in blue. (B) The mean value of 휅2 as a function of nascent chain length during continuous synthesis simulations with the folding Hamiltonian are shown. Error bars are 95% confidence intervals about the mean.

104

Table B.1. HemK dye model mass parameters Bead Type Mass (AMU) A1 67.666667 L1 67.666667 L2 67.666667 R1 67.666667 R2 67.666667 R3 67.666667 A34 48.777778 L3 48.777778 L4 48.777778 L5 48.777778 R4 48.777778 R5 48.777778 R6 48.777778 R7 48.777778 *Total mass of BOF = 406 AMU was split evenly between six interaction sites, while the total mass of BOP = 439 AMU was split evenly between nine interaction sites

105

Table B.2. HemK dye model bond parameters Bond* Equilibrium Bond Length (Å) Source A1 – L1 1.916690 B3LYP/6-31G minimized L1 – L2 2.427500 B3LYP/6-31G minimized L2 – R1 3.059530 B3LYP/6-31G minimized R1 – R2 2.297340 B3LYP/6-31G minimized R2 – R3 2.504890 B3LYP/6-31G minimized R3 – R1 4.609180 B3LYP/6-31G minimized A34 – L3 1.969880 B3LYP/6-31G minimized L3 – L3 2.564730 B3LYP/6-31G minimized L3 – L4 2.373360 B3LYP/6-31G minimized L4 – L5 2.447950 B3LYP/6-31G minimized L5 – R4 3.067520 B3LYP/6-31G minimized R4 – R5 2.293190 B3LYP/6-31G minimized R4 – R6 4.395260 B3LYP/6-31G minimized R5 – R6 2.304410 B3LYP/6-31G minimized R6 – R7 3.817990 B3LYP/6-31G minimized R5 – R7 4.562860 B3LYP/6-31G minimized *All bond force constants taken as 50 kcal/mol/Angstrom2.

106

Table B.3. HemK dye model angle parameters Angle Equilibrium Angle (Degrees) Source L1 – A1 – A2* 109.35 Average Cα- Cα- Cβ from ALA- ALA-ALA tripeptide generated in CHARMM assuming right handed alpha helical conformation A1 – L1 – L2 Cα double-well potential Best, Chen, and Hummer 2005 L1 – L2 – R1 Cα double-well potential Best, Chen, and Hummer 2005 L2 – R1 – R2 88.785 B3LYP/6-31G minimized L2 – R1 – R3 105.648 B3LYP/6-31G minimized R1 – R2 – R3 147.367 B3LYP/6-31G minimized R1 – R3 – R2 15.592 B3LYP/6-31G minimized R2 – R1 – R3 17.041 B3LYP/6-31G minimized A33 – A34 – L3 109.35 Average Cα- Cα- Cβ from ALA- ALA-ALA tripeptide generated in CHARMM assuming right handed alpha helical conformation L3 – A34 – A35 109.35 Average Cα- Cα- Cβ from ALA- ALA-ALA tripeptide generated in CHARMM assuming right handed alpha helical conformation A34 – L3 – L3 Cα double-well potential Best, Chen, and Hummer 2005 L3 – L3 – L4 Cα double-well potential Best, Chen, and Hummer 2005 L3 – L4 – L5 Cα double-well potential Best, Chen, and Hummer 2005 L4 – L5 – R4 Cα double-well potential Best, Chen, and Hummer 2005 L5 – R4 – R5 89.722 B3LYP/6-31G minimized L5 – R4 – R6 106.354 B3LYP/6-31G minimized R5 – R4 – R6 17.105 B3LYP/6-31G minimized R4 – R6 – R5 17.019 B3LYP/6-31G minimized R5 – R6 – R7 93.037 B3LYP/6-31G minimized R4 – R6 – R7 110.028 B3LYP/6-31G minimized R6 – R7 – R5 30.287 B3LYP/6-31G minimized R7 – R5 – R6 56.676 B3LYP/6-31G minimized R7 – R5 – R4 202.661 B3LYP/6-31G minimized R6 – R5 – R4 145.877 B3LYP/6-31G minimized *Angle force constants taken as 30.000000 kcal/(mol*radian2) unless double-well potential is used.

107

Table B.4. HemK dye model torsional parameters Dihedral Force Multiplicity Delta Source constant A32 – A33 – A34 – L3 0.019163 1 56.923718 A32 – A33 – A34 – L3 0.301167 2 5.844342 Gly-Gly Cα A32 – A33 – A34 – L3 0.043003 3 292.787031 dihedral A32 – A33 – A34 – L3 0.107372 4 331.434920 A33 – A34 – L3 – L3 0.019163 1 56.923718 A33 – A34 – L3 – L3 0.301167 2 5.844342 Gly-Gly Cα A33 – A34 – L3 – L3 0.043003 3 292.787031 dihedral A33 – A34 – L3 – L3 0.107372 4 331.434920 L2 – L1 – A1 – A2 0.019163 1 56.923718 L2 – L1 – A1 – A2 0.301167 2 5.844342 Gly-Gly Cα L2 – L1 – A1 – A2 0.043003 3 292.787031 dihedral L2 – L1 – A1 – A2 0.107372 4 331.434920 L1 – A1 – A2 – A3 0.019163 1 56.923718 L1 – A1 – A2 – A3 0.301167 2 5.844342 Gly-Gly Cα L1 – A1 – A2 – A3 0.043003 3 292.787031 dihedral L1 – A1 – A2 – A3 0.107372 4 331.434920 L3 – A34 – A35 – A36 0.019163 1 56.923718 L3 – A34 – A35 – A36 0.301167 2 5.844342 Gly-Gly Cα L3 – A34 – A35 – A36 0.043003 3 292.787031 dihedral L3 – A34 – A35 – A36 0.107372 4 331.434920 L3 – L3 –A34 – A35 0.019163 1 56.923718 L3 – L3 –A34 – A35 0.301167 2 5.844342 Gly-Gly Cα L3 – L3 –A34 – A35 0.043003 3 292.787031 dihedral L3 – L3 –A34 – A35 0.107372 4 331.434920 A1 – L1 – L2 – R1 0.019163 1 56.923718 A1 – L1 – L2 – R1 0.301167 2 5.844342 Gly-Gly Cα A1 – L1 – L2 – R1 0.043003 3 292.787031 dihedral A1 – L1 – L2 – R1 0.107372 4 331.434920 L1 – L2 – R1 – R2 0.019163 1 56.923718 L1 – L2 – R1 – R2 0.301167 2 5.844342 Gly-Gly Cα L1 – L2 – R1 – R2 0.043003 3 292.787031 dihedral L1 – L2 – R1 – R2 0.107372 4 331.434920 L1 – L2 – R1 – R3 0.019163 1 56.923718 L1 – L2 – R1 – R3 0.301167 2 5.844342 Gly-Gly Cα L1 – L2 – R1 – R3 0.043003 3 292.787031 dihedral L1 – L2 – R1 – R3 0.107372 4 331.434920 A34 – L3 – L3 – L4 0.019163 1 56.923718 A34 – L3 – L3 – L4 0.301167 2 5.844342 Gly-Gly Cα A34 – L3 – L3 – L4 0.043003 3 292.787031 dihedral A34 – L3 – L3 – L4 0.107372 4 331.434920

108

Table B.4. continued Dihedral Force constant Multiplicity Delta Source L3 – L3 – L4 – L5 0.019163 1 56.923718 L3 – L3 – L4 – L5 0.301167 2 5.844342 Gly-Gly Cα L3 – L3 – L4 – L5 0.043003 3 292.787031 dihedral L3 – L3 – L4 – L5 0.107372 4 331.434920 L3 – L4 – L5 – R4 0.019163 1 56.923718 L3 – L4 – L5 – R4 0.301167 2 5.844342 Gly-Gly Cα L3 – L4 – L5 – R4 0.043003 3 292.787031 dihedral L3 – L4 – L5 – R4 0.107372 4 331.434920 L4 – L5 – R4 – R5 0.019163 1 56.923718 L4 – L5 – R4 – R5 0.301167 2 5.844342 Gly-Gly Cα L4 – L5 – R4 – R5 0.043003 3 292.787031 dihedral L4 – L5 – R4 – R5 0.107372 4 331.434920 L4 – L5 – R4 – R6 0.019163 1 56.923718 L4 – L5 – R4 – R6 0.301167 2 5.844342 Gly-Gly Cα L4 – L5 – R4 – R6 0.043003 3 292.787031 dihedral L4 – L5 – R4 – R6 0.107372 4 331.434920 L2 – R1 – R2 – R3 0.200000 1 8.214 (-171.786)* B3LYP/6-31G minimized L2 – R1 – R3 – R2 0.200000 1 188.53 (8.530) * B3LYP/6-31G minimized L5 – R4 – R5 – R6 0.200000 1 13.317 (-166.683) * B3LYP/6-31G minimized L5 – R4 – R5 – R7 0.200000 1 186.386 (6.386) * B3LYP/6-31G minimized L5 – R4 – R6 – R5 0.200000 1 193.889 (13.889) * B3LYP/6-31G minimized L5 – R4 – R6 – R7 0.200000 1 190.499 (10.499) * B3LYP/6-31G minimized R4 – R5 – R6 – R7 0.200000 1 356.81 (176.810) * B3LYP/6-31G minimized R4 – R5 – R7 – R6 0.200000 1 4.647 (-175.353) * B3LYP/6-31G minimized R4 – R6 – R5 – R7 0.200000 1 3.19 (-176.810) * B3LYP/6-31G minimized R4 – R6 – R7 – R5 0.200000 1 180.993 (0.993) * B3LYP/6-31G minimized R5 – R4 – R6 – R7 0.200000 1 176.61 (-3.390) * B3LYP/6-31G minimized R6 – R4 – R5 – R7 0.200000 1 353.069 (173.069) * B3LYP/6-31G minimized *For single-well dihedrals, the values are presented as “phase (angle)”

109

Table B.5. HemK dye model non-bonded parameters

Bead 푬퐦퐢퐧 (kcal/mol) 풓퐦퐢퐧⁄ퟐ (Å) Source A1 -0.000132 3.407083 Met Cα prm file L1 -0.000132 2.245 Merchant et al. 2007 L2 -0.000132 2.245 Merchant et al. 2007 R1 -0.000132 4.2 Ribose R2 -0.000132 5.051 Merchant et al. 2007 R3 -0.000132 4.2 Ribose A34 -0.000132 4.252035 Lys Cα prm file L3 -0.000132 2.245 Merchant et al. 2007 L4 -0.000132 2.245 Merchant et al. 2007 L5 -0.000132 2.245 Merchant et al. 2007 R4 -0.000132 4.2 Ribose R5 -0.000132 5.051 Merchant et al. 2007 R6 -0.000132 4.2 Ribose R7 -0.000132 4.2 Ribose

110

Appendix C

CHAPTER 4 METHODS

Published as a paper entitled “Altered co-translational processing plays a role in Huntington’s pathogenesis – A hypothesis” by Daniel A. Nissley and Edward P. O’Brien in Frontiers in Molecular Neuroscience 2016, 9:54. EPO first proposed the hypothesis. DAN and EPO conducted the research and wrote the manuscript. This article was published under an Open Access Creative Commons Attribution License (CC BY).

C.1 Methods

C.1.1 Derivation of chemical-kinetic model for Htt misprocessing

CAF binding tends to be favored in a narrow range of nascent chain lengths. For example, equilibrium binding data show that SRP's affinity for arrested RNCs is optimal within a 122 ~20 amino-acid region . Outside this optimal region, the 퐾D increases by 3- to 24-fold. Therefore, in our chemical kinetic reaction scheme (Figures 4.2A and B) we allow CAF/N17 binding only in a narrow range of nascent chain lengths by assuming that 푘on is zero except between codon positions 53–72. We also assume that 푘off is equal to zero, which approximates a binding process that heavily favors association over dissociation in the region of optimal binding. The ribosome first encounters the codons encoding the poly- proline region at codon 53 in Htt with a 35-residue poly-glutamine region. We therefore take position 53, somewhat arbitrarily, as the start of the CAF binding region (Figure 4.2B), which extends to codon position 72. Increasing the width of this optimal-binding region does not influence the results obtained in the calculation of Eq. 4.1 (data not shown). In reality, the nascent chain length regime over which a CAF prefers to engage nascent Htt may be significantly different than that which we use here. Under these assumptions the amount of misprocessed Htt (퐴mp) can be expressed as a function of the total time available for CAF binding (휏AFB, the dwell time of the RNC in the optimal-binding region). These ideas are expressed mathematically in Eq. C.1.

퐴mp(휏AFB(푁CAG)) = 퐴mp(휏AFB = 0) ∗ exp(−푘on휏AFB(푁CAG)) (Eq. C.1)

퐴mp(휏AFB = 0) is the amount of Htt misprocessed when there is no time available for CAF binding. The value of 휏AFB depends on 푁CAG and is the total time required by the ribosome to decode codons i = 53 to i = 72 (Figure 4.2C). For simplicity, we assume that there are two types of codons, faster-translating codons, which are decoded in time 휏A, and slower- translating codons, which are decoded in time 2휏A (Figure 4.2C). Codons in the poly- glutamine region are defined to be faster-translating, while codons in the poly-proline region are defined to be slower-translating. Thus, 휏AFB decreases by an increment of 휏A for each CAG repeat past 35 (Figure 4.2C). Experimental reports suggest that proline codons

111 require between two and six times longer to translate than the global average codon translation time128,129. Figures 4.2E and F display the results when we use the smallest difference suggested by the literature of a two-fold increase in translation time of prolines. Equation 4.1 gives strong correlations when translation times between 2휏A and 6휏A are used (data not shown). As 푁CAG increases over 35 and 휏AFB decreases, the fraction of nascent Htt which is incorrectly processed increases (Figure 4.2D). In order to avoid the issue of estimating a value for 퐴mp(휏AFB = 0), we consider instead the fraction of Htt which is misprocessed (푓mp) at 푁CAG relative to the amount of Htt misprocessed at 푁CAG = 35,

퐴mp(휏AFB=0)[exp(−푘on휏AFB(푁CAG))−exp(−푘on휏AFB(35))] 푓mp(푁CAG) = (Eq. C.2) 퐴mp(휏AFB=0) exp(−푘on휏AFB(35)]

Algebraic simplification of Eq. C.2 yields Eq. 4.1, which gives the fraction of misprocessed Htt as a function of 푁CAG and does not depend on 퐴mp(휏AFB = 0). Equation 4.1 is linear for small arguments of the exponential term. Consider the power series expansion of exp(푥),

푥푛 푥2 푥3 exp(푥) = ∑∞ = 1 + 푥 + + + ⋯ (Eq. C.3) 푛=0 푛! 2! 3!

For values of x ≪ 1, the n > 1 terms in Eq. C.3 are vanishingly small, and exp(푥) is reasonably linear. We also note that altering the width and/or location of the optimal- binding region will alter the range of 푁CAG values over which 푓mp is a monotonically- increasing function. Once 푁CAG increases such that all codons in the optimal-binding region encode Q and not P, 휏AFB is minimized and 푓mp remains constant. With the sample numbers used here, 휏AFB reaches a minimum value of 20 at 푁CAG = 55 (see Figure 4.2C), such that 푓mp becomes constant when 푁CAG ≥ 55.

C.1.2 Calculation of 풌퐨퐧 for the chemical kinetic model

Equation 4.1 requires a rate constant for CAF/RNC binding in order to predict 푓mp. Binding rates are available in the literature for several different CAFs, including trigger factor, signal recognition particle, DnaK, and DnaJ. The rates reported in the literature typically have units of M−1s−1, indicating a dependence on both the cellular concentration of the CAF 5 −1 −1[47] and time. First, we selected the 푘on rate measured for DnaJ of 3.3 × 10 M s . Next, we determined a reasonable estimate for the intracellular concentration of DnaJ in human cells. Finka and Goloubinoff (2013) recently reported intracellular concentrations for 147 molecular chaperones in HeLa cells187. We generated a reasonable estimate of the DnaJ intracellular concentration by taking the median of the cellular concentrations of the subset of 109 chaperones which were identified to be cytosolic or nuclear. This median value is 1.11 × 10−7 M (assuming an average HeLa cell volume of 2,600 μm3[187]). Multiplying the

112 on rate for DnaJ by this concentration yields the in vivo 푘on estimate used in Eq. 4.1 of 0.0366 s−1.

113

Appendix D

CHAPTER 5 METHODS AND SUPPLEMENTARY FIGURES AND TABLES

D.1 Methods

D.1.1 Multi-domain protein selection and model building

Fifty genes, coding for cytosolic multi-domain proteins, were selected randomly from a previously published database of E. coli proteins72. The amino acid sequence of each gene (here defined as the translated sequence), Protein Data Bank (PDB) identifier and domain definition of these proteins were also collected as a starting point for building the corresponding atomistic and coarse-grained models. The amino acid sequence of each PDB listed in Table D.1 was aligned to the translated sequence of the corresponding gene. The residue numbering in the PDB was shifted by (푝 − 1) where 푝 is the position that the first PDB residue occupies in the translated sequence after alignment to ensure consistency between the translated mRNA sequences and coordinates of the protein. All residues not resolved in the PDB structure were eliminated from the aligned PDB sequence. Missing residues were identified by comparing the aligned PDB sequence and the translated amino acid sequence. Because some of the PDB structures contained missing residues or domains, we searched PDB in order to find alternative structures that had the minimum number of missing residues/domains, the highest possible resolution, and that represented the protein closest to physiological conditions. In some cases, such as when different domains were crystallized in different experiments, the original and alternative PDB structures were utilized to model the full protein (see Table D.1 for more details). PDB files storing different domains or segments of the same protein were combined by performing a structural alignment of the common regions first and then merging the coordinates to reconstruct the entire protein. In one case the PDB files contained no common regions to be used for the initial structural alignment (1NG9/3ZLJ in Table D.1). Whenever homologous structures from E. coli were not available, homologous structures from other organisms were used as a template for the protein model, provided the sequence similarity was larger than 30% and the portion of the protein solved in both structures had a low backbone RMSD (no more than 2 Å). VMD was used to perform the structural alignments188. After this first phase, some multi-domain proteins still contained missing residues or disconnected domains. All proteins were therefore subject to a rebuilding phase to add the missing atoms. The rebuilding phase further allowed us to revert possible mutations or sequence mismatches, such as those introduced upon modeling the coordinates of a protein based on multiple structures (see Table D.1). Because the reconstructed segments were generated in an extended conformation and disconnected from the polypeptide, a minimization was performed in vacuo for 200 integration time steps. This short minimization was sufficient to relax the non-physical conformations and produce a connected polypeptide chain. For proteins with short stretches of missing residues (less

114 than 10), this minimized configuration was accepted as the final full-atomistic model. If a protein contained one or more long stretches of missing residues (more than 10) or disconnected domains, the minimized protein structure was further subjected to MD at the temperature of E. coli optimal growth (310 K). In this phase, the reconstructed atoms within each solved domain were left free to move, thereby allowing the structure to locally equilibrate. The smallest domain in each protein was also allowed to move freely in order to re-orient in the most favorable conformation relative to the larger domain(s). All other atoms were either fixed or harmonically restrained to fluctuate around the experimentally solved structure with a force constant of 1 kcal/(mol x Å2). All reconstructions, minimizations, and MD simulations were performed using CHARMM, with the par27 force field. The minimized structures were solvated in TIP3 water and 150 mM of NaCl, gradually heated to 310 K for 100 ps and equilibrated for 1.5 ns at the same temperature. Production runs had different durations, spanning from 20 to 50 ns. Langevin Dynamics with a friction coefficient of 1.0 ps-1 and a timestep of 1.5 fs were used. For each protein the conformation with the lowest energy was selected as the final full-atomistic model. We emphasize that the purpose of the MD simulations was not to thoroughly explore the conformational space of the multi-domain proteins, but rather to provide reasonable atomistic conformations for the coarse-graining phase. PREDATOR189 and IUPRED190 were used to predict whether those residues not resolved in the PDBs are intrinsically unstructured. Domains in the multi-domain proteins were initially defined according to CATH167, which is also used in the original database72. The domain residue numbering was shifted to match the translated sequence and modeling of the missing atoms performed (Table D.1). The final domain definition reported in Table D.7 includes also residues and domains that were modeled as described in Table D.1.

D.1.2 Single-domain protein selection and model building

The database from which proteins were selected contains 1,014 cytosolic proteins, with 598 single- and 416 multi-domain proteins. Given a set of 50 multi-domain proteins, 72 single-domain proteins are required to reproduce this 41:59 ratio of multi- to single-domain proteins. Of these 598 single-domain proteins there are 250 α, 55 β, and 293 α/β; this ratio of structural classes was reproduced in the subset of 72 proteins by choosing 30 α, 7 β, and 35 α/β proteins. Single-domain proteins were chosen at random from the database and the PDB files and FASTA amino acid sequences retrieved from Protein Data Bank. These amino acid sequences were then mapped to the corresponding mRNA sequence based on their ordered locus number as recorded in UniProt. mRNA sequences were retrieved from University of California Santa Cruz microbe table browser (http://microbes.ucsc.edu/). Proteins whose amino acid sequences did not exactly match their corresponding mRNA were rejected and a replacement protein selected at random. In some cases, small sections of amino acids (12 amino acids or less) or small numbers of heavy atoms (less than 10) that were not resolved in the experimental structure were rebuilt and minimized in CHARMM.

115

D.1.3 Selection of mean in silico codon translation time based on training set folding kinetics

Cα CG models were built for six α-helical (Table D.2), six β-sheet (Table D.3), and seven α/β proteins (Table D.4) with lengths varying from 58 to 155 residues for which 191,192 experimental free energy changes upon folding (∆퐺exp) and mean folding 191,193–197 exp times (〈휏F 〉) were previously determined. Three to six replica exchange simulations with various multiplicative scaling factors for the attractive non-bonded interactions (휂), which were original set with the Betancourt-Thirumalai statistical 180 potential, were completed for each protein and the free energy of folding (∆퐺sim) computed for each value of 휂 using the WHAM equations181. The fraction of native contacts (푄) was used to differentiate folded and unfolded conformations. The 푄 threshold separating the folded and unfolded populations at equilibrium, 푄eq, was identified as the value of 푄 at which the cumulative probability of 푄 at the CG model’s melting temperature (푇M) equals 0.5. The ∆퐺sim(휂) data for each protein were fit to the equation ∆퐺sim(휂) = ∆퐺 −푏 푚휂 + 푏 and the tuned η, denoted η*, calculated as 휂∗ = exp where 푚 and 푏 are the 푚 slope and y-axis intercept, respectively, of a particular protein’s ∆퐺sim(휂) best-fit line. ∗ Thus, 휂 is the scaling constant predicted to reproduce a protein’s ∆퐺exp value in silico. All values of ∆퐺sim were computed at the temperature at which ∆퐺exp was obtained (푇exp). The 휂∗ values calculated by this method range from 0.998 to 1.628, with an overall average value of 1.235. The mean 휂∗ for the α, β, and α/β proteins within training set are 1.170, 1.442, and 1.114, respectively. Coarse-grain models were constructed for all 19 training set proteins at their respective 휂∗ values. The mean time of folding of each of these tuned CG models, denoted sim 〈휏F 〉, was determined from temperature quench simulations. Five-hundred trajectories were run for each CG model, with a 1000-K equilibration period of between 20 ns and 2 µs followed by a 310-K quenching period of between 100 ns and 10 μs. The durations of these two phases of the simulation were chosen such that the length of the 1000-K equilibration was 20% the duration of the 310-K quench. Kinetic 푄 thresholds, denoted

푄kin, were defined for each protein to differentiate the unfolded and folded ensembles during the quench period. This parameter was calculated as the arithmetic mean of 푄eq and the most probable value of 푄 observed at 310 K (푄310) in the REX simulation at the value of 휂 closest to 휂∗ for a given CG model. We required that the folded state remain stably folded for at least 150 ps before the protein is considered to have folded at the initial frame at which 푄 > 푄kin. The time-dependent survival probability of the unfolded state was 푁F(푡) calculated as 푆U(푡) = 1 − , where 푁F(푡) is the number of trajectories that have 푁traj populated the folded state based on the above kinetic definition of folding at least once by simulation time 푡 and 푁traj is the number of statistically independent trajectories included in the analysis. Survival probability curves for all proteins except ABP1 SH3 (PDB ID:

116

1JO8) were fit to the equation 푆U(푡) = exp(−푘F ∗ 푡) with Python, where 푘F is the sole sim 1 fitting parameter, and the mean folding time calculated as 〈휏F 〉 = . The survival 푘F probability distribution for ABP1 SH3 was found to be best fit by the sum of two exponential terms; the fit equation for ABP1 SH3 was modified to the form 푆U(푡) =

푓1 exp(−푘F,1푡) + 푓2 exp(−푘F,2푡) with 푓1 + 푓2 ≡ 1. Curve fitting revealed that 푓1 = 0.49, −3 −1 −1 −1 sim 푓2 = 0.51, 푘1 = 9.87 푥 10 ns , and 푘2 = 2.18 푥 10 ns ; the value of 〈휏F 〉 sim reported in Table D.5 for ABP1 SH3 was calculated as 〈휏F 〉 = 푓1휏1 + 푓2휏2 where 휏1 = 1 1 2 and 휏2 = . Pearson 푅 values for all fits were ≥0.98. Note that very few trajectories 푘F,1 푘F,2 sim for dihydrofolate reductase (PDB ID: 1RX4) folded during 10 µs of quench and 〈휏F 〉 therefore could not be calculated. We maintain the experimentally observed ratio of a protein’s folding time to the timescale of amino acid addition by requiring that exp sim sim 〈휏A 〉 1 exp 〈휏A 〉 = 〈휏F 〉 ∗ exp = ∗ 〈휏A 〉, [Eq. D.1] 〈휏F 〉 훼 exp 〈휏F 〉 in which we have introduced the parameter 훼 = sim that gives the fold acceleration of 〈휏F 〉 sim protein folding observed in our simulations compared to experiment. The 〈휏F 〉, α, and other relevant parameters for each of the 18 proteins are reported in Table D.5. The mean value of 훼 over the 18 proteins in the data set was found to be 〈훼〉 = 3,967,486. Assuming exp 〈휏A 〉 = 0.05 s in E. coli, 1 1 109 ns 〈휏sim〉 = ∗ 〈휏exp〉 = ( ) ∗ 0.05 s ∗ ( ) = 12.6 ns. [Eq. D.2] A 〈훼〉 A 3,967,486 1 s We emphasize that the acceleration of folding we observe, and therefore this mean in silico translation time, is dependent on the specifics of our CG model (force field, frictional coefficient, temperature, model resolution, etc.) and thus not transferrable.

D.1.4 Parameterization of single- and multi-domain protein contact potentials

The 휂 values reported in Tables D.6 and D.7 for the single- and multi-domain data sets were selected based on the training set values recorded in Table D.5. Successive rounds of ten 1-µs Langevin dynamics simulations at 310 K were run for each protein. A frictional coefficient of FBETA = 0.050 ps-1, an integration time step of 0.015 ps, the SHAKE algorithm, and the ETEN non-bonded potential form are used. Coordinate information was printed every 5000 integration time steps (75 ps). A domain or interface was considered to be kinetically stable if its fraction of native contacts, 푄, was greater than the average 푄kin from the training set of 〈푄kin〉 = 0.69080 for at least 98% of the simulation frames in all ten trajectories. Three different 휂 values were calculated for each structural class. The first value is simply 〈휂〉class, the mean 휂 calculated for the training set proteins in a given structural class. Inspection of the ∆퐺class = 푚 ∗ 〈휂〉class + 푏 values in Table D.8 indicates that using 〈휂〉class results in under-stabilized protein models in some cases. To account for this variation, simulations were also run with 휂 increased by 〈∆∆퐺/∆퐺 〉 ∗ 100% = exp − 117

22%, where ∆∆퐺 = ∆퐺exp − ∆퐺class for a given protein and 〈 〉− indicates that the average is calculated only for negative values of ∆∆퐺 (i.e., over values of ∆∆퐺 corresponding to proteins under-stabilized by 〈휂〉class; see Table D.8). This increase of 22% is thus the average relative increase in 〈휂〉class necessary to stabilize a protein not stable at 〈휂〉class. Finally, a third set of simulations were run with 〈휂〉class increased by the largest destabilization relative to the average ∆퐺exp, calculated as ∆∆퐺min/〈∆퐺exp〉*100% = 72% where ∆∆퐺min is the most-negative value of ∆∆퐺 within the training set. These values are summarized in Table D.9. Selection of 휂 for single-domain proteins was made based on these initial simulations by selecting the lowest 휂 value that kinetically stabilizes a domain. Seven of the 72 single-domain proteins were found to be unstable at each of the three 휂 values tested. For these proteins a fourth round of simulations were run with 휂 = 2.480, the overall largest value of 휂 suggested by the training set (corresponding to the +72% value for β domains). None of the seven proteins were found to be stable at 휂 = 2.480, and they were therefore assigned the median 휂 from the set of stable domains of 1.170. Scaling factors for the multi-domain proteins were determined by an analogous procedure. Contacts between interfaces were assigned the overall mean 휂 value from the training set (see Table D.9). Following the initial three sets of simulations another round of ten 1-μs simulations were run with each domain and interface assigned the minimum 휂 at which it was found to be stable in the initial round of simulations. In many cases domains and interfaces previously stable at a particular 휂 value became unstable in these “mixed 휂” simulations. For example, for PDB ID: 1D2F domain 1, domain 2, and the 1|2 interface were initially found to be stable at 휂 = 1.427, 1.359, and 2.124, respectively. However, when these 휂 values are used in the mixed-휂 simulations the 1|2 interface became unstable. In these situations, 휂 values were increased by one level (e.g., from 1.114 to 1.359 or from 1.442 to 1.759) and an additional set of ten 1-μs simulations run. As for the single-domain proteins, all domains and interfaces that were found to be unstable at each of the three 휂 values for their respective structural class were also run with 휂 = 2.480, the overall training set maximum value. If a domain or interface was unstable even at 휂 = 2.480 it was assigned the median 휂 from the subsets of domains or interfaces found to be stable of 1.170 and 1.507, respectively. The only exception to these selection rules is PDB ID: 2KX9 domain 2; this domain was found to be unstable at all tested values of 휂, but when it was assigned the median 휂 = 1.170 value domains 1 and 3 also became unstable. Due to this strong correlation between the stability of domain 2 and domains 1 and 3, domain 2 was assigned 휂 = 2.480 to ensure the stability of domains 1 and 3 despite the fact that domain 2 is unstable at this value and the selection rules dictate it be assigned 휂 = 1.170. A total of 59/110 interfaces and 14/213 domains (not including domain 1 of PDB 1FTS, which appears to be intrinsically disordered and contains no native contacts) were found to be unstable at any of the 휂 values calculated from the training set. Unstable interfaces tend to be smaller, consist of fewer inter-domain contacts, and to be less hydrophobic than stable domains (Figure D.1). These unstable interfaces may represent crystal packing interfaces that are not present in the soluble native state of the protein.

118

Similarly, unstable domains tend to be small and to contain a smaller number of intra- domain contacts than stable domains (Figure D.2). Increasing 휂 beyond the values suggested by the training set to stabilize these interfaces and domains would result in an exaggerated incidence of kinetic trapping as more and more energy is required to break a given contact, making them longer lived on average. We therefore choose to use the median values for unstable interfaces and domains in order to preserve a realistic energy scale while preventing such exaggerated kinetic trapping behavior.

D.1.5 Construction of 50S E. coli ribosome cutout

Simulating the synthesis of each of the 122 proteins in the multi- and single-domain datasets would be prohibitively expensive and time consuming if the entire 50S ribosome is explicitly represented. To increase computational efficiency the structure of the 50S ribosome contained in PDB ID: 3R8T was reduced to a cutout of the exit tunnel and surface near the exit tunnel opening. The full 50S structure was first coarse-grained and oriented as previously described198 with the origin of the CHARMM coordinate system placed at the position of the N6 atom of A2602 and the positive x-axis pointing from this origin towards the exit tunnel opening. A single ribosomal interaction site corresponding to the uracil ring of U2585 was removed to prevent steric clashes with the nascent chain. Initial continuous synthesis simulations with PDB 2QVR were run with the resulting CG structure, which contained the entire 50S subunit minus U2585’s uracil ring, and a trajectory in which the nascent chain was inserted in an extended conformation through the exit tunnel selected. All ribosome interaction sites within 30 Å of the nascent chain or with an x-coordinate greater than 60 Å were kept and all other interaction sites were deleted. Residues with an x-coordinate greater than 60 Å but with zero solvent accessible surface area, as calculated with the COOR SURF functionality of CHARMM with RPROBE = 1.8 Å, were also removed. This probe size is significantly smaller than the smallest nascent chain interaction site, meaning that only ribosome sites with which the nascent chain cannot interact were removed. An 18-residue loop of ribosomal protein L24, which extends out over the exit tunnel opening, was allowed to fluctuate. The resulting ribosome cutout consists of 3,800 interaction sites.

D.1.6 Simulations of translation elongation, translation termination, and post- translational protein dynamics

Each protein in the single- and multi-domain protein datasets was synthesized starting from a single residue using a modified version of a previously published protocol198. CHARMM simulations parameters are the same as those reported in D.1.4. The dwell time at a particular nascent chain length was randomly selected from an exponential distribution with a mean equal to the average decoding time of the codon in the A-site. Average decoding times are taken from the Fluitt-Viljoen model12 and scaled to reproduce an overall average of 12.6 ns (840,000 integration time steps of 0.015 ps duration, see Tables D.5 and D.10). A planar restraint in the yz-plane through the point (58, 0, 0) Å is used to prevent

119 the nascent chain from contacting the underside of the ribosome cutout. Fifty trajectories were run for each of the 122 proteins in the dataset. After Langevin dynamics was completed for a given trajectory the harmonic restraint on the C-terminal bead to model the covalent bond between the nascent protein and the P-site tRNA is removed. Simulations of termination were run until the C-terminal residue of each trajectory reached an x- coordinate of 100 Å or greater, indicating that the protein exited the tunnel. The ribosome cutout was then deleted and the conformation of the protein saved for post-translational dynamics. Final conformations from translation termination simulations were run for 30 days of wall time in bulk solution. Different proteins were able to achieve different total simulation durations due to the dependence of simulation speed on protein size. Ten trajectories were also initiated from the native state (rather than from a synthesis trajectory) and run for 30 days of wall time to serve as a comparison data set representing the native state ensemble of each protein.

D.1.7 Calculation of domain and interface folding times

Time series of 푄 were calculated for each domain and interface as described in D.1.3. Contacts within domains were only considered in this calculation if the coarse-grain interaction sites were (i) within 8 Å of one another in the native state reference structure, (ii) separated by at least 3 amino acids in the primary sequence, and (iii) both identified to be in conserved secondary structural elements by STRIDE.182 Interface contacts are required to satisfy criteria (i) and (ii) as well as the condition that the two coarse-grain interaction sites be in different domains as defined in Table D.7. The first passage time of each domain and interface to the folded/structured state was determined by comparison of these 푄 time series to the mean 푄 calculated in the corresponding domain or interface in simulations initiated from the native state reference structure in bulk solution, denoted 〈푄〉ref. A domain or interface is considered to fold when its 푄 value is greater than or equal to 〈푄〉ref for at least 150 ps of simulations time.

120

D.2 Supplementary figures and tables

Figure D.1. Parameters related to interface stability. Stacked boxplots for number of residues involved in interface contacts (푁residues), number of inter-domain contacts per interface (푁contacts), number of contacts per interface residue (푁residues/푁contacts), and mean hydropathy of residues involved in interface contacts. Mean hydropathy values are calculated using the Kyte-Doolittle hydropathy score, in which a more positive value is more hydrophobic and a more negative score is more hydrophilic. Boxplots were produced by rank ordering a given parameter and then color-coded based on the 휂 value used to scale that interface’s inter-domain contact energies.

121

Figure D.2. Parameters related to domain stability. Stacked boxplots for number of residues per domain (푁residues), number of intra-domain contacts (푁contacts), number of intra-domain contacts per residue (푁residues/푁contacts). Boxplots were produced by rank ordering a given parameter, splitting the ordered list into quartiles, and then color-coding sections of the bar based on the 휂 value used to scale that domain’s intra-domain contact energies. Plots include single- and multi-domain proteins.

122

Figure D.3. Distributions of protein and domain length (top and bottom, respectively) within the entire database of E. coli proteins and the subset of 50 multi-domain proteins selected.

123

Figure D.4. Distributions of protein size for single-domain proteins in the set of all 598 single-domain proteins within the database of E. coli proteins and for the subset of 72 single-domain proteins selected for simulations.

124

Figure D.5. 1FUI domain fraction of native contacts time series. Fraction of native contacts as a function of simulation and experimental time for domain 1 (top), domain 2 (middle) and domain 3 (bottom) of PDB ID: 1FUI. Light and dark lines correspond to individual trajectories and the ensemble average over trajectories, respectively. The dotted black line in each plot indicates the 〈Q〉ref − 2σ threshold for each domain.

125

Figure D.6. 1FUI interface fraction of native contacts time series. Fraction of native contacts as a function of simulation and experimental time for the 1|2 (top), 1|3 (middle) and 2|3 (bottom) interfaces of PDB ID: 1FUI. Light and dark lines correspond to individual trajectories and the ensemble average over trajectories, respectively. The dotted black line in each plot indicates the 〈Q〉ref − 2σ threshold for each domain.

126

Table D.1. Database of multi-domain proteins pdb:chain missing res other pdb Modeling strategy Notes Rebuild and mini. MD to relax Missing residues are predicted to be 1svt:J 526-548 the missing residues. disordered. 143-167 Rebuild and mini. MD to relax Missing residues are predicted to be 1ksf:X 611-623 the missing residues. disordered.

Substitution with alternative 1brm:A 232-242 1t4b:B 1t4b represents the same gene. structure. Rebuild and mini

Modeling of N-ter domain of 3oqo (different gene) is used as a template 3brq. Rebuild and mini. MD to to model the missing N-ter domain of 3brq:B 1-58 3oqo:A relax the relative orientation of 3brq. 3oqo and 3brq are structurally the N-ter domain relative to the aligned and 3oqo coordinates of residues 1 rest. to 58 merged with 3brq.

Missing residues are predicted to be 1fts:A 1-200 Rebuild and mini. disordered199. Implicit solvent MD is performed due to the fragment length. Substitution with alternative 3dnt:A <10 2wiu:A 2wiu represents the same gene. structure. Rebuild and mini

3m7m (same gene) is used because it is Substitution with alternative the closed state of , which occurs structures 3m7m. Modeling of under no-stress conditions. 1xjh is used to 3m7m:X C-ter domain of 3m7m. 1hw7:A 233-292 model the missing C-ter domain. 3m7m 1xjh:A Rebuild and mini. MD to relax and 1xjh are structurally aligned to 1vzy the relative orientation of the N (other organism) and then merged and C-ter domains. together.

Substitution with alternative 1mxa:A <10 1p7l:A 1p7l represents the same gene. structure. Rebuild and mini. Substitution with alternative Coordinates missing from 1d2f are taken structure. Rebuild and mini. 1d2f:A 1-29 4dgt:A from 4dgt (different organism), after MD to relax the missing structural alignment of 1d2f and 4dgt. residues. Predict model using I- Missing residues are partly structured; I- 1xvi:A 236-271 TASSER. MD to relax the TASSER predicted model is used as a missing residues. template. 3zlj (same gene) corresponds to residues Modeling of C-ter domain of 823 to 853. Linker residues 801 to 822 are 1ng9. Rebuild and mini. MD missing in 1ng9 and 3zlj, and are predicted 1ng9:A 801-853 3zlj:C simulation to find the relative to be unstructured. The linker is added to orientation of the N- and C-ter 1ng9; 3zlj is re-oriented in vmd to be domains. attached to linker. Substitution with alternative 1zym:A 250-575 2kx9:A 2kx9 represents the same gene. structure. Rebuild and mini

Coordinates missing from 2hnh are taken Modeling of missing residues. from 5fku (same gene) after structural 2hnh:A 911-1160 5fku:A Rebuild and mini. MD to relax alignment of 5fku and 2hnh. Residues the missing residues. 927-937 are missing in both 2hnh and 5fku. They are predicted as unstructured.

Multi-domain proteins subject to rebuilding and minimization phases only (pdb:chain) 2kfw:A 3ofo:D 4e8b:A 4kn7:C 3nxc:A 3gn5:B 1u0b:B 3pco:D 2fym:A 1qf6:A 2r5n:A 1uuf:A 3qou:A 4im7:A 4hr7:A 2hg2:A 1duv:G 1gyt:L 4dzd:A 1fui:A 2qvr:A 1w78:A 2ww4:A 1xru:A 4fzw:A 1cli:A 2ptq:A 1gqe:A 2h1f:A 4iwx:A 1gz0:C 4dcm:A 2id0:A 1ef9:A 2qcu:B 1glf:O 1ger:B

127

1.080 1.163 1.304 0.998 1.298 1.174

0.99927 0.98838 0.98517 0.98901 0.98347 0.99558

22.609 19.278 22.216 13.832 17.167 26.378

Linear Linear regressionparameters

23.456 20.230 21.941 16.745 18.480 28.079

------

)

퐞퐱퐩

3

10

(

1.18 2. 3.17 5.5 5.13 6.77 9.17 4.26 5.09 7.49 2.96 5.22 6.42 1.90 3.39 4.42 6.69 7.36 9.05 6.46 10.3 18.3

0.798

------

-

퐬퐢퐦

(kcal/mol)

parameters

퐞퐪

0.60 0.67 0.67 0.67 0.65 0.53 0.55 0.51 0.66 0.68 0.62 0.62 0.59 0.58 0.53 0.52 0.52 0.51 0.50 0.48 0.57 0.56 0.55

(K)

301.8 302.6 311.2 319.0 335.0 321.0 336.4 351.6 316.4 323.8 336.8 313.6 341.2 351.6 307.0 315.0 322.8 345.8 353.8 362.2 321.8 341.0 378.4

helical helical proteins

η

-

REX inputREX and output

1.000 1.000 1.050 1.100 1.200 1.200 1.300 1.400 1.200 1.255 1.350 1.000 1.150 1.200 1.050 1.100 1.150 1.300 1.350 1.400 1.150 1.300 1.600

] ] ] ] ] ]

191 191 191 191 191 192

[ [ [ [ [ [

퐞퐱퐩

2.72 4.25 6.40 2.88 6.81 6.60

------

(kcal/mol)

Parameters

Experimental

퐞퐱퐩

298 298 298 298 298 298

(K)

ID

1CEI

256B

PDB

1RJK

1IMQ

2ABD

1LMB

. Replica exchange results for results α exchange Replica .

-

IM7 IM9

256b

EC298

repressor

bACBP

Protein -

λ

Cytochrome

TableD.2

128

1.467 1.298 1.445 1.457 1.357

1.628

500

parameters

0.97134 0.99129 0.99 0.99883 0.99238

0.99358

4.403 9.680

18.998 34.408 22.402

16.035

Linear Linear regression 풎

4.590

11.149 13.749 15.226 28.220 20.195

-

- - - -

-

)

퐞퐱퐩

(

1.75 2.06 2.52 2.91 5.48 5.76 7.10 1.69 3.38 4.71 5.83 3.02 3.93 5.33 5.16 7.81 10.8 1.73 4.06 5.77

------

퐬퐢퐦

푮 (kcal/mol)

퐞퐪

output parameters

0.77 0.77 0.77 0.77 0.51 0.54 0.51 0.48 0.47 0.47 0.47 0.51 0.50 0.51 0.40 0.39 0.39 0.49 0.49 0.47

(K)

316.6 324.0 338.0 352.6 333.2 339.8 355.4 311.4 326.2 340.6 354.4 317.4 324.0 337.2 329.0 343.8 360.4 301.4 314.8 329.0

input and

REX REX

η sheet proteins sheet

-

1.400 1.500 1.600 1.350 1.400 1.500 1.300 1.400 1.500 1.600 1.450 1.500 1.600 1.400 1.500 1.600 1.200 1.300 1.400

1.350

] ] ] ] ] ]

191 191 191 192 191 191

[ [ [ [ [ [

퐞퐱퐩

3.07 6.68 1.81 3.00 6.71 5.00

- - - - -

- (kcal/mol)

Parameters

Experimental

퐞퐱퐩

298 293 298 298 298 293 (K)

ID

1JO8

PDB

1SHF

1C9O

1MJC 1TEN 1WIU

for results β exchange Replica

. .

SH3

SH3

CspA

Protein

Fyn Fyn CspB Bc

Tenascin Twitchin

ABP1 TableD.3

129

1.164 1.075 1.298 1.163 1.001 1.047 1.051

0.94639 0.98269 0.91770 0.98516 0.99949 0.99945 0.98363

19.419 21.078 18.778 23.997 52.554 33.369 24.797

Linear Linear regressionparameters

21.377 22.836 20.051 25.190 58.105 38.735 29.666

------

)

퐞퐱퐩

(

5.77 2.55 3.01 2.79 6.03 6.71 3.74 6.56 9.04 10.5 3.72 4.53 5.01 7.67 4.10 5.49 8.72 11.5 5.48 11.5 17.1 9.19 13.2 16.9 4.49 7.99

------

퐬퐢퐦

(kcal/mol)

퐞퐪

0.48 0.69 0.51 0.51 0.49 0.49 0.47 0.48 0.47 0.49 0.72 0.71 0.73 0.67 0.69 0.70 0.64 0.61 0.46 0.46 0.44 0.22 0.22 0.22 0.55 0.55

(K)

323.0 303.4 305.2 308.6 330.2 335.6 312.6 331.0 348.8 366.2 317.6 324.8 329.0 347.6 312.2 328.0 345.8 362.2 315.2 335.6 354.2 332.4 348.4 366.0 315.0 331.4

REX outputREX parameters

η

1.150 1.050 0.930 1.050 1.200 1.230 1.100 1.200 1.300 1.400 1.100 1.190 1.200 1.300 1.100 1.200 1.300 1.400 1.000 1.100 1.200 1.100 1.200 1.300 1.000 1.100

] ] ] ] ] ] ]

191 191 191 191 192 192 192

[ [ [ [ [ [ [

퐞퐱퐩

0 0

5.46 3.48 7.24 5.29 5.6 7.2 6.37

------

(kcal/mol)

Parameters

Experimental

퐞퐱퐩

298 298 298 298 298 298 298

(K)

ID

1E65

PDB 2QJL

1SPR

2RN2 1RX4

1POH

3CHY

Replica exchange results for α/β for proteinsresults α/β exchange Replica

. .

1

(apo)

SH2

H

rm

Hpr

U CheY

Src Src

Protein

reductase

Azurin

Ribonuclease Ribonuclease

Dihydrofolate Dihydrofolate

TableD.4

130

퐬퐢퐦 퐅

486

757 577 171

,

755 140 , , ,

〈 , ,

,

158 208 891 236 780

506 136 591 574 070 , 681 , , , 353 ,

/

, , , , , , ,

170

,

967

796 784 520

,

1

655 483 , , ,

40 36 13 80 15 68 24

, ,

102 278 972 266 274

3

1 1

퐞퐱퐩 퐅

13 35 16

=

] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]

191 191 191 191 191 193 191 191 191 194 191 191 195 191 191 191 196 197

[ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [

퐞퐱퐩 퐅

(s)

0.00003 0.00073 0.00735 0.00011 0.00095 0.00075 0.00066 0.00125 0.08547 0.01060 0.00503 0.16611 0.66667 0.06711 0.07576 0.00016 0.02653 1.35135

s)

퐬퐢퐦 퐅

8.38 6.44 7.15 6.57

(n

26.54 18.02 11.82 49.52 18.20 51.62 18.08 12.04 18.63 68.98 96.55 81.80

〈 203.40 284.55

ퟑퟏퟎ

+

0.68 0.61 0.80 0.83 0.78 0.69 0.64 0.72 0.89 0.75 0.64 0.61 0.65 0.67 0.66 0.81 0.61 0.41

퐞퐪

퐤퐢퐧

푸 =

ퟑퟏퟎ

1.0

0.83 0.74 0.90 0.98 0.94 0.76 0.78 0.87 0.98 0.78 0.82 0.82 0.86 0.86 0.94 0.75 0.59

퐞퐪

0.53 0.48 0.70 0.67 0.62 0.62 0.51 0.57 0.77 0.51 0.51 0.39 0.47 0.48 0.47 0.67 0.46 0.22

1.200 1.300 1.200 1.100 1.350 1.000 1.300 1.150 1.600 1.500 1.450 1.500 1.400 1.150 1.100 1.300 1.000 1.100

Closest

1.163 1.298 1.163 1.080 1.304 0.998 1.298 1.174 1.628 1.467 1.445 1.457 1.357 1.164 1.075 1.298 1.001 1.047

80 66 69 86 85 86 58 59 69 89 93 85 99

128 106 102 128 155

Residues

α β α α α α α β β β β β

α/β α/β α/β α/β α/β α/β

class

Structural

ID

1E65 1CEI 1JO8

256B

PDB 2QJL

1SPR

1SHF

1C9O 2RN2

1IMQ 1MJC 1TEN

1WIU 1POH

1RYK 2ABD 3CHY

1LMB

-

. Kinetic parameters for 18 training set proteins set training 18 for parameters Kinetic .

1

(apo)

SH3

SH2

H

D.5

rm

256

Hpr

IM7 IM9

CspB

CspA

U CheY

EC298

repressor

bACBP

Src

Protein -

Fyn

Tenascin Twitchin

λ ABP1 SH3

Azurin

Cytochrome

Ribonuclease Ribonuclease

Table Table

131

Table D.6. Single-domain protein data set information Minimum η Number of Structural PDB ID Name required for residues Class kinetic stability 1A69 239 Purine nucleoside phosphorylase DeoD-type α/β 1.359 1A6J 163 Nitrogen regulatory IIA protein α/β 1.114 1A82 225 Dethiobiotin synthetase α/β 1.114 1AG9 176 Flavodoxin 1 α/β 1.114 1AH9 72 Translation initiation factor 1 β 2.480 1AKE 214 Adenylate kinase α 1.170 1B9L 120 Dihydroneopterin triphosphate 2'-epimerase α/β 1.359 1DCJ 81 Sulfur carrier protein TusA α/β 1.359 1DFU 94 50S ribosomal protein L25 α/β 1.916 1DXE 256 5-keto-4-deoxy-D-glucarate aldolase α 1.170 1EIX 245 Orotidine 5'-phosphate decarboxylase α 1.170 1EM8 147 DNA polymerase III subunit chi α/β 1.114 1EUM 165 Bacterial non-heme ferritin α 1.170 1FJJ 158 UPF0098 protein YbhB β 1.442 1FM0 81 sulfur carrier subunit α/β 1.359 1GQT 309 Ribokinase α/β 1.114 1GT7 274 Rhamnulose-1-phosphate aldolase α 1.170 1H16 760 Formate acetyltransferase 1 α 1.170 1H75 81 Glutaredoxin-like protein NrdH α/β 1.359 1I6O 220 Beta-carbonic anhydrase α 1.170 1JNS 93 Peptidyl-prolyl cis-trans isomerase C α/β 1.170* 1JW2 72 Hemolysin expression-modulating protein Hha α 2.012 1JX7 117 Protein YchN α/β 1.114 1K7J 206 Uncharacterized protein YciO α/β 1.114 1KO5 175 Thermoresistant gluconokinase α 1.170 1L6W 220 Fructose-6-phosphate aldolase 1 α 1.170 1M3U 264 3-methyl-2-oxobutanoate hydroxymethyltransferase α 1.170 1MZG 138 desulfuration protein SufE α 1.170 1NAQ 112 Divalent-cation tolerance protein CutA α/β 1.916 1ORO 213 Orotate phosphoribosyltransferase α/β 1.114 1P91 269 23S rRNA (guanine(745)-N(1))-methyltransferase α/β 1.359 1PF5 131 RutC family protein YjgH α/β 1.916 1PMO 466 Glutamate decarboxylase beta α 1.170 1PSU 140 Acyl-coenzyme A thioesterase PaaI α/β 1.916 1Q5X 161 Regulator of RNase E activity A α/β 1.359 1QTW 285 Endonuclease 4 α 1.170 1RQJ 299 Farnesyl diphosphate synthase α 1.170 1SG5 84 Protein rof β 1.170* 1SV6 269 2-keto-4-pentenoate hydratase α/β 1.359 1T8K 78 Acyl carrier protein α 1.427 1U60 310 Glutaminase 1 α 1.170 1W8G 234 Pyridoxal phosphate homeostasis protein α 1.170 1WOC 104 Primosomal replication protein N β 2.480 1XN7 78 Probable [Fe-S]-dependent transcriptional repressor FeoC α 1.170* 1YQQ 277 Purine nucleoside phosphorylase 2 α/β 1.114 1ZYL 328 Stress response kinase A α 1.170 1ZZM 259 Uncharacterized metal-dependent hydrolase YjjV α 1.170 2A6Q 84 Toxin YoeB α/β 1.359 2AXD 76 DNA polymerase III subunit theta α 1.170* 2D1P 119 Protein TusC α 1.170 2FEK 147 Low molecular weight protein--phosphatase Wzb α 1.170 2GQR 237 Phosphoribosylaminoimidazole-succinocarboxamide synthase α/β 1.359 2HD3 95 Ethanolamine utilization protein eutN β 1.759 2HGK 109 Hypothetical protein yqcC α 1.427 2HNA 147 Protein MioC α/β 1.916 2HO9 167 Chemotaxis protein CheW β 1.442 2JEE 81 Cell division protein ZapB α 1.170* 2JO6 108 Nitrite reductase (NADH) small subunit β 2.480 2JRX 75 UPF0352 protein YejL α 1.170* 2KC5 162 Hydrogenase-2 operon protein HybE α/β 1.359 2O1C 150 Dihydroneopterin triphosphate diphosphatase α/β 1.114 Mannitol-specific cryptic phosphotransferase enzyme IIA 2OQ3 147 α/β 1.359 component

132

Table D.6 continued 2PTH 194 Peptidyl-tRNA hydrolase α/β 1.114 2UYJ 129 Putative reactive intermediate deaminase TdcF α/β 1.359 2V81 205 2-dehydro-3-deoxy-6-phosphogalactonate aldolase α 1.170 2YVA 196 DnaA initiator-associating protein DiaA α 1.170 3ASV 248 Short-chain dehydrogenase/reductase SDR α 1.170 3BMB 136 Regulator of nucleoside diphosphate kinase α/β 1.114 3HWO 391 Isochorismate synthase EntC α/β 1.359 3IV5 98 DNA-binding protein fis α 1.170* 3N1S 119 HIT-like protein hinT α/β 1.114 4A2C 346 Galactitol 1-phosphate 5-dehydrogenase α/β 1.359 *indicates an 휂 value that does not result in a stable domain or interface based on the criteria described in Section D.1.4

133

Table D.7. Multi-domain protein data set information Minimum η Number of Structural required for PDB ID Name Domains and Interfaces residues Class kinetic stability Domain 1: 1-170 α/β 1.359 Phosphoribosylformylglycinamidine 1CLI 345 Domain 2: 171-345 α/β 1.114 cyclo-ligase 1|2 Interface – 1.235 Domain 1: 1-41; 285-390 α 1.427 1D2F 390 Protein MalY Domain 2: 42-284 α/β 1.359 1|2 Interface – 2.480 Domain 1:1-136; 320-334 α 1.170 Ornithine carbamoyltransferase 1DUV 334 Domain 2: 137-319 α 1.170 subunit I 1|2 Interface – 1.235 Domain 1: 1-202 α/β 1.114 1EF9 261 Methylmalonyl-CoA decarboxylase Domain 2: 203-261 α 1.170* 1|2 Interface – 1.507* Domain 1: 1-200 ID 1.170 Domain 2: 201-284 α 1.170 Signal recognition particle receptor Domain 3: 285-497 α/β 1.114 1FTS 497 FtsY 1|2 Interface – 1.507* 1|3 Interface – 1.507* 2|3 Interface – 1.507* Domain 1: 1-175 α/β 1.114 Domain 2: 176-340 α 1.170 Domain 3: 341-591 α/β 1.114 1FUI 591 L-fucose isomerase 1|2 Interface – 1.235 1|3 Interface – 1.235 2|3 Interface – 2.124 Domain 1: 1-143; 262-337 α/β 1.114 Domain 2: 144-261 α/β 1.114 Domain 3: 338-450 α/β 1.114 1GER 450 Glutathione reductase 1|2 Interface – 1.507* 1|3 Interface – 2.124 2|3 Interface – 1.235 Domain 1: 1-254 α/β 1.114 1GLF 502 Glycerol kinase Domain 2: 255-502 α/β 1.114 1|2 Interface – 1.235 Domain 1: 1-120 α 1.170 Domain 2: 121-225; 317-365 α/β 1.114 Domain 3: 226-316 α/β 1.114 1GQE 365 Peptide chain release factor RF2 1|2 Interface – 1.507* 1|3 Interface – 1.507* 2|3 Interface – 1.507* Domain 1: 1-180 α/β 1.114 1GYT 503 Cytosol aminopeptidase Domain 2: 181-503 α/β 1.114 1|2 Interface – 1.235 Domain 1: 1-78 α/β 1.114 23S rRNA (guanosine-2'-O-)- 1GZ0 243 Domain 2: 79-243 α/β 1.114 methyltransferase RlmB 1|2 Interface – 1.507* Domain 1: 1-155 α 1.427 Domain 2: 156-350 α/β 1.916 Domain 3: 351-438 α 1.427 Domain 4: 439-652 α/β 1.359 ATP-dependent Clp protease ATP- 1KSF 758 Domain 5: 653-758 α/β 1.916 binding subunit ClpA 1|3 Interface – 2.480 2|3 Interface – 1.507* 3|4 Interface – 1.507* 4|5 Interface – 1.507*

134

Table D.7 continued Minimum η Number of Structural required for PDB ID Name Domains and Interfaces residues Class kinetic stability Domain 1: 1-127 α/β 1.114 Domain 2: 128-265 α/β 1.114 Domain 3: 266-384; 538-566 α 1.170 Domain 4: 385-537 α 1.427 Domain 5: 567-811 α/β 1.359 Domain 6: 812-853 α 2.012 DNA mismatch repair protein 1NG9 853 1|2 Interface – 1.507 MutS 1|3 Interface – 1.507* 2|3 Interface – 1.235 2|5 Interface – 1.507* 3|4 Interface – 1.235 3|5 Interface – 1.235 5|6 Interface – 1.507* Domain 1: 1-10; 137-233 α/β 1.114 Domain 2: 11-105; 234-270 α/β 1.359 Domain 3: 106-136; 271-384 α/β 1.114 1P7L 384 S-adenosylmethionine synthase 1|2 Interface – 1.507* 1|3 Interface – 1.235 2|3 Interface – 2.480 Domain 1: 1-64 α/β 2.480 Domain 2: 130-188 α/β 1.114 Domain 3: 65-129; 189-234 α/β 1.114 Domain 4: 235-534 α/β 1.114 Domain 5: 535-642 α/β 1.114 1QF6 642 --tRNA ligase 1|2 Interface – 1.507* 1|3 Interface – 1.507* 2|3 Interface – 1.235 3|4 Interface – 1.507 4|5 Interface – 1.235 Domain 1: 1-135; 411-548 α 1.427 Domain 2: 136-191; 373-410 α 1.427 1SVT 548 60 kDa Domain 3: 192-372 α/β 1.114 1|2 Interface – 2.124 2|3 Interface – 1.507* Domain 1: 1-134; 352-367 α/β 1.114 Aspartate-semialdehyde 1T4B 367 Domain 2: 135-351 α/β 1.114 dehydrogenase 1|2 Interface – 1.235 Domain 1: 1-305 α/β 1.114 Domain 2: 306-392 α 1.170 Domain 3: 393-461 α 2.480 1U0B 461 Cysteine--tRNA ligase 1|2 Interface – 1.235 1|3 Interface – 1.507* 2|3 Interface – 1.235 Domain 1: 1-158; 315-349 α/β 1.114 1UUF 349 Aldehyde reductase YahK Domain 2: 159-314 α/β 1.114 1|2 Interface – 1.507* Dihydrofolate Domain 1: 1-287 α/β 1.114 1W78 422 synthase/folylpolyglutamate Domain 2: 288-422 α/β 1.114 synthase 1|2 Interface – 2.480 Domain 1: 1-7; 30-143 β 1.442 4-deoxy-L-threo-5-hexosulose- 1XRU 278 Domain 2: 8-29; 144-278 β 1.442 uronate ketol-isomerase 1|2 Interface – 1.507 Domain 1: 1-92; 187-271 α/β 1.916 Mannosyl-3-phosphoglycerate 1XVI 271 Domain 2: 93-186 α/β 1.916 phosphatase 1|2 Interface – 2.480 Domain 1: 1-127 α/β 1.359 2FYM 432 Enolase Domain 2: 128-432 α/β 1.114 1|2 Interface – 2.480 Domain 1: 1-163 α/β 1.114 Lipopolysaccharide 2H1F 326 Domain 2: 164-326 α/β 1,114 heptosyltransferase-1 1|2 Interface – 1.235

135

Table D.7 continued Minimum η Number of Structural required for PDB ID Name Domains and Interfaces residues Class kinetic stability Domain 1: 1-253; 450-479 α/β 1.114 2HG2 479 Lactaldehyde dehydrogenase Domain 2: 254-449 α/β 1.114 1|2 Interface – 1.507* Domain 1: 1-280 α 1.170 Domain 2: 281-404 α 1.170 Domain 3: 1079-1160 α/β 1.916 Domain 4: 558-931 α 1.170 Domain 5: 932-1078 α/β 1.916 Domain 6: 405-557 α/β 1.114 DNA polymerase III subunit 2HNH 1160 1|2 Interface – 1.507* alpha 1|6 Interface – 2.480 2|4 Interface – 1.235 2|6 Interface – 1.235 3|5 Interface – 1.507* 4|5 Interface – 1.507* 4|6 Interface – 1.507* Domain 1: 1-82 α/β 1.916 Domain 2: 83-172 α/β 1.359 Domain 3: 173-557 α/β 1.114 Domain 4: 558-644 β 1.442 2ID0 644 Exoribonuclease 2 1|2 Interface – 2.480 2|3 Interface – 2.124 2|4 Interface – 1.507* 3|4 Interface – 1.507* Domain 1: 1-71; 126-196 α/β 2.480 FKBP-type peptidyl-prolyl cis- 2KFW 196 Domain 2: 72-125 β 1.170* trans isomerase SlyD 1|2 Interface – 1.507* Domain 1: 1-20; 147-229 α/β 2.480 Domain 2: 21-146 α 2.480* hosphoenolpyruvate-protein 2KX9 575 Domain 3: 230-575 α 1.170 phosphotransferase 1|2 Interface – 1.507* 1|3 Interface – 1.507* Domain 1: 1-117 α 1.170 Domain 2: 118-381; 446-456 α 1.170 2PTQ 456 Adenylosuccinate lyase Domain 3: 382-445 α 1.170 1|2 Interface – 1.235 2|3 Interface – 2.124 Domain 1: 1-387 α/β 1.114 Aerobic glycerol-3-phosphate 2QCU 501 Domain 2: 388-501 α 1.170 dehydrogenase 1|2 Interface – 1.507* Domain 1: 1-194 α/β 1.114 2QVR 332 Fructose-1,6-bisphosphatase Domain 2: 195-332 α 1.170 1|2 Interface – 1.507 Domain 1: 1-326 α 1.170 Domain 2: 327-537 α 1.170 2R5N 663 Transketolase 1 Domain 3: 538-663 α/β 1.114 1|2 Interface – 1.235 2|3 Interface – 1.507* Domain 1: 1-183; 217-250 α/β 1.114 Serine/threonine-protein kinase 2WIU 440 Domain 2: 184-216; 251-440 α 1.170 toxin HipA 1|2 Interface – 1.507 Domain 1: 1-164 α/β 1.114 4-diphosphocytidyl-2-C-methyl- 2WW4 283 Domain 2: 165-283 α/β 1.114 D-erythritol kinase 1|2 Interface – 2.124 Domain 1: 1-59 α 1.170* Domain 2: 60-161 α/β 1.114 HTH-type transcriptional 3BRQ 336 Domain 3: 162-336 α 1.170 regulator AscG 1|2 Interface – 1.507* 2|3 Interface – 1.235

136

Table D.7 continued Minimum η Number of Structural required for PDB ID Name Domains and Interfaces residues Class kinetic stability Domain 1: 1-68 α/β 2.480 3GN5 131 Antitoxin MqsA Domain 2: 69-131 α 1.170 1|2 Interface – 1.507* Domain 1: 1-178 α/β 1.359 Domain 2: 179-229 α 2.012 3M7M 292 33 kDa chaperonin Domain 3: 230-292 α 1.170* 1|2 Interface – 2.480 1|3 Interface – 1.507* Domain 1: 1-62 α 1.427 3NXC 198 Nucleoid occlusion factor SlmA Domain 2: 63-198 α 1.170 1|2 Interface – 2.124 Domain 1: 1-96; 191-206 α 2.012 3OFO 206 30S ribosomal protein S4 Domain 2: 97-190 α/β 1.916 1|2 Interface – 1.507* Domain 1: 1-38; 155-190 α/β 1.359 Domain 2: 39-154 β 1.759 Domain 3: 191-398 α/β 1.359 Domain 4: 399-480 α/β 1.359 Domain 5: 481-697 α/β 1.916 Phenylalanine--tRNA ligase beta Domain 6: 698-795 α/β 1.916 3PCO 795 subunit 1|2 Interface – 1.507* 1|3 Interface – 1.507* 1|4 Interface – 1.507* 2|3 Interface – 1.507* 3|4 Interface – 1.507 5|6 Interface – 1.507* Domain 1: 1-110 α/β 1.114 Domain 2: 111-198 α 1.170 3QOU 284 Chaperedoxin Domain 3: 199-284 α 1.170 2|3 Interface – 1.507* Domain 1: 1-185 α/β 1.114 Ribosomal RNA large subunit 4DCM 378 Domain 2: 186-378 α/β 1.114 methyltransferase G 1|2 Interface – 1.507* Domain 1: 1-74 α/β 1.359 CRISPR system Cascade subunit 4DZD 199 Domain 2: 75-199 α/β 1.359 CasE 1|2 Interface – 1.507* Domain 1: 1-71 β 1.442 Ribosomal RNA small subunit 4E8B 243 Domain 2: 72-243 α/β 1.114 methyltransferase E 1|2 Interface – 1.507* Domain 1: 1-197 α/β 1.114 2,3-dehydroadipyl-CoA 4FZW 255 Domain 2: 198-255 α 1.170* hydratase 1|2 Interface – 1.507* Domain 1: 1-85 α/β 1.114 Domain 2: 131-203 α/β 1.114 4HR7 449 Biotin carboxylase Domain 3: 86-130; 204-449 α/β 1.114 1|3 Interface – 1.235 2|3 Interface – 1.507* Domain 1: 1-279 α/β 1.114 Hypothetical oxidoreductase 4IM7 486 Domain 2: 280-486 α 1.170 ydfI 1|2 Interface – 2.124 Domain 1: 1-114 α/β 1.114 Domain 2: 115-181 α/β 1.916 Ribosomal protein S6--L- Domain 3: 182-300 α/β 1.114 4IWX 300 glutamate ligase 1|2 Interface – 1.507* 1|3 Interface – 1.235 2|3 Interface – 1.507

137

Table D.7 continued Minimum η Number of Structural required for PDB ID Name Domains and Interfaces residues Class kinetic stability Domain 1: 1-17; 796-825; 1060- α/β 1.359 1242 Domain 2: 18-152; 445-585; α/β 1.916 657-713 Domain 3: 826-935; 1041-1059 β 1.759 Domain 4: 586-656 β 1.759 Domain 5: 714-795 β 2.480 Domain 6: 153-444 α/β 1.359 Domain 7: 936-1040 α 1.427 Domain 8: 1243-1342 α 1.170* 1|2 Interface – 2.124 DNA-directed RNA polymerase 4KN7 1342 1|3 Interface – 1.507 subunit beta 1|5 Interface – 1.507* 1|8 Interface – 1.507* 2|3 Interface – 1.507* 2|4 Interface – 2.480 2|5 Interface – 1.507 2|6 Interface – 1.507* 2|7 Interface – 1.507* 3|5 Interface – 1.507* 3|7 Interface – 1.507* 3|8 Interface – 1.507* 5|7 Interface – 1.507* *indicates an 휂 value that does not result in a stable domain or interface based on the criteria described in Section D.1.4

138

Table D.8. Additional training set parameters ∗ α 휂 푚 푏 ∆퐺exp ∆퐺class = 푚 ∗ 〈휂〉class + 푏 ∆∆퐺 = ∆퐺exp − ∆퐺class EC298 1.080 -23.456 22.609 -2.72 -4.82 2.10 λ-repressor 1.163 -20.230 19.278 -4.25 -4.38 0.13 bACBP 1.304 -21.941 22.216 -6.40 -3.44 -2.96 IM7 0.998 -16.745 13.832 -2.88 -5.75 2.87 IM9 1.298 -18.480 17.167 -6.81 -4.45 -2.36 cytochrome-256b 1.174 -28.079 26.378 -6.60 -6.46 -0.14

〈휂〉 , α 1.170 class

∗ β 휂 푚 푏 ∆퐺exp ∆퐺class = 푚 ∗ 〈휂〉class + 푏 ∆∆퐺 = ∆퐺exp − ∆퐺class ABP1 SH3 1.628 -4.590 4.403 -3.07 -2.22 -0.85 Fyn SH3 1.467 -11.149 9.680 -6.68 -6.40 -0.28 CspB Bc 1.298 -13.749 16.035 -1.81 -3.79 1.98 CspA 1.445 -15.226 18.998 -3.00 -2.96 -0.04 Tenascin 1.457 -28.220 34.408 -6.71 -6.29 -0.42 Twitchin 1.357 -20.195 22.402 -5.00 -6.72 1.72

〈휂〉class, β 1.442

∗ α/β 휂 푚 푏 ∆퐺exp ∆퐺class = 푚 ∗ 〈휂〉class + 푏 ∆∆퐺 = ∆퐺exp − ∆퐺class Hpr 1.164 -21.377 19.419 -5.46 -4.40 -1.06 Urm1 1.075 -22.836 21.078 -3.48 -4.36 0.88 Src SH2 1.298 -20.051 18.778 -7.24 -3.56 -3.68 Azurin (apo) 1.163 -25.190 23.997 -5.29 -4.07 -1.22 CheY 1.001 -58.105 52.554 -5.60 -12.18 6.58 Ribonuclease H 1.047 -38.735 33.369 -7.20 -9.79 2.59 Dihydrofolate 1.051 -29.666 24.797 -6.37 -8.26 1.89 reductase

〈휂〉class, α/β 1.114

Note: All free energies, m, and b have units of kcal/mol.

139

Table D.9. 휂 values used for each structural class and the training set overall

Structural Class 〈휼〉퐜퐥퐚퐬퐬 〈휼〉퐜퐥퐚퐬퐬+22% 〈휼〉퐜퐥퐚퐬퐬+72% α 1.170 1.427 2.012 β 1.442 1.759 2.480 α/β 1.114 1.359 1.916 Overall 1.235 1.507 2.124

140

Table D.10. Mean in silico dwell times calculated from Fluitt-Viljoen model Mean decoding time from Mean in silico decoding Index Codon Fluitt and Viljoen, ms time, 0.015 ps time steps 1 UUU 136 846908 2 UUC 195 1214317 3 UUG 50 311363 4 UUA 157 977681 5 UCU 55 342500 6 UCC 246 1531908 7 UCG 96 597818 8 UCA 106 660090 9 UGU 75 467045 10 UGC 109 678772 11 UGG 168 1046181 12 UGA 12 74727 13 UAU 53 330045 14 UAC 77 479500 15 UAG 19 118318 16 UAA 11 68500 17 CUU 260 1619090 18 CUC 204 1270363 19 CUG 35 217954 20 CUA 286 1780998 21 CCU 143 890499 22 CCC 197 1226772 23 CCG 134 834454 24 CCA 237 1475862 25 CGU 28 174363 26 CGC 35 217954 27 CGG 397 2472225 28 CGA 34 211727 29 CAU 296 1843271 30 CAC 222 1382453 31 CAG 231 1438499 32 CAA 179 1114681 33 GUU 26 161909 34 GUC 208 1295272 35 GUG 42 261545 36 GUA 73 454591 37 GCU 39 242863 38 GCC 415 2584316 39 GCG 44 274000 40 GCA 83 516863 41 GGU 35 217954 42 GGC 49 305136 43 GGG 81 504409 44 GGA 324 2017635 45 GAU 77 479500 46 GAC 116 722363 47 GAG 36 224182 48 GAA 57 354954 49 AUU 97 604045 50 AUC 128 797090 51 AUG 266 1656453 52 AUA 128 797090 53 ACU 55 342500 54 ACC 153 952772 55 ACG 129 803318 56 ACA 178 1108454 57 AGU 85 529318 58 AGC 127 790863 59 AGG 461 2870770 60 AGA 190 1183181 61 AAU 109 678772 62 AAC 161 1002590 63 AAG 102 635181 64 AAA 76 473272

141

Table D.11. Kinetically trapped inter-domain interfaces Mean simulation Mean Trapping Mean folding Trapping PDB ID Domain %trapped time, subset of synthesis duration, time, ns duration, s trapped trajs, ns time, ns ns 1DUV 1|2 4 3793 14036 3749 10287 41 1FUI 1|2 16 6100 15020 6769 8251 33 1FUI 2|3 12 6788 15329 6769 8560 34 1FUI 1|3 60 9717 13408 6769 6639 26 1GLF 1|2 42 9969 14442 5203 9239 37 1GYT 1|2 78 7927 16950 5336 11614 46 1KSF 2|3 2 4871 13501 7855 5646 22 1NG9 3|5 16 7615 12324 8769 3555 14 1QF6 2|3 8 3303 12878 6754 6124 24 1SVT 1|2 16 4606 11009 4662 6347 25 1SVT 2|3 4 4843 11281 4662 6618 26 1T4B 1|2 16 7957 15626 4255 11371 45 1U0B 2|3 2 4578 12215 4761 7454 30 1W78 1|2 8 4508 12752 4485 8267 33 1XVI 1|2 2 3109 17349 3230 14119 56 2FYM 1|2 2 4093 10694 3786 6908 27 2HG2 1|2 2 5581 13774 5256 8517 34 2WW4 1|2 2 3128 14417 3227 11191 44 3PCO 1|4 26 6308 12773 7466 5307 21 3PCO 3|4 40 5901 12837 7466 5371 21 4IWX 1|3 42 15612 29539 3163 26377 105

142

REFERENCES

1. Komar, A. A. A pause for thought along the co-translational folding pathway. Trends Biochem. Sci. 34, 16–24 (2009). 2. Nicola, A. V., Chen, W. & Helenius, A. Co-translational folding of an alphavirus capsid protein in the cytosol of living cells. Nat. Cell Biol. 1, 341–345 (1999). 3. Gloge, F., Becker, A. H., Kramer, G. & Bukau, B. Co-translational mechanisms of protein maturation. Curr. Opin. Struct. Biol. 24, 24–33 (2014). 4. Pechmann, S., Willmund, F. & Frydman, J. The Ribosome as a Hub for Protein Quality Control. Mol. Cell 49, 411–421 (2013). 5. Walter, P. & Johnson, A. Signal Sequence Recognition and To the Membrane. Annu. Rev. Cell Biol. 10, 87–119 (1994). 6. Comyn, S. A., Chan, G. T. & Mayor, T. False start: Cotranslational protein ubiquitination and cytosolic protein quality control. J. Proteomics 100, 92–101 (2014). 7. Duttler, S., Pechmann, S. & Frydman, J. Principles of cotranslational ubiquitination and quality control at the ribosome. Mol. Cell 50, 379–393 (2013). 8. Ruiz-Canada, C., Kelleher, D. J. & Gilmore, R. Cotranslational and Posttranslational N-Glycosylation of Polypeptides by Distinct Mammalian OST Isoforms. Cell 136, 272–283 (2009). 9. Sauna, Z. E. & Kimchi-Sarfaty, C. Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12, 683–691 (2011). 10. Lehmann, J. & Libchaber, A. Degeneracy of the genetic code and stability of the base pair at the second position of the anticodon. Rna 14, 1264–1269 (2008). 11. Sørensen, M. A. & Pedersen, S. Absolute in vivo translation rates of individual codons in Escherichia coli. The two glutamic acid codons GAA and GAG are translated with a threefold difference in rate. J. Mol. Biol. 222, 265–280 (1991). 12. Fluitt, A., Pienaar, E. & Viljoen, H. Ribosome kinetics and aa-tRNA competition determine rate and fidelity of peptide synthesis. Comput. Biol. Chem. 31, 335–346 (2007). 13. Spencer, P. S., Siller, E., Anderson, J. F. & Barral, J. M. Silent substitutions predictably alter translation elongation rates and protein folding efficiencies. J. Mol. Biol. 422, 328–335 (2012). 14. Makhoul, C. H. & Trifonov, E. N. Distribution of rare triplets along mrna and their relation to protein folding. Journal of Biomolecular Structure and Dynamics 20, 413–420 (2002). 15. Clarke, T. F. & Clark, P. L. Increased incidence of rare codon clusters at 5’ and 3’

143

gene termini:implications for function. BMC Genomics 11, 118 (2010). 16. Clarke IV, T. F. & Clark, P. L. Rare codons cluster. PLoS One 3, (2008). 17. Frydman, J., Erdjument-Bromage, H., Tempst, P. & Ulrich Hartl, F. Co- translational domain folding as the structural basis for the rapid de novo folding of firefly luciferase. Nat. Struct. Biol. 6, 697–705 (1999). 18. Chang, H. C., Kaiser, C. M., Hartl, F. U. & Barral, J. M. De novo folding of GFP fusion proteins: High efficiency in eukaryotes but not in bacteria. J. Mol. Biol. 353, 397–409 (2005). 19. Evans, M. S., Sander, I. M. & Clark, P. L. Cotranslational Folding Promotes β- Helix Formation and Avoids Aggregation In Vivo. J. Mol. Biol. 383, 683–692 (2008). 20. Sánchez, I. E., Morillas, M., Zobeley, E., Kiefhaber, T. & Glockshuber, R. Fast folding of the two-domain Semliki Forest virus capsid protein explains co- translational proteolytic activity. J. Mol. Biol. 338, 159–167 (2004). 21. Siller, E., DeZwaan, D. C., Anderson, J. F., Freeman, B. C. & Barral, J. M. Slowing Bacterial Translation Speed Enhances Eukaryotic Protein Folding Efficiency. J. Mol. Biol. 396, 1310–1318 (2010). 22. Zhang, G., Hubalewska, M. & Ignatova, Z. Transient ribosomal attenuation coordinates protein synthesis and co-translational folding. Nat. Struct. Mol. Biol. 16, 274–280 (2009). 23. Sander, I. M., Chaney, J. L. & Clark, P. L. Expanding anfinsen’s principle: Contributions of synonymous codon selection to rational protein design. J. Am. Chem. Soc. 136, 858–861 (2014). 24. Saunders, R. & Deane, C. M. Synonymous codon usage influences the local protein structure observed. Nucleic Acids Res. 38, 6719–6728 (2010). 25. O’brien, E. P., Vendruscolo, M. & Dobson, C. M. Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nat. Commun. 5, (2014). 26. Cortazzo, P. et al. Silent mutations affect in vivo protein folding in Escherichia coli. Biochem. Biophys. Res. Commun. 293, 537–541 (2002). 27. Komar, A. A., Lesnik, T. & Reiss, C. Synonymous codon substitutions affect ribosome traffic and protein folding during in vitro translation. FEBS Lett. 462, 387–391 (1999). 28. Sherman, M. Y. & Qian, S. B. Less is more: Improving proteostasis by translation slow down. Trends Biochem. Sci. 38, 585–591 (2013). 29. Zhou, M. et al. Non-optimal codon usage affects expression, structure and function of clock protein FRQ. Nature 494, 111–115 (2013). 30. Fedyunin, I. et al. TRNA concentration fine tunes protein solubility. FEBS Lett.

144

586, 3336–3340 (2012). 31. Kimchi-Sarfaty, C. et al. A ‘silent’ polymorphism in the MDR1 gene changes substrate specificity. Science (80-. ). 315, 525–529 (2007). 32. Angov, E., Hillier, C. J., Kincaid, R. L. & Lyon, J. A. Heterologous protein expression is enhanced by harmonizing the codon usage frequencies of the target gene with those of the expression host. PLoS One 3, 1–10 (2008). 33. Maertens, B. et al. Gene optimization mechanisms: A multi-gene study reveals a high success rate of full-length human proteins expressed in Escherichia coli. Protein Sci. 19, 1312–1326 (2010). 34. Bartoszewski, R. A. et al. A synonymous single nucleotide polymorphism in ΔF508 CFTR alters the secondary structure of the mRNA and the expression of the mutant protein. J. Biol. Chem. 285, 28741–28748 (2010). 35. Chiti, F. & Dobson, C. M. Protein Misfolding, Functional Amyloid, and Human Disease. Annu. Rev. Biochem. 75, 333–366 (2006). 36. Zhang, D. & Shan, S. O. Translation elongation regulates substrate selection by the signal recognition particle. J. Biol. Chem. 287, 7652–7660 (2012). 37. Goder, V. & Spiess, M. Molecular mechanism of signal sequence orientation in the endoplasmic reticulum. EMBO J. 22, 3645–3653 (2003). 38. Conti, B. J., Elferich, J., Yang, Z., Shinde, U. & Skach, W. R. Cotranslational folding inhibits translocation from within the ribosome-Sec61 translocon complex. Nat. Struct. Mol. Biol. 21, 228–235 (2014). 39. Arfin, S. M. & Bradshaw, R. A. Cotranslational Processing and Protein Turnover in Eukaryotic Cell. Biochemistry 27, 7979–7984 (1988). 40. Turner, G. C. & Varshavsky, A. Detecting and Measuring Protein Degradation. Adv. Sci. 289, 2117–2120 (2010). 41. Kramer, G., Boehringer, D., Ban, N. & Bukau, B. The ribosome as a platform for co-translational processing, folding and targeting of newly synthesized proteins. Nat. Struct. Mol. Biol. 16, 589–597 (2009). 42. Meinnel, T., Mechulam, Y. & Blanquet, S. Methionine as translation start signal: A review of the enzymes of the pathway in Escherichia coli. Biochimie 75, 1061– 1075 (1993). 43. Zhang, F., Saha, S., Shabalina, S. A. & Kashina, A. Differential Arginylation of Actin Sequence – Dependent Degradation. Science (80-. ). 1065, 1534–37 (2010). 44. Knobe, K. E., Sjörin, E. & Ljung, R. C. R. Why does the mutation G17736A/Val107Val (silent) in the F9 gene cause mild haemophilia B in five Swedish families? Haemophilia 14, 723–728 (2008). 45. Supek, F., Miñana, B., Valcárcel, J., Gabaldón, T. & Lehner, B. Synonymous mutations frequently act as driver mutations in human cancers. Cell 156, 1324–

145

1335 (2014). 46. O’Brien, E. P., Vendruscolo, M. & Dobson, C. M. Prediction of variable translation rate effects on cotranslational protein folding. Nat. Commun. 3, (2012). 47. Powers, E. T., Powers, D. L. & Gierasch, L. M. FoldEco: A Model for Proteostasis in E. coli. Cell Rep. 1, 265–276 (2012). 48. O’Brien, E. P., Hsu, S. T. D., Christodoulou, J., Vendruscolo, M. & Dobson, C. M. Transient tertiary structure formation within the ribosome exit port. J. Am. Chem. Soc. 132, 16928–16937 (2010). 49. O’Brien, E. P., Christodoulou, J., Vendruscolo, M. & Dobson, C. M. New scenarios of protein folding can occur on the ribosome. J. Am. Chem. Soc. 133, 513–526 (2011). 50. O’Brien, E. P., Christodoulou, J., Vendruscolo, M. & Dobson, C. M. Trigger factor slows Co-translational folding through kinetic trapping while sterically protecting the nascent chain from aberrant cytosolic interactions. J. Am. Chem. Soc. 134, 10920–10932 (2012). 51. Elcock, A. H. Molecular simulations of cotranslational protein folding: Fragment stabilities, folding cooperativity, and trapping in the ribosome. PLoS Comput. Biol. 2, 0824–0841 (2006). 52. Zhang, B. & Miller, T. F. Long-Timescale Dynamics and Regulation of Sec- Facilitated Protein Translocation. Cell Rep. 2, 927–937 (2012). 53. Gumbart, J. C., Teo, I., Roux, B. & Schulten, K. Reconciling the roles of kinetic and thermodynamic factors in membrane-protein insertion. J. Am. Chem. Soc. 135, 2291–2297 (2013). 54. Zhang, B. & Miller, T. F. Direct simulation of early-stage sec-facilitated protein translocation. J. Am. Chem. Soc. 134, 13700–13707 (2012). 55. O’Brien, E. P., Ciryam, P., Vendruscolo, M. & Dobson, C. M. Understanding the influence of codon translation rates on cotranslational protein folding. Acc. Chem. Res. 47, 1536–1544 (2014). 56. Kowarik, M., Küng, S., Martoglio, B. & Helenius, A. Protein folding during cotranslational translocation in the endoplasmic reticulum. Mol. Cell 10, 769–778 (2002). 57. Eichmann, C., Preissler, S., Riek, R. & Deuerling, E. Cotranslational structure acquisition of nascent polypeptides monitored by NMR spectroscopy. Proc. Natl. Acad. Sci. 107, 9111–9116 (2010). 58. Kim, S. J. et al. Protein folding. Translational tuning optimizes nascent protein folding in cells. Science (80-. ). 348, 444–448 (2015). 59. Braakman, I., Hoover-Litty, H., Wagner, K. R. & Helenius, A. Folding of influenza hemagglutinin in the endoplasmic reticulum. J. Cell Biol. 114, 401–411

146

(1991). 60. Reuveni, S., Meilijson, I., Kupiec, M., Ruppin, E. & Tuller, T. Genome-scale analysis of translation elongation with a ribosome flow model. PLoS Comput. Biol. 7, (2011). 61. Margaliot, M. & Tuller, T. On the steady-state distribution in the homogeneous ribosome flow model. IEEE/ACM Trans. Comput. Biol. Bioinforma. 9, 1724–1736 (2012). 62. Tinoco, I. & Wen, J. Der. Simulation and analysis of single-ribosome translation. Phys. Biol. 6, (2009). 63. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011). 64. Kosolapov, A. & Deutsch, C. Tertiary interactions within the ribosomal exit tunnel. Nat. Struct. Mol. Biol. 16, 405–411 (2009). 65. Hoffmann, A. et al. Concerted Action of the Ribosome and the Associated Chaperone Trigger Factor Confines Nascent Polypeptide Folding. Mol. Cell 48, 63–74 (2012). 66. Kaiser, C. M., Goldman, D. H. & Chodera, J. D. The Ribosome Modulates Nascent Protein Folding. Science 334, 1723–1727 (2011). 67. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome- wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. supplementary material. Science 324, 218–23 (2009). 68. Gardin, J. et al. Measurement of average decoding rates of the 61 sense codons in vivo. Elife 3, 1–20 (2014). 69. Stadler, M. & Fire, A. Wobble base-pairing slows in vivo translation elongation in metazoans. Rna 17, 2063–2073 (2011). 70. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014). 71. Han, Y. et al. Monitoring cotranslational protein folding in mammalian cells at codon resolution. Proc. Natl. Acad. Sci. 109, 12467–12472 (2012). 72. Ciryam, P., Morimoto, R. I., Vendruscolo, M., Dobson, C. M. & O’Brien, E. P. In vivo translation rates can substantially delay the cotranslational folding of the Escherichia coli cytosolic proteome. Proc. Natl. Acad. Sci. 110, E132–E140 (2013). 73. De Sancho, D. & Muñoz, V. Integrated prediction of protein folding and unfolding rates from only size and structural class. Phys. Chem. Chem. Phys. 13, 17030– 17043 (2011). 74. Gillespie, D. T. Exact stochastic simulation of coupled chemical reactions. J. Phys.

147

Chem. 81, 2340–2361 (1977). 75. Liang, S.-T., Xu, Y.-C., Dennis, P. & Bremer, H. mRNA Composition and Control of Bacterial Gene Expression. J. Bacteriol. 182, 3037–3044 (2000). 76. Maizel, J. V. Synthesis Assembly of Adenovirus. 000, (1967). 77. Netzer, W. J. & Hartl, F. U. Recombination of protein domains facilitated by co- translational folding in eukaryotes. Nature 388, 343–349 (1997). 78. Agashe, V. R. et al. Function of trigger factor and DnaK in multidomain protein folding: increase in yield at the expense of folding speed. Trends Biochem. Sci. 117, 199–209 (2004). 79. Rutkowska, A. et al. Dynamics of trigger factor interaction with translating ribosomes. J. Biol. Chem. 283, 4124–4132 (2008). 80. Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E. & Kimchi-Sarfaty, C. Exposing synonymous mutations. Trends Genet. 30, 308–321 (2014). 81. Pechmann, S. & Frydman, J. Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nat. Struct. Mol. Biol. 20, 237–243 (2013). 82. Holtkamp, W. et al. Cotranslational protein folding on the ribosome monitored in real time. Science (80-. ). 350, 1104–1107 (2015). 83. Kajihara, D. et al. FRET analysis of protein conformational change through position-specific incorporation of fluorescent amino acids. Nat. Methods 3, 923– 929 (2006). 84. Mercier, E. & Rodnina, M. V. Co-Translational Folding Trajectory of the HemK Helical Domain. Biochemistry 57, 3460–3464 (2018). 85. Nilsson, O. B. et al. Cotranslational folding of spectrin domains via partially structured states. Nat. Struct. Mol. Biol. 24, 221–225 (2017). 86. Flory, P. J. Principles of Polymer Chemistry. (Cornell Univ. Press, 1953). doi:10.1126/science.119.3095.555-a 87. Stirnemann, G., Giganti, D., Fernandez, J. M. & Berne, B. J. Elasticity, structure, and relaxation of extended proteins under force. Proc. Natl. Acad. Sci. 110, 3847– 3852 (2013). 88. Schlierf, M., Li, H. & Fernandez, J. M. The unfolding kinetics of ubiquitin captured with single-molecule force-clamp techniques. Proc. Natl. Acad. Sci. 101, 7299–7304 (2004). 89. Walther, K. A. et al. Signatures of hydrophobic collapse in extended proteins captured with force spectroscopy. Proc. Natl. Acad. Sci. 104, 7916–7921 (2007). 90. Flory, P. J. & Volkenstein, M. Statistical mechanics of chain molecules. (Interscience, 1969). doi:10.1002/bip.1969.360080514

148

91. Dill, K. a & Shortie, D. DENATURED STATES OF PROTEINS. Annu. Rev. Biochem. 60, 795–825 (1991). 92. Caniparoli, L. & O’Brien, E. P. Modeling the effect of codon translation rates on co-translational protein folding mechanisms of arbitrary complexity. J. Chem. Phys. 142, (2015). 93. Fritch, B. et al. Origins of the Mechanochemical Coupling of Formation to Protein Synthesis. J. Am. Chem. Soc. 140, 5077–5087 (2018). 94. Thommen, M., Holtkamp, W. & Rodnina, M. V. Co-translational protein folding: progress and methods. Curr. Opin. Struct. Biol. 42, 83–89 (2017). 95. Merchant, K. a, Best, R. B., Louis, J. M., Gopich, I. V & Eaton, W. a. Characterizing the unfolded states of proteins using single-molecule FRET spectroscopy and molecular simulations. Proc. Natl. Acad. Sci. U. S. A. 104, 1528– 1533 (2007). 96. Best, R. B., Hofmann, H., Nettels, D. & Schuler, B. Quantitative Interpretation of FRET Experiments via Molecular Simulation: Force Field and Validation. Biophys. J. 108, 2721–2731 (2015). 97. Henry, E. R. & Hochstrasser, R. M. Molecular dynamics simulations of fluorescence polarization of in myoglobin. Proc. Natl. Acad. Sci. U. S. A. 84, 6142–6146 (1987). 98. Davies, S. W. et al. Formation of neuronal intranuclear inclusions underlies the neurological dysfunction in mice transgenic for the HD mutation. Cell 90, 537– 548 (1997). 99. Tobin, A. J. & Signer, E. R. Huntington’s disease: The challenge for cell biologists. Trends Cell Biol. 10, 531–536 (2000). 100. Squitieri, F. et al. DNA haplotype analysis of huntington disease reveals clues to the origins and mechanisms of CAG expansion and reasons for geographic variations of prevalence. Hum. Mol. Genet. 3, 2103–2114 (1994). 101. Brinkman, R. R., Mezei, M. M., Theilmann, J., Almqvist, E. & Hayden, M. R. The likelihood of being affected with Huntington disease by a particular age, for a specific CAG size. Am J Hum Genet. 60, 1202–10 (1997). 102. Lee, J. M. et al. CAG repeat expansion in Huntington disease determines age at onset in a fully dominant fashion. Neurology 78, 690–695 (2012). 103. Li, S. H. & Li, X. J. Aggregation of N-terminal huntingtin is dependent on the length of its glutamine repeats. Hum. Mol. Genet. 7, 777–782 (1998). 104. Yano, H. et al. Inhibition of mitochondrial protein import by mutant huntingtin. Nat. Neurosci. 17, 822–831 (2014). 105. Atwal, R. S. et al. Huntingtin has a membrane association signal that can modulate huntingtin aggregation, nuclear entry and toxicity. Hum. Mol. Genet. 16, 2600–

149

2615 (2007). 106. Cattaneo, E., Zuccato, C. & Tartari, M. Normal huntingtin function: An alternative approach to Huntington’s disease. Nat. Rev. Neurosci. 6, 919–930 (2005). 107. Maiuri, T., Woloshansky, T., Xia, J. & Truant, R. The huntingtin N17 domain is a multifunctional CRM1 and ran-dependent nuclear and cilial export signal. Hum. Mol. Genet. 22, 1383–1394 (2013). 108. Rockabrand, E. et al. The first 17 amino acids of Huntingtin modulate its sub- cellular localization, aggregation and effects on calcium homeostasis. Hum. Mol. Genet. 16, 61–77 (2007). 109. Crick, S. L., Ruff, K. M., Garai, K., Frieden, C. & Pappu, R. V. Unmasking the roles of N- and C-terminal flanking sequences from exon 1 of huntingtin as modulators of polyglutamine aggregation. Proc. Natl. Acad. Sci. 110, 20075– 20080 (2013). 110. Suhr, S. T. et al. Identities of sequestered proteins in aggregates from cells with induced polyglutamine expression. J. Cell Biol. 153, 283–294 (2001). 111. Chow, W. N. V., Luk, H. W., Chan, H. Y. E. & Lau, K.-F. Degradation of mutant huntingtin via the ubiquitin/proteasome system is modulated by FE65. Biochem. J. 443, 681–689 (2012). 112. Landles, C. et al. Proteolysis of mutant huntingtin produces an exon 1 fragment that accumulates as an aggregated protein in neuronal nuclei in huntington disease. J. Biol. Chem. 285, 8808–8823 (2010). 113. El-Daher, M.-T. et al. Huntingtin proteolysis releases non-polyQ fragments that cause toxicity through dynamin 1 dysregulation. EMBO J. 34, 2255–2271 (2015). 114. Truant, R., Atwal, R. & Burtnik, A. Hypothesis: huntingtin may function in membrane association and vesicular trafficking. Biochem. Cell Biol. 84, 912–917 (2006). 115. Mclaughlin, B. A., Spencer, C. & Eberwine, J. CAG trinucleotide RNA repeats interact with RNA-binding proteins. Am. J. Hum. Genet. 59, 561–9 (1996). 116. Gil, J. M. & Rego, A. C. Mechanisms of Neurodegeneration in Huntington’s Disease. Eur. J. Neurosci. 27, 2803–2820 (2008). 117. Oh, W. J. et al. MTORC2 can associate with ribosomes to promote cotranslational phosphorylation and stability of nascent Akt polypeptide. EMBO J. 29, 3939–3951 (2010). 118. Keshwani, M. M. et al. Cotranslational cis-phosphorylation of the COOH-terminal tail is a key priming step in the maturation of cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 109, E1221–E1229 (2012). 119. Polevoda, B. & Sherman, F. N α -terminal Acetylation of Eukaryotic Proteins. J. Biol. Chem. 275, 36479–36482 (2000).

150

120. Pechmann, S., Chartron, J. W. & Frydman, J. Local slowdown of translation by nonoptimal codons promotes nascent-chain recognition by SRP in vivo. Nat. Struct. Mol. Biol. 21, 1100–1105 (2014). 121. Nissley, D. A. & O’Brien, E. P. Timing is everything: Unifying Codon translation rates and nascent proteome behavior. J. Am. Chem. Soc. 136, 17892–17898 (2014). 122. Noriega, T. R. et al. Signal recognition particle-ribosome binding is sensitive to nascent chain length. J. Biol. Chem. 289, 19294–19305 (2014). 123. Sandikci, A. et al. Dynamic enzyme docking to the ribosome coordinates N- terminal processing with polypeptide folding. Nat. Struct. Mol. Biol. 20, 843–850 (2013). 124. Maier, T., Ferbitz, L., Deuerling, E. & Ban, N. A cradle for new proteins: Trigger factor at the ribosome. Curr. Opin. Struct. Biol. 15, 204–212 (2005). 125. Hoffmann, A., Bukau, B. & Kramer, G. Structure and function of the molecular chaperone Trigger Factor. Biochim. Biophys. Acta - Mol. Cell Res. 1803, 650–661 (2010). 126. Pop, C. et al. Causal signals between codon bias, mRNA structure, and the efficiency of translation and elongation. Mol. Syst. Biol. 10, 770–770 (2014). 127. Yusupov, M. M. et al. Crystal Structure of the Ribosome at 5 . 5 Å Resolution. Science (80-. ). 292, 883–896 (2001). 128. Pavlov, M. Y. et al. Slow peptide bond formation by proline and other N- alkylamino acids in translation. Proc. Natl. Acad. Sci. 106, 50–54 (2009). 129. Artieri, C. G. & Fraser, H. B. Accounting for biases in riboprofiling data indicates a major role for proline in stalling translation. Genome Res. 24, 2011–2021 (2014). 130. DiFiglia, M., Sapp, E., Chase, K., Davies, S. & Bates, G. Aggregation of Huntingtin in Neuronal Intranculear Inclusions and Dystrophic Neurites in Brain. Science (80-. ). 277, 1990–1993 (1997). 131. Huang, C. C. et al. Amyloid formation by mutant huntingtin: Threshold, progressivity and recruitment of normal polyglutamine proteins. Somat. Cell Mol. Genet. 24, 217–233 (1998). 132. Qi, L. & Zhang, X. Role of chaperone-mediated autophagy in degrading Huntington’s disease-associated huntingtin protein. Acta Biochem. Biophys. 83–91 (2014). doi:10.1093/abbs/gmt133.Advance 133. Seo, H., Sonntag, K. C. & Isacson, O. Generalized brain and skin proteasome inhibition in Huntington’s disease. Ann. Neurol. 56, 319–328 (2004). 134. Hunter, J. M., Lesort, M. L. & Johnson, G. V. Ubiquitin-proteasome system alterations in a striatal cell model of Huntington’s Disease. J. Neurosci. Res. 85, 1774–1788 (2007). 135. Fishbain, S. et al. Sequence composition of disordered regions fine-tunes protein

151

half-life. Nat. Struct. Mol. Biol. 22, 214–221 (2015). 136. Xia, J. Huntingtin contains a highly conserved nuclear export signal. Hum. Mol. Genet. 12, 1393–1403 (2003). 137. Gu, X. et al. N17 Modifies Mutant Huntingtin Nuclear Pathogenesis and Severity of Disease in HD BAC Transgenic Mice. Neuron 85, 726–741 (2015). 138. Neveklovska, M., Clabough, E. B. D., Steffan, J. S. & Zeitlin, S. O. Deletion of the huntingtin proline-rich region does not significantly affect normal huntingtin function in mice. J. Huntingtons. Dis. 1, 71–87 (2012). 139. Kaytor, M. D., Wilkinson, K. D. & Warren, S. T. Modulating huntingtin half-life alters polyglutamine-dependent aggregate formation and cell toxicity. J. Neurochem. 89, 962–973 (2004). 140. Zuccato, C. et al. Huntingtin interacts with REST/NRSF to modulate the transcription of NRSE-controlled neuronal genes. Nat. Genet. 35, 76–83 (2003). 141. Cornett, J. et al. Polyglutamine expansion of huntingtin impairs its nuclear export. Nat. Genet. 37, 198–204 (2005). 142. Nalavade, R., Griesche, N., Ryan, D. P., Hildebrand, S. & Krauß, S. Mechanisms of RNA-induced toxicity in CAG repeat disorders. Cell Death Dis. 4, e752-11 (2013). 143. Peel, A. L. PKR Activation in Neurodegenerative Disease. J. Neuropathol. Exp. Neurol. 63, 97–105 (2004). 144. Krauß, S. et al. Translation of HTT mRNA with expanded CAG repeats is regulated by the MID1-PP2A protein complex. Nat. Commun. 4, (2013). 145. Scherzinger, E. et al. Self-assembly of polyglutamine-containing huntingtin fragments into amyloid-like fibrils: Implications for Huntington’s disease pathology. Proc. Natl. Acad. Sci. 96, 4604–4609 (1999). 146. Atwal, R. S. et al. Kinase inhibitors modulate huntingtin cell localization and toxicity. Nat. Chem. Biol. 7, 453–460 (2011). 147. Thompson, L. M. et al. IKK phosphorylates Huntingtin and targets it for degradation by the proteasome and lysosome. J. Cell Biol. 187, 1083–1099 (2009). 148. Dehlin, E. et al. Regulation of ghrelin structure and membrane binding by phosphorylation. Peptides 29, 904–911 (2008). 149. Aiken, C. T. et al. Phosphorylation of threonine 3: Implications for huntingtin aggregation and neurotoxicity. J. Biol. Chem. 284, 29427–29436 (2009). 150. Oh, E. et al. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell 147, 1295–1308 (2011). 151. MacKinnon, A. L., Garrison, J. L., Hegde, R. S. & Taunton, J. Photo-leucine incorporation reveals the target of a cyclodepsipeptide inhibitor of cotranslatonal

152

translocation. J. Am. Chem. Soc. 129, 14560–14561 (2007). 152. Dieterich, D. C. et al. Labeling, detection and identification of newly synthesized proteomes with bioorthogonal non-canonical amino-acid tagging. Nat. Protoc. 2, 532–540 (2007). 153. Zheng, S., Kim, H. & Verhaak, R. G. W. Silent mutations make some noise. Cell 156, 1129–1131 (2014). 154. Bateman, A. et al. UniProt: A hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015). 155. Dill, K. A. & Chan, H. S. From Levinthal to pathways to funnels. Nat. Struct. Biol. 4, 10–19 (1997). 156. Shin, S.-H. et al. Direct observation of kinetic traps associated with structural transformations leading to multiple pathways of S-layer assembly. Proc. Natl. Acad. Sci. 109, 12968–12973 (2012). 157. Borgia, A., Williams, P. M. & Clarke, J. Single-Molecule Studies of Protein Folding. Annu. Rev. Biochem. 77, 101–125 (2008). 158. Veitshans, T., Klimov, D. & Thirumalai, D. Protein folding kinetics: Timescales, pathways and energy landscapes in terms of sequence-dependent properties. Fold. Des. 2, 1–22 (1997). 159. Noe, F., Schutte, C., Vanden-Eijnden, E., Reich, L. & Weikl, T. R. Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc. Natl. Acad. Sci. U.S.A. 106, 19011–19016 (2009). 160. Maurizi, M. R. Proteases and protein degradation in Escherichia coli. Experientia 48, 178–201 (1992). 161. Piana, S. & Shaw, D. E. Atomic-Level Description of Protein Folding inside the GroEL Cavity. J. Phys. Chem. B acs.jpcb.8b07366 (2018). doi:10.1021/acs.jpcb.8b07366 162. Kerner, M. J. et al. Proteome-wide analysis of chaperonin-dependent protein folding in Escherichia coli. Cell 122, 209–220 (2005). 163. Tomita S, Kirino Y, S. T. Cleavage of Alzheimer ’ s Amyloid Precursor Protein ( APP ) by Secretases Occurs after O- Glycosylation of APP in the Protein Secretory Pathway. J. Biol. Chem. 273, 6277–6284 (1998). 164. Twisk, J. et al. The role of the LDL receptor in apolipoprotein B secretion. J. Clin. Invest. 105, 521–531 (2000). 165. Govind, A. P., Walsh, H. & Green, W. N. Nicotine-Induced Upregulation of Native Neuronal Nicotinic Receptors Is Caused by Multiple Mechanisms. J. Neurosci. 32, 2227–2238 (2012). 166. Cumming, G., Fidler, F. & Vaux, D. L. Error bars in experimental biology. J. Cell Biol. 177, 7–11 (2007).

153

167. Sillitoe, I. et al. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res. 41, 490–498 (2013). 168. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–10 (1990). 169. Frank, J. & Gonzalez, R. L. Structure and Dynamics of a Processive Brownian Motor: The Translating Ribosome. Annu. Rev. Biochem. 79, 381–412 (2010). 170. Wen, J. Der et al. Following translation by single ribosomes one codon at a time. Nature 452, 598–603 (2008). 171. Ingolia, N. T., Brar, G. a, Rouskin, S., Mcgeachy, A. M. & Weissman, J. S. Fragments. Nat. Prot. 7, 1534–1550 (2013). 172. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011). 173. Schmieder, R. & Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 (2011). 174. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). 175. Kim, D. et al. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). 176. Jackson, S. E. & Fersht, A. R. Folding of Chymotrypsin Inhibitor 2.1. Evidence for a Two-State Transition. Biochemistry 30, 10428–10435 (1991). 177. O’Brien, E. P., Ziv, G., Haran, G., Brooks, B. R. & Thirumalai, D. Effects of denaturants and osmolytes on proteins are accurately predicted by the molecular transfer model. Proc. Natl. Acad. Sci. 105, 13403–13408 (2008). 178. Karanicolas, J. & Brooks, C. The origins of asymmetry in the folding transition states of protein L and protein G. Protein Sci. 11, 2351–2361 (2002). 179. Brooks, B. R. et al. CHARMM: The Biomolecular Simulation Program B. J. Comput. Chem. 30, 1545–1614 (2009). 180. Betancourt, M. R. & Thirumalai, D. Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 8, 361–369 (2008). 181. Kumar, S., Rosenberg, J. M., Bouzida, D., Swendsen, R. H. & Kollman, P. A. THE weighted histogram analysis method for free‐energy calculations on biomolecules. I. The method. J. Comput. Chem. 13, 1011–1021 (1992). 182. Frishman, D. & Argos, P. Knowledge-based protein secondary structure assignment - Frishman - 2004 - Proteins: Structure, Function, and Bioinformatics - Wiley Online Library. Proteins-Structure, Funct. Genet. 23, 566–579 (1995).

154

183. Mittelstaet, J., Konevega, A. L. & Rodnina, M. V. A kinetic safety gate controlling the delivery of unnatural amino acids to the ribosome. J. Am. Chem. Soc. 135, 17031–17038 (2013). 184. Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E. ., Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V. . M. & B.; Petersson, G. . et al. G. Gaussian 09, Revision E.01, Gaussian, Inc. (2009). 185. Best, R. B., Chen, Y. G. & Hummer, G. Slow protein conformational dynamics from multiple experimental structures: The helix/sheet transition of Arc repressor. Structure 13, 1755–1763 (2005). 186. Trovato, F. & O’Brien, E. P. Fast Protein Translation Can Promote Co- and Posttranslational Folding of Misfolding-Prone Proteins. Biophys. J. 112, 1807–1819 (2017). 187. Finka, A. & Goloubinoff, P. Proteomic data from human cell cultures refine mechanisms of chaperone-mediated protein homeostasis. Cell Stress Chaperones 18, 591–605 (2013). 188. Humphrey, W., Dalke, A. & Schulten, K. VMD: Visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996). 189. Frishman, D. & Argos, P. Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9, 133– 142 (1996). 190. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21, 3433–3434 (2005). 191. De Sancho, D., Doshi, U. & Muñoz, V. Protein folding rates and stability: how much is there beyond size? J. Am. Chem. Soc. 131, 2074–5 (2009). 192. Kumar, M. D. S. ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res. 34, D204–D206 (2006). 193. Wittung-Stafshede, P., Gray, H. B. & Winkler, J. R. Rapid formation of a four- helix bundle. Cytochrome b562 folding triggered by electron transfer. J. Am. Chem. Soc. 119, 9562–9563 (1997). 194. Reid, K., Rodriguez, H., Hillier, B. & Gregoret, L. Stability and folding properties of a model beta-sheet protein, Escherichia coli CspA. Protein Sci. 7, 470–479 (1998). 195. Van Nuland, N. A. J. et al. Slow cooperative folding of a small globular protein HPr. Biochemistry 37, 622–637 (1998). 196. López-Hernández, E. & Serrano, L. Structure of the transition state for folding of the 129 aa protein CheY resembles that of a smaller protein, CI-2. Fold. Des. 1, 43–55 (1996).

155

197. Raschke, T. M. & Marqusee, S. The kinetic folding intermediate of ribonuclease H resembles the acid molten globule and partially unfolded molecules detected under native conditions. Nat. Struct. Biol. 4, 298–304 (1997). 198. Nissley, D. A. & O’Brien, E. P. Structural Origins of FRET-Observed Nascent Chain Compaction on the Ribosome. J. Phys. Chem. B 122, 9927–9937 (2018). 199. Stjepanovic, G. et al. Lipids Trigger a Conformational Switch That Regulates Signal Recognition Particle (SRP)-mediated Protein Targeting. J. Biol. Chem. 286, 23489–23497 (2011).

156

VITA

Daniel A. Nissley

Education Ph.D., Chemistry, The Pennsylvania State University May 2019 B.S., Chemistry, Lehigh University May 2013

Publications [1] Leininger, S.E.; Trovato, F.; Nissley, D.A.; O’Brien, E.P. “Domain topology, stability, and translation speed govern mechanical force generation on the ribosome.” Under Review. [2] Nissley, D.A.; O’Brien, E.P. “Structural origins of FRET-observed nascent chain compaction on the ribosome.” J. Phys. Chem. B. 122, 43, 9927-9937 (2018). [3] Fritch, B.; Kosolapov, A.; Hudson, P.; Nissley, D.A.; Woodcock, H.L.; Deutsch, C.; O’Brien, E.P. “Origins of the mechanochemical coupling of peptide bond formation to protein synthesis.” J. Am. Chem. Soc. 140, 5077-5087 (2018). [4] Nissley, D.A.; O’Brien, E.P. “Altered co-translational processing plays a role in Huntington’s pathogenesis – A hypothesis.” Front. Mol. Neuro. 9:54 doi: 10.3389/fnmol.2016.00054 (2016). [5] Nissley, D.A.; Sharma, A.K.; Friedrich, U.A.; Kramer, G.; Bukau, B.; O’Brien E.P. “Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding.” Nat. Commun. 7:10341 doi: 10.1038/ncomms10341 (2016). [6] Nissley, D.A.; O’Brien E.P. “Timing is everything: unifying codon translation rates and nascent proteome behavior.” J. Am. Chem. Soc. 136, 17892-17898 (2014). [7] Cook, K. M.; Nissley, D. A.; Ferguson, G. S. “Spatially selective formation of hydrocarbon, fluorocarbon, and hydroxyl-terminated monolayers on a microelectrode array.” Langmuir 29, 6779–83 (2013).

Invited Oral Presentations Penn State Chemical Biology Seminar Series October 2018 Biophysical Society 61st Annual Meeting February 2017

Selected Poster Presentations Biophysical Society 62nd Annual Meeting February 2018 From Computational Biophysics to Systems Biology 2017 May 2017

Selected Awards and Honors Penn State Graduate Student Award 2017 Best Poster, From Computational Biophysics to Systems Biology 2017 Penn State Travel Awards 2016-2018