MARKOV STATE MODELS FOR PROTEIN AND RNA FOLDING

A DISSERTATION

SUBMITTED TO THE PROGRAM IN BIOPHYSICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Gregory R. Bowman

July 2010

© 2010 by Gregory Ross Bowman. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/ky974bm1455

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Vijay Pande, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Russ Altman

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Daniel Herschlag

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii

ABSTRACT

Understanding the molecular bases of human health could greatly augment our ability to prevent and treat diseases. For example, a deeper understanding of protein folding would serve as a reference point for understanding, preventing, and reversing protein misfolding in diseases like Alzheimer’s. Unfortunately, the small size and tremendous flexibility of proteins and other biomolecules make it difficult to simultaneously monitor their thermodynamics and kinetics with sufficient chemical detail. Atomistic Molecular Dynamics (MD) simulations can provide a solution to this problem in some cases; however, they are often too short to capture biologically relevant timescales with sufficient statistical accuracy. We have developed a number of methods to address these limitations. In particular, our work on Markov State Models (MSMs) now makes it possible to map out the conformational space of biomolecules by combining many short simulations into a single statistical model. Here we describe our use of MSMs to better understand protein and RNA folding. We chose to focus on these folding problems because of their relevance to misfolding diseases and the fact that any method capable of describing such drastic conformational changes should also be applicable to less dramatic but equally important structural rearrangements like allostery. One of the key insights from our folding simulations is that protein native states are kinetic hubs. That is, the unfolded ensemble is not one rapidly mixing set of conformations. Instead, there are many non-native states that can each interconvert more rapidly with the native state than with one another. In addition to these general observations, we also demonstrate how MSMs can be used to make predictions about the structural and kinetic properties of specific systems. Finally, we explain how MSMs and other enhanced sampling algorithms can be used to drive efficient sampling.

iv

ACKNOWLEDGMENTS

Thanks to my family and my God for giving me the passion, intellect, and opportunity to do this work. It is difficult to imagine life without the love, support, and training my parents, brother, and wife have given me. Graduate school—and life in general—have been much more enjoyable with the companionship of my beautiful wife Angela.

Thanks to my advisor, Vijay Pande, for being such a superb guide, for creating such an intellectually invigorating environment, and for being so generous with resources of all kinds. My lab-mates have also been great. I’m especially indebted to Xuhui Huang for helping to jump-start my progress by working so closely with me during my rotation and the early years of my PhD. Sergio Bacallado, Kyle Beauchamp, John Chodera, Dan Ensign, Imran Haque, Peter Kasson, Yu-Shan Lin, Paul Novick, and Vince Voelz were all great collaborators. Thanks to Jason Wagoner and Del Lucent for all the conversations about science, religion, politics, and philosophy.

Thanks to my committee members, Russ Altman and Dan Herschlag, for making time to help me along the way. Dan was especially generous in including me in his group and getting me into the wet-lab. Seb Doniach has also been like a co- advisor.

v

Table of Contents

List of tables ...... x List of figures ...... xi Introduction ...... 1 Chapter 1: Using generalized ensemble simulations and Markov state models to identify conformational states ...... 6 Abstract...... 6 Introduction ...... 6 Description of Method...... 10 Conclusions ...... 17 Chapter 2: Progress and challenges in the automated construction of Markov state models for full protein systems ...... 19 Abstract...... 19 Introduction ...... 20 Materials & Methods...... 24 Results & Discussion...... 29 Conclusions ...... 45 Chapter 3: Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39)...... 47 Abstract...... 47 Introduction ...... 48 Materials & Methods...... 48 Results & Discussion...... 49 Conclusions ...... 55 Chapter 4: Protein folded states are kinetic hubs ...... 56 Abstract...... 56 Introduction ...... 57 Results & Discussion...... 59 Conclusions ...... 71 vi

Materials & Methods...... 73

Chapter 5: Atomistic folding simulations of the five helix bundle protein Lambda6-

85 ...... 75 Abstract...... 75 Introduction ...... 76 Results & Discussion...... 78 Conclusions ...... 84 Chapter 6: Enhanced modeling via network theory: Adaptive sampling of Markov state models ...... 86 Abstract...... 86 Introduction ...... 86 Theoretical Underpinnings ...... 89 Results & Discussion...... 96 Conclusions ...... 107 Chapter 7: Simulated tempering yields insight into the low-resolution Rosetta scoring functions ...... 108 Abstract...... 108 Introduction ...... 109 Methods ...... 111 Results ...... 119 Discussion...... 128 Conclusions ...... 132 Chapter 8: The roles of entropy and kinetics in structure prediction ...... 133 Abstract...... 133 Introduction ...... 134 Results & Discussion...... 136 Conclusions ...... 144 Materials & Methods...... 145 Chapter 9: Structural insight into RNA hairpin folding intermediates...... 148

vii

Abstract...... 148 Introduction ...... 148 Results & Discussion...... 150 Conclusions ...... 156 Chapter 10: Rapid equilibrium sampling initiated from non-equilibrium data...... 157 Abstract...... 157 Introduction ...... 158 Results & Discussion...... 162 Conclusions ...... 170 Materials & Methods...... 171 Appendix A: Estimating transition matrices and equilibrium distributions...... 172 Appendix B: The possibility of longer timescales than the implied timescales...... 175 Appendix C: Supporting information for chapter 3 ...... 177 Molecular dynamics simulation ...... 177 Markov State Model (MSM) construction ...... 178 Transition Pathway Theory (TPT) analysis...... 178 Structural analysis of macrostate ensembles ...... 179 Analysis of states along folding pathways: comparison between secondary

structure formation and reaction progress (pfold)...... 180 How does NTL9 fold in our simulations? ...... 181 Appendix D: Supporting information for chapter 4 ...... 187 Villin MSM ...... 187 Simple models ...... 191 Appendix E: Supporting information for chapter 5...... 199 Simulation Details ...... 199 MSM Construction and Analysis ...... 199 Appendix F: Supporting information for chapter 6...... 210 Appendix G: Supporting information for chapter 9 ...... 211 Serial Replica Exchange (SREMD) ...... 211

viii

Simulation Details ...... 211 Topological Method (Mapper) for Pathway Analysis...... 212 PEDFs...... 213 Melting Curves ...... 214 Appendix H: Supporting information for chapter 10 ...... 216 Initial Configurations...... 216 The Convergence of Weights in Simulated Tempering (ST)...... 216 Molecular Dynamics (MD) Simulation Details ...... 220 Hierarchical K-medoids clustering algorithm ...... 220 Markov State Models ...... 221 A simple model of non-Arrhenius, metastable dynamics ...... 226 Bibliography...... 236

ix

LIST OF TABLES

Number Page Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for transitioning from the unfolded state(s) to the native state in the three simple models...... 198 Table 2. Convergence of the weights is shown for representative temperatures Δg

= gj − gi obtained from distributed computing simulations starting from a helical structure (third column) and a coil structure (fourth column) at different temperature pairs. Differences between free energy

differences Δfji = gj/βj −gi /βi obtained from simulations starting from a helical structure and a coil structure are displayed in the 5th column.

KT at temperature i is shown in the sixth column. Δfji(Helical)-

Δfji(coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs...... 221

Table 3. Metastability (Q) and average self-transition probability between metastable states for the MSMs built from ST simulations and seeding simulations...... 225

x

LIST OF FIGURES

Number Page Figure 1. Schematic of the steps required for building an MSM and obtaining representative conformations for each state. First, GE data represented by points are grouped into microstates represented by circles, with darker circles for more highly populated microstates. Kinetically related microstates are then lumped together into macrostates, or metastable states, represented by amorphous shapes. Finally, representative conformations are obtained by extracting the most probable conformation from each macrostate...... 10 Figure 2. Implied timescales as a function of the lag time. There are two probable gaps in the implied timescales. If gap one were selected then a macrostate MSM with four states would be constructed whereas if gap two were selected a higher resolution MSM with 6 states would be constructed...... 14 Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its RMSD. A) The initial 10,000 state model, B) the 30,000 state model, C) the final 10,000 state model, and D) the final 10,000 state model except that the average RMSD across five structures in each state is used instead of the RMSD of the state center...... 31 Figure 4. Top ten implied timescales for the initial 10,000 state model...... 31 Figure 5. Three representative structures for A) the lowest RMSD state in the final model and B) the most probable state in the final model overlaid with the crystal structure (red). The phenylalanine core is shown explicitly for each molecule...... 35 Figure 6. Top ten implied timescales for the final model. A) The implied timescales at intervals of one ns. B) The implied timescales with error

xi

bars obtained by doing five iterations of bootstrapping at an interval of five ns...... 38 Figure 7. The average RMSD of each state in the final model versus its left eigenvector component in the longest timescale transition showing that this transition corresponds to folding...... 39 Figure 8. Comparison between the time evolution of the native population in the MSM (blue) and the raw data (black) for the entire dataset. The error bars represent the standard error...... 40 Figure 9. Comparison between the time evolution of the RMSD in the MSM (blue), the reduced representation (yellow), and the raw data (black) for A) an example of good agreement and B) an example of the worst case scenario. The error bars represent one standard deviation in the RMSD...... 42 Figure 10. Improved agreement between the MSM and raw data for the example of poor agreement from Figure 6B obtained by building the transition probability matrix from simulations started from this starting structure alone. The error bars represent one standard deviation in the RMSD...... 44

Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1- 39) after 10 µs. The arrows indicate thresholds defined for the native basin at 3.5Å and 4Å. (b) The number of parallel simulations M(t) started from unfolded states at 370K that reach time t. (c) Posterior predictions of the folding rate given the amount of simulation time and observed folding events for 3.5Å (dashed) and 4Å (solid) thresholds, using uniform (black) and Jeffrey’s (gray) priors, using methods from (85). In red is a Gaussian distribution representing the experimental rate mean and standard deviation...... 50 Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-

C of 3.1Å compared to the native state (cyan). (b) Non-native (top) and native-like (bottom) hydrophobic core arrangements observed in low-RMSD conformations of folding trajectories. Highlighted are

xii

sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35 (pink)...... 51 Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of 12 ns. Shown is the superposition of the top 10 folding fluxes, calculated by a greedy backtracking algorithm (see Appendix C). These pathways account for only about 25% of the total flux, and transit only 14 of the 2000 macrostates (shown labeled a-n, for convenient discussion). The visual size of each state is proportional to its free energy, and arrow size is proportional to the inter-state flux...... 52 Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted along structural and kinetic reaction coordinates. The balance between

native-like helix and sheet structure is quantified by Qα – (Qβ12 +

Qβ13)/2 (vertical axis), and progress along the folding reaction is

quantified by the pfold (committor) value (horizontal axis). It can be seen that the “unfolded” state (a) contains residual native-like helical propensity, and that pathways involving various ordering of native-like helix and sheet formation are possible...... 54 Figure 15. Q-values, which capture the extent of native-like structures, plotted

versus pfold (committor) values. The lines are to guide to eye...... 54 Figure 16. Three representative networks each having unfolded state(s) (U and

Ui), intermediates (Ii), and a native state (N). S has a single pathway, P has parallel pathways, and H has a heterogeneous unfolded state...... 61 Figure 17. Distributions of the first folding times for the simple networks S, P, and H are shown in panels A, B, and C respectively. The blue lines are exponential fits to the data after the initial lag phase...... 62 Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs from (A) unfolded states to the native state and (B) between unfolded states. (C) Relaxation kinetics with a 10:1 signal-noise ratio (black

xiii

curve with Gaussian noise) and a single exponential fit (blue curve with τ≈810 ns)...... 64 Figure 19. Schematic diagrams of funnel and native hub models having unfolded states (U), intermediates (I), and native states (N). (A) A network description of a folding funnel with nodes corresponding to individual conformations and a bottleneck near the native state. (B) A native hub model with metastable nodes. The size of each node in (B) is correlated with its equilibrium probability and the connectivity falls off as one moves away from the native state...... 67 Figure 20. Distance between the final villin MSM and MSMs constructed from subsets of the data (varying trajectory length and number of trajectories). Distance is measured by a relative entropy metric (see Appendix D for details). Black lines are contours of equal amounts of data. No data was available for the upper-right portion of the graph...... 70

Figure 21. (A) The crystal structure of the λ1-92 dimer bound to DNA (PDB code

1LMB). (B) A model of λ 6-85 with the Trp22-Tyr33 pair monitored in T-jump experiments space-filled...... 77

Figure 22. One of the 10 millisecond timescale pathways labeled with pfold values (the probability of reaching state H before state A)...... 80 Figure 23. The 500 most populated macrostates with sizes proportional to their free energies and connections between states if transitions between them occurred in our simulations. The native state (green state with green connections) is a hub. The crystallographic state from Figure 22H is blue, the compact β-sheet state from Figure 22A is red, and the remaining states are yellow. All of these states have smaller equilibrium populations and fewer connections than the native state...... 83 Figure 24. Distributions of mean first passage times (MFPTs) between sets of microstates (A) without weighting the distribution and (B) weighting each MFPT by the equilibrium probability of the starting state. The

xiv

solid line is the distribution of MFPTs from non-native to native microstates and the dashed line is the distribution of MFPTs between non-native states. The average MFPT from non-native states to native ones is about 10 times faster than that between non-native states in (A) and the difference is even greater in (B). Native microstates were defined as those in the most populated macrostate. All other microstates were considered non-native...... 83 Figure 25. Scaling for adaptive sampling of villin as the number of parallel simulations (N) used during each round is varied. (A) Wall-clock time scaling as N is varied. The black line is a best fit to the linear portion of the data (circles), which extends up to 5,000 simulations per iteration. (B) Computer time required to achieve a given model quality (relative entropy) for various sampling schemes. L refers to one long trajectory and the numbers refer to the number of parallel simulations used in each iteration of adaptive sampling. All results come from averaging over ten independent runs. Each step equates to 15 ns...... 98 Figure 26. (A) The two models, S and P. (B) Distance from the true model (measured via the relative entropy) as a function of wall-clock time for adaptive sampling versus one long simulation of S (assuming 5 steps/day to mimic 5 nanoseconds/day in protein folding simulations). The lines are one long simulation (dashed line) and adaptive sampling with 10 simulations of 20 steps (solid line), 10 simulations of 200 steps (dotted line), 100 simulations of 20 steps (dash-dot line), and 1000 simulations of 20 steps (black squares) per iteration...... 100 Figure 27. Relative entropy (top) and free energy of each state in kcal/mol (bottom) as a function of the adaptive sampling iteration on model S...... 102 Figure 28. Distance from the true model (measured via the relative entropy) as a function of the number and length of simulations averaged over 10 independent samples. (A) Reference distribution for S, (B) adaptive

xv

sampling of S, (C) reference distribution for P, and (D) adaptive sampling of P. All simulations for the reference distributions started from state 1. The first 10 simulations for adaptive sampling started from state 1 and subsequent batches of simulations started from the state contributing most to uncertainty in the slowest process. Black lines are contours of equal amounts of data...... 103 Figure 29. Scaling for adaptive sampling of our simple models as the number of parallel simulations (N) used during each round is varied. (A) and (B) Wall-clock time scaling as N is varied for simple models S and P respectively. The black line is a best fit to the linear portion of the data (circles). (C) and (D) Computer time required to achieve a given model quality (relative entropy) for various sampling schemes applied to S and P respectively. L refers to one long trajectory and the numbers refer to the number of parallel simulations used in each iteration of adaptive sampling. All results come from averaging over ten independent runs...... 105 Figure 30. Flow chart showing the order the scoring functions are used in and giving brief descriptions of each. After score5, Rosetta returns to score2 five times before progressing to score3. The first six scoring functions constitute the low-resolution de novo structure prediction phase...... 113 Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each diamond represents the lowest scoring structure for a single run. Data for ST is shown in blue while data for standard Rosetta is shown in red. The black ‘‘+’’ symbols represent models obtained by idealizing and relaxing the crystal structure in low-resolution mode...... 120 Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond represents the lowest scoring structure for a single run. Data for ST is shown in blue while data for standard Rosetta is shown in red. Panel (A) shows results from the low-resolution phase. The black ‘‘+’’

xvi

symbols represent models obtained by idealizing and relaxing the crystal structure in low-resolution mode. Panel (B) shows results from the full-atom phase. The yellow circles represent models obtained by idealizing and relaxing the crystal structure in full-atom mode. The black ‘‘*’’ symbols are full-atom models obtained by relaxing the low- resolution structures depicted by ‘‘+’’ symbols in (A) using the full- atom scoring functions...... 121 Figure 33. Evolution of the score4 weights for protein G. The dashed line is the difference between the weights of the highest two temperatures: 10 and 20 kT. The solid line is the difference between the weights of the lowest two temperatures: 0.1 and 0.25 kT. The first points come from constant temperature runs and subsequent points represent each

iteration of refining the weights. Δg=gj-gi where, j > i...... 123 Figure 34. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G in score4 using: (A) standard Rosetta runs starting from an extended chain, (B) standard Rosetta runs starting from the native state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST runs at 0.1 kT starting from the native state, (E) ST runs at 2 kT starting from the native state, (F) ST runs at 20 kT starting from the native state. Each white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring structures from each run were sorted by RMSD and only every twentieth point is shown so as to give the entire range without obscuring the underlying plot...... 124 Figure 35. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G. Each white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring structures from each run were sorted by RMSD and only every twentieth point is shown so as to give the entire range without obscuring the underlying plot. (A), (D), (G), and (J) show data from standard Rosetta runs with frequent

xvii

recovery of the lowest scoring structure in score1, score2, score5, and score3 respectively. (B), (E), (H), and (K) show data from standard Rosetta runs without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3 respectively. (C), (F), (I), and (L) show data from ST runs at 0.1 kT without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3, respectively...... 127

Figure 36. Time evolution of the Cα RMSD of the current umbrella center for five representative simulations demonstrating the presence of reversible folding...... 137 Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free

energy (<∆F>) as a function of Cα RMSD for protein G and engrailed homeodomain (EH)...... 138

Figure 38. Average free energies (<∆F>) as a function of Cα RMSD for temperatures of 0.5 and 0.1 for protein G and engrailed homeodomain (EH). The black lines are the hypothesized free energy at the given temperature and the dash-dot lines are the free energy at temperature 0.8 shown for reference...... 140 Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting structure used for comparing the ST and Standard Rosetta variants...... 142

Figure 40. Distribution of the minimum Cα RMSD values reached by 100 Simulated Tempering (ST) and 100 standard Rosetta runs started from a 5.7 Å structure. Results for both the low temperature and standard Rosetta variants were identical so only a single plot is shown...... 142 Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line) versus the total average energy (dash-dot line) as a function of Cα RMSD for protein G and engrailed homeodomain (EH)...... 143

xviii

Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the native state. Bases are numbered from 5’ to 3’ and native base-pair contacts (dotted lines) are numbered 1-4...... 149 Figure 43. The probability of a given number of native contacts during (A) unfolding and (B) refolding. (C) The probability of each contact when a given number of contacts are present during unfolding and refolding with the arrows representing the direction of movement between the unfolded state (U) and the folded state (F)...... 153 Figure 44. Contact maps representing the cluster centers from independent clustering of the unfolding (A) and refolding data (B). The grey lines represent the connectivity of the states. The blue lines represent native contacts with a probability of 0.6 or greater within the cluster. Intermediate structures are labeled A-D...... 153 Figure 45. Representative full-atom structures for the intermediate states with labels (A)-(D) corresponding to the labels A-D in Figure 3...... 155 Figure 46. A schematic free energy landscape with three representative seeding trajectories started from each basin and a projection of this free energy landscape onto a 2D plain showing the division into metastable states...... 161 Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents our ST trajectories, which are split into equilibration (green) and production (light blue) phases. The light red and light yellow boxes encompass our long and short adaptive seeding schemes respectively. For each adaptive seeding scheme, the dotted lines demark the portion of the ST data used to identify the dominant thermodynamic, or metastable, states by building an MSM (S). Constant temperature (or canonical, NVT) simulations are then started from each state and used to build a new MSM (E) that captures the equilibrium distribution. Both the light yellow and red boxes also encompass a portion of the original ST data that is equivalent to the amount of sampling used in

xix

the adaptive seeding scheme. An MSM is also built for this data and used as a baseline for judging the efficiency of the adaptive seeding scheme...... 163 Figure 48. Population of each state (bar graphs correspond to the mean values, and error bars stand for standard deviations) for (A) the long adaptive seeding scheme (lag time t=4.5 ns) and (B) the short adaptive seeding scheme (lag time t=4.5 ns)...... 165 Figure 49. Population of each state for the long adaptive seeding scheme as the lag time is varied...... 166 Figure 50. Representative structure for each of the six metastable states. The numbering is the same as in Figures 48 and 49...... 170 Figure 51. Graph depiction of the model system defined in Appendix B with edges labeled by A) their probability and B) their average timescale under a two-state assumption...... 176 Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State Models (MSMs) built at lag times between 1 and 32 ns. As the longest timescale levels off beyond a lag time of 10 ns, a lag time of 12 ns was chosen to build subsequent MSMs. The spectral gap present at all lag times indicates apparent two-state folding kinetics. (b) The implied timescales for a 2000-macrostate model built by lumping states from the microstate MSM show a similar spectral gap and leveling off of time scales. The faster implied timescales of the macrostate model at short lag times are due to lumping effects. (c) The 10 slowest implied timescales for the 2000 state models, with error analysis from a bootstrapping procedure. Error bars represent the standard deviation from the bootstrap analysis...... 183 Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the 100,000-state MSM calculated from the simulation data at 370K. The RMSD-to-native is calculated using the peptide backbone residues,

xx

with respect to the native starting state. The free energy of each

microstate i is computed as –kT ln (pi /p0), where pi is the equilibrium

probability of the microstate, and p0 is an arbitrary reference (in this

case, max(pi)). Shown in red are the 14 macrostates transited by the top ten pathway fluxes, labeled with the same letters as in Figure 13. In this mesoscopic view, we find that 1) the macrostates are diffuse collections of conformational states, 2) there are multiple folding pathways along these metastable states, and 3) we can identify highly populated “native” (state n) and “unfolded” (state a) macrostates that dominate the observed relaxation rates. The red arrow is meant to guide to eye in illustrating a “mesoscopic” view of the transition state barrier: the “unfolded” state (a) and “native” state (n) are at free energy minima, while intermediate RMSD values have macrostates with higher free energies...... 184

Figure 54. Contact profile subspaces used to calculate Q, Q12, and Q13, which

quantify the extent of native-like structuring for beta-strand 1 and 2

pairing, beta-strand 1 and 3 pairing, and helix formation, respectively...... 184 Figure 55. Here, contact profiles (see definition above) for the 14 macrostates involved in the top ten folding pathways are plotted in a similar fashion to Figure 55. For clarity, the pathway arrows have been removed. Each contact profile is a 39 x 39 matrix of inter-residue contacts, showing the contact fraction on a linear grayscale from 0 (white) to 1 (black)...... 185

Figure 56. Here, values of Q (yellow), Q12 (red), and Q13 (blue) are plotted in a bar graph for each of the 14 macrostates involved in the top ten folding pathways. The layout is in a similar fashion to Figure 56...... 185 Figure 57. Macrostates l, m and n (the “native” state) have very similar structural

ensembles and similar pfold values (pfold > ~0.93). To examine the subtle differences in their macrostate contact profiles, we computed xxi

difference contact profiles for (l-m), (n-l) and (n-m) transitions. These difference maps reveal that these states differ mostly in their hairpin registrations and packing of the hairpin loop...... 186 Figure 58. Implied timescales for the villin macrostate MSM...... 194 Figure 59. Distribution of MFPTs between all pairs of non-native states for villin (A) on a linear scale to demonstrate the peak does not shift significantly relative to the distribution shown in Figure 18B and (B) on a log scale to highlight that the tail of the distribution does extend to about 60 ns...... 194 Figure 60. Distributions of the MFPTs (A) from each non-native state to the native state and (B) between every pair of non-native states for our 2,000 state NTL9(1-39) model. As discussed in Ref (93), further refinement of this model is likely necessary. However, we do not expect the qualitative trend of long timescales (relative to folding) for transitioning between unfolded states to change...... 195 Figure 61. Two conformations from different unfolded basins demonstrating the structural heterogeneity of non-native states (especially in their non- native contacts) that, in combination with the vastness of conformational space, result in slow transitions between unfolded states. The structures are colored red to blue from the N-terminus to the C-terminus. Atoms for residues Arg 14, Trp 23, and Lys 32 are shown to highlight that 23 and 32 are in contact on the left while the chain has rearranged such that 14 and 32 are in contact on the right. These images were made with VMD (67)...... 195 Figure 62. Relaxation of the fraction folded starting from equally populated unfolded states (black is data and blue is single exponential fit with τ≈810 ns). The beginning of the curve is dominated by single exponential relaxation but deviations from this apparent two-state behavior become apparent later...... 196

xxii

Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate level (thick black line) and a biexponential fit (thin blue line) with time constants of ~60 and ~415 ns, at least qualitatively consistent with time constants of ~70 and ~720 ns from experiment (56). We hope to explain this behavior in a future work on villin. As in Ref. (4), the

native state was defined as all microstates with an average Cα RMSD to the crystal structure less than 3 Å...... 197 Figure 64. The distance to the gold-standard model, measured via the relative entropy, for 40,000 trajectories up to 400 nanoseconds in length. The black lines are contours of equal amounts of data. Again, there was insufficient data to resolve the upper right-hand corner of the plot...... 198 Figure 65. Implied timescales for the full 370 K dataset...... 202 Figure 66. Implied timescales for the 300 K dataset...... 202 Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random...... 203 Figure 68. A coarse-grained view of the slowest transition with state sizes proportional to the free energy and arrow widths proportional to the flux (see key in figure)...... 203 Figure 69. Another coarse-grained view of the slowest transition with state sizes proportional to the free energy and arrow widths proportional to the flux (see key in figure). Here the states are laid out in terms of the average number of β-sheet residues (calculated from 100 random

conformations from each state) and the pfold (probability of reaching the crystallographic state in L before the compact β-sheet state in A)...... 204 Figure 70. Free energy projections of the microstate MSM onto typical order

parameters like the radius of gyration (Rg), the Cα RMSD to the crystal structure, and the distance between the Trp22 and Tyr33 residues. Differences between the two panels highlight the difficulty in interpreting such projections...... 206

xxiii

Figure 71. Free energy projection of the microstate MSM onto Pfold and the distance between the Trp22 and Tyr33 residues. Obtaining projections onto kinetic order parameters like Pfold is greatly simplified with MSMs. In this case Pfold refers to the probability of reaching the crystallographic state before reaching the compact β-sheet state (i.e. the slow transition from Figure 21). Unlike the projections in, this one hints that D14A may not be well described by a simple two- or three- state model or that the Trp22-Tyr33 distance is not a good reaction coordinate, since there are a broad range of Pfold values possible for a given Trp-Tyr distance. Indeed, analysis of the MSM reveals that D14A is best described by a native hub...... 206 Figure 72. The ten most populated macrostates with their equilibrium probabilities. ....207 Figure 73. Relaxation of the fraction unfolded with different observables and observation times. The thick black curves come from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the Trp22- Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non-native states being equally populated. The bottom row shows relaxation of the fraction unfolded measured by the

Cα RMSD to the crystal structure (C) starting from all states being equally populated and (D) starting from all non-native states being equally populated. Fitting parameters are given in the figure (in units of microseconds). In this case, the fitting parameters are relatively independent of the observable and starting distribution...... 207 Figure 74. Relaxation of the fraction unfolded with different observables and observation times from an MSM built without the trajectories started from β-sheet structures. The thick black curves come from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the

xxiv

Trp22-Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non-native states being equally populated. The bottom row shows relaxation of the fraction unfolded

measured by the Cα RMSD to the crystal structure (C) starting from all states being equally populated and (D) starting from all non-native states being equally populated. Fitting parameters are given in the figure (in units of microseconds). In this case the fitting parameters are more dependent on the observable, consistent with the experimental observation of probe dependent kinetics...... 208 Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet state in Figure 22A to the native state in Figure 22H, (B) from the extended state in Figure 22E to the native state in Figure 22H, and (C) from the extended state in Figure 22E to the native state in Figure 22G. None are purely downhill, though some may be consistent with incipient downhill folding (i.e. have sufficiently low barriers that there is a reasonable population at the barrier top that can fold in a downhill manner in addition to activated folding across the barrier)...... 209 Figure 76. The helicity of each residue predicted from Agadir.(143) The purple, numbered bars show where the five helices are (the extra purple block between helices 4 and 5 is a turn)...... 209 Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent samples of (A) reference simulations of M1 and (B) adaptive sampling of M1. Black lines are contours of equal amounts of data...... 210 Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent samples of (A) reference simulations of M2 and (B) adaptive sampling of M2. Black lines are contours of equal amounts of data...... 210

xxv

Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from Folding@home data at each of the 56 temperatures used. (b). The 2 convergence measure averaged over all temperatures as a function of

time. Triangles correspond to using Pfinal as the reference distribution

and circles correspond to using Pinitial as the reference...... 214 Figure 80. Native contacts melting curve. Only every third temperature is displayed for clarity...... 215 Figure 81. The two initial structures used in this study: A) A near-native conformation and B) a random coil conformation...... 216 Figure 82. Amount of sampling at different temperatures for ST simulations started from the native (top row) and coil configurations (bottom row) computed from different segment of simulation time 0-0.3ns, 1.2-1.5 ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is reached for both sets of ST simulations indicating the weights are converged...... 220 Figure 83. Three example structures from a single microstate...... 224 Figure 84. The largest one hundred implied timescales as a function of the lag time for (a) ST simulations starting from the coil initial configuration. (b) The long adaptive seeding microstate MSM...... 225 Figure 85. Potential of Mean Force (PMF) for the simple potential at  (1/KT) a. 0.995, b. 0.652, and c. 0.456. In part a, four metastable macrostates are separated by the dashed black lines and labled...... 228 Figure 86. Populations of four macrostates as function of =1/kT...... 229 Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of =1/kT...... 230 Figure 88. Logarithms of the implied timescales as function of  for the 2D potential are displayed. The three slowest timescales are plotted using up triangle, down triangle, and cross points respectively...... 231

xxvi

Figure 89. Populations computed from Simulated Temperating (ST) simulations for four metastable states of the are plotted as a function of length of the simulation. The reference populaiton is shown in the solid lines and 1000 trajectories are used for this calculaiton. The error bars are the standard derivation obtained from bootstrapping 100 times with replacement...... 232 Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four metastable states of the are plotted as a function of length of the simulation. The reference populaiton is shown in the solid lines and 1000 trajectories are used for this calculation. The lag time is selected as 1/3 of the length of the simulation. The error bars are standard derivation obtained from a Bayesian method (See section 2.5.3 for details)...... 233 Figure 91. Populations computed from ASM simulations for four metastable states as a function of lag time...... 234 Figure 92. Number of steps taken to reach the convergence as a function of number of trajs...... 235

xxvii

INTRODUCTION

Molecular kinetics plays fundamental roles in human health and disease. For example, conformational changes in the ribosome drive translation and many drugs work by inducing allosteric conformational changes in G protein-coupled receptors. Many neurological diseases, like Alzheimer’s, are also hypothesized to result from protein misfolding. Therefore, a deeper understanding of molecular kinetics is crucial for our ability to comprehend and control human health.

Protein folding is a classic grand-challenge in molecular biophysics because it is such a dramatic example of molecular kinetics and has important medical implications. With the recent discovery of structured RNAs, RNA folding has also become of interest. Folding is the process by which a disordered chain of residues (either amino acids or nucleotides) spontaneously self-assembles into a specific three- dimensional shape. The fact that folding happens at all is astounding given the enormous number of possible conformations a protein or RNA can adopt. For example, a hypothetical protein with 100 residues, each of which could adopt two conformations, could fold into over 1,000,000,000,000,000,000,000,000,000,000 different structures. If such a protein visited one conformation/second then reaching all of them would take over 1,000,000,000,000 times longer than the age of the universe. Moreover, real proteins have many more degrees of freedom and sometimes many more residues. Despite this, proteins can often fold in a matter of milliseconds to seconds and RNA folding is only moderately slower. Therefore, it is reasonable to conclude that there must be one or more pathways guiding a biomolecule to its native—or most probable—state. Because folding is such a dramatic conformational change, any method that could map out the pathways by which protein and RNA molecules fold would likely be a powerful means of understanding less drastic but equally important structural rearrangements like allostery, all of which fall into the general category of molecular kinetics. In addition, accurate models for protein folding 1 would serve as a reference point for understanding, preventing, and reversing from misfolding diseases.

Many experimental techniques have been developed to probe folding. Unfortunately, biomolecules are extremely sensitive to their underlying chemical details and no current experimental method can simultaneously describe the atomic details of a molecule’s thermodynamics and kinetics. For example, x-ray crystallography can provide atomistic snapshots of a protein’s structure but gives little information about its kinetics. FRET, on the other hand, can provide information about a protein’s structure and dynamics by reporting on the changing distance between two probes attached to a molecule but is blind to the rest of that molecule’s structure. Heterogeneity also complicates the interpretation of much experimental data.

Molecular dynamics (MD) simulations are a powerful means of simultaneously modeling a biomolecules thermodynamics and kinetics with atomic resolution. In an MD simulation, one explicitly represents every atom and the bonds between them. One can then iteratively update the position and velocity of each atom based on the force exerted on it by the rest of the simulated system. The resulting trajectory is like a movie taken by zooming in on a single protein (or some other biomolecule).

Unfortunately, MD has many of its own challenges. First and foremost among these is the sampling problem. Atomistic MD simulations must take very small timesteps (on the order of femtoseconds) to avoid unphysical phenomena like atoms passing through one another. Therefore, a typical computer can only simulate ~5 nanoseconds/day even for a small protein and would take over 500,000 years to simulate one second. In addition, molecular kinetics are stochastic, so generating a single long simulation is inadequate for truly understanding processes like protein folding. Instead, one must witness numerous events to characterize the entire distribution of pathways by which they can occur. Moreover, even if one could run a sufficient number of long simulations, the task of analyzing this data and making a direct connection with experiments would still remain. And, of course, the validity of

2 the results of any simulation depends on the accuracy of the approximations and parameters (together referred to as the force field) used to describe the interactions between atoms. Unfortunately, testing a force field requires obtaining sufficient sampling and comparing the results to a large body of experimental data, so selecting (or developing) a good force field is non-trivial at best.

Networks called Markov state models (MSMs) are one potential solution to these problems (1-3). An MSM is essentially a map of a molecule’s conformational space built from MD simulations. That is, like a road map with cities labeled with populations connected by roads labeled with speed limits, MSMs give the probability that a protein or other molecule will be in a certain set of conformations (called a metastable state) connected by edges describing where it can go next and how quickly.

MSMs are typically constructed from simulation trajectories (3-8). Because of the temporal relationship between conformations in a trajectory, it is possible to group conformations that can interconvert rapidly into states and then determine the connectivity between states by counting the number of times a simulation went from one state to another. By employing these kinetic definitions, one ensures that the system’s dynamics can be modeled reasonably well by assuming stochastic transitions between states (1, 3-6, 9-12). Thus, it is possible to perform analyses, such as identifying the most probable conformations at equilibrium or modeling the relaxation of some experimental observable, and make a quantitative comparison to (or predictions of) experiments. In addition, one can naturally vary the temporal and spatial resolution of an MSM by changing the definition of what it means to interconvert rapidly or slowly (4, 5, 10, 13, 14), much like zooming in and out on a Google map. By choosing a long timescale cutoff, one can obtain humanly comprehensible models with just a few metastable (or long-lived) states that capture large conformational changes, like folding. Such coarse-grained models are useful for gaining an intuition for a system. With a short timescale cutoff, on the other hand, one can obtain a model with many states. By using such high resolution models, one sacrifices ease of comprehension for more quantitative agreement with experiments (4, 3

5, 15). Regardless of the resolution, one can also draw on network theory to analyze MSMs and gain important insights into processes like folding (16, 17). Thus, MSMs are a powerful way of analyzing simulation data sets.

MSMs also provide a statistical approach to molecular simulation—and potentially other problems exhibiting metastability (18). Rather than attempting to generate one realization of an entire process, one instead decomposes conformational space into multiple metastable states and seeks to gather statistics on each step of the process independently and in parallel (e.g. by running many short simulations from each state and then combining them into a single MSM). Adaptive sampling algorithms for MSM construction take this statistical approach a step further (12, 18- 20). In adaptive sampling, one first obtains an initial model of the entire process of interest by any means possible. One then iteratively calculates the contribution of each step of the process to uncertainties in some observable of interest via Bayesian statistics and runs numerous parallel simulations of the steps that can lead to the greatest increases in precision until the desired level of statistical certainty is achieved. Such an approach was recently shown to lead to dramatic reductions in the statistical uncertainty in the observable of interest relative to other refinement schemes (19). More recently, we have shown that it leads to efficient improvement of the global model quality (18). Once a converged sampling is obtained, MSMs at varying resolutions can be used to asses the validity of the underlying force field by making quantitative comparisons to existing data and predictions of new experiments. Therefore, one can gain new insight into processes like protein folding, or at least understand and correct errors in the force field.

Here we describe how MSMs can be used to understand protein folding (and related problems in molecular kinetics) and connect to experiments. We begin with an introduction to MSMs and a software package we developed to automate the construction of these models from simulation data sets. Next, we describe initial applications of this software to small model systems (a 35 residue mutant of the villin headpiece and a 39 residue fragment of NTL9) to test this methodology. We then 4 describe new insights into protein folding obtained from MSMs and their application to larger, more biologically relevant systems like λ repressor (an 80-residue protein). This discussion is followed by an explanation of how MSMs can be used to solve the sampling problem using adaptive sampling and other enhanced sampling algorithms. Within this discussion of sampling, we also describe some of the initial applications of MSMs to RNA folding.

5

CHAPTER 1: USING GENERALIZED ENSEMBLE SIMULATIONS AND MARKOV STATE MODELS TO IDENTIFY CONFORMATIONAL STATES

This chapter was taken from: Bowman GR, Huang X, & Pande VS (2009) Using generalized ensemble simulations and Markov state models to identify conformational states Methods 49:197-201.

ABSTRACT

Part of understanding a molecule’s conformational dynamics is mapping out the dominant metastable, or long lived, states that it occupies. Once identified, the rates for transitioning between these states may then be determined in order to create a complete model of the system’s conformational dynamics. Here we describe the use of the MSMBuilder package (now available at https://simtk.org/home/msmbuilder/) to build Markov State Models (MSMs) to identify the metastable states from Generalized Ensemble (GE) simulations, as well as other simulation datasets. Besides building MSMs, the code also includes tools for model evaluation and visualization.

INTRODUCTION

Molecular Dynamics (MD) and Monte Carlo (MC) computer simulations have the potential to complement experiments by elucidating the chemical details underlying the conformational dynamics of biological macromolecules like proteins and RNA. Such simulations sample a system’s free energy landscape, which is characterized by long-lived, or metastable, states separated by large free energy barriers. Thus, understanding a system’s conformational dynamics can be broken down into two

6 steps: 1) identifying the long lived, or metastable, states visited by the system and 2) determining the rates of transitioning between these states. Unfortunately, it is extremely difficult to adequately sample the conformational space accessible to biomolecules. Furthermore, even if adequate sampling can be achieved, the resulting datasets are often quite large and, therefore, difficult to analyze and interpret.

A popular approach to the first step is to use Generalized Ensemble (GE) algorithms (21-25) to sample the accessible space and then to generate projections of the free energy landscape onto some set of order parameters to identify the dominant thermodynamic states (26-29). GE algorithms, such as the Replica Exchange Method (REM) (22, 23) and Simulated Tempering (ST) (24, 25), achieve broad sampling at the temperature of interest by performing a in temperature space. Broad sampling is possible because an energy barrier that is difficult to cross at the temperature of interest will be flattened out and, therefore, more easily crossed at higher temperatures. GE algorithms also maintain canonical sampling at every temperature. Thus, they are a suitable way to sample the accessible space.

Projections of the free energy landscape onto a few order parameters are frequently used to make sense of the resulting dataset (26-29). Such projections may be meaningful if an appropriate set of order parameters is chosen; however, this is quite difficult so there is always the danger of being misled by projections because meaningful information along other order parameters may be completely lost (3, 30). For example, structures that fall within the same basin in some projection may have little structural or kinetic similarity. Thus, choosing a representative conformation for that basin may be impossible.

Clustering methods, on the other hand, do not have these issues because the dominant order parameters do not need to be specified in advance. However, most clustering algorithms group conformations together based solely on their structural similarity (31, 32), so they may fail to capture important kinetic properties. To illustrate the importance of integrating kinetic information into the clustering of

7 simulation trajectories, one can imagine two people standing on either side of a wall. Geometrically these two individuals may be very close but kinetically speaking it could be extremely difficult for one to get to the other. Similarly, two conformations from a simulation dataset may be geometrically close but kinetically distant and, therefore, a clustering based solely on a geometric criterion would be inadequate for describing the system’s dynamics.

Here we describe the use of Markov State Models (MSMs) to identify metastable states in GE datasets, though we note that the MSMBuilder package we introduce to build MSMs may be applied to any simulation dataset. An MSM may be thought of as a form of clustering that incorporates kinetic information by grouping conformations that can interconvert rapidly into the same state and conformations that cannot interconvert rapidly into different states (3, 6, 9, 11, 33, 34). Thus, conformations in the same metastable state, which may be thought of as a large free energy basin, will be grouped together while conformations separated by large free energy barriers will not.

A biomolecular folding free energy landscape may be thought of as a hierarchy of basins (35, 36). Since larger basins may contain numerous smaller local minima our use of the phrase free energy basin above is somewhat ambiguous. To determine what constitutes a distinct free energy basin an MSM may be represented as a transition probability matrix where the entry at row i and column j gives the probability of transitioning from state i to state j during a time Δt, called the lag time. Based on this matrix one may obtain a series of implied timescales for transitioning between various regions of phase space and use this information to determine an appropriate number of metastable states, as explained below. The number of metastable states to be constructed controls the resolution of the model by determining how large a barrier must be in order to divide phase space into multiple states.

In the past, MSMs have generally been used to model kinetics and, therefore, have been built from constant temperature data. For example, MSMs have been used

8 to model numerous small systems (33, 37, 38) and a few larger ones (39, 40). Since GE simulations perform a random walk in temperature space they do not have physical kinetics. However, GE simulations contain the desired canonical ensemble and therefore the desired free energy barriers. These barriers may be flattened or distorted at higher temperatures but the barriers at the temperature of interest should still be sufficient to provide the desired separation of timescales. That is, fast intrastate transitions and slower interstate transitions. Thus, the pseudo-kinetics of GE simulations are still sufficient to identify the dominant metastable states.

In the following sections we describe the use of the MSMBuilder package (now available at https://simtk.org/home/msmbuilder/) to identify the dominant metastable states in GE datasets, though we note the method may be applied as is to datasets generated with other algorithms and is easily extensible to completely different problems. There are four major steps in the procedure: 1) dividing the data into small sets called microstates based on their structural similarity, 2) lumping kinetically related microstates together into metastable states (also called macrostates), 3) extracting representative conformations for each state, and optionally 4) calculating populations of each state to judge convergence. Steps 1-3 are depicted schematically in Figure 1. The conformations extracted with this method represent the space explored by the system and thus give insights into its dynamics. The pseudo-kinetics of the GE simulations may give some indication of the connectivity of these states but cannot give conclusive results due to the random walk in temperature space. However, this method may serve as a basis for obtaining both accurate thermodynamics and kinetics (Huang et al. in preparation).

9

Figure 1. Schematic of the steps required for building an MSM and obtaining representative conformations for each state. First, GE data represented by points are grouped into microstates represented by circles, with darker circles for more highly populated microstates. Kinetically related microstates are then lumped together into macrostates, or metastable states, represented by amorphous shapes. Finally, representative conformations are obtained by extracting the most probable conformation from each macrostate.

DESCRIPTION OF METHOD

1. DIVIDING THE DATA INTO MICROSTATES The first step in building an MSM is to divide the data into thousands of microstates based on their structural similarity (6). For conformational dynamics we measure structural similarity by the RMSD for some subset of the atoms. While the RMSD may not be very meaningful for large distances, it does have a kinetic interpretation for small distances. That is, conformations with very small RMSDs should be able to interconvert rapidly. Thus, if a microstate is small enough that every member has a very small RMSD to every other member then one may assume that their structural similarity implies a kinetic similarity.

10

However, one must also take care not to generate microstates that are too small because it is important to see a sufficient number of transitions between them. For example, if every conformation were put into its own microstate no pair of trajectories would ever visit the same microstate. Thus, the most meaningful grouping of microstates would be to group every conformation in the same trajectory together and no new insight would be gained.

One method for determining an appropriate size for each microstate is to measure the average RMSD between every pair of temporally adjacent conformations in each trajectory and to ensure that the diameter of each microstate is no more than this value (Sun et al. in preparation). Thus, any pair of conformations within a given microstate will tend to be within one MD step of each other. However, this method may be overly stringent. We have found that using microstates with an all-heavy-atom RMSD radius of about 3.0 Å allows us to capture the true equilibrium distribution for an 8 nucleotide RNA hairpin. Preliminary work in our lab shows that radii on the order of 2-2.5 Å seem more appropriate for protein systems.

One can use the doFastGromacsClustering executable provided by the Clusterer component of the MSMBuilder package to divide a dataset into microstates. At present the Clusterer code is capable of using an approximation of the k-centers clustering algorithm (41, 42) to divide simulation datasets generated with the Gromacs software package (43) into some number of microstates. However, it is written in object oriented C++ code so it is straight forward to add new clustering algorithms, data types to cluster, distance metrics, and other components.

The approximate k-centers clustering algorithm was chosen as the default clustering method because it is deterministic, simple, fast, and creates clusters with approximately equal radii (42). The algorithm works as follows: 1) every point is initially infinitely far from any cluster center, 2) choose an arbitrary point as the first cluster center, 3) compute the distance between every point and the new cluster center, 4) assign points to this new cluster center if they are closer to it than the cluster center

11 they are currently assigned to, 5) declare the point that is furthest from every cluster center to be the next new cluster center, and 6) repeat steps 2-5 until the desired number of clusters have been generated. Thus, the algorithm has complexity O (kN) where k is the number of clusters to be generated and N is the number of data points to be clustered. An order of magnitude speedup is also made possible by using the triangle inequality to avoid unnecessary distance computations (Sun et al. in preparation). This fast version of the algorithm is used by default, though the original version described above is also available. Besides the cluster definitions, this program also gives the radius of each microstate and the average and standard deviation of the RMSD from every member of the microstate to the cluster center.

The arbitrary starting point used by this approximate k-center clustering algorithm would be of some concern for small k or if the microstates were our primary interest. However, we have found that the clustering results are insensitive to the starting point for large k (e.g. k > 1000). In addition, we are mainly concerned with the macrostates generated by lumping kinetically related microstates together. The lumping algorithm described in the next section is fairly insensitive to the exact boundaries between microstates as long as each microstate is sufficiently small, so the arbitrary starting point is acceptable for building MSMs.

An attractive feature of this approximate k-centers algorithm is that it yields clusters of approximately equal volume (as judged by using the maximal distance between the cluster center and any other point in the cluster as the radius of a sphere) (42). This property is of value because it means that the population of a cluster is approximately proportional to its density in phase space. However, we note that exploiting this interpretation requires some caution as it is unclear how to compute exact volumes in a high dimensional phase space and, therefore, difficult to measure densities in phase space precisely. Regardless, this property also allows the boundaries between metastable states to be well-resolved. Clustering algorithms that do not have this guarantee may create large clusters in sparse regions of phase space and small clusters in dense regions. The large clusters in sparse regions of phase space are prone 12 to violate the assumption that conformations within a microstate are kinetically related. Therefore, various conformations in the microstate may be most kinetically related to different metastable states, in which case it will be unclear which macrostate to group the microstate with.

2. LUMPING MICROSTATES INTO MACROSTATES Conceivably, one could extract a representative conformation for each microstate to get an idea of the conformational space explored by the system of interest. However, this would only be a slight improvement upon examining the raw data itself. Instead, it is valuable to lump kinetically related microstates together into metastable states, also called macrostates. The tools for lumping together microstates, as well as for extracting representative conformations and determining state populations, may be found in the PythonTools component of the MSMBuilder package.

The first step in generating a set of macrostates is to determine how many of them to create (6). This task may be accomplished with the BuildMSMsAsVaryLagTime.py script. This script builds a microstate MSM for each of a series of lag times. A microstate MSM is just a transition probability matrix where the entry in row i and column j is the probability that a simulation will be in microstate j at time t+Δt given that it was in state i at time t. A series of implied timescales are then calculated and printed to a file for each microstate MSM based on the eigenvalues of the transition probability matrix. These implied timescales correspond to the timescales for transitioning between different sets of microstates. An appropriate number of macrostates to build can be determined based on the location of the major gap in the implied timescales, which should correspond to the largest separation of timescales within the system. The implied timescales for multiple lag times are examined because the location of this gap is normally sensitive to the lag time. Ideally the implied timescales will level out as the lag time increases (34) and obvious gaps that are robust with respect to the lag time will be apparent, as indicated in Figure 2. An appropriate number of macrostates is then one more than the number 13 of implied timescales above the major gap (3, 6). In non-ideal cases the number of implied timescales above the gap will not level off. In such cases we recommend erring on the side of having too many macrostates rather than too few. If too many macrostates are generated then some of the representative conformations may be redundant (only separated by small barriers), whereas if too few are constructed important regions of phase space may not be identified.

Figure 2. Implied timescales as a function of the lag time. There are two probable gaps in the implied timescales. If gap one were selected then a macrostate MSM with four states would be constructed whereas if gap two were selected a higher resolution MSM with 6 states would be constructed.

A macrostate MSM with the appropriate number of states may then be built using the BuildMacroMSM.py script. First, this script uses the Perron Cluster Cluster Analysis (PCCA) algorithm (44, 45) to lump together kinetically related microstates. The PCCA algorithm identifies kinetic relationships based on the eigenvalue/eigenvector structure of the microstate MSM and will not be described in detail here. This initial lumping is then refined using simulated annealing to maximize the metastability (6), which is defined as

14

N   ( iiTQ ), (1) i1 where N is the number of macrostates and T is the macrostate MSM transition probability matrix. In words, the metastability is the sum of the self-transition probabilities of each macrostate. Thus, the metastability may range from 0 to N. Maximizing the metastability is a heuristic for maximizing the separation of timescales (6). During each simulated annealing step a randomly selected microstate is reassigned to a randomly selected macrostate, the resulting change in metastability is calculated, and the move is either accepted or rejected based on the Metropolis criterion.

We recommend using a lag time of one step to build the MSM to maximize the use of all the data. The resulting state definitions and a longer lag time may then be used to obtain populations and transition rates. A lag time within the implied timescale gap should yield a strongly Markovian model. That is, one with a sufficiently large separation of timescales that the assumption that the state at time t+Δt depends only on the state at time t is valid.

The main outputs of the BuildMacroMSM.py script are a mapping from microstates to macrostates and the metastability of this lumping. The mapping from microstates to macrostates may be used to determine which macrostate each data point is in using the WriteMacroAssignments.py script or the doFastGromacsAssign program. In general, the WriteMacroAssignments.py script should be used as it is faster. Both methods allow the user to specify a temperature range and will only print out assignments for conformations within this range. This feature is useful for calculating populations of states at a given temperature. The mapping may also be used by the getMacroStateCenters program to get information about each macrostate, such as the most geometrically central microstate and the average and the standard deviation of the RMSD between that microstate’s center and the center of every other 15

microstate in that macrostate. Such information is useful for getting an idea of the size of each macrostate.

3. EXTRACTING REPRESENTATIVE CONFORMATIONS There are a number of ways of extracting representative conformations for each macrostate. A simple way of getting a single conformation is to use the getMacroStateCenters program as discussed above. However, one must remember that conformations selected in this manner represent the geometric center of each macrostate and not necessarily the most probable member of each macrostate.

To understand the distribution of conformations in each macrostate one may identify the central conformation of each microstate in a given macrostate using the GetMicroCentersByMacroState.py script. The conformations for a given macrostate may then be overlaid in a viewer for visual analysis. Such an approach may be cumbersome if there are too many microstates in each macrostate. One alternative is to randomly select a reduced number of conformations from each macrostate using the GetRandomConfsFromEachState.py script. A major shortcoming of these methods is that they select conformations with a more or less uniform distribution across the macrostate.

Probably the best way of extracting representative conformations is to use the GetDensityInfo.py script. This script outputs a list of the microstates in each macrostate ordered from densest to sparsest. That is, the most probable to the least probable. Any number of the most probable structures in a given macrostate may then be selected and overlaid in a viewer to get an idea of the distribution of conformations within the state.

4. JUDGING CONVERGENCE Unfortunately there is no analytic way of checking that a single set of simulations has explored the entire accessible space for a given system and, therefore, yielded 16 representative conformations that accurately describe the conformational dynamics. To the best of our knowledge, the most effective way to ensure that the entire space has been explored is to run two distinct sets of simulations started from very different initial configurations. The populations for each state may then be calculated for each dataset. If they agree then one can be relatively sure that the entire space has been explored because the thermodynamics found are independent of the starting conformation.

One practical consideration is that the same state definition must be used for both datasets because it is unclear how to compare different MSMs. A common state definition may be obtained by building a single MSM based on both datasets. The WriteMacroAssignments.py script or doFastGromacsAssign program may then be used to independently assign each dataset to this common state definition, preferably restricting the assignments to the temperature range of interest for GE datasets, so that the population of each state may be determined. Of course, due to the stochastic nature of conformational dynamics the two sets of populations are unlikely to agree exactly. To make a valid comparison the GetMacroMSMPopStats.py script may be used to obtain error bars on the populations from each dataset. This script uses a bootstrapping algorithm to approximate the variation in the populations. If the populations agree within error then the two simulations may be considered to have converged to the true equilibrium distribution and one may be relatively sure that the entire accessible space has been explored. Thus, the conformations extracted in step 3 will provide an accurate depiction of the conformational dynamics of the system.

CONCLUSIONS

Using the MSMBuilder to analyze GE simulations and other datasets will allow researchers to quickly map out the conformational space explored by biological macromolecules like RNA, which is the first step to understanding conformational

17 dynamics. The MSMBuilder may also be used to determine the rates of transitioning between states in microcanonical and canonical simulations, resulting in a complete Markov state model for the system’s conformational dynamics. While more sophisticated algorithms for building MSMs exist (6), they are not likely to provide much improvement for analyzing GE datasets due to the distortion resulting from high temperature data. The highly extensible object oriented design of the code should allow such algorithms to be incorporated easily for use with other datasets though. Incorporating other data types, clustering methods, distance metrics, and analysis tools should also be straight forward. In particular, this software serves as a foundation for automating adaptive sampling algorithms (19), which promise to allow the maximal use of one’s computing resources by focusing sampling on regions of uncertainty. Finally, the results of applying this method to GE datasets may be used as a basis for determining the rates of transitioning between states (Huang et al. in preparation), thereby giving a complete picture of a system’s dynamics.

18

CHAPTER 2: PROGRESS AND CHALLENGES IN THE AUTOMATED CONSTRUCTION OF MARKOV STATE MODELS FOR FULL PROTEIN SYSTEMS

This chapter was taken from: Bowman GR, Beauchamp KA, Boxer G, & Pande VS (2009) Progress and challenges in the automated construction of Markov state models Journal of Chemical Physics 131:124101.

ABSTRACT

Markov State Models (MSMs) are a powerful tool for modeling both the thermodynamics and kinetics of molecular systems. In addition, they provide a rigorous means to combine information from multiple sources into a single model and to direct future simulations/experiments to minimize uncertainties in the model. However, constructing MSMs is challenging because doing so requires decomposing the extremely high dimensional and rugged free energy landscape of a molecular system into long-lived states, also called metastable states. Thus, their application has generally required significant chemical intuition and hand tuning. To address this limitation we have developed a toolkit for automating the construction of MSMs called MSMBuilder (available at https://simtk.org/home/msmbuilder). In this work we demonstrate the application of MSMBuilder to the villin headpiece (HP-35 NleNle), one of the smallest and fastest folding proteins. We show that the resulting MSM captures both the thermodynamics and kinetics of the original molecular dynamics of the system. As a first step towards experimental validation of our methodology we show that our model provides accurate structure prediction and that the longest timescale events correspond to folding.

19

INTRODUCTION

For a molecular system, the distribution of conformations and the dynamics between them is determined by the underlying free energy landscape. Thus, the ability to map out a molecule’s free energy landscape would yield solutions to many outstanding biophysical questions. For example, structure prediction could be accomplished by identifying the free energy minimum (46), leading to insights into catalytic mechanisms of proteins that are difficult to crystallize. Intermediate states, such as those currently thought to be the primary toxic elements in Alzheimer’s disease (47), could also be identified by locating local minima. As a final example, protein folding mechanisms could be understood by examining the rates of transitioning between all the relevant states.

Unfortunately, the free energy landscapes of solvated biomolecules are extremely high dimensional and there is no analytical means to identify all the relevant features, especially when one is concerned with molecules in which small molecular changes yield significant perturbations of the system, such as amino acid mutations in proteins. Therefore, a theoretical treatment requires sampling the potential, generally using Monte Carlo (MC) or Molecular Dynamics (MD), and then inferring information about the states in the free energy landscape from the sampled configurations. Moreover, if one is interested in kinetic properties, one must go further and sample kinetic quantities (e.g. rates) of interconversion between these thermodynamic states.

Mapping out a molecule’s free energy landscape can be broken down into three stages: 1) identifying the relevant states and, in particular, the native state, 2) quantifying the thermodynamics of the system, and 3) quantifying the kinetics of transitioning between the states. Each of these stages builds upon the preceding stages. In fact, this hierarchy of objectives is evident in the literature. For example, in the structure prediction community it is common to plot the free energy as a function of the RMSD to the native state (48). Such representations allow researchers to quickly

20 assess whether or not their potential accurately captures the most experimentally verifiable state, the native state. However, they provide little information on the presence of other states, their relative probabilities, or the kinetics of moving between them (49). Projections of the free energy landscape onto multiple order parameters, on the other hand, may capture multiple states and their thermodynamics (30, 49). The main limitation of these representations is that they depend heavily upon the order parameters selected (30). If the order parameters are not good reaction coordinates, then important features may be distorted or even completely obscured (30, 50). Furthermore, barring the selection of a perfect set of reaction coordinates, such projections only yield limited information about the system’s kinetics due to loss of information about other important degrees of freedom (51).

Clustering techniques are a promising means of overcoming these limitations as they allow the automatic identification of the relevant degrees of freedom (52). However, most clustering techniques are based solely on geometric criteria (31, 32) so they may fail to capture important kinetic properties. To illustrate the importance of integrating kinetic information into the clustering of simulation trajectories, one can imagine two people standing on either side of a wall. Geometrically these two individuals may be very close but kinetically speaking it could be extremely difficult for one to get to the other. Similarly, two conformations from a simulation dataset may be geometrically close but kinetically distant and, therefore, a clustering based solely on a geometric criterion would be inadequate for describing the system’s dynamics.

Markov State Models (MSMs) fit nicely into this progression as they provide a natural means to achieve a complete understanding of a molecule’s free energy landscape—a map of all the relevant states with their correct thermodynamics and kinetics (3, 6, 9, 10, 53). The critical distinction between MSMs and other clustering techniques is that an MSM constitutes a kinetic clustering of one’s data (3, 6, 9, 10). That is, conformations that can interconvert rapidly are grouped into the same state while conformations that can only interconvert slowly are grouped into separate states. Such a kinetic clustering ensures that equilibration within a state, and therefore loss of 21

memory of the previous state, occurs more rapidly than transitions between states. As a result, the model satisfies the —the identity of the next state depends only on the identity of the current state and not any of the previous states.

MSMs are better able to capture the stochastic nature of processes like protein folding than traditional analysis techniques, allowing more quantitative comparisons with and predictions of experimental observables. Thus, they will allow researchers to move beyond the traditional view of MD simulations as molecular microscopes. An MSM also provides a natural means of varying the resolution of one’s model. For example, consider a protein folding process that occurs on a 10 μs timescale. Using a cutoff of one ns to distinguish a fast transition from a slow one would yield a high resolution model that may be difficult to interpret by eye. Using a cutoff of one μs, however, would likely yield a high-level model capturing the essence of the process in a human readable form. MSMs provide a rigorous means to combine data from multiple sources and can be used to extract information about long timescale events from short simulations (11, 54, 55). Finally, there are a number of ways of exploiting MSMs to minimize the amount of computation that must be performed to achieve a good model for a given system (12, 19, 20).

Unfortunately, constructing MSMs is a difficult task because it requires dividing the rugged and high dimensional free energy landscape of a system into metastable states (6). A good set of states will tend to divide phase space along the highest free energy barriers. More specifically, none of the states will have significant internal barriers. Such a partitioning ensures the separation of timescales discussed above—intrastate transitions are fast relative to interstate transitions—and, therefore, that the model is Markovian. States with high internal barriers break the separation of timescales and introduce memory. To illustrate this situation, imagine a state divided in half by a single barrier that is higher than any barrier between states. Besides breaking the separation of timescales by causing transitions within this state to be slow relative to transitions between states, trajectories that enter the state to the left of the internal barrier will also tend to leave to the left while trajectories that enter on the 22 right will tend to leave to the right. Thus, the probability of any possible new state will depend both on the identity of the current state and the previous state, breaking the Markov property. Avoiding such internal barriers has generally required a great deal of chemical insight and hand tuning (33, 39); thus, the application of MSMs has been limited.

To facilitate the more widespread use of MSMs we have developed an open source software package called MSMBuilder that automates their construction (now available at https://simtk.org/home/msmbuilder) (10). MSMBuilder builds on previous automated methods (6) by incorporating new geometric and kinetic clustering algorithms. It also provides a command-line interface built on top of an object oriented structure that should allow for the rapid incorporation of new advances. In summary, MSMBuilder works as follows: 1) group conformations into very small states called microstates and assume the high degree of structural similarity within a state implies a kinetic similarity, 2) validate that this state decomposition is Markovian, and optionally 3) lump the microstates into some number of macrostates based on kinetic criteria and ensure that this macrostate model is Markovian. There are also a number of tools for analyzing and visualizing the model at both the microstate and macrostate levels.

In this work we demonstrate that MSMBuilder is able to construct MSMs for full protein systems in an automated fashion by applying it to the villin headpiece (HP-35 NleNle) (56, 57). Unlike the peptides that have been studied with automated methods in the past (6), villin has all the hallmarks of a protein, such as a hydrophobic core and tertiary contacts. It is also fast folding, so it is possible to carry out simulations on timescales comparable to the folding time (58).

Our hope is that this work will serve as a guide for future users of MSMBuilder. Thus, we will discuss failed models, the insights these models gave us, and how these insights led to the final model. We will also discuss some of the remaining limitations in the automated construction of MSMs. In addition, we will

23

demonstrate that our model yields accurate structure prediction and that the longest timescales correspond to folding. However, our main emphasis will be on the methodology of building MSMs that faithfully represent the raw simulation data. In particular, we will focus on the microstate level as this is the finest resolution and bounds the performance of lower resolution models. The full biophysical implications of the model and their relation to experimental results will be discussed more thoroughly in a later work.

MATERIALS & METHODS

SIMULATION DETAILS The data set used in this study was taken from Ensign et al. (58) and is described briefly below. It consists of ~450 simulations ranging from 35 ns to 2 μs in length and is publicly available at the SimTK website (https://simtk.org/home/foldvillin).

First, the crystal structure (PDB structure 2F4K) (56) was relaxed using a steepest descent algorithm in GROMACS (43, 59) using the AMBER03 force field (60). The resulting structure was placed in an octahedral box of dimensions 4.240 nm×4.969 nm×4.662 nm and solvated with 1306 TIP3P water molecules. Nine 10 ns high temperature simulations (at 373 K), each with different initial velocities drawn from a Maxwell–Boltzmann distribution, were run from this solvated structure. The final structures from each of these unfolding simulations were then used as the initial points for ~450 folding simulations at 300 K.

Folding simulations were preceded by 10 ns equilibration simulations at constant volume and the protein coordinates fixed. For all MD simulations, the SHAKE (61) and SETTLE (62) algorithms were used with the default GROMACS 3.3 parameters to constrain bond lengths. Periodic boundary conditions were employed. To control temperature, protein and solvent were coupled separately to a Nosé– Hoover thermostat (63, 64) with an oscillation period of 0.5 ps. The system was

24 coupled to a Parrinello–Rahman barostat (65, 66) at 1 bar, with a time constant of 10 ps, assuming a compressibility of 4.5×10−5 bar−1. Velocities were assigned randomly from a Maxwell–Boltzmann distribution. The linear center-of-mass motion of the protein and solvent groups were removed every ten steps. A cutoff at 0.8 nm was employed for both the Coulombic and van der Waals interactions. During these simulations, the long-range electrostatic forces were treated with a reaction field assuming a continuum dielectric of 78, and the van der Waals was treated with a switch from 0.7 nm to 0.8 nm. The neighborlist was set to 0.7 nm for computational performance.

MARKOV STATE MODEL CONSTRUCTION All the MSMs used in this paper were constructed with MSMBuilder (10), the relevant components of which are reviewed below. A significant modification of the code was the introduction of sparse matrix types, which allows the construction of MSMs with many more states than previously possible by making more efficient use of the available memory. Sparse matrices will be included in the next release of MSMBuilder.

CLUSTERING An approximate k-centers clustering algorithm was used to generate the microstates in all the MSMs used in this study (41, 42). The algorithm works as follows: 1) choose an arbitrary point as the first cluster center, 2) compute the distance between every point and the new cluster center, 3) assign points to this new cluster center if they are closer to it than the cluster center they are currently assigned to, 4) declare the point that is furthest from every cluster center to be the next new cluster center, and 5) repeat steps 2-4 until the desired number of clusters have been generated. The computational complexity of this algorithm is O(kN) where k is the number of clusters and N is the number of data points to be clustered. The algorithm is intended to give clusters with approximately equal radii, where the radius of a cluster is defined as the

25 maximum distance between the cluster center and any other data point in the cluster. Given that MD simulations are Markovian (9), it should be possible to generate a Markov model for simulation dynamics by constructing sufficiently small (or numerous) states. However, the size of a given data set will limit how many clusters can be generated because reducing the number of conformations in each state will eventually result in an unacceptable level of statistical uncertainty.

Based on the Boltzmann relationship, we can calculate the free energy of a state as – kTlog (p), where p is the probability of being in the state. Though small variations in the radii of microstates may imply quite large variations in their volumes due to the high dimensionality of the phase space of biomolecules, empirically we find that assuming the clusters have equal volume is useful. In particular, we find that interpreting lower free energy microstates as having higher densities and evaluating models based on the correlation between the free energy and RMSD of each microstate agrees with other measures of the validity of an MSM, such as implied timescales plots as discussed below. Because this relationship is not guaranteed to hold the correlation between microstate free energy and RMSD should never be used as the sole assessment of a model. As discussed in the Results & Discussion, it is quite useful for identifying potential shortcomings of a given model. These issues are not a concern at the macrostate level.

All clustering in this work was based on the heavy-atom RMSD between pairs of conformations. However, we note that pairs of atoms in the same side chain that are indistinguishable with respect to symmetry operations were excluded from the RMSD computations.

Representative conformations from some clusters are shown using VMD(67).

TRANSITION PROBABILITY MATRICES Transition probability matrices are at the heart of MSMs (9). Row normalized transition probability matrices are used in this study. The element in row i and column 26 j of such a matrix gives the probability of transitioning from state i to state j in a certain time interval called the lag time (τ).

The transition probability matrix serves many purposes. For example, a vector of state probabilities may be propagated forward in time by multiplying it by the transition probability matrix.

  Ttptp  )()()( (1.1) where t is the current time, τ is the lag time, p(t) is a row vector of state probabilities at time t, and T(τ) is the row normalized transition probability matrix with lag time τ.

The eigenvalue/eigenvector spectrum of a transition probability matrix gives information about aggregate transitions between subsets of the states in the model and what timescales these transitions occur on (9). More specifically, the eigenvalues are related to an implied timescale for a transition, which can be calculated as

 k  (1.2) )ln( where τ is the lag time and μ is an eigenvalue. The corresponding left eigenvector specifies which states are involved in the aggregate transition. That is, states with positive eigenvector components are transitioning with those with negative components and the degree of participation for each state is related to the magnitude of its eigenvector component (9).

IMPLIED TIMESCALES PLOTS Implied timescales plots are one of the most sensitive indicators of whether or not a model is Markovian (34). These plots are generated by graphing the implied timescales of an MSM for a series of lag times. If the model is Markovian at a certain lag time then the implied timescales should remain constant for any greater lag time. The minimal lag time at which the implied timescales level off is the Markov time, or 27

the smallest time interval for which the model is Markovian. The implied timescales for a non-Markovian model tend to increase with the lag time instead of leveling off. Unfortunately, increasing the lag time decreases the amount of data and, therefore, increases the uncertainty in the implied timescales. Thus, implied timescales plots can be very difficult to interpret.

In this study error bars on implied timescales plots were obtained using a bootstrapping procedure. Five randomly selected subsets of the available trajectories were selected with replacement and the averages and variances of the implied timescales for each lag time were calculated.

TIME EVOLUTION OF OBSERVABLES The time evolution of the mean and variance of any molecular observable can be calculated from an MSM. Calculating the time evolution of an observable X requires 2 2 calculating the average of X in each state i (Xi) and the average of X (Xi ). In this study we took averages over five randomly selected conformations from each state. An initial state probability vector may then be propagated in time as in Equation 1.1. At each time step the mean and variance can be calculated as

N   ( )XtpX ii (1.3) i1

 2 2 XX  2 (1.4)

where N is the number of states, pi (t) is the probability of state i at time t, σ is the standard deviation and

N 2 2   ( )XtpX ii (1.5) i

28

RESULTS & DISCUSSION

AN INITIAL MODEL Given the computational cost of running extensive MD simulations an important consideration in constructing an MSM is to maximize one’s use of the available data. Of course, one’s hardware always sets hard upper limits on the amount of data that may be used at each stage of building an MSM. In particular, it may not always be possible to fit all of the available conformations into memory for the initial clustering phase of constructing an MSM with MSMBuilder. A convenient way of overcoming this bottleneck is to use a subset of the available data to generate a set of clusters. Data that was left out during the clustering phase may then be assigned to these clusters.

To maximize the use of our data while satisfying the memory constraints of our system we first sub-sampled our dataset by a factor of 10 and clustered the resulting conformations into 10,000 states. Snapshots were stored every 50 ps during our MD simulations, which will henceforth be referred to as the raw data. Thus, the effective trajectories used during our clustering consisted of snapshots separated by 500 ps. The remaining 90% of the data was subsequently assigned to this 10,000 state model. Fortunately, it is possible to parallelize this assignment phase because the cluster definitions are never updated after the initial clustering.

As discussed in the introduction, the first criterion for assessing the validity of our model is whether or not it is capable of capturing the native state. The next criterion is whether or not the thermodynamics of the model are correct. An initial assessment of these two criteria may be obtained from a scatter plot of the free energy of each state as a function of the RMSD of the state center from the native state.

There is some correlation between the free energy of a microstate and the RMSD of its center from the crystal structure in this model, as shown in Figure 3A. However, the most native-like RMSD of any of the state centers is 4.15 Å whereas the simulations reach conformations with RMSD values as low as 0.52 Å. This

29 discrepancy is a first indication that there may be significant heterogeneity within the states of this model. In particular, more near-native conformations must have been absorbed into one or more other states. Highly heterogeneous states are likely to violate the assumption that the degree of geometric similarity within a microstate implies a kinetic similarity, preventing the construction of a valid MSM. This conclusion is supported by the fact that the average distance between any conformation and the nearest cluster center is over 4.5 Å.

30

Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its RMSD. A) The initial 10,000 state model, B) the 30,000 state model, C) the final 10,000 state model, and D) the final 10,000 state model except that the average RMSD across five structures in each state is used instead of the RMSD of the state center.

Final confirmation of the imperfections of the current 10,000 state model comes from examining the implied timescales as a function of the lag time. If the division into microstates were fine enough to ensure the absence of any large internal barriers the largest implied timescales should be invariant with respect to the lag time for any lag time greater than the Markov time(34). Figure 4 shows that the implied timescales for this model continue to grow monotonically as the lag time is increased. While the growth is not too severe it should be possible to improve upon this model given the amount of sampling in the dataset.

Figure 4. Top ten implied timescales for the initial 10,000 state model.

Besides the structural and kinetic heterogeneity within states, the monotonic growth of the implied timescales may also be due to the low number of counts in some states and the resulting uncertainty in transition probabilities from these states. For example, there are less than 10 data points in over 100 of the states at 31

the smallest lag time. Even for a state with ten data points no transition probability can be resolved beyond a single significant digit. Increasing the lag time will reduce the number of data points in every state, having particularly deleterious effects on estimates of transition probabilities from states with low counts in the first place.

MORE STATES ARE NOT ALWAYS BETTER As a first attempt at improving our original model we increased the number of states from 10,000 to 30,000. Our objective in doing so was to avoid internal barriers by dividing phase space into smaller states. In addition, we hoped to find more near native states by pulling low RMSD conformations into their own clusters.

Clustering the data into more states did indeed result in more near-native states, as shown in Figure 3B. The most native-like state center in the 30,000 state model has an RMSD of 3 Å and there is still a general correlation between low free energy and low RMSD. The average distance between any conformation and its nearest state center was also reduced from 4.5 Å to 3.5 Å.

However, increasing the number of states also had some negative effects on the model. In the 10,000 state model about 1% of the states had 10 or less conformations in them whereas in the new 30,000 state model 6% of the states have 10 or fewer conformations. Thus, the uncertainty in the transition probabilities from many states will be greater. In addition, while increasing the number of states did create a handful of more near-native states, it also more than doubled the number of states with an RMSD over 10 Å. These phenomena are consistent with the fact that the approximate k-centers clustering algorithm used in this work tends to create clusters with approximately equal radii (41, 42). When adding more clusters, this property will tend to result in most of the new clusters appearing in large sparse regions of phase space in the tails of the distribution of conformations. As a result of these shortcomings, the 30,000 state model was found to have monotonically increasing implied timescales

32 similar to those for the 10,000 state model and, therefore, is not significantly more Markovian than the previous model (data not shown).

DISREGARDING OUTLIERS DURING CLUSTERING YIELDS A MARKOVIAN MODEL One approach to dealing with outliers would be to use all the data during the clustering phase and then discard those clusters that behave in unphysical ways, such as clusters that act as sinks. However, such an approach could discard legitimate trapped states. In addition, the tendency of our approximate k-centers algorithm to select outliers as cluster centers could easily result in a large fraction of clusters being discarded.

To deal with the limitations of our clustering algorithm we reverted to using 10,000 states and increased the amount of sub-sampling at the clustering stage from a factor of 10 to a factor of 100, which is equivalent to using trajectories with conformations stored at a 5 ns interval for this data set. This change compensates for the tendency of our approximate k-centers algorithm to select outliers as cluster centers by reducing the number of available data points in the tails of the distribution of conformations at the clustering stage. Thus, increasing the degree of sub-sampling at our clustering stage focuses more clusters in dense regions of phase space where more of the relevant dynamics are occurring. The remaining data can then be assigned to these clusters, so no data is thrown out entirely. Incorporating the remaining data in this manner will tend to enlarge clusters on the periphery of phase space because they will absorb data points in the tails of the distribution of conformations. More central clusters, on the other hand, will tend to stay approximately the same size. The number of data points in every cluster should increase though, allowing better resolution of the transition probabilities from each state.

A very simple kinetically inspired clustering scheme could be implemented by sub-sampling to select N evenly spaced conformations (in time) as cluster centers. In this case a large number of clusters would appear in dense regions of phase space while there would be very few clusters in sparse regions. Our current approach is an

33

intermediate between such a kinetically inspired clustering and the purely geometrically defined clustering used in our first two models. It is intended to have some of the strengths of both approaches—i.e. fine resolution everywhere as in the geometric approach but even more so in dense regions of phase space as in the kinetic approach.

In fact, sub-sampling more at the approximate k-centers clustering stage and then assigning the remaining data to these clusters does improve the structural, thermodynamic, and kinetic properties of the model. Based on our experience with this data set and a few others (RNA hairpins and small peptides, data not shown) a good starting point is to sub-sample such that 10N conformations are used to generate N clusters and conformations used during the clustering are separated by at least 100 ps. The remaining data should then be assigned to these clusters. The degree of sub- sampling and number of clusters may then be adjusted to improve the model as necessary as the optimal parameters will depend on the system. In particular, the optimal strategy may be quite different for much smaller or larger systems.

Structural agreement: Figure 3C shows that our new model has state centers with RMSDs as low as 3.4 Å, which is somewhat higher than the 30,000 state model but better than the original model. Examination of randomly selected structures from a number of states revealed that the microstate center is not always a good representative of the state. In particular, some near-native states have a dense pocket of very low RMSD conformations and a handful of outliers. In such cases our approximate k-centers clustering algorithm will select a conformation in between the dense pocket of low RMSD states and the outliers (41) when really a structure from the denser region would be more representative of the state. A further improvement in the structural characterization of the model is made possible by calculating the average RMSD over five randomly selected conformations from each state instead of just the state center, as shown in Figure 3D. This analysis reveals that the most native-like state has an average RMSD of about 1.8 Å. To illustrate the agreement between this state and the crystal structure Figure 5A shows an overlay of three randomly selected 34 conformations from this state with the crystal structure. An interesting future direction would be to further validate near-native states by comparing them directly with the experimental data rather than the model thereof.

Figure 5. Three representative structures for A) the lowest RMSD state in the final model and B) the most probable state in the final model overlaid with the crystal structure (red). The phenylalanine core is shown explicitly for each molecule.

35

Thermodynamic agreement: As discussed in the introduction, we cannot calculate the equilibrium distribution of villin analytically so we do not have an absolute reference point to judge our model against. However, there are some promising features of the thermodynamics of the model that lend it credibility. The most populated state has about 4% of the total population and has an average RMSD of 2.3 Å. Figure 5B illustrates the agreement between three random conformations from this state and the crystal structure. The state with the lowest average RMSD also has the fifth highest population, which is about 2% of the total population, and about 12% of the conformations are in states with average RMSD values less than 3 Å. There is also a reasonable correlation between the RMSD and the free energy, as shown in Figure 3D. Our results seem to be robust with respect to the method used for calculating the equilibrium distribution as well, as discussed in Appendix A. Finally, the populations from the MSM are consistent with those from averaging over the raw data in successive windows of the simulation time, indicating that the MSM thermodynamics are in agreement with the underlying potential if not experiment (data not shown).

Here it is important to note that none of the simulations were started from the native state. While this is not formally a blind prediction (since the crystal structure has been previously reported (57)), it is promising that so many simulations folded under the given potential, allowing one to not merely reach the folded state but predict its structure ab initio. It will be interesting to see if this procedure can yield similar results in a blind prediction, or at least when structural criteria are not used as a basis for adjusting the model as in this work.

Kinetic agreement: Another promising feature of this model is that there are no fewer than 12 data points in every state, indicating that this model may be able to better resolve the transition probabilities for most states. In fact, the implied timescales for this model do seem to level off as the lag time is increased. Figure 6A shows that the longest timescales level off at a lag time of about 15 ns but increase moderately at longer lag times. Figure 6B, however, shows that the implied timescales are level 36 within error from 15 to 60 ns. After about 35 ns there is an increase in the statistical uncertainty in the implied timescales, explaining their apparent growth in Figure 6A. After 60 ns the statistical uncertainty becomes enormous so implied timescales beyond this point are not shown. Thus, this model appears to be Markovian at lag times of 15 ns and beyond.

37

Figure 6. Top ten implied timescales for the final model. A) The implied timescales at intervals of one ns. B) The implied timescales with error bars obtained by doing five iterations of bootstrapping at an interval of five ns.

The longest implied timescale for this model is about 8 μs. While this is quite long relative to the experimentally predicted folding time of 720 ns at 300 K (56), it is consistent with previous simulation work suggesting that the experimental measurements may be monitoring structural properties which relax faster than the complete folding process (58). In that study, the authors found that a surrogate for the experimental observable was consistent with the experimental measurements but that longer timescales on the order of 4 μs were present when monitoring the relaxation of a more global metric for folding. Ensign et al. also found timescales as high as ~50 μs by applying a maximum likelihood estimator to a subset of the data with little folding. While this timescale is much longer than any of the implied timescales in our MSM, it is not inconsistent with our model because the rates for transitioning between some states in an MSM, when fit using a two-state kinetics assumption, may be slower than the implied timescales. Ensign et al. likely identified one of these slow rates by focusing on a subset of the data. For a more detailed discussion of this topic with a simple example see Appendix B.

The components of the left eigenvector corresponding to the longest timescale give information about what is occurring on this timescale. That is, states with positive eigenvector components are interchanging with states with negative components and the degree of participation in this aggregate transition is given by the magnitude of the components (9). Figure 7 demonstrates that the longest timescale in our model does correspond to folding by showing that it corresponds to transitions between high and low RMSD states. Numerous states do not participate strongly in this transition, explaining the streak of points with eigenvector components near zero.

38

Figure 7. The average RMSD of each state in the final model versus its left eigenvector component in the longest timescale transition showing that this transition corresponds to folding.

For further confirmation that the MSM is an accurate model of the simulation data we compared the predicted time evolution of the population of the native state with the raw simulation data, where the native state was defined as all microstates with an average Cα RMSD to the crystal structure less than 3 Å. Figure 8 shows that there is good agreement between the MSM and raw data.

39

Figure 8. Comparison between the time evolution of the native population in the MSM (blue) and the raw data (black) for the entire dataset. The error bars represent the standard error.

While the time evolution of state populations is a good test of our MSM, often we will want to compute the time evolution of some observable to make comparisons with and predictions of experiments. As an example we compare the predicted time evolution of the Cα RMSD to the actual time evolution of the RMSD in the raw data for each of the nine initial configurations. The means by which we calculated the RMSD from the MSM is described in the Methods section. Measuring the time evolution of the RMSD from the raw data is simply a matter of measuring the average RMSD over the simulations started from the given initial structure at every time point. We also included a reduced representation of the raw data in this comparison. In the reduced representation each trajectory is represented as a series of states rather than a series of conformations. The average RMSD at a given time point is then calculated by averaging the RMSD of the states each of the relevant trajectories is in. It is important to note that we used the average RMSD across five randomly selected conformations (and the variance thereof) for each state rather than the RMSD of the state centers in

40 these comparisons. Just using the RMSD of the state centers resulted in poor comparisons since they are not truly representative of the state, as discussed above.

Very good agreement (i.e. within the uncertainties of the observables) was found between all three representations for seven of the nine starting configurations, an example of which is shown in Figure 9A. In these cases the MSM was found to capture both the mean and variance of the time evolution of the RMSD to high precision. The agreement was less strong for the two remaining starting conformations, as shown in Figure 9B. In these cases the reduced representation agreed well with the raw data, showing that our states are structurally sufficient to capture the correct behavior. The mean RMSD from the MSM does not agree as well with the other two representations, though the true mean is still within the variance of the prediction from the MSM. Note that this variance, as well as al the other variances shown in Figure 9, are just due to the variance in the RMSD within each state and do not include any of the statistical uncertainty in the model. Their large magnitude is an indication of the heterogeneity of villin folding.

41

Figure 9. Comparison between the time evolution of the RMSD in the MSM (blue), the reduced representation (yellow), and the raw data (black) for A) an example of good agreement and B) an example of the worst case scenario. The error bars represent one standard deviation in the RMSD.

The discrepancy between the MSM predictions and the other two representations for two of the starting structures indicates that our model still has some subtle memory issues in a subset of the states. Interestingly, the two conformations

42 where the MSM agreed less well with the raw data were found to be faster folding than the other seven initial configurations in a previous study(58). It would appear that the slower folding trajectories are dominating the equilibrium distribution, causing all the MSM predictions to level off at about 6 Å, which is too high for the two fast folding initial configurations. Similar results were found with other observables, such as the distance between the Trp23 and His27 residues that was previously used as a surrogate for the experimental observable used to measure the folding time(58) (data not shown).

REMAINING ISSUES The most probable cause of any subtle memory issues in our model is the existence of internal barriers within some states. As discussed previously, a state with a sufficiently high internal barrier could cause transition probabilities from that state to depend on the identity of the previous state. In particular, simulations started from one initial configuration could tend to enter and exit a state in one way while simulations started from a different initial configuration could tend to enter and exit the same state in a completely different way.

To test for the existence of internal barriers we calculated independent MSMs for each initial configuration. Each of these MSMs used the same state definitions, however, only simulations started from the given starting conformation were used to calculate the transition probabilities between states. All of these models agreed well with the raw data. For example, Figure 10 shows good agreement for the starting structure previously used as an example of the poorest agreement between the full model and the raw data (shown in Figure 9B).

43

Figure 10. Improved agreement between the MSM and raw data for the example of poor agreement from Figure 6B obtained by building the transition probability matrix from simulations started from this starting structure alone. The error bars represent one standard deviation in the RMSD.

This improved agreement indicates that some states do indeed have internal barriers. Moreover, the seven conformations for which the full model best reproduced the raw data probably have the same behavior in these states while the two initial configurations with poorer agreement between the full MSM and the raw data have a different behavior in these states. The discrepancy then occurs because transition probabilities for these states in the full MSM will be a weighted average of the two types of behavior. The two starting conformations that contribute less heavily to this weighted average are then captured less well by the full MSM.

In an attempt to address this problem we tried increasing the number of states to 30,000. This model may have had some structural advantages and given a slightly lower Markov time, however, it still suffered from the same subtle memory issues as the 10,000 state version (data not shown). Models with even more states were not attempted as they would greatly increase the number of states with very few counts and, therefore, increase uncertainty in the model. These issues may be resolved by 44 identifying those states with internal barriers and splitting them further. However, such hand-tuning is beyond the scope of this work, which focuses on the performance of automated procedures for constructing MSMs.

CONCLUSIONS

Our analysis of the villin headpiece shows that the automated construction of MSMs using MSMBuilder is now at a point where it can be applied to full protein systems, a step beyond the small peptides that have been studied in the past(6, 68). This advance was made possible by the proper application of our approximate k-centers clustering algorithm. A naïve application of this algorithm to a molecular simulation dataset may result in a mediocre state decomposition because outliers in sparse regions of phase space are likely to be selected as cluster centers. To compensate for this tendency, one can sub-sample at the clustering stage, effectively disregarding many of the outliers and focusing the clusters in more relevant regions of conformational space. Data not included in the clustering phases may then be assigned to the resulting model to maximize the use of the available data. General guidelines for applying this result are given in Section C of the Results & Discussion.

To demonstrate that our MSM is a reasonable map for villin’s underlying free energy landscape, we showed that it is capable of accurate structure prediction and its thermodynamics and kinetics are consistent with the raw simulation data. Thus, we have laid a foundation for implementing an automated adaptive sampling scheme capable of constructing models with the minimum possible computational cost. The fact that our model captures both the mean behavior and heterogeneity of villin folding will also allow for more accurate comparisons with experiments and predictions of other experimental observables in a future work on the biophysics of villin folding. By applying this methodology to multiple systems we hope to understand general principles of protein folding. Of course, there is still room for improvement. Future work on estimating reversible transition matrices from simulation data, clustering,

45 adaptive sampling, and exploring the connections between MSMs and Transition Path Sampling (TPS)(33, 69) could extend the accuracy and applicability of MSMBuilder.

46

CHAPTER 3: MOLECULAR SIMULATION OF AB INITIO PROTEIN FOLDING FOR A MILLISECOND FOLDER NTL9(1-39)

This chapter was taken from: Voelz VA, Bowman GR, Beauchamp KA, & Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc 132:1526-1528.

ABSTRACT

To date, the slowest-folding proteins folded ab initio by all-atom molecular dynamics simulations have had folding times in the range of nanoseconds to microseconds. We report simulations of several folding trajectories of NTL9(1-39), a protein which has a folding time of ~1.5 milliseconds. Distributed molecular dynamics simulations in implicit solvent on GPU processors were used to generate ensembles of trajectories out to ~40 µs for several temperatures and starting states. At a temperature less than the melting point of the forcefield, we observe a small number of productive folding events, consistent with predictions from a model of parallel uncoupled two-state simulations. The posterior distribution of the folding rate predicted from the data agrees well with the experimental folding rate (~640/sec). Markov State Models (MSMs) built from the data show a gap in the implied time scales indicative of two- state folding, and heterogeneous pathways connecting diffuse mesoscopic substates. Structural analysis of the 14 out of 2000 macrostates transited by the top ten folding pathways reveals that native-like pairing between strands 1 and 2 only occurs for macrostates with pfold > 0.5, suggesting β12 hairpin formation may be rate-limiting. We believe that using simulation data such as these to seed adaptive resampling simulations will be a promising new method for achieving statistically converged descriptions of folding landscapes at longer time scales than ever before.

47

INTRODUCTION

A complete understanding of how proteins fold, i.e. self-assemble to their biologically relevant “native state,” remains an unattained goal (70). Computer simulation, validated by experiment, is a natural means to elucidate this. There is over a million- fold range in folding rates, suggesting a possible diversity in mechanisms between slow and fast folding proteins (71). Very fast (microsecond timescale) folding proteins (56, 72) appear to fold via a large number of heterogeneous, parallel paths (58, 73, 74), potentially key for folding on such fast timescales. Does the folding of much slower proteins change this picture?

To date, the slowest-folding proteins folded ab initio by all-atom molecular dynamics simulations with fidelity to experimental kinetics have had folding times in the range of nanoseconds to microseconds. These include the designed mini-protein Trp-cage (~4.1 µs) (75), the villin headpiece domain (~10 µs) (76), a fast-folding variant of villin (<1 µs) (58), and Fip35 WW domain (~13 µs) (77). In this communication, we report simulations of several folding trajectories, each from fully unfolded states, of the 39-residue protein NTL9(1-39), which experimentally has a folding time of ~1.5 milliseconds (78).

MATERIALS & METHODS

Trajectories were simulated via the Folding@Home distributed computing platform (79) at 300K, 330K, 370K and 450K from native, extended, and random-coil configurations using an accelerated version of GROMACS written for GPU processors (80), for an aggregate time of 1.52 ms. GPUs play a key role here, allowing for dramatically longer trajectories than previously possible. The AMBER ff96 forcefield (60) with the GBSA solvation model (81) was used, a combination previously shown to give good results folding Fip35 WW domain (77), and shown to exhibit a good balance of native-like secondary structure for a set of small helical and beta sheet peptides studied by replica exchange (82).

48

RESULTS & DISCUSSION

PREDICTION OF AB INITIO FOLDING AND FOLDING RATES We find that the native state (taken from the N-terminal domain of the crystal structure of ribosomal protein L9 (83)) is stable in this forcefield at 300K, exhibiting decreasing stability with increasing temperature (Figure 11a). RMSD-C distributions after 10 µs show well-defined native and collapsed unfolded basins near 3Å and 5Å, respectively. Of the ~3000 trajectories started from unfolded (extended and coil) states at 370K

(Figure 11b), two reach an RMSD-C < 3.5Å and eight reach an RMSD-C < 4Å. No productive folding trajectories were observed at lower temperatures, consistent with the enhanced forward folding rate expected by Arrhenius kinetics. Higher temperature trajectories (450K) exceed the melting temperature of NTL9 in the forcefield.

The observed number of folding events n is consistent with expectations from a simple model of parallel uncoupled folding simulations (84) in which folding is modeled as a two-state Poisson process: = ∫M(t)k exp(-M(t)kt)dt, where M(t) is the number of simulations that reach time t (Figure 11b) and k is the experimental folding rate (~640/sec) (78). This theory predicts (on average) ~1.8 folding trajectories for the amount of sampling performed, in agreement with the two folding trajectories found in practice. Posterior distributions of folding rates given the amount of simulation time and number of folding trajectories were computed using a Bayesian approach (85), which yield expectation values within an order of magnitude of the experimental folding rate.

49

Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1-39) after 10 µs. The arrows indicate thresholds defined for the native basin at 3.5Å and 4Å. (b) The number of parallel simulations M(t) started from unfolded states at 370K that reach time t. (c) Posterior predictions of the folding rate given the amount of simulation time and observed folding events for 3.5Å (dashed) and 4Å (solid) thresholds, using uniform (black) and Jeffrey’s (gray) priors, using methods from (85). In red is a Gaussian distribution representing the experimental rate mean and standard deviation.

In addition to native-like conformations, we see near-native configurations, which show heterogeneity in hydrophobic packing, most notably in alternative side chain arrangements in the beta-sheet structure (Figure 12). Most common of these is a non-native hydrophobic core involving residues I4, I18 and I37 (which normally contact the C-terminal helix in the full-length protein) with F5 solvent-exposed.

INSIGHT INTO FOLDING MECHANISMS In order to describe the kinetics and mechanistic aspects of folding, we employ a new paradigm for sampling the global free energy landscape of folding, using Markov State Models (MSMs). MSM approaches, by automatically identifying a set of kinetically metastable states (such as foldons (86)) and efficiently sampling transitions between these states, can model long-timescale kinetics from much shorter trajectories (3, 6, 37, 54).

Our strategy for simulating slow-folding proteins is first to generate an initial series of kinetically connected states from both the folding and unfolding directions, and then to use adaptive resampling techniques (12) to produce statistically converged estimates of metastable basins and the transition rates between them. In the remainder of this communication, we report progress toward the first goal, by constructing an MSM from the entire set of 370K trajectory data (4, 10), which we will use to seed future rounds of transition sampling. While additional rounds of adaptive sampling could likely aid in increasing the quantitative power of this model, there are several notable observations which can be made with the current data set.

50

Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-C of 3.1Å compared to the native state (cyan). (b) Non-native (top) and native-like (bottom) hydrophobic core arrangements observed in low-RMSD conformations of folding trajectories. Highlighted are sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35 (pink).

Key to accurately identifying metastable states is the clustering of trajectory conformations into microstates fine-grained enough to be used for lumping into groups of maximally metastable macrostates (10). 100,000 microstate clusters were calculated using an approximate k-centers algorithm (42), each with an average radius of 4.5Å RMSD-backbone. Lag times ranging from 1 to 32 ns were used to build a series of MSMs. The implied time scales predicted by these models (obtained by diagonalizing the rate matrix) show a clear spectral gap separating the slowest relaxation time scale from the rest, indicative of single-exponential kinetics (see Figure 52). The implied time scale of the model levels off beyond a lag time of ~10 ns to an implied time scale of ~1 ms, close to the experimental folding time.

An important strength of MSMs is their ability to gain insight at coarser scales by “lumping” the kinetic transitions into a simpler model with fewer states. To gain a mesoscopic view of the folding free energy landscape, we lumped our 100,000- microstate MSM into a 2000-macrostate model. In this view, we find that the metastable states are diffuse collections of conformations over which multiple possible folding pathways can occur, indicating a vast heterogeneity of folding substates that need to be understood in greater detail. At the same time, we can identify highly

51

populated “native” (state n) and “unfolded” (state a) macrostates that dominate the observed relaxation rates (Figure 13 and Figure 53).

The ten pathways with the highest folding flux from macrostate a to n were calculated by a greedy backtracking algorithm (see Appendix C) from the macrostate transition matrix using transition path theory (5, 87) (TPT). The diversity of pathways demonstrates the power of the MSM approach: although we observe only a few folding trajectories directly, a network of many possible pathways can be inferred from the overlapping sampling of local transitions.

While NTL9(1-39) folds quickly for a two-state folder, it is similar in size to many ultrafast (sub-millisecond) folders that appear to exhibit so-called “downhill” folding. Hence, we would like to understand the structural features that limit the overall folding rate. As in a macroscopic two-state model, the highest-flux pathways in our mesoscopic model are amn and aln direct routes from disordered to structured macrostates, reminiscent of nucleation-condensation. These pathways by themselves, however, account for only ~10% of the total flux, and the structural diversity seen in all pathways is reminiscent of more hierarchical folding models such as diffusion-collision. Thus, we sought to more fully study the 14 macrostates transited by the top ten folding pathways.

Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of 12 ns. Shown is the superposition of the top 10 folding fluxes, calculated by a greedy backtracking algorithm (see Appendix C). These pathways account for only about 25% of the total flux, and transit only 14 of

52

the 2000 macrostates (shown labeled a-n, for convenient discussion). The visual size of each state is proportional to its free energy, and arrow size is proportional to the inter-state flux.

To examine structural changes along the folding reaction, we considered three main native structural elements: the central helix (), the pairing of strands 1 and 2

(12), and the pairing of strands 1 and 3 (13). To quantify the extent of native-like structuring for each of these elements we calculated QQ12 and Q13, respectively (see Appendix C for details). The Q-value is a number between 0 and 1 that quantifies the extent of native-like contacts. We then examined, for each macrostate, the Q- values in relation to the pfold value (committor), a kinetic reaction coordinate. The pfold value is computed from the macrostate transition matrix (5, 37, 87).

This analysis yields several key insights into the folding mechanism of NTL9(1-39) on the mesoscale. We find the “unfolded” state a is compact, and contains a baseline level of residual native-like structure, with Q near 0.5, and Q12 and Q13 near 0.2. In general, across the 14 macrostates studied, Q-values increase as pfold values increase, although the relative balance of QQ12 and Q13 varies, indicating pathway heterogeneity: i.e. native-like structures can form in different orders (Figures 14, Figure 55, Figure 56). An exception to this, however, is observed for 12 strand pairing. Only for macrostates with pfold > 0.5 (states g-n) does appreciable 12 strand pairing occur (Figure 15). This suggests that the formation of a local strand pair (12), rather than a nonlocal strand pair (13), is rate-limiting. This effect is not predicted by strictly topological models of folding in which loop closure entropy loss dominates (88), but instead may result from sequence-specific details.

Unlike the 13 strand pair, which has a small interaction surface stabilized by hydrophobic contacts, the 12 hairpin contains seven of the protein’s eight lysine residues, and three of its five glycine residues in a flexible loop region, features which may imbue 12 with larger barriers to folding. This proposed role of 12 is also consistent with the large changes in kinetics and stability seen experimentally for mutations in the 12 hairpin (78).

53

Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted along structural and kinetic reaction coordinates. The balance between native-like helix and sheet structure is quantified

by Qα – (Qβ12 + Qβ13)/2 (vertical axis), and progress along the folding reaction is quantified by the

pfold (committor) value (horizontal axis). It can be seen that the “unfolded” state (a) contains residual native-like helical propensity, and that pathways involving various ordering of native-like helix and sheet formation are possible.

Figure 15. Q-values, which capture the extent of native-like structures, plotted versus pfold (committor) values. The lines are to guide to eye.

It is natural to compare our results with previous unfolding simulations of NTL9(1-39) K12M by Snow et al. (89). In that work, a detailed characterization of the 54 transition state ensemble required the definition of strand-pairing reaction coordinates corresponding to 12 and 13 formation. In our MSM analysis, no such pre-definition is required. Snow et al. also note the difficulty in resolving kinetic intermediates not captured by the chosen order parameters. Indeed, our structural analysis can resolve subtle kinetic intermediates within the native basin, corresponding to alternative rearrangements of the 12 hairpin loop (Figure 57).

CONCLUSIONS

The above results suggest that existing forcefield models using implicit solvent are indeed accurate enough to fold proteins ab initio at long time scales (milliseconds), opening the door to simulating more structurally complex proteins. Moreover, our work demonstrates that there need not be a single pathway or single, dominant mechanism for the folding of a given protein: since the theories proposed for how proteins fold are based on broadly relevant physical principles, it is natural to imagine that multiple mechanisms could be simultaneously present, but that the sequence of the protein, coupled with the chemical environment would control the balance to which each mechanistic pathway is seen.

55

CHAPTER 4: PROTEIN FOLDED STATES ARE KINETIC HUBS

This chapter was taken from: Bowman GR & Pande VS (2010) Protein folded states are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895.

ABSTRACT

Understanding molecular kinetics, and particularly protein folding, is a classic grand challenge in molecular biophysics. Network models, such as Markov State Models (MSMs), are one potential solution to this problem. MSMs have recently yielded quantitative agreement with experimentally derived structures and folding rates for specific systems, leaving them positioned to potentially provide a deeper understanding of molecular kinetics that can lead to experimentally testable hypotheses. Here we use existing MSMs for the villin headpiece and NTL9, which were constructed from atomistic simulations, to accomplish this goal. In addition, we provide simpler, humanly comprehensible networks that capture the essence of molecular kinetics and reproduce qualitative phenomena like the apparent two-state folding often seen in experiments. Together, these models show that protein dynamics are dominated by stochastic jumps between numerous metastable states and that proteins have heterogeneous unfolded states (many unfolded basins that interconvert more rapidly with the native state than with one another) yet often still appear two- state. Most importantly, we find that protein native states are hubs that can be reached quickly from any other state. However, metastability and a web of non-native states slow the average folding rate. Experimental tests for these findings and their implications for other fields, like protein design, are also discussed.

56

INTRODUCTION

Molecular kinetics has fascinated biophysicists and biochemists for decades. From a biophysical point of view, it remains a mystery how systems with so many possible configurations can self-organize with such specificity and rapidity, carry out catalysis, and trigger signaling cascades. From a biomedical standpoint, protein misfolding causes many debilitating diseases, including Alzheimer’s, Huntington’s, and Parkinson’s diseases (90). Understanding how proteins fold is a logical first step in understanding how they misfold and, more importantly, how to prevent or recover from misfolding; indeed, this approach is already proving valuable (40). Furthermore, a better understanding of protein folding mechanisms could lead to more efficient structure prediction (91, 92), for use in high throughput proteomics and studies of systems that defy experimental characterization, and better models for molecular kinetics could aid in computational drug and protein design.

What would the ultimate theory of molecular kinetics look like though? A natural way of answering this question is by analogy to well established theories, such as Schrodinger’s equation in the successful field of quantum mechanics. On the one hand, computational solutions to Schrodinger’s equation have yielded quantitative agreement with and prediction of experimental observables. However, equally important is this theory’s ability to yield insight into simple systems, such as the particle in a box, for the purposes of gaining an intuition for fundamental principles, like the quantization of energy and the role of molecular orbitals. Likewise, the ultimate theory of molecular kinetics should be capable of scaling from sophisticated models capable of quantitatively predicting experiments to simple models which yield mechanistic insight. At even the most fundamental levels of this hierarchy, such a theory ought to be at least qualitatively consistent with experimental observations and be capable of generating experimentally testable hypotheses. In particular, such a theory ought to provide insight into protein folding as success in describing such drastic conformational changes would be evidence for the theory’s ability to describe less extreme ones. 57

We propose that networks of metastable, or long-lived, states (4, 9, 33, 55) could fulfill this role because they are implicit in even the most simple protein folding models; examples include U↔N and U↔I↔N where U is the unfolded state, I is an intermediate, and N is the native state. Networks called Markov State Models (MSMs) make these implicitly considered properties explicit and have the potential to provide complete maps of a protein’s free energy landscape, with nodes corresponding to metastable states (or free energy basins) and edges representing the probabilities of transitioning between pairs of these states (3, 4, 6, 9, 33, 55).

A number of recent works have provided validation for these networks by showing that they can yield quantitative agreement with experimentally derived structures and folding rates (4, 5, 12, 93). In particular, the predicted native state from our villin model (based on calculated free energies) had an RMSD to the crystal structure of ~1.8 Å (4). The model also correctly predicted quantitative details of the kinetics, such as the absolute folding rate (to logarithmic accuracy). This degree of accuracy in predicted free energies, structures, and rates is crucial as all experimental measurements are functions of these properties. In all, the agreement between theory and experiment leads us to the conclusion that our models provide a sufficiently accurate reflection of reality.

To further flesh out this potential theory of molecular kinetics, we have delved into the nature of the free energy landscapes of the villin headpiece (HP-35 NleNle) (56) and a 39 residue fragment of NTL9 (78). Furthermore, because complex networks for real systems are difficult to comprehend, we construct simple, generic models that capture qualitative phenomena like apparent two-state folding and provide an intuition for molecular kinetics. Together, these models allow us to assess existing theories, which describe folding as a two-state process characterized by cooperative transitions across a dominant free energy barrier separating a rapidly mixing unfolded ensemble from the native state (94, 95).

58

The remainder of this paper will be organized around three key results. First, protein free energy landscapes can yield apparent two-state behavior even in the absence of a single dominant barrier. Second, protein unfolded states are heterogeneous, having multiple basins that interconvert more rapidly with the native state than one another. Third, protein native states are kinetic hubs: it is possible to reach them relatively quickly from anywhere in a network but it is also possible to get stuck in a web of non-native states.

RESULTS & DISCUSSION

APPARENT TWO-STATE BEHAVIOR CAN OCCUR IN THE ABSENCE OF A KINETICALLY

RELEVANT TWO-STATE DECOMPOSITION. Many proteins appear to fold via a single cooperative transition from a rapidly mixing ensemble of unfolded conformations to a well defined native structure (94, 96). However, based on chemical intuition, one would expect to find many more metastable states, corresponding to the numerous favorable interactions that could form in the absence of the full native structure as well as dynamics within the native state. To reconcile these points, one typically assumes a single dominate free energy barrier that serves as the rate limiting step for folding. Other barriers are often assumed to be small relative to the thermal energy (or at least to the dominant barrier) and the equilibrium probability of any intermediate is assumed to be too small to detect.

However, in some cases modeling experimental data requires the use of at least three states (97-99) and simple toy models have shown that even three-state systems can yield apparent two-state behavior (100). Thus, it is natural to hypothesize that many systems may have more complex arrangements of metastable states (9, 10, 101) yet still exhibit apparent two-state behavior.

59

To test this hypothesis, we first turn to an MSM for the villin headpiece. This MSM was recently built from atomistic simulations and, by assuming stochastic jumps between its states, was shown to give quantitative agreement with experimental structures and folding rates in addition to recapitulating the raw simulation data (4). Thus, the presence of numerous metastable states in this model would be strong evidence for their actual existence and the stochastic nature of transitions between them. Indeed, with a lagtime on the order of 10ns, analysis of this MSM reveals the existence of at least 500 metastable states. At least 2,000 are found for NTL9 (93). The free energy barriers between our villin states have an average height of about 5.9 (+/- 2.5) kT (see Appendix D for details), indicating that they are non-trivial and potentially detectable. Moreover, no single dominant barrier is apparent.

To better understand the system specific results from our all-atom models, we now consider three simple models for dynamics capable of providing insight into protein folding in general. Each of these networks has six metastable states and is depicted in Figure 16. These models have a single folding pathway (S), parallel folding pathways (P), and a heterogeneous unfolded state (H, with multiple unfolded basins that each interconvert more rapidly with the native state than with one another) as discussed in the Materials & Methods section.

60

Figure 16. Three representative networks each having unfolded state(s) (U and Ui), intermediates (Ii), and a native state (N). S has a single pathway, P has parallel pathways, and H has a heterogeneous unfolded state.

One may be tempted to associate the states in these models with folding nuclei (102), pre-organized secondary structure (103), foldons (104), or the elements of some other model of protein folding (53). However, we simply require that they all be metastable. That is, a system within one state is more likely to stay there than to transition to a different state. Moreover, we propose that the concept of metastability unifies many of the previously proposed folding mechanisms, each of which describes some systems better than others, as all consist of basic units that are stable on some timescale.

We can now imagine monitoring stochastic transitions within each of these representative systems (or ensembles thereof) with a device that can only detect the native state. This hypothetical setup is equivalent to experiments wherein unfolded molecules are allowed to relax to an observable folded state where they are trapped to prevent unfolding and refolding. Figure 17 shows that such an experiment yields the exponential behavior typical of an ideal two-state system. In fact, exponential fits to the data after the initial lag phase only give slight underestimates of the true Mean First Passage Times (MFPTs) between the unfolded and folded states (Table 1). Thus, even these simple systems are qualitatively consistent with both stochastic jumps between numerous metastable states and apparent two-state behavior. This is particularly surprising for model H since it cannot be divided into a single, rapidly mixing unfolded basin separated from the native state by one dominant barrier (i.e. it is not two-state).

61

Figure 17. Distributions of the first folding times for the simple networks S, P, and H are shown in panels A, B, and C respectively. The blue lines are exponential fits to the data after the initial lag phase.

A kinetic perspective on our simple networks helps to explain why two-state behavior is often observed even when there are many large barriers. As discussed previously, when there is a single dominant rate then faster transitions will tend to be lost in the noise. Multiple slow rates will also be lost in the noise if they are too similar. Moreover, this same logic applies even when there are multiple folding routes from different starting points (and thus no kinetically relevant two-state decomposition). Thus, observing anything other than mainly single exponential kinetics requires a delicate balance wherein the slowest rates differ sufficiently to distinguish them but not so much that one dominates the rest, not to mention extremely precise measurements.

Fortunately, there is ample evidence that achieving this balance and the precision necessary to detect it are possible. Multi-exponential behavior is often consistent with the experimental data, but fit to stretched exponentials (105, 106). Increasing the temporal resolution of single molecule pulling experiments has also steadily revealed more metastable states and kinetic measurements can be probe dependent (107, 108). We propose that the ability to simultaneously monitor multiple degrees of freedom (such as extension and FRET) in single molecule experiments

62 would reveal even more metastable states, particularly if MSMs were used to choose the number of probes employed and their placement.

PROTEINS HAVE HETEROGENEOUS UNFOLDED STATES WITH MULTIPLE BASINS THAT

INTERCONVERT MORE RAPIDLY WITH THE NATIVE STATE THAN EACH OTHER. We now investigate which of the simple network topologies is most representative of real protein free energy landscapes. As a first step, we have calculated that every state can reach the native basin of our villin model in one or two steps. This eliminates the possibility of a single pathway since states with that topology could require up to 499 steps to reach the native basin.

Determining whether the parallel pathway model (95, 109, 110) or the heterogeneous unfolded state model is more representative of villin requires a definition of the unfolded state(s). Since every non-native state can reach the native basin in one or two steps it is natural to label every state that is not directly connected to the native state (332 in all) as unfolded and all other non-native states (167 in all) as intermediates.

Taking this definition, we can now examine the distribution of MFPTs from each unfolded state to the native state as well as the distribution of MFPTs between all pairs of unfolded states. Doing so reveals that the average MFPT to the native state is 880 (+/-270) nanoseconds, in reasonable agreement with the experimentally predicted folding time of 720 nanoseconds (56). Moreover, this value is much lower than the average MFPT between pairs of unfolded states (~370 microseconds), as shown in Figure 18A and 18B. Considering every non-native state as part of the unfolded ensemble also gives similar distributions (Figure 59), implying that these results are robust to the exact definition of the unfolded state. Similar results are found for NTL9 as well (Figure 60). Thus, we can conclude that the heterogeneous unfolded state model is most representative of our villin and NTL9 models and possibly proteins in

63

general. This result is in contrast to existing theories of protein folding, which assume rapid equilibration within the unfolded ensemble (95, 111, 112).

Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs from (A) unfolded states to the native state and (B) between unfolded states. (C) Relaxation kinetics with a 10:1 signal-noise ratio (black curve with Gaussian noise) and a single exponential fit (blue curve with τ≈810 ns).

Examination of representative structures suggests that non-native interactions (often in the context of relatively compact conformations) and the enormity of conformational space are responsible for slow transitions between unfolded basins (Figure 61). Non-native contacts can easily have free energies on the order of native contacts, making non-native states reasonably metastable. Once a set of non-native contacts is broken, the probability of forming a particular set of other non-native contacts is quite small due to the large number of other possibilities. This small probability is equivalent to a slow rate. In contrast, evolutionary pressure to fold makes transitioning to the native state reasonably probable, which equates to fast folding relative to slower transitions between unfolded basins.

The tight distribution of MFPTs to the native state is also consistent with our explanation of apparent two-state behavior. Due to experimental noise, it is difficult to justify using more than one or two exponentials to fit the relaxation of our coarse- grained villin model with 500 states, as shown in Figure 18C. Only with an extremely high signal to noise ratio can one accurately identify the deviations from single 64 exponential relaxation shown in Figure 62. We also note that more fine-grained models for villin can capture the burst phase in its relaxation (Figure 63) but here we emphasize the ability of our coarse-grained model to capture the apparent two-state behavior that dominates this system’s relaxation (56).

Our ability to reconcile our model with existing experimental data on the nature of the unfolded ensemble (specifically under native conditions, as opposed to the more rapidly mixing denatured state) indicates that more experiments will be required to definitively falsify or support our conclusions. For example, Nettels et al. have reported a 50 ns global relaxation time within the unfolded ensemble (113). Our model, however, would suggest that this may be due to relaxation within individual unfolded basins, not between them. This hypothesis is consistent with recent measurements of slow dynamics in the unfolded ensemble from the Lapidus lab (114, 115). Therefore, we suggest that this may be an interesting direction for future experimental work. In addition to existing methodologies for probing the unfolded ensemble, single molecule experiments monitoring multiple degrees of freedom could help to falsify or support our conclusions.

If our heterogeneous unfolded state model is indeed generally true then protein folding kinetics cannot be accurately described by two-states separated by a single barrier. Instead, folding must be understood in terms of multiple pathways starting from a number of distinct states. Mixing between pathways adds another layer of complexity to the folding process. Modeling the effects of mutations will thus require considering changes in the relative free energies of numerous states and barrier heights. Understanding the global effects of small changes on networks will likely also be important for protein design.

65

A NATIVE HUB ALLOWS RAPID FOLDING BUT PROTEINS CAN STILL GET STUCK IN A WEB

OF NON-NATIVE STATES. The accessibility of villin’s native state implies the hub-like connectivity characteristic of small-world and scale-free networks (116, 117). We can test this hypothesis by counting the number of connections observed between states because only those transitions with probabilities above some threshold are observed with our finite sampling (all transitions would be observed with infinite sampling). Examining subsets of the states independently, one finds that the average degree (or number of connections) increases as one moves from the unfolded states to the native basin. The unfolded states have an average degree of 12 while the intermediate states have an average degree of 25. The native state acts as a hub, connected to 167 other states. Similar results are found for a small β-sheet peptide (17) and NTL9.

Reduced connectivity between non-native states results in slow dynamics within the unfolded ensemble. This connectivity contradicts other models, which predict bottlenecks close to the native state and high connectivity in non-native regions (95, 110, 112, 118), as depicted in Figure 19A. A more thorough discussion of the similarities and differences between our model and those proposed previously is given in the next section.

66

Figure 19. Schematic diagrams of funnel and native hub models having unfolded states (U), intermediates (I), and native states (N). (A) A network description of a folding funnel with nodes corresponding to individual conformations and a bottleneck near the native state. (B) A native hub model with metastable nodes. The size of each node in (B) is correlated with its equilibrium probability and the connectivity falls off as one moves away from the native state.

The native hub explains how villin folds so quickly. Just as there are only about six degrees of separation between people in the US (119), it is possible to reach 67

villin’s native state in one or two jumps (each 15 ns). Therefore, it is possible to fold from anywhere in the landscape in 30 ns or less. This result is consistent with recent experimental work showing that the transition path time between the unfolded and native ensembles can be as much as four orders of magnitude faster than the average folding time (120) and likely results from evolutionary pressure to fold quickly.

Due to the kinetic proximity of the native state with a 15 ns lagtime, we see that villin can fold in just 30 ns; however, such trajectories are rare because the metastability and connectivity of non-native states makes taking a direct route to the native state improbable. Instead, villin will often spend considerable time in a web of non-native states before finally folding, resulting in an average folding time on the microsecond timescale. In the future, it will be interesting to test whether slower folding proteins have unfolded states further from the native one or just more strongly metastable states, which equates to higher barriers and slower transitions between states. Preliminary analysis of NTL9 suggests every basin can reach the native state in 5 steps (~100 nanoseconds) or less.

We have also found a rough correlation between the connectivity of states and their equilibrium probabilities. The average probabilities of unfolded and intermediate states are ~0.0005 and ~0.004, respectively. The native state has an equilibrium probability of ~0.2. Figure 19B shows a schematic of a protein folding network that attempts to capture all of these observations in a humanly comprehendible manner. All of these observations are in qualitative agreement regardless of the degree of lumping; that is, whether one uses smaller and more numerous states to capture more local minima in the landscape or fewer and more voluminous states to obtain an even more coarse-grained model. While one may be tempted to consider Figure 19B merely an alternative depiction of a funnel, we emphasize that the kinetic connectivity of the native state and lack of connectivity within the unfolded ensemble are important qualitative deviations from traditional funnel theory (95).

68

An important methodological consequence of the network topology found here is that many short, parallel simulations (or experiments) started from arbitrary initial points are an excellent way of exploring the entire free energy landscape. In the extreme case of using a single starting point, one could still reach every free energy basin despite the presence of numerous metastable states so long as each simulation was longer than the diameter of the network (the minimal time that allows one to reach any state from an arbitrary starting point). However, reaching every state would be impossible with simulations that were shorter than the diameter of the network. Thus, our network theory provides an alternate explanation for the previously noted need to have simulations longer than some minimal lag phase, which was then attributed to the need to equilibrate within the unfolded state before folding in two-state systems (121).

Another simple but more efficient strategy would be to start simulations from multiple conformations dispersed throughout phase space and run them long enough to ensure mixing between them and coverage of the entire space. In fact, Figures 20 and Figure 64 how that such a scheme is actually more valuable than a few long trajectories, using a relative entropy metric for MSMs from Ref (18) to measure the information content of different datasets relative to our validated villin model. However, this trend can be seen to break down for simulations that are insufficiently long or too few as they are unlikely to reach every state or traverse every possible pathway between pairs of states. The simulation length at which this breakdown occurs decreases as the number of simulations increases though. Even better performance can be obtained using adaptive sampling algorithms (18, 19), which direct sampling to where it is needed most to improve a model.

69

Figure 20. Distance between the final villin MSM and MSMs constructed from subsets of the data (varying trajectory length and number of trajectories). Distance is measured by a relative entropy metric (see Appendix D for details). Black lines are contours of equal amounts of data. No data was available for the upper-right portion of the graph.

COMPARISON TO PREVIOUS THEORIES FOR PROTEIN FOLDING. There is a long history of theoretical models for protein folding (53) so it is important to put our work in the context of these previous theoretical approaches. In particular, folding funnel models (95, 112, 118) have dominated much of how the field currently conceptualizes protein folding and hence it is natural to compare our model to such theories. One of the most similar funnel categories is type0B, which is characterized by overall downhill folding interrupted by a glass transition along the reaction coordinate (95). While this regime does include slow dynamics between compact states, it also results in a small number of folding pathways relative to higher connectivity in the unfolded ensemble. In addition, this and other previous funnel- based models have explicitly described rapidly interconverting unfolded states, as reflected in the “bottleneck” discussed in previous works (110, 111), as well as the 70 choice of structurally-based reaction coordinates like the number of native contacts (Q) (95, 111), which directly requires that dynamics along orthogonal degrees of freedom, such as interconversion between unfolded conformations, is rapid compared to folding. In contrast, we find a large number of folding pathways, slow dynamics between unfolded states relative to folding, and no glass transition. Our folding rates are also quite similar, rather than the different rates characteristic of the folding pathways in type0B folding.

Other funnel models have recognized the possibility of a large number of folding pathways (95, 109, 118), but still in the context of fast dynamics within the unfolded basin relative to slower transitions to the folded state. Some have even gone so far as to assume global connectivity (122, 123); however, even these emphasize that local connectivity would dominate in the full dimensional conformational space and global connectivity only arises when projecting onto a few order parameters. Furthermore, they argue global connectivity will not give an activation barrier and, therefore, these models are primarily intended for studies of downhill folding or the early activationless stages of folding. Our model, on the other hand, has a native hub and slow dynamics in the unfolded state relative to faster folding regardless of the degree of coarse-graining one employs. We also demonstrate that this can result in apparent two-state folding (i.e. activated kinetics) and that this occurs in non-downhill folding proteins, such as the millisecond folding NTL9.

CONCLUSIONS

Many biological systems, ranging from signaling pathways to social networks, can be most naturally described as networks. As a field, we have now established a new level to this hierarchy: a network theory for molecular kinetics that is able to map out the free energy landscapes of proteins and other macromolecules in their entirety.

Previous work has demonstrated that this network theory is capable of quantitative agreement with experiments (4, 5, 12, 93) and we have now shown that it 71 can also scale down to simple, generic models. Using this theory at both the quantitative and qualitative levels, we have provided an intuition for conformational changes as drastic as protein folding and this intuition has led to experimentally testable insights into the nature of protein free energy landscapes.

We have focused on three new insights from these network models, which appear to hold regardless of the degree of coarse-graining one employs and can be reconciled with current experiments. First, even models that defy a kinetic decomposition into two states often give rise to apparent two-state behavior. Second, proteins have heterogeneous unfolded states (multiple basins that each interconvert more rapidly with the native state than with one another, preventing a kinetic decomposition into two states). Third, proteins have a native hub. Thus, it is possible to fold quickly from anywhere in the landscape but proteins often get stuck in a web of non-native states before finally folding, greatly increasing the average folding time.

These properties are a natural result of reasonably strong non-native interactions and the enormous number of non-native conformations a protein can adopt, in combination with evolutionary pressure to fold quickly (for example, to avoid aggregation). Therefore, we suggest that these conclusions are likely true of proteins in general. Our approach also unifies other models for protein folding by recognizing that each of them builds upon elements, whether they are called folding nuclei (102) or foldons (104), which correspond to different types of metastable states.

We look forward to a fruitful future of drawing on network theory to better understand molecular kinetics and guide experiments probing both general properties and system specific details. In particular, can one reinterpret the many experiments that have been analyzed under a two-state assumption? If so, that could shed light on the chemistry of the underlying structures that leads to the network topology and dynamics described here. Moreover, can further experiments be designed to directly probe the unfolded state under native conditions (rather than with denaturant or high temperature, where mixing is more rapid) to directly test the predictions made here?

72

We also hope to explore how the methodologies developed for building and understanding biomolecular networks may be applicable to other types of networks, especially as network theorists attempt to develop a general framework for understanding network dynamics.

MATERIALS & METHODS

ATOMIC RESOLUTION PROTEIN FOLDING SIMULATIONS AND NETWORKS. Ref (4) describes the use of the MSMBuilder package (https://simtk.org/home/msmbuilder/) (10) to construct an MSM with 10,000 microstates for the villin headpiece (HP-35 NleNle). This model was based on ~450 all-atom, explicit solvent simulations, each up to 2 μs in length, for a total simulation time of 354 μs (58). While the longest timescale transitions in the model from Ref (4) were found to be Markovian, implying memory-less transitions between metastable states, not every state was metastable. We used MSMBuilder to lump kinetically related microstates into 500 metastable macrostates to ensure a direct correlation between states in the MSM and free energy basins, as described in the SI. This is equivalent to common experimental analyses in which the potential is smoothed and the friction is rescaled. We note, however, that the free energy landscape for this system is actually a hierarchy of basins so it is possible to build many valid MSMs with different numbers of states. As a result, one would not expect there to be exactly 500 experimentally detectable states. Regardless of the resolution at which one examines this hierarchy, however, requiring that each state is metastable ensures that they are directly related to a free energy basin. Thus, our networks of metastable states are an important step beyond previously described networks, which often used simpler approximations to define state boundaries and the transition rates between states (17, 95, 110, 124, 125). An additional 40,000 simulations, each up to 400 ns in length (for a total simulation time of 14 milliseconds), were also assigned to this MSM to explore the effect of using more simulations.

73

Preliminary results for a 39 residue fragment of NTL9 are based on an MSM built from ~1.5 milliseconds of simulation in implicit solvent with a different force field (93). Similarities between these two systems thus suggest our results are not a force field artifact.

SIMPLE MODELS. We have designed three simple networks, depicted in Figure 16, that capture the essence of various protein folding mechanisms. Each of these models has six metastable states with approximately the same equilibrium and transition probabilities so that differences between their behaviors may be attributed to differences in their topologies (see the Appendix Dfor details).

The first model (S) has a single folding pathway. This model is a natural extension of the common U↔I↔N model (97, 126) and is often used to justify the expense of running long simulations as shorter ones could fail to reach every state.

The second model (P) has parallel folding pathways. Parallel folding pathways have been proposed for a number of systems (58, 98, 99, 109). In addition, this model emphasizes the need to observe numerous folding and unfolding transitions to obtain sufficient statistics on the entire process. The increased connectivity relative to S also results in faster timescales.

The third model (H) has a heterogeneous unfolded state—multiple unfolded basins that each interconvert more rapidly with the native state than with one another. Thus, there is no kinetic decomposition of this model into two states, one folded and one unfolded. This model was inspired by a growing body of work on the presence of deep minima and gutters in unfolded regions of conformational space (114, 115, 127- 129).

74

CHAPTER 5: ATOMISTIC FOLDING SIMULATIONS OF THE FIVE HELIX

BUNDLE PROTEIN LAMBDA6-85

This chapter is in preparation as: Bowman GR, Voelz VA, Ensign DL, & Pande VS (2010) Atomistic folding simulations of the five helix bundle protein λ6-85.

ABSTRACT

Understanding protein folding is a long-standing problem with important medical applications, such as elucidating the role of protein misfolding in diseases like Alzheimer’s. Solving the folding problem will ultimately require a combination of theory and experiment, with theoretical models providing an atomically-detailed picture of both the thermodynamics and kinetics of folding and experimental tests grounding these models in reality. However, modeling long timescale dynamics (e.g. microseconds, milliseconds, and beyond) with sufficient statistical accuracy and chemical detail to make a quantitative connection with experiments is extremely challenging. Here we report significant progress in this direction: an atomistic model of the folding of an 80-residue fragment of the λ repressor protein with explicit solvent that captures dynamics on 10 millisecond timescales. This advance greatly increases the common ground accessible to both theory and experiment (both in terms of system size and long timescales) and leads to a number of predictions that warrant further experimental tests. For example, our model’s native state is a kinetic hub and biexponential kinetics arise from the presence of many free energy basins separated by barriers of different heights rather than a lack of barriers (the previously proposed downhill scenario).

75

INTRODUCTION

Understanding protein folding is a long-standing problem with important medical applications, such as elucidating the role of protein misfolding in diseases like Alzheimer’s. Solving the folding problem will ultimately require a combination of theory and experiment, with theoretical models providing an atomically-detailed picture of both the thermodynamics and kinetics of folding and experimental tests grounding these models in reality. However, modeling long timescale dynamics (e.g. microseconds, milliseconds, and beyond) with sufficient statistical accuracy and chemical detail to make a quantitative connection with experiments is extremely challenging. Much progress has been made with small, fast-folding proteins but can the methods used scale to larger, slower systems? Here we report significant progress in this direction: an atomistic model of the folding of an 80-residue fragment of the λ repressor protein with explicit solvent that captures dynamics on a 10 millisecond timescale.

This advance builds on a growing body of work on describing molecular kinetics with Markov State Models (MSMs). MSMs are essentially maps of a molecule’s conformational space (1-3, 6). However, instead of having towns connected by roads labeled with speed limits, MSMs have metastable states (sets of rapidly interconverting conformations) connected by edges giving the probability of going from one state to another. One can exploit the kinetic definition of states in an MSM to scale from high-resolution models capable of quantitative agreement with experiments to low-resolution models that provide an intuition for the system. In addition, one can break up slow processes like protein folding into many small steps that can be studied with short, parallel simulations.

The proteins studied with MSMs to date have generally been small and fast folding (see Refs (3) and (2) for reviews). For example, we have built a model for a 35-residue mutant of the villin headpiece (4) that folds on the μs timescale. The native state of this model (i.e. lowest free energy state) was within 1.8 Å of the crystal

76 structure, an important achievement given that all the simulations used to build the model started from unfolded conformations. Noe et al. have built an MSM for a Pin WW domain (5) (34 residues, μs folding time) and Voelz et al. have built an MSM for a 39-residue fragment of NTL9 (93) (the first millisecond folder to be modeled with MSMs). The ability of these models to predict structures, thermodynamics, and rates indicates they should be capable of predicting any experimental observable, since all are functions of these properties.

To test whether the MSM approach can scale to larger systems, we have applied it to the D14A mutant of an 80-residue fragment of the λ repressor protein (72). Full length λ repressor is a 236-residue protein capable of dimerizing and binding to DNA, maintaining the λ phage in the lysogenic state and regulating its own expression. Figure 21A shows the crystal structure of a 92-residue fragment that can still dimerize and bind to DNA (130, 131). Based on this structure, Huang and Oas selected an 80-residue fragment (λ6-85) that favors the monomeric state (Figure 21B), making it appropriate for folding studies (132). This fragment was one of the first sub- millisecond timescale folders to be discovered. Subsequently, a number of mutants of

λ6-85 have been found to fold on faster timescales (72, 133-135). The D14A mutant is one of the fastest folders, having an approximately 2 μs molecular phase and an approximately 10 μs activated phase (72). These timescales have been attributed to downhill (or barrierless) and two-state folding, respectively.

Figure 21. (A) The crystal structure of the λ1-92 dimer bound to DNA (PDB code 1LMB). (B) A model

of λ 6-85 with the Trp22-Tyr33 pair monitored in T-jump experiments space-filled.

77

The fast timescales reported for D14A make it a prime candidate for atomistic molecular dynamics simulations combined with MSMs, which can now capture millisecond timescales (93). We have run 3,265 trajectories with explicit solvent at 370 K. Each one is up to 1 μs in length, for an aggregate of 1.3 milliseconds of simulation. These simulations were started from six initial configurations drawn from replica exchange simulations in implicit solvent (136). One is native-like, three are partially unfolded, and two have β-sheets. A more detailed description of our simulations is given in Appendix E. We then constructed a high-resolution MSM with 30,000 microstates that is appropriate for making quantitative connections with experiments. A low-resolution model with 5,000 macrostates was created from the high-resolution MSM to facilitate interpretation of the model. More details on our use of the MSMBuilder package (10) to construct these models are given in Appendix E. While no single trajectory visits every state, these MSMs are able to capture long timescale dynamics by exploiting overlap between our simulations to stitch them together in a physically and statistically meaningful way. Examination of the implied timescales of the microstate MSM shows that a five ns lag time yields Markovian behavior (Figure 65).

RESULTS & DISCUSSION

Analysis of our high-resolution MSM reveals the presence of 10 millisecond timescales. These timescales are preserved in an independent dataset run at 300 K and subsamples of the 370 K dataset (Figure 66 and Figure 67), indicating that they are a robust feature of the simulated system. Do these slow timescales reflect inadequacies in the simulation parameters (the force field)? For example, λ repressor’s folding time is known to be sensitive to solvent viscosity (137), so small errors in our parameterization could easily affect our predicted rates. Or could the experimental probes and techniques used to date be insensitive to these long timescales? One might expect D14A, with its sizeable hydrophobic core, to fold on slower timescales given

78 that the wild-type villin headpiece (which is less than half the size of D14A and barely has a hydrophobic core) is also reported to fold in just under ten μs (73).

To explore these possibilities we mapped out the 10 millisecond timescale conformational rearrangement. Analysis of our coarse-grained MSM reveals that this slow timescale corresponds to exchange between a compact β-sheet structure and the crystal structure through multiple parallel pathways (Figure 68 and Figure 69). Figure 22 shows a representative pathway between these states from our high- resolution MSM. First, the compact β-sheet structure expands, breaking apart the β- sheets. Then helices 1 and 4 begin to form, followed by collapse into a native-like topology. Finally, the remaining helices form. As in a previous study (138), more conventional projections of the free energy landscape were less informative (Figure 70 and Figure 71).

79

Figure 22. One of the 10 millisecond timescale pathways labeled with pfold values (the probability of reaching state H before state A).

The prediction of β-sheet states in the unfolded ensemble under folding conditions is somewhat surprising for a helical protein, especially since they are well populated (Figure 72). However, experiments have shown that the unfolded and denatured states of many systems can have significant populations of compact, β-sheet structures yet still display the random coil statistics characteristic of expanded conformations (139, 140). Thus, our prediction of compact, β-sheet structures is not unreasonable.

As a further test we used our MSM to model the relaxation of a surrogate for the Trp22-Tyr33 quenching interaction measured in T-jump experiments and a more 80 global metric, the Cα RMSD to the crystal structure (Figure 73). Both have biexponential relaxation—a characteristic of D14A that has been used to argue that it is a downhill folder—but the molecular phase is about two orders of magnitude slower than in experiment (1 millisecond versus 10 μs). However, ignoring simulations started from β-sheet structures yields better agreement (Figure 74). First, the Trp22- Tyr33 surrogate has a 1 μs molecular phase and a 4.3 μs activated phase, in reasonable agreement with the experimental values of 2 and 10 μs. Secondly, the RMSD now relaxes on different timescales, consistent with observed probe dependent kinetics (141, 142). Projections of the free energy onto a kinetically meaningful reaction coordinate (pfold(51)) are not purely downhill, but could be consistent with incipient downhill folding along parallel pathways (Figure 75). Incipient downhill folding is a scenario in which a barrier is present but is sufficiently low that its peak is well populated; therefore, one observes downhill folding (a molecular phase) from the barrier top and two-state folding (an activated phase) across the barrier.

Based on these results, we cannot conclusively determine whether the stability of the β-sheet states is a force field artifact or a feature of D14A not yet detected by experiments. It is possible that short T-jumps simply cannot reach the β-sheet states. Fully resolving this issue will likely require more experiments and more points of comparison between theory and experiment. Regardless of the outcome, it is exciting that MSMs built from atomistic simulations can now capture 10 millisecond timescales.

The crystallographic state (Figure 22H, probability ~0.09) is not the native (most stable) state in our model. The native state in our model (Figure 22G, probability ~0.44) differs from the crystallographic state in that helix five is unraveled and packed against the side of the protein. This observation is consistent with both the negligible helical propensity in helix five reported by Agadir (143) (Figure 76) and the context of this helix in the original crystal structure (Figure 21A), where it is extended by seven residues. These extra residues form important contacts between the two members of the dimmer that could stabilize helix five. Truncating the sequence to 81

favor the monomer could lead to a lack of structure in the remaining residues of helix five, resulting in a strong propensity to fill in the hydrophobic cavity normally occupied by the corresponding helix in the other member of the dimmer or adopt one of a number of other well-populated, unstructured conformations (Figure 72). Further

support for this observation comes from the fact that a crystal structure of λ6-85 has high B-factors in helix five (135) and the stability of this system seems to be insensitive to mutations in this helix (136). Similar results were also found in a Gō model study, where helix five tended to un-dock from the rest of the protein (138). However, Gō models do not include non-native interactions, so helix five was not found to unravel or pack against the protein. The behavior of a variational model (144) and a diffusion-collision model (145) also differ from that found here due to the lack of non-native interactions. However, the diffusion-collision model is similar in nature to our MSM approach in its use of states and rates. Helix five was also found to be unstable in replica exchange simulations with implicit solvent (136).

Our MSM for D14A is also consistent with previous reports of native hubs (16, 146). A first hint of this comes from the large number of connections to our native state (Figure 23). The native state in our model makes direct connections to 98% of the non-native states while non-native states only connect to 0.1% of the other states on average. Moreover, the MFPTs to the native state are typically ~10 times faster than the MFPTs between non-native states, as shown in Figure 24. Therefore, molecules in non-native states can generally fold faster than they can transition to other non-native states. The fastest way to transition between two randomly selected non-native states is then to fold and unfold.

82

Figure 23. The 500 most populated macrostates with sizes proportional to their free energies and connections between states if transitions between them occurred in our simulations. The native state (green state with green connections) is a hub. The crystallographic state from Figure 22H is blue, the compact β-sheet state from Figure 22A is red, and the remaining states are yellow. All of these states have smaller equilibrium populations and fewer connections than the native state.

Figure 24. Distributions of mean first passage times (MFPTs) between sets of microstates (A) without weighting the distribution and (B) weighting each MFPT by the equilibrium probability of the starting state. The solid line is the distribution of MFPTs from non-native to native microstates and 83

the dashed line is the distribution of MFPTs between non-native states. The average MFPT from non-native states to native ones is about 10 times faster than that between non-native states in (A) and the difference is even greater in (B). Native microstates were defined as those in the most populated macrostate. All other microstates were considered non-native.

This hub model presents an alternative to the two-state and downhill models often used to describe protein folding and interpret experiments. Rather than having a single dominant barrier or no barrier at all, the hub model has many metastable states separated by barriers of different heights and numerous unfolded basins that interconvert more rapidly with the native state than one another. Therefore, there are many parallel folding pathways. We have already showed that MSMs with native hubs can predict the dominant two-state behavior and burst phase kinetics of other systems (16). Here we show that MSMs with native hubs can also predict the biexponential relaxation of D14A that has previously been attributed to downhill (or barrierless) folding (72, 147). Our previous work proposed that the native hub results from non- negligible non-native contacts, which must be broken in order to fold (16, 146). Figure 22 demonstrates this behavior in our model of D14A. Testing the hub model will require more experiments on the unfolded state under native conditions (rather than at high temperature or in the presence of denaturant, where the unfolded ensemble is likely more diffuse).

CONCLUSIONS

The combination of simulations and MSMs can now access ~10 millisecond timescales for moderately large (~80 residue) systems, greatly increasing the common ground between theory and experiment. The ability of our MSMs to capture biexponential kinetics also indicates that proteins previously designated as downhill folders may actually have many barriers of differing heights. In addition, our model leads to a number of predictions for D14A: 1) current experiments may be failing to detect processes on 10 millisecond timescales, 2) there may be significant β-sheet structure in the unfolded ensemble under native conditions, 3) helix five may unfold

84 and fill a hydrophobic pocket in the native state and lack structure in other well populated states, and 4) the native state may act as a kinetic hub. Our ability to reconcile these observations with existing experiments suggests that more experimental data will be necessary to provide a detailed description of how D14A folds. We suggest that MSMs could be used to help design such experiments and lead to important new insights into folding or, at the very least, provide more data for refining existing force fields and improving the agreement between theory and experiment.

85

CHAPTER 6: ENHANCED MODELING VIA NETWORK THEORY: ADAPTIVE SAMPLING OF MARKOV STATE MODELS

This chapter was taken from: Bowman GR, Ensign DL, & Pande VS (2010) Enhanced modeling via network theory: adaptive sampling of Markov state models. J Chem Theory Comput 6:787-794.

ABSTRACT

Computer simulations can complement experiments by providing insight into molecular kinetics with atomic resolution. Unfortunately, even the most powerful supercomputers can only simulate small systems for short timescales, leaving modeling of most biologically relevant systems and timescales intractable. In this work, however, we show that molecular simulations driven by adaptive sampling of networks called Markov State Models (MSMs) can yield tremendous time and resource savings, allowing previously intractable calculations to be performed on a routine basis on existing hardware. We also introduce a distance metric (based on the relative entropy) for comparing MSMs. We primarily employ this metric to judge the convergence of various sampling schemes but it could also be employed to assess the effects of perturbations to a system (e.g. determining how changing the temperature or making a mutation changes a system’s dynamics).

INTRODUCTION

Molecular dynamics simulations are a powerful means of understanding both the thermodynamics and kinetics of molecular processes like protein folding and conformational changes. Unfortunately, such processes are highly sensitive to the underlying chemical details. For example, point mutations in the amino acid sequence of a protein may have significant effects on its kinetics (147) and a small number of 86 point mutations can even drastically change the native structure (148). Thus, atomistic simulations are required to make quantitative connections with experiments (149, 150).

Advances in computing have made it possible to rapidly generate huge data sets even at this level of chemical detail (79, 151); however, these data sets are still insufficient. A typical computer can only simulate ~5 nanoseconds/day of protein folding and would thus take over 500 years to simulate one millisecond, an average folding time typical of proteins. Whether one is interested in dynamics or merely equilibrium probabilities, a kinetic perspective on this problem that explicitly considers the rate of equilibration reveals that metastability, or the presence of long- lived states that act as “traps”, is a common source of inefficiency.

One approach to dealing with this issue is to make tremendous investments in specialized software and hardware for generating long simulations (152). While theoretically sound (153), this serial approach often only results in simulations that are long relative to standard trajectories. However, a truly-long simulation must be orders of magnitude longer than the slowest relaxation time so that the probabilities of all states and pathways can be estimated accurately. Even if such a simulation were possible, the task of analyzing the data would still remain (152, 154). Moreover, serial approaches are inherently inefficient, both due to parallelization overhead and, more importantly, the fact that they waste hundreds of years of computing time waiting for rare events.

A statistical approach provides a fundamentally different perspective on model construction. Rather than attempting to generate one realization of an entire process, one instead aims to generate an ensemble of events in parallel. For example, a number of methods have been developed for exploiting statistical mechanics to simulate protein folding more efficiently (69, 84, 155, 156). Most of these approaches rely on the fact that in two-state protein folding, the waiting time for observing a transition is exponentially distributed but the actual transition times are quite rapid (120). Thus,

87

proteins often fold much faster or slower than the average folding time. Such approaches are amenable to commodity hardware and take far less wall-clock time than a serial approach with an equivalent amount of sampling, particularly when combined with grid computing (79). Unfortunately, these methods are generally only applicable to two-state systems and may require simulations of an unknown minimum length (121). Some multi-state generalizations exist (157) but quickly become computationally intractable.

Markov State Models (MSMs) extend this work by allowing for a tractable, multi-state scheme that allows efficient modeling of any system exhibiting metastability (9). An MSM is a network with nodes corresponding to metastable states and edges describing the rates of transitioning between pairs of states, akin to a map with cities connected by roads labeled with speed-limits. Rather than attempting to generate one realization of an entire process, one can exploit the decomposition of conformational space into multiple metastable states to gather statistics on each step of the process independently, allowing a problem to be broken up into more manageable and trivially parallelizable pieces.

Mathematically, MSMs are represented as transition probability matrices, with the entry in row i and column j giving the probability of transitioning from state i to state j within a time interval called the lag time of the model. Building MSMs is a challenging task but significant progress has been made over the past few years (3, 4, 6, 10), leading to freely available software for automatically constructing these models (10). While MSMs could be used to analyze truly long simulations, their ultimate value lies in their ability to facilitate efficient model construction by allowing precise, parallel determination of the transition rates between states by running many short simulations from each of them.

Adaptive sampling algorithms for MSM construction take this statistical approach a step further (12, 19, 20). In adaptive sampling, one first obtains an initial model of the entire process of interest by any means possible. One then iteratively

88 calculates the contribution of each step of the process to uncertainties in some observable of interest via Bayesian statistics and runs numerous parallel simulations of the steps that can lead to the greatest increases in precision until the desired level of statistical certainty is achieved. Such an approach was recently shown to lead to dramatic reductions in the statistical uncertainty in the observable of interest relative to other refinement schemes (19).

However, a number of important questions remain to be answered. First, does adaptive sampling improve the global model quality or just local components that are important for the observable of interest? Exactly how much more efficient is adaptive sampling? And finally, is adaptive sampling capable of discovering previously unknown components of a model, or is it only able to refine the initial model it is given?

In this work, we address these questions using an MSM for the villin headpiece (HP-35 NleNle) that was recently constructed from atomistic simulations with explicit solvent (4). We then move on to simple models, where the role of the network is clear, to gain an intuition for our results and test whether such methods could be more broadly applicable to a wide class of different types of systems. These analyses rely on a new distance metric for MSMs developed in Section 2.2, which should prove generally useful for evaluating various sampling schemes and even assessing the effects of perturbations to a system (like changes in temperature or even mutations).

THEORETICAL UNDERPINNINGS

ADAPTIVE SAMPLING. In adaptive sampling approaches to MSM construction, simulations are run iteratively to minimize uncertainties in some property of a model (12, 19, 20). In this work, adaptive sampling is performed as follows:

1. perform N simulations of L steps starting from a particular starting state(s) 89

2. build an MSM only including those states identified so far

3. calculate the contribution of each state to uncertainty in the slowest kinetic rate following Ref (19)

4. start N new simulations of L steps distributed amongst the states in proportion to their contribution to uncertainty in the slowest rate

5. repeat steps 2-4 for some number of iterations

All the MSMs in this work were constructed and analyzed with the MSMBuilder package (which is freely available at https://simtk.org/home/msmbuilder/) (10) modified such that transition count matrices were not symmetrized by counting the transitions that would have been observed if one watched each simulation backwards.

We note that in the past simulations in each round of adaptive sampling were all started from the same initial state (the one contributing most to uncertainty in the quantity of interest) (19). The intuition behind our alteration was that as the number of simulations (N) becomes large, starting all the simulations from one state would be excessive as fewer would be sufficient to drastically reduce the uncertainty. Instead, it would be preferable to allocate some of these excess simulations to reduce uncertainties in other states’ transition probabilities. Indeed, we have found that our modified procedure yields better results for sufficiently large N on reasonably complex networks and gives equivalent results for simple networks and small N.

To demonstrate the utility of this algorithm, we carried out adaptive sampling with synthetic trajectories generated from transition count matrices. To generate synthetic simulations from a transition count matrix we first normalize each row to obtain a transition probability matrix. At each time step (or each lag time), the next state is chosen according to the distribution of transition probabilities for the current

90 state. The prior described below is not used for these calculations, so the matrices used to generate trajectories tend to be sparse.

QUANTIFYING THE SIMILARITY BETWEEN MSMS. In order to monitor the convergence of any sampling scheme, it is important to first develop a similarity metric that is capable of measuring the global quality of a test model relative to some reference model. Such a metric would also have broad usefulness, as there are several reasons for comparing MSMs quantitatively. For example, this metric could be used to compare MSMs generated by two different simulation methods allowing one to directly compare the resulting dynamics. Alternatively, one could compare MSMs generated by two somewhat different, but related systems, such as comparing the simulations of the dynamics of two point mutants of a given protein.

We have developed such a distance metric for MSMs that is based on the relative entropy, which is a common measure of the distance between two probability distributions in information theory (158) with important physical implications (159). The relative entropy between two normalized distributions P and Q, over a common set of outcomes, is

Pi (  PQPD i log)|| i Qi

where Pi is the probability of outcome i, P is a reference distribution, and Q is some test distribution.

An MSM consists of one normalized distribution per state, which gives the probability of transitioning to each other state within one lag time. We define the relative entropy between a reference and test MSM, with transition matrices P and Q respectively, as

91

N Pij ()||   PPQPD iji log (6.1) , ji Qij

where Pi is the equilibrium probability of state i, Pij is the probability of transitioning from state i to state j during one lag time, and N is the number of states. Intuitively, our relative entropy metric is the sum of the relative entropies between the transition probability distributions for each state weighted by their stationary probabilities.

One may derive our relative entropy metric for MSMs more formally by considering that the entropy (H) of a sample path of a , normalized by its length, is also called the entropy rate. An important theorem in information theory is the following:

Theorem. For an ergodic stochastic process X1, …, Xn

1 lim 1 n  n 1 XXXHXXH n1),...,|(lim),...,( n  n n 

For a , the right hand side takes a very simple form, because the conditional entropy only depends on the previous step, which converges to the stationary distribution.

In the following, we prove a similar statement for the relative entropy between the paths of two Markov chains as n goes to infinity. For two Markov chains p and q with state space Ω, we would like to compute:

1 lim 1 n 1 XXqXXpD n )),...,(||),...,(( n  n

For simplicity, let us define lowercase xn = X1, …, Xn. Then, by the

chain rule for the relative entropy, we get:

92

1 lim n1 n1  nn 1 xXqxXpDxqxpD nn 1))]|(||)|(())(||)(([ (6.2) n n

Eq. 2.65 in Cover & Thomas (160) defines the conditional relative entropy above as the expectation of the relative entropy between the conditional distributions of Xn given xn-1, with respect to the distribution of xn-1. This means that:

nn 1 xXqxXpD nn 1   n1  n n yXqyXpDyxp ))|(||)|(()())|(||)|(( n1 y    n1  n n YXqYXpDYXp ))|(||)|(()( Y  where we have grouped terms with the same final state in the “history" y, which have the same relative entropy factor, and summed their probabilities to obtain the marginal probability over Xn-1.

Repeating the step that led to Eq. 6.2 many times yields:

1 n lim mm 1 mm 1  1 XqXpDxXqxXpD 1))(||)((]))|(||)|(([ n   n m2

If the initial state is deterministic, the last term is just zero. As for the first term, as n goes to infinity, the distribution of Xm-1 goes to the stationary distribution of p, which we call μ. Then, using the equation for the conditional entropy,

ZYp )|( nn 1 xXqxXpD nn 1   ZYpZ log[)|()())|(||)|((lim ] n   ZY  ZYq )|(

Since the terms in the series converge to a limit, their Cesaro means converge to the same limit, so:

1 ZYp )|( lim 1 n 1 XXqXXpD n   ZYpZ log[)|()()),...,(||),...,(( ] n   n ZY  ZYq )|(

93

The terms p(Y|Z) and q(Y|Z) are just the elements of the transition matrices of p and q respectively, so this is equivalent to Eq. 6.1.

PRIOR FOR RELATIVE ENTROPY AND ADAPTIVE SAMPLING. There is always some probability of transitioning between every pair of states, though these probabilities may be low enough that no actual transitions are observed. To account for this, as well as to reflect our lack of prior knowledge about the transition probabilities, we add a pseudo-count of 1/N to every element of the transition count matrix, where N is the number of states, before normalizing each row to find the transition probability matrix, as in Refs (19, 161). The intuition behind this choice is that for a state to exist we must observe at least one count in that state but before observing any real data the probability of this count leading to any other state is equal. From a Bayesian perspective, these pseudo-counts equate to a uniform prior. These pseudo-counts also prevent the relative entropy metric from becoming infinite whenever a zero is encountered in an MSM’s transition probability matrix. It is often the case that certain transitions are not observed, so this correction is of great practical importance.

VILLIN SIMULATIONS AND MSM. The simulation details for the original ~450 villin simulations are described in detail in Ref (58). In short, ~450 constant temperature molecular dynamics simulations with explicit solvent and up to 2 μs in length were run from nine initial configurations drawn from high temperature unfolding simulations at 373 K. Ref (4) describes the construction of a 10,000 microstate MSM that faithfully reproduces the raw simulation data. For the purposes of this work, we lumped these 10,000 microstates into 500 macrostates exhibiting metastability and having an equivalent Markov time (15 ns). This lumping was done with the MSMBuilder package (10). The macrostates containing the nine initial configurations used during the real simulations were used as

94 the starting points for adaptive sampling. Simulations of just 30 ns were used for adaptive sampling.

SIMPLE MODELS.

The transition count matrices for simple models S and P (CS and CP respectively) are

 00003000,6     0003000,13   003000,130  CS     03000,1300   3000,13000     000,9030000  and

 00022000,6     0220000,12   022000,102  CP     20000,1220   2000,10220     000,9022000  where the entry in row i and column j gives the number of transitions observed from state i to state j.

Mean first passage times were calculated following Ref (161). The mean first passage times for S and P are ~13,000 and ~5,000 steps respectively. Other equilibrium properties can be obtained by normalizing each row to obtain a transition probability matrix and then solving for the eigenvalues and eigenvectors of this matrix. For example, normalizing the first eigenvector (e.g. the one corresponding to an eigenvalue of 1) gives the equilibrium probabilities of each state. Subsequent eigenvalue/eigenvector pairs give kinetic rates and the states involved in these

95

transitions respectively (9). Once again, the MSMBuilder package (10) was used for analysis of these models.

Plots of the average relative entropy as a function of simulation number and length were generated by running 600 simulations of 5,000 steps for each model. Average relative entropies over 10 random samples of N trajectories from this pool were then calculated and plotted. Similar plots for our adaptive sampling scheme were also generated by averaging over 10 independent runs.

RESULTS & DISCUSSION

APPLICATION TO VILLIN MSM. With these tools in place, we are now in a position to assess the efficacy of adaptive sampling using a previously calculated MSM for the villin headpiece (4) as a model system. In particular, we would like to assess two types of efficiency. First, given our desire to push the envelope of what is possible in a reasonable amount of time, can adaptive sampling reduce the wall-clock time necessary to achieve a given model quality? Second, given our desire to mitigate negative impacts on the environment, can adaptive sampling reduce the amount of resources (in this case computer time) necessary to achieve a given model quality?

To address these questions we have performed adaptive sampling with a variable number of simulations per iteration generated from our villin MSM. We then assume each simulation progresses at a rate of 5 ns/day, a typical value for modern personal computers, and compare the convergence of our adaptive simulations to the gold-standard model from Ref (4) (that was validated by comparison to both the raw simulation data and experiments) with the convergence of a single long reference simulation to the same gold-standard. Convergence to the gold-standard model is measured with our relative entropy metric for MSMs (described in Section 2.2).

96

Figure 25A shows that the wall-clock time efficiency of adaptive sampling scales linearly up to 5,000 simulations per iteration. That is, adaptive sampling with N simulations per iteration can reduce the wall-clock time necessary to achieve a given model quality by a factor of N for N as high as 5,000. Using more simulations will help but will only reduce the wall-clock time by a factor of αN, where α<1. The crucial result, however, is that one can reduce a calculation that would take decades to run with traditional methods to a calculation that can be run in a matter of days with adaptive sampling.

97

Figure 25. Scaling for adaptive sampling of villin as the number of parallel simulations (N) used during each round is varied. (A) Wall-clock time scaling as N is varied. The black line is a best fit to the linear portion of the data (circles), which extends up to 5,000 simulations per iteration. (B) Computer time required to achieve a given model quality (relative entropy) for various sampling schemes. L refers to one long trajectory and the numbers refer to the number of parallel simulations used in each iteration of adaptive sampling. All results come from averaging over ten independent runs. Each step equates to 15 ns.

98

Adaptive sampling can also greatly reduce the resource requirements for achieving a given model quality. For example, Figure 25B shows the computer time necessary to achieve a given model quality for one long simulation and adaptive sampling with a varying number of simulations per iteration. This figure shows that adaptive sampling requires about half as much computer time to achieve the same model quality as one long simulation. Once again, the relative efficiency of adaptive sampling begins to fall off beyond some optimal number of simulations per iteration.

APPLICATION TO SIMPLE MODELS. To gain an intuition for the applicability of adaptive sampling to other systems, we have also applied it to two classic network topologies, shown in Figure 26A and defined more thoroughly in Section 2.5. These models are representative of problems with metastability, their equilibrium properties can be derived analytically and used as an unambiguous reference, and truly-long simulations are feasible.

99

Figure 26. (A) The two models, S and P. (B) Distance from the true model (measured via the relative entropy) as a function of wall-clock time for adaptive sampling versus one long simulation of S (assuming 5 steps/day to mimic 5 nanoseconds/day in protein folding simulations). The lines are one long simulation (dashed line) and adaptive sampling with 10 simulations of 20 steps (solid line), 10 simulations of 200 steps (dotted line), 100 simulations of 20 steps (dash-dot line), and 1000 simulations of 20 steps (black squares) per iteration.

Both models have states with approximately the same equilibrium and transition probabilities, such that differences between their behaviors can be attributed to differences between their topologies. More specifically, states 1-6 have equilibrium populations of 6%, 1%, 1%, 1%, 1%, and 90% respectively. Drawing an analogy to protein folding, state 1 is the unfolded state, state 6 is the folded state, and the

100 remaining states are intermediates. Thus, S has a single folding pathway and P has parallel folding pathways.

The reduced connectivity in S results in longer timescale transitions relative to P. In fact, the mean first passage time (MFPT) between states 1 and 6 is about three times longer in S than in P, making S considerably harder to sample. In addition, such linear models are often cited as a case where the holistic, long-trajectory approach is absolutely necessary; nevertheless, adaptive sampling is able to learn the network more efficiently than traditional approaches, as shown in Figure 26B. This figure shows how close various schemes can approach the true model for S given a set amount of wall-clock time and starting from state 1 to mimic the practice of starting protein folding simulations from an arbitrary conformation in the unfolded state.

To provide some intuition for our distance metric, Figure 27 shows the evolution of the relative entropy and the estimated free energy of each state in S during adaptive sampling. Adaptive sampling was carried out by running 10 simulations from state 1 and then repeatedly building an MSM and starting 10 new simulations from the state contributing most to uncertainty in the slowest process. Small jumps in the relative entropy are found each time a state with a low population is discovered (or, equivalently, when a new path is discovered for this model) and a very large jump is evident when the most populated state, state 6, is discovered. Slow decay occurs between these jumps. Thus, our metric is most sensitive to state and path discovery but still captures improvements in estimates of the transition probabilities along known paths. Such behavior is desirable as models that miss important states or paths should be penalized more than ones with imperfect transition probabilities.

101

Figure 27. Relative entropy (top) and free energy of each state in kcal/mol (bottom) as a function of the adaptive sampling iteration on model S.

Figure 28 shows a more thorough comparison of adaptive sampling and reference simulations with an equal amount of sampling for various numbers and lengths of simulations. Evaluation of the reference simulations for both S and P demonstrates that achieving a reasonable model quality by naively starting simulations from state 1 requires simulations of some minimal length, though this minimal length is shorter for P than S in terms of the absolute number of steps. Moreover, adaptive sampling is able to gain valuable information from much shorter and fewer simulations regardless of the topology of the network; that is, whether there is a single folding pathway or multiple pathways. This figure also shows that adaptive sampling generally benefits from using more parallel simulations but not longer ones. An important point is that each data point in Figure 28B and Figure 28D depends on the data points to its left. For example, to fill in the row corresponding to simulations of length 100, ten independent adaptive sampling runs of 50 iterations were performed.

102

The first round of each adaptive sampling run was used to compute average relative entropies for 1-10 simulations, the first and second round of each run (which depends on the first round) for 11-20 simulations, and so forth. As a result, there is some horizontal streakiness in these figures. We also note that adaptive sampling results in smaller uncertainties in the relative entropies shown in Figure 28 (see Figure 77and Figure 78).

Figure 28. Distance from the true model (measured via the relative entropy) as a function of the number and length of simulations averaged over 10 independent samples. (A) Reference distribution for S, (B) adaptive sampling of S, (C) reference distribution for P, and (D) adaptive sampling of P. All simulations for the reference distributions started from state 1. The first 10 simulations for adaptive sampling started from state 1 and subsequent batches of simulations started from the state contributing most to uncertainty in the slowest process. Black lines are contours of equal amounts of data.

Finally, we find that the scaling of adaptive sampling of our simple networks is similar to that found for villin, as shown in Figure 29. One noteworthy difference is

103 that our simple models saturate (i.e. fall short of linear scaling as additional parallel simulations are run) earlier than villin. Comparison of the two simple models also shows that S saturates before P. For S, adaptive sampling scales linearly up to 150 parallel simulations. For P, adaptive sampling scales linearly up to 500 simulations. The improved scaling for P is the result of the increased complexity of the network topology of P compared to S. Each node in P has more connections to learn and the algorithm benefits from doing this in parallel. Indeed, the complexity of our villin model is much greater than either of these simple networks and, as discussed previously, villin scales linearly up to 5,000 simulations per iteration. Thus, we expect that we can achieve linear scaling well beyond 5,000 simulations per iteration for systems that are more complex than the villin MSM that we sampled from.

104

Figure 29. Scaling for adaptive sampling of our simple models as the number of parallel simulations (N) used during each round is varied. (A) and (B) Wall-clock time scaling as N is varied for simple models S and P respectively. The black line is a best fit to the linear portion of the data (circles). (C) and (D) Computer time required to achieve a given model quality (relative entropy) for various sampling schemes applied to S and P respectively. L refers to one long trajectory and the numbers refer to the number of parallel simulations used in each iteration of adaptive sampling. All results come from averaging over ten independent runs.

105

APPLICABILITY. The adaptive sampling algorithm employed here was developed for application to MSMs with metastable states. That is, it assumes that every state has a self-transition probability greater than 0.5 such that a simulation in one state is more likely to stay there than to transition to a new state. This property helps to ensure a separation of timescales (fast intrastate transitions, slow interstate transitions) and, therefore, that the model is Markovian because a simulation can lose memory of its previous state before transitioning to a new one. Thus, the procedure for ab initio adaptive sampling is: 1) run some initial simulations, 2) cluster all the simulation data into microstates, 3) lump these microstates into metastable macrostates, 4) calculate the contribution of each macrostate to uncertainties in the slowest rate (or some other observable), 5) start new simulations from each state in proportion to its contribution to the overall uncertainty, and 6) repeat steps 2-5 until the desired level of statistical certainty is achieved. In the future it will be interesting to explore whether this adaptive sampling algorithm is equally applicable to more fine grained divisions of conformational space (e.g. at the microstate level) as the lumping stage would no longer be necessary. In addition, recent work has shown that more fine grained MSMs are better for obtaining quantitative predictions of experimental observables (4, 5, 15), so it could be advantageous to do refinement at this level.

The relative entropy metric assumes that the two models being compared have the same state-space. Comparing two simulation data sets therefore requires the following steps: 1) define a state space common to both datasets (i.e. by using both data sets for clustering to define microstates and, optionally, lumping to define macrostates), 2) computing transition probability matrices for each data set independently, and 3) computing the relative entropy between these matrices.

106

CONCLUSIONS

Together, our results with villin and fundamental model systems demonstrate the tremendous value of adaptive sampling. Since model quality has been assessed with a global metric and shows strong agreement between adaptive sampling results and the true model, we can conclude that adaptive sampling to minimize uncertainties in the slowest kinetic rate improves the global quality of a model. Moreover, adaptive sampling is significantly more efficient than a single long simulation, both in terms of the wall-clock time and resources required to achieve a given model quality, up to some saturation point. In fact, adaptive sampling with N parallel simulations requires about a factor of two less computer-time and a factor of N less wall-clock time. Considering that N can easily be as large as 10,000 (or more) (79), this can be a truly dramatic advantage in wall-clock time, turning calculations normally requiring decades into routine calculations on the timescale of days. Finally, since our simulations started from just a couple of states, we can conclude that adaptive sampling is capable of discovering new model components given no prior knowledge of the system, and is thus useful for model construction in addition to model refinement.

The adaptive sampling method described here may be directly applied to learn models from simulations of metastable phenomena, leading to significant resource and time savings in fields like molecular and quantum mechanics, but is not limited to these applications. Given a means to prepare samples within a given state, it could be applied equally well to experimental techniques, such as single molecule FRET and force extension experiments. More broadly, minimizing uncertainties in a model is likely to prove valuable even when metastability is not present. Similar methods may also be useful for understanding other complex network dynamics, as in signaling pathways.

107

CHAPTER 7: SIMULATED TEMPERING YIELDS INSIGHT INTO THE LOW- RESOLUTION ROSETTA SCORING FUNCTIONS

This chapter was taken from: Bowman GR & Pande VS (2009) Simulated tempering yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.

ABSTRACT

Rosetta is a structure prediction package that has been employed successfully in numerous protein design and other applications (162). Previ-ous reports have attributed the current limi-tations of the Rosetta de novo structure pre-diction algorithm to inadequate sampling, particularly during the low-resolution phase (150, 151, 163, 164). Here, we implement the Simulated Tempering (ST) sampling algorithm (24, 25) in Rosetta to address this issue. ST is intended to yield canonical sampling by inducing a random walk in temperatures space such that broad sampling is achieved at high tempera-tures and detailed exploration of local free energy minima is achieved at low tempera-tures. ST should therefore visit basins in accordance with their free energies rather than their energies and achieve more global sampling than the localized scheme currently implemented in Rosetta. However, we find that ST does not improve structure prediction with Rosetta. To understand why, we carried out a detailed analysis of the low-resolution scoring functions and find that they do not provide a strong bias towards the native state. In addition, we find that both ST and standard Rosetta runs started from the native state are biased away from the native state. Although the low-resolution scoring functions could be improved, we propose that working entirely at full-atom resolution is now possi-ble and may be a better option due to superior native-state discrimination at full-atom resolution. Such an approach will require more attention to the kinetics of convergence, however, as functions capable of native state discrimination are not necessarily capable of rapidly guiding non-native conformations to the native state.

108

INTRODUCTION

Since the discovery that a protein’s structure is determined by its sequence (46), a great deal of effort has been poured into trying to predict structure from sequence. Thus far, knowledge-based approaches have proved promising, though more purely physics-based structure predic-tion has potential (92). The Rosetta suite is one of the most successful approaches, and employs a combination of knowledge-based strategies and physical insight. Some of the more prominent achievements of this software package are the design of a protein with novel topology (165), the redesign of protein-protein interfaces (166), the redesign of protein-nucleic acid interfaces (167), the redesign of a folding pathway (168, 169), aid in solving the crystallographic phase problem (170), and, most recently, the design of new enzymes for reactions without known biological catalysts (171).

It has been suggested that the success of Rosetta is in large part due to its accurate scoring functions (151, 172). In the sense that many of the terms are based on energetic principles derived from physical chemistry, one can think of the scoring functions as energy functions. On the other hand, many of the terms are based on statistics from the PDB databank. Because they are based on native protein structures, which are assumed to represent the lowest free energy structures for a given sequence, these terms implicitly consider entropic contributions. Thus, the scoring functions can be thought of as free energy functions. In addition, the practice of clustering the lowest scoring structures is, in a sense, taking into account entropy by considering the relative populations of various states (173). To avoid confusion we will use the term ‘‘scoring function.’’ This is probably the most precise term as the scoring functions are primarily designed to discriminate native structures from non-native ones rather than to reproduce physical behavior. Furthermore, it allows us to more clearly discuss the conformational free energy under a given scoring function.

Rosetta uses a number of scoring functions in two distinct phases: low- resolution and full-atom. This ‘‘hierarchical’’ approach (174) was incorporated into

109

Rosetta for CASP6 (175). The low-resolution phase assumes that the conformational search of a protein is biased by local structural preferences and that the free energy minimum is selected by nonlocal interactions (162, 176). This is captured by building the protein structure from fragments drawn from native protein crystal structures. Thus, local interactions may be assumed to be at free energy minima and a coarse- grained sampling of the nonlocal free energy landscape may be carried out (176). During this phase, sidechains are represented by single atoms called centroids, thus sacrificing atomic resolution for rapid sampling. All of the scoring functions employed in this phase are dominated by the hydrophobic effect (162, 164) and are intended to give the correct topology (162, 176). Full-atom refinement employing a single scoring function is then carried out on each low-resolution model (151). This phase is intended to give atomic resolution with correct packing (162). However, the full-atom scoring function only tends to give accurate results when the starting low-resolution model is within 3A of the native state, the ‘‘radius of convergence’’ (163, 174). Thus, the full- atom phase is highly dependent on the success of the low-resolution phase. Together, these two phases represent the belief that the native state lies at the bottom of a deep minimum at the center of a broader basin (162, 173).

A number of recent works have claimed that the main challenge preventing better structure prediction with Rosetta is sampling, particularly in the low-resolution phase (150, 151, 163, 164). They suggest that improved sampling at low-resolution would give more structures within the radius of convergence, and thus better full-atom structures.

To address this issue, we have implemented the Simulated Tempering (ST) sampling algorithm in the low-re-solution phase of Rosetta. This algorithm is intended to allow rapid barrier crossing by performing a random walk in temperature space. At high temperatures broad sampling may be achieved, while at low temperature various free energy minima may be explored. ST is a serial algorithm so it is amenable to an automated distributed computing effort like Rosetta@home (151), whereas related parallel algorithms like the Replica Exchange Method (REM) are not (27, 177). 110

METHODS

OVERVIEW OF ROSETTA The standard Rosetta de novo structure prediction protocol (RSP) is designed to predict the structure of a protein given its sequence. The algorithm begins with a fully extended chain. First, a low-resolution phase is car-ried out in which side chains are represented with cent-roids, single atoms which recapitulate the properties of the sidechain. The centroids are located at the center of mass of the sidechain obtained from averaging over all the conformations found in the PDB databank. A Monte Carlo approach is used to substitute in segments from fragment libraries provided by the user.

The fragment libraries consist of possible three- and nine-residue segments from the PDB databank that match portions of the sequence. By default, 200 three- residue and 200 nine-residue fragments are included for each overlapping segment of the protein (176). Secondary structure predictions from PSIPRED (178), JUFO (179), SAM (180), and PROF (181), are used to guide the selection of these seg-ments (151). These fragments are chosen such that the pro-portion of possible helix, strand, and other configurations is equal to the average prediction of all the secondary structure prediction programs used (176). One may install the software for generating these segments locally or, as in the case of this work, use the Robetta server (http://robetta.bakerlab.org/). Three- and nine-residue sequences are used as they have the most significant correlations in local structure (182).

Only the torsion angles are modified when a fragment is inserted. Bond lengths and angles are held constant. The values for the bond lengths and angles are taken from CHARMM19 (183, 184). When generating the fragment libraries, the torsion angles are modified from those in the PDB databank to maintain consistency with these ideal bond lengths and angles (176).

111

Three major factors are intended to guide the algo-rithm to the native structure: 1) a series of scoring func-tions based on distributions from the PDB Databank and Bayesian inference (48, 172), 2) returning to the lowest scoring structure found thus far at regular intervals, and 3) a temperature schedule called quenching that is designed to detect and escape local minima.

The possible components of each scoring function are described in detail elsewhere (176). Since each bit of local structure comes directly from native proteins, it is assumed to be at a free energy minimum. Thus, the low-resolution scoring functions focus on giving a rapid coarse-grained approximation of the free-energy landscape for nonlocal interactions and are meant to find the global topology (162). One of the major driving forces is hydrophobic burial (151, 164).

Figure 30 shows the order in which the scoring functions are used. The final low-resolution de novo structure prediction scoring function, score4, is supposed to be able to distinguish native structures. The other scoring functions are mainly variants of score4 meant to help bias the structure towards the native state as quickly as possible. Rosetta begins with one or two cycles of 2000 Monte Carlo steps with score0, which only has a Van der Waals term. This scoring function serves to insert a fragment in each position of the extended chain in order to provide a more or less random starting point for the subsequent scoring functions. Next, a single cycle of 2000 steps is carried out with score1, which is meant to accu-mulate secondary structure (176). Rosetta then performs five repetitions of a 2000 step cycle with score2 followed by a 2000 step cycle with score5. Score2 includes terms to favor collapse and beta strand pairing while score5 is similar but lacks these two terms to allow some relaxation. Three cycles of 4000 steps each are then carried out using score3. Score3 includes all of the possible low-resolution (or centroid) terms, except for hydrogen bonding. The first cycle of score3 uses the normal fragment insertion scheme. The remaining cycles use smoothing steps as described by Rohl et al. (176) to make small perturbations that relax the structure. Finally, score4, which does not have any compaction or beta-strand pairing terms, is used to rank the lowest scoring structure seen so far. 112

Figure 30. Flow chart showing the order the scoring functions are used in and giving brief descriptions of each. After score5, Rosetta returns to score2 five times before progressing to score3. The first six scoring functions constitute the low-resolution de novo structure prediction phase.

Beginning with score2 Rosetta returns to the lowest scoring structure seen so far at the end of each cycle (approximately every 2000 steps). The temperature is implicitly in units of kT. By default the temperature is initially set to 2 kT and is updated using a quenching scheme. If 150 steps are performed without any being accepted then it is assumed that a local minimum has been reached and the temperature is increased by 1 kT, thus increasing the probability of accepting subsequent moves (176). As soon as a step is accepted, the temperature is quenched: that is, the temperature is immediately reset to 2 kT.

113

At present, it is standard practice to perform full-atom refinement on each of the low-resolution models (151). An example command-line for generating a low- resolution model of protein G and then refining it is as follos:

rosetta.gcc64 aa 1igd A -verbose -silent

-increase_cycles 10 -new_centroid_packing

-abrelax -output_chi_silent -stringent_relax

-vary_omega -omega_weight 0.5 -farlx

-ex1 -ex2 -termini -short_range_hb_weight 0.50

-long_range_hb_weight 1.0 -no_filters

-rg_reweight 0.5 -rsd_wt_helix 0.5

-rsd_wt_loop 0.5 -output_all -accept_all

-do_farlx_checkpointing -relax_score_filter

-record_irms_before_relax -acceptance_rate 1.0

-filter1a 10000 -filter1b 10000 -nstruct 1

-constant_seed -jran 1918492

The purpose of refinement is to get atomic level accuracy with correct packing of sidechains (162). Exploring the full-atom free energy landscape is considerably more expensive than exploring the low-resolution one because of the atomic resolution and the inclusion of local interaction terms. To minimize the computational expense, it is assumed that the low-resolution starting model has the correct topology and only conservative backbone moves are made (176). It is hoped that these conservative moves will also help to get adequate exploration in the context of a compact chain

114 where large moves are likely to cause clashes. The backbone moves include small random perturbations to single torsion angles, alterations of a series of torsion angles such that the global structure is preserved, and gradient descent. The torsion angle potential is based on distributions from the PDB databank (151). Correct packing is achieved by rotamer optimization (162, 163, 185). Solvation effects are captured by employing the EEF1 implicit solvent (186). Even with the assumption that the starting model has the correct topology, refining every model is still very expensive and is only made possible through the use of distributed computing on Rosetta@home (151).

An important addition to the full-atom scoring func-tion is a direction- dependent hydrogen bonding term (187). This potential term is based on distributions from the PDB and has been shown to provide better native state discrimination than Coulomb-based hydrogen bonding terms like those found in standard molecular dynamics packages (187) and to agree with quantum calculations (188). Both backbone-backbone and sidechain hydrogen bonds are included but the backbone- backbone terms have been found to provide the best native state discrimination (187). Hydrogen bonds are short range interactions, so it is not surprising that this potential gives the best discrimination for decoys within 1-3A of the native state.

A similar hydrogen bonding term is also included in a brief low-resolution relaxation performed before beginning the full-atom refinement. This low-resolution scoring function is called score6. Relaxation in this scoring function uses a conservative move set similar to that in full-atom relaxation.

The final prediction is made by performing an RMSD (RMSD over Cα atoms) clustering of the 100 to 1000 lowest scoring full-atom models (151) and selecting those with the greatest number of neighbors within a cutoff that depends on the size of the protein (173).

115

MODIFICATIONS TO ROSETTA

SIMULATED TEMPERING In this work, the Simulated Tempering (ST) sampling algorithm (24, 25) was implemented in place of the default quenching temperature schedule. ST allows the system to perform a random walk in temperature space with ca-nonical sampling at each temperature. At high temperatures the free energy landscape is flattened, allowing broad sampling of conformation space. At low temperatures, barriers are present and tend to confine the system to exploring a single free energy minimum. By performing a random walk in temperature space a single run is able to explore multiple minima, thus speeding convergence. For a detailed derivation of ST refer to Huang et al. (177) and the original works (24, 25).

ST requires an initial temperature, a list of possible temperatures, and a list of weights for each temperature as inputs. For this work the possible temperatures are 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 5, 10, and 20 kT. At regular intervals an attempt is made to change temperatures. For each attempt, the algorithm randomly decides to go either up or down in temperature. The probability of accepting the attempt is

 ,1min{)( ejiP   ij  ggXU ij )()()( } (7.1)

where, P(i→j) is the probability of transitioning from temperature i to temperature j,

i  Tk iB )/(1 , U(X) is the potential energy of conformation X (or in this case the

score), and gi is the weight of temperature i. Assuming the weights are properly selected, this probabilistic temperature changing ensures that the detailed balance condition for equilibrium is satisfied. That is,

i  j  ijPXPjiPXP )()()()( (7.2)

116 where, Pi(X) is the probability of conformation X at temperature i and P(i→j) is the probability of transitioning from temperature i to temperature j. In addition, it can also be shown that for a correct set of weights

ijPjiP )()(  (7.3) where, P(i→j) is the probability of transitioning from temperature i to temperature j. Furthermore, ensuring that Eq. (7.3) is satisfied is sufficient to yield correct weights (189).

From Eq. (7.1) it is evident that the probability of making a temperature change is controlled by the energy distribution (or in this case score distribution), temperature spacing, and the difference between the weights of a pair of neighboring temperatures. To choose the temperature list and an initial set of weights constant temperature runs were carried out at a variety of temperatures ranging from 0.1 to 20 kT. Temperatures were selected such that weights could be found yielding ijPjiP  5.0)()( . Twenty iterations of 100 runs at 10 times the default length were then carried out, updating the weights after each iteration to satisfy Eq. (7.3). This protocol yielded converged weights for all of the systems studied and is thus a plausible candidate for a fully automated system compatible with a distributed computing environment like Rosetta@home.

OTHER OPTIONS The user may change the frequency of temperature change attempts, the frequency of outputting structures throughout a given run, and whether the program recovers the lowest scoring structure seen thus far between cycles. Another option allows the exploration of the final scoring function alone, thus removing any bias from the other scoring functions or the regular returns to the low-est scoring structure.

117

STRUCTURE PREDICTION PROTOCOLS Structure prediction is carried out employing the same procedure used by the Baker lab in CASP7 (151), though full-atom refinement was only used in a subset of cases. For each structure 10,000 independent runs are carried out using 10 times the default number of steps for each cycle. Each low-resolution run took about 2 min on an Intel E5345 quad-core 2.33 GHz processor. Full-atom refinement of a model took an additional 4 min. The lowest scoring structure from each run is stored. All of these structures are clustered by RMSD and the top five cluster centers are selected as the best predictions.

A similar procedure was used for comparing the various scoring functions. In this case, 1000 independent runs each with 100 times the default number of steps were performed to ensure adequate exploration of the whole space. Each RSP run using the full sequence of scoring functions took about 15 min on an Intel E5345 quad-core 2.33 GHz processor while using just score4 took 30 min. The ST variant took 30 min and 1 hr when using all the scoring functions and just score4 respectively. For ST, an independent set of weights is used for each scoring function to ensure canonical sampling of each of them.

To characterize the native state, the crystal structure was idealized and relaxed 50 times. Idealization consists of setting the bond lengths and angles to ideal values. Relaxation is carried out to compensate for any deleterious effects resulting from the idealization process. Relaxation may be carried out in either the low-resolution score6 space or the full-atom scoring function using the conservative move set described in the ‘‘Overview of Rosetta’’ Section. The native structure used as the starting point for many of the runs described below was the one with the lowest RMSD out of the 50 idealization/relaxation runs carried out in the low-resolution score6 space.

˚ For protein G, this structure has an RMSD of 0.88 Å. Projections of the free energy landscapes from ST runs were generated using the Multistate Bennett Acceptance Ratio (MBAR) estimator (190), a variant of the Weighted Histogram 118

Analysis Method (WHAM) (191), to make use of data from all of the temperatures. Projections from RSP were generated using all the data. These plots are analogous to free energy landscapes but we note that RSP does not guarantee a canonical distribution.

RESULTS

COMPARISON OF ST AND STANDARD ROSETTA The structure prediction protocol was carried out for four systems of varying size: an SH3 domain (58 resi-dues, PDB code 1shf (192)), protein G (61 residues, PDB code 1igd (193)), ubiquitin (76 residues, PDB code 1d3z (194)), and a zinc finger (136 residues, PDB code 2j6a (195)). All are X-ray structures with a resolution less than 2 Å with the exception of ubiquitin, which is the sole NMR structure evaluated in this work.

SH3 DOMAIN In a previous work Bradley et al. found poor results for the SH3 domain studied here (150). They attributed these poor results to inadequate sampling, thus, this target seemed like a good test for enhanced sampling. The ST results proved to be qualitatively similar to those from RSP, as shown in Figure 31. Both algorithms identified a pronounced score minimum at high RMSD, which may explain the poor results from the previous study. Despite this similarity, ST did yield slightly broader sampling of the low score space. Figure 31 also shows black plus-signs corresponding to the idealized native structure relaxed with the score6 low-resolution scoring function, which includes a direction dependent hydrogen bonding term, and then scored with the final scoring function, score4. This data indicates that the native structure is stable in score6 but is not recognized as native by the final scoring function.

119

Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each diamond represents the lowest scoring structure for a single run. Data for ST is shown in blue while data for standard Rosetta is shown in red. The black ‘‘+’’ symbols represent models obtained by idealizing and relaxing the crystal structure in low-resolution mode.

PROTEIN G Protein G was chosen as a small and tractable target. Figure 32(A) shows the results of low-resolution de novo structure prediction while panel (B) shows the results for full- atom refinement of the low-resolution models. Both RSP and the ST variant perform well on this system, finding low scoring structures with RMSD values as low as 1.5 Å. The lowest scoring de novo structures had RMSD values of about 2 Å but there is a clear correlation between low score and low RMSD. The ST and RSP results were qualitatively identical for both low-resolution runs and full-atom refinement. Full- atom refinement does not appear to greatly change the RMSD. On average the RMSD changed by -0.03 Å with a standard deviation of 0.3 Å between the low-resolution and full-atom phases. The black plus-signs in Figure 32(A) demonstrates that the native structure is stable in score6 but not recognized by score4, as was the case for 1shf. Figure 32(B), however, shows that the full-atom scoring function assigns low scores to the idealized native structure when relaxed in the full-atom scoring function (yellow circles). Furthermore, the low-resolution relaxed native structures that were not recognized by score4 are assigned low scores after relaxation with the full-atom 120 scoring function (black asterisks). In fact, these structures are closer to the full-atom idealized/relaxed structures than any of the de novo structures.

Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond represents the lowest scoring structure for a single run. Data for ST is shown in blue while data for standard Rosetta is shown in red. Panel (A) shows results from the low-resolution phase. The black ‘‘+’’ symbols represent models obtained by idealizing and relaxing the crystal structure in low- resolution mode. Panel (B) shows results from the full-atom phase. The yellow circles represent models obtained by idealizing and relaxing the crystal structure in full-atom mode. The black ‘‘*’’ symbols are full-atom models obtained by relaxing the low-resolution structures depicted by ‘‘+’’ symbols in (A) using the full-atom scoring functions.

UBIQUITIN Ubiquitin was selected due to its larger size and to evaluate the accuracy on NMR structures. Both Rosetta variants gave equivalently good results on this system (data not shown). Like protein G, there was a general correlation between score and RMSD and structures with as low as 2.5 Å RMSD were reached. Once again, the idealized and relaxed native structure was not assigned a low score by score4.

ZINC FINGER Finally, the 136 residue zinc finger was a CASP7 tar-get selected to push the limits of the algorithms. Rosetta tends to perform well on proteins with less than 100 residues

121

(151). If ST is indeed giving enhanced sampling then one would expect for it to outperform RSP on larger systems. However, once again both Rosetta variants gave equivalent and poor results, with RMSD values no less than 10 A, and the idealized and relaxed native structures were not recognized by score4 (data not shown).

The close agreement between the ST results and those of RSP in all cases indicates that ST is not giving enhanced sampling. The most probable explanations are that ST is not capable of escaping free energy minima in the Rosetta scoring functions or that RSP is correctly identifying all the accessible minima of the scoring functions. To explore these alternatives a more extensive analysis of the protein G results is conducted as this small system allows many trials to be run. Results presented by Alena Shmygelska and Michael Levitt at our Structural Biology Retreat on Nov. 15, 2007 showed that both Temperature replica exchange and Hamiltonian replica exchange do sample significantly better than the original Rosetta Monte Carlo method.

VALIDATION OF ST Figure 33 shows an example of the evolution of the weights throughout the weight determination protocol, demonstrating that it yields converged weights. The difference in weights is plotted as it is this difference, and not the absolute value of the weights, that determines the acceptance probability. The weight differences at high temperatures converge very quickly, consistent with a more or less flat free energy surface allowing broad sampling. The weight differences at low temperatures con- verge more slowly, consistent with a more rugged land-scape and restricted sampling. Independent runs of the weight determination procedure also produced more or less equivalent results (data not shown). The convergence of the weights is good evidence that the algorithm is working properly. These converged weights yield equal sampling of temperature space and multiple visits to both high and low temperatures in each run.

122

Figure 33. Evolution of the score4 weights for protein G. The dashed line is the difference between the weights of the highest two temperatures: 10 and 20 kT. The solid line is the difference between the weights of the lowest two temperatures: 0.1 and 0.25 kT. The first points come from constant

temperature runs and subsequent points represent each iteration of refining the weights. Δg=gj-gi where, j > i.

Figure 34(D)–(F) show projections of the free energy surface onto score versus RMSD for protein G runs started from the native structure at a range of temperatures (0.1, 2, and 20 kT, respectively). Figure 34(D) shows that at low temperature, the system spends considerably more time at low score and low RMSD. The higher scores and RMSDs at high temperature show that high enough temperatures are being reached for the system to escape local minima and achieve broad sampling.

123

Figure 34. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G in score4 using: (A) standard Rosetta runs starting from an extended chain, (B) standard Rosetta runs starting from the native state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST runs at 0.1 kT starting from the native state, (E) ST runs at 2 kT starting from the native state, (F) ST runs at 20 kT starting from the native state. Each white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring structures from each run were sorted by RMSD and only every twentieth point is shown so as to give the entire range without obscuring the underlying plot.

THE FINAL SCORING FUNCTION In theory, the sequence of scoring functions employed by Rosetta was chosen to bias the system to low score and low RMSD as quickly as possible. The final low- resolution scoring function (score4) is only applied to the lowest scoring structure

124 found while exploring the previous scoring function (score3). Thus, it is difficult to judge whether or not the system is truly being biased towards the global free energy minimum of score4. To test this, both ST and RSP were applied to the final scoring function in isolation.

Figure 34 shows that this analysis yields qualitatively similar results for both ST and RSP, particularly at low score and low RMSD. The agreement between the results generated with the same algorithm but different starting states demonstrates that the landscapes are con-verged and, therefore, represent the entire accessible space. The agreement between the ST and RSP results suggests that both algorithms are identifying the free energy minimum. Furthermore, the ST results show that running at lower temperatures does not significantly shift the free energy minimum towards the native state. Limiting the temperature range used by ST to 0.1–3 kT to more closely parallel the temperatures explored by RSP gave similar results (data not shown).

A rough time course of the evolution of the RSP landscapes was found by plotting projections of the free energy landscape for the first third of each run, the second third, and the final third. All three plots were identical to Figure 34(A),(B) (data not shown), further supporting the conclusion that the landscape is converged and indicating that the algorithm is capable of crossing any barriers present on short timescales. Analyzing the tem-perature throughout the RSP runs shows that 90% of the runs increased their temperature to 3 kT at some point but that only about 10% increased their temperature to 4 kT and none reached temperatures greater than 6 kT. Moreover, less than 10% of the time was spent at temperatures greater than 2 kT.

The global minimum in this projection covers a range of about 5–20 Å. The differences present between the RSP and ST results are due to the greater temperature range and approximately equal sampling at each temper-ature in ST. Visits to lower RMSD values rapidly become less likely below about 5 Å. Conformational clustering was carried out to confirm that the projection onto score versus RMSD is not hiding a highly populated minimum closer to the native state. The centers of the top 10 most

125

populated clusters were found to fall within the minimum of the landscape, confirming the validity of this projection. Thus, the final scoring function provides only a small bias towards the native state and has only small barriers.

Figure 34 also includes white plus-signs for the lowest scoring structure visited for each run. Once again, there is strong agreement between RSP and ST started from both native and extended structures. These points fall in a range of about 2.5–15 Å. The spread is slightly larger for ST runs, which is to be expected given that more time is spent at higher temperatures.

The large variation of the score with RMSD is also of interest. It seems that while low scores are correlated with low RMSD values high scores are not necessarily indicative of high RMSD values.

THE OTHER SCORING FUNCTIONS Given this analysis of the final scoring function, it is interesting to ask where the bias towards the native state seen in the RSP results comes from. Two possible answers are: (1) the sequence of other scoring functions and (2) the frequent returns to the lowest scoring structure found so far. To explore these possibilities 1000 runs at 100 times the default length were carried out with both ST and RSP with and without the regular returns to the lowest scoring structure.

Figure 35 shows projections of the free energy landscape onto score versus RMSD for each of the Rosetta scoring functions. The first two columns show general agreement between RSP runs with and without returns to the lowest scoring structure. Strong agreement between ST runs with and without returns was also found (data not shown). General agreement is found between the RSP and ST results as well. Though ST has a stronger bias towards lower scoring structures due to the lower temperatures reached, there is no apparent bias towards lower RMSD structures. Figure 35 also includes white plus-signs corresponding to the lowest scoring structure for a single run in the given scoring function. Although these points are slightly more localized to low 126

RMSD when the low scoring structure is regularly recovered, lower RMSD values are not reached. Thus, it would seem any bias comes from the sequence of scoring functions employed, though returning to the lowest scoring structure between cycles may speed up the process slightly.

Figure 35. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G. Each white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring structures from each run were sorted by RMSD and only every twentieth point is shown so as to give the entire range without obscuring the underlying plot. (A), (D), (G), and (J) show data from standard Rosetta runs with frequent recovery of the lowest scoring structure in score1, score2, score5, and score3 respectively. (B), (E), (H), and (K) show data from standard Rosetta runs 127

without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3 respectively. (C), (F), (I), and (L) show data from ST runs at 0.1 kT without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3, respectively.

In fact, Figure 35 indicates that this is so. Score1 shows the broadest sampling and the least bias towards low score and RMSD. Score2, on the other hand, has a global free energy minimum at about 5 Å. Score5 once again allows slightly broader sampling. Relatively broad sampling is achieved by score3 as well, however, this scoring function reaches the lowest RMSD values and is the only one to have a global minimum extending well below 5 Å. Based on these results it appears that score2 and score3 provide the greatest biasing force towards the native state. The main distinguishing feature of these scoring functions is the inclusion of compaction and beta-strand pairing terms.

DISCUSSION

In previous works, many of the limitations of Rosetta have been attributed to inadequate sampling, particularly at low-resolution (150, 151, 163, 164). However, the failure of the Simulated Tempering (ST) sampling algorithm to give any improvement on a range of targets indicates that this may not be the case. The fact that ST does not yield any improvement on a larger target where any increase in sampling would be most beneficial is particularly noteworthy. Plots of the free energy landscape at varying temperatures demonstrate that ST is working and that these results are not just due to a failure to reach sufficiently high temperatures.

The results presented in this work indicate that the low-resolution scoring functions are the main limitations to Rosetta’s performance, and not the sampling methods employed. This conclusion is supported by the fact that the final low- resolution scoring function (score4) fails to recognize the native structure for any of the targets examined. Furthermore, the free energy landscape for the score4 function has a broad minimum ranging from 5–20 Å RMSD. This landscape is judged to be

128 converged based on agreement between data generated from two different starting conformations and a rough time course of the landscape. If the low-resolution scoring functions were capable of recognizing the native state but it were just very difficult to get there one might expect ST runs starting from the native state to give worse results than standard Rosetta runs started from the native state because increasing the temperature would cause unfold-ing without subsequent refolding. The fact that standard Rosetta runs in score4 started from the native state are no more likely to identify near-native states highlights the bias of the low-resolution scoring functions away from the native state. Two of the other low-resolution scoring functions are found to have minima closer to the native state, presumably due to the inclusion of compaction and beta-strand pairing terms. However, this minimum at around 5 Å is still insufficient to satisfy the 3 Å radius of convergence required by full-atom refine- ment (163, 174). In principle, lower temperatures may promote more sampling at lower scores but the ST results at 0.1 kT show that in practice this doesn’t provide much, if any, improvement over the standard 2 kT quenching runs. Finally, plotting the free energy landscape of structures throughout numerous runs rather than just a scatter plot of the lowest scoring structure from each run demonstrates that there is only weak correlation between low score and low RMSD. Although low scores may be indicative of low RMSD values, high scores are not necessarily correlated with high RMSD values. Together, these results indicate that the low-resolution scoring functions do indeed allow rapid and complete exploration of a coarse grained landscape as intended but this landscape does not have the desired near-native free energy minimum.

The conclusion that the low-resolution scoring functions are the main limiting factor in Rosetta is also sup-ported by a number of recent works. For example, Misura et al. note that generating more low-resolution models and then refining them did not improve the accuracy of de novo structure prediction while performing more independent refinement runs of each low-resolution model did (163). This observation seems to indicate that sampling is a problem at high-resolution but not low-resolution.

129

Furthermore, the presence of false minima in the low-resolution scoring function has been acknowledged (150, 151).

One approach to addressing these issues would be to improve the low- resolution scoring function. This could be accomplished through improving either the thermodynamics, kinetics, or both. On the thermodynamic front, the ideal case would be to have a scoring function free energy minimum near the native state. This would ensure that each run would be likely to find a near-native state and preferably spend a significant amount of time sampling near-native regions. One way to achieve this goal would to make native and near-native states have even lower scores to compensate for the apparent entropic benefits of higher RMSD structures.

An alternative approach would be to think more about kinetics. At present most runs seem to find a lower scoring structure than those in the free energy minimum but do not appear to spend a great deal of time there. Assigning even lower scores to these structures could bias the sampling towards these regions. However, it may be the case that it is just too difficult to get to these structures. To illustrate, one can imagine that a map showing the locations of cities but no roads would give an accurate indication of the distance between two points but no indication of the fastest route between them. Likewise, assigning a structure a low score may accurately identify it as a native-like structure but not solve the problem of getting to that structure from an extended conformation.

One final way of improving the low-resolution scoring functions would be to improve the move set. In one recent work, it is acknowledged that the ‘‘single- fragment insertion approach makes many global conformers dynamically inaccessible’’ (176). Incorporating more conservative moves could improve results. However, making smaller moves would slow down exploration of the space, defeating the purpose of having a low-resolution phase in the first place.

130

Alternatively, one could forego the low-resolution phase altogether in favor of increased sampling at full-atom resolution. The hierarchical approach was developed in recognition of the fact that using more detail from the beginning was too expensive but that low-resolution models in isolation make prohibitive simplifications (174). However, creating an adequate and generalizable low-resolution model may not be worth the cost and effort. If the low-resolution scoring functions are not accurate enough then they will tend to bias the structure away from the native-state. At present, the low-resolution phase gives structures in the range of 3–6 Å starting from an extended chain (176) and this work shows that starting from the native-state gives equivalent results.

Furthermore, the full-atom refinement carried out in this work gave negligible changes in RMSD, though other works have claimed 1.5–4 Å changes (163). In either case, the low-resolution phase is unlikely to give many structures close enough to the native backbone structure. And, without the correct backbone structure it is nearly impossible to get the correct packing (162). In addition, the problems found in the low-resolution phase are compounded by inaccurate secondary structure prediction (151). This dependency could be removed by working solely at full-atom resolution.

The sampling required for such an endeavor is daunting, but recent distributed computing efforts such as Folding@home (79) and Rosetta@home (151) may make it feasible. Furthermore, the fact that the Rosetta full-atom scoring function is starting to yield improvements in homology modeling shows that it’s accuracy is promising (151). Finally, some recent work in purely full-atom structure prediction without the sampling power of a distributed computing platform shows that this approach may be viable (196, 197). One can even imagine a new hierarchical approach in which a large move set full-atom phase is followed by a second phase with more conservative moves. Of course, developing such a method would still require careful attention to the difference between native state discrimination and a function capable of guiding a non-native state to the native state.

131

CONCLUSIONS

We have implemented the Simulated Tempering (ST) sampling algorithm in Rosetta to test whether improved sampling in the low-resolution phase can improve Rosetta structure prediction. The low-resolution Rosetta scoring functions are shown to be adequately sampled by both standard Rosetta and a Simulated Tempering variant. Agreement between data generated from an extended and a native conformation supports the conclusion that the entire space is being sampled. Similar agreement between the results from both algorithms indicates that the scoring functions do not have near-native free energy minima. Thus, the low-resolution scoring functions, and not sampling at low-resolution, are the main limitation to accurate Rosetta structure prediction.

Structure prediction with Rosetta may be improved by correcting the low- resolution scoring functions. However, given current computational resources, such as Folding@home and Rosetta@home, it may be time to work at full-atom resolution from the beginning. Such an endeavor would require careful consideration of kinetics. Although functions designed for native-state discrimination may be able to correctly distinguish between native and nonnative conformations, that does not necessarily indicate that they are well suited for guiding nonnative conformations to the native state. Even if one does not care about predicting physical kinetics (e.g. the rate of folding in simulation compared with experiment), rapid kinetics of reaching the native state is crucial for the convergence of the simulation results and the general efficiency of the method.

132

CHAPTER 8: THE ROLES OF ENTROPY AND KINETICS IN STRUCTURE PREDICTION

This chapter was taken from: Bowman GR & Pande VS (2009) The roles of entropy and kinetics in structure prediction. PLoS One 4:e5840.

ABSTRACT

Here we continue our efforts to use methods developed in the folding mechanism community to both better understand and improve structure prediction. Our previous work demonstrated that Rosetta’s coarse-grained potentials may actually impede accurate structure prediction at full-atom resolution. Based on this work we postulated that it may be time to work completely at full-atom resolution but that doing so may require more careful attention to the kinetics of convergence. To explore the possibility of working entirely at full-atom resolution, we apply enhanced sampling algorithms and the free energy theory developed in the folding mechanism community to full-atom protein structure prediction with the prominent Rosetta package. We find that Rosetta’s full-atom scoring function is indeed able to recognize diverse protein native states and that there is a strong correlation between score and Cα RMSD to the native state. However, we also show that there is a huge entropic barrier to folding under this potential and the kinetics of folding are extremely slow. We then exploit this new understanding to suggest ways to improve structure prediction. Based on this work we hypothesize that structure prediction may be improved by taking a more physical approach, i.e. considering the nature of the model thermodynamics and kinetics which result from structure prediction simulations.

INTRODUCTION

In 1961 Anfinsen demonstrated that the native state of a protein is encoded in its amino acid sequence and hypothesized that the native state is the lowest free energy state (46). Since then, many researchers have dedicated their careers to understanding the driving forces underlying protein folding in order to 1) predict the native states of proteins from their amino acid sequences and 2) understand the mechanisms and pathways by which proteins fold. Collectively, these components constitute the protein folding problem (53, 70).

The protein structure prediction community has generally focused on finding a protein’s native state based on its sequence. A typical approach is to develop a knowledge-based scoring function to discriminate native structures from non-native ones and to sample this potential in search of the global minimum (198). For example, the Rosetta structure prediction package uses a Monte Carlo (MC) scheme to sample a series of scoring functions with increasing levels of chemical detail in order to identify protein native states (48, 49, 150). In Rosetta and many other structure prediction schemes, the problem of finding the free energy minimum is simplified by focusing on the energetic (or score) term (199). We note that Rosetta includes a simple implicit solvent and some implicit accounting for entropy by using information from known structures but stress that it does not explicitly account for conformational entropy. This simplification is justified by arguing that the conformational entropy of the native state is negligible and, therefore, the energetic term must be the dominant factor favoring the native state and the energy minimum should be equivalent to the free energy minimum. This approach has proved remarkably successful and has resulted in the design of a protein with a novel fold (165), accurate high-resolution structure predictions for small globular proteins (151), and the design of novel enzymes (171). However, ignoring conformational entropy will have increasingly deleterious effects on the landscape as one moves away from the native state and this may ultimately prevent accurate structure prediction for more complex systems.

134

In contrast, researchers studying folding mechanisms have placed less emphasis on predicting native states and focused on understanding how proteins fold. This work is also based on potentials, or force fields. However, these potentials have been designed to reproduce our physical reality rather than to simply discriminate native and non-native protein structures. Furthermore, much emphasis has been placed on understanding the entire free energy landscape and the kinetics of traversing this landscape (53). To accomplish these objectives numerous advanced sampling algorithms have been developed (21), as well as methods to visualize free energy landscapes (52) and determine whether or not they represent the true equilibrium distribution of the system under the given potential (177).

Here we continue our efforts to use methods developed in the folding mechanism community to both better understand and improve structure prediction. Our previous work demonstrated that Rosetta’s coarse-grained potentials may actually impede accurate structure prediction at full-atom resolution (49) and this result has been confirmed by other researchers (200). Based on this work we postulated that it may be time to work completely at full-atom resolution but that doing so may require more careful attention to the kinetics of convergence. To explore this possibility, we have used Generalized Ensemble (GE) algorithms (21) to generate projections of the landscape defined by Rosetta’s full-atom scoring function. We find that these scoring functions are capable of recognizing the native states of both protein G and engrailed homeodomain, an α/β and all α-helix protein, respectively. Furthermore, the score has the desired correlation with Cα RMSD to the native state. However, there is a huge entropic barrier to folding and the hydrogen bonding potential does not provide any significant bias towards the native state, slowing the kinetics of convergence. Based on these insights, we believe that further advances in structure prediction may be made by taking advantage of methods and ideas developed in the folding mechanism community.

135

RESULTS & DISCUSSION

GENERAL APPROACH In order to gain a deeper understanding of Rosetta’s full-atom resolution scoring function we have implemented a variant of the Simulated Tempering (ST) algorithm (24, 25) in Rosetta. ST was originally intended to induce the system of interest to perform a random walk in temperature space so that broad sampling at high temperatures would improve mixing at lower temperatures. However, ST may be generalized to other spaces (24). Here we define an RMSD space consisting of a number of umbrellas constraining the system to a given Cα RMSD from the native state. ST is then used to induce the system to perform a random walk in RMSD space without making any alterations to the temperature (201). Furthermore, we only use MC moves rather than the combination of MC and minimization moves used in the standard Rosetta protocol. Thus, the system can move back and forth between the folded and unfolded states while remaining at equilibrium. Exchanging between umbrellas also allows the system to access all the possible conformations in a given RMSD range (202). By performing many simulations in parallel we hope to explore all the relevant folding pathways. Figure 36 shows that this procedure results in reversible folding (i.e. multiple folding and unfolding events), confirming that our simulations have reached convergence (203). The Multistate Bennett Acceptance Ratio (MBAR) method (190), a statistically optimal variant of the Weighted Histogram Analysis Method (WHAM) (191), is used to determine the unbiased average values of thermodynamic properties such as energies and conformational entropies as a function of the RMSD. All the thermodynamic measurements in this work are dimensionless. That is, energies and free energies are given in units of the thermal energy kT and entropies are given in units of the Boltzmann constant k.

136

Figure 36. Time evolution of the Cα RMSD of the current umbrella center for five representative simulations demonstrating the presence of reversible folding.

We have applied this method to two systems: protein G (PDB code 1igd) (193) and engrailed homeodomain (PDB code 1enh) (204). Protein G has an α/β fold while engrailed homeodomain (EH) is a 3-helix bundle. Because these systems contain both major protein secondary structure motifs our conclusions should be applicable to most protein systems.

A THERMODYNAMIC PERSPECTIVE The average energy (or score), conformational entropy, and free energy as a function of the RMSD for both protein G and EH are shown in Figure 37. The average score has a clear correlation with the RMSD and the native state is at the scoring function’s global minimum for both systems. Thus, Rosetta’s full-atom scoring function is indeed able to recognize diverse protein native states. However, the conformational entropy of the native state is extremely low for both proteins. In fact, at the temperature used during full-atom Rosetta structure prediction during the CASP competitions (0.8 in arbitrary units, internal to the Rosetta code) the entropy 137 dominates the free energy. As a result, the native state is the free energy maximum instead of the desired minimum.

Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free energy (<∆F>) as a

function of Cα RMSD for protein G and engrailed homeodomain (EH).

This observation gives some insight into the limitations currently observed with Rosetta structure prediction. Rosetta uses a hierarchical approach in which coarse-grained structure predictions are made and then used as starting points for full- atom refinement (49). A number of recent works have noted that for full-atom refinement to be successful, i.e. reach RMSD values less than 2 Å, the initial configuration must be within a “radius of convergence” of about 3 Å from the native state (150, 199). Our results show that the free energy difference between 3 Å and 2 Å is about 5 kT and, therefore, sampling a 2 Å structure when starting from a 3 Å

138 structure is extremely unlikely. The improbability of moving to lower RMSD structures is consistent with the fact that one to ten thousand independent runs must be performed in order to find a few accurate full-atom structures with Rosetta’s ab initio structure prediction protocol (151).

TEMPERATURE DEPENDENCE OF THE FREE ENERGY The relative importance of the energetic and entropic contributions to the free energy may be tuned by adjusting the temperature ( F    S ). Namely, the energetic term will dominate at sufficiently low temperatures while the entropic term will dominate at higher temperatures. By assuming that the average energy and conformational entropy are independent of temperature we are able to predict the temperature dependence of the free energy. We can then predict what temperature one would have to use in Rosetta structure prediction in order for the free energy landscape to have the desired correlation with the RMSD.

We find that the free energy landscape has the desired shape (i.e. stable native state, unstable unfolded state) at temperatures below 0.5, as shown in Figure 38. At temperatures above 0.5 the free energy landscape still has a maximum at the native state. At a temperature of about 0.5 there are still non-trivial barriers between the native and unfolded state but the free energy landscape is essentially flat relative to other temperatures.

139

Figure 38. Average free energies (<∆F>) as a function of Cα RMSD for temperatures of 0.5 and 0.1 for protein G and engrailed homeodomain (EH). The black lines are the hypothesized free energy at the given temperature and the dash-dot lines are the free energy at temperature 0.8 shown for reference.

EXPLOITING THE TEMPERATURE DEPENDENCE While the projections of the thermodynamic landscapes shown in Figure 37 and Figure 38 appear to be smooth, the true landscapes are actually quite rugged due to energetic terms like hydrogen bonding and Van der Waals interactions. In order to explore this space the standard Rosetta full-atom refinement protocol uses a combination of MC and minimization moves (49). The minimization moves are intended to guide the protein towards the native state at the energy minimum while the MC moves are intended to help the protein overcome small barriers. For the MC moves to perform this function they must use a sufficiently high temperature to overcome small barriers but a low enough temperature to avoid mitigating the effectiveness of the minimization moves. Simply running the standard protocol at a lower temperature is likely to destroy this balance and prevent the system from overcoming even trivially small barriers, thus drastically slowing the dynamics. However, using our insights into the temperature dependence of the free energy

140 landscape it may be possible to devise a temperature ST protocol that could overcome this roughness and reach the native state.

To test this hypothesis we have implemented a temperature ST version of the full-atom Rosetta refinement protocol, as well as a variant of the standard protocol that runs at a temperature of 0.1. For the ST variant we used a temperature range of 0.1 to 0.5 and a purely MC move set in order to obey detailed balance. Broad sampling should be possible at a temperature of 0.5 because of the relative flatness of the landscape, while at lower temperatures the native state should be favored. Temperatures above 0.5 are not used because they would favor unfolding. The low temperature variant allows us to ensure that any improvements seen with the ST variant over the standard protocol are not simply the result of running at lower temperatures. Both the standard and low temperature variants use the full set of MC and minimization moves available in Rosetta.

Our ST variant is found to outperform both standard Rosetta and the low temperature variant. For each of these three protocols we performed 100 runs starting from a 5.7 Å structure, well beyond the radius of convergence, drawn from our umbrella sampling simulations. Figure 39 shows our 5.7 Å starting structure alongside protein G’s native state as a reference. Figure 40 shows histograms of the lowest RMSD found in each run. One ST run reached an RMSD value of 4.8 Å and 37% of the ST runs found structures with RMSD values lower than the initial configuration. However, neither the standard protocol nor the low temperature variant were able to find any structures with RMSD values less than that of the initial configuration. The increased ability of our ST protocol to move towards the native state demonstrates that utilizing explicit knowledge of the entropic contribution to the free energy may improve structure prediction, even when the physical conformational entropy is not of interest.

141

Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting structure used for comparing the ST and Standard Rosetta variants.

Figure 40. Distribution of the minimum Cα RMSD values reached by 100 Simulated Tempering (ST) and 100 standard Rosetta runs started from a 5.7 Å structure. Results for both the low temperature and standard Rosetta variants were identical so only a single plot is shown.

PHYSICAL PERSPECTIVE ON ENERGETIC TERMS A physical perspective may also be taken in order to evaluate and improve individual energetic terms. For example, Rosetta’s hydrogen bonding term (187) is seen as a critical component of the full-atom scoring function (199). While this term agrees with quantum calculations (188), it has been found empirically that the hydrogen bonding potential only helps discriminate between models within about 3 Å of the native state (187).

We find that the hydrogen bonding term actually impedes the kinetics of convergence while providing only a minor energetic advantage to near-native states 142 and, therefore, ultimately impedes rapid and accurate structure prediction. Figure 41 shows that the average hydrogen bonding energy is somewhat lower within about 3 Å of the native state for protein G but not for EH. For both systems, however, the average hydrogen bonding energy is basically flat relative to the total energy. Because the average hydrogen bonding energy is flat, it does not necessarily provide any guiding force to bias the system towards the native state.

Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line) versus the total average energy (dash-dot line) as a function of Cα RMSD for protein G and engrailed homeodomain (EH).

Shmygelska and Levitt have reported that Rosetta’s hydrogen bonding potential is better able to discriminate native from non-native states than the low- resolution potentials (200). The most likely explanation for this apparent discrepancy is that they weighted the hydrogen bonding term more heavily. During our simulations the long-range hydrogen bonding term was weighted by a factor of one while the short-range term was weighted by a factor of 0.5, following the protocol used by the Baker group in CASP 7. If these terms were weighted more heavily relative to the rest of the potential a stronger bias towards the native state could arise. For example, the small dip we observe in the hydrogen bonding term for protein G could become quite substantial. Comparing our results with those of Shmygelska and Levitt is also complicated by the fact that they sampled the hydrogen bonding term in the context of Rosetta’s less accurate low-resolution potentials while we have sampled it in the

143 context of the more accurate full-atom potential. A more extensive comparison of our methods in the context of the full-atom potential is an interesting future direction.

We suggest that structure prediction potentials could possibly be improved by avoiding such flat terms or reweighting them such that they provide a substantial biasing force towards the native state. We note that proteins can have surprisingly fast kinetics, with some small proteins folding on the microsecond time scale (57). One outstanding question is whether it is even feasible to design a knowledge based potential that can accurately identify protein native states and have kinetics that are faster than physical kinetics. If not, physics based methods may actually be the fastest algorithms for complex systems as they may be able to take advantage of the evolutionary optimization or the physical processes for kinetics present in the natural kinetics of protein folding. Even if this is not the case, our results show that structure prediction may benefit by taking advantage of ideas developed to better understand folding mechanisms. Informatics approaches that incorporate more physical insights into protein folding mechanisms are thus an interesting direction (205-207).

CONCLUSIONS

Our results demonstrate that explicitly accounting for conformational entropy and considering the kinetics of convergence may improve structure prediction even if physical conformational entropies and kinetics are not of interest. For example, by understanding the interplay between energy and conformational entropy one can choose an optimal temperature or set of temperatures to use for exploring conformational space. By considering the kinetics of convergence one can ensure that this space can be explored rapidly, resulting in computationally efficient structure prediction protocols. An outstanding question is whether it is possible to design knowledge-based potentials with better entropic and kinetic properties than our physical reality. If not, physics based structure prediction may ultimately be necessary for more complex systems. Whether or not this is the case, our results show that

144 structure prediction may benefit by taking advantage of ideas developed to better understand folding mechanisms.

MATERIALS & METHODS

All structural representations were generated using VMD (67).

TEMPERATURE ST Temperature ST (24, 25) simulations perform a random walk within a pre-determined temperature set (T1, …, Tn). This is accomplished using an expanded Hamiltonian

i )()(  gXEXH i

where i  1 kTi , E (X) is the energy (or score) of the current configuration (X), and gi is the weight corresponding to Ti. At regular intervals the simulation attempts to move either up or down in temperature space with equal probability. The probability of accepting a given move is

 ,1min()( ejiP   ij )()(  ggXE ij )

where P (i→j) is the probability of moving from Ti to Tj.

Our temperature ST simulations used a temperature list of 0.1, 0.15, 0.2, 0.3, 0.4, and 0.5 in arbitrary units internal to the Rosetta code and temperature exchanges were attempted every 50 steps. All weights were determined using the Simulated Tempering Equal Acceptance Ratio (STEAR) method (49). This method obtains an initial estimate of the weights from short constant temperature simulations at each temperature and then refines these weights in subsequent ST simulations before holding them constant in the final data collection phase. Two iterations of weight refinement consisting of 100 runs of 600,000 steps were performed for temperature ST 145 simulations, followed by 100 runs of 600,000 steps for data collection. In order to maintain detailed balance the ST simulations only used MC moves in torsion space.

RMSD ST RMSD ST simulations perform a random walk amongst a predetermined set of umbrellas constraining the system to a given RMSD from the native state without changing the system’s temperature. In this case the expanded Hamiltonian and probability of accepting a move are

2   ()([)( RMSDaXEXH current  RMSDi ])  gi

2 2  ,1min()( ejiP  [(RMSDa current RMSD j () RMSDcurrent RMSD ])  gg iji )

where   1 kT , E (X) is the energy of the current configuration (X), RMSDcurrent is the current RMSD from the native state, RMSDi is the center of umbrella i, and “a” determines the strength of the spring constraining the system to a given umbrella.

Our RMSD ST simulations used umbrellas centered at RMSD values from 0.5 to 10 Å at 0.5 Å intervals and jumps between neighboring umbrellas were attempted every 50 steps. The “a” parameter was set to three. All weights were determined using the Simulated Tempering Equal Acceptance Ratio (STEAR) method (49). This method obtains an initial estimate of the weights from short umbrella simulations at each umbrella (without any jumps between umbrellas) and then refines these weights in subsequent RMSD ST simulations before holding them constant in the final data collection phase. Three iterations of weight refinement consisting of 100 runs of 1,700,000 steps were performed for RMSD ST simulations, followed by 100 runs of 900,000,000 steps for data collection. In order to maintain detailed balance the RMSD ST simulations only used MC moves in torsion space.

146

ROSETTA For an overview of the Rosetta structure prediction algorithm and the command-line options used in this study see reference (49). The full Rosetta move set was used for standard Rosetta runs. The same number of moves was used when comparing standard Rosetta runs with ST.

147

CHAPTER 9: STRUCTURAL INSIGHT INTO RNA HAIRPIN FOLDING INTERMEDIATES

This chapter was taken from: Bowman GR, et al. (2008) Structural insight into RNA hairpin folding intermediates. J Am Chem Soc 130:9676-9678.

ABSTRACT

Hairpins are a ubiquitous secondary structure motif in structured RNA molecules. Despite their simple structure, there is some debate over whether they fold in a two- state or multi-state manner. We have studied the folding of a small tetraloop hairpin using a serial version of the replica exchange method on a distributed computing environment. Based on these simulations we have identified a number of intermediates that are consistent with experimental results. We also find that folding is not simply the reverse of unfolding and suggest that this may be a general feature of biomolecular folding.

INTRODUCTION

RNA hairpins are one of the most common secondary structure motifs, appearing in most every large RNA structure (208-210). In addition to serving as nucleation sites for RNA folding (211), they may also guide RNA folding by forming tertiary contacts (212, 213) and serve as recognition sites for RNA binding proteins (214). They are potential drug targets (215), terminate transcription (211), and influence translation through their role as aptamer domains in riboswitches (216). Despite the great variety of functions they may serve, hairpins are one of the simplest RNA motifs, requiring only monovalent ions to fold. Thus, understanding the folding of small RNA hairpins is both a critical first step in understanding the folding of larger RNA molecules (215) and amenable to computer simulation (217-219). 148

RNA hairpins consist of a primarily Watson-Crick base-paired stem capped with a loop of unpaired or non-Watson-Crick base-paired nucleotides. Tetraloops, such as the GCAA tetraloop (5’-GGGCGCAAGCCU-3’) examined in this work and shown in Figure 42, have four such bases in their loop. This particular structure was chosen due to its predominance in the ribosome (210).

Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the native state. Bases are numbered from 5’ to 3’ and native base-pair contacts (dotted lines) are numbered 1-4.

Despite their simple structure there is some controversy over whether these hairpins fold in a two-state or multi-state manner. The two-state hypothesis for nucleic acid hairpins is primarily based on thermodynamic measurements. For example, Ansari et al. found similar sigmoidal melting curves when they monitored all the base- pairing interactions or a subset of fluorescently labeled nucleotides (220). The multi- state hypothesis is based on kinetic measurements, such as FCS and T-jump experiments. For example, Jung et al. found discrepancies between equilibrium distributions from FCS and melting experiments (221). More recently, Ma et al. found evidence of melting in T-jump experiments starting at temperatures above the melting temperature (TM), indicating that the supposed unfolded state in melting experiments is not completely unstructured (222, 223). These authors went on to propose an intermediate state in which the ends of the hairpin are in contact but the base-pairing and base-stacking interactions in the stem are not yet formed.

149

To determine if there is in fact an intermediate and, if so, what its structure is, we have run Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224) simulations of the GCAA tetraloop depicted in Figure 42. Due to the heterogeneity of the loop (225, 226) we have defined the native state as any conformation with all four stem base-pair contacts formed, numbered as shown in Figure 42B. We refer to these base-pair contacts as native contacts. Two nucleotides are considered to be contacting if any two atoms, one from each nucleotide, fall within 3 Å of each other. Thus, a structure can be well described by a contact map—a bit string specifying which residues are in contact.

RESULTS & DISCUSSION

Previously, Sorin et al. studied the folding of this system using constant temperature Molecular Dynamics (MD) and explicit solvent (217). While these studies provided valuable insight into the folding of RNA hairpins, only 19 folding events were observed within the thousands of simulations run. We have applied SREMD on the Folding@home infrastructure to obtain better sampling and, therefore, greater insight into RNA folding.

SREMD is a serial version of the Replica Exchange Molecular Dynamics (REMD) (22, 23), which induces the system to perform a random walk in temperature space such that broad sampling is achieved at high temperature and detailed exploration of free energy minima is achieved at low temperature. In REMD, multiple simulations are run, each at a different temperature. A random walk in temperature space is achieved by periodically attempting to swap the conformations at two neighboring temperatures. The probability of accepting a swap is

 ,1min()( ejiP   UU jiij ))(( ) (1)

where P (i→j) is the probability of transitioning from temperature i (Ti) to temperature

j (Tj), βi is 1/ (kTi), and Ui is the potential energy of the conformation at Ti. Thus, the

150 detailed balance condition is satisfied. SREMD allows any number of asynchronous simulations to be run, making it more suitable for distributed computing than standard REMD (177). This is accomplished by providing each simulation with the Potential Energy Distribution Function (PEDF) for each temperature. SREMD uses the same criteria for swapping temperatures as REM except that the energy of the current conformation is compared to an energy randomly drawn from the neighboring temperature’s PEDF rather than the energy from a parallel simulation. The simulation parameters are described in detail in Appendix G.

We ran 2,800 SREMD simulations with an aggregate simulation time of 54.6 µs starting from the NMR structure (PDB code 1ZIH) (209). Even with this amount of simulation, reversible folding was not achieved and we cannot claim to be at equilibrium (203). However, we did observe 760 trajectories with a complete unfolding event and 550 trajectories with a complete refolding event. Thus, we have sufficient data to define the dominant states in the folding and unfolding pathways, though we cannot give their relative probabilities. While SREM will not give any kinetic information directly, an analysis of the relevant thermodynamic states can yield information about the states along the folding and unfolding pathways.

An unfolding event is defined as the set of conformations between the first point with no contacts between any two residues on opposite sides of the stem and the first preceding point with four native contacts. A refolding event is defined as the set of conformations between the first point with no contacts between any two residues on opposite sides of the stem and the first subsequent point where the number of native contacts is four.

We used Mapper (227, 228), topological data analysis algorithm, to identify the dominant states in the folding and unfolding pathways. For example, to understand unfolding we applied the Mapper technique to conformations from unfolding events, where the conformations were represented by contact maps. The mapper clustering technique works as follows. First, the similarity between each pair of conformations

151 was determined using the Hamming distance metric. The data set of interest was then divided into overlapping subsets based on the density of configurations around each conformation, allowing efficient identification of intermediate states with low populations as well as folded/unfolded states with high populations. Single-linkage clustering was carried out in each subset, facilitating the identification of non-convex clusters. Finally, a graph was generated that represents the connectivity between clusters in different density levels based on their degree of overlap. More details are provided in the SI.

In SREM, replicas visiting high temperatures lead to rapid unfolding. To better understand this unfolding process, we first calculated the probability of having one, two, or three native contacts during unfolding as shown in Figure 43A. This data indicates that there is substantial breathing, with one or two base-pairs being broken and reformed, but that complete unfolding quickly follows the breakage of three native contacts. Further insight is provided by Figure 43C, where we show the probability of each native contact given that a certain number of native contacts are present. Apparently, unfolding has a single dominant pathway characterized by unzipping from the end. This result is confirmed by Mapper, as shown in Figure 44. There is no cluster corresponding to a single native contact due to the low probability of such structures. Structures with three native contacts also appear to be absorbed into either the native cluster or the cluster with only two base-pairs formed, probably due to the use of the simple Hamming distance metric.

152

Figure 43. The probability of a given number of native contacts during (A) unfolding and (B) refolding. (C) The probability of each contact when a given number of contacts are present during unfolding and refolding with the arrows representing the direction of movement between the unfolded state (U) and the folded state (F).

Figure 44. Contact maps representing the cluster centers from independent clustering of the unfolding (A) and refolding data (B). The grey lines represent the connectivity of the states. The blue lines represent native contacts with a probability of 0.6 or greater within the cluster. Intermediate structures are labeled A-D.

Figure 43B shows that there is often a single contact present during refolding but adding subsequent base-pairs becomes progressively less likely. Thus, there are many nucleation events consisting of the formation of a single native contact but few proceed to the folded state. Figure 43C again shows the probability of each contact

153 given that a certain number of contacts are present. When a single native contact is present, it is most likely between the closing base-pair or the two ends, native contacts 1 and 4 respectively. The higher probability of native contact 1 is probably due to the close special proximity of the two participating residues imposed by their close proximity in the sequence. The higher probability of native contact 4 may be explained by the lack of steric hindrance relative to the other native contacts. Once two or three native contacts are formed each is more or less equally probable, which is consistent with numerous models.

The results from Mapper shown in Figure 44 give more insight. The first step is either the formation of the closing base-pair or the end base-pair. This is followed by the formation of native contacts 1 and 2 and subsequent folding is dominated by zipping. Presumably, the formation of the end base-pair facilitates the formation of native contacts 1 and 2 by reducing the conformational space that needs to be searched, as predicted by Ma et al. (222). The fact that the end base-pair does not appear in the center of the cluster with two native contacts doesn’t mean it breaks as folding proceeds, just that it does not occur frequently within the cluster. This is consistent with the fact that about four times as many refolding events occur through the pathway starting with the formation of native contact 1 as go through the pathway starting with the formation of native contact 4. Once again, we note these relative probabilities are not necessarily expected to be found in experimental studies due to the random walk in temperature space our simulations undergo. However, these are expected to be the two dominant pathways.

The two folding pathways observed here are consistent with the zipping and compactions mechanisms observed by Sorin et al. (217) as well as experimental work pointing to the presence of multiple folding pathways (215, 229). Furthermore, these results support the hypothesis that the folding pathway of RNA hairpins has at least three states. In particular, the collapsed structure with a single native contact between the end base-pair is consistent with the intermediate structure proposed by Ma et al. (222). However, the other clusters along the folding pathway with one, two, or three 154 native contacts formed may also contribute to the experimental signal. Full-atom structures for each of these intermediates are shown in Figure 45. Reptation (defined as the sliding of the two strands of the stem relative to one another) is not one of the dominant folding pathways, in agreement with results for small β-hairpins (230). Thus, it appears that misfolded states must unfold before refolding properly, although we cannot discount the possibility that they may contribute to folding on longer timescales than our simulations reach. Results from the unfolding analysis using Mapper lend further support to this hypothesis. They include small clusters of reptated structures between the folded and intermediate states (data not shown), consistent with the idea that misfolding serves as an off-pathway trap that slows the overall folding process (215, 220, 223, 231).

Figure 45. Representative full-atom structures for the intermediate states with labels (A)-(D) corresponding to the labels A-D in Figure 3.

Another result of this work is that folding and high temperature unfolding follow different pathways. We propose that this may be a general feature of hairpin folding, due to the intrinsic similarities in the thermodynamic forces which stabilize 155 their structure. Furthermore, the amount of sampling we have achieved and the fact that we have still not reached convergence calls into question the results of shorter REMD studies. Such simulations will be dominated by non-equilibrium unfolding, which as we show here does not necessarily provide any insight into folding. Applying measures of convergence, such as reversible folding or agreement between simulations with different starting states, is critical for validating such studies.

CONCLUSIONS

The results presented here support recent work indicating that the folding of even the smallest of RNA motifs is more complicated than previously suspected. We have identified a number of folding intermediates consistent with experimental observations. We also found multiple highly populated folding pathways but only a single dominant unfolding pathway. Significant sampling was necessary to gain any statistics on folding, indicating that shorter simulations are dominated by unfolding, which differs from the folding pathway in this systems. In future works we intend to determine the sequence dependence of intermediate states and folding kinetics. Such work will also provide more insight into whether or not folding and unfolding differ for biomolecules in general.

156

CHAPTER 10: RAPID EQUILIBRIUM SAMPLING INITIATED FROM NON- EQUILIBRIUM DATA

This chapter was taken from: Huang X, Bowman GR, Bacallado S, & Pande VS (2009) Rapid equilibrium sampling initiated from nonequilibrium data PNAS 106:19765-19769.

ABSTRACT

Simulating the conformational dynamics of biomolecules is extremely difficult due to the rugged nature of their free energy landscapes and multiple long-lived, or metastable, states. Generalized Ensemble (GE) algorithms, which have become quite popular in recent years, attempt to facilitate crossing between states at low temperatures by inducing a random walk in temperature space. Enthalpic barriers may be crossed more easily at high temperatures; however, entropic barriers will become more significant. This poses a problem because the dominant barriers to conformational change are entropic for many biological systems, such as the short RNA hairpin studied here. We present a new efficient algorithm for conformational sampling, called the Adaptive Seeding Method (ASM), that uses non-equilibrium GE simulations to identify the metastable states and seeds short simulations at constant temperature from each of them to quantitatively determine their equilibrium populations. Thus, the ASM takes advantage of the broad sampling possible with GE algorithms but generally crosses entropic barriers more efficiently during the seeding simulations at low temperature. We show that only local equilibrium is necessary for ASM so very short seeding simulations may be used. Moreover, the ASM may be used to recover equilibrium properties from existing datasets that failed to converge and is well-suited to running on modern computer clusters.

157

INTRODUCTION

The functions of biological macromolecules are in large part determined by their structure and dynamics. As such, many experimental techniques have been developed and applied to probe these properties, each of which has its strengths and weaknesses. Computational methods such as Molecular Dynamics (MD) and Monte Carlo (MC) simulations have the potential to complement such experiments by modeling the evolution of entire systems with atomic resolution. However, it is extremely difficult to obtain equilibrium sampling of even moderately sized systems in atomic simulations because of the rugged nature of the free energy landscapes that must be explored. Without adequate sampling, it is impossible to validate the parameters, or force fields, that determine the interactions between atoms or to address phenomena that occur on biologically relevant timescales.

Many methods have been developed in an attempt to address the sampling problem. Generalized Ensemble (GE) algorithms like Replica Exchange Method (REM) (or Parallel Tempering ) (22, 23) and Simulated Tempering (ST) (24, 25) are popular approaches for studying biomolecular folding(26-28, 177, 232-238). They attempt to overcome the sampling problem by inducing a random walk in temperature space while maintaining canonical sampling at each temperature. At high temperatures energetic barriers may be crossed easily while at low temperatures the system is generally constrained to local minima. However, recent studies have shown that GE simulations do not yield converged equilibrium sampling much faster than standard constant temperature MD if the phenomena of interest are non-Arrhenius. (27, 177, 238-243)

For example, Zuckerman et al. (240) used the Arrhenius equation to argue that the maximum efficiency gain of GE simulations is no more than an order of magnitude at physiological temperatures and Zheng et al. (241, 242) used a kinetic network model to show that there is an optimal temperature for non-Arrhenius folding kinetics and any time spent above this temperature will decrease the efficiency of GE

158 simulations. This lack of improvement is the result of the interplay between energy and entropy. While high temperatures may facilitate the crossing of energetic barriers, entropic barriers will be more difficult to cross. (27)

Thus, GE simulations will provide little improvement when the dominant barriers are entropic. Hansmann and coworkers have made some effort to improve the effectiveness of GE algorithms by optimizing the temperature spacing. (244, 245) However, these methods assume that diffusion in temperature space is the rate limiting process so crossing entropic barriers in the conformational space, which is the true rate limiting process, is still a problem. A number of other methods also exist, such as umbrella sampling, and milestoning (156). However, these methods require that the dominant reaction coordinate is known a priori and this information is often unavailable.

The sampling problem is exacerbated by the practice of viewing global equilibration of individual trajectories as a requirement for considering a simulation to have reached equilibrium. Global equilibration is most naturally obtained by running a simulation much longer than the longest relaxation time of the system, so that all degrees of freedom are equilibrated and many uncorrelated samples are generated from each metastable state. For example, the reversible folding metric holds that a simulation has reached equilibrium if there are multiple folding and unfolding events. (203) While this criterion is sufficient, it may not be necessary. Instead of requiring global equilibration of individual trajectories, we suggest that local equilibration may be sufficient. Local equilibration may be achieved by using multiple simulations, each of which visits only a subset of the metastable states with their correct Boltzmann probabilities but that together cover the entire accessible space. Local equilibration may require significantly less wall-clock time because shorter simulations (all of which may be run in parallel) are required. The main difficulty is to analyze multiple simulations appropriately.

159

Markov State Models (MSMs) are a powerful tool which can be used to extract equilibrium properties from a dataset that satisfies the local equilibration criterion. MSMs partition phase space into metastable states such that intra-state transitions are fast but inter-state transitions are slow.(3, 6, 11, 33, 34) Such separation of timescales ensures that the model is Markovian, that is, the probability of being in a given state at time t+∆t, where ∆t is called the lag time, depends only on the state at time t. The key point is to build a model with a lag time that is shorter than the timescale of the process of interest with few enough states that it may be understood easily. Usually MSMs are used to study kinetics, but here we only derive thermodynamic information from them. In an MSM, the time evolution of a vector representing the population of each state may be calculated by repeatedly left-multiplying by the transition probability matrix.

Pn()[()](0 t T  tn P ) (1)

where P(n∆t) is a vector of state populations at time n∆t, T is the column-stochastic transition probability matrix. The first left eigenvector of the transition matrix T corresponds to the equilibrium distribution(6). This can be an advantage and a useful opportunity, since obtaining kinetics from MSMs is challenging, but obtaining only the equilibrium thermodynamic properties might be a less demanding goal as less information is required. Indeed, we find that the populations of the dominant states are invariant with respect to the lag time so very short simulations can be used.

Here, we introduce the Adaptive Seeding Method (ASM) and show that it rapidly yields converged thermodynamics even when faced with entropic barriers by exploiting many simulations at local equilibrium. This is achieved by 1) using non- equilibrium GE simulations to obtain broad sampling, 2) building a Markov State Model (MSM) to identify all the metastable states as shown in Figure 46, 3) starting new constant temperature simulations at the temperature of interest from each metastable state in a process called seeding, and 4) using MSMs to extract the correct equilibrium populations from the seeding simulations. Seeding short simulation from 160 the known equilibrium distribution of alanine dipeptide has been shown to yield good models for its kinetics (246). A key advance in our new method is that it does not require that the initial sampling has reached equilibrium. We note that many non- equilibrium GE datasets have been generated due to the difficulty in reaching equilibrium and that there is growing interest in recovering equilibrium properties from such datasets(247). Thus, one strength of the ASM is that steps 2-4 may be used to recover the correct equilibrium thermodynamic properties from a non-equilibrium dataset. Furthermore, this procedure may be iterated and combined with adaptive sampling (161) to most efficiently use one’s computational resources, i.e. using the fewest and shortest trajectories necessary to achieve a good model, since minimizing wall clock time is an important consideration for computer simulations.

Figure 46. A schematic free energy landscape with three representative seeding trajectories started from each basin and a projection of this free energy landscape onto a 2D plain showing the division into metastable states.

To test the ASM we apply it to a small biomolecular system with long time scale dynamics: an eight nucleotide RNA hairpin (5’-GCUUUUGC-3’) known as the 161

UUUU tetraloop. Hairpins are a fundamental RNA secondary structure motif(208) and perform many biologically relevant functions but our understanding of their folding is still incomplete.(52, 217, 220, 222, 223) The folding of this hairpin is diffusion controlled(215, 220, 248), so despite its small size the folding time is on the µs timescale, as measured by laser temperature jump experiments(223). Thus, capturing a single folding event with a single MD simulation with explicit solvent would likely take more than a year on a typical CPU. ASM, however, is able to reach converged equilibrium sampling within a week using many short parallel simulations, as judged by agreement on the populations of metastable states between distinct sets of simulations started from very different initial configurations. ASM is also found to yield converged thermodynamic properties with at least six times less sampling than GE simulations for this system. Finally, the fact that the most highly populated metastable state has a well-formed two base pair stem, as in the NMR structure, provides some validation of the force field. Since there is no analytical solution for the equilibrium distribution of our RNA hairpin system, we also studied a 2D potential where the equilibrium populations can be computed analytically. Using this model, we confirm that ASM is much more efficient than ST, and also provide some guidelines for choosing the optimal number and length of the seeding simulations.

RESULTS & DISCUSSION

COMPARISON OF ASM TO ST Here we compare the results of our long ASM procedure with an equivalent amount of ST sampling, as depicted schematically in Figure 47. We ran two distinct sets of simulations starting from a near-native and random-coil configuration respectively, as shown in Figure 81. Thus, we are able to judge the convergence of our results by comparing these two datasets.

162

Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents our ST trajectories, which are split into equilibration (green) and production (light blue) phases. The light red and light yellow boxes encompass our long and short adaptive seeding schemes respectively. For each adaptive seeding scheme, the dotted lines demark the portion of the ST data used to identify the dominant thermodynamic, or metastable, states by building an MSM (S). Constant temperature (or canonical, NVT) simulations are then started from each state and used to build a new MSM (E) that captures the equilibrium distribution. Both the light yellow and red boxes also encompass a portion of the original ST data that is equivalent to the amount of sampling used in the adaptive seeding scheme. An MSM is also built for this data and used as a baseline for judging the efficiency of the adaptive seeding scheme.

The first step was to run an independent set of one thousand 18 ns ST simulations starting from each initial configuration to obtain broad sampling. During an initial equilibration phase (first 9 ns) the weights were updated using the Simulated Tempering Equal Acceptance Ratio (STEAR) method(49, 177) described in Appendix H. This procedure was found to give nearly equal sampling of each temperature and converged weights for each dataset (Appendix H). During the subsequent 9 ns production phase the weights were held constant. These two sets of ST simulations do not reach converged sampling because of their short length (data not shown), but they should be able to reach all the metastable states.

163

To identify the metastable states we built an independent MSM from the production phase of each dataset. First, all the conformations from every temperature were divided into a large number of small sets of very structurally similar, and therefore likely kinetically similar, conformations called microstates using a hierarchical K-medoids clustering algorithm as described in Appendix H. We then used spectral clustering (249, 250) (PCCA (6, 44, 45)) refined with simulated annealing to lump microstates that can interconvert rapidly into larger states called metastable states while conformations separated by large free energy barriers are grouped into different states, as depicted in Figure 46. This algorithm was developed by Chodera et al (6) and is also described in the SI. This procedure yielded six states for each dataset.

To obtain equilibrium sampling we then seeded simulations from each metastable state. Specifically, 100 random conformations were chosen from each metastable state and used as starting points for 10 ns constant temperature MD simulations at 300K. The equilibrium distribution was extracted by building a new MSM. A common state definition is necessary in order to compare different datasets so this MSM was built using all the seeding data. Populations with error bars for each independent dataset were then determined under the same state definition using a Bayesian method developed by Noe.(251) Figure 48A shows that the populations from each seeding dataset, as well as the combined data, are in strong agreement and are therefore converged to the equilibrium distribution. Populations for an equivalent amount of folded and coil ST data (19 ns) were also calculated by considering only those conformations at 300 K. These two ST datasets have not converged yet. In particular, there is a relatively obvious difference in the populations of states 2 and 4 (about 10% and 7% respectively).

164

Figure 48. Population of each state (bar graphs correspond to the mean values, and error bars stand for standard deviations) for (A) the long adaptive seeding scheme (lag time t=4.5 ns) and (B) the short adaptive seeding scheme (lag time t=4.5 ns).

MSMs are usually used to study kinetics(3). In order to get a reasonable number of states and ensure that the model is Markovian, a relatively long lag time must be used, though it generally ought to be shorter than the timescale of the process of interest. Furthermore, to get accurate kinetics each simulation must be at least a few times longer than the lag time so that multiple crossings of each barrier may be observed. For example, Chodera et al. show that a twenty state MSM for the folding of the helical Fs-peptide (which occurs on a timescale of tens of nanoseconds) requires a lag time of five nanoseconds.(6) Thus, obtaining accurate kinetics for the UUUU tetraloop, which folds on a microsecond timescale, should likely require orders of magnitude longer simulations than for the Fs-peptide. However, obtaining accurate thermodynamics may require significantly less sampling. In particular, short lag times where the system is not Markovian may still be sufficient to estimate thermodynamic properties. In fact, Figure 49 shows that the equilibrium populations of each state are identical within statistical error regardless of the lag time. Similar observations have been made by Hummer and coworkers who found that the free energy profile for a water dewetting transition can be predicted using a very short lag time at which the kinetics are not reproduced well(11). In addition to the error due to non-Markovian effects, the statistical error due to insufficient sampling of transition events will also be smaller for thermodynamic properties. In a model with N states there are only N thermodynamic parameters to determine whereas getting accurate kinetics requires

165 determining all N2 pairwise transition probabilities. Sampling all possible transitions over-determines the free energy differences between states. Thus, obtaining accurate thermodynamics may require significantly less sampling.

Figure 49. Population of each state for the long adaptive seeding scheme as the lag time is varied.

MINIMIZING THE SIMULATION LENGTH To push the limits of the ASM we repeated the above procedure using drastically less data (See short ASM in Figure 47). Ten times less data was used for equilibration, six times less ST data was used to identify the states, and the seeding simulations were half as long. To maximize our use of this minimal data we combined the folded and coil ST data to identify the metastable states used for seeding. Figure 48B shows the populations obtained from this procedure compared to a reference distribution from our long ASM runs and an equivalent amount of ST data started from both folded and coil states. All these populations were determined using the previous state definition. The populations from these short ASM runs were found to be in agreement with the previously determined equilibrium distribution whereas the ST data deviated significantly from equilibrium. To determine the limits of ASM we also performed the same analysis using both fewer and shorter trajectories. First we held the seeding trajectory length constant at 5 ns and varied the number of trajectories initiated from each state, finding that as few as 70 trajectories from each state gave reasonable

166 agreement with the reference distribution. We also held the number of seeding trajectories started from each state constant at 100 and varied the trajectory length, finding that as little as 2 ns long seeding simulations gave reasonable agreement with the reference. Thus, our ASM method reaches equilibrium at least six times faster than ST. These results demonstrate that the ASM is significantly more efficient than GE simulations for sampling conformational changes that are diffusion controlled, as in hairpin folding.

To address any concerns about the validity of our reference distribution, we also studied a simple model where the equilibrium populations can be computed analytically. The model is based on a discrete-state system introduced by Zwanzig(252) as a simple model for protein folding (see Appendix H for details). There are four metastable states in the system (folded, unfoled and two intermediate states), among which the folded state is favored energetically, while the unfolded state is favored entropically (see Figure 85). This is an attractive system for testing ASM because it has non-Arrehnius folding kinetics, i.e. the folding rate decreases with temperature. (see Figure 87).

We compared the efficiency with which ST and ASM reach the equilibrium state populations as a function of the length and number of trajectories. As shown in Figure 92, ASM converges to the correct distribution with 4-7 times shorter simulations than ST. We suggest that seeding simulations longer than the slowest intra-macrostate equilibration time should always be sufficient for convergence. In practice, however, much shorter simulations may be used as discussed before. When using shorter simulations one should test that independent sets of simulations started from different configurations converge to the same distribution and that the equilibrium distribution is invariant with respect to the lag time. We also found that using more than 200 trajectories does not increase the efficiency of ST whereas ASM continues to scale favorably with the number of trajectories up to 600 trajectories in this example. The optimal number of simulations to run depends on one’s tolerance for statistical error. Currently an equal number of simulations are seeded from each 167 state. In the future, however, adaptive sampling [31] could be used to start an optimal number of simulations from each metastable state to further optimize the efficiency of this method.

There are a number of factors contributing to the improved efficiency of ASM. By using short GE simulations to identify the metastable states, the ASM is able to exploit the ability of GE simulations to rapidly cross energetic barriers while avoiding the penalty incurred at high temperatures for entropic barriers by using seeding simulations at low temperatures. Furthermore, only short seeding simulations are necessary because only local, not global, equilibration is required, due to the use of MSMs. Global equilibration metrics like reversible folding require that each simulation is long enough to cross every barrier multiple times. Local equilibration, however, may be obtained with many short simulations run in parallel because each run only has to be long enough to cross a single barrier. By using MSMs to identify the metastable states we can initiate seeding simulations from uncorrelated conformations within every metastable state and thereby ensure every barrier is crossed.

The ASM also has limitations. For example, the initial sampling has to be broad enough to identify all the metastable states. Failure to do so will quickly become apparent as some states will be populated in one dataset but not in another. This situation may be remedied by iterating the ASM: that is, seeding ST simulations from each state to obtain broader sampling, building an MSM to identify the metastable states, and performing new constant temperature seeding simulations. In addition, seeding simulations at physiological temperatures are only able to cross barriers on the order of a few kT. However, this should be sufficient for most biological systems. Finally, we note that the random selection of initial configurations from each state may lead to some error if the seeding simulations are not long enough. In the future, this method might be improved by choosing initial configurations from an equilibrium

distribution prepared within each metastable state(54).

168

EXAMINING THE STATES Figure 48B shows that the short folded ST data spent a disproportionate amount of time in state 2 while the coil ST data spent a disproportionate amount of time in state 4. Based on this result, we hypothesized that state two is the native state and that state four is a random coil state. To test this hypothesis we extracted representative structures for each state. The representative structure for each state is the configuration with the greatest density of nearby conformations (mathematically this is the conformation with the minimal RMSD to every other conformation in the state).

Figure 50 shows the representative structures for each state. In fact, state two is the folded state, having a well-formed two base pair stem. Our ability to identify this state without including any knowledge of the native state and the fact that it is the most populated state lends credibility to the force field used, AMBER99(60). Furthermore, state four is a random coil. The other states represent various collapsed non-native states. For example, state 1 has native-like base stacking interactions but no clear base pairing between the two sides of the stem. State 3 has interactions between bases 1 and 8 as well as 2 and 7, but they are stacking interactions instead of base pairing interactions. These results are consistent with both experimental and computational work showing that small RNA hairpins have folding intermediates with contacting end residues but without well-formed stems.(52, 222) Fully validating the force field will require longer simulations to get accurate kinetic predictions and more extensive comparisons with experimental observables.

169

Figure 50. Representative structure for each of the six metastable states. The numbering is the same as in Figures 48 and 49.

CONCLUSIONS

We have introduced the Adaptive Seeding Method (ASM) and shown that it samples significantly more efficiently than GE simulations, which have found widespread use in studying biological systems(27, 28, 219, 237), for a 2D simple potential and RNA hairpins. The ASM takes advantage of the broad sampling possible with GE methods but can more effectively cross entropic barriers using constant temperature simulations. Moreover, by requiring local equilibration rather than global equilibration only relatively short simulations are necessary and these simulations may be run in parallel, rendering the calculation particularly well suited to modern computing clusters. MSMs are then used to extract global equilibrium populations from these short simulations. Besides serving as an efficient sampling algorithm, the ASM also may be used to recover equilibrium properties from non-equilibrium datasets. Thus, the ASM holds great promise for validating force fields and bridging the gap between experimental and computational timescales.

In the future, we plan to apply the adaptive seeding method to larger systems. We also hope to explore alternative sampling methods for identifying initial states. For example, coarse-grained simulations could be used to identify the dominant states of a system and seed all-atom MD simulations that would elucidate the atomic details of the free energy surface. Alternatively, implicit solvent simulations run at low viscosity could be used to rapidly identify the dominant states and seed explicit solvent simulations to provide more accuracy. Finally, adaptive sampling(161) with longer simulations may be used to obtain accurate kinetics from MSMs.

170

MATERIALS & METHODS

Two distinct sets of ST simulations were run: one started from a folded state and the other from a random coil. An independent MSM was then built for each dataset to identify the dominant metastable states. We use MSMBuilder(10) to build an MSM. At first, conformations were first split into a large number of microstates using a hierarhical K-medoids clustering algorithm with the all heavy atom RMSD as the distance metric (e.g. we generated 1,597 microstates for long ASM seeding runs). Kinetically related microstates were then lumped together using PCCA (6, 44, 45). One hundred random conformations were then chosen from each state and used as starting points for constant temperature 300K MD simulations, still maintaining two distinct sets of simulations. New MSMs were built from these constant temperature datasets. A Bayesian method (251) was used to calculate the populations of each state with error bars and the models were compared based on these values. The original ST simulations were also extended to match the sampling of the constant temperature simulations. State populations with error bars for these long ST runs were computed using bootstrapping and compared to the populations from the constant temperature simulations. More details are available in the SI.

171

APPENDIX A: ESTIMATING TRANSITION MATRICES AND EQUILIBRIUM DISTRIBUTIONS

Given our simulation data and assignments thereof to states, it is necessary to estimate the transition probability matrix and the corresponding equilibrium distribution. We have experimented with a number of such methods, all of which give results that are similar to within error for this data set. However, this property should not be assumed of other data sets a priori.

First, we show the standard method for estimating the transition probability matrix T(τ) (or just T for simplicity). The entries of T are the probabilities of transitions from state i to state j in time τ, that is, Tij = P(i→j). To estimate this, let Cij = C(i→j) be the number of observed transitions from i to j. Then a reasonable estimate

(a maximum likelihood estimate) is Tij=Cij / Ci, where

i  CC ij (A1) j

is the number of observed transitions starting in state i.

To estimate the equilibrium distribution of T, one merely has to find the stationary eigenvector of T. Under ideal conditions (if the model is ergodic and irreducible) (253), the stationary eigenvector e is unique and can easily be computed by repeated multiplication of some initial probability density by T, as in Equation A1. Similarly, one could use standard eigenvalue routines to find the eigenvector corresponding to an eigenvalue of 1.

A possible problem with the standard estimate for T is that the resulting model might not satisfy detailed balance

 TeTe jijiji (A2)

172 where ei is the equilibrium probability of state i. The naïve solution to this is to symmetrize the count matrix by adding its transpose, which amounts to including the counts that would have arisen from viewing the simulations in reverse. Clearly this procedure is inappropriate for situations not at equilibrium; nonetheless, we sometimes find this procedure useful for equilibrium data due to its ease. Furthermore, if the underlying count matrix is symmetric, one can show that the equilibrium distribution can be obtained simply by dividing the number of observations in each state by the total number of observations.

A somewhat more complicated procedure to ensure reversibility is using a maximum likelihood estimate constrained to the set of models satisfying detailed balance. To achieve this, assume that we are given the observed count matrix C. By exploiting the equivalence between this count matrix and a random walk on an edge- weighted undirected graph (160), we then estimate an additional count matrix, X, which we require to be symmetric. We compute X by maximizing the likelihood of X given C; this assumption gives a set of equations that allow the self-consistent calculation of X. More formally, if C is the observed counts, and X is a symmetric matrix that approximates C, then the likelihood is

Cij  X ij  (CXL )|    (A3) , ji  X i 

Maximizing the likelihood yields the following equation, which we solve by self- consistent iteration,

 CC X  jiij (A4) ij C C i  j X i X j

where Ci and Xi are defined as the row sums of C and X, respectively, as in Equation A1. In our experience, this method works but it can be slow for the large matrices we

173 consider. Furthermore, statistical noise in the count data can dominate the resulting equilibrium distribution and even cause the self-consistent iterations to diverge.

A final method is that of Bacallado et al. (254), which uses Bayesian inference with a prior on the space of matrices satisfying detailed balance. This method is formally the most sound, as it uses Bayesian inference and includes a powerful prior. However, it is much more computationally demanding than the other methods. Thus, this method was also applied to the data in order to assess the validity of the simpler methods.

We find that the four methods mentioned above give similar results for the underlying equilibrium distribution of this dataset, indicating that we have achieved equilibrium sampling. As such, we have used the naïve method of symmetrizing the matrix due to its computational efficiency (and the fact that we have so much data, that our data set is very close to having reached equilibrium). However, in general, we stress that either the maximum likelihood or Bayesian methods should be used.

174

APPENDIX B: THE POSSIBILITY OF LONGER TIMESCALES THAN THE IMPLIED TIMESCALES

Here we show a simple model demonstrating that the rates for transitioning between some states in an MSM under a two-state assumption (as used in the maximum likelihood approach of Ensign et al. (58)) may be slower than the implied timescales. First we define a four state system that satisfies detailed balance

0.949, 0.050, 0.001, 0.000  0.001, 0.949, 0.000, 0.050  T  )(    0.001, 0.000, 0.998, 0.001    0.000, 0.001, 0.001, 0.998 

This system is depicted in Figure 51A.

The eigenvalues of this system are 1, 0.997, 0.95559, and 0.94141 and we will assume a lag time of 1 in arbitrary units. Thus, disregarding the eigenvalue of one corresponding to the equilibrium distribution, there are three implied timescales: 332.785, 22.0139, and 16.5627.

We can write the probability of transitioning between two states as

1 ep  /  (B1) where ω is the average timescale for the transition (this notation deviates from the standard notation of τ but avoids confusion with the lag time). Rearranging, we find

   (B2)  p)1ln(

Plugging our transition probabilities into this equation we arrive at the average timescales for transitioning between each pair of states shown in Figure 51B. Many of 175 these timescales are as high as 1,000 units, much greater than the largest implied timescale of ~332 units. In principle, one could monitor these average timescales, resulting in apparent timescales longer than the implied timescales of the system.

Figure 51. Graph depiction of the model system defined in Appendix B with edges labeled by A) their probability and B) their average timescale under a two-state assumption.

176

APPENDIX C: SUPPORTING INFORMATION FOR CHAPTER 3

MOLECULAR DYNAMICS SIMULATION

Distributed molecular dynamics simulation on GPUs were performed using an accelerated version of GROMACS (255) written specifically for GPUs (80) using the Folding@Home platform (79). The AMBER ff96 (60) forcefield was used with the generalized Born/surface area (GBSA) implicit solvent model of Onufriev, Bashford and Case (81). AMBER ff96 has been reported to have more accurate secondary structure propensities when used with GBSA (82). Up to 10,000 parallel simulations (each with randomized Boltzmann-distributed initial velocities) were simulated at 300K, 330K, 370K and 450K, from several different initial starting states. Due to the nature of distributed computing, in which uncoupled simulations are used to produce successive trajectory segments, a broad distribution of trajectory lengths is obtained (see Figure 11b in the main text). Stochastic integration was performed using a time step of 2 fs and Berendsen temperature coupling. A water-like solvent (shear) viscosity of 91 ps-1 was used, with full O(N2) electrostatic and vdW interactions. Hydrogen bond lengths were constrained using the SHAKE algorithm. Trajectory snapshots were recorded every 1 ns.

Starting conformations for the native state of NTL9 were taken from the crystal structure 1DIV, and steepest-descent minimized for 5000 steps. (Minimization was done using the GBSA model of Still et al. (256)) Five starting conformations for the random coil ensemble were taken from snapshots of Monte Carlo trajectories in which dihedral angles were randomized under a potential rewarding compact Rg. The dihedral probabilities came from the TOP500 database (257). Starting conformations for extended structures were constructed by setting dihedral angles to their canonical values.

177

MARKOV STATE MODEL (MSM) CONSTRUCTION

We used the MSMBuilder package (10), modified to use sparse matrices (4), to construct an MSM for NTL9(1-39). First, 100,000 microstates were generated by clustering conformations separated by 10 ns. The remaining 90% of the data was then assigned to these clusters. The resulting microstates had an average radius of ~4.5 Å, where the radius of a cluster is defined as the largest distance between any conformation in that cluster and the cluster center. The implied timescales were then calculated for lag times from 1 to 32 ns at 4 ns intervals and found to level off at ~12 ns (Figure 52a), implying a 12 ns Markov time. Finally, we generated a macrostate model (Figure 53) by lumping microstates into 2,000 macrostates using the PCCA+ algorithm (44) and verified that the implied timescales still leveled off on a similar timescale (Figure 52b).

We have confirmed the statistical accuracy of our equilibrium populations for the 2,000 state model using a Bayesian method (251). This analysis reveals that the statistical uncertainty in the population of any state ranges form 0.2% to 2% of that state's population (0.7% on average). Unfortunately, it is not possible to rigorously address any systematic error in our model without an independent data set to compare to.

TRANSITION PATHWAY THEORY (TPT) ANALYSIS

Many of our calculations are modeled on those in (3, 87, 258). However, we have chosen a slightly different algorithm for decomposing the reactive flux into individual pathways. Given a folded state B, an unfolded state A, and the matrix of net reactive flux F, our greedy backtracking decomposition works as follows:

1. Start at the folded state B. Label this state x1.

2. Choose the state whose net flux into x1 is maximal.

178

3. Next, choose the state x2 such that the net flux from x1 to x2 is maximal.

We repeat this process for each state xn-1, choosing the next state xn such that the flux from xn-1 to xn is maximal, until we reach state xn = A.

Upon completion, we have produced a series of states (x1, … , xn) defining a pathway. We define the flux along this pathway as the minimum of the fluxes, min(F(xi  xi+1)). We then subtract this flux from each of the pathway's edges in the original flux matrix. Finally, we repeat the same algorithm on the new flux matrix to produce additional pathways.

The result of this algorithm is a set of pathways and their associated fluxes.

STRUCTURAL ANALYSIS OF MACROSTATE ENSEMBLES

Because macrostate conformational ensembles can be somewhat heterogenous and diffuse, we used a metric that quantifies the extent of native-like structure without using predetermined reaction coordinates or requiring artificial thresholds for native contacts, which we call the Q-value.

For each macrostate, we define a vector c(x) indexed by x = (i, j), denoting a contact between residues i and j. The entries of c(x) are continuous (non-integer) values between 0 and 1, representing the fraction of the ensemble for which the alpha- carbons of residues i and j are closer than 8Å. We will call c(x) a contact profile. We define the Q-value of a given c as its projection onto the contact profile of the “native” macrostate (state n), cnat.

179

The Q-value for the “native” macrostate (state n) is unity, and less native-like contact profiles will have lower Q-values. Because a contact profile can only contain entries between 0 and 1, Q is always positive.

Moreover, we also define Q-values for particular structural elements by restricting a contact profile to a particular subspace of contacts. For example Q β12 is

the Q-value when c is restricted to a subspace where x  β12, a set of contacts corresponding to pairings between beta-strands β1 and β2. We examined Q-values for three native structural elements: Qα, Qβ12, and Qβ13, based on the subsets of the “native” (state n) contact profile (Figure 54). For clarity, we call the Q-value for the

entire set of contacts Qtotal.

ANALYSIS OF STATES ALONG FOLDING PATHWAYS: COMPARISON BETWEEN SECONDARY STRUCTURE FORMATION AND REACTION

PROGRESS (PFOLD)

How heterogeneous are the possible pathways for folding? One way to examine this question is to compare the secondary structure formed in a given state versus its position along the reaction pathway. In Figure 55 and Figure 56, we use a simple metric to plot the secondary structure bias, namely the difference between alpha helical and beta sheet contacts Qα – (Qβ12 + Qβ13)/2 of a given state and compare this to the position of the state along the reaction pathway as determined by its commitor or pfold value. From these figures, it can be seen that 1) the “unfolded” state (a) contains residual native-like helical propensity, and 2) pathways involving various ordering of native-like helix and sheet formation are possible.

The contact profiles (see Figure 55) for these states demonstrate the existence of non-native contacts in some states as well as the fact that certain contacts are present more commonly in a given state. For example, states h, i, j, and k all have a 180 mixture of some contacts which are very prevalent (dark black) and some which are only partially formed (light gray to gray), whereas state g has fewer contacts, but all prevalent. The nature of the heterogeneity even within a state highlights the ensemble nature of this form of analysis, as well as the degree to which a given state is structured (and in which parts).

Finally, we find a variety of degrees of structure in a given state for the natural independent folding units found (i.e. the alpha helix, β12, and β13). This is shown in Figure 57, where we see a significant diversity present in state and pathways in terms of the secondary structure formed.

HOW DOES NTL9 FOLD IN OUR SIMULATIONS?

In order to understand how NTL9 folds, a natural approach is to analyze the pathways found in terms of existing theories for protein folding. The highest-flux pathways in our mesoscopic model are a→m→n and a→l→n. Both pathways are direct routes from disordered to highly-structured macrostates, reminiscent of a nucleation- condensation mechanism (259). This picture is consistent with the cooperative two- state kinetics observed in stopped-flow re-folding experiments (78). While these pathways show concomitant formation of helix and hairpin structures, the intervening states l and m differ (mostly) in the β12 hairpin registration (see Figure 57). The large pfold values of states l and m, and their obligate presence in the two highest-flux pathways from a to n, suggests that to some extent, the states l, m and n can be considered a very native-like “molten-globule”, in which the details of tertiary arrangement are sorted out after overcoming the main barrier to folding. Kinetics between such metastable stables would be difficult to detect experimentally using a single fluorescent reporter, and in the nucleation-condensation view such events might be described using the encompassing term “condensation”.

At the same time, the structural diversity along the folding pathways we analyze corresponds well with many models of hierarchical folding. In general, the 181 macrostates with low pfold values have a baseline of native helicity, with the full extent of native beta-sheet structure occurring later in the folding reaction (Figure 56). This is consistent with the idea that local structures such as helices form early, with non-local structures such as beta-hairpins and beta-sheets forming later in the reaction. Macrostates b through f (which have low pfold values and are involved early in the folding reaction) contain a variety of distinct non-native structural elements, particularly non-native hairpin and sheet arrangements (see Figure 13, and Figure 55). This is reminiscent of hierarchical mechanisms such as diffusion-collision (260) where competing ‘foldons’ (86) form as kinetically metastable units, and are cooperatively stabilized when in a native-like arrangement. The heterogeneous sequences of secondary structure formation in pathways a→h→k→m→n (in which the central helix forms first) versus pathway a→g→l→n (in which the hairpin structure forms first) suggest that independent folding units can form and coalesce in any order.

We stress that there need not be a single pathway or single, dominant mechanism for folding. Moreover, the various theories proposed for how proteins fold, such as a diffusion-collision or nucleation-condensation mechanism, are based on physical principles broadly relevant for proteins. Therefore, it is natural to imagine that multiple mechanisms could be simultaneously present, but that the sequence of the protein, coupled with the chemical environment (solvent conditions, temperature, pH, etc), would control the balance of the degree to which each mechanistic pathway is seen.

182

Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State Models (MSMs) built at lag times between 1 and 32 ns. As the longest timescale levels off beyond a lag time of 10 ns, a lag time of 12 ns was chosen to build subsequent MSMs. The spectral gap present at all lag times indicates apparent two-state folding kinetics. (b) The implied timescales for a 2000- macrostate model built by lumping states from the microstate MSM show a similar spectral gap and leveling off of time scales. The faster implied timescales of the macrostate model at short lag times are due to lumping effects. (c) The 10 slowest implied timescales for the 2000 state models, with error analysis from a bootstrapping procedure. Error bars represent the standard deviation from the bootstrap analysis.

183

Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the 100,000-state MSM calculated from the simulation data at 370K. The RMSD-to-native is calculated using the peptide backbone residues, with respect to the native starting state. The free energy of each microstate i is

computed as –kT ln (pi /p0), where pi is the equilibrium probability of the microstate, and p0 is an

arbitrary reference (in this case, max(pi)). Shown in red are the 14 macrostates transited by the top ten pathway fluxes, labeled with the same letters as in Figure 13. In this mesoscopic view, we find that 1) the macrostates are diffuse collections of conformational states, 2) there are multiple folding pathways along these metastable states, and 3) we can identify highly populated “native” (state n) and “unfolded” (state a) macrostates that dominate the observed relaxation rates. The red arrow is meant to guide to eye in illustrating a “mesoscopic” view of the transition state barrier: the “unfolded” state (a) and “native” state (n) are at free energy minima, while intermediate RMSD values have macrostates with higher free energies.

Figure 54. Contact profile subspaces used to calculate Q, Q12, and Q13, which quantify the extent of

native-like structuring for beta-strand 1 and 2 pairing, beta-strand 1 and 3 pairing, and helix formation, respectively.

184

Figure 55. Here, contact profiles (see definition above) for the 14 macrostates involved in the top ten folding pathways are plotted in a similar fashion to Figure 55. For clarity, the pathway arrows have been removed. Each contact profile is a 39 x 39 matrix of inter-residue contacts, showing the contact fraction on a linear grayscale from 0 (white) to 1 (black).

Figure 56. Here, values of Q (yellow), Q12 (red), and Q13 (blue) are plotted in a bar graph for each of the 14 macrostates involved in the top ten folding pathways. The layout is in a similar fashion to Figure 56.

185

Figure 57. Macrostates l, m and n (the “native” state) have very similar structural ensembles and similar

pfold values (pfold > ~0.93). To examine the subtle differences in their macrostate contact profiles, we computed difference contact profiles for (l-m), (n-l) and (n-m) transitions. These difference maps reveal that these states differ mostly in their hairpin registrations and packing of the hairpin loop.

186

APPENDIX D: SUPPORTING INFORMATION FOR CHAPTER 4

VILLIN MSM

LUMPING INTO MACROSTATES To identify metastable states in villin, we lumped kinetically related microstates into 500 macrostates (all having self-transition probabilities >0.5) using the PCCA+ algorithm (20, 261). Figure 58 shows the implied timescales for this macrostate MSM. While they are somewhat shorter than at the microstate level (4), their leveling off at lag times of 10-15 ns indicates that the model is Markovian at these timescales (6, 34).

Mean first passage times (MFPTs) between pairs of states can then be calculated as in Ref (161). The equilibrium probability of each state can be obtained by normalizing the first eigenvector of the transition probability matrix. Finally, the relative entropy between two MSMs is calculated as in Ref (18)

N Pij ()||   PPQPD iji log , ji Qij

where Pi is the equilibrium probability of state i, Pij is the probability of transitioning from state i to state j during one lag time, N is the number of states, P is the reference model, and Q is a test model (in this case generated from a subset of the data).

Figure 62 shows the relative entropy for varying numbers of simulations (up to 40,000) of a given length (up to 400 nanoseconds). This figure highlights the fact that too small numbers of too short simulations are less valuable than a single long simulation with an equivalent aggregate amount of data but that the simulation length at which this breakdown occurs decreases for increasing numbers of simulations.

187

BARRIER HEIGHTS As further confirmation of the relevance of our macrostates, we have developed a simple Bayesian approach for estimating the free energy barrier heights between states under the simplifying assumption that we can examine pairs of connected states independently. To begin with, we make the following defintions:

C  transition count matrix with Cij giving the number of observed i→j transitions and

Ci the number of counts originating in state i

Cij P  transition probability matrix obtained by normalizing each row of C ( Pij  ) Ci

‡ Gij  barrier height for i→j transitions

 ij  attempt frequency for i→j transitions

nij  number of attempted i→j transitions

kij  rate of i→j transitions

kB  Boltzmann’s constant

T  temperature (300 K for this work)

  lag time of the MSM

The quantity we wish to obtain is the posterior distribution over barrier heights given our data (the count matrix, C). However, to account for the attempt frequency we must begin with

188

 ‡ GPGCP ‡  ),(),|(  ‡  CGP )|,(  ijij ijij ijij P C)( where the equality comes from applying Bayes’ rule. We can then integrate out the attempt frequency to obtain

 ‡ GPGCP ‡ ),(),|(   ‡ CGP )|(  ijij ijijij ij  CP )(

We now assume that the barrier height and attempt frequency are independent and assign them each a uniform prior.

‡ ‡  ijij  PGPGP  ijij )()(),( ‡ ij )(  cGP

 ij )(  cP where c is a constant.

We can also put bounds on the attempt frequency from our observed data by recognizing that the number of attempts is at least as many crossing events as were observed and no greater than the number of observed crossings plus the number of self-transitions.

C  CC ij   iiij C ij C i i where the denominator gives the total time spent in state i and the numerator gives the number of attempts at transitioning from state i to state j. Using nij to denote the number of attem pts at transitioning from state i to state j, we can also write

n   ij ij C i

189

Using these priors and bounds, we obtain

‡ ‡ ‡   ijij  PGPGCP  ijij )()(),|( ij CGP )|(    ij CP )( n GCP ‡  ij  ‡ nPGP )()(),|( cc iiij ijij C ijij ‡ CGP )|(  i ij  CP )( cn ijij

Given a particular value of the attempt frequency (or equivalently, the number of attempts) we can write

n C GCP ‡  ij ),|(  ij 1 PPC Cn ijij ijij C ijij ijCn ij i

C where Cn ijij denotes “n choose c”. Using the simple rate equation

P exp1  k  ij  ij  and, from transition state theory,

 G‡ k  ij )exp( ijij Tk B

we finally obtain

n ‡ Cij ‡ Cn ijij ‡ ij  ij exp  Bij TkG  ij exp  Bij TkG GCP  ijij Cn 1),|(  eC  e  ijij Ci

n Cij n Cn ijij  ij exp  ‡ TkG   ij exp  ‡ TkG  nij Bij Bij GCP ‡  1),|(  eC Ci   e Ci  ijij Cn ijij     Ci    

The denominator, P(C), is then obtained by normalization.

190

Using these equations gives a posterior distribution of barrier heights for every possible transition. Since there are thousands of possible transitions, it is impractical to examine them all. Instead, we have calculated the expected barrier height for each transition. The mean expected barrier height is 5.9 (+/- 2.5) kT, indicating that most of the states are potentially detectable (separated by reasonable barriers) and that the distribution of barrier heights is quite broad.

As a consistency check, it is also possible to solve for the barrier height in terms of the observed counts and attempt frequency.

  C    1ln  ij     C   ‡  TkG ln  i  Bij   n     ij     C     i  

One can then plug in the lower and upper bounds for the attempt frequency and ensure that these encompass the Bayesian result.

SIMPLE MODELS

TRANSITION COUNT MATRICES The transition count matrices for simple models S, P, and H (CS, CP, and CH respectively) are

 00003000,6     0003000,13   003000,130  CS     03000,1300   3000,13000     000,9030000 

191

 00022000,6     0220000,12   022000,102  CP     20000,1220   2000,10220     000,9022000 

 00002500,3     2002000,12   202000,120  CH     22000,1200   0500,32000     000,9002220  where the entry in row i and column j gives the number of transitions observed from state i to state j. State 0 is unfolded, 1-4 are intermediates, and 5 is the native state.

To generate synthetic simulations from a transition count matrix we first normalize each row to obtain a transition probability matrix. At each time step (or each lag time), the next state is chosen according to the distribution of transition probabilities for the current state.

FOLDING SIMULATIONS MFPTs from the unfolded state(s) to the native state, given in Table 1, were calculated following Ref (161). The distribution of first folding times was determined by first running 10,000 simulations of 50,000 steps each started from state 0 (for model H half the simulations were started from state 4). The first folding time of each simulation was then calculated and these values were plotted as a histogram with 100 bins. The lag phase was determined by finding the first folding time with the maximum probability. Exponential fits were calculated by fitting to the first 50 bins after the lag phase (to avoid noise in less populated bins at longer first folding times). The exponential fits and lag phases are also given in Table 1. Similar results were obtained 192 by randomizing matrix elements while maintaining the network topology, subject to the constraints of detailed balance and metastability. Example matrices include

 00003000,6     0003000,33   003000,130  C ,randS     03500300   3000,13000     000,9030000  and

 000051500,3     700068000,151   92017000,1680  C ,randH     4297000,11700   0500,397000     000,9004292700 

193

Figure 58. Implied timescales for the villin macrostate MSM.

Figure 59. Distribution of MFPTs between all pairs of non-native states for villin (A) on a linear scale to demonstrate the peak does not shift significantly relative to the distribution shown in Figure 18B and (B) on a log scale to highlight that the tail of the distribution does extend to about 60 ns.

194

Figure 60. Distributions of the MFPTs (A) from each non-native state to the native state and (B) between every pair of non-native states for our 2,000 state NTL9(1-39) model. As discussed in Ref (93), further refinement of this model is likely necessary. However, we do not expect the qualitative trend of long timescales (relative to folding) for transitioning between unfolded states to change.

Figure 61. Two conformations from different unfolded basins demonstrating the structural heterogeneity of non-native states (especially in their non-native contacts) that, in combination with the vastness of conformational space, result in slow transitions between unfolded states. The structures are colored red to blue from the N-terminus to the C-terminus. Atoms for residues Arg 14, Trp 23, and Lys 32 are shown to highlight that 23 and 32 are in contact on the left while the 195

chain has rearranged such that 14 and 32 are in contact on the right. These images were made with VMD (67).

Figure 62. Relaxation of the fraction folded starting from equally populated unfolded states (black is data and blue is single exponential fit with τ≈810 ns). The beginning of the curve is dominated by single exponential relaxation but deviations from this apparent two-state behavior become apparent later.

196

Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate level (thick black line) and a biexponential fit (thin blue line) with time constants of ~60 and ~415 ns, at least qualitatively consistent with time constants of ~70 and ~720 ns from experiment (56). We hope to explain this behavior in a future work on villin. As in Ref. (4), the native state was defined as all

microstates with an average Cα RMSD to the crystal structure less than 3 Å.

197

Figure 64. The distance to the gold-standard model, measured via the relative entropy, for 40,000 trajectories up to 400 nanoseconds in length. The black lines are contours of equal amounts of data. Again, there was insufficient data to resolve the upper right-hand corner of the plot.

Model Exponential Fit MFPT Lag Phase S 12,800 13,400 2,500 P 4,500 5,000 1,000 H 3,300 3,600 800 Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for transitioning from the unfolded state(s) to the native state in the three simple models.

198

APPENDIX E: SUPPORTING INFORMATION FOR CHAPTER 5

SIMULATION DETAILS

Six initial starting conformations covering a range of 0 to 13 Å Cα RMSD to the crystal structure were drawn from replica exchange simulations in implicit solvent from Bill Swope and Jed Pitera at the IBM Almaden Research Center (136). These conformations were energy minimized using a steepest-descents algorithm in the Gromacs simulation package (43) with the AMBER03 force field (60). They were then solvated in tip3p water and the solvent was equilibrated at 300 K with the protein coordinates held fixed. Finally, simulations were run on the Folding@home distributed computing platform using an MPI-enabled version of Gromacs (58) at both 300 and 370 K. The details of this procedure are identical to those used in Ref (58) and a full description can be found there. Most of the results described in this work are from the 370 K data. This temperature was chosen to approximate the experimental melting temperature, correcting for the fact that simulations tend to over-estimate the melting temperature for this system (136).

Structures were rendered with PyMOL.

MSM CONSTRUCTION AND ANALYSIS

We used the MSMBuilder package (4, 10) to construct a microstate model with 30,000 states and a coarse-grained macrostate model with 5,000 states. The microstate model was generated by clustering conformations stored at 5 ns intervals based on their Cα RMSDs using the k-centers algorithm in MSMBuilder. The remaining data (50 ps spacing) was then assigned to these clusters and used to construct a transition count matrix (Cij = the number of observed transition from state i at time t to state j at time t+τ, where τ is the lag time of the model) and corresponding transition probability matrix (Pij = probability of transitioning from state i at time t to state j at time t+τ, 199

where τ is the lag time of the model). The PCCA+ algorithm (20, 44, 261) was then used to lump kinetically related microstates into 5,000 macrostates and these state definitions were used to construct macrostate level transition count/probability matrices.

The lag time for each model was selected by computing the implied timescales of the model

 k  )ln( where μ is an eigenvalue, τ is the lag time, and k is a rate. This equation comes from the equivalence between discrete time MSMs and continuous time master equations (see Refs (6) and (3) for details). By plotting the implied timescales as a function of the lag time one can identify the lag time at which they begin to level-off (satisfy the Chapman-Kolmogorov test), indicating that the model is Markovian (34). Based on this analysis, we chose a lag time of 5 ns for our microstate model (Figure 65), where all the kinetic analyses in this work were performed.

To calculate the relaxation of the fraction folded as measured by some observable we used the procedure from Ref (58) to distinguish folded and non-native states and the procedure from Ref (4) to propagate the fraction folded. For example, with the experimental surrogate (Trp22-Tyr33 quenching) we calculated the average and standard deviation of the distance between these residues (Nativeave and Nativestd respectively) in native-state simulations started from a model of D14A based on the 1LMB crystal structure. Five random conformations were drawn from each state and

used to calculate the average distance between these residues for that state (Stateave).

A state was considered to be native if Stateave < Nativeave - Nativestd and non-native otherwise. The fraction-folded can then be calculated as the dot-product between a vector with 1’s for folded states and 0’s for non-native ones with the state populations. To mimic an ensemble T-jump we used two starting populations: 1) all states equally populated and 2) all microstates in non-native macrostates (i.e. outside the most 200 populated macrostate) equally populated. The relaxation of these starting ensembles was modeled by propagating the populations forward in time with the transition probability matrix and calculating the fraction folded at each time step. The same procedure was used for the fraction folded determined by the RMSD to the crystal structure, which was examined to determine whether or not the Trp22-Tyr33 distance could be measuring a more local rearrangement than full folding, as proposed for villin (58). Figure 73 shows that these two observables gave similar timescales for the full MSM and, while differences are apparent when the simulations started from β– sheet structures are ignored, the timescales do not appear to be substantially slower for the RMSD relaxation (Figure 74). The molecular and activated timescales (τm and τa respectively) were obtained by fitting to the biexponential

t / t / Ae m a  CBe where t is the time and A, B, and C are constants.

The states participating most strongly in a given transition mode are specified by the corresponding left eigenvector (states with negative components are interconverting with those with positive components, and the magnitude of the eigenvector component gives the degree of participation) (1). The highest flux pathways between sets of state were calculated as in Refs (258) and (5). Mean First

Passage Times (MFPTs) between states and Pfolds were calculated as in Ref (37).

Given our finite sampling, one can estimate the kinetic connectivity of a state by counting the number of edges connecting it to other states (effectively a way of counting the number of edges with probabilities above some threshold since all connections would be made with infinite sampling).

Two residues were considered to be in contact if any pair of atoms was within 7 Å. Native contacts are those formed in the energy minimized model based on the crystal structure 1LMB (130, 131). Solvent accessible surface areas were measured

201 using the g_sas program from Gromacs (43) with a 1.4 Å probe radius. The distance between two residues is the distance between the centroids of their side chains.

Figure 65. Implied timescales for the full 370 K dataset.

Figure 66. Implied timescales for the 300 K dataset.

202

Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random.

Figure 68. A coarse-grained view of the slowest transition with state sizes proportional to the free energy and arrow widths proportional to the flux (see key in figure). 203

Figure 69. Another coarse-grained view of the slowest transition with state sizes proportional to the free energy and arrow widths proportional to the flux (see key in figure). Here the states are laid out in terms of the average number of β-sheet residues (calculated from 100 random conformations from

each state) and the pfold (probability of reaching the crystallographic state in L before the compact β-sheet state in A).

204

205

Figure 70. Free energy projections of the microstate MSM onto typical order parameters like the radius

of gyration (Rg), the Cα RMSD to the crystal structure, and the distance between the Trp22 and Tyr33 residues. Differences between the two panels highlight the difficulty in interpreting such projections.

Figure 71. Free energy projection of the microstate MSM onto Pfold and the distance between the Trp22 and Tyr33 residues. Obtaining projections onto kinetic order parameters like Pfold is greatly simplified with MSMs. In this case Pfold refers to the probability of reaching the crystallographic state before reaching the compact β-sheet state (i.e. the slow transition from Figure 21). Unlike the projections in, this one hints that D14A may not be well described by a simple two- or three-state model or that the Trp22-Tyr33 distance is not a good reaction coordinate, since there are a broad range of Pfold values possible for a given Trp-Tyr distance. Indeed, analysis of the MSM reveals that D14A is best described by a native hub.

206

Figure 72. The ten most populated macrostates with their equilibrium probabilities.

Figure 73. Relaxation of the fraction unfolded with different observables and observation times. The thick black curves come from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the Trp22- Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non- native states being equally populated. The bottom row shows relaxation of the fraction unfolded

measured by the Cα RMSD to the crystal structure (C) starting from all states being equally populated and (D) starting from all non-native states being equally populated. Fitting parameters

207

are given in the figure (in units of microseconds). In this case, the fitting parameters are relatively independent of the observable and starting distribution.

Figure 74. Relaxation of the fraction unfolded with different observables and observation times from an MSM built without the trajectories started from β-sheet structures. The thick black curves come from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the Trp22-Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non-native states being equally

populated. The bottom row shows relaxation of the fraction unfolded measured by the Cα RMSD to the crystal structure (C) starting from all states being equally populated and (D) starting from all non-native states being equally populated. Fitting parameters are given in the figure (in units of microseconds). In this case the fitting parameters are more dependent on the observable, consistent with the experimental observation of probe dependent kinetics.

208

Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet state in Figure 22A to the native state in Figure 22H, (B) from the extended state in Figure 22E to the native state in Figure 22H, and (C) from the extended state in Figure 22E to the native state in Figure 22G. None are purely downhill, though some may be consistent with incipient downhill folding (i.e. have sufficiently low barriers that there is a reasonable population at the barrier top that can fold in a downhill manner in addition to activated folding across the barrier).

Figure 76. The helicity of each residue predicted from Agadir.(143) The purple, numbered bars show where the five helices are (the extra purple block between helices 4 and 5 is a turn).

209

APPENDIX F: SUPPORTING INFORMATION FOR CHAPTER 6

Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent samples of (A) reference simulations of M1 and (B) adaptive sampling of M1. Black lines are contours of equal amounts of data.

Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent samples of (A) reference simulations of M2 and (B) adaptive sampling of M2. Black lines are contours of equal amounts of data.

210

APPENDIX G: SUPPORTING INFORMATION FOR CHAPTER 9

SERIAL REPLICA EXCHANGE (SREMD)

Molecular Dynamics (MD) is a powerful technique for exploring the conformational space of biomolecules. However, MD simulations often spend a significant portion of time trapped in local free energy minima. Replica Exchange Molecular Dynamics (REMD) (22, 23) was developed to overcome this problem by inducing a random walk in temperature space. In REMD, independent MD simulations are performed in parallel at different temperatures. At regular intervals attempts are made to exchange configurations between temperatures. These exchanges are accepted according to a well defined transition probability. The REMD scheme requires synchronization of different processors, which makes it unsuitable for a heterogeneous distributed computing environment.

Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224) is a serial version of REMD that is suitable for distributed computing. In SREMD, a single simulation performs a random walk in temperature space by making regular attempts to swap temperatures. The transition probability for this move is determined by one potential energy from the simulation and a second one from a pre-stored potential energy distribution function (PEDF) at the new temperature. SREMD has been shown to be an efficient sampling method when applied in a distributed computing environment (177). However, we note that SREMD is only approximately correct unless the exact PEDFs are adopted.

SIMULATION DETAILS

Our simulations used the AMBER 94 potential (262). The SREMD algorithm was implemented in a version of the GROMACS (43) molecular dynamics simulation package modified for the Folding@Home (79) infrastructure (http://folding. 211 stanford.edu). The RNA molecule was solvated in a water box with 3943 TIP3P (263) waters and 11 Na+. The simulation system was minimized using a steepest descent algorithm, followed by a 100ps MD simulation applying a position restraint potential to the RNA heavy atoms. All simulations were run with constant NVT by coupling to a Nose-Hoover thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å was used for non-bonded interactions. Long-range electrostatic interactions were treated with the Particle-Mesh Ewald (PME) method (264). Nonbonded pair-lists were updated every 10 steps with an integration step size of 2 fs in all simulations. All bonds were constrained using the LINCS algorithm (265) .

2,800 SREMD simulations with an aggregate simulation time of 54.6 µs starting from the NMR structure (PDB code 1ZIH) (209) were performed. The temperature list was roughly exponentially distributed, with 56 temperatures covering a range from 285 to 592K. To obtain initial estimates of the PEDFs, we performed 56 3ns SREMD simulations where every move was accepted. For the Folding@home (FAH) runs, the initial temperatures were uniformly selected from the temperature list. Thus, there are 50 simulations starting from each temperature, each with different initial velocities. The PEDFs were updated every 40ns for 40 iterations, then every 400ns for 20 iterations, and at last every 1000ns.

TOPOLOGICAL METHOD (MAPPER) FOR PATHWAY ANALYSIS

Our SREMD simulations generate a massive number of configurations. Therefore, it is difficult to discern the structure of the data. Such data is normally dominated by the folded and unfolded structures. However, we are interested in understanding structures in transition states or intermediate states. Direct application of clustering algorithms to all the configurations will be biased toward the densest regions (i.e. folded/unfolded states in this study), making it difficult to identify the sparsely populated intermediate states of interest. Furthermore, such clustering methods will not provide any information on the connectivity between different clusters.

212

To address such issues, Yao et al. (228) proposed a topological data analysis method to explore pathways in biomolecular folding, based on Mapper14, a general topological data analysis tool for high dimensional data sets. This method efficiently identifies intermediate states along a pathway. Roughly speaking, we use Mapper with filters based on some conditional density function estimated from the data. Then the data is divided into overlapping level sets based on the filter. Single-linkage clustering is then used within each density level. Finally a graph is generated with a node corresponding to each cluster and edges between pairs of nodes in neighboring level sets that have non-zero overlap.

We note that clusters may be intrinsically non-convex in biomolecular folding problems. K-means type clustering algorithms will fail for such clusters. The use of single-linkage clustering in density levels in Mapper allows the efficient discovery of non-convex clusters and separates sparsely populated intermediate states from the dominant unfolded/folded states. For details on how such a scheme works, readers are referred to [13].

PEDFS

Figure 79 (a) shows SREMD PEDFs from our massive distributed computing simulations. The convergence of the PEDFs can be verified by the 2 convergence measure. The 2 convergence measure is defined as an integrated error as shown below (224),

N 22ref  (()Ptii P ) i1 where N is the number of bins in the potential histogram, Pi(t) is the value of the ith bin of the potential energy histogram generated by potential energies collected over time t at a particular temperature, and Prefi is the reference PEDF.

213

Figure 79 (b) displays the 2 convergence measure averaged over all temperatures. When the final PEDFs are used as the reference distributions, 2(t) decays to zero. On the other hand, when the PEDFs from the initial 3ns constant temperature simulations (Pinitial) are used as the reference, 2(t) grows to a plateau value. The 2(t) values for single temperatures show the same trends as these averaged values. Therefore, the PEDFs have converged.

MELTING CURVES

Figure 80 shows the native contacts melting curve. The data demonstrates that folded conformations dominate at low temperatures while extended structures dominate at high temperatures.

Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from Folding@home data at each of the 56 temperatures used. (b). The 2 convergence measure averaged over all temperatures

as a function of time. Triangles correspond to using Pfinal as the reference distribution and circles

correspond to using Pinitial as the reference.

214

Figure 80. Native contacts melting curve. Only every third temperature is displayed for clarity.

215

APPENDIX H: SUPPORTING INFORMATION FOR CHAPTER 10

INITIAL CONFIGURATIONS

We started our ST simulations from two different initial configurations as shown in Figure 81: a near-native state and a random coil. The near-native state was created by analogy to the NMR structure of the GCAA tetraloop (first structure of PDB code 1zih (209)). The random coil conformation was created with the Nucleic Acid Builder (266).

Figure 81. The two initial structures used in this study: A) A near-native conformation and B) a random coil conformation.

THE CONVERGENCE OF WEIGHTS IN SIMULATED TEMPERING (ST)

SIMULATED TEMPERING In Simulated Tempering (ST) (24, 25), configurations are sampled from a mixed canonical ensemble in which the canonical ensembles with different temperatures are weighted differently as defined by a generalized Hamiltonian:

ii(,)X pHXp (,) gi (H1)

216

Where βi =1/(kBTi), H(X, p) is the Hamiltonian for the canonical ensemble at temperature Ti. X denotes the conformation and p is the momentum. A priori determined constant gi is the weight for the temperature Ti.

ST works as follows: a single simulation starts from a particular temperature

(Ti) and an attempt is made periodically to change the configuration (Xn) to another temperature (Tj) according to a well defined transition probability by satisfying the detailed balance condition.

PXinn(,)( p Pi j ) Pjnn (,')( X p Pj  i) (H2)

The probability of configuration Xn at temperature Ti for the expanded canonical ensemble is,

11 PX( , p ) exp( (X , p )) exp( HX ( , p )  g ) (H3) inn ZZinn i nn i

where pn is the momentum and Z is the partition function for the expanded canonical ensemble. HX(,)nn p is the sum of kinetic energy (K) and potential energy (U), and

HX(,nn p)Kp ()n UX()n

A re-scaling of the momentum ( pnji'/ TTpn) following the exchange causes the kinetic energy to cancel out in the detailed balance equation, and the transition probability after applying the Metropolis criterion is shown below,

()()()jiUX n  g ji g Peij  min{1, } (H4)

where U(Xn) is the potential energy for configuration Xn, which is sampled from the canonical ensemble at Ti. A set of weights need to be pre-determined to calculate these transition probabilities. Without proper weighting, ST simulations will be constrained to a subset of the temperature space and become inefficient (25, 177). It

217

was shown that weights leading the system to perform a random walk in temperature space equal the unit-less free energies at different temperatures (24, 25).

SIMULATED TEMPERING EQUAL ACCEPTANCE RATIO (STEAR) METHOD It is not an easy task to determine the free energy weights enabling system to perform a random walk in temperature space. The Simulated Tempering Equal Acceptance Ratio (STEAR) method for determining the free energy weights is adopted in this study (49, 177). This method is based on the property that the free energy weights leading to uniform sampling must yield the same acceptance ratios for both forward and backward transitions from Ti to Tj as shown below.

PggUij(,)(,) j  i i  PggU ji i  j j  (H5) where

 PPggUPUd(,)()  U ij ij j i i i i   (H6) PPggUPUd(,)()U jijiijjj j 

where Ui is the potential energy for a configuration sampled from the canonical ensemble at temperature Ti and P(Ui) is the potential energy distribution function

(PEDF) at Ti. PEDFs for each temperature are initially estimated from short trial MD simulations and then updated during an equilibration phase preceding the production phase, which uses a static set of weights. By solving Eq. 7.3, we can obtain a set of near free energy weights.

DETAILED PROCEDURE TO UPDATE THE WEIGHTS The ST algorithm was implemented in version 3.1.4 of the GROMACS (43) molecular dynamics simulation package modified for the Folding@Home (79) infrastructure

218

(http://folding.stanford.edu). In our ST simulations, the temperature list (T1 … Tn) containing 56 temperatures is roughly exponentially distributed between 270 and 592 K. The detailed procedure to determine the weights using STEAR is described as below

Obtaining the initial weights: For each of the two initial configurations (see Figure 81), one 2 ns NVT simulation was carried out at each of 56 temperatures on a computer cluster. Potential energies collected every 0.1 ps from the last nanosecond of these simulations were used to get a rough approximation of the energy distribution at each temperature. The weight (gi) that gives an equal acceptance ratio for transitions from Ti to Ti+1 and vice versa is found using Newton’s method (See Equation (H5)) and g1 is set to zero.

Updating the weights: Once an initial set of weights has been chosen, we start 1120 ST simulations from each initial configuration on the Folding@Home distributed computing environment. In these simulations, a temperature swap is attempted every 0.2 ps. At regular intervals (about every 300ns of simulation in total) all the new data is collected and only new data is used to refine the approximation of the energy distribution at each temperature. Newton’s method is then used to update the weights to satisfy the equal acceptance ratio criterion given the new energy distributions as shown in Equation (H4).

CONVERGENCE OF THE WEIGHTS The weights obtained from two independent sets of ST simulations starting from different initial configurations are converged well as shown in Table 2. The weights converge at about 9 ns for each initial configuration. As described before, a set of converged weights, i.e. free energy weights should induce a uniform sampling of the temperature space. As shown in Figure 82, both sets of simulations achieve uniform sampling at about 9ns. Thus, after about 9 ns, the weights are held static and the simulations are continued in what is called the production phase.

219

Figure 82. Amount of sampling at different temperatures for ST simulations started from the native (top row) and coil configurations (bottom row) computed from different segment of simulation time 0- 0.3ns, 1.2-1.5 ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is reached for both sets of ST simulations indicating the weights are converged.

MOLECULAR DYNAMICS (MD) SIMULATION DETAILS

Our MD simulations used the nucleic acid parameters from the AMBER99 force field (60, 267). The RNA molecule was solvated in a water box with 2543 TIP3P (263) waters and 7 Na+ ions. The simulation system was minimized using a steepest descent algorithm, followed by a 100ps MD simulation applying a position restraint potential to the RNA heavy atoms. All NVT simulations were coupled to a Nose-Hoover thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å was used for both vdW and short range electrostatic interactions. Long-range electrostatic interactions were treated with the Particle-Mesh Ewald (PME) method (264). Nonbonded pair-lists were updated every 10 steps with an integration step size of 2 fs in all simulations. All bonds were constrained using the LINCS algorithm (265).

HIERARCHICAL K-MEDOIDS CLUSTERING ALGORITHM

A hierarchical K-medoids clustering algorithm developed by Boxer, G. is used in this study. In K-medoids clustering one starts by choosing some number of random conformations to be generators. All remaining conformations are then assigned to the generator that they are most similar to, thus forming a state corresponding to each 220 generator. Each generator is then updated by choosing a number of random conformations from its corresponding state and selecting the one that is closest to every other conformation in the state (i.e. the one that is closest to the center of the state) as the new generator. This updating procedure may be continued for some predetermined number of iterations or until the answer converges. The basic idea of hierarchical clustering is to perform K-medoids clustering on the entire dataset and then to recursively perform K-medoids clustering on each state until every state has fewer conformations than some threshold. This threshold is set as an input parameter for the K-medoids clustering algorithm.

Table 2. Convergence of the weights is shown for representative temperatures Δg = gj − gi obtained from distributed computing simulations starting from a helical structure (third column) and a coil structure (fourth column) at different temperature pairs. Differences between free energy

differences Δfji = gj/βj −gi/βi obtained from simulations starting from a helical structure and a coil structure are displayed in the 5th column. KT at temperature i is shown in the sixth column.

Δfji(Helical)-Δfji(coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs.

MARKOV STATE MODELS

A Markov model is basically a graph representing the structure and temporal connectivity of some dataset that consists of temporally ordered observations (3, 6). In this case, each node corresponds to a set of kinetically similar conformations. These nodes are connected by directed edges with corresponding values equal to the probability of transitioning between them. For the model to be Markovian, the probability of transitioning to state j must depend solely on the previous state. 221

A Markov State Model (MSM) may also be represented by a transition probability matrix as (also see Equ 1 in the main text)

Pt() TtP ()(0)  (H7) where P(∆t) is a vector of state populations at time ∆t, T is the column-stochastic transition probability matrix, and ∆t is the lag time (or time step). Using this representation, the time evolution of a vector representing the population of each state may be calculated by repeatedly left-multiplying the column vector by the transition probability matrix. The model also has a corresponding lag time, which is effectively the time resolution of the model. Each step, or multiplication by the transition probability matrix, is equivalent to one lag time. For the model to be Markovian there must be a separation of timescales. That is, equilibration within states must occur on timescales faster than the lag time while transitions between states must occur on timescales longer than the lag time. The key is finding an appropriate balance between the number of states in the model and the lag time. A desirable Markov model has few enough states that it may be understood by a person and a lag time shorter than the timescale of the process of interest.

The eigenvalues (k) of the transition matrix each imply a time scale (k).

   k ln ( ) k (H8) where k is an eigenvalue of the transition matrix with the lag time .

The focus of the current study is thermodynamics instead of kinetics. The first left eigenvector of the transition matrix Tij correspond to the equilibrium distribution (6).

222

SPLITTING INTO MICROSTATES The first step in our procedure to build an MSM is to divide all the conformations sampled into small sets of structurally similar configurations called microstates (3, 6). This is accomplished using the hierarchical K-medoids clustering algorithm described in Section 3. For example, by setting the threshold for the hierarchical K-medoids clustering to stop splitting a certain state as 2500 conformations, we divided 1.3 million conformations generated from long ASM seeding simulations into 1,597 microstates. Heavy atom RMSD is used as the distance metric, since it accounts for both local similarities between pairs of conformations as well as global ones,. This distance metric has also been shown to be able to distinguish between kinetically distinct conformations. If the state population threshold is chosen to be small enough then the conformations in one microstate may be considered to be kinetically as well as structurally similar as it would require very few MD steps to get from one to another. As shown in Figure 83, overlaid structures from the same microstate have great structural similarity. Based on this assumption, one may build a microstate Markov model by using the original data to calculate the probability of transitioning between each pair of microstates (stored as a transition probability matrix). Because of the small size of each microstate, this Markov model will have too many states to provide any insight into the nature of the free energy landscape. To gain a clearer understanding of the free energy landscape one may lump together kinetically similar microstates to form macrostates. These macrostates comprise a new MSM that hopefully has an appropriate separation of timescales.

223

Figure 83. Three example structures from a single microstate.

LUMPING INTO METASTABLE STATES Lumping is done by first calculating the eigenvalues and eigenvectors of the microstate transition probability matrix (44). The eigenvalues are related to the timescale for interconverting between two sets of microstates while the corresponding eigenvectors indicate which microstates constitute these two sets if the model is Markovian at this timescale. We estimate the number of macrostates based on the gap in the implied timescales (see Equation (H6)) of the microstate transition probability matrix as a function of the lag time. As shown in Figure 84, there are six macro states for the seeding simulations.

224

Figure 84. The largest one hundred implied timescales as a function of the lag time for (a) ST simulations starting from the coil initial configuration. (b) The long adaptive seeding microstate MSM.

Sets of kinetically related microstates are grouped together into macrostates using a spectral clustering algorithm: Perron Cluster Cluster Analysis (PCCA) (45). While generating the transition count matrix, all the recorded transitions are independent (i.e. transitions from time t to 2t, 2t to 3t, etc). The initial lumping calculated from this data is refined by using a Simulated Annealing (SA) scheme to maximize the metastability (Q) of the model (6). Twenty SA runs of 20,000 steps each are used. In each simulated annealing step, a microstate is randomly reassigned to a new macrostate and the move is accepted using the Metropolis criterion. The metastability is defined as the sum of the self-transition probabilities of each

N macrostate (Q Tii ). Maximizing the metastability is assumed to be a good way for i1 maximizing the separation of timescales necessary for a valid MSM. The metastability is shown in Table 3.

N. Metastable Q States ST (Native) 6 5.09 0.848 ST(Coil) 6 5.01 0.835 Seeding 6 5.61 0.935

Table 3. Metastability (Q) and average self-transition probability between metastable states for the MSMs built from ST simulations and seeding simulations.

DETERMINING STATE POPULATIONS AND UNCERTAINTIES Simulation trajectories are used to estimate transitions between different metastable states in order to build a MSM. Such estimation induces uncertainties in any property computed from the model including the metastable state equilibrium population we pursued in this study. Therefore, obtaining the uncertainties is important to test the reliability of our results. In order to estimate these uncertainties,, we employ a 225

Bayesian method introduced by Noe (251). Assuming that the system is Markovian at the given lag time, the method defines the following stochastic model for its parameters. The likelihood of any trajectory is simply the product of independent transition probabilities, as a consequence of the Markov property, and the transition probability matrix T is assigned an independent, symmetric Dirichlet prior in each row. This is the conjugate prior for the Markov likelihood, which means that the posterior distribution of T after observing a number of transitions has the same functional form as the prior. This method makes the further assumption that the system obeys detailed balance, so the distributions of T are restrained to the space of reversible stochastic matrices. This distribution is difficult to normalize analytically, but it may be sampled using a Markov Chain Monte Carlo (MCMC) algorithm. It was shown (251) that the restriction to reversible matrices greatly reduces the uncertainty of many thermodynamic properties, which is why it was deemed necessary in our study. Using this method, we were able to sample from the posterior distribution of T, given our simulation data, to obtain stable Monte Carlo estimates of the deviations of equilibrium populations.

A SIMPLE MODEL OF NON-ARRHENIUS, METASTABLE DYNAMICS

SIMPLE POTENTIAL GE algorithms attempt to overcome the sampling problem by inducing a random walk in temperature space, where high temperatures help systems cross energetic barriers. However, it has been shown that GE simulations will provide little improvement when the folding kinetics are non-Arrhenius, and the dominant barriers are entropic at high temperatures. In order to demonstrate the efficiency of the ASM in comparison with the GE algorithms, we introduce a model 2D potential to fully contrast the convergence of equilibrium statistics from the different algorithms. The model is based on a discrete-state system introduced by Zwanzig (252) as a simple model for protein folding, which is similar in sprit to continuous-space models used to study

226 anti-Arrhenius dynamics by the Levy group (241). These models define an energy surface reminiscent of a golf-course, which is almost everywhere flat with some bias toward the folded state and has a sharp decline near the folded state. On the other hand, the degeneracy of the microstates increases sharply as we move away from the folded conformation, providing an entropic advantage that stabilizes the unfolded macrostate at higher temperatures.

The system of Zwanzig (252) was modified by introducing an additional, uncoupled degree of freedom, which has the effect of creating intermediate states between the folded and unfolded states. The energy as a function of the two independent parameters S and R is

E=SU+RU--S,R S0 R0 +(2-) R0 S0 (H9)

where SN{0,....,s } and R {0,....,NR }. The constant U determines the slope of the energy function as we move away from the folded state along each coordinate;  represents the drop in energy when one of the coordinates becomes 0, while  is the depth of the energy well of the completely folded state, where both S and R equal 0. The degeneracy of each microstate is given by:

SRNS NR gSR,  (H10) SS

With all this information, it is straightforward to analytically derive the partition function

NS NR  ESR(,) Qeg  SR, SR00 (H11) ((1)1)((1)1)eeUUNS  ee  NR  ee 2

The equilibrium probability of each of the (NR+1)(NS+1)microstates is now easy to compute by 227

e E(,)SR PSR(, ) (H12) Q

In the current study, we select parameters =4, =100, =1.5, U=1, and NR = NS = 7 for our purpose of mimicking the non-Arrehnius folding kinetics. The

Potential of Mean Force (PMF) (GPS ln ( , ) ) atR a range of temperatures are displayed in Figure 85. PMF plots suggest 4 metastable macrostates, shown in Figure 85 as separated by black dashed lines (the state decomposition will be discussed in the next paragraph). The folded state where S = R = 0 (state 1), the unfolded state where S>0 and R>0 (state 4), and two intermediate states where either S = 0 (State 2) or R = 0 (State 3).

Figure 85. Potential of Mean Force (PMF) for the simple potential at  (1/KT) a. 0.995, b. 0.652, and c. 0.456. In part a, four metastable macrostates are separated by the dashed black lines and labled.

As expected, the free energy of the folded state decreases as we increase the temperature, while the opposite is true of the unfolded state. This is also shown in Figure 86 where the equilibrium populations of four macrostates are plotted as a function of =1/kT. The populations of intermediate states 2 and 3 have low

228 populations at both low and high temperatures, but reach the maximum values at medium temperatures with   0.65 .

Figure 86. Populations of four macrostates as function of =1/kT.

The potential was equipped with a discrete-time, Metropolis Hastings Monte Carlo dynamics, where the proposal probabilities are proportional to the state degeneracy for states where at least one of S and R change by 1, and zero for all others. A Markovian transition probability matrix T was computed at each temperatures, from which we obtained evidence for non-Arrhenius behavior and metastability. The non-Arrhenius behavior can be seen in Figure 87 where we plot the folding and unfolding rates at a function of temperature, computed as the inverse of the mean first passage times between the folded and unfolded states. The mean first passage times are computed using the method described by Singhal et.al. (37). The unfolding rate increases with temperature. However, the folding rate decreases with temperature due to the high entropic barriers for refolding at high temperatures. Metastability for this system is confirmed by the large gap between the third and fourth timescales implied by T as shown in Figure 88. At all temperatures, the third largest timescale is at least a factor of 5 greater than the fourth implied timescale. 229

Therefore, we confirm that there is a separation of timescales for this system, and it has four metastable macrostates. The first 3 implied timescales correspond to the transitions between macrostates, while other shorter implied timescales correspond to transitions within macrostates. State decomposition can be obtained by spectra clustering algorithm Perron Cluster Cluster Analysis (PCCA) (45) and the resulting definition of the four metastable states are shown in Figure 85 (a).

Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of =1/kT.

COMPARING EFFICIENCY OF ASM AND GE USING THE SIMPLE POTENTIAL. To test our hypothesis that GE algorithms, in particular Simulated Tempering (ST), would exhibit a slower rate of convergence for equilibrium statistics than ASM, we simulated 1000 trajectories of 610 6 steps using each method. An optimal list of 10 temperatures with  = 1.1, 0.995, 0.939, 0.89, 0.827, 0.652, 0.554, 0.519, 0.491, and 0.456 are selected for ST to obtain acceptance ratios bigger than 40% between all neighbouring temperatures. The weights (gi) are chosen analytically from the partition function (177) to enable the system to uniformly sample every temperature.

230

gQi ln ( ) (H13)

An equal number of trajectories was started from each temperature, with temperature change proposals done every 10 steps of simulation. Two independent sets of ST simulations are performed with initial state 0 and 4 respectively.

Figure 88. Logarithms of the implied timescales as function of  for the 2D potential are displayed. The three slowest timescales are plotted using up triangle, down triangle, and cross points respectively.

For ASM, we simulated 250 trajectories from each of the 4 macrostates at a constant temperature of  = 0.995, at which the folded state is the dominant state in order to mimic the situation at physiological temperatures.

The convergence of the equilibrium populations from ST was analyzed in the following way. For a set number of trajectories, we take a window of 50,000 steps, and compute the fraction of the configurations at a certain metastable state and temperature  = 0.995 within this window. By bootstrapping this estimator 100 times, we can determine distribution of the state populations as a function of simulation time

231

(see Figure 89). Populations obtained form the two independent sets of ST simulations are converged between 2.5 105 and 310 5 steps.

Figure 89. Populations computed from Simulated Temperating (ST) simulations for four metastable states of the are plotted as a function of length of the simulation. The reference populaiton is shown in the solid lines and 1000 trajectories are used for this calculaiton. The error bars are the standard derivation obtained from bootstrapping 100 times with replacement.

Similarly for ASM, we obtain a distribution for the equilibrium populations with different trajectory length for a certain number of trajectories, which is computed by a Bayesian method (251). As shown in Figure 86, it only takes about 410 4 steps for ASM to converge to the correct populations, which is much more efficient than ST. The populations in Figure 90 are computed using a lag time of 1/3 of the trajectory length. However, we show in that the populations are almost invariant to the lag time if it is longer than about 1/8 of the trajectory. We note that one has to choose a proper lag time in order to get good estimate of the populations. A good lag time has to be small enough so that there are enough transition counts, but not too small to have many correlated transition counts. In our RNA hairpin example, we use a small lag

232 time but only a few transition counts are taken from each trajectory to make sure we only consider independent transition events. In that case, we can still estimate thermodynamic properties accurately even though the model is not Markovian under the lag time used.

Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four metastable states of the are plotted as a function of length of the simulation. The reference populaiton is shown in the solid lines and 1000 trajectories are used for this calculation. The lag time is selected as 1/3 of the length of the simulation. The error bars are standard derivation obtained from a Bayesian method (See section 2.5.3 for details).

To compare the efficiency of ASM and ST as a function of length and number of trajectories, we define a criterion for the convergence as following: the probability that the estimated populations for all states are within 5% of the actual equilibrium populations is bigger than 80%. The population distributions are computed the same way as in Figure 89 for ST and in Figure 90 for ASM. As shown in Figure 92, ASM is much more efficient than ST, and can reach the convergence using 4-7 times shorter simulations than ST. In addition, the efficiency of ST will not increase with the number of trajectories after 200, while the efficiency of ASM keeps increasing with number of trajectories up to 600. We think ideally the length of the seeding 233 simulations should lie in the major gap of the implied timescales, such that they are longer than the slowest intra-macrostate equilibration time to minimize the model error due to non-Markovian effects. In the current system, the minimum length of the

4 3 4 simulations (~510 ) is indeed between 3rd (1.61 10 ) and 4th (9.58 10 ) slowest implied timescales. There is evidence from the RNA hairpin example and previous work on a water dewetting transition in a carbon nanotube (7) that these requirements for the lag time may be relaxed for real systems, where the separation of timescales is less evident than in the model system studied here. . Additionally, the number of seeding simulations has to be big enough to reduce the statistical error to a satisfactory level.

Figure 91. Populations computed from ASM simulations for four metastable states as a function of lag time.

234

Figure 92. Number of steps taken to reach the convergence as a function of number of trajs.

235

BIBLIOGRAPHY

1. Schütte C, Fischer A, Huisinga W, & Deuflhard P (1999) A direct approach to conformational dynamics based on hybrid Monte Carlo. J Comput Phys 151:146–168. 2. Bowman GR, Huang X, & Pande VS (2010) Network models for molecular kinetics and their initial applications to human health. Cell Res 20:622-630. 3. Noe F & Fischer S (2008) Transition networks for modeling the kinetics of conformational change in macromolecules. Curr Opin Struct Biol 18:154-162. 4. Bowman GR, Beauchamp KA, Boxer G, & Pande VS (2009) Progress and challenges in the automated construction of Markov state models for full protein systems. J Chem Phys 131:124101. 5. Noe F, Schutte C, Vanden-Eijnden E, Reich L, & Weikl TR (2009) Constructing the equilibrium ensemble of folding pathways from short off- equilibrium simulations. Proc Natl Acad Sci U S A 106:19011-19016. 6. Chodera JD, Singhal N, Pande VS, Dill KA, & Swope WC (2007) Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J Chem Phys 126:155101. 7. Sriraman S, Kevrekidis IG, & Hummer G (2005) Coarse nonlinear dynamics and metastability of filling-emptying transitions: Water in carbon nanotubes. Phys. Rev. Lett. 95:130603. 8. Gfeller D, De Los Rios P, Caflisch A, & Rao F (2007) Complex network analysis of free-energy landscapes. Proc Natl Acad Sci U S A 104:1817-1822. 9. Schutte C (1999) Conformational Dynamics: Modeling, Theory, Algorithm, and Application to Biomolecules. (thesis, Freie Universitat Berlin). 10. Bowman GR, Huang X, & Pande VS (2009) Using generalized ensemble simulations and Markov state models to identify conformational states. Methods 49:197-201. 11. Sriraman S, Kevrekidis LG, & Hummer G (2005) Coarse master equation from Bayesian analysis of replica molecular dynamics simulations. J Phys Chem B 109:6479-6484. 12. Huang X, Bowman GR, Bacallado S, & Pande VS (2009) Rapid equilibrium sampling initiated from nonequilibrium data. Proc Natl Acad Sci U S A 106:19765-19769. 13. Huang X, et al. (2010) Constructing multi-resolution Markov state models (MSMs) to elucidate RNA hairpin folding mechanisms. Pac Symp Biocomput 15:228-239.

236

14. Noe F, Horenko I, Schutte C, & Smith JC (2007) Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. J Chem Phys 126:155102. 15. Sarich M, Noe F, & Schutte C (2010) On the approximation quality of Markov state models. SIAM Multiscale Model Simul, in press. 16. Bowman GR & Pande VS (2010) Protein folded states are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895. 17. Rao F & Caflisch A (2004) The protein folding network. J Mol Biol 342:299- 306. 18. Bowman GR, Ensign DL, & Pande VS (2010) Enhanced modeling via network theory: adaptive sampling of Markov state models. J Chem Theory Comput 6:787-794. 19. Hinrichs NS & Pande VS (2007) Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics. J Chem Phys 126:244101. 20. Roblitz S (2008) Statistical error estimation and grid-free hierarchical refinement in conformation dynamics. (thesis, Freie Universitat Berlin). 21. Mitsutake A, Sugita Y, & Okamoto Y (2001) Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60:96-123. 22. Hansmann UH & Okamoto Y (1999) New Monte Carlo algorithms for protein folding. Curr. Opin. Struct. Biol. 9:177-183. 23. Sugita Y & Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314:141-151. 24. Lyubartsev AP, Martsinovski AA, Shevkunov SV, & Vorontsov-Velyaminov PN (1992) New approach to Monte Carlo calculation of the free energy: Method of expanded ensembles. J. Chem. Phys. 96:1776-1783. 25. Marinari E & Parisi G (1992) Simulated Tempering: a New Monte Carlo Scheme. Euro. Lett. 19:451-458. 26. Zhou R, Berne BJ, & Germain R (2001) The free energy landscape for beta hairpin folding in explicit water. Proc. Natl. Acad. Sci. USA 98:14931-14936. 27. Rhee YM & Pande VS (2003) Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophysical journal 84:775- 786. 28. Nymeyer H & Garcia AE (2003) Simulation of the folding equilibrium of alpha-helical peptides: a comparison of the generalized Born approximation with explicit solvent. Proc. Natl. Acad. Sci. USA 100:13934-13939. 29. Zhou R (2003) Trp-cage: folding free energy landscape in explicit water. Proc. Natl. Acad. Sci. USA 100:13280-13285. 30. Krivov SV & Karplus M (2004) Hidden complexity of free energy surfaces for peptide (protein) folding. Proc. Natl. Acad. Sci. U.S.A. 101:14766-14770. 31. Karpen ME, Tobias DJ, & Brooks CL, 3rd (1993) Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. Biochemistry 32:412-420.

237

32. Shao JY, Tanner SW, Thompson N, & Cheatham TE (2007) Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms. J. Chem. Theory Comp. 3:2312-2334. 33. Buchete NV & Hummer G (2008) Coarse master equations for peptide folding dynamics. J Phys Chem B 112:6057-6069. 34. Swope WC, Pitera JW, & Suits F (2004) Describing protein folding kinetics by molecular dynamics simulations. 1. Theory. J Phys Chem B 108:6571-6581. 35. Frauenfelder H, Sligar SG, & Wolynes PG (1991) The energy landscapes and motions of proteins. Science 254:1598-1603. 36. Yang WY & Gruebele M (2004) Detection-dependent kinetics as a probe of folding landscape microstructure. J Am Chem Soc 126:7758-7759. 37. Singhal N, Snow CD, & Pande VS (2004) Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin. J. Chem. Phys. 121:415-425. 38. Elmer S, Park S, & Pande VS (2005) Foldamer dynamics expressed via Markov State Models: 2. Explicit solvent molecular dynamics simulations in acetonitrile, chloroform, methanol, and water. J. Chem. Phys. 122:124908. 39. Jayachandran G, Vishal V, & Pande VS (2006) Folding Simulations of the Villin Headpiece in All-Atom Detail. J. Chem. Phys. 124:164902. 40. Kelley NW, Vishal V, Krafft GA, & Pande VS (2008) Simulating oligomerization at experimental concentrations and long timescales: A Markov state model approach. J Chem Phys 129:214707. 41. Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theo. Comp. Sci. 38:293-306. 42. Dasgupta S & Long PM (2005) Performance guarantees for hierarchical clustering. J. Comput. System Sci. 70:555-569. 43. Lindahl E, B. Hess, and D. van der Spoel. (2001) GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Modeling. 7:306-317. 44. Deuflhard P & Weber M (2005) Robust Perron cluster analysis in conformation dynamics. Lin. Alg. Appl. 398:161-184. 45. Deuflhard P, Huisinga W, Fischer A, & Schütte C (2000) Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Lin. Alg. Appl. 315:39-59. 46. Anfinsen CB, Haber E, Sela M, & White FH, Jr. (1961) The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 47:1309-1314. 47. Klein WL, Stine WB, Jr., & Teplow DB (2004) Small assemblies of unmodified amyloid beta-protein are the proximate neurotoxin in Alzheimer's disease. Neurobiol Aging 25:569-580. 48. Simons KT, Kooperberg C, Huang E, & Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209-225. 49. Bowman GR & Pande VS (2009) Simulated tempering yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.

238

50. Bolhuis PG, Dellago C, & Chandler D (2000) Reaction coordinates of biomolecular isomerization. Proc Natl Acad Sci U S A 97:5877-5882. 51. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1998) On the transition coordinate for protein folding. J Chem Phys 108:34-350. 52. Bowman GR, et al. (2008) Structural insight into RNA hairpin folding intermediates. J Am Chem Soc 130:9676-9678. 53. Dill KA, Ozkan SB, Shell MS, & Weikl TR (2008) The protein folding problem. Annu Rev Biophys 37:289-316. 54. Chodera JD, Swope WC, Pitera JW, & Dill KA (2006) Long-timescale protein folding dynamics from short-time molecular dynamics simulations. Multi Mod Simul 5:1214–1226. 55. Yang S, Banavali NK, & Roux B (2009) Mapping the conformational transition in Src activation by cumulating the information from multiple molecular dynamics trajectories. Proc Natl Acad Sci U S A 106:3776-3781. 56. Kubelka J, Chiu TK, Davies DR, Eaton WA, & Hofrichter J (2006) Sub- microsecond protein folding. J Mol Biol 359:546-553. 57. Chiu TK, et al. (2005) High-resolution x-ray crystal structures of the villin headpiece subdomain, an ultrafast folding protein. Proc Natl Acad Sci USA 102:7517-7522. 58. Ensign DL, Kasson PM, & Pande VS (2007) Heterogeneity even at the speed limit of folding: large-scale molecular dynamics study of a fast-folding variant of the villin headpiece. J Mol Biol 374:806-816. 59. Berendsen HJC, Vanderspoel D, & Vandrunen R (1995) Gromacs - a Message- Passing Parallel Molecular-Dynamics Implementation. Computer Physics Communications 91:43-56. 60. Wang JM, Cieplak P, & Kollman PA (2000) How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? Journal of computational chemistry 21:1049-1074. 61. Ryckaert JP, Ciccotti G, & Berendsen HJC (1977) Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n-Alkanes. J. Comp. Phys. 23:327-341. 62. Miyamoto S & Kollman PA (1992) Settle - an Analytical Version of the Shake and Rattle Algorithm for Rigid Water Models. Journal of computational chemistry 13:952-962. 63. Hoover W (1985) Canonical dynamics: Equilibrium phase-space distributions. Phys. Rev. A 31:1695-1697. 64. Nose S & Klein ML (1983) Constant Pressure Molecular-Dynamics for Molecular-Systems. Molecular Physics 50:1055-1076. 65. Nose S (1984) A Molecular-Dynamics Method for Simulations in the Canonical Ensemble. Molecular Physics 52:255-268. 66. Parrinello M & Rahman A (1981) Polymorphic Transitions in Single-Crystals - a New Molecular-Dynamics Method. Journal of Applied Physics 52:7182- 7190.

239

67. Humphrey W, Dalke A, & Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33-38. 68. Schultheis V, Hirschberger T, Carstens H, & Tavan P (2005) Extracting Markov Models of Peptide Conformational Dynamics from Simulation Data. JCTC 1:515-526. 69. Bolhuis PG, Chandler D, Dellago C, & Geissler PL (2002) Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu Rev Phys Chem 53:291-318. 70. Dill KA, Ozkan SB, Weikl TR, Chodera JD, & Voelz VA (2007) The protein folding problem: when will it be solved? Curr Opin Struct Biol 17:342-346. 71. Plaxco KW, Simons KT, & Baker D (1998) Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277:985-994. 72. Yang WY & Gruebele M (2003) Folding at the speed limit. Nature 423:193- 197. 73. Kubelka J, Hofrichter J, & Eaton WA (2004) The protein folding 'speed limit'. Curr Opin Struct Biol 14:76-88. 74. Udgaonkar JB (2008) Multiple routes and structural heterogeneity in protein folding. Annu Rev Biophys 37:489-510. 75. Pitera JW & Swope W (2003) Understanding folding and design: replica- exchange simulations of "Trp-cage" miniproteins. Proc Natl Acad Sci U S A 100:7587-7592. 76. Zagrovic B, Snow CD, Shirts MR, & Pande VS (2002) Simulation of folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. J Mol Biol 323:927-937. 77. Ensign DL & Pande VS (2009) The Fip35 WW domain folds with structural and mechanistic heterogeneity in molecular dynamics simulations. Biophys J 96:L53-55. 78. Horng JC, Moroz V, & Raleigh DP (2003) Rapid cooperative two-state folding of a miniature alpha-beta protein and design of a thermostable variant. J Mol Biol 326:1261-1270. 79. Shirts M & Pande VS (2000) COMPUTING: Screen Savers of the World Unite! Science 290:1903-1904. 80. Friedrichs MS, et al. (2009) Accelerating molecular dynamic simulation on graphics processing units. J Comput Chem 30:864-872. 81. Onufriev A, Bashford D, & Case DA (2004) Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55:383-394. 82. Shell MS, Ritterson R, & Dill KA (2008) A test on peptide stability of AMBER force fields with implicit solvation. J Phys Chem B 112:6878-6886. 83. Hoffman DW, et al. (1994) Crystal structure of prokaryotic ribosomal protein L9: a bi-lobed RNA-binding protein. EMBO J 13:205-212. 84. Shirts MR & Pande VS (2001) Mathematical analysis of coupled parallel simulations. Phys Rev Lett 86:4983-4987.

240

85. Ensign DL & Pande VS (2009) Bayesian single-exponential kinetics in single- molecule experiments and simulations. J Phys Chem B 113:12410-12423. 86. Panchenko AR, Luthey-Schulten Z, & Wolynes PG (1996) Foldons, protein structural modules, and exons. Proc Natl Acad Sci U S A 93:2008-2013. 87. Metzner P, Schutte C, & Vanden-Eijnden E (2009) Transition Path Theory for Markov Jump Processes. Multiscale Modeling & Simulation 7:1192-1219. 88. Weikl TR (2008) Loop-closure principles in protein folding. Archives of Biochemistry and Biophysics 469:67-75. 89. Snow CD, Rhee YM, & Pande VS (2006) Kinetic definition of protein folding transition state ensembles and reaction coordinates. Biophys J 91:14-24. 90. Uversky VN (2009) Intrinsic disorder in proteins associated with neurodegenerative diseases. Front Biosci 14:5188-5238. 91. Bowman GR & Pande VS (2009) The roles of entropy and kinetics in structure prediction. PLoS One 4:e5840. 92. Ozkan SB, Wu GA, Chodera JD, & Dill KA (2007) Protein folding by zipping and assembly. Proc Natl Acad Sci U S A 104:11987-11992. 93. Voelz VA, Bowman GR, Beauchamp KA, & Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc 132:1526-1528. 94. Jackson SE & Fersht AR (1991) Folding of chymotrypsin inhibitor 2. 1. Evidence for a two-state transition. Biochemistry 30:10428-10435. 95. Bryngelson JD, Onuchic JN, Socci ND, & Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins 21:167-195. 96. Barrick D (2009) What have we learned from the studies of two-state folders, and what are the unanswered questions about two-state protein folding? Phys Biol 6:15001. 97. Spudich GM, Miller EJ, & Marqusee S (2004) Destabilization of the Escherichia coli RNase H kinetic intermediate: switching between a two-state and three-state folding mechanism. J Mol Biol 335:609-618. 98. Radford SE, Dobson CM, & Evans PA (1992) The folding of hen lysozyme involves partially structured intermediates and multiple pathways. Nature 358:302-307. 99. Kamagata K, Sawano Y, Tanokura M, & Kuwajima K (2003) Multiple parallel-pathway folding of proline-free Staphylococcal nuclease. J Mol Biol 332:1143-1153. 100. Ma H & Gruebele M (2006) Low barrier kinetics: dependence on observables and free energy surface. J Comput Chem 27:125-134. 101. Wales DJ & Scheraga HA (1999) Global optimization of clusters, crystals, and biomolecules. Science 285:1368-1372. 102. Wetlaufer DB (1973) Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci U S A 70:697-701. 103. Myers JK & Oas TG (2001) Preorganized secondary structure as an important determinant of fast protein folding. Nat Struct Biol 8:552-558.

241

104. Krishna MM, Maity H, Rumbley JN, Lin Y, & Englander SW (2006) Order of steps in the cytochrome C folding pathway: evidence for a sequential stabilization mechanism. J Mol Biol 359:1410-1419. 105. Volk M, et al. (1997) Peptide Conformational Dynamics and Vibrational Stark Effects Following Photoinitiated Disulfide Cleavage. J Chem Phys 101:8607. 106. Sabelko J, Ervin J, & Gruebele M (1999) Observation of strange kinetics in protein folding. Proc Natl Acad Sci U S A 96:6031-6036. 107. Liu F & Gruebele M (2007) Tuning lambda6-85 towards downhill folding at its melting temperature. J Mol Biol 370:574-584. 108. Liu F, et al. (2009) A one-dimensional free energy surface does not account for two-probe folding kinetics of protein alpha(3)D. J Chem Phys 130:061101. 109. Ghosh K & Dill KA (2007) The ultimate speed limit to protein folding is conformational searching. J Am Chem Soc 129:11920-11927. 110. Betancourt MR & Onuchic JN (1995) Kinetics of protein like models: The energy landscape factors that determine folding. J Chem Phys 103:773. 111. Cho SS, Levy Y, & Wolynes PG (2006) P versus Q: structural reaction coordinates capture protein folding on smooth landscapes. Proc Natl Acad Sci U S A 103:586-591. 112. Leopold PE, Montal M, & Onuchic JN (1992) Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc Natl Acad Sci U S A 89:8721-8725. 113. Nettels D, Gopich IV, Hoffmann A, & Schuler B (2007) Ultrafast dynamics of protein collapse from single-molecule photon statistics. Proc Natl Acad Sci U S A 104:2655-2660. 114. Waldauer SA, et al. (2008) Ruggedness in the folding landscape of protein L. HFSP J 2:388-395. 115. Voelz VA, Singh VR, Wedemeyer WJ, Lapidus LJ, & Pande VS (2010) Unfolded state dynamics and structure of protein L characterized by simulation and experiment. J Am Chem Soc 132:4702-4709. 116. Watts DJ & Strogatz SH (1998) Collective dynamics of 'small-world' networks. Nature 393:440-442. 117. Barabasi AL & Albert R (1999) Emergence of scaling in random networks. Science 286:509-512. 118. Dill KA & Chan HS (1997) From Levinthal to pathways to funnels. Nat Struct Biol 4:10-19. 119. Milgram S (1967) The small world problem. Psychol Today 1:61-67. 120. Chung HS, Louis JM, & Eaton WA (2009) Experimental determination of upper bound for transition path times in protein folding from single-molecule photon-by-photon trajectories. Proc Natl Acad Sci U S A 106:11837-11844. 121. Fersht AR (2002) On the simulation of protein folding by short time scale molecular dynamics and distributed computing. Proc Natl Acad Sci U S A 99:14122-14125.

242

122. Saven JG, Wang J, & Wolynes PG (1994) Kinetics of Protein-Folding - the Dynamics of Globally Connected Rough Energy Landscapes with Biases. J Chem Phys 101:11037-11043. 123. Wang J, Saven JG, & Wolynes PG (1996) Kinetics in a globally connected, correlated random energy model. J Chem Phys 105:11276-11284. 124. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1999) On the role of conformational geometry in protein folding. J Chem Phys 111:10375. 125. Andrec M, Felts AK, Gallicchio E, & Levy RM (2005) Protein folding pathways from replica exchange simulations and a kinetic network model. Proc Natl Acad Sci U S A 102:6801-6806. 126. Kim PS & Baldwin RL (1990) Intermediates in the folding reactions of small proteins. Annu Rev Biochem 59:631-660. 127. Shan B, Eliezer D, & Raleigh DP (2009) The unfolded state of the C-terminal domain of the ribosomal protein L9 contains both native and non-native structure. Biochemistry 48:4707-4719. 128. Kuzmenkina EV, Heyes CD, & Nienhaus GU (2005) Single-molecule Forster resonance energy transfer study of protein dynamics under denaturing conditions. Proc Natl Acad Sci U S A 102:15471-15476. 129. McLeish TC (2005) Protein folding in high-dimensional spaces: hypergutters and the role of nonnative interactions. Biophys J 88:172-183. 130. Pabo CO & Lewis M (1982) The operator-binding domain of lambda repressor: structure and DNA recognition. Nature 298:443-447. 131. Clarke ND, Beamer LJ, Goldberg HR, Berkower C, & Pabo CO (1991) The DNA binding arm of lambda repressor: critical contacts from a flexible region. Science 254:267-270. 132. Huang GS & Oas TG (1995) Submillisecond folding of monomeric lambda repressor. Proc Natl Acad Sci U S A 92:6878-6882. 133. Burton RE, Huang GS, Daugherty MA, Calderone TL, & Oas TG (1997) The energy landscape of a fast-folding protein mapped by Ala-->Gly substitutions. Nat Struct Biol 4:305-310. 134. Ghaemmaghami S, Word JM, Burton RE, Richardson JS, & Oas TG (1998) Folding kinetics of a fluorescent variant of monomeric lambda repressor. Biochemistry 37:9179-9185. 135. Liu F, Gao YG, & Gruebele M (2010) A survey of lambda repressor fragments from two-state to downhill folding. J Mol Biol 397:789-798. 136. Larios E, Pitera JW, Swope W, & Gruebele M (2006) Correlation of early orientational ordering of engineered λ6–85 structure with kinetics and thermodynamics Chem Phys 323:45-53. 137. Yang WY & Gruebele M (2004) Folding lambda-repressor at its speed limit. Biophys J 87:596-608. 138. Allen LR, Krivov SV, & Paci E (2009) Analysis of the free-energy surface of proteins from reversible folding simulations. PLoS Comput Biol 5:e1000428.

243

139. Yang WY, Larios E, & Gruebele M (2003) On the extended beta-conformation propensity of polypeptides at high temperature. J Am Chem Soc 125:16220- 16227. 140. Hoffmann A, et al. (2007) Mapping protein collapse with single-molecule fluorescence and kinetic synchrotron radiation circular dichroism spectroscopy. Proc Natl Acad Sci U S A 104:105-110. 141. DeCamp SJ, Naganathan AN, Waldauer SA, Bakajin O, & Lapidus LJ (2009) Direct observation of downhill folding of lambda-repressor in a microfluidic mixer. Biophys J 97:1772-1777. 142. Ma H & Gruebele M (2005) Kinetics are probe-dependent during downhill folding of an engineered lambda6-85 protein. Proc Natl Acad Sci U S A 102:2283-2287. 143. Munoz V & Serrano L (1994) Elucidating the folding problem of helical peptides using empirical parameters. Nat Struct Biol 1:399-409. 144. Portman J, Takada S, & Wolynes PG (1998) Variational Theory for Site Resolved Protein Folding Free Energy Surfaces. Phys Rev Lett 81:5237-5240. 145. Burton RE, Myers JK, & Oas TG (1998) Protein Folding Dynamics: Quantitative Comparison between Theory and Experiment. Biochemistry 37:5337–5343. 146. Pande VS (2010) A simple theory of protein folding kinetics. Phys Rev Lett, in submssion. 147. Liu F, et al. (2008) An experimental survey of the transition between two-state and downhill protein folding scenarios. Proc Natl Acad Sci U S A 105:2369- 2374. 148. He Y, Yeh DC, Alexander P, Bryan PN, & Orban J (2005) Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 44:14055-14061. 149. Rhee YM & Pande VS (2006) On the role of chemical detail in simulating protein folding kinetics. J Chem Phys 323:66-77. 150. Bradley P, Misura KM, & Baker D (2005) Toward high-resolution de novo structure prediction for small proteins. Science 309:1868-1871. 151. Das R, et al. (2007) Structure prediction for CASP7 targets using extensive all- atom refinement with Rosetta@home. Proteins 69:118-128. 152. Klepeis JL, Lindorff-Larsen K, Dror RO, & Shaw DE (2009) Long-timescale molecular dynamics simulations of protein structure and function. Curr Opin Struct Biol 19:120-127. 153. Geyer CJ (1992) Practical Markov Chain Monte Carlo. Stat. Sci. 7:473-511. 154. King RD, et al. (2009) The automation of science. Science 324:85-89. 155. Pande VS, et al. (2003) Atomistic protein folding simulations on the submillisecond time scale using worldwide distributed computing. Biopolymers 68:91-109. 156. Faradjian AK & Elber R (2004) Computing time scales from reaction coordinates by milestoning. J Chem Phys 120:10880-10889.

244

157. Rogal J & Bolhuis PG (2008) Multiple state transition path sampling. J Chem Phys 129:224107 158. MacKay DJC (2003) Information theory, inference, and learning algorithms (Cambridge University Press, Cambridge, UK ; New York) p 34. 159. Shell MS (2008) The relative entropy is fundamental to multiscale and inverse thermodynamic problems. J. Chem. Phys. 129:144108 160. Cover TM & Thomas JA (2006) Elements of information theory (Wiley- Interscience, Hoboken, N.J.) 2nd Ed pp xxiii, 748 p. 161. Singhal N & Pande VS (2005) Error analysis and efficient sampling in Markovian state models for molecular dynamics. J Chem Phys 123:204909. 162. Baker D (2006) Prediction and design of macromolecular structures and interactions. Philos Trans R Soc Lond B Biol Sci 361:459-463. 163. Misura KM & Baker D (2005) Progress and challenges in high-resolution refinement of protein structure models. Proteins 59:15-29. 164. Schueler-Furman O, Wang C, Bradley P, Misura K, & Baker D (2005) Progress in modeling of protein structures and interactions. Science 310:638- 642. 165. Kuhlman B, et al. (2003) Design of a novel globular protein fold with atomic- level accuracy. Science 302:1364-1368. 166. Kortemme T, et al. (2004) Computational redesign of protein-protein interaction specificity. Nat Struct Mol Biol 11:371-379. 167. Ashworth J, et al. (2006) Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 441:656-659. 168. Nauli S, Kuhlman B, & Baker D (2001) Computer-based redesign of a protein folding pathway. Nat Struct Biol 8:602-605. 169. Nauli S, et al. (2002) Crystal structures and increased stabilization of the protein G variants with switched folding pathways NuG1 and NuG2. Protein Sci 11:2924-2931. 170. Qian B, et al. (2007) High-resolution structure prediction and the crystallographic phase problem. Nature 450:259-264. 171. Rothlisberger D, et al. (2008) Kemp elimination catalysts by computational enzyme design. Nature 453:190-195. 172. Simons K, et al. (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34:82–95. 173. Shortle D, Simons K, & Baker D (1998) Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 95:11158–11162. 174. Lee M, Tsai J, Baker D, & PA K (2001) Molecular dynamics in the endgame of protein structure prediction. J Mol Biol 313:417–430. 175. Chivian D, et al. (2005) Prediction of CASP6 structures using automated robetta protocols. Proteins 61:157–166. 176. Rohl C, Strauss C, Misura K, & Baker D (2004) Protein structure prediction using rosetta. Meth Enzymol 383:66–93.

245

177. Huang X, Bowman GR, & Pande VS (2008) Convergence of folding free energy landscapes via application of enhanced sampling methods in a distributed computing environment. J Chem Phys 128:205106. 178. McGuffin LJ, Bryson K, & Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404-405. 179. Meiler J, Muller M, Zeidler A, & Schmaschke F (2001) Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Journal of Molecular Modeling 7:360-369. 180. Karplus K & Hu BR (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 17:713-720. 181. Ouali M & King RD (2000) Cascaded multiple classifiers for secondary structure prediction. Protein Science 9:1162-1176. 182. Bystroff C, Simons KT, Han KF, & Baker D (1996) Local sequence-structure correlations in proteins. Current Opinion in Biotechnology 7:417-421. 183. Engh RA & Huber R (1991) Accurate Bond and Angle Parameters for X-Ray Protein-Structure Refinement. Acta Crystallographica Section A 47:392-400. 184. Neria E, Fischer S, & Karplus M (1996) Simulation of activation free energies in molecular systems. Journal of Chemical Physics 105:1902-1921. 185. Dunbrack RL & Cohen FE (1997) Bayesian statistical analysis of protein side- chain rotamer preferences. Protein Science 6:1661-1681. 186. Lazaridis T & Karplus M (1999) Effective energy function for proteins in solution. Proteins-Structure Function and Genetics 35:133-152. 187. Kortemme T, Morozov AV, & Baker D (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 326:1239-1259. 188. Morozov AV, Kortemme T, Tsemekhman K, & Baker D (2004) Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci USA 101:6946-6951. 189. Park S & Pande VS (2007) Choosing weights for simulated tempering. Phys Rev E Stat Nonlin Soft Matter Phys 76:016703. 190. Shirts M & Chodera J (2008) Statistically optimal analysis of samples from multiple equilibrium states. J Chem Phys 129:124105. 191. Kumar S, Bouzida D, Swendsen RH, Kollman PA, & Rosenberg JM (1992) The Weighted Histogram Analysis Method for Free-Energy Calculations on Biomolecules .1. The Method. J Comp Chem 13:1011-1021. 192. Noble MEM, Musacchio A, Saraste M, Courtneidge SA, & Wierenga RK (1993) Crystal-Structure of the Sh3 Domain in Human Fyn - Comparison of the 3-Dimensional Structures of Sh3 Domains in Tyrosine Kinases and Spectrin. Embo Journal 12:2617-2624. 193. Derrick JP & Wigley DB (1994) The third IgG-binding domain from streptococcal protein G. An analysis by X-ray crystallography of the structure alone and in a complex with Fab. J Mol Biol 243:906-918.

246

194. Cornilescu G, Marquardt JL, Ottiger M, & Bax A (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. Journal of the American Chemical Society 120:6836-6837. 195. Heurgue-Hamard V, et al. (2006) The zinc finger protein Ynr046w is plurifunctional and a component of the eRF1 methyltransferase in yeast. Journal of Biological Chemistry 281:36140-36148. 196. Yang JS, Chen WW, Skolnick J, & Shakhnovich EI (2007) All-atom ab initio folding of a diverse set of proteins. Structure 15:53-63. 197. Yang JS, Wallin S, & Shakhnovich EI (2008) Universality and diversity of folding mechanics for three-helix bundle proteins. Proc Natl Acad Sci U S A 105:895-900. 198. Moult J (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15:285-289. 199. Das R & Baker D (2008) Macromolecular modeling with rosetta. Annu Rev Biochem 77:363-382. 200. Shmygelska A & Levitt M (2009) Generalized ensemble methods for de novo structure prediction. Proc Natl Acad Sci U S A. 201. Sugita Y, Kitao A, & Okamoto Y (2000) Multidimensional replica-exchange method for free-energy calculations. J Chem Phys 113:6042-6051. 202. Neale C, Rodinger T, & Pomès R (2008) Equilibrium exchange enhances the convergence rate of umbrella sampling Chem Phys Lett 460:375–381. 203. Rao F & Caflisch A (2003) Replica exchange molecular dynamics simulations of reversible folding. J Chem Phys 119:4035-4042. 204. Clarke ND, Kissinger CR, Desjarlais J, Gilliland GL, & Pabo CO (1994) Structural studies of the engrailed homeodomain. Protein Sci 3:1779-1787 205. Tsai CJ, Maizel JV, & Nussinov R (2000) Anatomy of protein structures: Visualizing how a one-dimensional protein chain folds into a three- dimensional shape. Proc Natl Acad Sci USA 97:12038-12043. 206. Haspel N, Tsai CJ, Wolfson H, & Nussinov R (2003) Reducing the computational complexity of protein folding via fragment folding and assembly. Protein Sci 12:1177-1187. 207. Kifer I, Nussinov R, & Wolfson HJ (2008) Constructing templates for protein structure prediction by simulation of protein folding pathways. Proteins 73:380-394. 208. Uhlenbeck OC (1990) Tetraloops and RNA folding. Nature 346:613-614. 209. Jucker FM, Heus HA, Yip PF, Moors EH, & Pardi A (1996) A network of heterogeneous hydrogen bonds in GNRA tetraloops. J Mol Biol 264:968-980. 210. Woese CR, Winker S, & Gutell RR (1990) Architecture of ribosomal RNA: constraints on the sequence of "tetra-loops". Proc Natl Acad Sci USA 87:8467- 8471. 211. Varani G (1995) Exceptionally stable nucleic acid hairpins. Annual review of biophysics and biomolecular structure 24:379-404.

247

212. Marino JP, Gregorian RS, Csankovszki G, & Crothers DM (1995) Bent helix formation between RNA hairpins with complementary loops. Science 268:1448-1454. 213. Pley HW, Flaherty KM, & McKay DB (1994) Model for an RNA tertiary interaction from the structure of an intermolecular complex between a GAAA tetraloop and an RNA helix. Nature 372:111-113. 214. Glück A, Endo Y, & Wool IG (1992) Ribosomal RNA identity elements for ricin A-chain recognition and catalysis. Analysis with tetraloop mutants. J Mol Biol 226:411-424. 215. Ansari A & Kuznetsov SV (2005) Is hairpin formation in single-stranded polynucleotide diffusion-controlled? The journal of physical chemistry B 109:12982-12989. 216. Roth A, et al. (2007) A riboswitch selective for the queuosine precursor preQ1 contains an unusually small aptamer domain. Nat Struct Mol Biol 14:308-317. 217. Sorin EJ, Rhee YM, & Pande VS (2005) Does water play a structural role in the folding of small nucleic acids? Biophys J 88:2516-2524. 218. Kannan S & Zacharias M (2007) Folding of a DNA hairpin loop structure in explicit solvent using replica-exchange molecular dynamics simulations. Biophys J 93:3218-3228. 219. Garcia AE & Paschek D (2008) Simulation of the pressure and temperature folding/unfolding equilibrium of a small RNA hairpin. J Am Chem Soc 130:815-817. 220. Ansari A, Kuznetsov SV, & Shen Y (2001) Configurational diffusion down a folding funnel describes the dynamics of DNA hairpins. Proc Natl Acad Sci USA 98:7771-7776. 221. Jung J & Van Orden A (2006) A three-state mechanism for DNA hairpin folding characterized by multiparameter fluorescence fluctuation spectroscopy. J Am Chem Soc 128:1240-1249. 222. Ma H, Wan C, Wu A, & Zewail AH (2007) DNA folding and melting observed in real time redefine the energy landscape. Proc Natl Acad Sci USA 104:712-716. 223. Ma H, et al. (2006) Exploring the energy landscape of a small RNA hairpin. J Am Chem Soc 128:1523-1530. 224. Hagen M, Kim B, Liu P, Friesner RA, & Berne BJ (2007) Serial replica exchange. in J Phys Chem B), pp 1416-1423. 225. Menger M, Eckstein F, & Porschke D (2000) Dynamics of the RNA hairpin GNRA tetraloop. in Biochemistry-Us), pp 4500-4507. 226. Zhao L & Xia T (2007) Direct revelation of multiple conformations in RNA by femtosecond dynamics. J Am Chem Soc 129:4118-4119. 227. G. Singh FMaGC (Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point- Based Graphics. 228. Yao Y, et al. (2009) Topological methods for exploring low-density states in biomolecular folding pathways. J Chem Phys 130:144115.

248

229. Kim J, Doose S, Neuweiler H, & Sauer M (2006) The initial step of DNA hairpin folding: a kinetic analysis using fluorescence correlation spectroscopy. in Nucleic Acids Res), pp 2516-2527. 230. Pitera JW, Haque I, & Swope WC (2006) Absence of reptation in the high- temperature folding of the trpzip2 beta-hairpin peptide. The Journal of chemical physics 124:141102. 231. Zhang W & Chen SJ (2002) RNA hairpin-folding kinetics. Proc Natl Acad Sci U S A 99:1931-1936. 232. Mohanty S & Hansmann UH (2006) Folding of proteins with diverse folds. Biophys J 91:3573-3578. 233. Liu P, Huang X, Zhou R, & Berne BJ (2006) Hydrophobic aided replica exchange: an efficient algorithm for protein folding in explicit solvent. J Phys Chem B 110:19018-19022. 234. Im W & Brooks CL (2004) De novo folding of membrane proteins: An exploration of the structure and NMR properties of the fd coat protein. Journal of Molecular Biology 337:513-519. 235. Roitberg AE, Okur A, & Simmerling C (2007) Coupling of replica exchange simulations to a non-Boltzmann structure reservoir. J Phys Chem B 111:2415- 2418. 236. Pitera JW, Swope WC, & Abraham FF (2008) Observation of noncooperative folding thermodynamics in simulations of 1BBL. Biophysical journal 94:4837- 4846. 237. Zhang W, Wu C, & Duan Y (2005) Convergence of replica exchange molecular dynamics. J Chem Phys 123:154105. 238. Periole X & Mark AE (2007) Convergence and sampling efficiency in replica exchange simulations of peptide folding in explicit solvent. J Chem Phys 126:014903. 239. Nymeyer H (2008) How efficient is replica exchange molecular dynamics? An analytic approach J. Chem. Theory Comput. 4:626–636. 240. Zuckerman DM & Lyman E (2006) A Second Look at Canonical Sampling of Biomolecules Using Replica Exchange Simulation. J. Chem. Theory Comput. 2:1200-1202. 241. Zheng W, Andrec M, Gallicchio E, & Levy RM (2008) Simple continuous and discrete models for simulating replica exchange simulations of protein folding. J Phys Chem B 112:6083-6093. 242. Zheng W, Andrec M, Gallicchio E, & Levy RM (2007) Simulating replica exchange simulations of protein folding with a kinetic network model. Proc Natl Acad Sci U S A 104:15340-15345. 243. Sanbonmatsu KY & Garcia AE (2002) Structure of Met-enkephalin in explicit aqueous solution using replica exchange molecular dynamics. Proteins 46:225- 234. 244. Nadler W & Hansmann UH (2007) Dynamics and optimal number of replicas in parallel tempering simulations. Phys Rev E Stat Nonlin Soft Matter Phys 76:065701.

249

245. Nadler W & Hansmann UH (2007) Optimizing replica exchange moves for molecular dynamics. Phys Rev E Stat Nonlin Soft Matter Phys 76:057102. 246. Hummer G & Kevrekidis IG (2003) Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and long-time dynamics computations. J Chem Phys 118:10762-10773. 247. Ytreberg FM & Zuckerman DM (2008) A black-box re-weighting analysis can correct flawed simulation data. Proc Natl Acad Sci U S A 105:7982-7987. 248. Levitt M (1972) Folding of nucleic acids. Ciba Found Symp 7:147-171. 249. Schutte C & Huisinga W (2003) Biomolecular conformations can be identified as metastable sets of molecular dynamics. Handbook of numerical analysis:699-744. 250. Schutte C & Huisinga W (2000) Biomolecular conformations as metastable sets of Markov chains. Proceedings of the 18th Annual Allerton Conference on Communication, Control, and Computing:1106-1115. 251. Noe F (2008) Probability distributions of molecular observables computer from Markov models. J Chem Phys 128:244103. 252. Zwanzig R (1995) Simple-Model of Protein-Folding Kinetics. Proceedings of the National Academy of Sciences of the United States of America 92:9801- 9804. 253. Brzezniak Z & Zastawniak T (1999) Basic stochastic processes : a course through exercises (Springer, London ; New York) pp x, 225 p. 254. Bacallado S, Chodera JD, & Pande V (2009) Bayesian comparison of Markov models of molecular dynamics with detailed balance constraint. J Chem Phys 131:045106. 255. Van der Spoel D, et al. (2005) GROMACS: Fast, flexible, and free. Journal of computational chemistry 26:1701-1718. 256. Still WC, Tempczyk A, Hawley RC, & Hendrickson T (1990) Semianalytical Treatment of Solvation for Molecular Mechanics and Dynamics. Journal of the American Chemical Society 112:6127-6129. 257. Lovell SC, et al. (2003) Structure validation by C alpha geometry: phi,psi and C beta deviation. Proteins-Structure Function and Genetics 50:437-450. 258. Berezhkovskii A, Hummer G, & Szabo A (2009) Reactive flux and folding pathways in network models of coarse-grained protein dynamics. J Chem Phys 130:205102. 259. Fersht AR (1997) Nucleation mechanisms in protein folding. Current Opinion in Structural Biology 7:3-9. 260. Karplus M & Weaver DL (1976) Protein-Folding Dynamics. Nature 260:404- 406. 261. Weber M & Kube S (2005) Robust Perron Cluster Analysis for various applications in computational life science. Computational Life Sciences, Proceedings 3695:57-66. 262. Cornell WD, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. & Ferguson DCS, T. Fox, J. W. Caldwell, and P. A. Kollman (1995) A second

250

generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117:5179-5197. 263. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, & Klein ML (1983) Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79:926-935. 264. Darden T, D. York, and L. Pedersen. (1995) A smooth particle mesh Ewald potential. J. Chem. Phys. 103:3014-3021. 265. Hess B, H. Bekker, H. J. C. Berendsen, and J. G. E. M. Fraaije. (1997) LINCS: a linear constraint solver for molecular simulations. J. Comput. Chem. 18:1463-1472. 266. Macke TJ & Case DA (1998) Modeling unusual nucleic acid structures. Molecular Modeling of Nucleic Acids 682:379-393. 267. DUAN Y, et al. (2003) A Point-Charge Force Field for Molecular Mechanics Simulations of Proteins Based on Condensed-Phase Quantum Mechanical Calculations. J. Comp. Chem. 24:1999-2012.

251