ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nina Singhal Hinrichs September 2007 ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nina Singhal Hinrichs September 2007 °c Copyright by Nina Singhal Hinrichs 2007 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Vijay S. Pande) Principal Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Serafim Batzoglou) Principal Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Leonidas Guibas) Approved for the University Committee on Graduate Studies. iii Abstract Many important processes in biology occur at the molecular scale. A detailed understanding of these processes can lead to significant advances in the medical and life sciences – for example, many dis- eases are caused by protein aggregation or misfolding. One approach to studying these systems is to use physically-based computational simulations to model the interactions and movement of the molecules. While molecular simulations are computationally expensive, it is now possible to sim- ulate many independent molecular dynamics trajectories in a parallel fashion by using distributed computing methods such as Folding@Home. The analysis of these large, high-dimensional, data sets presents new computational challenges. This dissertation presents a novel approach to analyzing large ensembles of molecular dynamics tra- jectories to generate a compact model of the dynamics. The model groups conformations into dis- crete states and describes the dynamics as Markovian, or history-independent, transitions between the states. We will discuss why the Markovian state model (MSM) is suitable for macromolecu- lar dynamics, and how it can be used to answer many interesting and relevant questions about the molecular system. We will also present new approaches for many of the computational and sta- tistical challenges in building such a model, specifically a novel algorithm for defining the states, methods for comparing between different state definitions and determining the optimal number of states, efficient error analysis techniques to determine the statistical reliability, and adaptive algo- rithms to efficiently design new simulations. The methods are applied to model systems as well as molecular dynamics simulation data of several small peptides. iv Acknowledgements I would first like to thank my family and friends for their love and support: my parents Kumud and Kishore, who always inspired and encouraged me; my sister Monica, for her guidance and advice; Tim Knight, Eran Guendelman, and Andrea Tompa for filling graduate school with fun memories; and especially Tim Hinrichs, my best friend and husband, for sharing this wonderful experience with me. I would also like to acknowledge some of my collaborators: Peter Kasson for collaborations on lipid vesicle simulations; John Chodera and Bill Swope for interesting discussions about Markov state models and excellent collaborations on state decomposition algorithms; and all of the Pande lab members, past and present, who I had the pleasure of working with. There were numerous people who made their simulation data available for the analysis presented in this thesis: Christopher Snow for the trpzip2 data set (Chapter 2); Eric Sorin for the Fs peptide data set (Chapter 3); Jed Pitera for the trpzip2 data set (Chapter 3); John Chodera for the alanine data set (Chapters 4 and 6); and Guha Jayachandran for the villin headpiece model (Chapter 6). Several people had helpful comments on various parts of this thesis: Hans Andersen and Frank Noe´ for enlightening conversations on the nature of Markov chain models; Vishal Vaidyanathan for assistance with clustering algorithms (Chapter 3); Jed Pitera for insightful discussions and con- structive comments on Chapter 3; Libusha Kelly, David Mobley and Guha Jayachandran for critical comments on Chapter 3; Kishore Singhal for insightful discussions about sensitivity analysis (Chap- ters 5 and 6); and John Chodera for helpful comments on Chapters 4 and 6. My thesis committee members deserve special thanks: Axel Brunger, as the chair of my orals committee; Jean-Claude Latombe for inspiration about graphical kinetic models and for serving on my orals committee; Leonidas Guibas for discussions about the geometric nature of conformation space for being a committee member; Serafim Batzoglou, as my co-advisor and for discussions about alignment which helped motivate many of the ideas in this thesis; and especially my advisor Vijay Pande, for his help and guidance throughout my graduate career. v Contents Abstract iv Acknowledgements v 1 Introduction 1 2 Markovian state models 5 2.1 Introduction . 5 2.2 Theory and methods . 7 2.2.1 Direct rate calculations . 7 2.2.2 Sampling of paths . 8 2.2.3 MSM generation . 9 2.2.4 Post-processing of MSMs . 10 2.2.5 Reweighting of edges . 12 2.2.6 Mean first passage time and Pfold calculation . 15 2.3 Results . 17 2.3.1 Model system . 17 2.3.2 Trpzip2 kinetics . 22 2.4 Discussion and conclusions . 26 3 Automatic state decomposition 28 3.1 Introduction . 28 3.2 Theory . 32 3.2.1 Markov chain and master equation models of conformational dynamics . 32 3.2.2 Markov model construction from simulation data given a state partitioning 34 3.2.3 Requirements for a useful Markov model . 35 vi 3.2.4 Validation of Markov models . 36 3.3 The automatic state decomposition algorithm . 38 3.3.1 Practical considerations for an automatic state decomposition algorithm . 38 3.3.2 Sketch of the method . 39 3.3.3 Implementation . 42 3.3.4 Validation . 45 3.4 Applications . 46 3.4.1 Alanine dipeptide . 46 3.4.2 The Fs helical peptide . 51 3.4.3 The trpzip2 ¯-peptide . 56 3.5 Discussion . 59 3.6 Supporting Information . 62 4 Model selection 63 4.1 Introduction . 63 4.2 Methods . 65 4.2.1 Bayesian Networks . 65 4.2.2 Parameter estimation in Bayesian Networks . 65 4.2.3 Scoring of Bayesian Networks . 67 4.2.4 Markovian state models as Bayesian Networks . 70 4.2.5 Comparison between different Markovian state models . 71 4.2.6 Non-equilibrium data . 73 4.3 Results . 74 4.3.1 Model system . 74 4.3.2 Alanine peptide . 80 4.4 Conclusions . 83 5 Error analysis methods 86 5.1 Introduction . 86 5.2 Methods . 88 5.2.1 Mean first passage times . 88 5.2.2 Transition probability distribution . 89 5.2.3 Sampling based error analysis methods . 91 5.2.4 Non-sampling based error analysis method . 95 vii 5.2.5 Adaptive sampling algorithm . 96 5.2.6 Extension to large systems . 99 5.3 Results . 102 5.3.1 Demonstration of method 1 . 103 5.3.2 Validity of approximations . 103 5.3.3 Adaptive sampling . 105 5.4 Discussion and conclusions . 108 6 Eigenvalue and eigenvector error analysis 111 6.1 Introduction . 111 6.2 Methods . 113 6.2.1 Eigenvalue and eigenvector equations . 114 6.2.2 Transition probability distribution . 114 6.2.3 Distribution of eigenvalues and eigenvectors . 116 6.2.4 Adaptive sampling . 119 6.3 Results . 120 6.3.1 Eigenvalue distributions . 121 6.3.2 Eigenvector distributions . 122 6.3.3 Adaptive sampling . 126 6.4 Discussion and Conclusions . 129 7 Conclusions 132 A Sampling from a Dirichlet distribution 134 B Sampling from a Multivariate Normal distribution 135 C MFPT sensitivity analysis 137 D Solving a bordered sparse matrix 139 E Eigenvalue sensitivity analysis 141 F Eigenvector sensitivity analysis 144 Bibliography 146 viii List of Tables 3.1 Macrostates from a 20-state state decomposition of the Fs helical peptide. 52 4.1 Four state definitions for the transition model between 9 conformations. 78 5.1 Summary of sampling based methods for calculating the error of the MFPT from the initial state due to sampling. 94 5.2 Means and standard deviations of the MFPT distributions generated for the four sampling and the non-sampling based error analysis methods. 106 5.3 Running times for the error analysis methods on calculating the MFPT distribution of an 87 state example. 106 ix List of Figures 2.1 The shooting algorithm for sampling paths. 9 2.2 Clustering of MSM points. 11 2.3 Clustering of nodes to guarantee that all nodes can reach the final state. 11 2.4 Contour graph of the potential energy, E(x; y), of the model energy landscape. 18 2.5 The correlation between Pfold values calculated directly from many simulations and MSM simulations on the model energy landscape. 20 2.6 The comparison between the MFPT calculated directly from many simulations and from the MSM simulations as a function of temperature. 21 2.7 The comparison between the MFPT calculated from many simulations to the MFPT calculated from reweighted versions of a single MSM as a function of temperature. 22 2.8 Error analysis of direct simulations and the various MSM techniques. 23 2.9 The effect of clustering cutoff on the calculated MFPT for the model system and trpzip2 peptide. 25 3.1 Flowchart of the automatic state decomposition algorithm.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages171 Page
-
File Size-