Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models Robert T. McGibbon [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA Bharath Ramsundar [email protected] Department of Computer Science, Stanford University, Stanford CA 94305, USA Mohammad M. Sultan [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA Gert Kiss [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA Vijay S. Pande [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA Abstract 1. Introduction We present a machine learning framework for Protein folding and conformational change are grand chal- modeling protein dynamics. Our approach uses lenge problems, relevant to a multitude of human diseases, including Alzheimer’s disease, Huntington’s disease and L1-regularized, reversible hidden Markov models to understand large protein datasets generated cancer. These problems entail the characterization of the via molecular dynamics simulations. Our model process and pathways by which proteins fold to their ener- is motivated by three design principles: (1) the getically optimal configuration and the dynamics between requirement of massive scalability; (2) the need multiple long-lived, or “metastable,” configurations on the to adhere to relevant physical law; and (3) the potential energy surface. Proteins are biology’s molec- necessity of providing accessible interpretations, ular machines; a solution to the folding and conforma- critical for both cellular biology and rational drug tional change problem would deepen our understanding of design. We present an EM algorithm for learning the mechanism by which microscopic information in the and introduce a model selection criteria based on genome is manifested in the macroscopic phenotype of or- the physical notion of convergence in relaxation ganisms. Furthermore, an understanding of the structure timescales. We contrast our model with stan- and dynamics of proteins is increasingly important for the dard methods in biophysics and demonstrate im- rational design of targeted drugs (Wong & McCammon, proved robustness. We implement our algorithm 2003). on GPUs and apply the method to two large pro- Molecular dynamics (MD) simulations provide a computa- tein simulation datasets generated respectively tional microscope by which protein dynamics can be stud- on the NCSA Bluewaters supercomputer and the ied with atomic resolution (Dill et al., 1995). These simula- Folding@Home distributed computing network. tions entail the forward integration of Newton’s equations Our analysis identifies the conformational dy- of motion on a classical potential energy surface. The po- namics of the ubiquitin protein critical to cellular tential energy functions in use, called forcefields, are semi- signaling, and elucidates the stepwise activation emprical approximations to the true quantum mechanical mechanism of the c-Src kinase protein. Born-Oppenheimer surface, designed to reproduce experi- mental observables (Beauchamp et al., 2012). For moder- ately sized proteins, this computation can involve the prop- Proceedings of the 31 st International Conference on Machine agation of more than a million physical degrees of freedom. Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- Furthermore, while folding events can take milliseconds right 2014 by the author(s). Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models (10−3 s) or longer, the simulations must be integrated with are essential for describing the protein’s dominant long femtosecond (10−15 s) timesteps, requiring the collection time-scale dynamics (see Cho et al. (2006) and references of datasets containing trillions of data points. therein). Furthermore, substantial prior work indicates that protein folding occurs via a sequence of localized shifts While the computational burden of performing MD sim- (Maity et al., 2005). Together, these pieces of evidence ulations has been a central challenge in the field, signif- motivate the imposition of L -fusion regularization (Tib- icant progress has been achieved recently with the de- 1 shirani et al., 2005). The L term penalizes deviations velopment of three independent technologies: ANTON, a 1 amongst states along uninformative degrees of freedom, special-purpose supercomputer using a custom ASIC to ac- thereby suppressing their effect on the model. Furthermore, celerate MD (Shaw, 2007); Folding@Home, a distributed the pairwise structure of the fusion penalty minimizes the computing network harnessing the desktop computers of number of transitions which involve global changes: many more than 240,000 volunteers (Shirts & Pande, 2000); and pairs of states will only differ along a reduced subset of the Google Exacycle, an initiative utilizing the spare cycles on dimensions. Google’s production infrastructure for science (Kohlhoff et al., 2014). The main results of this paper are the formulation of the L -regularized reversible HMM and the introduction of a The analysis of these massive simulation datasets now rep- 1 simple and scalable learning algorithm to fit the model. We resents a major difficulty: how do we turn data into knowl- contrast our approach against standard frameworks for the edge (Lane et al., 2013)? In contrast to some other machine analysis of MD data and demonstrate improved robustness learning problems, the central goal here is not merely pre- and physical interpretability. diction. Instead, we view analysis – often in the form of probabilistic models generated from MD datasets – as a This paper is organized as follows. Section 2 describes tool for generating scientific insight about protein dynam- prior work. Section 3 introduces the model and associated ics. learning algorithm. Section 4 applies the model to three systems: a toy double well potential; ubiquitin, a human Useful probabilistic models must embody the appropriate signaling protein; and c-Src kinase, a critical regulatory physics. The guiding physical paradigm by which chem- protein involved in cancer genesis. Section 5 provides dis- ical dynamics are understood is one of states and rates. cussion and indicates future directions. States correspond to metastable regions in the configuration space of the protein and can often be visualized as wells on the potential energy surface. Fluctuations within each 2. Prior Work metastable state are rapid; the dominant, long time-scale Earlier studies have applied machine learning techniques dynamics can be understood as a jump process moving to investigate protein structure prediction – the problem with various rates between the states. This paradigm moti- of discovering a protein’s energetically optimal configura- vates probabilistic models based on a discrete-state Markov tion – using CRFs, belief propagation, deep learning, and chain. A priori, the location of the metastable states are other general ML methods (Sontag et al., 2012; Di Lena unknown. As a result, each metastable state should corre- et al., 2012; Chu et al., 2006; Baldi & Pollastri, 2003). But spond to a latent variable in the model. Hidden Markov proteins are fundamentally dynamic systems, and none of models (HMMs) thus provide the natural framework. these approaches offer insight into kinetics; rather, they are Classical mechanics at thermal equilibrium satisfy a sym- concerned with extracting static information about protein metry with respect to time: a microscopic process and its structure. time-reversed version obey the same laws of motion. The The dominant computational tool for studying protein dy- stochastic analogue of this property is reversibility (also namics is MD. Traditional analyses of MD datasets are pri- called detailed balance): the equilibrium flux between any marily visual and non-quantitative. Standard approaches two states X and Y is equal in both directions. Probabilis- include watching movies of a protein’s structural dynamics tic models which fail to capture this essential property will along simulation trajectories, and inspecting the time evo- assign positive probability to systems that violate the sec- lution of a small number of pre-specified degrees of free- ond law of thermodynamics (Prinz et al., 2011). Hence, we dom (Humphrey et al., 1996; Karplus & Kuriyan, 2005). enforce detailed balance in our HMMs. While these methods have been successfully applied to In addition to the constraints motivated by adherence to smaller proteins, they struggle to characterize the dynamics physical laws, suitable probabilistic models should, in of the large and complex biomolecules critical to biologi- broad strokes, incorporate knowledge from prior experi- cal function. Quantitative methods like PCA can elucidate mental and theoretical studies of proteins. Numerous stud- important (high variance) degrees of freedom, but fail to ies indicate that only a subset of the degrees of freedom capture the rich temporal structure in MD datasets. Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models Markov state models (MSMs) are a simple class of proba- tion, which adds the following pairwise cost: bilistic models, recently introduced to capture the temporal X X (j) dynamics of the folding process. In an MSM, protein dy- λ τk;k0 j(µk)j − (µk0 )jj : namics are modeled by the evolution of a Markov chain on k;k0 j a discrete state space. The finite set of states is generated Here, λ governs the overall strength of

Load more