<<

Understanding Dynamics with L1-Regularized Reversible Hidden Markov Models

Robert T. McGibbon [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA

Bharath Ramsundar [email protected] Department of Computer Science, Stanford University, Stanford CA 94305, USA

Mohammad M. Sultan [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA

Gert Kiss [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA

Vijay S. Pande [email protected] Department of Chemistry, Stanford University, Stanford CA 94305, USA

Abstract 1. Introduction

We present a machine learning framework for Protein folding and conformational change are grand chal- modeling . Our approach uses lenge problems, relevant to a multitude of human diseases, including Alzheimer’s disease, Huntington’s disease and L1-regularized, reversible hidden Markov mod- els to understand large protein datasets generated cancer. These problems entail the characterization of the via molecular dynamics simulations. Our model process and pathways by which fold to their ener- is motivated by three design principles: (1) the getically optimal configuration and the dynamics between requirement of massive scalability; (2) the need multiple long-lived, or “metastable,” configurations on the to adhere to relevant physical law; and (3) the potential energy surface. Proteins are biology’s molec- necessity of providing accessible interpretations, ular machines; a solution to the folding and conforma- critical for both cellular biology and rational drug tional change problem would deepen our understanding of design. We present an EM algorithm for learning the mechanism by which microscopic information in the and introduce a model selection criteria based on genome is manifested in the macroscopic phenotype of or- the physical notion of convergence in relaxation ganisms. Furthermore, an understanding of the structure timescales. We contrast our model with stan- and dynamics of proteins is increasingly important for the dard methods in and demonstrate im- rational design of targeted drugs (Wong & McCammon, proved robustness. We implement our algorithm 2003). on GPUs and apply the method to two large pro- Molecular dynamics (MD) simulations provide a computa- tein simulation datasets generated respectively tional microscope by which protein dynamics can be stud- on the NCSA Bluewaters supercomputer and the ied with atomic resolution (Dill et al., 1995). These simula- Folding@Home distributed computing network. tions entail the forward integration of Newton’s equations Our analysis identifies the conformational dy- of motion on a classical potential energy surface. The po- namics of the ubiquitin protein critical to cellular tential energy functions in use, called forcefields, are semi- signaling, and elucidates the stepwise activation emprical approximations to the true quantum mechanical mechanism of the c-Src kinase protein. Born-Oppenheimer surface, designed to reproduce experi- mental observables (Beauchamp et al., 2012). For moder- ately sized proteins, this computation can involve the prop- Proceedings of the 31 st International Conference on Machine agation of more than a million physical degrees of freedom. Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- Furthermore, while folding events can take milliseconds right 2014 by the author(s). Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

(10−3 s) or longer, the simulations must be integrated with are essential for describing the protein’s dominant long femtosecond (10−15 s) timesteps, requiring the collection time-scale dynamics (see Cho et al. (2006) and references of datasets containing trillions of data points. therein). Furthermore, substantial prior work indicates that protein folding occurs via a sequence of localized shifts While the computational burden of performing MD sim- (Maity et al., 2005). Together, these pieces of evidence ulations has been a central challenge in the field, signif- motivate the imposition of L -fusion regularization (Tib- icant progress has been achieved recently with the de- 1 shirani et al., 2005). The L term penalizes deviations velopment of three independent technologies: ANTON, a 1 amongst states along uninformative degrees of freedom, special-purpose supercomputer using a custom ASIC to ac- thereby suppressing their effect on the model. Furthermore, celerate MD (Shaw, 2007); Folding@Home, a distributed the pairwise structure of the fusion penalty minimizes the computing network harnessing the desktop computers of number of transitions which involve global changes: many more than 240,000 volunteers (Shirts & Pande, 2000); and pairs of states will only differ along a reduced subset of the Google Exacycle, an initiative utilizing the spare cycles on dimensions. Google’s production infrastructure for science (Kohlhoff et al., 2014). The main results of this paper are the formulation of the L -regularized reversible HMM and the introduction of a The analysis of these massive simulation datasets now rep- 1 simple and scalable learning algorithm to fit the model. We resents a major difficulty: how do we turn data into knowl- contrast our approach against standard frameworks for the edge (Lane et al., 2013)? In contrast to some other machine analysis of MD data and demonstrate improved robustness learning problems, the central goal here is not merely pre- and physical interpretability. diction. Instead, we view analysis – often in the form of probabilistic models generated from MD datasets – as a This paper is organized as follows. Section 2 describes tool for generating scientific insight about protein dynam- prior work. Section 3 introduces the model and associated ics. learning algorithm. Section 4 applies the model to three systems: a toy double well potential; ubiquitin, a human Useful probabilistic models must embody the appropriate signaling protein; and c-Src kinase, a critical regulatory physics. The guiding physical paradigm by which chem- protein involved in cancer genesis. Section 5 provides dis- ical dynamics are understood is one of states and rates. cussion and indicates future directions. States correspond to metastable regions in the configuration space of the protein and can often be visualized as wells on the potential energy surface. Fluctuations within each 2. Prior Work metastable state are rapid; the dominant, long time-scale Earlier studies have applied machine learning techniques dynamics can be understood as a jump process moving to investigate prediction – the problem with various rates between the states. This paradigm moti- of discovering a protein’s energetically optimal configura- vates probabilistic models based on a discrete-state Markov tion – using CRFs, belief propagation, deep learning, and chain. A priori, the location of the metastable states are other general ML methods (Sontag et al., 2012; Di Lena unknown. As a result, each metastable state should corre- et al., 2012; Chu et al., 2006; Baldi & Pollastri, 2003). But spond to a latent variable in the model. Hidden Markov proteins are fundamentally dynamic systems, and none of models (HMMs) thus provide the natural framework. these approaches offer insight into kinetics; rather, they are Classical mechanics at thermal equilibrium satisfy a sym- concerned with extracting static information about protein metry with respect to time: a microscopic process and its structure. time-reversed version obey the same laws of motion. The The dominant computational tool for studying protein dy- stochastic analogue of this property is reversibility (also namics is MD. Traditional analyses of MD datasets are pri- called detailed balance): the equilibrium flux between any marily visual and non-quantitative. Standard approaches two states X and Y is equal in both directions. Probabilis- include watching movies of a protein’s structural dynamics tic models which fail to capture this essential property will along simulation trajectories, and inspecting the time evo- assign positive probability to systems that violate the sec- lution of a small number of pre-specified degrees of free- ond law of thermodynamics (Prinz et al., 2011). Hence, we dom (Humphrey et al., 1996; Karplus & Kuriyan, 2005). enforce detailed balance in our HMMs. While these methods have been successfully applied to In addition to the constraints motivated by adherence to smaller proteins, they struggle to characterize the dynamics physical laws, suitable probabilistic models should, in of the large and complex biomolecules critical to biologi- broad strokes, incorporate knowledge from prior experi- cal function. Quantitative methods like PCA can elucidate mental and theoretical studies of proteins. Numerous stud- important (high variance) degrees of freedom, but fail to ies indicate that only a subset of the degrees of freedom capture the rich temporal structure in MD datasets. Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Markov state models (MSMs) are a simple class of proba- tion, which adds the following pairwise cost: bilistic models, recently introduced to capture the temporal X X (j) dynamics of the folding process. In an MSM, protein dy- λ τk,k0 |(µk)j − (µk0 )j| . namics are modeled by the evolution of a Markov chain on k,k0 j a discrete state space. The finite set of states is generated Here, λ governs the overall strength of the penalty, while by clustering the set of configurations in the MD trajec- (j) tories (Beauchamp et al., 2011). MSMs can be viewed as the adaptive fusion weights, {τk,k0 }, control the contribu- fully observable HMMs. More recently, HMMs with multi- tion from each pair of states (Guo et al., 2010). During nomial emission distributions have been employed on this learning, the adaptive fusion weights are computed as discrete state space (Noe´ et al., 2013). (j) −1 τk,k0 = |(˜µk)j − (˜µk0 )j| , Although MSMs have had a number of notable successes (Voelz et al., 2012; Sadiq et al., 2012), they are brittle and where the {µ˜k} are the learned metastable state means in complex. Traditional MSMs lack complete data likelihood the absence of the penalty. The intuition motivating the functions, and learning cannot be easily characterized by adaptive strength of the penalty is that if degree of freedom a single optimization problem. For these reasons, MSM j is informative for separating states k and k0, the corre- learning requires significant manual tuning. For example, sponding fusion penalty should be applied lightly. because clustering is purely a preprocessing step, the like- The reversible time evolution of the model is parameter- lihood function contains no guidance on the choice of the ized by an irreducible, aperiodic, row-normalized K by metastable states. Moreover, the lack of uncertainty in the K stochastic matrix T, which satisfies detailed balance. observation model necessitates the introduction of a very Mathematically, the detailed balance constraint is large number of states, typically more than ten thousand, in 0 order to cover the protein’s phase space at sufficient reso- ∀k, k , πkTk,k0 = πk0 Tk0,k, lution. This abundance of states is statistically inefficient, as millions of pairwise transition parameters must be esti- where row vector π is the stationary distribution of T. mated in typically-sized models, and renders interpretation The stationary distribution π also parameterizes the ini- of learned MSMs challenging. tial distribution over the metastable states. By the Perron– Frobenius theorem, π is the dominant left eigenvector of T 3. Fusion L -Regularized Reversible HMM with eigenvalue 1 and is not an independent parameter in 1 this model. We introduce the L -regularized reversible HMM with 1 The initial distributions and evolution of {X ,Y } satisfy Gaussian emissions, a generative probabilistic model over t t the following equations: multivariate discrete-time continuous-space time series. As discussed in Section 1, we integrate necessary physical K X constraints on top of the core hidden Markov model (Ra- X0 ∼ πkδk, biner, 1989). k=1 D K Let {Yt} be the observed time series in R of length T X Xt+1 ∼ TXt,k δk, (i.e., the input simulation data), and let {Xt} be the cor- k=1 responding latent time series in {1,...,K}, where K is 2 Yt ∼ N (µX , σ ). a hyperparameter indicating the number of hidden states t Xt in the model. Each hidden variable x corresponds to a t The complete data likelihood {x , y } is metastable state of the physical system. The emission dis- t t tribution given Xt = k is a multivariate normal distribution L({xt}, {yt}|T, µ, σ) = D parameterized by mean µk ∈ R and diagonal covariance T −1 T −1 2 D·D 2 D Y Y matrix Diag(σk) ∈ R (where σk ∈ R is the vector of π T N (y ; µ , σ2 ). x0 xt−1,xt t xt xt diagonal covariance elements). We use the notation (µk)j t=1 t=0 to indicate the jth element of the vector µk.

Controlling the means {µk} is critical for achieving phys- The hyperparameter ∆ controls the discretization interval ically interpretable models. As discussed in Section 1, we at which a protein’s coordinates are sampled to obtain {yt}. wish to minimize the differences between µk and µk0 to In the absence of downsampling by ∆, subsequent samples the extent possible. Consequently, we place a fusion L1 yt, yt+1 would be highly correlated. On the other hand, penalty (Tibshirani et al., 2005) on our log likelihood func- subsequent samples from an HMM are conditionally inde- pendent given the hidden state. Choice of ∆ large enough recovers this conditional independence (vide infra). Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

3.1. Learning The transition matrix update is X X The model is fit using expectation-maximization. The E- T = arg max log(Tij) ξij(t). T step is standard, while the M-step requires modification to ij t enforce the detailed balance constraint on T and the adap- Because the Gaussian emission distributions have infinite tive fusion penalty on the {µ }. k support, T is irreducible and aperiodic by construction. However, we must explicitly constrain T to satisfy detailed 3.1.1. E-STEP balance.

Inference is identical to that for the standard HMM, using Lemma 1. T satisfies detailed balance if and only if Tij = the forward-backward algorithm (Rabiner, 1989) to com- Wij T P W , where W = W . pute the following quantities: k ik

Proof. If T satisfies detailed balance, then let Wij = γi(t) = P(Xt = i|{yt}), πiTij = πjTji = Wji. Then note ξij(t) = P(Xt = i, Xt+1 = j | {yt}). W π T T ij = i ij = ij = T P W P π T P T ij 3.1.2. M-STEP k ik k i ik k ik

Wij To prove the converse, assume Tij = P , with W = Both the penalty on {µk} and the reversibility constraint k Wik T P affect only the M-step. The M-step update to the means in W . Let πi = k Wik. Then πiTij = Wij = Wji = the t-th iteration of EM consists of maximizing the penal- πjTji. ized log-likelihood function Substituting the results of Lemma 1, we rewrite the transi- N K 2 (t+1) X X (xi − µk) tion matrix update as µ = argmin γ (i) k k 2 (t) µ 2(σ ) h i  k i k k X X W = arg max log(Wij) − log πi ξij(t) . X X (j) W ij t + λ τk,k0 |µk,j − µk0,j| . k,k0 j We compute the derivative of the inner term with respect to log W and optimize with L-BFGS (Nocedal & Wright, The {µ } update is a quadratic program, which can be ij k 2006). solved by a variety of methods. We compute {µk} by it- erated ridge regression. Following Guo et al. (2010) and Fan & Li (2001), we use the local quadratic approximation 3.2. Model Selection There are two free model parameters: K and ∆. The num- (t,s+1) (t,s+1) µk,j − µk0,j ≈ ber of metastable states, K, is expected to be small – at 2 most a few dozen. To choose K, we can use the AIC or BIC  (t,s+1) (t,s+1) µk,j − µk0,j 1 selection criteria, or alternatively enumerate a few small (t,s) (t,s) + µk,j − µk0,j . values. (t,s) (t,s) 2 2 µk,j − µk0,j The choice of ∆ is more difficult than the choice of K, as where s is the iteration index for this procedure within the changing the discretization interval alters the support of the t-th M-step. This approximation is based on the identity likelihood function. Recall that choosing ∆ too small re- sults in subsequent samples yt, yt+1 becoming highly cor- (x − y)2 1 related, while the model satisfies the conditional indepen- |x − y| = + |x − y|.

Y = Y | X ∆ 2 |x − y| 2 dence assumption t | t+1 t. Moreover, small in- creases data-storage requirements, while ∆ too large will Under the approximation, we obtain a generalized ridge re- needlessly discard data. Thus a balance between these two gression problem which can be solved in closed form dur- conflicting directives is necessary. ing each iteration s. Note that this approximation is only (t,s) (t,s) We use the physical criterion of convergence in the relax- valid when |µ −µ 0 | > 0. For numerical stability, we k,j k ,j ation timescales to evaluate when ∆ is large enough. The (t,s) (t,s) −10 threshold |µk,j − µk0,j | to zero at values less than 10 . propagation of the dynamics from an initial distribution The variance update is standard: over the hidden states, Xt, can be described by K−1 P T γk(t)(yt − µk) (yt − µk) X 2 T n 2 t P (yt+n | xt) = N (yt+n; µk, σ ) x T . σk = P . k t k t γk(t) k=0 Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Diagonalize T in terms of its left eigenvectors φi, right eigenvectors ψi, and eigenvalues λi (such a diagonalization is always possible since T is a stochastic matrix).

K " K ! # X 2 X n P (yt+n | xt) = N (yt+n; µk, σk) λi hxt, ψii φi k=1 i=1 k Since π is the stationary left eigenvector of T, and the re- maining eigenvalues lie in the interval −1 < λi < 1, the collective dynamics can be interpreted as a sum of expo- nential relaxation processes.

K " K ! # X X −n/τi P (yt+n | xt) = fk(yt+n) π + e hxt, ψii φi k=1 i=2 k 2 In the equation, we define fk(y) = N (y; µk, σk). Each Figure 1. Simulations of Brownian dynamics on a double well po- eigenvector of T (except the first) describes a dynamical tential (A) illustrate the advantages of the HMM over the MSM. mode with characteristic relaxation timescale When the dynamics are discretized at a time interval of > 500 1 steps, the 2-state HMM, unlike the 2-state MSM achieves a quan- τi = − . titatively accurate prediction of the first relaxation timescale (B). ln λ i The MSM (C) features hard cutoffs between the states wheres the The longest timescales, τi, are of central interest from HMM (D) each have infinite support. a molecular modeling perspective because they describe dynamical modes visible in time-resolved protein experi- The speedup using our GPU implementation is 15× com- ments (Zhuang et al., 2011) and are robust against perturba- pared to our optimized CPU implementation and 75× with tions (Weber & Pande, 2012). We choose ∆ large enough respect to a standard numpy implementation using K = 16 to converge the τ : for adequately large ∆, we expect τ (∆) i i states on a NVIDIA GTX TITAN GPU / Intel Core i7 4 to asymptotically converge to the true relaxation timescale core Sandy Bridge CPU platform. Further scaling of the τ ∗. For simple systems, we may evaluate τ ∗ explicitly, i i implementation could be achieved by splitting the compu- while for larger systems, we choose ∆ large enough so that tation over multiple GPUs with MPI. τi(∆) no longer changes with further increase in the dis- cretization interval. 4. Experiments 3.3. Implementation 4.1. Double Well Potential We implement learning for both multithreaded CPU and We first consider a one-dimensional diffusion process yt NVIDIA GPU platforms. In the CPU implementation, governed by Brownian dynamics. The process is described we parallelize across trajectories during the E-step using by the stochastic differential equation OpenMP. The largest portion of the run time is spent in dy √ LOG-SUM-EXP operations, which we manually vectorize t = −∇V (y ) + 2DR(t) with SSE2 intrinsics for SIMD architectures. Parallelism dt t on the GPU is more fine grained. The E-step populates two where V is the reduced potential energy, D is the diffusion T × K × K arrays with forward and backwards sweeps re- constant, and R(t) is a zero-mean delta-correlated station- spectively. To fully utilize the GPU’s massive parallelism, ary Gaussian process. For simplicity, we set D = 1 and each trajectory has a team of threads which cooperate on consider the double well potential updating the K × K matrix at each time step. Specialized CUDA kernels were written for K = 4, 8, 16 and 32 along V (y) = 1 + cos(2y) with a generic kernel for K > 32. with reflecting boundary conditions at y = −π and y = π. Even in log space, for long trajectories, the forward- Using the Euler-Maruyama method and a time step of backward algorithm can suffer from an accumulation of ∆t = 10−3, we produced ten simulation trajectories of floating point errors which lead to catastrophic cancelation length 5 × 105 steps each. The histogrammed trajectories during the computation of γi(t). This risk requires that the are shown in Fig. 1(A). The exact value of the first re- forward-backward matrices be accumulated in double pre- laxation timescale was computed by a finite element dis- cision, whereas the rest of the calculation is safe in single cretization of the corresponding Fokker-Planck equation precision. (Higham, 2001). Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

We applied both a two-state MSM and two-state HMM, signaling network. with fusion L regularization parameter λ = 0, to the sim- 1 We obtained a dataset of MD simulations of human Ub ulation trajectories. The MSM states were fixed, with a di- consisting of 3.5 million data points. The protein, shown viding surface at y = 0, as shown in Fig. 1(C). The HMM in Fig. 2, is composed of 75 amino acids. The simulations states were learned, as shown in Fig. 1(D). Both the MSM were performed on the NCSA Blue Waters supercomputer. and the HMM display some sensitivity with respect to the The resulting structures were featurized by extracting the discretization interval, with more accurate predictions of distance from each amino acid’s central carbon atom to its the relaxation timescale at longer lag times. position in the simulations’ starting configurations. HMMs The two-state MSM is unable to accurately learn the were constructed with 2 to 6 states. We chose ∆ by moni- longest timescale, τ1, even with large lag times, while the toring the convergence of the relaxation timescales as dis- two-state HMM succeeds in identifying τ1 with ∆ ≥ 500 cussed in Sec. 3.2, and set the L1 fusion penalty heuristi- Fig. 1(B). cally to a default value of λ = 0.01. In agreement with ex- isting biophysical data (Zhang et al., 2012), the HMMs cor- 4.2. Ubiquitin rectly determined that Ub was best modeled with 2 states (Fig. 2A). For ease of representation, the learned HMM is shown projected onto two critical degrees of freedom (dis- cussed below). For comparison, we generated MSM models with 500 mi- crostates (Fig. 2B) and projected upon the same critical degrees of freedom. We used a standard kinetic lump- ing post-processing step to identify 3 macrostates (shown in green, blue, and yellow respectively); the lumping al- gorithm collapsed when asked to identify 2 macrostates (Bowman, 2012). Contrast the simple, clean output of the 2 state HMM in Fig. 2(A) with the standard MSM of Fig. 2(B). Note how significant post-processing and man- ual tuning would be required to piece together the true two- state structural dynamics of Ub from the MSM output. We display a structural rendering of the Ub system in Fig. 2(C). The imposed L1 penalty of the HMM suppresses differences among the uninformative degrees of freedom depicted in grey. The remaining portions of the protein (shown in color) reveal the two critical axes of motion of the Ub system: the hinge dynamics of the loop region dis- played in yellow and a kink in the lower helix shown in red. We use these axes in the simplified representations shown Figure 2. Dynamics of Ub. (A) The HMM identifies the two metastable states of Ub, varying primarily in the loop and helix in Figs. 2(A,B). regions (axes in yellow and red respectively). (B) The MSM fails The states S0 and S1 identified by the HMM have direct bi- to cleanly separate the two underlying physical states. Three post- ological interpretations. Comparison to earlier experimen- processed macrostates from the MSM are shown (in blue, green, tal work reveals that configuration S0 binds to the UCH and yellow). (C) A structural rendering of the conformational family of proteins, while configuration S1 binds to the USP states of the Ub system. S0, shown in grey, binds to the UCH family of proteins, and S1 (with characteristic structural differ- family instead (Komander et al., 2009). The families play ences to S0 in red and yellow) binds to the USP family. differing roles in the vital task of regenerating active Ub for the cell-signaling cycle. Ubiquitin (Ub) is a regulatory hub protein at the intersec- Together, MD and the HMM analysis provide atomic in- tion of many signaling pathways in the human body (Her- sight into the effect of protein structure on ubiquitin’s role shko & Ciechanover, 1998). Among its many tasks are the in the signaling network. Our analysis approach may have regulation of inflammation, repair of DNA, and the break- significant value for protein biology and for the further down and recycling of waste proteins. Ubiquitin interacts study of cellular signaling networks. Although experimen- with close to 5000 human signaling proteins. Understand- tal studies of protein signaling provide the gold standard for ing the link between structure and function in ubiquitin hard data, they struggle to provide structural explanations would elucidate the underlying framework of the human Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

— knowing why a certain protein is more suited for cer- A( B( tain signaling functions is challenging at best. In contrast, the MD/HMM approach can provide a direct link between structure and function and give a causal basis for observed protein activity.

4.3. c-Src Tyrosine Kinase Protein kinases are a family of that are critical for regulating cellular growth whose aberrant activation can lead to uncontrolled cellular proliferation. Because of their central role in cell proliferation, kinases are a critical target C( for anti-cancer therapeutics. The c-Src tyrosine kinase is a Inac%ve( Intermediate( Ac%ve( prominent member of this family that has been implicated in numerous human malignancies (Aleshin & Finn, 2010). Due to the protein’s size and complexity, performing MD simulations of the c-Src kinase is a formidable task. The protein, shown in Fig. 3A, consists of 262 amino acids; when surrounding water molecules – necessary for accu- rate simulation – are taken into account, the system has over 40,000 atoms. Furthermore, transition between the Figure 3. Activation of the c-Src Kinase. (A) Structure of the pro- active and inactive states takes hundred of microseconds. tein system. (B) The 3 state HMM, projected onto two degrees of freedom representing the positions of the A-loop (shown in red) Adequate sampling of these processes therefore requires and C-helix (shown in orange) respectively. (C) Structural render- hundreds of billions of MD integrator steps. Simulations ings of the means of the hidden states showing atomistic details of the c-Src kinase were performed on the Folding@Home of the activation pathway. distributed computing network, collecting a dataset of 4.7 million configurations from 550 µs of sampling, for a total of 108 GB of data (Shukla et al., 2014). process takes place only in a small portion of the overall In order to understand the molecular activation mechanism protein; the random fluctuations of the remaining degrees of the c-Src kinase, we analyzed this dataset using the L1 of freedom are largely uncoupled from the activation pro- regularized reversible HMM. We featurized the configu- cess. As with Ub, the L1 penalty suppresses the signal from rations by extracting the distance from each amino acid’s unimportant degrees of freedom shown in grey. In contrast central carbon atom to its position in an experimentally de- to the simplicity of HMM approach, a recent MSM analysis termined inactive configuration. We built HMMs with 2 to of this dataset found similar results, but required 2,000 mi- 6 states, and singled out the 3 state model for achieving a crostastates and significant post-processing of the models balance of complexity and interpretability. As with Ub, we to generate physical insight into the activation mechanism chose ∆ by monitoring the convergence of the relaxation (Shukla et al., 2014). timescales, same default L1 fusion penalty of λ = 0.01. The identification of the intermediate state along the activa- The L1-regularized reversible HMM elucidates the c-Src tion pathway has substantial implications in the field of ra- kinase activation pathway, revealing a stepwise mechanism tional drug design. Chemotherapy drugs often have harm- of the dynamics. A projection of the learned HMM states ful side effects because they target portions of proteins that onto two key degrees of freedom is shown in Fig. 3B. are common across entire families, interfering with both Fig. 3C shows a structural representation of the means of the uncontrolled behavior of tumor proteins as well as the the three states, highlighting a sequential activation mech- critical cellular function of healthy proteins. Intermediate anism. The transformation from the inactive to the inter- states, such as the one identified by the HMM, are more mediate state occurs first by the unfolding of the A-loop likely to be unique to each kinase protein; future therapeu- (the subsection of the protein highlighted in red). Acti- tics that target these intermediate states could have signifi- vation is completed by the inward rotation of the C-helix cantly fewer deleterious side effects (Fang et al., 2013). (highlighted in orange) and rupture of a critical side chain interaction between two amino acids on the C-helix and the A-loop respectively. 5. Discussion and Conclusion Although the protein structure is complex, the activation Currently, MSMs are a dominant framework for analyz- ing protein dynamics datasets. We propose replacing this Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models methodology with L1-regularized reversible HMMs. We References show that HMMs have significant advantages over MSMs: Aleshin, A. and Finn, R. S. SRC: a century of science whereas the MSM state decomposition is a prepreprocess- brought to the clinic. Neoplasia, 12(8):599–607, 2010. ing procedure without guidance from a complete-data like- lihood function, the HMM couples the identification of Baldi, Pierre and Pollastri, Gianluca. The principled design metastable states with the estimation of transition probabil- of large-scale recursive neural network architectures– ities. As such, accurate models require fewer states, aiding DAG-RNNs and the protein structure prediction prob- interpretability from a physical perspective. lem. J. Mach, Learn. Res., 4:575–602, 2003. The switch is not without tradeoffs. MSMs are backed by Beauchamp, Kyle A., Bowman, Gregory R., Lane, a significant body of theoretical work: the MSM is a di- Thomas J., Maibaum, Lutz, Haque, Imran S., and Pande, rect discretization of an integral operator which formally Vijay S. MSMBuilder2: Modeling conformational dy- controls the long timescale dynamics known as the trans- namics on the picosecond to millisecond scale. J. Chem. fer operator. This connection enables the quantification of Theory Comput., 7(10):3412–3419, 2011. approximation error in the MSM framework (Prinz et al., 2011). No such theoretical guarantees yet exist for the L1- Beauchamp, Kyle A., Lin, Yu-Shan, Das, Rhiju, and Pande, regularized reversible HMM because the evolution of Yt is Vijay S. Are protein force fields getting better? A sys- no longer unconditionally Markovian. However, because tematic benchmark on 524 diverse NMR measurements. the HMM can be viewed as a generalized hidden MSM, J. Chem. Theory Comput., 8(4):1409–1414, 2012. there is reason to believe that analogues of MSM theoreti- Bowman, Gregory R. Improved coarse-graining of markov cal guarantees extend to the HMM framework. state models via explicit consideration of statistical un- While the L1-regularized reversible hidden Markov model certainty. J. Chem. Phys., 137(13):134111, 2012. represents an improvement over previous methods for an- alyzing MD datasets, future work will likely confront a Cho, Samuel S., Levy, Yaakov, and Wolynes, Peter G. P number of remaining challenges. For example, the cur- versus Q: Structural reaction coordinates capture protein rent model does not learn the featurization and treats folding on smooth landscapes. Proc. Natl. Acad. Sci. ∆ as a hyperparameter. Bringing these two aspects of U.S.A., 103(3):586–591, 2006. the model into the optimization framework would reduce Chu, Wei, Ghahramani, Zoubin, Podtelezhnikov, Alexei, the required amount of manual tuning. Adapting tech- and Wild, David L. Bayesian segmental models with niques from Bayesian nonparametrics, unsupervised fea- multiple sequence alignment profiles for protein sec- ture learning and linear dynamical systems may facilitate ondary structure and contact map prediction. IEEE/ACM the achievement of these goals. Trans. Comput. Biol. Bioinf., 3(2):98–113, 2006. Our results show that structured statistical analysis of mas- Di Lena, Pietro, Baldi, Pierre, and Nagata, Ken. Deep sive protein datasets is now possible. We reduce com- spatio-temporal architectures and learning for protein plex dynamical systems with thousands of physical degrees structure prediction. In Adv. Neural Inf. Process. Syst. of freedom to simple statistical models characterized by a 25, NIPS ’12, pp. 521–529, 2012. small number of metastable states and transition rates. The HMM framework is a tool for turning raw molecular dy- Dill, Ken A, Bromberg, Sarina, Yue, Kaizhi, Chan, namics data into scientific knowledge about protein struc- Hue Sun, Ftebig, Klaus M, Yee, David P, and Thomas, ture, dynamics and function. Our experiments on the ubiq- Paul D. Principles of protein folding – a perspective from uitin and c-Src kinase proteins extract insight that may fur- simple exact models. Protein Sci., 4(4):561–602, 1995. ther the state of the art in cellular biology and rational drug design. Fan, Jianqing and Li, Runze. Variable selection via non- concave penalized likelihood and its oracle properties. J. Am. Stat. Assoc., 96(456):1348–1360, 2001. Acknowledgments Fang, Zhizhou, Grtter, Christian, and Rauh, Daniel. Strate- We thank Diwakar Shukla for kindly providing the c-Src gies for the selective regulation of kinases with al- kinase MD trajectories. B.R. was supported by the Fan- losteric modulators: Exploiting exclusive structural fea- nie and John Hertz Foundation. G.K and V.S.P acknowl- tures. ACS Chem. Biol., 8(1):58–70, 2013. edge support from the Simbios NIH Center for Biomedical Computation (NIH U54 Roadmap GM072970). V.S.P. ac- Guo, Jian, Levina, Elizaveta, Michailidis, George, and knowledges NIH R01-GM62868 and NSF MCB-0954714. Zhu, Ji. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010. Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Hershko, Avram and Ciechanover, Aaron. The ubiquitin Shaw, David E. et al. Anton, a special-purpose machine for system. Annu. Rev. Biochem., 67(1):425–479, 1998. molecular dynamics simulation. In ACM Comp. Ar. 34, ISCA ’07, pp. 1–12, 2007. Higham, Desmond J. An algorithmic introduction to nu- merical simulation of stochastic differential equations. Shirts, Michael and Pande, Vijay S. Screen savers of the SIAM Rev., 43(3):525–546, 2001. world unite! Science, 290(5498):1903–1904, 2000.

Humphrey, William, Dalke, Andrew, and Schulten, Klaus. Shukla, Diwakar, Meng, Yilin, Roux, Benoˆıt, and Pande, VMD: Visual molecular dynamics. J. Mol. Graphics, 14 Vijay S. Activation pathway of Src kinase reveals in- (1):33 – 38, 1996. termediate states as novel targets for drug design. Nat. Commun., 5, 2014. Karplus, M. and Kuriyan, J. Molecular dynamics and pro- tein function. Proc. Natl. Acad. Sci. U.S.A., 102(19): Sontag, David, Meltzer, Talya, Globerson, Amir, Jaakkola, 6679–6685, 2005. Tommi S, and Weiss, Yair. Tightening LP relax- ations for MAP using message passing. arXiv preprint Kohlhoff, Kai J, Shukla, Diwakar, Lawrenz, Morgan, Bow- arXiv:1206.3288, 2012. man, Gregory R, Konerding, David E, Belov, Dan, Alt- man, Russ B, and Pande, Vijay S. Cloud-based simu- Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, lations on Google Exacycle reveal modulation of Zhu, Ji, and Knight, Keith. Sparsity and smoothness via gpcr activation pathways. Nat. Chem., 6(1):15–21, 2014. the fused lasso. J. R. Statistic. Soc. B, 67(1):91–108, Komander, David, Clague, Michael J, and Urbe,´ Sylvie. 2005. Breaking the chains: structure and function of the deu- Voelz, Vincent A., Jager,¨ Marcus, Yao, Shuhuai, Chen, Yu- biquitinases. Nat. Rev. Mol. Cell Biol., 10(8):550–563, jie, Zhu, Li, Waldauer, Steven A., Bowman, Gregory R., 2009. Friedrichs, Mark, Bakajin, Olgica, Lapidus, Lisa J., Lane, Thomas J, Shukla, Diwakar, Beauchamp, Kyle A, Weiss, Shimon, and Pande, Vijay S. Slow unfolded- and Pande, Vijay S. To milliseconds and beyond: chal- state structuring in Acyl-CoA binding protein folding re- lenges in the simulation of protein folding. Curr. Opin. vealed by simulation and experiment. J. Am. Chem. Soc., Struct. Biol., 23(1):58 – 65, 2013. 134(30):12565–12577, 2012. Maity, Haripada, Maity, Mita, Krishna, Mallela M. G., Weber, Jeffrey K and Pande, Vijay S. Protein folding is Mayne, Leland, and Englander, S. Walter. Protein fold- mechanistically robust. Biophys. J., 102(4):859–867, ing: The stepwise assembly of foldon units. Proc. Natl. 2012. Acad. Sci. U.S.A., 102(13):4741–4746, 2005. Wong, Chung F. and McCammon, J. Andrew. Protein flexi- Nocedal, J. and Wright, S. Numerical Optimization. bility and computer-aided drug design. Annu. Rev. Phar- Springer Series in Operations Research and Financial macol. Toxicol., 43(1):31–45, 2003. Engineering. Springer, New York, 2nd edition, 2006. Zhang, Yingnan, Zhou, Lijuan, Rouge, Lionel, Phillips, Noe,´ Frank, Wu, Hao, Prinz, Jan-Hendrik, and Plattner, Aaron H, Lam, Cynthia, Liu, Peter, Sandoval, Wendy, Nuria. Projected and hidden markov models for calculat- Helgason, Elizabeth, Murray, Jeremy M, Wertz, In- ing kinetics and metastable states of complex molecules. grid E, et al. Conformational stabilization of ubiqui- J. Chem. Phys., 139(18):184114, 2013. tin yields potent and selective inhibitors of USP7. Nat. Chem. Biol., 9(1):51–58, 2012. Prinz, Jan-Hendrik, Wu, Hao, Sarich, Marco, Keller, Bet- tina, Senne, Martin, Held, Martin, Chodera, John D., Zhuang, Wei, Cui, Raymond Z., Silva, Daniel-Adriano, Schutte,¨ Christof, and Noe,´ Frank. Markov models of and Huang, Xuhui. Simulating the T-jump-triggered molecular kinetics: Generation and validation. J. Chem. unfolding dynamics of trpzip2 peptide and its time- Phys., 134(17):174105, 2011. resolved IR and two-dimensional IR signals using the markov state model approach. J. Phys. Chem. B, 115 Rabiner, Lawrence R. A tutorial on hidden markov models (18):5415–5424, 2011. and selected applications in speech recognition. Proc. IEEE, 77(2):257–286, 1989. Sadiq, S. Kashif, Noe,´ Frank, and De Fabritiis, Gianni. Ki- netic characterization of the critical step in HIV-1 pro- tease maturation. Proc. Natl. Acad. Sci. U.S.A., 109(50): 20449–20454, 2012.