Nonlinear Discovery of Slow Molecular Modes using State-Free Reversible VAMPnets

Wei Chen,1 Hythem Sidky,2 and Andrew L. Ferguson2, ∗ 1Department of Physics, University of Illinois at Urbana-Champaign, 1110 West Green Street, Urbana, Illinois 61801 2Institute for Molecular Engineering, 5640 South Ellis Avenue, University of Chicago, Chicago, Illinois 60637 The success of enhanced sampling molecular simulations that accelerate along collective variables (CVs) is predicated on the availability of variables coincident with the slow collective motions governing the long-time conformational dynamics of a system. It is challenging to intuit these slow CVs for all but the simplest molecular systems, and their data-driven discovery directly from molecular simulation trajectories has been a central focus of the molecular simulation community to both unveil the important physical mechanisms and to drive enhanced sampling. In this work, we introduce state-free reversible VAMPnets (SRV) as a deep learning architecture that learns nonlinear CV approximants to the leading slow eigenfunctions of the spectral decomposition of the transfer operator that evolves equilibrium-scaled probability distributions through time. Orthogonality of the learned CVs is naturally imposed within network training without added regularization. The CVs are inherently explicit and differentiable functions of the input coordinates making them well- suited to use in enhanced sampling calculations. We demonstrate the utility of SRVs in capturing parsimonious nonlinear representations of complex system dynamics in applications to 1D and 2D toy systems where the true eigenfunctions are exactly calculable and to simulations of alanine dipeptide and the WW domain protein.

I. INTRODUCTION a dataset of size N, which becomes intractable for large datasets. In practice, this issue can be adequately alle- Molecular dynamics (MD) simulations have long been viated by selecting a small number of landmark points an important tool for studying molecular systems by pro- to which kernel TICA is applied, and then constructing viding atomistic insight into physicochemical processes interpolative projections of the remaining points using that cannot be easily obtained through experimenta- the Nystr¨omextension [12]. Second, kTICA is typically tion. A key step in extracting kinetic information from highly sensitive to the kernel hyperparameters [12], and molecular simulation is the recovery of the slow dynam- extensive hyperparameter tuning is typically required to ical modes that govern the long-time evolution of sys- obtain acceptable results. Third, the use of the kernel tem coordinates within a low-dimensional latent space. trick and Nystr¨omextension compromises differentiabil- The variational approach to conformational dynamics ity of the latent space projection. Exact computation of (VAC) [1, 2] has been successful in providing a math- the gradient of any new point requires the expensive cal- ematical framework through which the eigenfunctions of culation of kernel function K(x, y) of new data x with the underlying transfer operator can be estimated [3, 4]. respect to all points y in the training set, and gradient A special case of VAC which estimates linearly op- estimations based only on the landmarks are inherently timal slow modes from mean-free input coordinates is approximate [13]. Due to their high cost and/or insta- known as time-lagged independent component analysis bility of these gradient estimates, the slow modes esti- (TICA) [1, 2, 4–9]. TICA is a widely-used approach that mated through kTICA are impractical for use as collec- has become a standard step in the Markov state model- tive variables (CVs) for enhanced sampling or reduced- ing pipeline [10]. However, it is restricted to form lin- dimensional modeling approaches that require exact gra- ear combinations of the input coordinates and is unable dients of the CVs with respect to the input coordinates. to learn nonlinear transformations that are typically re- Deep learning offers an attractive alternative to kTICA quired to recover high resolution kinetic models of all but as means to solve these challenges. Artificial neural net- arXiv:1902.03336v2 [stat.ML] 2 Jun 2019 the simplest molecular systems. Schwantes et al. address works are capable of learning nonlinear functions of arbi- this limitation by applying the kernel trick with TICA to trary complexity [14, 15], are generically scalable to large learn non-linear functions of the input coordinates [11]. A datasets with training scaling linearly with the size of the special case of a radial basis function kernels was realized training set, the network predictions are relatively robust by No´eand Nuske in the direct application of VAC using to the choice of network architecture and activation func- Gaussian functions [1]. Kernel TICA (kTICA), however, tions, and exact expressions for the derivatives of the suffers from a number of drawbacks that have precluded learned CVs with respect to the input coordinates are its broad adoption. First, its implementation requires available by automatic differentiation [16–19]. A number O(N 2) memory usage and O(N 3) computation time for of approaches utilizing artificial neural networks to ap- proximate eigenfunctions of the dynamical operator have been proposed. Time-lagged autoencoders [20] utilize auto-associative neural networks to reconstruct a time- ∗ Author to whom correspondence should be addressed: lagged signal, with suitable collective variables extracted [email protected] from the bottleneck layer. Variational dynamics encoders 2

(VDEs) [21] combine time-lagged reconstruction loss and tion of the slow eigenfunctions of the reversible dynamics autocorrelation maximization within a variational au- rather than a less biased estimator of the possibly non- toencoder. While the exact relationship between the re- reversible dynamics contained in the finite data. Second, gression approach employed in time-lagged autoencoders VAMPnets employ softmax activations within their out- and the VAC framework is not yet formalized [20], varia- put layers to generate k-dimensional output vectors that tional autoencoders (VAEs) have already been studied as can be interpreted as probability assignment to each of estimators of the dynamical propagator [21] and in fur- k metastable states. This network architecture achieves nishing collective variables for enhanced sampling [22]. nonlinear featurization of the input basis and soft cluster- The most pressing limitation of VAE approaches to date ing into metastable states. The output vectors are subse- is their restriction to the estimation of the single leading quently optimized using the VAMP principle to furnish eigenfunctions of the dynamical propagator. The absence a kinetic model over these soft/fuzzy states by approx- of the full spectrum of slow modes fails to expose the full imating the eigenfunctions of the transfer operator over richness of the underlying dynamics, and limits enhanced the states. Importantly, even though the primary objec- sampling calculations in the learned CVs to acceleration tive of VAMPnets is clustering, the soft state assignments along a single coordinate that may be insufficient to drive can be used to approximate the transfer operator eigen- all relevant conformational interconversions. functions using a reweighting procedure. However, since In this work, we propose a deep-learning based method the soft state assignments produced by softmax activa- to estimate the slow dynamical modes that we term state- tions of the output layer are constrained to sum to unity, free reversible VAMPnets (SRVs). SRVs take advantage there is a linear dependence that requires (k + 1) out- of the VAC framework using a neural network architec- put components to identify the k leading eigenfunctions. ture and loss function to recover the leading modes of The second distinguishing feature of SRVs is thus to em- the spectral hierarchy of eigenfunctions of the transfer ploy linear or nonlinear activation functions in the out- operator that evolves equilibrium-scaled probability dis- put layer of the network. By eschewing any clustering, tributions through time. In a nutshell, the SRV discovers the SRV is better suited to directly approximating the a nonlinear featurization of the input basis to pass to the transfer operator eigenfunctions as its primary objective, VAC framework for estimation of the leading eigenfunc- although clustering can also be performed in the space tions of the transfer operator. of these eigenfunctions in a post-processing step. This seemingly small change has a large impact in the success This approach shares much technical similarity with, rate of the network in successfully recovering the trans- and was in large part inspired by, the elegant VAMPnets fer operator eigenfunctions. Training neural networks is approach developed by No´eand coworkers [23] and deep inherently stochastic and it is standard practice to train canonical correlation analysis (DCCA) approach devel- multiple networks with different initial network parame- oped by Livescu and coworkers [24]. Both approaches ters and select the best. Numerical experiments on the employ Siamese neural networks to discover nonlinear small biomolecule alanine dipeptide (see Section III.C) featurizations of an input basis that are optimized us- in which we trained 100 VAMPnets and 100 SRVs for ing a VAMP score. VAMPnets differ from DCCA in 100 epochs employing optimal learning rates showed that optimizing a VAMP-2 rather than VAMP-1 score, mak- both VAMPnets and SRVs were able to accurately re- ing it better suited to applications to time series data cover the leading eigenfunctions and eigenvalues in quan- due to its theoretical grounding in the Koopman approx- titative agreement with one another, but that VAMPnets imation error [23]. VAMPnets seek to replace the en- exhibited a 29% success rate in doing so compared to tire MSM construction pipeline of featurization, dimen- 70% for SRVs. Accordingly, the architecture of SRVs is sionality reduction, clustering, and construction of a ki- better suited to direct estimation of the transfer opera- netic model. The objective of SRVs is not to perform tor eigenfunctions and may be preferred when the objec- direct state space partitioning but rather to learn con- tive is to estimate these functions as continuous, explicit, tinuous nonlinear functions of the input data to gener- and differentiable functions of the input coordinates that ate a nonlinear basis set with which to approximate the can be used to infer the mechanisms of molecular con- eigenfunctions of the transfer operator. The design of formational transitions, employed directly in enhanced SRVs differs from that of VAMPnets in two important sampling calculations, and passed to standard MSM con- ways to support this goal. Indeed the name given to our struction pipelines to perform microstate clustering and approach is intended to both indicate its heritage with estimation of discrete kinetic models. VAMPnets and these distinguishing features. First, the SRV optimizes the VAC as a variational principle for sta- tionary and reversible processes [1, 2], whereas VAMP- The structure of this paper is as follows. We first derive nets employ the more general VAMP principle that ap- the theoretical foundations of the SRV as a special case of plies to non-stationary and non-reversible processes [23]. VAC, and then demonstrate its efficacy against kTICA As such, SRVs are designed for applications to molec- and state-of-the-art TICA-based MSMs in applications ular systems obeying detailed balance where the VAC to 1D and 2D toy systems where the true eigenfunctions permits us to take molecular trajectories that may not are known and in molecular simulations of alanine dipep- strictly obey detailed balance and make a biased estima- tide and WW domain. 3

II. METHODS Let {ψi(x)} be eigenfunctions of T corresponding to the eigenvalues {λi} in non-ascending order We first recapitulate transfer operator theory and T ◦ ψi(x) = λiψi(x). (5) the variational approach to conformational dynamics (VAC) [1, 2, 11, 25, 26], choosing to borrow the nota- The self-adjoint nature of T implies that it possesses real tional convention from Ref. [26]. We then demonstrate eigenvalues {λi(x)} and its eigenvectors {ψi(x)} form a how the VAC specializes to TICA, kTICA, and SRVs. complete orthonormal basis [1, 2, 27], with orthonormal- ity relations

A. Transfer operator theory hψi|ψjiπ = δij. (6) R Normalization of the transition density dxpτ (y, x) = 1 Given the probability distribution of a system config- together with the assumption of ergodicity implies that uration pt(x) at time t and the equilibrium probability the eigenvalue spectrum is bounded from above by a distribution π(x), we define ut(x) = pt(x)/π(x) and the unique unit eigenvalue such that 1 = λ0 > λ1 ≥ λ2 ≥ ... transfer operator Tt = Tt(τ), known formally as the [2, 26]. Any state χt(x) at a specific time t can be written Perron-Frobenius operator or propagator with respect to as a linear expansion in this basis of {ψi} the equilibrium density [4], such that X Z χt(x) = hψi|χtiπψi(x). (7) 1 t ut+τ (x) = Tt ◦ut(x) = dy p (y, x)ut(y)π(y) (1) i π(x) τ The evolution of χt(x) to χt+kτ (x) after a time period t where pτ (y, x) = P(xt+τ = x|xt = y) is a transition den- kτ can be written as sity describing the probability that a system at y at time t k X k t evolves to x after a lag time τ. In general, pτ (y, x) χt+kτ (x) = T ◦ χt(x) = hψi|χtiπT ψi(x) depends on not only current state y at time t, but also i previous history, and is therefore time dependent. Under X k = hψi|χtiπλi ψi(x) the Markovian assumption, which becomes an increas- i ingly good approximation at larger lag times τ, pt (y, x) τ X  kτ  becomes a time homogeneous transition density pτ (y, x) = hψi|χtiπ exp − ψi(x), ti independent of t and the transfer operator Tt can be i written as T , where (8)

1 Z where ti is the implied timescale corresponding to eigen- ut+τ (x) = T ◦ ut(x) = dy pτ (y, x)ut(y)π(y). π(x) function ψi given by (2) τ If the system is at equilibrium, then it additionally obeys ti = − . (9) log λi detailed balance such that This development makes clear that the eigenvalue as- π(x)pτ (x, y) = π(y)pτ (y, x). (3) sociated with an eigenfunction characterizes its temporal autocorrelation. The pair {ψ0, λ0 = 1} corresponds to Given any two state functions u1(x) and u2(x), we appeal the equilibrium distribution. Smaller positive eigenval- to Eq. 2 and 3 to write ues in the range 1 > λ > 0 decay increasingly faster in time. The self-adjointness of T assure all eigenval- hu1(x)|T ◦ u2(x)iπ Z Z ues are real but does not prohibit negative eigenvalues, = dx u1(x) dy pτ (y, x)u2(y)π(y) which are mathematically admissible but unphysical on the grounds that they are measures of autocorrelation. Z Z Accordingly, negative eigenvalues are rarely observed for = dx u (x) dy p (x, y)u (y)π(x) 1 τ 2 well trained models for which sufficient training data is Z Z available at sufficiently high temporal resolution, and = dy u2(y) dx pτ (x, y)u1(x)π(x) their appearance is a numerical indication that the “slow Z Z modes” identified by the approach cannot be adequately = dx u2(x) dy pτ (y, x)u1(y)π(y) resolved from the data and/or model at hand.

=hT ◦ u1(x)|u2(x)iπ, B. Variational approach to conformational which demonstrates that T is self-adjoint with respect dynamics (VAC) to the inner product Z ˜ Under the VAC, we seek an orthonormal set {ψi} to ap- ha|biπ = a(x)b(x)π(x)dx. (4) proximate {ψi} under the orthogonality conditions given 4 by Eq. 6. Typically we are interested not in the full {ψi} the associated eigenvalue. The spectrum of solutions of but only the leading eigenfunctions corresponding to the Eq. 15 yield the best linear estimations of the eigenfunc- largest eigenvalues and thus longest implied timescales. tions of the transfer operator T within the basis {ζj(x)}. We first observe that ψ0(x) = 1 is a trivial eigen- The generalized eigenvalue problem can be solved by function of T with eigenvalue λ0 = 1 corresponding standard techniques [29]. to the equilibrium distribution at t → ∞. This fol- Eq. 12 and 13 serve as the central equations for TICA, lows from Eq. 2 and 3 whereby T ◦ ψ0(x) = T ◦ 1 = kTICA and SRVs. We first show how these equations can 1 R 1 R π(x) dy pτ (y, x)π(y) = π(x) dy pτ (x, y)π(x) = 1 = be estimated from simulated data, and then how these ψ0(x). three methods emerge as specializations of VAC under To learn ψ1(x), we note that any state function u(x) particular choices for the input basis. which is orthogonal to ψ0(x) can be expressed as

X X C. Estimation of VAC equations from trajectory u(x) = hψi|uiπψi(x) = ciψi(x) (10) data i≥1 i≥1 where ci = hψi|uiπ are expansion coefficients, and Here we show how Eq. 12 and 13 can be estimated P 2 P 2 from empirical trajectory data [1, 2, 27]. The numerator hu | T ◦ uiπ i≥1 ci λi i≥1 ci λ1 λ˜ = = ≤ = λ . of Eq. 12 becomes hu | ui P c2 P c2 1 π i≥1 i i≥1 i ˜ ˜ (11) hψi | T ◦ ψiiπ ˜ Z 1 Z Since λ is bounded from above by λ1 by the variational = dx π(x)ψ˜ (x) dy p (y, x)ψ˜ (y)π(y) principle [2, 11], we can exploit this fact to approximate i π(x) τ i the first non-trivial eigenfunction ψ1 by searching for a u Z ˜ = dx dy ψ˜ (x)p (y, x)ψ˜ (y)π(y) that maximizes λ subject to hu | ψ0iπ = 0. The learned i τ i u is an approximation to first non-trivial eigenfunction Z ˜ ˜ ˜ ψ1. = dx dy ψi(x)P(xt+τ = x|xt = y)ψi(y)P(xt = y) We can continue this procedure to approximate higher ˜ h ˜ ˜ i order eigenfunctions. In general we approximate ψi(x) ≈ E ψi(xt)ψi(xt+τ ) , (18) by maximizing h ˜ ˜ i ˜ ˜ where E ψi(xt)ψi(xt+τ ) can be estimated from a tra- ˜ hψi | T ◦ ψiiπ λi = , (12) ˜ ˜ jectory {xt}. The denominator follows similarly as hψi | ψiiπ ˜ ˜ under the orthogonality constraints hψi | ψiiπ Z ˜ ˜ ˜ ˜ hψk | ψiiπ = 0, 0 ≤ k < i. (13) = dx π(x)ψi(x)ψi(x) In essence, the VAC procedure combines a variational Z = dx ψ˜ (x)ψ˜ (x) (x = x) principle [2] with a linear variational approach perhaps i i P t most familiar from quantum mechanics [28]. Given an h ˜ ˜ i ≈ E ψi(xt)ψi(xt) . (19) arbitrary input basis {ζj(x)}, the eigenfunction approxi- mations may be written as linear expansions The full expression for Eq. 12 becomes ˜ X ψi = sijζj. (14) h ˜ ˜ i ˜ ˜ E ψi(xt)ψi(xt+τ ) j ˜ hψi | T ◦ ψiiπ λi = ≈ . (20) hψ˜ | ψ˜ i h ˜ ˜ i Adopting this basis, the VAC can be shown to lead to i i π E ψi(xt)ψi(xt) a generalized eigenvalue problem analogous to the quan- tum mechanical Roothaan-Hall equations [2, 28] Similarly, Eq. 13 becomes h i ˜ hψ˜ | ψ˜ i ≈ E ψ˜ (x )ψ˜ (x ) = 0, 0 ≤ k < i. (21) Csi = λiQsi, (15) k i π k t i t where Using the same reasoning, the components (Eq. 16 and 17) of the generalized eigenvalue problem (Eq. 15) are Cjk = hζj(x)|T ◦ ζk(x)iπ, (16) estimated as

Qjk = hζj(x)|ζk(x)iπ. (17) Cjk = hζj(x)|T ◦ ζk(x)iπ ≈ E[ζj(xt)ζk(xt+τ )] , (22)

Here si is the (eigen)vector of linear expansion coeffi- ˜ ˜ cients for the approximate eigenfunction ψi, and λi is Qjk = hζj(x)|ζk(x)iπ ≈ E[ζj(xt)ζk(xt)] . (23) 5

D. Time-lagged independent component analysis (TICA) ˜ ˜ 0 = hψk | ψiiπ = E [(ai · δxt)(ak · δxt)] . (30) ˜ In TICA, we represent ψi(x) as a linear combination of molecular coordinates x, where a is a vector of linear i which are exactly the objective function and orthogonal- expansion coefficients and C is an additive constant ity constraints of TICA [5, 9].

˜ ψi(x) = ai · x + C. (24) ˜ E. Kernel TICA (kTICA) The orthogonality condition Eq. 13 of ψi relative to ˜ ψ0(x) = 1 becomes One way of generalizing TICA to learn nonlinear fea- Z Z tures is though feature engineering. Specifically, if we ˜ ˜ ˜ h ˜ i 0 = dx π(x)ψ0(x)ψi(x) = dx π(x)ψi(x) = E ψi(x) . can find a nonlinear mapping φ that maps configura- tions x to appropriate features φ(x), we can apply TICA (25) on these features. However, designing good nonlinear It follows that features typically requires expert knowledge or expen- h ˜ i sive data preprocessing techniques. Therefore, instead 0 = E ψi(x) = E [ai · x + C] = ai · E[x] + C (26) of finding an explicit mapping φ, an alternative ap- proach is to apply the kernel trick using a kernel function ⇒ C = −ai · E[x] , (27) K(x, y) = φ(x) · φ(y) that defines an inner product be- and therefore Eq. 24 can be written as tween φ(x) and φ(y) as a similarity measure in the feature space that does not require explicit definition of φ. ˜ ψi(x) = ai · x − ai · E[x] = ai · δx, (28) To apply the kernel trick to TICA, we need to refor- mulate TICA in terms of this kernel function. It can be where δx = x − E[x] is a mean-free coordinate. Under shown that in Eq. 29, the coefficient a is linear combina- ˜ i this specification for ψi(x), Eq. 12 and Eq. 13 become tion of {δxt} ∪ {δxt+τ } (see Supplementary Information of Ref. [11]) and may therefore be written as ˜ ˜ ˜ hψi | T ◦ ψiiπ λi = X ˜ ˜ ai = (βitδxt + γitδxt+τ ) . (31) hψi | ψiiπ t E [(ai · δxt)(ai · δxt+τ )] = 2 , (29) E [(ai · δxt) ] Under this definition for ai Eq. 29 and 30 become,

˜ ˜ hψi | T ◦ ψiiπ λ˜ = i ˜ ˜ hψi | ψiiπ P P E [( t (βitδxt + γitδxt+τ ) · δxt)( t (βitδxt + γitδxt+τ ) · δxt+τ )] = P 2 , (32) E [( t (βitδxt + γitδxt+τ ) · δxt) ]

˜ ˜ 0 = hψk | ψiiπ " # X X = E ( (βitδxt + γitδxt+τ ) · δxt)( (βktδxt + γktδxt+τ ) · δxt) . (33) t t

Now the objective function Eq. 32 and constraints Eq. 33 linear similarity measure, which is the inner product only depend on inner products between any pair of ele- δx · δy of two vectors, with a symmetric nonlinear kernel ments in {δxt} ∪ {δxt+τ }. function K(x, y). This transforms Eq. 32 and 33 to To obtain a nonlinear transformation, we replace the 6

˜ ˜ hψi | T ◦ ψiiπ λ˜ = i ˜ ˜ hψi | ψiiπ P P E [( t (βitK(xt, xt) + γitK(xt, xt+τ )))( t (βitK(xt, xx+τ ) + γitK(xt+τ , xt+τ )))] = P 2 , (34) E [( t (βitK(xt, xt) + γitK(xt, xt+τ ))) ]

˜ ˜ 0 = hψk | ψiiπ " # X X = E ( (βitK(xt, xt) + γitK(xt, xt+τ )))( (βktK(xt, xt) + γktK(xt, xt+τ ))) , (35) t t

which define the objective function and orthogonality F. State-free reversible VAMPnets (SRVs) constraints of kTICA. As detailed in Ref. [11], Eq. 34 and 35 can be simplified to a generalized eigenvalue problem In general, the eigenfunction approximations need not that admits efficient solution by standard techniques [29]. be a linear function within either the input coordinate space or the feature space. The most general form of Eq. 12 and 13 is Although kTICA enables recovery of nonlinear eigen- functions, it does have some significant drawbacks. First, ˜ ˜ hψi | T ◦ ψiiπ λ˜ = it has high time and space complexity. The Gram matrix i ˜ ˜ 2 hψi | ψiiπ K = [K(x, y)]N×N takes O(N ) time to compute and 2 h ˜ ˜ i requires O(N ) memory, and the generalized eigenvalue E ψi(xt)ψi(xt+τ ) 3 problem takes O(N ) time to solve, which severely limits = h i , (36) ˜2 its application to large datasets. Second, results can be E ψi (xt) sensitive to the choice of kernel and there exist no rigor- ous guidelines for the choice of an appropriate kernel for h i a particular application. This limits the generalizability 0 = hψ˜ | ψ˜ i = E ψ˜ (x )ψ˜ (x ) . (37) of the approach. Third, the kernels are typically highly k i π i t k t sensitive to the choice of hyperparameters. For example, We now introduce the SRV approach that employs a the Gaussian (radial basis function) kernel is sensitive to neural network f to learn nonlinear approximations to noise for small kernel widths σ, which leads to overfit- {ψ˜ } directly without requiring the kernel trick. The neu- ting and overestimation the implied timescales. A large i ral network f maps configuration x to a n-dimensional σ on the other hand typically approaches linear TICA output f (x)(i = 1, ..., n), where n is the number of results which undermines its capacity to learn nonlin- i slow modes we want to learn. Then a linear variational ear transformations [12, 22]. This hyperparameter sen- ˜ method is applied to obtain the corresponding ψi(x) such sitivity typically requires signifiant expenditure of com- ˜ putational effort to tune these values in order to obtain that {ψi(x)} form an orthonormal basis set that mini- satisfactory results [11, 12]. Fourth, kTICA does not fur- mizes the network loss function. nish an explicit expression for the mapping φ : x → φ(x) The method proceeds as follows. Given a neural net- projecting configurations into the nonlinear feature space work f with n-dimensional outputs, we feed a training [21]. Accordingly, it is not straightforward to apply ker- set X = {x} and train the network with loss function L nel TICA within enhanced sampling protocols that re-  h ˜ ˜ i quire the learned latent variables to be explicit and dif- E ψi(xt)ψi(xt+τ ) X ˜ X ferentiable functions of the input coordinates. L = g(λi) = g  h i  . (38) ˜2 i i E ψi (xt)

where g is a monotonically decreasing function. Minimiz- ˜ One way to ameliorate the first deficiency by reduc- ing L is equivalent to maximizing the sum over λi and ˜ ing memory usage and computation time is to employ therefore maximizing the sum over ti (Eq. 9). The {ψi} a variant of kTICA known as landmark kTICA. This correspond to linear combinations of the neural network approach selects m << N landmarks from the dataset, outputs {fi(x)} computes an m-by-m Gram matrix, and then uses the ˜ X Nystr¨omapproximation to estimate the original N-by-N ψi(x) = sijfj(x), (39) Gram matrix [12]. j 7 computed by applying the linear VAC to the neural Input 푥푡 Input 푥푡+휏 network outputs [1, 2]. The linear VAC is equivalent to the generalized eigenvalue problem in Eq. 15 where ζj(x) = fj(x). Hidden Hidden Minimization of the loss function by gradient descent requires the derivative of L with respect to neural net- shared work parameters θ 푓푖(푥푡) 푓푖(푥푡+휏) ˜ ∂L X ∂L ∂g ∂λi = . (40) ∂θ ∂g ˜ ∂θ i ∂λi 퐶jk = E 푓푗 푥푡 푓푘 푥푡+휏 푄 = E[푓 (푥 )푓 (푥 )] Linear VAC The first two partial derivatives are straightforwardly cal- jk 푗 푡 푘 푡 ሚ culated by automatic differentiation once a choice for g 퐶푠푖 = 휆푖푄푠푖 has been made. The third partial derivative requires a lit- tle more careful consideration. We first expand this final ሚ Loss function derivative using Eq. 15 to make explicit the dependence 퐿 = ෍ 푔(휆푖) ˜ of λi on the matrices C and Q 푖

∂λ˜ ∂λ˜ ∂C ∂λ˜ ∂Q FIG. 1: Schematic diagram of state-free reversible i = i + i . (41) ∂θ ∂C ∂θ ∂Q ∂θ VAMPnets (SRVs). A pair of configurations (xt, xt+τ ) are fed into a Siamese neural network in which each To our best knowledge, no existing computational graph subnet possesses the same architecture and weights. frameworks provide gradients for generalized eigenvalue The Siamese net generates network outputs {fi(xt)} problems. Accordingly, we rewrite Eq. 15 as follows. We and {fi(xt+τ )} . The mappings fi : x → fi(x) evolve as first apply a Cholesky decomposition to Q, such that network training proceeds and the network weights are updated. Optimal linear combinations of {f } produce ˜ T i Csi = λiLL si, (42) ˜ estimates of the transfer operator eigenfunctions {ψi} by applying linear VAC, which can be formulated as a where L is a lower triangular matrix. We then left mul- generalized eigenvalue problem. Network training tiply both sides by L−1 to obtain proceeds by and is terminated when P ˜ −1 T −1 T  ˜ T  the loss function L = g(λi) is minimized. L C(L ) L si = λi L si . (43) i

−1 T −1 T Defining C˜ = L C(L ) ands ˜i = L si we convert the generalized eigenvalue problem into a standard eigen- value with a symmetric matrix C˜ validation set with which to implement early stopping. Training is terminated when validation loss no longer de- ˜ creases for a pre-specified number of epochs. A schematic C˜s˜i = λis˜i, (44) diagram of the SRV is shown in Fig. 1. ˜ where the Cholesky decomposition assures numerical sta- We note that we do not learn {ψi} directly in our neu- bility. Now, Eq. 41 becomes ral network, but obtain it as a weighted linear combi- nation of {fi}. Specifically, during training we learn not ˜ ˜ ! ∂λi ∂λi ∂C˜ ∂C ∂C˜ ∂Q only the weights of neural network, but also the linear ex- = + (45) ˜ ∂θ ∂C˜ ∂C ∂θ ∂Q ∂θ pansion coefficients {sij} that yield {ψi} from {fi}. Con- ceptually, one can consider the neural network as learning where all terms are computable using automatic differen- an optimal nonlinear featurization of the input basis to ˜ pass through the VAC framework. After training is com- tiation: ∂λi from routines for a symmetric matrix eigen- ∂C˜ plete, a new out-of-sample configuration x can be passed ∂C˜ ∂C˜ value problem, ∂C and ∂Q from those for Cholesky de- through the neural network to produce f, which is then composition, matrix inversion, and matrix multiplica- transformed via a linear operator {sij} to get the eigen- ∂C ∂Q ˜ tion, and ∂θ and ∂θ by applying the chain rule to Eq. 16 functions estimates {ψi}. Since the neural network is and 17 with ζj(x) = fj(x) and computing the derivatives fully differentiable and the final transformation is linear ∂f ˜ ∂θ through the neural network. the SRV mapping x → {ψi} is explicit and differentiable, Training is performed by passing {xt, xt+τ } pairs to making it well suited to applications in enhanced sam- the SRV and updating the network parameters using pling. mini-batch gradient descent using Adam [30] and employ- An important choice in our network design is the func- ing the automatic differentiation expression for ∂L/∂θ to tion g(λ˜) within our loss function. In theory it can be minimize the loss function. To prevent overfitting, we any monotonically decreasing function, but motivated by ˜ ˜2 shuffle all {xt, xt+τ } pairs and reserve a small portion as Refs. [23] and [27], we find that choosing g(λ) = −λ such 8 that the loss function corresponds to the VAMP-2 score each hidden layer, and hidden layer activation functions yields good performance and possesses strong theoretical for all four applications in this work, demonstrating the grounding. Specifically, the VAMP-2 score may be in- simplicity, robustness, and generalizability of the SRV terpreted as the cumulative kinetic variance [6, 10] anal- approach. ogous to the cumulative explained variance in principal component analysis [31], but where the VAMP-2 score measures kinetic rather than conformational information. III. RESULTS The VAMP-2 score can also be considered to measure the closeness of the approximate eigenfunctions to the true A. 1D 4-well potential eigenfunctions of the transfer operator, with this score achieving its maximum value when the approximations In our first example, we consider a simple 1D 4-well become exact [23]. This choice may also be generalized potential defined in Ref. [11] and illustrated in Fig. 2. ˜ ˜r to the VAMP-r score g(λ) = −λ [27]. We have also ob- The eigenfunctions of the transfer operator for this sys- ˜ ˜ served good performance using g(λ) = 1/ log(λ), which tem are exactly calculable, and this provides a simple corresponds to maximizing the sum of implied timescales initial demonstration of SRVs. We construct a transi- P i ti (Eq. 9). tion matrix as a discrete approximation for the transfer We also note that it is possible that the transfer op- operator by dividing the interval [−1, 1] into 100 evenly- erator may possess degenerate eigenvalues correspond- spaced bins and computing the transition probability pij ing to associated eigenfunctions with identical implied of moving from bin i to bin j as, timescales. True degeneracy is expected to be rare within the leading eigenfunctions but may arise from symmetries ( in the system dynamics; approximate degeneracy may be C e−(Vj −Vi), if |i − j| ≤ 1 p = i (46) encountered wherein timescales become indistinguishable ij 0, otherwise within the training data available to the model. In princi- ple, either of these situations could lead to numerical dif- where Vj and Vi are the potential energies at the centers ficulties wherein degenerate eigenfunctions cannot be sta- of bins j and i and Ci is the normalization factor for bin i bly resolved due to the existence of infinitely many equiv- such that the total transition probability associated with alent linear combinations. In such instances, all eigen- bin i sums to unity. We then define a unit-time transition functions associated with repeated eigenvalues should be matrix P (1) = [pij]100×100, and a transition matrix of lag retained and mutually aligned within successive rounds time τ as P (τ) = P (1)τ . In this model, we use a lag time of training by determining the optimal rotation matrix τ = 100 for both theoretical analysis and model learning. within their linear subspace. In practice, we suspect that minor differences in the implied timescales induced by the finite nature of the training data will break the de- 2.00 generacy of the eigenvalues and allow for stable numerical 1.75 recovery. A post hoc analysis of any groups of close eigen- 1.50 values can then be performed to attempt to resolve the 1.25 root of any suspected underlying symmetry. 1.00

The SRV learning protocol is quite simple and efficient. V(x) It only requires O(N) memory and O(N) computation 0.75 time, which makes it ideal for large datasets. It also does 0.50 not require selection of a kernel function to achieve appro- 0.25 priate nonlinear embeddings [21, 22] Instead we appeal 0.00 to the universal approximation theorem [14, 15], which 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 loosely states that a neural network with more than one x hidden layer and enough number of hidden nodes can approximate any nonlinear continuous function without FIG. 2: Model 1D 4-well potential landscape. Potential need to impose a kernel. Lifting the requirement for ker- energy barriers of various heights to introduce nel selection and its attendant hyperparameter tuning metastability and a separation of time-scales. The is a significant advantage of the present approach over potential is given by V (x) = kernel-based techniques. Training such a simple neural  2 2 2  2 x8 + 0.8e−80x + 0.2e−80(x−0.5) + 0.5e−40(x+0.5) . network is possible using standard techniques such as stochastic gradient descent and is largely insensitive to our choice of network architecture, hyperparameters, and activation functions. Using a default learning rate, large By computing the eigenvectors {vi} of the transition batch size, and sufficiently large network gives excellent matrix P (τ = 100), we recover the equilibrium distribu- results as we demonstrate below. Furthermore, we use tion π = v0. The corresponding eigenfunctions of trans- the same number of hidden layers, number of nodes in fer operator T (τ = 100) are given by ψi = vi/π. The 9 top four eigenfunctions are shown in Fig. 3, where first kTICA SRV eigenfunction ψ0 corresponds to trivial stationary transi- tion, and next three to transitions over the three energy barriers.

t0 = 0 1

1 (t1 = 6057) 1 (t1 = 6047)

1 (t1 = 6038) 1 (t1 = 6038)

t1 = 6038

t3 = 475 2 3

t2 = 922

x x 2 (t2 = 895) 2 (t2 = 893)

2 (t2 = 922) 2 (t2 = 922) FIG. 3: Theoretical eigenfunctions of the transfer operator for the model 1D 4-well potential illustrated in 3 (t3 = 474) 3 (t3 = 474) Figure 2. Note that the first eigenfunction corresponds 3 (t3 = 475) 3 (t3 = 475) to the stationary distribution and is equal to unity for the transfer operator. The remaining eigenfunctions represent transitions between the various energy wells whose barrier heights determine the associated τ eigenvalues λi and implied timescales ti = − . log λi

Using the calculated transition matrix P (1), we gen- x x erate a 5,000,000-step trajectory over the 1D landscape FIG. 4: Top three eigenfunctions learned by kernel by initializing the system within a randomly selected bin TICA (column 1) and SRV (column 2) for the model and then propagating its dynamics forward through time 1D 4-well potential. Each row represents the theoretical under the action of P (1) [11]. The state of the system eigenfunction (red dashed line) and corresponding at any time t is represented by the 1D x-coordinate of learned eigenfunction (blue solid line). Timescales for the particle x(t) ∈ 1. We then adopt a fixed lag time of R theoretical eigenfunctions and learned eigenfunctions τ = 100 and learn the top three non-trivial eigenfunctions are reported in the legend. using both kTICA and SRV. We note that for kTICA, we cannot compute the full 5,000,000-by-5,000,000 Gram matrix, so instead we select 200 landmarks by K-means clustering and use the landmark approximation. We se- lect a Gaussian kernel with σ = 0.05. In contrast, the SRV has no difficulty processing all data points. We use two hidden layers with 100 neurons each, giving a final architecture of [1, 100, 100, 3]. The activation function for all layers are selected to be tanh(x), and we employ a VAMP-2 loss function. The SRV network is constructed B. Ring potential and trained within Keras [32]. The results for kTICA and SRV are shown in Fig. 4. We find that both methods are in excellent agreement with the theoretical eigenfunctions. The small deviation between the estimated timescales for both methods and We now consider the more complicated example of a the theoretical timescales is a result of noise in the sim- 2D modified ring potential V (r, θ). This potential con- ulated data. This result demonstrates the capacity of tains a narrow ring potential valley of 0 kBT and four SRVs to recover the eigenfunctions of a simple 1D system barriers of heights 1.0 kBT , 1.3 kBT , 0.5 kBT , and 8.0 in quantitative agreement with the theoretical results and kBT . The expression of the potential is given by Eq. 47, excellent correspondence with kTICA. and it is plotted in Fig. 5 10

is defined by the 2D (x,y)-coordinate pair representing the particle location (x(t), y(t)) ∈ 2. Again small devi- 2.5 + 9(r − 0.8)2, if |r − 0.8| > 0.05 R  ations between the estimated and theoretical timescales  π 0.5, if |r − 0.8| < 0.05 and θ − < 0.25  2 should be expected due to noise and finite length of the V (r, θ) 1.3, if |r − 0.8| < 0.05 and |θ − π| < 0.25 simulated data. = 3π The kTICA results employing the optimal bandwidth kBT 1.0, if |r − 0.8| < 0.05 and θ − 2 < 0.25  8.0, if |r| > 0.4 and |θ| < 0.05 σ of the Gaussian kernel are shown in Fig. 6. Although  it gives a reasonable approximation of the eigenfunctions 0, otherwise within the ring where data are densely populated, the (47) agreement outside of the ring is much worse. This is due We use the same procedure outlined in the previous ex- to the intrinsic limitation of a Gaussian kernel function: ample to generate the theoretical eigenfunctions of trans- a small σ leads to an accurate representation near land- fer operator and simulate a trajectory using a Markov marks but poor predictions for regions far away, while a state model. The region of interest, [−1, 1] × [−1, 1], is large σ produces better predictions far away at the ex- discretized into 50-by-50 bins. In this model, we use a lag pense of local accuracy. Moreover, the kTICA results time of τ = 100 for both theoretical analysis and model depend sensitively on both the number of landmarks and learning. The transition probability pij of moving from the bandwidth of the Gaussian kernel. In Fig. 7 we report bin i to bin j is given by Eq. 48, where Ci is the nor- the test loss given by Eq. 38 on a dynamical trajectory malization factor for bin i such that the total transition of length 1,000,000 for kTICA models with different ker- probability associated with bin i sums to unity. nel bandwidths σ and numbers of landmarks selected by K-means clustering. The approximations of the leading 8 non-trivial eigenfunctions are reported in Fig. 8. We note that only when we use a large number of landmarks can 0.5k T 7 B we achieve reasonable results, which leads to expensive

6 computations in both landmark selection and calculation of the Gram matrix. Moreover, even with a large number

5 of landmarks, the range of σ values for which satisfactory

) results are obtained is still quite small and requires sub- T B

y 4 stantial tuning. k

1.3kBT 8.0kBT (

V In contrast, the SRV with architecture of [2, 100, 100, 3 3] shows excellent agreement with the theoretical eigen- functions without any tuning of network architecture, ac- 2 tivation functions, or loss function (Fig. 6). The SRV eigenvalues closely match those extracted by kTICA, and 1 the SRV eigenfunctions show superior agreement with 1.0kBT the theoretical results compared to kTICA. This result 0 x demonstrates the capacity of SRVs in leveraging the flexi- bility and robustness of neural networks in approximating FIG. 5: Contour plot of 2D ring potential, which arbitrary continuous function over compact sets. consists a ring-shape potential valley, with four potential barriers of heights 1.0 kBT , 1.3 kBT , 0.5 kBT , and 8.0 kBT . C. Alanine dipeptide

Having demonstrated SRVs on toy models, we now consider their application to alanine dipeptide in water as ( a simple but realistic application to molecular data. Ala- −(Vj −Vi)/(kB T ) Cie , if i,j are neighbors or i=j 0 pij = nine dipeptide (N-acetyl-L-alanine-N -methylamide) is a 0, otherwise. simple 22-atom peptide that stands as a standard test (48) system for new biomolecular simulation methods. The Once again we compute the first three non-stationary molecular structure of alanine dipeptide annotated with theoretical eigenfunctions of T (τ = 100) from the tran- the four backbone dihedral angles that largely dictate sition matrix and illustrate these in Fig. 6. We then its configurational state is presented in Fig. 9. A 200 numerically simulate a 5,000,000-step trajectory over the ns simulation of alanine dipeptide in TIP3P water and 2D landscape under the action of the transition matrix, modeled using the Amber99sb-ILDN forcefield was con- and pass these data to a 1000-landmark kTICA model ducted at T = 300 K and P = 1 bar using the OpenMM employing a Gaussian kernel and an SRV with the same 7.3 simulation suite [33, 34]. Lennard-Jones interactions hidden layer architecture and loss function as the pre- were switched smoothly to zero at a 1.4 nm cutoff, and vious example. The state of the system at any time t electrostatics treated using particle-mesh Ewald with a 11

2.0 2.0 1.6

1.6 1.6 1.2 )

1.2 ) 0.8 ) 1.2 4 0 7 2

0.8 1 0.4 1 9

0.8 3 8 7

0.4 3 0.0 1 1

y 0.4

Theoretical = =

= 0.0 0.4 2 3 1 0.0 t t t ( (

( 0.4 0.8 2 3 0.4 1 0.8 1.2 0.8 1.2 1.6 1.2 1.6 2.0

2.0 2.0 1.6

1.6 1.6 1.2 )

1.2 ) 0.8 ) 1.2 2 9 9 2

0.8 2 0.4 9 5

0.8 5 8 8

0.4 3 0.0 1 1

y 0.4

kTICA = =

= 0.0 0.4 2 3 1 0.0 t t t ( (

( 0.4 0.8 2 3 0.4 1 0.8 1.2 0.8 1.2 1.6 1.2 1.6 2.0

2.0 2.0 1.6

1.6 1.6 1.2 )

1.2 ) 0.8 ) 1.2 1 5 7 1

0.8 9 0.4 8 1

0.8 4 8 8

0.4 3 0.0 1 1

y 0.4

SRV = =

= 0.0 0.4 2 3 1 0.0 t t t ( (

( 0.4 0.8 2 3 0.4 1 0.8 1.2 0.8 1.2 1.6 1.2 1.6 2.0

1.0 1.0 1.0

0.8 0.8 0.8

0.6 1 0.6 2 0.6 3

|kTICA - Theoretical| y

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

1.0 1.0 1.0

0.8 0.8 0.8

0.6 1 0.6 2 0.6 3

|SRV - Theoretical| y

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0 x x x

FIG. 6: Theoretical eigenfunctions (row 1) and eigenfunctions learned by kernel TICA (row 2) and SRV (row 3) of the 2D ring potential. Each column denotes one eigenfunction with values represented in colors. Timescales are shown in corresponding colorbar labels. Contours of the ring potential are shown in gray to provide a positional reference. Absolute differences with respect to the theoretical eigenfunctions of the kernel TICA (row 4) and SRV (row 5) results.

real space cutoff of 1.4 nm and a reciprocal space grid ps to produce generate a trajectory comprising 1,000,000 spacing of 0.12 nm. Configurations were saved every 2 configurations. The instantaneous state of the peptide is 12

2.800 den layer architecture and loss function as the previous -1.825 -2.364 -2.712 -2.777 -2.816 -2.822 examples. The SRV eigenfunctions illustrated in Fig. 10

0.03 are in excellent agreement with those learned by kTICA, -2.571 -2.657 -2.794 -2.822 -2.826 -2.826 2.805 but importantly the implied timescale of the SRV lead-

0.05 ing mode is 17% slower than that extracted by kTICA. -2.734 -2.774 -2.823 -2.826 -2.827 -2.827 What is the origin of this discrepancy? 2.810

0.08 The current state-of-the-art methodology to approx- -2.769 -2.802 -2.825 -2.827 -2.827 -2.827 imate the eigenfunctions of the transfer operator is to 0.1 construct a Markov state model (MSM) over a microstate loss -2.797 -2.819 -2.824 -2.824 -2.824 -2.824 2.815 decomposition furnished by TICA [9, 10, 44–50] Apply- ing TICA to the 45 pairwise distances between the heavy 0.15 -2.806 -2.818 -2.819 -2.819 -2.819 -2.819 atoms, we construct a MSM using PyEMMA [51], and

0.2 2.820 present the results in Fig. 10. We see that the MSM

-2.803 -2.806 -2.806 -2.806 -2.806 -2.806 eigenfunctions are in excellent accord with those learned

0.3 by SRVs and kTICA, however while SRVs nearly exactly

-2.776 -2.777 -2.777 -2.777 -2.777 -2.777 2.825 match the implied timescales of the MSM and results re-

0.5 ported in Ref. [45], kTICA substantially underestimates 50 100 300 500 800 1000 the timescale of the slowest mode. number of landmarks The underlying reason for this failure is the that spa- tial resolution in feature space is limited by number of FIG. 7: Test loss of kTICA models with different landmarks, and if the Euclidean distance in feature space Gaussian kernel bandwidths σ and number of does not closely correspond to kinetic distances then it re- landmarks learned by K-means clustering applied to the quires much finer spatial resolution to resolve the correct ring potential. Good test losses (i.e., minimization of slow modes. For alanine dipeptide, pairwise distances the loss function) are only obtained for a large number between heavy atoms are not well correlated with the of landmarks and small σ. slowest dynamical modes – here, rotations around back- bone dihedrals – and therefore landmark kTICA models built on these features have difficulty resolving the slow- est mode. This issue may be alleviated by employing represented by the Cartesian coordinates of the 22 atoms more landmarks, but it quickly becomes computationally x(t) ∈ R66, where the influence of the solvent molecules is intractable on commodity hardware to use significantly treated implicitly through their influence on the peptide more than approximately 5000 landmarks. Another op- configuration. In this case the theoretical eigenfunctions tion is to use features that are better correlated with the of the transfer operator are unavailable, and we instead slow modes. For alanine dipeptide, it is known that the compare the SRV results against those of kTICA. backbone dihedrals are good features, and if we perform The 45 pairwise distances between the 10 heavy atoms kTICA using these input features we do achieve much were used as features with which to perform kTICA em- better estimates of the implied timescale of the leading ploying employing a Gaussian kernel, 5000 landmarks ob- mode (t1 = 1602 ps). In general, however, a good fea- tained from K-means clustering, and a lag time of τ = ture set is not known a priori, and for poor choices it is 20 ps. The intramolecular pairwise distances or contact typically not possible to obtain good performance even matrix are favored within biomolecular simulations as an for large numbers of landmarks. internal coordinate frame representation that is invariant Our numerical investigations also show the implied to translational and rotational motions [35]. The leading timescales extracted by SRVs to be very robust to the ˜ three eigenfunctions ψi (i = 1, 2, 3) discovered by kTICA particular choice of lag time. The reliable inference of employing a manually tuned kernel bandwidth are shown implied timescales from MSMs requires that they be con- in Fig. 10 superposed upon the Ramachandran plot in verged with respect to lag time, and slow convergence the backbone φ and ψ torsional angles that are known presents an impediment to the realization of high-time to be good discriminators of the molecular metastable resolution MSMs. The yellow bars in Fig. 11 present the states [36–42]. The timescales of the 4th and higher or- implied time scales of the leading three eigenfunctions der modes lie below the τ = 20 ps lag time so cannot computed from the mean over five SRVs constructed with be resolved by this model. Accordingly, we select three lag times of τ = 10, 40, and 100 ps. It is clear that the leading slow modes for analysis. From Fig. 10 it is appar- implied timescales are robust to the choice of lag time ent that the first slow mode captures transitions along φ over a relatively large range. torsion, the second characterizes the transitions between Given the implied timescales evaluated at four differ- α and (β, P//) basins, and the third motions between the ent lag times – τ = 10, 20, 40, and 100 ps – this pro- αL and γ basins. vides an opportunity to assess the dynamical robustness The trajectory is then analyzed at the same lag time of SRVs by subjecting them to a variant of the Chapman- using a [45, 100, 100, 3] SRV employing the same hid- Kolmogorov test employed in the construction and vali- 13

1| 1 = 0.525 1| 1 = 0.922 1| 1 = 0.987 1| 1 = 0.994 1| 1 = 0.998 1| 1 = 0.998

2.0 0.03

1| 1 = 0.972 1| 1 = 0.981 1| 1 = 0.995 1| 1 = 0.998 1| 1 = 0.999 1| 1 = 0.999 1.6 0.05

1| 1 = 0.988 1| 1 = 0.992 1| 1 = 0.998 1| 1 = 0.999 1| 1 = 0.999 1| 1 = 0.999 1.2 0.08

1| 1 = 0.990 1| 1 = 0.995 1| 1 = 0.999 1| 1 = 0.999 1| 1 = 0.999 1| 1 = 0.999 0.8 0.1

1| 1 = 0.990 1| 1 = 0.997 1| 1 = 0.998 1| 1 = 0.998 1| 1 = 0.998 1| 1 = 0.998 0.4 1 0.15

1| 1 = 0.987 1| 1 = 0.996 1| 1 = 0.996 1| 1 = 0.996 1| 1 = 0.996 1| 1 = 0.996 0.0 0.2 -0.4 1| 1 = 0.977 1| 1 = 0.983 1| 1 = 0.983 1| 1 = 0.983 1| 1 = 0.983 1| 1 = 0.983 0.3 -0.8

1| 1 = 0.917 1| 1 = 0.919 1| 1 = 0.919 1| 1 = 0.919 1| 1 = 0.919 1| 1 = 0.919

0.5 -1.2 50 100 300 500 800 1000

number of landmarks

FIG. 8: Leading non-trivial eigenfunctions of the 2D ring potential learned by kTICA employing different Gaussian ˜ kernel bandwidths σ and number of landmarks defined by K-means clustering. The projection coefficient hψ1|ψ1iπ of the leading learned eigenfunctions on the corresponding leading theoretical eigenfunction are reported directly above each plot as a measure of quality of the kTICA approximation. Contours of the ring potential are shown in gray to provide a positional reference.

dation of MSMs [23]. As discussed in Section. II A, the we have

VAC framework is founded on the Markovian assump- k−1 tion that the transfer operator is homogeneous in time. Y Tt(kτ) = Tt+iτ (τ) (49) In previous two toy examples, this assumption holds by i=0 construction due to the way the trajectory data were gen- erated. However, for a real system like alanine dipeptide, If the Markovian assumption holds, then the transfer op- there is no such guarantee. Here we test a necessary con- erators are independent of t, such that dition for the Markovian assumption to hold. In general T (kτ) = T (τ)k. (50) 14

D. WW domain

Our final example considers a 1137 µs simulation of the folding dynamics of the 35-residue WW domain protein performed in Ref. [53]. We use 595 pairwise distances of all Cα atoms to train a TICA-based MSM and a [595, 100, 100, 2] SRV with the same hidden layer architecture and loss function as all previous examples. We use lag time of τ = 400 ns (2000 steps) for both models, and focus on the two leading slowest modes. The implied timescales of higher-order modes lie close to or below the lag time and so cannot be reliably resolved by this FIG. 9: Molecular rendering of alanine dipeptide model. The slow modes discovered by the TICA-based annotated with the four backbone dihedral angles. MSM and the SRV are shown in Fig. 12a projected onto Image constructed using VMD [43]. the two leading TICA eigenfunctions (tIC1, tIC2) to fur- nish a consistent basis for comparison. The MSM and SRV eigenfunctions are in excellent agreement, exhibit- ing Pearson correlation coefficients of ρ = 0.99, 0.98 for The corresponding eigenvalue and eigenfunctions are the top two modes respectively. The implied timescales ˜ ˜ ˜ inferred by the two methods are also in good agreement. T (τ) ◦ ψi,τ (x) = λi,τ ψi,τ (x) k ˜ ˜k ˜ ⇒ T (τ) ◦ ψi,τ (x) = λi,τ ψi,τ (x), (51) An important aspect of model quality is the conver- and gence rate of the implied timescales with respect to the lag time τ. The lag time must be selected to be suffi- ˜ ˜ ˜ T (kτ) ◦ ψi,kτ (x) = λi,kτ ψi,kτ (x), (52) ciently large such that the state decomposition is Marko- vian whereby dynamical mixing with states is faster than ˜ ˜ where {λi,τ } and {ψi,τ (x)} are the estimated eigenvalues the lag time and interconversion between states is slower, ˜ ˜ and eigenfunctions of T (τ), and {λi,kτ } and {ψi,kτ (x)} but it is desirous that the lag time be as short as pos- those of T (kτ). Appealing to Eq. 50, it follows that sible to produce a model with high time resolution [44]. Better approximations for the leading eigenfunctions of ˜ ˜ ψi,kτ (x) = ψi,τ (x), the transfer operator typically lead to convergence of the ˜ ˜k implied timescales at shorter lag times. We construct λi,kτ = λi,τ , (53) 10 independent SRV models and 10 independent TICA- providing a means to compare the consistency of SRVs based MSMs over the WW domain trajectory data and constructed at different choices of τ. In particular, the report in Fig. 12b the mean and 95% confidence inter- implied timescales for the eigenfunctions of T (kτ) esti- vals of the implied timescale convergence with lag time. mated from an SRV constructed at a lag time kτ The SRV exhibits substantially faster convergence than the TICA-based MSM, particularly in the second eigen- kτ t˜ = − . (54) function. This suggests that the eigenfunctions identi- i,T (kτ),kτ ˜ log λi,kτ fied by the SRV, although strongly linearly correlated should be well approximated by those estimated from an with those identified by the TICA-based MSM, provide SRV constructed at a lag time τ a better approximation to the leading slow modes of the transfer operator and produce a superior state decompo- kτ sition. We attribute this observation to the ability of the t˜i, (τ),kτ = − . (55) T log λ˜k SRV to capture complex nonlinear relationships in the i,τ data within a continuous state representation, whereas If this is not the case, then the assumption of Markovian- the MSM is dependent upon the TICA coordinates which ity is invalidated for this choice of lag time. are founded on a linear variational approximation to the We present in Fig. 11 the predicted implied timescales transfer operator eigenfunctions that subsequently in- over the range of lag times τ = 2–120 ps calculated from form the construction of an inherently discretized mi- an SRV constructed at a lag time of τ = 20 ps. These pre- crostate transition matrix from which we compute the dictions are in excellent accord to the implied timescales MSM eigenvectors. This result demonstrates the viabil- directly computed from SRVs constructed at lag times of ity of SRVs to furnish a high-resolution model of the slow τ = 10, 40, and 100 ps, demonstrating satisfaction of the system dynamics without the need to perform any sys- Chapman-Kolmogorov test and a demonstration of the tem discretization and at a higher time resolution (i.e., Markovian property of the system at lag times τ & 10 ps lower lag time) than is possible with the current state- [10, 23, 44, 49, 52]. of-the-art TICA-based MSM protocol. 15

FIG. 10: Eigenfunctions of alanine dipeptide learned by kTICA (row 1), SRV (row 2), and MSM constructed over TICA (row 3). Eigenfunctions are superposed as heatmaps over the Ramachandran plot in the φ and ψ backbone dihedrals. The metastable basins within the plot are labeled according to their conventional terminology. Timescales of learned eigenfunctions are shown in corresponding colorbar labels. Absolute differences with respect to the MSM eigenfunctions of the kernel TICA (row 4) and SRV (row 5) results.

IV. CONCLUSIONS covery of a hierarchy of nonlinear slow modes of a dy- namical system from trajectory data. The framework is In this work we proposed a new framework that we built on top of transfer operator theory that uses a flex- term state-free reversible VAMPnets (SRV) for the dis- 16

ational dynamics encoders (VDEs) with an exclusive au- 103 tocorrelation loss, subject to Gaussian noise [21]. VDEs however, cannot currently generalize to multiple dimen- sions due to the lack of an orthogonality constraint on 102 the learned eigenfunctions. By using the more general VAMP principle for non-reversible processes and target- ing membership state probabilities rather than learning 101 continuous functions, VAMPnets are obtained [23]. timescales/ps In regards to the analysis of molecular simulation tra- jectories, we anticipate that the flexibility and high time- resolution of SRV models will be of use in helping resolve 100 20 40 60 80 100 120 and understand the important long-time conformational lag time/ps changes governing biomolecular folding and function. Moreover, it is straightforward to replace TICA-based FIG. 11: Implied timescales of the three leading MSMs with SRV-based MSMs, maintaining the large eigenfunctions of alanine dipeptide estimated by SRVs body of theoretical and practical understanding of MSM employing a variety of lag times. The yellow bars and construction while delivering the advantages of SRVs in triangles report the three leading implied timescales for improved approximation of the slow modes and superior SRVs employing lag times of τ = 10, 40, and 100 ps. microstate decomposition. In regards to enhanced sam- The blue, red, and green traces present the pling in molecular simulation, the differentiable nature extrapolative estimation of the implied timescales at of SRV coordinates naturally enables biasing along the various lag times using a single SRV constructed with τ SRV collective variables (CVs) using well-established ac- = 20 ps. The solid black line represents the locus of celerated sampling techniques such as umbrella sampling, points are which the implied timescale is equal to the metadynamics, and adaptive biasing force. The efficiency lag time. Implied timescales that fall into the shaded of these techniques depends crucially on the choice of area decay more quickly than the lag time and therefore “good” CVs coincident with the important underlying cannot be reliably resolved at that choice of lag time. dynamical modes governing the ling-time evolution of the system. A number of recent works have employed neural networks to learn nonlinear CVs describing the directions of highest variance within the data [54–57]. However, the ible neural network to learn an optimal nonlinear basis high variance directions are not guaranteed to also corre- from the input representation of the system. Compared spond to the slow directions of the dynamics. Only vari- to kernel TICA and variational dynamics encoders, our ational dynamics encoders have been used to learn and SRV framework has many advantages. It is capable of bias sampling in a slow CV [22], but, as observed above, simultaneously learning an arbitrary number of eigen- the VDE is limited to approximate only the leading eigen- functions while guaranteeing orthogonality. It also re- function of the transfer operator. SRVs open the door to quires O(N) memory and O(N) computation time, which performing accelerated sampling within the full spectrum makes it amenable to large data sets such as those com- of all relevant eigenfunctions of the transfer operator. In monly encountered in biomlecular simulations. The neu- a similar vein, SRVs may also be profitably incorporated ral network architecture does not require the selection of into adaptive sampling approaches that do not apply ar- a kernel function or adjustment of hyperparameters that tificial biasing force, but rather smartly initialize short can strongly affect the quality of the results and be te- unbiased simulations on the edge of the explored domain dious and challenging to tune [21, 22]. In fact, we find [58–64]. The dynamically meaningful projections of the that training such a simple fully-connected feed-forward data into the SRV collective variables is anticipated to neural network is simple, cheap, and insensitive to batch better resolve the dynamical frontier than dimensional- size, learning rate, and architecture. Finally, the SRV is ity reduction approaches based on maximal preservation a parametric model, which provides an explicit and dif- of variance, and therefore better direct sampling along ferentiable mapping from configuration x to the learned the slow conformational pathways. In sum, we expect approximations of the leading eigenvectors of the transfer SRVs to have a variety of applications not just in the ˜ context of molecular simulations, but also more broadly operator {ψi}. These slow collective variables are then ideally suited to be utilized in collective variable-based within the analysis of dynamical systems. enhanced sampling methods where the differentiability of the SRV collective variables enable their seamless incor- poration with powerful biased sampling techniques such ACKNOWLEDGMENTS as metadynamics [38]. The SRV framework possesses a close connection This material is based upon work supported by the with a number of existing methodologies. In the one- National Science Foundation under Grant No. CHE- dimensional limit, SRVs are formally equivalent to vari- 1841805. H.S. acknowledges support from the Molec- 17

FIG. 12: Application of a TICA-based MSM and SRV to WW domain. (a) Eigenfunctions of WW domain learned by TICA-based MSM (row 1) and SRV (row 2) projected onto the two leading TICA eigenfunctions (tIC1, tIC2). Implied timescales of the learned eigenfunctions are noted adjacent to each plot. (b) Convergence of the implied timescales as a function of lag time for the two leading eigenfunctions of the system as computed by TICA-based MSM (triangles) and SRV (circles). Colored lines indicate the mean implied timescale averages over 10 independently trained models runs and colored shading the 95% confidence intervals. The teal horizontal dashed lines indicate the converged implied timescales achieved by both the MSM and SRV at large τ.

ular Software Sciences Institute (MolSSI) Software Fel- lows program (NSF grant ACI-1547580) [65, 66]. We are grateful to D.E. Shaw Research for sharing the WW do- main simulation trajectories.

SOFTWARE AVAILABILITY

Software to perform slow modes discovery along with an API compatible with scikit-learn and Keras, Jupyter notebooks to reproduce the results presented in this paper, documentation, and examples have been freely available under MIT license at https://github.com/ hsidky/srv. 18

[1] F. No´eand F. Nuske, Multiscale Modeling & Simulation 11, 635 (2013). [2] F. N¨uske, B. G. Keller, G. P´erez-Hern´andez,A. S. J. S. Mey, and F. No´e,Journal of Chemical Theory and Computation 10, 1739 (2014). [3] F. No´eand C. Clementi, Current Opinion in Structural Biology 43, 141 (2017). [4] S. Klus, F. N¨uske, P. Koltai, H. Wu, I. Kevrekidis, C. Sch¨utte, and F. No´e,Journal of Nonlinear Science 28, 985 (2018). [5] G. P´erez-Hern´andez,F. Paul, T. Giorgino, G. De Fabritiis, and F. No´e,The Journal of Chemical Physics 139, 07B604 1 (2013). [6] F. No´eand C. Clementi, Journal of Chemical Theory and Computation 11, 5002 (2015). [7] F. No´e,R. Banisch, and C. Clementi, Journal of Chemical Theory and Computation 12, 5620 (2016). [8] G. P´erez-Hern´andezand F. No´e,Journal of Chemical Theory and Computation 12, 6118 (2016). [9] C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 9, 2000 (2013). [10] B. E. Husic and V. S. Pande, Journal of the American Chemical Society 140, 2386 (2018). [11] C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 11, 600 (2015). [12] M. P. Harrigan and V. S. Pande, bioRxiv preprint bioRxiv:10.1101/123752 (2017). [13] M. M. Sultan and V. S. Pande, Journal of Chemical Theory and Computation (2017). [14] M. H. Hassoun, Fundamentals of Artificial Neural Networks (MIT press, 1995). [15] T. Chen and H. Chen, IEEE Transactions on Neural Networks 6, 911 (1995). [16] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, arXiv preprint arXiv:1211.5590 (2012). [17] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, Journal of Marchine Learning Research 18, 1 (2018). [18] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, (2017). [19] J. Schmidhuber, Neural Networks 61, 85 (2015). [20] C. Wehmeyer and F. No´e,The Journal of Chemical Physics 148, 241703 (2018). [21] C. X. Hern´andez,H. K. Wayment-Steele, M. M. Sultan, B. E. Husic, and V. S. Pande, Physical Review E 97, 062412 (2018). [22] M. M. Sultan, H. K. Wayment-Steele, and V. S. Pande, Journal of Chemical Theory and Computation 14, 1887 (2018). [23] A. Mardt, L. Pasquali, H. Wu, and F. No´e,Nature Communications 9, 5 (2018). [24] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, in Proceedings of the 30th International Conference on , Atlanta, Georgia, USA, 2013., Vol. JMLR: W&CP Volume 28 (2013) pp. 1247–1255. [25] C. Sch¨utte,W. Huisinga, and P. Deuflhard, in Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems (Springer, 2001) pp. 191–223. [26] J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Sch¨utte, and F. No´e,The Journal of Chemical Physics 134, 174105 (2011). [27] H. Wu and F. No´e,arXiv preprint arXiv:1707.04659 (2017). [28] A. Szabo and N. S. Ostlund, Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory (Courier Corporation, 2012). [29] D. S. Watkins, The Matrix Eigenvalue Problem: GR and Krylov Subspace Methods, Vol. 101 (Siam, 2007). [30] D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 (2014). [31] K. Pearson, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559 (1901). [32] F. Chollet, “Keras,” https://keras.io. [33] P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang, and D. Shukla, Journal of Chemical Theory and Computation 9, 461 (2012). [34] P. Eastman, J. Swails, J. D. Chodera, R. T. McGibbon, Y. Zhao, K. A. Beauchamp, L.-P. Wang, A. C. Simmonett, M. P. Harrigan, and C. D. Stern, PLOS 13, e1005659 (2017). [35] F. Sittel, A. Jain, and G. Stock, The Journal of Chemical Physics 141, 07B605 1 (2014). [36] A. L. Ferguson, A. Z. Panagiotopoulos, P. G. Debenedetti, and I. G. Kevrekidis, The Journal of Chemical Physics 134, 04B606 (2011). [37] W. Ren, E. Vanden-Eijnden, P. Maragakis, and W. E, The Journal of Chemical Physics 123, 134109 (2005). [38] A. Laio and M. Parrinello, Proceedings of the National Academy of Sciences 99, 12562 (2002). [39] A. Barducci, G. Bussi, and M. Parrinello, Physical Review Letters 100, 020603 (2008). [40] P. G. Bolhuis, C. Dellago, and D. Chandler, Proceedings of the National Academy of Sciences 97, 5877 (2000). [41] O. Valsson and M. Parrinello, Physical Review Letters 113, 090601 (2014). [42] H. Stamati, C. Clementi, and L. E. Kavraki, Proteins: Structure, Function, and Bioinformatics 78, 223 (2010). [43] W. Humphrey, A. Dalke, and K. Schulten, Journal of Molecular Graphics 14, 33 (1996). [44] V. S. Pande, K. Beauchamp, and G. R. Bowman, Methods 52, 99 (2010). [45] B. Trendelkamp-Schroer, H. Wu, F. Paul, and F. No´e,The Journal of Chemical Physics 143, 11B601 1 (2015). [46] M. M. Sultan and V. S. Pande, The Journal of Physical Chemistry B 122, 5291 (2017). [47] S. Mittal and D. Shukla, Molecular Simulation 44, 891 (2018). [48] M. P. Harrigan, M. M. Sultan, C. X. Hern´andez, B. E. Husic, P. Eastman, C. R. Schwantes, K. A. Beauchamp, R. T. McGibbon, and V. S. Pande, Biophysical Journal 112, 10 (2017). 19

[49] C. Wehmeyer, M. K. Scherer, T. Hempel, B. E. Husic, S. Olsson, and F. No´e,Living Journal of Computational Molecular Science 1, 5965 (2018). [50] M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu, and F. No´e,arXiv preprint arXiv:1811.11714 (2018). [51] M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. P´erez-Hern´andez,M. Hoffmann, N. Plattner, C. Wehmeyer, J.-H. Prinz, and F. No´e,Journal of Chemical Theory and Computation 11, 5525 (2015). [52] J. D. Chodera, W. C. Swope, J. W. Pitera, and K. A. Dill, Multiscale Modeling & Simulation 5, 1214 (2006). [53] K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Science 334, 517 (2011). [54] W. Chen and A. L. Ferguson, Journal of 39, 2079 (2018). [55] W. Chen, A. R. Tan, and A. L. Ferguson, The Journal of Chemical Physics 149, 072312 (2018). [56] J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary, The Journal of Chemical Physics 149, 072301 (2018). [57] J. M. L. Ribeiro and P. Tiwary, Journal of Chemical Theory and Computation 15, 708 (2018). [58] Z. Shamsi, K. J. Cheng, and D. Shukla, The Journal of Physical Chemistry B 122, 8386 (2018). [59] J. Preto and C. Clementi, Physical Chemistry Chemical Physics 16, 19181 (2014). [60] J. K. Weber and V. S. Pande, Journal of Chemical Theory and Computation 7, 3405 (2011). [61] M. I. Zimmerman and G. R. Bowman, Journal of Chemical Theory and Computation 11, 5747 (2015). [62] G. R. Bowman, D. L. Ensign, and V. S. Pande, Journal of Chemical Theory and Computation 6, 787 (2010). [63] N. S. Hinrichs and V. S. Pande, The Journal of Chemical Physics 126, 244101 (2007). [64] S. Doerr and G. De Fabritiis, Journal of Chemical Theory and Computation 10, 2064 (2014). [65] A. Krylov, T. L. Windus, T. Barnes, E. Marin-Rimoldi, J. A. Nash, B. Pritchard, D. G. Smith, D. Altarawy, P. Saxe, C. Clementi, T. D. Crawford, R. J. Harrison, S. Jha, V. S. Pande, and T. Head-Gordon, The Journal of Chemical Physics 149, 180901 (2018). [66] N. Wilkins-Diehr and T. D. Crawford, Computing in Science & Engineering 20, 26 (2018).