Data-driven model reduction and transfer operator approximation
Stefan Klus1, Feliks Nüske1, Péter Koltai1, Hao Wu1, Ioannis Kevrekidis2,3, Christof Schütte1,3, and Frank Noé1
1Department of Mathematics and Computer Science, Freie Universität Berlin, Germany 2Department of Chemical and Biological Engineering & Program in Applied and Computational Mathematics, Princeton University, USA 3Zuse Institute Berlin, Germany
Abstract In this review paper, we will present different data-driven dimension reduction tech- niques for dynamical systems that are based on transfer operator theory as well as methods to approximate transfer operators and their eigenvalues, eigenfunctions, and eigenmodes. The goal is to point out similarities and differences between methods devel- oped independently by the dynamical systems, fluid dynamics, and molecular dynam- ics communities such as time-lagged independent component analysis (TICA), dynamic mode decomposition (DMD), and their respective generalizations. As a result, exten- sions and best practices developed for one particular method can be carried over to other related methods.
1 Introduction
The numerical solution of complex systems of differential equations plays an important role in many areas such as molecular dynamics, fluid dynamics, mechanical as well as electrical engineering, and physics. These systems often exhibit multi-scale behavior which can be due to the coupling of subsystems with different time scales – for instance, fast electrical and slow mechanical components – or due to the intrinsic properties of the system itself – for instance, the fast vibrations and slow conformational changes of molecules. Analyzing such problems using transfer operator based methods is often infeasible or prohibitively expensive from a computational point of view due to the so-called curse of dimensionality. arXiv:1703.10112v2 [math.DS] 18 Sep 2017 One possibility to avoid this is to project the dynamics of the high-dimensional system onto a lower-dimensional space and to then analyze the reduced system representing, for instance, only the relevant slow dynamics, see, e.g., [1,2]. In this paper, we will introduce different methods such as time-lagged independent compo- nent analysis (TICA) [3,1,4] and dynamic mode decomposition (DMD) [5,6,7,8] to identify the dominant dynamics using only simulation data or experimental data. It was shown that
1 these methods are related to Koopman operator approximation techniques [9, 10, 11, 12]. Extensions of the aforementioned methods called the variational approach of conformation dynamics (VAC) [13, 14, 15] developed mainly for reversible molecular dynamics problems and extended dynamic mode decomposition (EDMD) [16, 17, 18] can be used to compute eigenfunctions, eigenvalues, and eigenmodes of the Koopman operator (and its adjoint, the Perron–Frobenius operator). Interestingly, although the underlying ideas, derivations, and intended applications of these methods differ, the resulting algorithms share a lot of similari- ties. The goal of this paper is to show the equivalence of different data-driven methods which have been widely used by the dynamical systems, fluid dynamics, and molecular dynamics communities, but under different names. Hence, extensions, generalizations, and algorithms developed for one method can be carried over to its counterparts, resulting in a unified the- ory and set of tools. An alternative approach to data-driven model reduction – also related to transfer operators and their generators – would be to use diffusion maps [19, 20, 21, 22]. Manifold learning methods, however, are beyond the scope of this paper. The outline of the paper is as follows: Section2 briefly introduces transfer operators and the concept of reversibility. In Section3, different data-driven methods for the approximation of the eigenvalues, eigenfunctions, and eigenmodes of transfer operators will be described. The theoretical background and the derivation of these methods will be outlined in Section4. Section5 addresses open problems and lists possible future work.
2 Transfer operators and reversibility
In the literature, the term transfer operator is sometimes used in different contexts. In this section, we will briefly introduce the Perron–Frobenius operator, the Perron–Frobenius operator with respect to the equilibrium density, and the Koopman operator. All these three operators are, according to our definition, transfer operators.
2.1 Guiding example Our paper will deal with data-driven methods to analyze both stochastic and determinis- tic dynamical systems. To illustrate the concepts of transfer operators and their spectral components, we first introduce a simple stochastic dynamical system that will be revisited throughout the paper.
Example 2.1. Consider the following one-dimensional Ornstein–Uhlenbeck process, given by an Itô stochastic differential equation1 of the form: √ dXt = −αD Xt dt + 2D dW t.
Here, {Wt}t≥0 is a one-dimensional standard Wiener process (Brownian motion), the pa- rameter α is the friction coefficient, and D = β−1 is the diffusion coefficient. The stochastic forcing usually models physical effects, most often thermal fluctuations and it is customary to call β the inverse temperature.
1 A general time-homogeneous Itô stochastic differential equation is given by dXt = −α(Xt) Xt dt + d d d d×d σ(Xt) dWt, where α : R → R and σ : R → R are coefficient functions, and {Wt}t≥0 is a d- dimensional standard Wiener process.
2 The transition density of the Ornstein–Uhlenbeck process, i.e., the conditional probability density to find the process near y a time τ after it had been at x, is given by
2 ! 1 y − x e−αDτ pτ (x, y) = exp − , (1) p2 π σ2(τ) 2 σ2(τ) where σ2(τ) = α−1 1 − e−2αDτ . Figure1a shows the transition densities for different values of τ. More details can be found in [23]. For complex dynamical systems, the transition density is not known explicitly, but must be estimated from simulation or measurement data.
a)
b)
Figure 1: a) Transition density function of the Ornstein–Uhlenbeck process for different values of τ. If τ is small, values starting in x will stay close to it. For larger values of τ, the influence of the starting point x is negligible. Densities converge to the equilibrium density, denoted by π. Here, α = 4 and D = 0.25. b) Evolution of the probability to find the dynamical system at any point x over time t, after starting with a peaked distribution at t = 0. We show the resulting distributions at times t = 0.1, and t = 1, and t = 10. The system relaxes towards the stationary density π(x).
In this work we will describe the dynamics of a system in terms of dynamical operators such as the propagator Pτ , which is defined by the transition density pτ (x, y) and propagates a probability density of Brownian walkers in time by
pt+τ (x) = Pτ pt(x).
See Figure1b for the time evolution of the Ornstein–Uhlenbeck process initiated from a localized starting condition. It can be seen that the distribution spreads out and converges
3 towards a Gaussian distribution, which is then invariant in time. For this simple dynamical system we can give the equation for the invariant density explicitly:
1 x2 π(x) = √ exp − , (2) 2 π α−1 2 α−1 which is a Gaussian whose variance is decreasing with increasing friction and decreasing temperature. 4
2.2 Transfer operators 2 Let {Xt}t≥0 be a time-homogeneous stochastic process defined on the bounded state space d X ⊂ R . It can be genuinely stochastic or it might as well be deterministic, such that there 3 is a flow map Φτ : X → X with Φτ (Xt) = Xt+τ for τ ≥ 0. Let the measure P denote the law of the process {Xt}t≥0 that we will study in terms of its statistical transition properties. To this end, under some mild regularity assumptions4 which are satisfied by Itô diffusions with smooth coefficients [24, 25], we can give the following definition.
Definition 2.2. The transition density function pτ : X × X → [0, ∞] of a process {Xt}t≥0 is defined by [X ∈ | X = x] = p (x, y) dy, P t+τ A t ˆ τ A for every measurable set A. Here and in what follows, P[ · | E] denotes probabilities condi- tioned on the event E. That is, pτ (x, y) is the conditional probability density of Xt+τ = y given that Xt = x.
r For 1 ≤ r ≤ ∞, the spaces L (X) denote the usual spaces of r-Lebesgue integrable functions, which is a Banach space with the corresponding norm k · kLr .
1 ∞ Definition 2.3. Let pt ∈ L (X) be the probability density and ft ∈ L (X) an observable of the system. For a given lag time τ:
1 1 a) The Perron–Frobenius operator or propagator Pτ : L (X) → L (X) is defined by
P p (x) = p (y, x) p (y) dy. τ t ˆ τ t X ∞ ∞ b) The Koopman operator Kτ : L (X) → L (X) is defined by
K f (x) = p (x, y) f (y) dy = [f (X ) | X = x]. τ t ˆ τ t E t t+τ t X
2 We call a stochastic process {Xt}t≥0 time-homogeneous, or autonomous, if it holds for every t ≥ s ≥ 0 that the distribution of Xt conditional to Xs = x only depends on x and (t − s). It is the stochastic analogue of the flow of an autonomous (time-independent) ordinary differential equation. 3For a measure-theoretic discussion of this construction, please refer to [18]. For our purposes, it is sufficient to equip X with the standard Lebesgue measure. In particular, if not stated otherwise, measurability of a set A ⊂ X is meant with respect to the Borel σ-algebra. 4These conditions are called interchangeably absolute continuity,µ-compatibility, or null preservingness.
4 Both Pτ and Kτ are linear but infinite-dimensional operators which are adjoint to each other with respect to the standard duality pairing h·, ·i, defined by hf, gi = f(x) g(x) dx. X The homogeneity of the stochastic process {Xt}t≥0 implies the so-called semigroup´ property of the operators, i.e., Pτ+σ = Pτ Pσ and Kτ+σ = Kτ Kσ for τ, σ ≥ 0. In other words, these operators describe time-stationary Markovian dynamics. While the Perron–Frobenius operator describes the evolution of densities, the Koopman operator describes the evolution of observables. For the analysis of the long-term behavior of dynamical systems, densities that remain unchanged by the dynamics play an important role (one can think of the concept of ergodicity).
Definition 2.4. A density π is called an invariant density or equilibrium density if Pτ π = π. That is, the equilibrium density π is an eigenfunction of the Perron–Frobenius operator Pτ with corresponding eigenvalue 1.
r r r : In what follows, Lπ(X) denotes the weighted L -space of functions f such that kfkLπ = |f(x)|rπ(x) dx < ∞. While one can consider the evolution of densities with respect to X ´any density, we are particularly interested in the evolution with respect to the equilibrium density. From this point on, we assume there is a unique invariant density. This assumption is typically satisfied for molecular dynamics applications, where the invariant density is given by the Boltzmann distribution.
1 −1 Definition 2.5. Let Lπ(X) 3 ut(x) = π(x) pt(x) be a probability density with respect to the equilibrium density π. Then the Perron–Frobenius operator (propagator) with respect to the equilibrium density, denoted by Tτ , is defined by π(y) T u (x) = p (y, x) u (y) dy. τ t ˆ π(x) τ t X
r r0 The operators Pτ and Kτ can be defined on other spaces L and L , with r 6= 1 and r0 6= ∞, see [26, 18] for more details. By defining the weighted duality pairing r r0 1 1 hf, giπ = f(x) g(x) π(x) dx for f ∈ L ( ) and g ∈ L ( ), where + 0 = 1, Tτ de- X π X π X r r r0 r fined on L´π (X) is the adjoint of Kτ defined on Lπ(X) with respect to h·, ·iπ:
hKτ f, giπ = hf, Tτ giπ.
For more details, see [27, 28, 29, 13, 14, 18, 30]. The two operators Pτ and Tτ are often referred to as forward operators, whereas Kτ is also called backward operator, as they are the solution operators of the forward (Fokker–Planck) and backward Kolmogorov equations [27, Section 11], respectively.
2.3 Spectral decomposition of transfer operators
In what follows, let Aτ denote one of the transfer operators defined above, i.e., Pτ , Tτ , or Kτ . We are particularly interested in computing eigenvalues λ`(τ) ∈ C and eigenfunctions ϕ` : X → C of transfer operators, i.e.:
Aτ ϕ` = λ`(τ) ϕ`.
5 a) b)
Figure 2: Dominant eigenfunctions of the Ornstein–Uhlenbeck process computed analyt- ically (dotted lines) and using VAC/EDMD (solid lines). a) Eigenfunctions of the Per- ron–Frobenius operator Pτ . b) Eigenfunctions of the Koopman operator Kτ .
Note that the eigenvalues depend on the lag time τ. For the sake of simplicity, we will often omit this dependency. The eigenvalues and eigenfunctions of transfer operators contain important information about the global properties of the system such as metastable sets or fast and slow processes and can also be used as reduced coordinates, see [31, 28, 10, 32, 29,2] and references therein.
Example 2.6. The eigenvalues λ` and eigenfunctions ϕ` of Kτ associated with the Ornstein– Uhlenbeck process introduced in Example 2.1 are given by
−αD (`−1) τ 1 √ λ`(τ) = e , ϕ`(x) = H`−1 α x , ` = 1, 2,..., p(` − 1)! where H` denotes the `th probabilists’ Hermite polynomial [23]. The eigenfunctions of Pτ are given by the eigenfunctions of Kτ multiplied by the equilibrium density π, see also Figure2. 4
In addition to the eigenvalues and eigenfunctions, an essential part of the Koopman op- erator analysis is the set of Koopman modes for the so-called full-state observable g(x) = x. The Koopman modes are vectors that, together with the eigenvalues and eigenfunctions, allow us to reconstruct and to propagate the system’s state [16]. More precisely, assume that each component gi of the full-state observable, i.e., gi(x) = xi for i = 1, . . . , d, can P be written in terms of the eigenfunctions as gi(x) = ` ϕ`(x) ηi`. Defining the Koopman T P modes by η` = [η1`, . . . , ηd`] , we obtain g(x) = x = ` ϕ`(x) η` and thus X Kτ g(x) = E[g(Xτ ) | X0 = x] = E[Xτ | X0 = x] = λ`(τ) ϕ`(x) η`. (3) ` For vector-valued functions, the Koopman operator is defined to act componentwise. In order to be able to compute eigenvalues, eigenfunctions, and eigenmodes numerically, we project the infinite-dimensional operators onto finite-dimensional spaces spanned by a given set of basis functions. This will be described in detail in Section4.
6 2.4 Reversibility We briefly recapitulate the properties of reversible systems. For many applications, including commonly used molecular dynamics models, the dynamics in full phase space are known to be reversible.
Definition 2.7. A system is said to be reversible if the so-called detailed balance condition is fulfilled, i.e., it holds for all x, y ∈ X that
π(x) pτ (x, y) = π(y) pτ (y, x). (4)
Example 2.8. The Ornstein–Uhlenbeck process is reversible. It is straightforward to verify that (4) is fulfilled by the transition density (1) and the stationary density (2) for all values of x, y and τ. Also general Smoluchowski equations of a d-dimensional system of the form √ dXt = −D∇V (Xt) dt + 2dD dWt with dimensionless potential V (x) are reversible. The stationary density is then given by π ∝ exp(−V (x)) [33]. 4
As a result of the detailed balance condition, the Koopman operator Kτ and the Per- ron–Frobenius operator with respect to the equilibrium density, Tτ , are identical (hence also self-adjoint):
π(y) K f = p (x, y) f(y) dy = p (y, x) f(y) dy = T f. τ ˆ τ ˆ π(x) τ τ X X
Moreover, both Kτ and Pτ become self-adjoint with respect to the stationary density, i.e.
hPτ f, giπ−1 = hf, Pτ giπ−1 , hKτ f, giπ = hf, Kτ giπ.
Hence, the eigenvalues λ` are real and the eigenfunctions ϕ` of Kτ form an orthogonal basis with respect to h·, ·iπ. That is, the eigenfunctions can be scaled so that hϕ`, ϕ`0 iπ = δ``0 . Furthermore, the leading eigenvalue λ1 is the only eigenvalue with absolute value 1 and we obtain 1 = λ1 > λ2 ≥ λ3 ≥ ..., 2 see, e.g., [14]. We can then expand a function f ∈ Lπ(X) in terms of the eigenfunctions as P∞ f = `=1hf, ϕ`iπ ϕ` such that
∞ X Kτ f = λ`(τ) hf, ϕ`iπ ϕ`. (5) `=1
Furthermore, the eigenvalues decay exponentially with λ`(τ) = exp(−κ`τ) with relaxation −1 rate κ` and relaxation timescale t` . Thus, for a sufficiently large lag time τ, the fast relaxation processes have decayed and (5) can be approximated by finitely many terms. The propagator Pτ has the same eigenvalues and the eigenfunctions ϕ˜` are given by ϕ˜`(x) = π(x) ϕ`(x).
7 3 Data-driven approximation of transfer operators
In this section, we will describe different data-driven methods to identify the dominant dy- namics of dynamical systems and to compute eigenfunctions of transfer operators associated with the system, namely TICA and DMD as well as VAC and EDMD. A formal derivation of methods to compute finite-dimensional approximations of transfer operators – resulting in the aforementioned methods – will be given in Section4. Although TICA can be regarded as a special case of VAC, and DMD as a special case of EDMD, these methods are often used in different settings. With the aid of TICA, for instance, it is possible to identify the main slow coordinates and to project the dynamics onto the resulting reduced space, which can then be discretized using conventional Markov state models (a special case of VAC or EDMD, respectively, see Subsection 3.5). We will introduce the original methods – TICA and DMD – first and then extend these methods to the more general case. Since in many publications a different notation is used, we will first start with the required basic definitions. d In what follows, let xi, yi ∈ R , i = 1, . . . , m, be a set of pairs of d-dimensional data vectors, where xi = Xti and yi = Xti+τ . Here, the underlying dynamical system is not necessarily known, the vectors xi and yi can simply be measurement data or data from a black-box simulation. In matrix form, this can be written as X = x1 x2 ··· xm and Y = y1 y2 ··· ym , (6)
d×m with X,Y ∈ R . If one long trajectory {z0, z1, z2,... } of a dynamical system is given, i.e., zi = Xt0+h i, where h is the step size and τ = nτ h the lag time, we obtain X = z0 z1 ··· zm−1 and Y = znτ znτ +1 ··· znτ +m−1 .
That is, in this case Y is simply X shifted by the lag time τ. Naturally, if more than one trajectory is given, the data matrices X and Y can be concatenated. In addition to the data, VAC and EDMD require a set of uniformly bounded basis functions ∞ or observables, given by {ψ1, ψ2, . . . , ψk} ⊂ L (X). Since X is assumed to be bounded, we r have ψi ∈ L (X) for all i = 1, . . . , k and 1 ≤ r ≤ ∞. The basis functions could, for instance, be monomials, indicator functions, radial basis functions, or trigonometric functions. The optimal choice of basis functions remains an open problem and depends strongly on the system. If the set of basis functions is not sufficient to represent the eigenfunctions, the results will be inaccurate. A too large set of basis functions, on the other hand, might lead to ill-conditioned matrices and overfitting. Cross-validation strategies have been developed to detect overfitting [34]. d k For a basis ψi, i = 1, . . . , k, define ψ : R → R to be the vector-valued function given by
T ψ(x) = [ψ1(x), ψ2(x), . . . , ψk(x)] . (7)
The goal then is to find the best approximation of a given transfer operator in the space spanned by these basis functions. This will be explained in detail in Section4. In addition to the data matrices X and Y , we will need the transformed data matrices ΨX = ψ(x1) ψ(x2) . . . ψ(xm) and ΨY = ψ(y1) ψ(y2) . . . ψ(ym) . (8)
8 3.1 Time-lagged independent component analysis Time-lagged independent component analysis (TICA) has been introduced in [3] as a solution to the blind source separation problem, where the correlation matrix and the time-delayed correlation matrix are used to separate superimposed signals. The term TICA has been introduced later [35]. TDSEP [36], an extension of TICA, is popular in the machine learning community. It was shown only recently that TICA is a special case of the VAC by computing the optimal linear projection for approximating the slowest relaxation processes, and as such provides an approximation of the leading eigenvalues and eigenfunctions of transfer operators [1]. TICA is now a popular dimension reduction technique in the field of molecular dynamics [1,4]. That is, TICA is used as a preprocessing step to reduce the size of the state space by projecting the dynamics onto the main coordinates. The time-lagged independent components are required (a) to be uncorrelated and (b) to maximize the autocovariances at lag time τ, see [35,1] for more details. Assuming that the system is reversible, the TICA coordinates are the eigenfunctions of Tτ or Kτ , respectively, projected onto the space spanned by linear basis functions, i.e., ψ(x) = x. Let C(τ) be the time-lagged covariance matrix defined by
Cij(τ) = hXt,i Xt+τ,jit = Eπ [Xt,i Xt+τ,j] .
Given data X and Y as defined above, estimators C0 and Cτ for the covariance matrices C(0) and C(τ) can be computed as
m 1 X T 1 T C0 = m−1 xk xk = m−1 XX , k=1 m (9) 1 X T 1 T Cτ = m−1 xk yk = m−1 XY . k=1 The time-lagged independent components are defined to be solutions of the eigenvalue prob- lem + Cτ ξ` = λ` C0 ξ` or C0 Cτ ξ` = λ` ξ`, (10) + + respectively. In what follows, let MTICA = C0 Cτ , where C0 denotes the Moore–Penrose pseudo-inverse of C0. In applications, often the symmetrized estimators
1 T T 1 T T C0 = 2m−2 (XX + YY ) and Cτ = 2m−2 (XY + YX ) are used so that the resulting TICA coordinates become real-valued. This corresponds to averaging over the trajectory and the time-reversed trajectory. Note that this symmetriza- tion can introduce a large estimator bias that affects the dominant spectrum of (10), if the process is non-stationary, or the distribution of the data is far from the equilibrium of the process. In the latter case, a reweighting procedure can be applied to obtain weighted versions of the estimators (9), to reduce that bias [30]. Example 3.1. Let us illustrate the idea behind TICA with a simple example. Consider the data shown in Figure3, which was generated by a stochastic process which will typically spend a long time in one of the two clusters before it jumps to the other. We are interested
9 in finding these metastable sets. Standard principal component analysis (PCA) leads to the coordinate shown in red, whereas TICA – shown in black – takes time-information into account and is thus able to identify the slow direction of the system correctly. Projecting the system onto the x-coordinate will preserve the slow process while eliminating the fast stochastic noise. 4
Figure 3: The difference between PCA and TICA. The top and bottom plot show the x- and y-component of the system, respectively, the plot on the right the resulting main principal component vector and the main TICA coordinate.
Algorithm 1 AMUSE algorithm to compute TICA. 1. Compute a reduced SVD of X, i.e., X = U Σ V T . 2. Whiten data: X˜ = Σ−1U T X and Y˜ = Σ−1U T Y . ˜ ˜ T −1 T T −1 3. Compute M TICA = XY = Σ U XY UΣ .
4. Solve the eigenvalue problem M TICA w` = λ` w`. −1 5. The TICA coordinates are then given by ξ` = UΣ w`.
The TICA coordinates can be computed using AMUSE5 [37] as shown in Algorithm1. Instead of computing a singular value decomposition of the data matrix X in step 1, an eigenvalue decomposition of the covariance matrix XXT could be computed, which is more efficient if m d, but less accurate. The vectors ξ` computed by AMUSE are solutions of
5Algorithm for Multiple Unknown Signals Extraction.
10 the eigenvalue problem (10), since