Linear Dynamics: Clustering without identification

Chloe Ching-Yun Hsu† Michaela Hardt† Moritz Hardt† University of California, Berkeley Amazon University of California, Berkeley

Abstract for LDS parameter estimation, but it is inherently non-convex and can often get stuck in local min- Linear dynamical systems are a fundamental ima [Hazan et al., 2018]. Even when full system iden- and powerful parametric model class. How- tification is hard, is there still hope to learn meaningful ever, identifying the parameters of a linear information about linear dynamics without learning all is a venerable task, per- system parameters? We provide a positive answer to mitting provably efficient solutions only in this question. special cases. This work shows that the eigen- We show that the eigenspectrum of the state-transition spectrum of unknown linear dynamics can be of unknown linear dynamics can be identified identified without full system identification. without full system identification. The eigenvalues of We analyze a computationally efficient and the state-transition matrix play a significant role in provably convergent algorithm to estimate the determining the properties of a linear system. For eigenvalues of the state-transition matrix in example, in two dimensions, the eigenvalues determine a linear dynamical system. the stability of a linear dynamical system. Based on When applied to time series clustering, the and the of the state-transition our algorithm can efficiently cluster multi- matrix, we can classify a linear system as a stable dimensional time series with temporal offsets node, a stable spiral, a saddle, an unstable node, or an and varying lengths, under the assumption unstable spiral. that the time series are generated from linear To estimate the eigenvalues, we utilize a funda- dynamical systems. Evaluating our algorithm mental correspondence between linear systems and on both synthetic data and real electrocardio- Autoregressive-Moving-Average (ARMA) models. We gram (ECG) signals, we see improvements in establish bi-directional perturbation bounds to prove clustering quality over existing baselines. that two LDSs have similar eigenvalues if and only if their output time series have similar auto-regressive parameters. Based on a consistent estimator for 1 Introduction the autoregressive model parameters of ARMA mod- els [Tsay and Tiao, 1984], we propose a regularized it- Linear dynamical system (LDS) is a simple yet gen- erated least-squares regression method to estimate the eral model for time series. Many machine learn- arXiv:1908.01039v3 [cs.LG] 29 Feb 2020 LDS eigenvalues. Our method runs in time linear in the ing models are special cases of linear dynamical sys- sequence length T and converges to true eigenvalues at tems [Roweis and Ghahramani, 1999], including princi- the rate O (T −1/2). pal component analysis (PCA), mixtures of Gaussians, p Kalman filter models, and hidden Markov models. As one application, our eigenspectrum estimation al- gorithm gives rise to a simple approach for time series When the states are hidden, LDS parameter iden- clustering: First use regularized iterated least-squares tification has provably efficient solutions only in regression to fit the autoregressive parameters; then special cases, see for example [Hazan et al., 2018, cluster the fitted autoregressive parameters. Hardt et al., 2018, Simchowitz et al., 2018]. In prac- tice, the expectation–maximization (EM) algo- This simple and efficient clustering approach captures rithm [Ghahramani and Hinton, 1996] is often used similarity in eigenspectrums, assuming each time series comes from an underlying linear dynamical system. It †This work was done at Google. is a suitable similarity measure where the main goal for rd Proceedings of the 23 International Conference on Artificial clustering is to characterize state-transition dynamics Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. regardless of change of basis, particularly relevant when PMLR: Volume 108. Copyright 2020 by the author(s). Linear Dynamics: Clustering without identification there are multiple data sources with different measure- Afsari et al., 2012, Vishwanathan et al., 2007], where ment procedures. Our approach bypasses the challenge the observed time series are higher dimensional than of LDS full parameter estimation, while enjoying the the hidden state dimension. Our work is motivated natural flexibility to handle multi-dimensional time by the more challenging sitatuion with a single or a series with time offsets and partial sequences. few output dimensions, common in climatology, energy consumption, finance, medicine, etc. To verify that our method efficiently learns sys- tem eigenvalues on synthetic and real ECG data, Compared to ARMA-parameter based clustering, our we compare our approach to existing baselines, in- method only uses the AR half of the parameters which cluding model-based approaches based on LDS, AR we show to enjoy more reliable convergence. We also and ARMA parameter estimation, and PCA, as well differ from AR-model based clustering because fitting as model-agnostic clustering approaches such as dy- AR to an ARMA process results in biased estimates. namic time warping [Cuturi and Blondel, 2017] and Autoregressive parameter estimation. Existing k-Shape [Paparrizos and Gravano, 2015]. spectral analysis methods for estimating AR parame- Organization. We review LDS and ARMA models in ters in ARMA models include high-order Yule-Walker Sec. 3. In Sec. 4 we discuss the main technical results (HOYW), MUSIC, and ESPRIT [Stoica et al., 2005, around the correspondence between LDS and ARMA. Stoica et al., 1988]. Our method is based on iterated re- Sec. 5 presents the regularized iterated regression algo- gression [Tsay and Tiao, 1984], a more flexible method rithm, a consistent estimator of autoregressive parame- for handling observed exogenous inputs (see Appendix ters in ARMA models with applications to clustering. C) in the ARMAX generalization. We carry out eigenvalue estimation and clustering ex- periments on synthetic data and real ECG data in 3 Preliminaries Sec. 6. In the appendix, we describe generalizations to observable inputs and multidimensional outputs, and 3.1 Linear dynamical systems include additional simulation results. A discrete-time linear dynamical system (LDS) with pa- 2 Related Work rameters Θ = (A, B, C, D) receives inputs x1, ··· , xT ∈ k n R , has hidden states h0, ··· , hT ∈ R , and generates m Linear dynamical system identification. The outputs y1, ··· , yT ∈ R according to the following LDS identification problem has been studied since time-invariant recursive equations: the 60s [Kalman, 1960], yet the theoretical bounds are ht = Aht−1 + Bxt + ζt still not fully understood. Recent provably efficient al- (1) gorithms [Simchowitz et al., 2018, Hazan et al., 2018, yt = Cht + Dxt + ξt. Hardt et al., 2018, Dean et al., 2017] require setups Assumptions. We assume that the stochastic noise that are not best-suited for time series clustering, such ζ and ξ are diagonal Gaussians. We also assume the as assuming observable states and focusing on predition t t system is observable, i.e. C, CA, CA2, ··· ,CAn−1 are error instead of parameter recovery. linearly independent. When the LDS is not observable, On recovering system parameters without observed the ARMA model for the output series can be reduced states, Tsiamis et al. recently study a subspace iden- to lower AR order, and there is not enough information tification algorithm with non-asymptotic O(T −1/2) in the output series to recover all the full eigenspectrum. rate [Tsiamis and Pappas, 2019]. While our analysis The model equivalence theorem (Theorem 4.1) and does not provide finite sample complexity bounds, our the approximation theorem (Theorem 4.2) do not re- simple algorithm achieves the same rate asymptotically. quire any additional assumptions for any real matrix Model-based time series clustering. Com- A. When additionally assuming A only has simple mon model choices for clustering include Gaus- eigenvalues in C, i.e. each eigenvalue has multiplicity sian mixture models [Biernacki et al., 2000], autore- 1, we give a better convergence bound. gressive integrated moving average (ARIMA) mod- Distance between linear dynamical systems. els [Kalpakis et al., 2001], and hidden Markov mod- With the main goal to characterize state-transition els [Smyth, 1997]. Gaussian mixture models, ARIMA dynamics, we view systems as equivalent up to change models, and hidden Markov models are all special of basis, and use the ` distance of the spectrum of the cases of the more general linear dynamical system 2 transition matrix A, i.e. d(Θ , Θ ) = kλ(A )−λ(A )k , model [Roweis and Ghahramani, 1999]. 1 2 1 2 2 where λ(A1) and λ(A2) are the spectrum of A1 and A2 Linear dynamical systems have used to clus- in sorted order. This distance definition satisfies non- ter video trajectories [Chan and Vasconcelos, 2005, negativity, identity, symmetry, and triangle inequality. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

Two very different time series could still have small dis- and MA(q) models to consider dependencies both on tance in eigenspectrum. This is by design to allow the past time series values and past unpredictable shocks, flexiblity for different measurement procedures, which p q mathematically correspond to different C matricies. yt = c + t + Σi=1ϕiyt−i + Σi=1θit−i. When there are multiple data sources with different measurement procedures, our approach can compare ARMAX model. ARMA can be generalized to the underlying dynamics of time series across sources. autoregressive–moving-average model with exogenous Jordan canonical basis. Every square real matrix inputs (ARMAX). is similar to a complex block diagonal matrix known as its Jordan canonical form (JCF). In the special case p q r yt = c + t + Σi=1ϕiyt−i + Σi=1θit−i + Σi=0γixt−i, for diagonalizable matrices, JCF is the same as the

diagonal form. Based on JCF, there exists a canonical where {xt} is a known external time series, possibly basis {ei} consisting only of eigenvectors and gener- multidimensional. In case xt is a vector, the parameters alized eigenvectors of A. A vector v is a generalized γi are also vectors. eigenvector of rank µ with corresponding eigenvalue λ if (λI − A)µv = 0 and (λI − A)µ−1v 6= 0. Estimating ARMA and ARMAX models is significantly harder than AR, since the model depends on unob- served variables and the maximum likelihood equa- 3.2 Autoregressive-moving-average models tions are intractable [Durbin, 1959, Choi, 2012]. Maxi- The autoregressive-moving-average (ARMA) model mum likelihood estimation (MLE) methods are com- combines the autoregressive (AR) model and the monly used for fitting ARMA and ARMAX [Guo, 1996, moving-average (MA) model. The AR part involves Bercu, 1995, Hannan et al., 1980], but have converge regressing the variable with respect its lagged past val- issues. Although regression methods are also used in ues, while the MA part involves regressing the variable practice, OLS is a biased estimator for ARMA mod- against past error terms. els [Tiao and Tsay, 1983]. Autoregressive model. The AR model describes Lag operator. We also introduce the lag op- how the current value in the time series depends on the erator, a concise way to describe ARMA mod- lagged past values. For example, if the GDP realization els [Granger and Morris, 1976], defined as Lyt = yt−1. is high this quarter, the GDP in the next few quarters The lag operator could be raise to powers, or form poly- 3 2 are likely high as well. An autoregressive model of nomials. For example, L yt = yt−3, and (a2L + a1L + order p, noted as AR(p), depends on the past p steps, a0)yt = a2yt−2 + a1yt−1 + a0yt. The lag polynomials can be multiplied or inverted. An AR(p) model can be p characterized by yt = c + Σi=1ϕiyt−i + t,

where ϕ1, ··· , ϕp are autoregressive parameters, c is a Φ(L)yt = c + t, constant, and t is white noise. where Φ(L) = 1 − ϕ L − · · · − ϕ Lp is a polynomial of When the errors are normally distributed, the ordi- 1 p the lag operator L of degree p. For example, any AR(2) nary least squares (OLS) regression is a conditional model can be described as (1 − ϕ L − ϕ L2)y = c +  . maximum likelihood estimator for AR models yielding 1 2 t t optimal estimates [Durbin, 1960]. Similarly, an MA(q) can be characterized by a polyno- mial Ψ(L) = θ Lq + ··· + θ L + 1 of degree q, Moving-average model. The MA model, on the q 1 other hand, captures the delayed effects of unobserved random shocks in the past. For example, changes in yt = c + Ψ(L)t. winter weather could have a delayed effect on food For example, for an MA(2) model the equation would harvest in the next fall. A moving-average model of 2 order q, noted as MA(q), depends on unobserved lagged be yt = c + (θ2L + θ1L + 1)t. errors in the past q steps, Merging the two and adding dependency to exogenous

q input, we can write an ARMAX(p, q, r) model as yt = c + t + Σi=1θit−i,

Φ(L)yt = c + Ψ(L)t + Γ(L)xt (2) where θ1, ··· , θq are moving-average parameters, c is a constant, and the errors  are white noise. t where Φ, Ψ, and Γ are polynomials of degree p, q and r. ARMA model. The autoregressive-moving-average When the exogenous time series xt is multidimensional, (ARMA) model, denoted as ARMA(p, q), merges AR(p) Γ(L) is a vector of degree-r polynomials. Linear Dynamics: Clustering without identification

4 Learning eigenvalues without ical systems have the same autoregressive parameters system identification if and only if they have the same non-zero eigenvalues with the same multiplicities. This section provides theoretical foundations for learning LDS eigenvalues from autoregressive pa- Proof. By Theorem 4.1, the autoregressive parameters rameters without full system identification. While are determined by the characteristic polynomial. Two general model equivalence between LDS and LDSs of the same dimension have the same autore- ARMA(X) is known [Åström and Wittenmark, 2013, gressive parameters if and only if they have the same Kailath, 1980], we provide detailed analysis of the characteristic polynomials, and hence the same eigenval- exact correspondence between the LDS character- ues with the same multiplicities. Two LDSs of different istic polynomial and the ARMA(X) autoregressive dimensions n1 < n2 can have the same autoregressive n2−n1 parameters along with perturbation bounds. parameters if and only if χA1 (λ) = χA2 (λ)λ and

ϕn1+1 = ··· = ϕn2 = 0, in which case they have the 4.1 Model equivalence same non-zero eigenvalues with same multiplicities.

We show that the output series from any LDS can be It is possible for two LDSs with different dimensions to seen as generated by an ARMAX model, whose AR have the same AR coefficients, if the higher-dimensional parameters contain full information about the LDS system has additional zero eigenvalues. Whether over- eigenvalues. parameterized models indeed learn additional zero m eigenvalues requires further empirical investigation. Theorem 4.1. Let yt ∈ R be the outputs from a linear dynamical system with parameters Θ = 4.2 Approximation theorems for LDS (A, B, C, D), hidden dimension n, and inputs xt ∈ k eigenvalues R . Each dimension of yt can be generated by an ARMAX(n, n, n − 1) model, whose autoregressive pa- We show that small error in the AR parameter es- rameters ϕ , ··· , ϕ can recover the characteristic poly- 1 n timation guarantees a small error in the eigenvalue nomial of A by χ (λ) = λn − ϕ λn−1 − · · · − ϕ . A 1 n estimation. This implies that an effective estimation In the special case where the LDS has no external inputs, algorithm for the AR parameters in ARMAX models the ARMAX model is an ARMA(n, n) model. leads to effective estimation of LDS eigenvalues.

See Appendix A for the full proof. General (1/n)-exponent bound

As a high-level proof sketch: We first analyze the hid- Theorem 4.2. Let yt be the outputs from an n- den state projected to (generalized) eigenvector direc- dimensional linear dynamical system with parame- tions in Lemma A.2. We show that for a (generalized) ters Θ = (A, B, C, D), eigenvalues λ1, ··· , λn, and ∗ eigenvector e of the adjoint A of the transition matrix hidden inputs. Let Φˆ = (ϕˆ1, ··· , ϕˆn) be the esti- with eigenvalue λ and rank µ, the time series obtained mated autoregressive parameters for {yt} with error µ from applying the lag operator polynomial (1−λL) to kΦˆ − Φk = , and let r1, ··· , rn be the roots of the n hht, ei can be expressed as a linear combination of the polynomial 1 − ϕˆ1z − · · · − ϕˆnz . past k inputs x , ··· , x . Since A is real-valued, t t−k+1 Assuming the LDS is observable, the roots converge to A and its adjoint A∗ share the same characteristic the true eigenvalues with convergence rate O(1/n). If polynomial χ . A all eigenvalues of A are simple (no multiplicity), then † We then consider the lag operator polynomial χA(L) = the convergence rate is O(). n −1 L χA(L ), and show that the time series obtained † Without additional assumptions, the 1 -exponent in from applying χA(L) applied to any (generalized) eigen- n vector direction is a linear combination of the past k the above general bound is tight. As an exam- 2 √ inputs. We use this on the Jordan canonical basis for ple, z −  has roots z ± . The general phe- A∗ that consists of (generalized) eigenvectors. From nomenon that a root with multiplicity m can split m † into m roots at rate O( ) is related to the reg- there, we conclude χ is the autoregressive lag polyno- A ular splitting property [Hryniv and Lancaster, 1999, mial that contains the autoregressive coefficients. Lancaster et al., 2003]. The converse of Theorem 4.1 also holds. An ARMA(p, q) model can be seen as a (p+q)-dimensional Linear bound for simple eigenvalues LDS where the state encodes the relevant past values Under the additional assumption that all the eigenval- and error terms. ues are simple (no multiplicity), we derive a better O() Corollary 4.1. The output series of two linear dynam- bound instead of O(1/n). We show small perturbation Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

in AR parameters results in small perturbation in com- by panion matrix, and small perturbation in companion 1 matrix results in small perturbation in eigenvalues. Q ≤ κ ≤ k6=j |λj − λk| We defer the full proofs to Appendix B, but describe √ n n−1 2 n−1 the proof ideas here. Q (max(1, |λj|)) (1 + ρ(A) ) 2 , k6=j |λj − λk| n n−1 For a monic polynomial Φ(u) = z + ϕ1z + ··· + where ρ(A) is the spectral radius, i.e. largest absolute ϕn−1z + ϕn, the companion matrix, also known as the controllable canonical form in control theory, is the value of its eigenvalues. square matrix In particular, when ρ(C) ≤ 1, i.e. when the matrix is   Lyapunov stable, then the absolute difference between 0 0 ... 0 −ϕn the root from the auto-regressive method and the eigen- √ √ n−1 1 0 ... 0 −ϕn−1 n( 2) 2   value is bounded by |rj − λj| ≤ Q |λ −λ |  + o( ). 0 1 ... 0 −ϕn−2 k6=j j k C(Φ) =  . ......  . . . . .  In Appendix B, we derive the explicit formula in The- 0 0 ... 1 −ϕ1 orem 4.3 by conjugating the companion matrix by a Vandermonde matrix to diagonalize it and invoking the The matrix C(Φ) is the companion in the sense that it explicit inverse formula of Vandermonde matrices. has Φ as its characteristic polynomial. In relation to an autoregressive AR(p) model, the com- 5 Estimation of ARMA autoregressive panion matrix corresponds to the transition matrix in parameters the linear dynamical system when we encode the values form the past p lags as a p-dimensional state In general, learning ARMA models is hard, since the output series depends on unobserved error terms. For-  T ht = yt−p+1 ··· yt−1 yt . tunately, for our purpose we are only interested in the autoregressive parameters, that are easier to learn since

If yt = ϕ1yt−1 + ··· + ϕpyt−p, then ht = the past values of the time series are observed. The autoregressive parameters in an ARMA(p, q) model    0 1 0 ... 0   yt−p+1 yt−p are not equivalent to the pure AR(p) parameters for  0 0 1 ... 0  yt−p+2  yt−p+1 the same time series. For AR(p) models, ordinary    ......    ···  =  . . . . .  ···  least squares (OLS) regression is a consistent estimator  y    y   t−1   0 0 0 ... 1  t−2  of the autoregressive parameters [Lai and Wei, 1983]. y y t ϕp ϕp−1 ϕp−2 . . . ϕ1 t−1 However, for ARMA(p, q) models, due to the serial correlation in the error term  + Pq θ  , the = C(−Φ)T h . t i=1 i t−i t−1 OLS estimates for autoregressive parameters can be (3) biased [Tiao and Tsay, 1983]. Regularized iterated regression. Iterated regres- We then use matrix eigenvalue perturbation theory sion [Tsay and Tiao, 1984] is a consistent estimator for results on the companion matrix for the desired bound. the AR parameters in ARMA models. While iterated Lemma 4.1 (Theorem 6 in [Lancaster et al., 2003]). regression is theoretically well-grounded, it tends to Let L(λ, ) be an analytic matrix function with semi- over-fit and results in excessively large parameters. To simple eigenvalue λ0 at  = 0 of multiplicity M. Then avoid over-fitting, we propose a slight modification there are exactly M eigenvalues λi() of L(λ, ) for with regularization, which keeps the same theoretical which λi() → λ0 as  → 0, and for these eigenvalues guarantees and yields better practical performance.

λ () = λ + λ0  + o(). (4) We also generalize the method to handle multidimen- i 0 i sional outputs from the LDS and observed inputs by using ARMAX instead of ARMA models, as described Explicit bound on condition number in details in Appendix C as Algorithm 2. When the LDS has all simple eigenvalues, we provide The i-th iteration of the regression only uses error a more explicit bound on the condition number. terms from the past i lags. The initial iteration is Theorem 4.3. In the same setting as above in The- an ARMA(n, 0) regression, the first iteration is an orem 4.2, when all eigenvalues of A are simple, |rj − ARMA(n, 1) regression, and so forth until ARMA(n, n) 2 λj| ≤ κ+o( ), then the condition number κ is bounded in the last iteration. Linear Dynamics: Clustering without identification

Algorithm 1: Regularized iterated regression for au- In previous sections, our theoretical analysis shows that toregressive parameter estimation AR parameters in ARMA time series models can effec- tively estimate the eigenspectrum of underlying LDSs. Input: Time series {y }T , target hidden state t t=1 We therefore propose a simple time series clustering al- dimension n, and regularization coefficient α. gorithm: 1) first use iterated regression to estimate the Initialize error term estimates ˆ = 0 for t = 1,...,T ; t autoregressive parameters in ARMA models for each for i = 0, ··· , n do times series, and 2) then apply any standard clustering Perform `2-regularized least squares regression to ˆ algorithm such as K-means on the distance between estimate ϕˆj, θj, and cˆ in Pn Pi ˆ autoregressive parameters. yt = j=1 ϕˆjyt−j + j=1 θjˆt−j +c ˆ with ˆ Our method is very flexible. It handles multi- regularization strength α only on the θj terms; dimensional data, as Theorem 4.1 suggests that any Update ˆt to be the residuals from the most recent regression; output series from the same LDS should share the end same autoregressive parameters. It can also handle exogenous inputs as illustrated in Algorithm 2 in Ap- Return ϕˆ1, ··· , ϕˆn. pendix C. It is scale, shift, and offset invariant, as the autoregressive parameters in ARMA models are. It Time complexity. The iterated regression involves accommodates missing values in partial sequences as n + 1 steps of least squares regression each on at most we can still perform OLS after dropping the rows with 2n + 1 variables. Therefore, the total time complexity missing values. It also allows sequences to have differ- of Algorithm 1 is O(n3T +n4), where T is the sequence ent lengths, and could be adapted to handle sequences length and n is the hidden state dimension. with different sampling frequencies, as the compound of multiple steps of LDS evolution is still linear. Convergence rate. The consistency and the convergence rate of the estimator is analyzed in [Tsay and Tiao, 1984]. Adding regularization does not change the asymptotic property of the estimator. 6 Experiments Theorem 5.1 ([Tsay and Tiao, 1984]). Suppose that yt is an ARMA(p, q) process, stationary or not. The We experimentally evaluate the quality and efficiency estimated autoregressive parameters Φˆ = (ϕˆ1, ··· , ϕˆn) of the clustering from our method and compare it to from iterated regression converges in probability to the existing baselines. The source code is available online true parameters with rate at https://github.com/chloechsu/ldseig. −1/2 Φˆ = Φ + Op(T ), or more explicitly, convergence in probability means 6.1 Methods 1/2 that for all , limT →inf Pr(T |Φˆ − Φ| > ) = 0. • ARMA: K-means on AR parameters in ARMA(n, n) 5.1 Applications to clustering model estimated by regularized iterated regression as we proposed in Algorithm 1. The task of clustering depends on an appropriate dis- • ARMA_MLE: K-means on AR parameters in tance measure for the clustering purpose. When there ARMA(n, n) model estimated by the MLE method are multiple data sources for time series with compara- using statsmodels [Seabold and Perktold, 2010]. ble dynamics but measured by different measurement • AR: K-means on AR parameters in AR(n) model procedures, one might hope to cluster time series based estimated by OLS using statsmodels. only on the state-transition dynamics. • LDS: K-means on estimated LDS eigenvalues. We We can observe from the LDS definition (1) that two estimate the LDS eigenvalues with the pylds pack- LDSs with parameters (A, B, C, D) and (A0,B0,C0,D0) age [Johnson and Linderman, 2018], with 100 EM are equivalent if A0 = P −1AP, B0 = P −1B,C0 = steps initialized by 10 Gibbs iterations. 0 0 −1 • k-Shape: A shape-based time series cluster- CP,D = DP , and ht = P ht under change of basis by some non-singular matrix P . Therefore, to cap- ing method [Paparrizos and Gravano, 2015], using ture the state-transition dynamics while allowing for the tslearn package [Tavenard, 2017]. flexibility in the measurement matrix C, the distance • DTW: K-medoids on dynamic time warping dis- measure should be invariant under change of basis. We tance, using the dtaidistance [Meert, 2018] and choose to use the eigenspectrum distance between state- pyclustering [Novikov, 2019] packages. transition matrices, a natural distance choice that is • PCA: K-means on the first n PCA components, invariant under change of basis. using sklearn [Pedregosa et al., 2011]. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

# Clusters Method Adj. Mutual Info. Adj. Rand Score V-measure Runtime (secs) 2 AR 0.06 (0.04-0.08) 0.07 (0.05-0.09) 0.07 (0.05-0.09) 1.09 (1.01-1.17) ARMA 0.13 (0.11-0.16) 0.16 (0.13-0.19) 0.14 (0.11-0.16) 0.44 (0.40-0.47) ARMA_MLE 0.02 (0.01-0.03) 0.02 (0.01-0.03) 0.03 (0.02-0.04) 70.64 (68.01-73.28) DTW 0.02 (0.01-0.03) 0.02 (0.01-0.03) 0.03 (0.02-0.04) 6.60 (6.34-6.86) k-Shape 0.03 (0.02-0.04) 0.03 (0.02-0.05) 0.04 (0.03-0.05) 28.58 (25.15-32.01) LDS 0.09 (0.06-0.12) 0.09 (0.06-0.12) 0.10 (0.07-0.12) 341.08 (328.24-353.93) PCA -0.00 (-0.00-0.00) -0.00 (-0.00-0.00) 0.02 (0.02-0.02) 0.45 (0.43-0.47) 3 AR 0.11 (0.09-0.12) 0.09 (0.07-0.10) 0.12 (0.11-0.14) 1.02 (0.93-1.10) ARMA 0.18 (0.16-0.20) 0.16 (0.14-0.18) 0.19 (0.17-0.21) 0.42 (0.38-0.46) ARMA_MLE 0.04 (0.03-0.05) 0.04 (0.03-0.05) 0.06 (0.05-0.07) 72.58 (69.63-75.52) DTW 0.04 (0.03-0.05) 0.03 (0.02-0.03) 0.07 (0.06-0.08) 6.63 (6.37-6.90) k-Shape 0.06 (0.05-0.07) 0.04 (0.04-0.05) 0.08 (0.07-0.09) 40.10 (34.65-45.55) LDS 0.20 (0.18-0.23) 0.17 (0.15-0.20) 0.22 (0.19-0.24) 338.67 (325.66-351.67) PCA 0.00 (-0.00-0.00) 0.00 (-0.00-0.00) 0.04 (0.04-0.04) 0.47 (0.45-0.49) 5 AR 0.17 (0.16-0.18) 0.11 (0.10-0.12) 0.22 (0.21-0.23) 0.91 (0.83-1.00) ARMA 0.22 (0.21-0.23) 0.15 (0.14-0.16) 0.26 (0.25-0.28) 0.40 (0.35-0.45) ARMA_MLE 0.08 (0.07-0.09) 0.05 (0.04-0.05) 0.14 (0.13-0.15) 74.00 (71.70-76.30) DTW 0.05 (0.04-0.06) 0.03 (0.02-0.03) 0.11 (0.10-0.12) 6.20 (5.86-6.55) k-Shape 0.08 (0.07-0.09) 0.05 (0.04-0.05) 0.14 (0.13-0.15) 97.83 (84.97-110.68) LDS 0.25 (0.23-0.26) 0.17 (0.16-0.18) 0.29 (0.28-0.30) 321.40 (304.17-338.64) PCA 0.01 (0.01-0.01) 0.00 (0.00-0.01) 0.08 (0.07-0.08) 0.52 (0.49-0.55) 10 AR 0.22 (0.21-0.22) 0.11 (0.11-0.12) 0.38 (0.37-0.38) 0.87 (0.77-0.98) ARMA 0.24 (0.23-0.25) 0.14 (0.13-0.15) 0.39 (0.39-0.40) 0.42 (0.37-0.48) ARMA_MLE 0.11 (0.10-0.12) 0.06 (0.05-0.06) 0.29 (0.28-0.30) 63.39 (60.49-66.30) DTW 0.06 (0.06-0.07) 0.03 (0.02-0.03) 0.25 (0.24-0.25) 5.51 (5.03-5.98) k-Shape 0.08 (0.07-0.08) 0.04 (0.03-0.04) 0.26 (0.26-0.27) 108.79 (93.41-124.18) LDS 0.23 (0.22-0.24) 0.13 (0.12-0.13) 0.39 (0.38-0.40) 277.45 (253.38-301.51) PCA 0.02 (0.01-0.02) 0.01 (0.00-0.01) 0.14 (0.13-0.15) 0.47 (0.44-0.51)

Table 1: Performance of clustering 100 random 2-dimensional LDSs based on their output series of length 1000, with 95% confidence intervals from 100 trials. AMI, Adj. Rand, and V-measure are the adjusted mutual information score, adjusted Rand score, and V-measure between ground truth cluster labels and learned clusters. The runtime is on an instance with 12 CPUs and 40 GB memory running Ubuntu 18.

Figure 1: Absolute `2-error in eigenvalue estimation for 2-dimensional and 3-dimensional LDSs, with 95% confidence interval from 500 trials.

Method AMI Adj. Rand Score V-measure ARMA 0.12 (0.10-0.13) 0.12 (0.11-0.14) 0.14 (0.12-0.16) AR 0.10 (0.09-0.12) 0.10 (0.09-0.12) 0.13 (0.11-0.14) PCA 0.04 (0.03-0.05) 0.02 (0.02-0.03) 0.07 (0.06-0.08) LDS 0.09 (0.07-0.10) 0.11 (0.10-0.13) 0.10 (0.09-0.11) k-Shape 0.08 (0.07-0.10) 0.09 (0.07-0.11) 0.10 (0.09-0.12) DTW 0.03 (0.02-0.04) 0.02 (0.01-0.03) 0.06 (0.05-0.07)

Table 2: Clustering performance on electrocardiogram (ECG) data separating segments of normal sinus rhythm from supraventricular tachycardia. 95% Confidence intervals are from 100 bootstrapped samples of 50 series. Linear Dynamics: Clustering without identification

6.2 Metrics of Cluster Quality most common dataset for evaluating algorithms for ECG data [De Chazal et al., 2004, Yeh et al., 2012, We measure cluster quality using three metrics in Özbay et al., 2006, Ceylan et al., 2009]. sklearn: V-measure [Rosenberg and Hirschberg, 2007], adjusted mutual information [Vinh et al., 2010], and It contains 48 half-hour recordings collected at the adjusted Rand score [Hubert and Arabie, 1985]. Beth Israel Hospital between 1975 and 1979. Each two- channel recording is digitized at a rate of 360 samples per second per channel. 15 distinct rhythms are anno- 6.3 Simulation tated in recordings including abnormalities of cardiac rhythm (arrhythmias) by two cardiologists. Dataset. We generate LDSs representing cluster cen- ters with random matrices of i.i.d. Gaussian entries. Detecting cardiac arrhythmias has stimulated From the cluster centers, we derive LDSs that are close research and product applications such as Ap- to the cluster centers. From each LDS, we generate ple’s FDA-approved detection of atrial fibril- time series of length 1000 by drawing inputs from stan- lation [Turakhia et al., 2018]. ECG data have dard Gaussians and adding noise to the output sampled been modeled with AR and ARIMA mod- from N(0, 0.012). More details in Appendix D.1. els [Kalpakis et al., 2001, Corduas and Piccolo, 2008, Ge et al., 2002], and more recently convolutional Clustering performance. The iterated ARMA regres- neural networks [Hannun et al., 2019]. sion method and the LDS method yield the best cluster- ing quality, while the iterated ARMA regression method is We bootstrap 100 samples of 50 time series; each boot- significantly faster. These results hold up for choices of strapped sample consists of 2 labeled clusters: 25 series different cluster quality metrics and number of clusters. with supraventricular tachycardia and 25 series with normal sinus rhythm. Each series has length 500 which Eigenvalue Estimation. Good clustering results rely adequately captures a complete cardiac cycle. We set on good approximations of the LDS eigenvalue distance. the ARMA `2-regularization coefficient to be 0.01, chosen Our analyses in Theorem 5.1 and Theorem 4.2 proved based on our simulation results. that the iterative ARMA regression algorithm can learn −1/2 the LDS eigenvalues with converge rate Op(T ). Results. Comparing methods outlined in Section 6.1, In Figure 1, we see that the observed convergence rate Table 2 shows that our method achieves the best quality in simulations roughly matches the theoretical bound. closely followed by the AR and LDS methods, according 3 to adjusted mutual information, adjusted Rand score Each EM step in LDS runs in O(n T ). When running a and V-measure, while being computationally efficient. constant number of EM steps, LDS has the same total complexity as iterated ARMA. We chose 100 steps based on empirical evaluation of convergence for sequence 7 Conclusion length 1000. However, longer sequences may need more EM steps to converge, which would explain the We give a fast, simple, and provably effective method increase in LDS eigenvalue estimation error for sequence to estimate linear dynamical system (LDS) eigenvalues length 50000 in Figure 1. Depending on different im- based on system outputs. The algorithm combines plementations and initialization schemes, it is possible statistical techniques from the 80’s with our insights on that the LDS performance can be further optimized. the correspondence between LDSs and ARMA models. ARMA and LDS have comparable eigenvalue estimation As a proof-of-concept, we apply the eigenvalue estima- error for most configurations. While the pure AR ap- tion algorithm to time series clustering. The resulting proach also gives comparable estimation error on rela- clustering approach is flexible to handle varying lengths, tively short sequences, its estimation is biased, and the temporal offsets, as well as multidimensional inputs error does not go down as sequence length increases. and outputs. Our efficient algorithm yields high quality clusters in simulations and on real ECG data. 6.4 Real-world ECG data While LDSs are general models encompassing mixtures of Gaussian and hidden Markov models, they may not While our simulation results show the efficacy of our fit all applications. It would be interesting to extend the method, the data generation process satisfy assump- analysis to non-linear models, and to consider model tions that may not hold on real data. As a proof-of- overparameterization and misspecification. concept, we also test our method on real electrocardio- gram (ECG) data. Acknowledgments We thank the anonymous re- Dataset. The MIT-BIH [Moody and Mark, 2001] viewers for thoughtful comments, and Scott Linderman, dataset in PhysioNet [Goldberger et al., 2000] is the Andrew Dai, and Eamonn Keogh for helpful guidance. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

References of heartbeats using ecg morphology and heartbeat interval features. IEEE transactions on biomedical [Afsari et al., 2012] Afsari, B., Chaudhry, R., engineering, 51(7):1196–1206. Ravichandran, A., and Vidal, R. (2012). Group action induced distances for averaging and clustering [Dean et al., 2017] Dean, S., Mania, H., Matni, N., linear dynamical systems with applications to Recht, B., and Tu, S. (2017). On the sample complex- the analysis of dynamic scenes. In 2012 IEEE ity of the linear quadratic regulator. arXiv preprint Conference on Computer Vision and Pattern arXiv:1710.01688. Recognition, pages 2208–2215. IEEE. [Durbin, 1959] Durbin, J. (1959). Efficient estimation [Åström and Wittenmark, 2013] Åström, K. J. and of parameters in moving-average models. Biometrika, Wittenmark, B. (2013). Computer-controlled sys- 46(3/4):306–316. tems: theory and design. Courier Corporation. [Durbin, 1960] Durbin, J. (1960). Estimation of param- [Beauzamy, 1999] Beauzamy, B. (1999). How the roots eters in time-series regression models. Journal of the of a polynomial vary with its coefficients: A local Royal Statistical Society: Series B (Methodological), quantitative result. Canadian Mathematical Bulletin, 22(1):139–153. 42(1):3–12. [El-Mikkawy, 2003] El-Mikkawy, M. E. (2003). Explicit [Bercu, 1995] Bercu, B. (1995). Weighted estimation inverse of a generalized vandermonde matrix. Applied and tracking for armax models. SIAM Journal on mathematics and computation, 146(2-3):643–651. Control and Optimization, 33(1):89–106. [Ge et al., 2002] Ge, D., Srinivasan, N., and M Krish- [Biernacki et al., 2000] Biernacki, C., Celeux, G., and nan, S. (2002). Cardiac arrhythmia classification Govaert, G. (2000). Assessing a mixture model for using autoregressive modeling. Biomedical engineer- clustering with the integrated completed likelihood. ing online, 1:5. IEEE transactions on pattern analysis and machine [Ghahramani and Hinton, 1996] Ghahramani, Z. and intelligence, 22(7):719–725. Hinton, G. E. (1996). Parameter estimation for lin- [Ceylan et al., 2009] Ceylan, R., Özbay, Y., and Kar- ear dynamical systems. Technical report, Technical lik, B. (2009). A novel approach for classification of Report CRG-TR-96-2, University of Totronto, Dept. ecg arrhythmias: Type-2 fuzzy clustering neural net- of Computer Science. work. Expert Systems with Applications, 36(3):6721– [Goldberger et al., 2000] Goldberger, A. L., Ama- 6726. ral, L. A. N., Glass, L., Hausdorff, J. M., [Chan and Vasconcelos, 2005] Chan, A. B. and Vas- Ivanov, P. C., Mark, R. G., Mietus, J. E., concelos, N. (2005). Probabilistic kernels for the Moody, G. B., Peng, C.-K., and Stanley, H. E. classification of auto-regressive visual processes. In (2000). PhysioBank, PhysioToolkit, and Phys- 2005 IEEE Computer Society Conference on Com- ioNet: Components of a new research resource puter Vision and Pattern Recognition (CVPR’05), for complex physiologic signals. Circulation, volume 1, pages 846–851. IEEE. 101(23):e215–e220. Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.full [Choi, 2012] Choi, B. (2012). ARMA model identifica- PMID:1085218; doi: 10.1161/01.CIR.101.23.e215. tion. Springer Science & Business Media. [Granger and Morris, 1976] Granger, C. W. and Mor- [Corduas and Piccolo, 2008] Corduas, M. and Piccolo, ris, M. J. (1976). Time series modelling and inter- D. (2008). Time series clustering and classification by pretation. Journal of the Royal Statistical Society: the autoregressive metric. Computational Statistics Series A (General), 139(2):246–257. & Data Analysis, 52:1860–1872. [Greenbaum et al., 2019] Greenbaum, A., Li, R.-c., [Cuturi and Blondel, 2017] Cuturi, M. and Blondel, M. and Overton, M. L. (2019). First-order perturba- (2017). Soft-dtw: a differentiable loss function for tion theory for eigenvalues and eigenvectors. arXiv time-series. In Proceedings of the 34th International preprint arXiv:1903.00785. Conference on Machine Learning-Volume 70, pages 894–903. JMLR. org. [Guo, 1996] Guo, L. (1996). Self-convergence of weighted least-squares with applications to stochas- [De Chazal et al., 2004] De Chazal, P., O’Dwyer, M., tic adaptive control. IEEE transactions on automatic and Reilly, R. B. (2004). Automatic classification control, 41(1):79–89. Linear Dynamics: Clustering without identification

[Hannan et al., 1980] Hannan, E. J., Dunsmuir, W. T., [Lai and Wei, 1983] Lai, T. and Wei, C. (1983). and Deistler, M. (1980). Estimation of vector armax Asymptotic properties of general autoregressive mod- models. Journal of Multivariate Analysis, 10(3):275– els and strong consistency of least-squares estimates 295. of their parameters. Journal of multivariate analysis, 13(1):1–23. [Hannun et al., 2019] Hannun, A. Y., Rajpurkar, P., Haghpanahi, M., Tison, G. H., Turakhia, M. P., [Lancaster et al., 2003] Lancaster, P., Markus, A., and Bourn, C., and Ng, A. Y. (2019). Cardiologist-level Zhou, F. (2003). Perturbation theory for analytic ma- arrhythmia detection and classification in ambula- trix functions: the semisimple case. SIAM Journal tory electrocardiograms using a deep neural network. on Matrix Analysis and Applications, 25(3):606–626. Nature Medicine, 15. [Meert, 2018] Meert, Wannes, e. a. (2018). Dtaidis- [Hardt et al., 2018] Hardt, M., Ma, T., and Recht, B. tance. https://dtaidistance.readthedocs.io/ (2018). Gradient descent learns linear dynamical en/latest/. systems. The Journal of Machine Learning Research, 19(1):1025–1068. [Moody and Mark, 2001] Moody, G. and Mark, R. (2001). The impact of the mit-bih arrhythmia [Hazan et al., 2018] Hazan, E., Lee, H., Singh, K., database. IEEE engineering in medicine and biology Zhang, C., and Zhang, Y. (2018). Spectral filtering magazine : the quarterly magazine of the Engineering for general linear dynamical systems. In Advances in Medicine & Biology Society, 20:45–50. in Neural Information Processing Systems, pages 4634–4643. [Novikov, 2019] Novikov, A. (2019). Pyclustering: [Hryniv and Lancaster, 1999] Hryniv, R. and Lan- data mining library. Journal of Open Source Soft- caster, P. (1999). On the perturbation of analytic ware, 4(36):1230. matrix functions. Integral Equations and Operator [Özbay et al., 2006] Özbay, Y., Ceylan, R., and Karlik, Theory, 34(3):325–338. B. (2006). A fuzzy clustering neural network architec- [Hubert and Arabie, 1985] Hubert, L. and Arabie, P. ture for classification of ecg arrhythmias. Computers (1985). Comparing partitions. Journal of classifica- in Biology and Medicine, 36(4):376–388. tion, 2(1):193–218. [Paparrizos and Gravano, 2015] Paparrizos, J. and [Ipsen and Rehman, 2008] Ipsen, I. C. and Rehman, Gravano, L. (2015). k-shape: Efficient and accu- R. (2008). Perturbation bounds for rate clustering of time series. In Proceedings of the and characteristic polynomials. SIAM Journal on 2015 ACM SIGMOD International Conference on Matrix Analysis and Applications, 30(2):762–776. Management of Data, pages 1855–1870. ACM. [Johnson and Linderman, 2018] Johnson, M. and Lin- [Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., derman, S. (2018). Pylds: Bayesian inference for Gramfort, A., Michel, V., Thirion, B., Grisel, O., linear dynamical systems. https://github.com/ Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, mattjj/pylds. V., et al. (2011). Scikit-learn: Machine learning [Kailath, 1980] Kailath, T. (1980). Linear systems, in python. Journal of machine learning research, volume 156. Prentice-Hall Englewood Cliffs, NJ. 12(Oct):2825–2830. [Kalman, 1960] Kalman, R. E. (1960). A new approach [Rosenberg and Hirschberg, 2007] Rosenberg, A. and to linear filtering and prediction problems. Journal Hirschberg, J. (2007). V-measure: A conditional of basic Engineering, 82(1):35–45. entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on em- [Kalpakis et al., 2001] Kalpakis, K., Gada, D., and pirical methods in natural language processing and Puttagunta, V. (2001). Distance measures for effec- computational natural language learning (EMNLP- tive clustering of arima time-series. In Proceedings CoNLL). 2001 IEEE international conference on data mining, pages 273–280. IEEE. [Roweis and Ghahramani, 1999] Roweis, S. and Ghahramani, Z. (1999). A unifying review of [Kalpakis et al., 2001] Kalpakis, K., Gada, D., and linear gaussian models. Neural computation, Puttagunta, V. (2001). Distance measures for ef- 11(2):305–345. fective clustering of arima time-series. In Proceed- ings 2001 IEEE International Conference on Data Mining, pages 273–280. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

[Seabold and Perktold, 2010] Seabold, S. and Perk- rameters and extended sample autocorrelation func- told, J. (2010). Statsmodels: Econometric and statis- tion for stationary and nonstationary arma mod- tical modeling with python. In Proceedings of the 9th els. Journal of the American Statistical Association, Python in Science Conference, volume 57, page 61. 79(385):84–96. Scipy. [Tsiamis and Pappas, 2019] Tsiamis, A. and Pappas, [Simchowitz et al., 2018] Simchowitz, M., Mania, H., G. J. (2019). Finite sample analysis of stochastic sys- Tu, S., Jordan, M. I., and Recht, B. (2018). Learn- tem identification. arXiv preprint arXiv:1903.09122. ing without mixing: Towards a sharp analysis of linear system identification. arXiv preprint [Turakhia et al., 2018] Turakhia, M., Desai, M., arXiv:1802.08334. Hedlin, H., Rajmane, A., Talati, N., Ferris, T., De- sai, S., Nag, D., Patel, M., Kowey, P., Rumsfeld, [Smyth, 1997] Smyth, P. (1997). Clustering sequences J., M. Russo, A., True Hills, M., B. Granger, C., with hidden markov models. In Advances in neural W. Mahaffey, K., and Perez, M. (2018). Rationale information processing systems, pages 648–654. and design of a large-scale, app-based study to iden- tify cardiac arrhythmias using a smartwatch: The [Stoica et al., 1988] Stoica, P., Friedlander, B., and apple heart study. American Heart Journal, 207. Söderström, T. (1988). A high-order yule-walker method for estimation of the ar parameters of an [Vinh et al., 2010] Vinh, N. X., Epps, J., and Bailey, arma model. Systems & control letters, 11(2):99–105. J. (2010). Information theoretic measures for cluster- [Stoica et al., 2005] Stoica, P., Moses, R. L., et al. ings comparison: Variants, properties, normalization (2005). Spectral analysis of signals. Pearson Prentice and correction for chance. Journal of Machine Learn- Hall Upper Saddle River, NJ. ing Research, 11(Oct):2837–2854. [Vishwanathan et al., 2007] Vishwanathan, S., Smola, [Tavenard, 2017] Tavenard, R. (2017). tslearn: A ma- A. J., and Vidal, R. (2007). Binet-cauchy kernels chine learning toolkit dedicated to time-series data. on dynamical systems and its application to the URL https://github. com/rtavenar/tslearn. analysis of dynamic scenes. International Journal of [Tiao and Tsay, 1983] Tiao, G. C. and Tsay, R. S. Computer Vision, 73(1):95–119. (1983). Consistency properties of least squares esti- [Yeh et al., 2012] Yeh, Y.-C., Chiou, C. W., and Lin, mates of autoregressive parameters in arma models. H.-J. (2012). Analyzing ecg for cardiac arrhythmia The Annals of Statistics, pages 856–871. using cluster analysis. Expert Systems with Applica- [Tsay and Tiao, 1984] Tsay, R. S. and Tiao, G. C. tions, 39(1):1000–1010. (1984). Consistent estimates of autoregressive pa- Linear Dynamics: Clustering without identification

A Proofs for model equivalence This implies that each dimension of yt can be generated by an ARMAX(n, n, n − 1) model, where the autore- In this section, we prove a generalization of Theorem gressive parameters are the characteristic polynomial 4.1 for both LDSs with observed inputs and LDSs with coefficients in reverse order and in negative values. hidden inputs. To prove the theorem, we introduce a lemma to an- A.1 Preliminaries alyze the autoregressive behavior of the hidden state projected to a generalized eigenvector direction. Sum of ARMA processes It is known that the Lemma A.2. Consider a linear dynamical system with sum of ARMA processes is still an ARMA process. n parameters Θ = (A, B, C, D), hidden states ht ∈ R , k m Lemma A.1 (Main Theorem inputs xt ∈ R , and outputs yt ∈ R as defined in ∗ in [Granger and Morris, 1976]). The sum of two (1). For any generalized eigenvector ei of A with independent stationary series generated by ARMA(p, eigenvector λ and rank µ, the lag operator polynomial m) and ARMA(q, n) is generated by ARMA(x, y), µ (i) (1 − λL) applied to time series ht := hht, eii results where x ≤ p + q and y ≤ max(p + n, q + m). in In shorthand notation, ARMA(p, m) + ARMA(q, n) = (1−λL)µh(i) = linear transformation of x , ··· , x . ARMA(p + q, max(p + n, q + m)). t t t−µ+1

When two ARMAX processes share the same exoge- Proof. To expand the LHS, first observe that nous input series, the dependency on exogenous in- (i) put is additive, and the above can be extended to (1 − λL)ht = (1 − λL)hht, eii ARMAX(p, m, r) + ARMAX(q, n, s) = ARMAX(p + (i) (i) = hh , e i − λLhh , e i q, max(p + n, q + m), max(r, s)). t i t i = hAht−1 + Bxt, eii − hht−1, λeii Jordan canonical form and canonical basis Ev- (i) ∗ = hht−1, (A − λI)eii + hBxt, eii. ery square real matrix is similar to a complex block diagonal matrix known as its Jordan canonical form We can apply (1 − λL) again similarly to obtain (JCF). In the special case for diagonalizable matri- 2 (i) ∗ 2 ces, JCF is the same as the diagonal form. Based (1 − λL) ht = hht−2, (A − λI) eii on JCF, there exists a canonical basis {e } consisting ∗ i +hBxt−1, (A − λI)eii + (1 − λL)hBxt, eii, only of eigenvectors and generalized eigenvectors of A. A vector v is a generalized eigenvector of rank µ and in general we can show inductively that with corresponding eigenvalue λ if (λI − A)µv = 0 and µ−1 k (i) ∗ k (λI − A) v 6= 0. (1 − λL) ht − hht−k, (A − λI) eii = Relating the canonical basis to the characteristic poly- k−1 X k−1−j j ∗ j nomial, the characteristic polynomial can be com- (1 − λL) L hBxt, (A − λI) eii, j=0 pletely factored into linear factors χA(λ) = (λ − µ1 µ2 µr λ1) (λ − λ2) ··· (λ − λr) over C. The complex where the RHS is a linear transformation of roots λ1, ··· , λr are eigenvalues of A. For each eigen- xt, ··· , xt−k+1. value λi, there exist µi linearly independent generalized µi ∗ µ eigenvectors v such that (λiI − A) v = 0. Since (λI − A ) ei = 0 by definition of general- ∗ µ ized eigenvectors, hht−µ, (A − λI) eii = 0, and µ (i) A.2 General model equivalence theorem hence (1 − λL) ht itself is a linear transformation of xt, ··· , xt−µ+1. Now we state Theorem A.1, a more detailed version of Theorem 4.1. Proof for Theorem A.1 Using Lemma A.2 and the Theorem A.1. For any linear dynamical system with canonical basis, we can prove Theorem A.1. parameters Θ = (A, B, C, D), hidden dimension n, in- k m puts xt ∈ R , and outputs yt ∈ R , the outputs yt Proof. Let λ1, ··· , λr be the eigenvalues of A with mul- satisfy tiplicity µ , ··· , µ . Since A is a real-valued matrix, its † † 1 r χA(L)yt = χA(L)ξt + Γ(L)xt, (5) adjoint A∗ has the same characteristic polynomial and † n −1 eigenvalues as A. There exists a canonical basis {e }n where L is the lag operator, χ (L) = L χA(L ) is the i i=1 A ∗ reciprocal polynomial of the characteristic polynomial for A , where e1, ··· , eµ1 are generalized eigenvectors of A, and Γ(L) is an m-by-k matrix of polynomials of degree n − 1. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

0 with eigenvalue λ1, eµ1+1, ··· , eµ1+µ2 are generalized Gaussians, yt is then generated by an ARMA(n, n − 1) eigenvectors with eigenvalue λ , so on and so forth, and † 2 process with autoregressive polynomial χA(L). eµ1+···+µr−1+1, ··· , eµ1+···+µr are generalized eigenvec- tors with eigenvalue λ . The output noise ξt itself can be seen as an ARMA(0, 0) r process. By Lemma A.1, ARMA(n, n − 1) + µ1 (i) By Lemma (A.2), (1 − λ1L) ht is a linear trans- ARMA(0, 0) = ARMA(n + 0, max(n + 0, n − 1 + 0)) formation of xt, ··· , xt−µ1+1 for i = 1, ··· , µ1; (1 − = ARMA(n, n). Hence the outputs yt are generated (i) µ2 by an ARMA(n, n) process as claimed in Theorem 4.1. λ2L) ht is a linear transformation of xt, ··· , xt−µ2+1 for i = µ1 + 1, ··· , µ1 + µ2; so on and so forth; (1 − It is easy to see in the proof of Lemma A.1 that the µr (i) autoregressive parameters do not change when adding λrL) h is a linear transformation of xt, ··· , xt−µ +1 t r a white noise [Granger and Morris, 1976]. for i = µ1 + ··· + µr−1 + 1, ··· , n.

µj We then apply lag operator polynomial Πj6=i(1−λjL) B Proof for eigenvalue approximation to both sides of each equation. The lag polynomial µ1 µr theorems in the LHS becomes (1 − λ1L) ··· (1 − λrL) = † µj χ (L). For the RHS, since Πj6=i(1−λjL) is of degree A Here we restate Theorem 4.2 and Theorem 4.3 together, n − µi, it lags the RHS by at most n − µi additional steps, and the RHS becomes a linear transformation of and prove it in three steps for 1) the general case, 2) the simple eigenvalue case, and 3) the explicit condition xt, ··· , xt−n+1. number bounds for the simple eigenvalue case. Thus, for each i, χ† (L)h(i) is a linear transformation A t Theorem B.1. Suppose yt are the outputs from an of xt, ··· , xt−n+1. n-dimensional latent linear dynamical system with pa- The outputs of the LDS are defined as y = Ch + rameters Θ = (A, B, C, D) and eigenvalues λ1, ··· , λn. t t ˆ Dx + ξ = Pn h(i)Ce + Dx + ξ . By linearity, Let Φ = (ϕˆ1, ··· , ϕˆn) be the estimated autoregressive t t i=1 t i t t ˆ † Pn (i) parameters with error kΦ − Φk = , and let r1, ··· , rn and since χA(L) is of degree n, both i=1 ht Cei and n † be the roots of the polynomial 1 − ϕˆ1z − · · · − ϕˆnz . χA(L)Dxt are linear transformations of xt, ··· , xt−n. We can write any such linear transformation as Γ(L)xt Assuming the LDS is observable, the roots converge to for some m-by-k matrix Γ(L) of polynomials of degree the true eigenvalues with convergence rate O(1/n). If n − 1. Thus, as desired, all eigenvalues of A are simple (i.e. multiplicity 1), † † then the convergence rate is O(). If A is symmetric, χA(L)yt =χA(L)ξt + Γ(L)xt. Lyapunov stable (spectral radius at most 1), and only has simple eigenvalues, then † Assuming that there are no common factors in χA and √ † n−1 Γ, χ is then the lag operator polynomial that repre- n2 2 A |ri − λi| ≤  + O( ). sents the autoregressive part of yt. This assumption is Πk6=j|λj − λk| the same as saying that yt cannot be expressed as a lower-order ARMA process. The reciprocal polynomial B.1 General (1/n)-exponent bound has the same coefficients in reverse order as the original polynomial. According to the lag operator polynomial This is a known perturbation bound on polynomial 2 n † on the LHS, 1 − ϕ1L − ϕ2L − · · · − ϕnL = χA(L), root finding due to Ostrowski [Beauzamy, 1999]. n n−1 and L − ϕ1L − · · · − ϕn = χA(L), so the i-th order n n−1 Lemma B.1. Let Φ(z) = z + ϕ1z + ··· + ϕn−1z + autoregressive parameter ϕi is the negative value of n n−1 ϕn and Ψ(z) = z + ψ1z + ··· + ψn−1z + ψn be two the (n − i)-th order coefficient in the characteristic polynomials of degree n. If kΦ − Ψk2 < , then the polynomial χA. roots (rk) of Φ and roots (r˜k) of Ψ under suitable order satisfy 1/n |rk − r˜k| ≤ 4Cp , 1/n 1/n A.3 The hidden input case as a corollary where C = max1,0≤k≤n{|ϕn| , |ψn| }.

The statement about LDS without external inputs in The general O(1/n) convergence rate in Theorem 4.2 Theorem 4.1 comes as a corollary to Theorem A.1, with follows directly from Lemma B.1 and Theorem 4.1. a short proof here. B.2 Bound for simple eigenvalues 0 Proof. Define yt = Cht +Dxt to be the output without 0 † 0 1 noise, i.e. yt = yt + ξt. By Theorem A.1, χA(L)yt = The n -exponent in the above bound might seem not Γ(L)xt. Since we assume the hidden inputs xt are i.i.d. very ideal, but without additional assumptions the Linear Dynamics: Clustering without identification

1 -exponent is tight. As an example, the polyno- n √   mial x2 −  has roots x ± . This is a general 0 0 ... 0 −ϕn phenomenon that a root with multiplicity m could 1 0 ... 0 −ϕn−1   split into m roots at rate O(m), and is related to the 0 1 ... 0 −ϕn−2 C(Φ) =  . regular splitting property [Hryniv and Lancaster, 1999, ......  . . . . .  Lancaster et al., 2003] in matrix eigenvalue perturba- 0 0 ... 1 −ϕ1 tion theory. Under the additional assumption that all the eigen- The matrix C(Φ) is the companion in the sense that its values are simple (no multiplicity), we can prove a characteristic polynomial is equal to Φ. better bound using the following idea with companion matrix: Small perturbation in autoregressive parame- In relation to a pure autoregressive AR(p) model, the ters results in small perturbation in companion matrix, companion matrix corresponds to the transition matrix and small perturbation in companion matrix results in in the linear dynamical system when we encode the small perturbation in eigenvalues. values form the past p lags as a p-dimensional state  T Matrix eigenvalue perturbation theory The ht = yt−p+1 ··· yt−1 yt . perturbation bound on eigenvalues is a well-studied If y = ϕ y + ··· + ϕ y , then h = problem [Greenbaum et al., 2019]. The regular split- t 1 t−1 p t−p t ting property states that, for an eigenvalue λ0 with    0 1 0 ... 0   yt−p+1 yt−p partial multiplicities m1, ··· , mk, an O() perturba- 0 0 1 ... 0 yt−p+2  yt−p+1 tion to the matrix could split the eigenvalue into    . . . . .    ···  =  ......  ···  M = m1 + ··· + mk distinct eigenvalues λij() for    . . . .    yt−1    yt−2  i = 1, ··· , k and j = 1, ··· , mi, and each eigenvalue  0 0 0 ... 1  1/mi yt yt−1 λij() is moved from the original position by O( ). ϕp ϕp−1 ϕp−2 . . . ϕ1 T For semi-simple eigenvalues, geometric multiplicity = C(−Φ) ht−1. equals algebraic multiplicity. Since geometric multiplic- (7) ity is the number of partial multiplicities while alge- braic multiplicity is the sum of partial multiplicities, for Proof of Theorem 4.2 for simple eigenvalues semi-simple eigenvalues all partial multiplicities mi = 1. Therefore, the regular splitting property corresponds to Proof. Let yt be the outputs of a linear dynamical the asymptotic relation in equation 6. It is known that system S with only simple eigenvalues, and let Φ = regular splitting holds for any semi-simple eigenvalue (ϕ1, ··· , ϕn) be the ARMAX autoregressive parameters even for non-Hermitian matrices. for yt. Let C(Φ) be the companion matrix of the n n−1 n−2 Lemma B.2 (Theorem 6 in [Lancaster et al., 2003]). polynomial z − ϕ1z − ϕ2z − · · · − ϕn. The Let L(λ, ) be an analytic matrix function with semi- companion matrix is the transition matrix of the LDS described in equation 7. Since this LDS the same simple eigenvalue λ0 at  = 0 of multiplicity M. Then autoregressive parameters and hidden state dimension there are exactly M eigenvalues λi() of L(λ, ) for as the original LDS, by Corollary 4.1 the companion which λi() → λ0 as  → 0, and for these eigenvalues matrix has the same characteristic polynomial as the 0 λi() = λ0 + λi + o(). (6) original LDS, and thus also has simple (and hence also semi-simple) eigenvalues. The O() convergence rate Companion Matrix Matrix perturbation theory then follows from Lemma B.2 and Theorem 5.1, as the tell us how perturbations on matrices change eigen- error on ARMAX parameter estimation can be seen as values, while we are interested in how perturbations perturbation on the companion matrix. on polynomial coefficients change roots. To apply ma- trix perturbation theory on polynomials, we introduce A note on the companion matrix One might the companion matrix, also known as the controllable hope that we could have a more generalized result using canonical form in control theory. Lemma B.2 for all systems with semi-simple eigenvalues instead of restricting to matrices with simple eigenval- n Definition B.1. For a monic polynomial Φ(u) = z + ues. Unfortunately, even if the original linear dynamical n−1 ϕ1z + ··· + ϕn−1z + ϕn, the companion matrix of system has only semi-simple eigenvalues, in general the the polynomial is the square matrix companion matrix is not semi-simple unless the original linear dynamical system is simple. This is because the Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

companion matrix always has its minimal polynomial The companion matrix can be diagonalized as C = −1 equal to its characteristic polynomial, and hence has V diag(λ1, ··· , λn)V , the rows of the Vandermonde geometric multiplicity 1 for all eigenvalues. This also matrix V are the row eigenvectors of C, while the points to the fact that even though the companion columns of V −1 are the column eigenvectors of C. Since −1 matrix has the form of the controllable canonical form, the the j-th row Vj,∗ and the j-th column V∗,j have in general it is not necessarily similar to the transition inner product 1 by definition of matrix inverse, the matrix in the original LDS. condition number is given by

−1 B.3 Explicit bound for condition number κ(C, λj) = kVj,∗k2 kV∗,j k2. (13)

In this subsection, we write out explicitly the condi- Formula for inverse of Vandermonde matrix tion number for simple eigenvalues in the asymptotic The Vandermonde matrix is defined as relation λ() = λ0 + κ + o(), to show how it varies  2 p−1 according to the spectrum. Here we use the notation 1 λ1 λ1 ··· λ1 2 p−1 κ(C, λ) to note the condition number for eigenvalue λ 1 λ2 λ ··· λ   2 2  in companion matrix C. V = . . . . . (14) . . . ··· .  Lemma B.3. For a companion matrix C with simple 1 λ λ2 ··· λp−1 0 0 p p p eigenvalues λ1, ··· , λn, the eigenvalues λ1, ··· , λn of the perturbed matrix by C + δC satisfy The inverse of the Vandermonde matrix V is given by [El-Mikkawy, 2003] using elementary symmetric |λ − λ0 | ≤ κ(C, λ )kδCk + o(kδCk2), (8) j j j 2 2 polynomial. and the condition number κ(C, λj) is bounded by (−1)i+jS (V −1) = p−i,j , (15) 1 i,j Q (λ − λ ) Q (λ − λ ) Q ≤ κ(C, λj) ≤ kj k j k6=j |λj − λk| √ where Sp−i,j = Sp−i(λ1, ··· , λj−1, λj+1, ··· , λp). n n−1 2 n−1 Q (max(1, |λj|)) (1 + ρ(C) ) 2 , k6=j |λj − λk| Pulling out the common denominator, the j-th column (9) vector of V −1 is   where ρ(C) is the spectral radius, i.e. largest absolute (−1)Sp−1 2 value of its eigenvalues. (−1) Sp−2 (−1)j    . , In particular, when ρ(C) ≤ 1, i.e. when the matrix is Q (λ − λ ) Q (λ − λ ) .  kj k j  p−1  Lyapunov stable, (−1) S1 √ (−1)p √ n−1 0 n( 2) 2 |λj − λj| ≤ Q kδCk2 + o(kδCk2). (10) k6=j |λj − λk| where the elementary symmetric polynomials are over variables λ1, ··· , λj−1, λj+1, ··· , λp. Proof. For each simple eigenvalue λ of the companion For example, if p = 4, then the 3rd column (up to matrix C with column eigenvector v and row eigenvec- scaling) would be tor w∗, the condition number of the eigenvalue is   −λ1λ2λ4 kwk2kvk2 κ(C, λ) = . (11) −1 λ1λ2 + λ1λ4 + λ2λ4 |w∗v|  . (λ3 − λ1)(λ3 − λ2)(λ4 − λ3) −λ1 − λ2 − λ4  1 This is derived from differentiating the eigenvalue equa- tion Cv = vλ, and multiplying the differentiated equa- tion by w∗, which results in Bounding the condition number As discussed be- fore, the condition number for eigenvalue λj is w∗(δC)v + w∗C(δv) = λw∗(δv) + w∗v(δλ). κ(C, λ ) = kV k kV −1k . w∗(δC)v j j,∗ 2 ∗,j 2 δλ = ∗ . w v where Vj,∗ is the j-th row of the Vandermonde matrix −1 −1 Therefore, V and V∗,j is the j-th column of V . kwk kvk |δλ| ≤ 2 2 kδCk = κ(C, λ)kδCk . (12) |w∗v| 2 2 Linear Dynamics: Clustering without identification

h i 2 p−1 Theorem 4.3 follows from Lemma B.3, because the By definition Vj,∗ = 1 λj λj ··· λj , so estimation error on the autoregressive parameters can p−1 !1/2 be seen as the perturbation on the companion matrix, X 2i and the companion matrix has the same eigenvalues as kVj,∗k2 = λj . i=0 the original LDS.

−1 −1 Using the above explicit expression for V , kV∗,j k2 = C Iterated regression for ARMAX

p−1 !1/2 1 X 2 Algorithm We generalize Algorithm 1 to accommo- Q Si (λ1, ··· , λj−1, λj+1, ··· , λp) . date for exogenous inputs. Since the exogenous inputs k6=j |λj − λk| i=0 are explicitly observed, including exogenous inputs in . the regression does not change the consistent property Therefore, of the estimator.

p−1 !1/2 Theorem A.1 shows that different output channels from 1 X κ(C, λ ) = λ2i the same LDS have the same autoregressive parame- j Q |λ − λ | j k6=j j k i=0 ters in ARMAX models. Therefore, we could leverage p−1 !1/2 multidimensional outputs by estimating the autoregres- X 2 sive parameters in each channel separately and average Si (λ1, ··· , λj−1, λj+1, ··· , λp) . i=0 them. (16) Algorithm 2: Regularized iterated regression for AR Note that both parts under (··· )1/2 are greater than parameter estimation in ARMAX T m or equal to 1, so we can bound it below by Input: A time series {yt}t=1 where yt ∈ R , exogenous input series {x }T where x ∈ k, and 1 t t=1 t R guessed hidden state dimension n. κ(C, λj) ≥ Q . k6=j |λj − λk| for d = 1, ··· , m do (d) Let yt be the projection of yt to the d-th We could also bound the two parts above. The first dimension; part can be bounded by ~ m Initialize error term estimates ˆt = 0 ∈ R for p−1 !1/2 t = 1, ··· ,T ; X 2i √ (p−1) for i = 0, ··· , n do λj ≤ p max(1, |λj|) . (17) i=0 Perform `2-regularized least squares regression on yt against lagged terms of yt, xt, and ˆt to While for the second part, since ˆ solve for coefficients ϕˆj ∈ R, θj ∈ R, and   k (d) p − 1 i γˆj ∈ R in the linear equation yt = |Si(λ1, ··· , λj−1, λj+1, ··· , λp)| ≤ |λ|max, Pn (d) Pn−1 Pi ˆ i c+ j=1 ϕˆjyt−j + j=1 γˆjxt−j + j=1 θjˆt−j, ˆ we have that with `2-regularization only on θj; Update ˆt to be the residuals from the most p−1 X recent regression; S2(λ , ··· , λ , λ , ··· , λ ) i 1 j−1 j+1 p end i=0 (18) ˆ (d) p−1   Record Φ = (ϕ ˆ1, ··· , ϕˆn); X p − 1 2i 2 p−1 end ≤ |λ|max = (1 + |λ|max) . i ˆ 1 ˆ (1) ˆ (m) i=0 Return the average estimate Φ = d (Φ + ··· + Φ ). Combining equation 17 and 18 for the upper bound, and putting it together with the lower bound, Again as before the i-th iteration of the regression only uses error terms from the past i lags. In other 1 words, the initial iteration is an ARMAX(n, 0, n − 1) Q ≤ κ(C, λj) ≤ k6=j |λj − λk| regression, the first iteration is an ARMAX(n, 1, n − 1) √ regression, and so forth. p p−1 2 p−1 Q (max(1, |λj|)) (1 + ρ(C) ) 2 , k6=j |λj − λk| (19) as desired. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

Time complexity The iterated regression in each inter-cluster distance between centers. dimension involves n + 1 steps of least squares re- For each LDS, we generate a sequence by drawing hid- gression each on at most n(k + 2) variables. There- den inputs x ∼ N(0, 1) and put noise ξ ∼ N(0, 0.012) fore, the total time complexity of Algorithm 2 is t t on the outputs. O(nm((nk)2T +(nk)3)) = O(mn3k2T +mn4k3), where T is the sequence length, n the hidden state dimension, m the output dimension, and k the input dimension. D.2 Empirical correlation between AR distance and LDS distance.

D Additional simulation details Theorem 4.2 shows that LDSs with similar AR param- eters also have similar eigenvalues. The converse of D.1 Synthetic data generation Theorem 4.2 is also true: dynamical systems with small eigenvalue distance have small autoregressive parameter First, we generate K cluster centers by generating distance, which follows from perturbation bounds for LDSs with random matrices A, B, C of standard i.i.d. characteristic polynomials [Ipsen and Rehman, 2008]. Gaussians. We assume that the output yt only de- Figure 2 shows simulation results where the AR pa- pends on the hidden state ht but not the input xt, i.e. rameter distance and the LDS eigenvalue distance are the matrix D is zero. When generating the random highly correlated. LDSs, we require that the spectral radius ρ(A) ≤ 1, i.e. all eigenvalues of A have absolute values at most 1, and regenerate a new random matrix if the spectral radius is above 1. Our method also applies to the case of arbitrary spectral radius, this requirement is for the purpose of preventing numeric overflow in gen- erated sequence. We also require that the `2 distance d(Θ1, Θ2) = kλ(A1) − λ(A2)k2 between cluster centers are at least 0.2 apart. Then, we generate 100 LDSs by randomly assigning them to the clusters. To obtain a LDS with assigned 0 cluster center Θ = (Ac,Bc,Cc), we generate A by adding a i.i.d. Gaussians to each entry of A , while c Figure 2: The eigenvalue ` distance and the autore- B0 and C0 are new random matrices of i.i.d. stan- 2 gressive parameter ` distance for 100 random linear dard Gaussians. The standard deviation of the i.i.d. 2 dynamical systems with eigenvalues drawn uniformly Gaussians for A0 − A is chosen such that the aver- c randomly from [−1, 1]. The two distance measures are age distance to cluster centers is less than half of the highly correlated.