<<

Linear Dynamics: Clustering without identification

AProofsformodelequivalence This implies that each dimension of yt can be generated by an ARMAX(n, n, n 1) model, where the autore- In this section, we prove a generalization of Theorem gressive parameters are the characteristic 4.1 for both LDSs with observed inputs and LDSs with coefficients in reverse order and in negative values. hidden inputs. To prove the theorem, we introduce a lemma to an- A.1 Preliminaries alyze the autoregressive behavior of the hidden state projected to a direction. Sum of ARMA processes It is known that the Lemma A.2. Consider a linear dynamical system with sum of ARMA processes is still an ARMA process. n parameters ⇥=(A, B, C, D), hidden states ht R , k m 2 Lemma A.1 (Main Theorem inputs xt R , and outputs yt R as defined in 2 2 in [Granger and Morris, 1976]). The sum of two (1). For any generalized eigenvector ei of A⇤ with independent stationary series generated by ARMA(p, eigenvector and µ, the lag operator polynomial m) and ARMA(q, n) is generated by ARMA(x, y), µ (i) (1 L) applied to time series ht := ht,ei results where x p + q and y max(p + n, q + m). h i   in In shorthand notation, ARMA(p, m)+ARMA(q, n)= µ (i) (1 L) ht = linear transformation of xt, ,xt µ+1. ARMA(p + q, max(p + n, q + m)). ··· When two ARMAX processes share the same exoge- Proof. To expand the LHS, first observe that nous input series, the dependency on exogenous in- (i) put is additive, and the above can be extended to (1 L)h =(1 L) ht,ei t h i ARMAX(p, m, r)+ARMAX(q, n, s)=ARMAX(p + (i) (i) = h ,ei L h ,ei q, max(p + n, q + m), max(r, s)). h t i h t i = Aht 1 + Bxt,ei ht 1,ei h ih i Jordan canonical form and canonical Ev- (i) = ht 1, (A⇤ I)ei + Bxt,ei . ery square real is similar to a complex block h i h i known as its Jordan canonical form We can apply (1 L) again similarly to obtain (JCF). In the special case for diagonalizable matri- 2 (i) 2 ces, JCF is the same as the diagonal form. Based (1 L) ht = ht 2, (A⇤ I) ei h i on JCF, there exists a canonical basis e consisting { i} + Bxt 1, (A⇤ I)ei +(1 L) Bxt,ei , only of eigenvectors and generalized eigenvectors of h i h i A.Avectorv is a generalized eigenvector of rank µ and in general we can show inductively that with corresponding eigenvalue if (I A)µv =0and µ 1 k (i) k (I A) v =0. (1 L) ht ht k, (A⇤ I) ei = 6 h i k 1 Relating the canonical basis to the characteristic poly- k 1 j j j (1 L) L Bx , (A⇤ I) e , nomial, the characteristic polynomial can be com- h t ii j=0 pletely factored into linear factors A()=( X µ1 µ2 µr 1) ( 2) ( r) over C. The complex ··· where the RHS is a linear transformation of roots 1, ,r are eigenvalues of A.Foreacheigen- ··· xt, ,xt k+1. value i,thereexistµi linearly independent generalized ··· µi µ eigenvectors v such that (iI A) v =0. Since (I A⇤) ei =0by definition of general- µ ized eigenvectors, ht µ, (A⇤ I) ei =0,and (i)h i A.2 General model equivalence theorem hence (1 L)µh itself is a linear transformation t of xt, ,xt µ+1. Now we state Theorem A.1, a more detailed version of ··· Theorem 4.1. Proof for Theorem A.1 Using Lemma A.2 and the Theorem A.1. For any linear dynamical system with canonical basis, we can prove Theorem A.1. parameters ⇥=(A, B, C, D), hidden dimension n,in- k m puts xt R , and outputs yt R , the outputs yt Proof. Let , , be the eigenvalues of A with mul- 2 2 1 ··· r satisfy tiplicity µ , ,µ .SinceA is a real-valued matrix, its 1 ··· r A† (L)yt = A† (L)⇠t +(L)xt, (5) adjoint A has the same characteristic polynomial and ⇤ n n 1 eigenvalues as A. There exists a canonical basis ei where L is the lag operator, A† (L)=L A(L ) is the { }i=1 for A⇤, where e , ,e are generalized eigenvectors reciprocal polynomial of the characteristic polynomial 1 ··· µ1 of A, and (L) is an m-by-k matrix of of degree n 1. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt with eigenvalue , e , ,e are generalized Gaussians, y0 is then generated by an ARMA(n, n 1) 1 µ1+1 ··· µ1+µ2 t eigenvectors with eigenvalue 2,soonandsoforth,and process with autoregressive polynomial A† (L). eµ1+ +µr 1+1, ,eµ1+ +µr are generalized eigenvec- ··· ··· ··· The output noise ⇠t itself can be seen as an ARMA(0, 0) tors with eigenvalue r. process. By Lemma A.1, ARMA(n, n 1)+ (i) By Lemma (A.2), (1 L)µ1 h is a linear trans- ARMA(0, 0)=ARMA(n +0, max(n +0,n 1 + 0)) 1 t formation of xt, ,xt µ1+1 for i =1, ,µ1; (1 =ARMA(n, n). Hence the outputs yt are generated (i) ··· ··· µ2 by an ARMA(n, n) process as claimed in Theorem 4.1. 2L) ht is a linear transformation of xt, ,xt µ2+1 ··· for i = µ +1, ,µ + µ ;soonandsoforth;(1 It is easy to see in the proof of Lemma A.1 that the 1 ··· 1 2 µr (i) autoregressive parameters do not change when adding rL) ht is a linear transformation of xt, ,xt µr +1 ··· a white noise [Granger and Morris, 1976]. for i = µ1 + + µr 1 +1, ,n. ··· ··· µj We then apply lag operator polynomial ⇧j=i(1 jL) 6 BProofforeigenvalueapproximation to both sides of each equation. The lag polynomial µ1 µr in the LHS becomes (1 1L) (1 rL) = theorems ··· µj † (L). For the RHS, since ⇧j=i(1 jL) is of degree A 6 n µ ,itlagstheRHSbyatmostn µ additional Here we restate Theorem 4.2 and Theorem 4.3 together, i i steps, and the RHS becomes a linear transformation of and prove it in three steps for 1) the general case, 2) the simple eigenvalue case, and 3) the explicit condition xt, ,xt n+1. ··· number bounds for the simple eigenvalue case. (i) Thus, for each i, † (L)h is a linear transformation A t Theorem B.1. Suppose yt are the outputs from an of xt, ,xt n+1. ··· n-dimensional latent linear dynamical system with pa- rameters ⇥=(A, B, C, D) and eigenvalues , , . The outputs of the LDS are defined as yt = Cht + 1 ··· n n (i) Let ˆ =('ˆ1, , 'ˆn) be the estimated autoregressive Dxt + ⇠t = i=1 ht Cei + Dxt + ⇠t.Bylinearity, ··· n (i) parameters with error ˆ = ✏, and let r1, ,rn and since A† (L) is of degree n,both i=1 ht Cei and k k ···n P be the roots of the polynomial 1 'ˆ1z 'ˆnz . † (L)Dxt are linear transformations of xt, ,xt n. ··· A P ··· We can write any such linear transformation as (L)xt Assuming the LDS is observable, the roots converge to m k (L) the true eigenvalues with convergence rate (✏1/n).If for some -by- matrix of polynomials of degree O n 1. Thus, as desired, all eigenvalues of A are simple (i.e. multiplicity 1), then the convergence rate is (✏).IfA is symmetric, O A† (L)yt =A† (L)⇠t +(L)xt. Lyapunov stable (spectral radius at most 1), and only has simple eigenvalues, then Assuming that there are no common factors in A† and p n 1 , † is then the lag operator polynomial that repre- n2 2 A ri i ✏ + (✏ ). sents the autoregressive part of yt. This assumption is | |⇧k=j j k O 6 | | the same as saying that yt cannot be expressed as a lower-order ARMA process. The reciprocal polynomial B.1 General (1/n)-exponent bound has the same coefficients in reverse order as the original polynomial. According to the lag operator polynomial This is a known perturbation bound on polynomial 2 n on the LHS, 1 '1L '2L 'nL = A† (L), root finding due to Ostrowski [Beauzamy, 1999]. n n 1 ··· and L '1L 'n = A(L),sothei-th order n n 1 Lemma B.1. Let (z)=z + '1z + + 'n 1z + ··· autoregressive parameter 'i is the negative value of n n 1 ··· 'n and (z)=z + 1z + + n 1z + n be two the (n i)-th order coefficient in the characteristic ··· polynomials of degree n.If 2 <✏, then the polynomial A. k k roots (rk) of and roots (r˜k) of under suitable order satisfy r r˜ 4Cp✏1/n, | k k| 1/n 1/n A.3 The hidden input case as a corollary where C = max1,0 k n 'n , n .   {| | | | } The statement about LDS without external inputs in The general (✏1/n) convergence rate in Theorem 4.2 O Theorem 4.1 comes as a corollary to Theorem A.1, with follows directly from Lemma B.1 and Theorem 4.1. ashortproofhere. B.2 Bound for simple eigenvalues Proof. Define yt0 = Cht +Dxt to be the output without 1 † noise, i.e. yt = yt0 + ⇠t. By Theorem A.1, A(L)yt0 = The n -exponent in the above bound might seem not (L)xt.Sinceweassumethehiddeninputsxt are i.i.d. very ideal, but without additional assumptions the Linear Dynamics: Clustering without identification

1 n -exponent is tight. As an example, the polyno- mial x2 ✏ has roots x p✏. This is a general 00... 0 'n ± m 10... 0 'n 1 phenomenon that a root with multiplicity could 2 3 m O(✏m) 01... 0 'n 2 split into roots at rate ,andisrelatedtothe C() = . 6. . . . . 7 regular splitting property [Hryniv and Lancaster, 1999, 6...... 7 6 7 Lancaster et al., 2003]inmatrixeigenvalueperturba- 600... 1 ' 7 6 1 7 tion theory. 4 5 Under the additional assumption that all the eigen- The matrix C() is the companion in the sense that its values are simple (no multiplicity), we can prove a characteristic polynomial is equal to . better bound using the following idea with companion matrix: Small perturbation in autoregressive parame- In relation to a pure autoregressive AR(p)model,the ters results in small perturbation in companion matrix, companion matrix corresponds to the transition matrix and small perturbation in companion matrix results in in the linear dynamical system when we encode the small perturbation in eigenvalues. values form the past p lags as a p-dimensional state T ht = yt p+1 yt 1 yt . Matrix eigenvalue perturbation theory The ··· perturbation bound on eigenvalues is a well-studied ⇥ ⇤ If yt = '1yt 1 + + 'pyt p,thenht = problem [Greenbaum et al., 2019]. The regular split- ··· ting property states that, for an eigenvalue 0 with 01 0... 0 yt p+1 yt p partial multiplicities m1, ,mk,anO(✏) perturba- 00 1... 0 ··· yt p+2 yt p+1 tion to the matrix could split the eigenvalue into 2 3 2 . . . . . 32 3 = ...... M = m1 + + mk distinct eigenvalues ij(✏) for ··· 6 7 ··· ··· 6 yt 1 7 6 00 0... 1 76 yt 2 7 i =1, ,k and j =1, ,mi,andeacheigenvalue 6 7 6 76 7 ··· ··· 1/mi 6 yt 7 6 yt 1 7 (✏) is moved from the original position by O(✏ ). 6'p 'p 1 'p 2 ... '17 ij 6 7 6 76 7 4 5 T 4 54 5 = C( ) ht 1. For semi-simple eigenvalues, geometric multiplicity equals algebraic multiplicity. Since geometric multiplic- (7) ity is the number of partial multiplicities while alge- braic multiplicity is the sum of partial multiplicities, for Proof of Theorem 4.2 for simple eigenvalues semi-simple eigenvalues all partial multiplicities mi =1. Therefore, the regular splitting property corresponds to Proof. Let yt be the outputs of a linear dynamical the asymptotic relation in equation 6. It is known that system S with only simple eigenvalues, and let = regular splitting holds for any semi-simple eigenvalue (' , ,' ) be the ARMAX autoregressive parameters 1 ··· n even for non-Hermitian matrices. for yt.LetC() be the companion matrix of the n n 1 n 2 polynomial z ' z ' z ' .The Lemma B.2 (Theorem 6 in [Lancaster et al., 2003]). 1 2 ··· n Let L(,✏) be an analytic matrix function with semi- companion matrix is the transition matrix of the LDS simple eigenvalue 0 at ✏ =0of multiplicity M. Then described in equation 7. Since this LDS the same there are exactly M eigenvalues i(✏) of L(,✏) for autoregressive parameters and hidden state dimension which i(✏) 0 as ✏ 0, and for these eigenvalues as the original LDS, by Corollary 4.1 the companion ! ! matrix has the same characteristic polynomial as the i(✏)=0 + i0 ✏ + o(✏). (6) original LDS, and thus also has simple (and hence also semi-simple) eigenvalues. The O(✏) convergence rate Companion Matrix Matrix perturbation theory then follows from Lemma B.2 and Theorem 5.1, as the tell us how perturbations on matrices change eigen- error on ARMAX parameter estimation can be seen as values, while we are interested in how perturbations perturbation on the companion matrix. on polynomial coefficients change roots. To apply ma- trix perturbation theory on polynomials, we introduce A note on the companion matrix One might the companion matrix,alsoknownasthecontrollable hope that we could have a more generalized result using canonical form in control theory. Lemma B.2 for all systems with semi-simple eigenvalues instead of restricting to matrices with simple eigenval- Definition B.1. For a (u)=zn + n 1 ues. Unfortunately, even if the original linear dynamical '1z + + 'n 1z + 'n, the companion matrix of ··· system has only semi-simple eigenvalues, in general the the polynomial is the square matrix companion matrix is not semi-simple unless the original linear dynamical system is simple. This is because the Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

companion matrix always has its minimal polynomial The companion matrix can be diagonalized as C = 1 equal to its characteristic polynomial, and hence has V diag( , , )V , the rows of the Vandermonde 1 ··· n geometric multiplicity 1 for all eigenvalues. This also matrix V are the row eigenvectors of C, while the 1 points to the fact that even though the companion columns of V are the column eigenvectors of C.Since 1 matrix has the form of the controllable canonical form, the the j-th row Vj, and the j-th column V ,j have ⇤ in general it is not necessarily similar to the transition inner product 1 by definition of matrix inverse,⇤ the matrix in the original LDS. condition number is given by

1 (C,j)= Vj, 2 V ,j 2. (13) B.3 Explicit bound for condition number k ⇤k k ⇤ k

In this subsection, we write out explicitly the condi- Formula for inverse of tion number for simple eigenvalues in the asymptotic The Vandermonde matrix is defined as relation (✏)=0 + ✏ + o(✏),toshowhowitvaries 2 p 1 according to the spectrum. Here we use the notation 1 1 1 1 2 ··· p 1 (C,) to note the condition number for eigenvalue 1 2 2 2 ··· 2 3 in companion matrix C. V = . . . . . (14) . . . . Lemma B.3. For a companion matrix C with simple 6 2 ··· p 17 61 p p p 7 eigenvalues , , , the eigenvalues 0 , ,0 of 6 ··· 7 1 ··· n 1 ··· n 4 5 the perturbed matrix by C + C satisfy The inverse of the Vandermonde matrix V is given 2 by [El-Mikkawy, 2003]usingelementarysymmetric 0 (C, ) C + o( C ), (8) | j j| j k k2 k k2 polynomial. and the condition number (C,j) is bounded by i+j 1 ( 1) Sp i,j 1 (V )i,j = , (15) (C, ) kj(k j)  j  k=j | j k| 6 S =QS ( , ,Q , , , ). pn n 1 where p i,j p i 1 j 1 j+1 p Q n 1 2 ··· ··· (max(1, j )) (1 + ⇢(C) ) 2 , k=j j k | | Pulling out the common denominator, the j-th column 6 | | 1 (9) vector of V is Q where ⇢(C) is the spectral radius, i.e. largest absolute ( 1)Sp 1 2 value of its eigenvalues. ( 1) Sp 2 ( 1)j 2 3 . , In particular, when ⇢(C) 1, i.e. when the matrix is ( ) ( ) .  kj k j 6 p 1 7 Lyapunov stable, 6( 1) S17 6 7 Q Q 6 ( 1)p 7 p n 1 6 7 pn( 2) 2 4 5 j j0 C 2 + o( C 2). (10) | | k=j j k k k k k where the elementary symmetric polynomials are over 6 | | variables 1, ,j 1,j+1, ,p. Q ··· ··· Proof. For each simple eigenvalue of the companion For example, if p =4,thenthe3rdcolumn(upto matrix C with column eigenvector v and row eigenvec- scaling) would be tor w⇤,theconditionnumberoftheeigenvalueis

124 w 2 v 2 (C,)=k k k k . (11) 1 12 + 14 + 24 w⇤v 2 3. | | ( )( )( ) 1 2 4 3 1 3 2 4 3 6 1 7 This is derived from differentiating the eigenvalue equa- 6 7 4 5 tion Cv = v,andmultiplyingthedifferentiatedequa- tion by w , which results in Bounding the condition number As discussed be- ⇤ fore, the condition number for eigenvalue j is w⇤(C)v + w⇤C(v)=w⇤(v)+w⇤v(). 1 (C,j)= Vj, 2 V ,j 2. w (C)v k ⇤k k ⇤ k = ⇤ . w⇤v where Vj, is the j-th row of the Vandermonde matrix ⇤ 1 1 Therefore, V and V ,j is the j-th column of V . ⇤ w 2 v 2 k k k k C = (C,) C . (12) | | w v k k2 k k2 | ⇤ | Linear Dynamics: Clustering without identification

2 p 1 By definition Vj, = 1 j j j ,so Theorem 4.3 follows from Lemma B.3, because the ⇤ ··· h i estimation error on the autoregressive parameters can p 1 1/2 be seen as the perturbation on the companion matrix, 2i and the companion matrix has the same eigenvalues as Vj, 2 = j . k ⇤k i=0 ! the original LDS. X

1 1 Using the above explicit expression for V , V ,j 2 = CIteratedregressionforARMAX k ⇤ k p 1 1/2 1 2 Algorithm We generalize Algorithm 1 to accommo- Si (1, ,j 1,j+1, ,p) . ··· ··· date for exogenous inputs. Since the exogenous inputs k=j j k i=0 ! 6 | | X are explicitly observed, including exogenous inputs in .Q the regression does not change the consistent property Therefore, of the estimator.

p 1 1/2 Theorem A.1 shows that different output channels from 1 2i the same LDS have the same autoregressive parame- (C,j)= j j k ters in ARMAX models. Therefore, we could leverage k=j | | i=0 ! 6 X multidimensional outputs by estimating the autoregres- p 1 1/2 Q 2 sive parameters in each channel separately and average Si (1, ,j 1,j+1, ,p) . ··· ··· them. i=0 ! X (16) Algorithm 2: Regularized iterated regression for

1/2 AR parameter estimation in ARMAX Note that both parts under ( ) are greater than T m ··· Input: A time series yt t=1 where yt R , or equal to 1,sowecanbounditbelowby { } T 2 k exogenous input series xt where xt R , { }t=1 2 1 and guessed hidden state dimension n. (C,j) . k=j j k for d =1, ,m do 6 | | ··· y(d) y d Q Let t be the projection of t to the -th We could also bound the two parts above. The first dimension; m part can be bounded by Initialize error term estimates ✏ˆt = ~0 R for 2 t =1, ,T; p 1 1/2 ··· 2i (p 1) for i =0, ,n do pp max(1, j ) . (17) ··· j  | | Perform `2-regularized least squares i=0 ! X regression on yt against lagged terms of yt, While for the second part, since xt,and✏ˆt to solve for coefficients 'ˆj R, ˆ k 2 ✓j R,andˆj R in the linear equation p 1 i (d2) n 2 (d) n 1 Si(1, ,j 1,j+1, ,p) max, yt = c + 'ˆjyt j + ˆjxt j + | ··· ··· | i | | j=1 j=1 ✓ ◆ i ✓ˆj✏ˆt j, with `2-regularization only j=1 P P we have that ˆ Pon ✓j; p 1 Update ✏ˆt to be the residuals from the most 2 Si (1, ,j 1,j+1, ,p) recent regression; ··· ··· i=0 X (18) end p 1 ˆ (d) p 1 2i 2 p 1 Record =(ˆ'1, , 'ˆn); =(1+ ) . ···  i | |max | |max end i=0 X ✓ ◆ Return the average estimate Combining equation 17 and 18 for the upper bound, ˆ = 1 (ˆ (1) + + ˆ (m)). d ··· and putting it together with the lower bound,

Again as before the i-th iteration of the regression 1 only uses error terms from the past i lags. In other (C, ) j words, the initial iteration is an ARMAX(n, 0,n 1) k=j j k   6 | | regression, the first iteration is an ARMAX(n, 1,n 1) p p 1 p Q p 1 2 (max(1, j )) (1 + ⇢(C) ) 2 , regression, and so forth. k=j j k | | 6 | | (19) Q as desired. Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

Time complexity The iterated regression in each inter-cluster distance between centers. dimension involves n +1 steps of least squares re- For each LDS, we generate a sequence by drawing hid- gression each on at most n(k + 2) variables. There- den inputs x N(0, 1) and put noise ⇠ N(0, 0.012) fore, the total time complexity of Algorithm 2 is t ⇠ t ⇠ on the outputs. O(nm((nk)2T +(nk)3)) = O(mn3k2T +mn4k3), where T is the sequence length, n the hidden state dimension, m the output dimension, and k the input dimension. D.2 Empirical correlation between AR distance and LDS distance.

DAdditionalsimulationdetails Theorem 4.2 shows that LDSs with similar AR param- eters also have similar eigenvalues. The converse of D.1 Synthetic data generation Theorem 4.2 is also true: dynamical systems with small eigenvalue distance have small autoregressive parameter First, we generate K cluster centers by generating distance, which follows from perturbation bounds for LDSs with random matrices A, B, C of standard i.i.d. characteristic polynomials [Ipsen and Rehman, 2008]. Gaussians. We assume that the output yt only de- Figure 2 shows simulation results where the AR pa- pends on the hidden state ht but not the input xt,i.e. rameter distance and the LDS eigenvalue distance are the matrix D is zero. When generating the random highly correlated. LDSs, we require that the spectral radius ⇢(A) 1,  i.e. all eigenvalues of A have absolute values at most 1, and regenerate a new random matrix if the spectral radius is above 1. Our method also applies to the case of arbitrary spectral radius, this requirement is for the purpose of preventing numeric overflow in gen- erated sequence. We also require that the `2 distance d(⇥ , ⇥ )= (A ) (A ) between cluster centers 1 2 k 1 2 k2 are at least 0.2 apart. Then, we generate 100 LDSs by randomly assigning them to the clusters. To obtain a LDS with assigned cluster center ⇥=(Ac,Bc,Cc),wegenerateA0 by adding a i.i.d. Gaussians to each entry of Ac, while Figure 2: The eigenvalue `2 distance and the autore- B0 and C0 are new random matrices of i.i.d. stan- gressive parameter `2 distance for 100 random linear dard Gaussians. The standard deviation of the i.i.d. dynamical systems with eigenvalues drawn uniformly Gaussians for A0 Ac is chosen such that the aver- randomly from [ 1, 1]. The two distance measures are age distance to cluster centers is less than half of the highly correlated.