Stable Spectral Learning Based on Schur Decomposition

Nicolo` Colombo Nikos Vlassis Luxembourg Centre for Systems Biomedicine Adobe Research University of Luxembourg San Jose, CA [email protected] [email protected]

Abstract tions, including Gaussian mixtures, Hidden Markov mod- els, stochastic languages, and others (Mossel and Roch, Spectral methods are a powerful tool for inferring 2006; Hsu et al., 2012; Anandkumar et al., 2012a,c; Balle the parameters of certain classes of probability et al., 2014; Kuleshov et al., 2015). Some the most distributions by means of standard eigenvalue- widely studied algorithms include Chang’s spectral tech- eigenvector decompositions. Spectral algorithms nique (Chang, 1996; Mossel and Roch, 2006), a tensor can be orders of magnitude faster than log- decomposition approach (Anandkumar et al., 2012a), and likelihood based and related iterative methods, an indirect learning method for inferring the parameters of and, thanks to the uniqueness of the spectral de- Hidden Markov Models (Hsu et al., 2012). composition, they enjoy global optimality guar- Spectral algorithms are typically much faster than iterative antees. In practice, however, the applicability of solvers such as the EM algorithm (Dempster et al., 1977), spectral methods is limited due to their sensitiv- and thanks to the uniqueness of the spectral decomposition, ity to model misspecification, which can cause they enjoy strong optimality guarantees. However, spec- instability issues in the case of non-exact models. tral algorithms are more sensitive to model misspecifica- We present a new spectral approach that is based tion than algorithms that maximize log-likelihood. Studies on the Schur triangularization of an observable in the field of linear system subspace identification have , and we carry out the corresponding theo- shown that the solutions obtained via retical analysis. Our main result is a bound on the methods can be suboptimal (Favoreel et al., 2000; Buesing estimation error that is shown to depend linearly et al., 2012). On the other hand, good results have been on the condition number of the ground-truth con- obtained by using the output of a spectral algorithm to ini- ditional probability matrix and inversely on the tialize the EM algorithm (Zhang et al., 2014). eigengap of an observable matrix. Numerical ex- periments show that the proposed method is more The practical implementation of the spectral idea is a non- stable, and performs better in general, than the trivial task because the stability of spectral decomposition classical spectral approach using direct matrix di- strongly depends on the spacing between the eigenvalues agonalization. of the empirical matrices (Anandkumar et al., 2012a; Hsu and Kakade, 2012). Mossel and Roch (2006) obtain cer- tain eigenvalue separation guarantees for Chang’s spec- 1 INTRODUCTION tral technique via the contraction of higher (order three) moments through Gaussian random vectors. Anandkumar et al. (2012a) describe a tensor decomposition method that The problem of learning mixtures of probability distribu- generalizes deflation methods for matrix diagonalization to tions from sampled data is central in the statistical literature the case of symmetric tensors of order three. Another algo- (Titterington, 1985; Lindsay, 1995). In pioneering work, rithmic variant involves replacing the random contracting Chang (1996) showed that it is possible to learn a mix- vector of Chang’s spectral technique with an ‘anchor obser- ture of product distributions via the spectral decomposition vation’, which guarantees the presence of at least one well of ‘observable’ matrices, that is, matrices that can be esti- separated eigenvalue (Arora et al., 2012; Song and Chen, mated directly from the data using suitable combinations of 2014) (See also Kuleshov et al. (2015) for a similar idea). the empirical joint probability distributions (Chang, 1996). Finally, Zou et al. (2013) have presented a technique for Extensions and improvements of this idea have been devel- learning mixtures of product distributions in the presence oped more recently in a series of works, where the spectral of a background model. technique is applied to a larger class of probability distribu- In this article we propose an alternative and more stable approach to Chang’s method that is based on Schur de- composition (Konstantinov et al., 1994). We show that an approximate triangularization of all observables matrices appearing in Chang’s spectral method can be obtained by means of the orthogonal matrices appearing in the Schur decomposition of their linear combination, an idea that has been suggested earlier (Corless et al., 1997; Anandkumar et al., 2012a). Our main result is a theoretical bound on Algorithm 1 Spectral algorithm via Schur decomposition the estimation error that is based on a perturbation anal- Input: data sn = [xn, yn, zn] ∈ N , dimension d, number ysis of Schur decomposition (Konstantinov et al., 1994). of mixture components p In analogy to related results in the literature, the bound is Output: estimated conditional probability matrices shown to depend directly on the model mispecification er- X,ˆ Y,ˆ Zˆ and mixing weights vector wˆ ror and inversely on an eigenvalue separation gap. How- 1: Pˆ = 0 ever, the major advantage of the Schur approach is that the 2: for sn ∈ S do bound depends very mildly on the condition number of the ˆ ˆ 3: Pxnynzn = Pxnynzn + 1 ground-truth conditional probability matrix (see discussion 4: end for after Theorem 1). We compare numerically the proposed 5: for i = 1, . . . d do Schur decomposition approach with the standard spectral ˆY ˆ ˆX ˆ ˆZ ˆ 6: [Pi ]jk = Pjik, [Pi ]jk = Pijk, [Pi ]jk = Pjki technique (Chang, 1996; Mossel and Roch, 2006), and we 7: end for show that the proposed method is more stable and does a P Y P X 8: compute [Pˆxz] = [Pˆ ], [Pˆyz] = [Pˆ ], better job in recovering the parameters of misspecified mix- i i i i [Pˆ ] = P [PˆZ ] tures of product distributions. xy i i 9: for i = 1, . . . , d do ˆ ˆY ˆ−1 10: compute Mi = Pi Pxz 2 SPECTRAL LEARNING VIA SCHUR 11: end for 12: find θ ∈ Rd such that Mˆ = P θ Mˆ has real non- DECOMPOSITION i i i degenerate eigenvalues. 13: find Uˆ such that Uˆ T Uˆ = 1 and Uˆ T Mˆ Uˆ is upper trian- Here we discuss the standard spectral technique (Chang, gular (Schur decomposition) 1996; Mossel and Roch, 2006), and the proposed Schur de- 14: for i, j = 1, . . . , d do composition, in the context of learning mixtures of product T 15: let Yˆi,j = [Uˆ Mˆ iUˆ]jj distributions. The complete algorithm is shown in Algo- 16: set Yˆ = 0 if [Uˆ T Mˆ Uˆ] < 0 rithm 1. Its main difference to previous algorithms is step i,j i jj 17: end for 13 (Schur decomposition). 18: normalize to 1 the columns of Yˆ 19: for i = 1, . . . , d do The spectral approach in a nutshell. Consider ` dis- ˆ X ˆX ˆ−1 20: compute Mi = Pi Pyz tinct variables taking values in a discrete set with finite 21: compute Mˆ Z = PˆZ Pˆ−1 {1, . . . , d} S i i xy number of elements , and a sample consisting 22: end for of a number of independent joint observations. The em- 23: for i, j = 1, . . . , d do pirical distribution corresponding to these observations is −1 X 24: let Xˆi,j = [Yˆ Mˆ Yˆ ]jj and set Xˆi,j = 0 if computed by counting the frequencies of all possible joint i [Yˆ −1Mˆ X Yˆ ] < 0 events in the sample (step 3), and it is modeled (approxi- i jj Zˆ = [Xˆ −1Mˆ Z Xˆ] Zˆ = 0 mated) by a mixture of product distributions with a given 25: let i,j i jj and set i,j if ˆ −1 ˆ Z ˆ number p of mixture components. For every p < d, spec- [X Mi X]jj < 0 tral methods allow one to recover the parameters of this 26: end for ˆ ˆ approximation by means of the simultaneous diagonalisa- 27: normalize to 1 the columns of X and Z ˆ −1 ˆ ˆ T −1 tion of a set of ‘observable’ nearly diagonalizable matrices 28: compute wˆ = X Pxy(Y ) and normalize to 1 {Mˆ 1,..., Mˆ p}, computed from the sample S (step 10). If the sample is drawn exactly from a mixture of p com- ponents, and in the limit of an infinite amount of data, the mixture parameters, i.e., the conditional probability distributions and the mixing weights of the mixture, are contained exactly in the eigenvalues of the matrices Mˆ i (Chang, 1996). If the sample is not drawn exactly from a mixture of p product distributions, and in the typical finite sample setting, the model is only an approximation to the where wh ∈ [0, 1] for all h = 1, . . . , p, and we have defined empirical distribution, and as a result, the matrices Mˆ i are the conditional probability matrices no longer simultaneously diagonalizable and an approxi- d×p T T mate diagonalisation technique is required. The standard X ∈ [0, 1] , 1d X = 1p , (4) approach consists of choosing one particular observable matrix in the set, or a linear combination of all matrices, (and similarly for Y,Z), where 1n is a vector of n ones. The columns of the matrices X,Y,Z encode the condi- and use its eigenvectors to diagonalize each Mˆ i (Mossel and Roch, 2006). tional probabilities associated with the p mixture compo- P nents, and the mixing weights satisfy h wh +  = 1. Here we propose a new approach that is based on the Schur decomposition of a linear combination of the observable The conditional probability matrices X,Y,Z and the mix- ing weight w of the rank-p mixture can be estimated from matrices. In particular, we first mix the matrices Mˆ i to compute a candidate matrix Mˆ (step 12), and then we ap- the approximate eigenvalues of a set of observable matrices ˆ ply Schur decomposition to Mˆ (step 13). The eigenvalues Mi, for i = 1, . . . p, that are computed as follows. Let for simplicity p = d and consider the matrices [PˆY ] = Pˆ of each Mˆ i, and thereby the model parameters, are then i jk jik ˆ P ˆY ˆ extracted using the orthogonal matrix Uˆ of the Schur de- (step 6) and [Pxz] = i[Pi ] (step 8). Assuming that Pxz composition (steps 15-16). Effectively we exploit the fact is invertible, we define (step 10) that the real eigenvalues of a matrix A always appear on the ˆ ˆY ˆ−1 diagonal of its Schur triangularization T = U T AU, even Mi = Pi Pxz (5) though the entries of the strictly upper diagonal part of T for i = 1, . . . , d. Under the model assumption Pˆ = P + may not be unique. Using the perturbation analysis of the  ∆P , it is easy to show that Schur system of a matrix by Konstantinov et al. (1994), we obtain a theoretical bound on the error of such eigenvalue Mˆ = M + ∆M + o(2) (6) estimation as a function of the model misspecification er- i i i ror, the condition number of the ground-truth matrix X, where ∆M ∈ Rd×d is linear in the misspecification pa- ˆ i and the separation of the eigenvalues of M (Theorem 1). rameter , and

−1 Detailed description and the Schur approach. Con- Mi = X diag(Yi1,...Yip) X (7) sider for simplicity a sample S of independent observations s = [x, y, z] of three distinct variables taking values in the where diag(v1, . . . vd) denotes a whose di- discrete set {1, . . . , d}. The empirical distribution associ- agonal entries are v1, . . . vd. If the model is exact, i.e., ated to the sample S is defined as p = rank+(Pˆ) or equivalently  = 0, the matrices {Mˆ i} are simultaneously diagonalizable and the entries of the 1 X Pˆ = δ δ δ (1) conditional probability matrix Y are given (up to normal- i,j,k |S| x,i y,j z,k s∈S ization of its columns) by

−1 where |S| is the number of elements in S and δab = 1 if Yij ∝ [V Mˆ iV ]jj, i, j = 1, . . . d (8) a = b and zero otherwise. The empirical distribution Pˆ is a nonnegative order-3 tensor whose entries sum to one. where V is the matrix of the eigenvectors shared by all Mˆ i. Its nonnegative rank rank (Pˆ) is the minimum number of + When  6= 0, the matrices Mˆ are no longer simultane- rank-1 tensors, i.e., mixture components, required to ex- i press Pˆ as a mixture of product distributions. Such a de- ously diagonalizable and an approximate simultaneous di- composition of Pˆ (exact or approximate) is always pos- agonalisation scheme is needed. The standard procedure Mˆ sible (Lim and Comon, 2009). Hence, for any choice of consists of selecting a representative matrix , compute its eigenvectors Vˆ , and use the matrix Vˆ to obtain the ap- p ≤ rank (Pˆ), we can hypothesize that Pˆ is generated by + proximate eigenvalues of all matrices Mˆ and thereby esti- a model i mate Y (Mossel and Roch, 2006; Hsu et al., 2012; Anand- Pˆ = P +  ∆P,  ≥ 0 (2) ij kumar et al., 2012b). In this case, the estimation error is where P ∈ [0, 1]dx×dy ×dz is a nonnegative rank-p approx- known to depend on the model misspecification parameter imation of Pˆ,  is a model mispecification parameter, and  and on the inverse of an eigenvalue separation γ (see, e.g., ∆P ∈ [0, 1]dx×dy ×dz is a nonnegative tensor whose en- eq. (12)). Using matrix perturbation theorems and proper- tries sum to one. The rank-p component P is interpreted as ties of the Gaussian distribution, Mossel and Roch (2006) the mixture of product distributions that approximates the have shown that a certain separation γ > α is guaranteed empirical distribution, and it can be written with probability proportional to (1−α) if Vˆ is the matrix of ˆ P ˆ p the eigenvectors of some M = i θiMi, with θ sampled X from a Gaussian distribution of zero mean and unit vari- Pijk = wh XihYjhZkh , (3) h=1 ance. In practice, however, this approach can give rise to instabilities (such as negative or imaginary values for Yij), the good perturbation properties of the orthogonal matrices especially when the size of the empirical matrices grows. involved in the Schur decomposition, as compared to the eigenvector matrices of the Chang approach. The differ- Here we propose instead to triangularize the matrices Mˆ i ence in the bounds suggests that, for a randomly generated by means of the Schur decomposition of their linear com- ˆ P ˆ true model, a spectral algorithm based on the Schur de- bination M = i θiMi, for an appropriate θ (steps 12-13). ˆ composition is expected to be more stable and accurate in The orthogonal matrix U obtained from the Schur decom- practice that an algorithm based on matrix diagonalization. ˆ ˆ ˆ ˆ T position M = UT U is then used in place of the eigen- Intuitively, the key to the improved stability of the Schur vectors matrix of Mˆ to approximately triangularize all the ˆ approach comes from the freedom to ignore the non-unique observable matrices Mi and thereby recover the mixture off-diagonal parts in Schur triangulation. parameters (steps 15 and 24, 25). For example, the condi- tional probability matrix Y is estimated as During the reviewing process we were made aware of the work of Kuleshov et al. (2015), who propose computing a T Yˆij ∝ [Uˆ Mˆ iUˆ]jj, i, j = 1, . . . d (9) tensor factorization from the simultaneous diagonalization of a set of matrices obtained by projections of the tensor and normalized so that its columns sum to one. Let k·k de- along random directions. Kuleshov et al. (2015) establish note the Frobenius norm. Our main result is the following: an error bound that is independent of the eigenvalue gap,

Theorem 1. Let Mˆ i and Mi be the real d × d matrices but their approach does not come with global optimality defined in (5) and (6). Suppose it is possible to find θ ∈ Rd guarantees (but the authors report good results in practice). ˆ P ˆ It would be of interest to see whether such random projec- such that M = k θkMk has real distinct eigenvalues. Then, for all j = 1, . . . , d, there exists a permutation π tions combined with a simultaneous Schur decomposition such that (see, e.g., De Lathauwer et al. (2004)) could offer improved bounds.   ˆ k(X) λmax 2 |Yij − Yiπ(j)| ≤ a1 + 1 E + o(E ) (10) γˆ 3 EXPERIMENTS where k(X) = σmax(X) is the condition number of the σmin(X) We have compared the performance of the proposed spec- ground-truth conditional probability matrix X, λ = max tral algorithm based on Schur decomposition with the clas- max Y , i,j i,j sical spectral method based on eigenvalue decomposition. The two algorithms that we tested are equivalent except for γˆ = min λi(Mˆ ) − λj(Mˆ ) > 0 (11) i6=j line 13 of Algorithm 1, which in the classical spectral ap- proach should be “find V such that V −1MV = D, with with λi(Mˆ ) being the ith eigenvalue of Mˆ , a1 = D diagonal”. In all experiments we used the same code q 3 2 with decompositions performed via the two Matlab func- kθk 2 d , and E = max k∆M k = max kMˆ − M k. d−1 i i i i i tions schur(M) and eig(M) respectively. We tested the two algorithms on simulated real multi-view and Hidden Proof. See appendix. Markov Model data. In what follows we denote by ‘schur’ the algorithm based on the Schur decomposition and by The analogous bound for the diagonalization approach is ‘eig’ the algorithm based on the eigenvalues-eigenvector (Anandkumar et al., 2012c, Section B6, eq.11) decomposition. In the first set of experiments we generated multi-view data ˜ ! ˆ 4 λmax 2 from a mixture of product distributions of p mixture com- |Yij−Yiπ(j)| ≤ a2k(X) + a3k(X) E, (12) γ ponents in d dimensions. For each experiment, we created 3 two different datasets N = {[xnynzn] ∈ [1, . . . , d] }, P P where γ = mini6=j |λi( k θkMk) − λj( k θkMk)|, and Ntest, one for training and one for testing, the latter ˜ T |Ntest| λmax = max(maxi[θ Y ]i, maxi,j Yi,j), and a2, a3 are containing the labels L ∈ [1, . . . , p] of the mixture constants that depend on the dimensions of the involved components that generated each instance. The output was matrices. evaluated by measuring the distance between the estimated conditional probability distributions X,ˆ Y,ˆ Zˆ and the cor- When the model misspecification error E is not too large, responding ground-truth values X,Y,Z: the error bound under the Schur approach (10) is charac- terized by a much smoother dependence on k(X) than the E = kXˆ − Xk2 + kYˆ − Y k2 + kZˆ − Zk2. (13) error bound (12). Moreover, the Schur bound depends on the eigenvalue gap γˆ of an observable matrix, and hence Since the order of the columns in X,ˆ Y,ˆ Zˆ may be differ- it can be controlled in practice by optimizing θ. The sim- ent from X,Y,Z, the norms were computed after obtain- plified dependence on k(X) of the Schur bound is due to ing the best permutation. We also tested according to a |N |(d = 10, p = 5) Eschur Eeig Sschur Seig kTˆschur − T k kTˆeig − T k 1000 0.057 (0.013) 0.066 (0.0167) 0.364 (0.135) 0.360 (0.135) 0.016 (0.004) 0.026 (0.009) 2000 0.039 (0.008) 0.120 (0.227) 0.415 (0.068) 0.356 (0.138) 0.011 (0.004) 0.019 (0.008) 5000 0.043 (0.012) 0.046 (0.013) 0.387 (0.114) 0.386 (0.084) 0.013 (0.004) 0.021 (0.004) 10000 0.036 (0.014) 0.047 (0.009) 0.390 (0.130) 0.402 (0.085) 0.013 (0.006) 0.022 (0.007) 20000 0.032 (0.014) 0.113 (0.230) 0.431 (0.124) 0.341 (0.157) 0.011 (0.007) 0.025 (0.007) 50000 0.019 (0.015) 0.025 (0.010) 0.475 (0.1887) 0.434 (0.143) 0.007 (0.006) 0.015 (0.009)

Table 1: Columns recovery error E, classification score S, and distance of the approximate distribution kTˆ − T k for multi-view datasets of increasing size. The algorithm based on Schur decomposition obtained the best scores on almost all datasets.

p classification rule where the estimated conditional proba- matrix, and htrue ∈ [0, 1] is the starting distribution, and bility matrices [X,ˆ Y,ˆ Zˆ] were used to assign each triple in we generated two sample datasets, one for training and one the test dataset to one of the mixture components. For ev- for testing. All sequences sn were simulated starting from ery run, we obtained a classification score by counting the an initial hidden state drawn from htrue and following the number of successful predictions divided by the number of dynamics of the model according to Rtrue and Otrue. The elements in the test dataset: length of the sequences in the training and testing datasets was set to 20. We evaluated the two algorithms based on 1 X S = f(n), (14) the columns recovery error E as in the previous set of |Ntest| n∈Ntest experiments. Also, letting Otrue = [otrue1, . . . , otruep] and O = [o1, . . . , op], we considered a recovery ratio r R(M) = p , where r is the number of columns satisfying  0 arg max Xˆ Yˆ Zˆ 6= L(n) f(n) = i xni yni zni . (15) ˆ ˆ ˆ 2 2 1 arg maxi XxniYyniZzni = L(n) kotruei − oik < ξ, ξ = 0.05 ∗ d . (19)

Finally we computed the distance in norm between the re- In Table 2 we show the results for recovering a d = covered tensor {5, 10, 20, 30} HMM with p = 5 hidden states. All val- X ues are computed by averaging over 10 experiments and Tˆ = wˆ [Xˆ] [Yˆ ] [Zˆ] ijk r ir jr kr (16) the corresponding standard variation is reported between r brackets. The Schur algorithm is in general better than the and the original tensor classical approach. We note that, for a fixed number of hid- den states, the inference of the HMM parameters becomes X Tijk = [w]r[X]ir[Y ]jr[Z]kr (17) harder as the dimensionality of the space decreases. As the r recovery ratio R shows, in the limit situation d = p both algorithms fail (values R = 0 imply unstable solutions). as follows s ˆ X ˆ 2 4 CONCLUSIONS kT − T k = [T − T ]ijk . (18) i,j,k We have presented a new spectral algorithm for learning multi-view mixture models that is based on the Schur de- In Table 1 we show the results obtained by the two algo- composition of an observable matrix. Our main result is rithms for d = 10, p = 5 and increasing size of the training a theoretical bound on the estimation error (Theorem 1), dataset |N |. In the table we report the average score over which is shown to depend very mildly (and much more 10 analogous runs and the corresponding standard devia- smoothly than in previous results) on the condition num- tion in brackets. When the recovered matrices contained ber of the ground-truth conditional probability matrix, and infinite values we have set E = kXk2 + kY k2 + kZk2, inversely on the eigengap of an observable matrix. Numer- and kTˆ − T k = kT k. The proposed algorithm based on ical experiments show that the proposed method is more Schur decomposition obtained the best scores on almost all stable, and performs better in general, than the classical datasets. spectral approach using direct matrix diagonalization. In the second set of experiments we tested the two al- gorithms on datasets generated by a d-dimensional Hid- Appendix - Proof of Theorem 1 den Markov Model with p hidden states. For each experiment we randomly picked a model Mtrue = Theorem 1. Let Mˆ i and Mi be the real d × d matrices d×p d Mtrue(Otrue,Rtrue, htrue), where Otrue ∈ [0, 1] is defined in (5) and (6). Suppose it is possible to find θ ∈ R p×p ˆ P ˆ the observation matrix, Rtrue ∈ [0, 1] is the transition such that M = k θkMk has real distinct eigenvalues. |N |(d = 30, p = 5) E(Tschur) E(Teig) R(Tschur) R(Teig) 100 0.011 (0.001) 0.014 (0.005) 1 (0) 1 (0) 500 0.011 (0.001) 0.012 (0.002) 1 (0) 1 (0) 1000 0.011 (0.001) 0.013 (0.002) 1 (0) 1 (0) 2000 0.011 (0.001) 0.011 (0.001) 1 (0) 1 (0) 5000 0.010 (0.001) 0.010 (0.001) 1 (0) 1 (0) |N |(d = 20, p = 5) E(Tschur) E(Teig) R(Tschur) R(Teig) 100 0.019 (0.002) 0.026 (0.010) 1 (0) 0.880 (0.168) 500 0.018 (0.003) 0.044 (0.080) 1 (0) 0.900 (0.316) 1000 0.019 (0.004) 0.021 (0.002) 1 (0) 1 (0) 2000 0.017 (0.002) 0.017 (0.002) 1 (0) 1 (0) 5000 0.015 (0.002) 0.018 (0.006) 1 (0) 0.960(0.126)

|N |(d = 10, p = 5) E(Tschur) E(Teig) R(Tschur) R(Teig) 100 0.047 (0.011) 0.050 (0.011) 0.200 (0.188) 0.220 (0.175) 500 0.044 (0.008) 0.097 (0.154) 0.240 (0.157) 0.200 (0.188) 1000 0.046 (0.017) 0.051 (0.018) 0.260 (0.211) 0.120 (0.139) 2000 0.043 (0.016) 0.048 (0.011) 0.180 (0.220) 0.120 (0.139) 5000 0.040 (0.013) 0.089 (0.153) 0.380 (0.257) 0.200 (0.133) |N |(d = 5, p = 5) E(Tschur) E(Teig) R(Tschur) R(Teig) 100 0.163 (0.089) 0.164 (0.074) 0 (0) 0 (0) 500 0.173 (0.063) 0.228 (0.294) 0 (0) 0.040 (0.084) 1000 0.191 (0.068) 0.251 (0.273) 0.020 (0.063) 0.040 (0.084) 2000 0.205 (0.070) 0.166 (0.052) 0.060 (0.096) 0 (0) 5000 0.142 (0.063) 0.164 (0.069) 0 (0) 0.020 (0.063)

Table 2: Columns recovery error E and recovery ratio R for recovering HMMs of various dimensionality and hidden states using the Schur and the standard spectral approach. See text for details.

Then, for all j = 1, . . . , d, there exists a permutation π with T upper triangular. Let Yij be the ground-truth matrix such that defined in (3) and Yˆ the estimation output by Algorithm 1.  k(X) λ  Then, assuming that ∆M and ∆U are small, there exists a |Yˆ − Y | ≤ a max + 1 E + o(E2) (20) ij iπ(j) 1 γˆ permutation of the indexes π such that, for all i, j = 1, . . . d with Yˆ estimated from (9), and where k(X) = σmax(X) is ˆ σmin(X) δy = |Yij − Yiπ(j)| (25) the condition number of the ground-truth conditional prob- ˆ T ˆ ˆ T = |[U MiU]jj − [U MiU]π(j)π(j)| (26) ability matrix X, λmax = maxi,j Yi,j, T ≤ k(U − ∆U) (Mi + ∆Mi)(U − ∆U) − Tik (27) γˆ = min λ (Mˆ ) − λ (Mˆ ) > 0 i j (21) T T T 2 i6=j =k∆U UTi + TiU ∆U − U ∆MiU + o(∆ )k(28) T 2 with λi(Mˆ ) being the ith eigenvalue of Mˆ , a1 = = kxTi − Tix + U ∆MiU + o(∆ )k (29) q 23d2 ˆ 2 kθk d−1 , and E = maxi k∆Mik = maxi kMi − Mik. ≤ 2 kxkkTik + k∆Mik + o(k∆ k) (30) 2 Proof. Consider the set of real commuting matrices Mi, ≤ 2 kxkµ + E + o(k∆ k) (31) i = 1, . . . d, and their random perturbations Mˆ i = Mi + T ∆Mi defined in (5) and (6). Assume that k∆Mik < E for where we have defined x = U ∆U, µ = maxi kMik and all i = 1, . . . , d and that θ ∈ Rd is such that the eigenval- used 1 = (U + ∆U)T (U + ∆U) = 1 + xT + x + o(∆2) ˆ P o(∆2) = o(x2) + o(∆Mx) ues of M = M + ∆M = i θi(Mi + ∆Mi) are real and where . ˆ non-degenerate. Let U be the orthogonal matrix defined Following (Konstantinov et al., 1994), a linear bound of ˆ ˆ T ˆ ˆ by the Schur decomposition of M = U T U computed by kxk can be estimated as follows. First, observe that the the matrix decomposition subroutine in Algorithm 1. Note Schur decomposition of M in (24) implies that Uˆ may not be unique and different choices of Uˆ lead ˆ to different entries in the strictly upper-diagonal part of T . low(Tˆxˆ − xˆTˆ) = low(Uˆ T ∆MUˆ) + o(∆2) (32) However, for any given Uˆ such that Uˆ T TˆUˆ is upper trian- gular, there exists an orthogonal matrix U and a real matrix where low(A) denotes the strictly lower diagonal part of d,d ˆ ∆U ∈ R such that U = U + ∆U and A and xˆ = Uˆ T ∆U . Since Tˆ is upper triangular, one has U T MU = (Uˆ + ∆U)T (Mˆ − ∆M)(Uˆ + ∆U) (22) low(Tˆxˆ − xˆTˆ) = low(Tˆlow(ˆx) − low(ˆx)Tˆ), i.e. the linear operator defined by L ˆ(ˆx) = low(Tˆxˆ − xˆTˆ) maps strictly = Tˆ − ∆T (23) T lower-triangular matrices to strictly lower-triangular matri- = T (24) ces. Let LfTˆ(·) be the restriction of LTˆ(·) to the subspace of lower-triangular matrices, then from (32) one has where σmin(A) is the smallest of A. We can estimate σmin(M ) by using the following lemma ˆ T ˆ 2 LfTˆ(low(ˆx)) = low(U ∆MU) + o(∆ ) (33) σmin(A) = min kA − Bk, rank(A) = n (45) rank(B)

µ = max kMik (51)  i [0d−i,i−j, 1d−j] i > j  −1 = max kX diag(Yi1,...Yid) X k (52) Mij = [mi] i = j (38) i  0 i < j −1 ≤ kXkkX k max kdiag(Yi1,...,Yid)k (53)  i Tˆj+i−1,k+i j < k √  ≤ k(X) d max Yij (54) [mi]jk = Tˆj+i,j+i − Ti,i j = k (39) i,j  0 j > k √ = k(X) d λmax . (55) for i, j = 1, . . . d−1. The determinant of is the product M where X is the ground-truth matrix defined in (3) and of the determinants of its diagonal blocks, i.e. k(X) = σmax(X) is the condition number of X. σmin(X) Y ˆ ˆ det(M) = (Tii − Tjj) (40) Finally, the statement (10) follows from (31), (43), kxˆk = i>j kxk, (44), (49), (55) and and is not null provided that the eigenvalues of Tˆ are real d 2 X 2 separated. In this case, the matrix M and hence the opera- k∆Mk = k θi∆Mik (56) i tor LfTˆ(·) are invertible. From (32) one has d d 2 X X low(ˆx) = −1low (Uˆ T ∆MUˆ) + o(∆2) (41) = θ [∆M ] (57) LfTˆ i i jk j,k i and in particular d d √ X 2 X 2 kxˆk = 2klow(ˆx)k (42) ≤ kθk [∆Mi]jk (58) √ j,k i −1 2 = 2kLfT kF k∆Mk + o(k∆k ) (43) d 2 X 2 where the first equality is obtained using the linear approxi- = kθk k ∆Mik (59) mation xˆ = −xˆT and kAk2 = klow(A)k2 +kdiag(A)k2 + i 2 2 kup(A)k2, with diag(A) and up(A) denoting the diagonal ≤ dkθk E (60) and upper-diagonal parts of A. The norm of the inverse where we have used the Cauchy-Schwarz inequality and operator can be bound using its matrix realization, i.e. the definition of E. In particular, for all higher orders 2 2 2 1 terms contained in ∆ , one has o(k∆ k) = o(kxk ) + k −1k = k −1k ≤ (44) 2 2 LfTˆ F M o(k∆Mk ) = o(E ). σmin(M ) References Konstantinov, M. M., Petkov, P. H., and Christov, N. D. (1994). Nonlocal perturbation analysis of the Schur sys- Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Tel- tem of a matrix. SIAM Journal on Matrix Analysis and garsky, M. (2012a). Tensor decompositions for learning Applications, 15(2):383–392. latent variable models. CoRR, abs/1210.7559. Kuleshov, V., Chaganty, A., and Liang, P. (2015). Tensor Anandkumar, A., Hsu, D., Huang, F., and Kakade, S. M. factorization via matrix factorization. In 18th Interna- (2012b). Learning high-dimensional mixtures of graph- tional Conference on Artificial Intelligence and Statistics ical models. arXiv preprint arXiv:1203.0697. (AISTATS). Anandkumar, A., Hsu, D., and Kakade, S. M. (2012c). Lim, L. H. and Comon, P. (2009). Nonnegative approxima- A method of moments for mixture models and hidden tions of nonnegative tensors. Journal of Chemometrics, Markov models. CoRR, abs/1203.0683. 23(7-8):432–441. Arora, S., Ge, R., and Moitra, A. (2012). Learning topic Lindsay, B. G. (1995). Mixture models: theory, geome- models-going beyond SVD. In Foundations of Com- try and applications. In NSF-CBMS regional conference puter Science (FOCS), 2012 IEEE 53rd Annual Sympo- series in probability and statistics, pages 1–163. JSTOR. sium on, pages 1–10. IEEE. Mossel, E. and Roch, S. (2006). Learning nonsingular phy- Balle, B., Hamilton, W., and Pineau, J. (2014). Meth- logenies and hidden Markov models. The Annals of Ap- ods of moments for learning stochastic languages: Uni- plied Probability, 16(2):583–614. fied presentation and empirical comparison. In Jebara, T. and Xing, E. P., editors, Proceedings of the 31st In- Song, J. and Chen, K. C. (2014). Spectacle: Faster and ternational Conference on Machine Learning (ICML- more accurate chromatin state annotation using spectral 14), pages 1386–1394. JMLR Workshop and Conference learning. bioRxiv. Proceedings. Titterington, D. M. (1985). Statistical analysis of finite mix- Buesing, L., Sahani, M., and Macke, J. H. (2012). Spec- ture distributions. Wiley series in probability and math- tral learning of linear dynamics from generalised-linear ematical statistics. Wiley, Chichester ; New York. observations with application to neural population data. Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. (2014). In Advances in neural information processing systems, Spectral methods meet EM: A provably optimal algo- pages 1682–1690. rithm for crowdsourcing. In Advances in Neural Infor- Chang, J. T. (1996). Full reconstruction of Markov mod- mation Processing Systems, pages 1260–1268. els on evolutionary trees: identifiability and consistency. Zou, J. Y., Hsu, D., Parkes, D. C., and Adams, R. P. (2013). Math Biosci, 137(1):51–73. Contrastive learning using spectral methods. In Ad- vances in Neural Information Processing Systems, pages Corless, R. M., Gianni, P. M., and Trager, B. M. 2238–2246. (1997). A reordered Schur factorization method for zero- dimensional polynomial systems with multiple roots. pages 133–140. ACM Press. De Lathauwer, L., De Moor, B., and Vandewalle, J. (2004). Computation of the canonical decomposition by means of a simultaneous generalized Schur decomposi- tion. SIAM Journal on Matrix Analysis and Applications, 26(2):295–327. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1 – 38. Favoreel, W., De Moor, B., and Van Overschee, P. (2000). Subspace state space system identification for industrial processes. Journal of Process Control, 10(2):149–155. Hsu, D. and Kakade, S. M. (2012). Learning gaussian mix- ture models: Moment methods and spectral decomposi- tions. CoRR, abs/1206.5766. Hsu, D., Kakade, S. M., and Zhang, T. (2012). A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460 – 1480.