Tensor Decomposition with Smoothness
Masaaki Imaizumi 1 Kohei Hayashi 23
Abstract To solve these problems, the low-rank assumption, i.e., Real data tensors are typically high dimensional; given tensor is generated from a small number of latent however, their intrinsic information is preserved factors, is widely used. If the number of observed ele- in low-dimensional space, which motivates the ments is sufficiently larger than the number of latent fac- use of tensor decompositions such as Tucker tors (i.e., rank) and noise level, we can estimate latent fac- decomposition. Frequently, real data tensors tors and reconstruct the entire structure. The methods of smooth in addition to being low dimensional, estimating latent factors are collectively referred to as ten- which implies that adjacent elements are similar sor decompositions. There are several formulations of ten- or continuously changing. These elements typi- sor decompositions such as Tucker decomposition (Tucker, cally appear as spatial or temporal data. We pro- 1966) and the CANDECOMP/PARAFAC(CP) decomposi- pose smoothed Tucker decomposition (STD) to tion (Harshman, 1970). While these methods were orig- incorporate the smoothness property. STD lever- inally formulated as nonconvex problems, several authors ages smoothness using the sum of a few basis have studied their convex relaxations in recent years (Liu functions; this reduces the number of parameters. et al., 2009; Tomioka et al., 2010; Signoretto et al., 2011; An objective function is formulated as a convex Gandy et al., 2011). problem, and an algorithm based on the alternat- Another important, yet less explored, assumption is the ing direction method of multipliers is derived to smoothness property. Consider fMRI data as a ten- solve the problem. We theoretically show that, sor X. As fMRI data are spatiotemporal, each ele- under the smoothness assumption, STD achieves ment of X is expected to be similar to its adjacent el- a better error bound. The theoretical result and ements with every way, i.e., xi,j,k,t should be close to performances of STD are numerically verified. xi 1,j,k,t,xi,j 1,k,t,xi,j,k 1,t, and xi,j,k,t 1. In statistics, ± ± ± ± this kind of smoothness property has been studied through functional data analysis (Ramsay, 2006; Hsing & Eubank, 1. Introduction 2015). Studies show that the smoothness assumption in- creases sample efficiency, i.e., estimation is more accu- A tensor (i.e., a multi-way array) is a data structure that is a rate with small sample size. Another advantage is that generalization of a matrix, and it can represent higher-order the smoothness assumption makes interpolation possible, relationships. Tensors appear in various applications such i.e., we can impute an unobserved value using its adjacent as image analysis (Jia et al., 2014), data mining (Kolda & observed values. This interpolation ability is particularly Sun, 2008), and medical analysis (Zhou et al., 2013). For useful for solving a specific tensor completion problem re- instance, functional magnetic resonance imaging (fMRI) ferred to as the tensor interpolation problem, as known as records brain activities in each time period as voxels, which the “cold-start” problem (Gantner et al., 2010). Suppose are represented as 4-way tensors (X-axis Y-axis Z-axis ⇥ ⇥ a case in which fMRI tensor X is completely missing at time). Frequently, data tensors in the real world con- ⇥ t = t0. In this case, standard tensor decompositions cannot tain several missing elements and/or are corrupted by noise, predict missing elements because there is no information which leads to the tensor completion problem for predict- to estimate the latent factor at t = t . However, using the ing missing elements and the tensor recovery problem for 0 smoothness property, we can estimate the missing elements removing noise. from the elements at t = t0 1 and t = t0 +1. 1Institute of Statistical Mathematics 2National Institute of Ad- A fundamental challenge of tensor completion and recov- vanced Industrial Science and Technology 3RIKEN. Correspon- ery methods is to analyze their performance. Tomioka et al. dence to: Masaaki Imaizumi
th low-rank tensor decompositions. On the contrary, the per- Proceedings of the 34 International Conference on Machine formance of tensor decompositions incorporating smooth- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). ness (Yokota et al., 2015b;a; Amini et al., 2013) has never Tensor Decomposition with Smoothness been addressed. The most important barrier is that all meth- ods are formulated as nonconvex problems, which hinders the use of the tools developed in the convex tensor decom- positions (Liu et al., 2009; Signoretto et al., 2010; Tomioka et al., 2010).
Contributions In this paper, we propose a simple tensor decomposition model incorporating the smoothness prop- erty, which we refer to as Smoothed Tucker Decomposition (STD). Following the notions of functional data analysis, STD approximates an observed tensor by a small number of basis functions, such as Fourier series, and decomposes them through Tucker decomposition (Figure 1). STD is for- mulated as a convex optimization problem that regularizes the ranks and degree of smoothness. To solve this prob- lem, we derive an algorithm based on the alternating direc- Figure 1. Comparison of Tucker decomposition and STD. In tion method of multipliers (ADMM), which always finds STD, the mode-wise smoothness of the observed tensor is pre- the global optimum. served via basis functions. Based on the convex formulation, we provide a few theo- retical guarantees of STD, namely, we derive error bounds 2. Preliminaries for tensor recovery and interpolation problems. We show that the error bounds for smooth tensors are improved and Given K N natural numbers, I1,...,IK N, let I ... I 2 2 X⇢ better than those for other methods. In addition, to the best R 1⇥ ⇥ K be the space of a K-way tensor, and X be 2X of our knowledge, this is the first analysis that establishes the K-way tensor that belongs to . For practical use, we X an error bound for tensor interpolation. These results are define I k := k =k Ik. Each way of a tensor is referred \ 0 empirically confirmed through experiments using synthetic to as mode; Ik is the dimensionality of the k-th mode for Q d and real data. k =1,...,K. For vector Y R , [Y ]j denotes its j-th el- 2 ement. Similarly, [X]j1j2...jK denotes the (i1,i2,...,iK )- To summarize, STD has the following advantages. th element of X. The inner product in is defined as I1,I2,...,IK X X, X0 = [X]j ,j ,...,j [X0]j ,j ,...,j for Sample efficiency: STD achieves the same error with h i j1,j2,...,jK =1 1 2 K 1 2 K • X, X0 . This induces the Frobenius norm, X = less sample size. 2XP ||| |||F X, X . For vectors Z Rd, let Z = pZT Z denote Interplation ability: STD can solve the tensor interpo- h i 2 k k • the norm. In addition, we introduce the L norm for func- lation problem. p 2 tions as f 2 = f(t)2dt for function f : I R with Convex formulation: STD ensures that a global solu- k k2 I ! • some domain I R. ↵(I) denotes a set of an ↵-times tion is obtained. ⇢R C differentiable function on I. Related Works A few authors have investigated the smoothness property for tensor decompositions. Amini 2.1. Tucker Decomposition et al. (2013) proposed a kernel method, and Yokota et al. With a set of finite positive integers (R1,...,RK ), the (2015a;b) developed a smooth decomposition method for Tucker decomposition of X is defined as matrices and tensors using basis functions. These stud- ies demonstrated that the smoothness assumption signifi- R1,...,RK X = g u(k) u(k) ...u(k), (1) cantly improves the performance of tensor decompositions r1...rK r1 ⌦ r2 ⌦ rK r ,...,r =1 for actual applications such as noisy image reconstruc- 1 XK tion (Yokota et al., 2015b). However, these performance where gr1...rK R is a coefficient for each rk, de- gains were confirmed only in an empirical manner. 2 Ik ⌦ notes the tensor product, and urk R denotes vector 2 Several authors have addressed the tensor interpolation for each rk(k =1,...,K), which are orthogonal to each problem by extending tensor decompositions; however, in- other for rk =1,...,Rk. Here, we refer to (R1,...,RK ) stead of smoothness, these methods utilize additional in- as the Tucker rank, and X is an (R1,...,RK )-rank ten- R ... R sor. In addition, we let tensor G R 1⇥ ⇥ K with formation such as network structures (Hu et al., 2015) 2 or side information (Gantner et al., 2010; Narita et al., [G]r1,...,rK = gr1...rK be a core tensor and matrix U(k) = (k) (k) Ik Rk 2011). Moreover, the performance of the tensor interpo- (ur1 ...urK ) R ⇥ is a set of the vectors for all 2 lation problem has never been analyzed theoretically. k =1,...,K. Using this notation, Tucker decomposition Tensor Decomposition with Smoothness
(1) can be written as To solve the problem (5) using the Schatten regulariza- tion, ADMM is frequently employed (Boyd et al., 2011; X = G U U ... U , (2) ⇥1 (1) ⇥1 (2) ⇥2 ⇥K (K) Tomioka et al., 2010). ADMM generates a sequence of variables and Lagrangian multipliers by iteratively mini- where denotes the k-mode matrix product (see Kolda ⇥k mizing the augmented Lagrangian function. It is known & Bader (2009) for more details). that ADMM can easily solve an optimization problem with a non-differentiable regularization term such as s. 2.2. Application Problems for Tensors |||·|||
I1,I2,...,IK Let S (j1,j2,...,jK ) j ,j ,...,j =1 be an index set and 3. STD: Smoothed Tucker Decomposition ⇢{ } 1 2 K n := S . Let j(i) be the i-th element of S for i =1,...,n. | | Then, we consider the following observation model: 3.1. Smoothness on Tensors Before explaining the proposed approach, we introduce the yi =[X⇤]j(i) + ✏i, (3) notion of smoothness on tensors. We start with the idea that n a data tensor is obtained as a result of the discretization of where X⇤ is an unobserved true tensor, y is 2X { i}i=1 a multivariate function. For example, consider an observa- the set of observed values, and ✏i is noise, where the mean 2 tion model of the wind power on a land surface. Suppose is zero and the variance is . We define an observation 2 T n that the land surface is described by a plain [0, 1] (i.e., lon- vector Y := (y1 ...yn) R , and a noise vector := T n 2 E gitude and latitude) and the observation model is given by (✏1 ...✏n) R . Additionally, we define a rearranging 2 2 n a function f :[0, 1] R. Assume that we have infi- operator X : R via [X(X)]i =[X]j(i). Using this ! X! nite memory space so that we can record the wind power notation, observation model (3) is written as y = f(a, b) for any points a, b [0, 1]. In such an unre- 2 Y = X(X⇤)+ . (4) alistic case, it is possible to handle the entire information E about f. However, only finite memory space is available; n When all the elements are observed, i.e., n = I , the we resort to retain finite observations f(ai,bi) i=1. If the k k { } problem of estimating X⇤ is referred to as the tensor re- points (ai,bi) are considered as a grid, the observations can covery problem. When a few elements of X areQ missing, be considered as a matrix. i.e., n< k Ik, the problem is referred to as the tensor This idea is generalized to tensors as follows. Consider a completion problem. Specifically, for any mode k, if there K Q K-variate function fX :[0, 1] R, and a set of points exists an index j0 [Ik] that S does not contain, we refer K I1...Ik ! K k (j1,...,jK ) [0, 1] as grid points in [0, 1] . 2 I1 IK j1...jK =1 to the problem of estimating [X⇤] j =j as { 2 } j1=1 ··· k k0 ···jK =1 Then, each element of X is represented as the tensor interpolation problem.
[X]j1...jK = fX (gj1 ,...,gjK ), (6) Using observation model (4), we provide an estimator for the unknown true tensor X⇤. The estimator of Xˆ is ob- for jk =1, 2,...with each k =1,...,K. tained by solving the following optimization problem: As the smoothness of the function, we assume that fX is differentiable with respect to all K arguments, which al- 1 min Y X(X) 2 +⌦(X) , (5) lows for the expansion of the basis function to a few use- X ⇥ 2nk k 2 ful basis functions (for example, Tsybakov (2008)) and the decomposition of multivariate functions by the basis (see where ⇥ is a convex subset of , and ⌦:⇥ R+ ⇢X X ! (k) Hackbusch (2012) for detail). Let m :[0, 1] R m is a regularization term. For the regularization ⌦(X), the { ! } overlapped Schatten 1-norm is frequently used (Liu et al., be a set of orthonormal basis functions, such as Fourier se-
ries or wavelet series, and wm1,...,mK R m1,...,mK be 2009; Tomioka et al., 2010; Signoretto et al., 2011; Gandy { 2 } et al., 2011); is is defined as a set of coefficients. Because of the differentiability, fX is written as the weighted sum of the basis functions as R 1 K 1 K k X := X := (X ), 1 1 (1) (K) s (k) s rk (k) f = w . ||| ||| K k k K X m1...mK m1 mK (7) k=1 k=1 r =1 ··· ··· k m =1 m =1 X X X X1 XK Ik k =k Ik Combining (6) and (7) yields a formulation of the elements where X(k) R ⇥ 06 denotes the unfolding matrix 2 obtained by concatenatingQ the mode-k fibers of X as col- of the smooth tensor as umn vectors and rk (X(k)) denotes the rk-th largest eigen- [X]j1,...,jK (8) value of X(k). This penalty term regularizes the Tucker 1 1 rank of X (Negahban & Wainwright, 2011; Tomioka et al., = w (1) (g ) (K)(g ). ··· m1...mK m1 j1 ··· mK jK 2011). m =1 m =1 X1 XK Tensor Decomposition with Smoothness
M (1) M (K) Hereafter, we say that X is smooth if it follows (8). vectorized tensor of WX . Let :R k ⇥···⇥ I1 IK ! R ⇥···⇥ be an operator that convertsQ WX to X as given (k) 3.2. Objective Function by (9) using m(gj) m,j. Let be a Ik M { } k ⇥ k matrix that satisfies (w)= w. As (w) is a linear map- Model (8) is not directly applicable because it requires an ping by (9), the existence of is ensured.Q Then,Q (10) is infinite number of basis functions. To prevent this, we con- rewritten as follows: sider their truncation. Let M (k) < be the basis functions 1 for mode k, which represents a degree of smoothness of X K 1 2 n 2 (k) min Y Qx + Zk s + µn w , in terms of mode k. For example, when M is large, such K (k) x,w, Zk k=1 2nk k K k k k k as M = Ik for all k, the basis function formulation can { } kX=1 represent any X, which implies that it neglects the smooth s.t. x = (w),Pk(w)=Zk, k (11) 8 structure. Then, we consider X such that it satisfies the (k) (k) ( k) M M M \ following relation: where Pk : R k R ⇥ is a rearranging ! operator from theQ vector to the unfolding matrix. Note that [X]j1,...,jK (9) W F = w F holds by the definition of w. M (1) M (K) ||| ||| k k (1) (K) We use the ADMM approach to solve (11). Maximizing = wm1 mK m (gj1 ) m (gjK ). ··· ··· 1 ··· K the augmented Lagrangian function for (11), we obtain the m =1 m =1 X1 XK following iteration steps. Here, ⌘>0 is a step size and (1) (K) (k) M M k M K k Ik For practical use, let WX R ⇥···⇥ be a coeffi- ↵k R k=1 and R are Lagrangian 2 { 2 } 2K K cient tensor that satisfies multipliers.Q Let (x0,w0, Zk,0 ,Q↵k,0 , 0) be an { }k=1 { }k=1 initial point. The ADMM step at the `-th iteration is written [WX ]m1...mK = wm1...mK , as follows:
1 with given X. x`+1 = QT Q + n⌘I QT Y n ` + n⌘ w` Using the representation, we propose an objective function `+1 T 1 w =(2 µnI + ⌘KI + ⌘ ) of STD. Based on the same convex optimization approach K as (5), we define the objective function as 1 ` ` T ` T `+1 ⌘P (Z ) ↵ + + ⌘ x ⇥ k k k k=1 ! 1 2 2 X min Y X(X) + n WX s + µn WX F , Z`+1 = (P (w`+1 + ↵` )) X ⇥ 2nk k ||| ||| ||| ||| k prox n/⌘ k k 2 `+1 ` `+1 1 `+1 (10) ↵ = ↵ +(w P (Z )) k k k k where ,µ 0 are regularization coefficients that de- `+1 = ` +(x`+1 w`+1), n n pend on n and n,µn 0 as n . Here, regularization ! 2 !1 terms W and W are employed. where prox n/⌘( ) denotes the shrinkage operation of ||| X |||s ||| X |||F · the singular values, which is defined as prox⌘(Z)= There are three primary advantages of the formulation U max(S ⌘I, 0)V T , where S, U, and V are obtained given by (10). Firstly, (10) is written as a convex optimiza- through the singular value decomposition as Z = USV T . tion problem. Thus, it is ensured that to obtain the global Note that the ADMM steps for Z and ↵ are required for solution of W will be obtained. Secondly, regularization k k X every k =1,...,K. For ⌘, Tomioka et al. (2010) suggest term WX s determines the Tucker rank of WX appropri- ||| ||| setting ⌘ = ⌘0/ Var(yi) with some constant ⌘0. As the ately. Even though we must select , this is considerably n regularization terms are convex, the sequence of the vari- easier than selecting the values of K. Thirdly, the regular- p 2 ables of ADMM is ensured to converge to the optimal so- ization WX F penalizes the smoothness of fX . Note that ||| ||| (k) lution of (10)(Gandy et al., 2011, Theorem 5.1). the smoothness of fX is related to M , and we introduce 2 WX to select an appropriate degree of smoothness. ||| |||F 3.4. Practical Issues (k) 3.3. Algorithm STD has several hyperparameters, i.e., n,µn, M , { } and m . n and µn can be determined through cross To optimize (10), we first reformulate it through vectoriza- { } (k) I validation. M is initialized as a large value and re- tion and matricization. We define x k k as the vector- { } R duced during the algorithm depending on µ . As M (k) 2 Q n ized tensor of X, and Q is a k Ik n matrix, which is the ⇥ does not exceed Ik because of an identification reason, the matricized version of the rearranging operator X. We de- (k) (k) ( k) initial value of M is bounded. Practically, M is con- ( k) (k)Q Mk M \ fine M \ := k =k M , and Zk R ⇥ is the 06 2 siderably less than Ik. Thus, we can start the iteration with M (k) mode-k unfoldingQ matrix of WX . Let w R k be the small values. 2 Q Tensor Decomposition with Smoothness
One can criticize that a few data tensors are smooth with c>0. Then, with some constants C1,C2,C3 > 0, we have one mode, but not with others. We emphasize that STD (k) Xˆ X⇤ max I,II,III , can address such a situation by controlling M for each k. ||| |||F { } As STD can represent tensors without the smooth structure (k) (k) where I,II, and III are: when M = Ik, setting M = Ik for some mode k (k) K and M Ik for other modes is sufficient to address the C I = 1 n RW , situation. K k k=1 q In this study, selecting the form of the basis functions X 1 K 2 m m is not our primary interest because it does not C { } II = 2 n (W ) , specifically affect the theoretical result. However, there 0 K r X⇤(k) 1 k=1 r>RW are a few typical choices. For instance, when the data ten- X Xk 1 sor is periodic, such as audio, the Fourier basis is appro- @ A 2 priate. Even through other functions, such as wavelet or C3µn 2 III = w⇤ , spline functions, provide the theoretical guarantees of ap- 0 K m1...mk 1 m >M (k) proximating f. { k X }k @ A respectively; wm⇤ 1...mk is the coefficient of X⇤. 4. Theoretical Analysis Lemma 2 states that the estimation error of Xˆ is bounded We introduce a few notations for convenience. Let by three types of values, where (I) indicates the error re- 1 K X := maxr r(X ) be a norm of a tensor, sulting from estimating a tensor that is smooth and low- ||| |||m K k=1 (k) which is necessary for evaluating the penalty parameters. rank; (II) indicates the error resulting from introducing P Let X⇤ be an adjoint operator of X, namely, X(z),z0 = the low-rank; and (III) indicates the error resulting from h i z,X⇤(z0) holds for all z,z0 . For theoretical require- approximating by the smooth tensor. h i 2X ment, we let the basis functions j :[0, 1] R j be { ! } From Lemma 2, we see that (II) and (III) disappear and uniformly bounded for all j 1. All proofs of this section (I) remains when X⇤ is low rank and smooth, which we are provided in the supplemental material. show in the next proposition. Here, we define ⇥˜ ⇥ as (k⇢) the set of the tensors represented by (9) with M k and 4.1. Error Bound with X⇤ W { }W the coefficient tensor WX with its rank (R1 ,...,RK ). First, we impose the following assumption on X. Proposition 3. Suppose the same conditions of Lemma 1 hold and X⇤ ⇥˜ is smooth and low-rank. Then, with some Assumption 1 (Restricted Strong Convexity (RSC) condi- 2 constant Cf > 0 we have tion). A finite constant CX > 0 depending on Ik k exists, { } then the rearranging operator X satisfies K Cf n W Xˆ X⇤ R . 1 ||| |||F K k X(X) 2 C X 2 , k=1 q 2nk k X ||| |||F X for all X ⇥. 4.2. Error Bound with fX 2 ⇤ One of the advantages of STD is that it can estimate Intuitively, this assumption requires that X is sufficiently X and the smooth function f defined in (6), which sensitive to the perturbation of X. A similar type of condi- ⇤ X allows for the interpolation of X . Here, we evaluate tion has been used in previous studies on sparse regression, ⇤ the estimation error of STD with respect to the norm such as LASSO (Bickel et al., 2009; Raskutti et al., 2010), 2 for the functional space. Let us define f := i.e., the restricted isometry condition. The RSC condition is L X⇤ k·Mk(1),...,M (K) (1) (K) weaker than the isometry condition because the RSC con- m ,...,m wm⇤ 1,...,mK m1 mK which is one of 1 K ··· the smooth function as the limit of X⇤ as I for dition requires only the lower bound. k !1 allP k 1,...,K . We define the estimator of f by the 2{ } X⇤ We provide the following lemma regarding the error bound following: when true tensor X⇤ can be neither smooth nor low-rank. W W M (1)...M (K) Let (R1 ,...,RK ) be the Tucker rank of WX . (1) (K) fXˆ := wˆm1...mK m1 mK , Lemma 2. Consider X⇤ ⇥, and the rearranging ··· 2 m1...mK =1 operator X that satisfies the RSC condition. Suppose X there exist sequences n,µn, and n that satisfy n where wˆm1...mK is an element of WXˆ . Estimation error is 2 c( X⇤( ) + ) and 1 cµ < 3/4 with a constant provided as follows. n ||| E |||m n n Tensor Decomposition with Smoothness
METHOD RECOVERY INTERPOLATION ASSUMPTION X 2 TOMIOKA ET AL. (2011) O(R I ) N/A LOW-RANK X 2 TOMIOKA &SUZUKI (2013) O(R I ) N/A LOW-RANK X 2 WIMALAWARNE ET AL. (2014) O(R I ) N/A LOW-RANK W 2 1/2 K 1 STD O(R I ) O n I pRW LOW-RANK &SMOOTH ⇣ ⌘ Table 1. Comparison of error bounds under the low-rank and smoothness assumptions. For clarity, the special cases of the error bounds X X X W W W are shown, where the shape of X is symmetric, i.e., I := I1 = = IK , R := R = = R , and R := R = = R . ··· 1 ··· K 1 ··· K
Lemma 4. Suppose that the rearranging operator X that Theorem 6. Suppose thatX⇤ ⇥˜ and the rearranging 2 satisfies the RSC condition and the true tensor X⇤ ⇥˜ . operator X that satisfies the RSC condition, and the noise 2 Then, with some constant CF > 0, we have is i.i.d. Gaussian. By setting C20,C21 > 0 and n = C20 K k=1(pIk+ I k), with high probability, we have C K pnK \ sup f (g) f (g) F n RW . Xˆ X⇤ k P p g [0,1]K | | K sup f ˆ (g) fX (g) 2 k=1 X ⇤ X q g [0,1]K | | 2 K K When n 0 by the setting, Lemma 4 shows that f ˆ C 1 1 ! X 21 W f Ik + I k Rk . estimated by STD uniformly converges to X⇤ . pn K \ K k=1 ! k=1 ! X p q X q 4.3. Applications and Comparison 4.4. Comparison to Related Studies To discuss the result of Lemma 2 more precisely, we con- Several studies have derived an error bound for Xˆ sider the following two practical settings: tensor recovery 2 ||| X⇤ /n in each situation. Tomioka et al. (2011) inves- and tensor interpolation. For each setting, we derive rigor- |||F ous error bounds. tigated the tensor decomposition problem with an over- lapped Schatten-norm and derived the error bound as 4.3.1. TENSOR RECOVERY Xˆ X⇤ = ||| |||F We consider that all the elements of X are observed, and K K K 1 1 they are affected by noise, i.e., we set n = Ik and X k=1 O Ik + I k Rk , (12) X is a vectorization operator. Then, by applying Lemma 2, K \ ! K !! k=1 q k=1 q we obtain the following result. Q X p X X X where (R ,...,R ) is the Tucker of X⇤. The error bound Theorem 5. Suppose thatX⇤ ⇥˜ and the rearranging 1 K 2 X operator X that satisfies the RSC condition, and the noise is in Proposition 3 is obtained by replacing the part Rk in (12) by RW . Tomioka & Suzuki (2013) and Wimalawarne i.i.d. Gaussian. Let C10,C11 > 0 be some finite constants. k 1 K et al. (2014) introduced modified norms with the Schatten- By setting n = C10 nK k=1(pIk + I k), with high probability, we have \ norm and derived other error bounds. P p 2 Table 1 compares the main coefficients for the convergence. Xˆ X⇤ ||| |||F When the tensor is sufficiently smooth, i.e., RW