1

Exact Recovery of Tensor Robust Principal Component Analysis under Linear Transforms

Canyi Lu and Pan Zhou

This work studies the Tensor Robust Principal Component low-rank and sparse tensors decomposition from their sum. Analysis (TRPCA) problem, which aims to exactly recover More specifically, assume that a tensor X can be decomposed the low-rank and sparse components from their sum. Our as X = L + E where L is a low-rank tensor and E is model is motivated by the recently proposed linear transforms 0 0 0 0 based tensor-tensor product and tensor SVD. We define a new a sparse tensor. TRPCA aims to recover L0 and E0 from X . transforms depended tensor rank and the corresponding tensor We focus on the convex model which can be solved exactly nuclear norm. Then we solve the TRPCA problem by convex and efficiently, and the solutions own the theoretical guarantee. optimization whose objective is a weighted combination of the TRPCA extends the well known Robust PCA [1] model, i.e., new tensor nuclear norm and the `1-norm. In theory, we show that under certain incoherence conditions, the convex program min kLk∗ + λkEk1, s.t. X = L + E, (1) exactly recovers the underlying low-rank and sparse components. L,E It is of great interest that our new TRPCA model generalizes existing works. In particular, if the studied tensor reduces to where kLk∗ denotes the nuclear norm, kEk1 denotes a matrix, our TRPCA model reduces to the known matrix the `1-norm and λ > 0. Under certain incoherence conditions, RPCA [1]. Our new TRPCA which is allowed to use general it is proved in [1] that the solution to (1) exactly recovers the linear transforms can be regarded as an extension of our former underlying low-rank and sparse components from their sum. TRPCA work [2] which uses the discrete . But their proof of the recovery guarantee is different. Numerical RPCA has many applications in image and video analysis [9], experiments verify our results and the application on image [10]. recovery demonstrates the superiority of our method. It is natural to consider the tensor extension of RPCA. Unlike the matrix cases, the recovery theory for low-rank I.INTRODUCTION tensor estimation problems is far from being well established. Tensors are the higher-order generalization of vectors and The main issues lie in the definitions of tensor rank. There matrices. They have many applications in the physical, imag- have different formulations of tensor SVD decompositions ing and information sciences, and an in depth survey can which associated with different tensor rank definitions. Then be found in [3]. Tensor decompositions give a concise rep- they lead to different tensor RPCA model. Compared with the resentation of the underlying structure of the tensor, revealing matrix RPCA, the tensor RPCA models have several limita- when the tensor-data can be modeled as lying close to a tions. The tensor CP rank is defined as the smallest number of low-dimensional subspace. Having originated in the fields of rank one tensor CP decomposition. However, it is NP-hard to psychometrics and chemometrics, these decompositions are compute. Moreover, finding the best convex relaxation of the now widely used in other application areas such as computer tensor CP rank is also NP-hard [11]. Unlike the matrix case, vision [4], web data mining [5], and signal processing [6]. where the convex relaxation of the rank, viz., the nuclear norm, Tensor decomposition faces several challenges: arbitrary can be computed efficiently. These issues make the low CP outliers, missing data/partial observations, and computational rank tensor recovery problem challenging to solve. The tensor efficiency. Tensor decomposition resembles principal com- Tucker rank is more widely studied as it is tractable. The sum- of-nuclear-norms (SNN) is served as a simple convex surrogate

arXiv:1907.08288v1 [cs.LG] 16 Jul 2019 ponent analysis (PCA) for matrices in many ways. The two commonly used decompositions are the CANDECOM- for tensor Tucker rank. This idea, after first being proposed P/PARAFAC (CP) and Tucker decomposition [3]. It is well in [12], has been studied in [13], [14], and successfully known that PCA is sensitive to outliers and gross corruptions applied to various problems [15], [16]. Unlike the matrix cases, (non-Gaussian noise). Since the CP and Tucker decomposi- the recovery theory for low Tucker rank tensor estimation tions are also based on least-squares approximation, they are problems is far from being well established. The work [14] prone to these problems as well. Algorithms based on non- conducts a statistical analysis of tensor decomposition and convex formulations have been proposed to robustify tensor provides the first theoretical guarantee for SNN minimization. decompositions against outliers [7] and missing data [8]. How- This result has been further enhanced in recent works [17], ever, they suffer from the lack of global optimality guarantees [18]. The main limitation of SNN minimization is that it is and statistical guarantee. suboptimal, e.g., for tensor completion, the required sample In this work, we study the Tensor Robust Principal Com- complexity is much higher than the degrees of freedom. This ponent Analysis (TRPCA) problem which aims to find the is different from the matrix nuclear norm minimization which leads to order optimal sampling complexity [19]. Intuitively, C. Lu is with the Department of Electrical and Computer Engineering, the limitation of SNN lies in the issue that SNN is not a tight Carnegie Mellon University (e-mail: [email protected]). P. Zhou is with the Department of Electrical and Computer Engineering, convex relaxation of tensor Tucker rank [20]. National University of Singapore, Singapore (e-mail: [email protected]). More recently, based on the tensor-tensor product (t- 2 product) and tensor SVD (t-SVD) [21], a new tensor tubal rank II.TENSOR TUBAL RANKAND TENSOR NUCLEAR NORM and Tensor Nuclear Norm (TNN) is proposed and applied to UNDER LINEAR TRANSFORM tensor completion [22], [23], [24] and tensor robust PCA [2], A. Notations [25]. Compared with the Tucker rank based SNN model, the We first introduce some notations and definitions used main advantage of the t-product induced TNN based models in this paper. We follow similar notations as in [25]. We is that they own the same tight recovery bound as the matrix denote scalars by lowercase letters, e.g., a, vector by boldface cases [26], [27], [25], [2]. Also, unlike the CP rank, the tubal lowercase letters, e.g., a, matrices by boldface capital letters, rank and TNN are computable. In [28], the authors observe e.g., A, and tensors by boldface Euler script letters, e.g., that the t-product is based on a -like operation, A. For a 3-way tensor A ∈ n1×n2×n3 , we denote its which can be implemented using the Discrete Fourier Trans- R (i, j, k)-th entry as A or a and use the Matlab notation form (DFT). In order to properly motivate this transform based ijk ijk A(i, :, :), A(:, i, :) and A(:, :, i) to denote respectively the approach, a more general tensor-tensor product definition i-th horizontal, lateral and frontal slice [3]. More often, the is proposed based on any invertible linear transforms. The frontal slice A(:, :, i) is denoted compactly as A(i). The tube transforms based t-product also owns a matrix-algebra based is denoted as A(i, j, :). The inner product between A and interpretation in the spirit of [29]. Such a new transforms based B in n1×n2 is defined as hA, Bi = Tr(A>B), where A> t-product is of great interest in practice as it allows to use R denotes the transpose of A and Tr(·) denotes the matrix trace. different linear transforms for different real data. The inner product between A and B in n1×n2×n3 is defined D E R In this work, we propose a more general tensor RPCA as hA, Bi = Pn3 A(i), B(i) . We denote I as the n × n model based on a new tensor rank and tensor nuclear norm i=1 n sized . induced by invertible linear transforms, and provide the theo- Some norms of vector, matrix and tensor are used. We retical recovery guarantee. We show that when the invertible q denote the Frobenius norm as kAk = P a2 , the linear transform given by the matrix L further satisfies F ijk ijk P `1-norm as kAk1 = ijk |aijk|, and the infinity norm as kAk∞ = maxijk |aijk|, respectively. The above norms reduce > > L L = LL = `I, to the vector or matrix norms if A is a vector or a matrix. n pP 2 For v ∈ R , the `2-norm is kvk2 = i vi . The spectral for some constant ` > 0, a new tensor tubal rank and tensor norm of a matrix A is denoted as kAk = maxi σi(A), where σi(A)’s are the singular values of A. The matrix nuclear norm nuclear norm can be defined induced by the transform based t- P product. Equipped with the new tensor nuclear norm, we then is kAk∗ = i σi(A). solve the TRPCA problem by solving a convex program (see the definitions of the notations in Section II) B. T-product Induced Tensor Nuclear Norm under Linear Transform We first give some notatons, concepts and properties min kLk∗ + λkSk1, s.t. X = L + S. (2) L,E about the tensor-tensor product proposed in [21]. For A ∈ Rn1×n2×n3 , we define the block bcirc(A) ∈ n1n3×n2n3 of A as In theory, we prove that, under certain incoherence conditions, R  (1) (n3) (2) the solution to (2) exactly recovers the underlying low-rank L0 A A ··· A and sparse S components with high probability. Interestingly,  A(2) A(1) ··· A(3) 0   our model is much more flexible and the recovery result is bcirc(A) =  . . . .  .  . . .. .  much more general than existing works [1], [2], since we have A(n3) A(n3−1) ··· A(1) much more general choices of the invertible linear transforms. If the tensors reduce to matrices, our model and main result We define the following two operators reduce to the RPCA cases in [1]. If the discrete Fourier  A(1)  transform is used as the linear transform L, the t-product,  A(2)  tensor nuclear norm and our TRPCA model reduce to the cases   unfold(A) =  .  , fold(unfold(A)) = A, in our former work [2], [25]. However, our extension on more  .  general choices of linear transforms is non-trivial, since the A(n3) proofs of the exact recovery guarantee is very different due where the unfold operator maps A to a matrix of size n n × to different property of the linear transforms. We will give 1 3 n and fold is its inverse operator. Let A ∈ n1×n2×n3 and detailed discussions in Section III. 2 R B ∈ Rn2×l×n3 . Then the t-product C = A ∗ B is defined to The rest of this paper is structured as follows. Section II be a tensor of size n1 × l × n3, gives some notations and presents the new tensor nuclear norm induced by the t-product under linear transforms. Section III C = A ∗ B = fold(bcirc(A) · unfold(B)). (3) presents the theoretical guarantee of the new tensor nuclear The block circulant matrix can be block diagonalized using norm based convex TRPCA model. Numerical experiments Discrete Fourier Transform (DFT) matrix F n3 , i.e., conducted on both synthesis and real world data are presented (F ⊗ I ) · bcirc(A) · (F −1 ⊗ I ) = A¯, (4) in Section IV. We finally conclude this work in Section V. n3 n1 n3 n2 3 where ⊗ denotes the , and A¯ is a block Algorithm 1 T-product under linear transform L ¯(i) n1×n2×n3 n2×l×n3 with the i-th block A being the i-th frontal Input: A ∈ R , B ∈ R , and L : ¯ n1×n2×n3 n1×n2×n3 slices of A obtained by performing DFT on A along the 3-rd R → R . n1×l×n3 dimension, i.e., Output: C = A ∗L B ∈ R . 1. Compute A¯ = L(A) and B¯ = L(B). A¯ = A ×3 F n , (5) 3 2. Compute each frontal slice of C¯ by (i) (i) (i) where ×3 denotes the mode-3 product (see Definition 2.5 C¯ = A¯ B¯ , i = 1, ··· , n3 n3×n3 −1 ¯ in [28]), and F n3 ∈ C denotes the DFT matrix (see 3. Compute C = L (C). [25] for the formulation). By using the Matlab command fft, we also have A¯ = fft(A, [], 3). We denote D = A B (6) = as the frontal-slice-wise product (Definition 2.1 in [28]), i.e.,

(i) (i) (i) D = A B , i = 1, ··· , n3. (7) Then the block diagonalized property in (4) implies that C¯ = = ⨀ A¯ B¯. So the t-product in (3) is equivalent to the matrix- matrix product in the transform domain using DFT. Instead of using the specific discrete Fourier transform, the recent work [28] proposes a more general definition of Fig. 1: An illustration of the t-product under linear transform L. t-product under any invertible linear transform L. In this work, we consider the linear transform L : Rn1×n2×n3 → n ×n ×n L has special properties. For example, the cost for computing R 1 2 3 which gives A¯ by performing a linear transform on A along the 3-rd dimension, i.e., L(A) is O(n1n2n3 log n3) for [25]. The t-product enjoys many similar properties as the matrix- A¯ = L(A) = A ×3 L, (8) matrix product. For example, the t-product is associative, i.e., A ∗L (B ∗L C) = (A ∗L B) ∗L C. where the linear transform is given by L ∈ Rn3×n3 which can be arbitrary . We also have the inverse Definition 2. (Tensor transpose) [28] Let L be any invertible mapping given by linear transform in (8), and A ∈ Rn1×n2×n3 . Then the tensor > > (i) −1 −1 transpose of A under L, denoted as A , satisfies L(A ) = L (A) = A ×3 L . (9) (i) > (L(A) ) , i = 1, ··· , n3. Then the t-product under linear transform L is defined as Definition 3. (Identity tensor) [28] Let L be any invertible follows. n×n×n3 linear transform in (8). Let I ∈ R so that each frontal ¯ Definition 1. (T-product) [28] Let L be any invertible linear slice of L(I) = I is a n × n sized identity matrix. Then −1 n1×n2×n3 n2×l×n3 I = L (I¯) is called the identity tensor under L. transform in (8), A ∈ R and B ∈ R . The t-product of A and B under L, denoted as C = A ∗L B, is It is clear that A ∗L I = A and I ∗L A = A given the defined such that L(C) = L(A) L(B). appropriate dimensions. The tensor I¯ = L(I) is a tensor with We denote A¯ ∈ Rn1n3×n2n3 as a block diagonal matrix each frontal slice being the identity matrix. with its i-th block on the diagonal as the i-th frontal slice Definition 4. (Orthogonal tensor) [28] Let L be any invertible (i) A¯ of A¯ = L(A), i.e., n×n×n linear transform in (8). A tensor Q ∈ R 3 is orthogonal > > A¯(1)  under L if it satisfies Q ∗L Q = Q ∗L Q = I. ¯(2)  A  Definition 5. (F-diagonal Tensor) A tensor is called f- A¯ = bdiag(A¯) =   ,  ..  diagonal if each of its frontal slices is a diagonal matrix.  .  ¯(n3) A Theorem 1. (T-SVD) [28] Let L be any invertible linear transform in (8), and A ∈ n1×n2×n3 . Then it can be where bdiag is an operator which maps tensor A¯ to the R factorized as block diagonal matrix A¯. Then L(C) = L(A) L(B) is > equivalent to C¯ = A¯B¯ . This implies that the t-product under A = U ∗L S ∗L V , (10) L is equivalent to the matrix-matrix product in the transform where U ∈ n1×n1×n3 , V ∈ n2×n2×n3 are orthogonal, and domain. By using this property, Algorithm 1 gives the way for R R S ∈ n1×n2×n3 is an f-diagonal tensor. computing t-product. Figure 1 gives an intuitive illustration of R t-product and its equivalence in the transform domain. It is Figure 2 illustrates the t-SVD factorization. Also, t-SVD easy to see that the time complexity for computing L(A) is can be computed by performing matrix SVD in the transform 2 O(n1n2n3). Then the time complexity for computing A ∗L B domain. For any invertible linear transform L, we have L(0) = 2 2 −1 ¯ is O(n1n2n3 +n2ln3 +n1n2n3l). But the cost can be lower if L (0) = 0. So both S and S are f-diagonal tensors. We can 4

푛 푛 푛 n1×n2×n3 ˜ n1n3×n2n3 3 3 3 푛3 For any B ∈ R and B ∈ R , we have kAk := sup hA, Bi (14) 푛2 ∗ = kBk≤1 푛1 푛1 푛1 푛2 1 = sup hA¯, B¯ i (15) ¯ ` 푛2 푛1 푛2 kBk≤1 1 Fig. 2: An illustration of the t-SVD under linear transform L of a ≤ sup hA¯, B˜ i (16) ` ˜ n1 × n2 × n3 sized tensor. kBk≤1 1 = kA¯k , (17) Algorithm 2 T-SVD under linear transform L ` ∗ Input: A ∈ Rn1×n2×n3 and invertible linear transform L. where (15) uses (12), (16) is due to the fact that B¯ is a block Output: T-SVD components U, S and V of A. diagonal matrix in Rn1n3×n2n3 while B˜ is an arbitrary matrix 1. Compute A¯ = L(A). in Rn1n3×n2n3 , and (17) uses the fact that the matrix nuclear 2. Compute each frontal slice of U¯, S¯ and V¯ from A¯ by norm is the dual norm of the matrix spectral norm. On the > for i = 1, ··· , n3 do other hand, let A = U ∗L S ∗L V be the t-SVD of A and (i) (i) (i) (i) > [U¯ , S¯ , V¯ ] = SVD(A¯ ); B = U ∗L V . We have end for −1 −1 −1 kAk∗ = sup hA, Bi 3. Compute U = L (U¯), S = L (S¯), and V = L (V¯). kBk≤1 > > ≥hU ∗L S ∗L V , U ∗L V i =hU > ∗ U ∗ S, V> ∗ Vi use their number of nonzero tubes to define the tensor tubal L L L rank as [25]. =hS, Ii 1 Definition 6. (Tensor tubal rank) Let L be any invertible = hS¯, I¯i n ×n ×n ` linear transform in (8), and A ∈ R 1 2 3 . The tensor tubal 1 = Tr(S¯) rank of A under L, denoted as rankt(A), is defined as the ` number of nonzero tubes of S, where S is from the t-SVD of 1 > = kA¯k . (18) A = U ∗L S ∗L V . We can write ` ∗ Combining (14)-(17) and (18), we then have the following rankt(A) =#{i, S(i, i, :) 6= 0} = #{i, S¯(i, i, :) 6= 0}. definition of tensor nuclear norm. n1×n2×n3 For A ∈ R with tubal rank r, we also have the Definition 8. (Tensor nuclear norm) Let L be any invertible > n1×r×n3 skinny t-SVD, i.e., A = U ∗L S∗L V , where U ∈ , R linear transform in (8) and it satisfies (11), and A = U ∗L r×r×n3 n2×r×n3 > S ∈ R , and V ∈ R , in which U ∗L U = I > > S ∗L V be the t-SVD of A. The tensor nuclear norm of A and V ∗L V = I. We use the skinny t-SVD throughout this 1 under L is defined as kAk∗ := hS, Ii = kA¯k∗. paper. The tensor tubal rank is nonconvex. In Section III, we ` will study the low tubal rank and sparse tensor decomposition If we define the tensor average rank as ranka(A) = 1 ¯ problem by convex surrogate function minimization. At the ` rank(A), then it is easy to prove that the above tensor nuclear following, we show how to define the convex tensor nuclear norm is the convex envelope of the tensor average rank within norm induced by the t-product under L. We can first define the domain {A|kAk ≤ 1} [25]. the tensor spectral norm as in [25]. III.TRPCA UNDER LINEAR TRANSFORMANDTHE Definition 7. (Tensor spectral norm) Let L be any invertible EXACT RECOVERY GUARANTEE linear transform in (8), and A ∈ n1×n2×n3 . The tensor R With the new tensor rank and TNN definition in the pre- spectral norm of A under L is defined as kAk := kA¯k. vious section, now we consider the TRPCA problem and the The tensor nuclear norm can be defined as the dual norm of convex model using TNN in (2). Assume that we are given the tensor spectral norm. To this end, we need the following X = L0 +S0, where L0 is of low tubal rank and S0 is sparse. assumption on L given in (8), i.e., The goal is to recover both L0 and S0 from X . We will show that both components can be exactly recovered by solving > > L L = LL = `In3 , (11) the convex program (2) under certain incoherence conditions. In this section, we present the exact recovery guarantee of where ` > 0 is a constant. Using (11), we have the following TRPCA in theory. We will also give the detail for solving the important properties, convex program (2). 1 hA, Bi = A¯, B¯ , (12) ` A. Tensor Incoherence Conditions As in the recovery problems of RPCA [1] and TRPCA [2], 1 kAkF = √ kA¯kF . (13) some incoherence conditions are required to avoid some patho- ` logical situations that the recovery is impossible. We need to 5

assume that the low-rank component L0 is not sparse. To this shows that our new conditions (19)-(20) are less restrictive. end, we assume L0 to satisfy some incoherence conditions. And thus we use the new conditions to replace (22)-(23) in Another identifiability issue arises if the sparse tensor has low our proofs. tubal rank. This can be avoided by assuming that the support of S0 is uniformly distributed. We need the following tensor B. Main Results basis concept for defining the tensor incoherence conditions. Define n(1) = max(n1, n2) and n(2) = min(n1, n2). We Definition 9. (Standard tensor basis) Let L be any invertible show that the convex program (2) is able to perfectly recover linear transform in (8) and it satisfies (11). We denote ˚ei as the low-rank and sparse components. the tensor column basis, which is a tensor of size n × 1 × n3 Theorem 2. Let L be any invertible linear transform in (8) with the entries of the (i, 1, :) tube of L(˚e ) equaling 1 and n×n×n3 i and it satisfies (11). Suppose L0 ∈ R obeys (19)-(21). the rest equaling 0. Naturally its transpose ˚e> is called row i Fix any n × n × n3 tensor M of signs. Suppose that the basis. The other standard tensor basis is called tube basis e˙ , k support set Ω of S0 is uniformly distributed among all sets which is a tensor of size 1 × 1 × n with the (1, 1, k)-th entry 3 of cardinality m, and that sgn ([S0]ijk) = [M]ijk for all of L(e˙ ) equaling 1 and the rest equaling 0. k (i, j, k) ∈ Ω. Then, there exist universal constants c1, c2 > 0 −c2 Definition 10. (Tensor Incoherence Conditions) Let L be such that with probability at least 1 − c1(nn3) (over the any invertible linear transform in (8) and it satisfies (11). For choice of support√ of S0), {L0, S0} is the unique minimizer n1×n2×n3 to (2) with λ = 1/ n`, provided that L0 ∈ R , assume that rankt(L0) = r and it has the > n1×r×n3 skinny t-SVD L0 = U ∗L S ∗L V , where U ∈ ρrn 2 R rankt(L0) ≤ and m ≤ ρsn n3, (24) n2×r×n3 > > 2 and V ∈ R satisfy U ∗L U = I and V ∗L V = I, µ(log(nn3)) r×r×n3 and S ∈ is a f-diagonal tensor. Then L0 is said to n1×n2×n3 R where ρr and ρs are positive constants. If L0 ∈ R satisfy the tensor incoherence conditions with parameter µ if p has rectangular frontal slices, TRPCA with λ = 1/ n(1)` r −c2 > µr succeeds with probability at least 1−c1(n(1)n3) , provided max max kU ∗L ˚ei ∗L L(e˙k)kF ≤ , (19) ρr n(2) that rankt(L0) ≤ 2 and m ≤ ρsn1n2n3. i=1,··· ,n1 k=1,··· ,n3 n1 µ(log(n(1)n3)) r > µr Theorem 2 gives the exact recovery guarantee for convex max max kV ∗L ˚ej ∗L L(e˙k)kF ≤ , (20) j=1,··· ,n2 k=1,··· ,n3 n2 model (2) under certain conditions. It says that the inco- and herent L0 can be recovered for rankt(L0) on the order of r 2 > µr n/(µ(log nn3) ) and a number of nonzero entries in S0 on kU ∗L V k∞ ≤ . (21) 2 n1n2` the order of n n3. TRPCA is also parameter free. If n3 = 1, both our model (2) and result in Theorem 2 reduces to the The incoherence condition guarantees that for small values matrix RPCA cases in [1]. Intuitively, the TNN definition and of µ, the singular vectors are reasonably spread out, or not its based TRPCA model in [25] fall into the special cases of sparse [1], [19]. ours when the discrete Fourier transform is used. We would Proposition 1. With the same notations in Definition 10, if like to emphasize some more differences from [25] as follows: the following conditions hold, 1. It is obvious our TNN and its based TRPCA model under r linear transforms are much more general than the ones in > µr max kU ∗L ˚eikF ≤ , (22) [25] which uses the discrete Fourier transform. We only i=1,··· ,n1 n1` r require the linear transform satisfies property (11). We then > µr max kV ∗L ˚ejkF ≤ , (23) can use different linear transforms for different purposes, j=1,··· ,n2 n2` e.g., improving the learning performance for different data then (19)-(20) hold. or tasks or improving the efficiency by using special linear transforms. Proof. By using property (13), we have 2. Our tensor incoherence conditions, theoretical exact re- > kU ∗L ˚ei ∗L L(e˙k)kF covery guarantee in Theorem 2 and its proof do not fall into the cases in [25] even the discrete Fourier transform 1 > =√ kL(U ) L(˚ei) L(L(e˙k))kF is used. The key difference is that we restrict the linear ` transform within the real domain L : Rn1×n2×n3 → 1 > n ×n ×n ≤√ kL(U ) L(˚ei)kF kL(L(e˙k))kF R 1 2 3 while the discrete Fourier transform is a √` mapping from the real domain to the complex domain > = `kU ∗L ˚eikF , L : Rn1×n2×n3 → Cn1×n2×n3 . Several special properties √ √ of the discrete Fourier transform are used in their proof where we use kL(L(e˙ ))k = `kL(e˙ )k = ` due to k F k F of exact recovery guarantee, but these techniques are not property (11). Thus, if (22) hold, then (19) hold. Similarly, applicable in our proofs. For example, the definitions of the we have the same relationship between (23) and (20). standard tensor basis are different. This leads to different The tensor incoherence conditions are used in the proof tensor incoherence conditions and thus different proofs. of Theorem 2. Though similar formulations as (22)-(23) are We discuss more in the proofs which are provided in the widely used in RPCA [1] and TRPCA [25], Proposition 1 supplementary material. 6

Problem (2) can be solved by the standard Alternating TABLE I: Correct recovery for random problems of varying size. The Discrete Cosine Transform (DCT) is used as the invertible linear Direction Method of Multipliers (ADMM) [30]. We only give transform L. the detail for solving the following key subproblem in ADMM, 3 n1×n2×n3 r = rankt(L0) = 0.1n, m = kS0k0 = 0.1n i.e., for any Y ∈ , ˆ ˆ R ˆ ˆ kL−L0kF kS−S0kF n r m rankt(L) kSk0 kL0kF kS0kF 1 2 100 10 1e5 10 102,921 2.4e−6 8.7e−10 min τkX k∗ + kX − YkF . (25) X 2 200 20 8e5 20 833,088 6.8e−6 8.7e−10 300 30 27e5 30 2,753,084 1.8e−5 1.4e−9 > 3 Let Y = U ∗L S ∗L V be the tensor SVD of Y. For any r = rankt(L0) = 0.1n, m = kS0k0 = 0.2n ˆ ˆ ˆ ˆ kL−L0kF kS−S0kF τ > 0, we define the Tensor Singular Value Thresholding (T- n r m rankt(L) kSk0 kL0kF kS0kF SVT) operator as follows 100 10 2e5 10 201,090 4.0e−6 2.1e−10 > 200 20 16e5 20 1,600,491 5.3e−6 9.8e−10 Dτ (Y) = U ∗L Sτ ∗L V , (26) 300 30 54e5 30 5,460,221 1.8e−5 1.8e−9 S = L−1((L(S) − τ) ) t = max(t, 0) where τ + . Here + . The TABLE II: Correct recovery for random problems of varying size. T-SVT operator is the proximity operator associated with the A Random (ROM) is used as the invertible linear proposed tensor nuclear norm. transform L. 3 r = rankt(L0) = 0.1n, m = kS0k0 = 0.1n ˆ ˆ Theorem 3. Let L be any invertible linear transform in (8) ˆ ˆ kL−L0kF kS−S0kF n r m rankt(L) kSk0 kL k kS k n1×n2×n3 0 F 0 F and it satisfies (11). For any τ > 0 and Y ∈ R , the 100 10 1e5 10 103,034 2.7e−6 9.6e−9 T-SVT operator (26) obeys 200 20 8e5 20 833,601 7.0e−6 9.0e−10 1 300 30 27e5 30 2,852,933 1.8e−5 1.3e−9 2 3 Dτ (Y) = arg min τkX k∗ + kX − YkF . (27) r = rankt(L0) = 0.1n, m = kS0k0 = 0.2n X ˆ ˆ 2 ˆ ˆ kL−L0kF kS−S0kF n r m rankt(L) kSk0 The main cost of ADMM for solving (2) is to compute kL0kF kS0kF 100 10 2e5 10 201,070 4.3e−6 2.3e−9 Dτ (Y) for solving (27). For any general linear transform L in 200 20 16e5 20 1,614,206 5.4e−6 9.9e−10 2 300 30 54e5 30 5,457,874 9.6e−6 9.5e−10 (8), it is easy to see that the per-iteration cost is O(n1n2n3 + 2 n(1)n(2)n3). Such a cost is the same as that in TRPCA using DFT as the linear transform [2]. Table I and II report the recovery results which use the DCT and ROM as the linear transform L, respectively. It can IV. EXPERIMENTS be seen that our convex program (2) gives the correct tubal We present an empirical study of our method. The goal of rank estimation of L0 in all cases and also the relative errors this study is two-fold: a) establish that the convex program −5 kLˆ − L0kF /kL0kF are very small, all around 10 . The (2) indeed recovers the low-rank and sparse parts correctly, sparsity estimation of S0 is not as exact as the rank estimation, and thus verify our result in Theorem 2; b) show that our but note that the relative errors kSˆ − S0kF /kS0kF are all tensor methods are superior to matrix based RPCA methods very small, less than 10−8 (actually much smaller than the and other existing TRPCA methods in practice. relative errors of the recovered low-rank component). These results well verify the correct recovery phenomenon as claimed A. Exact Recovery from Varying Fractions of Error in Theorem 2 under different linear transforms. We first verify the correct recovery guarantee of Theorem 2 on randomly generated tensors. Since Theorem 2 holds for B. Phase Transition in Tubal Rank and Sparsity L L any invertible linear transform with in (8) satisfying Theorem 2 shows that the recovery is correct when the tubal (11). So we consider two cases of L: (1) Discrete Cosine rank of L0 is relatively low and S0 is relatively sparse. Now Transform (DCT) which is used in the original work of we examine the recovery phenomenon with varying tubal rank transforms based t-product [28]. We use the Matlab command of L0 from varying sparsity of S0. We consider the tensors dct to generate the DCT matrix L. (2) Random Orthogonal n×n×n3 of size R , where n = 100 and n3 = 50. We generate Matrix (ROM) generated by the method with codes available L by the same way as the previous section. We consider online1. In both cases, (11) holds with ` = 1. We use the 0 two distributions of the support of S0. The first case is the same way for generating the random data as [25]. We simply Bernoulli model for the support of S , with random signs: n × n × n n = 0 consider the tensors of size , with 100, 200 each entry of S takes on value 0 with probability 1 − ρ, and r 0 and 300. We generate a tensor with tubal rank as a product values ±1 each with probability ρ/2. The second case chooses L = P ∗ Q> P Q n × r × n 0 L , where and are tensors with the support Ω in accordance with the Bernoulli model, but this N (0, 1/n) entries independently sampled from distribution. time sets S = P sgn(L ). We set r = [0.01 : 0.01 : 0.5] Ω m S 0 Ω 0 n The support set (with size ) of 0 is chosen uniformly at and ρ = [0.01 : 0.01 : 0.5]. For each ( r , ρ )-pair, we (i, j, k) ∈ Ω [S ] = M M s n s random. For all , let 0 ijk ijk, where simulate 10 test instances and declare a trial to be successful is a tensor with independent Bernoulli ±1 entries. For different −3 if the recovered Lˆ satisfies kLˆ − L0kF /kL0kF ≤ 10 . size n, we set the tubal rank of L0 as 0.1n and consider two 3 3 Figure 3 and 4 respectively show the the fraction of correct cases of the sparsity m = kS0k0 = 0.1n and 0.2n . r recovery for each pair ( n , ρs) for two settings of S0. The 1https://www.mathworks.com/matlabcentral/fileexchange/11783- white region indicates the exact recovery while the black one randorthmat. indicates the failure. The experiment shows that the recovery 7

0.5 0.5 segmentation dataset [31] for this test. We randomly set 10%

0.4 0.4 of pixels to random values in [0, 255], and the positions of the corrupted pixels are unknown. We compare our TRPCA model 0.3 0.3 with RPCA [1], SNN [18] and TRPCA [2], which also own s s ρ ρ 0.2 0.2 the theoretical recovery guarantee. For RPCA, we apply it on each channel separably and combine the results to obtain the 0.1 0.1 recovered image. For SNN [18] , we set λ = [15, 15, 1.5] as

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 in [2]. TRPCA [2] is a special case of our new TRPCA using rank /n rank /n t t discrete Fourier transform. In this experiment, we use DCT (a) Random Signs, DCT (b) Random Signs, ROM and ROM as the transforms in our new TRPCA method. We denote our method using these two transforms as TRPCA- Fig. 3: Correct recovery for varying rank and sparsity. In this test, DCT and TRPCA-ROM, respectively. The Peak Signal-to- sgn(S ) is random. Fraction of correct recoveries across 10 trials, as 0 Noise Ratio (PSNR) value a function of rankt(L0) (x-axis) and sparsity of S0 (y-axis). Left: DCT is used as the linear transform L. Right: ROM is used as the 2 ! kL0k linear transform L. PSNR(Lˆ , L ) = 10 log ∞ , 0 10 1 2 kLˆ − L0k n1n2n3 F 0.5 0.5 is used to evaluate the recovery performance. The higher PSNR value indicates better recovery performance. 0.4 0.4 Figure 5 shows the comparison of the PSNR values and 0.3 0.3 running time of all the compared methods. Some examples are s s ρ ρ 0.2 0.2 given in Figure 6. It can be seen that in most case, our new TRPCA-DCT achieves the best performance. This implies that 0.1 0.1 the discrete Fourier transform used in [2] may not be the best in this task, though the reason is not very clear now. However, 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 rank /n rank /n t t if the random orthogonal matrix (ROM) is used as the linear (a) Coherent Signs, DCT (b) Coherent Signs, ROM transform, the results are generally bad. This is reasonable and it implies that the choice of the linear transform L is crucial Fig. 4: Correct recovery for varying rank and sparsity. In this test, in practice, though the best linear transform is currently not S0 = PΩsgn(L0). Fraction of correct recoveries across 10 trials, clear. Figure 5 (b) shows that our method is as efficient as as a function of rankt(L0) (x-axis) and sparsity of S0 (y-axis). Left: DCT is used as the linear transform L. Right: ROM is used as the RPCA and TRPCA in [2]. linear transform L. V. CONCLUSIONS Based on the t-product under invertible linear transforms, is correct when the tubal rank of L0 is relatively low and the we defined the new tensor tubal rank and tensor nuclear norm. errors S0 is relatively sparse. More importantly, the results We then proposed a new Tensor Robust Principal Component show that the used linear transforms are not important, as (TRPCA) model given the transforms satisfying certain con- long as the property (11) holds. The recovery performances ditions. In theory, we proved that, under certain assumptions, are very similar even different linear transforms are used. the convex program recovers both the low-rank and sparse This well verify our main result in Theorem 2. The recovery components exactly. Numerical experiments verify our theory phenomenons are also similar to RPCA [1] and TRPCA [25]. and the results on images demonstrate the effectiveness of our But note that our results are much more general as we can use model. different linear transforms. This work provides a new direction for the pursuit of low- rank tensor estimation. The tensor rank and tensor nuclear norm not only depend on the given tensor, but also depend C. Applications to Image Recovery on the used invertible linear transforms. It is obvious that RPCA and tensor RPCA have been applied for image the best transforms will be different for different tasks and recovery [12], [2], [24]. The main motivation is that the color different data. So looking for the optimized transforms is a image can be well approximated by low-rank matrices or very interesting future work. Second, though this work focuses tensors. In this section, we consider to apply our TRPCA on the analysis for 3-way tensors, it may not be difficult to model using DCT as the linear transform for image recovery, generalize our model and main results to the case of order-p and compare our method with state-of-the-art methods. (p ≥ 3) tensors, by using the t-SVD for order-p tensors as For any color image of size n1 × n2, we can formate in [32]. Finally, it is always important to apply such a new it as a tensor n1 × n2 × n3, where n3 = 3. The frontal technique for some other applications. slices correspond to the three channels of the color image2. We randomly select 100 color images from the Berkeley APPENDIX A. Proof of Theorem 3 2 We observe that this way of tensor construction achieves the best perfor- ∗ mance in practice, though we do not show the results of other ways of tensor Proof. Let Y = U ∗L S ∗L V be the tensor SVD of Y. By construction due to space limit. using the definition of tensor nuclear norm and property (13), 8

50 RPCA SNN TRPCA TRPCA-DCT TRPCA-ROM

40

30 PSNR value

20

10 20 30 40 50 60 70 80 90 100 Index of images 120 RPCA SNN TRPCA TRPCA-DCT 100

80

60

40 Running time (s) 20

10 20 30 40 50 60 70 80 90 100 Index of images Fig. 5: Comparison of the PSNR values (up) and running time (bottom) of RPCA, SNN, TRPCA, our TRPCA-DCT and TRPCA-ROM for image denoising on 50 images. Best viewed in ×2 sized color pdf file.

problem (27) is equivalent to [9] Hui Ji, Sibin Huang, Zuowei Shen, and Yuhong Xu, “Robust video restoration by joint sparse and low rank matrix approximation,” SIAM 1 2 J. Imaging Sciences, vol. 4, no. 4, pp. 1122–1142, 2011. arg min τkX k∗ + kX − YkF X 2 [10] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma, “RASL: Robust alignment by sparse and low-rank decomposition for 1 ¯ 1 ¯ ¯ 2 = arg min (τkXk∗ + kX − Y kF ) linearly correlated images,” IEEE Trans. Pattern Recognition and X ` 2 Machine Intelligence, vol. 34, no. 11, pp. 2233–2246, 2012. n3 1 X ¯ (i) 1 ¯ (i) ¯ (i) 2 [11] Christopher J Hillar and Lek-Heng Lim, “Most tensor problems are = arg min (τkX k∗ + kX − Y kF ). (28) NP-hard,” J. ACM, vol. 60, no. 6, pp. 45, 2013. X ` 2 i=1 [12] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye, “Tensor By Theorem 2.1 in [33], we know that the i-th frontal slice completion for estimating missing values in visual data,” IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 35, no. 1, pp. 208– of Dτ (Y) solves the i-th subproblem of (28). Hence, Dτ (Y) 220, 2013. solves problem (27). [13] Silvia Gandy, Benjamin Recht, and Isao Yamada, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse REFERENCES Problems, vol. 27, no. 2, pp. 025010, 2011. [14] Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, and Hisashi Kashima, [1] Emmanuel J Candes,` Xiaodong Li, Yi Ma, and John Wright, “Robust “Statistical performance of convex tensor decomposition,” in Advances principal component analysis?,” J. ACM, vol. 58, no. 3, 2011. in Neural Information Processing Systems, 2011, pp. 972–980. [2] Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and [15] Nadia Kreimer and Mauricio D Sacchi, “Nuclear norm minimization Shuicheng Yan, “Tensor robust principal component analysis: Exact and tensor completion in exploration seismology,” in Proc. IEEE Int’l recovery of corrupted low-rank tensors via convex optimization,” in Conf. Acoustics, Speech and Signal Processing, 2013, pp. 4275–4279. Proc. IEEE Conf. Computer Vision and Pattern Recognition. IEEE, 2016. [16] Marco Signoretto, R Van De Plas, B De Moor, and Johan A K Suykens, [3] Tamara G Kolda and Brett W Bader, “Tensor decompositions and “Tensor versus matrix completion: A comparison with application to applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009. spectral data,” IEEE Signal Processing Letters, vol. 18, no. 7, pp. 403– [4] M Alex O Vasilescu and Demetri Terzopoulos, “Multilinear analysis 406, 2011. of image ensembles: Tensorfaces,” in Proc. European Conf. Computer [17] Cun Mu, Bo Huang, John Wright, and Donald Goldfarb, “Square deal: Vision, pp. 447–460. Springer, 2002. Lower bounds and improved relaxations for tensor recovery,” in Proc. [5] Thomas Franz, Antje Schultz, Sergej Sizov, and Steffen Staab, “Tripler- Int’l Conf. Machine Learning, 2014, pp. 73–81. ank: Ranking semantic web data by tensor decomposition,” in Int’l Semantic Web Conf., 2009, pp. 213–228. [18] Bo Huang, Cun Mu, Donald Goldfarb, and John Wright, “Provable [6] Nicholas D Sidiropoulos, Rasmus Bro, and Georgios B Giannakis, models for robust low-rank tensor completion,” Pacific J. Optimization, “Parallel factor analysis in sensor array processing,” IEEE Trans. Signal vol. 11, no. 2, pp. 339–364, 2015. Processing, vol. 48, no. 8, pp. 2377–2388, 2000. [19] Emmanuel J Candes` and Benjamin Recht, “Exact matrix completion [7] Sanne Engelen, Stina Frosch, and Bo M Jorgensen, “A fully robust via convex optimization,” Foundations of Computational mathematics, parafac method for analyzing fluorescence data,” J. Chemometrics, vol. vol. 9, no. 6, pp. 717–772, 2009. 23, no. 3, pp. 124–131, 2009. [20] Bernardino Romera-Paredes and Massimiliano Pontil, “A new convex [8] Evrim Acar, Tamara G. Kolda, Daniel M. Dunlavy, and Morten Mrup, relaxation for tensor completion,” in Advances in Neural Information “Scalable tensor factorizations with missing data,” in IEEE Int’l Conf. Processing Systems, 2013, pp. 2967–2975. Data Mining, 2009, pp. 701–712. [21] Misha E Kilmer and Carla D Martin, “Factorization strategies for third- 9

(a) (b) (c) (d) (e) (f) Fig. 6: Performance comparison for image recovery on some sample images. (a) Original image; (b) observed image; (c)-(f) recovered images by RPCA, SNN, TRPCA, and our TRPCA-DCT, respectively. Best viewed in ×2 sized color pdf file.

order tensors,” Linear Algebra and its Applications, vol. 435, no. 3, pp. [26] Yudong Chen, “Incoherence-optimal matrix completion,” IEEE Trans. 641–658, 2011. Information Theory, vol. 61, no. 5, pp. 2909–2923, May 2015. [22] Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha [27] Zemin Zhang and Shuchin Aeron, “Exact tensor completion using t- Kilmer, “Novel methods for multilinear data completion and de-noising SVD,” IEEE Trans. Signal Processing, vol. 65, no. 6, pp. 1511–1526, based on tensor-SVD,” in Proc. IEEE Conf. Computer Vision and Pattern 2017. Recognition. IEEE, 2014, pp. 3842–3849. [28] Eric Kernfeld, Misha Kilmer, and Shuchin Aeron, “Tensor–tensor [23] Oguz Semerci, Ning Hao, Misha E Kilmer, and Eric L Miller, “Tensor- products with invertible linear transforms,” Linear Algebra and its based formulation and nuclear norm regularization for multienergy Applications, vol. 485, pp. 545–570, 2015. computed tomography,” IEEE Trans. Image Processing, vol. 23, no. [29] David F Gleich, Chen Greif, and James M Varah, “The power and 4, pp. 1678–1693, 2014. arnoldi methods in an algebra of circulants,” Numerical Linear Algebra [24] Canyi Lu, Jiashi Feng, Zhouchen Lin, and Shuicheng Yan, “Exact low with Applications, vol. 20, no. 5, pp. 809–831, 2013. tubal rank tensor recovery from Gaussian measurements,” in Int’l Joint [30] Canyi Lu, Jiashi Feng, Shuicheng Yan, and Zhouchen Lin, “A unified Conf. Artificial Intelligence, 2018. alternating direction method of multipliers by majorization minimiza- [25] Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and tion,” IEEE Trans. Pattern Recognition and Machine Intelligence, vol. Shuicheng Yan, “Tensor robust principal component analysis with a new 40, no. 3, pp. 527–541, 2018. tensor nuclear norm,” IEEE Trans. Pattern Recognition and Machine [31] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik, “A Intelligence, 2019. database of human segmented natural images and its application to 10

evaluating segmentation algorithms and measuring ecological statistics,” in Proc. IEEE Int’l Conf. Computer Vision. IEEE, 2001, vol. 2, pp. 416– 423. [32] Carla D Martin, Richard Shafer, and Betsy LaRue, “An order-p tensor factorization with applications in imaging,” SIAM J. Scientific Computing, vol. 35, no. 1, pp. A474–A490, 2013. [33] Jianfeng Cai, Emmanuel Candes,` and Zuowei Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optimization, 2010.

Canyi Lu is now a postdoctoral research associate in the Carnegie Mellon University. He received his Ph.D. degree from the National University of Singapore in 2017. His current research interests PLACE include computer vision, machine learning, pattern PHOTO recognition and optimization. He was the winner HERE of 2014 Microsoft Research Asia Fellowship, 2017 Chinese Government Award for Outstanding Self- Financed Students Abroad and Best Poster Award in BigDIA 2018.

Pan Zhou received Master Degree in computer science from Peking University in 2016. Now he is a Ph.D. candidate at the Department of Electrical and Computer Engineering (ECE), National Univer- PLACE sity of Singapore, Singapore. His research interests PHOTO include computer vision, machine learning, and op- HERE timization. He was the winner of 2018 Microsoft Research Asia Fellowship.