JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 1

t-Schatten-p for Low-Rank Tensor Recovery Hao Kong, Xingyu Xie, Student Member, IEEE, Zhouchen Lin, Fellow, IEEE

Abstract—In this paper, we propose a new definition of tensor Minimizing the rank function directly is usually NP-hard and Schatten-p norm (t-Schatten-p norm) based on t-SVD [1], and is difficult to be solved within polynomial time, hence we prove that this norm has similar properties to matrix Schatten- often replace the function rank(X ) by its convex/non-convex p norm. More importantly, the t-Schatten-p norm can better surrogate function f(X ): approximate the `1 norm of the tensor multi-rank [2] with 0 < p < 1. Therefore it can be used for the Low-Rank min f(X ), Tensor Recovery problems as a tighter regularizer. We further X (2) prove the tensor multi-Schatten-p norm surrogate theorem and s.t. Ψ(X ) = T . give an efficient algorithm accordingly. By decomposing the target tensor into many small-scale tensors, the non-convex The main difference among the present LRTR models is the optimization problem (0 < p < 1) is transformed into many choice of surrogate function f(·). Based on different definitions convex sub-problems equivalently, which can greatly improve the of tensor singular values, various tensor nuclear norms or tensor computational efficiency when dealing with large-scale tensors. 1 Finally, we provide the theories on the conditions for exact Schatten-p norms are proposed as the rank surrogates [1], [13], recovery in the noiseless case and give the corresponding error [14]. But they all have some limitations when applied to real bounds for the noise case. Experimental results on both synthetic problems. and real-world datasets demonstrate the superiority of our t- Based on CP-decomposition [10], Friedland et al. [14] Schattern-p norm in the Tensor Robust Principle Component introduce cTNN (Tensor Nuclear Norm based on CP) as the Analysis (TRPCA) and the Tensor Completion (TC) problems. convex relaxation of the tensor CP-rank: Index Terms—tensor Schatten-p norm, low-rank, tensor de- ( r r ) composition, convex optimization. X X kT kcT NN = inf |λi| : T = λiui ◦ vi ◦ wi , (3) i=1 i=1 I.INTRODUCTION where kuik = kvik = kwik = 1 and ◦ represents the N computer vision and pattern recognition, data struc- vector outer product. Yuan et al. [15] give the sub-gradient of I tures are becoming more and more complex. Thus multi- cTNN by leveraging its dual property, therefore we can solve dimensional arrays (also called as tensors) attract more and the cTNN minimization problem by using some traditional more attention recently. Many problems can be converted to the gradient-based methods. It is worth mentioning that, for a Low-Rank Tensor Recovery (LRTR) problems, such as video given tensor T ∈ Rn1×n2×n3 , calculating its CP-rank [9] is denoising [3], video inpainting [4], subspace clustering [5], usually NP-complete [16], [17], which means that we cannot recommendation systems [6], multitask learning [7], etc. The verify the consistency of cTNN’s implicit decomposition with LRTR problem aims to recover the original low-rank tensor the ground-truth CP-decomposition. Moreover, it is hard to based on the observed corrupted/disturbed tensor. It can be measure the cTNN’s tightness relative to the CP-rank since formulated as the following problem: whether cTNN satisfies the continuous analogue of Comon’s conjecture [14] remains unknown. What’s more, inconsistent min rank(X ), X (1) with the two-dimensional case, one cannot extent cTNN to the s.t. Ψ(X ) = T , tensor Schatten-p norm because the infimum will be identically 0 when the ` norm of the coefficients is replaced by an ` T 1 p where is the observed measurement by a linear operator norm for any p > 1 [14]. All the above reasons limit the Ψ(·) X and is the clean data. Similar to the matrix case, the application of cTNN to the LRTR problem. (·) operation rank works as a sparsity regularization of tensor To avoid the NP-complete CP decomposition, Liu et al. [13] singular values. Unfortunately, none of the existing definitions define a kind of tensor nuclear norm named SNN (Sum of of tensor rank work well in practice. They are all related to Nuclear Norm) based on the Tucker decomposition [12]: particular tensor decompositions [8]. For example, CP-rank [9] dim is based on the CANDECOMP/PARAFAC decomposition [10]; X kT k = T , Tucker-n-rank [11] is based on the Tucker Decomposition [12]; SNN (i) ∗ (4) and tensor multi-rank and tubal-rank [2] are based on t-SVD [1]. i=1 where T(i) denotes unfolding the tensor along the i-th dimen- H. Kong and Z. Lin are with the Key Lab. of Machine Perception (MoE), sion and k · k is the nuclear norm of a matrix, i.e., sum of School of EECS, Peking University, Beijing 100871, P. R. China. Z. Lin is the ∗ corresponding author. (e-mails: [email protected] and [email protected]). singular values. Because SNN is easy to compute, it has been X. Xie is with the College of Automation, Nanjing University of Aeronautics widely used, e.g., [13], [18], [19]. By considering the subspace and Astronautics, Nanjing 210016, P. R. China. (e-mail: [email protected]). structure in each mode, Kasai et al. [18] propose a Riemannian This paper has supplementary downloadable material available at http: //ieeexplore.ieee.org, provided by the author. The material includes a Sup- manifold based tensor completion method (RMTC). However, plemental Material of our proofs. Contact [email protected] for further questions about this work. 1Schattern-p norm is only a pseudo-norm when 0 < p < 1. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2

Paredes et al. [20] point out that SNN is not the tightest convex 0 < p < 1 we decompose the target tensor T into many relaxation of the Tucker-n-rank [11], and is actually an overlap small-scale tensors {Ti} with T = T1 ∗ ... ∗ TI , and then regularization of it. They further propose an alternative convex we minimize the weighted sum of convex Schatten-p norms P 1 pi relaxation of Tucker-n-rank [11] which is tighter than SNN. kTik , where pi ≥ 1, ∀i. In this way, the original i pi Spi Tomioka et al. [21] introduce a tensor Schatten-p norm based non-convex non-smooth optimization problem is divided into on SNN, and Li et al. [19] achieve a better experimental results many convex sub-problems. Hence we not only reduce the on the tensor completion (TC) problem than the original SNN. computational complexity of each iteration, but also give a Nonetheless, they still simply unfold the high-order tensor into better approximation to the `1 norm of tensor multi-rank, matrices, which will unavoidably destroy the intrinsic structure which can lead to a better performance. We also provide of tensor data. an efficient algorithm for solving the resulting optimization In order to maintain the internal structure of high- problem. Finally, we apply the proposed method to the TC and dimensional arrays, Kilmer et al. [1] propose a new tensor the TRPCA problems, and provide some theoretical analysis decomposition named t-SVD. Zhang et al. [4] give a definition on the performance guarantees. of the nuclear norm corresponding to t-SVD, i.e., Tensor In summary, our main contributions include: Nuclear Norm (TNN). Further more, they point out that TNN • We propose a new definition of tensor Schatten-p norm is the tightest convex relaxation to `1 norm of the tensor multi- rank2 within the unit ball of the tensor spectral norm3. with some desirable properties, e.g., unitary invariance, When arranging image or video data into matrices [13], they convexity and differentiability. When 0 < p < 1, it is often lie on a union of low-rank subspaces. Fortunately, the tighter than TNN to approximate the `1 norm of tensor original tensor data also have a low multi-rank (or tubal-rank) multi-rank, which is beneficial to LRTR problems. structures. Figure 1 shows the singular values of all frontal • We prove the tensor Schatten-p norm surrogate theorem, slices of several commonly used datasets. It is easy to see that which helps us to transform a non-convex problem into 4 most singular values are very close to 0 and much smaller many convex sub-problems , and we give an efficient than the largest ones. So the related problems can be solved algorithm to solve the transformed model. We also give a effectively by t-SVD based low-rank methods. By adopting proof of the convergence of our algorithm. Our method TNN, [3] and [22] propose the exact recovery conditions of can not only reduce the computational complexity of TRPCA and TC problems, respectively. each iteration significantly when dealing with large-scale Due to consdering the internal structure of data, TNN has tensors, but also maintain the balance among factor been widely used in recent years. Nevertheless, when dealing tensors. with large-scale tensor data, the computational complexity of • We provide the sufficient conditions for exact recovery TNN grows dramatically. For instance, when solving TC prob- using a general linear operator and the error bounds based lems by TNN, the computational complexity at each iteration is on some assumptions when there exists noise. For ensuring O(n1n2n3(log n3 + min{n1, n2})), which consumes several the performance of the TC problem, we give a theoretical hours to complete a tensor with size 500 × 500 × 500. To analysis of exact completion. avoid this high complexity, Zhou et al. [23] and Liu et al. [24] We apply our proposed t-Schatten-p norm to the TRPCA utilize the tensor factorization method to preserve the low- and the TC problems. Experimental results on synthetic and rank structure, and they only maintain two smaller tensors real-world datasets verify the advantages of our method. during each optimization iteration. By decomposing a large- scale tensor into two skinny ones, the computational cost at each iteration drops to O(r(n1 + n2)n3 log n3 + rn1n2n3)) [23]. Although the complexity is reduced, their methods do not II.NOTATIONS AND PRELIMINARIES consider the balance between factors, hence they cannot prevent the extremely imbalanced tensor decompositions, which will In this section, we introduce some notations and necessary make their obtained tensors violate the incoherence conditions. definitions which will be used later. Note that incoherence is the essential condition for successful Tensors are represented by uppercase curlycue letters, e.g., completion. T . Matrices are represented by boldface uppercase letters, e.g., To break the limits of existing methods, in this paper we M. Vectors are represented by boldface lowercase letters, e.g., propose a new tensor Schatten-p norm (t-Schatten-p norm) v. Scalars are represented by lowercase letters, e.g., c. which is defined in Eq. (11) based on t-product [1]. The For a given 3-order tensor T ∈ n1×n2×n3 , we use T(k) proposed norm is of similar properties to matrix Schatten- R to represent its k-th frontal slice T (:, :, k). Its (i, j, k)-th entry p norm. Additionally, when 0 < p < 1 this Schatten-p is represented as T . We use T to represent the result of norm is a tighter regularizer than TNN to approximate the ijk discrete Fourier transformation of T along the 3-rd dimension, ` norm of tensor multi-rank [2]. Furthermore, inspired by [23] 1 corresponding to Matlab operator T = fft(T , [], 3). This also and [25], we extend the surrogate theorem to the (i) tensor case. By using the new theorem and t-product, when implies T = ifft(T , [], 3). T denotes the i-th frontal slice

2The tensor multi-rank is a vector with each entry representing the rank of a frontal slice after Fourier transform along the third dimension. 4But the whole optimization problem is still nonconvex due to the multilinear 3The related definition of the tensor spectral norm can be found in [4]. constraint. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 3

(a) (b)

(c) (d)

Fig. 1. Illustrations of the low tubal-rank properties of some datasets. (a) and (c) are from the Berkeley Segmentation dataset. (b) is from the UMist Faces dataset. (d) is from the YUV Video Sequences dataset. of T . And the block circulant matrix associated to a 3-order Then the t-product can be calculated as follows: tensor T is represented by bcirc(T ) ∈ Rn1n3×n2n3 : Property 1. [1] Let A ∈ Rn1×s×n3 , B ∈ Rs×n2×n3 . Then  T(1) T(n3) ... T(2) the t-product is equivalent to matrix product of A and B:  T(2) T(1) ... T(3) bcirc(T ) =   . (k) (k) (k)  . . .. .  T = A ∗ B ⇐⇒ T = A B , k = 1, . . . , n3. (6)  . . . .  T(n3) T(n3−1) ... T(1) Remark 1. In this paper, we use  to represent frontal-slice- As for block unfolding T and its inverse operation, we use the wise matrix multiplication between tensors A and B. Then following operators:  T(1)  (k) (k) (k) T = A ∗ B ⇐⇒ T = A B ⇐⇒ T = A  B. (7)  T(2)  unfold(T ) =   , fold(unfold(T )) = T .  .   .  The relations of inner products (and Frobebius norm) in the T(n3) time and the frequency domains are as follows. Then we define the t-product between two 3-order tensors as: Property 2. [3] Let A, B ∈ Rn1×n2×n3 , then: n ×s×n Definition 1. (t-product) [1] Let A ∈ R 1 3 , B ∈ 1 s×n2×n3 (1) hA, Bi = A, B , R . Then the t-product is defined as: n3 (8) p 1 C = A ∗ B = fold(bcirc(A) · unfold(B)). (5) (2) kAkF = hA, Ai = √ A F . n3 n1×n2×n3 Here C ∈ . Note that if n3 = 1, the operator ∗ R ∗ reduces to matrix multiplication. For the definitions of tensor transpose T in Definition 8, Identity tensor I in Definition 9, and Orthogonal tensor in n ×n ×n For tensor T ∈ R 1 2 3 , Kilmer et al. [1] point out that Definition 10, please refer to the Appendix. By using these the block circulant matrix bcirc(T ) can be diagonalized by a notations, tensor Singular Value Decomposition (t-SVD) and F Rn3×n3 specific matrix. We denote n3 as the DFT matrix, Tensor Nuclear Norm (TNN) are defined as follows. F H F I I and n3 denotes the conjugate transpose of n3 . n1 and n2 n1×n2×n3 are n1-order and n2-order identity matrices, respectively. Then Definition 2. (t-SVD) [1] Let T ∈ R . Then there using Kronecker product we have [1]: exist U ∈ Rn1×n1×n3 , S ∈ Rn1×n2×n3 and V ∈ Rn2×n2×n3 such that:  (1)  T T = U ∗ S ∗ V∗, (9)  (2)  H  T  (Fn3 ⊗In1 )·bcirc(T )·(F ⊗In2 ) =   . ∗ ∗ n3  ..  where U ∗U = I, V∗V = I, and S is a frontal-slice-diagonal  .  (n ) tensor. T 3 JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 4

Definition 3. (TNN) [3] The tensor nuclear norm of a tensor Proposition 2. Given a 3-order tensor T ∈ Rn1×n2×n3 , when p T ∈ n1×n2×n3 , denoted as kT k , is defined as the average p ≥ 1, kT k is convex w.r.t. T . In other words, it satisfies R ∗ Sp of the nuclear norm of all the frontal slices of T as follow: the inequality for any λ ∈ (0, 1):

n3 p p p 1 (i) kλA + (1 − λ)Bk ≤ λ kAk + (1 − λ) kBk , p ≥ 1. X Sp Sp Sp kT k∗ := kT k∗. (10) n3 (13) i=1 p And when 0 < p < 1, kT kS is non-convex w.r.t. T . Furthermore, the tensor spectral norm, tensor multi-rank and p With this proposition, when p ≥ 1, kT kp can be applied tubal-rank are defined by using t-SVD as follows: Sp to many convex optimization problems. For a certain convex n ×n ×n Definition 4. (Tensor spectral norm) [3] Let T ∈ R 1 2 3 . function, whether it is differentiable or not is very important. The tensor spectral norm of T is defined as kT k := The next proposition gives the answer when p ≥ 1. n (i) o maxi T . n1×n2×n3 Proposition 3. Given a 3-order tensor T0 ∈ R and ∗ By using the Von Neumann’s inequality, it is easy to prove that the skinny t-SVD of T0 is U ∗ S ∗ V . When p > 1, the gradient of kT kp at T has the following form: that the dual norm of tensor spectral norm is the tensor nuclear Sp 0 norm and vice versa. p−1 ∇ kT kp = p U ∗ D ∗ V∗, with D = S , p > 1. T0 Sp Definition 5. (Tensor multi-rank and tubal-rank) [4] Let (14) n ×n ×n T ∈ R 1 2 3 , then the tensor multi-rank of T is a vector Moreover, when p = 1, the subdifferential of kT kp at T is: Sp 0 r ∈ Rn3 with its i-th entry as the rank of the i-th frontal slice of (i) p ∗ ∗ ∂T0 kT kS = {U ∗ V + W|U ∗ W = O, W ∗ V = O}. (15) T , i.e., ri = rank(T ). The tensor tubal-rank of T , denoted p p as rankt(T ), is defined as the number of nonzero singular For models which need to minimze kT k , Proposition 3 Sp tubes of S, where S is from the t-SVD of T = U ∗ S ∗ V∗. indicates that when p ≥ 1 we can use gradient-based methods to solve them. III. MAIN RESULT Comparing with the relations between matrix nuclear norm B. Unified Surrogate Theorem and matrix schatten-p norm, we propose a new definition of For the matrix case, there exist many surrogates for a specific tensor Schatten-p norm (t-Schatten-p norm) based on TNN and matrix Schatten-p norm, such as p = 1, 2/3, 1/2 [26]–[28]. Xu t-SVD as: et al. [25] give a general result of unified surrogates for the matrix Schatten-p norm. Definition 6. (Tensor Schatten-p Norm) Let T ∈ Rn1×n2×n3 and its tensor singular value decomposition be T = U ∗S ∗V∗. Lemma 1. (Multi-Schatten-p Norm Surrogate) [25] Given m×d1 Then the tensor schatten-p norm is defined as: I (I ≥ 2) matrices Xi (i = 1,...,I), where X1 ∈ R , di−1×di dI−1×n 1 Xi ∈ R (i = 2,...,I − 1), XI ∈ R , and X ∈ n3 ! p m×n 1 X (i) p R with rank(X) = r ≤ min{di, i = 1,...,I − 1}, for kT kSp := kT kS , PI n p any p, p1, . . . , pI > 0 satisfying 1/p = 1/pi, we have 3 i=1 i=1 1 (11) I   p n3 min{n1,n2} 1 p X 1 pi 1 X X p kXkS = min kXikS . (16) i.e., kT k := S . p p X :X=QI X p pi Sp  iik  i i=1 i i=1 i n3 k=1 i=1 This lemma indicates that for any given Schatten-p norm, When p = 1 it becomes the tensor nuclear norm, which is we can get the same value by solving a minimization problem. similar to the matrix case. The following Theorem 1 points out that for our proposed t-Schatten-p norm, this rule holds too. Obviously, t-Schatten-p norm satisfies: (1) kT kSp ≥ 0 with equality holding if and only if T is zero; (2) kαT kSp = Theorem 1. (Multi-Tensor-Schatten-p Norm Surrogate)

αkT kSp . Given I (I ≥ 2) tensors Ti (i = 1,...,I), where T1 ∈ m×d1×k di−1×di×k R , Ti ∈ R (i = 2,...,I − 1), TI ∈ dI−1×n×k m×n×k A. Algebraic Properties R , and T ∈ R with rankt(T ) = r ≤ min{d , i = 1,...,I − 1}, for any p, p , . . . , p > 0 satisfying Our proposed t-Schatten-p norm has some properties similar i 1 I 1/p = PI 1/p , we have to the matrix Schatten-p norm. The followings are some of the i=1 i p I properties of kT kS and kT k that we use in this paper. For p Sp 1 X 1 kT kp = min kT kpi . (17) the proofs, please refer to the Supplementary Materials. Sp i Sp p Ti:T =T1∗···∗TI p i i=1 i n1×n2×n3 Proposition 1. (Unitary Invariance) Let T ∈ R , and In the previous section, we mention that when p ≥ 1 the U ∈ Rn1×n1×n3 and V ∈ Rn2×n2×n3 be orthogonal tensors. p t-Schatten-p norm kT kS is convex. When p < 1, we can use The tensor Schatten-p norm defined in Eq. (11) is unitary p p Theorem (1) to convert the non-convex function kT k into Sp invariant, i.e., pi weighted sum of convex functions kTik with pi ≥ 1 and Spi kT k = kU ∗ T k = kT ∗ V∗k = kU ∗ T ∗ V∗k . T = T ∗ ... ∗ T . Note that p can also be less than 1, i.e., Sp Sp Sp Sp 1 I i (12) 0 < pi < 1. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 5

C. Tightness A. Model In practical applications, the observed tensor T is inevitably Lu et al. [3] point out that if the tensor average rank is  (i) contaminated by noise. Therefore we add a noise tensor and a 1 Pn3 defined as ranka(T ) = rank T , then TNN is the n3 i=1 noise regularization to the model in Eq. (18): convex envelope of the tensor average rank within the unit min kX kp + λg (E) , ball of the tensor spectral norm. Here we claim that when Sp X ,E (19) 0 < p < 1, our kT kp is a tighter non-convex envelope of the Sp s.t. Ψ(X ) + E = T , tensor average rank within the same unit ball. where T is the observed tensor, E is a noise tensor and g (·) Proposition 4. For a given 3-order tensor T and 0 < p ≤ 1, denotes the noise regularization. In specific problems, if we p kT kS is an envelope of the tensor average rank within the unit assume that the noise follows the Gaussian distribution or the p 2 ball of the tensor spectral norm. If we set p = 1, it becomes Laplacian distribution, g (E) can be chosen as kEkF or kEk1, TNN, which is the tightest convex envelope. Otherwise, it is a respectively. non-convex envelope of the tensor multi-rank, which is tighter When p ≥ 1, the problem in Eq. (19) can be solved by p 1 than TNN in the sense of kT k ≤ kT k ≤ krankm (T )k . many convex optimization methods directly. If 0 < p < 1, ∗ Sp n3 1 the t-Schatten-p norm becomes non-convex. Then we can use Zhang et al. [4] give another definition of TNN and prove Theorem 1 to convert the non-convex function into sum of that it is the tightest convex envelope to `1 norm of the tensor several convex functions. The following proposition provides multi-rank within a unit ball. It is easy to find that these two the guarantee of this idea. conclusions are equivalent. Proposition 5. [25] For any 0 < p < 1, there always exist Considering the LRTR problem in Eq. (1), we usually need to PI use a relaxed function to replace the non-smooth rank function. I ∈ N and pi such that 1/p = i=1 1/pi, where all pi satisfy By using Proposition 4, we transform the recovery problem one of the cases: (a) pi ≥ 1 or (b) pi > 1. into the following: Given I (I ≥ 2) and i = 1,...,I, for any p, pi > 0 satisfy- ing 1/p = PI 1/p , we assume that X = X ∗ X ∗ · · · ∗ X , p i=1 i 1 2 I min kX kS , then (19) can be converted to: X p (18) I s.t. Ψ(X ) = T , X 1 min kX kpi + λg (E) , i Sp {Xi},E pi i (20) where T is the observed measurement through a linear operator i=1 Ψ(·). We aim to recover the clean tensor X based on the s.t. Ψ(X1 ∗ X2 ∗ · · · ∗ XI ) + E = T . observed corrupted/missing tensor T . If pi ≥ 1 holds for all i, the optimization of problem in Eq. (20) becomes multi-convex. Thus if we apply the block coordinate descent method to solve Eq. (20), each Xi can be D. Advantages and Disadvantages efficiently updated by convex optimization.

The main advantage is that our kT kp is a tighter non- B. Optimization Sp convex envelope of the tensor average rank within the unit Different from the matrix case in [26], we need to introduce ball of the tensor spectral norm, as shown in Proposition 4. an intermediate tensor G to separate {Xi} from Ψ(·). If not, Liu el al. [29] point out that, in matrix case, when using the calculating the sub-gradient of kΨ(X1 ∗ X2)kF may be a matrix factorization or Schatten-p norm (p = 2/3) the recovery difficult problem for certain Ψ, such as PΩ in the TC problem. condition is weaker than the convex optimization based on Then by adding an equality constraint, Eq. (20) can be rewritten nuclear norm. That is to say, when using a better approximation, as: I we may get a better solution. Our experiments in Section VI X 1 min kX kpi + λg (E) , i Sp also verify this statement. {Xi},E pi i i=1 (21) However, the disadvantage is also obvious. Compared with s.t. Ψ(G) + E = T , TNN in [4], kT kp is non-convex and non-smooth when Sp 0 < p < 1. It would be hard to obtain a strong performance X1 ∗ X2 ∗ · · · ∗ XI = G. guarantee as done in the convex programs, e.g., [22]. Even Here we solve Eq. (21) by a new method based on in the matrix case, it is still unknown under what conditions LADMPSAP [30] and BCD [31]. By introducing Lagrange a specific optimization procedure for Schatten-p norm can multipliers Y and Z, the augmented Lagrangian function of (21) produce an optimal solution that exactly recovers the target is given as follows: matrix. The same is true for the tensor case. I X 1 pi L(Xi, G, E, Y, Z) = kXikS + λg (E) p pi i=1 i ρ1 IV. APPLICATION AND OPTIMIZATION + hY, Ψ(G) + E − T i + kΨ(G) + E − T k2 2 F ρ + hZ, X ∗ X ∗ · · · ∗ X − Gi + 2 kX ∗ X ∗ · · · ∗ X − Gk2 . In this section, we propose a general LRTR model based on 1 2 I 2 1 2 I F the t-Schatten-p norm and give a feasible algorithm to solve it. (22) JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 6

All {Xi} in (22) need to be updated sequentially. Note that Algorithm 1 Solving problem (21) 0 0 different order of updating Xi may lead to different convergence Input: The observed tensor data T and parameters λ, ρ1, ρ2, rates. We update X1 and XI first, and then update others in ρ1,max, ρ2,max. η, ε.  0 0 0 0 0 proper order. Initialize: Xi , G , E , Y , Z . 1) Update Xi: While not converge do k k k k Assume that we have already updated X1 , X2 ,..., Xi−1. Let 1) Update Xi by solving (26) for i = 1,...,I. k k k−1 k−1 k Q1:(i−1) = X1 ∗ · · · ∗ Xi−1 and Q(i+1):I = Xi+1 ∗ · · · ∗ XI . 2) Update G by solving (27). k k Then the sub-problem for updating Xi can be written as: 3) Update E by solving (30). 4) Update Yk and Zk by (33). 1 p ( X k = arg min kX k i k k−1 i i Sp k k ρ1 = min{ηρ1 , ρ1,max} Xi p i i 5) Update ρ1 and ρ2 by k k−1 ρ2 k−1 k−1 2 ρ2 = min{ηρ2 , ρ2,max} + kQ1:(i−1) ∗ Xi ∗ Q(i+1):I − G + Z /ρ2kF . k k−1 2 6) Check the convergence condition: Xi − Xi ≤ ε2, (23) k k−1 k k−1 k G − G ≤ ε2, and E − E ≤ ε2. Calculating (23) directly is difficult. Letting fi (Xi) = 1 k−1 k−1 2 7) k ← k + 1. 2 kQ1:(i−1) ∗ Xi ∗ Q(i+1):I − G + Z /ρ2kF and further k k−1 end While linearizing fi (Xi) at point Xi , the sub-problem becomes: Output: The factor tensors {Xi}, the noise tensor E and the 1  intermediate tensor G. X k = arg min kX kpi + ρ ∇f k(X k−1), X − X k−1 i i Sp 2 i i i i Xi pi i k−1  Li k−1 2 Different linear operators Ψ result in different solutions of + kXi − Xi kF 2 Eq. (27). For example, in TRPCA problems Ψ is an identity 1 k pi operator and G is given by: = arg min kXikS X pi i pi k−1 k−1  k k−1  2 k ρ1 T − E − Y /ρ1 + ρ2 Q1:I + Z /ρ2 k−1 k k−1 G = , ρ2Li k−1 ∇i f(Xi ) ρ + ρ + Xi − X + , 1 2 2 i Lk−1 (28) i F k k k k (24) where Q1:I = X1 ∗ X2 ∗ · · · ∗ XI . In TC problems Ψ is a k k−1 k k−1 where ∇f (X ) is the gradient of f (Xi) at X : projection operator PΩ. If we devide the identity operator into i i i i k PΩ and PΩ¯ , then G is given by: k k−1 ∗ ∇fi (Xi ) = Q1:(i−1) k−1 k−1 k−1  ∗ k k k−1  ∗ Q1:(i−1) ∗ Xi ∗ Q(i+1):I − G + Z /ρ2 ∗ Q(i+1):I . G = PΩ¯ Q1:I + Z /ρ2 (25) k−1 k−1  k k−1 ! k−1 k k−1 ρ1 T − E − Y /ρ1 + ρ2 Q1:I + Z /ρ2 Li is the Lipschitz constant of ∇fi (Xi ), and it can be + PΩ . k k−1 ρ1 + ρ2 chosen as the tensor spectral norm of ∇fi (Xi ). The right (29) hand side of Eq. (24) is the proximal mapping of Xi and can be solved by off-the-shelf algorithms: 3) Update E: By fixing other variables, we have the following sub-problem k k−1 ! k k−1 ∇i f(Xi ) to update E: p Xi = Proxp ρ Lk−1,k·k i Xi − k−1 , (26) i 2 i Sp ρ1 i Li k k k−1 2 E = arg min λg (E) + kΨ(G ) + E − T + Y /ρ1kF , E 2 λ 2 where Proxλ,f (z) := arg minx f(x) + kx − zk . k k k−1  2 F i.e., E = Proxρ1/λ,g(·) T − Ψ G − Y /ρ1 . There are some special cases of pi that have closed-form (30) solutions. When pi = 1, the problem in Eq. (26) can be The problem in Eq. (30) usually has a closed-form solution 2 k solved by the tensor Singular Value Thresholding (tSVT) for a specific g (·). If g (·) is chosen as kEkF , E is given by: operator [4]. When pi = 2, the problem in Eq. (26) can be k k−1 T − Ψ G − Y /ρ1 solved by calculating the unique critical point of quadratic Ek = . (31) function, or using Theorem 3.1 in [13] by setting γ = 0. When 1 + ρ1/λ k k−1 pi < 1, the problem in Eq. (26) becomes non-smooth and If g (·) is chosen as kEk1, and let R = T −Ψ G −Y /ρ1, non-convex. We can use the generalized iterated shrinkage then Ek is given by: algorithm (GISA) [32] to obtain a high-precision solution.  λ  2) Update G: Ek = sign(R) max |R| − , 0 . (32) By fixing other variables, we have the following sub-problem ρ1 to update G: 4) Update Y and Z: After updating {X k}, Gk and Ek, the Lagrange multipliers k ρ1 k−1 k−1 2 i G = arg min Ψ(G) + E − T + Y /ρ1 Y Z G 2 F and are updated by: ρ2 k k k k−1 2 ( k k−1 k k  + X ∗ X ∗ · · · ∗ X − G + Z /ρ2 . Y := Y + ρ1 Ψ(G ) + E − T , 2 1 2 I F k k−1 k k k k (33) (27) Z := Z + ρ2 X1 ∗ X2 ∗ · · · ∗ XI − G . JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 7

For the penalty parameters ρ1 and ρ2, Lin et al. [30] further A. Null Space Property (NSP) suggest increasing them gradually. We summarize the algorithm NSP is widely used in the theoretical analysis of recovering for solving the problem in Eq. (21) in Algorithm 1. sparse vectors and low-rank matrices [33], [34]. Here we give a sufficient condition for exactly recovering the low-rank tensor Xb in Eq. (18) by the following model: C. Convergence Analysis I X 1 In general, it is hard to provide the convergence for the min kX kpi , i Sp {Xi} p i (34) ADMM based method with a Burer-Monteiro factorization i=1 i constraint. Fortunately, due to the separability of the proposed s.t. Ψ(X1 ∗ X2 ∗ · · · ∗ XI ) = T . objective, we can still prove the convergence of our algorithm ∗ by assuming the smoothness of the noise regularization g (·) in Assume Xb = Ub ∗ Sb ∗ Vb to be the true tensor in Eq. (18) with rank (X ) = r, and X = X ∗ · · · ∗ X with X = U ∗ Eq. (21) and all pi ≥ 1. Note that in the following Theorem 2, t b b b1 bI b1 b p/p1 p/p2 p/pI ∗ p can be chosen in the range (0, +∞) as long as 1/p = Sb , Xb2 = Sb , . . . , and XbI = Sb ∗ Vb . N (Ψ) := PI {X : Ψ(X ) = 0} denotes the null space of the linear operator 1/pi holds. i=1 Ψ. Then we have the following theorem: Theorem 2. If the optimization problem in Eq. (21) satisfies n1×n2×n3 the following conditions: (a) the function g (·) in Eq. (21) is Theorem 3. Assume Xb ∈ R to be the true tensor smooth, convex, and coercive; (b) p ≥ 1 for i = 1,...,I; (c) for Eq. (18) with tubal-rank rankt(Xb) = r, and p ∈ (0, 1] i 1 PI 1 I with = . In addition, for any X ∈ {Xi} with ρ1 and ρ2 in Eq. (22) are sufficiently large, then the sequence p i=1 pi i=1 k k k k k X ∈ k1×k2×k3 , min{k , k } ≥ r holds. Then X is the unique {Xi , G , E , Y , Z } generated in Algorithm 1 satisfies the R 1 2 b following properties: optimal solution of Eq. (18) and can be uniquely recovered (1) The augmented Lagrangian function (22) is monotonically by Eq. (34), if for any Z = (Xb1 + W1) ∗ · · · ∗ (XbI + WI ) − decreasing, i.e. for some c > 0, (Xb1 ∗ · · · ∗ XbI ) ∈ N (Ψ) \O, where {Wi} have compatible dimensions and N (Ψ) denotes the null space of the linear + + + + + L(Xi, G, E, Y, Z) − L(Xi , G , E , Y , Z ) operator Ψ, we have + 2 + 2 + 2 ≥c(k(Xi − Xi )kF + kG − G kF + kE − E kF ); r n3 min{n1,n2} n3 X X p  (j) X X p  (j) σi Z < σi Z . (35) + + + (2) kXi − Xik → 0, kG − Gk → 0, kE − Ek → 0; i=1 j=1 i=r+1 j=1 k k k k k (3) The sequence {Xi , G , E , Y , Z } are bounded; Note that this condition is usually hard to be satisfied. (4) Any accumulation point of the sequence k k k k k Therefore, for specific problems we need to give some error {Xi , G , E , Y , Z } is a constrained stationary point. bounds between the true tensors and the solutions by our algorithm.

D. Complexity Analysis B. Error Bound Analysis for Robust Tensor Recovery For Algorithm 1, different pi’s result in different complexities. In this section, we first introduce the following assumption Here we choose the widely used case p1 = p2 = 1 to of the general linear operator A. Based on this assumption, we n1×d×n3 d×n2×n3 analyze. If X1 ∈ R , X2 ∈ R and rankt(X ) = then give a theoretical analysis of the error bound for robust r (r ≤ d), then the per-iteration complexity in Algorithm 1 is tensor recovery. 2  O (n1 + n2)n3d log n3 + (n1 + n2)n3d + n1n2n3d . One Assumption 1. [35] Suppose that there is a positive constant iteration means updating all variables once in order. As κ(A) such that for ∆ ∈ Rn1×n2×n3 and ∆ ∈ C, the general for TNN in [3], the computational complexity at each iter- linear operator A : Rn1×n2×n3 → Rl satisfies the following ation is O(n1n2n3(log n3 + min{n1, n2})). Obviously, when inequality: d  min{n1, n2}, our method is much more efficient than TNN based methods in each iteration. Due to the convexity kA(∆)k2 ≥ κ(A)k∆kF , ∆ ∈ C. (36) of TNN, the related problem usually needs fewer iterations to When C denotes the whole space Rn1×n2×n3 , κ(A) actually converge. But when the tensor rank is large or the noise is can be chosen as the smallest singular value of operator A. great, it may perform worse than our proposed methods. Our experiments in Section VI.B also verify this conclusion. Based on Assumption 1, we provide the error bound for robust tensor recovery via Eq.s (19) and (20), which have noisy measurements.

V. RECOVERY GUARANTEES n1×n2×n3 Theorem 4. Assume that Xb ∈ R is a true tensor which satisfies the corrupted measurements Ψ(Xb) + E = T , In this section, we provide theoretical guarantees for LRTR   problems based on our proposed t-Schatten-p norm, which where E is the noise with kEkF ≤ . Let Xb1,..., XbI be a λ 2 aim to recover low-rank tensors from linear observations. For critical point of Eq. (34) with the squared loss 2 k · kF and all the proofs of our theorems, please refer to the Supplementary pi ≥ 1. Here rankt(Xb) = r (r ≤ d) and d = min{min{p, q} : p×q×l Materials. Xbi ∈ R , i = 1,...,I}. If the linear operator Ψ satisfies JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 8 the condition of Assumption 1 with a positive constant κ(Ψ) If T satisfies the Tensor Incoherent Condition with parameter on Rn1×n2×n3 , then µ, and ρ ≥ O (µr log(n n )/(n n )), then with a high possi- 1 3 1 3   bility, the exact solution of Eq. (38), denoted by X ,..., X : b1 bI Xb − Xb1 ∗ · · · ∗ XbI  √ F ≤ √ n n n κ(Ψ) n n n Xb∗ = U ∗ Sp/p1 ∗ Q∗, 1 2 3 √1 2 3 (37) 1 1 ∗ p/pi ∗ t Xb = Qi−1 ∗ S ∗ Q , i = 2,...,I − 1, (40) + √ , i i λC κ(Ψ) n n n p/pI ∗ 1 1 2 3 XbI = QI−1 ∗ S ∗ V , where t ≥ d and C is a constant related to {X }. We give a qi×r×n3 ∗ 1 bi where Qi ∈ R , Qi ∗ Qi = I for all i and qi ≥ r, is a lower bound for C1 in the Supplementary Materials. critical point of the problem in Eq. (38). Theorem 4 claims that if Ψ satisfies the condition in Theorem 5 gives a new perspective on our non-convex Assumption 1, then there is an upper bound of the error between tensor completion problem (38). When a certain optimization λ 2 any critical point of Eq. (34) with the squared loss 2 k · kF procedure converges to a stationary point, it may be close to and the true tensor in Eq. (19). The right hand side gives a the exact solutions. rough guarantee for our proposed model. When the noise is small, the exact solution is close to the critical point. VI.EXPERIMENTS In this section, we conduct numerical experiments to evaluate C. Guarantee for Tensor Completion our proposed model. We apply the t-Schatten-p norm (tSp) to The TC problem plays an important role in practical solve the TRPCA and the TC problems. The results on both applications. However, the projection operator PΩ in Eq. (38) synthetic and real-world datasets demonstrate the superiority usually does not satisfy the RIP condition or Assumption 1 [13], of our method. The numbers reported in all the experiments so the TC problem should be treated as a special case. By are averaged from 20 random trials. n setting Ψ as the projection operator PΩ in Eq. (34), we get In [4], Zhang et al. set λ = √ 3 . In [3], Lu et al. set the following formulation: max{n1,n2} λ = √ 1 and give some good properties. Because max{n1,n2}×n3 I TNN is the main method we need to compare with, we extend X 1 min kX kpi , i Sp its choice of λ to our multi-factors so that it can be regarded {Xi} p i (38) i=1 i as a special case of our method. In the following experiments, q s.t. PΩ(X1 ∗ X2 ∗ · · · ∗ XI ) = PΩ(T ). we usually set the parameter λ = I in Eq. (41) max{n1,n2}×n3 Note that the error bound introduced in Theorem 4 usually and Eq. (42), where I denotes the number of factors and data n ×n ×n does not hold. By using Theorem 8 in [35] we can deduce that X ∈ R 1 2 3 . But sometimes we need to readjust λ around the error bounds for the TC problem are related to |Ω| and the default value for a better experimental result. rank of each frontal slice, but low tubal-rank cannot guarantee the low slice ranks. What’s more, our proposed model is non- A. Tensor Robust Principal Component Analysis convex when 0 < p < 1, which makes it difficult to give a reliable performance guarantee as done in the convex programs, 1) Model and Experimental Settings: e.g., [22]. Therefore, we give the following Theorem to show For the TRPCA problem, Ψ in Eq.s (19) or (20) is an that, under a very mild condition, the exact solutions of Eq. (38) indentity operator. Then the TRPCA model based on our t- are its critical points. Schatten-p norm is as follows: I Definition 7. (Tensor Incoherent Condition) [22] Let the X 1 p ∗ min kX k i + λkEk , skinny t-SVD of a tensor Z be U ∗ S ∗ V . Z is said to satisfy i Sp 1 {Xi},E pi i (41) the standard tensor incoherent condition, if there exists µ such i=1 that s.t. X1 ∗ X2 ∗ · · · ∗ XI + E = T . r ∗ µr max kU ∗ eikF ≤ , Data: To show the advantages of the proposed method, i=1,...,n1 n n 1 3 (39) we experiment with both synthetic and real data. The real r µr max kV∗ ∗ e k ≤ , datasets cover one computer vision task: sequential face images j F 5 j=1,...,n2 n2n3 denoising . Baseline: In this part, we compare our method with TNN where ei is the n1 × 1 × n3 column basis with ei11 = 1 and based [3] and SNN based [36] methods. These two methods ej is the n2 × 1 × n3 column basis with ej11 = 1. r is the are widely used in various applications. tubal-rank of Z, i.e., rankt(Z) = r. Evaluation metrics: Assume that the clean tensor is X0 ∈ n ×n ×n Theorem 5. Consider the problem in Eq. (38) with pi ≥ R 1 2 3 , we represent the recovered tensor (the output of PI n1×n2×n3 1 (i = 1,...,I) and 1/p = i=1 1/pi. Let T ∈ R the algorithms) as X . with n1 ≥ n2, Ω ∼ Ber(ρ) and the skinny t-SVD of T be ∗ 5 U ∗ S ∗ V , and rankt(T ) = r. Obtained from http://www.cs.nyu.edu/~roweis/data.html. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 9

– Relative Square Error (RSE): The reconstruction error is smooth and convex, the triple-factor reformulation makes the computed as: objective surface more complex, which brings more difficulties kX − X k to optimization. RSE = 0 F . kX0kF – Peak Signal-to-Noise Ratio (PSNR):  2  n1n2n3kX0k∞ PSNR = 10 log10 2 . kX − X0kF

2) Synthetic Experiments: We only compare our methods with the TNN based method [3] on the synthetic dataset, because both of our methods are used to solve the low tubal-rank minimization problems, yet the SNN method unfolds the tensor into matrices along each dimension and minimizes the Tucker-n-rank [13]. n1×n2×n3 Here we firstly generate a tensor X ∈ R (n1 = n2 = n3 = 50) with each entry coming from the normal distribution N (0, 1), then we obtain a low tubal-rank tensor by truncating the singular values vectors in the frequency domain. Fig. 3. RSEs of the competing methods vs. CPU time. The tubal-rank is set to 20. For generating the noise/outliers tensor E, we create an index set Ω by using a Bernoulli model Figure 3 shows the RSEs of the competing methods to randomly sample a subset from {1, ··· , n1}×{1, ··· , n2}× with respect to CPU times. We generate a tensor X ∈ {1, ··· , n }. The noise/outliers fraction is 0.1 here with each n1×n2×n3 3 R (n1 = n2 = 800, n3 = 10) with rankt(X ) = 20 entry of the tensor obeying the distribution N (0, 3) if its index and noise fraction setting to 10%. The results show that when is contained in the index set Ω. the tubal-rank is much lower than min{n1, n2}, due to the smaller computational complexities, our methods are more efficient than TNN. Combining the results of Figure 2 and Figure 3, in order to obtain a faster convergence rate, we choose p = [1, 2] for the following experiments. For further comparison, we firstly generate a tensor X ∈ n1×n2×n3 R (n1 = n2 = n3 = 100) and then change the tubal-rank from 5 to 43 and vary the noise/outliers fraction |Ω|/(n1n2n3) from 0.05 to 0.45 with a step size equaling 0.02. Figure 4 compares the proposed method with the convex

TNN Ours (p = [1,2])

Fig. 2. The convergence of the competing methods on the TRPCA problem.

Tubal rank

Figure 2 shows the RSEs of the competing methods with respect to the iteration steps. We compare our methods with 0.05 0.15 0.25 0.35 0.45 0.05 0.15 0.25 0.35 0.45 different selections of p with TNN, where p denotes the Noise rate vector consisting of all pi’s. Since the optimization problem of TNN is convex and easy to solve, TNN converges faster Fig. 4. Comparing our method with convex optimization with TNN. The than our methods. However, our methods can exactly recover numbers plotted on the above figures are the success rates within 30 random trials. The white and black areas mean “succeed” and “fail”, respectively. Here, the underlying low tubal-rank tensor. This is because we the success is in a sense that PSNR ≥ 40dB. utilize a tighter rank approximation of each front slice of the Fourier transformed tensor. Although p < 1 makes the TNN method. Not surprisingly, in terms of the number of optimization non-smooth and non-convex, with the help of successfully restored matrices, our methods outperforms TNN Theorem 1, we can still solve the optimization problem by 40% around. And from the results in Figure 4, our method efficiently and exactly. For triple-fraction p = [2, 2, 2], we is much more robust to the noise/outliers in the relatively high found that it has a slower convergence rate than the double- rank case and also performs well when the noise rate is also fraction case (p = [1, 1], [1, 2]). Although its subproblem is high, which coincides with the similar phenomenon in the JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 10

matrix case [25]. Actually, the conditions of the performance TABLE I guarantee for the non-convex model are weaker than the convex COMPARISON OF PSNR RESULTS ON FACE IMAGES WITH DIFFERENT NOISE RATES. one, which bring the benefits to the t-Schatten-p norm for the TRPCA problem. Noise Rate 0.1 0.2 0.3 0.4 3) Image Denoising: SNN 28.17 25.48 22.35 21.27 TNN 28.94 26.58 23.92 21.99 Ours (p = [1, 1]) 29.89 27.55 25.94 23.85 Ours (p = [1, 2]) 30.41 26.92 26.08 23.10 Ours (p = [2, 2, 2]) 29.92 27.03 25.97 21.96

(a) Original (b) Observation (c) SNN

(d) TNN (e) Ours1 (f) Ours2

Fig. 5. Examples of face image denoising. (a) the original image. (b) the observed image. (c)-(f) the denoising results of SNN, TNN, Ours1 (p = [1, 2]), and Ours2 (p = [2, 2, 2]), respectively. Fig. 6. The RSEs of the competing methods with different noise ratios. We compare the methods on the face image denoising problem. This face dataset consist of 575 grayscale face images outliers than other two methods, which is very important for with a size of 112 × 92. All the entries of the tensor are scaled some scenarios, such as medical image processing and outlier to [0, 1] and the noise tensor is generated the same as that in the detection. synthetic case with entry from the distribution N (0, 3). Since the images for one individual are cropped from different views, it is actually a high rank matrix if we vectorize the images and B. Tensor Completion concatenate them as a matrix. Fortunately, as shown in [23], the 1) Model and Experimental Settings: frontal slice of the Fourier transformed face tensor is low-rank, For the TC problem, Ψ in Eq.s (19) or (20) is an orthogonal namely, we can denoise the face images by pursing the low projection operator PΩ. Then the TC model based on our tSp tubal-rank structure. is given by: Figure 5 gives some denoising results of the competing I methods with the noise rate equaling 0.1. Our methods can X 1 min kX kpi + λkEk2 , i Sp F deal with the details (areas near the nose and hair) better than {Xi},E p i (42) i=1 i TNN and SNN. For our models and TNN, we all set one s.t. P (X ∗ X ∗ · · · ∗ X + E) = P (T ). penalty coefficient for the whole multi-rank vectors. Our t- Ω 1 2 I Ω Schatten-p norm is much tighter than the tensor nuclear norm, Data: We evaluate our method and other state-of-the-art therefore our proposed models have lower probability of over methods on two inpainting tasks: 1) color image inpainting [37] penalizing the tensor rank, which can preserve the details of (Berkeley Segmentation database); 2) grayscale video inpaint- the images. Some numerical results are reported in Table I and ing (YUV Video Sequences). Figure 6. Baseline: In this part we compare our proposed method with Table I is the collection of the competing methods’ PSNRs. other state-of-the-arts, including TMac-inc [38], SiLRTC [13], The noise scale is set to 3, i.e., the non-zero entry of the noise TCTF [23], and TNN [4] on inpainting applications. The codes tensor is generated from the distribution N (0, 3). The results are provided by their corresponding authors. Note that our show that with the increase of noise rate, the advantages of proposed tSp together with TNN and TCTF are all based on our methods are more and more obvious. Figure 6 shows the t-product, while TMac-inc and SiLRTC are based on Tucker RSE between the original images and the recovered images, product. They all have their own theoretical guarantees, thus some of which corresponding to the various cases in Table I. we compare these methods together, but the first three are Comparing with TNN and SNN, the recovered tensors obtained compared emphatically. by our methods are closer to the original data. These results Evaluation metrics: We use the same metrics as the show that our method has a stronger ability to identify the TRPCA case, i.e., PSNR and RSE. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 11

2) Synthetic Experiments: TNN Ours (p = [1,1]) n1×n2×n3 We generate a low-rank tensor X ∈ R (n1 = n2 = n3 = 100) and an index set Ω by the following steps. First, we produce two tensors A ∈ Rn1×r×n3 and B ∈ Rr×n2×n3 . Then let X = A∗B to get a tensor with rankt(X ) = r. After that, we

create the index set Ω by using a Bernoulli model to randomly Tubal rank sample a subset from {1, ··· , n1}×{1, ··· , n2}×{1, ··· , n3}. The sampling rate is |Ω|/(n1n2n3).

Ours (p = [1,2]) Ours (p = [2,2,2])

Tubal rank

Sampling rate Sampling rate

Fig. 8. Comparing our method with convex optimization TNN. The numbers plotted on the above figures are the success rates within 30 random trials. The white and black areas mean “succeed” and “fail”, respectively. Here, the success is in a sense that PSNR ≥ 40dB.

Fig. 7. The convergence of the competing methods on the TC problem. According to the results, our methods can deal with the details of images better than TNN. The reason is that our t-Schatten- Figure 7 presents the RSE of TNN and the proposed methods p norm is much tighter than the tensor nuclear norm in with respect to the iteration steps. Here rankt(X ) equals to 40 approximating the tensor tubal-rank. Thus the t-Schatten-p and the sampling rate equals to 0.4. TNN converges much faster norm has a smaller possibility of over penalizing the singular than our methods due to its convexity. Another reason is that values of each frontal slice of the Fourier transformed tensor. we adopt the proximal gradient to update the variables while Therefore, the t-Schatten-p norm based method is more suitable TNN can achieve closed-form solutions for its subproblems. for inpainting images with complex details. Nevertheless, all our methods with various selections of p can exactly recover the ground-truth low tubal-rank tensor. Same TABLE II as TRPCA, this is because the condition of the sampling rate COMPARISON OF PSNR RESULTS ON NATURAL IMAGES WITH DIFFERENT SAMPLING RATES. for exact recovery is weaker than that of the convex nuclear norm, namely, the proposed t-Schatten-p norm still works well Sampling Rate 0.2 0.4 0.6 0.8 when the sample size is low. TMac-inc 18.72 24.18 28.55 38.74 The exhaustive comparison between TNN and our methods is SiLRTC 20.36 25.37 29.20 39.30 shown in the Figure 8. Our method outperforms TNN by 20% TCTF 21.11 27.10 29.33 39.62 around when p = [2, 2, 2]. These results verify the effectiveness TNN 23.10 27.98 33.29 41.39 of our t-Schatten-p norm. We can see that the performance Ours (p = [1, 1]) 25.29 29.85 34.42 42.08 for p = [1, 2] and p = [2, 2, 2] are similar. This is not strange Ours (p = [1, 2]) 24.41 28.72 33.68 41.31 because they both correspond to the t-Schatten-2/3 norm. Ours (p = [2, 2, 2]) 24.11 28.77 34.97 41.57 3) Image Inpainting: We use the Berkeley Segmentation dataset6 [37] to evaluate Table II shows the average PSNR results on 40 natural our method for the image inpainting task. This dataset has images, randomly chosen from the dataset, with different totally 200 RGB images, each with size 321 × 481 × 3. As sampling rates. It is obvious that our methods still work well pointed out by [23], most natural images have the low tubal- when the sampling rate is very low (rate = 0.2). Our methods rank structure, therefore we can inpaint these natural images outperform the others much more when rates are less than 0.6. by the low-rank tensor completion method. Note that although all the competing methods have similar Figure 9 gives the visualizations of the image inpainting on results when the sampling rate is high, our methods are more some images with 0.4 sampling rate. Our methods have the memory saving than the others due to the factorization strategy. highest PSNR among all the competing completion methods. 4) Video Inpainting: We choose p = [1, 2] for Ours1 and p = [2, 2, 2] for Ours2. We evaluate our methods on the widely used YUV Video Sequences7. Each sequence contains at least 150 frames. In the 6Obtained from https://www.eecs.berkeley.edu/Research/Projects/CS/vision/ bsds/. 7Obtained from http://trace.eas.asu.edu/yuv/. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 12

(a) Original (b) Observation (c) SiLRTC (d) TCTF (e) TMAC (f) TNN (g) Ours1 (h) Ours2

Fig. 9. Examples of image inpainting. (a) the original image. (b) the observed image. (c)-(h) the inpainting results of SiLRTC, TCTF, TMac-inc, TNN, Ours1 (p = [1, 1]), and Ours2 (p = [2, 2, 2]), respectively. experiments, we test our methods and other methods on two with the observations in Figure 11. Low tubal-rank meth- videos. The frame sizes of the two videos are both 144 × 176 ods (TNN and Ours) are also better than the others, which pixels. The work in [23] reveals that the tensor of grayscale demonstrate that the low tubal-rank structure does benefit the video has much redundant information because of its similar tensor completion task on video sequence. Comparing to TNN, contents within and between frames, and thus its low tubal-rank our methods can maintain the details of the video sequence structure is notable. We can complete the missing entries of better. This is because, one hand, the t-Schatten-p norm is the tensor by tensor low-rank minimization. much tighter than the nuclear norm for approximating the Due to the computational limitation, we only use the first tensor multi-rank. On the other hand, the surrogate of the 30 frames of the two sequences. As shown in Figure 11, we t-Schatten-p norm introduces the convexity and smoothness display the 10-th frame of the two testing videos, respectively. to the subproblems of the optimization, which will reduce From the recovery results, our methods perform better in filling the disadvantages of setting p < 1 for our norm. Hence, by the missing values of the two video sequences. It can deal with Theorem 5 we can still achieve a good stationary point. the details better.

TABLE III COMPARISON OF PSNR RESULTS ON VIDEO INPAINTINGS WITH DIFFERENT SAMPLING RATES.

Hall Monitor Sampling rate 0.1 0.2 0.4 0.6 TMac-inc 16.06 22.29 27.55 38.74 SiLRTC 19.23 21.64 28.47 39.30 TCTF 19.32 23.08 28.50 39.62 TNN 20.00 25.00 29.33 41.39 Fig. 10. RSEs of the TC and the TRPCA problems with respect to various Ours (p = [1, 1]) 22.08 27.32 31.23 41.31 selections of p for the t-Schatten-p norm. Ours (p = [1, 2]) 23.71 28.14 32.14 42.08 (p = [2, 2, 2]) Ours 23.02 27.92 31.75 41.57 C. Discussion on the Choice of p Akiyo In this section, we study the relation between the performance Sampling rate 0.1 0.2 0.4 0.6 and the value of p for our t-Schatten-p norm. We generate a low TMac-inc 15.98 21.06 26.41 28.99 n1×n2×n3 tubal-rank tensor X ∈ (n1 = n2 = n3 = 50) with SiLRTC 18.52 23.93 27.12 28.60 R rankt(X ) = 20 and the noise rate and sampling rate equaling TCTF 20.11 22.07 28.09 29.37 0.4. For each value of p ∈ {1/4, 1/3, 2/5, 1/2, 2/3, 1}, each TNN 19.97 24.20 29.60 33.43 experiment is repeated 20 times and the average RSEs are Ours (p = [1, 1]) 21.10 26.39 30.42 34.02 reported in Figure 10, from which we can see that the RSEs Ours (p = [1, 2]) 23.79 26.86 32.09 35.68 increase when the value of p rises in the range [2/5, 1]. This Ours (p = [2, 2, 2]) 23.16 26.02 31.86 35.03 result clearly justifies the validity of our proposed t-Schatten-p norm for solving the LRTR problems when p < 1. Actually, a Table III shows the PSNR metric of the competing methods. smaller p represents a tighter approximation to the tensor tubal- Our methods achieve the best inpainting recovery, consistent rank. Note that when p = 0, the t-Schatten-p norm reduces JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 13

(a) Original (b) Observation (c) SiLRTC (d) TCTF (e) TMAC (f) TNN (g) Ours1 (h) Ours2

Fig. 11. Examples of video inpainting. (a) the original image. (b) the observed image. (c)-(h) the inpainting results of SiLRTC, TCTF, TMac-inc, TNN, Ours1 (p = [1, 1]), and Ours2 (p = [2, 2, 2]), respectively. to the tensor rank function. However, a smaller p makes the [3] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust objective function more non-convex and non-smooth and thus principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” in Proceedings of the IEEE Conference more difficult to optimize. And accoarding to Theorem 1, a on Computer Vision and Pattern Recognition, pp. 5249–5257, 2016. smaller p (in the range (0, 2/5)) may require more tensor [4] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer, “Novel methods factors, and tensor multiplication makes the problem (20) for multilinear data completion and de-noising based on tensor-SVD,” in Proceedings of the IEEE Conference on Computer Vision and Pattern non-convex, which may lead to a bad solution. Besides, the Recognition, pp. 3842–3849, 2014. equivalence condition of Theorem 1 for each factor Xi is not [5] Y. Fu, J. Gao, D. Tien, Z. Lin, and X. Hong, “Tensor LRR and a single point. Instead, each Xi belongs to a large subset by sparse coding-based subspace clustering,” IEEE Transactions on Neural multiplying unitary tensors, which also increases the difficulty Networks and Learning Systems, vol. 27, no. 10, pp. 2120–2133, 2016. [6] N. Boumal and P.-a. Absil, “RTRMC: A Riemannian trust-region method of optimization. for low-rank matrix completion,” in Advances in Neural Information Processing Systems, pp. 406–414, 2011. [7] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil, VII.CONCLUSIONS “Multilinear multitask learning,” in International Conference on Machine Learning, pp. 1444–1452, 2013. We propose a new definition of tensor Schatten-p norm [8] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” named as the t-Schatten-p norm. When p < 1, our t-Schatten-p SIAM review, vol. 51, no. 3, pp. 455–500, 2009. norm can better approximate the `1 norm of tensor multi- [9] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of rank than TNN. Therefore, we use this norm to solve the products,” Studies in Applied Mathematics, vol. 6, no. 1-4, pp. 164–189, 1927. LRTR problem as a tighter regularizer. We further provide the [10] H. A. Kiers, “Towards a standardized notation and terminology in surrogate theorem for our proposed t-Schatten-p norm and give multiway analysis,” Journal of Chemometrics, vol. 14, no. 3, pp. 105–122, an efficient algorithm to solve the LRTR problem. We also 2000. [11] J. B. Kruskal, Rank, decomposition, and uniqueness for 3-way and n-way provide some theoretical analysis on exact recovery and the arrays. North-Holland Publishing Co., 1989. corresponding error bound for the noise case. The experimental [12] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” results on TRPCA and TC show that our methods perform Psychometrika, vol. 31, no. 3, pp. 279–311, 1966. better than the mainstream methods when the clean data have [13] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern a large tubal-rank or a high noise/corruption ratio. Finally, Analysis and Machine Intelligence, vol. 35, no. 1, pp. 208–220, 2013. We also discuss the choice of p, and recommend a range for [14] S. Friedland and L.-H. Lim, “Nuclear norm of higher-order tensors,” selecting p for the LRTR problem. Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281, 2018. [15] M. Yuan and C.-H. Zhang, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, ACKNOWLEDGMENTS pp. 1031–1068, 2016. [16] J. Håstad, “Tensor rank is NP-complete,” Journal of Algorithms, vol. 11, The authors would like to thank Jianlong Wu for helping us no. 4, pp. 644–654, 1990. improve the manuscript. Zhouchen Lin was supported by 973 [17] C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” Journal of the ACM, vol. 60, no. 6, p. 45, 2013. Program (grant no. 2015CB352502), NSF of China (grant nos. [18] H. Kasai and B. Mishra, “Low-rank tensor completion: a Riemannian 61625301 and 61731018), Qualcomm, and Microsoft Research manifold preconditioning approach,” in International Conference on Asia. Machine Learning, pp. 1012–1021, 2016. [19] C. Li, L. Guo, Y. Tao, J. Wang, L. Qi, and Z. Dou, “Yet another Schatten norm for tensor recovery,” in International Conference on REFERENCES Neural Information Processing, pp. 51–60, 2016. [20] B. Romera-Paredes and M. Pontil, “A new convex relaxation for tensor [1] M. E. Kilmer and C. D. Martin, “Factorization strategies for third-order completion,” in Advances in Neural Information Processing Systems, tensors,” Linear Algebra and its Applications, vol. 435, no. 3, pp. 641– pp. 2967–2975, 2013. 658, 2011. [21] R. Tomioka and T. Suzuki, “Convex tensor decomposition via structured [2] M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover, “Third-order tensors schatten norm regularization,” in Advances in Neural Information as operators on matrices: A theoretical and computational framework Processing Systems, pp. 1331–1339, 2013. with applications in imaging,” SIAM Journal on Matrix Analysis and [22] Z. Zhang and S. Aeron, “Exact tensor completion using t-SVD,” IEEE Applications, vol. 34, no. 1, pp. 148–172, 2013. Transactions on Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2017. JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 14

[23] P. Zhou, C. Lu, Z. Lin, and C. Zhang, “Tensor factorization for low-rank Definition 10. (Orthogonal tensor) [1] Let Q ∈ Rn×n×n3 , tensor completion,” IEEE Transactions on Image Processing, vol. 27, then Q is orthogonal if it satisfies no. 3, pp. 1152–1163, 2018. [24] X.-Y. Liu, S. Aeron, V. Aggarwal, and X. Wang, “Low-tubal-rank tensor ∗ ∗ completion using alternating minimization,” in Modeling and Simulation Q ∗ Q = Q ∗ Q = I. (44) for Defense Systems and Applications XI, vol. 9848, p. 984809, 2016. [25] C. Xu, Z. Lin, and H. Zha, “A unified convex surrogate for the Schatten-p norm,” in AAAI Conference on Artificial Intelligence, pp. 926–932, 2017. B. Proof of Theorem 2 [26] F. Shang, J. Cheng, Y. Liu, Z.-Q. Luo, and Z. Lin, “Bilinear factor matrix norm minimization for robust PCA: Algorithms and applications,” IEEE Proof. For convenience, we let the variables without and with Transactions on Pattern Analysis and Machine Intelligence, to appear. [27] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix superscript + represent the variable in the k and k +1 iteration factorization,” in Advances in Neural Information Processing Systems, variable and k · k denote any well-defined matrix/tensor norm. pp. 1329–1336, 2005. Assume we already have kY+ − Yk → 0 and kZ+ − Zk → 0, [28] F. Nie, H. Huang, and C. Ding, “Low-rank matrix recovery via efficient 1 + then k∇Z Lk = kZ − Zk = kX1 ∗ X2 ∗ · · · ∗ XI − Gk → Schatten p-norm minimization,” in AAAI Conference on Artificial ρ2 1 + Intelligence, pp. 655–661, 2012. 0 and k∇Y Lk = kY − Yk = kΨ(G) + E − T k → 0, ρ1 [29] G. Liu, Q. Liu, and X. Yuan, “A new theory for matrix completion,” in hence the equality constraint is satisfied at the limit points. Advances in Neural Information Processing Systems, pp. 785–794, 2017. [30] Z. Lin, R. Liu, and H. Li, “Linearized alternating direction method with Since our optimization problem is a multi-linear optimization, + + parallel splitting and adaptive penalty for separable convex programs in by Corollary 6.21 in [39], if we have kXi − Xik, kG − machine learning,” Machine Learning, vol. 99, no. 2, p. 287, 2015. Gk, and kE+ −Ek all approaching 0, then there exists υ ∈ ∂L [31] Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor fac- with υ → 0. Hence, any limit point produced by our algorithm torization and completion,” Siam Journal on Imaging Sciences, vol. 6, is a constrained stationary point. We now prove the convergence no. 3, pp. 1758–1789, 2015. of the differences to 0. [32] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang, “A generalized iterated shrinkage algorithm for non-convex sparse coding,” in IEEE International Conference on Computer Vision, pp. 217–224, 2013. Lemma 2. The change in the augmented Lagrangian when + [33] S. Foucart and M.-J. Lai, “Sparsest solutions of underdetermined linear the primal variable Xi is updated to Xi is given by systems via `q-minimization for 0 < q 6 1,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 395–407, 2009. L(X , G, E, Y, Z) − L(X +, G, E, Y, Z) [34] S. Oymak, K. Mohan, M. Fazel, and B. Hassibi, “A simplified approach i i to recovery conditions for low rank matrices,” in IEEE International 1 1 = kX kpi − kX +kpi − hυ, X − X +i Symposium on Information Theory Proceedings, pp. 2318–2322, 2011. i Sp i Sp i i (45) pi i pi i [35] F. Shang, Y. Liu, and J. Cheng, “Tractable and scalable schatten quasi- ρ norm approximations for rank minimization,” in Artificial Intelligence + 2 2 + 2 + ρ2LikXi − Xi kF + kC(Xi) − C(Xi )kF , and Statistics, pp. 620–629, 2016. 2 [36] B. Huang, C. Mu, D. Goldfarb, and J. Wright, “Provable low-rank tensor recovery,” Optimization-Online, vol. 4252, p. 2, 2014. where C(Xi) = Q1:(i−1) ∗ Xi ∗ Q(i+1):I − G is a linear [37] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human 1 + pi transformation and υ ∈ ∂( p kXi kS ). segmented natural images and its application to evaluating segmentation i pi algorithms and measuring ecological statistics,” in IEEE International + Conference on Computer Vision, vol. 2, pp. 416–423, 2001. Proof. Expanding L(Xi, G, E, Y, Z) − L(Xi , G, E, Y, Z), the [38] Y. Xu, R. Hao, W. Yin, and Z. Su, “Parallel matrix factorization for change is low-rank tensor completion,” Inverse Problems & Imaging, vol. 9, no. 2, pp. 601–624, 2017. 1 p 1 + p + [39] W. Gao, D. Goldfarb, and F. E. Curtis, “ADMM for multiaffine kX k i − kX k i + hZ,C(X ) − C(X )i i Sp i Sp i i constrained optimization,” arXiv preprint arXiv:1802.09592, 2018. pi i pi i ρ + 2 (kC(X ) − Gk2 − kC(X +) − Gk2 ) 2 i F i F APPENDIX 1 1 ρ = kX kpi − kX +kpi + 2 kC(X ) − C(X +)k2 + i Sp i Sp i i F A. Supplementary Definition pi i pi i 2 hZ,C(X ) − C(X +)i + ρ hC(X ) − C(X +),C(X +) − Gi. The followings are the definitions of tensor transpose, identity i i 2 i i i tensor, and orthogonal tensor, respectively. We observe that Definition 8. (Tensor transpose) [1] The conjugate transpose + + + hZ,C(Xi) − C(Xi )i + ρ2hC(Xi) − C(Xi ),C(Xi ) − Gi of a tensor T ∈ n1×n2×n3 is the T ∗ ∈ n2×n1×n3 obtained R R =hZ + ρ (C(X +) − G),C(X ) − C(X +)i by conjugate transposing each frontal slice of T , i.e., 2 i i i T + + =hC (Z + ρ2(C(Xi ) − G)), Xi − Xi i. (k)  (k)H ∗ ∗ T is the transpose of T ⇐⇒ T = T . (43) T + + + Note that C (Z + ρ2(C(Xi ) − G)) = ρ2∇fi (Xi ). On T ∗ can also be obtained by conjugate transposing each of the other hand, by the optimality of Eq. (23), we have + + + 1 + pi ρ2∇f (X ) + ρ2Li(X − Xi) ∈ −∂( kX k ). Hence, T ’s frontal slice and then reversing the order of transposed i i i pi i Spi we can obtain Eq. (45) directly. frontal slices 2 through n3.

n×n×n3 pi Definition 9. (Identity tensor) [1] Let I ∈ R , then I Due to Lemma 2 and the convexity of the function k · kS (1) pi is an identity tensor if its first frontal slice I is the n × n when pi ≥ 1, we have (i) identity matrix and all other frontal slices I , i = 2 . . . , n3, + + 2 are zero matrices. L(Xi , G, E, Y, Z) + c1k(Xi − Xi)kF ≤ L(Xi, G, E, Y, Z), (46) JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 15

+ where c1 = ρ2Li > 0. Since we update G by the closed-form on Equations (50) and (51) and kE − Ek → 0, we can have solution of Eq. (27), together with the strong convexity of the kY+ − Yk → 0 and kZ+ − Zk → 0. The proof is finished. objective, we have µ L(X +, G+, E, Y, Z) + 1 kG − G+k2 ≤ L(X +, G, E, Y, Z), i 2 F i (47) where µ1 > 0. By the same derivation of Lemma 2, + + + + + L(Xi , G , E, Y, Z) − L(Xi , G , E , Y, Z) ρ =g (E) − g E+ − hυ, E − E+i + 1 kC(E ) − C(E+)k2 , i i 2 i i F where υ = ∇(g (E+)) and C(E) = Ψ(G) + E − T . Note that g (·) in Theorem 2 is convex in our case, hence we get Hao Kong is currently a Ph.D. candidate with the Key Laboratory of Machine Perception, School ρ L(X +, G+, E+, Y, Z)+ 1 kE−E+k2 ≤ L(X +, G+, E, Y, Z). of Electronics Engineering and Computer Science, i 2 F i Peking University. He received his Bachelor degree in (48) computer science from the University of Science and According to the Eq. (33), it is easy to verify that Technology Beijing in 2016. His research interests include machine learning, pattern recognition, and + + + + + + + + computer vision. L(Xi , G , E , Y , Z ) − L(Xi , G , E , Y, Z) 1 + 2 1 + 2 (49) = kY − YkF + kZ − ZkF . ρ1 ρ2 By the optimality of E+, we have

+ + + + Y = ρ1(Ψ(G ) + E − T ) + Y = −λ∇(g E ). (50) The objective of (27) is differentiable and by the optimality of G+, we similarly have Ψ(Y+) = Z+. (51) Xingyu Xie (S’17) is a Master student at Nanjing Therefore, we conclude University of Aeronautics and Astronautics (NUAA), China. He received his Bachelor degree in Automa- + + + + + + + + tion from NUAA in 2016. His research interests L(Xi , G , E , Y , Z ) − L(Xi , G , E , Y, Z) include machine learning and computer vision. He (a) (b) 2 (52) 2 + 2 2λ c2 + 2 is a student member of the IEEE. ≤ kY − YkF ≤ kE − EkF , ρ1 ρ1 where c2 is some constant, we assume ρ2 ≥ ρ1 in (a) and use the gradient Lipschitz property of g (·) for (b). Combing the Eq (46) and (52), we find that the augmented Lagrangian function is monotonically decreasing. The function L is upper bounded. It is easy to verify the Lagrangian function L is also lower bounded and coercive. Thus, all the variables produced by our algorithm are bounded. We also obtain + + + + + L(Xi, G, E, Y, Z) − L(Xi , G , E , Y , Z ) + 2 µ1 + 2 ≥c1k(X − Xi)k + kG − G k i F 2 F (53) Zhouchen Lin (M’00-SM’08-F’18) is a professor   with the Key Laboratory of Machine Perception, ρ1 2λc2 + 2 + − kE − E kF . School of Electronics Engineering and Computer 2 ρ1 Science, Peking University. His research interests   include computer vision, image processing, machine ρ1 2λc2 learning, pattern recognition, and numerical optimiza- With the proper choice of ρ1 and ρ2 such that 2 − ρ > 0, 1 tion. He is an area chair of CVPR 2014/2016/2019, we can derive ICCV 2015, NIPS 2015/2018 and AAAI 2019, ∞ and a senior program committee member of AAAI X + 2 + 2 + 2 2016/2017/2018 and IJCAI 2016/2018. He is an (c1k(Xi − Xi)kF + c3kG − G kF + c4kE − E kF associate editor of the IEEE Transactions on Pattern k=1 Analysis and Machine Intelligence and the International Journal of Computer ∞ X + + + + + Vision. He is a Fellow of IAPR and IEEE. ≤ (L(Xi, G, E, Y, Z) − L(Xi , G , E , Y , Z )) < ∞, k=1 (54)

µ1 ρ1 2λc2 where c3 = > 0 and c4 = − > 0. We conclude that 2 2 ρ1 + + + kXi − Xik → 0, kG − Gk → 0, and kE − Ek → 0. Based