<<

arXiv:1803.00420v1 [cs.LG] 28 Feb 2018 nw stencernr) hc lolast a to leads also which ), (also norm nuclear rank the the the i.e., as relax 7], known [6, to envelope convex is its way by principled prob- these a solve efficiently lems, To [5]. and low- ro- learning [4] multi-task [2], regression [1], multivariate (RPCA) range [3], (MC) representation analysis rank wide completion component a matrix principal bust has in problem applications of minimization rank The Introduction 1 06 ai,San MR &Pvlm 1 Copyright 51. authors. volume the W&CP by JMLR: 2016 (AISTATS Spain. Cadiz, and 19 2016, Intelligence the Artificial of on ference Proceedings in Appearing rcal n clbeShte us-omApproximation Quasi-Norm Schatten Scalable and Tractable ihtesaeo-h-r methods. state-of-the-art the compared with algorithms our of effectiveness efficiency and the ex- both Our verified results algorithms. perimental our for bounds and error based MC (RSC) the convexity provide also strong We con- restricted point. algorithms critical a our bounded to verges by each generated that establish our and solve problems to alter- algorithms linearized minimization efficient two nating we design matri- We RPCA, factor smaller and ces. much solve MC to need as minimiza- only such rank Schat- problems various defined tion to two quasi-norms the ten applying By tively. om,adpoeta hyaei sec the essence in are Schatten-1 they that prove tri-trace and and norms, bi-trace tractable the two i.e. norms, define Schatten we penalty, bilin- its spectral and ear large- norm trace for the between equivalence impractical relation the by even Motivated or problems. scale slow too and algorithms existing norm are However, trace the function. between rank gap to the introduced bridge was quasi-norm Schatten The eateto optrSineadEgneig h hns Univ Chinese The Engineering, and Science Computer of Department ahaSagYaya i ae Cheng James Liu Yuanyuan Shang Fanhua / n 1 and 2 Abstract / us-om,respec- quasi-norms, 3 th nentoa Con- International akMinimization Rank ) h rc om lhuhteSchatten- the although norm, trace the ag nre fvcos n eut nabae solution. the biased to a in Similar results and vectors, of entries large 1]ad[4 rpsdieaierwihe lease reweighted iterative associated approximate Schatten- to proposed algorithms (IRLS) [14] and [13] [12]. fil- analysis collaborative MRI 10], and [11] [9, tering recovery of Schatten- images deal in the great a attention Therefore, attracted has function. minimization rank quasi-norm the to tion 0 for the from deviate solution the as the solution make original may the norm words, other to trace In leads values. also singular which large equally, over-penalize values singular all shrinks ok[5 6 1 ,1] h Schatten- the recent 10], 9, some 11, In non-convex 16, solve [15, problems. to work nu- reweighted algorithm minimization iteratively (IRNN) surrogate an norm proposed clear [10] dition, ovxotmzto rbe.I at h rc norm trace the an fact, is penalty In problem. optimization convex vr 8 one u htthe How- that solution. out low-rank pointed a [8] motivates ever, it thus and ues, om oevr 1]tertclypoe htthe that proved theoretically trace [17] Schatten- the Moreover, to superior empirically norm. be to shown been has plctos uha olbrtv leig[0 21]. [20, filtering Schatten- the collaborative Since as such large-scale many applications, in applied equivalent successfully been scalable 7], has [19, which a regularization has spectral bilinear norm the formulation, trace the contrast, In appli- [18]. not problems even large-scale are for and cable cost suffer computational they high Thus from iteration. each involve de- in eigenvalue and (EVD) or composition iteratively (SVD) existing solved decomposition value all be singular However, to have minimization. algorithms norm con- trace the than vex measurements fewer significantly quires ln arxfcoiainfr osm ae fthe of cases equiv- some an to design form we Schatten- factorization can matrix question: alent following the ask ℓ p us-omo h iglrvle,i sntrlto natural is it values, singular the on quasi-norm

In this paper we first define two tractable Schatten 3 Tractable Schatten Quasi-Norm norms, the bi-trace (Bi-tr) and tri-trace (Tri-tr) norms. Minimization We then prove that they are in essence the Schatten- 1/2 and 1/3 quasi-norms, respectively, for solving [19] and [7] pointed out that the trace norm has the whose minimization we only need to perform SVDs following equivalent non-convex formulations. on much smaller factor matrices to replace the large Lemma 1. Given a matrix X Rm n with matrices in the algorithms mentioned above. Then we × rank(X)= r d, the following holds: ∈ design two efficient linearized alternating minimization ≤ X tr = min U F V F algorithms with guaranteed convergence to solve our k k U Rm×d,V Rn×d:X=UV T k k k k problems. Finally, we provide the sufficient condition ∈ ∈ U 2 + V 2 for exact recovery, and the restricted strong convexity = min k kF k kF . U,V :X=UV T 2 (RSC) based and MC error bounds. 3.1 Bi-Trace Quasi-Norm 2 Notations and Background Motivated by the equivalence relation between the The Schatten-p norm (0

m n 3.2 Tri-Trace Quasi-Norm Suppose X R × is of rank r, and denote its skinny SVD by X ∈= UΣV T . By Theorems 1 and 2, and the Similar to the definition of the bi-trace quasi-norm, our properties of the ℓp quasi-norm, we have tri-trace (Tri-tr) norm is naturally defined as follows. X tr = diag(Σ) 1 diag(Σ) 1 = X Bi-tr r X tr, Rm n k k k k ≤k k 2 k k ≤ k k Definition 2. For any matrix X × with 2 ∈ X tr = diag(Σ) 1 diag(Σ) 1 = X Tri-tr r X tr. rank(X)= r d, we can factorize it into three much k k k k ≤k k 3 k k ≤ k k smaller matrices≤ U Rm d, V Rd d and W Rn d × × × In addition, such that X = UV W∈ T . Then the∈ tri-trace norm∈ of X is defined as X Bi-tr = diag(Σ) 1 diag(Σ) 1 = X Tri-tr. k k k k 2 ≤k k 3 k k X Tri-tr := min U tr V tr W tr. It is easy to see that Property 5 in turn implies that k k U,V,W :X=UV W T k k k k k k any low bi-trace or tri-trace quasi-norm approximation Like the bi-trace quasi-norm, the tri-trace norm is also is also a low trace norm approximation. a quasi-norm, as stated in the following theorem. 3.3 Problem Formulations Theorem 2. The tri-trace norm is a quasi- k·kTri-tr norm. In addition, it is also the Schatten-1/3 quasi- Bounding the Schatten quasi-norm of X in (1) by norm, i.e., the bi-trace or tri-trace quasi-norm defined above,

X Tri-tr = X S1/3 . the noiseless low-rank structured matrix factorization k k k k problem is given by The proof of Theorem 2 is very similar to that of Theo- T min (U,V )=( U tr + V tr)/2: (UV )=b , (3) rem 1 and is thus omitted. According to Theorem 2, it U,V R k k k k A is easy to verify that the tri-trace quasi-norm possesses  the following properties. where ( ) can also denote ( U tr+ V tr+ W tr)/3, and (RUV· T ) is replaced by k (UVk Wk T k). Ink addition,k Property 3. For any matrix X Rm n with × (3) hasA the following LagrangianA forms, rank(X)= r d, the following holds: ∈ ≤ T 3 U tr+ V tr f( (UV ) b) U tr+ V tr + W tr F (U,V ):=min k k k k + A − , (4) X Tri-tr = min k k k k k k U,V 2 µ k k X=UV W T 3     3 3 3 T U tr+ V tr+ W tr U tr + V tr+ W tr f( (UVW ) b) = min U tr V tr W tr= min k k k k k k . min k k k k k k + A − . (5) X=UVWTk k k k k k X=UVWT 3 U,V,W 3 µ   Property 4. The tri-trace quasi-norm satisfies the The formulations (3), (4) and (5) can address a wide following properties: range of problems, such as MC [13, 10], RPCA [2, 25, 26] ( is the identity operator, and f( ) = 1 or 1. X Tri-tr 0, with equality iff X =0. A · k·k k k ≥ p (0

4.1 LADM Algorithm Algorithm 1 LADM for (4) with non-smooth loss Input: b, the given rank d and µ. To avoid introducing more auxiliary variables, inspired 4 20 4 Initialize: β0 =10− , βmax =10 and ε=10− . by [29], we propose a linearized alternating direction 1: while not converged do method (LADM) to solve (6), whose augmented La- ϕ ψ 2: Update tk , Uk+1, tk , and Vk+1 by (11), (9), (11), grangian function is given by and (10), respectively. 1 f(e) T (U,V,e,λ,β)= ( U + V )+ + λ, (UV ) 3: Update ek+1 and λk+1 by (7c) and (7d). L 2 k ktr k ktr µ h A 4: Update βk+1 by βk+1 = min(ρβk, βmax). b e + (β/2) (UV T ) b e 2, 5: Check the convergence condition, − − i kA − − k2 Rl (U V T ) b e <ε. where λ is the Lagrange multiplier, , de- kA k+1 k+1 − − k+1k2 notes the∈ inner , and β > 0 is a penaltyh· ·i pa- 6: end while rameter. By applying the classical augmented La- Output: Uk+1, Vk+1, ek+1. grangian method to (6), we obtain the following it- erative scheme: U β 4.1.2 Computing Step Sizes U =arg mink ktr + k (UV T) e b 2, (7a) k+1 2 2 k k k 2 U kA − − k There are two step sizes, i.e., the Lipschitz constants ϕ ψ V tr βk T e 2 t in (9) and t in (10), need to be set during the Vk+1=arg mink k + (Uk+1V ) ek bk 2, (7b) k k V 2 2 kA − − k iteration.

f(e) βk T 2 T e =arg min + (U V ) e be , (7c) ϕk(U1) ϕk(U2) F = ∗ [(U1 U2)V ] Vk F k+1 k+1 k+1 k 2 k∇ −∇ k kA {A − k } k e µ 2 kA − − k T T ∗ 2 Vk Vk 2 U1 U2 F , λk+1 =λk + βk( (Uk+1V ) b ek+1), (7d) ≤ kA Ak k k k − k k+1 e T T A − − ψ (V ) ψ (V ) = U ∗ [U (V V ) ] k∇ k 1 −∇ k 2 kF k k+1A {A k+1 1− 2 }kF where bk =b λk/βk. In many machine learning prob- T − ∗ 2 Uk+1Uk+1 2 V1 V2 F , lems [15, 3, 4], is not identity, e.g., the operator Ω. ≤ kA Ak k k k − k A P Due toe the presence of Vk and Uk+1, thus we usually where ∗ denotes the adjoint operator of . Thus, need to introduce some auxiliary variables to achieve both stepA sizes are defined in the following way:A closed-form solutions to (7a) and (7b). To avoid intro- ϕ T ducing additional auxiliary variables, we propose the t ∗ 2 Vk Vk 2, k ≥ kA Ak k k (11) following linearization technique for (7a) and (7b). ψ T ( t ∗ 2 U Uk+1 2. k ≥ kA Ak k k+1 k 4.1.1 Updating Uk+1 and Vk+1 Based on the description above, we develop an efficient LADM algorithm to solve the Bi-tr quasi-norm regu- T 2 Let ϕk(U):= (UV ) b ek+λk/βk 2/2, then we can larized problem (4) with a non-smooth loss function kA k − − k know that the gradient of ϕk(U) is Lipschitz continu- (e.g., RPCA problems), as outlined in Algorithm 1. ϕ ous with the constant tk , i.e., ϕk(U1) ϕk(U2) F To further accelerate the convergence of the algorithm, ϕ k∇Rm d −∇ k ≤ t U1 U2 F for any U1,U2 × . By linearizing the penalty parameter β is adaptively updated by the k k − k ∈ ϕk(U) at Uk and adding a proximal term, we have strategy as in [32], as well as ρ. Moreover, Algorithm ϕ t 1 can be used to solve the noiseless problem (3) and ϕ (U,U )=ϕ (U )+ ϕ (U ),U U + k U U 2 . (8) k k k k h∇ k k − ki 2 k − kkF also extended to solve the Tri-tr quasi-norm regular- Therefore, we have ized problem (5) with a non-smooth loss function. b 1 Uk+1 =arg min U tr +βkϕk(U,Uk) U 2k k 4.2 PALM Algorithm ϕ (9) 1 βktk ϕk(Uk) 2 =arg min U tr+ U Ubk + ∇ ϕ F . By using the similar linearization technique in (9) and U 2k k 2 k − tk k (10), we design an efficient PALM algorithm to solve Similarly, we have (4) with a smooth loss function, e.g., MC problems. ψ 1 βktk ψk(Vk) 2 Specifically, by linearizing the smooth loss function Vk+1=arg min V tr+ V Vk+∇ ψ F , (10) T 2 2k k 2 k − k ϕk(U) := (UV ) b 2/2 at Uk and adding a proxi- V tk k mal term,kA we have the− k following approximation: T 2 where ψk(V ):= (Uk+1V ) b ek +λk/βk 2/2 with kA ψ − − k Uk+1 the Lipschitz constant tk . Using the so-called matrix ϕ shrinkage operator [30], we can obtain a closed-form U tr ϕk(Uk) tk 2 =arg mink k + ∇ ,U Uk + U Uk F solution to (9) and (10), respectively. Additionally, if U 2 h µ − i 2µk − k (12) ϕ f( )= 1, the optimal solution to (7c) can be obtained U tr tk ϕk(Uk) 2 by· thek·k well-known soft-thresholding operator [31]. =arg mink k + U Uk + ∇ ϕ F , U 2 2µk − tk k Fanhua Shang, Yuanyuan Liu, James Cheng

T T T where ϕk(Uk)= ∗[ (UkVk ) b]Vk. Similarly, If θ (1/2, 1), then C > 0 such that [Uk , Vk ] ∇ A A − • ∈ 1∃−θ k − ψ T T − V t ψ (V ) [U , V ] F Ck− 2θ 1 . V =arg min k ktr + k V V + ∇ k k 2 , (13) k ≤ k+1 2 2µk − k ψ kF V tk Theorem 5 shows us the convergence rate of our PALM T T algorithmb b for solving the non-convex and non-smooth where ψk(Vk)= ∗[ (Uk+1V ) b] Uk+1. ∇ {A A k − } bi-trace quasi-norm problem (4) with the squared loss 2. Moreover, we can see that the convergence rate 4.3 Convergence Analysis 2 ofk·k our PALM algorithm is at least sub-linear. In the following, we provide the convergence analysis of our algorithms. First, we analyze the convergence 5 Recovery Guarantees of our LADM algorithm for solving (4) with a non- smooth loss function, e.g., f( )= 1. We provide theoretical guarantees for our Bi-tr quasi- · k·k Theorem 3. Let (Uk, Vk,ek) be a sequence gener- norm minimization in recovering low-rank matrices ated by Algorithm 1,{ then we have} from small sets of linear observations. By using the null-space property (NSP), we first provide a sufficient 1. (U , V ,e ) are all Cauchy ; { k k k } condition for exact recovery of low-rank matrices. We 2. If lim λ λ = 0, then the accumula- k k+1 k 2 then establish the restricted strong convexity (RSC) tion point→∞ ofk the− sequencek (U , V ,e ) satisfies k k k condition based and MC error bounds. the KKT conditions for (6){. } The proof of Theorem 3 is provided in the Supplemen- 5.1 Null Space Property tary Materials. From Theorem 3, we can know that under mild conditions each sequence generated by our The wide use of NSP for recovering sparse vectors LADM algorithm converges to a critical point, similar and low-rank matrices can be found in [22, 34]. We to the LADM algorithms for solving convex problems give the sufficient and necessary condition for exact as in [32]. recovery via our bi-trace quasi-norm model (3) that Moreover, we provide the global convergence of our improves the NSP condition for the Schatten-p quasi- 1/2 Rm d 1/2 PALM algorithm for solving (4) with a smooth loss norm in [34]. Let U⋆=L(d)Σ × , V⋆=R(d)Σ (d) ∈ (d) ∈ function, e.g., f( )= 1 2. Rn d and Σ = diag([σ (X ),. . ., σ (X ), 0,. . ., 0]) · 2k·k2 × (d) 1 0 r 0 Rd d ∈ Theorem 4. Let (Uk, Vk) be a sequence generated × , where L(d) and R(d) denote the matrices con- by our PALM algorithm,{ then} it is a Cauchy sequence sisting the top d left and right singular vectors of the and converges to a critical point of (4) with the squared true matrix X0 (which satisfies (X0)= b) with rank 2 A Rm n loss, 2. at most r (r d). ( ) := X × : (X)= 0 k·k denotes the null≤ spaceN ofA the{ linear∈ operatorA . Then} The proof of Theorem 4 can be found in the Supple- A mentary Materials. Theorem 4 shows the global con- we have the following theorem, the proof of which is vergence of our PALM algorithm. We emphasize that, provided in the Supplementary Materials. different from the general subsequence convergence Theorem 6. X0 can be uniquely recovered by (3), if T T T property, the global convergence property is given by and only if for any Z = U⋆W2 + W1V⋆ + W1W2 Rm d Rn d ∈ (U , V ) (U, V ) as the number of iteration k + , ( ) 0 , where W1 × , W2 × , we have k k → → ∞ N A \{ } ∈ ∈ where (U, V ) is a critical point of (4). On the con- r d trary, existingb b algorithms for solving non-convex and σi(W1)+σi(W2) < σi(W1)+σi(W2). (14) i=1 i=r+1 non-smoothb b problems, such as [14] and [10], have only X X subsequence convergence property. Remark: Since Γ ( ), where Γ = Z Z = U W T +W V T +W W⊂T ,Z N A ( ) 0 , the sufficient{ | By the Kurdyka-Lojasiewicz (KL) property (for more ⋆ 2 1 ⋆ 1 2 ∈N A \{ }} details, see [28]) and Theorem 2 in [33], our PALM condition in Theorem 6 is weaker than the correspond- algorithm has the following convergence rate: ing sufficient condition for the Schatten-p quasi-norm in [34]. Theorem 5. The sequence (Uk, Vk) generated by our PALM algorithm converges{ to a} critical point 1 2 5.2 RSC based Error Bound (U, V ) of F with f( )= 2 2, which satisfies the KL · k·k 1 θ property at each point of dom ∂F with φ(s)=cs − for Unlike most of existing recovery guarantees as in [17, c>b 0band θ [0, 1). We have ∈ 34], we do not impose the restricted isometry property If θ = 0, (U , V ) converges to (U, V ) in finite (RIP) on the general operator , rather, we require the k k A • steps; { } operator to satisfy a weaker and more general condi- A If θ (0, 1/2], then C>0 and γ [0b, 1)b such that tion known as restricted strong convexity (RSC) [35], • ∈ ∃ ∈ [U T, V T ] [U T, V T ] Cγk; as shown in the following. k k k − kF ≤ b b Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

T 1/4 Assumption 1 (RSC). We suppose that there is a X0 UV F E F md log(m) µ√d k − k k k +C2δ + , positive constant κ( ) such that the general operator √mn ≤ √mn Ω 2C Ω m n l A  | |  3 : R × R satisfies the following inequality b b | | A → ˆ ˆ T ˆ Ω(D UV )V Fp kP − k 1 where δ =maxi,j Di,j and C3 = ˆ ˆ T . (∆) 2 κ( ) ∆ F | | Ω(D UV ) F √l kA k ≥ A k k kP − k The proof of Theorem 8 and the analysis of lower- m n for all ∆ R × . boundedness of C3 can be found in the Supplementary ∈ Materials. When the samples size Ω md log(m), We mainly provide the RSC based error bound for ro- | | ≫ bust recovery via our bi-trace quasi-norm algorithm the second and third terms diminish, and the re- with noisy measurements. To our knowledge, our re- covery error is essentially bounded by the “average” covery guarantee analysis is the first one for solutions of entries of noise E. In other words, only O(md log(m)) observed entries are needed, signif- generated by Schatten quasi-norm algorithms, not for 2 the global optima1 of (4) as in [36, 17, 34]. icantly lower than O(mr log (m)) in standard matrix completion theories [37, 39, 7], which will be confirmed Rm n Theorem 7. Assume X0 × is a true matrix and by the following experimental results. the corrupted measurements∈ (X )+ e = b, where e is A 0 noise with e 2 ǫ. Let (U,ˆ Vˆ ) be a critical point of k k ≤ 2 6 Experimental Results (4) with the squared loss 2, and suppose the operator satisfies the RSC conditionk·k with a constant κ( ). ThenA A We evaluate both the effectiveness and efficiency of our methods (i.e., the Bi-tr and Tri-tr methods) for X UˆVˆ T ǫ µ√d k 0 − kF + , solving MC and RPCA problems, such as collaborative √mn ≤ κ( )√lmn 2C1κ( )√lmn filtering and text separation. All experiments were A A conducted on an Intel Xeon E7-4830V2 2.20GHz CPU ∗ T (b (UˆVˆ ))Vˆ F where C1 = kA −A T k . with 64G RAM. b (Uˆ Vˆ ) 2 k −A k The proof of Theorem 7 and the analysis of lower- 6.1 Synthetic Matrix Completion boundedness of C1 is provided in the Supplementary Materials. m n The synthetic matrices X0 R × with rank r are gen- erated randomly by the following∈ procedure: the en- m r n r 5.3 Error Bound on Matrix Completion tries of both random matrices P R × and Q R × are first generated as independent∈ and identically∈ dis- Although the MC problem is a practically important T tributed (i.i.d.) numbers, and then X0 = PQ is application of (4), the projection operator in (15) PΩ assembled. The experiments are conducted on ran- does not satisfy the standard RIP and RSC conditions dom matrices with different noise factors, nf =0.1 or in general [1, 37, 38]. Therefore, we also need to pro- 0.2, where the observed subset is corrupted by i.i.d. vide the recovery guarantee for performance of our Bi- standard Gaussian random variables as in [18]. In tr quasi-norm minimization for solving the following both cases, the sampling ratio (SR) is set to 20% or MC problem. 30%. We use the relative standard error (RSE :=

U tr+ V tr 1 T 2 X X0 F / X0 F ) as the evaluation measure, where min k k k k + Ω(UV ) Ω(D) F . (15) k − k k k U,V 2 2µkP −P k X denotes the recovered matrix.   We compare our methods with two trace norm solvers: Without loss of generality, assume that the observed NNLS [40] and ALT [4], one bilinear spectral regu- matrix D Rm n can be decomposed as a true matrix × larization method, LRMF [20], and two Schatten-p X of rank∈ r d and a random Gaussian noise E, i.e., 0 norm methods, IRLS [14] and IRNN [10]. The re- D =X +E. We≤ give the following recovery guarantee 0 covery results of IRLS and IRNN (p 0.1, 0.2,..., 1 ) for our Bi-tr quasi-norm minimization (15). on noisy random matrices are shown∈{ in Figure 4, from} Theorem 8. Let (U, V ) be a critical point of the prob- which we can observe that as a scalable alternative to lem (15) with given rank d, and m n. Then there trace norm regularization, LRMF with relatively small ≥ exists an absolute constantb b C2, such that with proba- ranks often obtains more accurate solutions than its bility at least 1 2 exp( m), trace norm counterparts, i.e., NNLS and ALT. If p is − − 1 chosen from the range of 0.3, 0.4, 0.5, 0.6 , IRLS and It is well known that the Schatten-p quasi-norm (0< IRNN have similar performance,{ and usually} outper- p<1) problems in [15, 11, 14, 10, 9] are non-convex, non- smooth and non-Lipschitz [24]. The recovery guarantees form NNLS, ALT and LRMF in terms of RSE, oth- in [36, 17, 34] are naturally based on the global optimal erwise they sometimes perform much worse than the solution of associated models. latter three methods, especially p=1. This means that Fanhua Shang, Yuanyuan Liu, James Cheng

0.25 0.14 NNLS 0.1 0.2 ALT LRMF 0.12 0.2 0.08 IRLS 0.15 IRNN 0.1 RSE RSE RSE Tri−tr 0.06 RSE 0.15 Bi−tr 0.1 0.04 0.08 0.1 0.05 0.02 0.06 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 The value of p The value of p The value of p The value of p 0.1 0.07 0.09 NNLS 0.16 ALT 0.06 0.08 0.08 LRMF 0.14 IRLS 0.05 0.07 IRNN 0.12 RSE RSE 0.06 Tri−tr RSE RSE 0.1 0.04 0.06 Bi−tr 0.03 0.05 0.04 0.08 0.06 0.02 0.04 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 The value of p The value of p The value of p The value of p (a) 20% SR and nf =0.1 (b) 20% SR and nf =0.2 (c) 30% SR and nf =0.1 (d) 30% SR and nf =0.2 Figure 1: The recovery accuracy of NNLS, ALT, LRMF, IRLS, IRNN, and our Tri-tr and Bi-tr methods on noisy random matrices of size 100 100 (the first row) or 200 200 (the second row). × ×

1 0.88 both our methods (which are in essence the Schatten- NNLS ALT 0.86 1/2 and 1/3 quasi-norm algorithms) should perform LRMF 0.95 LMaFit better than them. As expected, the RSE results of IRLS 0.84 both our methods under all of these settings are consis- IRNN 0.9 Tri−tr 0.82 Testing RMSE tently much better than those of the other approaches. Testing RMSE Bi−tr 0.8 This clearly justifies the usefulness of our Bi-tr and 0.85 5 10 15 20 5 10 15 20 Tri-tr quasi-norm penalties. Moreover, the running Rank Rank time of all these methods on random matrices with (a) MovieLens1M (b) MovieLens10M different sizes is provided in the Supplementary Mate- 0.94 rials, which shows that our methods are much faster 0.95 than the other methods. This confirms that both our 0.92 0.9 methods have very good scalability and can address 0.9 large-scale problems. 0.85 0.88 Testing RMSE Testing RMSE 0.86 0.8 0.84 6.2 Collaborative Filtering 5 10 15 20 5 10 15 20 Rank Rank We test our methods on the real-world recommenda- (c) MovieLens20M (d) Netflix tion system datasets: MovieLens1M, MovieLens10M Figure 2: Evolution of the testing RMSE of different 2 and MovieLens20M , and Netflix [41]. We randomly methods with ranks varying from 5 to 20. choose 90% as the training set and the remaining as the testing set, and the experimental results are re- ported over 10 independent runs. Besides those meth- shown in Figures 5(b)-(d). In most cases, the sophis- ods used above, we also compare our methods to one ticated matrix factorization based approaches outper- of the fastest methods, LMaFit [42], and use the root form LMaFit as a baseline method without any regu- mean squared error (RMSE) as evaluation measure. larization term. This suggests that those regularized The testing RMSE of all those methods on the four models can alleviate the over-fitting problem of matrix datasets is reported in Figure 5, where the rank varies factorization. The testing RMSE of both our meth- from 5 to 20 (the running time of all methods are pro- ods varies only slightly when the number of the given vided in Supplementary Materials). From all these re- rank increases, while that of the other matrix factor- sults, we can observe that for these fixed ranks, the ma- ization methods changes dramatically. This further trix factorization methods including LMaFit, LRMF means that our methods perform much more robust and our methods significantly perform better than the than them in terms of the given ranks. More impor- trace norm solvers including NNLS and ALT in terms tantly, both our methods under all of the rank settings of RMSE, especially on the three larger datasets, as consistently outperform the other methods in terms of prediction accuracy. This confirms that our Bi-tr 2http://www.grouplens.org/node/73 or Tri-tr quasi-norm regularized models can provide a Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

(a) Images (b) PCP (c) LRMF+ℓ1 (d) Sp+ℓp (e) Tri-tr+ℓ1 (f) Bi-tr+ℓ1 (g) Bi-tr+ℓ1/2 Figure 3: Text separation results. The first and second rows mainly show the detected texts and the recovered background images: (a) Input image (upper) and original image (bottom); (b) AUC: 0.8939, RSE: 0.1494; (c) AUC: 0.9058, RSE: 0.1406; (d) AUC: 0.9425, RSE: 0.1342; (e) AUC: 0.9356, RSE: 0.1320; (f) AUC: 0.9389, RSE: 0.1173; (g) AUC: 0.9731, RSE: 0.0853. good estimation of a low-rank matrix. Note that IRLS 6.91sec, 163.65sec, 0.96sec, 0.57sec and 1.62sec, respec- and IRNN could not run on the three larger datasets tively. In other words, our three methods are at least due to runtime exceptions. Moreover, our methods are 7, 12 and 4 times faster than the other methods, re- much faster than LRMF, NNLS, ALT, IRLS and IRNN spectively. This is a very impressive result as our three on all these datasets, and are comparable in speed with methods are nearly 170, 290 or 100 times faster than LMaFit. This shows that our methods have very good the most related Sp+ℓp method, which further con- scalability and can solve large-scale problems. firms that our methods have good scalability.

6.3 Text Separation 7 Conclusions We conducted an experiment on artificially gener- ated data to separate some text from an image. The In this paper, we defined two tractable Schatten quasi- ground-truth image is of size 256 256 with rank equal norm formulations, and then proved that they are in × to 10. Figure 6(a) shows the input image together essence the Schatten-1/2 and 1/3 quasi-norms, respec- with the original image. The input data are gener- tively. By applying the two defined quasi-norms to ated by setting 10% of the randomly selected pixels as various rank minimization problems, such as MC and missing entries. We compare our Bi-tr+ℓ1, Tri-tr+ℓ1 RPCA, we achieved some challenging non-smooth and and Bi-tr+ℓ1/2 methods (see Supplementary Materials non-convex problems. Then we designed two classes for the details) to three state-of-the-art methods, in- of efficient PALM and LADM algorithms to solve cluding PCP [2], LRMF+ℓ1 [43] and Sp+ℓp [11] with our problems with smooth and non-smooth loss func- 0

References [14] M. Lai, Y. Xu, and W. Yin. Improved itera- tively rewighted least squares for unconstrained [1] E. Cand`es and B. Recht. Exact matrix completion smoothed ℓp minimization. SIAM J. Numer. via convex optimization. Found. Comput. Math., Anal., 51(2):927–957, 2013. 9(6):717–772, 2009. [15] G. Marjanovic and V. Solo. On ℓp optimization [2] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust and matrix completion. IEEE Trans. Signal Pro- principal component analysis? J. ACM, 58(3):1– cess., 60(11):5714–5724, 2012. 37, 2011. [16] F. Nie, H. Huang, and C. Ding. Low-rank matrix [3] G. Liu, Z. Lin, and Y. Yu. Robust subspace seg- recovery via efficient Schatten p-norm minimiza- mentation by low-rank representation. In ICML, tion. In AAAI, pages 655–661, 2012. pages 663–670, 2010. [17] M. Zhang, Z. Huang, and Y. Zhang. Restricted [4] C. Hsieh and P. A. Olsen. Nuclear norm mini- p-isometry properties of nonconvex matrix recov- mization via active subspace selection. In ICML, ery. IEEE Trans. Inform. Theory, 59(7):4316– pages 575–583, 2014. 4323, 2013.

[5] A. Argyriou, C. A. Micchelli, M. Pontil, and [18] F. Shang, Y. Liu, and J. Cheng. Scalable algo- Y. Ying. A spectral regularization framework for rithms for tractable Schatten quasi-norm mini- multi-task structure learning. In NIPS, pages 25– mization. In AAAI, pages 2016–2022, 2016. 32, 2007. [19] N. Srebro, J. Rennie, and T. Jaakkola. Maximum- margin matrix factorization. In NIPS, pages [6] M. Fazel, H. Hindi, and S. P. Boyd. A rank 1329–1336, 2004. minimization heuristic with application to mini- mum order system approximation. In ACC, pages [20] K. Mitra, S. Sheorey, and R. Chellappa. Large- 4734–4739, 2001. scale matrix factorization with missing data under additional constraints. In NIPS, pages 1642–1650, [7] B. Recht, M. Fazel, and P. A. Parrilo. Guar- 2010. anteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM [21] A. Aravkin, R. Kumar, H. Mansour, B. Recht, Rev., 52:471–501, 2010. and F. J. Herrmann. Fast methods for denois- ing matrix completion formulations, with applica- [8] J. Fan and R. Li. Variable selection via noncon- tions to robust seismic data interpolation. SIAM cave penalized likelihood and its Oracle proper- J. Sci. Comput., 36(5):S237–S266, 2014. ties. J. Am. Statist. Assoc., 96:1348–1361, 2001. [22] S. Foucart and M. Lai. Sparsest solutions of un- [9] Z. Lu and Y. Zhang. Schatten-p quasi- derdetermined linear systems via ℓq-minimization norm regularized matrix optimization via it- for 0 < q 1. Appl. Comput. Harmon. Anal., erative reweighted singular value minimization. 26:397–407,≤ 2009. arXiv:1401.0869v2, 2015. [23] Y. Liu, F. Shang, H. Cheng, and J. Cheng. [10] C. Lu, J. Tang, S. Yan, and Z. Lin. Generalized A Grassmannian manifold algorithm for nuclear nonconvex nonsmooth low-rank minimization. In norm regularized least squares problems. In UAI, CVPR, pages 4130–4137, 2014. pages 515–524, 2014. [11] F. Nie, H. Wang, X. Cai, H. Huang, and C. Ding. [24] W. Bian, X. Chen, and Y. Ye. Complexity anal- Robust matrix completion via joint Schatten p- ysis of interior point algorithms for non-Lipschitz norm and Lp-norm minimization. In ICDM, pages and nonconvex minimization. Math. Program., 566–574, 2012. 149:301–327, 2015. [12] A. Majumdar and R. K. Ward. An algorithm for [25] F. Shang, Y. Liu, J. Cheng, and H. Cheng. Robust sparse MRI reconstruction by Schatten p-norm principal component analysis with missing data. minimization. Magn. Reson. Imaging, 29:408– In CIKM, pages 1149–1158, 2014. 417, 2011. [26] F. Shang, Y. Liu, J. Cheng, and H. Cheng. Re- [13] K. Mohan and M. Fazel. Iterative reweighted al- covering low-rank and sparse matrices via robust gorithms for matrix rank minimization. J. Mach. bilateral factorization. In ICDM, pages 965–970, Learn. Res., 13:3441–3473, 2012. 2014. Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

[27] M. Jaggi. Revisiting Frank-Wolfe: Projection- [40] K.-C. Toh and S. Yun. An accelerated proximal free sparse convex optimization. In ICML, pages gradient algorithm for nuclear norm regularized 427–435, 2013. least squares problems. Pac. J. Optim., 6:615– 640, 2010. [28] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for noncon- [41] KDDCup. ACM SIGKDD and Netflix. In Proc. vex and nonsmooth problems. Math. Program., KDD Cup and Workshop, 2007. 146:459–494, 2014. [42] Z. Wen, W. Yin, and Y. Zhang. Solving a low- rank factorization model for matrix completion by [29] J. Yang and X. Yuan. Linearized augmented a nonlinear successive over-relaxation algorithm. Lagrangian and alternating direction methods Math. Prog. Comp., 4(4):333–361, 2012. for nuclear norm minimization. Math. Comp., 82:301–329, 2013. [43] R. Cabral, F. Torre, J. Costeira, and A. Bernardino. Unifying nuclear norm and [30] J. Cai, E. Cand`es, and Z. Shen. A singular bilinear factorization approaches for low-rank value thresholding algorithm for matrix comple- matrix decomposition. In ICCV, pages 2488– tion. SIAM J. Optim., 20(4):1956–1982, 2010. 2495, 2013. [31] I. Daubechies, M. Defrise, and C. DeMol. An [44] R. Mazumder, T. Hastie, and R. Tibshirani. iterative thresholding algorithm for linear inverse Spectral regularization algorithms for learning problems with a sparsity constraint. Commun. large incomplete matrices. J. Mach. Learn. Res., Pur. Appl. Math., 57(11):1413–1457, 2004. 11:2287–2322, 2010.

[32] Z. Lin, R. Liu, and Z. Su. Linearized alternat- [45] D. P. Bertsekas. Nonlinear Programming. The ing direction method with adaptive penalty for 2nd edition, Athena Scientific, Belmont, 2004. low-rank representation. In NIPS, pages 612–620, [46] M. C. Yue and A. M. C. So. A perturbation 2011. inequality for concave functions of singular val- ues and its applications in low-rank matrix recov- [33] H. Attouch and J. Bolte. On the convergence ery. Appl. Comput. Harmon. Anal., 40(2):396– of the proximal algorithm for nonsmooth func- 416, 2016. tions involving analytic features. Math. Program., 116:5–16, 2009. [47] Y. Wang and H. Xu. Stability of matrix factor- ization for collaborative filtering. In ICML, pages [34] S. Oymak, K. Mohan, M. Fazel, and B. Hassibi. A 417–424, 2012. simplified approach to recovery conditions for low rank matrices. In ISIT, pages 2318–2322, 2011. [48] D. Krishnan and R. Fergus. Fast image decon- volution using hyper-Laplacian priors. In NIPS, [35] S. Negahban, P. Ravikumar, M. J. Wainwright, pages 1033–1041, 2009. and B. Yu. A unified framework for highdi- mensional analysis of M-estimators with decom- [49] J. Zeng, S. Lin, Y. Wang, and Z. Xu. L1/2 reg- posable regularizers. In NIPS, pages 1348–1356, ularization: Convergence of iterative half thresh- 2009. olding algorithm. IEEE Trans. Signal Process., 62(9):2317–2329, 2014. [36] A. Rohde and A. B. Tsybakov. Estimation [50] R. Larsen. PROPACK-software for large of high-dimensional low-rank matrices. Ann. and sparse SVD calculations. Available from Statist., 39(2):887–930, 2011. http://sun.stanford.edu/srmunk/PROPACK/, [37] E. Cand`es and Y. Plan. Matrix completion with 2005. noise. Proc. IEEE, 98(6):925–936, 2010.

[38] P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. In NIPS, pages 937–945, 2010.

[39] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. In- form. Theory, 56(6):2980–2998, 2010. Fanhua Shang, Yuanyuan Liu, James Cheng

Supplementary Materials for “Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization”

Fanhua Shang Yuanyuan Liu James Cheng Department of Computer Science and Engineering, The Chinese University of Hong Kong

In this supplementary material, we give the detailed The Frobenius norm (also called the Schatten-2 norm) proofs of some lemmas, properties and theorems, as of X is defined as well as some additional experimental results on syn- min(m,n) thetic data and four recommendation system data sets. X = Tr (XT X)= σ2(X). k kF v i u i=1 q u X 8 More Notations t

Rn denotes the n-dimensional , and the set of all m n matrices with real entries is denoted m n × m n by R × . Given matrices X and Y R × , the inner 9 Proof of Theorem 1 product is defined by X, Y := Tr(X∈T Y ), where Tr( ) denotes the trace ofh a matrix.i X is the spectral· In the following, we will first prove that the bi-trace k k2 norm and is equal to the maximum singular value of norm Bi-tr is a quasi-norm. k·k X. I denotes an identity matrix. n Proof. By the definition of the bi-trace norm, for any For any vector x R , its ℓp quasi-norm for 0 σ (X) = = σ (X) = 0, where min T U F V F , and Lemma 6 in [44], there r r+1 min(m,n) X=UV k k k k r = rank(X). By writing··· X = UΣV T in its standard 1/2 1/2 exist both matrices U = U(d)Σ(d) and V = V(d)Σ(d) singular value decomposition (SVD), we can extend (which are constructed in the same way as U⋆ and T X = UΣV to the following definitions. V in Section 5.1) suchb that X = U b V with ⋆ k ktr k kF k kF The Schatten-p quasi-norm (0

m n By the above properties, there exists a constant α 1 such that the following holds for all X, Y R × ≥ ∈ X + Y α X + Y k kBi-tr ≤ k ktr α( X + Y ) ≤ k ktr k ktr α( X + Y ). ≤ k kBi-tr k kBi-tr m n T X R × and X = UV , we have ∀ ∈ X Bi-tr = min U tr V tr 0. k k X=UV T k k k k ≥

Moreover, if X Bi-tr = 0, we have X tr X Bi-tr = 0, i.e., X tr = 0. Hence, X = 0. In short, the bi-trace norm k isk a quasi-norm. k k ≤k k k k k·kBi-tr

Before giving a complete proof for Theorem 1, we first present and prove the following lemmas. Lemma 2 (Jensen’s inequality). Assume that the function g : R+ R+ is a continuous concave function on [0, + ). Then, for any t 0, t =1, and any x R+ for i =1→,...,n, we have ∞ i ≥ i i i ∈ P n n g t x t g(x ). (16) i i ≥ i i i=1 ! i=1 X X m n T m r Lemma 3. Suppose Z R × is of rank r min(m,n), and denote its SVD by Z = LΣZ R , where L R × , n r r∈ r ≤ T T T ∈ R R × and ΣZ R × . For any A satisfying AA = A A = Ir r, then (AΣZ A )k,k 0 for any∈ k =1,...,r, and∈ × ≥ 1/2 T 1/2 1/2 Tr (AΣZ A ) Tr (ΣZ )= Z , ≥ k kS1/2 1/2 1/2 where Tr (B)= i Bii .

Proof. For any k P1,...,r , we have ∈{ } (AΣ AT ) = a2 σ 0, Z k,k ki i ≥ i X where σ 0 is the i-th singular value of Z. Then i ≥ 1/2 T 2 1/2 Tr (AΣZ A )= ( akiσi) . (17) i Xk X Let t 0, t = 1, and x 0. Since the function g(x)= x1/2 is concave on R+, and by Lemma 2, we have i ≥ i i i ≥ P g t x t g(x ). i i ≥ i i i ! i X X Due to the fact that a2 = 1 for any k 1,...,r , we have i ki ∈{ } P 1/2 a2 σ a2 σ1/2. ki i ≥ ki i i ! i X X 2 According to the above inequality and the fact that k aki = 1 for any i 1,...,r , (17) can be written as follows: ∈ { } P 1/2 T 2 1/2 Tr (AΣZ A )= ( akiσi) i Xk X a2 σ1/2 ≥ ki i i Xk X 1/2 = σi i X 1/2 1/2 = Tr (ΣZ )= Z . k kS1/2 This completes the proof. Fanhua Shang, Yuanyuan Liu, James Cheng

Proof of Theorem 1:

T T Proof. Assume that U = LU ΣU RU and V = LV ΣV RV are the thin SVDs of U and V , respectively, where Rm d Rn d Rd d T LU × , LV × , and RU , ΣU , RV , ΣV × . Without loss of generality, we set X =LX ΣX RX , where ∈ ∈ m d n d ∈ the columns of LX R × and RX R × are the left and right singular vectors associated with the top d ∈ ∈ d d singular values of X with rank at most r (r d), and Σ = diag([σ (X), , σ (X), 0, , 0]) R × . ≤ X 1 ··· r ··· ∈ T T T T Rd d By X = UV , which means that LX ΣX RX = LU ΣU RU RV ΣV LV , then O1, O1 × satisfy LX = LU O1 T T T ∃ ∈ and LU = LX O1. Since O1 = LU LX and O1 = LX LU , then O1 = O1. Indeed, since LX = LU O1 = LX O1O1, T b T we immediately have O1O1 = O1 O1 = Id d. In addition, we obviously have O1O1 = O1O1 = Id d. Similarly, Rd d T T × × O2 × satisfyingb O2O2 = O2 O2 = Idb d such that RX = LV O2b. Therefore, we have b ∃ ∈ × b T T T T T T b ΣX O2 =ΣX RX LV = LX LU ΣU RU RV ΣV = O1 ΣU RU RV ΣV .

T Rd d T T 2 2 Let O3 = O2O1 × , then we have O3O3 = O3 O3 = Id d, i.e., (O3)ij = (O3)ij = 1 for i, j ∈ × i j ∀ ∈ 1, 2,...,d , where ai,j denotes the element of the matrix A in the i-th row and the j-th column. Furthermore, {let O = R}T R , we have (O )2 1 and (O )2 1 for i, j 1,P2,...,d . P 4 U V i 4 ij ≤ j 4 ij ≤ ∀ ∈{ } T T By the above analysis, thenP we have O2ΣX OP2 = O2O1 ΣU O4ΣV = O3ΣU O4ΣV . Let ̺i and τj denote the i-th and the j-th diagonal elements of ΣV and ΣU , respectively. By Lemma 3, we can derive that 2 2 2 X Tr1/2(O Σ OT ) = Tr1/2(O OT Σ O Σ ) = Tr1/2(O Σ O Σ ) k kS1/2 ≤ 2 X 2 2 1 U 4 V 3 U 4 V    2   2  d d d d = τ (O ) (O ) ̺ = ̺ τ (O ) (O )  v j 3 ij 4 ji i  v i j 3 ij 4 ji i=1 uj=1 i=1 u j=1 X uX X u X d td d   t  a ̺ (τ (O ) (O ) ) ≤ i j 3 ij 4 ji i=1 i=1 j=1 X X X d d d 2 2 ((O3) τj + (O4) τj ) b ̺ ij ji ≤ i 2 i=1 i=1 j=1 X X X d d c ̺ τ ≤ i j i=1 j=1 X X U + V 2 = U V k ktr k ktr , k ktrk ktr ≤ 2   where the inequality a holds due to the Cauchy Schwartz inequality, the inequality b follows from the basic 2 2 ≤ − ≤ inequality xy x +y for any real numbers x and y, and the inequality c relies on the fact that (O )2 =1 ≤ 2 ≤ i 3 ij and (O )2 1. Thus, we obtain i 4 ji ≤ P P U + V 2 U 2 + V 2 X U V k ktr k ktr and X U V k ktr k ktr . k kS1/2 ≤k ktrk ktr ≤ 2 k kS1/2 ≤k ktrk ktr ≤ 2  

1/2 1/2 T On the other hand, set U⋆ = LX ΣX and V⋆ = RX ΣX , then we have X = U⋆V⋆ and L Σ1/2 2 + R Σ1/2 2 X = [Tr1/2(Σ )]2 = L Σ1/2 R Σ1/2 = U V = k X X ktr k X X ktr k kS1/2 X k X X ktrk X X ktr k ⋆ktrk ⋆ktr 2 2 2 2 1/2 1/2 2 U + V LX Σ tr + RX Σ tr U + V = k ⋆ktr k ⋆ktr = k X k k X k = k ⋆ktr k ⋆ktr . 2 2 2 !   Therefore, under the constraint X = UV T , we have 2 2 2 U tr + V tr U tr + V tr X S1/2 = min k k k k = min k k k k = min U tr V tr = X Bi-tr. k k X=UV T 2 X=UV T 2 X=UV T k k k k k k   This completes the proof. Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

10 Proof of Theorem 3

Before giving the proof of Theorem 3, we first prove the boundedness of multipliers and some variables of Algorithm 1. To prove the boundedness, we first give the following lemmas. Lemma 4 ([7]). Let be a real endowed with an inner product , and a corresponding norm H h· ·i , and y ∂ x , where ∂f(x) denotes the subgradient of f(x). Then y ∗ =1 if x =0, and y ∗ 1 if x =0, k·k ∈ k k k k 6 k k ≤ where ∗ is the of . For instance, the dual norm of the trace norm is the spectral norm, 2, i.e., thek·k largest singular value. k·k k·k Lemma 5 ([45]). Assume that g is Lipschitz continuous on dom(g) := X g(X) < satisfying the following condition: g(X) g(Y ) ∇L X Y , X, Y dom(g), with a{ Lipschitz| constant∞} L . Then k∇ − ∇ kF ≤ gk − kF ∀ ∈ g L g(X) g(Y )+ g(Y ), X Y + g X Y 2 , X, Y dom(g). ≤ h∇ − i 2 k − kF ∀ ∈ T Lemma 6. Let λk+1 = λk + βk( (Uk+1Vk+1) b ek+1), then the sequences Uk, Vk , ek and λk produced by Algorithm 1 are all bounded. A − − { } { } { }

Proof. By the first-order optimality condition of the Lagrangian function (U,V,e,λ,β) with respect to e, we have L 0 ∂ (U , V ,e,λ ,β ), ∈ eL k+1 k+1 k k which equivalently states that 1 β (U V T ) e b + λ ∂ e . k A k+1 k+1 − k+1 − k ∈ µ k k+1k1  Recalling λ = λ + β ( (U V T ) b e ), we have λ 1 ∂ e . By Lemma 4, we have k+1 k k A k+1 k+1 − − k+1 k+1 ∈ µ k k+1k1 1 λk+1 , k k∞ ≤ µ where is the dual norm of 1. Thus, the sequence λk is bounded. k·k∞ k·k { } By the definition of the linearization function ϕk(U,Uk) in (8), we have ϕk(Uk,Uk)= ϕk(Uk). Since Uk+1 is the optimal solution of (9), and by Lemma 5, then we can derive that b b (U , V ,e , λ ,β ) L k+1 k k k k 1 β = U + k ϕ (U )+ c 2k k+1ktr 2 k k+1 1 β U + k ϕ (U ,U )+ c ≤2k k+1ktr 2 k k+1 k 1 β U + k ϕ (U )+ c = (U , V ,e , λ ,β ), ≤2k kktr 2 kb k L k k k k k where c is a constant independent of both Uk and Uk+1. Similarly, we have (U , V ,e , λ ,β ) (U , V ,e , λ ,β ). L k+1 k+1 k k k ≤ L k+1 k k k k Furthermore, by the iteration procedure of Algorithm 1, we obtain (U , V ,e , λ ,β ) L k+1 k+1 k+1 k k (U , V ,e , λ ,β ) (U , V ,e , λ ,β ) ≤L k+1 k+1 k k k ≤ L k k k k k 2 = (Uk, Vk,ek, λk 1,βk 1)+ αk λk λk 1 2, L − − k − − k

βk+βk−1 where αk = 2 . 2βk−1 Since ∞ ρ(ρ + 1) αk = < + , 2β0(ρ 1) ∞ Xk=1 − Fanhua Shang, Yuanyuan Liu, James Cheng

and recall the boundedness of λk , we have that (Uk, Vk,ek, λk 1,βk 1) is upper-bounded. { } {L − − } T Note that λk = λk 1 + βk 1( (UkV ) b ek). Then we have − − A k − − 2 2 1 1 λk 2 λk 1 2 ( Uk tr + Vk tr)+ ek 1 = (Uk, Vk,ek, λk 1,βk 1) k k −k − k , 2 k k k k µk k L − − − 2βk 1 − which is also upper-bounded. Thus the sequences e , U and V are all bounded. { k} { k} { k}

Proof of Theorem 3:

T Proof. (I) By (Uk+1Vk+1) ek+1 b = (λk+1 λk)/βk, the boundedness of λk , and limk βk = , we have A − − − { } →∞ ∞ T lim (Uk+1Vk+1) ek+1 b 2 =0. k kA − − k →∞ Hence, (U , V , e ) approaches to a feasible solution. { k k k } In the following, we will prove that the sequences U , V and e are Cauchy sequences. { k} { k} { k} ϕ By the boundedness of λk , ek , Uk and Vk , then both ϕk(Uk) and tk are bounded. Furthermore, P ∂ U satisfies{ } the{ following} { } first-order{ } optimality condition∇ of (9) ∃ k+1 ∈ k k+1ktr 1 1 P + β tϕ U U + ϕ (U ) =0. (18) 2 k+1 k k k+1 − k tϕ ∇ k k  k  By Lemma 4, we have P 1, which implies that P is bounded. k k+1k2 ≤ { k+1}

T ∗((ρ + 1)λk ρλk 1)Vk ϕk(Uk)= ∗[ (UkVk ) ek b + λk/βk]Vk = A − − . (19) ∇ A A − − βk

Substituting (19) into (18), it is easy to see that

1 2 Pk+1 + βk ϕk(Uk) F Pk+1 +2 ∗((ρ + 1)λk ρλk 1)Vk F Uk+1 Uk F = k ∇ϕ k = k A ϕ − − k . k − k βktk 2βktk

Consequently, if m>n,

Un Um F Un Un+1 F + Un+1 Un+2 F + ... + Um 1 Um F k − k ≤k − k k − k k − − k Pn+1+2 ∗((ρ+1)λn ρλn 1)Vn F Pn+2+2 ∗((ρ+1)λn+1 ρλn)Vn+1 F Pm+2 ∗((ρ+1)λm 1 ρλm 2)Vm 1 F =k A ϕ − − k +k A ϕ − k +...+k A −ϕ− − − k 2βntn 2βn+1tn+1 2βm 1tm 1 − − 1 1 1 δC 1 1 ρδC δC ( + + ... + )= (1 + + ... + m n 1 ) < , ≤ βn βn+1 βm 1 βn ρ ρ − − (ρ 1)βn − − ∗ ∗ ∗ Pn+1+2 ((ρ+1)λn ρλn−1)Vn F Pn+2+2 ((ρ+1)λn+1 ρλn)Vn+1 F Pm+2 ((ρ+1)λm−1 ρλm−2)Vm−1 F k A ϕ − k k A ϕ − k k A ϕ − k where δC = max 2t , 2t ,..., 2t . Since { n n+1 m−1 } ρδC (ρ 1)β 0, it follows that indeed Uk is a Cauchy sequence. − n → { } Similarly, V and e are also Cauchy sequences. { k} { k} (II) Let (U , V ,e ) be a stationary point of (6), then the Karush-Kuhn-Tucker (KKT) conditions for (6) are formulated as∗ follows:∗ ∗

0 ∂ U tr +2 ∗(λ )V , ∈ k ∗k A ∗ ∗ T 0 ∂ V tr + 2( ∗(λ )) U , ∈ k ∗k A ∗ ∗ 1 0 ∂ e 1 λ , ∈ µ k ∗k − ∗ e = (U V T ) b, ∗ A ∗ ∗ − Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization where λ is the associated Lagrangian multiplier. The first-order optimality condition of each subproblem at the (k+1)-th∗ iteration is given by

1 0 ∂ U +2β tϕ U U + ϕ (U ) , ∈ k k+1ktr k k k+1 − k tϕ ∇ k k  k  1 0 ∂ V +2β tψ V V + ψ (V ) , (20) ∈ k k+1ktr k k k+1 − k ψ ∇ k k " tk # 1 0 ∂ e β (U V T ) e b + λ /β . ∈ µ k k+1k1 − k A k+1 k+1 − k+1 − k k  

Since Uk , Vk and ek are Cauchy sequences, then Uk+1 Uk F 0, Vk+1 Vk F 0 and ek+1 ek 2 0. Let{ U } ,{V } and {e }be their limit points, respectively,k − andkλ →be thek associated− k → Lagrangiank multiplier.− k → ∞ ∞ ∞ ∞ By the assumption λk+1 λk 2 0, and (Uk, Vk,ek) approaches a feasible solution, then βk ϕk(Uk) = T k − k → T ∇ T βk ∗[ (UkVk ) b ek+λk/βk]Vk ∗(λ )V and βk ψk(Vk)= βk ∗[ (Uk+1Vk ) b ek +λk/βk] Uk+1 A A T − − → A ∞ ∞ ∇ {A A − − } → [ ∗(λ )] U . Therefore, with k , the following holds A ∞ ∞ → ∞

0 ∂ U tr +2 ∗(λ )V , ∈ k ∞k A ∞ ∞ T 0 ∂ V tr + 2[ ∗(λ )] U , ∈ k ∞k A ∞ ∞ 1 0 ∂ e 1 λ , ∈ µ k ∞k − ∞ e = (U V T ) b. ∞ A ∞ ∞ −

Hence, the accumulation point (U ,U ,e ) of the sequence (Uk,Uk,ek) generated by Algorithm 1 satisfies the KKT conditions for the problem∞ (6).∞ ∞ { }

11 Proof of Theorem 4

2 To solve the bi-trace quasi-norm regularized problem (4) with the squared loss 2, the proposed algorithm is based on the proximal alternating linearized minimization (PALM) method for solvingk·k the following non-convex problem: min Q(x, y) := F (x)+ G(y)+ H(x, y), (21) x,y where F (x) and G(y) are proper lower semi-continuous functions, and H(x, y) is a smooth function with Lipschitz continuous gradients on any . In Section 4.2 of the main paper, we stated that our PALM algorithm alternates between two blocks of variables, U and V . We establish the global convergence of our PALM algorithm by transforming the problem (4) into a standard form (21), and show that the transformed problem satisfies the condition needed to establish the convergence. First, the minimization problem (4) can be expressed in the form of (21) by setting

F (U) := 1 U ; 2 k ktr 1 G(V ) := 2 V tr;  k k H(U, V ) := 1 (UV T ) b 2. 2µ kA − k2   The conditions for global convergence of the PALM algorithm proposed in [28] are shown in the following lemma.

Lemma 7. Let (xk,yk) be a sequence generated by the PALM algorithm proposed in [28]. This sequence converges to a critical{ point} of (21), if the following conditions hold:

1. Q(x, y) is a Kurdyka-Lojasiewicz (KL) function;

2. H(x, y) has Lipschitz constant on any bounded set; ∇ 3. (x ,y ) is a bounded sequence. { k k } Fanhua Shang, Yuanyuan Liu, James Cheng

As stated in Lemma 7, the first condition requires that the objective function satisfies the KL property (For more details, see [28]). It is known that any proper closed semi-algebraic function is a KL function as such a 1 θ function satisfies the KL property for all points in domf with φ(s)= cs − for some θ [0, 1) and some c> 0. Therefore, we first give the following definitions of semi-algebraic sets and functions, and∈ then prove that the proposed problem (4) with the squared loss 2 is also semi-algebraic. k·k2 Definition 3 ([28]). A subset S Rn is a real semi-algebraic set if there exists a finite number of real polynomial functions g , h : Rn R such⊂ that ij ij → S = u Rn : g (u)=0, h (u) < 0 . { ∈ ij ij } j i [ \ Moreover, a function g(u) is called semi-algebraic if its graph (u,t) Rn+1 : g(u)= t is a semi-algebraic set. { ∈ } Semi-algebraic sets are stable under the operations of finite union, finite intersections, complementation and Cartesian product. The following are the semi-algebraic functions or the property of semi-algebraic functions used below:

Real polynomial functions. • Finite sums and product of semi-algebraic functions. • Composition of semi-algebraic functions. • 2 Lemma 8. Each term in the proposed problem (4) with the squared loss 2 is a semi-algebraic function, and thus the function (4) is also semi-algebraic. k·k

m d n d Proof. It is easy to notice that the sets = U R × : U D1 and = V R × : V D2 U { ∈ k k∞ ≤ } V { ∈ k k∞ ≤ } are both semi-algebraic sets, where D1 and D2 denote two pre-defined upper-bounds for all entries of U and V , respectively. For both terms F (U) = 1 U and G(V ) = 1 V : According to [28], we can know that the ℓ -norm is a 2 k ktr 2 k ktr 1 semi-algebraic function. Since the trace norm is equivalent to the ℓ1-norm on singular values of the associated matrix, it is natural that the trace norm is also semi-algebraic. 1 T 2 For the third term H(U, V ) = 2µ (UV ) b 2, it is a real polynomial function, and thus is a semi-algebraic kA − k 2 function [28]. Therefore, the proposed problem (4) with the squared loss 2 is semi-algebraic due to the fact that a finite sum of semi-algebraic functions is also semi-algebraic. k·k

1 T 2 For the second condition in Lemma 7, H(U, V ) = 2µ (UV ) b 2 is a smooth polynomial function, and 1 T 1 T TkA − k H(U, V ) = ( ∗[ (UV ) b]V, ∗[ (UV ) b] U). It is natural that H(U, V ) has Lipschitz constant ∇ µ A A − µ {A A − } ∇ on any bounded set [40].

For the final condition in Lemma 7, Uk and Vk for any k = 1, 2,..., which implies the sequence (U , V ) is bounded. ∈ U ∈ V { k k } In short, we can know that three similar conditions as in Lemma 7 hold for our PALM algorithm. In other words, our PALM algorithm shares the same convergence property as in Lemma 7.

12 Proof of Theorem 6

In order to prove Theorem 6, we first introduce the following Lemma (i.e., the Lemma 11 in [34] or the Theorem 1 in [46]).

n1 n2 Lemma 9. Let A, B R × , for any p (0, 1], then we have ∈ ∈ n n σp(A) σp(B) σp(A B), | i − i |≤ i − i=1 i=1 X X where n = min(n1,n2). Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

Proof of Theorem 6:

Proof. (= ) By the definitions of U and V and Theorem 1 of the main paper, we have ⇒ ⋆ ⋆

T 1/2 U⋆ tr + V⋆ tr X0 = U⋆V , X0 = k k k k , (22) ⋆ k kS1/2 2 and rank(U ) = rank(X ) r and rank(V ) = rank(X ) r. Z = U W T + W V T + W W T , we then obtain ⋆ 0 ≤ ⋆ 0 ≤ ∀ ⋆ 2 1 ⋆ 1 2

T T T T X0 + Z = U⋆V + U⋆W + W1V + W1W ⋆ 2 ⋆ 2 (23) T = (U⋆ + W1)(V⋆ + W2) .

Recall that (U V T )= (X )= b. Then for all Z ( ) 0 , we get A ⋆ ⋆ A 0 ∈ N A \{ } (U + W )(V + W )T = (X + Z)= b. A ⋆ 1 ⋆ 2 A 0  T Thus all feasible solutions to (3) can be represented as X0 + Z with Z ( ). To prove that X0 = U⋆V⋆ is ∈ N A T uniquely recovered by (3), we need to show that any feasible solution X1 = U1V1 (X1 = X0) to (3) satisfies 1/2 1/2 m d n d 6 X1 > X0 , where U1 R × and V1 R × . Let Z1 = X1 X0, then Z1 ( ) 0 . Applying k kS1/2 k kS1/2 ∈ ∈ − ∈ N A \{ } Theorem 1 of the main paper and by (23), the following holds for some W1 and W2:

1/2 1/2 X1 = X0 + Z1 = ( U⋆ + W1 tr + V⋆ + W2 tr)/2, (24) k kS1/2 k kS1/2 k k k k

1/2 1/2 T where U⋆ + W1 = UX1ΣX1 , V⋆ + W2 = VX1ΣX1 , and UX1ΣX1VX1 is the same SVD form of X1 as that of X0. According to (22), (24), Theorem 1 of the main paper, and Lemma 9 with p = 1, we have

1/2 1/2 1 X1 = X0 + Z1 = ( U⋆ + W1 tr + V⋆ + W2 tr) k kS1/2 k kS1/2 2 k k k k d d 1 = σ (U + W )+ σ (V + W ) 2 i ⋆ 1 i ⋆ 2 i=1 i=1 ! X X 1 d d σ (U ) σ (W ) + σ (V ) σ (W ) ≥2 | i ⋆ − i 1 | | i ⋆ − i 2 | i=1 i=1 ! X X 1 r r d r r d σ (U ) σ (W )+ σ (W )+ σ (V ) σ (W )+ σ (W ) ≥2 i ⋆ − i 1 i 1 i ⋆ − i 2 i 2 "i=1 i=1 i=r+1 i=1 i=1 i=r+1 # X X X X X X r r 1 > σ (U )+ σ (V ) 2 i ⋆ i ⋆ i=1 i=1 ! X X 1 1/2 = ( U⋆ tr + V⋆ tr)= X0 , 2 k k k k k kS1/2

T which confirms that X0 = U⋆V⋆ is uniquely recovered by (3). ( =) Conversely if (14) does not hold for some W and W , i.e., ⇐ 1 2

r d (σ (W )+ σ (W )) (σ (W )+ σ (W )), (25) i 1 i 2 ≥ i 1 i 2 i=1 i=r+1 X X then we can find W and W such that (W ) = U and (W ) = V , where (W ) and (W ) denote the 1 2 1 r − ⋆ 2 r − ⋆ 1 r 2 r matrices induced by setting all but largest r singular values of W1 and W2 to 0, respectively. Using Theorem 1 Fanhua Shang, Yuanyuan Liu, James Cheng of the main paper, then we can derive that

1/2 X0 + Z k kS1/2 1 ( U + W + V + W ) ≤2 k ⋆ 1ktr k ⋆ 2ktr d d 1 = σ (W )+ σ (W ) 2 i 1 i 2 i=r+1 i=r+1 ! X X r r 1 σ (W )+ σ (W ) ≤2 i 1 i 2 i=1 i=1 ! X X 1 r r = σ (U )+ σ (V ) 2 i ⋆ i ⋆ i=1 i=1 ! X X 1/2 = X0 , k kS1/2

1/2 1/2 T i.e., X0 + Z X0 . This shows that X0 = U⋆V is not the unique minimizer. k kS1/2 ≤k kS1/2 ⋆ 13 Proof of Theorem 7

Proof. With the squared loss, 2, (4) can be reformulated as follows: k·k2

U tr + V tr 1 T 2 min k k k k + (UV ) b 2 . (26) U,V 2 2µkA − k  

Given V , the first-order optimality condition for the problem (26) with respect to U is given by

1 T 1 b ∗(b (UV ))V ∂ U . (27) µA − A ∈ 2 k ktr b b b b According to Lemma 4 and (27), we can know that

1 T 1 ∗(b (UV ))V , µkA − A k2 ≤ 2 where A 2 is the spectral norm of a matrix A andb theb dualb norm of the trace norm. Recall that ∗(b Tk k m d T A − (UV ))V R × and rank( ∗(b (UV ))V ) d, then we obtain A ∈ A − A ≤

b b b Tb b b T µ√d ∗(b (UV ))V √d ∗(b (UV ))V . (28) kA − A kF ≤ kA − A k2 ≤ 2 b b b b b b Let Xˆ = UV T . By the RSC assumption and (28), we have

ˆ ˆ b b X0 X F (X0 X) 2 k − k kA − k √mn ≤ κ( )√lmn A (X ) b b (Xˆ) kA 0 − k2 + k − A k2 ≤ κ( )√lmn κ( )√lmn A A e ∗(b (Xˆ))Vˆ = k k2 + kA − A kF κ( )√lmn C κ( )√lmn A 1 A ǫ µ√d + . ≤ κ( )√lmn 2C κ( )√lmn A 1 A Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

Lower bound on C1

Next we discuss the lower boundedness of C1, that is, it is lower bounded by a positive constant. By the characterization of the subdifferentials of the trace norm, we can know that ∂ X = Y Y, X = X , Y 1 . (29) k ktr { | h i k ktr k k2 ≤ } T µ Let Ξ = ∗(b (UV ))V , and by (27), we have that Ξ ∂ U . By (29), we obtain A − A ∈ 2 k ktr 2 b b b Ξ, U = U b. µ k ktr   Note that X X and X, Y X Y forb any same-sizedb matrices X and Y . Then k ktr ≥k kF h i≤k kF k kF 2 2 Ξ U Ξ, U = U U . µk kF k kF ≥ µ k ktr ≥k kF   Recall that U > 0 and µ = 0, thus we obtainb b b b k kF 6 T µ ∗(b (UV ))V = Ξ . b kA − A kF k kF ≥ 2 b b b U is the optimal solution of the problem (26) with given V , then

1 T 2 1 T 2 1 1 2 b (UV ) b < (UV )b b + U tr b = ν, 2µkA − k2 2µkA − k2 2k k ≤ 2µk k2 where ν > 0 is a constant. Hence,b b b b b

∗(b (Xˆ))Vˆ F √µ C1 = kA − A k > . b (Xˆ) 2√2ν k − A k2 14 Proof of Theorem 8

According to Theorem 4 of the main paper, we can know that (U, V ) is a critical point of the problem (15). To prove Theorem 8, we first give the following lemma [47]. 1 1 b b Lemma 10. Let (X) = X X F and ˆ(X) = Ω(X X) F be the actual and empirical loss √mn √ Ω L k − k L | | kP − k function respectively, where X, X Rm n (m n). Furthermore, assume entry-wise constraint max X δ. b× b i,j ij Then for all rank-r matrices X, with∈ probability≥ greater than 1 2 exp( m), there exists a fixed constant| C|≤such − − that b mr log(m) 1/4 sup ˆ(X) (X) Cδ , X S |L − L |≤ Ω ∈ r  | |  m n where S = X R × : rank(X) r, X √mnδ . r { ∈ ≤ k kF ≤ } 2 2 Suppose that M = maxi,j (Xij Xij ) (2δ) and ǫ = 9δ as in [47]. According to Theorem 2 in [47], thus we have − ≤ bsup ˆ(X) (X) X S |L − L | ∈ r 2ǫ M 2 2mr log(9δm/ǫ) 1/4 + ≤ Ω 2 Ω | |  | |  1/4 p18β1 mr log(m) +2δ ≤ Ω Ω | |  | |  1/4 p 18 mr log(m) = 2+ δ . ( Ω mr log(m))1/4 Ω  | |   | |  18 Therefore, the constant C can be set to 2+ ( Ω mr log(m))1/4 . | | Fanhua Shang, Yuanyuan Liu, James Cheng

Proof of Theorem 8:

Proof.

D UV T k − kF √mn b b D UV T (D UV T )V (D UV T )V k − kF kPΩ − kF + kPΩ − kF ≤ √mn − C3 Ω C3 Ω | | | | b b b b b b b b D UV T (D pUV T ) (D UpV T )V = k − kF kPΩ − kF + kPΩ − kF . √mn − Ω C3 Ω | | | | b b b b b b b p p 1 T 1 T Let τ(Ω) := D UV F Ω(D UV ) F , then we need to bound τ(Ω). It is clear that √mn k − k − √ Ω kP − k | | T T rank(UV ) d, and thus Ub Vb Sd. According to Lemmab b 10, then with probability greater than 1 2 exp( m), ≤ ∈ 18 − − then there exists a fixed constant C2 =2+ ( Ω mr log(m))1/4 such that b b b b | | UV T D (UV T ) (D) sup τ(Ω) = k − kF kPΩ −PΩ kF UˆVˆ T S √mn − Ω ∈ d | | (30) b b 1 b b md log(m) 4 p C δ . ≤ 2 Ω  | | 

T We also need to bound Ω(UV D)V F . Given V , the optimization problem with respect to U is formulated as follows: kP − k b b b U tr 1 b T 2 min k k + Ω(UV ) Ω(D) F . (31) U 2 2µkP −P k b Since (U, V ) is a KKT point of the problem (15), the first-order optimality condition for the problem (31) is given by b b µ (D UV T )V ∂ U . (32) PΩ − ∈ 2 k ktr b b b b Using Lemma 4, we obtain µ (UV T D)V , kPΩ − k2 ≤ 2 where X is the spectral norm of a matrix X.b Recallb thatb rank( (UV T D)V ) d, we have k k2 PΩ − ≤ b b √dµb (UV T D)V √d (UV T D)V . (33) kPΩ − kF ≤ kPΩ − k2 ≤ 2

By (30) and (33), we have b b b b b b

X UV T E D UV T k 0 − kF k kF + k − kF √mn ≤ √mn √mn b b b b E (D UV T )V k kF + τ(Ω) + kPΩ − kF ≤ √mn C3 Ω b1|b| b 4 E F md log(m)p √dµ k k + C2δ + . ≤ √mn Ω 2C Ω  | |  3 | | This completes the proof. p Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

1 10 1 10

0 0 10 10

−1 −1 10 10 C on noiseless matrices C on noiseless matrices 3 3 Lower bound on noiseless matrices −2 Lower bound on noiseless matrices −2 C on noisy matrices 10 C on noisy matrices 10 3 3 Lower bound on noisy matrices Lower bound on noisy matrices and lower bound and lower bound 3 3 −3 C −3 C 10 10

−4 −4 10 10

−2 −1 0 −2 −1 0 10 10 10 10 10 10 Sampling ratio Sampling ratio (a) Matrices of size 100×100 (b) Matrices of size 200×200

Figure 4: Average results (mean and std.) of C3 vs. sampling ratio. For a fixed sampling ratio, we can observe that although C3 is not a constant, the value of C3 is stable with respect to the projection operator Ω (Best viewed zoomed in). P

Lower bound on C3

Finally, we also discuss the lower boundedness of C3, that is, it is lower bounded by a positive constant. Let Q = (D UV T )V , and by (32), we have that Q µ ∂ U . By (29), we obtain PΩ − ∈ 2 k ktr 2 b b b Q, U =b U . µ k ktr   Note that A A and A, B A B forb any matricesb A and B of the same size. k ktr ≥k kF h i≤k kF k kF 2 2 Q U Q, U = U U . µk kF k kF ≥ µ k ktr ≥k kF   Recall that U > 0 and µ = 0, thus we obtainb b b b k kF 6 µ (D UV T )V = Q . b kPΩ − kF k kF ≥ 2 b b b Since U is the optimal solution of the problem (31) with given V , then 1 1 1 1 b (D UV T ) 2 < (D UV T ) 2 +b U (D) 2 = γ, 2µkPΩ − kF 2µkPΩ − kF 2k ktr ≤ 2µkPΩ kF where γ > 0 is a constant. Hence,b b b b b

T Ω(D UV )V F √µ C3 = kP − k > . T Ω(D UV ) F 2√2γ kP −b b bk b b √µ In fact, the value of C3 is much greater than its lower bound, 2√2γ , as shown in Figure 4, where the ordinate is the average results over 100 independent runs, and the abscissa denotes the sampling ratio, which is chosen from 0.002, 0.005, 0.01, 0.05, 0.1,..., 0.95, 0.99, 0.995, 0.999 . Moreover, the regularization parameter µ is set to 5 and{ 100 for noisy matrices (nf =0.1) and noiseless matrices,} respectively.

15 Complexity Analysis

For MC and RPCA problems, the running time of our PALM and LADM algorithms is mainly consumed in performing some matrix multiplications. The time complexity of some multiplications operators is O(mnd). In Fanhua Shang, Yuanyuan Liu, James Cheng

2 2 addition, the time complexity of performing SVD on matrices of the same sizes as Uk and Vk is O(md + nd ). In short, the total time complexity of our PALM and LADM algorithms is O(nmd) (d m,n). Moreover, it is known that the matrix multiplication on multicore architectures can be efficiently≪ implemented. Thus, in practice our PALM and LADM algorithms are fast and scales well to handle large-scale problems.

16 More Experimental Results

For the MC problem, e.g., synthetic matrix completion and collaborative filtering, we propose an efficient proximal alternating linearized minimization (PALM) algorithm to solve (15), and then extend it to solve the Tri-tr quasi- norm regularized matrix completion problem. In the following, we present our bi-trace quasi-norm minimization models for the RPCA problems (e.g., the text separation task): 1 1 T min ( U tr + V tr)+ Ω(D UV ) 1, (34) U, V 2 k k k k µkP − k and 1 1 T 1/2 min ( U tr + V tr)+ Ω(D UV ) , (35) U, V 2 k k k k µkP − k1/2 where Ω denotes the linear projection operator, i.e., Ω(D)ij = Dij if (i, j) Ω, and Ω(D)ij = 0 otherwise. SimilarP to (34), the Tri-tr quasi-norm penalty can alsoP be used to the RPCA problem.∈ P To efficiently solve the RPCA problems (34) and (35), we also need to introduce an auxiliary variable E (as the same role as e), and can assume, without loss of generality, that the unknown entries of D are simply set as T zeros, i.e., DΩC = 0, and EΩC may be any values such that ΩC (D) = ΩC (UV )+ ΩC (E). Therefore, the RPCA problems (34) and (35) are reformulated as follows: P P P

1 1 T min ( U tr + V tr)+ Ω(E) 1, s.t., UV + E = D, (36) U,V,E 2 k k k k µkP k

1 1 1/2 T min ( U tr + V tr)+ Ω(E) , s.t., UV + E = D. (37) U,V,E 2 k k k k µkP k1/2

In fact, we can apply directly Algorithm 1 with the soft-thresholding operator [31] to solve (36). In contrast, for solving (37) we can update E via solving the following problem with Lagrange multiplier Yk,

1 1/2 βk 2 min Ω(E) + E Mk+1 F , (38) E µkP k1/2 2 k − k

T where Mk+1 = D Uk+1Vk+1 Yk/βk. In general, the ℓp (0 4 λ , Hλ(yi)= − | | ( 0, otherwise,

λ 3/2 where φ (y ) = arccos( ( y /3)− ). λ i 8 | i| By Lemma 11, the closed-form solution of (38) is given by

H2/(µβk)((Mk+1)i,j ), (i, j) Ω, (Ek+1)i,j = ∈ ( (Mk+1)i,j , otherwise. Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

−1 −1 10 10

−2 10 −2 NNLS NNLS 10 RSE ALT RSE ALT LRMF LRMF −3 IRLS 10 IRLS IRNN IRNN −3 10 Tri−tr Tri−tr Bi−tr Bi−tr −4 10 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 Given noise factor Given noise factor (a) Noisy matrices of size 100×100 (b) Noisy matrices of size 200×200 Figure 5: The recovery RSE results of NNLS, ALT, LRMF, IRLS, IRNN, and our Tri-tr and Bi-tr methods on noisy random matrices with different noise levels.

4 10

3 10 NNLS ALT

Time (seconds) 2 10 LRMF IRLS IRNN Tri−tr 1 10 Bi−tr

1 2 3 4 5 4 Size of matrices x 10 Figure 6: The running time of NNLS, ALT, LRMF, IRLS, IRNN, and our Tri-tr and Bi-tr methods as the size of noisy random matrices increases.

16.1 Implementation Details for Comparison

We present the implementation detail for all other algorithms in our comparison. The code for NNLS is downloaded from http://www.math.nus.edu.sg/˜mattohkc/NNLS.html. For ALT, the given rank is set to d = 1.25r for synthetic data and 50 for four recommendation system datasets, and the maximum number⌊ ⌋ of iterations is set to 100 and 50, respectively, and its code is downloaded from http://www.cs.utexas.edu/˜cjhsieh/. For LRMF, the code is down- loaded from http://ttic.uchicago.edu/˜ssameer/#code, and the IRLS code is downloaded from http://www.math.ucla.edu/˜wotaoyin/papers/improved_matrix_lq.html. The given rank of LRMF and IRLS is set to the same value as our algorithms, e.g., d = 1.25r for synthetic data. In addition, the code for IRNN is downloaded from https://sites.google.com/site/canyilu/⌊ ⌋ . For LMaFit, the code is downloaded from http://lmafit.blogs.rice.edu/, and the Sp+ℓp code is down- loaded from https://sites.google.com/site/feipingnie/publications. Note that the regulariza- tion parameter µ is generally set to max(m,n) as suggested in [2]. All the experiments were conducted on an Intel Xeon E7-4830V2 2.20GHz CPU with 64G RAM. p Fanhua Shang, Yuanyuan Liu, James Cheng

Table 1: Characteristics of the recommendation datasets. Dataset # row # column # rating MovieLens1M 6,040 3,906 1,000,209 MovieLens10M 71,567 10,681 10,000,054 MovieLens20M 138,493 27,278 20,000,263 Netflix 480,189 17,770 100,480,507 Table 2: Regularization parameter settings for different algorithms. NNLS ALT LRMF LMaFit IRLS Ours Datasets µλλ λ λµ MovieLens1M 1.70 50 5 – 1e-6 100 MovieLens10M 4.80 100 5 – 1e-6 100 MovieLens20M 6.63 150 5 – 1e-6 100 Netflix 16.76 150 5 – 1e-6 100

16.2 Synthetic Data

In order to evaluate the robustness of our methods against noise, we generated the noisy input by the following procedure [18]: b = (X + nf Θ), A 0 ∗ where the elements of the noise matrix Θ are i.i.d. standard Gaussian random variables, and nf is the given noise factor. We also conduct some experiments on noisy matrices of size 100 100 or 200 200 with different noise factors, and report the RSE results of all algorithms with 20% SR in Figure× 5. It is× clear that ALT and LRMF have very similar performance, and usually outperform NNLS in terms of RSE. Moreover, the recovery performance of both our methods are similar to that of IRLS and IRNN, and they consistently perform better than the other methods. Moreover, we also present the running time of all those methods with 20% SR as the size of noisy random matrices increases, as shown in Figure 6. We can observe that the running time of IRNN and IRLS increases dramatically when the size of matrices increases, and they could not yield experimental results within 48 hours when the size of matrices is 50, 000 50, 000. On the contrary, both our methods are much faster than the other methods. This further justifies that× both our methods have very good scalability and can address large-scale problems. As NNLS uses the PROPACK package [50] to compute a partial SVD in each iteration, it usually runs slightly faster than ALT.

16.3 Real-World Recommendation System Data

In this part, we present the detailed descriptions for four real-world recommendation system data sets and the detailed regularization parameter settings for different algorithms, as shown in Table 1 and Table 2, respectively. For IRNN, the regularization parameter λ is dynamically decreased by λk =0.7λk 1, where λ0 =10 Ω(D) . We also report the running time of all these algorithms on the four data sets, as shown− in Figure 7, fromkP whichk∞ it is clear that both our methods are much faster than the other methods, except LMaFit. Compared with LMaFit, both our methods remain competitive in speed, but achieve much lower RMSE than LMaFit on all the data sets (as shown in Figure 2 in the main paper). Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

IRLS IRNN NNLS 4 10 ALT LMaFit LRMF Tri−tr 3 10 Bi−tr Running time (seconds) 2 10

MovieLens1M MovieLens10M MovieLens20M Netflix Figure 7: Running time (seconds) for comparison on the four data sets (Best viewed in color).