An Accelerated Gradient Method for Trace Norm Minimization

Shuiwang Ji [email protected] Jieping Ye [email protected] Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA

Abstract (Fazel et al., 2001), defined as the sum of the singu- lar values of the matrix, since it is the convex enve- We consider the minimization of a smooth lope of the rank function over the unit ball of spectral loss function regularized by the trace norm norm. A number of recent work has shown that the low of the matrix variable. Such formulation finds rank solution can be recovered exactly via minimizing applications in many machine learning tasks the trace norm under certain conditions (Recht et al., including multi-task learning, matrix classi- 2008a; Recht et al., 2008b; Cand´es& Recht, 2008). fication, and matrix completion. The stan- dard semidefinite programming formulation In practice, the trace norm relaxation has been shown for this problem is computationally expen- to yield low-rank solutions and it has been used widely sive. In addition, due to the non-smooth na- in many scenarios. In (Srebro et al., 2005; Rennie & ture of the trace norm, the optimal first-order Srebro, 2005; Weimer et al., 2008a; Cai et al., 2008; Ma black-box method for solving such class of et al., 2008) the matrix completion problem was formu- problems converges as O( √1 ), where k is the k lated as a trace norm minimization problem. In prob- iteration counter. In this paper, we exploit lems where multiple related tasks are learned simul- the special structure of the trace norm, based taneously, the models for different tasks can be con- on which we propose an extended gradient al- strained to share certain information. Recently, this 1 gorithm that converges as O( k ). We further constraint has been expressed as the trace norm regu- propose an accelerated gradient , larization on the weight matrix in the context of multi- which achieves the optimal convergence rate task learning (Abernethy et al., 2006; Argyriou et al., 1 of O( k2 ) for smooth problems. Experiments 2008; Abernethy et al., 2009; Obozinski et al., 2009), on multi-task learning problems demonstrate multi-class classification (Amit et al., 2007), and mul- the efficiency of the proposed . tivariate linear regression (Yuan et al., 2007; Lu et al., 2008). For two-dimensional data such as images, the matrix classification formulation (Tomioka & Aihara, 1. Introduction 2007; Bach, 2008) applies a weight matrix, regular- ized by its trace norm, on the data. It was shown The problem of minimizing the rank of a matrix vari- (Tomioka & Aihara, 2007) that such formulation leads able subject to certain constraints arises in many fields to improved performance over conventional methods. including machine learning, automatic control, and image compression. For example, in collaborative fil- A practical challenge in employing the trace norm reg- tering we are given a partially filled rating matrix and ularization is to develop efficient algorithms to solve the task is to predict the missing entries. Since it is the resulting non-smooth optimization problems. It is commonly believed that only a few factors contribute well-known that the trace norm minimization problem to an individual’s tastes, it is natural to approximate can be formulated as a semidefinite program (Fazel the given rating matrix by a low-rank matrix. How- et al., 2001; Srebro et al., 2005). However, such formu- ever, the matrix rank minimization problem is NP- lation is computationally expensive. To overcome this hard in general due to the combinatorial nature of the limitation, a number of algorithms have been devel- rank function. A commonly-used convex relaxation of oped recently (Rennie & Srebro, 2005; Weimer et al., the rank function is the trace norm (nuclear norm) 2008a; Weimer et al., 2008b; Cai et al., 2008; Ma et al., 2008). In these algorithms some form of approxima- Appearing in Proceedings of the 26 th International Confer- tion is usually employed to deal with the non-smooth ence on Machine Learning, Montreal, Canada, 2009. Copy- trace norm term. However, a fast global convergence right 2009 by the author(s)/owner(s). An Accelerated Gradient Method for Trace Norm Minimization rate for these algorithms is difficult to guarantee. • Multi-taskP learningP (Argyriou et al., 2008): f(W ) = n si `(yj, wT xj), where n is the Due to the non-smooth nature of the trace norm, a i=1 j=1 i i i j j m simple approach to solve these problems is the subgra- number of tasks, (xi , yi ) ∈ R ×R is the jth sam- dient method (Bertsekas, 1999; Nesterov, 2003), which ple in the ith task, si is the number of samples in m×n converges as O( √1 ) where k is the iteration counter. the ith task, and W = [w1, ··· , wn] ∈ R . k It is known from the complexity theory of convex opti- • Matrix classification (Tomioka & Aihara, 2007; mization (Nemirovsky & Yudin, 1983; Nesterov, 2003) Ps Bach, 2008): f(W ) = `(y , Tr(W T X )), that this convergence rate is already optimal for non- i=1 i i where (X , y ) ∈ Rm×n × R is the ith sample. smooth optimization under the first-order black-box i i model, where only the function values and first-order • Matrix completion (Srebro et al., 2005; Cand´es derivatives are used. & Recht, 2008;P Recht et al., 2008a; Ma et al., In this paper we propose efficient algorithms with fast 2008): f(W ) = (i,j)∈Ω `(Mij,Wij), where M ∈ global convergence rates to solve trace norm regu- Rm×n is the partially observed matrix with the larized problems. Specifically, we show that by ex- entries in Ω being observed. ploiting the special structure of the trace norm, the classical gradient method for smooth problems can Since the trace norm term in the objective function in be adapted to solve the trace norm regularized non- Eq. (1) is non-smooth, a natural approach for solving smooth problems. This results in an extended gra- this problem is the in which a dient algorithm with the same convergence rate of sequence of approximate solutions are generated as 1 O( k ) as that for smooth problems. Following the Nes- terov’s method for accelerating the gradient method 1 0 Wk = Wk−1 − F (Wk−1), (2) (Nesterov, 1983; Nesterov, 2003), we show that the tk extended gradient algorithm can be further acceler- ated to converge as O( 1 ), which is the optimal con- where Wk is the approximate solution at the kth it- k2 eration, 1 is the step size, and F 0(W ) ∈ ∂F (W ) is vergence rate for smooth problems. Hence, the non- tk smoothness effect of the trace norm regularization is the subgradient of F (W ) at W and ∂F (W ) denotes effectively removed. The proposed algorithms extend the subdifferential (Bertsekas, 1999; Nesterov, 2003) the algorithms in (Nesterov, 2007; Tseng, 2008; Beck of F (W ) at W . It is known (Nesterov, 2003) that the subgradient method converges as O( √1 ), i.e., & Teboulle, 2009) to the matrix case. Experiments k on multi-task learning problems demonstrate the effi- ciency of the proposed algorithms in comparison with ∗ 1 F (Wk) − F (W ) ≤ c√ , (3) existing ones. Note that while the present paper was k under review, we became aware of a recent preprint by ∗ Toh and Yun (2009) who independently developed an for some constant c, where W = arg minW F (W ). algorithm that is similar to ours. It is known from the complexity theory of (Nemirovsky & Yudin, 1983; Nesterov, 2. Problem Formulation 2003) that this convergence rate is already optimal for non-smooth problems under the first-order black- In this paper we consider the following problem: box model. Hence, the convergence rate cannot be im- proved if a black-box model, which does not exploit min F (W ) = f(W ) + λ||W ||∗ (1) W any special structure of the objective function, is em- where W ∈ Rm×n is the decision matrix, f(·) repre- ployed. We show in the following that by exploiting sents the loss induced by some convex smooth (differ- the structure of the trace norm, its non-smoothness entiable) loss function `(·, ·), and || · ||∗ denotes the can be effectively overcome and the convergence rate trace norm defined as the sum of the singular values. of the algorithm for solving the trace norm regularized We assume that the gradient of f(·), denoted as 5f(·), problem in Eq. (1) can be improved significantly. is Lipschitz continuous with constant L, i.e., m×n 3. An Extended Gradient Method || 5 f(X) − 5f(Y )||F ≤ L||X − Y ||F , ∀X,Y ∈ R , where ||·||F denotes the Frobenius norm. Such formu- First, consider the minimization of the smooth loss lation arises in many machine learning tasks such as in function without the trace norm regularization: multi-task learning, matrix classification, and matrix min f(W ). (4) completion problems. W An Accelerated Gradient Method for Trace Norm Minimization

It is known (Bertsekas, 1999) that the gradient step The proof of this theorem is in the Appendix. 1 The above discussion shows that the problem in Wk = Wk−1 − 5 f(Wk−1) (5) tk Eq. (8) can be readily solved by SVD. Furthermore, we show in the following that if the step size 1 of the for solving this smooth problem can be reformulated tk equivalently as a proximal regularization of the lin- gradient method is chosen properly, we can achieve the same convergence rate as in the smooth case, i.e., earized function f(W ) at Wk−1 as 1 O( k ), despite the presence of the non-smooth trace Wk = arg min Pt (W, Wk−1), (6) norm regularization. W k where 3.1. Step Size Estimation P (W, W ) = f(W ) + hW − W , 5f(W )i tk k−1 k−1 k−1 k−1 To choose an appropriate step size we impose a con- t + k ||W − W ||2 , (7) dition on the relationship between the function values 2 k−1 F of F and Qtk at a certain point in Lemma 3.1. We and hA, Bi = Tr(AT B) denotes the matrix inner prod- show in Theorem 3.2 below that once this condition is uct. It has been shown (Nesterov, 2003) that the con- satisfied at each step by choosing an appropriate step 1 size, the convergence rate of the resulting sequence can vergence rate of this algorithm is O( k ). Note that the be guaranteed. function Ptk defined in Eq. (7) can be considered as a linear approximation of the function f at point Wk−1 Lemma 3.1. Let regularized by a quadratic proximal term. pµ(Y ) = arg min Qµ(X,Y ) (11) Based on this equivalence relationship, we propose to X solve the optimization problem in Eq. (1) by the fol- lowing iterative step: where Q is defined in Eq. (8). Assume the following inequality holds:

Wk=arg min Qtk (W, Wk−1),Ptk (W, Wk−1) + λ||W ||∗. W F (p (Y )) ≤ Q (p (Y ),Y ). (12) (8) µ µ µ A key motivation for this formulation is that if the Then for any X ∈ Rm×n we have optimization problem in Eq. (8) can be easily solved µ by exploiting the structure of the trace norm, the con- F (X) − F (p (Y )) ≥ ||p (Y ) − Y ||2 (13) vergence rate of the resulting algorithm is expected µ 2 µ F to be the same as that of gradient method, since no +µhY − X, pµ(Y ) − Y i. approximation on the non-smooth term is employed. By ignoring terms that do not depend on W , the ob- The proof of this lemma is in the Appendix. jective in Eq. (8) can be expressed equivalently as At each step of the algorithm we need to find an ap- ¯¯ µ ¶¯¯ ¯¯ ¯¯2 propriate value for µ such that Wk = pµ(Wk−1) and tk ¯¯ 1 ¯¯ ¯¯W − Wk−1 − 5 f(Wk−1) ¯¯ + λ ||W ||∗ . the condition 2 tk F (9) F (Wk) ≤ Qµ(Wk,Wk−1) (14) It turns out that the minimization of the objective in Eq. (9) can be solved by first computing the singular is satisfied. Note that since the gradient of f(·) is Lip- 1 value decomposition (SVD) of Wk−1 − 5 f(Wk−1) tk schitz continuous with constant L, we have (Nesterov, and then applying some soft-thresholding on the sin- 2003) gular values. This is summarized in the following the- orem (Cai et al., 2008). L 2 f(X) ≤ f(Y ) + hX − Y, 5f(Y )i + ||X − Y ||F , ∀X,Y. Theorem 3.1. Let C ∈ Rm×n and let C = UΣV T 2 m×r n×r be the SVD of C where U ∈ R and V ∈ R Hence, when µ ≥ L we have have orthonormal columns, Σ ∈ Rr×r is diagonal, and r = rank(C). Then F (pµ(Y ))≤Pµ(pµ(Y ),Y )+λ||pµ(Y )||∗ =Qµ(pµ(Y ),Y ). ½ ¾ 1 2 Tλ(C) ≡ arg min ||W − C||F + λ||W ||∗ (10) W 2 This shows that the condition in Eq. (14) is always satisfied if the update rule T is given by Tλ(C) = UΣλV , where Σλ is diagonal with (Σλ)ii = max{0, Σii − λ}. Wk = pL(Wk−1) (15) An Accelerated Gradient Method for Trace Norm Minimization is applied. However, L may not be known or it is ex- part and a non-smooth part provided that the non- pensive to compute in practice. We propose to employ smooth part is “simple”. In particular, it was shown the following step size estimation strategy to ensure that the `1-norm regularized problems can be acceler- the condition in Eq. (14): Given an initial estimate ated even though they are not smooth. In this section of L as L0, we increase this estimate with a multi- we show that the extended gradient method in Algo- plicative factor γ > 1 repeatedly until the condition in rithm 1 can also be accelerated to achieve the optimal Eq. (14) is satisfied. This results in the extended gra- convergence rate of smooth problems even though the dient method in Algorithm 1 for solving the problem trace norm is not smooth. This results in the acceler- in Eq. (1). ated gradient method in Algorithm 2.

Algorithm 1 Extended Gradient Algorithm Algorithm 2 Accelerated Gradient Algorithm m×n m×n Initialize L0, γ, W0 ∈ R Initialize L0, γ, W0 = Z1 ∈ R , α1 = 1 Iterate: Iterate: ¯ 1. Set L¯ = Lk−1 1. Set L = Lk−1

2. While F (pL¯ (Wk−1)) > QL¯ (pL¯ (Wk−1),Wk−1), set 2. While F (pL¯ (Zk−1)) > QL¯ (pL¯ (Zk−1),Zk−1), set

L¯ := γL¯ L¯ := γL¯

¯ ¯ 3. Set Lk = L and Wk = pLk (Wk−1) 3. Set Lk = L and update W = p (Z ) k Lk pk Since when Lk ≥ L the condition in Eq. (14) is always 2 1 + 1 + 4αk satisfied, we have αk+1 = (18) µ2 ¶ L ≤ γL, ∀k. (16) αk − 1 k Zk+1 = Wk + (Wk − Wk−1)(19) αk+1 Note that the sequence of function values generated by this algorithm is non-increasing as

F (W )≤Q (W ,W )≤Q (W ,W )=F (W ). k Lk k k−1 Lk k−1 k−1 k−1 4.1. Discussion

3.2. Convergence Analysis In the accelerated gradient method, two sequences {Wk} and {Zk} are updated recursively. In partic- We show in the following theorem that when the con- ular, Wk is the approximate solution at the kth step dition in Eq. (14) is satisfied at each iteration, the and Zk is called the search point (Nesterov, 1983; Nes- 1 extended gradient algorithm converges as O( k ). terov, 2003), which is constructed as a linear combina- Theorem 3.2. Let {Wk} be the sequence generated by tion of the latest two approximate solutions Wk−1 and Algorithm 1. Then for any k ≥ 1 we have Wk−2. The key difference between the extended and the accelerated algorithms is that the gradient step ∗ 2 ∗ γL||W0 − W ||F is performed at the current approximate solution W F (Wk) − F (W ) ≤ , (17) k 2k in the extended algorithm, while it is performed at ∗ where W = arg minW F (W ). the search point Zk in the accelerated scheme. The idea of constructing the search point is motivated by The proof of this theorem is in the Appendix. the investigation of the information-based complexity (Nemirovsky & Yudin, 1983; Nesterov, 2003), which 4. An Accelerated Gradient Method reveals that for smooth problems the convergence rate of the gradient method is not optimal, and thus meth- It is known (Nesterov, 1983; Nesterov, 2003) that when ods with a faster convergence rate should exist. The the objective function is smooth, the gradient method derivation of the search point is based on the concept can be accelerated to achieve the optimal convergence of estimate sequence and more details can be found in 1 rate of O( k2 ). It was shown recently (Nesterov, 2007; (Nesterov, 2003). Note that the sequence αk can be Tseng, 2008; Beck & Teboulle, 2009) that a similar updated in many ways as long as certain conditions scheme can be applied to accelerate optimization prob- are satisfied (Nesterov, 2003). Indeed, it was shown lems where the objective function consists of a smooth in (Tseng, 2008) that other schemes of updating αk An Accelerated Gradient Method for Trace Norm Minimization

Table 1. Comparison of the three multi-task learning algorithms (EGM, AGM, and MFL) in terms of the computation time (in seconds). In each case, the computation time reported is the time used to train the model for a given parameter value obtained by cross validation, and the averaged training time over ten random trials is reported.

Data set yeast letters digits dmoz Percentage 5% 10% 5% 10% 5% 10% 5% 10% EGM 2.24 3.37 4.74 5.67 62.51 29.59 133.21 146.58 AGM 0.34 0.49 0.62 0.91 2.41 2.39 1.59 1.42 MFL 2.33 17.27 2.49 9.66 15.50 42.64 74.24 31.49

600 1050 AGM AGM EGM EGM 1000 550

950 500 900 450

Objective value Objective value 850

400 800

350 750 0 20 40 60 80 100 0 20 40 60 80 100 120 Iteration Iteration

Figure 1. The convergence of EGM and AGM on the yeast data set when 5% (left figure) and 10% (right figure) of the data are used for training. On the first data set EGM and AGM take 81 and 1122 iterations, respectively, to converge, while on the second data set they take 108 and 773 iterations, respectively. can lead to better practical performance, though the 5. Experiments theoretical convergence rate remains the same. Note that the sequence of objective values generated by the We evaluate the proposed extended gradient method accelerated scheme may increase. It, however, can be (EGM) and the accelerated gradient method (AGM) made non-increasing by a simple modification of the on four multi-task data sets. The yeast data set was algorithm as in (Nesterov, 2005). derived from a yeast gene classification problem con- sisting of 14 tasks. The letters and digits are hand- written words and digits data sets (Obozinski et al., 4.2. Convergence Analysis 2009), which consist of 8 and 10 tasks, respectively. We show in the following that by performing the gra- The dmoz is a text categorization data set obtained dient step at the search point Zk instead of at the from DMOZ (http://www.dmoz.org/) in which each approximate solution Wk, the convergence rate of the of the 10 tasks corresponds to one of the subcategories 1 gradient method can be accelerated to O( k2 ). This of the Arts category. For each data set we randomly result is summarized in the following theorem. sample 5% and 10% of the data from each task for training. Theorem 4.1. Let {Wk} and {Zk} be the sequences generated by Algorithm 2. Then for any k ≥ 1 we have To evaluate the efficiency of the proposed formula- tions, we report the computation time of the multi- task feature learning (MFL) algorithm (Argyriou 2γL||W − W ∗||2 ∗ 0 F et al., 2008), as MFL involves a formulation that is F (Wk) − F (W ) ≤ 2 . (20) (k + 1) equivalent to EGM and AGM. For all methods, we terminate the algorithms when the relative changes in the objective is below 10−8, since the objective values The proof of this theorem follows the same strategy as of MFL and EGM/AGM are not directly comparable. in (Beck & Teboulle, 2009) and it is in the Appendix. An Accelerated Gradient Method for Trace Norm Minimization

The averaged computation time over ten random trials function at W ∗, i.e., for each method is reported in Table 1. We can ob- ∗ ∗ serve that AGM is by far the most efficient method in 0 ∈ W − C + λ∂||W ||∗. (22) all cases. The relative efficiency of EGM and AGM dif- T m×s Let W = P1ΛP2 be the SVD of W where P1 ∈ R n×s s×s fers significantly across data sets, demonstrating that and P2 ∈ R have orthonormal columns, Σ ∈ R the performance of AGM is very stable for different is diagonal, and s = rank(W ). It can be verified that problems. In order to investigate the convergence be- (Bach, 2008; Recht et al., 2008a) haviors of EGM and AGM, we plot the objective values T m×n T of these two methods on the yeast data set in Figure 1. ∂||W ||∗ = {P1P2 + S : S ∈ R ,P1 S = 0, We can observe that in both cases AGM converges SP2 = 0, ||S||2 ≤ 1}, (23) much faster than EGM, especially at early iterations. This is consistent with our theoretical results and con- where || · ||2 denotes the spectral norm of a matrix. firms that the proposed accelerated scheme can reach Decomposing the SVD of C as the optimal objective value rapidly. T T C = U0Σ0V0 + U1Σ1V1 , T 6. Conclusion and Discussion where U0Σ0V0 corresponds to the part of SVD with singular values greater than λ. Then we have the SVD In this paper we propose efficient algorithms to solve of Tλ(C) as trace norm regularized problems. We show that by ex- T (C) = U (Σ − λI)V T ploiting the special structure of the trace norm, the λ 0 0 0 optimal convergence rate of O( √1 ) for general non- k and thus 1 T C − Tλ(C) = λ(U0V + S) smooth problems can be improved to O( k ). We fur- 0 ther show that this convergence rate can be accelerated 1 T where S = λ U1Σ1V1 . It follows from the facts that to O( 1 ) by employing the Nesterov’s method. Exper- T k2 U0 S = 0, SV0 = 0, and ||S||2 ≤ 1 that iments on multi-task learning problems demonstrate the efficiency of the proposed algorithms. C − Tλ(C) ∈ λ∂||Tλ(C)||∗,

As pointed out in the paper, another important ap- which shows that Tλ(C) is an optimal solution. plication of the trace norm regularization is in ma- trix completion problems. We plan to apply the pro- Proof of Lemma 3.1 posed formulations to this problem in the future. The Proof. Since both the loss function f and the trace proposed algorithms require the computation of SVD, norm are convex, we have which may be computationally expensive for large- scale problems. We will investigate approximate SVD f(X) ≥ f(Y ) + hX − Y, 5f(Y )i, techniques in the future to further reduce the compu- λ||X|| ≥ λ||p (Y )|| + λhX − p (Y ), g(p (Y ))i, tational cost. ∗ µ ∗ µ µ where g(pµ(Y )) ∈ ∂||pµ(Y )||∗ is the subgradient of the Appendix trace norm at pµ(Y ). Summing up the above two in- equalities we obtain that Proof of Theorem 3.1 F (X) ≥ f(Y ) + hX − Y, 5f(Y )i (24) Proof. Since the objective function in Eq. (10) is +λ||pµ(Y )||∗ + λhX − pµ(Y ), g(pµ(Y ))i. strongly convex, a unique solution exists for this prob- lem and hence it remains to show that the solution is By combining the condition in Eq. (14), the result in m×n Tλ(C). Recall that Z ∈ R is the subgradient of a Eq. (24), and the relation m×n convex function h : R → R at X0 if Qµ(pµ(Y ),Y ) = Pµ(pµ(Y ),Y ) + λ||pµ(Y )||∗, we obtain that h(X) ≥ h(X0) + hZ,X − X0i (21)

F (X) − F (pµ(Y )) ≥ F (X) − Qµ(pµ(Y ),Y ) for any X. The set of subgradients of h at X is called µ 2 0 ≥ hX − pµ(Y ), 5f(Y ) + λg(pµ(Y ))i − ||pµ(Y ) − Y ||F the subdifferential of h at X and it is denoted as 2 0 µ ∂h(X ). It is well-known (Nesterov, 2003) that W ∗ = µhX − p (Y ),Y − p (Y )i − ||p (Y ) − Y ||2 0 µ µ 2 µ F is the optimal solution to the problem in Eq. (10) if µ m×n = µhY − X, p (Y ) − Y i + ||p (Y ) − Y ||2 , and only if 0 ∈ R is a subgradient of the objective µ 2 µ F An Accelerated Gradient Method for Trace Norm Minimization where the first equality follows from that pµ(Y ) is a Multiplying the last inequality by αk+1 and making 2 2 minimizer of Qµ(X,Y ) as in Eq. (11), and thus use of the equality αk = αk+1 − αk+1 derived from Eq. (18), we get 5f(Y ) + µ(pµ(Y ) − Y ) + λg(pµ(Y )) = 0. This completes the proof of the lemma. 2 2 2 2 (αkvk − αk+1vk+1) ≥ ||αk+1(Wk+1 − Zk+1)||F Lk+1 ∗ Proof of Theorem 3.2 +2αk+1hWk+1−Zk+1, αk+1Zk+1−(αk+1 − 1)Wk−W i.

Proof. Applying Lemma 3.1 with (X = W ∗,Y = Applying the equality in Eq. (25) to the right-hand Wn, µ = Ln+1) and making use of the fact that for side of the above inequality, we get any three matrices A, B, and C of the same size 2 2 2 2 (α2v −α2 v ) ≥||α W − (α − 1)W ||B −A||F +2hB −A, A−Ci = ||B −C||F −||A−C||F , k k k+1 k+1 k+1 k+1 k+1 k Lk+1 (25) ∗ 2 ∗ 2 we obtain that −W ||F − ||αk+1Zk+1 − (αk+1 − 1)Wk − W ||F . 2 ∗ ∗ 2 ∗ 2 It follows from Eq. (19) and the definition of U that (F (W )−F (Wn+1))≥||Wn+1−W ||F −||Wn−W ||F . k Ln+1 2 Summing the above inequality over n = 0, ··· , k − 1 2 2 2 2 (αkvk − αk+1vk+1) ≥ ||Uk+1||F − ||Uk||F , and making use of the inequality in Eq. (16), we get Lk+1 kX−1 γL which combined with Lk+1 ≥ Lk leads to (F (W )−F (W ∗))≤ (||W −W ∗||2 −||W −W ∗||2 ). n+1 2 0 F k F n=0 2 2 2 2 2 2 αkvk − αk+1vk+1 ≥ ||Uk+1||F − ||Uk||F . (28) Lk Lk+1 It follows from F (Wn+1) ≤ F (Wn) and F (Wn) ≥ F (W ∗) that ∗ Applying Lemma 3.1 with (X = W ,Y = Z1,L = L1), kX−1 ∗ ∗ we obtain k(F (Wk) − F (W )) ≤ (F (Wn+1) − F (W )) n=0 F (W ∗) − F (W ) = F (W ∗) − F (p (Z )) γL 1 L1 1 ≤ ||W − W ∗||2 , L 2 0 F ≥ 1 ||p (Z ) − Z ||2 + L hZ − W ∗, p (Z ) − Z i 2 L1 1 1 1 1 L1 1 1 which leads to Eq. (17). L = 1 ||W − Z ||2 + L hZ − W ∗,W − Z i 2 1 1 1 1 1 1 Proof of Theorem 4.1 L1 ∗ 2 L1 ∗ 2 = ||W1 − W || − ||Z1 − W || . Proof. Let us denote 2 2 ∗ Hence, we have vk = F (Wk) − F (W ), U = α W − (α − 1)W − W ∗. k k k k k−1 2 ∗ 2 ∗ 2 v1 ≤ ||Z1 − W || − ||W1 − W || . (29) Applying Lemma 3.1 with (X = Wk,Y = Zk+1,L = L1 ∗ Lk+1) and (X = W ,Y = Zk+1,L = Lk+1), respec- 2 2 tively, we obtain the following two inequalities: It follows from Eqs. (28) and (29) that α vk ≤ Lk k ∗ 2 2 2 ||W0 − W || , which combined with αk ≥ (k + 1)/2 (vk − vk+1) ≥ ||Wk+1 − Zk+1||F (26) Lk+1 yields +2hWk+1 − Zk+1,Zk+1 − Wki, 2L ||W −W ∗||2 2γL||W −W ∗||2 2 ∗ k 0 0 2 F (Wk) − F (W )≤ 2 ≤ 2 . − vk+1 ≥ ||Wk+1 − Zk+1||F (27) (k + 1) (k + 1) Lk+1 ∗ +2hWk+1 − Zk+1,Zk+1 − W i. This completes the proof of the theorem.

Multiplying both sides of Eq. (26) by (αk+1 − 1) and adding it to Eq. (27), we get Acknowledgments

2 2 This work was supported by NSF IIS-0612069, IIS- ((αk+1 − 1)vk−αk+1vk+1)≥αk+1||Wk+1−Zk+1||F Lk+1 0812551, CCF-0811790, NIH R01-HG002516, and ∗ +2hWk+1 − Zk+1, αk+1Zk+1 − (αk+1 − 1)Wk − W i. NGA HM1582-08-1-0016. An Accelerated Gradient Method for Trace Norm Minimization

References Nesterov, Y. (2003). Introductory lectures on con- vex optimization: A basic course. Kluwer Academic Abernethy, J., Bach, F., Evgeniou, T., & Vert, J.- Publishers. P. (2006). Low-rank matrix factorization with at- tributes (Technical Report N24/06/MM). Ecole des Nesterov, Y. (2005). Smooth minimization of non- Mines de Paris. smooth functions. Mathematical Programming, 103, 127–152. Abernethy, J., Bach, F., Evgeniou, T., & Vert, J.-P. (2009). A new approach to collaborative filtering: Nesterov, Y. (2007). Gradient methods for minimiz- Operator estimation with spectral regularization. J. ing composite objective function (Technical Report Mach. Learn. Res., 10, 803–826. 2007/76). CORE, Universit´ecatholique de Louvain. Obozinski, G., Taskar, B., & Jordan, M. I. (2009). Amit, Y., Fink, M., Srebro, N., & Ullman, S. (2007). Joint covariate selection and joint subspace selection Uncovering shared structures in multiclass classifica- for multiple classification problems. Statistics and tion. In Proceedings of the International Conference Computing. In press. on Machine Learning, 17–24. Recht, B., Fazel, M., & Parrilo, P. (2008a). Guaran- Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Con- teed minimum-rank solutions of linear matrix equa- vex multi-task feature learning. Machine Learning, tions via nuclear norm minimization. Submitted to 73, 243–272. SIAM Review.

Bach, F. R. (2008). Consistency of trace norm mini- Recht, B., Xu, W., & Hassibi, B. (2008b). Necessary mization. J. Mach. Learn. Res., 9, 1019–1048. and sufficient condtions for success of the nuclear norm heuristic for rank minimization. In Proceed- Beck, A., & Teboulle, M. (2009). A fast iterative ings of the 47th IEEE Conference on Decision and shrinkage-thresholding algorithm for linear inverse Control, 3065–3070. problems. SIAM Journal on Imaging Sciences, 2, Rennie, J. D. M., & Srebro, N. (2005). Fast maximum 183–202. margin matrix factorization for collaborative predic- tion. In Proceedings of the International Conference Bertsekas, D. P. (1999). . on Machine Learning, 713–719. Athena Scientific. 2nd edition. Srebro, N., Rennie, J. D. M., & Jaakkola, T. S. (2005). Cai, J.-F., Cand´es,E. J., & Shen, Z. (2008). A sin- Maximum-margin matrix factorization. In Advances gular value thresholding algorithm for matrix com- in Neural Information Processing Systems, 1329– pletion (Technical Report 08-77). UCLA Computa- 1336. tional and Applied Math. Toh, K.-C., & Yun, S. (2009). An accelerated prox- Cand´es,E. J., & Recht, B. (2008). Exact matrix com- imal gradient algorithm for nuclear norm regular- pletion via convex optimization (Technical Report ized least squares problems. Preprint, Department 08-76). UCLA Computational and Applied Math. of Mathematics, National University of Singapore, March 2009. Fazel, M., Hindi, H., & Boyd, S. P. (2001). A rank Tomioka, R., & Aihara, K. (2007). Classifying matri- minimization heuristic with application to minimum ces with a spectral regularization. In Proceedings of order system approximation. In Proceedings of the the International Conference on Machine Learning, American Control Conference, 4734–4739. 895–902. Lu, Z., Monteiro, R. D. C., & Yuan, M. (2008). Con- Tseng, P. (2008). On accelerated proximal gradient vex optimization methods for dimension reduction methods for convex-concave optimization. Submit- and coefficient estimation in multivariate linear re- ted to SIAM Journal on Optimization. gression. Submitted to Mathematical Programming. Weimer, M., Karatzoglou, A., Le, Q., & Smola, A. Ma, S., Goldfarb, D., & Chen, L. (2008). Fixed point (2008a). COFIrank - maximum margin matrix fac- and Bregman iterative methods for matrix rank min- torization for collaborative ranking. In Advances in imization (Technical Report 08-78). UCLA Compu- Neural Information Processing Systems, 1593–1600. tational and Applied Math. Weimer, M., Karatzoglou, A., & Smola, A. (2008b). Nemirovsky, A. S., & Yudin, D. B. (1983). Problem Improving maximum margin matrix factorization. complexity and method efficiency in optimization. Machine Learning, 72, 263–276. John Wiley & Sons Ltd. Yuan, M., Ekici, A., Lu, Z., & Monteiro, R. (2007). Dimension reduction and coefficient estimation in Nesterov, Y. (1983). A method for solving a con- multivariate linear regression. Journal of the Royal vex programming problem with convergence rate Statistical Society: Series B, 69, 329–346. O(1/k2). Soviet Math. Dokl., 27, 372–376.