An Accelerated Gradient Method for Trace Norm Minimization

An Accelerated Gradient Method for Trace Norm Minimization Shuiwang Ji [email protected] Jieping Ye [email protected] Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA Abstract (Fazel et al., 2001), de¯ned as the sum of the singular values of the matrix, since it is the convex enve- We consider the minimization of a smooth lope of the rank function over the unit ball of spectral loss function regularized by the trace norm norm. A number of recent work has shown that the low of the matrix variable. Such formulation ¯nds rank solution can be recovered exactly via minimizing applications in many machine learning tasks the trace norm under certain conditions (Recht et al., including multi-task learning, matrix classi- 2008a; Recht et al., 2008b; Cand¶es& Recht, 2008). ¯cation, and matrix completion. The stan- dard semide¯nite programming formulation In practice, the trace norm relaxation has been shown for this problem is computationally expen- to yield low-rank solutions and it has been used widely sive. In addition, due to the non-smooth na- in many scenarios. In (Srebro et al., 2005; Rennie & ture of the trace norm, the optimal ¯rst-order Srebro, 2005; Weimer et al., 2008a; Cai et al., 2008; Ma black-box method for solving such class of et al., 2008) the matrix completion problem was formu- problems converges as O( p1 ), where k is the k lated as a trace norm minimization problem. In prob- iteration counter. In this paper, we exploit lems where multiple related tasks are learned simul- the special structure of the trace norm, based taneously, the models for di®erent tasks can be con- on which we propose an extended gradient al- strained to share certain information. Recently, this 1 gorithm that converges as O( k ). We further constraint has been expressed as the trace norm regu- propose an accelerated gradient algorithm, larization on the weight matrix in the context of multi- which achieves the optimal convergence rate task learning (Abernethy et al., 2006; Argyriou et al., 1 of O( k2 ) for smooth problems. Experiments 2008; Abernethy et al., 2009; Obozinski et al., 2009), on multi-task learning problems demonstrate multi-class classi¯cation (Amit et al., 2007), and mul- the e±ciency of the proposed algorithms. tivariate linear regression (Yuan et al., 2007; Lu et al., 2008). For two-dimensional data such as images, the matrix classi¯cation formulation (Tomioka & Aihara, 1. Introduction 2007; Bach, 2008) applies a weight matrix, regularized by its trace norm, on the data. It was shown The problem of minimizing the rank of a matrix vari- (Tomioka & Aihara, 2007) that such formulation leads able subject to certain constraints arises in many ¯elds to improved performance over conventional methods. including machine learning, automatic control, and image compression. For example, in collaborative ¯l- A practical challenge in employing the trace norm reg- tering we are given a partially ¯lled rating matrix and ularization is to develop e±cient algorithms to solve the task is to predict the missing entries. Since it is the resulting non-smooth optimization problems. It is commonly believed that only a few factors contribute well-known that the trace norm minimization problem to an individual's tastes, it is natural to approximate can be formulated as a semide¯nite program (Fazel the given rating matrix by a low-rank matrix. How- et al., 2001; Srebro et al., 2005). However, such formu- ever, the matrix rank minimization problem is NP- lation is computationally expensive. To overcome this hard in general due to the combinatorial nature of the limitation, a number of algorithms have been devel- rank function. A commonly-used convex relaxation of oped recently (Rennie & Srebro, 2005; Weimer et al., the rank function is the trace norm (nuclear norm) 2008a; Weimer et al., 2008b; Cai et al., 2008; Ma et al., 2008). In these algorithms some form of approxima- Appearing in Proceedings of the 26 th International Confer- tion is usually employed to deal with the non-smooth ence on Machine Learning, Montreal, Canada, 2009. Copy- trace norm term. However, a fast global convergence right 2009 by the author(s)/owner(s). An Accelerated Gradient Method for Trace Norm Minimization rate for these algorithms is di±cult to guarantee. ² Multi-taskP learningP (Argyriou et al., 2008): f(W ) = n si `(yj; wT xj), where n is the Due to the non-smooth nature of the trace norm, a i=1 j=1 i i i j j m simple approach to solve these problems is the subgra- number of tasks, (xi ; yi ) 2 R £R is the jth sam- dient method (Bertsekas, 1999; Nesterov, 2003), which ple in the ith task, si is the number of samples in m£n converges as O( p1 ) where k is the iteration counter. the ith task, and W = [w1; ¢ ¢ ¢ ; wn] 2 R . k It is known from the complexity theory of convex opti- ² Matrix classi¯cation (Tomioka & Aihara, 2007; mization (Nemirovsky & Yudin, 1983; Nesterov, 2003) Ps Bach, 2008): f(W ) = `(y ; Tr(W T X )), that this convergence rate is already optimal for non- i=1 i i where (X ; y ) 2 Rm£n £ R is the ith sample. smooth optimization under the ¯rst-order black-box i i model, where only the function values and ¯rst-order ² Matrix completion (Srebro et al., 2005; Cand¶es derivatives are used. & Recht, 2008;P Recht et al., 2008a; Ma et al., In this paper we propose e±cient algorithms with fast 2008): f(W ) = (i;j)2 `(Mij;Wij), where M 2 global convergence rates to solve trace norm regu- Rm£n is the partially observed matrix with the larized problems. Speci¯cally, we show that by ex- entries in being observed. ploiting the special structure of the trace norm, the classical gradient method for smooth problems can Since the trace norm term in the objective function in be adapted to solve the trace norm regularized non- Eq. (1) is non-smooth, a natural approach for solving smooth problems. This results in an extended gra- this problem is the subgradient method in which a dient algorithm with the same convergence rate of sequence of approximate solutions are generated as 1 O( k ) as that for smooth problems. Following the Nes- terov's method for accelerating the gradient method 1 0 Wk = Wk¡1 ¡ F (Wk¡1); (2) (Nesterov, 1983; Nesterov, 2003), we show that the tk extended gradient algorithm can be further accelerated to converge as O( 1 ), which is the optimal con- where Wk is the approximate solution at the kth it- k2 eration, 1 is the step size, and F 0(W ) 2 @F (W ) is vergence rate for smooth problems. Hence, the non- tk smoothness e®ect of the trace norm regularization is the subgradient of F (W ) at W and @F (W ) denotes e®ectively removed. The proposed algorithms extend the subdi®erential (Bertsekas, 1999; Nesterov, 2003) the algorithms in (Nesterov, 2007; Tseng, 2008; Beck of F (W ) at W . It is known (Nesterov, 2003) that the subgradient method converges as O( p1 ), i.e., & Teboulle, 2009) to the matrix case. Experiments k on multi-task learning problems demonstrate the e±- ciency of the proposed algorithms in comparison with ¤ 1 F (Wk) ¡ F (W ) · cp ; (3) existing ones. Note that while the present paper was k under review, we became aware of a recent preprint by ¤ Toh and Yun (2009) who independently developed an for some constant c, where W = arg minW F (W ). algorithm that is similar to ours. It is known from the complexity theory of convex optimization (Nemirovsky & Yudin, 1983; Nesterov, 2. Problem Formulation 2003) that this convergence rate is already optimal for non-smooth problems under the ¯rst-order black- In this paper we consider the following problem: box model. Hence, the convergence rate cannot be improved if a black-box model, which does not exploit min F (W ) = f(W ) + ¸jjW jj¤ (1) W any special structure of the objective function, is em- where W 2 Rm£n is the decision matrix, f(¢) repre- ployed. We show in the following that by exploiting sents the loss induced by some convex smooth (di®er- the structure of the trace norm, its non-smoothness entiable) loss function `(¢; ¢), and jj ¢ jj¤ denotes the can be e®ectively overcome and the convergence rate trace norm de¯ned as the sum of the singular values. of the algorithm for solving the trace norm regularized We assume that the gradient of f(¢), denoted as 5f(¢), problem in Eq. (1) can be improved signi¯cantly. is Lipschitz continuous with constant L, i.e., m£n 3. An Extended Gradient Method jj 5 f(X) ¡ 5f(Y )jjF · LjjX ¡ Y jjF ; 8X; Y 2 R ; where jj¢jjF denotes the Frobenius norm. Such formu- First, consider the minimization of the smooth loss lation arises in many machine learning tasks such as in function without the trace norm regularization: multi-task learning, matrix classi¯cation, and matrix min f(W ): (4) completion problems. W An Accelerated Gradient Method for Trace Norm Minimization It is known (Bertsekas, 1999) that the gradient step The proof of this theorem is in the Appendix. 1 The above discussion shows that the problem in Wk = Wk¡1 ¡ 5 f(Wk¡1) (5) tk Eq. (8) can be readily solved by SVD. Furthermore, we show in the following that if the step size 1 of the for solving this smooth problem can be reformulated tk equivalently as a proximal regularization of the lin- gradient method is chosen properly, we can achieve the same convergence rate as in the smooth case, i.e., earized function f(W ) at Wk¡1 as 1 O( k ), despite the presence of the non-smooth trace Wk = arg min Pt (W; Wk¡1); (6) norm regularization.

An Accelerated Gradient Method for Trace Norm Minimization

12. Coordinate Descent Methods

Process Optimization

The Gradient Sampling Methodology

A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

Conditional Gradient Method for Multiobjective Optimization

Successive Linear Programming to The

Metaheuristic Techniques in Enhancing the Efficiency and Performance of Thermo-Electric Cooling Devices

Branch and Bound Algorithm for Dependency Parsing with Non-Local Features

Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Conjugate Gradient Descent Conjugate Direction Methods

Block Stochastic Gradient Method