Efficient Multi-Stage Conjugate Gradient for Trust Region Step

Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Efficient Multi-Stage Conjugate Gradient for Trust Region Step Pinghua Gong and Changshui Zhang State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Automation, Tsinghua University, Beijing 100084, China fgph08@mails, [email protected] Abstract issue is how to efficiently compute the trust region step in Eq. (1), which is the focus of this paper. The trust region step problem, by solving a sphere constrained quadratic programming, plays a critical role in the Existing algorithms for solving Eq. (1) can be broadly trust region Newton method. In this paper, we propose an ef- classified into two categories. The first category reformu- ficient Multi-Stage Conjugate Gradient (MSCG) algorithm to lates Eq. (1) as other optimization problems such as the root compute the trust region step in a multi-stage manner. Specif- finding (Mor and Sorensen 1983), the parameterized eigen- ically, when the iterative solution is in the interior of the value finding (Rojas, Santos, and Sorensen 2000) and the sphere, we perform the conjugate gradient procedure. Other- semi-definite programming (Rendl and Wolkowicz 1997; wise, we perform a gradient descent procedure which points Fortin and Wolkowicz 2004). The second category di- to the inner of the sphere and can make the next iterative so- rectly solves Eq. (1), which includes the conjugate gradi- lution be a interior point. Subsequently, we proceed with the conjugate gradient procedure again. We repeat the above pro- ent (Steihaug 1983), the Gauss quadrature technique (Golub cedures until convergence. We also present a theoretical anal- and Von Matt 1991) and the subspace minimization (Hager ysis which shows that the MSCG algorithm converges. More- 2001; Erway, Gill, and Griffin 2009; Erway and Gill 2009). over, the proposed MSCG algorithm can generate a solution An interesting one among these algorithms is Steihaug’s al- in any prescribed precision controlled by a tolerance parame- gorithm (Steihaug 1983), which utilizes the conjugate gradi- ter which is the only parameter we need. Experimental results ent method to obtain an approximate solution quickly. How- on large-scale text data sets demonstrate our proposed MSCG ever, the conjugate gradient procedure in Steihaug’s algo- algorithm has a faster convergence speed compared with the rithm terminates once its iterative solution first reaches the state-of-the-art algorithms. boundary of the sphere. Thus, Steihaug’s algorithm may ter- minate even if it doesn’t converge and the precision of the so- 1 Introduction lution can’t be specified by users (Gould et al. 1999). But in practice, different applications may have different require- The trust region Newton method (TRON) has received in- ments for the precision of the trust region step. Thus, it’s creasing attention and it has been successfully applied to preferred that the precision of the solution can be controlled many optimization problems in artificial intelligence and by a parameter (Erway, Gill, and Griffin 2009). machine learning communities (Lin, Weng, and Keerthi 2008; Kim, Sra, and Dhillon 2010; Yuan et al. 2010). The Recently, a class of first-order algorithms which can effi- trust region Newton method minimizes an objective function ciently solve Eq. (1) attract considerable attention. In arti- l(w) by generating the solution at the (k+1)-th iteration via ficial intelligence and machine learning communities, they wk+1 = wk + dk; dk is a trust region step computed by the are often used to solve large-scale optimization problems following trust region step problem: with simple constraints (e.g., sphere constraint) (Shalev- Shwartz, Singer, and Srebro 2007; Figueiredo, Nowak, and 1 T k k T Wright 2007; Bach et al. 2011). The most typical al- min ff(d) = d H d + (g ) dg s:t: kdk ≤ λk; (1) n d2R 2 gorithms include Projected Gradient (PG) algorithm (Lin 2007; Duchi et al. 2008; Daubechies, Fornasier, and Loris where gk and Hk are respectively the gradient and the Hes- 2008; Wright, Nowak, and Figueiredo 2009) and Acceler- sian matrix of the objective function; λ is the trust region k ated Projected Gradient (APG) algorithm (Nesterov 2004; radius controlling the size of the sphere constraint (kdk ≤ Liu, Ji, and Ye 2009b; 2009a; Beck and Teboulle 2009; λ ). We accept wk+1 = wk + dk only if dk makes the k Gong, Gai, and Zhang 2011; Yuan, Liu, and Ye 2011; ratio (l(wk+1) − l(wk))=f(dk) large enough. Otherwise, Liu, Sun, and Ye 2011). Due to the simplicity of projection we update λ and compute dk until the above ratio is large k a vector onto the sphere constraint, both PG and APG are enough (Lin and More´ 1999; Lin, Weng, and Keerthi 2008; efficient algorithms for solving the trust region step prob- Yuan et al. 2010). In the trust region Newton method, a key lem in Eq. (1), especially in large-scale scenarios. How- Copyright c 2012, Association for the Advancement of Artificial ever, both algorithms are general optimization techniques Intelligence (www.aaai.org). All rights reserved. and they don’t consider that the objective function in Eq. (1) 921 is quadratic. In theory, CG terminates with an optimal solution after at In this paper, we propose an efficient Multi-Stage Con- most n steps (Bertsekas 1999). However, in practice, CG of- jugate Gradient (MSCG) algorithm to solve the trust region ten terminates after much less than n steps, especially when problem in a multi-stage manner. The main contributions of the data is high dimensional (n is large). Thus, CG is very this paper include: efficient to solve the unconstrained quadratic optimization. (1) We propose an efficient Multi-Stage Conjugate Gradient However, extending CG to efficiently solve the constrained (MSCG) algorithm by extending the conjugate gradient al- quadratic optimization is still a hard problem. gorithm from the unconstrained optimization to the sphere constrained quadratic optimization in a multi-stage manner. 3 Proposed Multi-Stage Conjugate Gradient Specifically, when the iterative solution is in the interior of In this section, we propose a Multi-Stage Conjugate Gra- the sphere, we perform the conjugate gradient procedure. dient (MSCG) algorithm to efficiently solve the trust re- Otherwise, we perform a gradient descent procedure which gion step problem. The MSCG algorithm extends the points to the inner of the sphere and can make the next iter- conjugate gradient method as in the traditional uncon- ative solution be a interior point. Subsequently, we proceed strained quadratic optimization to solve the sphere con- with the conjugate gradient procedure again. We repeat the strained quadratic optimization in a multi-stage manner. We above procedures until convergence. unclutter Eq. (1) by omitting the superscript as follows: (2) The MSCG algorithm can generate a solution in any pre- 1 min f(d) = dT Hd + gT d s:t: kdk ≤ λ. (2) scribed precision controlled by a tolerance parameter which n d2R 2 is the only parameter we need. In the subsequent discussion, we consider the case that H is (3) We present a detailed theoretical analysis which shows positive definite, which is very common in real applications that our proposed MSCG algorithm decreases the objective such as logistic regression, support vector machines (Lin, function value in each iteration and further guarantees the Weng, and Keerthi 2008) and the wavelet-based image de- convergence of the algorithm. Moreover, empirical stud- blurring problems (Beck and Teboulle 2009). Our proposed ies on large-scale text data sets demonstrate the MSCG al- MSCG algorithm involves conjugate gradient (C procedure) gorithm has a faster convergence speed compared with the and gradient descent procedures (G procedure). The MSCG state-of-the-art algorithms. algorithm calculates the iterative solution switching between The rest of this paper is organized as follows: In Section 2, them. Specifically, when the i-th iterative solution di is an we introduce some notations and preliminaries on the conju- interior point of the sphere, i.e., kdik < λ, we perform C gate gradient method. In Section 3, we present the proposed procedure. Otherwise, we turn to G procedure. The detailed MSCG algorithm. Experimental results are present in Sec- MSCG algorithm is presented in Algorithm 1. tion 4 and we conclude the paper in Section 5. 2 Preliminaries Notations We introduce some notations used throughout the paper. Scalars are denoted by lower case letters (e.g., x 2 R) and vectors by lower case bold face letters (e.g., x 2 Rn). Ma- trix is denoted by capital letters (e.g., A) and the Euclidean pPn 2 Figure 1: Illustration for the MSCG algorithm. The initial norm of the vector x is denoted by kxk = i=1 xi . 0 1 2 3 rf(x) denotes the first derivative (gradient) of f(x) at x. solution d is in the interior of the sphere and d ; d ; d are generated by C procedure. d3 is on the boundary of Conjugate Gradient the sphere and we turn to G procedure which generates d4. 0 Conjugate gradient (CG) is an efficient method to solve the Note that if the algorithm starts from and terminates when d3 following unconstrained quadratic programming problem: the iterative solution ( ) first reaches the boundary of the sphere, the MSCG algorithm degrades into Steihaug’s algo- 1 min fh(x) = xT Qx + bT xg; rithm. Please refer to the text for more detailed explanations. x2 n 2 R To better understand Algorithm 1, we illustrate a simple where Q 2 n×n is positive definite and b 2 n. CG is R R case of the MSCG algorithm in Figure 1. The initial solution an iterative method which generates the solution xi+1 at the d0 is in the interior of the sphere.

Load more