A Limited-Memory Projected Quasi-Newton Algorithm

Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm Mark Schmidt, Ewout van den Berg, Michael P. Friedlander, and Kevin Murphy Department of Computer Science University of British Columbia fschmidtm,ewout78,mpf,[email protected] Abstract much higher. In particular, for parameter estimation in chain structured graphs, it takes O(k2v) time per training case, where v is the number of variables (nodes) An optimization algorithm for minimizing in the graph, and k is the number of states; for struc- a smooth function over a convex set is de- ture learning in Gaussian MRFs, it takes O(v3) time scribed. Each iteration of the method com- per objective evaluation; and for structure learning in putes a descent direction by minimizing, over discrete MRFs, it takes O(kv) time per evaluation (see the original constraints, a diagonal plus low- Table 1). This makes learning very expensive. rank quadratic approximation to the function. The quadratic approximation is constructed One-norm regularized maximum likelihood can be cast using a limited-memory quasi-Newton update. as a constrained optimization problem|as can several The method is suitable for large-scale prob- other problems in statistical learning, such as training lems where evaluation of the function is sub- support vector machines, etc. Although standard al- stantially more expensive than projection onto gorithms such as interior-point methods offer powerful the constraint set. Numerical experiments theoretical guarantees (e.g., polynomial-time complex- on one-norm regularized test problems indi- ity, ignoring the cost of evaluating the function), these cate that the proposed method is competitive methods typically require at each iteration the solution with state-of-the-art methods such as bound- of a large, highly ill-conditioned linear system; solving constrained L-BFGS and orthant-wise descent. such systems is potentially very difficult and expensive. We further show that the method generalizes This has motivated some authors to consider alterna- to a wide class of problems, and substantially tives such as gradient-projection methods (Duchi et al., improves on state-of-the-art methods for prob- 2008a; Schmidt et al., 2008), which only use the func- lems such as learning the structure of Gaus- tion gradient; these methods require only O(n) time per sian graphical models and Markov random iteration (where n is the number of parameters), plus fields. the cost of projecting onto the constraint set. Because the constraints are often simple, the projection onto the set of feasible values can typically be computed effi- 1 Introduction ciently. Although this leads to efficient iterations, using only first-order information means that these methods One-norm regularization is increasingly used in the sta- typically require a substantial number of iterations to tistical learning community as a tool to learn sparse or reach an accurate solution. parsimonious models. In the case of i.i.d. regression or In the case of unconstrained differentiable optimiza- classification, there are many efficient algorithms (e.g., tion with a large number of variables, algorithms based Andrew and Gao (2007)) for solving such problems. In on limited-memory quasi-Newton updates, such as L- the case of structured models, such as Markov random BFGS (Liu and Nocedal, 1989), are among the most fields (MRFs), the problem becomes much harder be- successful methods that require only first derivatives. cause the cost of evaluating the objective function is In a typical optimization algorithm, a step towards the solution is computed by minimizing a local quadratic th Appearing in Proceedings of the 12 International Confe- approximation to the function; between iterations, the rence on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: quadratic model is updated with second-order infor- W&CP 5. Copyright 2009 by the authors. mation inferred from observed changes in the gradient. 456 Optimizing Costly Functions with Simple Constraints Model Parameters Evaluation Projection 2 Projected Newton GGM-Struct O(v2) O(v3) O(n) MRF-Struct O(k2v2) O(kv) O(n) We address the problem of minimizing a differentiable CRF-Params O(k2 + kf) O(tvk2) O(n) function f(x) over a convex set C: Table 1: Number of parameters, cost of evaluating objec- minimize f(x) subject to x 2 C: (1) x tive, and cost of projection for different graphical model learning problems with (group) `1-regularization. Symbols: We cannot in general compute the solution to this v: number of nodes in graphical model; k: number of states problem analytically and must resort to iterative al- per node; f: number of features; t: number of training examples; n: number of optimization variables. Models: gorithms. Beginning with a solution estimate x0, at GGM-Struct: learning a sparse Gaussian graphical model each iteration k the projected Newton method forms a structure by imposing a (group) `1 penalty on the precision quadratic model of the objective function around the matrix; (Banerjee et al., 2008; Duchi et al., 2008a; Fried- current iterate xk: man et al., 2007); MRF-Struct: learning a sparse Markov random field structure with (group) `1 penalties applied to q (x) f + (x − x )T g + 1 (x − x )T B (x − x ): the edge weights (Lee et al., 2006; Schmidt et al., 2008); k , k k k 2 k k k CRF-Params: learning the parameters of a chain structured conditional random field by using an `1 penalty on the local Throughout this paper, we use the shorthand notation features (Andrew and Gao, 2007). fk = f(xk) and gk = rf(xk); Bk denotes a positive- 2 definite approximation to the Hessian r f(xk). The projected Newton method computes a feasible descent direction by minimizing this quadratic model subject The information available via the L-BFGS updates of- to the original constraints: ten allows these methods to enjoy good convergence rates. Crucially, the overhead cost per iteration is only minimize qk(x) subject to x 2 C: (2) O(mn), where m is a small number (typically between x five and ten) chosen by the user. Tellingly, one of the Because Bk is positive definite, the direction dk , most successful large-scale bound-constrained optimiza- x − xk is guaranteed to be a feasible descent direction tion methods is L-BFGS-B (Byrd et al., 1995), which at xk. (If xk is stationary, then dk = 0.) To select combines L-BFGS updates with a gradient-projection the next iterate, a backtracking line search along the strategy. Also, one of the most effective solvers for line segment xk + αdk, for α 2 (0; 1], is used to select (non-differentiable) `1-regularized optimization prob- a steplength α that ensures that a sufficient decrease lems is also an extension of the L-BFGS method, known condition, such as the Armijo condition as orthant-wise descent (OWD) (Andrew and Gao, 2007). Unfortunately, these extensions crucially rely T f(xk + αdk) ≤ fk + ναgk dk; with ν 2 (0; 1); on the separability of the constraints or of the regularization function; this requirement ensures that the is satisfied. By the definition of d, the new iterate will scaled search direction continues to provide descent for satisfy the constraints for this range of α. A typical the objective even after it is projected. In general, it is value for the sufficient decrease parameter ν is 10−4.A not straightforward to efficiently apply such algorithms suitable test of convergence for the method is that the to problems with more general constraints without a norm of the projected gradient, PC(xk −gk)−xk, where substantial increase in computation. PC is the projection onto C, is sufficiently small. If Bk is chosen as the exact Hessian r2f(x ) whenever it is This paper presents a new algorithm based on a two- k positive definite, and if the backtracking line search level strategy. At the outer level, L-BFGS updates are always tests the value α = 1 first, this method achieves used to construct a sequence of constrained, quadratic a quadratic rate of convergence in the neighborhood approximations to the problem; at the inner level, a of any point that satisfies the second-order sufficiency spectral projected-gradient method approximately min- conditions for a minimizer; see Bertsekas (1999, x2.2). imizes this subproblem. The iterations of this algorithm remain linear in the number of variables, but with a Despite its appealing theoretical properties, the pro- higher constant factor than the L-BFGS method, and jected Newton method just summarized is inefficient requiring multiple projections for each iteration. Never- in its unmodified form. The major short-coming of the theless, the method can lead to substantial gains when method is that finding the constrained minimizer of the the cost of the projection is much lower than evaluat- quadratic model may be almost as difficult as solving ing the function. We describe the new algorithm in the original problem. Further, it becomes completely xx2{5; in x6 we show experimentally that it equals or impractical to use a general n-by-n Hessian approxi- surpasses the performance of state-of-the-art methods mation Bk as n grows large. In the next section, we on the problems shown in Table 1. summarize the L-BFGS quasi-Newton updates to Bk, 457 Schmidt, van den Berg, Friedlander, Murphy Algorithm 1: Projected quasi-Newton Algorithm which is a rank-two change to Bk. The limited-memory variant of BFGS (L-BFGS) maintains only the most Given x0, c, m, . Set k 0 while not Converged do recent m vectors sk and yk, and discards older vectors. fk f(xk), gk rf(xk) The L-BFGS update is described in Nocedal and Wright if k = 0 then (1999, x7.3).

Load more