CS260: Machine Learning Algorithms Lecture 3: Optimization

Total Page:16

File Type:pdf, Size:1020Kb

CS260: Machine Learning Algorithms Lecture 3: Optimization CS260: Machine Learning Algorithms Lecture 3: Optimization Cho-Jui Hsieh UCLA Jan 14, 2019 Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes training error Convex function n A function f : R ! R is a convex function , the function f is below any line segment between two points on f : 8x1; x2; 8t 2 [0; 1]; f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2) Convex function n A function f : R ! R is a convex function , the function f is below any line segment between two points on f : 8x1; x2; 8t 2 [0; 1]; f (tx1 + (1 − t)x2)≤tf (x1) + (1 − t)f (x2) Strict convex: f (tx1 + (1 − t)x2) < tf (x1) + (1 − t)f (x2) Convex function Another equivalent definition for differentiable function: T f is convex if and only if f (x) ≥ f (x0) + rf (x0) (x − x0); 8x; x0 Convex function Convex function: (for differentiable function) rf (w ∗) = 0 , w ∗ is a global minimum If f is twice differentiable ) f is convex if and only if r2f (w) is positive semi-definite Example: linear regression, logistic regression, ··· Convex function Strict convex function: rf (w ∗) = 0 , w ∗ is the unique global minimum most algorithms only converge to gradient= 0 Example: Linear regression when X T X is invertible Convex vs Nonconvex Convex function: rf (w ∗) = 0 , w ∗ is a global minimum Example: linear regression, logistic regression, ··· Non-convex function: rf (w ∗) = 0 , w ∗ is Global min, local min, or saddle point (also called stationary points) most algorithms only converge to stationary points Example: neural network, ··· Gradient descent Generate the sequence w 1; w 2; ··· t converge to stationary points ( limt!1 krf (w )k = 0) Step size too large ) diverge; too small ) slow convergence Gradient Descent Gradient descent: repeatedly do w t+1 w t − αrf (w t ) α > 0 is the step size Step size too large ) diverge; too small ) slow convergence Gradient Descent Gradient descent: repeatedly do w t+1 w t − αrf (w t ) α > 0 is the step size Generate the sequence w 1; w 2; ··· t converge to stationary points ( limt!1 krf (w )k = 0) Gradient Descent Gradient descent: repeatedly do w t+1 w t − αrf (w t ) α > 0 is the step size Generate the sequence w 1; w 2; ··· t converge to stationary points ( limt!1 krf (w )k = 0) Step size too large ) diverge; too small ) slow convergence d ∗ will decrease f (·) if α (step size) is sufficiently small Why gradient descent? Successive approximation view At each iteration, form an approximation function of f (·): 1 f (w t + d) ≈ g(d) := f (w t ) + rf (w t )T d + kdk2 2α Update solution by w t+1 w t + d ∗ ∗ d = arg mind g(d) ∗ t 1 ∗ ∗ t rg(d ) = 0 ) rf (w ) + α d = 0 ) d = −αrf (w ) Why gradient descent? Successive approximation view At each iteration, form an approximation function of f (·): 1 f (w t + d) ≈ g(d) := f (w t ) + rf (w t )T d + kdk2 2α Update solution by w t+1 w t + d ∗ ∗ d = arg mind g(d) ∗ t 1 ∗ ∗ t rg(d ) = 0 ) rf (w ) + α d = 0 ) d = −αrf (w ) d ∗ will decrease f (·) if α (step size) is sufficiently small Illustration of gradient descent Illustration of gradient descent Form a quadratic approximation 1 f (w t + d) ≈ g(d) = f (w t ) + rf (w t )T d + kdk2 2α Illustration of gradient descent Minimize g(d): 1 rg(d ∗) = 0 ) rf (w t ) + d ∗ = 0 ) d ∗ = −αrf (w t ) α Illustration of gradient descent Update w t+1 = w t + d ∗ = w t −αrf (w t ) Illustration of gradient descent Form another quadratic approximation 1 f (w t+1 + d) ≈ g(d) = f (w t+1) + rf (w t+1)T d + kdk2 2α d ∗ = −αrf (w t+1) Illustration of gradient descent Update w t+2 = w t+1 + d ∗ = w t+1−αrf (w t+1) When will it diverge? Can diverge (f (w t )<f (w t+1)) if g is not an upperbound of f When will it converge? Always converge (f (w t )>f (w t+1)) when g is an upperbound of f Why? When α < 1=L, for any d, 1 g(d) = f (w t ) + rf (w t )T d + kdk2 2α L > f (w t ) + rf (w t )T d + kdk2 2 ≥ f (w t + d) So, f (w t + d ∗) < g(d ∗) ≤ g(0) = f (w t ) In formal proof, need to show f (w t + d ∗) is sufficiently smaller than f (w t ) Convergence Let L be the Lipchitz constant (r2f (x) LI for all x) 1 Theorem: gradient descent converges if α < L So, f (w t + d ∗) < g(d ∗) ≤ g(0) = f (w t ) In formal proof, need to show f (w t + d ∗) is sufficiently smaller than f (w t ) Convergence Let L be the Lipchitz constant (r2f (x) LI for all x) 1 Theorem: gradient descent converges if α < L Why? When α < 1=L, for any d, 1 g(d) = f (w t ) + rf (w t )T d + kdk2 2α L > f (w t ) + rf (w t )T d + kdk2 2 ≥ f (w t + d) In formal proof, need to show f (w t + d ∗) is sufficiently smaller than f (w t ) Convergence Let L be the Lipchitz constant (r2f (x) LI for all x) 1 Theorem: gradient descent converges if α < L Why? When α < 1=L, for any d, 1 g(d) = f (w t ) + rf (w t )T d + kdk2 2α L > f (w t ) + rf (w t )T d + kdk2 2 ≥ f (w t + d) So, f (w t + d ∗) < g(d ∗) ≤ g(0) = f (w t ) Convergence Let L be the Lipchitz constant (r2f (x) LI for all x) 1 Theorem: gradient descent converges if α < L Why? When α < 1=L, for any d, 1 g(d) = f (w t ) + rf (w t )T d + kdk2 2α L > f (w t ) + rf (w t )T d + kdk2 2 ≥ f (w t + d) So, f (w t + d ∗) < g(d ∗) ≤ g(0) = f (w t ) In formal proof, need to show f (w t + d ∗) is sufficiently smaller than f (w t ) When to stop? Fixed number of iterations, or Stop when krf (w)k < Applying to Logistic regression gradient descent for logistic regression Initialize the weights w0 For t = 1; 2; ··· Compute the gradient N 1 X ynxn rf (w) = − T N 1 + eynw xn n=1 Update the weights: w w − αrf (w) Return the final weights w Applying to Logistic regression gradient descent for logistic regression Initialize the weights w0 For t = 1; 2; ··· Compute the gradient N 1 X ynxn rf (w) = − T N 1 + eynw xn n=1 Update the weights: w w − αrf (w) Return the final weights w When to stop? Fixed number of iterations, or Stop when krf (w)k < Line Search: Select step size automatically (for gradient descent) Line Search In practice, we do not know L ··· need to tune step size when running gradient descent Line Search In practice, we do not know L ··· need to tune step size when running gradient descent Line Search: Select step size automatically (for gradient descent) A simple condition: f (w + αd) < f (w) often works in practice but doesn't work in theory A (provable) sufficient decrease condition: f (w + αd) ≤ f (w) + σαrf (w)T d for a constant σ 2 (0; 1) Line Search The back-tracking line search: Start from some large α0 α0 α0 Try α = α0; 2 ; 4 ; ··· Stop when α satisfies some sufficient decrease condition often works in practice but doesn't work in theory A (provable) sufficient decrease condition: f (w + αd) ≤ f (w) + σαrf (w)T d for a constant σ 2 (0; 1) Line Search The back-tracking line search: Start from some large α0 α0 α0 Try α = α0; 2 ; 4 ; ··· Stop when α satisfies some sufficient decrease condition A simple condition: f (w + αd) < f (w) A (provable) sufficient decrease condition: f (w + αd) ≤ f (w) + σαrf (w)T d for a constant σ 2 (0; 1) Line Search The back-tracking line search: Start from some large α0 α0 α0 Try α = α0; 2 ; 4 ; ··· Stop when α satisfies some sufficient decrease condition A simple condition: f (w + αd) < f (w) often works in practice but doesn't work in theory Line Search The back-tracking line search: Start from some large α0 α0 α0 Try α = α0; 2 ; 4 ; ··· Stop when α satisfies some sufficient decrease condition A simple condition: f (w + αd) < f (w) often works in practice but doesn't work in theory A (provable) sufficient decrease condition: f (w + αd) ≤ f (w) + σαrf (w)T d for a constant σ 2 (0; 1) Line Search gradient descent with backtracking line search Initialize the weights w0 For t = 1; 2; ··· Compute the gradient d = −∇f (w) For α = α0; α0=2; α0=4; ··· Break if f (w + αd) ≤ f (w) + σαrf (w)T d Update w w + αd Return the final solution w Stochastic Gradient descent 1 PN In general, f (w) = N n=1 fn(w), each fn(w) only depends on (xn; yn) Large-scale Problems Machine learning: usually minimizing the training loss N 1 X T minf `(w xn; yn)g := f (w) (linear model) w N n=1 N 1 X minf `(hw (xn); yn)g := f (w) (general hypothesis) w N n=1 `: loss function (e.g., `(a; b) = (a − b)2) Gradient descent: w w − η rf (w) | {z } Main computation Large-scale Problems Machine learning: usually minimizing the training loss N 1 X T minf `(w xn; yn)g := f (w) (linear model) w N n=1 N 1 X minf `(hw (xn); yn)g := f (w) (general hypothesis) w N n=1 `: loss function (e.g., `(a; b) = (a − b)2) Gradient descent: w w − η rf (w) | {z } Main computation 1 PN In general, f (w) = N n=1 fn(w), each fn(w) only depends on (xn; yn) Use stochastic sampling: Sample a small subset B ⊆ f1; ··· ; Ng Estimated gradient 1 X rf (w) ≈ rf (w) jBj n n2B jBj: batch size Stochastic gradient Gradient: N 1 X rf (w) = rf (w) N n n=1 Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute \approximate gradient"? Stochastic gradient Gradient: N 1 X rf (w) = rf (w) N n n=1 Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute \approximate gradient"? Use stochastic sampling: Sample a small subset B ⊆ f1; ··· ; Ng Estimated gradient 1 X rf (w) ≈ rf (w) jBj n n2B jBj: batch size Extreme case: jBj = 1 ) Sample one training data at a time Stochastic gradient descent Stochastic Gradient Descent (SGD) N Input: training data fxn; yngn=1 Initialize w (zero or random) For t = 1; 2; ··· Sample a small batch B ⊆ f1; ··· ; Ng Update parameter 1 X w w − ηt rf (w) jBj n n2B Stochastic gradient descent Stochastic Gradient
Recommended publications
  • Final Exam Guide Guide
    MATH 408 FINAL EXAM GUIDE GUIDE This exam will consist of three parts: (I) Linear Least Squares, (II) Quadratic Optimization, and (III) Optimality Conditions and Line Search Methods. The topics covered on the first two parts ((I) Linear Least Squares and (II) Quadratic Optimization) are identical in content to the two parts of the midterm exam. Please use the midterm exam study guide to prepare for these questions. A more detailed description of the third part of the final exam is given below. III Optimality Conditions and Line search methods. 1 Theory Question: For this question you will need to review all of the vocabulary words as well as the theorems on the weekly guides for Elements of Multivariable Calculus, Optimality Conditions for Unconstrained Problems, and Line search methods. You may be asked to provide statements of first- and second-order optimality conditions for unconstrained problems. In addition, you may be asked about the role of convexity in optimization, how it is detected, as well as first- and second- order conditions under which it is satisfied. You may be asked to describe what a descent direction is, as well as, what the Newton and the Cauchy (gradient descent) directions are. You need to know what the backtracking line-search is, as well as the convergence guarantees of the line search methods. 2 Computation: All the computation needed for Linear Least Squares and Quadratic Optimization is fair game for this question. Mostly, however, you will be asked to compute gradients and Hessians, locate and classify stationary points for specific optimizations problems, as well as test for the convexity of a problem.
    [Show full text]
  • A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees∗
    A LINE-SEARCH DESCENT ALGORITHM FOR STRICT SADDLE FUNCTIONS WITH COMPLEXITY GUARANTEES∗ MICHAEL O'NEILL AND STEPHEN J. WRIGHT Abstract. We describe a line-search algorithm which achieves the best-known worst-case com- plexity results for problems with a certain \strict saddle" property that has been observed to hold in low-rank matrix optimization problems. Our algorithm is adaptive, in the sense that it makes use of backtracking line searches and does not require prior knowledge of the parameters that define the strict saddle property. 1 Introduction. Formulation of machine learning (ML) problems as noncon- vex optimization problems has produced significant advances in several key areas. While general nonconvex optimization is difficult, both in theory and in practice, the problems arising from ML applications often have structure that makes them solv- able by local descent methods. For example, for functions with the \strict saddle" property, nonconvex optimization methods can efficiently find local (and often global) minimizers [16]. In this work, we design an optimization algorithm for a class of low-rank matrix problems that includes matrix completion, matrix sensing, and Poisson prinicipal component analysis. Our method seeks a rank-r minimizer of the function f(X), where f : Rn×m ! R. The matrix X is parametrized explicitly as the outer product of two matrices U 2 Rn×r and V 2 Rm×r, where r ≤ min(m; n). We make use throughout of the notation U (1) W = 2 (m+n)×r: V R The problem is reformulated in terms of W and an objective function F as follows: (2) min F (W ) := f(UV T ); where W , U, V are related as in (1).
    [Show full text]
  • Newton Method for Stochastic Control Problems Emmanuel Gobet, Maxime Grangereau
    Newton method for stochastic control problems Emmanuel Gobet, Maxime Grangereau To cite this version: Emmanuel Gobet, Maxime Grangereau. Newton method for stochastic control problems. 2021. hal- 03108627 HAL Id: hal-03108627 https://hal.archives-ouvertes.fr/hal-03108627 Preprint submitted on 13 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Newton method for stochastic control problems ∗ Emmanuel GOBET y and Maxime GRANGEREAU z Abstract We develop a new iterative method based on Pontryagin principle to solve stochastic control problems. This method is nothing else than the Newton method extended to the framework of stochastic controls, where the state dynamics is given by an ODE with stochastic coefficients. Each iteration of the method is made of two ingredients: computing the Newton direction, and finding an adapted step length. The Newton direction is obtained by solving an affine-linear Forward-Backward Stochastic Differential Equation (FBSDE) with random coefficients. This is done in the setting of a general filtration. We prove that solving such an FBSDE reduces to solving a Riccati Backward Stochastic Differential Equation (BSDE) and an affine-linear BSDE, as expected in the framework of linear FBSDEs or Linear-Quadratic stochastic control problems.
    [Show full text]
  • A Field Guide to Forward-Backward Splitting with a Fasta Implementation
    A FIELD GUIDE TO FORWARD-BACKWARD SPLITTING WITH A FASTA IMPLEMENTATION TOM GOLDSTEIN, CHRISTOPH STUDER, RICHARD BARANIUK Abstract. Non-differentiable and constrained optimization play a key role in machine learn- ing, signal and image processing, communications, and beyond. For high-dimensional mini- mization problems involving large datasets or many unknowns, the forward-backward split- ting method (also known as the proximal gradient method) provides a simple, yet practical solver. Despite its apparent simplicity, the performance of the forward-backward splitting method is highly sensitive to implementation details. This article provides an introductory review of forward-backward splitting with a special emphasis on practical implementation aspects. In particular, issues like stepsize selection, acceleration, stopping conditions, and initialization are considered. Numerical experiments are used to compare the effectiveness of different approaches. Many variations of forward-backward splitting are implemented in a new solver called FASTA (short for Fast Adaptive Shrinkage/Thresholding Algorithm). FASTA provides a simple interface for applying forward-backward splitting to a broad range of problems ap- pearing in sparse recovery, logistic regression, multiple measurement vector (MMV) prob- lems, democratic representations, 1-bit matrix completion, total-variation (TV) denoising, phase retrieval, as well as non-negative matrix factorization. Contents 1. Introduction2 1.1. Outline 2 1.2. A Word on Notation2 2. Forward-Backward Splitting3 2.1. Convergence of FBS4 3. Applications of Forward-Backward Splitting4 3.1. Simple Applications5 3.2. More Complex Applications7 3.3. Non-convex Problems9 4. Bells and Whistles 11 4.1. Adaptive Stepsize Selection 11 4.2. Preconditioning 12 4.3. Acceleration 13 arXiv:1411.3406v6 [cs.NA] 28 Dec 2016 4.4.
    [Show full text]
  • Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions
    Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions by Charles Gearhart Frye A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Neuroscience in the Graduate Division of the University of California, Berkeley Committee in charge: Associate Professor Michael R DeWeese, Co-chair Adjunct Assistant Professor Kristofer E Bouchard, Co-chair Professor Bruno A Olshausen Assistant Professor Moritz Hardt Spring 2020 Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions Copyright 2020 by Charles Gearhart Frye 1 Abstract Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions by Charles Gearhart Frye Doctor of Philosophy in Neuroscience University of California, Berkeley Associate Professor Michael R DeWeese, Co-chair Adjunct Assistant Professor Kristofer E Bouchard, Co-chair Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. This makes neural networks easy to train, which, com- bined with their high representational capacity and implicit and explicit regularization strategies, leads to machine-learned algorithms of high quality with reasonable computa- tional cost in a wide variety of domains. One thread of work has focused on explaining this phenomenon by numerically character- izing the local curvature at critical points of the loss function, where gradients are zero. Such studies have reported that the loss functions used to train neural networks have no local minima that are much worse than global minima, backed up by arguments from random matrix theory.
    [Show full text]
  • Arxiv:2108.10249V1 [Math.OC] 23 Aug 2021 Adepoints
    NEW Q-NEWTON’S METHOD MEETS BACKTRACKING LINE SEARCH: GOOD CONVERGENCE GUARANTEE, SADDLE POINTS AVOIDANCE, QUADRATIC RATE OF CONVERGENCE, AND EASY IMPLEMENTATION TUYEN TRUNG TRUONG Abstract. In a recent joint work, the author has developed a modification of Newton’s method, named New Q-Newton’s method, which can avoid saddle points and has quadratic rate of conver- gence. While good theoretical convergence guarantee has not been established for this method, experiments on small scale problems show that the method works very competitively against other well known modifications of Newton’s method such as Adaptive Cubic Regularization and BFGS, as well as first order methods such as Unbounded Two-way Backtracking Gradient Descent. In this paper, we resolve the convergence guarantee issue by proposing a modification of New Q-Newton’s method, named New Q-Newton’s method Backtracking, which incorporates a more sophisticated use of hyperparameters and a Backtracking line search. This new method has very good theoretical guarantees, which for a Morse function yields the following (which is unknown for New Q-Newton’s method): Theorem. Let f : Rm → R be a Morse function, that is all its critical points have invertible Hessian. Then for a sequence {xn} constructed by New Q-Newton’s method Backtracking from a random initial point x0, we have the following two alternatives: i) limn→∞ ||xn|| = ∞, or ii) {xn} converges to a point x∞ which is a local minimum of f, and the rate of convergence is quadratic. Moreover, if f has compact sublevels, then only case ii) happens. As far as we know, for Morse functions, this is the best theoretical guarantee for iterative optimization algorithms so far in the literature.
    [Show full text]
  • Composing Scalable Nonlinear Algebraic Solvers
    COMPOSING SCALABLE NONLINEAR ALGEBRAIC SOLVERS PETER R. BRUNE∗, MATTHEW G. KNEPLEYy , BARRY F. SMITH∗, AND XUEMIN TUz Abstract. Most efficient linear solvers use composable algorithmic components, with the most common model being the combination of a Krylov accelerator and one or more preconditioners. A similar set of concepts may be used for nonlinear algebraic systems, where nonlinear composition of different nonlinear solvers may significantly improve the time to solution. We describe the basic concepts of nonlinear composition and preconditioning and present a number of solvers applicable to nonlinear partial differential equations. We have developed a software framework in order to easily explore the possible combinations of solvers. We show that the performance gains from using composed solvers can be substantial compared with gains from standard Newton-Krylov methods. AMS subject classifications. 65F08, 65Y05, 65Y20, 68W10 Key words. iterative solvers; nonlinear problems; parallel computing; preconditioning; software 1. Introduction. Large-scale algebraic solvers for nonlinear partial differential equations (PDEs) are an essential component of modern simulations. Newton-Krylov methods [23] have well-earned dominance. They are generally robust and may be built from preexisting linear solvers and preconditioners, including fast multilevel preconditioners such as multigrid [5, 7, 9, 67, 73] and domain decomposition methods [55, 63, 64, 66]. Newton's method starts from whole-system linearization. The linearization leads to a large sparse linear system where the matrix may be represented either explicitly by storing the nonzero coefficients or implicitly by various \matrix-free" approaches [11, 42]. However, Newton's method has a number of drawbacks as a stand-alone solver. The repeated construction and solution of the linearization cause memory bandwidth and communication bottlenecks to come to the fore with regard to performance.
    [Show full text]
  • Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates Sharan Vaswani Aaron Mishkin Mila, Université de Montréal University of British Columbia Issam Laradji Mark Schmidt University of British Columbia University of British Columbia, 1QBit Element AI CCAI Affiliate Chair (Amii) Gauthier Gidel Simon Lacoste-Julien† Mila, Université de Montréal Mila, Université de Montréal Element AI Abstract Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD’s practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods’ practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters.
    [Show full text]
  • 1 Gradient-Based Optimization
    1 Gradient-Based Optimization 1.1 General Algorithm for Smooth Functions All algorithms for unconstrained gradient-based optimization can be described as follows. We start with iteration number k = 0 and a starting point, xk. 1. Test for convergence. If the conditions for convergence are satisfied, then we can stop and xk is the solution. 2. Compute a search direction. Compute the vector pk that defines the direction in n-space along which we will search. 3. Compute the step length. Find a positive scalar, αk such that f(xk + αkpk) < f(xk). 4. Update the design variables. Set xk+1 = xk + αkpk, k = k + 1 and go back to 1. xk+1 = xk + αkpk (1) | {z } ∆xk There are two subproblems in this type of algorithm for each major iteration: computing the search direction pk and finding the step size (controlled by αk). The difference between the various types of gradient-based algorithms is the method that is used for computing the search direction. AA222: Introduction to MDO 1 1.2 Optimality Conditions T Consider a function f(x) where x is the n-vector x = [x1; x2; : : : ; xn] . The gradient vector of this function is given by the partial derivatives with respect to each of the independent variables, 2 @f 3 6@x 7 6 1 7 6 @f 7 6 7 rf(x) ≡ g(x) ≡ 6@x 7 (2) 6 . 2 7 6 . 7 6 7 4 @f 5 @xn In the multivariate case, the gradient vector is perpendicular to the the hyperplane tangent to the contour surfaces of constant f.
    [Show full text]
  • A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
    Journal of Machine Learning Research 11 (2010) 1145-1200 Submitted 11/08; Revised 11/09; Published 3/10 A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning Jin Yu [email protected] School of Computer Science The University of Adelaide Adelaide SA 5005, Australia S.V.N. Vishwanathan [email protected] Departments of Statistics and Computer Science Purdue University West Lafayette, IN 47907-2066 USA Simon Gunter¨ GUENTER [email protected] DV Bern AG Nussbaumstrasse 21, CH-3000 Bern 22, Switzerland Nicol N. Schraudolph [email protected] adaptive tools AG Canberra ACT 2602, Australia Editor: Sathiya Keerthi Abstract We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1-regularized risk minimization with logistic loss.
    [Show full text]
  • Line Search Algorithms
    Lab 1 Line Search Algorithms Lab Objective: Investigate various Line-Search algorithms for numerical opti- mization. Overview of Line Search Algorithms Imagine you are out hiking on a mountain, and you lose track of the trail. Thick fog gathers around, reducing visibility to just a couple of feet. You decide it is time to head back home, which is located in the valley located near the base of the mountain. How can you find your way back with such limited visibility? The obvious way might be to pick a direction that leads downhill, and follow that direction as far as you can, or until it starts leading upward again. Then you might choose another downhill direction, and take that as far as you can, repeating the process. By always choosing a downhill direction, you hope to eventually make it back to the bottom of the valley, where you live. This is the basic approach of line search algorithms for numerical optimization. Suppose we have a real-valued function f that we wish to minimize. Our goal is to find the point x∗ in the domain of f such that f(x∗) is the smallest value in the range of f. For some functions, we can use techniques from calculus to analytically obtain this minimizer. However, in practical applications, this is often impossible, especially when we need a system that works for a wide class of functions. A line search algorithm starts with an initial guess at the minimizer, call it x0, and iteratively produces a sequence of points x1; x2; x3;::: that hopefully converge to ∗ the minimizer x .
    [Show full text]
  • Benchmarking of Bound-Constrained Optimization Software Introduction
    Benchmarking of bound-constrained optimization software Anke Tr¨oltzsch (CERFACS) joint work with S. Gratton (CERFACS) and Ph.L. Toint (University of Namur) March 01, 2007 CERFACS working note: WN/PA/07/143 Abstract In this report we describe a comparison of different algorithms for solv- ing nonlinear optimization problems with simple bounds on the variables. Moreover, we would like to come out with an assessment of the optimization library DOT used in the optimization suite OPTALIA at Airbus for this kind of problems. Keywords bound constrained optimization software, nonlinear optimization, line search method, trust region method, interior point method Introduction 0.1 Aim of the study Airbus is using an optimization library called Design Optimization Tools (DOT) to perform the minimizations occurring in aircraft design. DOT is mainly based on line search techniques and the purpose of this study is to compare it with more recent line search solvers and with other solvers based on different approaches such as trust region and interior point method. In this work, we focus on the case where function and gradient are available, but where the Hessian of the problem, when needed, has to be approximated using e.g. quasi-Newton updates. 1 0.2 Short outline In Chapter 1 can be found a description of the mathematical background of the considered bound-constrained optimization problem. In Chapter 2 we describe the algorithms implemented in the considered solvers together with their main properties. The methodology we followed in this compari- son is described in Chapter 3. Particular attention is paid on the choice of a relevant set of test cases and on the use of unified stopping criteria.
    [Show full text]