A Survey of Optimization Methods from a Machine Learning Perspective

Total Page:16

File Type:pdf, Size:1020Kb

A Survey of Optimization Methods from a Machine Learning Perspective 1 A Survey of Optimization Methods from a Machine Learning Perspective Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao Abstract—Machine learning develops rapidly, which has made high-order optimization methods, in which Newton’s method many theoretical breakthroughs and is widely applied in various is a typical example; and heuristic derivative-free optimization fields. Optimization, as an important part of machine learning, methods, in which the coordinate descent method is a has attracted much attention of researchers. With the exponential growth of data amount and the increase of model complexity, representative. optimization methods in machine learning face more and more As the representative of first-order optimization methods, challenges. A lot of work on solving optimization problems or the stochastic gradient descent method [1], [2], as well as improving optimization methods in machine learning has been its variants, has been widely used in recent years and is proposed successively. The systematic retrospect and summary evolving at a high speed. However, many users pay little of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance attention to the characteristics or application scope of these for both developments of optimization and machine learning methods. They often adopt them as black box optimizers, research. In this paper, we first describe the optimization which may limit the functionality of the optimization methods. problems in machine learning. Then, we introduce the principles In this paper, we comprehensively introduce the fundamental and progresses of commonly used optimization methods. Next, optimization methods. Particularly, we systematically explain we summarize the applications and developments of optimization methods in some popular machine learning fields. Finally, we their advantages and disadvantages, their application scope, explore and give some challenges and open problems for the and the characteristics of their parameters. We hope that the optimization in machine learning. targeted introduction will help users to choose the first-order Index Terms—Machine learning, optimization method, deep optimization methods more conveniently and make parameter neural network, reinforcement learning, approximate Bayesian adjustment more reasonable in the learning process. inference. Compared with first-order optimization methods, high- order methods [3], [4], [5] converge at a faster speed in which the curvature information makes the search direction I. INTRODUCTION more effective. High-order optimizations attract widespread ECENTLY, machine learning has grown at a remarkable attention but face more challenges. The difficulty in high- R rate, attracting a great number of researchers and order methods lies in the operation and storage of the inverse practitioners. It has become one of the most popular research matrix of the Hessian matrix. To solve this problem, many directions and plays a significant role in many fields, such variants based on Newton’s method have been developed, most as machine translation, speech recognition, image recognition, of which try to approximate the Hessian matrix through some recommendation system, etc. Optimization is one of the core techniques [6], [7]. In subsequent studies, the stochastic quasi- components of machine learning. The essence of most machine Newton method and its variants are introduced to extend high- learning algorithms is to build an optimization model and learn order methods to large-scale data [8], [9], [10]. arXiv:1906.06821v2 [cs.LG] 23 Oct 2019 the parameters in the objective function from the given data. Derivative-free optimization methods [11], [12] are mainly In the era of immense data, the effectiveness and efficiency of used in the case that the derivative of the objective function the numerical optimization algorithms dramatically influence may not exist or be difficult to calculate. There are two the popularization and application of the machine learning main ideas in derivative-free optimization methods. One is models. In order to promote the development of machine adopting a heuristic search based on empirical rules, and the learning, a series of effective optimization methods were put other is fitting the objective function with samples. Derivative- forward, which have improved the performance and efficiency free optimization methods can also work in conjunction with of machine learning methods. gradient-based methods. From the perspective of the gradient information in opti- Most machine learning problems, once formulated, can mization, popular optimization methods can be divided into be solved as optimization problems. Optimization in the three categories: first-order optimization methods, which are fields of deep neural network, reinforcement learning, meta represented by the widely used stochastic gradient methods; learning, variational inference and Markov chain Monte Carlo encounters different difficulties and challenges. The This work was supported by NSFC Project 61370175 and Shanghai Sailing Program 17YF1404600. optimization methods developed in the specific machine Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao are with learning fields are different, which can be inspiring to the School of Computer Science and Technology, East China Normal development of general optimization methods. University, 3663 North Zhongshan Road, Shanghai 200062, P. R. China. E-mail: [email protected], [email protected] (Shiliang Sun); Deep neural networks (DNNs) have shown great success [email protected], [email protected] (Jing Zhao) in pattern recognition and machine learning. There are two 2 very popular NNs, i.e., convolutional neural networks (CNNs) variational inference was proposed, which introduced natural [13] and recurrent neural networks (RNNs), which play gradients and extended the variational inference to large-scale important roles in various fields of machine learning. CNNs data [58]. are feedforward neural networks with convolution calculation. Optimization methods have a significative influence on CNNs have been successfully used in many fields such as various fields of machine learning. For example, [5] proposed image processing [14], [15], video processing [16] and natural the transformer network using Adam optimization [33], which language processing (NLP) [17], [18]. RNNs are a kind of is applied to machine translation tasks. [59] proposed super- sequential model and very active in NLP [19], [20], [21], resolution generative adversarial network for image super [22]. Besides, RNNs are also popular in the fields of image resolution, which is also optimized by Adam. [60] proposed processing [23], [24] and video processing [25]. In the field of Actor-Critic using trust region optimization to solve the deep constrained optimization, RNNs can achieve excellent results reinforcement learning on Atari games as well as the MuJoCo [26], [27], [28], [29]. In these works, the parameters of environments. weights in RNNs can be learned by analytical methods, and The stochastic optimization method can also be applied to these methods can find the optimal solution according to the Markov chain Monte Carlo (MCMC) sampling to improve trajectory of the state solution. Stochastic gradient-based efficiency. In this kind of application, stochastic gradient algorithms are widely used in deep neural networks [30], [31], Hamiltonian Monte Carlo (HMC) is a representative method [32], [33]. However, various problems are emerging when [61] where the stochastic gradient accelerates the step of employing stochastic gradient-based algorithms. For example, gradient update when handling large-scale samples. The noise the learning rate will be oscillating in the later training stage introduced by the stochastic gradient can be characterized by of some adaptive methods [34], [35], which may lead to introducing Gaussian noise and friction terms. Additionally, the problem of non-converging. Thus, further optimization the deviation caused by HMC discretization can be eliminated algorithms based on variance reduction were proposed to by the friction term, and thus the Metropolis-Hasting step can improve the convergence rate [36], [37]. Moreover, combining be omitted. The hyper-parameter settings in the HMC will the stochastic gradient descent and the characteristics of its affect the performance of the model. There are some efficient variants is a possible direction to improve the optimization. ways to automatically adjust the hyperparameters and improve Especially, switching an adaptive algorithm to the stochastic the performance of the sampler. gradient descent method can improve the accuracy and The development of optimization brings a lot of contri- convergence speed of the algorithm [38]. butions to the progress of machine learning. However, there Reinforcement learning (RL) is a branch of machine are still many challenges and open problems for optimization learning, for which an agent interacts with the environment problems in machine learning. 1) How to improve optimization by trial-and-error mechanism and learns an optimal policy performance with insufficient data in deep neural networks isa by maximizing cumulative rewards [39]. Deep reinforcement tricky problem. If there are not enough samples in the training learning combines the RL and deep learning techniques, of deep neural networks, it is prone to cause the problem of and enables the RL agent to have a good perception of its high variances and overfitting
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Hybrid Whale Optimization Algorithm with Modified Conjugate Gradient Method to Solve Global Optimization Problems
    Open Access Library Journal 2020, Volume 7, e6459 ISSN Online: 2333-9721 ISSN Print: 2333-9705 Hybrid Whale Optimization Algorithm with Modified Conjugate Gradient Method to Solve Global Optimization Problems Layth Riyadh Khaleel, Ban Ahmed Mitras Department of Mathematics, College of Computer Sciences & Mathematics, Mosul University, Mosul, Iraq How to cite this paper: Khaleel, L.R. and Abstract Mitras, B.A. (2020) Hybrid Whale Optimi- zation Algorithm with Modified Conjugate Whale Optimization Algorithm (WOA) is a meta-heuristic algorithm. It is a Gradient Method to Solve Global Opti- new algorithm, it simulates the behavior of Humpback Whales in their search mization Problems. Open Access Library for food and migration. In this paper, a modified conjugate gradient algo- Journal, 7: e6459. https://doi.org/10.4236/oalib.1106459 rithm is proposed by deriving new conjugate coefficient. The sufficient des- cent and the global convergence properties for the proposed algorithm proved. Received: May 25, 2020 Novel hybrid algorithm of the Whale Optimization Algorithm (WOA) pro- Accepted: June 27, 2020 posed with modified conjugate gradient Algorithm develops the elementary Published: June 30, 2020 society that randomly generated as the primary society for the Whales opti- Copyright © 2020 by author(s) and Open mization algorithm using the characteristics of the modified conjugate gra- Access Library Inc. dient algorithm. The efficiency of the hybrid algorithm measured by applying This work is licensed under the Creative it to (10) of the optimization functions of high measurement with different Commons Attribution International License (CC BY 4.0). dimensions and the results of the hybrid algorithm were very good in com- http://creativecommons.org/licenses/by/4.0/ parison with the original algorithm.
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
    Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I.
    [Show full text]
  • GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
    GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies.
    [Show full text]
  • A Hybrid Global Optimization Method: the One-Dimensional Case Peiliang Xu
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Journal of Computational and Applied Mathematics 147 (2002) 301–314 www.elsevier.com/locate/cam A hybrid global optimization method: the one-dimensional case Peiliang Xu Disaster Prevention Research Institute, Kyoto University, Uji, Kyoto 611-0011, Japan Received 20 February 2001; received in revised form 4 February 2002 Abstract We propose a hybrid global optimization method for nonlinear inverse problems. The method consists of two components: local optimizers and feasible point ÿnders. Local optimizers have been well developed in the literature and can reliably attain the local optimal solution. The feasible point ÿnder proposed here is equivalent to ÿnding the zero points of a one-dimensional function. It warrants that local optimizers either obtain a better solution in the next iteration or produce a global optimal solution. The algorithm by assembling these two components has been proved to converge globally and is able to ÿnd all the global optimal solutions. The method has been demonstrated to perform excellently with an example having more than 1 750 000 local minima over [ −106; 107].c 2002 Elsevier Science B.V. All rights reserved. Keywords: Interval analysis; Hybrid global optimization 1. Introduction Many problems in science and engineering can ultimately be formulated as an optimization (max- imization or minimization) model. In the Earth Sciences, we have tried to collect data in a best way and then to extract the information on, for example, the Earth’s velocity structures and=or its stress=strain state, from the collected data as much as possible.
    [Show full text]
  • Geometric GSI’19 Science of Information Toulouse, 27Th - 29Th August 2019
    ALEAE GEOMETRIA Geometric GSI’19 Science of Information Toulouse, 27th - 29th August 2019 // Program // GSI’19 Geometric Science of Information On behalf of both the organizing and the scientific committees, it is // Welcome message our great pleasure to welcome all delegates, representatives and participants from around the world to the fourth International SEE from GSI’19 chairmen conference on “Geometric Science of Information” (GSI’19), hosted at ENAC in Toulouse, 27th to 29th August 2019. GSI’19 benefits from scientific sponsor and financial sponsors. The 3-day conference is also organized in the frame of the relations set up between SEE and scientific institutions or academic laboratories: ENAC, Institut Mathématique de Bordeaux, Ecole Polytechnique, Ecole des Mines ParisTech, INRIA, CentraleSupélec, Institut Mathématique de Bordeaux, Sony Computer Science Laboratories. We would like to express all our thanks to the local organizers (ENAC, IMT and CIMI Labex) for hosting this event at the interface between Geometry, Probability and Information Geometry. The GSI conference cycle has been initiated by the Brillouin Seminar Team as soon as 2009. The GSI’19 event has been motivated in the continuity of first initiatives launched in 2013 at Mines PatisTech, consolidated in 2015 at Ecole Polytechnique and opened to new communities in 2017 at Mines ParisTech. We mention that in 2011, we // Frank Nielsen, co-chair Ecole Polytechnique, Palaiseau, France organized an indo-french workshop on “Matrix Information Geometry” Sony Computer Science Laboratories, that yielded an edited book in 2013, and in 2017, collaborate to CIRM Tokyo, Japan seminar in Luminy TGSI’17 “Topoplogical & Geometrical Structures of Information”.
    [Show full text]
  • Lecture Notes on Numerical Optimization
    Lecture Notes on Numerical Optimization Miguel A.´ Carreira-Perpi˜n´an EECS, University of California, Merced December 30, 2020 These are notes for a one-semester graduate course on numerical optimisation given by Prof. Miguel A.´ Carreira-Perpi˜n´an at the University of California, Merced. The notes are largely based on the book “Numerical Optimization” by Jorge Nocedal and Stephen J. Wright (Springer, 2nd ed., 2006), with some additions. These notes may be used for educational, non-commercial purposes. c 2005–2020 Miguel A.´ Carreira-Perpi˜n´an 1 Introduction Goal: describe the basic concepts & main state-of-the-art algorithms for continuous optimiza- • tion. The optimization problem: • c (x)=0, i equality constraints (scalar) min f(x) s.t. i ∈E x Rn ci(x) 0, i inequality constraints (scalar) ∈ ≥ ∈ I x: variables (vector); f(x): objective function (scalar). Feasible region: set of points satisfying all constraints. max f min f. ≡ − − 2 2 2 x1 x2 0 Ex. (fig. 1.1): minx1,x2 (x1 2) +(x2 1) s.t. − ≤ • − − x1 + x2 2. ≤ Ex.: transportation problem (LP) • x a i (capacity of factory i) j ij ≤ i ∀ min cijxij s.t. i xij bj i (demand of shop j) xij P ≥ ∀ { } i,j xij 0 i, j (nonnegative production) X P ≥ ∀ cij: shipping cost; xij: amount of product shipped from factory i to shop j. Ex.: LSQ problem: fit a parametric model (e.g. line, polynomial, neural net...) to a data set. Ex. 2.1 • Optimization algorithms are iterative: build sequence of points that converges to the solution. • Needs good initial point (often by prior knowledge).
    [Show full text]
  • Optimization and Approximation
    Optimization and Approximation Elisa Riccietti12 1Thanks to Stefania Bellavia, University of Florence. 2Reference book: Numerical Optimization, Nocedal and Wright, Springer 2 Contents I I part: nonlinear optimization 5 1 Prerequisites 7 1.1 Necessary and sufficient conditions . .9 1.2 Convex functions . 10 1.3 Quadratic functions . 11 2 Iterative methods 13 2.1 Directions for line-search methods . 14 2.1.1 Direction of steepest descent . 14 2.1.2 Newton's direction . 15 2.1.3 Quasi-Newton directions . 16 2.2 Rates of convergence . 16 2.3 Steepest descent method for quadratic functions . 17 2.4 Convergence of Newton's method . 19 3 Line-search methods 23 3.1 Armijo and Wolfe conditions . 23 3.2 Convergence of line-search methods . 27 3.3 Backtracking . 30 3.4 Newton's method . 32 4 Quasi-Newton method 33 4.1 BFGS method . 34 4.2 Global convergence of the BFGS method . 38 5 Nonlinear least-squares problems 41 5.1 Background: modelling, regression . 41 5.2 General concepts . 41 5.3 Linear least-squares problems . 43 5.4 Algorithms for nonlinear least-squares problems . 44 5.4.1 Gauss-Newton method . 44 5.5 Levenberg-Marquardt method . 45 6 Constrained optimization 47 6.1 One equality constraint . 48 6.2 One inequality constraint . 50 6.3 First order optimality conditions . 52 3 4 CONTENTS 6.4 Second order optimality conditions . 58 7 Optimization methods for Machine Learning 61 II Linear and integer programming 63 8 Linear programming 65 8.1 How to rewrite an LP in standard form .
    [Show full text]