A Survey of Optimization Methods from a Machine Learning Perspective

1 A Survey of Optimization Methods from a Machine Learning Perspective Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao Abstract—Machine learning develops rapidly, which has made high-order optimization methods, in which Newton’s method many theoretical breakthroughs and is widely applied in various is a typical example; and heuristic derivative-free optimization fields. Optimization, as an important part of machine learning, methods, in which the coordinate descent method is a has attracted much attention of researchers. With the exponential growth of data amount and the increase of model complexity, representative. optimization methods in machine learning face more and more As the representative of first-order optimization methods, challenges. A lot of work on solving optimization problems or the stochastic gradient descent method [1], [2], as well as improving optimization methods in machine learning has been its variants, has been widely used in recent years and is proposed successively. The systematic retrospect and summary evolving at a high speed. However, many users pay little of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance attention to the characteristics or application scope of these for both developments of optimization and machine learning methods. They often adopt them as black box optimizers, research. In this paper, we first describe the optimization which may limit the functionality of the optimization methods. problems in machine learning. Then, we introduce the principles In this paper, we comprehensively introduce the fundamental and progresses of commonly used optimization methods. Next, optimization methods. Particularly, we systematically explain we summarize the applications and developments of optimization methods in some popular machine learning fields. Finally, we their advantages and disadvantages, their application scope, explore and give some challenges and open problems for the and the characteristics of their parameters. We hope that the optimization in machine learning. targeted introduction will help users to choose the first-order Index Terms—Machine learning, optimization method, deep optimization methods more conveniently and make parameter neural network, reinforcement learning, approximate Bayesian adjustment more reasonable in the learning process. inference. Compared with first-order optimization methods, high- order methods [3], [4], [5] converge at a faster speed in which the curvature information makes the search direction I. INTRODUCTION more effective. High-order optimizations attract widespread ECENTLY, machine learning has grown at a remarkable attention but face more challenges. The difficulty in high- R rate, attracting a great number of researchers and order methods lies in the operation and storage of the inverse practitioners. It has become one of the most popular research matrix of the Hessian matrix. To solve this problem, many directions and plays a significant role in many fields, such variants based on Newton’s method have been developed, most as machine translation, speech recognition, image recognition, of which try to approximate the Hessian matrix through some recommendation system, etc. Optimization is one of the core techniques [6], [7]. In subsequent studies, the stochastic quasi- components of machine learning. The essence of most machine Newton method and its variants are introduced to extend high- learning algorithms is to build an optimization model and learn order methods to large-scale data [8], [9], [10]. arXiv:1906.06821v2 [cs.LG] 23 Oct 2019 the parameters in the objective function from the given data. Derivative-free optimization methods [11], [12] are mainly In the era of immense data, the effectiveness and efficiency of used in the case that the derivative of the objective function the numerical optimization algorithms dramatically influence may not exist or be difficult to calculate. There are two the popularization and application of the machine learning main ideas in derivative-free optimization methods. One is models. In order to promote the development of machine adopting a heuristic search based on empirical rules, and the learning, a series of effective optimization methods were put other is fitting the objective function with samples. Derivative- forward, which have improved the performance and efficiency free optimization methods can also work in conjunction with of machine learning methods. gradient-based methods. From the perspective of the gradient information in opti- Most machine learning problems, once formulated, can mization, popular optimization methods can be divided into be solved as optimization problems. Optimization in the three categories: first-order optimization methods, which are fields of deep neural network, reinforcement learning, meta represented by the widely used stochastic gradient methods; learning, variational inference and Markov chain Monte Carlo encounters different difficulties and challenges. The This work was supported by NSFC Project 61370175 and Shanghai Sailing Program 17YF1404600. optimization methods developed in the specific machine Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao are with learning fields are different, which can be inspiring to the School of Computer Science and Technology, East China Normal development of general optimization methods. University, 3663 North Zhongshan Road, Shanghai 200062, P. R. China. E-mail: [email protected], [email protected] (Shiliang Sun); Deep neural networks (DNNs) have shown great success [email protected], [email protected] (Jing Zhao) in pattern recognition and machine learning. There are two 2 very popular NNs, i.e., convolutional neural networks (CNNs) variational inference was proposed, which introduced natural [13] and recurrent neural networks (RNNs), which play gradients and extended the variational inference to large-scale important roles in various fields of machine learning. CNNs data [58]. are feedforward neural networks with convolution calculation. Optimization methods have a significative influence on CNNs have been successfully used in many fields such as various fields of machine learning. For example, [5] proposed image processing [14], [15], video processing [16] and natural the transformer network using Adam optimization [33], which language processing (NLP) [17], [18]. RNNs are a kind of is applied to machine translation tasks. [59] proposed super- sequential model and very active in NLP [19], [20], [21], resolution generative adversarial network for image super [22]. Besides, RNNs are also popular in the fields of image resolution, which is also optimized by Adam. [60] proposed processing [23], [24] and video processing [25]. In the field of Actor-Critic using trust region optimization to solve the deep constrained optimization, RNNs can achieve excellent results reinforcement learning on Atari games as well as the MuJoCo [26], [27], [28], [29]. In these works, the parameters of environments. weights in RNNs can be learned by analytical methods, and The stochastic optimization method can also be applied to these methods can find the optimal solution according to the Markov chain Monte Carlo (MCMC) sampling to improve trajectory of the state solution. Stochastic gradient-based efficiency. In this kind of application, stochastic gradient algorithms are widely used in deep neural networks [30], [31], Hamiltonian Monte Carlo (HMC) is a representative method [32], [33]. However, various problems are emerging when [61] where the stochastic gradient accelerates the step of employing stochastic gradient-based algorithms. For example, gradient update when handling large-scale samples. The noise the learning rate will be oscillating in the later training stage introduced by the stochastic gradient can be characterized by of some adaptive methods [34], [35], which may lead to introducing Gaussian noise and friction terms. Additionally, the problem of non-converging. Thus, further optimization the deviation caused by HMC discretization can be eliminated algorithms based on variance reduction were proposed to by the friction term, and thus the Metropolis-Hasting step can improve the convergence rate [36], [37]. Moreover, combining be omitted. The hyper-parameter settings in the HMC will the stochastic gradient descent and the characteristics of its affect the performance of the model. There are some efficient variants is a possible direction to improve the optimization. ways to automatically adjust the hyperparameters and improve Especially, switching an adaptive algorithm to the stochastic the performance of the sampler. gradient descent method can improve the accuracy and The development of optimization brings a lot of contri- convergence speed of the algorithm [38]. butions to the progress of machine learning. However, there Reinforcement learning (RL) is a branch of machine are still many challenges and open problems for optimization learning, for which an agent interacts with the environment problems in machine learning. 1) How to improve optimization by trial-and-error mechanism and learns an optimal policy performance with insufficient data in deep neural networks isa by maximizing cumulative rewards [39]. Deep reinforcement tricky problem. If there are not enough samples in the training learning combines the RL and deep learning techniques, of deep neural networks, it is prone to cause the problem of and enables the RL agent to have a good perception of its high variances and overfitting

A Survey of Optimization Methods from a Machine Learning Perspective

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Hybrid Whale Optimization Algorithm with Modified Conjugate Gradient Method to Solve Global Optimization Problems

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

A Hybrid Global Optimization Method: the One-Dimensional Case Peiliang Xu

Geometric GSI’19 Science of Information Toulouse, 27Th - 29Th August 2019

Lecture Notes on Numerical Optimization

Optimization and Approximation