Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation Yingqiu Zhu Yu Chen Danyang Huang∗ [email protected] [email protected] [email protected] School of Statistics, Renmin Guanghua School of Management, Center for Applied Statistics, Renmin University of China Peking University University of China Beijing, China Beijing, China School of Statistics, Renmin University of China Beijing, China Bo Zhang Hansheng Wang [email protected] [email protected] Center for Applied Statistics, Renmin Guanghua School of Management, University of China Peking University School of Statistics, Renmin Beijing, China University of China Beijing, China ABSTRACT experiments have been conducted to prove the strengths of In deep learning tasks, the learning rate determines the up- the proposed LQA method. date step size in each iteration, which plays a critical role in gradient-based optimization. However, the determination CCS CONCEPTS of the appropriate learning rate in practice typically replies • Computing methodologies ! Neural networks; Batch on subjective judgement. In this work, we propose a novel learning. optimization method based on local quadratic approximation (LQA). In each update step, given the gradient direction, we KEYWORDS locally approximate the loss function by a standard quadratic neural networks, gradient descent, learning rate, machine function of the learning rate. Then, we propose an approx- learning imation step to obtain a nearly optimal learning rate in a computationally efficient way. The proposed LQA method ACM Reference Format: has three important features. First, the learning rate is au- Yingqiu Zhu, Yu Chen, Danyang Huang, Bo Zhang, and Hansheng tomatically determined in each update step. Second, it is Wang. 2020. Automatic, Dynamic, and Nearly Optimal Learning dynamically adjusted according to the current loss function Rate Specification by Local Quadratic Approximation. In CIKM ’20: value and the parameter estimates. Third, with the gradi- 29th ACM International Conference on Information and Knowledge Management, October 19–23, 2020, Galway, Ireland. ACM, New York, ent direction fixed, the proposed method leads to nearly the NY, USA, 10 pages. https://doi.org/10.1145/1122445.1122456 greatest reduction in terms of the loss function. Extensive arXiv:2004.03260v1 [stat.ML] 7 Apr 2020 ∗Corresponding Author 1 INTRODUCTION In recent years, the development of deep learning has led to Permission to make digital or hard copies of all or part of this work for remarkable success in visual recognition [7, 10, 14], speech personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear recognition [8, 29], natural language processing [2, 5], and this notice and the full citation on the first page. Copyrights for components many other fields. For different learning tasks, researchers of this work owned by others than ACM must be honored. Abstracting with have developed different network frameworks, including credit is permitted. To copy otherwise, or republish, to post on servers or to deep convolutional neural networks [14, 16], recurrent neu- redistribute to lists, requires prior specific permission and/or a fee. Request ral networks [6], graph convolutional networks [12] and rein- permissions from [email protected]. forcement learning [19, 20]. Although the network structure CIKM ’20, October 19–23, 2020, Galway, Ireland © 2020 Association for Computing Machinery. could be totally different, the training methods are typically ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00 similar. They are often gradient decent methods, which are https://doi.org/10.1145/1122445.1122456 developed based on backpropagation [24]. CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al. Given a differentiable objective function, gradient descent compute the Hessian matrix for the loss function. However, is a natural and efficient method for optimization. Among this solution is computationally expensive. Because many various gradient descent methods, the stochastic gradient de- complex deep neural networks involve a large number of scent (SGD) method [23] plays a critical role. In the standard parameters, this makes the Hessian matrix have ultra-high SGD method, the first-order gradient of a randomly selected dimensionality. To solve this problem, we propose here a sample is used to iteratively update the parameter estimates novel approximation step. Note that, given a fixed gradient of a network. Specifically, the parameter estimates are ad- direction, the loss function can be approximated by a stan- justed with the negative of the random gradient multiplied dard quadratic function with the learning rate as the only by a step size. The step size is called the learning rate. Many input variable. For a univariate quadratic function such as generalized methods based on the SGD method have been this, there are only two unknown coefficients. They are the proposed [1, 4, 11, 25, 26]. Most of these extensions specify linear term coefficient and the quadratic term coefficient. improved update rules to adjust the direction or the step size. As long as these two coefficients can be determined, the However, [1] pointed out that, many hand-designed update optimal learning rate can be obtained. To estimate the two rules are designed for circumstances with certain characteris- unknown coefficients, one can try, for example, two different tics, such as sparsity or nonconvexity. As a result, rule-based but reasonably small learning rates. Then, the correspond- methods might perform well in some cases but poorly in ing objective function can be evaluated. This step leads to others. Consequently, an optimizer with an automatically two equations, which can be solved to estimate the two un- adjusted update rule is preferable. known coefficients in the quadratic approximation function. An update rule contains two important components: one Thereafter, the optimal learning rate can be obtained. is the update direction, and the other is the step size. The Our contributions: We propose an automatic, dynamic learning rate determines the step size, which plays a sig- and nearly optimal learning rate tuning algorithm that has nificant role in optimization. If it is set inappropriately, the the following three important features. parameter estimates could be suboptimal. Empirical experi- (1) The algorithm is automatic. In other words, it leads to ence suggests that a relatively larger learning rate might be an optimization method with little subjective judgment. preferred in the early stages of the optimization. Otherwise, (2) The method is dynamic in the sense that the learning the algorithm might converge very slowly. In contrast, a rate used in each update step is different. It is dynamically relatively smaller learning rate should be used in the later adjusted according to the current status of the loss function stages. Otherwise, the objective function cannot be fully op- and the parameter estimates. Typically, larger rates are used timized. This phenomenon inspires us to design a method in the earlier iterations, while smaller rates are used in the to automatically search for an optimal learning rate in each latter iterations. update step during optimization. (3) The learning rate derived from the proposed method is To this end, we propose here a novel optimization method nearly optimal. For each update step, by the novel quadratic based on local quadratic approximation (LQA). It tunes the approximation, the learning rate leads to almost the greatest learning rate in a dynamic, automatic and nearly optimal reduction in terms of the loss function. Here, “almost” refers manner. The method can obtain the best step size in each up- to the fact that the loss function is locally approximated by date step. Intuitively, given a search direction, what should a quadratic function with unknown coefficients numerically be the best step size? One natural definition is the step size estimated. For this particular update step, with the gradient that can lead to the greatest reduction in the global loss. Ac- direction fixed, and among all the possible learning rates, the cordingly, the step size itself should be treated as a parameter, one determined by the proposed method can result in nearly that needs to be optimized. For this purpose, the proposed the greatest reduction in terms of the loss function. method can be decomposed into two important steps. They The rest of this article is organized as follows. In Section are the expansion step and the approximation step. First, in 2, we review related works on gradient-based optimizers. the expansion step, we conduct Taylor expansion on the loss Section 3 presents the proposed algorithm in detail. In Sec- function, around the current parameter estimates. Accord- tion 4, we verify the performance of the proposed method ingly, the objective function can be locally approximated by through empirical studies on open datasets. Then, conclud- a quadratic function in terms of the learning rate. Then, the ing remarks are given in Section 5. learning rate is also treated as a parameter to be optimized, which leads to a nearly optimal determination of the learning 2 RELATED WORK rate for this particular update step. To optimize a loss function, two important components need Second, to implement this idea, we need to compute the to be specified: the update direction and the step size. Ideally, first- and second-order derivatives of the objective function the best update direction should be the gradient computed for on the gradient direction. One way to solve this problem is to the loss function based on the whole data. For convenience, Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland we refer to it as the global gradient. Since the calculation of 3.1 Stochastic gradient descent the global gradient is computationally expensive, the SGD Assume we have a total of N samples. They are indexed by method [23] uses the gradient estimated based on a stochas- 1 ≤ i ≤ N and collected by S = f1; 2; ··· ; N g.

Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

An Ε-Relaxation Method for Separable Convex Cost Generalized Network Flow Problems

ISMP BP Front

The Global Technology Revolution China, In-Depth Analyses

Optima 82 Publishes the Obituary of Paul Tseng by His Friends Affected by the Proposed Change; Only Their Subtitles Would Change

2013 Newsletter

Chapter 5: Lasso and Sparsity in Statistics

Copyright © by SIAM. Unauthorized Reproduction of This Article Is Prohibited. EXACT REGULARIZATION of CONVEX PROGRAMS 1327

2010 Mathematics Newsletter

Curriculum Vitae James V

Joint Descent: Training and Tuning Simultane

Mathematics at the University of Washington

Exact Regularization of Convex Programs