Calibrated Stochastic Gradient Descent for Convolutional Neural

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Calibrated Stochastic Gradient Descent for Convolutional Neural Networks Li’an Zhuo,1 Baochang Zhang,1∗ Chen Chen,2 Qixiang Ye,∗5 Jianzhuang Liu,4 David Doermann3 1School of Automation Science and Electrical Engineering, Beihang University, Beijing 2University of North Carolina at Charlotte, Charlotte, NC 3Department of Computer Science and Engineering University at Buffalo, Buffalo, NY 4Huawei Noah’s Ark Lab 5University of Chinese Academy of Sciences, China flianzhuo, [email protected], [email protected] Abstract Many improvements have been made on the basic gradient descent algorithm (Ruder 2016), including batch gra- In stochastic gradient descent (SGD) and its variants, the op- timized gradient estimators may be as expensive to compute dient descent (BGD), stochastic gradient descent (SGD) and as the true gradient in many scenarios. This paper introduces mini-batch gradient descent (MBGD). MBGD takes the best a calibrated stochastic gradient descent (CSGD) algorithm for of both BGD and SGD and performs an update with ev- deep neural network optimization. A theorem is developed to ery mini-batch of training examples. It is typically the al- prove that an unbiased estimator for the network variables gorithm of choice when training a neural network and the can be obtained in a probabilistic way based on the Lips- term SGD is often employed when mini-batches are used chitz hypothesis. Our work is significantly distinct from exist- (Zhang, Choromanska, and Lecun 2015). The gradient os- ing gradient optimization methods, by providing a theoretical cillation of SGD, on the one hand, enables it to jump to framework for unbiased variable estimation in the deep learn- a new and potentially better local minimum, but this ulti- ing paradigm to optimize the model parameter calculation. mately complicates the convergence because it can cause In particular, we develop a generic gradient calibration layer which can be easily used to build convolutional neural net- overshooting. To circumvent this problem, by slowly de- works (CNNs). Experimental results demonstrate that CNNs creasing the learning rate, SGD shows similar convergence with our CSGD optimization scheme can improve the state- behavior as BGD, converging close to a local or global min- of-the-art performance for natural image classification, digit imum for non-convex and convex optimization respectively recognition, ImageNet object classification, and object detec- (Dauphin et al. 2014). An unbiased gradient estimator based tion tasks. This work opens new research directions for de- on the likelihood-ratio method is introduced in (Gu et al. veloping more efficient SGD updates and analyzing the back- 2016) to estimate a stable gradient, which however is imple- propagation algorithm. mented based on complex mean-field networks that cause inefficiency for model calculation. In (Soudry, Hubara, and Meir 2014), the expectation BP is introduced to optimize the Introduction neural network calculation only when a prior distribution is Back-propagation (BP) is one of the most popular algo- given to approximate posteriors in the Bayesian inference. In rithms for optimization and by far the most important way to (Zhang, Kjellstrom,¨ and Stephan 2017), a mini-batch diver- train neural networks. The essence of BP is that the gradient sification scheme for SGD is introduced based on a similar- descent algorithm optimizes the neural network parameters ity measure between data points. It gives lower probabilities by calculating the minimum value of a loss function. Pre- to mini-batches which contain redundant data, and higher vious studies have focused on optimizing the gradient de- probabilities to mini-batches with more diverse data. Biased scent algorithm to make the loss decrease faster and more gradient schemes (Zhang, Kjellstrom,¨ and Stephan 2017) stable (Kingma and Ba 2014) (Dozat 2016) (Zeiler 2012). (Qian 1999) may reduce the stochastic gradient noise or ease These algorithms, however, are often used as black-box op- the optimization problem, which could lead to faster con- timizers, so a theoretical explanation of their strengths and vergence. However, the biased estimators are heuristic and weaknesses is hard to quantify (Bau et al. 2017). difficult to enumerate the situations in which these estima- ∗ ¨ The work was supported by the Natural Science Foundation tors will work well (Zhang, Kjellstrom, and Stephan 2017). of China under Contract 61672079 and 61473086, and Shenzhen These algorithms prove to be effective in their engineer- Peacock Plan KQTD2016112515134654. This work is supported ing applications. However, existing methods have following by the Open Projects Program of National Laboratory of Pat- limitations: (1) they focus on unbiased or biased gradient tern Recognition. Baochang Zhang and Qixiang Ye are the corre- estimators which rely on a prior knowledge about model op- sponding authors. Beijing Municipal Science & Technology Com- timization or a heuristic method; (2) unbiased variable es- mission under Grant Z181100008918014 and NSFC under Grant timation could be used to understand CNNs better, which 61836012. Baochang Zhang is also with Shenzhen Academy of is however neglected in prior arts. In this paper, we pro- Aerospace Technology, Shenzhen, China. vide a theoretical framework for unbiased variable estima- c Copyright 2019, Association for the Advancement of Artificial tion to optimize the CNNs model parameters in an end-to- Intelligence (www.aaai.org). All rights reserved. 9348 Forward propagation Backward propagation GClayer R1 Branch 1 1 R2 Branch 2 2 R3 Branch 3 3 gradient R4 Branch 4 caliboration conv 4 Input Output 1×4×32×32 1×4×30×30 fR() i GClayer = Ri i 4×4×3×3 (a) The CSGD procedure( i = 1,...,4 in this example). (b) Gradient calibration convolution Figure 1: The illustration of the CSGD procedure and gradient calibration convolution in our CCNNs. We can see that the num- bers of input and output channels in our convolution are the same, which is used to build gradient calibration layer (GClayer) that is generic and easily implemented by simply replicating the same module at each layer. R is shared within each GClayer, and thus the new layer can be implemented in low complexity. end manner. In particular, we do not impose any restriction estimator can be obtained in a probabilistic way based on on the gradient estimator (i.e., biased or unbiased) nor any a Lipschitz assumption, leading to a calibrated SGD algo- prior knowledge on the model parameters, which makes our rithm (CSGD) for optimizing the kernel weights of CNNs. framework more general and thus the performance in prac- • We develop a generic gradient calibration layer (GClayer) tice could be guaranteed. to achieve the proposed unbiased variable estimation, In this paper, we introduce an calibrated SGD (CSGD) al- which can be easily applied to existing CNN archi- gorithm for stable and efficient training of CNNs. We first tectures, such as AlexNet, ResNets and Wide-ResNet develop a theory showing that an unbiased variable estima- (WRN). tor can be achieved in a probabilistic way in SGD, which provides a guarantee on the performance in practice and also • Experimental results have demonstrated that popular poses a new direction to analyze the BP algorithm. In partic- CNN architectures optimised by the proposed CSGD al- ular, we compute the gradient based on a statistic analysis on gorithm, dubbed as CCNNs, yield state-of-the-art perfor- the importance of each branch of the gradient among train- mance for a variety of tasks such as digit recognition, Im- ing samples. The statistic can be done in the BP framework ageNet classification and object detection. by designing a generic gradient calibration layer (GClayer), which can be easily incorporated into any CNNs architec- Calibrated Stochastic Gradient Descent tures without bells and whistles. We refer to the CNNs based on our CSGD as CCNNs in the following. Gradient descent is based on the observation that if a func- Distinctions between this work and prior art. In tion f(w) is defined and differentiable in a neighborhood of (Soudry, Hubara, and Meir 2014), based on the expectation a point, then f(w) decreases fastest if one goes from a given propagation, the posterior of the weights given the data is position in the direction of the negative gradient of f(w). It approximated using a “mean-field” factorized distribution follows that: in an online setting. Differently, ours is automatically cal- wt+1 = wt − θδ; culated in the BP framework. Unlike an analytical approx- imation to the Bayes update of this posterior, we develop a where θ is the learning rate and δ is the gradient vector. The theory aiming to understand SGD in terms of unbiased esti- popular gradient descent method, SGD, performs frequent mator. Our work is also different from (Gu et al. 2016) which updates with a high variance that causes the loss to fluctuate designs stochastic neural networks based on unbiased BP, widely during training. To aovid this, we project the gradi- only when the distribution is given, i.e., based on Bernoulli ents onto a subspace, which calibrates the objective to obtain and multinomial distributions. Ours is more flexible, with- a stable solution. The gradient vector δ is calculated based out a prior hypothesis on the distribution, which provides a on the expectation method as: generic convolutional layer to optimise the kernel weights of K CNNs. The main contributions of this work are three-fold. X δ = Ri ∗ δi; (1) • A theorem is developed to reveal that an unbiased variable i 9349 where ∗ is the Schur product, an element-wise multiplica- Updating Ri tion operator, and δi spans a subspace Ω = fδ1; δ2; :::; δK g t+1 t ^ @f(Ri∗w) We update Ri based on Ri and δi = . We also denoting K branches of gradients. Each element of Ri @Ri assume the elements of R are probabilities 2 [0; 1] and denotes the probability of its corresponding element in δi, i P R = 1, where 1 denotes a vector with all elements which measures δi’s importance or contribution to the final i i δ.

Load more