Batch Normalization: Accelerating Deep Network Training By

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe Christian Szegedy Google Inc., [email protected] Google Inc., [email protected] Abstract Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient Training Deep Neural Networks is complicated by the fact of the loss over a mini-batch is an estimate of the gradient that the distribution of each layer’s inputs changes during over the training set, whose quality improves as the batch training, as the parameters of the previous layers change. size increases. Second, computation over a batch can be This slows down the training by requiring lower learning much more efficient than m computations for individual rates and careful parameter initialization, and makes it no- examples, due to the parallelism afforded by the modern toriously hard to train models with saturating nonlineari- computing platforms. ties. We refer to this phenomenon as internal covariate While stochastic gradient is simple and effective, it shift, and address the problem by normalizing layer in- requires careful tuning of the model hyper-parameters, puts. Our method draws its strength from making normal- specifically the learning rate used in optimization, as well ization a part of the model architecture and performingthe as the initial values for the model parameters. The train- normalization for each training mini-batch. Batch Nor- ing is complicated by the fact that the inputs to each layer malization allows us to use much higher learning rates and are affected by the parameters of all preceding layers – so be less careful about initialization. It also acts as a regu- that small changes to the network parameters amplify as larizer, in some cases eliminating the need for Dropout. the network becomes deeper. Applied to a state-of-the-art image classification model, The change in the distributions of layers’ inputs Batch Normalization achieves the same accuracy with 14 presents a problem because the layers need to continu- times fewer training steps, and beats the original model ously adapt to the new distribution. When the input dis- by a significant margin. Using an ensemble of batch- tribution to a learning system changes, it is said to experi- normalized networks, we improveupon the best published ence covariate shift (Shimodaira, 2000). This is typically result on ImageNet classification: reaching 4.9% top-5 handled via domain adaptation (Jiang, 2008). However, validation error (and 4.8% test error), exceeding the ac- the notion of covariate shift can be extended beyond the curacy of human raters. learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing 1 Introduction ℓ = F2(F1(u, Θ1), Θ2) Deep learning has dramatically advanced the state of the where F1 and F2 are arbitrary transformations, and the art in vision, speech, and many other areas. Stochas- parameters Θ1, Θ2 are to be learned so as to minimize arXiv:1502.03167v3 [cs.LG] 2 Mar 2015 tic gradient descent (SGD) has proved to be an effec- the loss ℓ. Learning Θ2 can be viewed as if the inputs tive way of training deep networks, and SGD variants x= F1(u, Θ1) are fed into the sub-network such as momentum (Sutskever et al., 2013) and Adagrad (Duchi et al., 2011) have been used to achieve state of the ℓ = F2(x, Θ2). art performance. SGD optimizes the parameters Θ of the network, so as to minimize the loss For example, a gradient descent step N m 1 α ∂F2(xi, Θ2) Θ = arg min ℓ(xi, Θ) Θ2 Θ2 Θ N ← − m ∂Θ2 Xi=1 Xi=1 where x1...N is the training data set. With SGD, the train- (for batch size m and learning rate α) is exactly equivalent ing proceeds in steps, and at each step we consider a mini- to that for a stand-alone network F2 with input x. There- batch x1...m of size m. The mini-batch is used to approx- fore, the input distribution properties that make training imate the gradient of the loss function with respect to the more efficient – such as having the same distribution be- parameters, by computing tween the training and test data – apply to training the sub-network as well. As such it is advantageous for the 1 ∂ℓ(xi, Θ) . distribution of x to remain fixed over time. Then, Θ does m ∂Θ 2 1 not have to readjust to compensate for the change in the 2 Towards Reducing Internal distribution of x. Covariate Shift Fixed distribution of inputs to a sub-network would We define Internal Covariate Shift as the change in the have positive consequences for the layers outside the sub- distribution of network activations due to the change in network, as well. Consider a layer with a sigmoid activa- network parameters during training. To improve the train- tion function z = g(W u + b) where u is the layer input, ing, we seek to reduce the internal covariate shift. By the weight matrix and bias vector are the layer pa- W b fixing the distribution of the layer inputs x as the training rameters to be learned, and g(x) = 1 . As x 1+exp(−x) progresses, we expect to improvethe training speed. It has ′ | | increases, g (x) tends to zero. This means that for all di- been long known (LeCun et al., 1998b; Wiesler & Ney, mensions of x= W u+b except those with small absolute 2011) that the network training converges faster if its in- values, the gradient flowing down to u will vanish and the puts are whitened – i.e., linearly transformed to have zero model will train slowly. However, since x is affected by means and unit variances, and decorrelated. As each layer W, b and the parameters of all the layers below, changes observes the inputs produced by the layers below, it would to those parameters during training will likely move many be advantageous to achieve the same whitening of the in- dimensions of x into the saturated regime of the nonlin- puts of each layer. By whitening the inputs to each layer, earity and slow down the convergence. This effect is we would take a step towards achieving the fixed distri- amplified as the network depth increases. In practice, butions of inputs that would remove the ill effects of the the saturation problem and the resulting vanishing gradi- internal covariate shift. ents are usually addressed by using Rectified Linear Units We could consider whitening activations at every train- (Nair & Hinton, 2010) ReLU(x) = max(x, 0), careful ing step or at some interval, either by modifying the initialization (Bengio & Glorot, 2010; Saxe et al., 2013), network directly or by changing the parameters of the and small learning rates. If, however, we could ensure optimization algorithm to depend on the network ac- that the distribution of nonlinearity inputs remains more tivation values (Wiesler et al., 2014; Raiko et al., 2012; stable as the network trains, then the optimizer would be Povey et al., 2014; Desjardins & Kavukcuoglu). How- less likely to get stuck in the saturated regime, and the ever, if these modifications are interspersed with the op- training would accelerate. timization steps, then the gradient descent step may at- tempt to update the parameters in a way that requires We refer to the change in the distributions of internal the normalization to be updated, which reduces the ef- nodes of a deep network, in the course of training, as In- fect of the gradient step. For example, consider a layer ternal Covariate Shift. Eliminating it offers a promise of with the input u that adds the learned bias b, and normal- faster training. We propose a new mechanism, which we izes the result by subtracting the mean of the activation call Batch Normalization, that takes a step towards re- computed over the training data: x = x E[x] where − ducing internal covariate shift, and in doing so dramati- x = u + b, = x1...N is the set of values of x over cally accelerates the training of deep neural nets. It ac- X { } 1 bN the training set, and E[x] = N i=1 xi. If a gradient complishes this via a normalization step that fixes the descent step ignores the dependenceP of E[x] on b, then it means and variances of layer inputs. Batch Normalization will update b b + ∆b, where ∆b ∂ℓ/∂x. Then also has a beneficial effect on the gradient flow through u + (b + ∆b)← E[u + (b + ∆b)] = ∝u + −b E[u + b]. the network, by reducing the dependence of gradients Thus, the combination− of the update to b and− subsequentb on the scale of the parameters or of their initial values. change in normalization led to no change in the output This allows us to use much higher learning rates with- of the layer nor, consequently, the loss. As the training out the risk of divergence. Furthermore, batch normal- continues, b will grow indefinitely while the loss remains ization regularizes the model and reduces the need for fixed. This problem can get worse if the normalization not Dropout (Srivastava et al., 2014). Finally, Batch Normal- only centers but also scales the activations. We have ob- ization makes it possible to use saturating nonlinearities served this empirically in initial experiments, where the by preventing the network from getting stuck in the satu- model blows up when the normalization parameters are rated modes. computed outside the gradient descent step. The issue with the above approach is that the gradient In Sec.

Batch Normalization: Accelerating Deep Network Training By

Batch Normalization

On Self Modulation for Generative Adver- Sarial Networks

A Regularization Study for Policy Gradient Methods

Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples

Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Predicting Uber Demand in NYC with Wavenet

Large Memory Layers with Product Keys

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Outline for Today's Presentation

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks