arXiv:1502.03167v3 [cs.LG] 2 Mar 2015 ewr,s st iiieteloss the minimize to as so network, where n rcesi tp,ada ahse ecnie a consider we step each batch at and steps, in proceeds ing r efrac.SDotmzsteparameters the the of optimizes state achieve SGD to performance. used art been Adagrad variants have and 2011) SGD 2013) al., effec- et and al., (Duchi et an networks, (Sutskever be momentum deep Stochas- as to training such proved of areas. has way other (SGD) tive many descent and speech, tic the of vision, state in the advanced art dramatically has learning Deep Introduction 1 ac- the exceeding raters. error), human top-5 of test curacy 4.9% 4.8% (and reaching error batch- classification: validation of ImageNet ensemble on an published result best the Using upon improve model we networks, margin. original normalized the significant beats a and by steps, training 14 with fewer accuracy times same model, the classification achieves Dropout. Normalization image for Batch state-of-the-art need a regu- the a to eliminating as Applied cases acts also some It in larizer, initialization. about careful and rates less learning be higher much use to us allows malization mt h rdeto h osfnto ihrsett the to respect with computing function by loss parameters, the of gradient the imate zto ato h oe rhtcueadpromn the performing and architecture normalization model the normal- of making part a from ization strength its draws method Our puts. as phenomenon this to shift refer nonlineari- We saturating with ties. models train to no hard it makes toriously and learning initialization, lower parameter careful requiring and by rates training change. the layers down previous slows the This of parameters during the changes as inputs ’s training, each of distribution fact the the that by complicated is Networks Neural Deep Training Abstract n drs h rbe ynraiiglyrin- layer normalizing by problem the address and , ac omlzto:Aclrtn epNtokTann b Training Network Deep Accelerating Normalization: Batch x x 1 1 ...m ...N fsize of stetann aast ihSD h train- the SGD, With set. data training the is ogeInc., Google r min arg = Θ o ahtann mini-batch training each for m m h iibthi sdt approx- to used is mini-batch The . 1 Θ ∂ℓ egyIoffe Sergey (x ∂ N 1 Θ euigItra oait Shift Covariate Internal Reducing [email protected] i , X i Θ) =1 N . ℓ (x i nenlcovariate internal , Θ) ac Nor- Batch . Θ fthe of mini- - 1 ota o tn-ln network stand-alone a for that to frbthsize batch (for rbto oalann ytmcags ti adt experi- to said dis- is input it ence changes, the system When continu- learning a to distribution. to new tribution need the layers to the adapt ously because problem a presents as amplify deeper. parameters becomes network network the the so – to layers changes preceding small layer all each that of to parameters inputs the the by train- that affected The fact are the parameters. by model complicated the is ing for values well initial as the optimization, hyper-parameters, in as model used rate the learning the of specifically tuning careful requires modern the by afforded platforms. parallelism computing the to due be examples, can than batch efficient a more over much computation batch the Second, as improves increases. gradient quality size the whose of set, estimate training an the is gradient over mini-batch the First, a over ways. loss several the in of helpful is time, exam- a one at to ple opposed as examples, of mini-batches Using u-ewr ralyr osdrantokcomputing a network as a such parts, Consider its layer. to a the apply or beyond to sub-network extended whole, a be as can system shift learning However, covariate of 2008). notion (Jiang, the adaptation domain via handled u-ewr swl.A uhi savnaeu o the for the advantageous training is to it apply such of As distribution – data well. test as be- sub-network and distribution same training the the having training tween as make such that – efficient properties more distribution input the fore, h loss the parameters where = x o xml,again ecn step descent gradient a example, For ogeInc., Google h hnei h itiuin flyr’inputs layers’ of distributions the in change The it effective, and simple is gradient stochastic While F oait shift covariate 1 F (u 1 ℓ , Learning . and Θ hita Szegedy Christian Θ Θ 1 1 ) 2 m x , F r e notesub-network the into fed are ← Θ ormi xdoe ie Then, time. over fixed remain to 2 ℓ [email protected] n erigrate learning and 2 = r rirr rnfrain,adthe and transformations, arbitrary are Θ r ob ere oa ominimize to as so learned be to are Sioar,20) hsi typically is This 2000). (Shimodaira, 2 ℓ F Θ − = 2 2 ( F m F α a evee si h inputs the if as viewed be can m 1 2 (u (x X i =1 m opttosfrindividual for computations , , Θ Θ ∂F 1 F 2 ) α ) 2 , . 2 seatyequivalent exactly is ) Θ ∂ (x ihinput with Θ 2 i ) , 2 Θ 2 ) x There- . Θ y 2 does not have to readjust to compensate for the change in the 2 Towards Reducing Internal distribution of x. Covariate Shift

Fixed distribution of inputs to a sub-network would We define Internal Covariate Shift as the change in the have positive consequences for the layers outside the sub- distribution of network activations due to the change in network, as well. Consider a layer with a sigmoid activa- network parameters during training. To improve the train- tion function z = g(W u + b) where u is the layer input, ing, we seek to reduce the internal covariate shift. By the weight matrix and bias vector are the layer pa- W b fixing the distribution of the layer inputs x as the training rameters to be learned, and g(x) = 1 . As x 1+exp(−x) progresses, we expect to improvethe training speed. It has ′ | | increases, g (x) tends to zero. This means that for all di- been long known (LeCun et al., 1998b; Wiesler & Ney, mensions of x= W u+b except those with small absolute 2011) that the network training converges faster if its in- values, the gradient flowing down to u will vanish and the puts are whitened – i.e., linearly transformed to have zero model will train slowly. However, since x is affected by means and unit , and decorrelated. As each layer W, b and the parameters of all the layers below, changes observes the inputs produced by the layers below, it would to those parameters during training will likely move many be advantageous to achieve the same whitening of the in- dimensions of x into the saturated regime of the nonlin- puts of each layer. By whitening the inputs to each layer, earity and slow down the convergence. This effect is we would take a step towards achieving the fixed distri- amplified as the network depth increases. In practice, butions of inputs that would remove the ill effects of the the saturation problem and the resulting vanishing gradi- internal covariate shift. ents are usually addressed by using Rectified Linear Units We could consider whitening activations at every train- (Nair & Hinton, 2010) ReLU(x) = max(x, 0), careful ing step or at some interval, either by modifying the initialization (Bengio & Glorot, 2010; Saxe et al., 2013), network directly or by changing the parameters of the and small learning rates. If, however, we could ensure optimization algorithm to depend on the network ac- that the distribution of nonlinearity inputs remains more tivation values (Wiesler et al., 2014; Raiko et al., 2012; stable as the network trains, then the optimizer would be Povey et al., 2014; Desjardins & Kavukcuoglu). How- less likely to get stuck in the saturated regime, and the ever, if these modifications are interspersed with the op- training would accelerate. timization steps, then the step may at- tempt to update the parameters in a way that requires We refer to the change in the distributions of internal the normalization to be updated, which reduces the ef- nodes of a deep network, in the course of training, as In- fect of the gradient step. For example, consider a layer ternal Covariate Shift. Eliminating it offers a promise of with the input u that adds the learned bias b, and normal- faster training. We propose a new mechanism, which we izes the result by subtracting the mean of the activation call Batch Normalization, that takes a step towards re- computed over the training data: x = x E[x] where − ducing internal covariate shift, and in doing so dramati- x = u + b, = x1...N is the set of values of x over cally accelerates the training of deep neural nets. It ac- X { } 1 bN the training set, and E[x] = N i=1 xi. If a gradient complishes this via a normalization step that fixes the descent step ignores the dependenceP of E[x] on b, then it means and variances of layer inputs. Batch Normalization will update b b + ∆b, where ∆b ∂ℓ/∂x. Then also has a beneficial effect on the gradient flow through u + (b + ∆b)← E[u + (b + ∆b)] = ∝u + −b E[u + b]. the network, by reducing the dependence of Thus, the combination− of the update to b and− subsequentb on the scale of the parameters or of their initial values. change in normalization led to no change in the output This allows us to use much higher learning rates with- of the layer nor, consequently, the loss. As the training out the risk of divergence. Furthermore, batch normal- continues, b will grow indefinitely while the loss remains ization regularizes the model and reduces the need for fixed. This problem can get worse if the normalization not Dropout (Srivastava et al., 2014). Finally, Batch Normal- only centers but also scales the activations. We have ob- ization makes it possible to use saturating nonlinearities served this empirically in initial experiments, where the by preventing the network from getting stuck in the satu- model blows up when the normalization parameters are rated modes. computed outside the gradient descent step. The issue with the above approach is that the gradient In Sec. 4.2, we apply Batch Normalization to the best- descent optimization does not take into account the fact performing ImageNet classification network, and show that the normalization takes place. To address this issue, that we can match its performance using only 7% of the we would like to ensure that, for any parameter values, training steps, and can further exceed its accuracy by a the network always produces activations with the desired substantial margin. Using an ensemble of such networks distribution. Doing so would allow the gradient of the trained with Batch Normalization, we achieve the top-5 loss with respect to the model parameters to account for error rate that improves upon the best known results on the normalization, and for its dependence on the model ImageNet classification. parameters Θ. Let again x be a layer input, treated as a

2 vector, and be the set of these inputs over the training we introduce, for each activation x(k), a pair of parameters data set. TheX normalization can then be written as a trans- γ(k),β(k), which scale and shift the normalized value: formation (k) (k) (k) (k) x= Norm(x, ) y = γ x + β . X which depends not onlyb on the given training example x These parameters are learnedb along with the original but on all examples – each of which depends on Θ if model parameters, and restore the representation power X x is generated by another layer. For , we of the network. Indeed, by setting γ(k) = Var[x(k)] and would need to compute the Jacobians β(k) = E[x(k)], we could recover the originalp activations, if that were the optimal thing to do. ∂Norm(x, ) ∂Norm(x, ) X and X ; In the batch setting where each training step is based on ∂x ∂ X the entire training set, we would use the whole set to nor- ignoring the latter term would lead to the explosion de- malize activations. However, this is impractical when us- scribed above. Within this framework, whitening the layer ing stochastic optimization. Therefore, we make the sec- inputs is expensive, as it requires computing the covari- ond simplification: since we use mini-batches in stochas- T T ance matrix Cov[x] = Ex∈X [xx ] E[x]E[x] and its tic gradient training, each mini-batch produces estimates inverse square root, to produce the whitened− activations of the mean and of each activation. This way, the Cov[x]−1/2(x E[x]), as well as the derivatives of these statistics used for normalization can fully participate in transforms for− backpropagation. This motivates us to seek the gradient backpropagation. Note that the use of mini- an alternative that performs input normalization in a way batches is enabled by computation of per-dimension vari- that is differentiable and does not require the analysis of ances rather than joint covariances; in the joint case, reg- the entire training set after every parameter update. ularization would be required since the mini-batch size is Some of the previous approaches (e.g. likely to be smaller than the number of activations being (Lyu & Simoncelli, 2008)) use statistics computed whitened, resulting in singular covariance matrices. over a single training example, or, in the case of image Consider a mini-batch of size m. Since the normal- B networks, over different feature maps at a given location. ization is applied to each activation independently, let us However, this changes the representation ability of a focus on a particular activation x(k) and omit k for clarity. network by discarding the absolute scale of activations. We have m values of this activation in the mini-batch, We want to a preserve the information in the network, by = x . normalizing the activations in a training example relative B { 1...m} to the statistics of the entire training data. Let the normalized values be x1...m, and their linear trans- formations be y1...m. We refer to the transform 3 Normalization via Mini-Batch b BNγ,β : x1...m y1...m Statistics → as the Batch Normalizing Transform. We present the BN Since the full whitening of each layer’s inputs is costly Transformin Algorithm 1. In the algorithm, ǫ is a constant and not everywhere differentiable, we make two neces- added to the mini-batch variance for numerical stability. sary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: = x1...m ; B { } normalize each scalar feature independently, by making it Parameters to be learned: γ, β have the mean of zero and the variance of 1. For a layer Output: yi = BNγ,β(xi) { } with -dimensional input (1) (d) , we will nor- d x = (x ...x ) m malize each dimension 1 µB x // mini-batch mean ← m i x(k) E[x(k)] Xi=1 x(k) = − m Var[x(k)] 2 1 2 σB (xi µB) // mini-batch variance b p ← m − where the expectation and variance are computed over the Xi=1 x µB training data set. As shown in (LeCun et al., 1998b), such x i − // normalize i 2 normalization speeds up convergence, even when the fea- ← σB + ǫ p tures are not decorrelated. byi γxi + β BNγ,β(xi) // scale and shift Note that simply normalizing each input of a layer may ← ≡ change what the layer can represent. For instance, nor- Algorithmb 1: Batch Normalizing Transform, applied to malizing the inputs of a sigmoid would constrain them to activation x over a mini-batch. the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network The BN transform can be added to a network to manip- can represent the identity transform. To accomplish this, ulate any activation. In the notation y = BNγ,β(x), we

3 indicate that the parameters γ and β are to be learned, (Duchi et al., 2011). The normalization of activations that but it should be noted that the BN transform does not depends on the mini-batch allows efficient training, but is independently process the activation in each training ex- neither necessary nor desirable during inference; we want ample. Rather, BNγ,β(x) depends both on the training the output to depend only on the input, deterministically. example and the other examples in the mini-batch. The For this, once the network has been trained, we use the scaled and shifted values y are passed to other network normalization layers. The normalized activations are internal to our x E[x] x x = − transformation, but their presence is crucial. The distri- Var[x]+ ǫ butions of values of any x has theb expected value of 0 using the population,b ratherp than mini-batch, statistics. and the variance of 1, as long as the elements of each Neglecting ǫ, these normalized activations have the same mini-batch are sampled fromb the same distribution, and mean 0 and variance 1 as during training. We use the un- if we neglect ǫ. This can be seen by observing that m 2 m m biased variance estimate Var[x] = EB[σ ], where x = 0 and 1 x2 = 1, and taking expec- m−1 B i=1 i m i=1 i the expectation is over training mini-batches· of size m and tations. Each normalized activation x(k) can be viewed as P P σ2 are their sample variances. Using moving averages in- an inputb to a sub-network composedb of the linear trans- B stead, we can track the accuracy of a model as it trains. form (k) (k) (k) (k), followedb by the other pro- y = γ x + β Since the means and variances are fixed during inference, cessing done by the original network. These sub-network the normalization is simply a linear transform applied to inputs all have fixedb means and variances, and although each activation. It may further be composed with the scal- the joint distribution of these normalized x(k) can change ing by γ and shift by β, to yield a single linear transform over the course of training, we expect that the introduc- that replaces BN(x). Algorithm 2 summarizes the proce- tion of normalized inputs accelerates theb training of the dure for training batch-normalized networks. sub-network and, consequently, the network as a whole. During training we need to backpropagate the gradi- ent of loss ℓ through this transformation, as well as com- Input: Network N with trainable parameters Θ; (k) K pute the gradients with respect to the parameters of the subset of activations x k=1 { } inf BN transform. We use , as follows (before sim- Output: Batch-normalized network for inference, NBN tr plification): 1: NBN N // Training BN network ← 2: for k =1 ...K do ∂ℓ ∂ℓ (k) k k (k) ∂xi = ∂yi γ 3: Add transformation y = BNγ( ),β( ) (x ) to b · tr ∂ℓ m ∂ℓ −1 2 −3/2 N (Alg. 1) 2 = (xi µB) (σB + ǫ) BN ∂σB i=1 ∂xi 2 tr (k) b · − · 4: Modify each layer in NBN with input x to take P m − P −2(xi−µB ) (k) ∂ℓ = m ∂ℓ 1 + ∂ℓ i=1 y instead ∂µB i=1 ∂xi 2 ∂σ2 m  b · √σB +ǫ  B · 5: end for P tr 2(xi−µB) ∂ℓ ∂ℓ 1 ∂ℓ ∂ℓ 1 6: Train NBN to optimize the parameters Θ ∂xi = ∂xi 2 + ∂σ2 m + ∂µB m √σB+ǫ B (k) (k) K ∪ b · · · γ ,β k=1 ∂ℓ m ∂ℓ { inf tr} = xi 7: N N // Inference BN network with frozen ∂γ i=1 ∂yi · BN ← BN ∂ℓ = Pm ∂ℓ // parameters ∂β i=1 ∂yi b P 8: for k =1 ...K do Thus, BN transform is a differentiable transformation that (k) (k) (k) 9: // For clarity, x x ,γ γ ,µB µ , etc. ≡ ≡ ≡ B introduces normalized activations into the network. This 10: Process multiple training mini-batches , each of ensures that as the model is training, layers can continue size m, and average over them: B learning on input distributions that exhibit less internal co- E[x] EB[µB] variate shift, thus accelerating the training. Furthermore, ← m 2 the learned affine transform applied to these normalized Var[x] − EB[σB] ← m 1 activations allows the BN transform to represent the iden- inf tity transformation and preserves the network capacity. 11: In NBN, replace the transform y = BNγ,β(x) with y = γ x + β γ E[x] √Var[x]+ǫ · − √Var[x]+ǫ 3.1 Training and Inference with Batch- 12: end for  Normalized Networks Algorithm 2: Training a Batch-Normalized Network To Batch-Normalize a network, we specify a subset of ac- tivations and insert the BN transform for each of them, according to Alg. 1. Any layer that previously received 3.2 Batch-Normalized Convolutional Net- x as the input, now receives BN(x). A model employing works Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch Batch Normalization can be applied to any set of acti- size m > 1, or with any of its variants such as Adagrad vations in the network. Here, we focus on transforms

4 that consist of an affine transformation followed by an the gradientduring backpropagationand lead to the model element-wise nonlinearity: explosion. However, with Batch Normalization, back- propagation through a layer is unaffected by the scale of z= g(W u + b) its parameters. Indeed, for a scalar a,

where W and b are learned parameters of the model, and BN(W u) = BN((aW )u) g( ) is the nonlinearity such as sigmoid or ReLU. This for- mulation· covers both fully-connected and convolutional and we can show that layers. We add the BN transform immediately before the ∂BN((aW )u) ∂BN(W u) nonlinearity, by normalizing x= W u+b. We could have ∂u = ∂u also normalized the layer inputs u, but since u is likely ∂BN((aW )u) = 1 ∂BN(W u) the output of another nonlinearity, the shape of its distri- ∂(aW ) a · ∂W bution is likely to change during training, and constraining The scale does not affect the layer Jacobian nor, con- its first and second moments would not eliminate the co- sequently, the gradient propagation. Moreover, larger variate shift. In contrast, W u + b is more likely to have weights lead to smaller gradients, and Batch Normaliza- a symmetric, non-sparse distribution, that is “more Gaus- tion will stabilize the parameter growth. sian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to We further conjecture that Batch Normalization may produce activations with a stable distribution. lead the layer Jacobians to have singular values close to 1, Note that, since we normalize W u+b, the bias b can be which is known to be beneficial for training (Saxe et al., ignored since its effect will be canceled by the subsequent 2013). Consider two consecutive layers with normalized mean subtraction (the role of the bias is subsumed by β in inputs, and the transformation between these normalized Alg. 1). Thus, z= g(W u + b) is replaced with vectors: z= F (x). Ifweassumethat x and z are Gaussian and uncorrelated, and that F (x) Jx is a linear transfor- BN z= g( (W u)) mation forb the givenb model parameters,≈ b thenb both x and z b b T where the BN transform is applied independently to each have unit covariances, and I = Cov[z] = JCov[x]J = T . Thus, T , and so all singular valuesb of b dimension of x = W u, with a separate pair of learned JJ JJ = I J are equal to 1, which preserves the gradientb magnitudesb parameters γ(k), β(k) per dimension. For convolutional layers, we additionally want the nor- during backpropagation. In reality, the transformation is malization to obey the convolutional property – so that not linear, and the normalized values are not guaranteed to different elements of the same feature map, at different be Gaussian nor independent, but we nevertheless expect locations, are normalized in the same way. To achieve Batch Normalization to help make gradient propagation this, we jointly normalize all the activations in a mini- better behaved. The precise effect of Batch Normaliza- batch, over all locations. In Alg. 1, we let be the set of tion on gradient propagation remains an area of further all values in a feature map across both theB elements of a study. mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p q, we use the effec- 3.4 Batch Normalization regularizes the tive mini-batch of size m′ = =× m p q. We learn a model pair of parameters γ(k) and β(|B|k) per feature· map, rather than per activation. Alg. 2 is modified similarly, so that When training with Batch Normalization, a training ex- during inference the BN transform applies the same linear ample is seen in conjunction with other examples in the transformation to each activation in a given feature map. mini-batch, and the training network no longer produc- ing deterministic values for a given training example. In our experiments, we found this effect to be advantageous 3.3 Batch Normalization enables higher to the generalization of the network. Whereas Dropout learning rates (Srivastava et al., 2014) is typically used to reduce over- In traditional deep networks, too-high may fitting, in a batch-normalizednetwork we found that it can result in the gradients that explode or vanish, as well as be either removed or reduced in strength. getting stuck in poor local minima. Batch Normaliza- tion helps address these issues. By normalizing activa- 4 Experiments tions throughout the network, it prevents small changes to the parameters from amplifying into larger and subop- 4.1 Activations over time timal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated To verify the effects of internal covariate shift on train- regimes of nonlinearities. ing, and the ability of Batch Normalization to combat it, Batch Normalization also makes training more resilient we considered the problem of predicting the digit class on to the parameter scale. Normally, large learning rates may the MNIST dataset (LeCun et al., 1998a). We used a very increase the scale of layer parameters, which then amplify simple network, with a 28x28 binary image as input, and

5 1 2 2 details are given in the Appendix. We refer to this model 0.9 as Inception in the rest of the text. The model was trained 0 0 0.8 Without BN using a version of Stochastic Gradient Descent with mo- With BN 0.7 10K 20K 30K 40K 50K−2 −2 mentum (Sutskever et al., 2013), using the mini-batch size (a) (b) Without BN (c) With BN of 32. The training was performedusing a large-scale,dis- tributed architecture (similar to (Dean et al., 2012)). All Figure 1: (a) The test accuracy of the MNIST network networks are evaluated as training progresses by comput- trained with and without Batch Normalization, vs. the ing the validation accuracy @1, i.e. the probability of number of training steps. Batch Normalization helps the predicting the correct label out of 1000 possibilities, on network train faster and achieve higher accuracy. (b, a held-out set, using a single crop per image. c) The evolution of input distributions to a typical sig- In our experiments, we evaluated several modifications moid, over the course of training, shown as 15, 50, 85 th of Inception with Batch Normalization. In all cases, Batch { } percentiles. Batch Normalization makes the distribution Normalization was applied to the input of each nonlinear- more stable and reduces the internal covariate shift. ity, in a convolutional way, as described in section 3.2, while keeping the rest of the architecture constant. 3 fully-connectedhidden layers with 100 activations each. Each hidden layer computes y= g(W u+b) with sigmoid 4.2.1 Accelerating BN Networks nonlinearity, and the weights W initialized to small ran- dom Gaussian values. The last hidden layer is followed Simply adding Batch Normalization to a network does not by a fully-connected layer with 10 activations (one per take full advantage of our method. To do so, we further class) and cross-entropy loss. We trained the network for changed the network and its training parameters, as fol- 50000 steps, with 60 examples per mini-batch. We added lows: Batch Normalization to each hidden layer of the network, Increase learning rate. In a batch-normalized model, as in Sec. 3.1. We were interested in the comparison be- we have been able to achieve a training speedup from tween the baseline and batch-normalized networks, rather higher learning rates, with no ill side effects (Sec. 3.3). than achieving the state of the art performance on MNIST (which the described architecture does not). Remove Dropout. As described in Sec. 3.4, Batch Nor- Figure 1(a) shows the fraction of correct predictions malization fulfills some of the same goals as Dropout. Re- by the two networks on held-out test data, as training moving Dropout from Modified BN-Inception speeds up progresses. The batch-normalized network enjoys the training, without increasing overfitting. higher test accuracy. To investigate why, we studied in- Reduce the L2 weight regularization. While in Incep- puts to the sigmoid, in the original network N and batch- tion an L2 loss on the model parameters controls overfit- tr ting, in Modified BN-Inception the weight of this loss is normalized network NBN (Alg. 2) overthe course of train- ing. In Fig. 1(b,c) we show, for one typical activation from reduced by a factor of 5. We find that this improves the the last hidden layer of each network, how its distribu- accuracy on the held-out validation data. tion evolves. The distributions in the original network Accelerate the learning rate decay. In training Incep- change significantly over time, both in their mean and tion, learning rate was decayed exponentially. Because the variance, which complicates the training of the sub- our network trains faster than Inception, we lower the sequent layers. In contrast, the distributions in the batch- learning rate 6 times faster. normalized network are much more stable as training pro- Remove Local Response Normalization While Incep- gresses, which aids the training. tion and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary. 4.2 ImageNet classification Shuffle training examples more thoroughly. We enabled We applied Batch Normalization to a new variant of the within-shard shuffling of the training data, which prevents Inception network (Szegedy et al., 2014), trained on the the same examples from always appearing in a mini-batch ImageNet classification task (Russakovsky et al., 2014). together. This led to about 1% improvements in the val- The network has a large number of convolutional and idation accuracy, which is consistent with the view of pooling layers, with a softmax layer to predict the image Batch Normalization as a regularizer (Sec. 3.4): the ran- class, out of 1000 possibilities. Convolutional layers use domization inherent in our method should be most bene- ReLU as the nonlinearity. The main difference to the net- ficial when it affects an example differently each time it is work described in (Szegedy et al., 2014) is that the 5 5 seen. convolutional layers are replaced by two consecutive lay-× Reduce the photometric distortions. Because batch- ers of 3 3 with up to 128 filters. The net- normalized networks train faster and observe each train- work contains× 13.6 106 parameters, and, other than the ing example fewer times, we let the trainer focus on more top softmax layer, has· no fully-connected layers. More “real” images by distorting them less.

6 0.8

0.7 Model Steps to 72.2% Max accuracy Inception 31.0 106 72.2% 0.6 BN-Baseline 13.3 · 106 72.7% Inception BN-x5 2.1 · 106 73.0% BN−Baseline · 0.5 BN−x5 BN-x30 2.7 106 74.8% BN−x30 · BN−x5−Sigmoid BN-x5-Sigmoid 69.8% Steps to match Inception 0.4 5M 10M 15M 20M 25M 30M Figure 3: For Inception and the batch-normalized variants, the number of training steps required to Figure 2: Single crop validation accuracy of Inception reach the maximum accuracy of Inception (72.2%), and its batch-normalized variants, vs. the number of and the maximum accuracy achieved by the net- training steps. work.

4.2.2 Single-Network Classification to be trained when sigmoid is used as the nonlinearity, despite the well-known difficulty of training such net- We evaluated the following networks, all trained on the works. Indeed, BN-x5-Sigmoid achieves the accuracy of LSVRC2012 training data, and tested on the validation 69.8%. Without Batch Normalization, Inception with sig- data: moid never achieves better than 1/1000 accuracy. Inception: the network described at the beginning of Section 4.2, trained with the initial learning rate of 0.0015. BN-Baseline: Same as Inception with Batch Normal- 4.2.3 Ensemble Classification ization before each nonlinearity. The current reported best results on the ImageNet Large BN-x5: Inception with Batch Normalization and the Scale Visual Recognition Competition are reached by the modifications in Sec. 4.2.1. The initial learning rate was Deep Image ensemble of traditional models (Wu et al., increased by a factor of 5, to 0.0075. The same learning 2015) and the ensemble model of (He et al., 2015). The rate increase with original Inception caused the model pa- latter reports the top-5 error of 4.94%, as evaluated by the rameters to reach machine infinity. ILSVRC server. Here we report a top-5 validation error of BN-x30: Like BN-x5, but with the initial learning rate 4.9%, and test error of 4.82% (according to the ILSVRC 0.045 (30 times that of Inception). server). This improves upon the previous best result, and BN-x5-Sigmoid: Like BN-x5, but with sigmoid non- exceeds the estimated accuracy of human raters according 1 linearity g(t) = 1+exp(−x) instead of ReLU. We also at- to (Russakovsky et al., 2014). tempted to train the original Inception with sigmoid, but For our ensemble, we used 6 networks. Each was based the model remained at the accuracy equivalent to chance. on BN-x30, modified via some of the following: increased In Figure 2, we show the validation accuracy of the initial weights in the convolutional layers; using Dropout networks, as a function of the number of training steps. (with the Dropout probability of 5% or 10%, vs. 40% Inception reached the accuracy of 72.2% after 31 106 for the original Inception); and using non-convolutional, · training steps. The Figure 3 shows, for each network, per-activation Batch Normalization with last hidden lay- the number of training steps required to reach the same ers of the model. Each network achieved its maximum 72.2% accuracy, as well as the maximum validation accu- accuracy after about 6 106 training steps. The ensemble racy reached by the network and the number of steps to prediction was based on· the arithmetic average of class reach it. probabilities predicted by the constituent networks. The By only using Batch Normalization (BN-Baseline), we details of ensemble and multicrop inference are similar to match the accuracy of Inception in less than half the num- (Szegedy et al., 2014). ber of training steps. By applying the modifications in We demonstrate in Fig. 4 that batch normalization al- Sec. 4.2.1, we significantly increase the training speed of lows us to set new state-of-the-art by a healthy margin on the network. BN-x5 needs 14 times fewer steps than In- the ImageNet classification challenge benchmarks. ception to reach the 72.2% accuracy. Interestingly, in- creasing the learning rate further (BN-x30) causes the model to train somewhat slower initially, but allows it to 5 Conclusion reach a higher final accuracy. It reaches 74.8% after 6 106 steps, i.e. 5 times fewer steps than required by Inception· We have presented a novel mechanism for dramatically to reach 72.2%. accelerating the training of deep networks. It is based on We also verified that the reduction in internal covari- the premise that covariate shift, which is known to com- ate shift allows deep networks with Batch Normalization plicate the training of systems, also ap-

7 Model Resolution Crops Models Top-1error Top-5error GoogLeNet ensemble 224 144 7 - 6.67% Deep Image low-res 256 - 1 - 7.96% Deep Image high-res 512 - 1 24.88 7.42% Deep Image ensemble variable - - - 5.98% BN-Inception single crop 224 1 1 25.2% 7.82% BN-Inception multicrop 224 144 1 21.99% 5.82% BN-Inception ensemble 224 144 6 20.1% 4.9%*

Figure 4: Batch-Normalized Inception comparison with previous state of the art on the provided validation set com- prising 50000 images. *BN-Inception ensemble has reached 4.82% top-5 error on the 100000 images of the test set of the ImageNet as reported by the test server. plies to sub-networks and layers, and removing it from entiating characteristics of Batch Normalization include internal activations of the network may aid in training. the learned scale and shift that allow the BN transform Our proposed method draws its power from normalizing to represent identity (the standardization layer did not re- activations, and from incorporating this normalization in quire this since it was followed by the learned linear trans- the network architecture itself. This ensures that the nor- form that, conceptually, absorbs the necessary scale and malization is appropriately handled by any optimization shift), handling of convolutional layers, deterministic in- method that is being used to train the network. To en- ferencethat does not depend on the mini-batch, and batch- able stochastic optimization methods commonly used in normalizing each convolutional layer in the network. deep network training, we perform the normalization for In this work, we have not explored the full range of each mini-batch, and backpropagate the gradients through possibilities that Batch Normalization potentially enables. the normalization parameters. Batch Normalization adds Our future work includes applications of our method to only two extra parameters per activation, and in doing so Recurrent Neural Networks (Pascanu et al., 2013), where preserves the representation ability of the network. We the internal covariate shift and the vanishing or exploding presented an algorithm for constructing, training, and per- gradients may be especially severe, and which would al- forming inference with batch-normalized networks. The low us to more thoroughly test the hypothesis that normal- resulting networks can be trained with saturating nonlin- ization improves gradient propagation (Sec. 3.3). We plan earities, are more tolerant to increased training rates, and to investigate whether Batch Normalization can help with often do not require Dropout for regularization. domain adaptation, in its traditional sense – i.e. whether Merely adding Batch Normalization to a state-of-the- the normalization performed by the network would al- art image classification model yields a substantial speedup low it to more easily generalize to new data distribu- in training. By further increasing the learning rates, re- tions, perhaps with just a recomputation of the population moving Dropout, and applying other modifications af- means and variances (Alg. 2). Finally, we believe that fur- forded by Batch Normalization, we reach the previous ther theoretical analysis of the algorithm would allow still state of the art with only a small fraction of training steps more improvements and applications. – and then beat the state of the art in single-network image classification. Furthermore, by combining multiple mod- els trained with Batch Normalization, we perform better References than the best known system on ImageNet, by a significant margin. Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. Interestingly, our method bears similarity to the stan- In Proceedings of AISTATS 2010, volume 9, pp. 249– dardization layer of (G¨ulc¸ehre & Bengio, 2013), though 256, May 2010. the two methods stem from very different goals, and per- form different tasks. The goal of Batch Normalization Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, is to achieve a stable distribution of activation values Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, throughout training, and in our experiments we apply it Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, before the nonlinearity since that is where matching the and Ng, Andrew Y. Large scale distributed deep net- first and second moments is more likely to result in a works. In NIPS, 2012. stable distribution. On the contrary, (G¨ulc¸ehre & Bengio, 2013) apply the standardization layer to the output of the Desjardins, Guillaume and Kavukcuoglu, Koray. Natural nonlinearity, which results in sparser activations. In our neural networks. (unpublished). large-scale image classification experiments, we have not observed the nonlinearity inputs to be sparse, neither with Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive nor without Batch Normalization. Other notable differ- subgradient methods for online learning and stochastic

8 optimization. J. Mach. Learn. Res., 12:2121–2159,July Saxe, Andrew M., McClelland, James L., and Ganguli, 2011. ISSN 1532-4435. Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, G¨ulc¸ehre, C¸aglar and Bengio, Yoshua. Knowledge mat- abs/1312.6120, 2013. ters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013. Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep function. Journal of Statistical Planning and Inference, into Rectifiers: Surpassing Human-Level Performance 90(2):227–244, October 2000. on ImageNet Classification. ArXiv e-prints, February 2015. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: Hyv¨arinen, A. and Oja, E. Independent component anal- A simple way to prevent neural networks from overfit- ysis: Algorithms and applications. Neural Netw., 13 ting. J. Mach. Learn. Res., 15(1):1929–1958, January (4-5):411–430, May 2000. 2014.

Jiang, Jing. A literature survey on domain adaptation of Sutskever, Ilya, Martens, James, Dahl, George E., and statistical classifiers, 2008. Hinton, Geoffrey E. On the importance of initial- ization and momentum in . In ICML LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (3), volume 28 of JMLR Proceedings, pp. 1139–1147. Gradient-based learning applied to document recog- JMLR.org, 2013. nition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du- LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient mitru, Vanhoucke, Vincent, and Rabinovich, An- backprop. In Orr, G. and K., Muller (eds.), Neural Net- drew. Going deeper with convolutions. CoRR, works: Tricks of the trade. Springer, 1998b. abs/1409.4842, 2014.

Lyu, S and Simoncelli, E P. Nonlinear image representa- Wiesler, Simon and Ney, Hermann. A convergence anal- tion using divisive normalization. In Proc. Computer ysis of log-linear training. In Shawe-Taylor, J., Zemel, Vision and , pp. 1–8. IEEE Com- R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. puter Society, Jun 23-28 2008. doi: 10.1109/CVPR. (eds.), Advances in Neural Information Processing Sys- 2008.4587821. tems 24, pp. 657–665,Granada, Spain, December 2011. Wiesler, Simon, Richard, Alexander, Schl¨uter, Ralf, and Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Ney, Hermann. Mean-normalized stochastic gradient improve restricted boltzmann machines. In ICML, pp. for large-scale deep learning. In IEEE International 807–814. Omnipress, 2010. Conference on Acoustics, Speech, and Signal Process- Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. ing, pp. 180–184, Florence, Italy, May 2014. On the difficulty of training recurrent neural networks. Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and In Proceedings of the 30th International Conference on Sun, Gang. Deep image: Scaling up image recognition, Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 2015. 21 June 2013, pp. 1310–1318, 2013.

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural networks with Appendix natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014. Variant of the Inception Model Used Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep Figure 5 documents the changes that were performed learning made easier by linear transformations in per- compared to the architecture with respect to the ceptrons. In International Conference on Artificial In- GoogleNet archictecture. For the interpretation of this telligence and Statistics (AISTATS),pp. 924–932, 2012. table, please consult (Szegedy et al., 2014). The notable architecture changes compared to the GoogLeNet model Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, include: Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa- thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, The 5 5 convolutional layers are replaced by two Alexander C., and Fei-Fei, Li. ImageNet Large Scale • consecutive× 3 3 convolutional layers. This in- Visual Recognition Challenge, 2014. creases the maximum× depth of the network by 9

9 weight layers. Also it increases the number of pa- rameters by 25% and the computational cost is in- creased by about 30%. The number 28 28 inception modules is increased • from 2 to 3. × Inside the modules, sometimes average, sometimes • maximum-pooling is employed. This is indicated in the entries corresponding to the pooling layers of the table. There are no across the board pooling layers be- • tween any two Inception modules, but stride-2 con- volution/pooling layers are employed before the fil- ter concatenation in the modules 3c, 4e. Our model employed separable with depth multiplier 8 on the first convolutional layer. This reduces the computational cost while increasing the memory con- sumption at training time.

10 patch size/ output #3×3 double #3×3 type depth #1×1 #3×3 double Pool +proj stride size reduce reduce #3×3 convolution* 7×7/2 112×112×64 1 max pool 3×3/2 56×56×64 0 convolution 3×3/1 56×56×192 1 64 192 max pool 3×3/2 28×28×192 0 inception (3a) 28×28×256 3 64 64 64 64 96 avg + 32 inception (3b) 28×28×320 3 64 64 96 64 96 avg + 64 inception (3c) stride 2 28×28×576 3 0 128 160 64 96 max + pass through inception (4a) 14×14×576 3 224 64 96 96 128 avg + 128 inception (4b) 14×14×576 3 192 96 128 96 128 avg + 128 inception (4c) 14×14×576 3 160 128 160 128 160 avg + 128 inception (4d) 14×14×576 3 96 128 192 160 192 avg + 128 inception (4e) stride 2 14×14×1024 3 0 128 192 192 256 max + pass through inception (5a) 7×7×1024 3 352 192 320 160 224 avg + 128 inception (5b) 7×7×1024 3 352 192 320 192 224 max + 128 avg pool 7×7/1 1×1×1024 0

Figure 5: Inception architecture

11