Rethinking Batch Normalization in Transformers

PowerNorm: Rethinking Batch Normalization in Transformers Sheng Shen * 1 Zhewei Yao * 1 Amir Gholami 1 Michael W. Mahoney 1 Kurt Keutzer 1 Abstract 1. Introduction The standard normalization method for neural Normalization has become one of the critical components network (NN) models used in Natural Language in Neural Network (NN) architectures for various machine Processing (NLP) is layer normalization (LN). learning tasks, in particular in Computer Vision (CV) and This is different than batch normalization (BN), Natural Language Processing (NLP). However, currently which is widely-adopted in Computer Vision. The there are very different forms of normalization used in CV preferred use of LN in NLP is principally due to and NLP. For example, Batch Normalization (BN) (Ioffe the empirical observation that a (naive/vanilla) & Szegedy, 2015) is widely adopted in CV, but it leads to use of BN leads to significant performance degra- significant performance degradation when naively used in dation for NLP tasks; however, a thorough un- NLP. Instead, Layer Normalization (LN) (Ba et al., 2016) is derstanding of the underlying reasons for this is the standard normalization scheme used in NLP. All recent not always evident. In this paper, we perform NLP architectures, including Transformers (Vaswani et al., a systematic study of NLP transformer models 2017), have incorporated LN instead of BN as their default to understand why BN has a poor performance, normalization scheme. In spite of this, the reasons why BN as compared to LN. We find that the statistics fails for NLP have not been clarified, and a better alternative of NLP data across the batch dimension exhibit to LN has not been presented. large fluctuations throughout training. This re- In this work, we perform a systematic study of the chal- sults in instability, if BN is naively implemented. lenges associated with BN for NLP, and based on this we To address this, we propose Power Normaliza- propose Power Normalization (PN), a novel normalization tion (PN), a novel normalization scheme that re- method that significantly outperforms LN. In particular, our solves this issue by (i) relaxing zero-mean nor- contributions are as follows: malization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to • We find that there are clear differences in the batch stabilize fluctuations, and (iii) using an approxi- statistics of NLP data versus CV data. In particular, we mate backpropagation for incorporating the run- observe that batch statistics for NLP data have a very ning statistics in the forward pass. We show large variance throughout training. This variance exists theoretically, under mild assumptions, that PN in the corresponding gradients as well. In contrast, leads to a smaller Lipschitz constant for the loss, CV data exhibits orders of magnitude smaller variance. compared with BN. Furthermore, we prove that See Figure2 and3 for a comparison of BN in CV and the approximate backpropagation scheme leads NLP. to bounded gradients. We extensively test PN for arXiv:2003.07845v2 [cs.CL] 28 Jun 2020 transformers on a range of NLP tasks, and we • To reduce the variation of batch statistics, we mod- show that it significantly outperforms both LN ify typical BN by relaxing zero-mean normalization, and BN. In particular, PN outperforms LN by and we replace the variance with the quadratic mean. 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 We denote this scheme as PN-V. We show theoreti- PPL on PTB/WikiText-103. We make our code cally that PN-V preserves the first-order smoothness publicly available at https://github.com/ property as in BN; see Lemma2. sIncerass/powernorm. • We show that using running statistics for the quadratic mean results in significantly better performance, up to *Equal contribution 1UC Berkeley. Correspondence to: Amir 1.5/2.0 BLEU on IWSLT14/WMT14 and 7.7/3.4 PPL Gholami <[email protected]>. on PTB/WikiText-103, as compared to BN; see Table1 Proceedings of the 37 th International Conference on Machine and2. We denote this scheme as PN. Using running Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by statistics requires correcting the typical backpropaga- the author(s). tion scheme in BN. As an alternative, we propose an PowerNorm: Rethinking Batch Normalization in Transformers approximate backpropagation to capture the running Normalization in Computer Vision Batch Normaliza- statistics. We show theoretically that this approximate tion (BN) (Ioffe & Szegedy, 2015) has become the de-facto backpropagation leads to bounded gradients, which is normalization for NNs used in CV. BN normalizes the ac- a necessary condition for convergence; see Theorem4. tivations (feature maps) by computing channel-wise mean and variance across the batch dimension, as schematically • We perform extensive tests showing that PN also shown in Figure1. It has been found that BN leads to improves performance on machine translation and robustness with respect to sub-optimal hyperparameters language modeling tasks, as compared to LN. In (e.g., learning rate) and initialization, and it generally re- particular, PN outperforms LN by 0.4/0.6 BLEU sults in more stable training for CV tasks (Ioffe & Szegedy, on IWSLT14/WMT14, and by 5.6/3.0 PPL on 2015). Following the seminal work of (Ioffe & Szegedy, PTB/WikiText-103. We emphasize that the improve- 2015), there have been two principal lines of research: (i) ment of PN over LN is without any change of hyper- extensions/modifications of BN to improve its performance, parameters. and (ii) theoretical/empirical studies to understand why BN helps training. • We analyze the behaviour of PN and LN by comput- With regard to (i), it was found that BN does not per- ing the Singular Value Decomposition of the resulting form well for problems that need to be trained with small embedding layers, and we show that PN leads to a batches, e.g., image segmentation (often due to memory more well-conditioned embedding layer; see Figure6. limits) (Zagoruyko & Komodakis, 2016; Lin et al., 2017; Furthermore, we show that PN is robust to small-batch Goldberger et al., 2005). The work of (Ioffe, 2017) proposed statistics, and it still achieves higher performance, as batch renormalization to remove/reduce the dependence of opposed to LN; see Figure5. batch statistics to batch size. It was shown that this approach leads to improved performance for small batch training as well as cases with non-i.i.d. data. Along this direction, the Layer Normalization Batch/Power Normalization work of (Singh & Shrivastava, 2019) proposed “EvalNorm,” which uses corrected normalization statistics. Furthermore, the recent work of (Yan et al., 2020) proposed “Moving Average Batch Normalization (MABN)” for small batch BN by replacing batch statistics with moving averages. There has also been work on alternative normalization tech- Sentence Length Sentence Length niques, and in particular Layer Normalization (LN), pro- Feature Feature posed by (Ba et al., 2016). LN normalizes across the channel/feature dimension as shown in Figure1. This could be Batch Dimension Batch Dimension extended to Group Norm (GN) (Wu & He, 2018), where 1 1 the normalization is performed across a partition of the fea- Figure 1. The illustration of layer normalization (left) and tures/channels with different pre-defined groups. Instance batch/power normalization (right). The entries colored in blue Normalization (IN) (Ulyanov et al., 2016) is another tech- show the components used for calculating the statistics. nique, where per-channel statistics are computed for each sample. 2. Related Work With regard to (ii), there have been several studies to understand why BN helps training in CV. The original motivation Normalization is widely used in modern deep NNs such was that BN reduces the so-called “Internal Covariance as ResNet (He et al., 2016), MobileNet-V2 (Sandler et al., Shift” (ICS) (Ioffe & Szegedy, 2015). However, this expla- 2018), and DenseNet (Huang et al., 2017) in CV, as well nation was viewed as incorrect/incomplete (Rahimi, 2017). as LSTMs (Hochreiter & Schmidhuber, 1997; Ba et al., In particular, the recent study of (Santurkar et al., 2018) 2016), transformers (Vaswani et al., 2017), and transformer- argued that the underlying reason that BN helps training is based models (Devlin et al., 2019; Liu et al., 2019) in NLP. that it results in a smoother loss landscape. This was later There are two main categories of normalization: weight confirmed for deep NN models by measuring the Hessian normalization (Salimans & Kingma, 2016; Miyato et al., spectrum of the network with/without BN (Yao et al., 2019). 2018; Qiao et al., 2019) and activation normalization (Ioffe & Szegedy, 2015; Jarrett et al., 2009; Krizhevsky et al., 2012; Ba et al., 2016; Ulyanov et al., 2016; Wu & He, 2018; Normalization in Natural Language Processing De- Li et al., 2019). Here, we solely focus on the latter, and we spite the great success of BN in CV, the large computation briefly review related work in CV and NLP. and storage overhead of BN at each time-step in recurrent PowerNorm: Rethinking Batch Normalization in Transformers neural networks (RNNs) made it impossible/expensive to de- Algorithm 1 Batch Normalization (Every Iteration) ploy for NLP tasks (Cooijmans et al., 2017). To address this, begin Forward Propagation: the work of (Cooijmans et al., 2017; Hou et al., 2019) used Input: X P RBˆd shared BN statistics across different time steps of RNNs. Output: Y P RBˆd 1 B However, it was found that the performance of BN is signif- // Get mini-batch mean µB “ B i“1 xi icantly lower than LN for NLP. For this reason, LN became 1 B σ2 “ px ´ µ q2 // Get mini-batch variance B B ři“1 i B the default normalization technique, even for the recent X´µ X “ B // Normalize transformer models introduced by (Vaswani et al., 2017).

Rethinking Batch Normalization in Transformers

Batch Normalization

On Self Modulation for Generative Adver- Sarial Networks

A Regularization Study for Policy Gradient Methods

Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples

Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Predicting Uber Demand in NYC with Wavenet

Large Memory Layers with Product Keys

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Outline for Today's Presentation

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks