Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks Guangyong Chen * 1 Pengfei Chen * 2 1 Yujun Shi 3 Chang-Yu Hsieh 1 Benben Liao 1 Shengyu Zhang 2 1 Abstract sive performance. The state-of-the-art neural networks are In this work, we propose a novel technique to often complex structures comprising hundreds of layers of boost training efficiency of a neural network. Our neurons and millions of parameters. Efficient training of a work is based on an excellent idea that whitening modern DNN is often complicated by the need of feeding the inputs of neural networks can achieve a fast such a behemoth with millions of data entries. Developing convergence speed. Given the well-known fact novel techniques to increase training efficiency of DNNs is that independent components must be whitened, a highly active research topics. In this work, we propose a we introduce a novel Independent-Component novel training technique by combining two commonly used (IC) layer before each weight layer, whose in- ones, Batch Normalization (BatchNorm) (Ioffe & Szegedy, puts would be made more independent. However, 2015) and Dropout (Srivastava et al., 2014), for a purpose determining independent components is a compu- (making independent inputs to neural networks) that is not tationally intensive task. To overcome this chal- possibly achieved by either technique alone. This marriage lenge, we propose to implement an IC layer by of techniques endows a new perspective on how Dropout combining two popular techniques, Batch Nor- could be used for training DNNs and achieve the original malization and Dropout, in a new manner that we goal of whitening inputs of every layer (Le Cun et al., 1991; can rigorously prove that Dropout can quadrati- Ioffe & Szegedy, 2015) that inspired the BatchNorm work cally reduce the mutual information and linearly (but did not succeed). reduce the correlation between any pair of neu- Despite a seminal work (Le Cun et al., 1991) advocating rons with respect to the dropout layer parameter the benefits of whitening, this technique has largely been p. As demonstrated experimentally, the IC layer limited to preprocessing input data to a neural network. An consistently outperforms the baseline approaches attempt to implementing the whitening idea at every activa- with more stable training process, faster conver- tion layer turned out to be computationally demanding and gence speed and better convergence limit on CI- resulted in the proposal of BatchNorm instead. The method FAR10/100 and ILSVRC2012 datasets. The im- works by scaling the net activations of each activation layer plementation of our IC layer makes us rethink to zero mean and unit variance for performance enhance- the common practices in the design of neural ment. The normalization procedure significantly smooths networks. For example, we should not place the optimization landscape in the parameter space (Bjorck Batch Normalization before ReLU since the non- et al., 2018) to improve training performance. Due to Batch- negative responses of ReLU will make the weight Norm’s easy implementation and success, the attention was layer updated in a suboptimal way, and we can diverted away from the original goal of whitening. arXiv:1905.05928v1 [cs.LG] 15 May 2019 achieve better performance by combining Batch Normalization and Dropout together as an IC The motivation of this work is to resume the pursuit of devis- layer. ing a computationally efficient approach to whitening inputs of every layer. Given the well-known fact that the independent activations must be whitened, we attempt to make the 1. Introduction net activations of each weight layer more independent. This viewpoint is supported by a recent neural scientific finding: Deep neural networks (DNNs) have been widely adopted representation power of a neural system increases linearly in many artificial-intelligence systems due to their impres- with the number of independent neurons in the population (Moreno-Bote et al., 2014; Beaulieu-Laroche et al., 2018). *Equal contribution 1Tencent Technology 2The Chinese Uni- versity of Hong Kong 3Nankai University. Correspondence to: Thus, we are motivated to make the inputs into the weight Benben Liao <[email protected]>. layers more independent. As discussed in a subsequent section, the independent activations indeed make the training Rethinking the Usage of Batch Normalization and Dropout process more stable. 푖 − 1 푖 An intuitive solution to generate independent components NormNorm NormNorm is to introduce an additional layer, which performs the in- (a) Norm WeightWeightWeight WeightWeight ActivationActivationActivation ActivationActivationActivation BatchBatch BatchBatch dependent component analysis (ICA) (Oja & Nordhausen, Batch 2006) on the activations. A similar idea has been previously explored in (Huang et al., 2018), which adopts zero-phase IC IC IC IC IC component analysis (ZCA) (Bell & Sejnowski, 1997) to (b) IC WeightWeightWeight WeightWeightWeight ActivationActivationActivation ActivationActivation whiten the net activations instead of making them indepen- Activation dent. In particular, ZCA always serves as the first step for the Figure 1. (a) The common practice of performing the whitening ICA methods, which then rotates the whitened activations operation, or named as batch normalization, between the weight to obtain the independent ones. However, the computations layer and the activation layer. (b) Our proposal to place the IC of ZCA itself is quite expensive, since it requires to calcu- layer right before the weight layer. late the eigen-decomposition of a dj × dj matrix with dj being the number of neurons in the j-th layer. This problem is more serious for wide neural networks, where a large architectures (He et al., 2016a;b) with our layer, and find number of neurons often dwell in an intermediate layer. that their performance can be further improved. As demon- While attacking this challenging problem, we find that strated empirically on the CIFAR10/100 and ILSVRC2012 BatchNorm and Dropout can be combined together to con- datasets, the proposed IC layer can achieve more stable struct independent activations for neurons in each interme- training process with faster convergence speed, and improve diate weight layer. To simplify our presentation, we denote the generalization performance of the modern networks. the layers {-BatchNorm-Dropout-} as an Independent Com- The implementation of our IC layer makes us rethink the ponent (IC) layer in the rest of this paper. Our IC layer common practice of placing BatchNorm before the activa- disentangles each pair of neurons in a layer in a continuous tion function in designing DNNs. The effectiveness of doing fashion - neurons become more independent when our layer so for BatchNorm has been argued by some researchers, but is applied (Section 3.1). Our approach is much faster than no analysis has been presented to explain how to deal with the traditional whitening methods, such as PCA and ZCA as the BatchNorm layer. The traditional usage of BatchNorm mentioned above. An intuitive explanation of our method is has been demonstrated to significantly smooth the optimiza- as follows. The BatchNorm normalizes the net activations tion landscape, which induces a more predictive and stable so that they have zero mean and unit variance just like the behavior of the gradients (Santurkar et al., 2018). How- ZCA method. Dropout constructs independent activations ever, as discussed in Section5, such usage of BatchNorm by introducing independent random gates for neurons in a still forbids the network parameters from updating in the layer, which allows neurons outputing its value with proba- gradient direction, which is the fastest way for the loss to bility p, or shuts them down by outputting zero otherwise. achieve minimum, and actually presents a zigzag optimiza- Intuitively, the output of an neuron conveys little information behavior. Moreover, different from prior efforts (Li tion from other neurons. Thus, we can imagine that these et al., 2018) where BatchNorm and Dropout are suggested neurons become statistically independent from each other. to be used simultaneously before the activation layer, our As theoretically demonstrated in Section 3.1, our IC layer analysis provides a unique perspective that BatchNorm and can reduce the mutual information between the outputs of Dropout together play a similar role as the ICA method and 2 any two neurons by a factor of p and reduce the correlation should be placed before the weight layer, which admits a coefficient by p, where p is the Dropout probability. To faster convergence speed when training the deep neural net- our knowledge, such a usage of Dropout has never been works. Both theoretical analysis and experimental results proposed before. indicate that BatchNorm and Dropout together should be Unlike the traditionally unsupervised learning methods, combined as the IC layer, which can be widely utilized for such as ICA and ZCA, we are not required to recover the training deep networks in future. original signals from the independent features or ensure To summarize, we list our main contributions as follows: the uniqueness of these features, but only required to dis- till some useful features, which can help fulfill the super- vised learning tasks. Our analysis, presented in Section2, • We combine two popular techniques, Batch Normal- shows that the proposed IC layer should be placed before the ization and Dropout, in a newly proposed Independent- weight

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Batch Normalization

On Self Modulation for Generative Adver- Sarial Networks

A Regularization Study for Policy Gradient Methods

Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples

Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Predicting Uber Demand in NYC with Wavenet

Large Memory Layers with Product Keys

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Outline for Today's Presentation

Improved Training of Wasserstein Gans