Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings
Total Page:16
File Type:pdf, Size:1020Kb
Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization Jonas Kohler* Hadi Daneshmand* Aurelien Lucchi Thomas Hofmann ETH Zurich ETH Zurich ETH Zurich ETH Zurich Ming Zhou Klaus Neymeyr Universit¨atRostock Universit¨atRostock Abstract (Bn) (Ioffe and Szegedy, 2015). This technique has been proven to successfully stabilize and accelerate Normalization techniques such as Batch Nor- training of deep neural networks and is thus by now malization have been applied successfully for standard in many state-of-the art architectures such training deep neural networks. Yet, despite as ResNets (He et al., 2016) and the latest Inception its apparent empirical benefits, the reasons Nets (Szegedy et al., 2017). The success of Batch Nor- behind the success of Batch Normalization malization has promoted its key idea that normalizing are mostly hypothetical. We here aim to pro- the inner layers of a neural network stabilizes train- vide a more thorough theoretical understand- ing which recently led to the development of many ing from a classical optimization perspective. such normalization methods such as (Arpit et al., 2016; Our main contribution towards this goal is Klambauer et al., 2017; Salimans and Kingma, 2016) the identification of various problem instances and (Ba et al., 2016) to name just a few. in the realm of machine learning where Batch Yet, despite the ever more important role of Batch Normalization can provably accelerate opti- Normalization for training deep neural networks, the mization. We argue that this acceleration Machine Learning community is mostly relying on em- is due to the fact that Batch Normalization pirical evidence and thus lacking a thorough theoretical splits the optimization task into optimizing understanding that can explain such success. Indeed length and direction of the parameters sepa- { to the best of our knowledge { there exists no theo- rately. This allows gradient-based methods to retical result which provably shows faster convergence leverage a favourable global structure in the rates for this technique on any problem instance. So loss landscape that we prove to exist in Learn- far, there only exists competing hypotheses that we ing Halfspace problems and neural network briefly summarize below. training with Gaussian inputs. We thereby turn Batch Normalization from an effective practical heuristic into a provably converg- 1.1 Related work arXiv:1805.10694v3 [stat.ML] 6 Oct 2018 ing algorithm for these settings. Furthermore, we substantiate our analysis with empirical Internal Covariate Shift The most widespread evidence that suggests the validity of our the- idea is that Batch Normalization accelerates training by oretical results in a broader context. reducing the so-called internal covariate shift, defined as the change in the distribution of layer inputs while the conditional distribution of outputs is unchanged. 1 INTRODUCTION This change can be significant especially for deep neu- ral networks where the successive composition of layers One of the most important recent innovations for opti- drives the activation distribution away from the ini- mizing deep neural networks is Batch Normalization tial input distribution. Ioffe and Szegedy (2015) argue that Batch Normalization reduces the internal covari- ate shift by employing a normalization technique that enforces the input distribution of each activation layer *The first two authors have contributed equally to this work. to be whitened - i.e. enforced to have zero means and Correspondence to [email protected] unit variances - and decorrelated . Yet, as pointed out Exponential convergence rates for Batch Normalization by Lipton and Steinhardt (2018), the covariate shift provably accelerates optimization with Gradient phenomenon itself is not rigorously shown to be the Descent and does the length-direction decoupling play a reason behind the performance of Batch Normalization. role in this phenomenon? Furthermore, a recent empirical study published by (Santurkar et al., 2018) provides strong evidence sup- We answer both questions affirmatively. In particular, porting the hypothesis that the performance gain of we show that the specific variance transformation of Batch Normilization is not explained by the reduction Bn decouples the length and directional components of internal covariate shift. of the weight vectors in such a way that allows local search methods to exploit certain global properties of Smoothing of objective function Recently, San- the optimization landscape (present in the directional turkar et al. (2018) argue that under certain assump- component of the optimal weight vector). Using this tions a normalization layer simplifies optimization by fact and endowing the optimization method with an smoothing the loss landscape of the optimization prob- adaptive stepsize scheme, we obtain an exponential (or lem of the preceding layer. Yet, we note that this effect as more commonly termed linear) convergence rate for may - at best - only improve the constant factor of Batch Norm Gradient Descent on the (possibly) non- the convergence rate of Gradient Descent and not the convex problem of Learning Halfspaces with Gaussian rate itself (e.g. from sub-linear to linear). Furthermore, inputs (Section 4), which is a prominent problem in the analysis treats only the largest eigenvalue and thus machine learning Erdogdu et al. (2016). We thereby one direction in the landscape (at any given point) and turn Bn from an effective practical heuristic into a prov- keeps the (usually trainable) BN parameters fixed to ably converging algorithm. Additionally we show that zero-mean and unit variance. For a thorough conclu- the length-direction decoupling can be considered as a sion about the overall landscape, a look at the entire non-linear reparametrization of the weight space, which eigenspectrum (including negative and zero eigenval- may be beneficial for even simple convex optimization ues) would be needed. Yet, this is particularly hard tasks such as logistic regressions. Interestingly, non- to do as soon as one allows for learnable mean and linear weightspace transformations have received little variance parameters since the effect of their interplay to no attention within the optimization community (see on the distribution of eigenvalues is highly non-trivial. (Mikhalevich et al., 1988) for an exception). Finally, in Section 5 we analyze the effect of Bn for Length-direction decoupling Finally, a different training a multilayer neural network (MLP) and prove perspective was brought up by another normalization { again under a similar Gaussianity assumption { that technique termed Weight Normalization ( ) (Sali- Wn Bn acts in such a way that the cross dependencies be- mans and Kingma, 2016). This technique performs a tween layers are reduced and thus the curvature struc- very simple normalization that is independent of any ture of the network is simplified. Again, this is due data statistics with the goal of decoupling the length of to a certain global property in the directional part of the weight vector from its direction. The optimization the optimization landscape, which BN can exploit via of the training objective is then performed by training the length-direction decoupling. As a result, gradient- the two parts separately. As discussed in Section 2, Bn based optimization in reparametrized coordinates (and and Wn differ in how the weights are normalized but with an adaptive stepsize policy) can enjoy a linear share the above mentioned decoupling effect. Interest- convergence rate on each individual unit. We substan- ingly, weight normalization has been shown empirically tiate both findings with experimental results on real to benefit from similar acceleration properties as Batch world datasets that confirm the validity of our analysis Normalization (Gitman and Ginsburg, 2017; Salimans outside the setting of our theoretical assumptions that and Kingma, 2016). This raises the obvious question cannot be certified to always hold in practice. whether the empirical success of training with Batch Normalization can (at least partially) be attributed to its length-direction decoupling aspect. 2 BACKGROUND 1.2 Contribution and organization 2.1 Assumptions on data distribution We contribute to a better theoretical understanding Suppose that x 2 Rd is a random input vector and y 2 of Batch Normalization by analyzing it from an opti- {±1g is the corresponding output variable. Throughout mization perspective. In this regard, we particularly this paper we recurrently use the following statistics address the following question: u := E [−yx] ; S := E xx> (1) Can we find a setting in which Batch Normalization and make the following (weak) assumption. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann Assumption 1. [Weak assumption on data distribu- As stated (in finite-sum terms) in Algorithm 1 of tion] We assume that E [x] = 0. We further assume (Ioffe and Szegedy, 2015) the normalization operation that the spectrum of the matrix S is bounded as amounts to computing > > 0 < µ := λmin (S) ;L := λmax (S) < 1: (2) x w − E [x w] BN(x>w) = g x + γ; (6) > 1=2 As a result, S is the symmetric positive definite covari- varx[x w] ance matrix of x. where g 2 R and γ 2 R are (trainable) mean and The part of our analysis presented in Section 4 and 5 variance adjustment parameters. In the following, we relies on a stronger assumption on the data distribu- assume that x is zero mean (Assumption 1) and omit tion. In this regard we consider the combined random γ.2 Then the variance can be written as follows variable > > 2 > > z := −yx; (3) varx[x w] = Ex (x w) = Ex (w x)(x w) > > > whose mean vector and covariance matrix are u and S = w Ex xx w = w Sw as defined above in Eq. (1). (7) Assumption 2. [Normality assumption on data distri- and replacing this expression into the batch normalized bution] We assume that z is a multivariate normal ran- output of Eq.(5) yields dom variable distributed with mean E [z] = E [−yx] = u and second-moment E zz> − E[z]E[z]> = E xx> − x>w f (w; g) = E '(g ) : (8) uu> = S − uu>.