A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 Qianxiao Li 1 2 Zuowei Shen 1 Abstract effects of BN are attributed to the so-called “reduction of Despite its empirical success and recent theoret- covariate shift”. However, it is unclear what this statement ical progress, there generally lacks a quantita- means in precise mathematical terms. tive analysis of the effect of batch normalization Although recent theoretical work have established cer- (BN) on the convergence and stability of gradi- tain convergence properties of gradient descent with BN ent descent. In this paper, we provide such an (BNGD) and its variants (Ma & Klabjan, 2017; Kohler et al., analysis on the simple problem of ordinary least 2018; Arora et al., 2019), there generally lacks a quantita- squares (OLS). Since precise dynamical prop- tive comparison between the dynamics of the usual gradient erties of gradient descent (GD) is completely descent (GD) and BNGD. In other words, a basic question known for the OLS problem, it allows us to iso- that one could pose is: what quantitative changes does BN late and compare the additional effects of BN. bring to the stability and convergence of gradient descent dy- More precisely, we show that unlike GD, gra- namics? Or even more simply: why should one use BNGD dient descent with BN (BNGD) converges for instead of GD? To date, a general mathematical answer to arbitrary learning rates for the weights, and the these questions remain elusive. This can be partly attributed convergence remains linear under mild conditions. to the complexity of the optimization objectives that one Moreover, we quantify two different sources of typically applies BN to, such as those encountered in deep acceleration of BNGD over GD – one due to over- learning. In these cases, even a quantitative analysis of the parameterization which improves the effective dynamics of GD itself is difficult, not to mention a precise condition number and another due having a large comparison between the two. range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and For this reason, it is desirable to formulate the simplest non- could account for much of its robustness proper- trivial setting, on which one can concretely study the effect ties. These findings are confirmed quantitatively of batch normalization and answer the questions above in by numerical experiments, which further show a quantitative manner. This is the goal of the current pa- that many of the uncovered properties of BNGD per, where we focus on perhaps the simplest supervised in OLS are also observed qualitatively in more learning problem – ordinary least squares (OLS) regression complex supervised learning problems. – and analyze precisely the effect of BNGD when applied to this problem. A primary reason for this choice is that the dynamics of GD in least-squares regression is completely 1. Introduction understood, thus allowing us to isolate and contrast the ad- arXiv:1810.00122v2 [cs.LG] 9 May 2019 ditional effects of batch normalization. Batch normalization (BN) is one of the most important tech- niques for training deep neural networks and has proven Our main findings can be summarized as follows extremely effective in avoiding gradient blowups during back-propagation and speeding up convergence. In its orig- 1. Unlike GD, BNGD converges for arbitrarily large learn- inal introduction (Ioffe & Szegedy, 2015), the desirable ing rates for the weights, and the convergence remains linear under mild conditions. 1Department of Mathematics, National University of Sin- gapore, Singapore 2Institute of High Performance Comput- 2. The asymptotic linear convergence of BNGD is faster ing, A*STAR, Singapore. Correspondence to: Yongqiang Cai than that of GD, and this can be attributed to the over- <[email protected]>, Qianxiao Li <[email protected]>, Zuowei Shen <[email protected]>. parameterization that BNGD introduces. Proceedings of the 36 th International Conference on Machine 3. Unlike GD, the convergence rate of BNGD is insen- Learning, Long Beach, California, PMLR 97, 2019. Copyright sitive to the choice of learning rates. The range of 2019 by the author(s). insensitivity can be characterized, and in particular it A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent increases with the dimensionality of the problem. tions (Theorem 3.4). This convergence result is stronger, but this is to be expected since we are considering a specific case. More importantly, we discuss concretely how BNGD Although these findings are established concretely only for offers advantages over GD instead of just matching its best- the OLS problem, we will show through numerical experi- case performance. For example, not only do we show that ments that some of them hold qualitatively, and sometimes convergence occurs for any learning rate, we also derive a even quantitatively for more general situations in deep learn- quantitative relationship between the learning rate and the ing. convergence rate, from which the robustness of BNGD on OLS can be explained (see Section3). 1.1. Related Work Batch normalization was originally introduced in Ioffe & 1.2. Organization Szegedy(2015) and subsequently studied in further detail Our paper is organized as follows. In Section2, we outline in Ioffe(2017). Since its introduction, it has become an the ordinary least squares (OLS) problem and present GD important practical tool to improve stability and efficiency and BNGD as alternative means to solve this problem. In of training deep neural networks (Bottou et al., 2018). Ini- Section3, we demonstrate and analyze the convergence of tial heuristic arguments attribute the desirable features of the BNGD for the OLS model, and in particular contrast the BN to concepts such as “covariate shift”, but alternative results with the behavior of GD, which is completely known explanations based on landscapes (Santurkar et al., 2018) for this model. We also discuss the important insights to and effective regularization (Bjorck et al., 2018) have been BNGD that these results provide us with. We then validate proposed. these findings on more general supervised learning problems Recent theoretical studies of BN include Ma & Klabjan in Section4. Finally, we conclude in Section5. (2017); Kohler et al.(2018); Arora et al.(2019). We now outline the main differences between them and the current 2. Background work. In Ma & Klabjan(2017), the authors proposed a variant of BN, the diminishing batch normalization (DBN) 2.1. Ordinary Least Squares and Gradient Descent algorithm and established its convergence to a stationary d Consider the simple linear regression model where x 2 R point of the loss function. In Kohler et al.(2018), the authors is a random input column vector and y is the corresponding also considered a BNGD variant by dynamically setting the output variable. Since batch normalization is applied for learning rates and using bisection to optimize the rescaling each feature separately, in order to gain key insights it is variables introduced by BN. It is shown that this variant of sufficient to consider the case y 2 R. A noisy linear rela- BNGD converges linearly for simplified models, including tionship is assumed between the dependent variable y and an OLS model and “learning halfspaces”. The primary dif- the independent variables x, i.e. y = xT w + noise where ference in the current work is that we do not dynamically d w 2 R is the vector of trainable parameters. Denote the modify the learning rates, and consider instead a constant following moments: learning rate, i.e. the original BNGD algorithm. This is an important distinction; While a decaying or dynamic learn- H := E[xxT ]; g := E[xy]; c := E[y2]: (1) ing rate is sometimes used in GD, in the case of BN it is critical to analyze the constant learning rate case, precisely To simplify the analysis, we assume the covariance matrix because one of the key practical advantages of BN is that a H of x is positive definite and the mean E[x] of x is zero. big learning rate can be used. Moreover, this allows us to The eigenvalues of H are denoted as λi(H); i = 1; 2; :::d;. isolate the influence of batch normalization itself, without Particularly, the maximum and minimum eigenvalue of H the potentially obfuscating effects a dynamic learning rate is denoted by λmax and λmin respectively. The condition number of H is defined as κ := λmax . Note that the positive schedule can introduce (e.g. see Eq. (10) and the discussion λmin that follows). As the goal of considering a simplified model definiteness of H allows us to define the vector norm k:kH 2 T is to analyze the additional effects purely due to BN on GD, by kxkH = x Hx. it is desirable to perform our analysis in this regime. The ordinary least squares (OLS) method for estimating the In Arora et al.(2019), the authors proved a general con- unknown parameters w leads to the following optimization 1=2 vergence result for BNGD of O(k− ) in terms of the problem, gradient norm for objectives with Lipschitz continuous gra- 1 T 2 min J0(w) : = Ex;y[(y − x w) ] (2) dients. This matches the best result for gradient descent w d 2 2R on general non-convex functions with learning rate tuning = c − wT g + 1 wT Hw; (Carmon et al., 2017). In contrast, our convergence result 2 2 1 is in iteration and is shown to be linear under mild condi- which has unique minimizer w = u := H− g. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent The gradient descent (GD) method (with step size or learn- reasons which will become clear in the subsequent analysis. ing rate ") for solving the optimization problem (2) is given We thus have the following discrete-time dynamical system: by the iteration T wk g ak = ak + "a − ak ; (7) +1 σk wk+1 = wk − "rwJ0(wk) = (I − "H)wk + "g; (3) T ak wk g wk+1 = wk + " σ g − σ2 Hwk : (8) 2 k k which converges if 0 < " < =: "max, and the con- λmax vergence rate is determined by the spectral radius ρ" := To simplify subsequent notation, we denote by H∗ the ma- ρ(I − "H) = maxifj1 − ελi(H)jg with trix HuuT H ku − wk+1k ≤ ρ(I − "H)ku − wkk: (4) H∗ := H − uT Hu ; (9) It is well-known (e.g.

Load more