A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 Qianxiao Li 1 2 Zuowei Shen 1 Abstract effects of BN are attributed to the so-called “reduction of Despite its empirical success and recent theoreti- covariate shift”. However, it is unclear what this statement cal progress, there generally lacks a quantitative means in precise mathematical terms. analysis of the effect of batch normalization (BN) Although recent theoretical work have established cer- on the convergence and stability of gradient de- tain convergence properties of gradient descent with BN scent. In this paper, we provide such an analysis (BNGD) and its variants (Ma & Klabjan, 2017; Kohler et al., on the simple problem of ordinary least squares 2018; Arora et al., 2019), there generally lacks a quantita- (OLS), where the precise dynamical properties of tive comparison between the dynamics of the usual gradient gradient descent (GD) is completely known, thus descent (GD) and BNGD. In other words, a basic question allowing us to isolate and compare the additional that one could pose is: what quantitative changes does BN effects of BN. More precisely, we show that un- bring to the stability and convergence of gradient descent dy- like GD, gradient descent with BN (BNGD) con- namics? Or even more simply: why should one use BNGD verges for arbitrary learning rates for the weights, instead of GD? To date, a general mathematical answer to and the convergence remains linear under mild these questions remain elusive. This can be partly attributed conditions. Moreover, we quantify two differ- to the complexity of the optimization objectives that one ent sources of acceleration of BNGD over GD – typically applies BN to, such as those encountered in deep one due to over-parameterization which improves learning. In these cases, even a quantitative analysis of the the effective condition number and another due dynamics of GD itself is difficult, not to mention a precise having a large range of learning rates giving rise comparison between the two. to fast descent. These phenomena set BNGD apart from GD and could account for much of For this reason, it is desirable to formulate the simplest non- its robustness properties. These findings are con- trivial setting, on which one can concretely study the effect firmed quantitatively by numerical experiments, of batch normalization and answer the questions above in which further show that many of the uncovered a quantitative manner. This is the goal of the current pa- properties of BNGD in OLS are also observed per, where we focus on perhaps the simplest supervised qualitatively in more complex supervised learn- learning problem – ordinary least squares (OLS) regression ing problems. – and analyze precisely the effect of BNGD when applied to this problem. A primary reason for this choice is that the dynamics of GD in least-squares regression is completely 1. Introduction understood, thus allowing us to isolate and contrast the additional effects of batch normalization. Batch normalization (BN) is one of the most important tech- niques for training deep neural networks and has proven Our main findings can be summarized as follows extremely effective in avoiding gradient blowups during back-propagation and speeding up convergence. In its orig- 1. Unlike GD, BNGD converges for arbitrarily large learn- inal introduction (Ioffe & Szegedy, 2015), the desirable ing rates for the weights, and the convergence remains linear under mild conditions. 1Department of Mathematics, National University of Sin- gapore, Singapore 2Institute of High Performance Comput- 2. The asymptotic linear convergence of BNGD is faster ing, A*STAR, Singapore. Correspondence to: Yongqiang Cai than that of GD, and this can be attributed to the over- <[email protected]>, Qianxiao Li <[email protected]>, Zuowei Shen <[email protected]>. parameterization that BNGD introduces. Proceedings of the 36 th International Conference on Machine 3. Unlike GD, the convergence rate of BNGD is insen- Learning, Long Beach, California, PMLR 97, 2019. Copyright sitive to the choice of learning rates. The range of 2019 by the author(s). insensitivity can be characterized, and in particular it A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent increases with the dimensionality of the problem. tions (Theorem 3.4). This convergence result is stronger, but this is to be expected since we are considering a specific case. More importantly, we discuss concretely how BNGD Although these findings are established concretely only for offers advantages over GD instead of just matching its best- the OLS problem, we will show through numerical experi- case performance. For example, not only do we show that ments that some of them hold qualitatively, and sometimes convergence occurs for any learning rate, we also derive a even quantitatively for more general situations in deep learn- quantitative relationship between the learning rate and the ing. convergence rate, from which the robustness of BNGD on OLS can be explained (see Section3). 1.1. Related Work Batch normalization was originally introduced in Ioffe & 1.2. Organization Szegedy(2015) and subsequently studied in further detail Our paper is organized as follows. In Section2, we outline in Ioffe(2017). Since its introduction, it has become an the ordinary least squares (OLS) problem and present GD important practical tool to improve stability and efficiency and BNGD as alternative means to solve this problem. In of training deep neural networks (Bottou et al., 2018). Ini- Section3, we demonstrate and analyze the convergence of tial heuristic arguments attribute the desirable features of the BNGD for the OLS model, and in particular contrast the BN to concepts such as “covariate shift”, but alternative results with the behavior of GD, which is completely known explanations based on landscapes (Santurkar et al., 2018) for this model. We also discuss the important insights to and effective regularization (Bjorck et al., 2018) have been BNGD that these results provide us with. We then validate proposed. these findings on more general supervised learning problems Recent theoretical studies of BN include Ma & Klabjan in Section4. Finally, we conclude in Section5. (2017); Kohler et al.(2018); Arora et al.(2019). We now outline the main differences between them and the current 2. Background work. In Ma & Klabjan(2017), the authors proposed a variant of BN, the diminishing batch normalization (DBN) 2.1. Ordinary Least Squares and Gradient Descent algorithm and established its convergence to a stationary d Consider the simple linear regression model where x 2 R point of the loss function. In Kohler et al.(2018), the authors is a random input column vector and y is the corresponding also considered a BNGD variant by dynamically setting the output variable. Since batch normalization is applied for learning rates and using bisection to optimize the rescaling each feature separately, in order to gain key insights it is variables introduced by BN. It is shown that this variant of sufficient to consider the case y 2 R. A noisy linear rela- BNGD converges linearly for simplified models, including tionship is assumed between the dependent variable y and an OLS model and “learning halfspaces”. The primary dif- the independent variables x, i.e. y = xT w + noise where ference in the current work is that we do not dynamically d w 2 R is the vector of trainable parameters. Denote the modify the learning rates, and consider instead a constant following moments: learning rate, i.e. the original BNGD algorithm. This is an important distinction; While a decaying or dynamic learn- H := E[xxT ]; g := E[xy]; c := E[y2]: (1) ing rate is sometimes used in GD, in the case of BN it is critical to analyze the constant learning rate case, precisely To simplify the analysis, we assume the covariance matrix because one of the key practical advantages of BN is that a H of x is positive definite and the mean E[x] of x is zero. big learning rate can be used. Moreover, this allows us to The eigenvalues of H are denoted as λi(H); i = 1; 2; :::d;. isolate the influence of batch normalization itself, without Particularly, the maximum and minimum eigenvalue of H the potentially obfuscating effects a dynamic learning rate is denoted by λmax and λmin respectively. The condition number of H is defined as κ := λmax . Note that the positive schedule can introduce (e.g. see Eq. (10) and the discussion λmin that follows). As the goal of considering a simplified model definiteness of H allows us to define the vector norm k:kH 2 T is to analyze the additional effects purely due to BN on GD, by kxkH = x Hx. it is desirable to perform our analysis in this regime. The ordinary least squares (OLS) method for estimating the In Arora et al.(2019), the authors proved a general con- unknown parameters w leads to the following optimization vergence result for BNGD of O(k−1=2) in terms of the problem, gradient norm for objectives with Lipschitz continuous gra- 1 T 2 min J0(w) : = Ex;y[(y − x w) ] (2) d 2 dients. This matches the best result for gradient descent w2R on general non-convex functions with learning rate tuning = c − wT g + 1 wT Hw; (Carmon et al., 2017). In contrast, our convergence result 2 2 is in iteration and is shown to be linear under mild condi- which has unique minimizer w = u := H−1g. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent The gradient descent (GD) method (with step size or learn- reasons which will become clear in the subsequent analysis. ing rate ") for solving the optimization problem (2) is given We thus have the following discrete-time dynamical system: by the iteration T wk g ak+1 = ak + "a − ak ; (7) σk wk+1 = wk − "rwJ0(wk) = (I − "H)wk + "g; (3) T ak wk g wk+1 = wk + " σ g − 2 Hwk : (8) 2 k σk which converges if 0 < " < =: "max, and the con- λmax vergence rate is determined by the spectral radius ρ" := To simplify subsequent notation, we denote by H∗ the ma- ρ(I − "H) = maxifj1 − ελi(H)jg with trix ∗ T ku − wk+1k ≤ ρ(I − "H)ku − wkk: (4) Huu H H := H − uT Hu ; (9) It is well-known (e.g.

Load more