<<

A Quantitative Analysis of the Effect of Batch Normalization on Descent

Yongqiang Cai 1 Qianxiao Li 1 2 Zuowei Shen 1

Abstract effects of BN are attributed to the so-called “reduction of Despite its empirical success and recent theoreti- covariate shift”. However, it is unclear what this statement cal progress, there generally lacks a quantitative means in precise mathematical terms. analysis of the effect of batch normalization (BN) Although recent theoretical work have established cer- on the convergence and stability of gradient de- tain convergence properties of with BN scent. In this paper, we provide such an analysis (BNGD) and its variants (Ma & Klabjan, 2017; Kohler et al., on the simple problem of ordinary least squares 2018; Arora et al., 2019), there generally lacks a quantita- (OLS), where the precise dynamical properties of tive comparison between the dynamics of the usual gradient gradient descent (GD) is completely known, thus descent (GD) and BNGD. In other words, a basic question allowing us to isolate and compare the additional that one could pose is: what quantitative changes does BN effects of BN. More precisely, we show that un- bring to the stability and convergence of gradient descent dy- like GD, gradient descent with BN (BNGD) con- namics? Or even more simply: why should one use BNGD verges for arbitrary learning rates for the weights, instead of GD? To date, a general mathematical answer to and the convergence remains linear under mild these questions remain elusive. This can be partly attributed conditions. Moreover, we quantify two differ- to the complexity of the optimization objectives that one ent sources of acceleration of BNGD over GD – typically applies BN to, such as those encountered in deep one due to over-parameterization which improves learning. In these cases, even a quantitative analysis of the the effective condition number and another due dynamics of GD itself is difficult, not to mention a precise having a large range of learning rates giving rise comparison between the two. to fast descent. These phenomena set BNGD apart from GD and could account for much of For this reason, it is desirable to formulate the simplest non- its robustness properties. These findings are con- trivial setting, on which one can concretely study the effect firmed quantitatively by numerical experiments, of batch normalization and answer the questions above in which further show that many of the uncovered a quantitative manner. This is the goal of the current pa- properties of BNGD in OLS are also observed per, where we focus on perhaps the simplest supervised qualitatively in more complex supervised learn- learning problem – ordinary least squares (OLS) regression ing problems. – and analyze precisely the effect of BNGD when applied to this problem. A primary reason for this choice is that the dynamics of GD in least-squares regression is completely 1. Introduction understood, thus allowing us to isolate and contrast the ad- ditional effects of batch normalization. Batch normalization (BN) is one of the most important tech- niques for training deep neural networks and has proven Our main findings can be summarized as follows extremely effective in avoiding gradient blowups during back-propagation and speeding up convergence. In its orig- 1. Unlike GD, BNGD converges for arbitrarily large learn- inal introduction (Ioffe & Szegedy, 2015), the desirable ing rates for the weights, and the convergence remains linear under mild conditions. 1Department of Mathematics, National University of Sin- gapore, Singapore 2Institute of High Performance Comput- 2. The asymptotic linear convergence of BNGD is faster ing, A*STAR, Singapore. Correspondence to: Yongqiang Cai than that of GD, and this can be attributed to the over- , Qianxiao Li , Zuowei Shen . parameterization that BNGD introduces.

Proceedings of the 36 th International Conference on Machine 3. Unlike GD, the convergence rate of BNGD is insen- Learning, Long Beach, California, PMLR 97, 2019. Copyright sitive to the choice of learning rates. The range of 2019 by the author(s). insensitivity can be characterized, and in particular it A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

increases with the dimensionality of the problem. tions (Theorem 3.4). This convergence result is stronger, but this is to be expected since we are considering a specific case. More importantly, we discuss concretely how BNGD Although these findings are established concretely only for offers advantages over GD instead of just matching its best- the OLS problem, we will show through numerical experi- case performance. For example, not only do we show that ments that some of them hold qualitatively, and sometimes convergence occurs for any , we also derive a even quantitatively for more general situations in deep learn- quantitative relationship between the learning rate and the ing. convergence rate, from which the robustness of BNGD on OLS can be explained (see Section3). 1.1. Related Work Batch normalization was originally introduced in Ioffe & 1.2. Organization Szegedy(2015) and subsequently studied in further detail Our paper is organized as follows. In Section2, we outline in Ioffe(2017). Since its introduction, it has become an the ordinary least squares (OLS) problem and present GD important practical tool to improve stability and efficiency and BNGD as alternative means to solve this problem. In of training deep neural networks (Bottou et al., 2018). Ini- Section3, we demonstrate and analyze the convergence of tial heuristic arguments attribute the desirable features of the BNGD for the OLS model, and in particular contrast the BN to concepts such as “covariate shift”, but alternative results with the behavior of GD, which is completely known explanations based on landscapes (Santurkar et al., 2018) for this model. We also discuss the important insights to and effective regularization (Bjorck et al., 2018) have been BNGD that these results provide us with. We then validate proposed. these findings on more general problems Recent theoretical studies of BN include Ma & Klabjan in Section4. Finally, we conclude in Section5. (2017); Kohler et al.(2018); Arora et al.(2019). We now outline the main differences between them and the current 2. Background work. In Ma & Klabjan(2017), the authors proposed a variant of BN, the diminishing batch normalization (DBN) 2.1. Ordinary Least Squares and Gradient Descent algorithm and established its convergence to a stationary d Consider the simple linear regression model where x ∈ R point of the . In Kohler et al.(2018), the authors is a random input column vector and y is the corresponding also considered a BNGD variant by dynamically setting the output variable. Since batch normalization is applied for learning rates and using bisection to optimize the rescaling each feature separately, in order to gain key insights it is variables introduced by BN. It is shown that this variant of sufficient to consider the case y ∈ R. A noisy linear rela- BNGD converges linearly for simplified models, including tionship is assumed between the dependent variable y and an OLS model and “learning halfspaces”. The primary dif- the independent variables x, i.e. y = xT w + noise where ference in the current work is that we do not dynamically d w ∈ R is the vector of trainable parameters. Denote the modify the learning rates, and consider instead a constant following moments: learning rate, i.e. the original BNGD algorithm. This is an important distinction; While a decaying or dynamic learn- H := E[xxT ], g := E[xy], c := E[y2]. (1) ing rate is sometimes used in GD, in the case of BN it is critical to analyze the constant learning rate case, precisely To simplify the analysis, we assume the covariance matrix because one of the key practical advantages of BN is that a H of x is positive definite and the mean E[x] of x is zero. big learning rate can be used. Moreover, this allows us to The eigenvalues of H are denoted as λi(H), i = 1, 2, ...d,. isolate the influence of batch normalization itself, without Particularly, the maximum and minimum eigenvalue of H the potentially obfuscating effects a dynamic learning rate is denoted by λmax and λmin respectively. The condition number of H is defined as κ := λmax . Note that the positive schedule can introduce (e.g. see Eq. (10) and the discussion λmin that follows). As the goal of considering a simplified model definiteness of H allows us to define the vector norm k.kH 2 T is to analyze the additional effects purely due to BN on GD, by kxkH = x Hx. it is desirable to perform our analysis in this regime. The ordinary least squares (OLS) method for estimating the In Arora et al.(2019), the authors proved a general con- unknown parameters w leads to the following optimization vergence result for BNGD of O(k−1/2) in terms of the problem, gradient norm for objectives with Lipschitz continuous gra- 1 T 2 min J0(w) : = Ex,y[(y − x w) ] (2) d 2 dients. This matches the best result for gradient descent w∈R on general non-convex functions with learning rate tuning = c − wT g + 1 wT Hw, (Carmon et al., 2017). In contrast, our convergence result 2 2 is in iteration and is shown to be linear under mild condi- which has unique minimizer w = u := H−1g. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

The gradient descent (GD) method (with step size or learn- reasons which will become clear in the subsequent analysis. ing rate ε) for solving the optimization problem (2) is given We thus have the following discrete-time dynamical system: by the iteration  T  wk g ak+1 = ak + εa − ak , (7) σk wk+1 = wk − ε∇wJ0(wk) = (I − εH)wk + εg, (3)  T  ak wk g wk+1 = wk + ε σ g − 2 Hwk . (8) 2 k σk which converges if 0 < ε < =: εmax, and the con- λmax vergence rate is determined by the spectral radius ρε := To simplify subsequent notation, we denote by H∗ the ma- ρ(I − εH) = maxi{|1 − ελi(H)|} with trix

∗ T ku − wk+1k ≤ ρ(I − εH)ku − wkk. (4) Huu H H := H − uT Hu , (9) It is well-known (e.g. see Chapter 4 of Saad(2003)) that We will see later that the over-parameterization intro- 2 the optimal learning rate is εopt = , where the λmax+λmin duced by BN gives rise to a degenerate κ−1 kuk2 optimal convergence rate is ρopt = . ∗ ∗ ∗ κ+1 diag 1, kw∗k2 H at a minimizer (a , w ), and the BNGD dynamics is governed by H∗ instead of H as in the GD 2.2. Batch Normalization case. The matrix H∗ is positive semi-definite (H∗u = 0) and has better spectral properties than H, such as a lower Batch normalization is a feature-wise normalization proce- ∗ ∗ λmax ∗ effective condition number κ = ∗ ≤ κ, where λmax dure typically applied to the output, which in this case is λmin T ∗ simply z = x w. The normalization transform is defined and λmin are the maximal and minimal nonzero eigenvalues as follows: of H∗ respectively. Particularly, κ∗ < κ for almost all u (see appendix B.1). T N(z) := √z−E[z] = x w , (5) Var[z] σ √ 3. Mathematical Analysis of BNGD on OLS where σ := wT Hw. After this rescaling, N(z) will be order 1, and hence in order to reintroduce the scale (Ioffe & In this section, we discuss several mathematical results one Szegedy, 2015), we multiply N(z) with a rescaling param- can derive concretely for BNGD on the OLS problem (6). eter a (Note that the shift parameter can be set zero since Compared with GD, the update coefficient before Hwk in E[wT x|w] = 0). Hence, we get the BN version of the OLS Eq. (8) changed from ε in Eq. (3) to a complicated term problem (2): which we call the effective learning rate εˆk

2 T 1  T   a w g min J(a, w) : = Ex,y y − aN(x w) εˆ := ε k k . d 2 k σ σ2 (10) w∈R ,a∈R k k c wT g 1 2 = 2 − σ a + 2 a . (6) Also, notice that with the over-parameterization introduced by a, it is no longer necessary for wk to converge to u. The objective function J(a, w) is no longer convex. In fact, In fact, any non-zero scalar multiple of u can be a global it has critical points, {(a∗, w∗)|a∗ = 0, w∗T g = 0}, which minimum. Hence, instead of considering the residual u−wk are saddle points of J(a, w) if g 6= 0. as in the GD analysis Eq. (4), we may combine Eq. (7) and We are interested in the critical points which constitute the Eq. (8) to give set of global minima and satisfy the relations T  T  wk g wk g √ u − σ2 wk+1 = (I − εˆkH) u − σ2 wk . (11) ∗ ∗ k k a = sign(s) uT Hu, w = su, for some s ∈ R \{0}. T 2 Define the modified residual ek := u−(wk g/σk)wk, which It is easy to check that they are in fact global minimizers and equals 0 if and only if wk is a global minimizer. Observe the Hessian matrix at each point is degenerate. Neverthe- that the mapping u 7→ (wT g/σ2)w = (wT Hu/wT Hw)w less, the saddle points are strict (see appendix B.1), which is an orthogonal projection under the inner product induced typically simplifies the analysis of gradient descent on non- by H, hence we immediately have convex objectives (Lee et al., 2016; Panageas & Piliouras, T 2017). wk g kek+1kH ≤ u − 2 wk+1 ≤ ρ(I − εˆkH)kekkH , σk H We consider the gradient descent method for solving the (12) problem (6), which we hereafter call batch normalization gradient descent (BNGD). We set the learning rates for a and where ρ(I − εˆkH) is spectral radius of the matrix I − εˆkH. w to be εa and ε respectively. These may be different, for In other words, as long as maxi{|1 − εˆkλi(H)|} ≤ ρˆ < 1 A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent for some ρˆ < 1 and all k, we have linear convergence of ics of BNGD, and is useful in our subsequent analysis of the residual (which also implies linear convergence of the its convergence and stability properties. For example, it objective, see appendix Lemma B.22). indicates that separating learning rate for weights (w) and rescaling parameters (a) is equivalent to changing the norm At this point, we make an important observation: if we allow of initial weights. for dynamic learning rates, we may simply set εˆk = c for some fixed c ∈ (0, 2/λmax) at every iteration. Then, linear convergence is immediate. However, it is clear that this fast 3.2. Batch Normalization Converges for Arbitrary Step convergence is almost entirely due to the effect of dynamic Size learning rates, and this has limited relevance in explaining Having established the scaling law, we then have the follow- the effect of BN. Moreover, comparing with Eq. (4) one ing convergence result for BNGD on OLS. can observe that with this choice, BNGD and GD have the Theorem 3.3 (Convergence of BNGD). The iteration se- same optimal convergence rates, and so this cannot offer quence (a , w ) in Eq. (7)-(8) converges to a stationary explanations for any advantage of BNGD over GD either. k k point for any initial value (a , w ) and any ε > 0, as long For these reasons, it is important to avoid such dynamic 0 0 as ε ∈ (0, 1]. Particularly, we have: If ε = 1 and ε > 0, learning rate assumptions. a a then (ak, wk) converges to global minimizers for almost all As discussed above, without using dynamic learning rates initial values (a0, w0). one has to then estimate εˆk to establish convergence. Heuris- tically, observe that if ε small enough, this is likely true as Sketch of proof. We first prove that the algorithm converges the other terms can be controlled due to the normalization. for any ε ∈ (0, 1] and small enough ε, with any initial value Thus, convergence for small ε should hold. In order to han- a (a , w ) such that kw k ≥ 1 (appendix Lemma B.12). Next, dle the large ε case, we establish a simple but useful scaling 0 0 0 we observe that the sequence {kw k} is monotone increas- law that draws connections amongst cases with different ε k ing, and thus either converges to a finite limit or diverges. scales. The scaling property is then used to exclude the divergent case – if {kwkk} diverges, then at some k the norm kwkk 3.1. Scaling Property should be large enough, and by the scaling property, it is The dynamical properties of the BNGD iterations are equivalent to a case where kwkk = 1 and ε is small, which governed by a set of parameters, or a configuration we have proved converges. This shows that kwkk converges {H, u, a0, w0, εa, ε}. to a finite limit, from which the convergence of wk and the loss function value can be established, after some work. Definition 3.1 . Two configura- (Equivalent configuration) This proof is fully presented in appendix Theorem B.16 and tions, {H, u, a , w , ε , ε} and {H0, u0, a0 , w0 , ε0 , ε0}, are 0 0 a 0 0 a the preceding lemmas. Lastly, using the “strict saddle point” said to be equivalent if for BNGD iterates {w }, {w0 } k k arguments (Lee et al., 2016; Panageas & Piliouras, 2017), following these configurations respectively, there is an in- we can prove the set of initial value for which (a , w ) con- vertible linear transformation T and a nonzero constant t k k 0 0 verges to saddle points has Lebesgue measure 0, provided such that wk = T wk, ak = tak for all k. εa = 1, ε > 0 (appendix Lemma B.19). The scaling property ensures that equivalent configurations must converge or diverge together, with the same rate up to It is important to note that BNGD converges for all step a constant multiple. Now, it is easy to check the system has size ε > 0 of wk, independent of the spectral properties of the following scaling law. H. This is a significant advantage and is in stark contrast with GD, where the step size is limited by 2/λ , and Proposition 3.2 (Scaling property). Suppose µ 6= max H 0, γ 6= 0, r 6= 0,QT Q = I, then (1) The the condition number of intimately controls the stability γ and convergence rate. Although we only prove the almost configurations {µQT HQ, √ Qu, γa , γQw , ε , ε} and µ 0 0 a everywhere convergence to a global minimizer for the case {H, u, a0, w0, εa, ε} are equivalent. (2) The configura- 2 of εa = 1, we have not encountered convergence to saddles tions {H, u, a0, w0, εa, ε} and {H, u, a0, rw0, εa, r ε} are in the OLS experiments even for εa ∈ (0, 2) with initial equivalent. values (a0, w0) drawn from typical distributions. It is worth noting that the scaling property (2) in Proposi- Remark: In appendixA, we show that the combination of tion 3.2 originates from the batch-normalization procedure the scaling property and the monotonicity of weight norms, and is independent of the specific structure of the loss func- which hold for batch (and weight) normalization of general tion. Hence, it is valid for general problems where BN is loss functions, can be used to prove a more general conver- used (appendix Lemma A.3). Despite being a simple result, gence result: if iterates converge for small enough ε, then the scaling property is important in determining the dynam- gradient norm converges for any ε. We note that in the inde- A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent pendent work of Arora et al.(2019), similar ideas have been of H is very large. used to prove convergence results for batch normalization The acceleration effect can be understood heuristically as for neural networks. Lastly, one can also show that in the follows: due to the over-parameterization introduced by general case, the over-parameterization due to batch (and BN, the convergence rate near a minimizer is governed by weight) normalization only introduces strict saddle points H∗ instead of H. The former has a degenerate direction (see appendix Lemma A.1). {λu : λ ∈ R}, which coincides with the degenerate global minima. Hence, the effective condition number governing 3.3. Convergence Rate and Acceleration Due to convergence is dependent on the largest and the second Over-parameterization smallest eigenvalue of H∗ (the smallest being 0 in the de- Having established the convergence of BNGD on OLS, a generate minima direction). One can contrast this with the natural follow-up question is why should one use BNGD GD case where the smallest eigenvalue of H is considered over GD. After all, even if BNGD converges for any learning instead since no degenerate directions exists. rate, if the convergence is universally slower than GD then it does not offer any advantages. We prove the following result 3.4. Robustness and Acceleration Due to Learning Rate that shows that under mild conditions, the convergence rate Insensitivity of BNGD on OLS is linear. Moreover, close to the optima Let us now discuss another advantage BNGD possesses over the linear rate of convergence can be shown to be faster than GD, related to the insensitive dependence of the effective the best-case linear convergence rate of GD. This offers a learning rate εˆ (and by extension, the effective convergence concrete result that shows that BNGD could out-perform k rate in Eq. (12) or Eq. (13)) on ε. The explicit dependence GD, even if the latter is perfectly-tuned. of εˆk on ε is quite complex, but we can give the following Theorem 3.4 (Convergence rate). If (ak, wk) converges to asymptotic estimates (see appendix B.6 for proof). ∗ ∗ a minimizer with εˆ := lim εˆk < εmax := 2/λmax, then T k→∞ Proposition 3.5. Suppose εa ∈ (0, 1], a0w0 g > 0, and T the convergence is linear. Furthermore, when (ak, wk) is 2 w0 g T ||g|| ≥ 2 g Hw0, then λmaxε|ak| σ0 close to a minimizer, such that 2 kekkH ≤ δ < 1 σk (this must happen for large enough k, since we assumed (1) When ε is small enough, ε  1, the effective step size convergence to a minimizer), then we have εˆk has a same order with ε. ∗ ∗ ρ (I−εˆkH )+δ kek+1kH ≤ 1−δ kekkH , (13) (2) When ε is large enough, ε  1, the effective step size εˆ has order O(ε−1). ∗ ∗ ∗ k where ρ (I − εˆkH) := max{|1 − εˆkλmin|, |1 − εˆkλmax|}. Observe that for finite k, εˆ is a This statement is proved in appendix Lemma B.21. Recall k of ε. Therefore, the above result implies, via the mean that H∗, λ∗ are defined in section2. The assumption max value theorem, the existence of some ε > 0 such that εˆ < ε∗ is mild since one can prove the set of initial val- 0 max dεˆ /dε| = 0. Consequently, there is at least some ues (a , w ) such that (a , w ) converges to a minimizer k ε=ε0 0 0 k k small interval of the choice of learning rates ε where the (a∗, w∗) with εˆ > ε∗ and det(I − εHˆ ∗) 6= 0 is of mea- max performance of BNGD is insensitive to this choice. sure zero (see appendix Lemma B.23). In fact, empirically this is one commonly observed advan- The inequality (13) is motivated by the linearized system tage of BNGD over GD, where the former typically allows corresponding to Eq. (7)-(8) near a minimizer. When the for a variety of (large) learning rates to be used without iteration converges to a minimizer, the limiting εˆ must be a adversely affecting performance. The same is not true for positive number where the assumption εˆ < ε∗ makes max GD, where the convergence rate depends sensitively on the sure the coefficient in Eq. (13) is smaller than 1. This choice of learning rate. We will see later in Section4 that implies linear convergence of ke k . Generally, the matrix k H although we only have a local insensitivity result above, H∗ has better spectral properties than H, in the sense that the interval of this insensitivity is actually quite large in ρ∗(I − εˆ H∗) ≤ ρ(I − εˆ H), provided εˆ > 0, where the k k k practice. inequality is strict for almost all u. This is a consequence of the Cauchy eigenvalue interlacing property, which one can Furthermore, with some additional assumptions and approx- show directly using mini-max properties of eigenvalues (see imations, the explicit dependence of εˆk on  can be charac- appendix Lemma B.1). This leads to acceleration effects of terized in a quantitative manner. Concretely, we quantify BNGD: When kekkH is small, the contraction coefficient the insensitivity of step size characterized by the interval ρ in Eq. (12) can be improved to a lower coefficient in in which the εˆ is close to the optimal step size εopt (or the ∗ Eq. (13). This acceleration could be significant when κ is maximal allowed step size εmax in GD, since εopt is very much smaller than κ, which can happen if the spectral gap close to εmax when κ is large). Proposition 3.5 indicates A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

C2 that this interval is approximately [C1εmax, ], which As a consequence of the above, if the eigenvalues of εmax C2 H are sampled from a given continuous distribution on crosses a magnitude of 2 , where C1,C2 are positive C1εmax constants. [λmin, λmax], such as the uniform distribution, then by law of large numbers, Ω = O(d) for large d. This result T H We set εa = 1, a0 = w0 g/σ0 (which is the value in the suggests that the magnitude of the interval on which the second step if we set a0 = 0), kw0k = kuk = 1, where performance of BNGD is insensitive to the choice of the Theorem 3.3 gives the linear converge result for almost all learning rate increases linearly in dimension, implying that initial values and the convergence rate can be quantified by this robustness effect of BNGD is especially useful for high ε the limiting effective learning rate εˆ := lim εˆk = 2 . k→∞ kw∞k dimensional problems. Interestingly, although we only de- 2 Consequently, we need to estimate the magnitude kw∞k . rived this result for the OLS problem, this linear scaling The BNGD iteration implies the following equality, of insensitivity interval is also observed in neural networks experiments, where we varied the dimension by adjusting 2 2 ε2 kwk+1k = kwkk + 2 βk, (14) kwkk the width of the hidden layers. See Section 4.2.

2 2 2 akkwkk The insensitivity to learning rate choices can also lead to where βk is defined as βk := σ2 ek H2 . The ear- k acceleration effects if one have to use the same learning rate lier convergence results motivate the following plausible for training weights with different effective conditioning. approximation: we assume βk linearly converges to zero 2 This may arise in applications where each and the iteration of kwkk can be approximated by ξ(k + 1) ’s gradient magnitude varies widely, thus requiring which obeys the following ODE (whose discretization for- very different learning rates to achieve good performance. mally matches Eq. (14), assuming the aforementioned con- In this case, BNGD’s large range of learning rate insensi- vergence rate ρ): tivity allows one to use common values across all layers 2 2t ξ(0) = kw k2, ξ˙(t) = ε β0ρ . (15) without adversely affecting the performance. This is again 1 ξ(t) in contrast to GD, where such insensitivity is not present. 2 See Section 4.3 for some experimental validation of this Its solution is ξ2(t) = ξ2(0) + ε β0 (1 − ρ2t), where ρ ∈ | ln ρ| claim. (0, 1) depends on ε and is self-consistently determined by the limiting effective step size, i.e. ρ is the spectral radius ε 4. Experiments of I − ξ(∞) H and ξ(∞) in turn depends on ρ. Analyzing the dependence of ξ(∞) on ε can give an estimate of the Let us first summarize our key findings and insights from 1 insensitivity interval, which is now [εmax, ], since β0εmax the analysis of BNGD on the OLS problem. εˆ ≈ ε when ε  1, and εˆ ≈ 1 when ε  1. (see β0ε appendix B.6.) Therefore, the magnitude of the interval of 1. A scaling law governs BNGD, where certain configu- insensitivity varies inversely with β0. Below, we quantify rations can be deemed equivalent. this magnitude in an average sense. 2. BNGD converges for any learning rate ε > 0, provided Definition 3.6. The average magnitude of the insensitivity T that εa ∈ (0, 1]. In particular, different learning rates w0 g interval of BNGD with εa = 1, a0 = (or a0 = 0) σ0 can be used for the BN variables (a) compared with ¯ 2 ¯ is defined as ΩH = 1/(βH εmax), where βH is the geo- the remaining trainable variables (w). metric average of β0 over w0 and u, which we take to be d−1 3. There exists intervals of ε for which the performance independent and uniformly on the unit sphere S , of BNGD is not sensitive to the choice of ε, and the wT Hu 2 2 ¯  0   magnitude of this interval grows with dimension. βH := exp Ew0,u ln T e0 2 . (16) w0 Hw0 H In the subsequent sections, we first validate numerically Note that we use the geometric average rather than the arith- these claims on the OLS model, and then show that these metic average because we are measuring a ratio. Although insights go beyond the simple OLS model we considered in we can not calculate the value of Ω analytically, we have H the theoretical framework. In fact, much of the uncovered the following lower bound (see appendix B.7): properties are observed in general applications of BNGD in Proposition 3.7. For positive definite matrix H with mini- deep learning. mal and maximal eigenvalues λmin and λmax respectively, the Ω defined in Definition 3.6 satisfies Ω ≥ d , where H H C 4.1. Experiments on OLS T r[H2] T r[H] 2 ln κ T r[H]  C := 4 2 exp (1 − ) , (17) Here we test the convergence and stability of BNGD for dλ dλmax κ−1 dλmin min the OLS model. Consider a diagonal matrix H = diag(h) κ = λmax is the condition number of H. where h = (1, ..., κ) is a increasing sequence. The scaling λmin A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent property (Proposition 3.2) allows us to set the initial value to the optimization of RNN models. w0 having same norm with u, kw0k = kuk = 1. Of course, one can verify that the scaling property holds strictly in this 4.2. Experiments on the Effect of Dimension case. In order to validate the approximate results in Section 3.4, Figure1 gives examples of H with different condition num- we compute numerically the dependence of the performance bers κ. We tested the loss function of BNGD, compared of BNGD on the choice of the learning rate ε. Observe from with the optimal GD (i.e. GD with the optimal step size Figure2 that the quantitative predictions of Ω in Definition εopt), in a large range of step sizes εa and ε, and with differ- 3.6 is consistent with numerical experiments, and the linear- ent initial values of a0. Another quantity we observe is the in-dimension scaling of the magnitude of the insensitivity effective step size εˆk of BN. The results are encoded by four interval is observed. Perhaps more interestingly, the same different colors: whether εˆk is close to the optimal step size scaling is also observed in (stochastic) BNGD on fully con- εopt, and whether loss of BNGD is less than the optimal nected neural networks trained on the MNIST dataset. This GD. The results indicate that the optimal convergence rate suggests that this scaling is relevant, at least qualitatively, of BNGD can be better than GD in some configurations, beyond the regimes considered in the theoretical parts of consistent with the statement of Theorem 3.4. Recall that this paper. this acceleration phenomenon is ascribed to the conditioning of H∗ which is better than H.

a0 = 10 a0 = 1 a0 = 0.01 a0 = 0.0001 a0 = 1e − 8 yes close to optimal

= 80 step size

κ no yes no loss_BN < loss_GD = 1280 κ = 20480 κ

loss GD(opt) BNGD = 327680 κ

Figure 1. Comparison of BNGD and GD on OLS model. The results are encoded by four different colors: whether εˆk is close to Figure 2. Effect of dimension. (Top line) Tests of BNGD on the optimal step size εopt of GD, characterized by the inequality OLS model with step size εa = 1, a0 = 0. Parameters: 0.8εopt < εˆk < εopt/0.8, and whether loss of BNGD is less than H=diag(linspace(1,10000,d)), u and w0 is randomly chosen uni- the optimal GD. Parameters: H = diag(logspace(0,log10(κ),100)), formly from the unit sphere in d. The BNGD iterations are 100 R u is randomly chosen uniformly from the unit sphere in R , w0 executed for k = 5000 steps. The values are averaged over 500 is set to Hu/kHuk. The GD and BNGD iterations are executed independent runs. (Bottom line) Tests of stochastic BNGD on for k = 2000 steps with the same w0. In each image, the range of MNIST dataset, fully connected neural network with one hidden εa (x-axis) is 1.99 * logspace(-10,0,41), and the range of ε (y-axis) layer and softmax mean-square loss. The separated learning rate is logspace(-5,16,43). Observe that the performance of BNGD is for BN Parameters is lr a=10, The performance is characterized less sensitive to the condition number, and its advantage is more by the accuracy at the first epoch (averaged over 10 independent pronounced when the latter is big. runs). The magnitude Ω is approximately measured for reference.

Another important observation is a region such that εˆ is 4.3. Further Neural Network Experiments close to εopt, in other words, BNGD significantly extends the range of “optimal” step sizes. Consequently, we can We conduct further experiments on deep learning applied choose step sizes in BNGD at greater liberty to obtain almost to standard classification datasets: MNIST (LeCun et al., the same or better convergence rate than the optimal GD. 1998), Fashion MNIST (Xiao et al., 2017) and CIFAR- However, the size of this region is inversely dependent on 10 (Krizhevsky & Hinton, 2009). The goal is to explore the initial condition a0. Hence, this suggests that small if the other key findings outlined at the beginning of this a0 at first steps may improve robustness. On the other section continue to hold for more general settings. For the hand, small εa will weaken the performance of BN. The MNIST and Fashion MNIST dataset, we use two different phenomenon suggests that improper initialization of the BN networks: (1) a one-layer fully connected network (784 × parameters weakens the power of BN. This experience is 10) with softmax mean-square loss; (2) a four-layer con- encountered in practice, such as (Cooijmans et al., 2016), volution network (Conv-MaxPool-Conv-MaxPool-FC-FC) where higher initial values of BN parameter are detrimental with ReLU and cross-entropy loss. For A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent the CIFAR-10 dataset, we use a five-layer net- cases, the range of optimal learning rates in BNGD is quite work (Conv-MaxPool-Conv-MaxPool-FC-FC-FC). All the large, which is in agreement with the OLS analysis (see trainable parameters are randomly initialized by the Glorot Section 3.4). This phenomenon is potentially crucial for scheme (Glorot & Bengio, 2010) before training. For all understanding the acceleration of BNGD in deep neural three datasets, we use a minibatch size of 100 for computing networks. Heuristically, the “optimal” learning rates of GD stochastic . In the BNGD experiments, batch nor- in distinct layers (depending on some effective notion of malization is performed on all layers, the BN parameters are “condition number”) may be vastly different. Hence, GD initialized to transform the input to zero mean/unit with a shared learning rate across all layers may not achieve distributions, and a small√ regularization parameter  =1e-3 the best convergence rates for all layers at the same time. In is added to variance σ2 +  to avoid division by zero. this case, it is plausible that the acceleration of BNGD is a result of the decreased sensitivity of its convergence rate on Scaling property Theoretically, the scaling property 3.2 the learning rate parameter over a large range of its choice. holds for any layer using BN. However, it may be slightly biased by the regularization parameter . Here, we test the scaling property in practical settings. Figure3 gives the loss and accuracy of network-(2) (2CNN+2FC) at the first epoch with different learning rate. The norm of all weights and biases are rescaled by a common factor η. We observe that the scaling property remains true for relatively large η. However, when η is small, the norm of weights√ are small. Therefore, the effect of the -regularization in σ2 +  be- comes significant, causing the curves to be shifted.

Figure 4. Performance of BNGD and GD method on MNIST (network-(1), 1FC), Fashion MNIST (network-(2), 2CNN+2FC) Figure 3. Tests of scaling property of the 2CNN+2FC network on and CIFAR-10 (2CNN+3FC) datasets. The performance is charac- MNIST dataset. BN is performed on all layers, and =1e-3 is √ terized by the loss value at the first epoch. In the BNGD method, added to variance σ2 + . All the trainable parameters (except both the shared learning rate schemes and separated learning rate the BN parameters) are randomly initialized by the Glorot scheme, scheme (learning rate lr a for BN parameters) are given. The and then multiplied by a same parameter η. values are averaged over 5 independent runs.

Stability for large learning rates We use the loss value at the end of the first epoch to characterize the performance of 5. Conclusion BNGD and GD methods. Although the training of models In this paper, we analyzed the dynamical properties of batch have generally not converged at this point, it is enough normalization on OLS, chosen for its simplicity and the to extract some relative rate information. Figure4 shows availability of precise characterizations of GD dynamics. the loss value of the networks on the three datasets. It Even in such a simple setting, we saw that BNGD exhibits is observed that GD and BNGD with identical learning interesting non-trivial behavior, including scaling laws, ro- rates for weights and BN parameters exhibit a maximum bust convergence properties, acceleration, as well as the in- allowed learning rate, beyond which the iterations becomes sensitivity of performance to the choice of learning rates. At unstable. On the other hand, BNGD with separate learning least in the setting considered here, our analysis allows one rates exhibits a much larger range of stability over learning to concretely answer the question of why BNGD can achieve rate for non-BN parameters, consistent with our theoretical better performance than GD. Although these results are de- results on OLS problem rived only for the OLS model, we show via experiments that Insensitivity of performance to learning rates Observe these are qualitatively, and sometimes quantitatively valid that BN accelerates convergence more significantly for deep for more general scenarios. These point to promising future networks, whereas for one-layer networks, the best perfor- directions towards uncovering the dynamical effect of batch mance of BNGD and GD are similar. Furthermore, in most normalization in deep learning and beyond. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

References Panageas, I. and Piliouras, G. Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points Arora, S., Li, Z., and Lyu, K. Theoretical analysis and Invariant Regions. In Papadimitriou, C. H. (ed.), of auto rate-tuning by batch normalization. In In- 8th Innovations in Theoretical Computer Science Con- ternational Conference on Learning Representations, ference (ITCS 2017), volume 67 of Leibniz Interna- 2019. URL https://openreview.net/forum? tional Proceedings in Informatics (LIPIcs), pp. 2:1–2:12, id=rkxQ-nA9FX. Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz- Bjorck, J., Gomes, C., and Selman, B. Understanding Batch Zentrum fuer Informatik. ISBN 978-3-95977-029-3. doi: Normalization. ArXiv e-prints, May 2018. 10.4230/LIPIcs.ITCS.2017.2. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Saad, Y. Iterative methods for sparse linear systems, vol- methods for large-scale . SIAM Review, ume 82. siam, 2003. 60(2):223–311, 2018. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower Does Batch Normalization Help Optimization? (No, It Is bounds for finding stationary points i. arXiv preprint Not About Internal Covariate Shift). ArXiv e-prints, May arXiv:1710.11606, 2017. 2018. Cooijmans, T., Ballas, N., Laurent, C., and Courville, Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a A. C. Recurrent batch normalization. CoRR, novel image dataset for benchmarking machine learning abs/1603.09025, 2016. URL http://arxiv.org/ algorithms, 2017. abs/1603.09025. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Pro- ceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010. Ioffe, S. Batch renormalization: Towards reducing mini- batch dependence in batch-normalized models. CoRR, abs/1702.03275, 2017. URL http://arxiv.org/ abs/1702.03275. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015. Kohler, J., Daneshmand, H., Lucchi, A., Zhou, M., Neymeyr, K., and Hofmann, T. Towards a Theoretical Understanding of Batch Normalization. ArXiv e-prints, May 2018. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998. Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient Descent Converges to Minimizers. ArXiv e- prints, February 2016. Ma, Y. and Klabjan, D. Convergence analysis of batch nor- malization for deep neural nets. CoRR, 1705.08011, 2017. URL http://arxiv.org/abs/1705.08011.