Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization
Jonas Kohler* Hadi Daneshmand* Aurelien Lucchi Thomas Hofmann ETH Zurich ETH Zurich ETH Zurich ETH Zurich
Ming Zhou Klaus Neymeyr Universit¨atRostock Universit¨atRostock
Abstract (Bn) (Ioffe and Szegedy, 2015). This technique has been proven to successfully stabilize and accelerate Normalization techniques such as Batch Nor- training of deep neural networks and is thus by now malization have been applied successfully for standard in many state-of-the art architectures such training deep neural networks. Yet, despite as ResNets (He et al., 2016) and the latest Inception its apparent empirical benefits, the reasons Nets (Szegedy et al., 2017). The success of Batch Nor- behind the success of Batch Normalization malization has promoted its key idea that normalizing are mostly hypothetical. We here aim to pro- the inner layers of a neural network stabilizes train- vide a more thorough theoretical understand- ing which recently led to the development of many ing from a classical optimization perspective. such normalization methods such as (Arpit et al., 2016; Our main contribution towards this goal is Klambauer et al., 2017; Salimans and Kingma, 2016) the identification of various problem instances and (Ba et al., 2016) to name just a few. in the realm of machine learning where Batch Yet, despite the ever more important role of Batch Normalization can provably accelerate opti- Normalization for training deep neural networks, the mization. We argue that this acceleration Machine Learning community is mostly relying on em- is due to the fact that Batch Normalization pirical evidence and thus lacking a thorough theoretical splits the optimization task into optimizing understanding that can explain such success. Indeed length and direction of the parameters sepa- – to the best of our knowledge – there exists no theo- rately. This allows gradient-based methods to retical result which provably shows faster convergence leverage a favourable global structure in the rates for this technique on any problem instance. So loss landscape that we prove to exist in Learn- far, there only exists competing hypotheses that we ing Halfspace problems and neural network briefly summarize below. training with Gaussian inputs. We thereby turn Batch Normalization from an effective practical heuristic into a provably converg- 1.1 Related work arXiv:1805.10694v3 [stat.ML] 6 Oct 2018 ing algorithm for these settings. Furthermore, we substantiate our analysis with empirical Internal Covariate Shift The most widespread evidence that suggests the validity of our the- idea is that Batch Normalization accelerates training by oretical results in a broader context. reducing the so-called internal covariate shift, defined as the change in the distribution of layer inputs while the conditional distribution of outputs is unchanged. 1 INTRODUCTION This change can be significant especially for deep neu- ral networks where the successive composition of layers One of the most important recent innovations for opti- drives the activation distribution away from the ini- mizing deep neural networks is Batch Normalization tial input distribution. Ioffe and Szegedy (2015) argue that Batch Normalization reduces the internal covari- ate shift by employing a normalization technique that enforces the input distribution of each activation layer *The first two authors have contributed equally to this work. to be whitened - i.e. enforced to have zero means and Correspondence to [email protected] unit variances - and decorrelated . Yet, as pointed out Exponential convergence rates for Batch Normalization by Lipton and Steinhardt (2018), the covariate shift provably accelerates optimization with Gradient phenomenon itself is not rigorously shown to be the Descent and does the length-direction decoupling play a reason behind the performance of Batch Normalization. role in this phenomenon? Furthermore, a recent empirical study published by (Santurkar et al., 2018) provides strong evidence sup- We answer both questions affirmatively. In particular, porting the hypothesis that the performance gain of we show that the specific variance transformation of Batch Normilization is not explained by the reduction Bn decouples the length and directional components of internal covariate shift. of the weight vectors in such a way that allows local search methods to exploit certain global properties of Smoothing of objective function Recently, San- the optimization landscape (present in the directional turkar et al. (2018) argue that under certain assump- component of the optimal weight vector). Using this tions a normalization layer simplifies optimization by fact and endowing the optimization method with an smoothing the loss landscape of the optimization prob- adaptive stepsize scheme, we obtain an exponential (or lem of the preceding layer. Yet, we note that this effect as more commonly termed linear) convergence rate for may - at best - only improve the constant factor of Batch Norm Gradient Descent on the (possibly) non- the convergence rate of Gradient Descent and not the convex problem of Learning Halfspaces with Gaussian rate itself (e.g. from sub-linear to linear). Furthermore, inputs (Section 4), which is a prominent problem in the analysis treats only the largest eigenvalue and thus machine learning Erdogdu et al. (2016). We thereby one direction in the landscape (at any given point) and turn Bn from an effective practical heuristic into a prov- keeps the (usually trainable) BN parameters fixed to ably converging algorithm. Additionally we show that zero-mean and unit variance. For a thorough conclu- the length-direction decoupling can be considered as a sion about the overall landscape, a look at the entire non-linear reparametrization of the weight space, which eigenspectrum (including negative and zero eigenval- may be beneficial for even simple convex optimization ues) would be needed. Yet, this is particularly hard tasks such as logistic regressions. Interestingly, non- to do as soon as one allows for learnable mean and linear weightspace transformations have received little variance parameters since the effect of their interplay to no attention within the optimization community (see on the distribution of eigenvalues is highly non-trivial. (Mikhalevich et al., 1988) for an exception). Finally, in Section 5 we analyze the effect of Bn for Length-direction decoupling Finally, a different training a multilayer neural network (MLP) and prove perspective was brought up by another normalization – again under a similar Gaussianity assumption – that technique termed Weight Normalization ( ) (Sali- Wn Bn acts in such a way that the cross dependencies be- mans and Kingma, 2016). This technique performs a tween layers are reduced and thus the curvature struc- very simple normalization that is independent of any ture of the network is simplified. Again, this is due data statistics with the goal of decoupling the length of to a certain global property in the directional part of the weight vector from its direction. The optimization the optimization landscape, which BN can exploit via of the training objective is then performed by training the length-direction decoupling. As a result, gradient- the two parts separately. As discussed in Section 2, Bn based optimization in reparametrized coordinates (and and Wn differ in how the weights are normalized but with an adaptive stepsize policy) can enjoy a linear share the above mentioned decoupling effect. Interest- convergence rate on each individual unit. We substan- ingly, weight normalization has been shown empirically tiate both findings with experimental results on real to benefit from similar acceleration properties as Batch world datasets that confirm the validity of our analysis Normalization (Gitman and Ginsburg, 2017; Salimans outside the setting of our theoretical assumptions that and Kingma, 2016). This raises the obvious question cannot be certified to always hold in practice. whether the empirical success of training with Batch Normalization can (at least partially) be attributed to its length-direction decoupling aspect. 2 BACKGROUND
1.2 Contribution and organization 2.1 Assumptions on data distribution
We contribute to a better theoretical understanding Suppose that x ∈ Rd is a random input vector and y ∈ of Batch Normalization by analyzing it from an opti- {±1} is the corresponding output variable. Throughout mization perspective. In this regard, we particularly this paper we recurrently use the following statistics address the following question: u := E [−yx] , S := E xx> (1)
Can we find a setting in which Batch Normalization and make the following (weak) assumption. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Assumption 1. [Weak assumption on data distribu- As stated (in finite-sum terms) in Algorithm 1 of tion] We assume that E [x] = 0. We further assume (Ioffe and Szegedy, 2015) the normalization operation that the spectrum of the matrix S is bounded as amounts to computing
> > 0 < µ := λmin (S) ,L := λmax (S) < ∞. (2) x w − E [x w] BN(x>w) = g x + γ, (6) > 1/2 As a result, S is the symmetric positive definite covari- varx[x w] ance matrix of x. where g ∈ R and γ ∈ R are (trainable) mean and The part of our analysis presented in Section 4 and 5 variance adjustment parameters. In the following, we relies on a stronger assumption on the data distribu- assume that x is zero mean (Assumption 1) and omit tion. In this regard we consider the combined random γ.2 Then the variance can be written as follows variable > > 2 > > z := −yx, (3) varx[x w] = Ex (x w) = Ex (w x)(x w) > > > whose mean vector and covariance matrix are u and S = w Ex xx w = w Sw as defined above in Eq. (1). (7) Assumption 2. [Normality assumption on data distri- and replacing this expression into the batch normalized bution] We assume that z is a multivariate normal ran- output of Eq.(5) yields dom variable distributed with mean E [z] = E [−yx] = u and second-moment E zz> − E[z]E[z]> = E xx> − x>w f (w, g) = E ϕ(g ) . (8) uu> = S − uu>. BN x (w>Sw)1/2
In the absence of further knowledge, assuming Gaussian In order to keep concise notations, we will often use the data is plausible from an information-theoretic point induced norm of the positive definite matrix S defined > 1/2 of view since the Gaussian distribution maximizes the as kwkS := w Sw . Comparing Eq. (4) and (8) entropy over the set of all absolutely continuous distri- it becomes apparent that Bn can be considered as a butions with fixed first and second moment (Dowson reparameterization of the weight space. We thus define and Wragg, 1973). Thus, many recent studies on neural w networks make this assumption on x (see e.g. (Brutzkus w˜ := g (9) kwk and Globerson, 2017; Du and Lee, 2018)). Here we S assume Gaussianity on yx instead which is even less and note that w accounts for the direction and g for restrictive in some cases1. the length of w˜ . As a result, the batch normalized output can then be written as 2.2 Batch normalization as a > reparameterization of the weight space fBN (w, g) = Ex ϕ(x w˜ ) . (10)
In neural networks a Bn layer normalizes the input Note that Weight Normalization (Wn) is another in- of each unit of the following layer. This is done on stance of the above reparametrization, where the co- the basis of data statistics in a training batch, but for variance matrix S is replaced by the identity matrix I the sake of analyzablility we will work directly with (Salimans and Kingma, 2016). In both cases, the objec- population statistics. In particular, the output f of a tive becomes invariant to linear scaling of w. From a specific unit, which projects an input x to its weight geometry perspective, the directional part of Wn can vector w and applies a sufficiently smooth activation be understood as performing optimization on the unit sphere while operates on the S-sphere (ellipsoid) function ϕ : R → R as follows Bn (Cho and Lee, 2017). Note that one can compute the variance term (7) in a matrix-free manner, i.e. S never > f(w) = Ex ϕ x w (4) needs to be computed explicitly for Bn. is normalized on the pre-activation level. That is, the Of course, this type of reparametrization is not exclu- input-output mapping of this unit becomes sive to applications in neural networks. In the follow- ing two sections we first show how reparametrizing the
> 2 fBN (w, g, γ) = Ex ϕ BN(x w) . (5) In the (non-compositional) models of Section 3 and 4, fulfilling the assumption that x is zero-mean is as simple as 1For example, suppose that conditional distribution centering the dataset. Yet, we also omit centering a neurons P (x|y = 1) is gaussian with mean µ for positive labels input as well as learning γ for the neural network analysis and −µ for negative labels (mixture of gaussians). If the in Section 5. This is done for the sake of simplicity but note covariance matrix of these marginal distributions are the that Salimans and Kingma (2016) found that these aspects same, z = yx is Gaussian while x is not. of Bn yield no empirical improvements for optimization. Exponential convergence rates for Batch Normalization weight space of linear models can be advantageous from Based upon existing results, the next theorem estab- a classical optimization point of view. In Section 5 we lishes a linear convergence rate for the above iterates extend this analysis to training Batch Normalized neu- to the minimizer w∗ in the normalized coordinates. ral networks with adaptive-stepsize Gradient Descent Theorem 1. [Convergence rate on least squares] Sup- and show that the length-direction split induces an pose that the (weak) Assumption 1 on the data distri- interesting decoupling effect of the individual network + bution holds. Consider the Gd iterates {wt}t∈N given layers which simplifies the curvature structure. > in Eq. (14) with the stepsize ηt = wt Swt/(2L|ρ(wt)|) and starting from ρ(w0) 6= 0. Then, 3 ORDINARY LEAST SQUARES µ 2t ∆ρ ≤ 1 − ∆ρ , (15) t L 0 As a preparation for subsequent analyses, we start ∗ −1 with the simple convex quadratic objective encountered where ∆ρt := ρ(wt) − ρ(w ). Furthermore, the S - when minimizing an ordinary least squares problem norm of the gradient ∇ρ(wt) relates to the suboptimal- ity as h > 2i min fols(w˜ ) := Ex,y y − x w˜ d 2 2 w˜ ∈ kw k k∇ρ(w )k −1 /|4ρ(w )| = ∆ρ . (16) R (11) t S t S t t (A1) ⇔ min 2u>w˜ + w˜ >Sw˜ . d w˜ ∈R This convergence rate is of the same order as the rate of standard Gd on the original objective fols of Eq. (11) One can think of this as a linear neural network with (Nesterov, 2013). Yet, it is interesting to see that the just one neuron and a quadratic loss. Thus, applying non-convexity of the normalized objective does not slow Bn resembles reparametrizing w˜ according to Eq. (9) gradient-based optimization down. In the following, we and the objective turns into the non-convex problem will repeatedly invoke this result to analyze more com- plex objectives for which Gd only achieves a sublinear > u w 2 convergence rate in the original coordinate space but is min fols(w, g) := 2g + g . (12) d w∈R \{0},g∈R kwkS provably accelerated after using Batch Normalization. Despite the non-convexity of this new objective, we will prove that Gradient Descent (Gd) enjoys a lin- 4 LEARNING HALFSPACES ear convergence rate. Interestingly, our analysis estab- We now turn our attention to the problem of Learning lishes a link between fols in reparametrized coordinates (Eq. (12)) and the well-studied task of minimizing (gen- Halfspaces, which encompasses training the simplest eralized) Rayleigh quotients as it is commonly encoun- possible neural network: the Perceptron. This opti- tered in eigenvalue problems (Argentati et al., 2017). mization problem can be written as > min flh(w˜ ) := Ey,x ϕ(z w˜ ) d (17) 3.1 Convergence analysis w˜ ∈R where z := −yx and ϕ : → + is a loss function. To simplify the analysis, note that, for a given w, R R Common choices for ϕ include the zero-one, piece-wise the objective of Eq. (12) is convex w.r.t. the scalar ∗ linear, logistic and sigmoidal loss. We here tailor our g and thus the optimal value gw can be found by ∂fols ∗ > analysis to the following choice of loss function. setting ∂g = 0, which gives gw := − u w /kwkS. Replacing this closed-form solution into Eq. (12) yields Assumption 3. [Assumptions on loss function] We the following optimization problem assume that the loss function ϕ : R → R is infinitely dif- ferentiable, i.e. ϕ ∈ C∞(R, R), with a bounded deriva- w>uu>w tive, i.e. ∃Φ > 0 such that |ϕ(1)(β)| ≤ Φ, ∀β ∈ R. min ρ(w) := − , (13) d > w∈R \{0} w Sw Furthermore, we need flh to be sufficiently smooth. which – as discussed in Appendix A.2 – is a special Assumption 4. [Smoothness assumption] We as- case of minimizing the generalized Rayleigh quotient sume that the objective f : Rd → R is ζ-smooth for which an extensive literature exists (Knyazev, 1998; if it is differentiable on R and its gradient is ζ- D’yakonov and McCormick, 1995). Here, we particu- Lipschitz. Furthermore, we assume that a solution ∗ 2 larly consider solving (13) with Gd, which applies the α := arg minα k∇f(αw)k exists that is bounded in following iterative updates to the parameters the sense that ∀w ∈ Rd, −∞ < α∗ < ∞.3 > 3This is a rather technical but not so restrictive assump- (u wt)u + ρ(wt)Swt tion. For example, it always holds for the sigmoid loss wt+1 := wt + 2ηt > . (14) wt Swt unless the classification error of w is already zero. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Since globally optimizing (17) is in general NP-hard As an illustration, Figure 1 shows the level sets as (Guruswami and Raghavendra, 2009), we instead fo- well as the optimal direction of a least squares-, a cus on understanding the effect of the normalized pa- logistic- and a sigmoidal regression problem on the rameterization when searching for a stationary point. same Gaussian dataset. Furthermore, it shows iterates Towards this end we now assume that z is a multivari- of Gd in original coordinates and a sequential version ate normal random variable (see Assumption 2 and of Gd in normalized coordinates that first optimizes discussion there). the direction and then the scaling of its parameters (Gdnpseq). Both methods start at the same point and 4.1 Global characterization of the objective run with an infinitesimally small stepsize. It can be seen that, while Gd takes completely different paths
The learning halfspaces objective flh – on Gaussian towards the optimal points of each problem instance, inputs – has a remarkable property: all critical points the dynamics of Gdnpseq are exactly the same until lie on the same line, independent of the choice of the the optimal direction4 is found and differ only in the loss function ϕ. We formalize this claim in the next final scaling. lemma. 10 10 S 1u S 1u Lemma 1. Under Assumptions 1 and 2, all bounded Start Start 5 GD 5 GDNP_seq critical points w˜ ∗ of flh have the general form 0 0 −1 w˜ ∗ = g∗S u, 5 5
10 10 where the scalar g∗ ∈ R depends on w˜ ∗ and the choice 15 15 of the loss function ϕ. 15 10 5 0 5 10 15 10 5 0 5 10 10 10 S 1u S 1u Start Start Interestingly, the optimal direction of these critical 5 GD 5 GDNP_seq points spans the same line as the solution of a corre- 0 0
sponding least squares regression problem (see Eq. (38) 5 5 in Appendix A). In the context of convex optimization 10 10 of generalized linear models, this fact was first pointed
15 15 out in (Brillinger, 2012). Although the global optima 15 10 5 0 5 10 15 10 5 0 5 10 10 10 of the two objectives are aligned, classical optimization S 1u S 1u Start Start methods - which perform updates based on local infor- 5 GD 5 GDNP_seq mation - are generally blind to such global properties 0 0 of the objective function. This is unfortunate since 5 5 Gradient Descent converges linearly in the quadratic least-squares setting but only sublinearly on general 10 10
15 15 Learning Halfspace problems (Zhang et al., 2015). 15 10 5 0 5 10 15 10 5 0 5 10
To accelerate the convergence of Gradient Descent, Er- Figure 1: Level sets and path of Gd (left) and Gdnpseq dogdu et al. (2016) thus proposed a two-step global (right) for a (i) least squares-, (ii) logistic-, and (iii) sig- optimization procedure for solving generalized linear moidal regression on a Gaussian dataset. All iterates are models, which first involves finding the optimal di- shown in original coordinates. Note that the Gdnpseq paths are identical until the optimal direction (red line) is found. rection by optimizing a least squares regression as a surrogate objective and secondly searching for a proper scaling factor of that minimizer. Here, we show that 4.2 Local optimization in normalized running Gd in coordinates reparameterized as in Eq. (9) parameterization makes this two-step procedure redundant. More specifi- cally, splitting the optimization problem into searching Let flh(w, g) be the S-reparameterized objective with for the optimal direction and scaling separately, allows w˜ = gw/kwkS as defined in Eq. (9). We here con- even local optimization methods to exploit the property sider optimizing this objective using Gradient Descent of global minima alignment. Thus - without having to in Normalized Parameters (Gdnp). In each iteration, solve a least squares problem in the first place - the Gdnp performs a gradient step to optimize flh with directional updates on the Learning Halfspace prob- respect to the direction w and does a one-dimensional lem can mimic the least squares dynamics and thereby bisection search to estimate the scaling factor g (see inherit the linear convergence rate. Combined with a fast (one dimensional) search for the optimal scaling 4Depicted by the dotted red line. Note that – as a result in each step the overall convergence stays linear. of Lemma 1 – this line is identical in all problems. Exponential convergence rates for Batch Normalization
Algorithm 1 Gradient Descent in Normalized Parame- coordinates. Although we observe that the accelerated terization (Gdnp) rate achieved by Gdnp is significantly better than 1: Input: Td, Ts, stepsize policy s, stopping criterion h the best known rate of Agd for non-convex objective and objective f functions, we need to point out that the rate for Gdnp 2: g ← 1, and initialize w such that ρ(w ) 6= 0 0 0 relies on the assumption of a Gaussian data distribution. 3: for t = 1,...,Td do 4: if h(wt, gt) 6= 0 then Yet, Section 4.4 includes promising experimental results 5: st ← s(wt, gt) in a more practical setting with non-Gaussian data. 6: wt+1 ← wt − st∇wf(wt, gt) # directional step 7: end if Finally, we note that the proof of this result re- 8: gt ← Bisection(Ts, f, wt) # see Alg. 3 lies specifically on the S-reparametrization done by 9: end for Batch Normalization. In Appendix B.3.6 we detail 10: w˜ ← g w /kw k Td Td Td Td S out why our proof strategy is not suitable for the I- 11: return w˜ Td reparametrization of Weight Normalization and thus leave it as an interesting open question if other settings (or proof strategies) can be found where linear rates Algorithm 1). It is therefore only a slight modification for Wn are provable. of performing Batch Normalization plus Gradient De- scent: Instead of taking a gradient step on both w and g, we search for the (locally)-optimal scaling g at each 4.4 Experiments I iteration. This modification is cheap and it simplifies the theoretical analysis but it can also be substituted Setting In order to substantiate the above analysis we compare the convergence behavior of and easily in practice by performing multiple Gd steps on Gd Agd the scaling factor, which we do for the experiments in to three versions of Gradient Descent in normalized Section 4.4. coordinates. Namely, we benchmark (i) Gdnp (Algo- rithm 1) with multiple gradient steps on g instead of Bisection, (ii) a simpler version (Bn) which updates 4.3 Convergence result w and g with just one fixed step-size gradient step5 We now show that Algorithm 1 can achieve a linear and (iii) Weight Normalization (Wn) as presented in convergence rate to a critical point on the possibly non- (Salimans and Kingma, 2016). All methods use full batch sizes and – except for Gdnp on w – each method convex objective flh with Gaussian inputs. Note that is run with a problem specific, constant stepsize. all information for computing the adaptive stepsize st is readily available and can be computed efficiently. We consider empirical risk minimization as a surrogate Theorem 2. [Convergence rate of Gdnp on learning for flh (17) on the common real-world dataset a9a as well as on synthetic data drawn from a multivariate halfspaces] Suppose Assumptions 1– 4 hold. Let w˜ Td be the output of Gdnp on flh with the following choice of Gaussian distribution. We center the datasets and use stepsizes two different functions ϕ(·). First, we choose the soft- plus which resembles the classical logistic regression 3 kwtkS (convex). Secondly, we use the sigmoid which is a com- st := s(wt, gt) = − (18) Lgth(wt, gt) monly used (non-convex) continuous approximation of the 0-1 loss (Zhang et al., 2015). Further details can for t = 1,...,Td, where be found in Appendix D. 0 > > h(wt, gt) :=Ez ϕ z w˜ t (u wt) (19) Results The Gaussian design experiments clearly 00 > > 2 − Ez ϕ z w˜ t u wt confirm Theorem 4 in the sense that the loss in the convex-, as well as the gradient norm in the non-convex is a stopping criterion. If initialized such that ρ(w0) 6= case decrease at a linear rate. The results on a9a show
0 (see Eq. (13)), then w˜ Td is an approximate critical that Gdnp can accelerate optimization even when the point of flh in the sense that normality assumption does not hold and in a setting where no covariate shift is present, which motivates 2 2Td 2 ∗ k∇w˜ f(w˜ Td )k ≤(1 − µ/L) Φ (ρ(w0) − ρ ) future research of normalization techniques in optimiza- −Ts (0) (0) 2 tion. Interestingly, the performance of simple Bn and + 2 ζ|bt − at |/µ . (20) Wn is similar to that of Gd, which suggests that the length-direction decoupling on its own does not cap- To complete the picture, Table 1 compares this result to the order complexity for reaching an -optimal critical 5Thus Bn is conceptually very close to the classical Batch point w˜ (i.e.k∇w˜ flh(w˜ )k ≤ ) of Gradient Descent Norm Gradient Descent presented in (Ioffe and Szegedy, and Accelerated Gradient Descent (Agd) in original 2015) Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Method Assumptions Complexity Rate Reference −2 Gd Smoothness O(Tgd ) Sublinear (Nesterov, 2013) −7/4 Agd Smoothness O(Tgd log(1/)) Sublinear (Jin et al., 2017) −1 Agd Smoothness+convexity O(Tgd ) Sublinear (Nesterov, 2013) 2 Gdnp 1, 2, 3, and 4 O(Tgd log (1/)) Linear This paper
Table 1: Computational complexity to reach an -critical point on flh – with a (possibly) non-convex loss function ϕ. O(Tgd) represents the time complexity of computing a gradient for each method.
101 GD GD 100 where 100 AGD AGD BN BN 10 1 10 1 WN WN (1) (m) (1) (m) 2 GDNP GDNP ˜ 10 10 2 W := {w˜ ,..., w˜ }, Θ := {θ , . . . , θ }.
10 3 10 3 10 4 10 4 In the following, we analyze the optimization of the 10 5 10 5 ˜ 10 6 input weights W with frozen output weights Θ and 10 7 10 6 thus write F (W˜ ) hereafter for simplicity. As previ- 0 100 200 300 400 0 200 400 600 800 1000 ously assumed for the case of learning halfspaces, our gaussian softplus a9a softplus analysis relies on Assumption 2. This approach is GD GD 10 2 10 2 AGD AGD rather common in the analysis of neural networks, e.g. BN BN WN WN Brutzkus and Globerson (2017) showed that for a spe- GDNP 3 GDNP 10 cial type of one-hidden layer network with isotropic 10 3 10 4 Gaussian inputs, all local minimizers are global. Un- der the same assumption, similar results for different 10 5 10 4 classes of neural networks were derived in (Li and Yuan, 0 50 100 150 200 0 200 400 600 800 1000 1200 2017; Soltanolkotabi et al., 2017; Du et al., 2017). gaussian sigmoid a9a sigmoid 5.1 Global characterization of the objective Figure 2: Results of an average run (solid line) in terms of log suboptimality (softplus) and log gradient norm (sigmoid) over iterations as well as 90% confidence intervals of 10 runs We here show that, under the same assumption, the with random initialization. landscape of fnn exhibits a global property similar to one of the Learning Halfspaces objective (see Section 4.1). In fact, the critical points w˜ i of all neurons in ture the entire potential of these methods. Gdnp on a hidden layer align along one and the same single the other hand takes full advantage of the parameter line in Rd and this direction depends only on incoming splitting, both in terms of multiple steps on g and – information into the hidden layer. more importantly – adaptive stepsizes in w. Lemma 2. Suppose Assumptions 1 and 2 hold and (i) ˜ let wˆ be a critical point of fnn(W) with respect to 5 NEURAL NETWORKS hidden unit i and for a fixed Θ 6= 0. Then, there exits (i) We now turn our focus to optimizing the weights of a scalar cˆ ∈ R such that a multilayer perceptron (MLP) with one hidden layer wˆ (i) =c ˆ(i)S−1u, ∀i = 1, . . . , m. (22) and m hidden units that maps an input x ∈ Rd to a scalar output in the following way See Appendix C.2 for a discussion of possible implica- m X > (i) tions for deep neural networks. Fx(W˜ , Θ) = θiϕ(x w˜ ). i=1 5.2 Convergence result The w˜ (i) ∈ Rd and θ(i) ∈ R terms are the input and output weights of unit i and ϕ : R → R is a so-called We again consider normalizing the weights of each activation function. We assume ϕ to be a tanh, which is unit by means of its input covariance matrix S, i.e. (i) (i) (i) (i) a common choice in many neural network architectures w˜ = g w /kw kS, ∀i = 1, . . . , m and present used in practice. Given a loss function ` : R → R+, the a simple algorithmic manner to leverage the input- optimal input- and output weights are determined by output decoupling of Lemma 2. Contrary to Bn, which minimizing the following optimization problem optimizes all the weights simultaneously, our approach is based on alternating optimization over the different ˜ h ˜ i min fnn(W, Θ) := Ey,x ` − yFx(W, Θ) , (21) hidden units. We thus adapt Gdnp to the problem of W˜ ,Θ optimizing the units of a neural network in Algorithm 2 Exponential convergence rates for Batch Normalization
Algorithm 2 Training an MLP with Gdnp is independent of all downstream layers. Since this (i) (i) (i) assumption is rather strong and since Algorithm 2 is 1: Input: Td , Ts , step-size policies s , stopping crite- rion h(i) intended for analysis purposes only, we test the validity 2: W/1 ← 0, g/1 = 1 of the above hypothesis outside the Gaussian setting 3: for i = 1, . . . , m do by training a Batch Normalized multilayer feedforward (1) (i−1) (i+1) (m) 4: W/i := {w ,..., w , w ,..., w }, g/i := network (Bn) on a real-world image classification task {g(1), . . . , g(i−1), g(i+1), . . . , g(m)} with plain Gradient Descent. For comparison, a second (i) (i) (i) (i) (i) 5: f (w , g ) := fnn(w , g , W/i, g/i) unnormalized network is trained by Gd. To validate (i) (i) (i) (i) (i) (i) (i) 6: (w , g ) ← Gdnp(f , Td , Ts , s , h ) Lemma 2 we measure the interdependency between # optimization of unit i the central and all other hidden layers in terms of the 7: end for Frobenius norm of their second partial cross derivatives 8: return W := [w(1),..., w(m)], g := [g1, . . . , gm]. (in the directional component). Further details can be found in Appendix D. but note that this modification is mainly to simplify
lossBN wfBN 80 the theoretical analysis. lossGD f 10 v GD
60 Optimizing each unit independently formally results 8 (i) (i) (i) in minimizing the function f (w , g ) as defined in 40 6 Algorithm 2. In the next theorem, we prove that this 20 version of Gdnp achieves a linear rate of convergence 4
(i) 0 to optimize each f . 2 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
200 Theorem 3. [Convergence of Gdnp on MLP] Suppose 2 2 fBN/( W2 W4) F 350 fGD/( W2 W4) F 175 2 2 fBN/( W3 W4) F fGD/( W3 W4) F Assumptions 1– 4 hold. We consider optimizing the 2f /( W W ) 300 2f /( W W ) 150 BN 5 4 F GD 5 4 F 2 2 (i) (i) fBN/( W6 W4) F fGD/( W6 W4) F 250 weights (w , g ) of unit i, assuming that all directions 125
(j) k 200 {w }j i. Then, Gdnp with step-size policy s(i) as in 75 150 (i) 50 100 (101) and stopping criterion h as in (102) yields a 25 50 (i) linear convergence rate on f in the sense that 0 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (i) 2 2t ∗ k∇w˜ (i) f(w˜ t )kS−1 ≤(1 − µ/L) C (ρ(w0) − ρ ) Figure 3: (i) Loss, (ii) gradient norm and dependencies −T (i) (0) (0) 2 between central- and all other layers for BN (iii) and GD + 2 s ζ|b − a |/µ , t t (iv) on a 6 hidden layer network with 50 units (each) on the (23) CIFAR10 dataset (5 runs with random initialization, 5000 where the constant C > 0 is defined in Eq. (105). iterations, curvature information computed each 200th).
The result of Theorem 3 relies on the fact that each w(j), j =6 i is either zero or has zero gradient. If we Results Figure 3 confirms that the directional gra- assume that an exact critical point is reached after dients of the central layer are affected far more by optimizing each individual unit, then the result directly the upstream than by the downstream layers to a sur- implies that the alternating minimization presented prisingly large extent. Interestingly, this holds even in Algorithm 2 reaches a critical point of the overall before reaching a critical point. The downstream cross- objective. Since the established convergence rate for dependencies are generally decaying for the Batch Nor- each individual unit is linear, this assumption sounds malized network (Bn) (especially in the first 1000 iter- realistic. We leave a more precise convergence analysis, ations where most progress is made) while they remain that takes into account that optimizing each individual elevated in the un-normalized network (Gd), which unit for a finite number of steps may yield numerical suggest that using Batch Normalization layers indeed suboptimalities, for future work. simplifies the networks curvature structure in w such that the length-direction decoupling allows Gradient 5.3 Experiments II Descent to exploit simpler trajectories in these normal- ized coordinates for faster convergence. Setting In the proof of Theorem 3 we show that Gdnp can leverage the length-direction decoupling in Of course, we cannot untangle this effect fully from a way that lowers cross-dependencies between hidden other possible positive aspects of training with Bn (see layers and yields faster convergence. A central part of introduction). Yet, the fact that the (de-)coupling in- the proof is Lemma 2 which says that – given Gaus- creases in the distance to the middle layer (note how sian inputs – the optimal direction of a given layer earlier (later) layers are more (less) important for W4) Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann emphasizes the relevance of this analysis particularly Brutzkus, A. and Globerson, A. (2017). Globally opti- for deep neural network structures, where downstream mal gradient descent for a convnet with gaussian dependencies might vanish completely with depth. This inputs. arXiv preprint arXiv:1702.07966. does not only make gradient based training easier but also suggests the possibility of using partial second Cho, M. and Lee, J. (2017). Riemannian approach to order information, such as diagonal Hessian approxi- batch normalization. In Guyon, I., Luxburg, U. V., mations (e.g. proposed in (Martens et al., 2012)). Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5225– 6 CONCLUSION 5235. Curran Associates, Inc.
We took a theoretical approach to study the ac- Dowson, D. and Wragg, A. (1973). Maximum-entropy celeration provided by Batch Normalization. In a distributions having prescribed first and second somewhat simplified setting, we have shown that the moments (corresp.). IEEE Transactions on Infor- reparametrization performed by Batch Normalization mation Theory, 19(5):689–693. leads to a provable acceleration of gradient-based op- timization by splitting it into subtasks that are easier Du, S. S. and Lee, J. D. (2018). On the power of over- to solve. In order to evaluate the impact of the as- parametrization in neural networks with quadratic sumptions required for our analysis, we also performed activation. arXiv preprint arXiv:1803.01206. experiments on real-world datasets that agree with the results of the theoretical analysis to a surprisingly large Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. extent. (2017). Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. We consider this work as a first step for two partic- arXiv preprint arXiv:1712.00779. ular directions of future research. First, it raises the question of how to optimally train Batch Normalized D’yakonov, E. G. and McCormick, S. (1995). Opti- neural networks. Particularly, our results suggest that mization in solving elliptic problems. CRC Press. different and adaptive stepsize schemes for the two parameters - length and direction - can lead to signif- Erdogdu, M. A., Dicker, L. H., and Bayati, M. (2016). icant accelerations. Second, the analysis of Section 3 Scaled least squares estimator for glms in large- and 4 reveals that a better understanding of non-linear scale problems. In Advances in Neural Information coordinate transformations is a promising direction for Processing Systems, pages 3324–3332. the continuous optimization community. Gitman, I. and Ginsburg, B. (2017). Comparison of batch normalization and weight normalization al- References gorithms for the large-scale image classification. arXiv preprint arXiv:1709.08145. Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Op- timization algorithms on matrix manifolds. Prince- Guruswami, V. and Raghavendra, P. (2009). Hardness ton University Press. of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765. Argentati, M. E., Knyazev, A. V., Neymeyr, K., Ovtchinnikov, E. E., and Zhou, M. (2017). Conver- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep gence theory for preconditioned eigenvalue solvers residual learning for image recognition. In Proceed- in a nutshell. Foundations of Computational Math- ings of the IEEE conference on computer vision ematics, 17(3):713–727. and pattern recognition, pages 770–778. Arpit, D., Zhou, Y., Kota, B. U., and Govindaraju, V. Ioffe, S. and Szegedy, C. (2015). Batch normalization: (2016). Normalization propagation: A parametric Accelerating deep network training by reducing technique for removing internal covariate shift in internal covariate shift. In Proceedings of the 32Nd deep networks. arXiv preprint arXiv:1603.01431. International Conference on International Confer- Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer ence on Machine Learning - Volume 37, ICML’15, normalization. arXiv preprint arXiv:1607.06450. pages 448–456. JMLR.org. Brillinger, D. R. (2012). A generalized linear Jin, C., Netrapalli, P., and Jordan, M. I. (2017). Ac- model with gaussian regressor variables. In Se- celerated gradient descent escapes saddle points lected Works of David Brillinger, pages 589–606. faster than gradient descent. arXiv preprint Springer. arXiv:1711.10456. Exponential convergence rates for Batch Normalization
Klambauer, G., Unterthiner, T., Mayr, A., and Hochre- Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. iter, S. (2017). Self-normalizing neural networks. (2018). How does batch normalization help op- In Advances in Neural Information Processing Sys- timization?(no, it is not about internal covariate tems, pages 972–981. shift). arXiv preprint arXiv:1805.11604.
Knyazev, A. and Shorokhodov, A. (1991). On exact Soltanolkotabi, M., Javanmard, A., and Lee, J. D. estimates of the convergence rate of the steepest as- (2017). Theoretical insights into the optimization cent method in the symmetric eigenvalue problem. landscape of over-parameterized shallow neural Linear algebra and its applications, 154:245–257. networks. arXiv preprint arXiv:1707.04926.
Knyazev, A. V. (1998). Preconditioned eigensolversan Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. oxymoron. Electron. Trans. Numer. Anal, 7:104– (2017). Inception-v4, inception-resnet and the im- 123. pact of residual connections on learning. In AAAI, volume 4, page 12. Knyazev, A. V. and Neymeyr, K. (2003). A geometric Zhang, Y., Lee, J. D., Wainwright, M. J., and Jordan, theory for preconditioned inverse iteration iii: A M. I. (2015). Learning halfspaces and neural net- short and sharp convergence estimate for general- works with random initialization. arXiv preprint ized eigenvalue problems. Linear Algebra and its arXiv:1511.07948. Applications, 358(1-3):95–114.
Krizhevsky, A. and Hinton, G. (2009). Learning multi- ple layers of features from tiny images.
Landsman, Z. and Nevslehov´a,J. (2008). Stein’s lemma for elliptical random vectors. Journal of Multivari- ate Analysis, 99(5):912–927.
Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Sys- tems, pages 597–607.
Lipton, Z. C. and Steinhardt, J. (2018). Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341.
Martens, J., Sutskever, I., and Swersky, K. (2012). Esti- mating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464.
Mikhalevich, V., Redkovskii, N., and Antonyuk, A. (1988). Minimization methods for smooth noncon- vex functions. Cybernetics, 24(4):395–403.
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.
Salimans, T. and Kingma, D. P. (2016). Weight normal- ization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Appendix
A LEAST SQUARES ANALYSIS
A.1 Restated result
Recall that, after normalizing according to (9) and using the closed form solution for the optimal scaling ∗ > factor g := − u w /kwkS, optimizing the ordinary least squares objective can be written as the following minimization problem
w>uu>w min ρ(w) := − . (13 revisited) d > w∈R w Sw We consider optimizing the above objective by Gd which takes iterative steps of the form
wt+1 = wt − ηt∇ρ(wt), (24) where > ∇ρ(wt) = −2 (Bwt + ρ(wt)Swt) /wt Swt. (25) Furthermore, we choose > wt Swt ηt = . (26) 2L|ρ(wt)|
Our analysis relies on the (weak) data distribution assumption stated in A1. It is noteworthy that ρ(w) ≤ 0, ∀w ∈ Rd \{0} holds under this assumption. In the next theorem, we establish a convergence rate of Gd in terms of function value as well as gradient norm.
Theorem 1. [Convergence rate on least squares] Suppose that the (weak) Assumption 1 on the data distribu- > + tion holds. Consider the Gd iterates {wt}t∈N given in Eq. (14) with the stepsize ηt = wt Swt/(2L|ρ(wt)|) and starting from ρ(w0) 6= 0. Then, µ 2t ∆ρ ≤ 1 − ∆ρ , (15) t L 0
∗ −1 where ∆ρt := ρ(wt) − ρ(w ). Furthermore, the S -norm of the gradient ∇ρ(wt) relates to the suboptimality as
2 2 kwtkSk∇ρ(wt)kS−1 /|4ρ(wt)| = ∆ρt. (16)
The proof of this result crucially relies on the insight that the minimization problem given in (12) resembles the problem of maximizing the generalized Rayleigh quotient which is commonly encountered in generalized eigenproblems. We will thus first review this area, where convergence rates are usually provided in terms of the angle of the current iterate with the maximizer, which is the principal eigenvector. Interestingly, this angle can be related to both, the current function value as well as the the norm of the current gradient. We will make use of these connections to prove the above Theorem in Section A.5. Although not necessarily needed for convex function, we introduce the gradient norm relation as we will later go on to prove a similar result for possibly non-convex functions in the learning halfspace setting (Theorem 4).
A.2 Background on eigenvalue problems
A.2.1 Rayleigh quotient
Optimizing the Rayleigh quotient is a classical non-convex optimization problem that is often encountered in eigenvector problems. Let B ∈ Rd×d be a symmetric matrix, then w1 ∈ Rd is the principal eigenvector of B if it Exponential convergence rates for Batch Normalization maximizes w>Bw q(w) = (27) w>w and q(w) is called the Rayleigh quotient. Notably, this quotient satisfies the so-called Rayleigh inequality
λmin(B) ≤ q(w) ≤ λ1(B), ∀w ∈ R \{0}, where λmin(B) and λ1(B) are the smallest and largest eigenvalue of B respectively.
Maximizing q(w) is a non-convex (strict-saddle) optimization problem, where the i-th critical point vi constitutes the i-th eigenvector with corresponding eigenvalue λi = q(wi) (see (Absil et al., 2009), Section 4.6.2 for details). It is known that optimizing q(w) with Gd - using an iteration-dependent stepsize - converges linearly to the principal eigenvector v1. The convergence analysis is based on the ”minidimensional” method and yields the following result 2t λ1 − q(wt) λ1 − λ2 λ1 − q(w0) 2 ≤ 1 − 2 (28) cos ∠(wt, v1) λ1 − λmin cos ∠(w0, v1) under weak assumptions on w0. Details as well as the proof of this result can be found in (Knyazev and Shorokhodov, 1991).
A.2.2 Generalized rayleigh quotient
The reparametrized least squares objective (13), however, is not exactly equivalent to (27) because of the covariance matrix that appears in the denominator. As a matter of fact, our objective is a special instance of the generalized Rayleigh quotient w>Bw ρ˜(w) = , (29) w>Aw where B is defined as above and A ∈ Rd×d is a symmetric positive definite matrix. Maximizing (29) is a generalized eigenproblem in the sense that it solves the task of finding eigenvalues λ of the matrix pencil (B, A) for which det(B − λA) = 0, i.e. finding a vector v that obeys Bv = λAv. Again we have
λmin(B, A) ≤ ρ˜(w) ≤ λ1(B, A), ∀w ∈ R \{0}.
Among the rich literature on solving generalized symmetric eigenproblems, a Gd convergence rate similar to (28) has been established in Theorem 6 of (Knyazev and Neymeyr, 2003), which yields
λ − ρ(w ) λ − λ 2t λ − ρ(w ) 1 t+1 ≤ 1 − 1 2 1 t , ρ(wt+1) − λ2 λ1 − λmin ρ(wt) − λ2 again under weak assumptions on w0.
A.2.3 Our contribution
Since our setting in Eq. (13) is a minimization task, we note that for our specific choice of A and B we have −ρ˜(w) = ρ(w) and we recall the general result that then
min ρ(w) = − max(˜ρ(w)).
More importantly, we here have a special case where the nominator of ρ(w) has a particular low rank structure. In fact, B := uu> is a rank one matrix. Instead of directly invoking the convergence rate in (Knyazev and Neymeyr, 2003), this allows for a much simpler analysis of the convergence rate of Gd on ρ(w) since the rank one property yields a simpler representation of the relevant vectors. Furthermore, we establish a connection between suboptimality on function value and the S−1-norm of the gradient. As mentioned earlier, we need such a guarantee in our future analysis on learning halfspaces which is an instance of a (possibly) non-convex optimization problem. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
A.3 Preliminaries
Notations Let A be a symmetric positive definite matrix. We introduce the following compact notations that will be used throughout the analysis.
d > • A-inner product of w, v ∈ R : hw, viA = w Av.
d > 1/2 • A-norm of w ∈ R : kwkA = w Aw . • A-angle between two vectors w, v ∈ Rdn{0}: hw, viA ∠A (w, v) := arccos (30) kwkAkvkA
• A-orthogonal projection wˆ of w to span{v} for w, v ∈ Rdn{0}:
wˆ ∈ span{v} with hw − wˆ , viA = 0 (31)
• A-spectral norm of a matrix C ∈ Rd×d:
kCwkA kCkA := max (32) d w∈R /{0} kwkA
These notations allow us to make the analysis similar to the simple Rayleigh quotient case. For example, the 2 denominator in (29) can now be written as kwkA. Properties We will use the following elementary properties of the induced terms defined above.
2 2 (P.1) sin ∠A(w, v) = 1 − cos ∠A(w, v) (P.2) If wˆ is the A-orthogonal projection of w to span{v}, then it holds that
2 2 kwˆ kA kw − wˆ kA cos ∠A(w, v) = 2 , sin ∠Awv = (33) kwkA kwkA
(P.3) The A-spectral norm of a matrix can be written in the alternative form
1/2 kA Cwk2 kCkA = max (34) d 1/2 w∈R \{0} kA wk2 k A1/2CA−1/2 A1/2wk = max 2 (35) d 1/2 w∈R \{0} kA wk2 1/2 −1/2 = kA CA k2 (36)
(P.4) Let C ∈ Rd×d be a square matrix and w ∈ Rd, then the following result holds due to the definition of A-spectral norm
kBwkA ≤ kBkAkwkA (37)
A.4 Characterization of the LS minimizer
By setting the gradient of (11) to zero and recalling the convexity of fols we immediately see that the minimizer of this objective is w˜ ∗ := −S−1u. (38) Indeed, one can easily verify that w˜ ∗ is also an eigenvector of the matrix pair (B, S) since
∗ > −1 2 Bw˜ = uu (−S u) = −kuk −1 u S (39) 2 −1 ∗ =(kukS−1 )S(−S u) = λ1Sw˜ Exponential convergence rates for Batch Normalization
2 where λ1 := kukS−1 is the corresponding generalized eigenvalue. The associated eigenvector with λ1 is ∗ v1 := w˜ /kukS−1 . (40)
d Thereby we extend the normalized eigenvector to an S-orthogonal basis (v1, v2,..., vd) of R such that ( 0, i 6= j hvi, vjiS = (41) 1, i = j
holds for all i, j. Let V2 := [v2, v3,..., vd] be the matrix whose (i − 1)-th column is vi, i ∈ {2, . . . , d}. The matrix B is orthogonal to the matrix V2 since
> > −1 BV2 = uu V2 = uu S SV2 (40) −1 > (42) = kS ukSu −v1 SV2 , | {z } 0 which is a zero matrix due to the A-orthogonality of the basis (see Eq. (41)). As a result, the columns of V2 are d eigenvectors associated with a zero eigenvalue. Since v1 and V2 form an S-orthonormal basis of R no further ∗ eigenvalues exist. We can conclude that any vector w ∈ span{v1} is a minimizer of the reparametrized ordinary least squares problem as presented in (13) and the minimum value of ρ relates to the eigenvalue as
λ1 = − min ρ(w). w
Spectral representation of suboptimality Our convergence analysis is based on the angle between the d current iterate wt and the leading eigenvector v1, for which we recall property (P.1). We can express w ∈ R in the S-orthogonal basis that we defined above:
d−1 w = α1v1 + V2α2, α1 ∈ R, α2 ∈ R (43) and since α1v1 is the S-orthogonal projection of w to span{v1}, the result of (P.2) implies
2 2 2 kα1v1kS α1 cos ∠S(w, v1) = 2 = 2 . (44) kwkS kwkS
Clearly this metric is zero for the optimal solution v1 and else bounded by one from above. To justify it is a proper choice, the next proposition proves that suboptimality on ρ, i.e. ρ(w) − ρ(v1), relates directly to this angle. 2 Proposition 1. The suboptimality of w on ρ(w) relates to sin ∠S(w, v1) as 2 ρ(w) − ρ(v1) = λ1 sin ∠S(w, v1), (45) where ρ(v1) = λ1. This is equivalent to
2 ρ(w) = −λ1 cos ∠S(w, v1). (46)
Proof. We use the proposed eigenexpansion of Eq. (43) to rewrite
(42) (39) Bw = (α1Bv1 + BV2α2) = α1Bv1 = α1λ1Sv1 (47) and replace the above result into ρ(w). Then
> > w Bw (47) (α1v1 + V2α2) Sv1 ρ(w) = − > = −α1λ1 2 w Sw kwkS 2 (41) α1 (44) 2 = −λ1 2 = −λ1 cos ∠S(w, v1), kwkS which proves the second part of the proposition. The first follows directly from property (P.1). Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Gradient-suboptimality connection Fermat’s first-order optimality condition implies that the gradient is zero at the minimizer of ρ(w). Considering the structure of ρ(w), we propose a precise connection between the norm of gradient and suboptimality. Our analysis relies on the representation of the gradient ∇ρ(w) in the S-orthonormal basis {v1,..., vd} which is described in the next proposition. Proposition 2. Using the S-orthogonal basis as given in Eq. (41), the gradient vector can be expanded as
2 2 kwkS∇ρ(w)/2 = − λ1α1 sin ∠S(w, v1)Sv1 2 (48) + λ1 cos ∠S(w, v1)SV2α2
Proof. The above derivation is based on two results: (i) v1 is an eigenvector of (B, S) and (ii) the representation of ρ(w) in Proposition 1. We recall the definition of ∇ρ(w) in (25) and write
2 ∇ρ(w)kwkS/2 = −ρ(w)Sw − Bw (47) = −ρ(w)S (α1v1 + V2α2) + λ1α1Sv1
(46) 2 = −(1 − cos ∠S(w, v1))λ1α1Sv1 (49) 2 +λ1 cos ∠S(w, v1)SV2α2 2 = −λ1α1 sin ∠S(w, v1)Sv1 2 +λ1 cos ∠S(w, v1)SV2α2
Exploiting the gradient representation of the last proposition, the next proposition establishes the connection −1 between suboptimality and the S -norm of gradient ∇wρ(w). Proposition 3. Suppose that ρ(w) 6= 0, then the S−1-norm of the gradient ∇ρ(w) relates to the suboptimality as
2 2 kwkSk∇ρ(w)kS−1 /(4|ρ(w)|) = ρ(w) − ρ(v1) (50)
Proof. Multiplying the gradient representation in Proposition 2 by S−1 yields
−1 2 2 S ∇ρ(w)kwkS/2 = − λ1α1 sin ∠S(w, v1)v1 2 + λ1 cos ∠S(w, v1)V2α2. −1 By combining the above result with the S-orthogonality of the basis (v1, V2), we derive the (squared) S -norm of the gradient as > −1 (41) ∇ρ(w) S ∇ρ(w)/4 = T1 + T2, −4 2 2 4 (51) T1 := kwkS λ1α1 sin ∠S(w, v1), −4 2 4 2 T2 := kwkS λ1 cos ∠S(w, v1)kα2k .
It remains to simplify the terms T1 and T2. For T1, 2 kwk (44) S T = kwk−2α2 = cos2 (w, v ). 2 4 1 S 1 ∠S 1 λ1 sin ∠S(w, v1)
Similarly, we simplify T2: 2 kwkS −2 2 2 4 T2 = kwkS kα2k λ1 cos ∠S(w, v1) −2 2 2 = kwkS kwkS − α1 2 α1 = 1 − 2 kwkS (44) 2 = 1 − cos ∠S(w, v1) 2 = sin ∠S(w, v1) Exponential convergence rates for Batch Normalization
Replacing the simplified expression of T1 and T2 into Eq. (51) yields
2 2 2 2 2 2 2 kwkSk∇wρ(w)kS−1 /4 =λ1 cos ∠S(w, v1) sin ∠S(w, v1) sin ∠S(w, v1) + cos ∠S(w, v1) 2 2 2 =λ1 cos ∠S(w, v1) sin ∠S(w, v1), (46) 2 = |ρ(w)|λ1k sin ∠S(w, v1) (45) = |ρ(w)| (ρ(w) − ρ(v1)) .
A rearrangement of terms in the above equation concludes the proof.
A.5 Convergence proof
2 We have seen: suboptimality in ρ(w) directly relates to sin ∠S(w, v1) for all w \{0}. In the next lemma we prove that this quantity is strictly decreased by repeated Gd updates at a linear rate. 2 kwtkS Lemma 3. Suppose that Assumption 1 holds and consider Gd (Gd) steps on (13) with stepsize ηt = . 2L|ρ(wt)| d Then, for any w0 ∈ R such that ρ(w0) < 0, the updates of Eq. (24) yield the following linear convergence rate
µ 2t sin2 (w , v ) ≤ 1 − sin2 (w , v ) (52) ∠S t 1 L ∠S 0 1
Proof. To prove the above statement, we relate the sine of the angle of a given iterate wt+1 with v1 in terms of the previous angle ∠(wt, v1). Towards this end, we assume for the moment that ρ(wt) 6= 0 but note that this 6 naturally always holds whenever ρ(w0) 6= 0, such that the angle relation can be recursively applied through all t ≥ 0 to yield Eq. (52).
(i) We start by deriving an expression for sin ∠S(v1, wt). By (31) and the definition ρ(wt), we have that −ρ(wt)wt −1 is the S-orthogonal projection of S Bwt to span{wt}. Indeed,
−1 hS Bwt + ρ(wt)wt, wtiS > > −1 wt Bwt > =wt BS Swt − > wt Swt = 0. wt Swt
−1 −1 > −1 Note that S Bwt = S u u wt is a nonzero multiple of v1 and thus sin ∠S(S Bwt, wt) = sin ∠S(v1, wt). Thus, by (P.2) we have −1 sin ∠S(v1, wt) = sin ∠S(S Bwt, wt) −1 kS Bwt + ρ(wt)wtkS (53) = −1 . kS BwtkS
d (ii) We now derive an expression for sin ∠S(v1, wt+1). Let at+1 ∈ R such that at+1wt+1 ∈ R is the S-orthogonal −1 projection of S Bwt to span{wt+1}. Then
−1 hS Bwt − at+1wt+1, wt+1iS = 0 −1 (54) ⇒ hS Bwt − at+1wt+1, (at+1 + ρ(wt))wt+1iS = 0.
By the Pythagorean theorem and (54), we get
−1 2 −1 2 kS Bwt + ρ(wt)wt+1kS = kS Bwt − at+1wt+1kS 2 + k(at+1 + ρ(wt))wt+1kS (55) −1 2 ≥ kS Bwt − at+1wt+1kS.
6as we will show later by induction. Jonas Kohler*, Hadi Daneshmand*, Aurelien Lucchi, Thomas Hofmann
Hence, again by (P.2) −1 sin ∠S(v1, wt+1) = sin ∠S(S Bwt, wt+1) −1 kS Bwt − at+1wt+1kS = −1 kS BwtkS (56) (55) −1 kS Bwt + ρ(wt)wt+1kS ≤ −1 . kS BwtkS
(iii) To see how the two quantities on the right hand side of (53) and (56) relate, let us rewrite the Gd updates from Eq. (24) as follows
S ρ(w )w = ρ(w )w − S−1Bw + ρ(w )w t t+1 t t L t t t S ⇔ S−1Bw + ρ(w )w = S−1Bw + ρ(w )w − S−1Bw + ρ(w )w (57) t t t+1 t t t L t t t S ⇔ S−1Bw + ρ(w )w = I − S−1Bw + ρ(w )w . t t t+1 L t t t
By taking the S-norm we can conclude
−1 kS Bwt + ρ(wt)wt+1kS (57) S −1 ≤ kI − kS · kS Bwt + ρ(wt)wtkS (58) L µ ≤ 1 − kS−1Bw + ρ(w )w k , L t t t S where the first inequality is due to property (P.4) of the S-spectral norm and the second is due to Assumption (1) and (P.3) , which allows us to bound the latter in term of the usual spectral norm as follows
(P.3) 1/2 −1/2 kI − S/LkS = kS (I − S/L) S k2 (59) (2) =kI − S/Lk2 ≤ 1 − µ/L.
(iv) Combining the above results yields the desired bound
(56) −1 kS Bwt + ρ(wt)wt+1kS sin ∠S(v1, wt+1) ≤ −1 kS BwtkS (58) −1 µ kS Bwt + ρ(wt)wtkS (60) ≤ 1 − −1 L kS BwtkS (53) µ = 1 − sin (v , w ). L ∠S 1 t
(v) Finally, we show that the initially made assumption ρ(wt) 6= 0 is naturally satisfied in all iterations. First, ˆ + ρ(w0) < 0 by assumption. Second, assuming ρ(wtˆ) < 0 for an arbitrary t ∈ N gives that the above analysis (i-iv) holds for tˆ+ 1 and thus (60) together with (45) and the fact that λ1 > 0 give
2 ρ(wtˆ+1) = −λ1(1 − sin (v1, wtˆ+1)) 2 < −λ1(1 − sin (v1, wtˆ)) = ρ(wtˆ) < 0, where the last inequality is our induction hypothesis. Thus ρ(wtˆ+1) < 0 and we can conclude by induction that + ρ(wt) < 0, ∀t ∈ N . As a result, (60) holds ∀t ∈ N+, which (applied recursively) proves the statement (52). Exponential convergence rates for Batch Normalization
We are now ready to prove Theorem 1. Proof of Theorem (1): By combining the results of Lemma 3 as well as Proposition 1 and 3, we can complete the proof of the Theorem 1 as follows
2 2 kwtkSk∇wρ(wt)kS−1 /|4ρ(wt)| (50) ∗ = (ρ(wt) − ρ(w ))
(45) 2 = λ1 sin ∠S(wt, v1) (52) µ 2t ≤ 1 − λ sin2 (w , v ) L 1 ∠S 0 1 (45) µ 2t = 1 − (ρ(w ) − ρ(w∗)) . L 0
B LEARNING HALFSPACES ANALYSIS
In this section, we provide a convergence analysis for Algorithm 1 on the problem of learning halfspaces