Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis

Yusuke Tsuzuku 1 2 * Issei Sato 1 2 Masashi Sugiyama 2 1

Abstract ble explanation for the generalization ability of deep learn- The notion of flat minima has gained attention as ing methods (Hochreiter & Schmidhuber, 1997). Many em- a key metric of the generalization ability of deep pirical studies have supported the usefulness of this no- learning models. However, current definitions of tion. For example, the notion shed light on the reason for flatness are known to be sensitive to parameter larger generalization gaps in large-batch training (Keskar rescaling. While some previous studies have pro- et al., 2017; Yao et al., 2018), and also inspired various posed to rescale flatness metrics using parameter training methods (Hochreiter & Schmidhuber, 1997; Chaud- scales to avoid the scale dependence, the normal- hari et al., 2017; Hoffer et al., 2017). Notably, Jiang et al. ized metrics lose the direct theoretical connections (2020) suggested that PAC-Bayes based generalization met- between flat minima and generalization. We first rics, which have deep connections with flatness, are effective provide generalization error bounds using exist- and promising among many other generalization metrics. ing normalized flatness measures for smooth and As measures of flatness, previous studies proposed the vol- stochastic networks using second-order approx- ume of the region in which a network keeps roughly the imation. Using the analysis, we then propose a same loss value (Hochreiter & Schmidhuber, 1997), the novel normalized flatness metric. The proposed maximum loss value around minima (Keskar et al., 2017), metric enjoys both direct theoretical connections and the spectral norm of the Hessian (Yao et al., 2018). and better empirical correlation to generalization Despite the empirical connections of “flatness” to generaliza- error. tion, current definitions of flatness suffer from unnecessary scale dependence. Dinh et al. (2017) showed that we can 1. Introduction arbitrarily change the flatness of the loss-landscape for some networks without changing the functions represented by the methods have achieved significant per- networks. Such scale dependence appears in networks with formance improvement in many domains, such as com- ReLU activation functions or normalization layers such as puter vision, language processing, and speech process- batch-normalization (Ioffe & Szegedy, 2015) and weight- ing (Krizhevsky et al., 2012; Devlin et al., 2019; van den normalization (Salimans & Kingma, 2016). A major coun- Oord et al., 2018). However, we are still on the way to terargument to the scale dependence is that flatness provides understand when they perform well on unseen data. Better valid generalization error bounds when parameter scales are understanding of the generalization performance of deep taken into account (Dziugaite & Roy, 2017; Neyshabur et al., learning methods would help improve their performance. 2017). However, both the flatness and parameter scales have Deep learning community has made tremendous effort to weak correlations with the generalization error in practical understand generalization of neural networks, both theoreti- settings (Sec. 8). cally and empirically (Zhang et al., 2017; Neyshabur et al., Previous studies have tried to make flatness metrics in- 2017; Arora et al., 2018). variant with respect to parameter scaling. Li et al. (2018) The notion of “flat minima” has gained attention as a possi- has successfully visualized connections between empirical performances and flatness hypotheses by scaling the loss- *Author is currently working at Preferred Networks, Inc. landscape using parameter scales. The success suggests 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan. Correspondence to: Yusuke Tsuzuku . generalization and motivates us to provide their theoretical interpretations. Neyshabur et al. (2017), Achille & Soatto Proceedings of the 37 th International Conference on Machine (2018), and Liang et al. (2019) also suggested scaling flat- Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- ness metrics by parameter scales. While these metrics are thor(s). Normalized Flat Minima invariant with respect to known scaling issues, they did not been used for generalization analysis outside the context of provide direct connections between the metrics and general- flat minima (Neyshabur et al., 2018; Letarte et al., 2019). ization error. From the PAC-Bayesian perspective, provid- Neyshabur et al. (2018) analyzed the propagation of pa- ing scale-invariant bounds is not straightforward because rameters’ perturbations to bound the generalization error. we need to use the log-uniform distribution as a prior so that Given the existence of adversarial examples (Szegedy et al., the Kullback-Leibler divergence term in PAC-Bayes bounds 2014), however, this approach inevitably provides vacuous becomes scale-invariant (Kingma et al., 2015), while the use bounds. Instead, in this paper, we stick to flat-minima ar- of the prior invalidates the PAC-Bayesian analysis. guments, which will better capture the effect of parameter perturbations. On the way to understand generalization of deep learning and its connections to a normalized loss-landscape, this Since Dinh et al. (2017) pointed out the scale dependences paper makes two contributions. of flat minima, many studies have attempted to resolve the issue. A common approach is scaling sharpness metrics • We provide PAC-Bayesian generalization error bounds by parameter scales. Neyshabur et al. (2017) pointed out using scale-invariant loss curvature metrics. that the scale dependence can be removed by balancing parameter scales and sharpness metrics. However, their • We propose a novel normalized loss curvature metric proposal requires using data-dependent priors (Guedj, 2019; that has a close connection to the bounds. Parrado-Hernández et al., 2012; Dziugaite & Roy, 2018) or equivalent alternatives, which adds non-trivial costs to This paper is organized as follows. We start our discus- the generalization bounds. In this paper, we make the costs sion with the PAC-Bayesian generalization error bound and explicit in Sec. 3 and later reduce them in Sec. 6. Prior to its connection to flat minima (Sec. 3). Next, we connect a Neyshabur et al. (2017), Dziugaite & Roy (2017) tried to flatness metric normalized by parameter scales per param- achieve balance using numerical optimization. While they eter to the PAC-Bayesian analysis (Sec. 4). Unfortunately, partially mitigated the scale dependence, they could not per-parameter normalization yields a constant term propor- completely overcome scale dependence because the prior tional to the number of parameters in the generalization variance was tied into whole network. Wang et al. (2018) error bounds, which makes them vacuous for deep learning explored a better choice of posterior variance. However, not methods. Thus, we re-analyze the known scale dependence only their analysis could not remove scale dependence, it issue (Sec. 5). We identify that we do not need to scale the was based on a parameter-wise argument, which involved loss-landscape per-parameter, but we only need to scale a factor that scales with the number of parameters, making 2 it per node. Then, applying the analysis, we improve the their bound as vacant as naive parameter counting . Our method of loss-landscape normalization (Sec. 6). Using the analysis overcomes both problems. novel normalized flatness definition, we provide generaliza- Sharpness metrics normalization has also appeared outside tion error bounds that do not have constant terms that scale the PAC-Bayesian context. Li et al. (2018) proposed rescal- with the number of parameters. Empirically, the proposed ing the loss-landscape by filters’ scales of convolutional metric can predict the generalization performance of models layers. Even though they provided useful visualizations more accurately than current unnormalized metrics (Sec. 8). and insights, it is unclear how the scaling relates to gener- alization. Our work bridges the gap between the emprical Remark: We provide the generalization bounds for success and theoretical understandings. Achille & Soatto stochastic networks. Most of them rely on the quadratic (2018) proposed a notion of information in weights, which approximation of the , but not all of them1. has connections to flat minima and PAC-Bayes. However, their arguments were based on an improper log-uniform 2. Related work prior and did not provide valid generalization error bounds. Even if we avoid the log-uniform prior to use a method The notion of flat minima has its origin in the minimum described in Neklyudov et al. (2017), the notion does not description length principle (MDL) (Hinton & van Camp, provide non-vacuous bounds, as we show in Sec. 4. Liang 1993; Rissanen, 1986; Honkela & Valpola, 2004). Dziugaite et al. (2019) proposed the Fisher-Rao norm, which is a met- & Roy (2017) connected flat minima to PAC-Bayesian argu- ric invariant with respect to known scaling issues. While the ments, which is a generalization of the MDL. They pointed Fisher-Rao norm has interesting functional equivalence, in out that the sharpness of local minima is not sufficient for addition to invariance, it has been hard to directly connect measuring generalization, and that parameter scales need the metric to generalization. to be taken into account. PAC-Bayesian analysis has also 2More comparison with Wang et al. (2018) is available in 1Sec. 6 briefly explains that the stochastic network becomes Appendix L in the supplementary material. close to a deterministic network as sample size increases. Normalized Flat Minima

We propose a novel scale-invariant sharpness metric for loss- are the noise-stable solution with respect to parameters, landscapes. Our definition distinguishes itself by its direct naturally correspond to term (A) in Eq. (4). Following connection to PAC-Bayesian generalization error bounds. prior work (Langford & Caruana, 2002; Hochreiter & While previous methods for normalization have been based Schmidhuber, 1997; Dziugaite & Roy, 2017; Arora et al., on per-parameter normalization, ours is based on per-node 2018), we analyze the true error of the stochastic classifier normalization. This difference is critical from the viewpoint Q in the following sections. Nagarajan & Kolter (2019) of our PAC-Bayesian analysis in terms of the tightness of presented a method to generalize the PAC-Bayes bounds to the bounds. deterministic classifiers. We will leave combining our work and theirs as future work. 3. Flat minima from PAC-Bayesian perspective 3.2. Effect of noise under second-order approximation To connect PAC-Bayesian analysis with the Hessian of the This section reviews a fundamental PAC-Bayesian general- loss-landscape as in prior work (Keskar et al., 2017; Dinh ization error bound and its connection to flat minima pro- et al., 2017; Yao et al., 2018), we consider the second-order vided by prior work. Table 1 in Appendix A in the sup- approximation of some surrogate loss functions. We use a plementary material summarizes the notations used in this Gaussian with a covariance matrix σ2I as the posterior of paper. parameters4. Then term (A) in the PAC-Bayesian bound (4) can be calculated as 3.1. PAC-Bayesian generalization error bound

� �¯  � ¯  The following is one of the most basic PAC-Bayesian gener- LS Q θ | θ − LS θ

 �¯  �¯  alization error bounds (Germain et al., 2016; Alquier et al., = E LS θ +  − LS θ (5) 2016). Let D be a distribution over input-output space Z, ∼N (0,σI) S be a set of m samples drawn i.i.d. from D, H be a set 1 �2  2 ≈ Tr r LS ¯ σ . (6) of parameters θ, ` : X × H → [0, 1] be a loss function, 2 θ θ be a distribution over H independent of S, and be a P Q Thus, we can approximate term (A) by the trace of the Hes- distribution over H. For any δ ∈ (0, 1], any distribution , Q sian and use it as a generalization metric. The effect of the and any nonnegative real number λ, with probability at least approximation is further discussed in Appendix H in the 1 − δ, supplementary material. Note, connecting PAC-Bayes with λ 1 1 Hessian appears in the literature repeatedly (Dziugaite & L � �θ | θ¯  ≤L � �θ | θ¯  + + ln D Q S Q 2m λ δ Roy, 2017; Wang et al., 2018). By tuning σ with an appro- 1 priate prior, we can balance terms (A) and (B) (Dziugaite & + KL[ �θ | θ¯  k (θ)], (1) λ Q P Roy, 2017; Neyshabur et al., 2017). However, appropriate where methods to balance them have not been extensively studied. Moreover, while some prior work proposed scaling σ by � � ¯ LD Q θ | θ := E [`(z, θ)] , (2) parameter scales, the operation has not been justified from ¯ z∼D,θ∼Q(θ|θ) the PAC-Bayesian viewpoint. We address the issues in the

� � ¯ next section. LS Q θ | θ := E [`(z, θ)] . (3) ¯ z∼S,θ∼Q(θ|θ) 4. PAC-Bayes and parameter-wise We reorganize the PAC-Bayesian bound (1) for later use as normalization λ 1 1 L � �θ | θ¯  ≤ L (θ) + + ln In this section, we propose a framework that connects D Q S 2m λ δ parameter-wisely normalized flatness metrics to PAC- � � ¯ 1 �¯  + LS Q θ | θ − LS (θ) + KL[Q θ | θ k P (θ)] . Bayesian generalization error analysis. As a first step, we λ | {z } | {z } virtually decompose the parameters: (A) (B) (4) θ = ηµ e[σ], (7)

Similar decompositions can be found in prior bounds such as Theorem 1.2.6 in Catoni (2007), which is known work (Dziugaite & Roy, 2017; Neyshabur et al., 2017; 2018). to be relatively tight in some cases and successfully provided We use a different PAC-Bayes bound (1) for later analysis, empirically nontrivial bounds for ImageNet scale networks in but they are essentially the same.3 Flat minima, which Zhou et al. (2019). 4Remark: Eq. (1) holds with not only Gaussian posteriors but 3To apply our analysis, we can also use other PAC-Bayesian also arbitrary posteriors. Normalized Flat Minima

[·] �  where is the Hadamard product, and e is an element- Q θ | θ¯ such that with probability 1 − δ over the training wise exponential. The codomain of random variables set and Q, the expected error LD (Q) is bounded by θ, µ, σ is Rn , and a hyperparameter η is a positive real λ0 1 number η ∈ >0. Intuitively, this is a scale-sign decompo- �¯  0 R LS θ + √ FR + √ NS0 (λ ) sition of parameters with continuous relaxation. We extend 2 λ 2 λ the decomposition to a scale-direction decomposition in 1 λ + Ξ + + E (L , ), (15) Sec. 6. We design a specific prior and posterior on µ and σ λ 2m S Q to mitigate the scale dependence. We first define our prior where design. n X � 2  2 (µ) = N (µ | 0, I), (8) FR = r L ¯ θ¯ , P θ S θ i,i i (16) 0 P (σ) = N (σ | β, η I), (9) i=1 where β ∈ n and η0 ∈ are hyperparameters. Next, R R>0 NS = we define our posterior design. 0 n h θ¯2 (σ¯ − β )2 i X �2  2σ¯ i i i i min r LS ¯ e + + , (µ | µ¯) = N (µ | µ¯, I), (10) θ θ i,i 2σ¯ i 0 Q σ¯ i∈R e λ 0 i=1 Q (σ | σ¯) = N (σ | σ¯, η I), (11) βi∈B (17) where µ¯ and σ¯ are real vectors in Rn satisfying and ¯ [σ¯] θ = ηµ¯ e , (12) 1 Ξ = ln + n ln|B| + 2 ln|Λ| + ln|Λ0|. (18) where a real vector θ¯ is a parameter assignment. Note that δ the choice of µ¯ and σ¯ is not necessarily unique, and we − 1 The optimal choice of λ is O(m 2 ). can balance them arbitrarily. Furthermore, the parameter η was introduced to address a technical issue concerning The FR and NS in the theorem stands for Fisher-Rao and PAC-Bayesian analysis. normalized sharpness respectively, where the latter will be Under the second-order approximation, terms (A) and (B) elaborated later and defined in Sec. 7. The convergence rate −1/4 in the PAC-Bayesian bound (1) can now be approximated of the whole bound is O(m ) with the optimal choice as follows5. of λ. This is not worse than existing bounds such as those of Bartlett et al. (2017), Neyshabur et al. (2018), and Arora η2 n X � 2  2σ¯ i et al. (2018). Appendix I in the supplementary material fur- (A) : r LS ¯ e 2 θ θ i,i ther discusses the convergence rate. The second term, con- i=1 taining FR, coincides with the second-order approximation 0 2 n (η ) X � 2  �¯ 2 of the scaled expected sharpness suggested by Neyshabur + r LS ¯ θi + E (LS , ), 2 θ θ i,i Q i=1 et al. (2017), the information in weight (Achille & Soatto, (13) 2018), Gauss-Newton norm (Zhang et al., 2019), and Fisher-

1 1 2 1  Rao norm under appropriate regularity conditions (Liang ¯ [−2σ¯] 2 (B) : θ e + kσ¯ − βk2 , et al., 2019). The third term, containing NS0, also has a (2λ) η2 2 (η0)2 connection to normalized loss-landscapes given by (14) n 1 q 2 where E (L , ) is an error term introduced by the second- 0 X 2 � ¯  S Q NS1 := lim NS0 (λ ) = (rθ LS |θ¯) θi . 2 λ0→∞ i,i order approximation. We obtain the following bound by i=1 combining (13) and (14), optimizing σ, and taking the union (19) bound with respect to η, η0, and β. While this equation only holds at the limit, we can match the Proposition 4.1. For any networks, any λ ∈ Λ and λ0 ∈ Λ0 sides of the equation by modifying the prior and posterior where Λ ⊂ and Λ0 ⊂ are a finite set of real num- R>0 R>0 design. We also can remove the second term at the same bers independent of S, any 0–1 bounded twice continuously time. First, we replace B to be a uniform distribution an a differentiable loss function with respect to parameters, any finite set. The set is carefully designed so that it does not finite set of real numbers B ⊂ independent of S, and any R become too large but sufficiently dense so that the scale parameter assignment θ¯, there exists a stochastic classifier dependence can be ignored. Next, we replace both of the 5See Appendix D.1 in the supplementary material for the deriva- prior and the posterior of σ with a distribution which takes tion. 1 on a real vector and 0 anywhere else. The replacement Normalized Flat Minima corresponds to setting η0 very small in their original prior scales. In this section, we discuss what controls the parame- and posterior. A more concrete explanation is provided in ter scales in deep learning models. This discussion is critical the proof of the next proposition, provided in Appendix D.3 for our improved PAC-Bayesian analysis in Sec. 6. It also in the supplementary material. reveals the scale dependence in existing matrix-norm-based 7 Proposition 4.2. For any networks, any λ ∈ Λ where generalization bounds . Λ ⊂ R>0 is a finite set of real number independent of S, any Here, we analyze the scale dependences appearing in net- positive number  ∈ R>0, any 0–1 bounded twice continu- works with the ReLU . The similar de- ously differentiable loss function with respect to parameters, pendences appear in networks with normalization layers, ¯ ¯ and any parameter assignment θ such that maxi θi ≤ b, if which we defer the explanation to Appendix C.1 in the sup- the diagonal elements of the Hessian of the training loss plementary material. We point out that the matrix-wise scale LS is bounded by M ∈ R>0 and nonnegative, there exists dependence introduced by Dinh et al. (2017) does not fully �¯  a stochastic classifier Q θ | θ such that with probability cover scale dependences8. To illustrate the hidden scale 1 − δ over the training set and Q, the expected error LD (Q) dependence, we consider a simple network with a single 6 is bounded by hidden layer and ReLU activation: 1 +  + 2 1 λ f (z) = W (2)(ReLU(W (1)(z))), (22) L (θ) + √ NS + Ξ + + E (L , ), θ S 1 λ 1 2m S Q λ (1) (2) (20) where weight matrices W and W are subsets of the pa- rameters θ, and z is an input of the network fθ. We can scale where the i-th column of W (2) by α > 0 and i-th row of W (1) by 1/α without modifying the function that the network rep- Ξ1 = O(n) · C,δ,|Λ|,M,b,λ. (21) resents.9 Since we are using the ReLU activation function, which has positive homogeneity, the transformation does The assumptions that we can bound all elements of θ¯ and the not change the represented function. By the transformation, diagonal of the Hessian are reasonable because we typically the diagonal elements of the Hessian corresponding to the train neural networks on computers with finite precision. i-th row of W (1) are scaled by α2 . The transformation has Even though ReLU networks do not satisfy the smoothness essentially the same effect as the one proposed by Dinh assumption, which is necessary because we use the Hes- et al. (2017). The difference is that the above transformation sian, the proposition at least addresses the scale dependence runs node-wise instead of matrix-wise. In terms of weight due to normalization layers. Section 6 discusses a possible matrices, the node-wise scale dependence can be translated method to address the smoothness and nonnegativity as- into a row- and column-wise scale dependence. sumptions for the Hessian’s diagonal elements. The method is also applicable to the above propositions. The  that 6. Improved PAC-Bayesian analysis appears in the proposition comes from an approximation error introduced by the discretization trick. Setting  = 1 In this section, we tighten our PAC-Bayesian bounds in suggests that the parameter’s existence does not worsen the Sec. 4 based on Sec. 5. In Sec. 5, we pointed out the row- order of the bound. and column-wise scale dependencies in modern network architectures. Thus, we at least need to absorb the row and 5. Finer-grained scale dependence column scales of weight matrices. It seems, however, that we do not need to control all parameter scales separately. The generalization error bounds (15) and (20) have constant Thus, we modify the decomposition per weight matrix in terms proportional to the number of parameters. This is Sec. 4 in the following way. because specifying scaling parameters per parameter is al- (l) 0 (l) (l) [γ ]  (l) [γ ]  most as difficult as directly specifying all parameters. Since W = ηDiag e V Diag e , η ∈ R, deep learning methods typically use overparametrized mod- (23) els, the constant terms make the bounds vacuous. Prior  (l)   0(l)  parameter-wisely normalized sharpness metrics also suffer where Diag e [γ ] and Diag e [γ ] are diagonal ma- from the problem explictly or inexplicitly (Wang et al., 2018; (l) 0(l) trices whose diagonal elements are e[γ ] and e[ γ ], re- Achille & Soatto, 2018). Fortunately, it turns out that we can control parameter scales by using fairly small number 7See Appendix C.2 in the supplementary material for the expla- of scaling parameters. For example, we can guess that pa- nation. rameters belonging to the same weight matrix have similar 8The same scale dependences in ReLU networks were intro- duced by Neyshabur et al. (2015). 6 9 The exact form of Ξ1 can be found in the Appendix D.3 in the Running examples of the transformation can be found in Ap- supplementary material. pendix B.1 in the supplementary material. Normalized Flat Minima spectively. The codomains of random variables γ(l) , γ0(l), Note that the proof of Prop. 6.1 is constructive. In the (l) h (l)(h l) h (l)(×h l) �¯  and V are R 1, R 2 , and R 1 2 , respectively. Sim- construction, the posterior Q θ | θ is a Gaussian cen- tered at θ¯ with a diagonal covariance matrix Σ such that ilar decomposition with (23) has been explored for the unit- 1 − 2 invariant SVD (Uhlmann, 2018). In the method, weight kΣk F = O(λ ). Since the E (LS , Q) term comes from matrices are normalized using the solution of Program a second-order approximation of the loss function, as λ II (Rothblum & Zenios, 1992). Even though the same increases, which means a training set size m increases, method is available for removing the scale dependence, the second-order approximation will hold better, and the we consider the Hessian jointly and use different scaling E (LS , Q) term will decrease. control that has some optimality from the PAC-Bayesian For the sake of theoretical completeness, we address the perspective. We first define some notations for convenience. E (LS , Q) term, smoothness assumption, and nonnegative Let U¯ (l) be a matrix defined as U¯ (l) = W¯ (l) W¯ (l), where assumption. In Props. 4.2 and 6.1, the E (LS , ) term was ¯ (l) ¯ ¯ (l) Q W is a subset of a parameter assignment θ. Let H be a introduced as a term satisfying the following equation. matrix such that L � �θ | θ¯   2  S Q (l) ∂ LS (θ) �  H¯ = ¯ . (24) = L θ¯ + E (L , ) i,j ∂W (l) ∂W (l) θ=θ S S Q i,j i,j 1 h� ¯ > � 2  � ¯ i + E θ − θ rθLS θ¯ θ − θ . (28) 2 ¯ By extending Prop. 4.2 to decomposition (23), we have the θ∼Q(θ|θ) following proposition. As long as the equation is satisfied, we can use any vec- Proposition 6.1. For any networks with d weight matrices, tors which mimick the Hessian’s diagonal elements. For any λ ∈ Λ where Λ ⊂ is a finite set of real number R>0 example, we can use the following as an estimation of the independent of S, any positive number  ∈ , any posi- R>0 Hessian diagonal. tive number R ∈ , any 0–1 bounded twice continuously R>0 differentiable loss function with respect to parameters, and h  �  ¯ ¯ f = Es∈{−1,1}n Div s gS,θ¯,(r, s) − gS,θ¯,(r, −s) , a parameter assignment θ such that maxi θi ≤ b, if the i diagonal elements of the Hessian of the training loss LS are 2r(Abs[θ] + 1)  , (29) bounded by M ∈ R>0 and nonnegative, then there exists �¯  a stochastic classifier Q θ | θ such that with probability where 1 − δ over the training set and Q, the expected error LD (Q) 10 � ¯ ¯  is bounded by gS,θ¯,(r, s) = rθ LS θ + r(Abs(θ)s + 1) , (30) 2 Div [·, ·] is an elementwise division, and Abs [·, ·] is an ele- �¯  (1 + ) 1 λ LS θ + √ NS2(R) + Ξ2 + + E (LS , Q), λ λ 2m mentwise absolute. We added a parameter  for numerical (25) stability. When a network is smooth, f is an approximation of the diagonal elements of the Hessian. Note, the estima- tion is scale-invariant when  = 0. The estimation is defined where NS2(R) = even when networks are not smooth. Furthermore, we can

d r carry out the following modification to f to enforce the > X (l) > 0 (l) min e[ −γ¯ ] H¯ (l) �U¯ (l)  e [γ¯ ], (26) nonnegative condition. (l) γ¯(l)∈[−R,R]h1 l=1 ˆ ˜ f = Max [f , 0] , (31) Ξ2 = O(hd) · C,δ,M,b,R,|Λ|, (27) where Max[·, ·] is an elementwise maximum. Let F¯ (l) be 1 Pd (l) Pd (l) and h = d max l=1 h1 , l=1 h2 . a matrix such that we substitute the diagonal elements of the Hessian by fˆin the definition of H¯ (l) (24). Using F¯(l) ¯ (l) The constant term, Ξ2, scales by the number of nodes, in- instead of H , we have the following theorem. stead of parameters. The reduction of the constant term Theorem 6.1. For any networks with d weight matrices, from Prop. 4.2 makes the bound meaningful in practical for any λ ∈ Λ where Λ ⊂ R>0 is a finite set of real num- settings. For example, ResNet50 has tens of millions of ber independent of S, a positive number  ∈ (0, 0.1], any parameters, while it only has tens of thousands of nodes. In positive number R ∈ R>0, any 0–1 bounded loss function, ¯ ¯ classification on ImageNet, which has millions of images, and a parameter assignment θ such that maxi θi ≤ b, if all ˆ this reduction is critical. elements of f are bounded by M ∈ R>0, then there exists �  10 a Gaussian distribution θ | θ¯ with a mean θ¯ and a co- The exact form of Ξ2 can be found in the Appendix D.4 in the Q −1/2 supplementary material. variance matrix Σ such that kΣk F = O(λ ) in terms of Normalized Flat Minima

λ, and with probability 1 − δ over the training set S and Q, normalized sharpness, we need to ensure that the choice 11 the expected error LD (Q) is bounded by of surrogate loss function does not introduce other scale dependences. We discuss this choice in Appendix G in the

� � ¯ 1 1 λ supplementary material LS Q θ | θ + √ NS3(R) + Ξ3 + . (32) λ λ 2m where NS2(R) = 8. Numerical evaluation

d r We numerically evaluated the effectiveness of normalized > X (l) > 0(l) min e [−γ¯ ] F¯(l) �U¯ (l)  e[ γ¯ ] (33) sharpness (35). We mainly validated the following two (l) l=1 γ¯(l)∈[−R,R]h1 points. and • Unnormalized sharpness metrics can fail to predict ˜ generalization performance. Ξ3 = O(hd) · C,δ,M,b,R,|Λ|. (34)

• Normalized sharpness (35) better predicts generaliza- Note that we can statistically bound the term � � ¯ tion than unnormalized metrics. LS Q θ | θ using the Monte-Carlo algorithm and Hoeffding’s inequality. In Theorem 6.1, we use the second-order approximation of the loss to estimate the best For these purposes, we checked the metrics’ ability to dis- choice of posterior variance to minimize both terms (A) and tinguish models trained on random labels (Sec. 8.1) and pre- (B) in Eq. (4). dict the generalization performance of models trained with different hyperparameters (Sec. 8.2). As existing current 7. Normalized sharpness definition and sharpness metrics, we used the trace of the Hessian with- calculation out normalization (6) and the sum of the squared Frobenius norm of the weight matrices (Neyshabur et al., 2017). We In this section, based on the analysis in Sec. 6, we define also investigated the empirical dependence of normalized normalized sharpness, a sharpness and generalization met- sharpness on the width and depth of networks (Sec. 8.3). ric invariant with respect to node-wise rescaling. We also In all experiments, we trained multilayer perceptrons with describe a practical calculation technique for the metric. three hidden layers, LeNet (Lecun et al., 1998), and Wide ResNet (Zagoruyko & Komodakis, 2016) with 16 layers and We define normalized sharpness as width factor 4 on MNIST (LeCun et al., 1998) and CIFAR-

d r 10 (Krizhevsky, 2009). More detailed experimental setups > X (l) > (l) NS = inf e[γ¯ ] F¯(l) �U¯ (l)  e[−γ¯ ], are described in Appendix M in the supplementary material. (l) (l) h l=1 γ¯ ∈R 1 (35) 8.1. Random labels We first investigated whether normalized sharpness (35) and U¯ (l) F¯(l) where the notations and were introduced in Sec. 6. current unnormalized sharpness metrics can distinguish Normalized sharpness appeared in Prop. 6.1 and Theo- models trained on random labels. In this experiment, sharper rem 6.1. The reason for the scale-invariance of normalized minima are expected to indicate larger generalization gaps. sharpness is explained in Appendix J in the supplementary material. Results: Figure 1 shows plots of the mean sharpness met- In convolutional layers, since the same filter has the same rics for models trained on datasets with different random scale, we can tie the scaling parameter per channel. We label ratios. The results show that networks trained on ran- defer a more detailed description of normalized sharpness dom labels had larger normalized sharpness to fit the random for convolutional layers to Appendix F in the supplemen- labels. Thus, we can say that normalized sharpness provides tary material. Fortunately, the optimization problem (35) a fairly good hierarchy in the hypothesis class. Even though (l) 12 is convex with respect to γ¯ . Thus, we can estimate a sharpness without normalization and the squared Frobe- near-optimal solution to the optimization problem by gradi- nius norm of weight matrices could also distinguish models ent descent. It is straightforward to see that the convexity trained on random labels to some extent, the signal was also holds with convolutional layers. When calculating the much weaker than that of normalized sharpness. Note that 11 normalized sharpness has an advantage in that it does not The exact form of Ξ3 can be found in the Appendix D.5 in the supplementary material. require a biplot for both the unnormalized sharpness and the 12See Appendix E in the supplementary material. squared Frobenius norm. Normalized Flat Minima

1.0 lenet 1.0 lenet mlp mlp Table 1. Correlation coefficients between generalization gap and 0.8 wres 0.8 wres the generalization metrics for models trained on MNIST.

0.6 0.6 MLP Lenet WResNet

0.4 0.4 Normalized sharpness 0.73 0.73 0.59 Trace of Hessian −0.62 −0.79 −0.59 Normalized sharpness 0.2 Normalized sharpness 0.2 Frobenius norm 0.58 0.58 0.49

0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ratio of random label Ratio of random label

1.0 lenet 1.0 lenet mlp mlp 0.8 wres 0.8 wres Table 2. Correlation coefficients between generalization gap and the generalization metrics for models trained on CIFAR10. 0.6 0.6

0.4 0.4 MLP Lenet WResNet

Trace of Hessian Trace of Hessian Normalized sharpness 0.92 0.98 0.92 0.2 0.2 Trace of Hessian −0.42 −0.51 −0.53 0.72 0.76 0.43 0.0 0.0 Frobenius norm 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ratio of random label Ratio of random label

1.0 1.0 lenet mlp 0.8 0.8 wres

0.6 0.6 Results: We summarized the correlation coefficients be- tween curvature metrics and the accuracy gaps in Table 1 0.4 0.4 and Table 2. Scatter plots can be found in Appendix N.1 Frobenius norm Frobenius norm

0.2 lenet 0.2 in the supplementary material with more detailed results. mlp On CIFAR10, especially for LeNet and Wide ResNet, we wres 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 observed almost linear correlations between normalized Ratio of random label Ratio of random label sharpness and accuracy gap. Thus, we can confirm the use- fulness of our generalization error bounds and normalized Figure 1. The figure shows the values of normalized sharp- sharpness. On MNIST, even though there were weak corre- ness (35), trace of the Hessian (6), and the sum of the squared lations between normalized sharpness and the accuracy gap, Frobenius norms of weight matrices of models trained on datasets the correlation was weaker than the results on CIFAR10. A with different random label ratios. Each dot represents the mean of the generalization metric among trained networks and the error possible explanation of this phenomenon is that the scale of bars represent the standard deviation. The left column is the re- the accuracy gap was too small on MNIST, which was at sults on MNIST and the right is on CIFAR10. All generalization most 0.02. Since we could not create models with various metrics were rescaled to [0, 1] by their maximum and minimum accuracy gaps by merely changing the regularization param- per network architecture. eters on MNIST, the effect of noise would have become larger. In all settings, both the trace of the Hessian and the squared Frobenius norm of the weight matrices had a weaker ability to predict the generalization. We can confirm that 8.2. Different hyperparameters normalized sharpness had consistently stronger correlations We tested whether normalized sharpness and the ex- with generalization. Notably, we observed a negative corre- isting unnormalized sharpness metrics can predict the lation with the trace of the Hessian and positive correlation generalization performance of models trained with dif- with the squared Frobenius norm. If we use the tuples of ferent hyperparameters. We then checked the correla- the two as a generalization metric (Neyshabur et al., 2017), tions between the generalization metrics and their gener- we cannot determine which models will generalize better. alization gaps, defined by (test misclassification ratio) − We explain the negative correlations as follows. In our (train misclassification ratio). Models are trained with expeirments, we used three regularizers: l2-regularization, different strengths of l2-regularization, weight de- weight-decay, and dropout. When we use l2-regularization cay (Loshchilov & Hutter, 2019), and dropout. This is an or weight-decay, stronger regularization decreases the gener- adversarial setting for existing unnormalized sharpness met- alization error and parameter scales. Since there are certain rics because both regularizations on weights and dropout trade-offs between the sharpness and parameter scales, as directly affects the balance between parameter scales and shown in Eq.(4), the stronger regularization makes the Hes- flatness of minima. sian larger. Normalized Flat Minima

35 9. Comparison with other generalization bounds: 30 We compare the tightness of our bounds and a prior PAC- 25 Bayesian approach13 (Neyshabur et al., 2018). First, we remark that Neyshabur et al. (2018) relies on a margin pa- 20 rameter. The margin parameter controls the term (A) in 15 Eq. (4) to some extent. Importantly, the margin loss they rely on does not decrease when the sample size increases. Normalized sharpness depth:1.0 10 depth:2.0 On the other hand the term (A) is solely controlled by the depth:3.0 5 posterior choice in our bounds. As a result, the term de- depth:4.0 crease as the sample size increases. However, the difference 0 250 500 750 1000 1250 1500 1750 2000 makes the rate of convergence different and their direct com- Network width parison harder. The convergence rate is further discussed in

35 width:2048.0 Appendix I in the supplementary material. Nevertheless, we width:512.0 can still compare the effect of our constant term, which is 30 width:32.0 O(hd), with existing bounds. The main bound in Neyshabur width:128.0 et al. (2018) scales at least O(d2h). Compared to that, our 25 constant term will not be critical to the tightness of our 20 bound.

15 An O(hd) generalization bound corresponds to O(1/h) compression per layer. This is fairly competitive with the Normalized sharpness 10 reported result by Arora et al. (2018). Even though our

5 work is not directly comparable to Arora et al. (2018), this suggests that our bounds are comparably tight compared to 1.0 1.5 2.0 2.5 3.0 3.5 4.0 those derived by Arora et al. (2018). Network depth Observations in Sec. 8.3 suggest an advantage of normalized sharpness over the Fisher-Rao norm. Since the Fisher-Rao Figure 2. Plot of the dependence of normalized sharpness on width 2 and depth. The top figure plots the dependence on width of the norm scales O(d ) with respect to depth, when the repre- normalized shaprness of multilayer-perceptron with different width. sented function is identical (Liang et al., 2019), normalized The bottom figure plots those on depth. The normalized sharpness sharpness might be more robust against architecture changes, showed almost linear dependence on network depth and sublinear especially concerning depth of networks. dependence on network width. 10. Conclusion We have formally connected normalized loss curvatures with generalization through PAC-Bayesian analysis. The analysis bridged the known gap between theoretical under- standings and empirical connections between normalized 8.3. Dependence on width and depth loss-landscape and generalization. The proof consists of two steps: scale-direction decompositions of parameters and To verify that the reduction of the constant term presented discretization trick. In the analysis, we found that using a in Sec. 6 improves the overall bounds, we empirically in- smaller number of scaling parameters is critical for mean- vestigated the dependency of normalized sharpness on net- ingful generalization bounds, at least within our framework. work width and depth. We trained multilayer perceptron Applying this discovery, we proposed normalized sharpness with depth-{1, 2, 3, 4} and width-{32, 128, 512, 2048} on as a novel generalization metric. Experimental results sug- MNIST three times for each and calculated normalized gest that this metric is more powerful than unnormalized sharpness for every trained model. Figure 2 shows the re- loss sharpness metrics as a measure of generalization. sults. It indicates that the normalized sharpness has an al- most linear dependence on the network depth and sub-linear 13Remark: Our primal goal is not providing the state-of-the-art dependence on the network width, at least under this setting. tightest bound, but connecting scale-invariant flatness metric and This suggests that the constant term reduction presented in generalization. Sec. 6 contributes to tightening the overall generalization bounds. Normalized Flat Minima

ACKNOWLEDGEMENTS Dziugaite, G. K. and Roy, D. M. Data-dependent PAC- Bayes priors via differential privacy. In Advances in YT was supported by Toyota/Dwango AI scholarship. IS Neural Information Processing Systems 31, pp. 8430– was supported by KAKENHI 20H04239. MS was supported 8441. 2018. by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Ad- Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. vanced Study. PAC-Bayesian Theory Meets Bayesian Inference. In Advances in Neural Information Processing Systems 29, References pp. 1884–1892. 2016.

Achille, A. and Soatto, S. Emergence of Invariance and Guedj, B. A Primer on PAC-Bayesian Learning. ArXiv Disentanglement in Deep Representations. Journal of e-prints, abs/1901.05353, 2019. Research, 19, 2018. Hinton, G. E. and van Camp, D. Keeping the Neural Net- Alquier, P., Ridgway, J., and Chopin, N. On the properties of works Simple by Minimizing the Description Length of variational approximations of Gibbs posteriors. Journal the Weights. In Proceedings of the Sixth Annual Confer- of Machine Learning Research, 17(239):1–41, 2016. ence on Computational Learning Theory, COLT ’93, pp. Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger 5–13, 1993. Generalization Bounds for Deep Nets via a Compres- Hochreiter, S. and Schmidhuber, J. Flat Minima. Neural sion Approach. In Proceedings of the 35th International Computation, 9(1):1–42, 1997. Conference on Machine Learning, pp. 254–263, 2018.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally- Hoffer, E., Hubara, I., and Soudry, D. Train longer, general- normalized margin bounds for neural networks. In Ad- ize better: closing the generalization gap in large batch Advances in Neural Infor- vances in Neural Information Processing Systems 30, pp. training of neural networks. In mation Processing Systems 30 6240–6249. 2017. , pp. 1731–1741. 2017.

Catoni, O. Pac-Bayesian Supervised Classification: The Honkela, A. and Valpola, H. Variational Learning and Bits- Thermodynamics of Statistical Learning. Institute of back Coding: An Information-theoretic View to Bayesian Mathematical Statistics, 2007. Learning. Trans. Neur. Netw., 15(4):800–810, 2004. ISSN 1045-9227. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Bal- dassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, Ioffe, S. and Szegedy, C. : Accelerating R. Entropy-SGD: Biasing Into Wide Deep Network Training by Reducing Internal Covariate Valleys. In International Conference on Learning Repre- Shift. In Proceedings of the 32nd International Confer- sentations, 2017. ence on Machine Learning, pp. 448–456, 2015.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: Jiang, Y., Neyshabur, B., Krishnan, D., Mobahi, H., and Pre-training of Deep Bidirectional Transformers for Lan- Bengio, S. Fantastic Generalization Measures and Where guage Understanding. In Proceedings of the 2019 Con- to Find Them. In International Conference on Learning ference of the North American Chapter of the Associa- Representations, 2020. tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., June 2-7, 2019, Volume 1 (Long and Short Papers), pp. and Tang, P. T. P. On Large-Batch Training for Deep 4171–4186, 2019. Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp 2017. Minima Can Generalize For Deep Nets. In Proceedings of the 34th International Conference on Machine Learning, Kingma, D. P., Salimans, T., and Welling, M. Variational pp. 1019–1028, 2017. Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems 28, Dziugaite, G. K. and Roy, D. M. Computing Nonvacuous pp. 2575–2583. 2015. Generalization Bounds for Deep (Stochastic) Neural Net- works with Many More Parameters than Training Data. Krizhevsky, A. Learning Multiple Layers of Features from In Proceedings of the Thirty-Third Conference on Uncer- Tiny Images. Master’s thesis, Department of Computer tainty in Artificial Intelligence, 2017. Science, University of Toronto, 2009. Normalized Flat Minima

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Parrado-Hernández, E., Ambroladze, A., Shawe-Taylor, J., Classification with Deep Convolutional Neural Networks. and Sun, S. PAC-Bayes Bounds with Data Dependent In Advances in Neural Information Processing Systems Priors. Journal of Machine Learning Research, 13:3507– 25, pp. 1097–1105. 2012. 3531, 2012.

Langford, J. and Caruana, R. (Not) Bounding the True Error. Rissanen, J. Stochastic Complexity and Modeling. Ann. In Advances in Neural Information Processing Systems Statist., 14(3):1080–1100, 09 1986. 14, pp. 809–816. 2002. Rothblum, U. G. and Zenios, S. A. Scalings of Matrices Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Satisfying Line-Product Constraints and Generalizations. based Learning Applied to Document Recognition. In Linear Algebra and Its Applications, pp. 159–175, 1992. Proceedings of the IEEE, pp. 2278–2324, 1998. Salimans, T. and Kingma, D. P. Weight Normalization: LeCun, Y., Cortes, C., and Burges, C. J. C. The MNIST A Simple Reparameterization to Accelerate Training of Database of Handwritten Digits. 1998. Deep Neural Networks. In Advances in Neural Informa- Letarte, G., Germain, P., Guedj, B., and Laviolette, F. Di- tion Processing Systems 29, pp. 901–909. 2016. chotomize and Generalize: PAC-Bayesian Binary Acti- vated Deep Neural Networks. In Advances in Neural In- Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, formation Processing Systems 32, pp. 6869–6879. 2019. D., Goodfellow, I., and Fergus, R. Intriguing proper- ties of neural networks. In International Conference on Li, H., Xu, Z., Taylor, G., and Goldstein, T. Visualizing Learning Representations, 2014. the Loss Landscape of Neural Nets. In Advances in Neu- ral Information Processing Systems 31, pp. 6391–6401. Uhlmann, J. A Generalized Matrix Inverse that is Consistent 2018. with Respect to Diagonal Transformations. SIAM Journal on Matrix Analysis and Applications, pp. 781–800, 2018. Liang, T., Poggio, T., Rakhlin, A., and Stokes, J. Fisher- Rao Metric, Geometry, and Complexity of Neural Net- van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., works. In Proceedings of Machine Learning Research, Vinyals, O., Kavukcuoglu, K., van den Driessche, G., volume 89, pp. 888–896, 2019. Lockhart, E., Cobo, L., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalch- Loshchilov, I. and Hutter, F. Decoupled Weight Decay brenner, N., Zen, H., Graves, A., King, H., Walters, T., Regularization. In International Conference on Learning Belov, D., and Hassabis, D. Parallel WaveNet: Fast Representations, 2019. High-Fidelity . In Proceedings of the Nagarajan, V. and Kolter, Z. Deterministic PAC-Bayesian 35th International Conference on Machine Learning, pp. generalization bounds for deep networks via generalizing 3918–3926, 2018. International Conference on Learning noise-resilience. In Wang, H., Shirish Keskar, N., Xiong, C., and Socher, R. Representations , 2019. Identifying Generalization Properties in Neural Networks. Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, ArXiv e-prints, abs/1809.07402, 2018. D. P. Structured Bayesian Pruning via Log-Normal Mul- Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, tiplicative Noise. In Advances in Neural Information M. W. Hessian-based Analysis of Large Batch Training Processing Systems 30, pp. 6775–6784. 2017. and Robustness to Adversaries. In Advances in Neural In- Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Path- formation Processing Systems 31, pp. 4954–4964. 2018. SGD: Path-Normalized Optimization in Deep Neural Net- works. In Advances in Neural Information Processing Zagoruyko, S. and Komodakis, N. Wide Residual Networks. Systems 28, pp. 2422–2430. 2015. In Proceedings of the British Machine Vision Conference, pp. 87.1–87.12, 2016. Neyshabur, B., Bhojanapalli, S., Mcallester, D., and Srebro, N. Exploring Generalization in Deep Learning. In Ad- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, vances in Neural Information Processing Systems 30, pp. O. Understanding Deep Learning Requires Rethinking 5947–5956. 2017. Generalization. In International Conference on Learning Representations, 2017. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A PAC- Bayesian Approach to Spectrally-Normalized Margin Zhang, G., Wang, C., Xu, B., and Grosse, R. Three Mecha- Bounds for Neural Networks. In International Confer- nisms of Weight Decay Regularization. In International ence on Learning Representations, 2018. Conference on Learning Representations, 2019. Normalized Flat Minima

Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach. In Inter- national Conference on Learning Representations, 2019.