Fast generalization error bound of from a kernel perspective

Taiji Suzuki [email protected] Graduate School of Information Science and Technology, The University of Tokyo PRESTO, Japan Science and Technology Agency, Japan Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan

Abstract guage processing, and many other area related to pat- tern recognition. Several high-performance methods We develop a new theoretical framework to have been developed and it has been revealed that analyze the generalization error of deep learn- deep learning possesses great potential. Despite the ing, and derive a new fast learning rate development of practical methodologies, its theoreti- for two representative algorithms: empirical cal understanding is not satisfactory. Wide rage of re- risk minimization and Bayesian deep learn- searchers including theoreticians and practitioners are ing. The series of theoretical analyses of expecting deeper understanding of deep learning. deep learning has revealed its high expres- Among theories of deep learning, a well developed sive power and universal approximation ca- topic is its expressive power. It has been theoreti- pability. Our point of view is to deal with cally shown that deep neural network has exponen- the ordinary finite dimensional deep neural tially large expressive power against the number of network as a finite approximation of the in- layers using several mathematical notions Montufar finite dimensional one. Our formulation of et al. (2014); Bianchini and Scarselli (2014); Cohen the infinite dimensional model naturally de- et al. (2016); Cohen and Shashua (2016); Poole et al. fines a reproducing kernel Hilbert space cor- (2016). Another important issue in neural network responding to each layer. The approximation theories is its universal approximation capability. It error is evaluated by the degree of freedom is well known that 3-layer neural networks have the of the reproducing kernel Hilbert space in ability, and thus the deep neural network also does each layer. We derive the generalization error (Cybenko, 1989; Hornik, 1991; Sonoda and Murata, bound of both of empirical risk minimization 2015). When we discuss the universal approximation and Bayesian deep learning and it is shown capability, the target function that is approximated is that there appears bias-variance trade-off in arbitrary and the theory is highly nonparametric in its terms of the number of parameters of the fi- nature. nite dimensional approximation. We show that the optimal width of the internal lay- Once we knew the expressive power and universal ap- ers can be determined through the degree of proximation capability of deep neural network, the next theoretical question naturally arises in its general- freedom and derive the optimal√ convergence rate that is faster than O(1/ n) rate which ization error. The generalization ability is typically an- has been shown in the existing studies. alyzed by evaluating the Rademacher complexity (see Bartlett (1998); Koltchinskii and Panchenko (2002); Neyshabur et al. (2015); Sun et al. (2015)). As a 1 Introduction whole, these studies have characterized the generaliza- tion ability based on the properties of the weights (pa- Deep learning has been showing great success in sev- rameters) but the properties of the input distributions eral applications such as computer vision, natural lan- are not well incorporated which is not satisfactory to obtain a sharper bound. Moreover, the generalization Proceedings of the 21st International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2018, Lan- error bound has been mainly given in finite dimen- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the sional models. As we have observed, the deep neural author(s). network possesses exponential expressive power and Fast generalization error bound of deep learning from a kernel perspective

by making use of favorable properties of the loss func- tion such as strong convexity. As for the second con- tribution, we first introduce an integral form of deep neural network as performed in the research of the uni- versal approximation capability of 3-layer neural net- works (Sonoda and Murata, 2015). Based on this inte- gral form, a reproducing kernel Hilbert space (RKHS) naturally arises. Due to this formulation, we can bor- row the terminology developed in the kernel method into the analysis of deep learning. In particular, we de- Figure 1: Distribution of eigen-values in the 9-th layer fine the degree of freedom of the RKHS as a measure of VGG for CIFAR-10 dataset. of complexity of the RKHS (Caponnetto and de Vito, universal approximation capability which are highly 2007; Bach, 2017b), and based on that, we evaluate nonparametric characterizations. This means that the how large a finite dimensional model should be to ap- theories are developed separately in the two regimes; proximate the original infinite dimensional model with finite dimensional parametric model and infinite di- a specified precision. These theoretical developments mensional nonparametric model. Therefore, theo- reveal that there appears bias-variance trade-off.An ries that connect these two regimes are expected to interesting observation here is that this bias-variance comprehensively understand statistical performance of trade-off is completely characterized by the behaviors deep learning. of the eigenvalues of the kernel functions corresponding to the RKHSs. In short, if the distribution of eigen- In this paper, we consider both of empirical risk min- values is peaky (that means, the decreasing rate of imization and Bayesian deep learning and analyze the ordered eigenvalues is fast), then the optimal width generalization error using the terminology of kernel (the number of units in each layer) can be small and methods. The main purpose of the analysis is to find consequently the variance can be reduced. This indi- what kind of quantity affects the optimal structure of cates that the notion of the degree of freedom gives a the deep learning. Especially, we see that the gen- practical implication about determination of the width eralization error is well affected by the behavior of of the internal layers. the eigenvalues of the (non-centered) covariance ma- The obtained generalization error bound is summa- trix for the output to each internal layer. Figure 1 1 shows the distribution of the eigenvalues of VGG-13 rized in Table 1 . network where the eigenvalues are ordered in increas- ing order. We see that the distribution is distorted 2 Integral representation of deep in a sense that a few eigenvalues are large and the neural network smallest eigenvalue is much smaller than the largest one. This means that an effective information is in- Here we give our problem settings and the model that cluded in a low-dimensional subspace. Our analysis we consider in this paper. Suppose that n input- aims to find the optimal dimension of the low dimen- n dx output observations Dn =(xi,yi)i=1 ⊂ R × R are sional subspace by borrowing the techniques developed independently identically generated from a regression in the analysis of kernel methods. Consequently,√ (i) model we derive a faster learning rate than O(1/ n)and o yi = f (xi)+ξi (i =1,...,n) (ii) we connect the finite dimensional regime and the n infinite dimensional regime based on the theories of where (ξi)i=1 is an i.i.d. sequence of Gaussian noises 2 2 n kernel methods. To analyze a sharper generalization N(0,σ ) with mean 0 and variance σ ,and(xi)i=1 error bound, we utilize the so-called local Rademacher is generated independently identically from a distri- d complexity technique for the empirical risk minimiza- bution PX (X) with a compact support in R x .The tion method (Mendelson, 2002; Bartlett et al., 2005; purpose of the deep learning problem we consider in o Koltchinskii, 2006; Giné and Koltchinskii, 2006), and, this paper is to estimate f from the n observations as for the Bayesian method, we employ the theoretical Dn. We may consider other situations such as classi- techniques developed to analyze nonparametric Bayes fications with margin assumptions, however, just for methods (Ghosal et al., 2000; van der Vaart and van theoretical simplicity, we consider the simplest regres- Zanten, 2008, 2011). These analyses are quite advan- sion problem. tageous to the typical Rademacher complexity anal- To analyze the generalization ability of deep learning, ysis because we can√ obtain convergence rate between we specify a function class in which the true function O(1/n)andO(1/ n) which is faster than that of√ the standard Rademacher complexity analysis O(1/ n) 1a ∨ b indicates max{a, b}. Taiji Suzuki

 − o2 Table 1: Summary of derived bounds for the generalization error f f L2(PX ) where n is the sample size, R is the norm of the weight in the internal layers, Rˆ∞ is an L∞-norm bound of the functions in the model, σ is the observation noise, dx is the dimension of the input, m is the width of the -th internal layer and N(λ)for (λ > 0) is the degree of freedom (Eq. (5)). Error bound 2 ˆ2 L L−+1 σ +R∞ L General setting L =2 R λ + n =1 mm+1 log(n) under an assumption that m  N(λ) log(N(λ)). 2 ˆ2 σ +R∞ L ∗ ∗ Finite dimensional model n =1 m m+1 log(n) ∗ where m is the true width of the -th internal layer. 1 2 L L−+1 − d 1+2s x Polynomial decay eigenvalue L =2(R ∨ 1) n log(n)+ n log(n) where s is the decay rate of the eigenvalue of the kernel function on the -th layer.

o f is included, and, by doing so, we characterize the feature space, then T = {1,...,d}. On the other “complexity” of the true function in a correct way. hand, the integral form (1) corresponds to a contin- d uous feature space T2 = {(w, b) ∈ R x × R} in the In order to give a better intuition, we first start from second layer. Now the input x is a dx-dimensional the simplest model, the 3-layer neural network. Let real vector, and thus we may set T1 = {1,...,dx}. η be a nonlinear such as ReLU Since the output is one dimensional, the output layer (Nair and Hinton, 2010; Glorot et al., 2011); η(x)= d d is just a singleton TL+1 = {1}. Based on these feature (max{xi, 0}) for a d-dimensional vector x ∈ R . i=1 spaces, our integral form of the deep neural network The 3-layer neural network model is represented by is constructed by stacking the map on the -th layer o → f(x)=W (2)η(W (1)x + b(1))+b(2) f : L2(Q) L2(Q+1) given as o o o where we denote by m2 the number of nodes in the f [g](τ)= h (τ,w)η(g(w))dQ(w)+b (τ), (2a) (2) 1×m (1) m ×d internal layer, and W ∈ R 2 , W ∈ R 2 x , T (1) m2 (2) b ∈ R and b ∈ R. It is known that this model is o where h (τ,w) corresponds to the weight of the fea- universal approximator and it is important to consider o ture w for the output τ and h ∈ L2(Q+1 × Q)and its integral form o 2 h (τ,·) ∈ L2(Q) for all τ ∈T+1 . Specifically, the f(x)= h(w, b)η(wx + b)dwdb + b(2). (1) first and the last layers are represented as

dx where (w, b) ∈ Rdx × R is a hidden parameter, h : o o o f1 [x](τ)= h1(τ,j)xjQ1(j)+b1(τ), (2b) dx R × R → R is a function version of the weight ma- j=1 (2) (2) trix W ,andb ∈ R is the bias term. This integral o o o fL[g](1) = hL(w)η(g(w))dQL(w)+bL, (2c) form appears in many places to analyze the capacity of TL the neural network. In particular, through the ridgelet o o where we wrote h (w) to indicate h (1,w) for sim- analysis, it is shown that there exists the integral form L L d plicity because TL+1 = {1}. Then the true function corresponding to any f ∈ L1(R x ) which has an inte- o f is given as grable Fourier transform for an appropriately chosen activation function η such as ReLU (Sonoda and Mu- o o o o f (x)=fL ◦ fL−1 ◦···◦f1 (x), (3) rata, 2015). o ◦ o · ∈ Motivated by the integral form of the 3-layer neural where f F (x) indicates f [F (x)]( ) L2(Q+1)for · ∈ network, we consider a more general representation F (x)( ) L2(Q). Since, the shallow 3-layer neural for deeper neural network. To do so, we define a fea- network is a universal approximator, and so is our gen- ture space on the -th layer. The feature space is a eralized deep neural network model (3). It is known a probability space (T, B, Q)whereT is a Polish that deep neural network gives more efficient represen- space, B is its Borel algebra, and Q is a probability tation of a function than shallow ones. Actually, Eldan measure on (T, B). This is introduced to represent and Shamir (2016) gave an example of a function that a general (possibly) continuous set of features as well 2 Note that, for g ∈ L2(Q), f[g] is also square inte- as a discrete set of features. For example, if the -th grable with respect to L2(Q+1)ifη is Lipschitz continuous internal layer is endowed with a d-dimensional finite because h ∈ L2(Q+1 × Q). Fast generalization error bound of deep learning from a kernel perspective the 3-layer neural network cannot approximate under It is easy to check that k is symmetric and positive a precision unless its with is exponential in the input definite. It is known that there exists a unique RKHS dimension but the 4-layer neural network can approx- H corresponding the kernel k (Aronszajn, 1950). Un- imate with polynomial order widths (see Safran and der this setting, all arguments at the -th layer can be Shamir (2016) for other examples). Therefore, it is carried out through the theories of kernel methods. quite important to consider the integral representa- Importantly, for g ∈H, there exists h ∈ L2(Q)such tion of a deep neural network rather than a 3-layer that o network. g(x)= T h(τ)η(F−1(x, τ))dQ (τ). The integral representation is natural also from the Moreover, the norms of g and h are connected as practical point of view. Indeed, it is well known that     g H = h L2(Q), (4) the deep neural network learns a simple pattern in the (Bach, 2017b,a). Therefore, the function early layers and it gradually extracts more complicated → o o x T h (τ,w)η(F−1(x, w))dQ(w), represent- features as the layer is going up. The trained feature ∈T is usually continuous one. For example, in computer ing the magnitude of a feature τ +1 for the vision tasks, the second layer typically extracts gradi- input x is included in the RKHS and its RKHS norm is equivalent to that of the internal layer weight ents toward several degree angles (Krizhevsky et al., o h (τ,·)L (Q ) because of Eq. (4). 2012). The angle is a continuous variable and thus 2 the feature space should be continuous to cover all an- To derive the approximation error, we need to evalu- gles. On the other hand, the real network discretize ate the “complexity” of the RKHS. Basically, the com- the feature space because of limitation of computa- plexity of the -th layer RKHS H is controlled by the tional resources. Our theory introduced in the next behavior of the eigenvalues of the kernel. To formally section offers a measure to evaluate this discretization state this notion, we consider the Hilbert-Schmidt de- error. composition of the kernel given by  ∞ () () ()  k(x, x )= j=1 μj φj (x)φj (x ),

3 Kernels and corresponding RKHSs () ∞ in L2(PX × PX ) where (μj )j=1 is the sequence of the () ∞ When we estimate that, we need to discretize the in- eigenvalues ordered in decreasing order, and (φj )j=1 tegrals by finite sums due to limitation of computa- forms an orthonormal system in L2(PX ). tional resources as we do in practice. In other word, Based on the Hilbert-Schmidt decomposition we define we consider the usual finite sum deep learning model as the degree of freedom N(λ) of the RKHS as an approximation of the integral form. However, the ∞ () () discrete approximation induces approximation error. N(λ)= μ /(μ + λ) (5) j=1 j j Here we give an upper bound of the approximation error. Naturally, there arises the notion of bias and for arbitrary λ>0. Roughly speaking, this is an ef- variance trade-off, that is, as the complexity of the fective dimension of the subspace that approximates finite model increases the “bias” (approximation er- the infinite dimensional RKHS with precision λ. Note ror) decreases but the “variance” for finding the best that N(λ) is a monotonically decreasing function with parameter in the model increases. Combining these respect to λ. two notions, it is possible to quantify the bias-variance Note that, by using the canonical expan- trade-off and find the best strategy to minimize the en- ∞ () () () sion η(F o (x, τ)) = μ φ (x)ψ (τ) tire generalization error. −1 j=1 j j j () ∞ in L2(PX × Q), the eigenvalues (μ ) The approximation error analysis of the deep neu- j j=1 are also those of the covariance operator ral network can be well executed by utilizing notions Σ : L2(Q) → L2(Q) that is defined by (Σh)(τ)= of the kernel method. Here we construct RKHS for o o    { η(F−1(x, τ))η(F−1(x, τ ))dPX (x)}h(τ )dQ(τ ) each layer in a way analogous to Bach (2017b,a) who ∈ for h L2(Q) because Σ can be represented by studied shallow learning and the kernel quadrature ∞ () () () · o Σ = j=1 μj ψj ψj , L2(Q). rule. Let the output of the -th layer be F (x, τ):= o o (f ◦···◦f1 (x))(τ). We define a reproducing kernel Hilbert space (RKHS) corresponding to the -th layer 4 Generalization error analysis: bias- ( ≥ 2) by introducing its associated kernel function variance trade-off d d k : R x × R x → R. We define the positive definite kernel k as In this section, we give the generalization error analysis using the notion of kernel methods. To do so, first we  o o  k (x, x ):= T η(F−1(x, τ))η(F−1(x ,τ))dQ (τ). give some assumptions. Taiji Suzuki

We assume that the true function f o satisfies a norm some examples in which the generalization error is condition as follows. analyzed in details. Before we state the generaliza- tion error bounds, we prepare some notations. Let Assumption 1 For each , ho and bo satisfy that L−1 L L− 3 Gˆ = LR¯ Dx + =1 R¯ , and define δˆ1,n, δˆ2,n as o o  ·  ≤ | |≤ ∀ ∈ L h (τ, ) L2(Q) R, b (τ) Rb ( τ T). L− L−+1 δˆ1,n = 2 cˆδ R λ, o  ·  ≤ =2 By Eq. (4), the first assumption h (τ, ) L2(Q) ⎛ ⎞ o · ∈H  o ·  ≤ R is interpreted as F (τ, ) and F (τ, ) H L √ √ o ˆ ¯ ¯ · 2 2 ⎝ 4 2G max{R,Rb} n ⎠ R. This means that the feature map F (τ, ) in each δˆ2,n = mm+1 log+ 1+ . L n σ mm+1 internal layer is well regulated by the RKHS norm. =1 =1 Moreover, we also assume that the activation function is scale invariant. Roughly speaking, δˆ1,n corresponds to a finite dimen- sional approximation error (bias term) and δˆ2,n corre- Assumption 2 We assume the following conditions sponds to the amount of deviation of the estimators in on the activation function η. the finite dimensional model (variance term). • η is scale invariant: η(ax)=aη(x) for all a>0 and x ∈ Rd (for arbitrary d). 4.1 Empirical risk minimization • η is 1-Lipschitz continuous: |η(x) − η(x)|≤x − x for all x, x ∈ Rd. In this section, we define the empirical risk minimizer and investigate its generalization error. Let the em- The first assumption on the scale invariance is essential pirical risk minimizer be f: n to derive tight error bounds. The second one ensures 2 f := argmin (yi − f(xi)) . that deviation in each layer does not affect the output f∈F i=1 so much. The most important example of an activa- tion function that satisfies these conditions is ReLU Note that there exists at least one minimizer because activation. Another one is the identity map η(x)=x. the parameter set corresponding to F is a compact set and η is a continuous function. f needs not neces- Finally we assume that the input distribution has a sarily be the exact minimizer but it could be an ap- compact support. proximated minimizer. We, however, assume f is the exact minimizer for theoretical simplicity. In practice, Assumption 3 The support of PX is compact and it   | |≤ ∀ ∈ the empirical risk minimizer is obtained by using the is bounded as x ∞ := max1≤i≤dx xi Dx ( x back-propagation technique. The regularization for supp(PX )). the norm of the weight matrices and the bias terms 4 From now on, we fix δ>0, and definec ˆδ := 1−δ . are implemented by using the L2-regularization and Then, we introduce the following constants corre- the drop-out techniques. sponding√ to the norm bounds (Assumption 1): R¯ := ¯ − Our main result about generalization error analysis for cˆδR, Rb := Rb/(1 δ). Define a finite dimensional the empirical risk minimizer is given in the following model with norm constrains as theorem. F := {f(x)=(W (L)η(·)+b(L)) ◦···◦(W (1)x + b(1)) () () Theorem 1 For any δ>0 and λ > 0, suppose that |W F ≤ R,¯ b ≤R¯b ( =1,...,L)}. m ≥ 5N(λ) log (32N(λ)/δ)( =2,...,L). (6) This is used to approximate the infinite dimensional model characterized by the integral form. We can show Then, there exists universal constants C1 such that, that all functions in F has an infinite norm bound as for any r>0 and r>˜ 1, f∞ ≤ Rˆ∞ where Rˆ∞ is defined as  − o2 ≤ ˆ2 2 ˆ2 ˆ2 L L L− f f L2(PX ) C3 r˜δ1,n +(σ + R∞)δ2,n Rˆ∞ := R¯ Dx + =1 R¯ R¯b. √ 2 2 The proof is given in Appendix A.2 in the supplemen- (Rˆ∞ + σ ) n + log+ + r tary material. Because of this, we can derive the gen- n min{σ/Rˆ∞, 1} eralization error bound with respect to the population nδˆ2 (˜r−1)2 L2-norm instead of the empirical L2-norm. − − 1,n − − with probability 1 exp 11Rˆ2 2 exp( r) for In the following, we derive the generalization error ∞ every r>0 and r˜ > 1. bounds for the two estimators: the empirical risk min- 3 imizer and the Bayes estimator. Then, we also give We define log+(x)=max{1, log(x)}. Fast generalization error bound of deep learning from a kernel perspective

d The proof is given in Appendix C in the supplemen- with radius C>0(Bd(C)={x ∈ R |x≤C}), B tary material. It is easily checked that√ the third term and U( d(C)) be the uniform distribution on the ball (Rˆ2 +σ2) n ∞ Bd(C). Here, we employ uniform distributions on the of the right side ( n log+ min{σ/Rˆ ,1} + r ) ∞ model F: is smaller than the first two terms, therefore the gen- () ∼ B ¯ () ∼ B ¯ eralization error bound can be simply evaluated as W U( m+1×m (R)),b U( m+1 (Rb)).  − o2 ˆ2 ˆ2 f f L2 = Op(δ1,n + δ2,n). In practice, the Gaussian distribution is also employed instead of the uniform distribution, but, for theoreti- The first term δˆ1,n represents the bias that is induced cal simplicity, we decided to analyze the uniform prior o by approximating f by the finite dimensional model distribution. F. The inequality (6) represents a condition on the F The prior distribution on the parameters width m under which the finite dimensional model () () L can approximate the true function f o with accuracy (W ,b )=1 induces a distribution of the func- ˆ tion f in the space of continuous functions endowed δ1,n. We can see that, as λ decreases, the required d ∞ R x width of the internal layer m increases by the con- with the Borel algebra corresponding to the L ( )- dition (6). Since N(λ) is a decreasing function with norm. We denote by Π the induced distribution. respect to λ, a larger model is required to approx- Using the prior, the posterior distribution is defined imate the true function with smaller approximation via the Bayes principle: ˆ ˆ n (y −f(x ))2 error (small δ1,n). On the other hand, δ2,n indicates exp{− i i }Π(df) i=1 2σ2 | n  2 the estimation error (or variance) to find the best el- Π(df D )= n (yi−f (xi))  . exp{− 2 }Π(df ) ement in the finite dimensional model F. Since small i=1 2σ λ leads to large m, the deviation δˆ2,n in the finite Although we do not pursue the computational issue dimensional model should increase. Therefore, we ob- of the Bayesian deep learning here, see, for example, serve bias-variance trade-off between δˆ1,n and δˆ2,n.A Hernandez-Lobato and Adams (2015); Blundell et al. key notion that characterizes the bias-variance trade- (2015) for practical algorithms. The following theorem off is the degree of freedom N(λ) which expresses the gives how fast the Bayes posterior contracts around the “complexity” of the RKHS H in each layer. The de- true function. gree of freedom of a complicated RKHS grows up faster than a simpler one as λ goes to 0. In other words, if Theorem 2 Fix arbitrary δ>0 and λ > 0( = the eigenvalues of the kernels decreases rapidly, then 1,...,L), and suppose that the condition (6) on m ≥ the degree of freedom gets smaller, and we achieve a is satisfied. Then, for all r 1, the posterior tail better generalization by using a simpler model. probability can be bounded as o ED Π(f : f − f L (P ) This is also informative in practice because, to deter- n 2 X mine the width of each layer, the degree of freedom Rˆ2 ≥ ˆ ˆ { ∞ }| gives a good guidance. That is, if the degree of free- (δ1,n + σδ2,n)r max 12, 33 σ2 Dn) dom is small compared with the sample size, then we 2 2 2 ≤ − ˆ2 min{(r −1) ,r −1} may increase the width. An estimate of the degree of exp nδ1,n 11Rˆ2 ∞ freedom can be computed from the trained network 2 2 r by computing the Gram matrix corresponding to the +12exp −n(δˆ1,n + σδˆ2,n) . 8σ2 kernel induced from the trained network (where the kernel is defined by the finite sum instead of the inte- The proof is given in Appendix B in the supplemen- gral form) and using the eigenvalue of the Gram matrix tary material. Roughly speaking this theorem indi- (see Figure 3). cates that the posterior distribution concentrates in L ˆ ˆ o To obtain the best generalization error bound, (λ)=1 the distance δ1,n+σδ2,n from the true function f .The should be tuned to balance the bias-variance terms tail probability is sub-Gaussian and thus the posterior L ˆ ˆ (and accordingly (m)=2 should also be fine-tuned). mass outside the distance δ1,n + σδ2,n from the true The examples of the best achievable generalization er- function rapidly decrease. Here we again observe that ror will be shown in Section 4.3. there appears bias-variance trade-off between δˆ1,n and δˆ2,n. This can be understood essentially in the same 4.2 Bayes estimator way as the empirical risk minimization.

In this section, we formulate a Bayes estimator and 4.3 Examples derive its generalization error. To define the Bayes estimator, we just need to specify the prior distribu- Here, we give some examples of the generalization d tion. Let Bd(C) be the ball in the Euclidean space R error bound. We have seen that both of the empirical Taiji Suzuki risk minimizer and the Bayes estimators have a sim- the complexity s of the RKHS affects the conver- o 2 plified generalization error bound as f − f L (P ) = gence rate directly. As expected, if the RKHSs are 2 X L L L−+1 L mm+1 simple (that is, (s)=2 are small), we obtain faster Op L R¯ λ + log(n) , by =2 =1 n convergence. L L supposing σ, Rˆ∞ and cˆδ R are in constant order. We evaluate the bound under the best choice of m One internal layer: kernel method Finally, we balancing the bias-variance trade-off. consider a simple but important situation in which there is only one internal layer (L = 2). In this setting, Finite dimensional internal layer If all RKHSs we only need to adjust m2 because the dimensions of ∗ 1 x 3 are finite dimensional, say m -dimensional. Then input and output are fixed as m = d and m =1. ∗ N(λ) ≤ m for all λ ≥ 0. Therefore, by setting We assume that the same condition (8) for  =2. λ =0(∀), the generalization error bound is obtained Then, it can be seen that the optimal width gives the as following generalization error bound: s L 2 1 1 2 2 o 2 2 1+s − o 2 σ + Rˆ∞ ∗ ∗  −   ¯∨ 2 1+s2 1+s2  −   f f L2(PX ) ((R 1) a2 )(dx+1) n log(n). f f L2(PX ) m m+1 log(n), (7) n =1 This convergence rate is equivalent to the minimax where we omitted the factors depending only on optimal convergence rate of the kernel ridge regression log(R¯R¯bGˆ). Note thatthis convergence rate is solely (Caponnetto and de Vito, 2007; Steinwart et al., 2009) dependent on the number of parameters. This is (up to constant and log(n) factors). It is known that much faster than the existing bounds that utilize the kernel method corresponds to the 3-layer neural the Rademacher√ complexity because their bounds are network with an infinite dimensional internal layer. In O(1/ n). This result matches more precise arguments that sense, our analysis includes that of kernel meth- for a finite dimensional 3-layer neural network based ods, and we can say that the deep neural network is a on asymptotic expansions (Fukumizu, 1999; Watan- method that constructs an optimal kernel in a layer- abe, 2001). wise manner.

Polynomial decreasing rate of eigenvalues We () 5 Numerical experiments assume that the eigenvalue μj decays in polynomial order as () − 1 Here, we give some numerical experiments to see con- ≤ s μj aj , (8) nections between our theories and practical situations. for a positive real 0 0. This is a standard assumption in the analysis of kernel methods MNIST First, we investigate the relations between (Caponnetto and de Vito, 2007; Steinwart and Christ- the width of each layer and the classification perfor- mann, 2008), and it is known that this assumption is mance on MNIST dataset4. We constructed 5-layer equivalent to the usual covering number assumption convolutional neural network (CNN) which consists (Steinwart et al., 2009). For small s, the decay rate is of 3 layers and 2 fully connected layers. fast and it is easy to approximate the kernel by another We changed the number of channels in the second and one corresponding to a finite dimensional subspace. third convolution layers from 16 to 256 for the second Therefore small s corresponds to a simple model and layer and from 16 to 1024 for the third layer. The large s corresponds to a complicated model. In this result is depicted in Figure 2. We see that for small- setting, the degree of freedom is evaluated as est widths (16, 16), the classification accuracy is the −s worst. This means that the network is so simple that  N(λ) (λ/a) . (9) there appears large bias. On the other hand, for the 2s 1 large network (254, 1024), the test accuracy is inferior 1+2s − 1+2s Hence, we can show that λ = a n gives the to the best one. This is due to large variance. optimal rate, and we obtain the generalization error bound as CIFAR-10 We next conduct experiments on 2 o 2 dx CIFAR-10 dataset (Krizhevsky, 2009) that contains f − f L (P )  log(n)+ × 2 X n 50,000 training color images of size 32 32 with L 2s 10-classes. We applied a VGG-type network (Si- 1 2(L−+1) 1+2s − 1+2s L (R¯ ∨ 1) a n log (n) , (10) monyan and Zisserman, 2014) with 8 convolution =2 layers, 3 max-pooling layers and 2 fully-connected layers. We constructed kernel functions for this where we omitted the factors depending on 2 4 s, log(R¯R¯bGˆ), σ and Rˆ∞. This indicates that http://yann.lecun.com/exdb/mnist/ Fast generalization error bound of deep learning from a kernel perspective

a large margin assumption. Basically, these studies have derived generalization error bounds depending on the properties of the parameters, and the properties of the data distributions are not much taken into account. On the other hand, our analysis involves such informa- tion as the degree of freedom, and thus can obtain a tighter bounds. Analysis of the bias-variance trade-off in three layer neural network from the kernel point of view has been Figure 2: Test classification accuracy for different investigated by Bach (2017a). However, only shallow channel sizes in the second and third layers of CNN network is analyzed. Moreover, the is not in MNIST dataset. Horizontal: second layer, vertical: assumed to be strongly convex,√ and thus the obtained third layer. rate is not faster than O(1/ n). network where each feature in the feature space Another important topic for the analysis of the gen- corresponds to each channel, and we computed the eralization ability is VC-dimension analysis (Bartlett et al., 1998; Karpinski and Macintyre, 1997; Gold- degree of freedom N(λ) using the validation data. The degree of freedom for each convolution layer berg and Jerrum, 1995). However, VC-dimension is is depicted in Figure 3. We see that the degree of a notion independent of the input distributions. Here freedom increases moderately as λ → 0. This means again, we note that the degree of freedom considered that the eigenvalues of the kernels decrease rapidly. in our paper depends on the input distribution and is Therefore, the eigenvalue decay rate assumption in more data specific, and thus it gives a tighter bound Eq. (8) is justified, and because of this phenomenon and could be practically more useful. the effective dimension of the network is less than the actual number of parameters. 7 Conclusion

6 Relations to existing work In this paper, we proposed to use the integral form of deep neural network for generalization error analysis, The sample complexity of deep neural network has and based on that, we derived the generalization error been extensively studied by analyzing its Rademacher bound of the empirical risk minimizer and the Bayes complexity. For example, Bartlett (1998) character- estimator. The integral form enabled us to define an ized the generalization error of a 3-layer neural net- RKHS in each layer, and import the theoretical tech- work by the norm of the weight vectors instead of the niques developed in kernel methods into the analysis number of parameters. Koltchinskii and Panchenko of deep learning. In particular, we defined the degree (2002) studied more general deep neural network and of freedom of each RKHS and showed that the approx- derived its generalization error bound of deep neu- imation error between a finite dimensional model and ral network under a norm constraint. More recently, the integral form can be characterized by the degree of Neyshabur et al. (2015) analyzed the Rademacher freedom. In addition to the approximation error, we complexity of the deep neural network based on gen- also derived the estimation error in the finite dimen- () L eralized norms of the weight matrix ((W )=1 in our sional model. We have observed that there appears paper). Sun et al. (2015) also derived the Rademacher bias-variance trade-off depending on the size of the fi- complexity and showed the generalization error under nite dimensional model, and the best choice of the size is characterized by the degree of freedom. Our theoretical investigation would be particularly useful to determine the optimal widths of the internal layers. We believe this study opens up a new direction of a series of theoretical analyses of deep learning.

Acknowledgement

TS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H01678 and Figure 3: The degree of freedom against λ for each 15H05707), JST-PRESTO and JST-CREST. layer. Taiji Suzuki

References K. Fukumizu. Generalization error of linear neural networks in unidentifiable cases. In Proceedings of N. Aronszajn. Theory of reproducing kernels. Trans- the 10th International Conference on Algorithmic actions of the American Mathematical Society, 68: Learning Theory, pages 51–62. Springer, 1999. 337–404, 1950. S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. F. Bach. Breaking the curse of dimensionality with Convergence rates of posterior distributions. The convex neural networks. Journal of Machine Learn- Annals of Statistics, 28(2):500–531, 2000. ing Research, 18(19):1–53, 2017a. E. Giné and V. Koltchinskii. Concentration inequali- F. Bach. On the equivalence between kernel quadra- ties and asymptotic results for ratio type empirical ture rules and random feature expansions. Journal processes. The Annals of Probability, 34(3):1143– of Research, 18(21):1–38, 2017b. 1216, 2006. P. Bartlett, O. Bousquet, and S. Mendelson. Local X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rec- Rademacher complexities. The Annals of Statistics, tifier neural networks. In Proceedings of the 14th 33:1487–1537, 2005. International Conference on Artificial Intelligence P. L. Bartlett. The sample complexity of pattern and Statistics, volume 15 of Proceedings of Machine classification with neural networks: the size of the Learning Research, pages 315–323, 2011. weights is more important than the size of the net- P. W. Goldberg and M. R. Jerrum. Bounding the work. IEEE Transactions on Information Theory, Vapnik-Chervonenkis dimension of concept classes 44(2):525–536, 1998. parameterized by real numbers. Machine Learning, P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear 18(2-3):131–148, 1995. VC-dimension bounds for piecewise polynomial net- J. M. Hernandez-Lobato and R. Adams. Probabilis- works. Neural Computation, 10(8):2159–2173, 1998. tic for scalable learning of bayesian M. Bianchini and F. Scarselli. On the complexity of neural networks. In Proceedings of the 32nd Interna- neural network classifiers: A comparison between tional Conference on Machine Learning, volume 37 shallow and deep architectures. IEEE Transactions of Proceedings of Machine Learning Research, pages on Neural Networks and Learning Systems, 25(8): 1861–1869, 2015. 1553–1565, 2014. K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251– C. Blundell, J. Cornebise, K. Kavukcuoglu, and 257, 1991. D. Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference M. Karpinski and A. Macintyre. Polynomial bounds on Machine Learning, volume 37 of Proceedings of for VC dimension of sigmoidal and general pfaffian Machine Learning Research, pages 1613–1622, 2015. neural networks. Journal of Computer and System Sciences, 54(1):169–176, 1997. A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of V. Koltchinskii. Local Rademacher complexities and Computational Mathematics, 7(3):331–368, 2007. oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006. N. Cohen and A. Shashua. Convolutional rectifier networks as generalized tensor decompositions. In V. Koltchinskii and D. Panchenko. Empirical margin Proceedings of the 33th International Conference on distributions and bounding the generalization error Machine Learning, volume 48 of JMLR Workshop of combined classifiers. The Annals of Statistics,30 and Conference Proceedings, pages 955–963, 2016. (1):1–50, 2002. N. Cohen, O. Sharir, and A. Shashua. On the ex- A. Krizhevsky. Learning multiple layers of features pressive power of deep learning: A tensor analy- from tiny images. 2009. sis. In Proceedings of the 29th Annual Conference A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima- on Learning Theory, pages 698–728, 2016. genet classification with deep convolutional neural G. Cybenko. Approximation by superpositions of a sig- networks. In Advances in Neural Information Pro- moidal function. Mathematics of Control, Signals, cessing Systems 25, pages 1097–1105, 2012. and Systems (MCSS), 2(4):303–314, 1989. S. Mendelson. Improving the sample complexity us- R. Eldan and O. Shamir. The power of depth for feed- ing global data. IEEE Transactions on Information forward neural networks. In Proceedings of the 29th Theory, 48(7):1977–1991, 2002. Annual Conference on Learning Theory, pages 907– G. F. Montufar, R. Pascanu, K. Cho, and Y. Ben- 940, 2016. gio. On the number of linear regions of deep neural Fast generalization error bound of deep learning from a kernel perspective

networks. In Advances in Neural Information Pro- cessing Systems 27, pages 2924–2932. 2014. V. Nair and G. E. Hinton. Rectified linear units im- prove restricted Boltzmann machines. In Proceed- ings of the 27th International Conference on Ma- chine Learning, pages 807–814, 2010. B. Neyshabur, R. Tomioka, and N. Srebro. Norm- based capacity control in neural networks. In Pro- ceedings of the 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Re- search, pages 1376–1401, 2015. B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neu- ral networks through transient chaos. In Advances in Neural Information Processing Systems 29, pages 3360–3368. 2016. I. Safran and O. Shamir. Depth separation in ReLU networks for approximating smooth non-linear func- tions. arXiv preprint arXiv:1610.09887, 2016. K. Simonyan and A. Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. S. Sonoda and N. Murata. Neural network with un- bounded activation functions is universal approxi- mator. Applied and Computational Harmonic Anal- ysis, 2015. I. Steinwart and A. Christmann. Support Vector Ma- chines. Springer, 2008. I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009. S. Sun, W. Chen, L. Wang, and T.-Y. Liu. Large mar- gin deep neural networks: theory and algorithms. arXiv preprint arXiv:1506.05232, 2015. A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics, 36(3):1435–1463, 2008. A. W. van der Vaart and J. H. van Zanten. Informa- tion rates of nonparametric Gaussian process meth- ods. Journal of Machine Learning Research, 12: 2095–2119, 2011. S. Watanabe. Learning efficiency of redundant neural networks in bayesian estimation. IEEE Transactions on Neural Networks, 12(6):1475–1486, 2001.