Fast Generalization Error Bound of Deep Learning from a Kernel Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Fast generalization error bound of deep learning from a kernel perspective Taiji Suzuki [email protected] Graduate School of Information Science and Technology, The University of Tokyo PRESTO, Japan Science and Technology Agency, Japan Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan Abstract guage processing, and many other area related to pat- tern recognition. Several high-performance methods We develop a new theoretical framework to have been developed and it has been revealed that analyze the generalization error of deep learn- deep learning possesses great potential. Despite the ing, and derive a new fast learning rate development of practical methodologies, its theoreti- for two representative algorithms: empirical cal understanding is not satisfactory. Wide rage of re- risk minimization and Bayesian deep learn- searchers including theoreticians and practitioners are ing. The series of theoretical analyses of expecting deeper understanding of deep learning. deep learning has revealed its high expres- Among theories of deep learning, a well developed sive power and universal approximation ca- topic is its expressive power. It has been theoreti- pability. Our point of view is to deal with cally shown that deep neural network has exponen- the ordinary finite dimensional deep neural tially large expressive power against the number of network as a finite approximation of the in- layers using several mathematical notions Montufar finite dimensional one. Our formulation of et al. (2014); Bianchini and Scarselli (2014); Cohen the infinite dimensional model naturally de- et al. (2016); Cohen and Shashua (2016); Poole et al. fines a reproducing kernel Hilbert space cor- (2016). Another important issue in neural network responding to each layer. The approximation theories is its universal approximation capability. It error is evaluated by the degree of freedom is well known that 3-layer neural networks have the of the reproducing kernel Hilbert space in ability, and thus the deep neural network also does each layer. We derive the generalization error (Cybenko, 1989; Hornik, 1991; Sonoda and Murata, bound of both of empirical risk minimization 2015). When we discuss the universal approximation and Bayesian deep learning and it is shown capability, the target function that is approximated is that there appears bias-variance trade-off in arbitrary and the theory is highly nonparametric in its terms of the number of parameters of the fi- nature. nite dimensional approximation. We show that the optimal width of the internal lay- Once we knew the expressive power and universal ap- ers can be determined through the degree of proximation capability of deep neural network, the next theoretical question naturally arises in its general- freedom and derive the optimal√ convergence rate that is faster than O(1/ n) rate which ization error. The generalization ability is typically an- has been shown in the existing studies. alyzed by evaluating the Rademacher complexity (see Bartlett (1998); Koltchinskii and Panchenko (2002); Neyshabur et al. (2015); Sun et al. (2015)). As a 1 Introduction whole, these studies have characterized the generaliza- tion ability based on the properties of the weights (pa- Deep learning has been showing great success in sev- rameters) but the properties of the input distributions eral applications such as computer vision, natural lan- are not well incorporated which is not satisfactory to obtain a sharper bound. Moreover, the generalization Proceedings of the 21st International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2018, Lan- error bound has been mainly given in finite dimen- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the sional models. As we have observed, the deep neural author(s). network possesses exponential expressive power and Fast generalization error bound of deep learning from a kernel perspective by making use of favorable properties of the loss func- tion such as strong convexity. As for the second con- tribution, we first introduce an integral form of deep neural network as performed in the research of the uni- versal approximation capability of 3-layer neural net- works (Sonoda and Murata, 2015). Based on this inte- gral form, a reproducing kernel Hilbert space (RKHS) naturally arises. Due to this formulation, we can bor- row the terminology developed in the kernel method into the analysis of deep learning. In particular, we de- Figure 1: Distribution of eigen-values in the 9-th layer fine the degree of freedom of the RKHS as a measure of VGG for CIFAR-10 dataset. of complexity of the RKHS (Caponnetto and de Vito, universal approximation capability which are highly 2007; Bach, 2017b), and based on that, we evaluate nonparametric characterizations. This means that the how large a finite dimensional model should be to ap- theories are developed separately in the two regimes; proximate the original infinite dimensional model with finite dimensional parametric model and infinite di- a specified precision. These theoretical developments mensional nonparametric model. Therefore, theo- reveal that there appears bias-variance trade-off.An ries that connect these two regimes are expected to interesting observation here is that this bias-variance comprehensively understand statistical performance of trade-off is completely characterized by the behaviors deep learning. of the eigenvalues of the kernel functions corresponding to the RKHSs. In short, if the distribution of eigen- In this paper, we consider both of empirical risk min- values is peaky (that means, the decreasing rate of imization and Bayesian deep learning and analyze the ordered eigenvalues is fast), then the optimal width generalization error using the terminology of kernel (the number of units in each layer) can be small and methods. The main purpose of the analysis is to find consequently the variance can be reduced. This indi- what kind of quantity affects the optimal structure of cates that the notion of the degree of freedom gives a the deep learning. Especially, we see that the gen- practical implication about determination of the width eralization error is well affected by the behavior of of the internal layers. the eigenvalues of the (non-centered) covariance ma- The obtained generalization error bound is summa- trix for the output to each internal layer. Figure 1 1 shows the distribution of the eigenvalues of VGG-13 rized in Table 1 . network where the eigenvalues are ordered in increas- ing order. We see that the distribution is distorted 2 Integral representation of deep in a sense that a few eigenvalues are large and the neural network smallest eigenvalue is much smaller than the largest one. This means that an effective information is in- Here we give our problem settings and the model that cluded in a low-dimensional subspace. Our analysis we consider in this paper. Suppose that n input- aims to find the optimal dimension of the low dimen- n dx output observations Dn =(xi,yi)i=1 ⊂ R × R are sional subspace by borrowing the techniques developed independently identically generated from a regression in the analysis of kernel methods. Consequently,√ (i) model we derive a faster learning rate than O(1/ n)and o yi = f (xi)+ξi (i =1,...,n) (ii) we connect the finite dimensional regime and the n infinite dimensional regime based on the theories of where (ξi)i=1 is an i.i.d. sequence of Gaussian noises 2 2 n kernel methods. To analyze a sharper generalization N(0,σ ) with mean 0 and variance σ ,and(xi)i=1 error bound, we utilize the so-called local Rademacher is generated independently identically from a distri- d complexity technique for the empirical risk minimiza- bution PX (X) with a compact support in R x .The tion method (Mendelson, 2002; Bartlett et al., 2005; purpose of the deep learning problem we consider in o Koltchinskii, 2006; Giné and Koltchinskii, 2006), and, this paper is to estimate f from the n observations as for the Bayesian method, we employ the theoretical Dn. We may consider other situations such as classi- techniques developed to analyze nonparametric Bayes fications with margin assumptions, however, just for methods (Ghosal et al., 2000; van der Vaart and van theoretical simplicity, we consider the simplest regres- Zanten, 2008, 2011). These analyses are quite advan- sion problem. tageous to the typical Rademacher complexity anal- To analyze the generalization ability of deep learning, ysis because we can√ obtain convergence rate between we specify a function class in which the true function O(1/n)andO(1/ n) which is faster than that of√ the standard Rademacher complexity analysis O(1/ n) 1a ∨ b indicates max{a, b}. Taiji Suzuki − o2 Table 1: Summary of derived bounds for the generalization error f f L2(PX ) where n is the sample size, R is the norm of the weight in the internal layers, Rˆ∞ is an L∞-norm bound of the functions in the model, σ is the observation noise, dx is the dimension of the input, m is the width of the -th internal layer and N(λ)for (λ > 0) is the degree of freedom (Eq. (5)). Error bound 2 ˆ2 L L−+1 σ +R∞ L General setting L =2 R λ + n =1 mm+1 log(n) under an assumption that m N(λ) log(N(λ)). 2 ˆ2 σ +R∞ L ∗ ∗ Finite dimensional model n =1 m m+1 log(n) ∗ where m is the true width of the -th internal layer. 1 2 L L−+1 − d 1+2s x Polynomial decay eigenvalue L =2(R ∨ 1) n log(n)+ n log(n) where s is the decay rate of the eigenvalue of the kernel function on the -th layer. o f is included, and, by doing so, we characterize the feature space, then T = {1,...,d}. On the other “complexity” of the true function in a correct way.