Arxiv:1808.04295V4 [Cs.LG] 29 Nov 2018 Ue L 21)Poe Htfrtolyrrl Ewrs Low Networks, Relu Two-Layer for That Proved (2017) Al

Understanding training and generalization in deep learning by Fourier analysis Zhi-Qin John Xu∗ New York University Abu Dhabi Abu Dhabi 129188, United Arab Emirates [email protected] Abstract Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)—equipped with many more parameters than training data and trained by (stochastic) gradient-based methods—often achieve re- markably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods often endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN’s ability to fit any function. These results are further confirmed by experiments of DNNs fitting the following datasets, that is, natural images, one-dimensional functions and MNIST dataset. 1 Introduction Background Deep learning has achieved great success as in many fields (LeCun et al. (2015)). Recent studies have focused on understanding why DNNs, trained by (stochastic) gradient-based methods, can generalize well, that is, DNNs often fit the test data well which are not used for training in practice. Counter-intuitively, although DNNs have many more parameters than training data, they can rarely overfit the training data in practice. Several studies have focused on the local property (sharpness/flatness) of loss function at minima (Hochreiter & Schmidhuber (1995)) to explore the DNN’s generalization ability. Keskar et al. arXiv:1808.04295v4 [cs.LG] 29 Nov 2018 (2016) empirically demonstrated that with small batch in each training step, DNNs consistently con- verge to flat minima, and lead to a better generalization. However, Dinh et al. (2017) argued that most notions of flatness are problematic. To this end, Dinh et al. (2017) used deep networks with rectifier linear units (ReLUs) to theoretically show that any minimum can be arbitrarily sharp or flat without specifying parameterization. With the constraint of small weights in parameterization, Wu et al. (2017) proved that for two-layer ReLU networks, low-complexity solutions lie in the areas with small Hessian, that is, flat and large basins of attractor (Wu et al. (2017)). They then concluded that a random initialization tends to produce starting parameters located in the basin of flat minima with a high probability, using gradient-based methods. Several studies rely on the concept of stochastic approximation or uniform stability (Bousquet & Elisseeff (2002), Hardt et al. (2015)). To ensure the stability of a training algorithm, Hardt et al. (2015) assumed loss function with good properties, such as Lipschitz or smooth condi- tions. However, the loss function of a DNN is often very complicated (Zhang et al. (2016)). ∗This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. Preprint. Work in progress. Another approach to understanding the DNN’s generalization ability is to find general principles during the DNN’s training. Empirically, Arpit et al. (2017) suggested that DNNs may learn simple patterns first in real data, before memorizing. Xu et al. (2018) found a similar phenomenon empirically, which is referred to Frequency Principle (F-Principle), that is, for a low-frequency dominant function, DNNs with common settings first quickly capture the dominant low-frequency components while keeping the amplitudes of high-frequency components small, and then relatively slowly captures those high-frequency components. F-Principle can explain how the training can lead DNNs to a good generalization empirically by showing that the DNN prefers to fit the target function by a low-complexity function Xu et al. (2018). Rahaman et al. (2018) found a similar result that “Lower Frequencies are Learned First”. To understand this phenomenon, Rahaman et al. (2018) estimated an inequality that the amplitude of each frequency component of the DNN output is controlled by the spectral norm of DNN weights theoretically. Rahaman et al. (2018) then show that the spectral norm2 increases gradually during training with a small-size DNN empirically. Therefore, the inequality implies that “longer training allows the network to represent more complex functions by allowing it to also fit higher frequencies”. However, for a large-size DNN, the spectral norm almost does not change during the training (See an example in Fig.5 in Appendix.), that is, the bound of the amplitude of each frequency component of the DNN output almost does not change during the training. Then, the inequality in Rahaman et al. (2018) cannot explain why the F-Principle still holds for a large-size DNN. Contribution In this work, we develop a theoretical framework by Fourier analysis aiming to understand the training process and the generalization ability of DNNs with sufficient neurons and hidden layers. We show that for any parameter, the gradient descent magnitude in each frequency component of the loss function is proportional to the product of two factors: one is a decay term with respect to (w.r.t.) frequency; the other is the amplitude of the difference between the DNN output and the target function. This theoretical framework shows that DNNs trained by gradient- based methods endow low-frequency components with higher priority during the training process. Since the power spectrum of the tanh function exponentially decays w.r.t. frequency, in which the exponential decay rate is proportional to the inverse of weight. We then show that small (large) initialization would result in small (large) amplitude of high-frequencycomponents, thus leading the DNN output to a low (high) complexity function with good (bad) generalization ability. Therefore, with small initialization, sufficient large DNNs can fit any function (Cybenko (1989)) while keeping good generalization. We demonstrate that the analysis in this work can be qualitatively extended to general DNNs. We exemplified our theoretical results through DNNs fitting natural images, 1-d functions and MNIST dataset (LeCun (1998)). The paper is established as follows. The common settings of DNNs in this work are presented in Section 2. The theoretical framework is given in Section 3. We then study the evolution of the mean magnitude of DNN parameters during the DNN training empirically in Section 4. The theoretical framework is validated by experiments in Section 5. The conclusions and discussions are followed in Section 6. 2 Methods The activation function for each neuron is tanh. We use DNNs of multiple hidden layers with no activation function for the output layer. The DNN is trained by Adam optimizer (Kingma & Ba (2014)). Parameters of the Adam optimizer are set to the default values (Kingma & Ba (2014)). The loss function is the mean squared error of the difference between the DNN output and the target function in the training set. 2In Rahaman et al. (2018), “for matrix-valued weights, their spectral norm was computed by evaluating the eigenvalue of the eigenvector obtained with 10 power iterations. For vector-valued weights, we simply use the L2 norm”. 2 3 Theoretical framework In this section, we will develop a theoretical framework in the Fourier domain to understand the training process of DNN. For illustration, we first use a DNN with one hidden layer with tanh function σ(x) as the activation function: ex − e−x σ(x)= tanh(x)= , x ∈ R. e−x + ex The Fourier transform of σ(wx + b) with w,b ∈ R is, π π i 1 F[σ(wx + b)](k)= δ(k)+ exp(−ibk/w) , (1) r 2 r 2 |w| exp(πk/2w) − exp(−πk/2w) where ∞ F[σ(x)](k)= σ(x)exp(−ikx)dx. Z−∞ Consider a DNN with one hidden layer with N nodes, 1-d input x and 1-d output: N R ϒ(x)= ∑ a jσ(w jx + b j), a j,w j,b j ∈ . (2) j=1 Note that we call all w j, a j and b j as parameters, in which w j and a j are weights, and b j is a bias term. When |πk/w j| is large, without loss of generality, we assume πk/w j ≫ 0 and w j > 0, N π i F[ϒ](k) ≈ ∑ a j exp(−(ib j + π/2)k/w j) . (3) j=1 r 2 w j We define the difference between DNN output and the target function f (x) at each k as D(k) , F[ϒ](k) − F[ f ](k). Write D(k) as D(k)= A(k)eiθ(k), (4) where A(k) and θ(k) ∈ [−π,π], indicate the amplitude and phase of D(k), respectively. The loss at 1 2 frequency k is L(k)= 2 |D(k)| , where | · | denotes the norm of a complex number. The total loss function is defined as: L = ∑L(k). k Note that according to the Parseval’s theorem3, this loss function in the Fourier domain is equal to the commonly used loss of mean squared error, that is, 1 L = ∑(ϒ(x) − f (x))2, (5) 2 x At frequency k, the amplitude of the gradient with respect to each parameter can be obtained (see Appendix for details), ∂L(k) C0 = i(C1 − 1) A(k)exp(−πk/2w j) (6) ∂a j w j ∂L(k) a j = C0C2 3 A(k)exp(−πk/2w j) (7) ∂w j 2w j ∂L(k) a jk = (C1 − 1)C0 2 A(k)exp(−πk/2w j), (8) ∂b j w j where π C = ei[θ(k)+b jk/w j] (9) 0 r 2 3Without loss of generality, the constant factor that is related to the definition of corresponding Fourier Transform is ignored here. 3 C1 = exp(−2i(b jk/w j + θ(k))), (10) C2 = [C1 (i(πk − 2w j) − 2b jk) + (−i(πk − 2w j) − 2b jk)], (11) The descent amount at any direction, say, with respect to parameter Θ jl, is ∂L ∂L(k) = ∑ .

Arxiv:1808.04295V4 [Cs.LG] 29 Nov 2018 Ue L 21)Poe Htfrtolyrrl Ewrs Low Networks, Relu Two-Layer for That Proved (2017) Al

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support