Frequency Principle in : an overview

∗ Zhi-Qin John Xu [email protected] School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC and Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, 200240, P.R. China https://ins.sjtu.edu.cn/people/xuzhiqin/

Abstract Understanding deep learning is increasingly emergent as it penetrates more and more industry and science. In recent years, a research line from Fourier analysis sheds lights into this “black box” magic by showing a Frequency Principle (F-Principle or spectral bias) of the training behavior of deep neural networks (DNNs) — DNNs often fit functions from low to high frequency during the training. The F-Principle is first demonstrated by an one-dimensional synthetic data followed by the verification in high-dimensional real datasets. A series of works subsequently enhance the validity of the F-Principle. The low-frequency implicit bias illustrates the strength of neural network at learning low-frequency function while suffering from learning high-frequency function. Such understanding inspires the design of DNN-based algorithms in practical problems, explains phenomena emerging in various scenarios, and further advances the study of deep learning from frequency perspective. Although incomplete, we provide an overview of F-Principle and propose some future research problems. Keywords: Neural network, frequency principle, deep learning, generalization, training, optimiza- tion

∗. February 25, 2021

1 Contents

1 Introduction 4

2 A brief history of F-Principle 6

3 Empirical study of F-Principle 7 3.1 One-dimensional experiments ...... 8 3.2 Frequency in high-dimensional problems ...... 8 3.2.1 Projection method ...... 9 3.2.2 Filtering method ...... 9

4 Theoretical study of F-Principle 11 4.1 Idealized setting ...... 11 4.2 General setting ...... 12 4.3 A continuous view point ...... 13 4.4 NTK setting and linear F-Principle ...... 14 4.4.1 NTK dynamics ...... 14 4.4.2 Eigen analysis for two-layer NN ...... 15 4.4.3 Linear F-Principle for two-layer neural network ...... 16

5 Generalization 19 5.1 Strength and weakness ...... 19 5.2 Early stopping ...... 20 5.3 Quantitative understanding in NTK regime ...... 20

6 Algorithms and scientific computing 22 6.1 The idea of DNN-based algorithms for solving a PDE ...... 22 6.1.1 A Ritz variational method for PDE ...... 22 6.1.2 MscaleDNN least square error method for PDEs ...... 23 6.2 Difference from traditional algorithms ...... 23 6.2.1 Iterative methods ...... 23 6.2.2 Ritz-Galerkin (R-G) method ...... 25 6.3 Algorithm design to overcome the curse of high-frequency ...... 28 6.3.1 Frequency scaled DNNs ...... 30 6.3.2 Activation function with compact support ...... 32 6.3.3 Two MscaleDNN structures ...... 32 6.3.4 Experiments: A square domain with a few holes ...... 33 6.3.5 High dimensional examples ...... 35

7 Understanding 35 7.1 Frequency perspective for understanding phenomena ...... 36 7.2 Inspiring the design of algorithm ...... 36

8 Anti-F-Principle 37

2 9 Deep frequency principle 38 9.1 Ratio Density Function (RDF) ...... 39 9.2 An example: Deep frequency principle on variants of Resnet18 ...... 40

10 Conclusion 41

3 1. Introduction Deep neural networks (DNNs) have achieved tremendous success in many applications, such as computer vision, speech recognition, speech translation, and natural language processing etc. In addition, DNNs are becoming an indispensable method for scientific problems. However, DNN sometimes fails and causes critical issues in applications. Such a “black-box” system has permeated many aspects of important industries and scientific researches. It is as much urgent and important to provide a satisfactory interpretation for DNN as the understanding of nature. The fundamental study of deep learning is still at large, although the extensive application of deep learning has accumulated much empirical observation that inspires many heuristic tricks to tune DNNs. A set of problems are well-known but yet to be solved, such as how to set the depth and width, how to set a good structure, how to estimate the error of approximation, generalization and training, how to make the neural networks (NNs) more robust and so on. However, it is a critical issue of how to formulate these hard problems in a way that can solve one sub-problem by another. Or put simply, it is unclear which problems are relatively simple to begin with yet lead to more solvable problems and achieve significant understanding of deep learning. The study of deep learning resembles the study of classical mechanics, which experienced from Tycho Brahe and Johannes Kepler to Galileo Galilei and Issac Newton, while the deep learning is still at the stage of (maybe before) Tycho Brahe. The extensive application, similar as the experience of physical law in daily life, leaves no path for deeper understanding. Tycho Brahe recorded tons of planet data and Johannes Kepler found out three empirical laws. Based the study of Kepler, Galileo Galilei and Issac Newton design ideal and simple experiments to formulate simple but effective Newton’s laws. Newton’s laws applies for both simple objects and complex systems. For deep learning, much experience has rendered much heuristic understanding. It is time to design experiments to identify stable phenomena in deep learning. To this end, since the real data is complicated and possess many unknown properties, we have to consider carefully what kind of data and network structure to start with. In addition, similarly as position and velocity in a physical system, we also need to design suitable quantities to characterize the NNs, based on which we can empirically conclude a bunch of laws. These empirical laws can be further verified in complex NNs and lead to theoretical study. Just like that James Maxwell wrote down a set of equations to finally understand electromagnetic waves based on a series of works of Gauss, Faraday and Ampere etc., we might find a unified framework based on various empirical laws to understand deep learning eventually. In this paper, we review the Fourier analysis of training behavior of NNs as an example of applying the above philosophy to study deep learning. An initial motivation of utilizing the Fourier analysis is studying the generalization and training speed of NNs, which are central issues in deep learning. The study of generalization in deep learning attracts much attention in recent years due to its contradictory to the traditional wisdom (Breiman, 1995; Zhang et al., 2017), as illustrated in the following. Although universal approximation theorem shows the powerful approximation of DNNs by skillfully constructing a set of parameters, no one can guarantee the training can find a set of parameters that not only fits the training data but also the data unseen in the training (test data). Traditional wisdom suggests that a model of too many parameters can easily overfit the data. For example of one-dimensional interpolation, polynomial functions with higher order terms, i.e., more parameters, can easily causes significant oscillation in test points (aka Runge phenomenon). However,

4 DNNs, equipped with parameters much more than the training data, often interpolate one-dimensional data with a relatively flat function in commonly used settings (such as small initial parameters) (Wu et al., 2017). For high-dimensional classification problems, over-parameterized DNNs also often generalize well (Breiman, 1995; Zhang et al., 2017). The generalization, contradiction with traditional wisdom, is of particular interest. To study the generalization, one has to be cautious that the no-free- lunch theorem hints that for any method we can find a dataset that this method generalizes badly. Therefore, to study the generalization puzzle of over-parameterized DNNs in real dataset, we have to separately study the DNN algorithm and the real dataset. If the characteristics of the algorithm are consistent with those of the real dataset, then, the algorithm generalizes well, otherwise, badly. The training of DNNs are enforced with no explicit constraints, therefore, it is important to study the implicit bias of the training of DNNs. The heuristic understanding from extensive application is that DNNs may implicitly bias towards simple function. To easily find an accurate form of the implicit bias of deep learning, we would like to use a comic dialog to emphasize the importance of synthetic data as the ideal experiments in physics. That is, A: I am looking for my quarter I dropped. B: Did you drop it here? A: No, I dropped it two blocks down the street. B: Then why are you looking for it here? A: Because the light is better here. This comic is often used to criticise those who only solve problems those can be solved but not those important. We slightly revise this comic by replacing the last sentence with “A: Because I need to get familiar with the road structure first.” A carefully designed synthetic data is simple enough for clearly analyzing the learning process yet preserves the interested properties of complicated datasets. Conclusion from such synthetic data, a bright street, can be used to explore high-dimensional data, a dark street. As discussed above, the generalization puzzle of over-parameterized DNNs is also non-trivial in learning one-dimensional data. Therefore, we study the training process of DNNs in learning one-dimensional data sampled from sin(x) + sin(5x), as shown in Fig. 1. It is clear that the NNs first captures the landscape of the target function, followed by more and more fluctuation details. The oscillation and flatness discussed in the above generalization issue and in this one-dimensional learning process hint that the frequency, which is natural to describe these concepts, may be an important quantity to characterize the training process of DNNs. To visualize or characterize the training process clearly, it requires a Fourier transform of the training data. However, the Fourier transform of high-dimensional data suffers from the curse of dimensionality and the visualization of high-dimensional data is difficult. Although the problem of high-dimensional data is the one matters, it lies in a dark street. Alternatively, we can study the problem of one-dimensional data, which lies in a perfect bright street. A series of experiments on synthetic data, followed by real dataset, confirm that

DNNs often fit target functions from low to high frequencies during the training.

This implicit bias is named as frequency principle (F-Principle) (Xu et al., 2019a,b) or spectral bias (Rahaman et al., 2019) and can be robustly observed no matter how overparameterized NNs are. Xu et al. (2019b) proposed a key mechanism of the F-Principle that the regularity of the activation function converts into the decay rate of a loss function in the frequency domain. Theoretical studies

5 2

y 0

2 5 0 5 x

Figure 1: Illustration of the training process of a DNN. Black dots are training data sampled from target function sin(x) + sin(5x). Cyan, blue and red curves indicates fθ(t)(x) at training epochs t = 0, 2000, 17000, respectively. Reprinted from (Zhang et al., 2021).

subsequently show that the F-Principle holds in general setting with infinite samples (Luo et al., 2019) and in the regime of wide NNs (Neural Tangent Kernel (NTK) regime (Jacot et al., 2018)) with finite samples (Zhang et al., 2019; Luo et al., 2020c) or samples distributed uniformly on sphere (Cao et al., 2019; Yang and Salman, 2019; Basri et al., 2019; Bordelon et al., 2020). E et al. (2020) show that the integral equation would naturally leads to the F-Principle. In addition to characterizing the training speed of NNs, the F-Principle also implicates that NNs prefer low-frequency function and generalize well for low-frequency functions (Xu et al., 2019b; Zhang et al., 2019; Luo et al., 2020c). The F-Principle further inspires the design of DNNs to fast learn a function with high frequency (Liu et al., 2020; Wang et al., 2020a; Jagtap et al., 2020; Cai et al., 2019; Biland et al., 2019; Li et al., 2020b). In addition, the F-Principle provides a mechanism to understand many phenomena in applications and inspires a series of study on deep learning from frequency perspective. The study of deep learning is a highly inter-disciplinary problem. As an example, the Fourier analysis, an approach of signal processing, is useful tool to understand better deep learning (Giryes and Bruna, 2020). A comprehensive understanding of deep learning remains an excited research subject calling for more fusion of existing approaches and developing new methods.

2. A brief history of F-Principle The story of the F-Principle dates back to October 2017, when David Cai, who was a professor at both Courant Institute at New York University and Institute of Natural Sciences at Shanghai Jiao Tong University and advised the PhD and postdoc of Zhi-Qin John Xu, passed away. Xu was working on computation neuroscience under the supervision of Prof. Cai but was very confused on how to study biological brain. One project Xu involved is how to reconstruct the connectivity among neurons. A natural question is what can we do if we do have the connectivity map of a NN. The answer to this question is unknown. This motivates Xu to study the function of a network with given connectivity and dynamics, where artificial NN is a perfect model to start with. Xu, Yaoyu Zhang and Yanyang Xiao, who are also advised by Prof. Cai during the PhD and postdoc periods, began to read papers and discuss a lot. Each of them has different ideas to start a deep learning research. Xu’s approach is to learn the Tensorflow first. The first project Xu tried to program is a NN that can fit any data. Thanks to Prof. Cai, who gave Xu a laptop with a GPU at his

6 very last few days, Xu can have computation power to run a NN to fit various synthetic functions. Xu noticed the network learns low-frequency function fast but high-frequency function slowly and name it F-Principle. Xu often discussed with Yaoyu Zhang and tried to convinced him with more and more experiments. Xu also learned some basic knowledge of Fourier transform from Yanyang Xiao. Fortunately, Prof. David W. MacLaughlin, the adviser of Prof. Cai at postdoc, took over the mentorship after the pass of Prof. Cai. Xu regularly reported to Prof. MacLaughlin and obtained much encouragement to explore the F-Principle. During Xu’s period at Courant, Prof. Charles Peskin gave a lecture about the reconstruction of signal at retinal (Peskin et al., 1984) and Prof. Leslie Greengard gave a talk about how to reconstruct Cryo-EM (Barnett et al., 2017), both of which have similar ideas of learning from low- to high-frequency. Therefore, Xu discussed F-Principle with both professors occasionally from December 2017 to May 2019. The first draft of the F-Principle (Xu et al., 2019a) was done on March 2018 by Xu, and revised many iterations by Xu, Yaoyu Zhang, Yanyang Xiao. We hesitate to submit the paper to arXiv because we are unfamiliar with arXiv. On March 2018, there was a conference held in New York University. At the conference, Xu talked to Prof. Yoshua Bengio about the F-Principle. Prof. Bengio commented that the F-Principle may be just a simple consequence of stochastic . Xu emphasized that the F-Principle is a very general phenomenon and by studying 1d synthetic data, the F-Principle can be rigorously proved by experiments. We submitted the paper to NeurIPS on May 2018 and submitted the paper to arXiv on early July 2018 (Xu et al., 2019a) . We found a paper from Prof. Bengio’s group published on arXiv few days before us, which studies the same phenomenon (named spectral bias) and also emphasizes this is not a consequence of stochastic gradient descent (Rahaman et al., 2019). After brief communication, the papers acknowledge both independently found the low-frequency bias in DNNs. On August 2018, Xu (2018b) proposes a simple theory for two-layer NN with tanh activation function to demonstrate that the regularity of the activation function is the key to the F-Principle. The second version of (Rahaman et al., 2019) adopt this simple theory in Xu (2018b) for networks with ReLU activation function. Xu (2018a) further extends the experiments of the F-Principle from the mean squared loss to the cross-entropy loss. Later, Xu et al. (2019b) present more experimental proof for high-dimensional datasets and deep network structures, and a rigorous but simple theory for the F-Principle based on two previous papers Xu (2018b,a). On May 2019, several theories are developed to study the F-Principle (Luo et al., 2019; Zhang et al., 2019; Basri et al., 2019). More and more subsequent works on empirical study, theoretical study and algorithm design related to the F-Principle emerge.

3. Empirical study of F-Principle Before the discovery of the F-Principle, some works have suggested the learning of the DNNs may follow a order from simple to complex (Arpit et al., 2017). However, previous empirical studies focus on the real dataset, which is high-dimensional, thus, it is difficult to find a suitable quantity to characterize such intuition. In this section, we review the empirical study of the F-Principle, which first presents a clear picture from the one-dimensional data and then carefully designs experiments to verify the F-Principle in high-dimensional data (Xu et al., 2019a,b; Rahaman et al., 2019).

7 (a) target function (b) relative error

ˆ Figure 2: 1d input. (a) f(x). Inset : |f(k)|. (b) ∆F (k) of three important frequencies (indicated by black dots in the inset of (a)) against different training epochs. Reprinted from Xu et al. (2019b).

3.1 One-dimensional experiments To clearly illustrate the phenomenon of F-Principle, one can use 1-d synthetic data to show the relative error of different frequencies during the training of DNN. The following shows an example from Xu et al. (2019b). Training samples are drawn from a 1-d target function f(x) = sin(x) + sin(3x) + sin(5x) with n−1 three important frequency components and even space in [−3.14, 3.14], i.e., {xi, f(xi)}i=0 . The discrete Fourier transform (DFT) of f(x) or the DNN output (denoted by h(x)) is computed by ˆ 1 Pn−1 −i2πik/n fk = n i=0 f(xi)e , where k is the frequency. As shown in Fig. 2(a), the target function has three important frequencies as designed (black dots at the inset in Fig. 2(a)). To examine the convergence behavior of different frequency components during the training with MSE, one computes the relative difference between the DNN output and the target function for the three ˆ ˆ ˆ important frequencies k’s at each recording step, that is, ∆F (k) = |hk − fk|/|fk|, where | · | denotes the norm of a complex number. As shown in Fig. 2(b), the DNN converges the first frequency peak very fast, while converging the second frequency peak much slower, followed by the third frequency peak. A series of experiments are performed with relative cheap cost on synthetic data to verify the validity of the F-Principle and eliminate some misleading factors. For example, the stochasticity and the learning rate are not important to reproduce the F-Principle. If one only focus on the high- dimensional data, such as the simple MNIST, it would require an expensive cost of computation and computer memory. The study of synthetic data show a clear guidance to examine the F-Principle in the high-dimensional data. In addition, since frequency is a quantity which theoretical study is relative easy to access in, the F-Principle provides a theoretical direction for further study.

3.2 Frequency in high-dimensional problems To study the F-Principle in high-dimensional data, two obstacles should be overcome first: what is the frequency in high dimension and how to separate different frequencies. The concept of “frequency” often causes misleading in imaging classification problems. The 2 image (or input) frequency (NOT used in the F-Principle) is the frequency of 2-d function I : R → R representing the intensity of an image over pixels at different locations. This frequency corresponds to the rate of change of intensity across neighbouring pixels. For example, an image of constant intensity possesses only the zero frequency, i.e., the lowest frequency, while a sharp edge contributes to high

8 frequencies of the image. The frequency used in the F-Principle is also called response frequency of a general Input-Output mapping f. For example, consider a simplified classification problem of 784 partial MNIST data using only the data with label 0 and 1, f(x1, x2, ··· , x784): R → {0, 1} mapping 784-d space of pixel values to 1-d space, where xj is the intensity of the j-th pixel. Denote the mapping’s Fourier transform as fˆ(k1, k2, ··· , k784). The frequency in the coordinate kj measures the rate of change of f(x1, x2, ··· , x784) with respect to xj, i.e., the intensity of the j-th pixel. If f possesses significant high frequencies for large kj, then a small change of xj in the image might induce a large change of the output (e.g., adversarial example). For a real data, the response frequency is rigorously defined via the standard nonuniform discrete Fourier transform (NUDFT). The difficulty of separating different frequencies is the computation of Fourier transform of high-dimensional data suffers from the curse of dimensionality. For example, if one evaluates two points in each dimension of frequency space, then, the evaluation of the Fourier transform of a d-dimensional function is on 2d points, an impossible large number even for MNIST data with d = 784. Two approaches are proposed in (Xu et al., 2019b).

3.2.1 PROJECTIONMETHOD One approach is to study the frequency in one dimension of frequency space. For a dataset n−1 {(xi, yi)}i=0 with yi ∈ R. The high dimensional discrete non-uniform Fourier transform of n−1 1 Pn−1 {(xi, yi)}i=0 is yˆk = n i=0 yi exp (−i2πk · xi). Consider a direction of k in the Fourier space, 1 Pn−1 i.e., k = kp1, p1 is a chosen and fixed unit vector. Then we have yˆk = n i=0 yi exp (−i2π(p1 · xj)k), n−1 which is essentially the 1-d Fourier transform of {(xp1,i, yi)}i=0 , where xp1,i = p1 · xi is the pro- jection of xi on the direction p1. Similarly, one can examine the relative difference between the DNN output and the target function for selected important frequencies at each recording step. In the experiments in Xu et al. (2019b), p1 is chosen as the first principle component of the input space. A fully-connected network and a convolutional network are used to learn MNIST and CIFAR10, respectively. As shown in Fig. 3(a) and 3(c), low frequencies dominate in both real datasets. As shown in Fig. 3(b) and 3(d), one can easily observe that DNNs capture low frequencies first and

gradually capture higher frequencies. ¡ 10 1 10 ¡ 1 train_true train_true

10 ¡ 2 10 ¡ 2 Amplitude Amplitude

10 ¡ 3

10 ¡ 3

0.0 0.2 0.4 0.00 0.05 0.10 0.15 frequency frequency (a) Fourier domain (b) Relative error (c) Fourier domain (d) Relative error

Figure 3: Projection method. (a, b) are for MNIST, (c, d) for CIFAR10. (a, c) Amplitude |yˆk| vs. frequency. Selected frequencies are marked by black squares. (b, d) ∆F (k) vs. training epochs for the selected frequencies.

3.2.2 FILTERING METHOD The projection method examines the F-Principle at only several directions. To compensate the projection method, one can consider a coarse-grained filtering method which is able to unravel whether, in the radially averaged sense, low frequencies converge faster than high frequencies.

9 The idea of the filtering method is to use a Gaussian filter to derive the low-frequency part of the data and then examine the convergence of the low- and high-frequency parts separately. The low frequency part can be derived by low,δ δ yi , (y ∗ G )i, (1) where ∗ indicates convolution operator, δ is the standard deviation of the Gaussian kernel. Since the Fourier of transform of a Gaussian function is still a Gaussian function but with a standard deviation 1/δ, 1/δ can be regarded as a rough frequency width which is kept in the low-frequency part. The high-frequency part can be derived by

high,δ low,δ yi , yi − yi . (2) Then, one can examine 1 P low,δ low,δ 2 ! 2 i |yi − hi | elow = , (3) P low,δ 2 i |yi | 1 P high,δ high,δ 2 ! 2 i |yi − hi | ehigh = , (4) P high,δ 2 i |yi | low,δ high,δ where h and h are obtained from the DNN output h. If elow < ehigh for different δ’s during the training, F-Principle holds; otherwise, it is falsified. Note the DNN is trained as usual. As shown in Fig. 4, low-frequency part converges faster in the following three settings for different δ’s: a tanh fully-connected network for MNIST, a ReLU shallow convolutional network for CIFAR10, and a VGG16 (Simonyan and Zisserman, 2015) for CIFAR10. Another approach to examine the F-Principle in high-dimensional data is to add to the training data and examine when the noise is captured by the network (Rahaman et al., 2019). Note that this approach contaminates the training data.

1.0 1.0 1.0

0.8 0.8 0.8 high high high e e e 0.6 0.6 0.6

0.4 0.4 0.4 low low low e e e 0.2 0.2 0.2

0 20 40 60 80 100 0 50 100 150 0 100 200 300 400 epoch epoch epoch (a) δ = 3, DNN (b) δ = 3, CNN (c) δ = 7, VGG

1.0 1.0 1.0 h g 0.8 0.8 i 0.8 high high h e e e 0.6 0.6 0.6

0.4 0.4 w 0.4 o low low l e e e 0.2 0.2 0.2

0 20 40 60 80 100 0 50 100 150 0 100 200 300 400 epoch epoch epoch (d) δ = 7, DNN (e) δ = 7, CNN (f) δ = 10, VGG

Figure 4: F-Principle in real datasets. elow and ehigh indicated by color against training epoch. Reprinted from Xu et al. (2019b).

10 4. Theoretical study of F-Principle An advantage of studying DNNs from the frequency perspective is that frequency can often be theoretically analyzed. This is especially important in deep learning since deep learning is often criticized as a black box due to the the lack of theoretical support. In this section, we review theories of the F-Principle for various settings. A key mechanism of the F-Principle is based on the regularity of the activation function is first proposed in Xu (2018b) and is formally published in Xu et al. (2019b). The theories reviewed in this section explores the F-Principle in an idealized setting (Xu et al., 2019b), in general setting with infinite samples (Luo et al., 2019), in a continuous viewpoint (E et al., 2020) and in the regime of wide NNs (Neural Tangent Kernel (NTK) regime (Jacot et al., 2018)) with samples distributed uniformly on sphere (Cao et al., 2019; Yang and Salman, 2019; Basri et al., 2019; Bordelon et al., 2020) or any finite samples (Zhang et al., 2019; Luo et al., 2020c).

4.1 Idealized setting The following presents a simple case to illustrate how the F-Principles may arise. More details can be found in Xu (2018b); Xu et al. (2019b). The activation function we consider is σ(x) = tanh(x).

ex − e−x σ(x) = tanh(x) = , x ∈ . ex + e−x R For a DNN of one hidden layer with m nodes, 1-d input x and 1-d output:

m X h(x) = ajσ(wjx + bj), aj, wj, bj ∈ R, (5) j=1 where wj, aj, and bj are training parameters. In the sequel, we will also use the notation θ = {θlj} with θ1j = aj, θ2j = wj, and θlj = bj, j = 1, ··· , m. Note that the Fourier transform of tanh(x) is iπ σˆ(k) = − sinh(πk/2) . The Fourier transform of σ(wjx + bj) with wj, bj ∈ R, j = 1, ··· , m reads as 2πi ib k  1 \ j σ(wj · +bj)(k) = exp πk πk . (6) |wj| wj exp(− ) − exp( ) 2wj 2wj Note that the last term exponentially decays w.r.t. |k|. Thus

m ˆ X 2πaji ibjk  1 h(k) = exp πk πk . (7) |wj| wj exp(− ) − exp( ) j=1 2wj 2wj

Define the amplitude deviation between DNN output and the target function f(x) at frequency k as ˆ ˆ D(k) , h(k) − f(k). iφ(k) Write D(k) as D(k) = A(k)e , where A(k) ∈ [0, +∞) and φ(k) ∈ R are the amplitude and 1 2 phase of D(k), respectively. The loss at frequency k is L(k) = 2 |D(k)| , where | · | denotes the R +∞ norm of a complex number. The total loss function is defined as: L = −∞ L(k) dk. Note that according to Parseval’s theorem, this loss function in the Fourier domain is equal to the commonly R +∞ 1 2 used loss of mean squared error, that is, L = −∞ 2 (h(x) − f(x)) dx.

11 The decrement at any direction, say, with respect to parameter θlj, is ∂L Z +∞ ∂L(k) = dk. (8) ∂θlj −∞ ∂θlj

The absolute contribution from frequency k to this total amount at θlj is

∂L(k) ≈ A(k) exp (−|πk/2wj|) Flj(θj, k), (9) ∂θlj where θj , {wj, bj, aj}, θlj ∈ θj, Flj(θj, k) is a function with respect to θj and k, which is approximate O(1). When the component at frequency k where hˆ(k) is not close enough to fˆ(k), i.e., A(k) 6= 0, exp (−|πk/2wj|) would dominate Glj(θj, k) for a small wj. Intuitively, the gradient of low- frequency components dominates the training, thus, leading a fast convergence of low-frequency components.

4.2 General setting

Luo et al. (2019) consider DNNs trained by a general loss function R˜D(θ) and measure the conver- gence of different frequencies by a mean squared loss. Similarly to the filtering method, the basic idea in this part is to decompose the frequency domain into a low-frequency part and a high-frequency part. d Consider target function ftarget in a compact domain Ω, i.e., Ω ⊂⊂ R . A bump function χ is used to truncate both hypothesis and target functions:

[L] fθ(x) = f(x, θ) = f (x, θ)χ(x), (10)

f(x) = ftarget(x)χ(x). (11)

In the sequel, we will also refer to fθ and f as the hypothesis and target functions, respectively. A general loss function with population measure D, i.e., ˜ RD(θ) = Ex∼D`(f(x, θ) − f(x)), (12) where the function ` satisfies some mild assumptions to be explained later. In this case, the training dynamics of θ becomes:  dθ  = −∇ R˜ (θ), dt θ D (13) θ(0) = θ0. In the case of MSE loss function, Z 2 RD(θ) = |fD(x, θ) − fD(x)| dx (14) Rd Z ˆ ˆ 2 = |fD(ξ, θ) − fD(ξ)| dξ, (15) Rd where ρ, satisfying D(dx) = ρ(x)dx, is called the population density and p p fD(·, θ) = f(·, θ) ρ(·), fD(·) = f(·) ρ(·). (16)

12 The second equality is due to the Plancherel theorem. − + Take the decomposition R = Rη + Rη with Z Z − + Rη (θ) = q(ξ, θ) dξ,Rη (θ) = q(ξ, θ) dξ, (17) c Bη Bη

c d where Bη and Bη = R \Bη are a ball centered at the origin with radius η > 0 and its complement,

q(ξ, θ) = |fˆ(ξ, θ) − fˆ(ξ)|2. (18)

The assumptions are summarized here.

Assumption 1 (regularity). The bump function χ satisfies χ(x) = 1, x ∈ Ω and χ(x) = 0, d 0 0 0 d x ∈ R \Ω for domains Ω and Ω with Ω ⊂⊂ Ω ⊂⊂ R . There is a positive integer k (can k,∞ d k,∞ d be ∞) such that ftarget ∈ Wloc (R ; R), χ ∈ Wloc (R ; [0, +∞)) , and activation function k,∞ σ ∈ Wloc (R; R) for l ∈ [L − 1], i ∈ [ml], ml is the width of lth layer. ∞ d  Assumption 2 (bounded population density). There exists a function ρ ∈ L R ; [0, +∞) satisfy- ing D(dx) = ρ(x) dx.

For the training dynamics (13), we suppose the parameters are bounded.

Assumption 3 (bounded trajectory). The training dynamics is nontrivial, i.e., θ(t) 6≡ const. There exists a constant C0 > 0 such that supt≥0|θ(t)| ≤ C0 where the parameter vector θ(t) is the solution to (13).

Remark 1. The bound C0 depends on initial parameter θ0. The general loss function considered in this work satisfies the following assumption.

Assumption 4 (general loss function). The function ` in the general loss function R˜D(θ) satisfies 2 −1 0 2 ` ∈ C (R; [0, +∞)) and there exist positive constants C and r0 such that C [` (z)] ≤ `(z) ≤ 2 C|z| for |z| ≤ r0. Theorem 1. [F-Principle in the intermediate stage] (general loss function) Suppose that Assumptions 1, 2, 3, and 4 hold. We consider the training dynamics (13). Then for any 1 ≤ r ≤ k −1, there is a constant C > 0 such that for any 0 < T1 < T2 1 satisfying 2 R(θ(T1)) ≥ R(θ(T2)), we have

+ R T2 dRη dt T1 dt p −r ≤ C T2 − T1η . (19) R T2 dR dt T1 dt

4.3 A continuous view point E et al. (2020) present a continuous framework to study machine learning and suggests that the F-Principle may be one part of current works that consist in a reasonably complete picture about the main reasons behind the success of modern machine learning, that is, the gradient flows are nice flows, and they obey the frequency principle, basically because they are integral equations.

13 4.4 NTK setting and linear F-Principle In general, it is difficult to analyze the convergence rate of each frequency due to its high dimension- m ality and nonlinearity. However, in√ a linear regime, where the network has a width approaching infinite and a scaling factor of 1/ m, several works have explicitly shown the convergence rate of each frequency.

4.4.1 NTK DYNAMICS

One can consider the following gradient-descent flow dynamics of the empirical risk LS of a network n function f(·, θ) parameterized by θ on a set of training data {(xi, yi)}i=1

 θ˙ = −∇ L (θ), θ S (20) θ(0) = θ0, where n 1 X L (θ) = (f(x , θ) − y )2. (21) S 2 i i i=1 Then the training dynamics of output function f(·, θ) is

d f(x, θ) = ∇ f(x, θ) · θ˙ dt θ = −∇θf(x, θ) · ∇θLS(θ) n X = −∇θf(x, θ) · ∇θf(xi, θ)(f(xi, θ) − yi) i=1 n X = − Km(x, xi)(f(xi, θ) − yi) i=1 where for time t the NTK evaluated at (x, x0) ∈ Ω × Ω reads as

0 0 Km(x, x )(t) = ∇θf(x, θ(t)) · ∇θf(x , θ(t)). (22)

As the NTK regime studies the network with infinite width, we denote

∗ 0 0 K (x, x )(t) := lim Km(x, x )(t) (23) m→∞ The gradient descent of the model thus becomes

n d   X   f(x, θ(t)) − f(x) = − K∗(x, x )(t) f(x , θ(t)) − f(x ) . (24) dt i i i i=1

n×d n Define the residual u(x, t) = f(x, θ(t)) − f(x). Denote X ∈ R and Y ∈ R as the training n data, u(X), ∇θf(X, θ(t)) ∈ R , and

∗ T n×n K (t) = lim Km(t) := lim ∇θf(X, θ(t))(∇θf(X, θ(t))) ∈ . (25) m→∞ m→∞ R

14 Then, one can obtain du(X) = −K∗(t)u(X). (26) dt Pn In a continuous form, one can define the empirical density ρ(x) = i=1 δ(x − xi)/n and further denote uρ(x) = u(x)ρ(x). Therefore the dynamics for u becomes Z d ∗ 0 0 0 u(x, t) = − K (x, x )(t)uρ(x , t) dx . (27) dt Rd Note that here we slightly abuse the usage of notation K∗. By the above analysis, it is clear to see a convergence analysis by performing eigen decomposition. The component in the sub-space of an eigen-vector converges faster with a larger eigen-value. For two-layer wide NNs in NTK regime, one can derive K∗(t) ≡ K∗(0). A series of works further show that the eigenvector with a larger eigen-value is lower-frequency, therefore, providing a rigorous proof for the F-Principle in NTK regime for two-layer networks. Consider a two-layer NN

m 1 X f(x, θ) = √ a σ(w|x + b ), (28) m j j j j=1

| | d+2 where the vector of all parameters θ is formed of the parameters for each neuron (aj, wj , bj) ∈ R for j ∈ [m]. At the infinite neuron limit m → ∞, the following linearization around initialization

lin f (x; θ(t)) = f (x; θ(0)) + ∇θf (x; θ(0)) (θ(t) − θ(0)) (29) is an effective approximation of f (x; θ(t)), i.e., f lin (x; θ(t)) ≈ f (x; θ(t)) for any t, as demon- strated by both theoretical and empirical studies of neural tangent kernels (NTK) (Jacot et al., 2018; Lee et al., 2019). Note that, f lin (x; θ(t)), linear in θ and nonlinear in x, reserves the universal approximation power of f (x; θ(t)) at m → ∞. In the following of this sub-section, we do not distinguish f (x; θ(t)) from f lin (x; θ(t)).

4.4.2 EIGEN ANALYSIS FOR TWO-LAYER NN For two-layer ReLU network, K∗ enjoys good properties for theoretical study. The exact form of K∗ can be theoretically obtained (Xie et al., 2017). Under the condition that training samples distributed uniformly on a sphere, the spectrum of K∗ can be obtained through spherical harmonic decomposition (Xie et al., 2017). In a rough intuition, each eigen vector of K∗ corresponds to a specific frequency. Based on such harmonic analysis, Basri et al. (2019); Cao et al. (2019) estimate the eigen values of K∗, i.e., the convergence rate of each frequency. Basri et al. (2020) further release the condition that data distributed uniformly on a sphere to that data distributed piecewise constant on a sphere but limit the result on 1d sphere. Similarly under the uniform distribution assumption in NTK regime, Bordelon et al. (2020) show that as the size of the training set grows, ReLU NNs fit successively higher spectral modes of the target function. Empirical studies also validate that real data often align with the eigen-vectors that have large eigen-values, i.e., low frequency eigen-vectors (Dong et al., 2019; Kopitkov and Indelman, 2020; Baratin et al., 2020).

15 4.4.3 LINEAR F-PRINCIPLEFORTWO-LAYER NEURAL NETWORK The condition of uniform distribution on a sphere is often non-realistic. A parallel work (Zhang et al., 2019) studies the evolution of each frequency for two-layer wide ReLU networks with any data distribution, including randomly discrete case. Luo et al. (2020b) provides a rigorous version of Zhang et al. (2019) and extend the study of ReLU activation function in Zhang et al. (2019) to any activation function. Next, we briefly review the study in Luo et al. (2020b), that is, the linear F-Principle (LFP). Consider the two-layer NN m 1 X f(x, θ) = √ a σ(w|x + b ) (30) m j j j j=1 m 1 X = √ σ∗(x, q ). (31) m j j=1 m where the vector of all parameters θ = vec({qj}j=1) is formed of the parameters for each neuron | | d+2 ∗ | qj = (aj, wj , bj) ∈ R and σ (x, qj) = ajσ(wj x + bj) for j ∈ [m]. Consider the kernel 2 regime that m  1 and assume that b ∼ N (0, σb ) with σb  1. For the two-layer network, its NTK can be calculated as follows m 1 X K (x, x0)(t) = ∇ σ∗(x, q (t)) · σ∗(x0, q (t)), (32) m m qj j j j=1 where the parameters qj’s are evaluated at time t. An example. Before we review a complete LFP theory, we first give a LFP result for two-layer ReLU network. Consider the residual u(x, t) = f(x, θ(t)) − f(x), one can obtain 2 ∂tF[u] = −(γ(ξ)) F[uρ](ξ), (33) with  r3 a2r  (γ(ξ))2 = + , (34) Ea,r 16π4kξkd+3 4π2kξkd+1 where F[·] is Fourier transform, r = kwk, Ea,r is the expectation with respect to the initial distribution of a, r, (·)ρ(x) = (·)(x)ρ(x). ρ(x) is the data distribution, which can be a continuous n function or ρ(x) = Σi=1δ(x − xi)/n, an uncommon part of this dynamics. One can further prove that the long-time solution of (33) satisfies the following constrained minimization problem Z −2 2 min γ(ξ) |hˆ(ξ) − hˆini(ξ)| dξ, h (35) ∗ s.t. h(xi) = f (xi), i = 1, . . . , n. Based on the equivalent optimization problem in (35), each decaying term for 1-d problems (d = 1) can be analyzed. When 1/ξ2 term dominates, the corresponding minimization problem Eq. (35) rewritten into spatial domain yields Z 0 0 2 min |h (x) − hini(x)| dx, h (36) ∗ s.t. h(xi) = f (xi), i = 1, ··· , n,

16 0 where indicates differentiation. For hini(x) = 0, Eq. (36) indicates a linear spline interpolation. 4 R 00 00 2 Similarly, when 1/ξ dominates, |h (x) − hini(x)| dx is minimized, indicating a cubic spline. In general, above two power law decay coexist, giving rise to a specific mixture of linear and cubic splines. For high dimensional problems, the model prediction is difficult to interpret because the order of differentiation depends on d and can be fractal. Similar analysis in spatial domain can be found in a subsequent work in Jin and Montufar´ (2020). Well-posedness. Inspired by the variational formulation of LFP model in (35), Luo et al. (2020a) propose a new continuum model for the supervised learning. This is a variational problem with a parameter α > 0: Z α 2 min Qα[h] = hξi |F[h](ξ)| dξ, (37) h∈H d R s.t. h(xi) = yi, i = 1, ··· , n, (38)

2 1 R α 2 where hξi = (1 + kξk ) 2 is the “Japanese bracket” of ξ and H = {h(x)| hξi |F[h](ξ)| dξ < Rd ∞}. Luo et al. (2020a) prove that α = d is a critical point, below which the variational problems leads to a trivial solution that is only non-zero at the training data point, and above which the variational problem leads to a solution with certain regularity. The LFP model shows that a NN is a convenient way to implement the variational formulation, which automatically satisfies the well-posedness condition. Rigorous analysis of LFP. Next, we review the rigorous LFP theory for general activation function. 0 0 To simplified the notation, we define g1(z) := (σ(z), aσ (z))| and g2(z) := aσ (z) for z ∈ R. Then  |   |  | σ(w x + b) ∂a[aσ(w x + b)] g1(w x + b) = 0 = , (39) aσ (w|x + b) ∂b[aσ(w|x + b)] | | 0 | g2(w x + b)x = ∇w[aσ(w x + b)] = aσ (w x + b)x. (40) The following theorem is the key to the exact expression of LFP dynamics for two-layer networks. Assumption 5. (Luo et al., 2020b) We assume that the initial distribution of q = (a, w|, b)| satisfies the following conditions:

(i) independence of a, w, b: ρq(q) = ρa(a)ρw(w)ρb(b). 2 2 (ii) zero-mean and finite variance of b: Ebb = 0 and Ebb = σb < ∞.

(iii) radially symmetry of w : ρw(w) = ρw(kwke1) where e1 = (1, 0, ··· , 0)|. Theorem 2 (Luo et al. (2020b): explicit expression of LFP operator for two-layer networks). Suppose that Assumption 5 holds. If σb  1, then the dynamics (27) has the following expression, −3 h∂tF[u], φi = − hL[F[uρ]], φi + O(σb ), (41) d where φ ∈ S(R ) is a test function and the LFP operator is given by Γ(d/2) 1 kξk −kξk L[F[u ]] = √ F[g ] ·F[g ] F[u ](ξ) ρ (d+1)/2 d−1 Ea,r 1 1 ρ 2 2π σbkξk r r r Γ(d/2)   1 kξk  kξk  − √ ∇ · F[g ] F[g ] − ∇F[u ](ξ) , (d+1)/2 Ea,r d−1 2 2 ρ 2 2π σb rkξk r r (42)

17 where F[·] is Fourier transform, h·, ·i is inner product, Γ(·) is the gamma function. The expectations are taken w.r.t. initial parameter distribution. Here r = kwk with the probability density ρr(r) := d/2 2π d−1 | Γ(d/2) ρw(re1)r , e1 = (1, 0, ··· , 0) . Based on (42), one can derive the exact LFP dynamics for the cases where the activation function is ReLU and tanh.

Corollary 1 (Luo et al. (2020b): LFP operator for ReLU activation function). Suppose that Assump- tion 5 holds. If σb  1 and σ = ReLU, then the dynamics (27) has the following expression,

−3 h∂tF[u], φi = − hL[F[uρ]], φi + O(σb ), (43)

d where φ ∈ S(R ) is a test function and the LFP operator reads as Γ(d/2)  r3 a2r  L[F[uρ]] = √ Ea,r + F[uρ](ξ) 2 2π(d+1)/2σ 16π4kξkd+3 4π2kξkd+1 b (44) Γ(d/2)   a2r   − √ ∇ · ∇F[u ](ξ) . (d+1)/2 Ea,r 2 d+1 ρ 2 2π σb 4π kξk The expectations are taken w.r.t. initial parameter distribution. Here r = kwk with the probability d/2 2π d−1 | density ρr(r) := Γ(d/2) ρw(re1)r , e1 = (1, 0, ··· , 0) . Corollary 2 (Luo et al. (2020b): LFP operator for tanh activation function). Suppose that Assumption 5 holds. If σb  1 and σ = tanh, then the dynamics (27) has the following expression,

−3 h∂tF[u], φi = − hL[F[uρ]], φi + O(σb ), (45)

d where φ ∈ S(R ) is a test function and the LFP operator reads as Γ(d/2) π2 π2kξk 4π4a2kξk2 π2kξk L[F[u ]] = √ csch2 + csch2 F[u ](ξ) ρ (d+1)/2 d−1 Ea,r 3 ρ 2 2π σbkξk r r r r Γ(d/2)   4π4a2 π2kξk  − √ ∇ · csch2 ∇F[u ](ξ) . (d+1)/2 Ea,r 3 d−3 ρ 2 2π σb r kξk r (46) The expectations are taken w.r.t. initial parameter distribution. Here r = kwk with the probability d/2 2π d−1 | density ρr(r) := Γ(d/2) ρw(re1)r , e1 = (1, 0, ··· , 0) . Finally, we give some remarks on the difference between the eigen decomposition and the frequency decomposition. In the non-NTK regime, the eigen decomposition can be similarly analyzed but without informative explicit form. In addition, the study of bias from the perspective of eigen decomposition is very limited. For finite networks, which are practically used, the kernel evolves with training, thus, it is hard to understand what kind of the component converges faster or slower. The eigen mode of the kernel is also difficult to be perceived. In the contrast, frequency decomposition is easy to be interpreted, and a natural approach widely used in science.

18 5. Generalization Generalization is a central issue of deep learning. In this section, we review how F-Principle gains understandings to the non-overfitting puzzle of over-parameterized NN in deep learning (Zhang et al., 2017). First, we show how intuitively the F-Principle explains a strength and a weakness of deep learning (Xu et al., 2019b). Second, we provide a rationale for the common trick of early-stopping (Xu et al., 2019a), which is often used to improve the generalization. Third, we estimate an a priori generalization error bound for wide two-layer ReLU NN (Luo et al., 2020b).

5.1 Strength and weakness As demonstrated in the Introduction part, if the implicit bias or characteristics of an algorithm is consistent with the property of data, the algorithm generalizes well, otherwise not. By identifying the implicit bias of the DNNs in F-Principle, we can have a better understanding to the strength and the weakness of deep learning, as demonstrated by Xu et al. (2019b) in the following. DNNs often generalize well for real problems Zhang et al. (2017) but poorly for problems like fitting a parity function Shalev-Shwartz et al. (2017); Nye and Saxe (2018) despite excellent training accuracy for all problems. The following demonstrates a qualitative difference between these two types of problems through Fourier analysis and use the F-Principle to provide an explanation different generalization performances of DNNs. Using the projection method in Sec. 3.2.1, one can obtain frequencies along the examined directions. For illustration, the Fourier transform of all MNIST/CIFAR10 data along the first principle component are shown in Fig. 5(a, b) for MNIST/CIFAR10, respectively. The Fourier transform of the training data (red dot) well overlaps with that of the total data (green) at the dominant low frequencies. As expected, the Fourier transform of the NN output with bias of low frequency, evaluated on both training and test data, also overlaps with the true Fourier transform at low-frequency part. Due to the negligible high frequency in these problems, the NNs generalize well. However, NNs generalize badly for high-frequency functions as follows. For the parity function Qd d ˆ 1 P Qd −i2πk·x f(x) = j=1 xj defined on Ω = {−1, 1} , its Fourier transform is f(k) = 2d x∈Ω j=1 xje = d Qd 1 1 d (−i) j=1 sin 2πkj. Clearly, for k ∈ [− 4 , 4 ] , the power of the parity function concentrates at 1 1 d k ∈ {− 4 , 4 } and vanishes as k → 0, as illustrated in Fig. 5(c) for the direction of 1d. Given a randomly sampled training dataset S ⊂ Ω with s points, the nonuniform Fourier transform on S ˆ 1 P Qd −i2πk·x ˆ ˆ is computed as fS(k) = s x∈S j=1 xje . As shown in Fig. 5(c), f(k) and fS(k) signifi- cantly differ at low frequencies, caused by the well-known aliasing effect. Based on the F-Principle, as demonstrated in Fig. 5(c), these artificial low frequency components will be first captured to explain the training samples, whereas the high frequency components will be compromised by DNN, leading to a bad generalization performance as observed in experiments. The F-Principle implicates that among all the functions that can fit the training data, a DNN is implicitly biased during the training towards a function with more power at low frequencies, which is consistent as the implication of the equivalent optimization problem (35). The distribution of power in Fourier domain of above two types of problems exhibits significant differences, which result in different generalization performances of DNNs according to the F-Principle.

19 ¡ 10 ¡ 1 10 1 1.00 train_true train_true train_true all_true all_true 0.75 all_true

¡ 2 10 all_fit 10 ¡ 2 all_fit all_fit 0.50

10 ¡ 3 10 ¡ 3 0.25

10 ¡ 4 0.00 0.0 0.2 0.4 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.8 frequency frequency frequency (a) MNIST (b) CIFAR10 (c) parity

Figure 5: Fourier analysis for different generalization ability. The plot is the amplitude of the Fourier coefficient against frequency k. The red dots are for the training dataset, the green line is for the whole dataset, and the blue dashed line is for an output of well-trained DNN on the input of the whole dataset. For (c), d = 10. The training data is 200 randomly selected points. The settings of (a) and (b) are the same as the ones in Fig. 3. For (c), we use a tanh-DNN with widths 10-500-100-1, learning rate 0.0005 under full batch-size training by Adam optimizer. The parameters of the DNN are initialized by a Gaussian distribution with mean 0 and standard deviation 0.05.

5.2 Early stopping When the training data is contaminated by noise, early-stopping method is usually applied to avoid overfitting in practice (Lin et al., 2016). By the F-Principle, early-stopping can help avoid fitting the noisy high-frequency components. Thus, it naturally leads to a well-generalized solution. Xu et al. (2019a) use the following example for illustration. As shown in Fig. 6(a), the data are sampled from a function with noise. The DNN can well fit the sampled training set as the loss function of the training set decreases to a very small value (green stars in Fig. 6(b)). However, the loss function of the test set first decreases and then increases (red dots in Fig. 6(b)). Fig. 6(c), the Fourier transform for the training data (red) and the test data (black) only overlap around the dominant low-frequency components. Clearly, the high-frequency components of the training set are severely contaminated by noise. Around the turning step — where the best generalization performance is achieved, indicated by the green dashed line in Fig. 6(b) — the DNN output is a smooth function (blue line in Fig. 6(a)) in spatial domain and well captures the dominant peak in frequency domain (Fig. 6 (c)). After that, the loss function of the test set increases as DNN start to capture the higher-frequency noise (red dots in Fig. 6b). These phenomena conform with our analysis that early-stopping can lead to a better generalization performance of DNNs as it helps prevent fitting the noisy high-frequency components of the training set. As a low-frequency function is more robust w.r.t. input than a high-frequency frequency function, the early-stopping can also enhance the robustness of the DNN. This effect is consistent with the study in Li et al. (2020a), which show that a two-layer NN, trained only on the input weight and early stopped, can reconstruct the true labels from a noisy data.

5.3 Quantitative understanding in NTK regime R −2 2 The static minimization problem (35) defines an FP-energy Eγ(h) = γ |hˆ| dξ that quantifies the preference of the LFP model among all its steady states. Because γ(ξ)−2 is an increasing

20 100 Trn_true Trn_fit Tst_true Tst_fit

10−1 |F[f]|

10−2

0 5 10 15 20 25 freq index (a) (b) (c)

Figure 6: Effect of Early-stopping on contaminated data. The training set and the test set consist of 300 and 6000 data points evenly sampled in [−10, 10], respectively. (a) The sampled values of the test set (red square dashed line) and DNN outputs (blue solid line) at the turning step. (b) Loss functions for training set (green stars) and test set (red dots) at different recording steps. The green dashed line is drawn at the turning step, where the best generalization performance is achieved. (c) The Fourier transform of the true data for the training set (red) and test set (black), and the Fourier transform of the DNN output for the training set (green), and test set (magenta) at the turning step. function, say γ(ξ)−2 = kξkd+1, the FP-energy R kξkd+1|hˆ|2 dξ amplifies the high frequencies while diminishing low frequencies. By minimizing Eγ(h), problem (35) gives rise to a low frequency fitting, instead of an arbitrary one, of training data. By intuition, if target f ∗ is indeed low frequency ∗ dominant, then h∞ likely well approximates f at unobserved positions. To theoretically demonstrate above intuition, Luo et al. (2020b) derive in the following an estimate of the generalization error of h∞ using the a priori error estimate technique E et al. (2019). ∗ ∗ Because h(x) = f (x) is a viable steady state, Eγ(h∞) ≤ Eγ(f ) by the minimization problem. Using this constraint on h∞, one can obtain that, with probability of at least 1 − δ,

∗ Eγ(f )  p  (h (x) − f ∗(x))2 ≤ √ C 2 + 4 2 log(4/δ) , (47) Ex ∞ n γ where Cγ is a constant depending on γ. Error reduces with more training data as expected with a √ ∗ decay rate 1/ n similar to Monte-Carlo method. Importantly, because Eγ(f ) strongly amplifies high frequencies of f ∗, the more high-frequency components the target function f ∗ possesses, the worse h∞ may generalize. Note that the error estimate is also consistent with another result (Arora et al., 2019) published at similar time. Arora et al. (2019) prove that the generalization error of the two-layer ReLU network in NTK regime found by GD is at most r 2Y T (K∗)−1Y , (48) n ∗ n where K is defined in (25), Y ∈ R is the labels of n training data. If the data Y is dominated more by the component of the eigen-vector that has small eigen-value, then, the above quantity is larger. Since in NTK regime the eigen-vector that has small eigen-value corresponds to a higher frequency, the error bound in (47) is larger, consistent with (48).

21 6. Algorithms and scientific computing Recently, DNN-based approaches have been actively explored for a variety of scientific computing problems, e.g., solving high-dimensional partial differential equations (E et al., 2017; Khoo and Ying, 2019; He et al., 2018; Fan et al., 2018; Han et al., 2019; Weinan et al., 2018; Strofer et al., 2019) and molecular dynamics (MD) simulations (Han et al., 2018). Three types of DNNs methods are often used: first, parameterizing the solution of a specific PDE by a DNN (E et al., 2017; Raissi et al., 2019); second, learning the mapping from the source to the solution at a set of fixed grid points (Fan et al., 2018); third, learning the operator in PDE (Li et al., 2020c). An overview of using DNN to solve high-dimensional PDEs can be found in Han et al. (2020). However, the behaviors of DNNs applied to these problems are not well-understood. To facilitate the designs and applications of DNN-based schemes, it is important to characterize the difference between DNNs and conventional numerical schemes on various scientific computing problems. In this section, we would focus on parameterizing the solution of a specific PDE by a DNN and review two types of difference between DNN-based algorithms and traditional algorithms. Then, we review several DNN-algorithms which focus on overcoming the difficulty of learning high frequency.

6.1 The idea of DNN-based algorithms for solving a PDE Consider solving Poisson’s equation, which has broad applications in mechanical engineering and theoretical physics Evans (2010), − ∆u(x) = g(x), x ∈ Ω, (49) with the boundary condition u(x) =g ˜(x), x ∈ ∂Ω. (50) Traditionally, this problem can be solved by finite difference scheme or finite element method etc. However, traditional methods suffers from the curse of dimensionality, for example, the number of grid points exponentially grows with dimension. By parameterizing the solution with a DNN, one can solve high-dimensional PDE. In addition, since such DNN algorithms is mesh-free, it can also solve problems with complicate geometry. In the following, we review two common used loss functions for solving PDE.

6.1.1 A RITZ VARIATIONAL METHOD FOR PDE The deep Ritz method as proposed by E and Yu (2018) produces a variational solution u(r) through the following minimization problem

u = arg min J(v), (51) v∈Π where the energy functional is defined as Z   Z Z 1 2 2 J(v) = |∇v| + V (r)v dr − g(r)v(r)dr , E(v(r))dr. (52) Ω 2 Ω Ω In simulation, the Ritz loss function is chosen to be 1 X 1 X L (h) = (|∇h(x)|2/2 − g(x)h(x)) + β ∗ (h(x) − g˜(x))2. (53) ritz n n˜ x∈S x∈S˜

22 where h(x) is the DNN ouput, S is the sample set from Ω and n is the sample size. n˜ indicates sample set from ∂Ω. The second penalty term is to enforce the boundary condition.

6.1.2 MSCALEDNN LEASTSQUAREERRORMETHODFOR PDES In an alternative approach, one can simply use the loss function of Least Squared Residual Error (LSE) , 1 X 1 X L (h) = (∆h(x) + g(x))2 + β ∗ (h(x) − g˜(x))2. (54) LSE n n˜ x∈S x∈S˜

To see the learning accuracy, one can compute the distance between h(x) and utrue,

1 X MSE(h(x), u (x)) = (h(x) − u (x))2 (55) true n +n ˜ true x∈S∪S˜

6.2 Difference from traditional algorithms

6.2.1 ITERATIVE METHODS A stark difference between a DNN-based solver and the Jacobi method during the training/iteration is that DNNs learns the solution from low- to high-frequency (Xu et al., 2019b; Wang et al., 2020d; Chen et al., 2020a), while Jacobi method learns the solution from high- to low-frequency. Therefore, DNNs would suffer from high-frequency curse. Jacobi method Before we show the difference between a DNN-based solver and the Jacobi method, we illustrate the procedure of the Jacobi method. Consider a 1-d Poisson’s equation:

− ∆u(x) = g(x), x ∈ Ω , (−1, 1), (56) u(−1) = u(1) = 0. (57)

[−1, 1] is uniformly discretized into n + 1 points with grid size h = 2/n. The Poisson’s equation in Eq. (56) can be solved by the central difference scheme, u − 2u + u − ∆u = − i+1 i i−1 = g(x ), i = 1, 2, ··· , n, (58) i (δx)2 i resulting a linear system Au = g, (59) where  2 −1 0 0 ··· 0   −1 2 −1 0 ··· 0     0 −1 2 −1 ··· 0  A =   , (60)  . . .   . . ··· .  0 0 ··· 0 −1 2 (n−1)×(n−1)

23     u1 g1  u2   g2      i  .  2  .  u =  .  , g = (δx)  .  , xi = 2 . (61)     n  un−2   gn−2  un−1 gn−1 A class of methods to solve this linear system is iterative schemes, for example, the Jacobi method. Let A = D − L − U, where D is the diagonal of A, and L and U are the strictly lower and upper triangular parts of −A, respectively. Then, we obtain

u = D−1(L + U)u + D−1g. (62)

At step t ∈ N, the Jacobi iteration reads as

ut+1 = D−1(L + U)ut + D−1g. (63)

We perform the standard error analysis of the above iteration process. Denote u∗ as the true value obtained by directly performing inverse of A in Eq. (59). The error at step t + 1 is et+1 = ut+1 − u∗. t+1 t −1 t Then, e = RJ e , where RJ = D (L + U). The converging speed of e is determined by the eigenvalues of RJ , that is, kπ λ = λ (R ) = cos , k = 1, 2, ··· , n − 1, (64) k k J n and the corresponding eigenvector vk’s entry is ikπ v = sin , i = 1, 2, ··· , n − 1. (65) k,i n So we can write n−1 t X t e = αkvk, (66) k=1 t t where αk can be understood as the magnitude of e in the direction of vk. Then,

n−1 n−1 t+1 X t X t e = αkRJ vk = αkλkvk. (67) k=1 k=1

t+1 t αk = λkαk. t Therefore, the converging rate of e in the direction of vk is controlled by λk. Since kπ (n − k)π cos = − cos , (68) n n the frequencies k and (n − k) are closely related and converge with the same rate. Consider the frequency k < n/2, λk is larger for lower frequency. Therefore, lower frequency converges slower in the Jacobi method.

24 (a) (b) (c) (d)

Figure 7: Poisson’s equation. (a) uref (x). Inset: |uˆref (k)| as a function of frequency. Frequencies peaks are marked with black dots. (b,c) ∆F (k) computed on the inputs of training data at different epochs for the selected frequencies for DNN (b) and Jacobi (c). (d) kh − uref k∞ at different running time. Green stars indicate kh − uref k∞ using DNN alone. The dashed lines indicate kh − uref k∞ for the Jacobi method with different colors indicating initialization by different timing of DNN training. Xu et al. (2019b) use a DNN with widths 1-4000-500-400-1 and full batch training by Adam optimizer Kingma and Ba (2015). The learning rate is 0.0005. β is 10. The parameters of the DNN are initialized following a Gaussian distribution with mean 0 and standard deviation 0.02.

Numerical experiments Xu et al. (2019b) consider the example with g(x) = sin(x)+4 sin(4x)− 8 sin(8x) + 16 sin(24x) such that the exact solution uref (x) has several high frequencies. After training with Ritz loss, the DNN output well matches the analytical solution uref . Focusing on the convergence of three peaks (inset of Fig. 7(a)) in the Fourier transform of uref , as shown in Fig. 7(b), low frequencies converge faster than high frequencies as predicted by the F-Principle. For comparison, Xu et al. (2019b) also use the Jacobi method to solve problem (56). High frequencies converge faster in the Jacobi method, as shown in Fig. 7(c). As a demonstration, Xu et al. (2019b) further propose that DNN can be combined with con- ventional numerical schemes to accelerate the convergence of low frequencies for computational problems. First, Xu et al. (2019b) solve the Poisson’s equation in Eq. (56) by DNN with M opti- mization steps (or epochs). Then, Xu et al. (2019b) use the Jacobi method with the new initial data for the further iterations. A proper choice of M is indicated by the initial point of orange dashed line, in which low frequencies are quickly captured by the DNN, followed by fast convergence in high frequencies of the Jacobi method. Similar idea of using DNN as initial guess for traditional methods is proved be effective in later works Huang et al. (2020). This example illustrates a cautionary tale that, although DNNs has clear advantage, using DNNs alone may not be the best option because of its limitation of slow convergence at high frequencies. Taking advantage of both DNNs and conventional methods to design faster schemes could be a promising direction in scientific computing problems.

6.2.2 RITZ-GALERKIN (R-G) METHOD Wang et al. (2020b) study the difference between R-G method and DNN methods, reviewed as follows. R-G method. We briefly introduce the R-G method (Brenner and Scott, 2008). For problem (56), we construct a functional 1 J(u) = a(u, u) − (f, u), (69) 2

25 where Z Z a(u, v) = ∇u(x)∇v(x)dx, (f, v) = f(x)v(x)dx. Ω Ω The variational form of problem (56) is the following: Find u ∈ H1(Ω), s.t. J(u) = min J(v). (70) 0 1 v∈H0 (Ω) 1 The weak form of (70) is to find u ∈ H0 (Ω) such that 1 a(u, v) = (f, v), ∀ v ∈ H0 (Ω). (71) 2 The problem (56) is the strong form if the solution u ∈ H0 (Ω). To numerically solve (71), we now 1 introduce the finite dimensional space Uh to approximate the infinite dimensional space H0 (Ω). Let 1 Uh ⊂ H0 (Ω) be a subspace with a sequence of basis functions {φ1, φ2, ··· , φm}. The numerical solution uh ∈ Uh that we will find can be represented as m X uh = ckφk, (72) k=1 1 where the coefficients {ci} are the unknown values that we need to solve. Replacing H0 (Ω) by Uh, both problems (70) and (71) can be transformed to solve the following system: m X cka(φk, φj) = (f, φj), j = 1, 2, ··· , m. (73) k=1

From (73), we can calculate ci, and then obtain the numerical solution uh. We usually call (73) R-G equation. For different types of basis functions, the R-G method can be divided into finite element method (FEM) and spectral method (SM) and so on. If the basis functions {φi(x)} are local, namely, they are compactly supported, this method is usually taken as the FEM. Assume that Ω is a polygon, and we divide it into finite element grid Th by simplex, h = maxτ∈Th diam(τ). A typical finite element basis is the linear hat basis function, satisfying

φk(xj) = δkj, xj ∈ Nh, (74) where Nh stands for the set of the nodes of grid Th. The schematic diagram of the basis functions in 1-D and 2-D are shown in Fig. 8. On the other hand, if we choose the global basis function such as Fourier basis or Legendre basis (Shen et al., 2011), we call R-G method spectral method. The error estimate theory of R-G method has been well established. Under suitable assumption on the regularity of solution, the linear finite element solution uh has the following error estimate

ku − uhk1 ≤ C1h|u|2, where the constant C1 is independent of grid size h. The spectral method has the following error estimate C ku − u k ≤ 2 , h ms where C2 is a constant and the exponent s depends only on the regularity (smoothness) of the solution u. If u is smooth enough and satisfies certain boundary conditions, the spectral method has the spectral accuracy.

26 • • • • • • • • • • • • • x x x • j−1 j j+1 • • •

Figure 8: The finite element basis function in 1d and 2d. Reprinted from Wang et al. (2020b).

Different learning results. DNNs with ReLU activation function can be proved to be equivalent with a finite element methods in the sense of approximation (He et al., 2018). However, the learning results have a stark difference. To investigate the difference, we utilize a control experiment, that is, solving PDEs given n sample points and controlling the number of bases in R-G method and the number of neurons in DNN equal m. Although not realistic in the common usage of R-G method, we choose the case m > n because the two methods are completely different in such situation especially when m → ∞. Then replacing the integral on the r.h.s. of (73) with the form of MC integral formula, we obtain m n X 1 X c a(φ , φ ) = f(x )φ (x ), j = 1, 2, ··· , m. (75) k k j n i j i k=1 i=1 We consider the 2D case  −∆u(x) = f(x), x ∈ (0, 1)2, u(x) = 0, x ∈ ∂(0, 1)2, where x = (x, y) and we know the values of f at n points sampled from the function f(x) := f(x, y) = 2π2 sin(πx) sin(πy). We fix the number of sample points n = 52. The sampling points in the x direction and the y direction are both at xh = [0.1, 0.25, 0.5, 0.8, 0.9]. We test the solution with the number of basis m = 5, 50, 100, 200, respectively. Fig. 9 plots the R-G solutions with Legendre basis and piecewise linear basis function. It can be seen that the numerical solution is a function with strong singularity. Fig. 10 shows the profile of R-G solutions at y = 0.5 for various m, in which we can see that the values of numerical solutions at the sampling points get larger and larger with the increase of m. However, Fig. 11 shows that the two-layer DNN solutions are stable without singularity for large m. The smooth solution of DNN, especially when neuron number is large, can be understood through the low-frequency bias, such as the analysis shown in the LFP theory. This helps understand the wide application of DNN in solving PDEs. For example, the low-frequency bias intuitively explains why DNN solve a shock wave by a smooth solution in Michoski et al. (2019). For R-G method, the following theorem explains why there is singularity in the 2d case when m is large.

27 Theorem 3. When m → ∞, the numerical method (75) is solving the problem

 n 1 X  −∆u(x) = δ(x − x )f(x ), x ∈ Ω, n i i (76) i=1  u(x) = 0, x ∈ ∂Ω, where δ(x) represents the Dirac delta function.

Remark 2. For the 1D case, the analytic solution to problem (76) defined in [a, b] can be given as a piecewise linear function, namely

n n 1 X x − a 1 X u(x) = f(x )(b − x ) − f(x )(x − x )H(x − x ), (77) n i i b − a n i i i i=1 i=1 where H(x) is the Heaviside step function

 0, x < 0, H(x) = 1, x ≥ 0.

For the 2D case, Polyanin (2002) gives the exact solution in [0, a] × [0, b] by Green’s function

n ∞ ∞ 4 X X X sin(pkx) sin(qly) sin(pkxi) sin(qlyi) u(x, y) = f . (78) nab i p2 + q2 i=1 k=1 l=1 k l where fi = f(xi, yi), pk = πk/a, ql = πl/b. We can prove that this series diverges at the sampling point (xi, yi)(i = 1, 2, ··· , n) and converges at other points. Therefore, the 2d exact solution u(x, y) is highly singular.

Remark 3. Although there exists equivalence between DNN-based algorithm and traditional methods for solving PDEs in the sense of approximation, it is important to take the implicit bias when analyzing the learning results of DNNs.

6.3 Algorithm design to overcome the curse of high-frequency Compared with many heuristic rules of thumb in designing DNN-based algorithms, the F-Principle provides valuable theoretical insights for overcoming the limit of DNN-based algorithms, that is, the curse of high-frequency (Xu et al., 2019b; Wang et al., 2020d). We review some ideas to find a way to learn high frequency faster. Jagtap et al. (2020) employ adaptive activation functions for regression in DNNs to solve both forward problems and inverse problems. The activation function σ(x) is replaced by a σ(µax), where µ is a fixed scale factor with µ ≥ 1 and a is a is a trainable variable shared for all neurons. They found this method can learn high frequency components in the target function faster than the normal activation function through empirical studies. By explicitly imposing high frequencies with higher priority in the loss function, Biland et al. (2019) significantly accelerate the simulation of fluid dynamics through DNN approach. This approach requires the computation in Fourier domain. Another approach is to find a way to convert the learning or approximation of high frequency data to that of a low frequency one. PhaseDNN Cai

28 Figure 9: (Example 3): R-G solutions with different m. The basis functions for the first and the second row are Legendre basis function and piecewise linear basis function, respectively. Reprinted from Wang et al. (2020b).

SM:n=52 FEM:n=52 5 5 m=5 2 m=5 2 m=502 m=502 4 m=1002 4 m=1002 m=2002 m=2002

3 3 =0.5) =0.5)

2 2 x, y x, y ( ( u u 1 1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x

Figure 10: (Example 3): Profile of R-G solutions with different m. Reprinted from Wang et al. (2020b).

et al. (2019) is a method to implement this converting approach high frequency component of the data downward to a low frequency spectrum. Peng et al. (2020b) propose a similar idea of PhaseDNN. The learning of the shifted data can be achieved with a small size DNN quickly, which was then shifted back (i.e., upward in frequency) to give approximation to the original high frequency data. The PhaseDNN has been shown to be very effective to handle highly oscillatory data from solutions of high frequency Helmholtz equations and functions of small dimensions. However, due to the number of phase shifts employed along each coordinate direction independently, the PhaseDNN

29 Figure 11: Two-layer DNN solutions with different m. The activation functions for the first and the second row are ReLU(x) and sin(x), respectively. Reprinted from Wang et al. (2020b).

will result in many small DNNs and a considerable computational cost even for three dimensional problems. In this section, we review a Multi-scale DNN (MscaleDNN) method , originally proposed in Cai and Xu (2019) and completed in Liu et al. (2020). MscaleDNN considers a different way to achieve the conversion of high frequency to lower one, namely, with a radial partition of the Fourier space, a scaling down operation will be used to convert higher frequency spectrum to a low frequency one before the learning is carried out with a small-sized DNN. As the scaling operation only needs to be done along the radial direction in the Fourier space, this approach is easy to be implemented and gives an overall small number of DNNs, thus reducing the computational cost. In addition, borrowing the multi-resolution concept of wavelet approximation theory using compact scaling and wavelet functions (Daubechies, 1992), we will modify the traditional global activation functions to ones with compact support. The compact support of the activation functions with sufficient smoothness will give a localization in the frequency domain where the scaling operation will effectively produce DNNs to approximate different frequency contents of a PDE solution. Li et al. (2020b) and Wang et al. (2020a) further use MscaleDNN to solve high-dimensional p-Laplacian problem and Oscillatory Stokes Flows in complex domains, respectively.

6.3.1 FREQUENCYSCALED DNNS The following presents a naive idea (Cai and Xu, 2019; Liu et al., 2020) of how to use a frequency scaling in Fourier wave number space to reduce a high frequency learning problems for a function to a low frequency learning one for the DNN. d Consider a band-limited function f(x) x ∈ R , whose Fourier transform fb(k) has a compact support, i.e., d supp fb(k) ⊂ B(Kmax) = {k ∈R , |k| ≤Kmax}. (79)

30 We will first partition the domain B(Kmax) as union of M concentric annulus with uniform or non-uniform width, e.g., for the case of uniform K0-width

d Ai = {k ∈R , (i − 1)K0 ≤ |k| ≤iK0},K0 = Kmax/M, 1 ≤ i ≤ M (80) so that M [ B(Kmax) = Ai. (81) i=1 Now, we can decompose the function fb(k) as follows

M M X X fb(k) = χAi (k)fb(k) , fbi(k), (82) i=1 i=1 where χAi is the indicator function of the set Ai and

supp fbi(k) ⊂ Ai. (83)

The decomposition in the Fourier space give a corresponding one in the physical space

M X f(x) = fi(x), (84) i=1 where −1 ∨ fi(x) = F [fbi(k)](x) = f(x) ∗ χAi (x), (85) and the inverse Fourier transform of χAi (k) is called the frequency selection kernel (Cai et al., 2019) and can be computed analytically using Bessel functions 1 Z χ∨ (x) = eik◦xdk. (86) Ai (2π)d/2 Ai

From (83), we can apply a simple down-scaling to convert the high frequency region Ai to a low frequency region. Namely, we define a scaled version of fbi(k) as

(scale) fbi (k) = fbi(αik), αi > 1, (87) and, correspondingly in the physical space

(scale) 1 fi (x) = fi( x), (88) αi or (scale) fi(x) = fi (αix). (89) (scale) We can see the low frequency spectrum of the scaled function fbi (k) if αi is chosen large enough, i.e., (scale) d (i − 1)K0 iK0 supp fbi (k) ⊂ {k ∈R , ≤ |k| ≤ }. (90) αi αi

31 Using the F-Principle of common DNNs (Xu et al., 2019b), with iK0/αi being small, we can ni (scale) train a DNN fθni (x), with θ denoting the DNN parameters, to learn fi (x) quickly

(scale) fi (x) ∼ fθni (x), (91) which gives an approximation to fi(x) immediately

fi(x) ∼ fθni (αix) (92) and to f(x) as well M X f(x) ∼ fθni (αix). (93) i=1 The difficulty of the above procedure used directly for approximating function and even more for finding a PDE solution is the need to compute the convolution in (85), which is computationally expensive for scattered data in the space, especially in higher dimensional problems. However, this framework will lay the structure for the multiscale DNN to be proposed next.

6.3.2 ACTIVATION FUNCTION WITH COMPACT SUPPORT In order to produce scale separation and identification capability of a MscaleDNN, Cai and Xu (2019); Liu et al. (2020) borrow the idea of compact scaling function in the wavelet theory (Daubechies, 1992), and consider the activation functions with compact support as well. Compared with the normal activation function ReLU(x) = max{0, x}, we will see activation functions with compact support are more effective in MscaleDNNs. Several activation functions are defined in Liu et al. (2020) as follows: sReLU(x)

sReLU(x) = ReLU(−(x − 1)) × ReLU(x) = (x)+(1 − x)+, (94) and the quadratic B-spline with first continuous derivative φ(x)

2 2 2 2 φ(x) = (x − 0)+ − 3(x − 1)+ + 3(x − 2)+ − (x − 3)+, (95) where x+ = max{x, 0} = ReLU(x), and the latter has an alternative form,

φ(x) = ReLU(x)2 − 3ReLU(x − 1)2 + 3ReLU(x − 2)2 − ReLU(x − 3)2. (96)

Li et al. (2020b) define another one

s2ReLU(x) = sin(2πx) ∗ ReLU(x) ∗ ReLU(1 − x). (97)

6.3.3 TWO MSCALEDNN STRUCTURES While the procedure leading to (93) is not practical for numerical approximation in high dimension, it does suggest a plausible form of function space for finding the solution more quickly with DNN functions. We can use a series of ai ranging from 1 to a large number to produce a MscaleDNN structure to achieve our goal in speeding up the convergence for solution with a wide range of frequencies with uniform accuracy in frequencies. For this purpose, Cai and Xu (2019); Liu et al. (2020) propose the following two multi-scale structures.

32 MscaleDNN-1 For the first kind, the neuron in the first hidden-layer are separated into to N parts. The neuron in the i-th part receives input aix, that is, its output is σ(aiw · x + b), where w, x, b are weight, input, and bias parameters, respectively. A complete MscaleDNNs takes the following form

[L−1] [1] [0] [0] [1] [L−1] fθ(x) = W σ ◦ (··· (W σ ◦ (W (K x) + b ) + b ) ··· ) + b , (98)

d [l] m ×m [l] where x ∈ R , W ∈ R l+1 l , ml is the neuron number of l-th hidden layer, m0 = d, b ∈ m R l+1 , σ is a scalar function and “◦” means entry-wise operation, is the Hadamard product and

T K = (a1, a1, ··· , a1, a2, a2, ··· , a2, a3, ··· , ai−1, ai, ai, ··· , ai, ··· , aN , aN ··· , aN ) , (99) | {z } | {z } | {z } | {z } 1st part 2nd part ith part Nth part

i−1 where ai = i or ai = 2 . This structure is referred as Multi-scale DNN-1 (MscaleDNN-1) of the form in Eq. (98), as depicted in Fig. 12(a).

MscaleDNN-2 A second kind of multi-scale DNN is given in Fig. 12(b), as a sum of N subnetworks, in which each scale input goes through a subnetwork. In MscaleDNN-2, weight matrices from W [1] [L−1] i−1 to W are block diagonal. Again, one could select the scale coefficient ai = i or ai = 2 . For comparison studies, a “normal” network is defined as an one fully connected DNN with no multi-scale features. Cai and Xu (2019); Liu et al. (2020) perform extensive numerical experiments to examine the effectiveness of different settings and use an efficient one to solve complex problems. All models are trained by Adam Kingma and Ba (2015) with learning rate 0.001.

(a) MscaleDNN-1 (b) MscaleDNN-2

Figure 12: Illustration of two MscaleDNN structures. Reprinted from Liu et al. (2020).

6.3.4 EXPERIMENTS:A SQUAREDOMAINWITHAFEWHOLES The centers for three circle holes are (−0.5, −0.5), (0.5, 0.5), and (0.5, −0.5), with radii of 0.1, 0.2, and 0.2, respectively. In each epoch, we randomly sample 3000 on outer boundary, 800 points on the boundary of each big hole and 400 points on the boundary of the small hole.

33 Consider the Poisson equation (56) with the source term as

2 g(x) = 2µ sin µx1 sin µx2, µ = 7π. (100)

The exact solution is u(x) = sin µx1 sin µx2. (101) which also provides the boundary condition. In each training epoch, 5000 points are sampled inside the domain with the following two DNN structures:

1. a fully-connected DNN with size 1-1000-1000-1000-1 (normal).

2. a MscaleDNN-2 with five subnetworks with size 1-200-200-200-1, and scale coefficients of {1, 2, 4, 8, 16}.(Mscale).

(a) exact (b) normal (c) Mscale

Figure 13: Exact and numerical solution for the Poisson equation. Reprinted from Liu et al. (2020).

As shown in Fig. 14. MscaleDNNs solve both problems much faster to lower errors.

Figure 14: Error vs. epoch for the Poisson equation in square domains with few holes. Reprinted from Liu et al. (2020).

Compared with the exact solutions in Fig. 13 (a) , normal DNN fails to resolve the magnitudes of many oscillations as shown in Fig. 13 (b) while MscaleDNNs capture each oscillation of the true solutions accurately as shown in Fig. 13 (c).

34 6.3.5 HIGHDIMENSIONALEXAMPLES Li et al. (2020b) consider the following p-Laplacian problem in domain Ω = [0, 1]5    p−2 −div κ(x1, x2, ··· , x5)|∇u(x1, x2, ··· , x5)| ∇u(x1, x2, ··· , x5) = f(x1, x2, ··· , x5),   u(0, x2, ··· , x5) = u(1, x2, ··· , x5) = 0,   ······  u(x1, x2, ··· , 0) = u(x1, x2, ··· , 1) = 0. (102) In this example, p = 2 and

κ(x1, x2, ··· , x5) = 1 + cos(πx1) cos(2πx2) cos(3πx3) cos(2πx4) cos(πx5).

The forcing term f satisfies the following exact solution

u(x1, x2, ··· , x5) = sin(πx1) sin(πx2) sin(πx3) sin(πx4) sin(πx5).

The five-dimensional elliptic problems are solved by two types of MscaleDNNs with size (1000, 800, 500, 500, 400, 200, 200, 100) and activation functions s2ReLU and sReLU, respectively. The training data set includes 7500 interior points and 1000 boundary points randomly sampled from Ω. The testing dataset includes 1600 random samples in Ω. The testing results are shown in Fig. 15.

Figure 15: Testing results for high-dimensional example. Mean square error and relative error for s2ReLU and sReLU, respectively. Reprinted from Li et al. (2020b).

The numerical results in Fig. 15 indicate that the MscaleDNN models with s2ReLU and sReLU can still well approximate the exact solution of elliptic equation in five-dimensional space, and the error of s2ReLU is much smaller than that of sReLU.

7. Understanding The study of the F-Principle has been utilized to understand various phenomena emerging in appli- cations of deep learning and to inspire the design of DNN-based algorithms (Bertiche et al., 2020; Guo and Ouyang, 2020; Lee et al., 2020; Zhang and Young, 2020; Ji and Zhu, 2020; Shao et al., 2021; Kim et al., 2021; Zhang et al., 2020a; Benz et al., 2020; Wang et al., 2020a; Xu et al., 2019a; Ma et al., 2020; Sharma and Ross, 2020; Camuto et al., 2020; Chen et al., 2019, 2020b; Zhu et al., 2019; Chakrabarty and Maji, 2019; Rabinowitz, 2019; Deng and Zhang, 2020; Tancik et al., 2020b;

35 Mildenhall et al., 2020; Bi et al., 2020; Pumarola et al., 2020; Hennigh et al., 2020; Wang et al., 2020c; Xu et al., 2020; Hani¨ et al., 2020; Zheng et al., 2020; Peng et al., 2020a; Guo et al., 2020; Cai and Xu, 2019; Liu et al., 2020; Agarwal et al., 2020; Tancik et al., 2020a; Biland et al., 2019; Liang et al., 2021; Jagtap et al., 2020; Campo et al., 2020; Jiang et al., 2020; Xi et al., 2020; You et al., 2020; Fu et al., 2021, 2020; Kim et al., 2020). We review some examples as follows.

7.1 Frequency perspective for understanding phenomena Compression phase. Xu et al. (2019a) explain the compression phase in information plane, pro- posed by Shwartz-Ziv and Tishby (2017), by the F-Principle as follows. The DNN first fits the continuous low-frequency components of the discretized function. Then, the DNN output discretizes as the network gradually captures the high-frequency components. The entropy or information quantifies the possibility of output values, i.e., more possible output values leads to a higher entropy. By the definition of entropy, this discretization process naturally reduces the entropy of the DNN output. Thus, the compression phase appears in the information plane. Increasing complexity. The F-Principle also explains the increasing complexity of DNN output during the training. For common small initialization, the initial output of a DNN is often close to zero. The F-Principle indicates that the output of DNN contains higher and higher frequency during the training. As frequency is a natural concept to characterize the complexity of a function, the F-Principle indicates that the complexity of the DNN output increases during the training. This increasing complexity of DNN output during training is consistent with previous studies and subsequent works (Arpit et al., 2017; Kalimeris et al., 2019; Goldt et al., 2020; He et al., 2020; Mingard et al., 2019; Jin et al., 2019). Strength and limitation. Ma et al. (2020) show that the F-Principle maybe a general mechanism behind the slow deterioration phenomenon in the training of DNNs, where the effect of the “double descent” is washed out. Sharma and Ross (2020) utilize the low-frequency bias of NNs to study effectiveness of an iris recognition NN. Chen et al. (2019) show that under the same computational budget, a MuffNet is a better universal approximator for functions containing high frequency components, thus, better for mobile deep learning. Chen et al. (2020b) embed a frequency-aware classifier into the discriminator to measure the realness of the input in both the spatial and spectral domains, based on which, the generator of SSD-GAN is encouraged to learn high-frequency content of real data and generate exact details. Zhu et al. (2019) utilize the F-Principle to help understand why high frequency is a limit when DNNs is used to solve spectral deconvolution problem. Chakrabarty and Maji (2019) utilize the idea of F-Principle to study the spectral bias of the deep image prior. Frequency approach. Camuto et al. (2020) show that the effect of injections to each hidden layer output is equivalent to a penalty of high-frequency components in the Fourier domain. Rabinowitz (2019); Deng and Zhang (2020) use the F-Principle as one of typical phenomena to study the difference between the normal learning and the meta-learning.

7.2 Inspiring the design of algorithm To accelerate the convergence of high-frequency, a series of works propose different approaches: several works project data into a high dimensional space with a set of sinusoids (Tancik et al., 2020b; Mildenhall et al., 2020; Bi et al., 2020; Pumarola et al., 2020; Hennigh et al., 2020; Wang et al.,

36 2020c; Xu et al., 2020; Hani¨ et al., 2020; Zheng et al., 2020; Peng et al., 2020a; Guo et al., 2020), which is similar to the design in MscaleDNN in Cai and Xu (2019); Liu et al. (2020); Agarwal et al. (2020) revise a normal activation function σ(wx + b) by σ(ew(x − b)), which can be more sensitive to the weight; Tancik et al. (2020a) use meta-learning to obtain a good initialization for fast and effective image restoration; Biland et al. (2019) explicitly impose high frequencies with higher priority in the loss function; Jagtap et al. (2020); Liang et al. (2021) design different types of activation functions. Campo et al. (2020) use a frequency filter to help reduce the interdependency between the low frequency and the (harder to learn) high frequency components of the state-action value approximation to achieve better results in reinforcement learning. Jiang et al. (2020) down-weight low frequencies dynamically in the loss function to generate high-quality images for generative models. Xi et al. (2020) argue that the performance improvement in low-resolution image classification is affected by the inconsistency of the information loss and learning priority on low-frequency components and high-frequency components, and propose a network structure to overcome this inconsistent issue. The F-Principle shows that DNNs quickly learn the low-frequency part, which is often dominated in the real data and more robust. At the early stage, the NN is similar to a linear model (Kalimeris et al., 2019; Hu et al., 2020). Some works take advantage of DNN at early training stage to save training computation cost. The original lottery ticket network (Frankle and Carbin, 2019) requires a full training of DNN, which has a very high computational cost. Most computation are used to capture high frequency while high frequency may be not important in many cases. You et al. (2020) show that a small but critical subnetwork emerge at the early training stage (Early-Bird ticket), and the performance of training this small subnetwork with the same initialization is similar to the training of the full network, thus, saving significant energy for training networks. Fu et al. (2021, 2020) utilize the robust of low frequency by applying low-precision for early training stage to save computational cost without sacrificing the generalization performance.

8. Anti-F-Principle The key to F-Principle is the regularity of the activation function. Imposing high priority on high frequency can alleviate the effect of the F-Principle and sometimes an anti-F-Principle can be observed. For example, consider a loss function, containing the gradient of the DNN output w.r.t. the input X ∗ 2 LS(θ) = (∇fθ(x) − ∇f (x)) . (103) x∈S

The Fourier transform of ∇fθ(x) is ξF[fθ](ξ), that is, a higher frequency would have a higher weight in the loss function. Then, whether there exists a F-Principle depends on the competition between the activation regularity and the loss function. If the loss function endorses more priority for the high frequency to compensate the low-priority induced by the activation function, an anti-F-Principle emerges. Since gradient often exists in solving PDEs, the anti-F-Principle can be observed in solving a PDE by designing a loss with high-order derivatives. Some analysis and numerical experiments can also be found in Lu et al. (2019); E et al. (2020). Another way to observe anti-F-Principle is using large values for network parameters. As shown in the analysis of idealized setting in Sec. 4.1, large weights alleviate the dominance of low-frequency in Eq. (9). In addition, large values would also cause large fluctuation of DNN output at initialization

37 (a) different networks (b) different target functions

Figure 16: Training epochs (indicated by ordinate axis) of different DNNs when they achieve a fixed error. (a) Using networks with different number of hidden layers with the same size to learn data sampled from a target function cos(3x) + cos(5x). (b) Using a fixed network to learn data sampled from different target functions. Reprinted from Xu and Zhou (2020).

(experiments can be seen in Xu et al. (2019a)), the amplitude term in Eq. (9) may endorse high frequency larger priority, leading to an anti-F-Principle, which is also studied in Yang and Salman (2019).

9. Deep frequency principle Xu and Zhou (2020) propose a deep frequency principle to understand an effect of depth in accelerat- ing the training, reviewed as follows. Understanding the effect of depth is a central problem to reveal the “black box” of deep learning. For example, empirical studies show that a deeper network can learn faster and generalize better in both real data and synthetic data (He et al., 2016; Arora et al., 2018). Different network structures have different computation costs in each training epoch. In this work, Xu and Zhou (2020) define that the learning of a DNN is faster if the loss of the DNN decreases to a designated error with fewer training epochs. For example, as shown in Fig. 16 (a), when learning data sampled from a target function cos(3x) + cos(5x), a DNN with more hidden layers achieves the designated training loss with fewer training epochs. Although empirical studies suggest deeper NNs may learn faster, there is few understanding of the mechanism. The motivation of deep frequency principle can be understood from the following ideal example. As the frequency of the target function decreases, the DNN achieves a designated error with fewer training epochs, as shown in Fig. 16 (b), which is similar to the phenomenon when using a deeper network to learn a fixed target function. Inspired by the above analysis, Xu and Zhou (2020) propose a mechanism to understand why n a deeper network, fθ(x), faster learns a set of training data, S = {(xi, yi)}i=1 sampled from a target function f ∗(x), illustrated as follows. Networks are trained as usual while Xu and Zhou (2020) separate a DNN into two parts in the analysis, as shown in Fig. 17, one is a pre-condition component and the other is a learning component, in which the output of the pre-condition one, [l−1] denoted as fθ (x) (first l − 1 layers are classified as the pre-condition component), is the input of the learning one. For the learning component, the effective training data at each training epoch [l−1] [l−1] n is S = {(fθ (xi), yi)}i=1. Xu and Zhou (2020) then perform experiments based on the

38 variants of Resnet18 structure He et al. (2016) and CIFAR10 dataset. Xu and Zhou (2020) fix the learning component (fully-connected layers). When increasing the number of the pre-condition layer (convolution layers), Xu and Zhou (2020) find that S[l−1] has a stronger bias towards low frequency during the training. By F-Principle, the learning of a lower frequency function is faster, therefore, the learning component is faster to learn S[l−1] when the pre-condition component has more layers. The analysis among different network structures is often much more difficult than the analysis of one single structure. For providing hints for future theoretical study, Xu and Zhou (2020) study a fixed fully-connected DNN by classifying different number of layers into the pre-condition component, i.e., varying l for a network in the analysis. As l increases, Xu and Zhou (2020) similarly find that S[l] contains more low frequency and less high frequency during the training. Therefore, Xu and Zhou (2020) propose the following principle: Deep frequency principle: The effective target function for a deeper hidden layer biases towards lower frequency during the training. With the well-studied F-Principle, the deep frequency principle shows a promising mechanism for understanding why a deeper network learns faster.

Figure 17: General deep neural network. Reprinted from Xu and Zhou (2020).

9.1 Ratio Density Function (RDF) Xu and Zhou (2020) define a Ratio Density Function (RDF) to characterize the frequency distribution of a high-dimensional function. The low frequency part can be derived by

low,δ(k0) δ(k0) yi , (y ∗ G )i, (104) where ∗ indicates convolution operator. Then, one can compute the LFR by

P low,δ(k0) 2 i |yi | LFR(k0) = P 2 . (105) i |yi| The low frequency part can be derived on the discrete data points by

n−1 1 X ylow,δ = y Gδ(x − x ), (106) i C j i j i j=0

39 Pn−1 δ where Ci = j=0 G (xi − xj) is a normalization factor and

δ 2  G (xi − xj) = exp −|xi − xj| /2δ . (107)

1/δ is the variance of Gˆ, therefore, it can be interpreted as the frequency width outside which is filtered out by convolution. LFR(k0) characterizes the power ratio of frequencies within a sphere of radius k0. To characterize each frequency in the radius direction, similarly to probability, Xu and Zhou (2020) define the ratio density function (RDF) as ∂LFR(k0) RDF(k0) = . (108) ∂k0

In practical computation, Xu and Zhou (2020) use 1/δ for k0 and use the linear slope between two consecutive points for the derivative. For illustration, Xu and Zhou (2020) show the LFR and RDF for sin(kπx) in Fig. 18. As shown in Fig. 18(a), the LFR of low-frequency function faster approaches one when the filter width in the frequency domain is small, i.e., small 1/δ. The RDF in Fig. 18(b) shows that as k in the target function increases, the peak of RDF moves towards wider filter width, i.e., higher frequency. Therefore, it is more intuitive that the RDF effectively reflects where the power of the function concentrates in the frequency domain. In the following, Xu and Zhou (2020) will use RDF to study the frequency distribution of effective target functions for hidden layers.

1.0 1.0 k=1 k=5 0.8 0.8 k=9 k=13 0.6 0.6 F R D F L 0.4 R 0.4 k=1 k=5 0.2 0.2 k=9 k=13 0.0 0.0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 1/ . 1/ (a) (b)

Figure 18: LFR and RDF for sin(kπx) vs. 1/δ. Note that Xu and Zhou (2020) normalize RDF in (b) by the maximal value of each curve for visualization. Reprinted from Xu and Zhou (2020).

9.2 An example: Deep frequency principle on variants of Resnet18 In this subsection, Xu and Zhou (2020) would utilize variants of Resnet18 and CIFAR10 dataset to validate deep frequency principle. The structures of four variants are illustrated as follows. As shown in Fig. 19, all structures have several convolution parts, followed by two same fully-connected layers. Compared with Resnet18-i, Resnet18-(i + 1) drops out a convolution part and keep other parts the same. As shown in Fig. 20, a deeper net attains a fixed training accuracy with fewer training epochs and achieves a better generalization after training. From the layer “-2” to the final output, it can be regarded as a two-layer NN, which is widely studied. Next, Xu and Zhou (2020) examine the RDF for layer “-2”. The effective target function is n [−3] n [−3] o S = fθ (xi), yi . (109) i=1

40 Figure 19: Variants of Resnet18. Reprinted from Xu and Zhou (2020).

(a) Training (b) Validation

Figure 20: Training accuracy and validation accuracy vs. epoch for variants of Resnet18. Reprinted from Xu and Zhou (2020).

As shown in Fig. 21(a), at initialization, the RDFs for deeper networks concentrate at higher frequencies. However, as training proceeds, the concentration of RDFs of deeper networks moves towards lower frequency faster. Therefore, for the two-layer NN with a deeper pre-condition component, learning can be accelerated due to the fast convergence of low frequency in NN dynamics, i.e., F-Principle. For the two-layer NN embedded as the learning component of the full network, the effective target function is S[−3]. As the pre-condition component has more layers, layer “-2” is a deeper hidden layer in the full network. Therefore, Fig. 21 validates that the effective target function for a deeper hidden layer biases towards lower frequency during the training, i.e., deep frequency principle.

10. Conclusion F-Principle is very general but important for DNNs. It serves as a basic principle to understand DNNs and to inspire the design of DNNs. As a good starting point, F-Principle leads to more interesting studies for better understanding DNNs. For examples, how to build a theory of F-Principle for general NNs with arbitrary sample distribution and study the generalization error. The Fourier analysis can

41 (a) epoch 0 (b) epoch 1 (c) epoch 2

(d) epoch 3 (e) epoch 15 (f) epoch 80

Figure 21: RDF of S[−3] (effective target function of layer “-2”) vs. 1/δ at different epochs for variants of Resnet18. Reprinted from Xu and Zhou (2020).

42 be used to study DNNs from other perspectives, such as the effect of different image frequency on the learning results. As a general implicit bias, F-Principle is insufficient to characterize more details of the training process of DNNs. To study DNNs in a more detailed way, it is important to study DNNs from more angles, such as the landscape of the loss function, the effect of width and depth, the effect of initialization, etc. For example, we have studied how initialization affects the implicit bias of DNNs (Zhang et al., 2020b; Luo et al., 2020d) and draw a phase diagram for wide two-layer ReLU NNs (Luo et al., 2020d).

References Rishabh Agarwal, Nicholas Frosst, Xuezhou Zhang, Rich Caruana, and Geoffrey E Hinton. Neural additive models: Interpretable machine learning with neural nets. arXiv preprint arXiv:2004.13912, 2020. 7, 7.2

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit accel- eration by overparameterization. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan,¨ Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 244– 253. PMLR, 2018. URL http://proceedings.mlr.press/v80/arora18a.html. 9

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 322–332. PMLR, 2019. URL http: //proceedings.mlr.press/v97/arora19a.html. 5.3

Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste- Julien. A closer look at memorization in deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR, 2017. URL http://proceedings.mlr.press/v70/ arpit17a.html. 3, 7.1

Aristide Baratin, Thomas George, Cesar´ Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. arXiv preprint arXiv:2008.00938, 2020. 4.4.2

Alex Barnett, Leslie Greengard, Andras Pataki, and Marina Spivak. Rapid solution of the cryo- em reconstruction problem by frequency marching. SIAM Journal on Imaging Sciences, 10(3): 1170–1195, 2017. 2

Ronen Basri, David W. Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Hanna M. Wallach, Hugo

43 Larochelle, Alina Beygelzimer, Florence d’Alche-Buc,´ Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4763–4772, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 5ac8bb8a7d745102a978c5f8ccdb61b8-Abstract.html. 1, 2, 4, 4.4.2

Ronen Basri, Meirav Galun, Amnon Geifman, David W. Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 685–694. PMLR, 2020. URL http://proceedings.mlr.press/v119/basri20a.html. 4.4.2

Philipp Benz, Chaoning Zhang, and In So Kweon. Batch normalization increases adversar- ial vulnerability: Disentangling usefulness and robustness of model features. arXiv preprint arXiv:2010.03316, 2020. 7

Hugo Bertiche, Meysam Madadi, and Sergio Escalera. Physically based neural simulator for garment animation. arXiv preprint arXiv:2012.11310, 2020. 7

Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Milosˇ Hasan,ˇ Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Neural reflectance fields for appearance acquisition. arXiv preprint arXiv:2008.03824, 2020. 7, 7.2

Simon Biland, Vinicius C Azevedo, Byungsoo Kim, and Barbara Solenthaler. Frequency-aware reconstruction of fluid simulations with generative networks. arXiv preprint arXiv:1912.08776, 2019. 1, 6.3, 7, 7.2

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020. URL http://proceedings. mlr.press/v119/bordelon20a.html. 1, 4, 4.4.2

Leo Breiman. Reflections after refereeing papers for nips. The Mathematics of Generalization, XX: 11–15, 1995. 1

Susanne C Brenner and L. Ridgway Scott. The mathematical theory of finite element methods. Springer, New York, third edition, 2008. 6.2.2

Wei Cai and Zhi-Qin John Xu. Multi-scale deep neural networks for solving high dimensional pdes. arXiv preprint arXiv:1910.11710, 2019. 6.3, 6.3.1, 6.3.2, 6.3.3, 6.3.3, 7, 7.2

Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency wave equations in inhomogeneous media. Arxiv preprint, arXiv:1909.11759, 2019. 1, 6.3, 6.3.1

Miguel Campo, Zhengxing Chen, Luke Kung, Kittipat Virochsiri, and Jianyu Wang. Band-limited soft actor critic model. arXiv preprint arXiv:2006.11431, 2020. 7, 7.2

44 Alexander Camuto, Matthew Willetts, Umut Simsekli, Stephen J. Roberts, and Chris C. Holmes. Ex- plicit regularisation in gaussian noise injections. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Pro- cessing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/ paper/2020/hash/c16a5320fa475530d9583c34fd356ef5-Abstract.html. 7, 7.1

Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019. 1, 4, 4.4.2

Prithvijit Chakrabarty and Subhransu Maji. The spectral bias of the deep image prior. arXiv preprint arXiv:1912.08905, 2019. 7, 7.1

Hesen Chen, Ming Lin, Xiuyu Sun, Qian Qi, Hao Li, and Rong Jin. Muffnet: Multi-layer feature federation for mobile deep learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. 7, 7.1

Jingrun Chen, Shi Jin, and Liyao Lyu. A consensus-based global optimization method with adaptive momentum estimation. arXiv preprint arXiv:2012.04827, 2020a. 6.2.1

Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, and Thomas Li. Ssd-gan: Measuring the realness in the spatial and spectral domains. arXiv preprint arXiv:2012.05535, 2020b. 7, 7.1

Ingrid Daubechies. Ten lectures on wavelets, volume 61. Siam, 1992. 6.3, 6.3.2

Xiang Deng and Zhongfei Zhang. Is the meta-learning idea able to improve the generalization of deep neural networks on the standard supervised learning? arXiv preprint arXiv:2002.12455, 2020. 7, 7.1

Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation \ ≈ early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network. arXiv preprint arXiv:1910.01255, 2019. 4.4.2

Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018. 6.1.1

Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high- dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017. 6

Weinan E, Chao Ma, and Lei Wu. A Priori Estimates of the Population Risk for Two-layer Neural Networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019. doi: 10.4310/ CMS.2019.v17.n5.a11. 5.3

Weinan E, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, i. Science China Mathematics, pages 1–34, 2020. 1, 4, 4.3, 8

Lawrence C Evans. Partial differential equations. 2010. 6.1

45 Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Nunez.´ A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883, 2018. 6 Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview. net/forum?id=rJl-b3RcF7. 7.2 Yonggan Fu, Haoran You, Yang Zhao, Yue Wang, Chaojian Li, Kailash Gopalakrishnan, Zhangyang Wang, and Yingyan Lin. Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient DNN training. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/8dc5983b8c4ef1d8fcd5f325f9a65511-Abstract.html. 7, 7.2 Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, and Yingyan Lin. CPT: Efficient Deep Neural Network Training via Cyclic Precision. arXiv:2101.09868, January 2021. 7, 7.2 Raja Giryes and Joan Bruna. How can we use tools from signal processing to understand better neural networks? Inside Signal Processing Newsletter, 2020. 1 Sebastian Goldt, Marc Mezard,´ Florent Krzakala, and Lenka Zdeborova.´ Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10 (4):041044, 2020. 7.1 Michelle Guo, Alireza Fathi, Jiajun Wu, and Thomas Funkhouser. Object-centric neural scene rendering. arXiv preprint arXiv:2012.08503, 2020. 7, 7.2 Weiyu Guo and Yidong Ouyang. Robust learning with frequency domain regularization. arXiv preprint arXiv:2007.03244, 2020. 7 Jiequn Han, Linfeng Zhang, Roberto Car, et al. Deep potential: A general representation of a many-body potential energy surface. Communications in Computational Physics, 23(3), 2018. 6 Jiequn Han, Chao Ma, Zheng Ma, and E Weinan. Uniformly accurate machine learning-based hydrodynamic models for kinetic equations. Proceedings of the National Academy of Sciences, 116(44):21983–21991, 2019. 6 Jiequn Han, Arnulf Jentzen, et al. Algorithms for solving high dimensional pdes: From nonlinear monte carlo to machine learning. arXiv preprint arXiv:2008.13333, 2020. 6 Nicolai Hani,¨ Selim Engin, Jun-Jee Chao, and Volkan Isler. Continuous object repre- sentation networks: Novel view synthesis without target view supervision. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 43a7c24e2d1fe375ce60d84ac901819f-Abstract.html. 7, 7.2

46 Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973, 2018. 6, 6.2.2

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 9, 9

Shilin He, Xing Wang, Shuming Shi, Michael R Lyu, and Zhaopeng Tu. Assessing the bilingual knowledge learned by neural machine translation models. arXiv preprint arXiv:2004.13270, 2020. 7.1

Oliver Hennigh, Susheela Narasimhan, Mohammad Amin Nabian, Akshay Subramaniam, Kaustubh Tangsali, Max Rietmann, Jose del Aguila Ferrandis, Wonmin Byeon, Zhiwei Fang, and Sanjay Choudhry. Nvidia simnetˆ{TM}: an ai-accelerated multi-physics simulation framework. arXiv preprint arXiv:2012.07938, 2020. 7, 7.2

Wei Hu, Lechao Xiao, Ben Adlam, and Jeffrey Pennington. The surprising simplicity of the early-time learning dynamics of neural networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/ 2020/hash/c6dfc6b7c601ac2978357b7a81e2d7ae-Abstract.html. 7.2

Jianguo Huang, Haoqin Wang, and Haizhao Yang. Int-deep: A deep learning initialized iterative method for nonlinear problems. Journal of Computational Physics, 419:109675, 2020. 6.2.1

Arthur Jacot, Clement´ Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolo` Cesa-Bianchi, and Roman Garnett, editors, Advances in Neu- ral Information Processing Systems 31: Annual Conference on Neural Information Pro- cessing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montreal,´ Canada, pages 8580–8589, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html. 1, 4, 4.4.1

Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. Journal of Computational Physics, 404:109136, 2020. 1, 6.3, 7, 7.2

Guangda Ji and Zhanxing Zhu. Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/ef0d3930a7b6c95bd2b32ed45989c61f-Abstract.html. 7

Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for generative models. arXiv preprint arXiv:2012.12821, 2020. 7, 7.2

47 Hui Jin and Guido Montufar.´ Implicit bias of gradient descent for mean squared error regression with wide neural networks. arXiv preprint arXiv:2006.07356, 2020. 4.4.3

Pengzhan Jin, Lu Lu, Yifa Tang, and George Em Karniadakis. Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness. arXiv preprint arXiv:1905.11427, 2019. 7.1

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin L. Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. SGD on neural networks learns functions of in- creasing complexity. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Flo- rence d’Alche-Buc,´ Emily B. Fox, and Roman Garnett, editors, Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Process- ing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3491–3501, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ b432f34c5a997c8e7c806a895ecc5e25-Abstract.html. 7.1, 7.2

Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scattering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019. 6

Soo Ye Kim, Kfir Aberman, Nori Kanazawa, Rahul Garg, Neal Wadhwa, Huiwen Chang, Nikhil Karnad, Munchurl Kim, and Orly Liba. Zoom-to-inpaint: Image inpainting with high frequency details. arXiv preprint arXiv:2012.09401, 2020. 7

Sung Eun Kim, Yongwon Seo, Junshik Hwang, Hongkyu Yoon, and Jonghyun Lee. Connectivity- informed drainage network generation using deep convolution generative adversarial networks. Scientific Reports, 11(1):1–14, 2021. 7

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980. 7, 6.3.3

Dmitry Kopitkov and Vadim Indelman. Neural spectrum alignment: Empirical study. In International Conference on Artificial Neural Networks, pages 168–179. Springer, 2020. 4.4.2

Heehwan Lee, Minjong Hong, Min Kang, Hyun Sung Park, Kyusu Ahn, Yongwoo Lee, and Yongjo Kim. A physics-driven complex valued neural network (cvnn) model for lithographic analy- sis. In Advances in Patterning Materials and Processes XXXVII, volume 11326, page 113260I. International Society for Optics and Photonics, 2020. 7

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. In Advances in Neural Information Processing Systems 32, pages 8572–8583. 2019. 4.4.1

Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of

48 Proceedings of Machine Learning Research, pages 4313–4324. PMLR, 2020a. URL http: //proceedings.mlr.press/v108/li20j.html. 5.2

Xi-An Li, Zhi-Qin John Xu, and Lei Zhang. A multi-scale dnn algorithm for nonlinear elliptic equations with multiple scales. Communications in Computational Physics, 28(5):1886–1906, 2020b. 1, 6.3, 6.3.2, 6.3.5, 15

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An- drew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020c. 6

Senwei Liang, Liyao Lyu, Chunmei Wang, and Haizhao Yang. Reproducing activation function for deep learning. arXiv preprint arXiv:2101.04844, 2021. 7, 7.2

Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes SGM. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Pro- ceedings, pages 2340–2348. JMLR.org, 2016. URL http://proceedings.mlr.press/ v48/lina16.html. 5.2

Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains. Communications in Computational Physics, 28 (5):1970–2001, 2020. 1, 6.3, 6.3.1, 6.3.2, 6.3.3, 6.3.3, 12, 13, 14, 7, 7.2

Lu Lu, Xuhui Meng, Zhiping Mao, and George E Karniadakis. Deepxde: A deep learning library for solving differential equations. arXiv preprint arXiv:1907.04502, 2019. 8

Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks. arXiv preprint arXiv:1906.09235, 2019. 1, 2, 4, 4.2

Tao Luo, Zheng Ma, Zhiwei Wang, Zhi-Qin John Xu, and Yaoyu Zhang. Fourier-domain variational formulation and its well-posedness for supervised learning. arXiv preprint arXiv:2012.03238, 2020a. 4.4.3, 4.4.3

Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. On the exact computation of linear frequency principle dynamics and its generalization. arXiv preprint arXiv:2010.08153, 2020b. 4.4.3, 5, 2, 1, 2, 5, 5.3

Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. On the exact computation of linear frequency principle dynamics and its generalization. arXiv preprint arXiv:2010.08153, 2020c. 1, 4

Tao Luo, Zhi-Qin John Xu, Zheng Ma, and Yaoyu Zhang. Phase diagram for two-layer relu neural networks at infinite-width limit. arXiv preprint arXiv:2007.07497, to appear in Journal of Machine Learning Research, 2020d. 10

Chao Ma, Lei Wu, and E Weinan. The slow deterioration of the generalization error of the random feature model. In Mathematical and Scientific Machine Learning, pages 373–389. PMLR, 2020. 7, 7.1

49 Craig Michoski, Milos Milosavljevic, Todd Oliver, and David Hatch. Solving irregular and data- enriched differential equations using deep neural networks. arXiv preprint arXiv:1905.04351, 2019. 6.2.2

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pages 405–421. Springer, 2020. 7, 7.2

Chris Mingard, Joar Skalse, Guillermo Valle-Perez,´ David Mart´ınez-Rubio, Vladimir Mikulik, and Ard A Louis. Neural networks are a priori biased towards boolean functions with low entropy. arXiv preprint arXiv:1909.11522, 2019. 7.1

Maxwell Nye and Andrew Saxe. Are efficient deep representations learnable? 2018. 5.1

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. arXiv preprint arXiv:2012.15838, 2020a. 7, 7.2

Wei Peng, Weien Zhou, Jun Zhang, and Wen Yao. Accelerating physics-informed neural network training with prior dictionaries. arXiv preprint arXiv:2004.08151, 2020b. 6.3

Charles S Peskin, Daniel Tranchina, and Diana M Hull. How to see in the dark: Photon noise in vision and nuclear medicine a. Annals of the New York Academy of Sciences, 435(1):48–72, 1984. 2

Andrei D. Polyanin. Handbook of Linear Partial Differential Equations for Engineers and Scientists. Chapman & Hall/CRC, 2002. 2

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. arXiv preprint arXiv:2011.13961, 2020. 7, 7.2

Neil C Rabinowitz. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320, 2019. 7, 7.1

Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks. International Conference on Machine Learning, 2019. 1, 2, 3, 3.2.2

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019. 6

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3067–3075. PMLR, 2017. URL http: //proceedings.mlr.press/v70/shalev-shwartz17a.html. 5.1

Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. Robust text captchas using adversarial examples. arXiv preprint arXiv:2101.02483, 2021. 7

50 Renu Sharma and Arun Ross. D-netpad: An explainable and interpretable iris presentation attack detector. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE, 2020. 7, 7.1

Jie Shen, Tao Tang, and Li Lian Wang. Spectral methods. Algorithms, analysis and applications. Springer, 2011. 6.2.2

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017. 7.1

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556. 3.2.2

Carlos Michelen Strofer, Jin-Long Wu, Heng Xiao, and Eric Paterson. Data-driven, physics-based feature extraction from fluid flow fields using convolutional neural networks. Communications in Computational Physics, 25(3):625–650, 2019. 6

Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P Srinivasan, Jonathan T Barron, and Ren Ng. Learned initializations for optimizing coordinate-based neural representations. arXiv preprint arXiv:2012.02189, 2020a. 7, 7.2

Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, 2020b. 7, 7.2

Feng Wang, Alberto Eljarrat, Johannes Muller,¨ Trond R Henninen, Rolf Erni, and Christoph T Koch. Multi-resolution convolutional neural networks for inverse problems. Scientific reports, 10(1): 1–11, 2020a. 1, 6.3, 7

Jihong Wang, Zhi-Qin John Xu, Jiwei Zhang, and Yaoyu Zhang. Implicit bias with ritz-galerkin method in understanding deep learning for solving pdes. arXiv preprint arXiv:2002.07989, 2020b. 6.2.2, 8, 9, 10, 11

Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale pdes with physics-informed neural networks. arXiv preprint arXiv:2012.10047, 2020c. 7, 7.2

Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective. arXiv preprint arXiv:2007.14527, 2020d. 6.2.1, 6.3

E Weinan, Chao Ma, and Jianchun Wang. Model reduction with memory and the machine learning of dynamical systems. Communications in Computational Physics, 25(4):947–962, 2018. ISSN 1991-7120. 6

Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning: Perspective of loss landscapes. The 34th International Conference on Machine Learning, 2017. 1

51 Yue Xi, Wenjing Jia, Jiangbin Zheng, Xiaochen Fan, Yefan Xie, Jinchang Ren, and Xiangjian He. Drl-gan: Dual-stream representation learning gan for low-resolution image classification in uav applications. IEEE Journal of selected topics in applied earth observations and remote sensing, 2020. 7, 7.2

Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pages 1216–1224. PMLR, 2017. URL http://proceedings.mlr.press/v54/xie17a.html. 4.4.2

Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. arXiv preprint arXiv:2012.05217, 2020. 7, 7.2

Zhi-Qin J Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. International Conference on Neural Information Processing, pages 264–274, 2019a. 1, 2, 3, 5, 5.2, 7, 7.1, 8

Zhi-Qin John Xu. Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146, 2018a. 2

Zhi-Qin John Xu and Hanxu Zhou. Deep frequency principle towards understanding why deeper learning is faster. arXiv preprint arXiv:2007.14313, AAAI-21, 2020. 16, 9, 9, 17, 9.1, 9.1, 9.1, 18, 9.2, 19, 20, 21

Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019b. 1, 1, 2, 3, 2, 3.1, 3.2, 3.2.1, 4, 4, 4.1, 5, 5.1, 6.2.1, 7, 6.2.1, 6.3, 6.3.1

Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018b. 2, 4, 4.1

Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019. 1, 4, 8

Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Yingyan Lin, Zhangyang Wang, and Richard G Baraniuk. Drawing early-bird tickets: Towards more efficient training of deep networks. International Conference on Learning Representations, 2020. 7, 7.2

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx. 1, 5, 5.1

Haoming Zhang, Mingqi Zhao, Chen Wei, Dante Mantini, Zherui Li, and Quanying Liu. Eeg- denoisenet: A benchmark dataset for deep learning solutions of eeg denoising. arXiv preprint arXiv:2009.11662, 2020a. 7

52 Yaoyu Zhang and Lai-Sang Young. Dnn-assisted statistical analysis of a model of local cortical circuits. Scientific Reports, 10(1):1–16, 2020. 7

Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. Explicitizing an implicit bias of the frequency principle in two-layer neural networks. arXiv preprint arXiv:1905.10264, 2019. 1, 2, 4, 4.4.3

Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientific Machine Learning, pages 144–164. PMLR, 2020b. 10

Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. Linear frequency principle model to understand the absence of overfitting in neural networks. arXiv preprint arXiv:2102.00200, to appear in Chinese Physics Letters, 2021. 1

Quan Zheng, Vahid Babaei, Gordon Wetzstein, Hans-Peter Seidel, Matthias Zwicker, and Gurprit Singh. Neural light field 3d printing. ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 7, 7.2

Hu Zhu, Yiming Qiao, Guoxia Xu, Lizhen Deng, and Yu Yu-Feng. Dspnet: A lightweight dilated con- volution neural networks for spectral deconvolution with self-paced learning. IEEE Transactions on Industrial Informatics, 2019. 7, 7.1

53