The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
Ben Adlam * † 1 Jeffrey Pennington * 1
Abstract Classical regime Abundant parameterization Superabundant parameterization Modern deep learning models employ consider- ably more parameters than required to fit the train-
Loss Deterministic ing data. Whereas conventional statistical wisdom NTK Linear suggests such models should drastically overfit, scaling scaling Quadratic
in practice these models generalize remarkably pm pm 2 well. An emerging paradigm for describing this Trainable parameters unexpected behavior is in terms of a double de- scent curve, in which increasing a model’s ca- Figure 1. An illustration of multi-scale generalization phenomena pacity causes its test error to first decrease, then for neural networks and related kernel methods. The classical U-shaped under- and over-fitting curve is shown on the far left. increase to a maximum near the interpolation After a peak near the interpolation threshold, when the number threshold, and then decrease again in the over- of parameters p equals the number of samples m, the test loss de- parameterized regime. Recent efforts to explain creases again, a phenomenon known as double descent. On the far this phenomenon theoretically have focused on right is the limit when p → ∞, which is described by the Neural simple settings, such as linear regression or ker- Tangent Kernel. In this work, we identify a new scale of interest in nel regression with unstructured random features, between these two regimes, namely when p is quadratic in m, and which we argue are too coarse to reveal important show that it exhibits complex non-monotonic behavior, suggesting nuances of actual neural networks. We provide that double descent does not provide a complete picture. Putting a precise high-dimensional asymptotic analysis these observations together we define three regimes separated by of generalization under kernel regression with the two transitional phases: (i) the classical regime of underparameter- Neural Tangent Kernel, which characterizes the ization when p < m, (ii) the abundant parameterization regime when m < p < m2, and (iii) the superabundant parameterization behavior of wide neural networks optimized with regime when p > m2. The transitional phases between them are gradient descent. Our results reveal that the test of particular interest as they produce non-monotonic behavior. error has non-monotonic behavior deep in the overparameterized regime and can even exhibit additional peaks and descents when the number of a rigorous understanding for when the models might work, parameters scales quadratically with the dataset and, crucially, when they might not. Unfortunately, the cur- size. rent theoretical understanding of deep learning is modest at best, as large gaps persist between theory and observation and many basic questions remain unanswered. 1. Introduction One of the most conspicuous such gaps is the unexpect- Machine learning models based on deep neural networks edly good generalization performance of large, heavily- have achieved widespread success across a variety of do- overparameterized models. These models can be so ex- mains, often playing integral roles in products and services pressive that they can perfectly fit the training data (even people depend on. As users rely on these systems in increas- when the labels are replace by pure noise), but still manage ingly important scenarios, it becomes paramount to establish to generalize well on real data (Zhang et al., 2016). An emerging paradigm for describing this behavior is in terms *Equal contribution †Work done as a member of the Google AI of a double descent curve (Belkin et al., 2019a), in which Residency Program. 1Google Brain. Correspondence to: Jeffrey increasing a model’s capacity causes its test error to first Pennington
There are of course more elaborate measures of a model’s 1.1. Our Contributions capacity than a naive parameter count. Recent empirical 1. We derive exact high-dimensional asymptotic expres- and theoretical work studying the correlation of these capac- sions for the test error of NTK ridge regression. ity measures with generalization has found mixed results, 2. We prove that the test error can exhibit non-monotonic with many measures having the opposite relationship with behavior deep in the overparameterized regime. generalization that theory would predict (Neyshabur et al., 3. We investigate the origins of this non-monotonicity 2017). Other work has questioned whether it is possible and attribute them to the kernel with respect to the in principle for uniform convergence results to explain the second-layer weights. generalization performance of neural networks (Nagarajan 4. We provide empirical evidence that triple descent can & Kolter, 2019). indeed occur for finite-sized networks trained with gra- Our approach is quite different. We consider the algorithm’s dient descent. asymptotic performance on a specific data distribution, lever- 5. We find exceptionally fast learning curves in the noise- −2 aging the large system size to get precise theoretical results. less case, with Etest ∼ m . In particular, we examine the high-dimensional asymptotics of kernel ridge regression with respect to the Neural Tangent Kernel (NTK) (Jacot et al., 2018) and conclude that double 1.2. Related Work descent does not always provide an accurate or complete picture of generalization performance. Instead, we identify A recent line of work studying the behavior of interpolat- complex non-monotonic behavior in the test error as the ing models was initiated by the intriguing experimental number of parameters varies across multiple scales and find results of (Zhang et al., 2016; Belkin et al., 2018b), which that it can exhibit additional peaks and descents when the showed that deep neural networks and kernel methods can number of parameters scales quadratically with the dataset generalize well even in the interpolation regime. A number size. of theoretical results have since established this behavior in certain settings, such as interpolating nearest neighbor Our theoretical analysis focuses on the NTK of a single- schemes (Belkin et al., 2018a) and kernel regression (Belkin layer fully-connected model when the samples are drawn et al., 2019c; Liang et al., 2020b). independently from a Gaussian distribution and the targets are generated by a wide teacher neural network. We provide These observations, coupled with classical notions of the an exact analytical characterization of the generalization bias-variance tradeoff, have given rise to the double descent error in the high-dimensional limit in which the number of paradigm for understanding how test error depends on model samples m, the number of features n0, and the number of complexity. These ideas were first discussed in (Belkin et al., hidden units n1 tend to infinity with fixed ratios φ := n0/m 2019a), and empirical evidence was obtained in (Advani & and ψ := n0/n1. By adjusting these ratios, we reveal the Saxe, 2017; Geiger et al., 2020) and recently in (Nakkiran intricate ways in which the generalization error depends on et al., 2019). Precise theoretical predictions soon confirmed the dataset size and the effective model capacity. this picture for linear regression in various scenarios (Belkin et al., 2019b; Hastie et al., 2019; Mitra, 2019). We investigate various limits of our results, including the behavior when the NTK degenerates into the kernel with Linear models struggle to capture all of the phenomena respect to only the first-layer or only the second-layer relevant to double descent because the parameter count is weights. The latter corresponds to the standard setting of tied to the number of features. Recent work found multiple random feature ridge regression, which was recently an- descents in the test loss for minimum-norm interpolants in alyzed in (Mei & Montanari, 2019). In this case, the to- Reproducing Kernel Hilbert Spaces (Liang et al., 2020a), tal number of parameters p is equal to the width n1, i.e. but it similarly requires changing the data distribution to p = n1 = (φ/ψ)m, so that p is linear in the dataset size. vary model capacity. A precise analysis of a nonlinear sys- In contrast, for the full kernel, the number of parameters is tem for a fixed data generating process is the most direct way 2 2 p = (n0 +1)n1 = (φ /ψ)m +(φ/ψ)m, i.e. it is quadratic to draw insight into double descent. A recent preprint (Mei in the dataset size. By studying these two kernels, we derive & Montanari, 2019) shares this view and adopts a simi- insight into the generalization performance in the vicinities lar analysis to ours, but focuses entirely on the standard of linear and quadratic overparameterization, and by piec- case of unstructured random features. Such a setup can ing these two perspectives together, we infer the existence indeed model double descent, and certainly bears relevance of multi-scale phenomena, which sometimes can include to certain wide neural networks in which only the top-layer triple descent. See Fig.1 for an illustration and Fig.4 for weights are optimized (Neal, 1996; Rahimi & Recht, 2008; empirical confirmation of this behavior. Lee et al., 2018; de G. Matthews et al., 2018; Lee et al., 2019), but its connection to neural networks trained with gradient descent remains less clear. The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
Gradient-based training of wide neural networks initialized where the expectation is over an iid test point (x, y) condi- in the standard way was recently shown to correspond to tional on the training set, the teacher parameters, and any kernel gradient descent with respect to the Neural Tangent randomness in the learning algorithm producing yˆ, such as Kernel (Jacot et al., 2018). This result has spawned renewed the random parameters defining the random features. Note interest in kernel methods and their connection to deep learn- that the test loss is a random variable; however, in the high- ing; a woefully incomplete list of papers in this direction dimensional asymptotics we consider here, it concentrates includes Lee et al.(2019); Chizat et al.(2019); Du et al. about its mean. (2019; 2018); Arora et al.(2019); Xiao et al.(2019). To connect these research directions, our analysis requires 2.2. Neural Tangent Kernel Regression tools and recent results from random matrix theory and free We consider predictive functions yˆ defined by approximate probability. A central challenge stems from the fact that (i.e. random feature) kernel ridge regression using the Neu- many of the matrices in question have nonlinear dependen- ral Tangent Kernel (NTK) of a single-hidden-layer neural cies between the elements, which arises from the nonlinear network. The NTK can be considered a kernel K that is feature matrix F = σ(WX). This challenge was over- approximated by random features corresponding to the Jaco- come in (Pennington & Worah, 2017), which computed the bian J of the network’s output with respect to its parameters, > spectrum of F , and in (Pennington & Worah, 2018), which i.e. K(x1, x2) = J(x1)J(x2) . As the width of the net- examined the spectrum of the Fisher information matrix; see work becomes very large (compared to all other relevant also (Louart et al., 2018). We also utilize the results of (Ad- scales in the system), the approximate NTK converges to a lam et al., 2019;P ech´ e´ et al., 2019), which established a constant kernel determined by the network’s initial param- linear signal plus noise model for F that shares the same eters and describes the trajectory of the network’s output bulk statistics. This linearized model allows us to write the under gradient descent. In particular, test error as the trace of a rational function of the underlying −1 −ηtK random matrices. The methods we use to compute such Nt(x) = N0(x)+(Y −N0(X))K (I−e )Kx , (3) quantities rely on so-called linear pencils that represent the rational function in terms of the inverse of a larger block where Nt(x) is the output of the network at time t, K := matrix (Helton et al., 2018), and on operator-valued free K(γ) = K(X,X) + γIm, Kx := K(X, x), η is the learn- probability for computing the trace of the latter (Far et al., ing rate, and γ is a ridge regularization constant2. For this 2006). work, we are interested in the t → ∞ limit of (3), which defines the predictive function,
2. Preliminaries −1 yˆ(x) := N∞(x) = N0(x) + (Y − N0(X))K Kx . (4) In this section, we introduce our theoretical setting and some of the tools required to state our results. We remark that if the width is not asymptotically larger than the dataset size, the validity of (3) can break down and 2.1. Problem Setup and Notation (4) may not accurately describe the late-time predictions of the neural network. While this potential discrepancy is We consider the task of learning an unknown function from an interesting topic, we defer an in-depth analysis to future n0 m independent samples (xi, yi) ∈ R × R, i ≤ m, where work (but see Fig.4) for an empirical analysis of gradient x ∼ N (0,I ) the datapoints are standard Gaussian, i n0 , and descent). Instead, we regard (4) as the definition of our 1 the labels are generated by a wide single-hidden-layer neu- predictive function and focus on kernel regression with the ral network: NTK. We believe this setup is interesting its own right; for √ √ example, recent work has demonstrated its effectiveness as yi|xi, Ω, ω ∼ ωσT(Ωxi/ n0)/ nT + εi. (1) a kernel method on complex image datasets (Li et al., 2019)
The teacher’s activation function σT is applied coordinate- and found it to be competitive with neural networks in small n ×n 1×n wise, and its parameters Ω ∈ R T 0 and ω ∈ R T are data regimes. matrices whose entries are independently sampled once for In this work, we restrict our study to the NTK of single- all data from N (0, 1). We also allow for independent label 2 hidden-layer fully-connected networks. In particular, con- noise, εi ∼ N (0, σε ). sider a network of with width n1 and pointwise activation Let yˆ(x) denote the model’s predictive function. We con- function σ, defined by, sider squared error, so the test loss is, √ √ N0(x) = W2σ(W1x/ n0)/ n1 , (5) 2 √ √ 2 E(y−yˆ) = Ex,ε(ωσT(Ωx/ n0)/ nT +ε−yˆ(x)) , (2) 2These overloaded definitions of K can be distinguished by the 1 We assume the width nT → ∞, but the rate is not important. number of arguments and should be clear from context. The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
5 5 n1×n0 (a) (b) for initial weight matrices W ∈ and W ∈ 4 Linear 4 1 R 2 Tanh 1×n1 3 3 ReLU 3 R with iid entries [W1]ij ∼ N (0, 1) and [W2]i ∼ Centered 2 N (0, σ ). 2 2 W2 etloss Test etloss Test
We collect our assumptions on the activation functions be- 1 1 low, in Assumption1. Their main purpose is to ensure that 0.010 0.100 1 10 100 1000 0.010 0.100 1 10 100 1000 certain moments and derivatives exist almost surely, but for p/m p/m2 simplicity we state somewhat stronger conditions than are Figure 2. Theoretical results for the test error with and without actually required for our analysis. To simplify the already centering for different activation functions with φ = 2, γ = 10−3, cumbersome algebraic manipulations, we assume that σ and SNR = 1 for (a) the second-layer kernel K2 and (b) the full has zero Gaussian mean. We emphasize that this condition NTK K as the number of parameters p is varied by changing is not essential and our techniques easily generalize to all the network width. Nonmonotonic behavior is clearly visible at commonly used activation functions. the linear scaling transition (p = m) and the quadratic scaling transition (p = m2). Here, ReLU denotes the zero-mean function Assumption 1. The activation functions σ, σ : → are √ T R R σ(x) = max(x, 0) − 1/ 2π. assumed to be differentiable almost everywhere. We assume 0 |σ(x)| , |σ (x)| , |σT(x)| = O (exp(Cx)) for some positive 0 constant C, which implies all the Gaussian moments of σ, σ , 3. Three Regimes of Parameterization and σT exist, and we assume Eσ(Z) = 0 for Z ∼ N (0, 1). In this section, we outline an argument based on the struc- The Jacobian of (5) with respect to the parameters nat- ture of the NTK as to why one should expect the test error urally decomposes into the Jacobian with respect to W1 to exhibit non-trivial phenomena at two different scales of and W2, i.e. J(x) = [∂N0(x)/∂W1, ∂N0(x)/∂W2] = overparameterization. From the expressions for the test [J1(x),J2(x)]. Therefore the kernel K also decomposes error (2) and the predictive function (4), it is evident that this way, and we can write. the behavior of the test error is determined by the spectral > > K(x1, x2) = J1(x1)J1(x2) + J2(x1)J2(x2) (6) properties of the NTK. Although the fine details of the rela- tionship can only be revealed by the explicit calculation, we =: K (x , x ) + K (x , x ) (7) 1 1 2 2 1 2 can nevertheless make some basic high-level observations A simple calculation yields the per-layer constituent kernels, based on the coarser structure of the kernel.
> 0 > 2 0 The number of trainable parameters p relative to the dataset X X (F ) diag(W2) F K1(X,X) = (8) size m controls the amount of parameterization or com- n0 n1 1 plexity of a model. In our setting of a single-hidden-layer K (X,X) = F >F, (9) fully-connected neural network, p = n (n + 1), and for 2 n 1 0 1 a fixed dataset, we can adjust the ratio p/m by varying the where we have introduced the abbreviations F = hidden-layer width n1. √ 0 0 √ σ(W1X/ n0) and F = σ (W1X/ n0). Notice that when σ2 → 0, K = K , i.e. the NTK degenerates into the The simplest way to see that there should be two scales W2 2 standard random features kernel. However, the predictive comes from examining the two terms in the kernel separately. K = J J T J ∈ m×n0n1 function (4) contains an offset N (x) which would typically Because 1 1 1 and 1 R , the first-layer 0 min{n n , m} be set to zero in standard random feature kernel regression kernel has rank at most 0 1 , which suggests p = Θ(m) because it simply increases the variance of test predictions. nontrivial transitional behavior when . Similarly, K min{n , m} Removing this variance component has an analogous oper- the rank of 2 is at most 1 , which suggests a n = Θ(m) ation in neural network training: either the function value second interesting scale when 1 , or equivalently, p = Θ(m2) n = Θ(n ) at initialization can be subtracted throughout training, or when if 0 1 . Our explicit calculations a symmetrization trick can be used in which two copies confirm that interesting phenomena indeed occur at these scales, as can be seen in Fig.2. of the NN are initialized identically,√ and their normalized (a) (b) difference N ≡ N − N / 2 is trained with gradi- These two scales partition the degree of parameterization ent descent. Either method preserves the kernel K while into three regimes. We consider the classical regime to be enforcing N0 ≡ 0. We call this setup centering, and present when p . m because classical generalization theory tends results with and without it. to hold and the U-shaped test error curve is observed. The p = Θ(m) Finally, we note that ridge regularization in the kernel per- transition around manifests as a sharp rise in spective corresponds to using L2 regularization of the neural the test loss near the interpolation threshold, followed by p network’s weights toward their initial values. a quick descent as increases further, as can be seen in Fig.2(a). We call this the linear scaling transition. After 3 2 Any non-zero σW1 can be absorbed into a redefinition of σ. this, we enter a regime we call abundant parameterization The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
2 when m . p . m . In this regime, the test error tends It follows that we can split K1 into two terms, to decrease until p nears the vicinity of m2, where it can > sometimes increase again, producing a second U-shaped X>X F¯0 diag(W )2F¯0 X>X 2 + σ2 ζ , 2 W2 (11) curve. When p = Θ(m ), another transition is observed, n0 n1 n0 which we call the quadratic scaling transition, which can be seen in Fig.2(b). On the other side of this transition, where F¯0 is a centered version of F 0. Focusing on the first 2 2 2 p & m , a regime we call superabundant parameterization. term, because n0n1 = φ /ψm , the random fluctuations in See Fig1 for an illustration of this general picture. the off-diagonal elements are O(1/m), which are too small to contribute to the spectrum or moments of an m × m While the classical regime has been long studied, and the matrix whose diagonal entries are order one. In fact, the superabundant regime has generated considerable recent diagonal entries are simply proportional to the variance of interest due to the NTK, our main aim in delineating the the entries of F 0, namely (η0 − ζ). Putting this together, we above regimes is to highlight the existence of the interme- can eliminate the Hadamard product entirely and write, diate scale containing complex phenomenology. For this reason, we focus our theoretical analysis on the novel scal- σ2 ζ 2 ∼ 2 0 W2 > ing regime in which p = Θ(m ). In particular, as mentioned K1 = σ (η − ζ)Im + X X, (12) W2 n in Sec.1, we consider the high-dimensional asymptotics in 0 which n0, n1, m → ∞ with φ := n0/m and ψ := n0/n1 where the =∼ notation means the two matrices share the held constant. same bulk statistics asymptotically. We make this argument precise in Sec. S1. 4. Overview of Techniques 4.2. Linearization 1: Gaussian Equivalents In this section, we provide a high-level overview of the ana- lytical tools and mathematical results we use to compute the The test error (2) involves large random matrices with non- generalization error. To begin with, let us first describe the linear dependencies, which are not immediately amenable main technical challenges in computing explicit asymptotic to standard methods of analysis in random matrix the- ory. The main culprit is the random feature matrix limits of (2). √ √ F = σ(W1X/ n0), but f := σ(W1x/ n0), Y = The first challenge, which is evident upon inspecting (8), √ √ √ √ ωσT(ΩX/ n0)/ nT + E, and y := ωσT(Ωx/ n0)/ nT is that the kernel contains a Hadamard product of random all suffer from the same issue. matrices, for which concrete results in the random matrix literature are few and far between. We address this problem The solution is to replace each of these matrices with an in Sec. 4.1. equivalent matrix without nonlinear dependencies, but cho- sen to maintain the same first- and second-order moments The second challenge, which is apparent by inspecting (9), is for all of the terms that appear in the test error (2). This that the kernel depends on random matrices with nonlinear approach was described for F in (Adlam et al., 2019) (see dependencies between the entries. We describe how to also (Pech´ e´ et al., 2019)). The upshot is that the test error is circumvent this difficulty in Sec. 4.2. asymptotically invariant to the following substitutions, Finally, by expanding the square in (2) and substituting (4), r −1 lin ζ p we find terms that are constant, linear, and quadratic in K . F → F := W1X + η − ζΘF (13) Some of the random matrices that appear inside the matrix n0 r r inverses (e.g. X, and W1) also appear outside of them as lin ζT ηT − ζT Y → Y := ωΩX + ωΘY + E (14) multiplicative factors, a situation that prevents the straight- nTn0 nT forward application of many standard proof techniques in r lin ζ p random matrix theory. We describe how to overcome this f → f := W1x + η − ζθf (15) n0 challenge in Sec. 4.3. r r lin ζT ηT − ζT y → y := ωΩx + ωθy . (16) 4.1. Simplification of First-Layer Kernel nTn0 nT
A straightforward central limiting argument shows that in The new objects ΘF , ΘY , θf , and θy are matrices of the √ the asymptotic limit the entries of W1X/ n0 are marginally appropriate shapes with iid standard Gaussian entries. The Gaussian with mean zero and unit variance. As such, the constants η, ζ, ηT, and ζT are chosen so that the mixed mo- first and second moments of the entries in the matrix F 0 = ments up to second order are the same for the original and 0 √ σ (W1X/ n0) are equal to linearized versions. In particular,
p 0 0 0 2 0 2 2 ζ := Ez∼N (0,1)σ (z) , η := Ez∼N (0,1)σ (z) . (10) ζ := [Ez∼N (0,1)σ (z)] , η := Ez∼N (0,1)σ(z) , (17) The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
0 2 2 ζT := [Ez∼N (0,1)σT(z)] , ηT := Ez∼N (0,1)σT(z) . Eqn. (24) is a rational function of the noncommutative ran- (18) dom variables W1,X, and ΘF . A useful result from non- commutative algebra guarantees that such a rational function The statement that the test error only depends on Y lin is can be linearized in the sense that it can be expressed in consistent with the observations made in (Ghorbani et al., terms of the inverse of a matrix whose entries are linear in 2019; Mei & Montanari, 2019) that in the high-dimensional the noncommutative variables. This representation is often regime where n0 = Θ(m), only linear functions of the data called a linear pencil, and is not unique; see e.g. (Helton can be learned. Indeed, Y lin is equivalent to a linear teacher et al., 2018) for details. plus noise with signal-to-noise ratio given by, To illustrate this concept, consider the simple case of K−1. After applying the substitutions (13)-(16) to (22), a linear ζT SNR = 2 . (19) pencil is given by ηT − ζT + σε
σ2 ζ √ √ −1 2 0 W > η−ζ > ζ > [γ+σ (η −ζ)]I 2 X Θ √ X We often make this equivalence to a linear teacher explicit W2 n0 n0 F n0n1 −XI 0 0 by setting σ (x) = x, which implies η = ζ = 1. Doing √ T T T √ ζ , − η−ζΘF − √ W1 I 0 so also removes the noise from the test label, but since this n0 0 0 −W > I noise merely contributes an additive shift to the test loss, 1 11 removing it does not change any of our conclusions. which can be checked by an explicit computation of the block matrix inverse. After obtaining a linear pencil for each 4.3. Linearization 2: Linear Pencil of the terms in (24), the only task that remains is computing Next we turn our attention to the actual computation of the the trace. Since each linear pencil is a block matrix whose asymptotic test loss. Expanding the test error (2) we have4, blocks are iid Gaussian random matrices, its trace can be evaluated using the techniques described in (Far et al., 2006) 2 or through the general formalism of operator-valued free Etest := E(x,y)(y − yˆ(x)) (20) probability. We refer the reader to the book (Mingo & h > > −1 > = E(x,ε) tr(y y) − 2 tr(Kx K Y y) Speicher, 2017) for more details on these topics. i + tr(K>K−1Y >YK−1K ) . (21) x x 5. Asymptotic Training and Test Error The simplification (12) gives, The calculations described in the previous section are pre- sented in the Supplementary Materials. Here we present the ζX>X F >F K = σ2 (η0 − ζ)I + + + γI main results. W2 m m n0 n1 Proposition 1. As n0, n1, m → ∞ with φ = n0/m and (22) 1 −1 ψ = n0/n1 fixed, the traces τ1(z) := m E tr(K(z) ) and σ2 ζ : 1 1 > −1 W2 > 1 > τ2(z) = m E tr( n X XK(z) ) are given by the unique Kx = X x + F f , (23) 0 n0 n1 solutions to the coupled polynomial equations, which, when applied to (21) together with the substitutions φ (ζτ2τ1 + φ(τ2 − τ1)) + ζτ1τ2ψ (zτ1 − 1) 2 0 (13)-(16), expresses the test error directly in terms of the = −ζτ1τ2σ (ζ (τ2 − τ1) ψ + τ1ψη + φ) W2 (25) iid Gaussian random matrices W1,X, ΘF , Ω, ΘY , E, θf , θy 2 0 2 ζτ τ2 (η − η) σ + ζτ1τ2 (zτ1 − 1) and x. The expectations over x and E are trivial because 1 W2 these variables do not appear inside the matrix inverse K−1. = (τ2 − τ1) φ (ζ (τ2 − τ1) + ητ1) , Moreover, asymptotically the traces concentrate around their + + such that τ1, τ2 ∈ for z ∈ . means with respect to Ω, ΘY , θf and θy, which we can also C C compute easily for the same reason. Therefore, the test error Theorem 1. Let γ = Re(z) and let τ1 and τ2 be defined can be written as, as in Proposition1 with Im (z) → 0+. Then the asymptotic 1 2 training error Etrain = m EkY − yˆ(X)kF is given by, X −1 X −1 −1 Etest = a0 + bi tr(BiK )+ ci tr(CiK DiK ) E = −γ2(σ2τ 0 + τ 0 ) + νσ2 γ2(τ + γτ 0 ) i i train ε 1 2 W2 1 1 (24) 4 2 0 0 0 (26) + νσW γ ((η − ζ)τ1 + ζτ2) , where Bi,Ci,Di are monomials in {W1,X, ΘF } and their 2 transposes, and a0, bi, ci ∈ R. 2 and the asymptotic test error Etest = E(y − yˆ(x)) is given 4 For simplicity, we discuss the centered setting with N0 = 0, by −2 2 which captures all of the technical complexities. Etest = (γτ1) Etrain − σε . (27) The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
2 Remark 1. The subtraction of σε in eqn. (27) is because φ held constant. In this setting we find, we have assumed that there is no label noise on the test 1 points. Had we included the same label noise on both the E | = (χ (φ − 1) + ξ φ(1 + φ) + ζ(1 − 3φ)) test ψ→∞ 2φχ 1 1 training and test distributions, that term would be absent. 1 1 −2 + (φξ + ζ − χ ) , Remark 2. When ν = 0, the quantity (γτ1) Etrain on the 1 1 (31) 2χ1SNR right hand side of eqn. (27) is precisely the generalized cross- where ξ := η0 + γ/σ2 , and validation (GCV) metric of (Golub et al., 1979). Theorem1 1 W2 shows that the GCV gives the exact asymptotic test error for p 2 2 the problem studied here. χ1 := (ζ + ξ1φ) − 4φζ . (32) The small width limit characterizes one boundary of the 6. Test Error in Limiting Cases abundant parameterization regime and as such provides an upper bound on the test loss in that regime. Therefore, a While the explicit formulas in preceding section provide an sufficient condition for the global minimum to occur at in- exact characterization of the asymptotic training and test termediate widths is Etest|ψ→∞ < Etest|ψ=0. By comparing loss, they do not readily admit clear interpretations. On the eqn. (28) to eqn. (31), precise though unenlightening con- other hand, eqn. (25) and therefore the expressions for Etest straints on the parameters can be derived for satisfying this simplify considerably under several natural limits, which condition. One such configuration is illustrated in Fig.4(b). we examine in this section. 6.3. Large Dataset Limit 6.1. Large Width Limit Here we consider the limit in which the dataset m is larger Here we examine the test error in the superabundant regime than any constant times the width n1, which can be obtained in which the width n1 is larger than any constant times the by letting φ → 0 with φ/ψ → 0. In this setting we find, dataset size m, which can be obtained by letting ψ → 0 and ψ/φ → 0. In this setting we find, 1+ψ φ φ 2 SNR ( ψ ) + O( ψ ) SNR < ∞ E |φ→0 = τ 2(νζ2σ4 +κ) , test W2 φ 2 φ 3 1 2 4 ( ) + O( ) SNR = ∞ (η−ζ)ζ σW ψ ψ Etest|ψ=0 = (χ0(φ − 1) + ξφ(1 + φ) + ρ(1 − 3φ)) 2 2φχ0 where ν = 0 with centering and ν = 1 with without it and, νσ2 + W2 ((ηφ + ζ)(ρ + ξφ) − 4ζρφ) τ := γ + σ2 (η0 − ζ) , κ := ζψ + (η − ζ)ψ2 . 2φχ0 W2 (33) νσ2 φξ + ρ − χ + W2 (ηφ − ζ) + 0 , (28) Here again we observe very steep learning curves, similar 2φ 2χ0SNR to the large width limit above. where ν = 0 with centering and ν = 1 without it and 6.4. Ridgeless Limit: First-Layer Kernel ρ := ζ(1 + σ2 ), ξ := γ + η + σ2 η0, and W2 W2 Here we examine the ridgeless limit γ → 0 of the first-layer p 2 2 χ0 := (ρ + ξφ) − 4φρ . (29) kernel K1. We find that the result can be obtained through a degeneration of (28), The learning curve is remarkably steep with centering. To K1 see this, we expand the result as m → ∞, i.e. as φ → 0, Etest |γ=0 = lim Etest|ψ=0 (34) σW2 →∞ ( 1 0 φ + O(φ2) SNR < ∞ = (¯χ(φ − 1) + η φ(1 + φ) + ζ(1 − 3φ)) SNR 2φχ¯ Etest|ψ=0 = ξ 2 2 3 . (30) (1 − ρ ) φ + O(φ ) SNR = ∞ 1 + (φη0 + ζ − χ¯) , (35) 2¯χSNR Interestingly, we see that when the network is super abun- dantly parameterized, we obtain very fast learning curves: where, χ¯ := p(ζ + η0φ)2 − 4φζ2 and we have specialized −1 for finite SNR, Etest ∼ m , and in the noiseless case to the centered case ν = 0. The expansion as m → ∞ also −2 Etest ∼ m . See Fig3(b). looks similar to (30) and can be obtained from that equation by substituting ξ/ρ → η0/ζ. 6.2. Small Width Limit 6.5. Ridgeless Limit: Second-Layer Kernel Here we consider the limit in which the width n1 is smaller than any constant times the dataset size m or the number of Here we examine the ridgeless limit γ → 0 when the kernel features n0, which can be obtained by letting ψ → ∞ with is due to the second-layer weights only, i.e. K2. This limit The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
1.000 (a) 1000 Test loss (b) 10 (c) 1.000 log 1000 1 2 W2 100 - 0.1 -7 100 1 0.100 -6 10 0.100−1 10 10 log10 SNR -5 m m / -4 / 1 1 - n n 10 5 4 0.100 -3
= 1 = 1 3 -2 / 0.010−2 /