Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Soufiane Hayou 1 Arnaud Doucet 1 Judith Rousseau 1

Abstract DNNs and how it differs from the actual training of DNNs with Stochastic . Recent work by Jacot et al.(2018) has shown that training a neural network of any kind, with gradi- Neural Tangent Kernel. Jacot et al.(2018) showed that ent descent in parameter space, is strongly related training a neural network (NN) with GD (Gradient Descent) to kernel gradient descent in function space with in parameter space is equivalent to a GD in a function space respect to the Neural Tangent Kernel (NTK). Lee with respect to the NTK. Du et al.(2019) used a similar et al.(2019) built on this result by establishing approach to prove that full batch GD converges to global that the output of a neural network trained using minima for shallow neural networks, and Karakida et al. gradient descent can be approximated by a linear (2018) linked the Fisher information matrix to the NTK, model for wide networks. In parallel, a recent line studying its spectral distribution for infinite width NN. The of studies (Schoenholz et al., 2017; Hayou et al., infinite width limit for different architectures was studied by 2019) has suggested that a special initialization, Yang(2019), who introduced a tensor formalism that can known as the Edge of Chaos, improves training. express the NN computations. Lee et al.(2019) studied a In this paper, we connect these two concepts by linear approximation of the full batch GD dynamics based quantifying the impact of the initialization and the on the NTK, and gave a method to approximate the NTK for on the NTK when the network different architectures. Finally, Arora et al.(2019) proposed depth becomes large. In particular, we show that an efficient algorithm to compute the NTK for convolutional the performance of wide deep neural networks architectures (Convolutional NTK). In all of these papers, cannot be explained by the NTK regime. We also the authors only studied the effect of the infinite width limit leverage our theoretical results to derive a learning (NTK regime) with relatively shallow networks. passband rate where training is possible. Information propagation. In parallel, information prop- agation in wide DNNs has been studied in (Hayou et al., 2019; Lee et al., 2018; Schoenholz et al., 2017; Yang and 1. Introduction Schoenholz, 2017a). These works provide an analysis of Deep neural networks (DNN) have achieved state of the art the signal propagation at the initial step as a function of results on numerous tasks. Hence, there is a multitude of the initialization hyper-parameters (i.e. variances of the works trying to theoretically explain their remarkable perfor- initial random weights and biases). They identify a set of mance; see, e.g., (Du et al., 2018; Nguyen and Hein, 2018; hyper-parameters known as the Edge of Chaos (EOC) and Zhang et al., 2017; Zou et al., 2018). Recently, Jacot et al. activation functions ensuring a deep propagation of the in- (2018) introduced the Neural Tangent Kernel (NTK) that formation carried by the input. This ensures that the network characterises DNN training in the so-called Lazy training output still has some information about the input. In this regime (or NTK regime). In this regime, the whole training paper, we prove that the Edge of Chaos initialization has arXiv:1905.13654v10 [stat.ML] 23 May 2021 procedure is reduced to a first order Taylor expansion of the also some benefits on the NTK. output function near its initialization value. It was shown in NTK training and SGD training. Stochastic Gradient De- (Lee et al., 2019), that such a simple model could lead to scent (SGD) has been successfully used in training deep surprisingly good performance. However, most experiments networks. Recently, with the introduction of the Neural with NTK regime are performed on shallow neural networks Tangent Kernel in (Jacot et al., 2018), Lee et al.(2019) sug- and have not covered DNNs. In this paper, we cover this gested a different approach to training overparameterized topic by showing the limitations of the NTK regime for neural networks. The idea originates from the conjecture that in overparameterized models, a local minima exists 1Department of Statistics, University of Oxford, Oxford, United kingdom. Correspondence to: Soufiane Hayou . lor expansion near initialization, the model is reduced to a simple linear model, and the linear model is trained instead Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

NTK training (Section2). However, for a deeper network Table 1. "Does the model learn?". We train a FeedForward Neural with L = 300, the NTK training fails for any initialization, Network on MNIST using both standard SGD training and NTK training defined in section2. For Shallow networks, both SGD while SGD training succeeds in training the model with and NTK yield good performance (See section5). However, for EOC initialization. This confirms the limitations of the Deep networks, the NTK training yields trivial accuracy of around NTK training for DNNs. However, although the large depth ∼ 10% for any initialization scheme. NTK regime is trivial, we leverage this asymptotic analysis Initialization on to infer a theoretical upper bound on the learning (section Other Initialization the Edge of Chaos 4). We illustrate our theoretical results through extensive Shallow Network NTK XX simulations. All the proofs are detailed in the appendix. (depth L = 3) SGD XX Medium Network NTK X  2. Neural Networks and Neural Tangent (depth L = 30) SGD X  Kernel Deep Network NTK  (depth L = 300) SGD X  2.1. Setup and notations Consider a neural network model consisting of L layers of of the original network. Hereafter, we refer to this training l widths (nl)1≤l≤L, n0 = d , and let θ = (θ )1≤l≤L be the procedure as the NTK training and the trained model as the flattened vector of weights and bias indexed by the layer’s NTK regime. We clarify this in section2. index, and p be the dimension of θ. The output f of the Contributions. The aim of this paper is to study the large neural network is given by some mapping s : RnL → Ro depth limit of NTK. Our contributions are of the last layer yL(x); o being the dimension of the output (e.g. number of classes for a classification problem). For • We prove that the NTK regime is always trivial in the limit d L o any input x ∈ R , we thus have f(x, θ) = s(y (x)) ∈ R . of large depth. However, the convergence rate to this trivial As we train the model, θ changes with time t, and we denote regime is controlled by the initialization hyper-parameters. by θt the value of θ at time t and ft(x) = f(x, θt). Let D = • We prove that only an EOC initialization provides a sub- (xi, zi)1≤i≤N be the data set, and let X = (xi)1≤i≤N , Z = exponential convergence rate to this trivial regime, while (zj)1≤j≤N be the matrices of input and output respectively, other initializations yield an exponential rate. For the same with dimension d × N and o × N. We assume that there depth, the NTK regime is thus ‘less’ trivial for an EOC. This is no colinearity in the input dataset X , i.e. there is no two allows training deep models using NTK training. inputs x, x0 ∈ X such that x0 = αx for some α ∈ R. We also assume that there exists a compact set E ⊂ Rd such • For ResNets, we also have convergence to a trivial NTK that X ⊂ E. regime but this always occurs at a polynomial rate, irrespec- L tive of the initialization. To further slow down the NTK The NTK Kθ is defined as the o × o dimensional kernel convergence rate, we introduce scaling factors to the ResNet satisfying for all x, x0 ∈ Rd blocks, which allows NTK training of deep ResNets. KL (x, x0) = ∇ f(x, θ )∇ f(x0, θ )T θt θ t θ t • We leverage our theoretical results on the asymptotic L behaviour of the NTK to show the existence of a learning X 0 T o×o = ∇θl f(x, θt)∇θl f(x , θt) ∈ R . rate passband for SGD training where training is possible. l=1 Table1 summarizes the behaviour of NTK and SGD training for different depths and initialization schemes of an FFNN • The NTK regime (Infinite width): In the case of an on the MNIST dataset. We show if the model learns or not, FFNN, Jacot et al.(2018) proved that, with GD, the kernel KL converges to KL, which depends only on L (depth) for i.e. if the model test accuracy is significantly bigger than θt 10%, which is the accuracy of the trivial random classifier. all t < T when n1, n2, ..., nL → ∞ sequentially, where T The results displayed in the table show that for shallow is an upper bound on the training time. The infinite width FFNN (L = 3), the model learns to classify with both NTK limit of the training dynamics with a quadratic loss is given training and SGD training for any initialization scheme. by the linear model

For a medium depth network (L = 30), NTK training and − t Kˆ L − t Kˆ L SGD training both succeed in training the model with an ft(X ) = e N f0(X ) + (I − e N )Z, (1) initialization on the EOC, while they both fail with other L L d where Kˆ = K (X , X ). For any input x ∈ R , we have initializations. It has been observed that with SGD, an EOC initialization in beneficial for the training of deep neural − 1 Kˆ Lt ft(x) = f0(x)+γ(x, X )(I −e N )(Z −f0(X )), (2) networks (Hayou et al., 2019; Schoenholz et al., 2017). Our results show that the EOC initialization is also beneficial for where γ(x, X ) = KL(x, X )(Kˆ L)−1. Hereafter, we refer Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks to ft by the "NTK regime solution" or simply the "NTK propagation using the NTK parameterization is given by regime" when there is no confusion. d For other loss functions such as the cross-entropy loss, (Lee σw X y1(x) = √ w1 x + σ b1 et al., 2019) used some approximation to obtain the NTK i ij j b i d j=1 regime. These approximations are implemented in Python nl−1 library (Novak et al., 2020). σw X yl(x) = √ wl φ(yl−1(x)) + σ bl, l ≥ 2. i n ij j b i l−1 j=1 • Role of the NTK in NTK training: As it has been ob- (3) served in Du et al.(2019), the convergence speed of f to t • Convolutional Neural Network (CNN) Consider a 1D f (infinite training time) is given by the smallest eigen- ∞ convolutional neural network of depth L, denoting by [m : value of Kˆ L. If the NTK becomes singular in the large n] the set of integers {m, m + 1, ..., n} for n ≤ m, the depth limit, then the NTK training fails. forward propagation is given by • Generalization in the NTK regime: From equation (2), the term γ plays a crucial role in the generalization capac- n0 1 σw X X 1 1 ity of the linear model. More precisely, different works yi,α(x) = √ wi,j,βxj,α+β + σbbi v1 (Du et al., 2019; Arora et al., 2019) showed that the in- j=1 β∈ker1 verse NTK plays a crucial role in the generalization error of nl−1 l σw X X l l−1 l wide shallow NN. Cao and Gu(2019) proved that training yi,α(x) = √ wi,j,βφ(yj,α+β(x)) + σbbi, vl a FeedForward NN of (fixed) depth L with SGD gives a j=1 β∈kerl q (4) generalization bound of the form O(L zT (Kˆ L)−1z/N) in the limit of infinite width, where z is the training label. where i ∈ [1 : nl] is the channel number, α ∈ [0 : M − 1] Moreover, equation (2) shows that the Reproducing Kernel is the neuron location in the channel, nl is the number of Hilbert Space (RKHS) generated by the NTK KL controls channels in the lth layer, and M is the number of neurons the generalization function. To see this, let t ∈ (0,T ), from in each channel, kerl = [−k : k] is a filter with size 2k + 1 l n ×n ×(2k+1) equation (2), we can deduce that there exist coefficients and vl = nl−1(2k + 1). Here, w ∈ R l l−1 . d a1, ..., aN ∈ R such that for all x ∈ R , ft(x) − f0(x) = We assume periodic boundary conditions, which result in PN L l l l i=1 aiK (xi, x), showing that the ‘training residual’ having yi,α = yi,α+M = yi,α−M , and similarly for l = 0, ft − f0 belongs to the RKHS of the NTK. In other words, xi,α+M0 = xi,α = xi,α−M0 . For the sake of simplification, the NTK controls whether the network would learn anything we only consider the case of 1D CNN, the generalization to beyond initialization with NTK training (linearized regime). a mD CNN for m ∈ N is straightforward. Hereafter, for x, x0 ∈ Rd, we denote by x · x0 the scalar d 0 n0×(2k+1) 0 3. Asymptotic Neural Tangent Kernel product in R . For x, x ∈ R , let [x, x ]α,α0 0 L be a convolutional mapping defined by [x, x ]α,α0 = In this section, we study the behaviour of K as L goes to 0 Pn0 P L xj,α+βx 0 . ∞. We prove that the limiting K is trivial so that the NTK j=1 β∈ker0 j,α +β cannot explain the generalization power of DNNs. However, iid We initialize the model randomly with wl , bl ∼ N (0, 1), with EOC initialization, this convergence is slow, which ij i where N (µ, σ2) denotes the normal distribution of mean µ makes it possible to use NTK training for medium depth and variance σ2. In the limit of infinite width, the neurons neural networks (L = 30). However, since the limiting (yl(.)) become Gaussian processes (Neal, 1995; Lee et al., NTK is trivial, NTK training necessarily fails for large depth i i,l 2018; Matthews et al., 2018; Hayou et al., 2019; Schoenholz neural networks. et al., 2017); hence, studying their covariance kernel is the natural way to gain insights on their behaviour. Hereafter, 3.1. NTK parameterization and the Edge of Chaos l 0 l 0 we denote by q (x, x ) resp. qα,α0 (x, x ) the covariance l l 0 l l 0 Let φ be the activation function. We consider the following between y1(x) and y1(x ) resp. y1,α(x) and y1,α0 (x ). We architectures: l 0 l 0 define the correlations c (x, x ) and cα,α0 (x, x ) similarly. • FeedForward Fully-Connected Neural Network For FFNN, we have that (FFNN) Consider an FFNN of depth L, widths (nl)1≤l≤L, σ2 l l d q1(x, x0) = σ2 + w x · x0, weights w and bias b . For some input x ∈ R , the forward b d and similarly for CNN we have

2 1 0 2 σw 0 qα,α0 (x, x ) = σb + [x, x ]α,α0 . n0(2k + 1) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

L For  ∈ (0, 1), we define the set B by: the notation K for the NTK of both FFNN and CNN. For FFNN, it represents KL given by Lemma1, whereas for 0 d 1 0 FFNN : B = {(x, x ) ∈ : c (x, x ) ≤ 1 − }, L 0 R CNN, it represents Kα,α0 given in lemma2 for any α, α , 0 d 0 1 0 i.e. all results that follow are true for any α, α0. We start by CNN : B = {(x, x ) ∈ R : ∀α, α , cα,α0 (x, x ) ≤ 1 − }, reviewing the Edge of Chaos theory. and assume that there exists  > 0, such that for all x 6= x x0 ∈ X , (x, x0) ∈ B . The infinite width limit refers to Edge of Chaos (EOC): For some input , we denote by  ql(x) yl(x) ql(x) infinite number of neurons for Fully Connected layers, and the variance of . The convergence of l infinite number of channels for Convolutional layers. All as increases is studied in (Lee et al., 2018), (Schoen- results below are derived in this limit. holz et al., 2017), and (Hayou et al., 2019). Under gen- eral regularity conditions, it is proved that ql(x) converges Jacot et al.(2018) established the following infinite width to a point q(σb, σw) > 0 independent of x as l → ∞. l 0 limit of the NTK of an FFNN when σw = 1. We generalize The asymptotic behaviour of the correlation c (x, x ) be- l l 0 0 the result to any σw > 0. tween y (x) and y (x ) for any two inputs x and x is Lemma 1 (Generalization of Theorem 1 in (Jacot et al., also driven by (σb, σw); Schoenholz et al.(2017) show 2 0 p 2 2018)). Consider an FFNN of the form (3). Then, as that if σwE[φ ( q(σb, σw)Z) ] < 1, where Z ∼ N (0, 1) 0 d 0 then cl(x, x0) converges to 1 exponentially quickly, and n1, n2, ..., nL−1 → ∞, we have for all x, x ∈ R , i, i ≤ L 0 L 0 L 0 the authors call this phase the ordered phase. However, if nL, K 0 (x, x ) = δii0 K (x, x ), where K (x, x ) is given ii 2 0 p 2 l 0 by the recursive formula σwE[φ ( q(σb, σw)Z) ] > 1 then c (x, x ) converges to c < 1, which is then referred to as the chaotic phase. The L 0 L 0 L−1 0 L 0 K (x, x ) =q ˙ (x, x )K (x, x ) +q ˆ (x, x ), authors define the EOC as the set of parameters (σb, σw), 2 0 p 2 such that σ [φ ( q(σb, σw)Z) ] = 1. The behaviour l 0 2 2 l−1 l−1 0 wE where qˆ (x, x ) = σb + σwE[φ(y1 (x))φ(y1 (x ))] and of cl(x, x0) on the EOC is studied in (Hayou et al., 2019) l 0 2 0 l−1 0 l−1 0 q˙ (x, x ) = σwE[φ (y1 (x))φ (y1 (x ))]. where it is proved to converge to 1 at a polynomial rate (see Section 2 of the Supplementary). The exact rate depends on Lemmas1,2,3 and4 are trivial and follow the same induc- the smoothness of the activation function. tion approach as in (Jacot et al., 2018). These results can be obtained using the Tensor Program framework of Yang The following proposition establishes that any initialization (2020) for example. on the Ordered or Chaotic phase, leads to a trivial limiting Lemma 2 (Infinite width dynamics of the NTK of a CNN). NTK as L becomes large. Consider a CNN of the form (4), then we have that for all Proposition 1 (NTK with Ordered/Chaotic Initialization). 0 d 0 0 x, x ∈ R , i, i ≤ n1 and α, α ∈ [0 : M − 1] Let (σb, σw) be either in the ordered or in the chaotic phase. Then, there exist λ > 0 such that for all  ∈ (0, 1), there 2 1 0 σw 0 2 exists γ > 0 such that K(i,α),(i0,α0)(x, x ) = δii0 [x, x ]α,α0 + σb . n0(2k + 1) sup |KL(x, x0) − λ| ≤ e−γL. (x,x0)∈B For l ≥ 2, as n1, n2, ..., nl−1 → ∞ recursively, we have for  0 0 l 0 all i, i ≤ nl, α, α ∈ [0 : M − 1], K(i,α),(i0,α0)(x, x ) = l 0 l δii0 Kα,α0 (x, x ), where Kα,α0 is given by the recursion The proof of proposition1 relies on the asymptotic analysis of the second moment of the gradient. We refer the reader l 1 X l−1 to section6 in the appendix for more details. Kα,α0 = Ψα+β,α0+β, 2k + 1 L β∈kerl Proposition1 shows that Kˆ becomes close to a constant matrix as the depth grows. The exponential convergence l−1 l l−1 l l l where Ψα,α0 =q ˙α,α0 Kα,α0 +q ˆα,α0 , and qˆα,α, resp. q˙α,α0 is rate implies that even with a small number of layers, the l l l−1 l−1 0 L defined as q , resp. q˙ in Lemma1, with y1,α (x), y1,α0 (x ) kernel K is close to being degenerate. This suggests that l−1 l−1 0 in place of y1 (x), y1 (x ). NTK training fails, and the performance of the NTK regime solution will be no better than that of a random classifier. The NTK of a CNN differs from that of an FFNN in the Empirically, we find that with depth L = 30, the NTK sense that it is an average over the NTK values of the previ- training fails when the network is initialized on the Ordered ous layer. This is due to the fact that neurons in the same phase. See Section5 for more details. channel are not independent at initialization. Before stating the results for EOC initialization, we intro- Using the above recursive formulas for the NTK, we can de- duce the following assumption on the input space of CNN. velop its mean-field theory to better understand its dynamics Assumption 1. [CNN input space] We assume that for all 0 1 0 0 as L goes to infinity. To alleviate notations, we hereafter use x, x ∈ X , qα,α0 (x, x ) is independent of α, α . Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Assumption1 is a constraint on the input space of CNN. 3.2. Residual Neural Networks (ResNet) It simplifies the analysis of the NTK of CNN by linking Another important feature of DNNs, which is known to be it to that of an FFNN. We refer the reader to Section3 in highly influential, is their architecture. For residual net- the appendix for more details. We will specify it clearly works, the NTK has also a simple recursion in the infinite whenever we use this assumption. width limit. With an initialization on the EOC, the convergence rate is Lemma 3 (NTK of a ResNet with fully connected layers in polynomial instead of exponential. We show this in the next the infinite width limit). Let Kres,1 be the exact NTK for theorem. Hereafter, we define the Average NTK (ANTK) the ResNet with 1 layer. Then L L by AK = K /L. The notation g(x) = Θ(m(x)) means • For the first layer (without residual connections), we have there exist two constants A, B > 0 such that A m(x) ≤ 0 d for all x, x ∈ R g(x) ≤ B m(x).  2  res,1 0 2 σw 0 Theorem 1 (NTK on the Edge of Chaos). Let φ be a non- Kii0 (x, x ) = δii0 σb + x · x . L d linear activation function, (σb, σw) ∈ EOC and AK = KL/L. We have that • For l ≥ 2, as n1, n2, ..., nl−1 → ∞ recursively, we have 0 res,l l l for all i, i ∈ [1 : nl], K 0 = δii0 Kres, where Kres is L ∞ −1 ii sup |AK (x, x) − AK (x, x)| = Θ(L ). given by the recursive formula for all x, x0 ∈ Rd x∈E l 0 l−1 0 l 0 l 0 Kres(x, x ) = Kres (x, x )(q ˙ (x, x ) + 1) + q (x, x ). Moreover, there exists a constant λ ∈ (0, 1) such that for all  ∈ (0, 1) For residual networks with convolutional layers, the formula is similar to the CNN case as well. L 0 ∞ 0 −1 sup AK (x, x ) − AK (x, x ) = Θ(log(L)L ), Lemma 4 (NTK of a ResNet with convolutional layers in 0 (x,x )∈B the infinite width limit). Let Kres,1 be the exact NTK for the ResNet with 1 layer. Then where 2 0 • For the first layer (without residual connections), we have ∞ 0 σwkxk kx k 0 d • if φ = ReLU, then AK (x, x ) = d (1 − (1 − for all x, x ∈ R λ)1 0 ) with λ = 1/4. x6=x 2 ∞ 0  σ  • if φ = Tanh, then AK (x, x ) = q(1 − (1 − λ)1x6=x0 ) res,1 0 w 0 2 K(i,α),(i0,α0)(x, x ) = δii0 [x, x ]α,α0 + σb . where q > 0 is a constant and λ = 1/3. n0(2k + 1) All results hold for CNN under Assumption1. • For l ≥ 2, as n1, n2, ..., nl−1 → ∞ recursively, we 0 0 have for all i, i ∈ [1 : nl], α, α ∈ [0 : M − 1], The proof of Theorem1 is tricky and requires a special form res,l 0 res,l 0 res,l K 0 0 (x, x ) = δii0 K 0 (x, x ), where K 0 is of inequalities to control the convergence rate (i.e. to obtain (i,α),(i ,α ) α,α α,α given by the recursive formula for all x, x0 ∈ d, using Θ instead of O). We refer the reader to section1 in the R the same notations as in lemma2, appendix for more details about the proof techniques. L Theorem1 shows that with an initialization on the EOC, K res,l res,l−1 1 X l−1 K 0 = K 0 + Ψ , increases linearly in L. Moreover, the EOC initialization α,α α,α 2k + 1 α+β,α0+β β slows down significantly the convergence rate (w.r.t L) of AKL to the trivial kernel AK∞. This is of big importance l l res,l l where Ψα,α0 =q ˙α,α0 Kα,α0 +q ˆα,α0 . since AK∞ is trivial and brings hardly any information on x. Indeed the convergence rate of AKL to AK∞ is l−1 0 res,l−1 The additional terms Kres (x, x ) (resp. Kα,α0 ) in the O(log(L)L−1). This means that as L grows, the NTK with recursive formulas of Lemma3 (resp. Lemma4) are due EOC is still much further from the trivial kernel AK∞ com- to the ResNet architecture. It turns out that this term helps pared to the NTK with the Ordered/Chaotic initialization. in slowing down the convergence rate of the NTK. The This allows NTK training on deeper networks compared to next proposition shows that for any σw > 0, the NTK of a the Ordered phase initialization. For ReLU, a similar result ResNet explodes (exponentially) as L grows. However, a L L appeared independently in (Huang et al., 2020) after the first normalized version K¯ = K /αL of the NTK of a ResNet version of this paper was made publicly available. However, will always have a polynomial convergence rate to a limiting the authors only proved an upper bound on the convergence trivial kernel. polylogL rate of order O( L ), while our result gives the exact −1 Theorem 2 (NTK for ResNet). Consider a ResNet satisfy- rate of Θ(log(L)L ) for both ReLU and Tanh. We also ing extend the results to ResNet and a Scaled form of ResNet in the next section. yl(x) = yl−1(x) + F(wl, yl−1(x)), l ≥ 2, (5) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks where F is either a convolutional or dense layer (equations that with the scaling, we can ‘NTK train’ deeper ResNets L (3) and (4)) with ReLU activation. Let Kres be the corre- compared to the non-scaled ResNet. We illustrate the effec- ¯ L L sponding NTK, and Kres = Kres/αL (Normalized NTK) tiveness of Scaled ResNet in section5. A more aggressive 2 σw L−1 scaling was studied in (Huang et al., 2020), where authors with αL = L(1 + 2 ) . Then, we have √ scale the blocks with 1/L instead of our scaling 1/ l, and ¯ L ¯ ∞ −1 sup |Kres(x, x) − Kres(x, x)| = Θ(L ). show that it also stabilizes the NTK of ResNet. This is the x∈E main topic of the next chapter. We particularly show that a Moreover, there exists a constant λ ∈ (0, 1) such that for all suitable scaling ensures that the limiting NTK is universal,  ∈ (0, 1) i.e. we can approximate any continuous function on some compact set K with a function from the Reproducing Ker- ¯ L 0 ¯ ∞ 0 −1 sup Kres(x, x ) − Kres(x, x ) = Θ(log(L)L ), nel Hilbert Space of the limiting NTK. This is a desirable 0 x,x ∈B property since the second term in the solution of the NTK regime lives in the RKHS of the NTK. σ2 kxk kx0k ¯ ∞ 0 w 1 where Kres(x, x ) = d (1 − (1 − λ) x6=x0 ). All results hold for ResNet with Convolutional layers under 3.3. Spectral decomposition of the limiting NTK Assumption1. To refine the analysis of Section3, we study the limiting The proof techniques used in theorem2 are similar to those behaviour of the spectrum of the NTK over the unit sphere d−1 d d−1 used in the proof of theorem1. Details are provided in the S = {x ∈ R : kxk2 = 1}. On the sphere S , the appendix. kernel KL is a dot-product kernel, i.e. there exists a function g such that KL(x, x0) = g (x · x0) for all x, x0 ∈ d−1. Theorem2 shows that the NTK of a ReLU ResNet ex- L L S This kernel type is known to be diagonalizable on the sphere plodes exponentially w.r.t L. However, the normalized ker- d−1 and the eigenfunctions are the so-called Spherical nel K¯ L = KL (x, x0)/α converges to a limiting kernel S res res L Harmonics of d−1. Many concurrent results have observed K¯ ∞ at the exact polynomial rate Θ(log(L)L−1) for all S res this fact (Geifman et al., 2020; Cao et al., 2020; Bietti and σ > 0. This allows for NTK training of deep ResNet, w Mairal, 2019). Our goal in the next theorem is to confirm the similarly to the EOC initialization for the FFNN or the CNN results of the previous section from a spectral perspective, networks. However, the NTK explodes exponentially and by showing that the eigenvalues of the NTK (scaled NTK) the normalized NTK converges to a trivial kernel, which converge to zero as the depth goes to infinity, and only the means that, even with ResNet, NTK training would fail at first eigenvalue remains positive (which corresponds to the some point as we increase the depth. constant eigenfunction). The term α in the residual NTK might cause numerical d−1 L L Theorem 3 (Spectral decomposition on S ). Let κ be ei- stability issues for NTK training, and the triviality of the ther, the NTK (KL) for an FFNN with L layers initialized on limiting kernel yields a trivial NTK regime solution (recall the Ordered phase, The Average NTK (AKL) for an FFNN that ft −f0 belongs to the RKHS of the NTK; see section2). with L layers initialized on the EOC, or the Normalized It turns out that we can improve the performance of NTK ¯ L NTK (Kres) for a ResNet with L layers (Fully Connected). training of ResNets with a simple scaling of the ResNet L Then, for all L ≥ 1, there exists (µk )k≥ such that for all blocks. 0 d−1 x, x ∈ S Consider a ResNet satisfy- Proposition 2 (Scaled ResNet). N(d,k) ing L 0 X L X 0 κ (x, x ) = µk Yk,j(x)Yk,j(x ). 1 k≥0 j=1 yl(x) = yl−1(x) + √ F(wl, yl−1(x)), l ≥ 2, (6) l d−1 (Yk,j)k≥0,j∈[1:N(d,k)] are spherical harmonics of S , and where F is either a convolutional or dense layer ((3) and N(d, k) is the number of harmonics of order k. (4)) with ReLU activation. Then the results of Theorem Moreover, we have that 0 < µ∞ = lim µL < ∞, and for 1+σ2 /2 0 0 2 apply with α = L w and the convergence rate L→∞ L L −1 all k ≥ 1, lim µk = 0. Θ(log(L) ). L→∞ √ Proposition2 shows that scaling the residual blocks by 1/ l The proof of theorem3 is based on a result from spectral has two important effects on the NTK: first, it stabilizes the theory analysis. The limiting eigenvalues are obtained by a 2 σw σ2 simple application of the dominated convergence theorem. 1+ 2 w L−1 NTK which only grows as L instead of L(1+ 2 ) ; second, it drastically slows down the convergence rate to the Theorem3 shows that in the limit of large L, the ker- ¯ ∞ L ∞ 0 limiting (trivial) Kres. Both properties are highly desirable nel κ becomes close to the trivial kernel κ (x, x ) 7→ ∞ 0 for NTK training. The second property in particular means µ0 Y0,0(x)Y0,0(x ), where Y0,0 is the constant function in Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Figure 1. Train/Test accuracy of an FFNN with ReLU activation on Fashion MNIST dataset for different depths and learning rates, trained for 10 epochs. The plot is in log-log scale. the spherical harmonics class. Therefore, in the limit of infinite depth, the RKHS of the kernel κL is reduced to the space of constant functions. Figure 2. Normalized eigenvalues of KL on the 2D sphere for an FFNN with different initializations, activations, and depths. Although the asymptotic NTK is degenerate, we show in the next section that we can leverage the asymptotic analysis of section3 to obtain valuable insights on the choice of the learning rate. slope of the red line is −1 which confirms our prediction that the upper bound of the LR passband grows as L−1.A 4. Learning Rate Passband similar bound has been introduced recently in (Hayase and Karakida, 2020) in the different context of networks achiev- Tuning the learning rate (LR) is crucial for the training ing dynamical isometry with Hard-Tan activation function. of DNNs; a large/small LR could cause the training to fail. On the other hand, Figure1 shows that lower bound of the Empirically, the optimal LR tends to decrease as the network passband and depths are almost uncorrelated.1 depth grows. In this section, we use the NTK linear model presented in Section2 to establish the existence of an LR passband, i.e. an interval of values for the learning rate 5. Experiments where training occurs. 5.1. Behaviour of KL as L goes to infinity Recall the dynamics of the linear model Proposition1, and theorems1 and2 show that the NTK (or 1 scaled NTK) converge to a trivial kernel. Figure2 shows df (X ) = − Kˆ L(f (X ) − Z)dt. (7) t N t the normalized eigenvalues of the NTK of an FFNN on 2D sphere. On the Ordered phase, the eigenvalues converge The GD update with learning rate η is given by quickly to zero as the depth grows, while with an EOC ini- η η tialization, the eigenvalues converge to zero at a slower rate. f (X ) = (I − Kˆ L)f (X ) − Kˆ LZ). (8) t+1 N t N For L = 300, the NTK on the EOC is ‘richer’ than the NTK on the Ordered phase, in the sense that the small eigenvalues To ensure stability of (8), a necessary condition is that kI − with EOC are relatively much bigger than those with the η ˆ L N K kF < 1, which implies having Ordered phase intialization. This reflects directly on the RKHS of the NTK, and allows the NTK regime solution to 2 η < be ‘richer’ since it is a combination of different eigenfunc- 1 ˆ L µmax( N K ) tions of the NTK, and not only one as in the Ordered phase (the constant eigenfunction). where µmax is the largest eigenvalue. For an FFNN (or a CNN with Assumption1) initialized on 5.2. Can NTK regime explain DNN performance? the EOC, as L grows we have that Kˆ L = qL((1 − λ)I + We train FFNN, Vanilla CNN (stacked convolutional lay- λU) + O(log(L)) (Theorem1). Therefore, for large L and ers without pooling, followed by a dense layer), Vanilla N, we have that µ ( 1 Kˆ L) ∼ qλL. The upper bound max N ResNet (ResNet with FFNN blocks), and Scaled ResNet on η scale as 1/L, therefore, we expect the passband to with different depths using two training methods: have a linear upper bound. To validate this hypothesis, we train FFNN on Fashion MNIST dataset. Figure1 shows 1We currently do not have an explanation for this effect. We the train/test accuracy for different LRs and depths. The leave this for future work. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Table 2. Test accuracy for varying architectures and depths on MNIST and CIFAR10 dataset. We show test accuracy after 100 training epochs for L ∈ {3, 30} and 160 epochs for L = 300. MNIST CIFAR10 NTK Training SGD Training NTK Training SGD Training EOC Ordered EOC Ordered EOC Ordered EOC Ordered L=3 FFNN-ReLU 96.64±0.11 96.57±0.12 97.05±0.27 97.11±0.31 48.13±0.10 48.45±0.14 55.13±0.23 54.10±0.12 FFNN-Tanh 95.34±1.04 96.32±0.41 97.19±0.11 97.03±0.29 48.32±0.15 48.10±0.10 56.13±0.34 54.10±0.23 CNN-ReLU 97.13±0.31 97.23±0.22 98.95±0.12 98.89±0.18 49.11±0.16 42.76±3.32 60.23±0.45 59.05±0.15 V-ResNet 96.73±0.05 96.71±0.16 97.19±0.23 97.12±0.14 47.82±0.73 48.01±0.20 54.40±0.24 54.28±0.33 L=30 FFNN-ReLU 96.95±0.22 — 97.55±0.09 — 48.32±0.10 — 56.10±0.41 — FFNN-Tanh 97.30±0.15 — 97.87±0.17 — 48.40±0.12 — 57.39±0.08 — CNN-ReLU 98.60±0.13 — 99.02±0.07 — 48.42±0.10 — 75.39±0.31 — V-ResNet — — 98.17±0.03 98.13±0.08 —— 57.09±0.47 58.13±0.18 S-ResNet 97.01±0.10 97.11±0.10 98.33±0.10 98.26±0.14 49.10±0.15 50.01±0.12 57.21±0.43 57.51±0.11 L=300 FFNN-ReLU — — 98.14±0.12 ——— 30.25±3.23 — FFNN-Tanh — — 98.54±0.18 ——— 58.25±0.43 — CNN-ReLU — — 99.43±0.04 ——— 76.25±0.21 — V-ResNet — — 98.23±0.09 98.19±0.06 —— 58.87±0.44 59.25±0.10 S-ResNet — — 98.40±0.07 98.51±0.08 —— 60.86±0.24 61.51±0.18

SGD training. We use SGD with a batchsize of 128 and Table 3. Test accuracy on CIFAR100 for ResNet. a learning rate 10−1 for L ∈ {3, 30} and 10−2 for L = 300 (this learning rate was found by a grid search of exponential Epoch 10 Epoch 160 standard 54.18±1.21 72.49±0.18 step size 10; note that the optimal learning rate with NTK ResNet32 parameterization is usually bigger than the optimal learning scaled 53.89±2.32 74.07±0.22 standard 51.09±1.73 73.63±1.51 rate with standard parameterization). We use 100 training ResNet50 epochs for L ∈ {3, 30}, and 150 epochs for L = 300. scaled 55.39±1.52 75.02±0.44 standard 47.02±3.23 74.77±0.29 ResNet104 scaled 56.38±2.54 76.14±0.98 NTK training. We use the Python library Neural- Tangents introduced by Novak et al.(2020) with 10K sam- ples from MNIST/CIFAR10. This corresponds to the in- fails for depth L = 300. version of a 10K × 10K matrix to obtain the NTK regime Does Scaled ResNet outperforms ResNet with SGD? solution discussed in Section2. √ We train standard ResNet with depths 32, 50, and 104 on CI- For the EOC initialization, we use (σb, σw) = (0, 2) for FAR100 with SGD. We use a decaying learning rate sched- ReLU, and (σb, σw) = (0.2, 1.298) for Tanh. For the Or- ule; we start with 0.1 and divide by 10 after ne/2 epochs, dered phase initialization, we use (σb, σw) = (1, 0.1) for where ne is the total number of epochs; we scale again, by both ReLU and Tanh. Table2 displays the test accuracies 10, after ne/4 epochs. We use a batch size of 128, and we for both NTK training and SGD training. The dashed lines train the model with 160 epochs. Proposition2 shows that refer to the trivial test accuracy ∼ 10%, which is the test the NTK of Scaled ResNet is more stable compared to the accuracy of a uniform random classifier with 10 classes NTK of standard ResNet. Although this result is limited to i.e. in these cases the model does not learn. For L = 300, NTK training, we investigate the impact of scaling on SGD NTK training fails for all architectures and initializations training. Table3 displays test accuracy for standard ResNet confirming the results of Theorems1 and2, and Proposition and scaled ResNet after 10 and 160 epochs; Scaled ResNet 1; while SGD succeeds in training FFNN and CNN with an outperforms ResNet and converges faster. However, it is EOC initialization and fails with an Ordered initialization, not clear whether this is linked to the NTK, or caused by and succeeds in training ResNet with both initializations something else. We leave this for future work. (which confirms findings in (Yang and Schoenholz, 2017b) that ResNet ‘live’ on the EOC). This proves that the NTK 6. Conclusion regime cannot explain DNN performance trained with SGD. With L = 30, NTK training fails with Vanilla ResNet, while In this paper, we have shown that the infinite depth limit it yields good performance with scaled ResNet; this also of the NTK regime is trivial and cannot explain the perfor- confirms the benefits of the scaling introduced in Proposi- mance of DNNs. However, we proved that the performance tion2. However, even with scaled ResNet, the NTK training of NTK training is initialization dependent (Table2). These Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

findings add to a recent line of research which shows that the infinite width approximation of the NTK does not fully cap- ture the training dynamics of DNNs. Indeed, recent works have shown that the NTK for finite width neural networks changes with time (Chizat and Bach, 2018; Ghorbani et al., 2019; Huang and Yau, 2020), and might even be random as shown by (Hanin and Nica, 2019) where authors prove that in the limit n, L → ∞ (where n is a width of the network) L with fixed ratio γ = n , the limiting kernel is random. An interesting property in this regime is the “feature learning” which the NTK regime lacks. Further research is needed in order to understand the difference between the two regimes.

Acknowledgement The project leading to this work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 834175). Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

References Huang, K., Y. Wang, M. Tao, and T. Zhao (2020). Why do deep residual networks generalize better than deep feed- Arora, S., S. Du, W. Hu, , Z. Li, and R. Wand (2019). Fine- forward networks? – a neural tangent kernel perspective. grained analysis of optimization and generalization for ArXiv preprint, arXiv:2002.06262. overparameterized two-layer neural networks. ICML. Jacot, A., F. Gabriel, and C. Hongler (2018). Neural tan- Arora, S., S. Du, W. Hu, Z. Li, R. Salakhutdinov, and gent kernel: Convergence and generalization in neural R. Wang (2019). On exact computation with an infinitely networks. 32nd Conference on Neural Information Pro- wide neural net. arXiv preprint arXiv:1904.11955. cessing Systems. Bietti, A. and J. Mairal (2019). On the inductive bias of neural tangent kernels. NeurIPS 2019. Karakida, R., S. Akaho, and S. Amari (2018). Universal statistics of Fisher information in deep neural networks: Cao, Y., Z. Fang, Y. Wu, D. Zhou, and Q. Gu (2020). To- Mean field approach. arXiv preprint arXiv:1806.01316. wards understanding the spectral bias of . arXiv prePrint 1912.01198. Lee, J., Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018). Deep neural networks as Cao, Y. and Q. Gu (2019). Generalization bounds of stochas- Gaussian processes. 6th International Conference on tic gradient descent for wide and deep neural networks. Learning Representations. NeurIPS. Lee, J., L. Xiao, S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, Chizat, L. and F. Bach (2018). A note on lazy training in and J. Pennington (2019). Wide neural networks of any supervised differentiable programming. arXiv preprint depth evolve as linear models under gradient descent. arXiv:1812.07956. NeurIPS.

Du, S., J. Lee, H. Li, L. Wang, and X. Zhai (2019). Gradient Lillicrap, T., D. Cownden, D. Tweed, and C. Akerman descent finds global minima of deep neural networks. (2016). Random synaptic feedback weights support error ICML. backpropagation for deep learning. Nature Communica- tions 7(13276). Du, S., J. Lee, Y. Tian, B. Poczos, and A. Singh (2018). Gradient descent learns one-hidden-layer CNN: Don’t be MacRobert, T. (1967). Spherical harmonics: An elemen- afraid of spurious local minima. ICML. tary treatise on harmonic functions, with applications. Pergamon Press. Du, S., X. Zhai, B. Poczos, and A. Singh (2019). Gradient descent provably optimizes over-parameterized neural Matthews, A., J. Hron, M. Rowland, R. Turner, and networks. ICLR. Z. Ghahramani (2018). Gaussian process behaviour in Geifman, A., A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and wide deep neural networks. 6th International Conference R. Basri (2020). On the similarity between the laplace on Learning Representations. NeurIPS and neural tangent kernels. . Neal, R. (1995). Bayesian learning for neural networks. Ghorbani, B., S. Mei, T. Misiakiewicz, and A. Montanari Springer Science & Business Media 118. (2019). Linearized two-layers neural networks in high Nguyen, Q. and M. Hein (2018). Optimization landscape dimension. arXiv preprint arXiv:1904.12191. and expressivity of deep CNNs. ICML. Hanin, B. and M. Nica (2019). Finite depth and width Novak, R., L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl- corrections to the neural tangent kernel. arXiv preprint Dickstein, and S. S. Schoenholz (2020). Neural tangents: arXiv:1909.05989. Fast and easy infinite neural networks in python. In Hayase, T. and R. Karakida (2020). The spectrum of fisher International Conference on Learning Representations. information of deep networks achieving dynamical isom- etry. arXiv PrePrint 2006.07814. Poole, B., S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016). Exponential expressivity in deep neu- Hayou, S., A. Doucet, and J. Rousseau (2019). On the ral networks through transient chaos. 30th Conference on impact of the activation function on deep neural networks Neural Information Processing Systems. training. ICML. Schoenholz, S., J. Gilmer, S. Ganguli, and J. Sohl-Dickstein Huang, J. and H. Yau (2020). Dynamics of deep neural (2017). Deep information propagation. 5th International networks and neural tangent hierarchy. ICML. Conference on Learning Representations. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Xiao, L., Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and P. Pennington (2018). Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. ICML 2018. Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760. Yang, G. (2020). Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685. Yang, G. and S. Schoenholz (2017a). Mean field residual networks: On the edge of chaos. Advances in Neural Information Processing Systems 30, 2869–2869. Yang, G. and S. Schoenholz (2017b). Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114.

Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Zou, D., Y. Cao, D. Zhou, and Q. Gu (2018). Stochastic gra- dient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks Appendix 0. Setup and notations 0.1. Neural Tangent Kernel

l l n n l Consider a neural network model consisting of L layers (y )1≤l≤L, with y : R l−1 → R l , n0 = d and let θ = (θ )1≤l≤L be the flattened vector of weights and bias indexed by the layer’s index and p be the dimension of θ. Recall that θl nL o has dimension nl + 1. The output f of the neural network is given by some transformation s : R → R of the last layer yL(x); o being the dimension of the output (e.g. number of classes for a classification problem). For any input x ∈ Rd, we thus have f(x, θ) = s(yL(x)) ∈ Ro. As we train the model, θ changes with time t and we denote by θt the value of θ at time t and ft(x) = f(x, θt) = (fj(x, θt), j ≤ o). Let D = (xi, zi)1≤i≤N be the data set and let X = (xi)1≤i≤N , Z = (zj)1≤j≤N be the matrices of input and output respectively, with dimension d × N d×o k and o×N. For any function g : R → R , k ≥ 1, we denote by g(X , Z) the matrix (g(xi, zi))1≤i≤N of dimension k×N.

(Jacot et al., 2018) studied the behaviour of the output of the neural network as a function of the training time t when the network is trained using a gradient descent algorithm. (Lee et al., 2019) built on this result to linearize the training dynamics. We recall hereafter some of these results.

1 PN For a given θ, the empirical loss is given by L(θ) = N i=1 `(f(xi, θ), zi). The full batch GD algorithm is given by ˆ ˆ ˆ θt+1 = θt − η∇θL(θt), (1) where η > 0 is the learning rate. Let T > 0 be the training time and Ns = T/η be the number of steps of the discrete GD (1). The continuous time system equivalent to (1) with step ∆t = η is given by

dθt = −∇θL(θt)dt. (2)

This differs from the result by (Lee et al., 2019) since we use a discretization step of ∆t = η. It is well known that this discretization scheme leads to an error of order O(η) (see Appendix). Equation (2) can be re-written as

1 T dθ = − ∇ f(X , θ ) ∇ 0 `(f(X , θ ), Z)dt. t N θ t z t where ∇θf(X , θt) is a matrix of dimension oN × p and ∇z0 `(f(X , θt), Z) is the flattened vector of dimension oN 0 0 0 constructed from the concatenation of the vectors ∇z `(z , zi)|z =f(xi,θt), i ≤ N. As a result, the output function ft(x) = o f(x, θt) ∈ R satisfies the following ODE

1 T df (x) = − ∇ f(x, θ )∇ f(X , θ ) ∇ 0 `(f (X ), Z)dt. (3) t N θ t θ t z t

L 0 d The Neural Tangent Kernel (NTK) Kθ is defined as the o × o dimensional kernel satisfying: for all x, x ∈ R ,

KL (x, x0) = ∇ f(x, θ )∇ f(x0, θ )T ∈ o×o θt θ t θ t R L X 0 T (4) = ∇θl f(x, θt)∇θl f(x , θt) . l=1

We also define KL (X , X ) as the oN × oN matrix defined blockwise by θt

 KL (x , x ) ··· KL (x , x )  θt 1 1 θt 1 N L L  Kθ (x2, x1) ··· Kθ (x2, xN )  KL (X , X ) =  t t  . θt  . .. .   . . .  KL (x , x ) ··· KL (x , x ) θt N 1 θt N N Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

By applying (3) to the vector X , one obtains

1 L dft(X ) = − K (X , X )∇z0 `(ft(X ), Z)dt, (5) N θt meaning that for all j ≤ N 1 L dft(xj) = − K (xj, X )∇z0 `(ft(X ), Z)dt. N θt

Infinite width dynamics. In the case of an FFNN, (Jacot et al., 2018) proved that, with GD, the kernel KL converges to a θt L kernel K which depends only on L (number of layers) for all t < T when n1, n2, ..., nL → ∞, where T is an upper bound R T on the training time, under the technical assumption 0 ||∇z`(ft(X , Z))||2dt < ∞ a.s. with respect to the initialization weights. The infinite width limit of the training dynamics is given by

1 L df (X ) = − K (X , X )∇ 0 `(f (X ), Z)dt, (6) t N z t ˆ L L 0 1 0 2 We note hereafter K = K (X , X ). As an example, with the quadratic loss `(z , z) = 2 ||z − z|| ,(6) is equivalent to 1 df (X ) = − Kˆ L(f (X ) − Z)dt, (7) t N t which is a simple linear model that has a closed-form solution given by

− 1 Kˆ Lt − 1 Kˆ Lt ft(X ) = e N f0(X ) + (I − e N )Z. (8)

For general input x ∈ Rd, we have

− 1 Kˆ Lt ft(x) = f0(x) + γ(x, X )(I − e N )(Z − f0(X )). (9) where γ(x) = KL(x, X )KL(X , X )−1.

0.2. Architectures Let φ be the activation function. We consider the following architectures (FFNN and CNN)

• FeedForward Fully-Connected Neural Network (FFNN) Consider an FFNN of depth L, widths (nl)1≤l≤L, weights wl and bias bl. For some input x ∈ Rd, the forward propagation using the NTK parameterization is given by d 1 σw X 1 1 yi (x) = √ wijxj + σbbi d j=1 (10) nl−1 σw X yl(x) = √ wl φ(yl−1(x)) + σ bl, l ≥ 2. i n ij j b i l−1 j=1

• Convolutional Neural Network (CNN/ConvNet) Consider a 1D convolutional neural network of depth L, denoting by [m : n] the set of integers {m, m + 1, ..., n} for n ≤ m, the forward propagation is given by

n0 1 σw X X 1 1 yi,α(x) = √ wi,j,βxj,α+β + σbbi v1 j=1 β∈ker 1 (11) nl−1 l σw X X l l−1 l yi,α(x) = √ wi,j,βφ(yj,α+β(x)) + σbbi, vl j=1 β∈kerl

where i ∈ [1 : nl] is the channel number, α ∈ [0 : M − 1] is the neuron location in the channel, nl is the number of th channels in the l layer, and M is the number of neurons in each channel, kerl = [−k : k] is a filter with size 2k + 1 l n ×n ×(2k+1) and vl = nl−1(2k + 1). Here, w ∈ R l l−1 . We assume periodic boundary conditions, which results in l l l having yi,α = yi,α+M = yi,α−M and similarly for l = 0, xi,α+M0 = xi,α = xi,α−M0 . For the sake of simplification, we consider only the case of 1D CNN, the generalization to a mD CNN for m ∈ N is straightforward. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks 1. Proof techniques The techniques used in the proofs range from simple algebraic manipulation to tricky inequalities.

Lemmas1,2,3,4. The proofs of these lemmas are simple and follow the same inductive argument as in the proof of the original NTK result in (Jacot et al., 2018). Note that these results can also be obtained by simple application of the Master Theorem in (Yang, 2020) using the framework of Tensor Programs.

Proposition1, Theorems1,2. The proof of these results follow two steps; Firstly, estimating the asymptotic behaviour of the NTK in the limit of large depth; secondly, controling these behaviour using upper/lower bounds. We analyse the asymptotic behaviour of the NTK of FFNN using existing results on signal propagation in deep FFNN. However, for CNNs, the dynamics are a bit trickier since they involve convolution operators; We use some results from the theory of Circulant Matrices for this purpose. It is relatively easy to control the dynamics of the NTK in the Ordered/Chaotic phase, however, the dynamics become a bit complicated on the Edge of Chaos and technical lemmas which we call Appendix Lemmas are introduced for this purpose.

Theorem3. The spectral decomposition of zonal kernels on the sphere is a classical result in spectral theory which was recently applied to Neural Tangent Kernel Geifman et al.(2020); Cao et al.(2020); Bietti and Mairal(2019). In order to prove the convergence of the eigenvalues, we use Dominated Convergence Theorem, leveraging the asymptotic results in Proposition1 and Theorems1,2.

2. The infinite width limit 2.1. Forward propagation

FeedForward Neural Network. For some input x ∈ Rd, the propagation of this input through the network is given by

d 1 σw X 1 1 yi (x) = √ wijxj + σbbi d j=1

nl−1 σw X yl(x) = √ wl φ(yl−1(x)) + σ bl, l ≥ 2 i n ij j b i l−1 j=1

Where φ : R → R is the activation function. When we take the limit nl−1 → ∞ recursively over l, this implies, using l √ Central Limit Theorem, that yi(x) is a Gaussian variable for any input x. This gives an error of order O(1/ nl−1) (standard l Monte Carlo error). More generally, an approximation of the random process yi(.) by a Gaussian process was first proposed by (Neal, 1995) in the single layer case and has been extended to the multiple layer case by (Lee et al., 2018) and (Matthews et al., 2018). The limiting Gaussian process kernels follow a recursive formula given by, for any inputs x, x0 ∈ Rd

l 0 l l 0 κ (x, x ) = E[yi(x)yi(x )] 2 2 l−1 l−1 0 = σb + σwE[φ(yi (x))φ(yi (x ))] 2 2 l−1 l−1 0 l−1 0 0 = σb + σwΨφ(κ (x, x), κ (x, x ), κ (x , x )), where Ψφ is a function that only depends on φ. This provides a simple recursive formula for the computation of the kernel κl; see, e.g., (Lee et al., 2018) for more details.

Convolutional Neural Networks. The infinite width approximation with 1D CNN yields a recursion for the kernel. √ However, the infinite width here means infinite number of channels, with a Monte Carlo error of O(1/ nl−1). The kernel in this case depends on the choice of the neurons in the channel and is given by

2 l 0 l l 0 2 σw X l−1 l−1 0 κ 0 (x, x ) = [y (x)y 0 (x )] = σ + [φ(y (x))φ(y (x ))] α,α E i,α i,α b 2k + 1 E 1,α+β 1,α0+β β∈ker Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks so that 2 l 0 2 σw X l−1 l−1 0 l−1 0 0 κ 0 (x, x ) = σ + F (κ (x, x), κ (x, x ), κ (x , x )). α,α b 2k + 1 φ α+β,α0+β α+β,α0+β α+β,α0+β β∈ker l The convolutional kernel κα,α0 has the ‘self-averaging’ property; i.e. it is an average over the kernels corresponding to different combination of neurons in the previous layer. However, it is easy to simplify the analysis in this case by studying l 1 P l the average kernel per channel defined by κˆ = N 2 α,α0 κα,α0 . Indeed, by summing terms in the previous equation and using the fact that we use circular padding, we obtain

l 0 2 2 1 X l−1 l−1 0 l−1 0 0 κˆ (x, x ) = σ + σ F (κ 0 (x, x), κ 0 (x, x ), κ 0 (x , x )). b w N 2 φ α,α α,α α,α α,α0

This expression is similar in nature to that of FFNN. We will use this observation in the proofs.

Note that our analysis only requires the approximation that, in the infinite width limit, for any two inputs x, x0, the variables l l 0 l 0 l l 0 yi(x) and yi(x ) are Gaussian with covariance κ (x, x ) for FFNN, and yi,α(x) and yi,α0 (x ) are Gaussian with covariance l 0 l l κα,α0 (x, x ) for CNN. We do not need the much stronger approximation that the process yi(x) (yi,α(x) for CNN) is a Gaussian process.

Residual Neural Networks. The infinite width limit approximation for ResNet yields similar results with an additional residual terms. It is straighforward to see that, in the case of a ResNet with FFNN-type layers, we have that

l 0 l−1 0 2 2 l−1 l−1 0 l−1 0 0 κ (x, x ) = κ (x, x ) + σb + σwFφ(κ (x, x), κ (x, x ), κ (x , x )), whereas for ResNet with CNN-type layers, we have that

l 0 l−1 0 2 κα,α0 (x, x ) = κα,α0 (x, x ) + σb σ2 X + w F (κl−1 (x, x), κl−1 (x, x0), κl−1 (x0, x0)). 2k + 1 φ α+β,α0+β α+β,α0+β α+β,α0+β β∈ker

2.2. Gradient Independence In the mean-field literature of DNNs, an omnipresent approximation in prior literature is that of the gradient independence which is similar in nature to the practice of feedback alignment (Lillicrap et al., 2016). This approximation states that, for wide neural networks, the weights used for forward propagation are independent from those used for back-propagation. When used for the computation of Neural Tangent Kernel, this approximation was proven to give the exact computation for standard architectures such as FFNN, CNN and ResNets (Yang, 2020) (Theorem D.1). This result has been extensively used in the literature as an approximation before being proved to yields exact computation for the NTK, and theoretical results derived under this approximation were verified empirically; see references below.

Gradient Covariance back-propagation. Analytical formulas for gradient covariance back-propagation were derived using this result, in (Hayou et al., 2019; Schoenholz et al., 2017; Yang and Schoenholz, 2017b; Lee et al., 2018; Poole et al., 2016; Xiao et al., 2018; Yang, 2019). Empirical results showed an excellent match for FFNN in (Schoenholz et al., 2017), for Resnets in (Yang, 2019) and for CNN in (Xiao et al., 2018).

Neural Tangent Kernel. The Gradient Independence approximation was implicitly used in (Jacot et al., 2018) to derive the infinite width Neural Tangent Kernel (See (Jacot et al., 2018), Appendix A.1). Authors have found that this infinite width NTK computed with the Gradient Independence approximation yields excellent match with empirical (exact) NTK. We use this result in our proofs and we refer to it simply by the Gradient Independence.

3. Discussion on Assumption 1

0 1 0 0 Assumption 1. We assume that for all x, x ∈ X , qα,α0 (x, x ) is independent of α, α . Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Assumption1 implies that, there exists some function e :(x, x0) 7→ e(x, x0) such that for all α, α0, x, x0

X X 0 0 xj,α+βxj,α0+β = e(x, x )

j β∈ker0

2 2 This system has N M equations and N × 2n0 × M variables. Therefore, in the case n0 >> 1, the set of solutions S is large. By using Assumption1, we restrict our analysis to this case. Hereafter, for all CNN analysis, for some function G and 0 0 set E, taking the supremum sup(x,x0)∈E G(x, x ) should be interpreted as sup(x,x0)∈E∩X 2 G(x, x ). Another justification to assumption1 can be attributed a self-averaging property of the dynamics of the correlation inside a CNN. We refer the reader to the proof of Appendix lemma3 for more details.

4. Warmup: Results from the Mean-Field theory of DNNs 4.1. Notation

l l l For FFNN layers, let q (x) := q (x, x) be the variance of y1(x) (the choice of the index 1 is not important since, in the l l 0 l 0 infinite width limit, the random variables (yi(x))i∈[1:Nl] are iid). Let q (x, x ), resp. c1(x, x ) be the covariance, resp. the l l 0 l 0 correlation between y1(x) and y1(x ). For Gradient back-propagation, let q˜ (x, x ) be the Gradient covariance defined by l 0 h ∂L ∂L 0 i l q˜ (x, x ) = E l (x) l (x ) where L is some . Similarly, let q˜ (x) be the Gradient variance at point x. We ∂y1 ∂y1 l 0 2 0 l−1 0 l−1 0 also define q˙ (x, x ) = σwE[φ (y1 (x))φ (y1 (x ))]. l l For CNN layers, we use similar notation across channels. Let qα(x) be the variance of y1,α,(x) (the choice of the index 1 is l not important here either since, in the limit of infinite number of channels, the random variables (yi,α(x))i∈[1:Nl] are iid). Let l 0 l l 0 l 0 qα,α0 (x, x ) the covariance between y1,α(x) and y1,α0 (x ), and cα,α0 (x, x ) the corresponding correlation. We also define the l 0 2 2 l−1 l−1 0 l 0 2 l−1 l−1 0 pseudo-covariance qˆα,α0 (x, x ) = σb + σwE[φ(y1,α (x))φ(y1,α0 (x ))] and q˙α,α0 (x, x ) = σwE[φ(y1,α (x))φ(y1,α0 (x ))].   l 0 ∂L ∂L 0 The Gradient covariance is defined by q˜α,α0 (x, x ) = E ∂yl (x) ∂yl (x ) . 1,α 1,α0

4.1.1. COVARIANCE PROPAGATION Covariance propagation for FFNN. In Section 2.1, we derived the covariance kernel propagation in an FFNN. For two inputs x, x0 ∈ Rd, we have l 0 2 2 l−1 l−1 0 q (x, x ) = σb + σwE[φ(yi (x))φ(yi (x ))] (12) this can be written as  q  q q  l 0 2 2 l l 0 l−1 l−1 2 iid q (x, x ) = σb + σwE φ q (x)Z1 φ q (x )(c Z1 + 1 − (c ) Z2 ,Z1,Z2 ∼ N (0, 1), with cl−1 := cl−1(x, x0). With ReLU, and since ReLU is positively homogeneous (i.e. φ(λx) = λφ(x) for λ ≥ 0), we have that σ2 q q ql(x, x0) = σ2 + w ql(x) ql(x0)f(cl−1) b 2 where f is the ReLU correlation function given by (Hayou et al., 2019) 1 p 1 f(c) = (c arcsin c + 1 − c2) + c. π 2

Covariance propagation for CNN. The only difference with FFNN is that the independence is across channels and not neurons. Simple calculus yields

2 l 0 l l 0 2 σw X l−1 l−1 0 q 0 (x, x ) = [y (x)y 0 (x )] = σ + [φ(y (x))φ(y (x ))] α,α E i,α i,α b 2k + 1 E 1,α+β 1,α0+β β∈ker Observe that l 0 1 X l 0 q 0 (x, x ) = qˆ 0 (x, x ) (13) α,α 2k + 1 α+β,α +β βinker Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

With ReLU, we have 2 q q l 0 2 σw X l l 0 l−1 0 q 0 (x, x ) = σ + q (x) q (x )f(c (x, x )). α,α b 2k + 1 α+β α0+β α+β,α0+β β∈ker

Covariance propagation for ResNet with ReLU. In the case of ResNet, only an added residual term shows up in the recursive formula. For a ResNet with FFNN layers, the recursion reads

σ2 q q ql(x, x0) = ql−1(x, x0) + σ2 + w ql(x) ql(x0)f(cl−1) (14) b 2 with CNN layers, we have instead 2 q q l 0 l−1 0 2 σw X l l 0 l−1 0 q 0 (x, x ) = q 0 (x, x ) + σ + q (x) q 0 (x )f(c 0 (x, x )) (15) α,α α,α b 2k + 1 α+β α +β α+β,α +β β∈ker

4.1.2. GRADIENT COVARIANCE BACK-PROPAGATION Gradient back-propagation for FFNN. The gradient back-propagation is given by

Nl+1 ∂L X ∂L = φ0(yl) W l+1. ∂yl i l+1 ji i j=1 ∂yj where L is some loss function. Using the Gradient Independence 2.2, we have as in (Schoenholz et al., 2017) N q˜l(x) =q ˜l+1(x) l+1 χ(ql(x)). Nl

l 2 p l 2 where χ(q (x)) = σwE[φ( q (x)Z) ].

Gradient Covariance back-propagation for CNN. We have that

∂L X ∂L = φ(yl−1 ) ∂W l ∂yl j,α+β i,j,β α i,α Moreover, n ∂L X X ∂L = W l+1 φ0(yl ). ∂yl l+1 i,j,β i,α i,α j=1 β∈ker ∂yj,α−β

Using the Gradient Independence 2.2, and taking the average over the number of channels we have that h i " # 2 0 p l 2 " # ∂L 2 σwE φ ( qα(x)Z) X ∂L 2 = . E ∂yl 2k + 1 E l+1 i,α β∈ker ∂yi,α−β

We can get similar recursion to that of the FFNN case by summing over α and using the periodic boundary condition, this yields " # " # X ∂L 2 X ∂L 2 = χ(ql (x)) . E ∂yl α E l+1 α i,α α ∂yi,α

4.1.3. EDGEOF CHAOS (EOC)

Let x ∈ Rd be an input. The convergence of ql(x) as l increases has been studied by (Schoenholz et al., 2017) and (Hayou l et al., 2019). In particular, under weak regularity conditions, it is proven that q (x) converges to a point q(σb, σw) > 0 independent of x as l → ∞. The asymptotic behaviour of the correlations cl(x, x0) between yl(x) and yl(x0) for any two Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

0 l l+1 l inputs x and x is also driven by (σb, σw): the dynamics of c is controlled by a function f i.e. c = f(c ) called the 2 0 p 2 correlation function. The authors define the EOC as the set of parameters (σb, σw) such that σwE[φ ( q(σb, σw)Z) ] = 1 2 0 p 2 where Z ∼ N (0, 1). Similarly the Ordered, resp. Chaotic, phase is defined by σwE[φ ( q(σb, σw)Z) ] < 1, resp. 2 0 p 2 σwE[φ ( q(σb, σw)Z) ] > 1. On the Ordered phase, the gradient will vanish as it backpropagates through the network, and the correlation cl(x, x0) converges exponentially to 1. Hence the output function becomes constant (hence the name ’Ordered phase’). On the Chaotic phase, the gradient explodes and the correlation converges exponentially to some limiting value c < 1 which results in the output function being discontinuous everywhere (hence the ’Chaotic’ phase name). On the EOC, the second moment of the gradient remains constant throughout the backpropagation and the correlation converges to 1 at a sub-exponential rate, which allows deeper information propagation. Hereafter, f will always refer to the correlation function. l l iid 2 We initialize the model with wij, bi ∼ N (0, 1), where N (µ, σ ) denotes the normal distribution of mean µ and variance σ2. In the remainder of this appendix, we assume that the following conditions are satisfied

• The input data is a subset of a compact set E of Rd, and no two inputs are co-linear. • All calculations are done in the limit of infinitely wide networks.

4.2. Some results from the information propagation theory Results for FFNN with Tanh activation. + l −λl Fact 1. For any choice of σ , σ ∈ , there exist q, λ > 0 such that for all l ≥ 1, sup d |q (x, x) − q| ≤ e . b w R x∈R (Equation (3) and conclusion right after in (Schoenholz et al., 2017)). l 0 −γl Fact 2. On the Ordered phase, there exists γ > 0 such that sup 0 d |c (x, x ) − 1| ≤ e . (Equation (8) in (Schoenholz x,x ∈R et al., 2017)) l 0 −1 0 Fact 3. Let (σb, σw) ∈ EOC. Using the same notation as in fact4, we have that sup(x,x )∈B |1 − c (x, x )| = O(l ). (Proposition 3 in (Hayou et al., 2019)). 0 d 1 0 Fact 4. Let B = {(x, x ) ∈ R : c (x, x ) < 1 − }. On the chaotic phase, there exist c < 1 such that for all  ∈ (0, 1), l 0 −γl 0 there exists γ > 0 such that sup(x,x )∈B |c (x, x ) − c| ≤ e . (Equations (8) and (9) in (Schoenholz et al., 2017)) √ 2 2 √ √ 2 σb +σwE[φ( qZ1)φ( q(xZ1+ 1−x Z2))] Fact 5 (Correlation function). The correlation function f is defined by f(x) = q where q is given in Fact1 and Z1,Z2 are iid standard Gaussian variables. Fact 6. f has a derivative of any order j ≥ 1 given by

(j) 2 j−1 (j) (j) p 2 f (x) = σwq E[φ (Z1)φ (xZ1 + 1 − x Z2)], ∀x ∈ [−1, 1]

(j) 2 j−1 (j) 2 As a result, we have that f (1) = σwq E[φ (Z1) ] > 0 for all j ≥ 1.

The proof of the previous fact is straightforward following the same integration by parts technique as in the proof of Lemma 1 in (Hayou et al., 2019). The result follows by induction. 0 Fact 7. Let (σb, σw) ∈ EOC. We have that f (1) = 1 (by definition of EOC). As a result, the Taylor expansion of f near 1 is given by f(c) = c + α(1 − c)2 − ζ(1 − c)3 + O((1 − c)4). where α, ζ > 0.

Proof. The proof is straightforward using fact6, and integral-derivative interchanging.

Results for FFNN with ReLU activation. √ + 2 Fact 8. The ordered phase for ReLU is given by Ord = {(σb, σw) ∈ (R ) : σw < 2}. Moreover, for any (σb, σw) ∈ Ord, 2 l −λl σb there exist λ such that for all l ≥ 1, supx∈ d |q (x, x) − q| ≤ e , where q = 2 . R 1−σw/2 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

The proof is straightforward using equation (12). l 0 −λl Fact 9. For any (σ , σ ) in the Ordered phase, there exist λ such that for all l ≥ 1, sup 0 d |c (x, x ) − 1| ≤ e . b w (x,x )∈R The proof of this claim follows from standard Banach Fixed point theorem in the same fashion as for Tanh in (Schoenholz et al., 2017). √ + 2 Fact 10. The Chaotic phase for ReLU is given by Ch = {(σb, σw) ∈ (R ) : σw > 2}. Moreover, for any (σb, σw) ∈ Ch, d l 2 l for all l ≥ 1, x ∈ R , q (x, x) & (σw/2) . The variance explodes exponentially on the Chaotic phase, which means the output of the Neural Network can grow arbitrarily in this setting. Hereafter, when no activation function is mentioned, and when we choose "(σb, σw) on the Ordered/Chaotic phase", it should be interpreted as "(σb, σw) on the Ordered phase" for ReLU and "(σb, σw) on the Ordered/Chaotic phase" for Tanh. 2 l σw 2 Fact 11. For ReLU FFNN on the EOC, we have that q (x, x) = d ||x|| for all l ≥ 1. √ The proof is straightforward using equation 12 and that (σb, σw) = (0, 2) on the EOC. √ Fact 12. The EOC of ReLU is given by the singleton {(σb, σw) = (0, 2)}. In this case, the correlation function of an FFNN with ReLU is given by 1 p 1 f(x) = (x arcsin x + 1 − x2) + x π 2 (Proof of Proposition 1 in (Hayou et al., 2019)).

Fact 13. Let (σb, σw) ∈ EOC. Using the same notation as in fact4, we have that

sup |1 − cl(x, x0)| = O(l−2) 0 (x,x )∈B

(Follows straightforwardly from Proposition 1 in (Hayou et al., 2019)). Fact 14. We have that f(c) = c + s(1 − c)3/2 + b(1 − c)5/2 + O((1 − c)7/2 (16) √ √ 2 2 2 with s = 3π and b = 30π . This result was proven in (Hayou et al., 2019) (in the proof of Proposition 1) for order 3/2, the only difference is that here we push the expansion to order 5/2.

Results for CNN with Tanh activation function. + l Fact 15. For any choice of σ , σ ∈ , there exist q, λ > 0 such that for all l ≥ 1, sup 0 sup d |q 0 (x, x) − q| ≤ b w R α,α x∈R α,α e−λl. (Equation (2.5) in (Xiao et al., 2018) and variance convergence result in (Schoenholz et al., 2017)).

l 0 0 The behaviour of the correlation cα,α0 (x, x ) was studied in (Xiao et al., 2018) only in the case x = x. We give a l 0 comprehensive analysis of the asymptotic behaviour of cα,α0 (x, x ) in the next section.

General results on the correlation function. Fact 16. Let f be either the correlation function of Tanh or ReLU. We have that

• f(1) = 1 (Lemma 2 in (Hayou et al., 2019)).

• On the ordered phase 0 < f 0(1) < 1 (By definition).

• On the Chaotic phase f 0(1) > 1 (By definition).

• On the EOC, f 0(1) = 1 (By definition).

• On the Ordered phase and the EOC, 1 is the unique fixed point of f ((Hayou et al., 2019)). Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

• On the Chaotic phase, f has two fixed points, 1 which is unstable, and c < 1 which is a stable fixed point (Schoenholz et al., 2017). Fact 17. Let  ∈ (0, 1). On the Ordered/Chaotic phase, with either ReLU or Tanh, there exists α ∈ (0, 1),γ > 0 such that

sup |f 0(cl(x, x0)) − α| ≤ e−γl 0 (x,x )∈B

Proof. This result follows from a simple first order expansion inequality. For Tanh on the Ordered phase, we have that

0 l 0 0 l 0 sup |f (c (x, x )) − f (1)| ≤ ζl sup |c (x, x ) − 1| 0 0 (x,x )∈B (x,x )∈B

00 00 where ζl = sup l 0 |f (t)| → |f (1)|. We conclude for Ordered phase with Tanh using fact2. The t∈(min(x,x0∈B) c (x,x ),1) same argument can be used for Chaotic phase with Tanh using fact4; in this case, α = f 0(c) where c is the unique stable fixed point of the correlation function f.

σ2 On the Ordered phase with ReLU, let f˜ be the correlation function. It is easy to see that f˜0(c) = w f 0(c) where f is given √ 2 0 2 1/2 3/2 in fact 12. f (x) = 1 − π (1 − x) + O((1 − x) ). Therefore, there exists l0, ζ > 0 such that for l > l0, sup |f 0(cl(x, x0)) − f 0(1)| ≤ ζ sup |cl(x, x0) − 1|1/2 0 0 (x,x )∈B (x,x )∈B We conclude using fact9.

Asymptotic behaviour of the correlation in FFNN. l Appendix Lemma 1 (Asymptotic behaviour of c for ReLU). Let (σb, σw) ∈ EOC and  ∈ (0, 1). We have

l 0 κ √ log(l) −3 sup c (x, x ) − 1 + 2 − 3 κ 3 = O(l ) 0 (x,x )∈B l l

9π2 where κ = 2 . Moreover, we have that 3 9 log(l) 0 l 0 √ −2 sup f (c (x, x )) − 1 + − 2 = O(l ). 0 (x,x )∈B l 2 κ l

√ 0 2 2 l 0 Proof. Let (x, x ) ∈ B and s = . From the preliminary results, we have that lim supx,x0∈ d 1 − c (x, x ) = 0 (fact 3π l→∞ R 13). Using fact 14, we have uniformly over B,

3/2 5/2 7/2 γl+1 = γl − sγl − bγl + O(γl ) where s, b > 0, this yields s 3s2 b γ−1/2 = γ−1/2 + + γ1/2 + γ + O(γ3/2). l+1 l 2 8 l 2 l l Thus, as l goes to infinity s γ−1/2 − γ−1/2 ∼ , l+1 l 2 and by summing and equivalence of positive divergent series s γ−1/2 ∼ l. l 2

−1/2 −1/2 s 3s2 1/2 3/2 Moreover, since γl+1 = γl + 2 + 8 γl + O(γl ), using the same argument multiple times and inverting the formula yields κ √ log(l) cl(x, x0) = 1 − + 3 κ + O(l−3) l2 l3 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Note that, by Appendix Lemma5 (section5), the O bound can be chosen in a way that it does not depend on (x, x0), it depends only on ; this concludes the proof for the first part of the result. Using fact 12, we have that 1 1 f 0(x) = arcsin(x) + π √ 2 2 = 1 − (1 − x)1/2 + O((1 − x)3/2). π Thus, it follows that √ 3 9 2 log(l) f 0(cl(x, x0)) = 1 − + + O(l−2). l 4 l2 uniformly over the set B, which concludes the proof.

We prove a similar result for an FFNN with Tanh activation. l Appendix Lemma 2 (Asymptotic behaviour of c for Tanh). Let (σb, σw) ∈ EOC and  ∈ (0, 1). We have

l 0 κ 2 log(l) −3 sup c (x, x ) − 1 + − κ(1 − κ ζ) 3 = O(l ) 0 (x,x )∈B l l

2 f 3(1) where κ = f 00(1) > 0 and ζ = 6 > 0. Moreover, we have that

0 l 0 2 2 log(l) −2 sup f (c (x, x )) − 1 + − 2(1 − κ ζ) 2 = O(l ). 0 (x,x )∈B l l

0 l 0 Proof. Let (x, x ) ∈ B and λl := 1 − c (x, x ). Using a Taylor expansion of f near 1 (fact7), there exist α, ζ > 0 such that 2 3 4 λl+1 = λl − αλl + ζλl + O(λl ) Here also, we use the same technique as in the previous lemma. We have that −1 −1 2 3 −1 −1 2 2 3 λl+1 = λl (1 − αλl + ζλ + O(λl )) = λl (1 + αλl + (α − ζ)λl + O(λl )) −1 2 2 = λl + α + (α − ζ)λl + O(λl ).

−1 By summing (divergent series), we have that λl ∼ αl. Therefore, −1 −1 2 −1 −1 −1 λl+1 − λl − α = (α − β)α l + o(l ) By summing a second time, we obtain −1 −1 λl = αl + (α − βα ) log(l) + o(log(l)), Using the same technique once again, we obtain −1 −1 λl = αl + (α − βα ) log(l) + O(1). This yields log(l) λ = α−1l−1 − α−1(1 − α−2β) + O(l−2). l l2 In a similar fashion to the previous proof, we can force the upper bound in O to be independent of x using Appendix Lemma 5. This way, the bound depends only on . This concludes the first part of the proof.

For the second part, observe that f 0(x) = 1 + (x − 1)f 00(1) + O((x − 1)2), hence 2 log(l) f 0(cl(x, x0)) = 1 − + 2(1 − α−2ζ) + O(l−2) l l2 which concludes the proof. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

4.3. Large depth behaviour of the correlation in CNNs For CNNs, the infinite width will always mean the limit of infinite number of channels. Recall that, by definition, l 0 2 2 l−1 l−1 0 l 0 l l 0 qˆα,α0 (x, x ) = σb + σwE[φ(y1,α (x))φ(y1,α0 (x ))] and qα,α0 (x, x ) = E[yi,α(x)yi,α0 (x )]. Unlike FFNN, neurons in the same channel are correlated since they share the same filters. Let x, x0 be two inputs and α, α0 two nodes in the same channel i. Using Central Limit Theorem in the limit of large nl (number of channels), we have

2 l 0 l l 0 σw X l−1 l−1 0 2 q 0 (x, x ) = [y (x)y 0 (x )] = [φ(y (x))φ(y (x ))] + σ α,α E i,α i,α 2k + 1 E 1,α+β 1,α0+β b β∈ker

l 0 l Let cα,α0 (x, x ) be the corresponding correlation. Since qα,α(x, x) converges exponentially to q which depends neither on x nor on α, the mean-field correlation as in (Schoenholz et al., 2017; Hayou et al., 2019) is given by

l 0 1 X l−1 0 c 0 (x, x ) = f(c (x, x )) α,α 2k + 1 α+β,α0+β β∈ker

√ 2 √ √ 2 2 σwE[φ( qZ1)φ( q(cZ1+ 1−c Z2))]+σb where f(c) = q and Z1,Z2 are independent standard normal variables. The dynamics l l of cα,α0 become similar to those of c in an FFNN under assumption1. We show this in the proof of Appendix Lemma3. l In (Xiao et al., 2018), authors studied only the limiting behaviour of correlations cα,α0 (x, x) (same input x), however, they l 0 0 do not study cα,α0 (x, x ) when x 6= x . We do this in the following Lemma, which will prove also useful for the main results of the paper. Appendix Lemma 3 (Asymptotic behaviour of the correlation in CNN with Tanh). We consider a CNN with Tanh activation + 2 0 d 1 0 function. Let (σb, σw) ∈ (R ) and  ∈ (0, 1). Let B = {(x, x ) ∈ R : supα,α0 cα,α0 (x, x ) < 1 − }. The following statements hold

1. If (σb, σw) are on the Ordered phase, then there exists β > 0 such that

l 0 −βl sup sup |cα,α0 (x, x ) − 1| = O(e ) 0 d 0 (x,x )∈R α,α

2. If (σb, σw) are on the Chaotic phase, then for all  > 0 there exists β > 0 and c ∈ (0, 1) such that

l 0 −βl sup sup |cα,α0 (x, x ) − c| = O(e ) 0 0 (x,x )∈B α,α

3. Under Assumption1, if (σb, σw) ∈ EOC, then we have

l 0 κ 2 log(l) −3 sup sup cα,α0 (x, x ) − 1 + − κ(1 − κ ζ) 3 = O(l ) 0 0 (x,x )∈B α,α l l

2 f 3(1) where κ = f 00(1) > 0, ζ = 6 > 0, and f is the correlation function given in Fact5.

We prove statements 1 and 2 for general inputs, i.e. without using Assumption1. The third statement requires Assumption1.

Proof. Let (x, x0) ∈ Rd. Without using assumption1, we have that

l 0 1 X l−1 0 c 0 (x, x ) = f(c (x, x )) α,α 2k + 1 α+β,α0+β β∈ker

Writing this in matrix form yields 1 C = Uf(C ) l 2k + 1 l−1 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

l 0 N 2 where Cl = ((cα,α+β(x, x ))α∈[0:N−1])β∈[0:N−1] is a vector in R , U is a convolution matrix and f is applied element- wise. As an example, for k = 1, U is given by

1 1 0 ... 0 1  ..  1 1 1 0 . 0    ..  0 1 1 1 . 0 U =    .  0 0 1 1 .. 0    . . . .   ......  1 0 ... 0 1 1

For general k, U is a Circulant symmetric matrix with eigenvalues λ1 > λ2 ≥ λ3... ≥ λN 2 . The largest eigenvalue of U is 1 N 2 given by λ1 = 2k + 1 and its equivalent eigenspace is generated by the vector e1 = N (1, 1, ..., 1) ∈ R . This yields

−l l T −βl (1 + 2k) U = e1e1 + O(e ) where β = log( λ1 ). λ2 This provides another justification to Assumption 1; as l grows, and assuming that Cl → e1 (which we show in the remainder 1 of this proof), Cl exhibits a self-averaging property since Cl ≈ 2k+1 UCl−1. This system concentrates around the average value of the entries of Cl as l grows. Since the variances converge to a constant q as l goes to infinity (fact 15), this approximation implies that the entries of Cl become almost equal as l goes to infinity, thus making assumption1 almost satisfied in deep layers. Let us now prove the statements.

0 d l l 0 1. Let (σb, σw) be in the Ordered phase, (x, x ) ∈ R and cm = minα,α0 cα,α0 (x, x ). Using the fact that f is non- l 0 1 P l−1 0 l−1 decreasing, we have that cα,α0 (x, x ) ≥ 2k+1 β∈ker cα+β,α0+β(x, x )) ≥ cm . Taking the minimum again over 0 l l−1 l α, α , we have cm ≥ cm , therefore cm is non-decreasing and converges to the unique fixed point of f which is c = 1. l 0 This proves that supα,α0 |cα,α0 (x, x ) − 1| → 0. Moreover, the convergence rate is exponential using the fact that (fact 16) 0 < f 0(1) < 1. To see this, observe that ! l 0 0 l 0 sup |1 − cα,α0 (x, x )| ≤ sup f (ζ) × sup |1 − cα,α0 (x, x )| α,α0 l−1 α,α0 ζ∈[cm ,1]

0 0 0 Knowing that sup l−1 f (ζ) → f (1) < 1, we conclude. Moreover, the convergence is uniform in (x, x ) since ζ∈[cm ,1] the convergence rate depends only on f 0(1).

2. Let  ∈ (0, 1). In the chaotic phase, the only difference is the limit c = c1 < 1 and the Supremum is taken over B to avoid points where c1(x, x0) = 1. In the Chaotic phase (fact 16), f has two fixed points, 1 is an unstable fixed point and c1 ∈ (0, 1) which is the unique stable fixed point. We conclude by following the same argument.

3. Let  ∈ (0, 1) and (σb, σw) ∈ EOC. Using the same argument of monotony as in the previous cases and that f has 1 as l 0 unique fixed point, we have that liml→∞ supx,x0 supα,α0 |1 − cα,α0 (x, x )| = 0. From fact7, the Taylor expansion of f near 1 is given by

f(c) = c + α(1 − c)2 − ζ(1 − c)3 + O((1 − c)4).

00 (3) f (1) f (1) (k) 2 k−1 (k) √ 2 where α = 2 and ζ = 6 . Using fact6, we know that f (1) = σwq E[φ ( qZ) ]. Therefore, we have α > 0, and ζ < 0. Under assumption1, it is straightforward that for all α, α0, and l ≥ 1

l 0 l 0 cα,α0 (x, x ) = c (x, x )

l 0 l 0 i.e. cα,α0 are equal for all α, α . The dynamics of c (x, x ) are exactly the dynamics of the correlation in an FFNN. We conclude using Appendix Lemma2. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

It is straightforward that the previous Appendix Lemma extend to ReLU activation, with slightly different dynamics. In this case, we use Appendix Lemma1 to conclude for the third statement. Appendix Lemma 4 (Asymptotic behaviour of the correlation in CNN with ReLU-like activation functions). We consider + 2 + 2 a CNN with ReLU activation. Let (σb, σw) ∈ (R ) . Let (σb, σw) ∈ (R ) and  ∈ (0, 1). The following statements hold

1. If (σb, σw) are on the Ordered phase, then there exists β > 0 such that

l 0 −βl sup sup |cα,α0 (x, x ) − 1| = O(e ) 0 d 0 (x,x )∈R α,α

2. If (σb, σw) are on the Chaotic phase, then there exists β > 0 and c ∈ (0, 1) such that

l 0 −βl sup sup |cα,α0 (x, x ) − c| = O(e ) 0 0 (x,x )∈B α,α

3. Under Assumption1, if (σb, σw) ∈ EOC, then

l 0 κ √ log(l) −3 sup sup c (x, x ) − 1 + 2 − 3 κ 3 = O(l ) 0 0 (x,x )∈B α,α l l

9π2 where κ = 2 .

Proof. The proof is similar to the case of Tanh in Appendix Lemma3. The only difference is that we use Appendix Lemma 1 to conclude for the third statement.

5. A technical tool for the derivation of uniform bounds

Results in Theorem1 and2 and Proposition1 involve a supremum over the set B. To obtain such results, we need a l 0 0 ’uniform’ Taylor analysis of the correlation c (x, x ) (see the next section) where uniformity is over (x, x ) ∈ B. It turns out that such result is trivial when the correlation follows a dynamical system that is controlled by a non-decreasing function. We clarify this in the next lemma. Appendix Lemma 5 (Uniform Bounds). Let A ⊂ R be a compact set and g a non-decreasing function on A. Define the sequence ζl by ζl = g(ζl−1) and ζ0 ∈ A. Assume that there exist αl, βl that do not depend on ζl, with βl = o(αl), such that for all ζ0 ∈ A,

ζl = αl + Oζ0 (βl) where Oζ0 means that the O bound depends on ζ0. Then, we have that

sup |ζl − αl| = O(βl) ζ0∈A i.e. we can choose the bound O to be independent of ζ0.

Proof. Let ζ0,m = min A and ζ0,M = max A. Let (ζm,l) and (ζM,l) be the corresponding sequences. Since g is non- decreasing, we have that for all ζ0 ∈ A, ζm,l ≤ ζl ≤ ζM,l. Moreover, by assumption, there exists M1,M2 > 0 such that |ζm,l − αl| ≤ M1|βl| and |ζM,l − αl| ≤ M2|βl| therefore, |ζl − αl| ≤ max(|ζm,l − αl|, |ζM,l − αl|) ≤ max(M1,M2)|βl| which concludes the proof.

Note that Appendix Lemma5 can be easily extended to Taylor expansions with ‘ o’ instead of ‘O’. We will use this result in the proofs, by refereeing to Appendix Lemma5. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks 6. Proofs of Section 3: Large Depth Behaviour of Neural Tangent Kernel 6.1. Proofs of the results of Section 3.1 In this section, we provide proofs for the results of Section 3.1 in the paper. Recall that Lemma 1 in the paper is a generalization of Theorem 1 in (Jacot et al., 2018) and is reminded here. The proof is simple and follows similar induction techniques as in (Jacot et al., 2018).

Lemma 1 (Generalization of Th. 1 in (Jacot et al., 2018)). Consider an FFNN of the form (3). Then, as n1, n2, ..., nL−1 → 0 d 0 L 0 L 0 L 0 ∞, we have for all x, x ∈ R , i, i ≤ nL, Kii0 (x, x ) = δii0 K (x, x ), where K (x, x ) is given by the recursive formula

KL(x, x0) =q ˙L(x, x0)KL−1(x, x0) + qL(x, x0),

l 0 2 2 l−1 l−1 0 l 0 2 0 l−1 0 l−1 0 where q (x, x ) = σb + σwE[φ(y1 (x))φ(y1 (x ))] and q˙ (x, x ) = σwE[φ (y1 (x))φ (y1 (x ))].

Proof. The proof for general σw is similar to when σw = 1 ((Jacot et al., 2018)) which is a proof by induction.

For l ≥ 2 and i ∈ [1 : nl]

nl σw X ∂ yl+1(x) = √ wl+1φ0(yl (x))∂ yl (x). θ1:l i n ij j θ1:l j l j=1

Therefore,

2 nl l+1 l+1 0 t σw X l+1 l+1 0 l 0 l 0 l l 0 t (∂θ1:l yi (x))(∂θ1:l yi (x )) = wij wij0 φ (yj(x))φ (yj0 (x ))∂θ1:l yj(x)(∂θ1:l yj0 (x )) nl j,j0

0 0 Using the induction hypothesis, namely that as n0, n1, ..., nl−1 → ∞, for all j, j ≤ nl and all x, x

l l 0 t l 0 0 ∂θ1:l yj(x)(∂θ1:l yj0 (x )) → K (x, x )1j=j we then obtain for all nl, as n0, n1, ..., nl−1 → ∞

2 nl 2 nl σw X l+1 l+1 0 l 0 l 0 l l 0 t σw X l+1 2 0 l 0 l 0 l 0 wij wij0 φ (yj(x))φ (yj0 (x ))∂θ1:l yj(x)(∂θ1:l yj0 (x )) → (wij ) φ (yj(x))φ (yj(x ))K (x, x ) nl nl j,j0 j

and letting nl go to infinity, the law of large numbers, implies that

n σ2 Xl w (wl+1)2φ0(yl (x))φ0(yl (x0))Kl(x, x0) → q˙l+1(x, x0)Kl(x, x0). n ij j j l j

Moreover, we have that

2 l+1 l+1 0 t l+1 l+1 0 t σw X l l 0 2 (∂ l+1 y (x))(∂ l+1 y (x )) + (∂ l+1 y (x))(∂ l+1 y (x )) = φ(y (x))φ(y (x )) + σ w i w i b i b i n j j b l j 2 l l 0 2 l+1 0 → σwE[φ(yi(x))φ(yi(x ))] + σb = q (x, x ). nl→∞ which ends the proof.

We now provide the recursive formula satisfied by the NTK of a CNN, namely Lemma 2 of the paper. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Lemma 2 (Infinite width dynamics of the NTK of a CNN). Consider a CNN of the form (4), then we have that for all 0 d 0 0 x, x ∈ R , i, i ≤ n1 and α, α ∈ [0 : M − 1]

2 1 0 σw 0 2 K(i,α),(i0,α0)(x, x ) = δii0 [x, x ]α,α0 + σb n0(2k + 1)

0 0 l 0 For l ≥ 2, as n1, n2, ..., nl−1 → ∞ recursively, we have for all i, i ≤ nl, α, α ∈ [0 : M − 1], K(i,α),(i0,α0)(x, x ) = l 0 l δii0 Kα,α0 (x, x ), where Kα,α0 is given by the recursive formula

l 1 X l−1 K 0 = Ψ α,α 2k + 1 α+β,α0+β β∈kerl

l−1 l l−1 l l l l−1 l−1 0 where Ψα,α0 =q ˙α,α0 Kα,α0 +q ˆα,α0 , and qˆα,α, q˙α,α0 are defined in Lemma1, with y1,α (x), y1,α0 (x ) in place of l−1 l−1 0 y1 (x), y1 (x ).

Proof. Let x, x0 be two inputs. We have that

n0 1 σw X X 1 1 yi,α(x) = √ wi,j,βxj,α+β + σbbi v1 j=1 β∈ker1

nl−1 l σw X X l l−1 l yi,α(x) = √ wi,j,βφ(yj,α+β(x)) + σbbi vl j=1 β∈kerl therefore

  1 1 1 1 1 0 X X X ∂yi,α(x) ∂yi0,α0 (x) ∂yi,α(x) ∂yi0,α0 (x) K 0 0 (x, x ) = + (i,α),(i ,α )  ∂w1 ∂w1  ∂b1 ∂b1 r j β r,j,β r,j,β r r   2 σw X X 2 = δii0  xj,α+βxj,α0+β + σb  n0(2k + 1) j β

Assume the result is true for l − 1, let us prove it for l. Let θ1:l−1 be model weights and bias in the layers 1 to l − 1. Let l l ∂yi,α(x) ∂θ y (x) = . We have that 1:l−1 i,α ∂θ1:l−1

σw X X ∂ yl (x) = wl φ0(yl−1 )∂ yl−1 (x) θ1:l−1 i,α p i,j,β j,α+β θ1:l−1 i,α+β nl−1(2k + 1) j β this yields

l l T ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) = 2 σw X X l l 0 l−1 0 l−1 l−1 l−1 T wi,j,βwi0,j0,β0 φ (yj,α+β)φ (yj0,α0+β)∂θ1:l−1 yj,α+β(x)∂θ1:l−1 yj0,α0+β(x) nl−1(2k + 1) j,j0 β,β0 as n1, n2, ..., nl−2 → ∞ and using the induction hypothesis, we have

l l T ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) → 2 σw X X l l 0 l−1 0 l−1 l−1 0 wi,j,βwi0,j,β0 φ (yj,α+β)φ (yj,α0+β)K(j,α+β),(j,α0+β)(x, x ) nl−1(2k + 1) j β,β0 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

l−1 0 l−1 0 note that K(j,α+β),(j,α0+β)(x, x ) = K(1,α+β),(1,α0+β)(x, x ) for all j since the variables are iid across the channel index j.

Now letting nl−1 → ∞, we have that

l l T ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) → 1 X l l−1 0  δ 0 q˙ 0 K (x, x ) ii (2k + 1) α+β,α +β (1,α+β),(1,α0+β) β,β0

We conclude using the fact that

2 l l T σw X l−1 l−1 0 2 ∂ y (x)∂ y 0 0 (x) → δ 0 ( [φ(y (x))φ(y (x ))] + σ ) θl i,α θl i ,α ii 2k + 1 E α+β α0+β b β

To alleviate notations, we use hereafter the notation KL for both the NTK of FFNN and CNN. For FFNN, it represents L L 0 the recursive kernel K given by lemma1, whereas for CNN, it represents the recursive kernel Kα,α0 for any α, α , which means all results that follow are true for any α, α0. The following proposition establishes that any initialization on the Ordered or Chaotic phase, leads to a trivial limiting NTK as the number of layers L becomes large.

Proposition 1 (Limiting Neural Tangent Kernel with Ordered/Chaotic Initialization). Let (σb, σw) be either in the ordered or in the chaotic phase. Then, there exist λ > 0 such that for all  ∈ (0, 1), there exists γ > 0 such that

sup |KL(x, x0) − λ| ≤ e−γL. 0 (x,x )∈B

We will use the next lemma in the proof of proposition1. −βl Appendix Lemma 6. Let (al) be a sequence of non-negative real numbers such that ∀l ≥ 0, al+1 ≤ αal + ke , where −γl α ∈ (0, 1) and k, β > 0. Then there exists γ > 0 such that ∀l ≥ 0, al ≤ e .

Proof. Using the inequality on al, we can easily see that

l−1 l X j −β(l−j) al ≤ a0α + k α e j=0 l l ≤ a αl + k e−βl/2 + k αl/2 0 2 2 where we divided the sum into two parts separated by index l/2 and upper-bounded each part. The existence of γ is straightforward.

Now we prove Proposition1

Proof. We prove the result for FFNN first. Let x, x0 be two inputs. From lemma1, we have that

Kl(x, x0) = Kl−1(x, x0)q ˙l(x, x0) + ql(x, x0)

2 1 0 2 σw T 0 l 0 2 2 0 l 0 where q (x, x ) = σb + d x x and q (x, x ) = σb + σwEf∼N (0,ql−1)[φ(f(x))φ(f(x ))] and q˙ (x, x ) = 2 0 0 0 σwEf∼N (0,ql−1)[φ (f(x))φ (f(x ))]. From facts1,2,4,9, 17, in the ordered/chaotic phase, there exist k, β, η, l0 > 0 and α ∈ (0, 1) such that for all l ≥ l0 we have

sup |ql(x, x0) − k| ≤ e−βl 0 (x,x )∈B Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks and sup |q˙l(x, x0) − α| ≤ e−ηl. 0 (x,x )∈B 0 d Therefore, there exists M > 0 such that for any l ≥ l0 and x, x ∈ R Kl(x, x0) ≤ M.

l 0 k r = sup 0 |K (x, x ) − | Letting l (x,x )∈B 1−α , we have

−ηl −βl rl ≤ αrl−1 + Me + e

We conclude using Appendix Lemma6.

Under Assumption1, the proof is similar for CNN, using Appendix Lemmas3 and4.

Now, we show that the Initialization on the EOC improves the convergence rate of the NTK wrt L. We first prove two preliminary lemmas that will be useful for the proof of the next proposition. Hereafter, the notation g(x) = Θ(m(x)) means there exist two constants A, B > 0 such that Am(x) ≤ g(x) ≤ Bm(x).

Appendix Lemma 7. Let A, B, Λ ⊂ R+ be three compact sets, and (al), (bl), (λl) be three sequences of non-negative real numbers such that for all (a0, b0, λ0) ∈ A × B × Λ α a = a λ + b , λ = 1 − + O(l−1−β), b = q(b ) + o(l−1), l l−1 l l l l l 0 ∗ where α ∈ N independent of a0, b0, λ0, q(b0) ≥ 0 is a limit that depends on b0, and β ∈ (0, 1). Assume the ‘O’ and ‘o’ depend only on A, B, Λ ⊂ R. Then, we have

al q −β sup − = O(l ). (a0,b0,λ0)∈A×B×Λ l 1 + α

Proof. Let A, B, Λ ⊂ R be three compact sets and (a0, b0, λ0) ∈ A × B × Λ. It is easy to see that there exists a constant al G > 0 independent of a0, b0, λ0 such that |al| ≤ G × l + |a0| for all l ≥ 0. Letting rl = l , we have that for l ≥ 2 1 α q r = r (1 − )(1 − + O(l−1−β)) + + o(l−2) l l−1 l l l 1 + α q = r (1 − ) + + O(l−1−β). l−1 l l q where O bound depends only on A, B, Λ. Letting xl = rl − 1+α , there exists M > 0 that depends only on A, B, Λ, and l0 > 0 that depends only on α such that for all l ≥ l0 1 + α M 1 + α M x (1 − ) − ≤ x ≤ x (1 − ) + . l−1 l l1+β l l−1 l l1+β Let us deal with the right hand inequality first. By induction, we have that

l l l Y 1 + α X Y 1 + α 1 x ≤ x (1 − ) + M (1 − ) . l l0−1 k j k1+β k=l0 k=l0 j=k+1

By taking the logarithm of the first term in the right hand side and using the fact that Pl 1 = log(l) + O(1), we have k=l0 k

l Y 1 + α (1 − ) = Θ(l−1−α). k k=l0 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks where the bound Θ does not depend on l0. For the second part, observe that l Y 1 + α (l − α − 1)! k! (1 − ) = j l! (k − α − 1)! j=k+1 and k! 1 ∼ kα−β. (k − α − 1)! k1+β k→∞ Since α ≥ 1 (α ∈ N∗), then the serie with term kα−β is divergent and we have that l l X k! 1 X ∼ kα−β (k − α − 1)! k2 k=l0 k=1 Z l ∼ tα−βdt 1 1 ∼ lα−β+1. α − β + 1 Therefore, it follows that l l l X Y 1 + α 1 (l − α − 1)! X k! 1 (1 − ) = j k1+β l! (k − α − 1)! k1+β k=l0 j=k+1 k=l0 1 ∼ l−β. α This proves that M x ≤ l−β + o(l−β). l α where the ‘o’ bound depends only on A, B, Λ. Using the same approach for the left-hand inequality, we prove that M x ≥ − l−β + o(l−β). l α This concludes the proof.

The next lemma is a different version of the previous lemma which will be useful for other applications. Appendix Lemma 8. Let A, B, Λ ⊂ R+ be three compact sets, and (al), (bl), (λl) be three sequences of non-negative real numbers such that for all (a0, b0, λ0) ∈ A × B × Λ −1 al = al−1λl + bl, bl = q(b0) + O(l ), α log(l) λ = 1 − + κ + O(l−2), l l l2 ∗ + where α ∈ N , κ 6= 0 both do not depend on a0, b0, Λ0, q(bo) ∈ R is a limit that depends on b0. Assume the ‘O’ and ‘o’ depend only on A, B, Λ ⊂ R. Then, we have

al q −1 sup − = Θ(log(l)l ) (a0,b0,λ0)∈A×B×Λ l 1 + α

Proof. Let A, B, Λ ⊂ R be three compact sets and (a0, b0, λ0) ∈ A × B × Λ. Similar to the proof of Appendix Lemma 7, there exists a constant G > 0 independent of a0, b0, λ0 such that |al| ≤ G × l + |a0| for all l ≥ 0, therefore (al/l) is al bounded. Let rl = l . We have 1 α log(l) q r = r (1 − )(1 − + κ + O(l−1−β)) + + O(l−2) l l−1 l l l2 l 1 + α log(l) q = r (1 − ) + r κ + + O(l−2). l−1 l l−1 l2 l Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

q −3/2 Let xl = rl − 1+α . It is clear that λl = 1 − α/l + O(l ). Therefore, using appendix lemma7 with β = 1/2, we q have rl → 1+α uniformly over a0, b0, λ0. Thus, assuming κ > 0 (for κ < 0, the analysis is the same), there exists κ1, κ2, M, l0 > 0 that depend only on A, B, Λ such that for all l ≥ l0

1 + α log(l) M 1 + α log(l) M x (1 − ) + κ − ≤ x ≤ x (1 − ) + κ + . l−1 l 1 l2 l2 l l−1 l 2 l2 l2 It follows that l l l Y 1 + α X Y 1 + α κ2 log(k) + M x ≤ x (1 − ) + (1 − ) l l0 k j k2 k=l0 k=l0 j=k+1 and l l l Y 1 + α X Y 1 + α κ1 log(k) − M x ≥ x (1 − ) + (1 − ) . l l0 k j k2 k=l0 k=l0 j=k+1

Recall that we have l Y 1 + α (1 − ) = Θ(l−1−α) k k=l0 and

l Y 1 + α (l − α − 1)! k! (1 − ) = j l! (k − α − 1)! j=k+1 so that k! κ log(k) − M 1 ∼ log(k)kα−1. (k − α − 1)! k2 k→∞ Therefore, we obtain

l l X k! κ1 log(k) − M X ∼ log(k)kα−1 (k − α − 1)! k2 k=l0 k=1 Z l ∼ log(t)tα−1dt 1 α ∼ C1l log(l), where C1 > 0 is a constant. Similarly, there exists a constant C2 > 0 such that

l X k! κ2 log(k) + M ∼ C lα log(l). (k − α − 1)! k2 2 k=1

(l−α−1)! −1−α Moreover, having that l! ∼ l yields

0 −1 −1 xl ≤ C l log(l) + o(l log(l)) where C0 and ‘o’ depend only on A, B, Λ. Using the same analysis, we get

00 −1 −1 xl ≥ C l log(l) + o(l log(l)) where C00 and ‘o’ depend only on A, B, Λ, which concludes the proof. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

L L Theorem 1 (Neural Tangent Kernel on the Edge of Chaos). Let φ be ReLU or Tanh, (σb, σw) ∈ EOC and K˜ = K /L. We have that sup |K˜ L(x, x) − K˜ ∞(x, x)| = O(L−1) x∈E Moreover, there exists a constant λ ∈ (0, 1) such that for all  ∈ (0, 1)

L 0 ∞ 0 −1 sup K˜ (x, x ) − K˜ (x, x ) = Θ(log(L)L ). 0 (x,x )∈B where σ2 kxk kx0k ˜ ∞ 0 w 1 • if φ is ReLU-like, then K (x, x ) = d (1 − (1 − λ) x6=x0 ). ∞ 0 • if φ is Tanh, then K˜ (x, x ) = q(1 − (1 − λ)1x6=x0 ) where q > 0 is a constant.

Proof. We start by proving the results for FFNN, then we generalize them to the case of CNN.

l 0 0 d l 0 q (x,x ) Case 1: FFNN. Let  ∈ (0, 1), (σb, σw) ∈ EOC, x, x ∈ and recall c (x, x ) = √ . Let γl := R ql(x,x)ql(x0,x0) 1 − cl(x, x0) and f be the correlation function defined by the recursive equation cl+1 = f(cl). By definition, we have that q˙l(x, x) = f 0(cl−1(x, x0)). We first prove the result for ReLU, then we extend it to Tanh.

• φ =ReLU: From fact 11, we know that, on the EOC for ReLU, the variance ql(x, x) is constant wrt l and given by 2 l 1 σw 2 l q (x, x) = q (x, x) = d ||x|| , and from fact 16 that q˙ (x, x) = 1. Therefore σ2 σ2 Kl(x, x) = Kl−1(x, x) + w ||x||2 = l w ||x||2 = lK˜ ∞(x, x) d d which concludes the proof for KL(x, x). Note that the results is ’exact’ for ReLU, which means the upper bound O(L−1) is valid but not optimal in this case. However, we will see that this bound is optimal for Tanh.

From Appendix Lemma1, we have that

l 0 κ √ log(l) −3 sup c (x, x ) − 1 + 2 − 3 κ 3 = O(l ) 0 (x,x )∈B l l

9π2 where κ = 2 . Moreover, we have that 3 9 log(l) 0 l 0 √ −2 sup f (c (x, x )) − 1 + − 2 = O(l ). 0 (x,x )∈B l 2 κ l

l+1 0 l+1 0 0 l 0 Using Appendix Lemma8 with al = K (x, x ), bl = q (x, x ), λl = f (c (x, x )), we conclude that

l+1 0 2 K (x, x ) 1 σw 0 −1 sup − kxkkx k = Θ(log(l)l ) 0 (x,x )∈B l 4 d

Using the compactness of B, we conclude that

l 0 2 K (x, x ) 1 σw 0 −1 sup − kxkkx k = Θ(log(l)l ) 0 (x,x )∈B l 4 d

• φ = T anh: The case of Tanh is similar to that of ReLU with small differences in technical lemmas used to conclude. From Appendix Lemma2, we have that

l 0 κ 2 log(l) −3 sup c (x, x ) − 1 + − κ(1 − κ ζ) 3 = O(l ) 0 (x,x )∈B l l Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

2 f 3(1) where κ = f 00(1) > 0 and ζ = 6 > 0. Moreover, we have that

0 l 0 2 2 log(l) −2 sup f (c (x, x )) − 1 + − 2(1 − κ ζ) 2 = O(l ). 0 (x,x )∈B l l

We conclude in the same way as in the case of ReLU using Appendix Lemma8. The only difference is that, in this l+1 0 0 case, the limit of the sequence bl = q (x, x ) is the limiting variance q (from facts3,1) does not depend on (x, x ).

Case 2: CNN. Under Assumption1, the NTK of a CNN is the same as that of an FFNN. Therefore, the results on the l 0 NTK of FFNN are all valid to the NTK of CNN Kα,α0 for any α, α .

6.2. Proofs of the results of Section 3.2 on ResNets In this section, we provide proofs for lemmas3 and4 together with Theorem3 and proposition2 on ResNets. Lemma 3 in the paper gives the recursive formula for the mean-field NTK of a ResNet with Fully Connected blocks. Lemma 3 (NTK of a ResNet with Fully Connected layers in the infinite width limit). Let x, x0 be two inputs and Kres,1 be the exact NTK for the Residual Network with 1 layer. Then, we have

• For the first layer (without residual connections), we have for all x, x0 ∈ Rd  2  res,1 0 2 σw 0 K 0 (x, x ) = δ 0 σ + x · x , ii ii b d

where x · x0 is the inner product in Rd. 0 res,l 0 l 0 l 0 • For l ≥ 2, as n1, n2, ..., nL−1 → ∞, we have for all i, i ∈ [1 : nl], Kii0 (x, x ) = δii0 Kres(x, x ), where Kres(x, x ) 0 d is given by the recursive formula have for all x, x ∈ R and l ≥ 2, as n1, n2, ..., nl → ∞ recursively, we have l 0 l−1 0 l 0 l 0 Kres(x, x ) = Kres (x, x )(q ˙ (x, x ) + 1) +q ˆ (x, x ).

Proof. The first result is the same as in the FFNN case since we assume there is no residual connections between the first layer and the input. We prove the second result by induction.

• Let x, x0 ∈ Rd. We have X ∂y1(x) ∂y1(x) ∂y1(x) ∂y1(x) σ2 K1 (x, x0) = 1 1 + 1 1 = w x · x0 + σ2. res ∂w1 ∂w1 ∂b1 ∂b1 d b j 1j 1j 1 1

• The proof is similar to the FeedForward network NTK. For l ≥ 2 and i ∈ [1 : nl]

nl σw X ∂ yl+1(x) = ∂ yl(x) + √ wl+1φ0(yl (x))∂ yl (x). θ1:l i θ1:l i n ij j θ1:l j l j=1 Therefore, we obtain

l+1 l+1 0 t l l 0 t (∂θ1:l yi (x))(∂θ1:l yi (x )) = (∂θ1:l yi(x))(∂θ1:l yi(x )) 2 nl σw X l+1 l+1 0 l 0 l 0 l l 0 t + wij wij0 φ (yj(x))φ (yj0 (x ))∂θ1:l yj(x)(∂θ1:l yj0 (x )) + I nl j,j0 where nl σw X I = √ wl+1(φ0(yl (x))∂ yl(x)(∂ yl (x0))t + φ0(yl (x0))∂ yl (x)(∂ yl(x0))t). n ij j θ1:l i θ1:l j j θ1:l j θ1:l i l j=1 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Using the induction hypothesis, as n0, n1, ..., nl−1 → ∞, we have that

2 nl l+1 l+1 0 t σw X l+1 l+1 0 l 0 l 0 l l 0 t (∂θ1:l yi (x))(∂θ1:l yi (x )) + wij wij0 φ (yj(x))φ (yj0 (x ))∂θ1:l yj(x)(∂θ1:l yj0 (x )) + I nl j,j0 n σ2 Xl → Kl (x, x0) + w (wl+1)2φ0(yl (x))φ0(yl (x0))Kl (x, x0) + I0, res n ij j j res l j

σ2 where I0 = w wl+1(φ0(yl(x)) + φ0(yl(x0)))Kl (x, x0). nl ii i i res

0 As nl → ∞, we have that I → 0. Using the law of large numbers, as nl → ∞ n σ2 Xl w (wl+1)2φ0(yl (x))φ0(yl (x0))Kl (x, x0) → q˙l+1(x, x0)Kl (x, x0). n ij j j res res l j Moreover, we have that

2 l+1 l+1 0 t l+1 l+1 0 t σw X l l 0 2 (∂ l+1 y (x))(∂ l+1 y (x )) + (∂ l+1 y (x))(∂ l+1 y (x )) = φ(y (x))φ(y (x )) + σ w i w i b i b i n j j b l j 2 l l 0 2 l+1 0 → σwE[φ(yi(x))φ(yi(x ))] + σb = q (x, x ). nl→∞

Now we proof the recursive formula for ResNets with Convolutional layers. Lemma 4 (NTK of a ResNet with Convolutional layers in the infinite width limit). Let Kres,1 be the exact NTK for the ResNet with 1 layer. Then • For the first layer (without residual connections), we have for all x, x0 ∈ Rd 2 res,1 0  σw 0 2 K(i,α),(i0,α0)(x, x ) = δii0 [x, x ]α,α0 + σb n0(2k + 1)

0 0 res,l 0 • For l ≥ 2, as n1, n2, ..., nl−1 → ∞ recursively, we have for all i, i ∈ [1 : nl], α, α ∈ [0 : M − 1], K(i,α),(i0,α0)(x, x ) = res,l 0 res,l 0 d δii0 Kα,α0 (x, x ), where Kα,α0 is given by the recursive formula for all x, x ∈ R , using the same notations as in lemma2,

res,l res,l−1 1 X l−1 K 0 = K 0 + Ψ . α,α α,α 2k + 1 α+β,α0+β β

l l res,l l where Ψα,α0 =q ˙α,α0 Kα,α0 +q ˆα,α0 .

Proof. Let x, x0 be two inputs. We have that

1 1 1 1 1 0 X X ∂yi,α(x) ∂yi0,α0 (x) ∂yi,α(x) ∂yi0,α0 (x) K(i,α),(i0,α0)(x, x ) = ( 1 1 + 1 1 ) ∂w ∂w 0 ∂b ∂b j β i,j,β i ,j,β j j 2 σw X X 2 = δii0 xj,α+βxj,α0+β + σb . n0(2k + 1) j β

Assume the result is true for l − 1, let us prove it for l. Let θ1:l−1 be model weights and bias in the layers 1 to l − 1. Let l l ∂yi,α(x) ∂θ y (x) = . We have that 1:l−1 i,α ∂θ1:l−1

σw X X ∂ yl (x) = ∂ yl−1(x) + wl φ0(yl−1 )∂ yl−1 (x) θ1:l−1 i,α θ1:l−1 i,α p i,j,β j,α+β θ1:l−1 i,α+β nl−1(2k + 1) j β Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks this yields

l l T l−1 l−1 T ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) = ∂θ1:l−1 yi,α (x)∂θ1:l−1 yi0,α0 (x) + 2 σw X X l l 0 l−1 0 l−1 l−1 l−1 T wi,j,βwi0,j0,β0 φ (yj,α+β)φ (yj0,α0+β)∂θ1:l−1 yj,α+β(x)∂θ1:l−1 yj0,α0+β(x) + I, nl−1(2k + 1) j,j0 β,β0 where

σw X I = wl φ0(yl−1 )(∂ yl−1(x)∂ yl−1 (x)T + ∂ yl−1 (x)∂ yl−1(x)T ). p i,j,β j,α+β θ1:l−1 i,α θ1:l−1 i,α+β θ1:l−1 i,α+β θ1:l−1 i,α nl−1(2k + 1) j,β

As n1, n2, ..., nl−2 → ∞ and using the induction hypothesis, we have

l l T l−1 0 0 ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) → δii Kα,α0 (x, x )+ 2 σw X X l l 0 l−1 0 l−1 l−1 0 wi,j,βwi0,j,β0 φ (yj,α+β)φ (yj,α0+β)K(j,α+β),(j,α0+β)(x, x ). nl−1(2k + 1) j β,β0

l−1 0 l−1 0 Note that K(j,α+β),(j,α0+β)(x, x ) = K(1,α+β),(1,α0+β)(x, x ) for all j since the variables are iid across the channel index j. Now letting nl−1 → ∞, we have that

l l T ∂θ1:l−1 yi,α(x)∂θ1:l−1 yi0,α0 (x) → l−1 0 1 X 0 l−1 0 l−1 0  δ 0 K 0 (x, x ) + δ 0 f (c (x, x ))K (x, x ) , ii α,α ii (2k + 1) α+β,α0+β (1,α+β),(1,α0+β) β,β0

0 l−1 0 2 0 l−1 0 l−1 where f (cα+β,α0+β(x, x )) = σwE[φ (yj,α+β)φ (yj,α0+β)]. We conclude using the fact that

2 l l T σw X l−1 l−1 0 2 ∂ y (x)∂ y 0 0 (x) → δ 0 ( [φ(y (x))φ(y (x ))] + σ ). θl i,α θl i ,α ii 2k + 1 E α+β α0+β b β

Before moving to the main theorem on ResNets, We first prove a Lemma on the asymptotic behaviour of cl for ResNet.

l Appendix Lemma 9 (Asymptotic expansion of c for ResNet). Let  ∈ (0, 1) and σw > 0. We have for FFNN

l 0 κσw √ log(l) −3 sup c (x, x ) − 1 + 2 − 3 κσw 3 = O(l ) 0 (x,x )∈B l l

9π2 2 2 where κ = (1 + 2 ) . Moreover, we have that 2 σw

2 √ 3(1 + σ2 ) 3 2 log(l) 0 l 0 w −2 sup f (c (x, x )) − 1 + − 2 = O(l ). 0 l 2π l (x,x )∈B where f is the ReLU correlation function given in fact 12.

0 0 Moreover, this result holds also for CNNs where the supremum should be replaced by sup(x,x )∈B supα,α .

Proof. We first prove the result for ResNet with fully connected layers, then we generalize it to convolutional layers. Let  ∈ (0, 1).

• Let x 6= x0 ∈ Rd, and cl := cl(x, x0). It is straightforward that the variance terms follow the recursive form

l l−1 2 l−1 2 l−1 1 q (x, x) = q (x, x) + σw/2q (x, x) = (1 + σw/2) q (x, x) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Leveraging this observation, we have that 1 α cl+1 = cl + f(cl), 1 + α 1 + α

2 σw where f is the ReLU correlation function given in fact 12 and α = 2 . Recall that 1 1 p 1 f(c) = c arcsin(c) + 1 − c2 + c. π π 2

l As in the proof of Appendix Lemma1, let γl = 1 − c , therefore, using Taylor expansion of f near 1 given in fact 14 yields αs αb γ = γ − γ3/2 − γ5/2 + O(γ7/5). l+1 l 1 + α l 1 + α l l

0 αs 0 αb This form is exactly the same as in the proof of Appendix Lemma1 with s = 1+α and b = 1+α . Thus, following the same analysis we conclude. For the second result, observe that the derivation is the same as in Appendix Lemma1.

• Under Assumption1, results of FFNN hold for CNN.

The next theorem shows that no matter what the choice of σw > 0, the normalized NTK of a ResNet will always have a ¯ ∞ subexponential convergence rate to a limiting Kres. Theorem 2 (NTK for ResNet). Consider a ResNet satisfying

yl(x) = yl−1(x) + F(wl, yl−1(x)), l ≥ 2, (17)

L where F is either a convolutional or dense layer (equations (3) and (4)) with ReLU activation. Let Kres be the corresponding 2 ¯ L L σw L−1 NTK and Kres = Kres/αL (Normalized NTK) with αL = L(1+ 2 ) . If the layers are convolutional assume Assumption 1 holds. Then, we have ¯ L ¯ ∞ −1 sup |Kres(x, x) − Kres(x, x)| = Θ(L ) x∈E Moreover, there exists a constant λ ∈ (0, 1) such that for all  ∈ (0, 1)

¯ L 0 ¯ ∞ 0 −1 sup Kres(x, x ) − Kres(x, x ) = Θ(L log(L)), 0 x,x ∈B

σ2 kxk kx0k ¯ ∞ 0 w 1 where Kres(x, x ) = d (1 − (1 − λ) x6=x0 ).

Proof. Case 1: ResNet with Fully Connected layers. 0 d L L 0 Let  ∈ (0, 1) and x 6= x ∈ R . We first prove the result for the diagonal term Kres(x, x) then Kres(x, x ).

2 2 l σw σw • Diagonal terms: using properties on the correlation function f (fact 12), we have that q˙ (x, x) = 2 f(1) = 2 . Moreover, it is easy to see that the variance terms for a ResNet follow the recursive formula ql(x, x) = ql−1(x, x) + 2 l−1 σw/2 × q (x, x), hence

σ2 ql(x, x) = (1 + σ2 /2)l−1 w kxk2 (18) w d Recall that the recursive formula of NTK of a ResNet with FFNN layers is given by (Appendix Lemma3)

l 0 l−1 0 l 0 l 0 Kres(x, x ) = Kres (x, x )(q ˙ (x, x ) + 1) + q (x, x ) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Hence, for the diagonal terms we obtain

σ2 Kl (x, x) = Kl−1(x, x)( w + 1) + ql(x, x) res res 2

2 ˆ l l σw l−1 Letting Kres = Kres/(1 + l ) yields σ2 Kˆ l (x, x) = Kˆ l−1(x, x) + w kxk2 res res d

ˆ 1 2 ¯ l Kres(x,x) σw 2 Therefore, Kres(x, x) = l + (1 − 1/l) d kxk , the conclusion is straightforward since E is compact and 0 1 (K )res(x, x) is continuous (hence, bounded on E). • The argument is similar to that of Theorem1 with few differences. From appendix lemma9 we have that

l 0 κσw √ log(l) −3 sup c (x, x ) − 1 + 2 − 3 κσw 3 = O(l ) 0 (x,x )∈B l l

9π2 2 2 where κ = (1 + 2 ) . Moreover, we have that 2 σw

2 √ 3(1 + σ2 ) 3 2 log(l) 0 l 0 w −2 sup f (c (x, x )) − 1 + − 2 = O(l ). 0 l 2π l (x,x )∈B

2 σw l+1 0 0 l 0 Let α = 2 . We also have q˙ (x, x ) = αf (c (x, x )) where f is the ReLU correlation function given in fact 12. It 0 follows that for all (x, x ) ∈ B

log(l) 1 +q ˙l+1(x, x0) = (1 + α)(1 − 3l−1 + ζ + O(l−3)) l2 for some constant ζ 6= 0 that does not depend on x, x0. The bound O does not depend on x, x0 either. Now let l+1 0 Kres (x,x ) al = (1+α)l . Using the recursive formula of the NTK, we obtain

al = λlal−1 + bl

2 −1 log(l) −3 σw p 0 l 0 0 −2 0 where λl = 1 − 3l + ζ l2 + O(l ), bl = d kxkkx kf(c (x, x )) = q(x, x ) + O(l ) with q(x, x ) = 2 σw p 0 l 0 −2 d kxkkx k and where we used the fact that c (x, x ) = 1 + O(l ) (Appendix Lemma1) and the formula for ResNet variance terms given by equation (18). Observe that all bounds O are independent from the inputs (x, x0). Therefore, using Appendix Lemma8, we have

L+1 0 L ¯ ∞ 0 −1 sup Kres (x, x )/L(1 + α) − Kres(x, x ) = Θ(L log(L)), 0 x,x ∈B

which can also be written as

L 0 L−1 ¯ ∞ 0 −1 sup Kres(x, x )/(L − 1)(1 + α) − Kres(x, x ) = Θ(L log(L)), 0 x,x ∈B

L 0 L−1 L 0 L−1 −1 We conclude by observing that Kres(x, x )/(L − 1)(1 + α) = Kres(x, x )/L(1 + α) + O(L ) where O can be chosen to depend only on .

Case 2: ResNet with Convolutional layers. Under Assumption1, the dynamics of the correlation and NTK are exactly the same for FFNN, hence all results on FFNN apply to CNN.

Now let us prove the Scaled Resnet result. Before that, we prove the following Lemma Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Appendix Lemma 10. Consider a Residual Neural Network with the following forward propagation equations 1 yl(x) = yl−1(x) + √ F(wl, yl−1(x)), l ≥ 2. (19) l where F is either a convolutional or dense layer (equations3 and4) with ReLU activation. Then there exists ζ, ∇ > 0 such that for all  ∈ (0, 1)

l 0 ζ ∇ 1 sup 1 − c (x, x ) − 2 + 3 = o( 3 ) 0 (x,x )∈B log(l) log(l) log(l) where the bound ‘o’ depends only on . For CNN, under Assumption1, the result holds and the supremum is taken also over α, α0, i.e.

l 0 ζ ∇ 1 sup sup 1 − cα,α0 (x, x ) − 2 + 3 = o( 3 ) 0 0 (x,x )∈B α,α log(l) log(l) log(l)

0 l l 0 Proof. We first start with the dense layer case. Let  ∈ (0, 1) and (x, x ) ∈ B be two inputs and denote by c := c (x, x ). Following the same machinery as in the proof of Appendix Lemma9, we have that 1 α cl = cl−1 + l f(cl−1) 1 + αl 1 + αl

2 σw 0 l l−1 l where αl = 2l . Using fact 12, it is straightforward that f ≥ 0, hence f is non-decreasing. Therefore, c ≥ c and c converges to a fixed point c. Let us prove that c = 1. By contradiction, suppose c < 1 so that f(c) − c > 0 (f has a unique fixed point which is 1). This yields f(c) − c cl − c cl − c = cl−1 − c + + O( ) + O(l−2) l l by summing, this leads to cl − c ∼ (f(c) − c) log(l) which is absurd since f(c) 6= c ( f has only 1 as a fixed point). We conclude that c = 1. Using the non-decreasing nature of f, it is easy to conclude that the convergence is uniform over B.

Now let us find the asymptotic expansion of 1 − cl. Recall the Taylor expansion of f near 1 given in fact 14 f(c) = c + s(1 − c)3/2 + b(1 − c)5/2 + O((1 − c)7/2) (20) x→1− √ √ 2 2 2 l where s = 3π and b = 30π . Letting γl = 1 − c , we obtain 3/2 5/2 7/5 γl = γl−1 − sδlγl−1 − bδlγl−1 + O(δlγl−1). which yields s 3 b γ−1/2 = γ−1/2 + δ + s2δ2γ1/2 + δ γ + O(δ γ3/2). (21) l l−1 2 l 8 l l−1 2 l l−1 l l−1 therefore, we have that

sσ2 γ−1/2 ∼ w log(l) l 4 l ζ 2 4 and 1 − c ∼ log(l)2 where ζ = 16/s σw. we can further expand the asymptotic approximation to have ζ ∇ 1 1 − cl = − + o( ) log(l)2 log(l)3 log(l)3 0 where ∇ > 0. the ‘o’ holds uniformly for (x, x ) ∈ B as in the proof of Appendix Lemma1.

This result holds for a ResNet with CNN layers under Assumption1 since the dynamics are the same in this case. Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

Proposition 2 (Scaled Resnet). Consider a Residual Neural Network with the following forward propagation equations 1 yl(x) = yl−1(x) + √ F(wl, yl−1(x)), l ≥ 2. (22) l where F is either a convolutional or dense layer (equations3 and4) with ReLU activation. Then the scaling factor αL in 1+σ2 /2 −1 Theorem2 becomes αL = L w and the convergence rate is Θ(log(L) ).

Proof. We use the same techniques as in the non scaled case. Let us prove the result for fully connected layers, the proof for 0 convolutional layers follows the same analysis. Let  ∈ (0, 1) and x, x ∈ B be two inputs. We first prove the result for the L L 0 diagonal term Kres(x, x) then Kres(x, x ).

2 2 l σw σw l l−1 2 l−1 • We have that q˙ (x, x) = 2l f(1) = 2l . Moreover, we have q (x, x) = q (x, x) + σw/2l × q (x, x) = 2 Ql 2 σw 2 [ k=1(1 + σw/2k)] d kxk . Recall that σ2 Kl (x, x) = Kl−1(x, x)(1 + w ) + ql(x, x) res res 2l

l 0 Kres(x,x) letting kl = Ql 2 ,we have that k=1(1+σw/2k) σ2 k0 = k0 + w kxk l l−1 d l 2 Q 2 σw/2 l using the fact that k=1(1 + σw/2k) = Θ(l ), we conclude for Kres(x, x). • Recall that l 0 l−1 0 l 0 l 0 Kres(x, x ) = Kres (x, x )(q ˙ (x, x ) + 1) + q (x, x )

Let cl := cl(x, x0). From Appendix Lemma 10 we have that

ζ ∇ 1 1 − cl = − + o( ) log(l)2 log(l)3 log(l)3

16 0 ζ = 2 4 and ∇ > 0. Using the Taylor expansion of f as in Appendix Lemma1, it follows that s σw

0 l 0 6 −1 0 −2 −3 f (c (x, x )) = 1 − 2 log(l) + ζ log(l) + O(log(l) ) σw

0 √∇ where ζ = 2πζ . We obtain

σ2 1 +q ˙l(x, x0) = 1 + w − 3l−1 log(l)−1 + ζ00l−1 log(l)−2 + O(l−1 log(l)−3) 2l

2 l+1 0 00 σw 0 Kres (x,x ) where ζ = 2 ζ . Letting al = Ql 2 , we obtain k=1(1+σw/2k)

al = λlal−1 + bl

−1 −1 −1 −1 −2 p 1 p 1 0 0 l 0 where λl = 1 − l − 3l log(l) + O(l log(l) ), bl = q (x, x) q (x , x )f(c (x, x )) = q(x, x0) + O(log(l)−2) with q = pq1(x, x)pq1(x0, x0) and where we used the fact that cl = 1 + O(log(l)−2) (Appendix Lemma 10).

al Now we proceed in the same way as in the proof of Appendix Lemma8. Let xl = l −q, then there exists M1,M2 > 0 such that 1 1 x (1 − ) − M l−1 log(l)−1 ≤ x ≤ x (1 − ) − M l−1 log(l)−1 l−1 l 1 l l−1 l 2 Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

0 therefore, there exists l0 independent of (x, x ) such that for all l ≥ l0

l l l Y 1 X Y 1 x ≤ x (1 − ) − M (1 − )k−1 log(k)−1 l l0 k 2 j k=l0 k=l0 j=k+1 and l l l Y 1 X Y 1 x ≥ x (1 − ) − M (1 − )k−1 log(k)−1 l l0 k 1 j k=l0 k=l0 j=k+1 after simplification, we have that

l l ! X Y 1 1 Z l 1 (1 − )k−1 log(k)−1 = Θ dt = Θ(log(l)−1) j l log(t) k=l0 j=k+1 R t 1 x where we have used the asymptotic approximation of the Logarithmic Intergal function Li(x) = log(t) ∼x→∞ log(x) 2 l σw Q 2 1+ 2 −1 we conclude that αL = L × k=1(1 + σw/2k) ∼ L and the convergence rate of the NTK is now Θ(log(L) ) −1 which is better than Θ(L ). The convergence is uniform over the set B.

In the limit of large L, the matrix NTK of the scaled resnet has the following form

ˆ l −1 AKres = qU + log(L) Θ(ML)

where U is the matrix of ones, and ML has all elements but the diagonal equal to 1 and the diagonal terms are −1 ˆ l O(L log(L)) → 0. Therefore, ML is inversible for large L which makes Kres also inversible. Moreover, observe that the convergence rate for scaled resnet is log(L)−1 which means that for the same depth L, the NTK remains far more expressive for scaled resnet compared to standard resnet, this is particularly important for the generalization.

6.3. Spectral decomposition of the limiting NTK

6.3.1. REVIEWON SPHERICAL HARMONICS

We start by giving a brief review of the theory of Spherical Harmonics (MacRobert, 1967). Let Sd−1 be the unit sphere in d d−1 d R defined by S = {x ∈ R : kxk2 = 1}. For some k ≥ 1, there exists a set (Yk,j)1≤j≤N(d,k) of Spherical Harmonics 2k+d−2 k+d−3 of degree k with N(d, k) = k d−2 .

The set of functions (Yk,j)k≥1,j∈[1:N(d,k)] form an orthonormal basis with respect to the uniform measure on the unit sphere Sd−1. For some function g, the Hecke-Funk formula is given by Z Z 1 Ωd−1 d 2 (d−3)/2 g(hx, wi)Yk,j(w)dνd−1(w) = Yk,j(x) g(t)Pk (t)(1 − t ) dt d−1 Ω S d −1 d−1 d−1 d where νd−1 is the uniform measure on the unit sphere S , Ωd is the volume of the unit sphere S , and Pk is the multi-dimensional Legendre polynomials given explicitly by Rodrigues’ formula

d−1 d 1k Γ( 2 ) 2 3−d d k 2 k+ d−3 P (t) = − (1 − t ) 2 (1 − t ) 2 k 2 d−1 dt Γ(k + 2 )

d 2 2 d−3 (Pk )k≥0 form an orthogonal basis of L ([−1, 1], (1 − t ) 2 dt), i.e.

d d hPk ,Pk0 i d−3 = δk,k0 L2([−1,1],(1−t2) 2 dt) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks where δij is the Kronecker symbol. Moreover, we have

d 2 (k + d − 3)! kPk k d−3 = L2([−1,1],(1−t2) 2 dt) (d − 3)(k − d + 3)!

Using the Heck-Funk formula, we can easily conclude that any dot product kernel on the unit sphere Sd−1, i.e. and kernel of the form κ(x, x0) = g(hx, x0i) can be decomposed on the Spherical Harmonics basis. Indeed, for any x, x0 ∈ Sd−1, the decomposition on the spherical harmonics basis yields

N(d,k) Z  0 X X 0 κ(x, x ) = g(hw, x i)Yk,j(w)dνd−1(w) Yk,j(x) d−1 k≥0 j=1 S Using the Hecke-Funk formula yields

N(d,k)  Z 1  0 X X Ωd−1 d 2 (d−3)/2 0 κ(x, x ) = g(t)Pk (t)(1 − t ) dt Yk,j(x)Yk,j(x ) Ωd k≥0 j=1 −1 we conclude that N(d,k) 0 X X 0 κ(x, x ) = µk Yk,j(x)Yk,j(x ) k≥0 j=1

Ωd−1 R 1 d 2 (d−3)/2 where µk = g(t)P (t)(1 − t ) dt. Ωd −1 k We use these result in the proof of the next theorem. Theorem 3 (Spectral decomposition). Let κL be either, the NTK (KL) for an FFNN with L layers initialized on the Ordered ˜ L ¯ L phase, The Average NTK (K ) for an FFNN with L layers initialized on the EOC, or the Normalized NTK (Kres) for a L 0 d−1 ResNet with L layers (Fully Connected). Then, for all L ≥ 1, there exists (µk )k≥ such that for all x, x ∈ S

N(d,k) L 0 X L X 0 κ (x, x ) = µk Yk,j(x)Yk,j(x ). k≥0 j=1

d−1 (Yk,j)k≥0,j∈[1:N(d,k)] are spherical harmonics of S , and N(d, k) is the number of harmonics of order k. ∞ L L Moreover, we have that 0 < µ0 = lim µ0 < ∞, and for all k ≥ 1, lim µk = 0. L→∞ L→∞

Proof. From the recursive formulas of the NTK for FFNN, CNN and ResNet architectures, it is straightforward that on the unit sphere Sd−1, the kernel κL is zonal in the sense that it depends only on the scalar product, more precisely, for all L ≥ 1, there exists a function gL such that for all x, x0 ∈ Sd−1 κL(x, x0) = gL(hx, x0i) using the previous results on Spherical Harmonics, we have that for all x, x0 ∈ Sd−1 N(d,k) L 0 X L X 0 κ (x, x ) = µk Yk,j(x)Yk,j(x ) k≥0 j=1 where µL = Ωd−1 R 1 gL(t)P d(t)(1 − t2)(d−3)/2dt. k Ωd −1 k For k = 0, we have that for all L ≥ 1, µL = Ωd−1 R 1 gL(t)(1 − t2)(d−3)/2dt. By a simple dominated convergence 0 Ωd −1 L Ωd−1 R 1 2 (d−3)/2 argument, we have that limL→∞ µ = qλ (1 − t ) dt > 0, where q, λ are given in Theorems1,2 and 0 Ωd −1 Proposition1 (where we take q = 1 for the Ordered/Chaotic phase initialization in Proposition1). Using the same argument, L Ωd−1 R 1 d 2 (d−3)/2 Ωd−1 d d we have that for k ≥ 1, limL→∞ µ = qλ P (t)(1−t ) dt = qλ hP0 ,P i d−3 = 0. k Ωd −1 k Ωd k L2([−1,1],(1−t2) 2 dt) Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks References Arora, S., S. Du, W. Hu, , Z. Li, and R. Wand (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ICML.

Arora, S., S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019). On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955.

Bietti, A. and J. Mairal (2019). On the inductive bias of neural tangent kernels. NeurIPS 2019.

Cao, Y., Z. Fang, Y. Wu, D. Zhou, and Q. Gu (2020). Towards understanding the spectral bias of deep learning. arXiv prePrint 1912.01198.

Cao, Y. and Q. Gu (2019). Generalization bounds of stochastic gradient descent for wide and deep neural networks. NeurIPS.

Chizat, L. and F. Bach (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.

Du, S., J. Lee, H. Li, L. Wang, and X. Zhai (2019). Gradient descent finds global minima of deep neural networks. ICML.

Du, S., J. Lee, Y. Tian, B. Poczos, and A. Singh (2018). Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. ICML.

Du, S., X. Zhai, B. Poczos, and A. Singh (2019). Gradient descent provably optimizes over-parameterized neural networks. ICLR.

Geifman, A., A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri (2020). On the similarity between the laplace and neural tangent kernels. NeurIPS.

Ghorbani, B., S. Mei, T. Misiakiewicz, and A. Montanari (2019). Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191.

Hanin, B. and M. Nica (2019). Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989.

Hayase, T. and R. Karakida (2020). The spectrum of fisher information of deep networks achieving dynamical isometry. arXiv PrePrint 2006.07814.

Hayou, S., A. Doucet, and J. Rousseau (2019). On the impact of the activation function on deep neural networks training. ICML.

Huang, J. and H. Yau (2020). Dynamics of deep neural networks and neural tangent hierarchy. ICML.

Huang, K., Y. Wang, M. Tao, and T. Zhao (2020). Why do deep residual networks generalize better than deep feedforward networks? – a neural tangent kernel perspective. ArXiv preprint, arXiv:2002.06262.

Jacot, A., F. Gabriel, and C. Hongler (2018). Neural tangent kernel: Convergence and generalization in neural networks. 32nd Conference on Neural Information Processing Systems.

Karakida, R., S. Akaho, and S. Amari (2018). Universal statistics of Fisher information in deep neural networks: Mean field approach. arXiv preprint arXiv:1806.01316.

Lee, J., Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018). Deep neural networks as Gaussian processes. 6th International Conference on Learning Representations.

Lee, J., L. Xiao, S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington (2019). Wide neural networks of any depth evolve as linear models under gradient descent. NeurIPS.

Lillicrap, T., D. Cownden, D. Tweed, and C. Akerman (2016). Random synaptic feedback weights support error backpropa- gation for deep learning. Nature Communications 7(13276). Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks

MacRobert, T. (1967). Spherical harmonics: An elementary treatise on harmonic functions, with applications. Pergamon Press. Matthews, A., J. Hron, M. Rowland, R. Turner, and Z. Ghahramani (2018). Gaussian process behaviour in wide deep neural networks. 6th International Conference on Learning Representations. Neal, R. (1995). Bayesian learning for neural networks. Springer Science & Business Media 118.

Nguyen, Q. and M. Hein (2018). Optimization landscape and expressivity of deep CNNs. ICML. Novak, R., L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz (2020). Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations. Poole, B., S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016). Exponential expressivity in deep neural networks through transient chaos. 30th Conference on Neural Information Processing Systems.

Schoenholz, S., J. Gilmer, S. Ganguli, and J. Sohl-Dickstein (2017). Deep information propagation. 5th International Conference on Learning Representations. Xiao, L., Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and P. Pennington (2018). Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. ICML 2018.

Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760. Yang, G. (2020). Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685. Yang, G. and S. Schoenholz (2017a). Mean field residual networks: On the edge of chaos. Advances in Neural Information Processing Systems 30, 2869–2869. Yang, G. and S. Schoenholz (2017b). Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114. Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Zou, D., Y. Cao, D. Zhou, and Q. Gu (2018). Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888.