Mean-Field Behaviour of Neural Tangent Kernel for Deep Neural
Total Page:16
File Type:pdf, Size:1020Kb
Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks Soufiane Hayou 1 Arnaud Doucet 1 Judith Rousseau 1 Abstract DNNs and how it differs from the actual training of DNNs with Stochastic Gradient Descent. Recent work by Jacot et al.(2018) has shown that training a neural network of any kind, with gradi- Neural Tangent Kernel. Jacot et al.(2018) showed that ent descent in parameter space, is strongly related training a neural network (NN) with GD (Gradient Descent) to kernel gradient descent in function space with in parameter space is equivalent to a GD in a function space respect to the Neural Tangent Kernel (NTK). Lee with respect to the NTK. Du et al.(2019) used a similar et al.(2019) built on this result by establishing approach to prove that full batch GD converges to global that the output of a neural network trained using minima for shallow neural networks, and Karakida et al. gradient descent can be approximated by a linear (2018) linked the Fisher information matrix to the NTK, model for wide networks. In parallel, a recent line studying its spectral distribution for infinite width NN. The of studies (Schoenholz et al., 2017; Hayou et al., infinite width limit for different architectures was studied by 2019) has suggested that a special initialization, Yang(2019), who introduced a tensor formalism that can known as the Edge of Chaos, improves training. express the NN computations. Lee et al.(2019) studied a In this paper, we connect these two concepts by linear approximation of the full batch GD dynamics based quantifying the impact of the initialization and the on the NTK, and gave a method to approximate the NTK for activation function on the NTK when the network different architectures. Finally, Arora et al.(2019) proposed depth becomes large. In particular, we show that an efficient algorithm to compute the NTK for convolutional the performance of wide deep neural networks architectures (Convolutional NTK). In all of these papers, cannot be explained by the NTK regime. We also the authors only studied the effect of the infinite width limit leverage our theoretical results to derive a learning (NTK regime) with relatively shallow networks. passband rate where training is possible. Information propagation. In parallel, information prop- agation in wide DNNs has been studied in (Hayou et al., 2019; Lee et al., 2018; Schoenholz et al., 2017; Yang and 1. Introduction Schoenholz, 2017a). These works provide an analysis of Deep neural networks (DNN) have achieved state of the art the signal propagation at the initial step as a function of results on numerous tasks. Hence, there is a multitude of the initialization hyper-parameters (i.e. variances of the works trying to theoretically explain their remarkable perfor- initial random weights and biases). They identify a set of mance; see, e.g., (Du et al., 2018; Nguyen and Hein, 2018; hyper-parameters known as the Edge of Chaos (EOC) and Zhang et al., 2017; Zou et al., 2018). Recently, Jacot et al. activation functions ensuring a deep propagation of the in- (2018) introduced the Neural Tangent Kernel (NTK) that formation carried by the input. This ensures that the network characterises DNN training in the so-called Lazy training output still has some information about the input. In this regime (or NTK regime). In this regime, the whole training paper, we prove that the Edge of Chaos initialization has arXiv:1905.13654v10 [stat.ML] 23 May 2021 procedure is reduced to a first order Taylor expansion of the also some benefits on the NTK. output function near its initialization value. It was shown in NTK training and SGD training. Stochastic Gradient De- (Lee et al., 2019), that such a simple model could lead to scent (SGD) has been successfully used in training deep surprisingly good performance. However, most experiments networks. Recently, with the introduction of the Neural with NTK regime are performed on shallow neural networks Tangent Kernel in (Jacot et al., 2018), Lee et al.(2019) sug- and have not covered DNNs. In this paper, we cover this gested a different approach to training overparameterized topic by showing the limitations of the NTK regime for neural networks. The idea originates from the conjecture that in overparameterized models, a local minima exists 1Department of Statistics, University of Oxford, Oxford, United kingdom. Correspondence to: Soufiane Hayou <soufi- near initialization weights. Thus, using a first order Tay- [email protected]>. lor expansion near initialization, the model is reduced to a simple linear model, and the linear model is trained instead Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks NTK training (Section2). However, for a deeper network Table 1. "Does the model learn?". We train a FeedForward Neural with L = 300, the NTK training fails for any initialization, Network on MNIST using both standard SGD training and NTK training defined in section2. For Shallow networks, both SGD while SGD training succeeds in training the model with and NTK yield good performance (See section5). However, for EOC initialization. This confirms the limitations of the Deep networks, the NTK training yields trivial accuracy of around NTK training for DNNs. However, although the large depth ∼ 10% for any initialization scheme. NTK regime is trivial, we leverage this asymptotic analysis Initialization on to infer a theoretical upper bound on the learning (section Other Initialization the Edge of Chaos 4). We illustrate our theoretical results through extensive Shallow Network NTK XX simulations. All the proofs are detailed in the appendix. (depth L = 3) SGD XX Medium Network NTK X 7 2. Neural Networks and Neural Tangent (depth L = 30) SGD X 7 Kernel Deep Network NTK 77 (depth L = 300) SGD X 7 2.1. Setup and notations Consider a neural network model consisting of L layers of of the original network. Hereafter, we refer to this training l widths (nl)1≤l≤L, n0 = d , and let θ = (θ )1≤l≤L be the procedure as the NTK training and the trained model as the flattened vector of weights and bias indexed by the layer’s NTK regime. We clarify this in section2. index, and p be the dimension of θ. The output f of the Contributions. The aim of this paper is to study the large neural network is given by some mapping s : RnL ! Ro depth limit of NTK. Our contributions are of the last layer yL(x); o being the dimension of the output (e.g. number of classes for a classification problem). For • We prove that the NTK regime is always trivial in the limit d L o any input x 2 R , we thus have f(x; θ) = s(y (x)) 2 R . of large depth. However, the convergence rate to this trivial As we train the model, θ changes with time t, and we denote regime is controlled by the initialization hyper-parameters. by θt the value of θ at time t and ft(x) = f(x; θt). Let D = • We prove that only an EOC initialization provides a sub- (xi; zi)1≤i≤N be the data set, and let X = (xi)1≤i≤N , Z = exponential convergence rate to this trivial regime, while (zj)1≤j≤N be the matrices of input and output respectively, other initializations yield an exponential rate. For the same with dimension d × N and o × N. We assume that there depth, the NTK regime is thus ‘less’ trivial for an EOC. This is no colinearity in the input dataset X , i.e. there is no two allows training deep models using NTK training. inputs x; x0 2 X such that x0 = αx for some α 2 R. We also assume that there exists a compact set E ⊂ Rd such • For ResNets, we also have convergence to a trivial NTK that X ⊂ E. regime but this always occurs at a polynomial rate, irrespec- L tive of the initialization. To further slow down the NTK The NTK Kθ is defined as the o × o dimensional kernel convergence rate, we introduce scaling factors to the ResNet satisfying for all x; x0 2 Rd blocks, which allows NTK training of deep ResNets. KL (x; x0) = r f(x; θ )r f(x0; θ )T θt θ t θ t • We leverage our theoretical results on the asymptotic L behaviour of the NTK to show the existence of a learning X 0 T o×o = rθl f(x; θt)rθl f(x ; θt) 2 R : rate passband for SGD training where training is possible. l=1 Table1 summarizes the behaviour of NTK and SGD training for different depths and initialization schemes of an FFNN • The NTK regime (Infinite width): In the case of an on the MNIST dataset. We show if the model learns or not, FFNN, Jacot et al.(2018) proved that, with GD, the kernel KL converges to KL, which depends only on L (depth) for i.e. if the model test accuracy is significantly bigger than θt 10%, which is the accuracy of the trivial random classifier. all t < T when n1; n2; :::; nL ! 1 sequentially, where T The results displayed in the table show that for shallow is an upper bound on the training time. The infinite width FFNN (L = 3), the model learns to classify with both NTK limit of the training dynamics with a quadratic loss is given training and SGD training for any initialization scheme. by the linear model For a medium depth network (L = 30), NTK training and − t K^ L − t K^ L SGD training both succeed in training the model with an ft(X ) = e N f0(X ) + (I − e N )Z; (1) initialization on the EOC, while they both fail with other L L d where K^ = K (X ; X ).