Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy Jiaoyang Huang * 1 Horng-Tzer Yau * 2 Abstract problems, there is no guarantee that a gradient based algo- The evolution of a deep neural network trained by rithm will be able to find the optimal parameters efficiently the gradient descent in the overparametrization during the training of neural networks. One question then regime can be described by its neural tangent ker- arises: given such complexities, is it possible to obtain a nel (NTK) (Jacot et al., 2018; Du et al., 2018b;a; succinct description of the training dynamics? Arora et al., 2019b). It was observed (Arora et al., In this paper, we focus on the empirical risk minimization 2019a) that there is a performance gap between problem with the quadratic loss function the kernel regression using the limiting NTK and n the deep neural networks. We study the dynamic 1 X 2 min L(θ) = (f(xα; θ) − yα) of neural networks of finite width and derive an θ 2n infinite hierarchy of differential equations, the α=1 neural tangent hierarchy (NTH). We prove that n n where fxαgα=1 are the training inputs, fyαgα=1 are the the NTH hierarchy truncated at the level p > 2 labels, and the dependence is modeled by a deep fully- approximates the dynamic of the NTK up to ar- connected feedforward neural network with H hidden layers. bitrary precision under certain conditions on the The network has d input nodes, and the input vector is given neural network width and the data set dimension. d by x 2 R . For 1 6 ` 6 H, the `-th hidden layer has The assumptions needed for these approximations m neurons. Let x(`) be the output of the `-th layer with become weaker as p increases. Finally, NTH can x(0) = x. Then the feedforward neural network is given by be viewed as higher order extensions of NTK. In the set of recursive equations: particular, the NTH truncated at p = 2 recovers the NTK dynamics. 1 x(`) = p σ(W (`)x(`−1)); ` = 1; 2; ··· ; H; (1) m (`) m×d (`) m×m 1. Introduction where W 2 R if ` = 1 and W 2 R if 2 6 ` 6 H are the weight matrices, and σ is the activation unit, Deep neural networks have become popular due to their un- which is applied coordinate-wise to its input. The output of precedented success in a variety of machine learning tasks. the neural network is Image recognition (LeCun et al., 1998; Krizhevsky et al., > (H) 2012; Szegedy et al., 2015), speech recognition (Hinton f(x; θ) = a x 2 R; (2) et al., 2012; Sainath et al., 2013), playing Go (Silver et al., 2016; 2017) and natural language understanding (Collobert where a 2 Rm is the weight matrix for the output layer. et al., 2011; Wu et al., 2016; Devlin et al., 2018) are just We denote the vector containing all trainable parameters a few of the recent achievements. However, one aspect of by θ = (vec(W (1)); vec(W (2)) :::; vec(W (H)); a). We deep neural networks that is not well understood is train- remark thatp this parametrization is nonstandard because of ing. Training a deep neural network is usually done via a those 1= m factors. However, it has already been adopted gradient decent based algorithm. Analyzing such training in several recent works (Jacot et al., 2018; Du et al., 2018b;a; dynamics is challenging. Firstly, as highly nonlinear struc- Lee et al., 2019). We note that the predictions and training tures, deep neural networks usually involve a large number dynamics of (1) are identicalp to those of standard networks, of parameters. Secondly, as highly non-convex optimization up to a scaling factor 1= m in the learning rate for each parameter. *Equal contribution 1School of Mathematics, IAS, Princeton, NJ, USA 2Mathematics Department, Harvard, Cambridge, MA, We initialize the neural network with random Gaussian USA. Correspondence to: Jiaoyang Huang <[email protected]>. weights following the Xavier initialization scheme (Glo- rot & Bengio, 2010). More precisely, we set the initial Proceedings of the 37 th International Conference on Machine (`) 2 2 Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- parameter vector θ0 as Wij ∼ N (0; σw), ai ∼ N (0; σa). thor(s). In this way, for the randomly initialized neural network, Dynamics of Deep Neural Networks and Neural Tangent Hierarchy we have that the L2 norms of the output of each layer are and for 1 6 ` 6 H, (`) 2 of order one, i.e. kx k2 = O(1) for 0 6 ` 6 H, and (`) G (xα; xβ) = h@ (`) f(xα; θt);@ (`) f(xβ; θt)i f(x; θ0) = O(1) with high probability. In this paper, we t W W * train all layers of the neural network with continuous time (W (`+1))> a = σ0 (x ) tp ··· σ0 (x )p t ; gradient descent (gradient flow): for any time t > 0 ` α m H α m (`) (`+1) > + @tWt = −@W (`) L(θt); ` = 1; 2; ··· ; H; 0 (Wt ) 0 at (`−1) (`−1) (3) σ (xβ) p ··· σ (xβ)p hx ; x i; ` m H m α β @tat = −@aL(θt); (1) (2) (H) and where θt = (vec(W ); vec(W ) :::; vec(W ); at). t t t (H+1) (H) (H) Gt = h@af(xα; θt);@af(xβ; θt)i = hxα ; x i: For simplicity of notations, we write σ(W (`)x(`−1)) as β σ (x), or simply σ if the context is clear. We write its (2) ` ` The NTK Kt (·; ·) varies along training. However, in the 0 (`) (`−1) 0 (1) derivative diag(σ (W x )) as σ`(x) = σ` (x), and infinite width limit, the training dynamic is very simple: The (r) (`) (`−1) (r) (r) (2) (2) r-th derivative diag(σ (W x )) as σ` (x), or σ` NTK does not change along training, Kt (·; ·) = K1 (·; ·). (r) The network function f(x; θt) follows a linear differential for r > 1. In this notation, σ` (x) are diagonal matrices. With those notations, explicitly, the continuous time equation (Jacot et al., 2018): gradient descent dynamic (3) is n 1 X @ f(x; θ ) = − K(2)(x; x )(f(x ; θ ) − y ); (8) t t n 1 β β t β (`) β=1 @tWt = −@W (`) L(θt) n (`+1) > ! which becomes analytically tractable. In other words, the 1 X 0 (Wt ) 0 at = − σ (xβ) p ··· σ (xβ)p (4) n ` m H m training dynamic is equivalent to the kernel regression using β=1 (2) the limiting NTK K1 (·; ·). While the linearization (8) is (`−1) > only exact in the infinite width limit, for a sufficiently wide ⊗ (xβ ) (f(xβ; θt) − yβ); deep neural network, (8) still provides a good approxima- for ` = 1; 2; ··· ;H, and tion of the learning dynamic for the corresponding deep neural network (Du et al., 2018b;a; Lee et al., 2019). As a n 1 X (H) consequence, it was proven in (Du et al., 2018b;a) that, for @ a = −@ L(θ ) = − x (f(x ; θ ) − y ): (5) 4 t t a t n β β t β a fully-connected wide neural network with m & n under β=1 certain assumptions on the data set, the gradient descent converges to zero training loss at a linear rate. Although 1.1. Neural Tangent Kernel highly overparametrized neural networks is equivalent to the kernel regression, it is possible to show that the class A recent paper (Jacot et al., 2018) introduced the Neural of finite width neural networks is more expressive than the Tangent Kernel (NTK) and proved the limiting NTK cap- limiting NTK. It has been constructed in (Ghorbani et al., tures the behavior of fully-connected deep neural networks 2019; Yehudai & Shamir, 2019; Allen-Zhu & Li, 2019) that in the infinite width limit trained by gradient descent: there are simple functions that can be efficiently learnt by finite width neural networks, but not the kernel regression @tf(x; θt) = @θf(x; θt)@tθt = −@θf(x; θt)@θL(θt) using the limiting NTK. n 1 X = − @ f(x; θ ) @ f(x ; θ )(f(x ; θ ) − y ) n θ t θ β t β t β 1.2. Contribution β=1 n There is a performance gap between the kernel regression 1 X (2) = − K (x; xβ)(f(xβ; θt) − yβ); (8) using the limiting NTK and the deep neural networks. It n t β=1 was observed in (Arora et al., 2019a) that the convolutional (6) neural networks outperform their corresponding limiting NTK by 5% - 6%. This performance gap is likely to origi- (2) where the NTK Kt (·; ·) is given by nate from the change of the NTK along training due to the finite width effect. The change of the NTK along training (2) has its benefits on generalization. Kt (xα; xβ) = h@θf(xα; θt);@θf(xβ; θt)i H+1 In the current paper, we study the dynamic of the NTK for X (`) (7) = Gt (xα; xβ); finite width deep fully-connected neural networks. Here we `=1 summarize our main contributions: Dynamics of Deep Neural Networks and Neural Tangent Hierarchy • We show the gradient descent dynamic is captured by 1.4. Related Work an infinite hierarchy of ordinary differential equations, In this section, we survey an incomplete list of previous the neural tangent hierarchy (NTH). Similar recursive works on optimization aspect of deep neural networks. differential equations were also obtained by Dyer and Gur-Ari (Dyer & Gur-Ari, 2019). Different from the Because of the highly non-convexity nature of deep neural limiting NTK (7), which depends only on the neural networks, the gradient based algorithms can potentially get network architecture, the NTH is data dependent and stuck near a critical point, i.e., saddle point or local mini- capable of learning data-dependent features.

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Neural Tangent Generalization Attacks

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Mean-Field Behaviour of Neural Tangent Kernel for Deep Neural

Limiting the Deep Learning Yiping Lu Peking University Deep Learning Is Successful, But…

Theory of Deep Learning

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

When and Why the Tangent Kernel Is Constant

Avoiding Kernel Fixed Points: Computing with ELU and GELU Inﬁnite Networks

On the Neural Tangent Kernel of Equilibrium Models

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks