Architectural Universality of Neural Tangent Kernel Training Dynamics

Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics Greg Yang 1 * Etai Littwin 2 * Abstract In (Yang, 2020a), the NTKINIT property was proven to hold Yang(2020a) recently showed that the Neural for standard architectures, meaning any composition of Tangent Kernel (NTK) at initialization has an MLPs, recurrent neural networks (RNN), LSTMs (Hochre- infinite-width limit for a large class of architec- iter & Schmidhuber, 1997), gated recurrent unit (GRU) tures including modern staples such as ResNet (Cho et al., 2014), convolutions (Fukushima, 1980; 1975; and Transformers. However, their analysis does Lecun et al., 1998; 2000; Rumelhart et al., 1986), residual not apply to training. Here, we show the same neu- connections (He et al., 2016; Huang et al., 2017), batch ral networks (in the so-called NTK parametriza- normalization (Ioffe & Szegedy, 2015), graph neural net- tion) during training follow a kernel gradient de- works (Bruna et al., 2014; Defferrard et al., 2016; Duvenaud scent dynamics in function space, where the ker- et al., 2015; Henaff et al., 2015; Kipf & Welling, 2017) nel is the infinite-width NTK. This completes the and attention (Bahdanau et al., 2015; Vaswani et al., 2017), proof of the architectural universality of NTK along with arbitrary weight sharing between components. behavior. To achieve this result, we apply the Ten- More generally, it holds for any architecture expressible sor Programs technique: Write the entire SGD in a so-called Tensor Program (Yang, 2019b;a; 2020a;b), dynamics inside a Tensor Program and analyze of which the standard architectures are a subset. However, it via the Master Theorem. To facilitate this their reasoning is limited to initialization only. proof, we develop a graphical notation for Tensor A statement is architecturally universal if it holds for any Programs. See the full version of our paper at reasonable neural architecture. This is an informal property, arxiv.org/abs/2105.03703. but here we will formalize it by taking reasonable to be “expressable in Tensor Programs.” By the expressiveness 1. Introduction of such programs (Yang, 2019a; 2020a), architectural universality is a fairly robust notion that covers present (and, (Jacot et al., 2018)’s pioneering work showed that a multi- we expect, future) architectures comprehensively. In this layer perceptron (MLP) trained by gradient descent (GD) terminology, (Yang, 2020a) showed that NTKINIT is archi- evolves like a linear model. This spurred a flurry of research tecturally universal. papers using this insight to tackle the core questions in deep learning theory, from optimization to generalization in both finite and infinite width regimes. (Jacot et al., 2018)’s Our Contribution We show the architectural universality argument consists of two observations: of the entire NTK theory by proving NTKTRAIN for the NTKINIT For the output of a network f(ξ; w) with parame- same architectures discussed above, including all standard ters w given example ξ,(Jacot et al., 2018) identified architectures. In the process, we introduce a new graphical the kernel K(ξ; ξ¯) = hrf(ξ; w); rf(ξ;¯ w)i, known as form of Tensor Programs that is both required in our proofs the Neural Tangent Kernel (NTK). They showed that if and useful for the pedagogy of Tensor Programs. f is parametrized and initialized appropriately, then K ˚ converges to a deterministic kernel K as the width of The Tensor Program Series This paper follows (Yang, the network tends to infinity. 2019b;a; 2020a;b; Yang & Hu, 2020) in the series. While NTKTRAIN As the infinitely wide network is trained by we number this paper “IIb” right after (Yang, 2020a), we gradient descent, the NTK remains frozen in its ini- actually need the complete theoretical foundation developed tial state, and the network evolves by kernel gradient in III (Yang, 2020b). See Footnote 21 for more details. descent with kernel K˚. 2. Background *Equal contribution 1Microsoft Research 2Apple Research. Cor- respondence to: Greg Yang <[email protected]>, Etai Let f(ξ; w) 2 denote the (scalar) output of a neural net- Littwin <[email protected]>. R work parameterized by w, given example ξ. To understand Proceedings of the 38 th International Conference on Machine how the output changes with a slight change in the network Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Architectural Universality of Neural Tangent Kernel Training Dynamics parameters w0 − δw, we may naively expand the network these insights systematically via the notion of paths so as to function using the first order Taylor expansion around a base apply to any architecture expressible by a Tensor Program. point w0: We will illustrate 1) through examples in Section 3 and 2) through figures in Section 5.1. f(ξ; w0 − δw) − f(ξ; w0) ≈ hrwf(ξ; w0); δwi: (1) Under the SGD algorithm, the weight update δw is given by Setup and Notations In this paper, we will consider the ^ ^ ^ the gradient δw = −ηχ(ξ)rwf(ξ; w0) where χ(ξ) is the architecture (including depth), data, and training time to be loss derivative, ξ^ is a sample from the training set, and η is fixed as width n ! 1.3 We describe common notations the learning rate. Plugging into Eq. (1), we get: used in the remainder of the paper. For simplicity, we will consider SGD with batch size 1 and learning rate η (often f(ξ; w − δw) − f(ξ; w ) ≈ −ηχ(ξ^)K(ξ; ξ^): 4 0 0 (2) set to 1 WLOG). We use ξt to denote the input and Lt to denote the loss function (absorbing the label) at step t. where K(ξ; ξ^) = hr f(ξ; w ); r f(ξ^; w )i is the NTK. w 0 w 0 More generally, subscript t on any symbol means time t. The NTK theory of infinitely wide neural networks as first However, for brevity, we abuse notation and shorthand f proposed by (Jacot et al., 2018) boils down to the the fol- t for f (ξ ), and, for any (pre-)activation x, x for x (ξ ).5 lowing observations: When the width of f tend to infinity, t t t t t We will also write χ for the loss derivative L0 (f ). For any the NTK K converges to a fixed kernel K˚ at random ini- t p t t vector x(ξ) we define δx (ξ) def nx (ξ) − x (ξ) tialization, independent of the specific instantiation of the t+1 = t+1 t def p @f(ξ) weights, and remains frozen during the optimization pro- and dx(ξ) = n @x(ξ) . We will track the evolution of f on ~ 6 ~ cess. Eq. (2) then gives an accurate description of the output an arbitrary input ξ. Similar to above, we shorthand x~t; ft ˚ ~ evolution with if we substitue K with K. The seemingly for xt(ξ); ft(~x). complex optimization trajectory of SGD therefore reduce to the convex trajectory of kernel gradient descent with a 3. Motivating Examples time-independent kernel K˚. Consider the output of the network f 2 RD on the full The purpose of this section is to illustrate our key ideas via training dataset. As shown in (Jacot et al., 2018), when the simple, intuitive examples without diving into the specifics L2 loss is used the evolution of the output ft at time t under of Tensor Programs. In the process, we will gain insight continuous time GD (i.e. gradient flow) takes a simple form: into how randomness from initialization propagates over the course of training. As these examples intend to provide the ? −ηK˚t ? ft − f = e (f0 − f ): reader with the proper intuition, we use informal arguments alone and relegate all formal statements to the appendix. where K˚ 2 D×D is the full NTK matrix evaluated on the R For brevity, we will gloss over minor details or routine training data, f ? is the label function, and f is the output 0 calculations, but interested readers can see Appendix A for at initialization. Hence, provided K˚ is full rank, as t ! 1 ? these omissions. we have that ft ! f , and the network can fit the training data perfectly. Key Idea It turns out that the random initialization and the Previous Approaches vs Ours A common theme in overparametrization of weights cause each (pre-)activation vector x (ξ) 2 n, its gradient dx (ξ) 2 n, and its showing NTKTRAIN for MLP is to derive high-probability t R t R δx (ξ) 2 n t bounds on the deviation of the NTK K from its initial value (scaled) change t R every time step to have throughout after training (e.g. Allen-Zhu et al.(2018); Du et al.(2018); roughly iid coordinates, not just initially but training 7 Zou et al.(2018)). 1 Obtaining these bounds usually requires . Then, as we shall demonstrate through the ex- developing ad hoc methods on a per-architecture basis, hin- amples below, to track the evolution of the neural network dering the scalability of the method to other settings. In the function, it suffices to track the evolution of the coordinate x(ξ) dx(ξ) δx(ξ) Zx(ξ) Zdx(ξ) present work we take a more holistic approach, leveraging distributions of , , . We write , , Zδx(ξ) 2 R for the random variables corresponding to such the recently developed Tensor Programs framework (Yang, 8 2019b;a; 2020a;b). It consists of two layers of arguments: coordinate distributions. 1) The bottom layer analyzes how the distribution of (pre- 3They will affect the rate of convergence to the infinite-width )activations change throughout the course of training; this limit, but since we are only concerned with whether convergence occurs, they do not appear in our theorem statements here.

Load more