On the linearity of large non-linear models: when and why the tangent kernel is constant

Chaoyue Liu∗ Libin Zhu† Mikhail Belkin‡

Abstract

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted “lazy training”. Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by .

1 Introduction

As the width of certain non-linear neural networks increases, they become linear functions of their parameters. This remarkable property of large models was first identified in [12] where it was stated in terms of the constancy of the (neural) tangent kernel during the training process. More precisely, consider a neural network or, generally, a machine learning model f(w; x), which takes x as input and has w as its (trainable) parameters. Its tangent kernel K(x,z)(w) is defined as follows:

T d K(x,z)(w) := ∇wf(w; x) ∇wf(w; z), for fixed inputs x, z ∈ R . (1)

The key finding of [12] was the fact that for some wide neural networks the kernel K(x,z)(w) is a constant function of the weight w during training. While in the literature, including [12], this phenomenon is described in terms of the (linear) training dynamics, it is important to note that the tangent kernel is associated to the model itself. As such, it does not depend on the optimization algorithm or the choice of a , which are parts of the training process. arXiv:2010.01092v3 [cs.LG] 20 Feb 2021 The goal of this work is to clarify a number of issues related to the constancy of the tangent kernel, to provide specific conditions when the kernel is constant, i.e., when non-linear models in the limit, as their width approach infinity, become linear, and also to explicate the regimes when they do not. One important conclusion of our analysis is that the “transition to linearity” phenomenon discussed in this work (equivalent to constancy of tangent kernel) cannot be explained by “lazy training” [6] associated to small change of parameters from the initialization point or model rescaling, which is widely held to be the reason for constancy of the tangent kernel, e.g., [20, 2, 10] (see Section 1.1 for a detailed discussion). The transition to linearity is neither due to a choice of a scaling of the model, nor is a universal property of large models including infinitely wide neural networks. In particular, the models shown to transition to linearity in this paper become linear in a Euclidean ball of an arbitrary ∗Department of Computer Science, The Ohio State University. E-mail: [email protected] †Department of Computer Science and Halicioglu˘ Data Science Institute, University of California, San Diego. E-mail: [email protected] ‡Halicioglu˘ Data Science Institute, University of California, San Diego. E-mail: [email protected] fixed radius, not just in a small vicinity of the initialization point, where higher order terms of the Taylor series can be ignored. Our first observation4 is that a function f(w, x) has a constant tangent kernel if and only if it is linear in w, that is T f(w, x) = w φ(x) + f0(x) for some “feature map” φ and function f0. Thus the constancy of the tangent kernel is directly linked to the linearity of the underlying model. So what is the underlying reason that some large models transition to linearity as a function of the parameters and when do we expect it to be the case? As known from the mathematical analysis, the deviation from the linearity is controlled by the second derivative, which is represented, for a multivariate function f, by the Hessian matrix H. If its spectral norm kHk is small compared to the gradient ∇wf in a ball of a certain radius, the function f will be close to linear and will have near-constant tangent kernel in that ball. Crucially, the spectral norm kHk depends not just on the magnitude of its entries, but also on the structure of the matrix H. This simple idea underlies the analysis in this paper. Note that throughout this paper we consider the Hessian of the model f, not of any related loss function. Constant tangent kernel for neural networks with linear output layer. In what follows we analyze the class of neural networks with linear output layer, which includes networks that have been found to have constant tangent kernel in [12, 16, 8] and other works. We show that while the gradient norm k∇wfk is (omitting log factors) of the order Θ(1)√ w.r.t. the network width m, the spectral norm of the Hessian matrix kHk scales with m as 1/ m. In the infinite width limit, this implies a vanishing Hessian and hence transition to linearity of the model in a ball of an arbitrary fixed radius. A consequence of this analysis is the constancy of the tangent kernel, providing a different perspective on the results in [12] and the follow-up works. We proceed to expose the underlying reason why the Hessian matrix scales differently from the gradient and delimit the regimes where this phenomenon exists. As we show, the scaling of the Hessian spectral norm is controlled by both the ∞-norms of the vectors ∂f/∂α(l), l ∈ [L], where α(l) is the (vector) value of the l-th hidden layer, and the norms of layer-wise derivatives (specifically, the (2, 1, 1)-norm of the corresponding order 3 tensors). On the other hand, the scaling of the gradient and the tangent kernel is controlled by the 2-norms (i.e., Euclidean norms) of ∂f/∂α(l). As the network width m (i.e., minimal width of hidden layers) is sufficiently large, the discrepancy between the the ∞-norm and 2-norm increases, while the (2, 1, 1)-norms remain of the same order. Hence we obtain the discrepancy between the scaling behaviors of the Hessian and gradient. Non-constancy of tangent kernels. We proceed to demonstrate, both theoretically (Section 4) and experimentally (Section 6), that the constancy of tangent kernel is not a general property of large models, including wide networks, even in the “lazy” training regime. In particular, if the output layer of a network is nonlinear, e.g., if there is a non-linear activation on the output, the Hessian norm does not tend to zero as m → ∞, and constancy of tangent kernel will not hold in any fixed neighborhood and along the optimization path, although each individual parameter may undergo only a small change. This demonstrates that the constancy of the tangent kernel relies on specific structural properties of the models. Similarly, we show that inserting a narrow “bottleneck” layer, even if it is linear, will generally result in the loss of near-linearity, as the Hessian norm becomes large compared to the gradient ∇wf of the model. Importantly, as we discuss in Section 5, non-constancy of the tangent kernel does not preclude efficient optimization. We construct examples of wide networks which can be provably optimized by gradient descent, yet with tangent kernel provably far from constant along the optimization path and with Hessian norm Ω(1), same as the gradient.

1.1 Discussion and related work

We proceed to make a number of remarks in the context of some recent work on the subject.

4While it is a known mathematical fact (see [9, 19]), we have not seen it in the neural network literature, possibly due to the discussion typically concerned with the dynamics of optimization. As a special case, note that while ∇f ≡ const clearly implies that f is linear, it is not a priori obvious that the weaker condition k∇fk ≡ const is also sufficient.

2 Is the weight change from the initialization to convergence small? In the recent literature (e.g.,[20, 2, 10]) it is sometimes asserted that the constancy of tangent kernel is explained by small change of weight vector during training, a property related to “lazy training” introduced in [6]. It is important to point out that the notion “small” depends crucially on the measurement. Indeed, as we discuss below, when measured correctly in relation to the tangent kernel, the change from initialization is not small. ∗ Let w0 and w be the weight vectors at initialization and at convergence respectively. For example, consider a one hidden layer network of width m. Each component of the weight vector is updated by √ ∗ O(1/√m) under gradient descent, as shown in [12], and hence for wide networks kw − w0k∞ = O(1/ m), a quantity that vanishes with the increasing width. In contrast, the change of the Euclidean ∗ 2 Pm ∗ 2 norm is not small in training, kw −w0k = i=1(wi −w0,i) = Θ(1). Thus convergence happens within a Euclidean ball with radius independent of the network width. In fact, the Euclidean norm of the change of the weight vector cannot be small for Lipschitz continuous models, even in the limit of infinite parameters. This is because

∗ |f(w0; x) − y| kw − w0k ≥ (2) supw k∇wf(w)k where is y is the label at x. Note that the difference |f(w0; x) − y|, between the initial prediction ∗ f(w0; x) and the ground truth label y, is of the same order as k∇wfk. Thus, we see that kw −w0k = Ω(1), no matter how many parameters the model f has. We note that the (approximate) linearity of a model in a certain region (and hence the constancy of the T tangent kernel) is predicated on the second-order term of the Taylor expansion (w−w0) H(w−w0). That term depends on the Euclidean distance from the initialization kw − w0k (and the spectral norm of the Hessian), instead of√ the ∞-norm kw − w0k∞. Since, as we discussed above, these norms are different by a factor of m, an argument based on small change of individual parameters from initialization cannot explain the remarkable phenomenon of constant tangent kernel. In contrast to these interpretations, we show that certain large networks have near constant tangent kernel in a ball of fixed radius due to the vanishing Hessian norm, as their widths approach infinity. Indeed, that is the case for networks analyzed in the NTK literature [12, 16, 8, 7]. Can the transition to linearity be explained by model rescaling? The work [6] introduced the term “lazy training” and proposed a mechanism for the constancy of the tangent kernel based on rescaling the model. While, as shown in [6], model rescaling can lead to lazy training, as we discuss below, it does not explain the phenomenon of constant tangent kernel in the setting of the original paper [12] and consequent works. Specifically, [6] provides the following criterion for the near constancy of the tangent kernel (using their notation): 2 kD f(w0)k κf (w0) := kf(w0) − yk 2  1, (3) | {z } kDf(w0)k A | {z } B 2 Here y is the ground truth label, D f(w0) is the Hessian of the model f at initialization and 2 kDf(w0)k is the norm of the gradient, i.e., a diagonal entry of the tangent kernel. The paper [6] shows that the model f can be rescaled to satisfy the condition in Eq.(3) as follows. Consider a rescaled model αf with a scaling factor α ∈ R, α > 0. Then the quantity καf becomes 1 kD2f(w )k y kD2f(w )k 0 0 καf (w0) = kαf(w0) − yk 2 = f(w0) − 2 . (4) α kDf(w0)k α kDf(w0)k

Assuming that f(w0) = 0 and choosing a large α, forces καf  1, by rescaling the factor A to be small, while keeping B unchanged.

While rescaling the model, together with the important assumption of f(w0) = 0, leads to a lazy training regime, we point out that it is not the same regime as observed in the original work [12] and followup papers such as [16, 8] and also different from practical neural network training, since we usually have A = kf(w0) − yk = Θ(1) in these settings. Specifically:

• The assumption of f(w0) = 0 is necessary for the rescaled models in [6] to have A  1. Yet, the networks, such as those analyzed in [12], are initialized so that f(w0) = Θ(1).

3 • From Eq.(4), we see that rescaling the model f by α is equivalent to rescaling the ground truth label y by 1/α without changing the model (this can also be seen from the loss function, cf. Eq.(2) of [6]). When α is large, the rescaled label y/α is close to zero. However, no such rescaling happens in practice or in works, such as [12, 16, 8]. The training dynamics of the model with the label y/α does not generally match the dynamics of the original problem with the label y and will result in a different solution.

Since A = Θ(1), in the NTK setting and many practical settings, to satisfy the criterion in Eq.(3), 2 2 the model needs to have B = kD f(w0)k/kDf(w0)k  1. In fact, we note that the analysis of 2-layer networks in [6] uses a different argument, not based on model rescaling. Indeed, as we show in this work, B is small for a broad class of wide neural networks with linear output layer, due to a vanishing norm of the Hessian as the width of the network increases. In summary, the rescaled models satisfy the criterion, κ  1, by scaling the factor A to be small, while the neural networks, such as the ones considered in the original work [12], satisfy this criterion by having B  1, while A = Θ(1). Is near-linearity necessary for optimization? In this work we concentrate on understanding the phenomenon of constant tangent kernel, when large non-linear systems transition to linearity with increasing number of parameters. The linearity implies convergence of gradient descent assuming that the tangent kernel is non-degenerate at initialization. However, it is important to emphasize that the linearity or near-linearity is not a necessary condition for convergence. Instead, convergence is implied by uniform conditioning of the tangent kernel in a neighborhood of a certain radius, while the linearity is controlled by the norm of the Hessian. These are conceptually and practically different phenomena as we show on an example of a wide shallow network with a non-linear output layer in Section 5. See also our paper [17] for an in-depth discussion of optimization.

2 Notation and Basic Results on Tangent Kernel and Hessian

2.1 Notation and Preliminary

We use bold lowercase letters, e.g., v, to denote vectors, capital letters, e.g., W , to denote matrices, and bold capital letters, e.g., W, to denote matrix tuples or higher order tensors. We denote the set {1, 2, ··· , n} as [n]. We use the following norms in our analysis: For vectors, we use k · k to denote the Euclidean norm (a.k.a. vector 2-norm) and k · k∞ for the ∞-norm; For matrices, we use k · k to denote the spectral norm (i.e., matrix 2-norm) and k · kF to denote the Frobenius norm. In addition, we use tilde, e.g., O˜(·), to suppress logarithmic terms in Big-O notation.

We use ∇wf to represent the derivative of f(w; x) with respect to w. For (vector-valued) functions, we use the following definition of its Lipschitz continuity: m n Definition 2.1. A function f : R → R is called Lf -Lipschitz continuous, if there exists Lf > 0, m such that for all x, z ∈ R , kf(x) − f(z)k ≤ Lf kx − zk.

For an order 3 tensor, we define its (2, 2, 1)-norm: Definition 2.2 ((2, 2, 1)-norm of order 3 tensors). For an order 3 tensor T ∈ Rd1×d2×d3 , with components Tijk, i ∈ [d1], j ∈ [d2], k ∈ [d3], define its (2, 2, 1)-norm as

d d d X3 X1 X2 d1 d2 kTk2,2,1 := sup Tijkxizj , where x ∈ R , z ∈ R . (5) kxk=kzk=1 k=1 i=1 j=1

We will later need the following proposition which is essentially a special case of the the Holder inequality. P Proposition 2.1. Consider a matrix A with components Aij = k Tijkvk, where Tijk is a compo- nent of the order 3 tensor T and vk is a component of vector v. Then the spectral norm of A satisfies

kAk ≤ kTk2,2,1kvk∞ (6)

4 T Proof. Note that spectral norm is defined as kAk = supkxk=kzk=1 x Az. Then X X X kAk = sup Tijkxizjvk ≤ max |vk| sup Tijkxizj = kvk∞kTk2,2,1. k kxk=kzk=1 i,j,k kxk=kzk=1 k i,j

2.2 Tangent kernel and the Hessian

Consider a machine learning model, e.g., a neural network, f(w; x), which takes x ∈ Rd as input and has w ∈ Rp as the trainable parameters. Throughout this paper, we assume f is twice differentiable with respect to the parameters w. To simplify the analysis, we further assume the output of the model n d f is a scalar. Given a set of points {xi}i=1, where each xi ∈ R , one can build a n × n tangent kernel matrix K(w), where each entry Kij(w) = K(xi,xj )(w). As discovered in [12] and analyzed in the consequent works [16, 8] the tangent kernel is constant for certain infinitely wide networks during training by gradient descent methods. First, we observe that the constancy of the tangent kernel is equivalent to the linearity of the model. While the mathematical result is not new (see [9, 19]), we have not seen this stated in the machine learning literature (the proof can be found in Appendix C). Proposition 2.2 (Constant tangent kernel = Linear model). The tangent kernel of a differentiable function f(w; x) is constant if and only if f(w; x) is linear in w. Of course for a model to be linear it is necessary and sufficient for the Hessian to vanish. The following proposition extends this result by showing that small Hessian norm is a sufficient condition for near-constant tangent kernel. The proof can be found in Appendix D. p Proposition 2.3 (Small Hessian norm ⇒ Small change of tangent kernel). Given a point w0 ∈ R p and a ball B(w0,R) := {w ∈ R : kw − w0k ≤ R} with fixed radius R > 0, if the Hessian matrix satisfies kH(w)k < , where  > 0, for all w ∈ B(w0,R), then the tangent kernel K(w) of the model, as a function of w, satisfies d |K(x,z)(w) − K(x,z)(w0)| = O(R), ∀w ∈ B(w0,R), ∀x, z ∈ R . (7) As we shall see in Section 3, all neural networks that are proven in [12, 8, 7] to have (near) constant tangent kernel during training, have small (zero, in the limit of m → ∞) spectral norms of the corresponding Hessian matrices.

3 Transition to linearity: non-linear neural networks with linear output layer

In this section, we analyze the class of neural networks with linear output layer, i.e., there is no non-linear activation on the final output. We show that the spectral norm of the Hessian matrix becomes small, when the width of each hidden layer increases. In the limit of infinite width, these spectral norms vanish and the models become linear, with constant tangent kernels. We point out that the neural networks that are already shown to have constant tangent kernels in [12, 16, 8] fall in this category.

3.1 1-hidden layer neural networks

As a warm-up for the more complex setting of deep networks, we start by considering the simple case of a shallow fully-connected neural network with a fixed output layer, defined as follows: m 1 X f(w; x) = √ v α (x), with α (x) = σ(w x), x ∈ . (8) m i i i i R i=1

Here m is the number of neurons in the hidden layer, v = (v1, ··· , vm) is the vector of output layer m weights, w = (w1, ··· , wm) ∈ R is the weights in the hidden layer. We assume that the σ(·) is βσ-smooth (e.g., σ can be sigmoid or tanh). We initialize at random, wi ∼ N (0, 1) and vi ∈ {−1, 1}. We treat v as fixed parameters and w as trainable parameters. For the purpose of illustration, we assume the input x is of dimension 1, and the multi-dimensional analysis is similar.

5 √ Remark 3.1. This definition of a shallow neural network (i.e., with the presence of a factor 1/ m and vi and wi of order O(1)) is consistent with the NTK parameterization used to show constancy of tangent kernel in [12, 16]. Hessian matrix. We observe that the Hessian matrix H of the neural network f is sparse, specifically, diagonal: 1 H = ∂2f/∂w ∂w = √ v σ00(w x)x2 1 . ij i j m i i {i=j} Consequently, if the input x is bounded, say |x| ≤ C, the spectral norm of the Hessian H is 2 x 00 1 2 1  kHk = max |Hii| = √ max |viσ (wix)| ≤ √ βσC = O √ . (9) i∈[m] m i∈[m] m m In the limit of m → ∞, the spectral norm kHk converges to 0. Tangent kernel and gradient. On the other hand, the magnitude of the norm of the tangent kernel of f is of order Θ(1) in terms of m. Specifically, for each diagonal entry we have m 1 X K (w) = k∇ f(w; x)k2 = x2(σ0(w x))2 = Θ(1). (10) (x,x) w m i i=1 2 0 2 In the limit of m → ∞, K(x,x)(w) = x Ew∼N (0,1)[(σ (wx)) ]. Hence the trace of tangent kernel is also Θ(1). Since the tangent kernel is a positive definite matrix of size independent of m, the norm is of the same order as the trace. Therefore, from Eq. (9) and Eq. (10)√ we observe that the tangent kernel scales as Θ(1) while the norm of the Hessian scales as O(1/ m) with the size of the neural network f. Furthermore, as m → ∞, the norm of the Hessian converges to zero and, by Proposition 2.3, the tangent kernel becomes constant.

Why does the Hessian become small with increasing width? So why should there be a discrep- ancy between the scaling of the Hessian spectral norm and the norm of the gradient? This is not a trivial question. There is no intrinsic reason why second and first order derivatives should scale differently with the size of an arbitrary model. In the rest of this subsection we analyze the source of that phenomenon in wide neural networks, connecting it to disparity of different norms in high dimension. Specifically, we show that the Hessian spectral norm is controlled by ∞-norm of the vector k∂f/∂αk∞. In contrast, the tangent kernel and the norm of the gradient are controlled by its Euclidean norm k∂f/∂αk. The disparity between these norms is the underlying reason for the transition to linearity in the limit of infinite width.

• Hessian is controlled by k∂f/∂αk∞. Given a model f in Eq. (8), its Hessian matrix H(f) is defined as m 2 X ∂f ∂ αi H(f) = ∂2f/∂w2 = , (11) ∂α ∂w2 i=1 i 2 2 ∂ αi ∂ α where ∂w2 are the components of the order 3 tensor of partial derivatives ∂w2 . When there is no ambiguity, we suppress the argument and denote the Hessian matrix by H. By Proposition 2.1 T (essentially the Holder’s inequality: |a b| ≤ kak1kbk∞), we have 2 ∂ α ∂f kHk ≤ 2 . (12) ∂w 2,2,1 ∂α ∞ ∂2α For this 1-hidden layer network, the tensor ∂w2 is given by  2  2 ∂ α ∂ αk 0 2 = = σ (wkx)x · I{i=j=k}, (13) ∂w ijk ∂wi∂wj where I{} is the indicator function. By definition of the (2, 2, 1)-norm we have 2 m m ∂ α X 0 X = sup σ (wkx)x(v1)k(v2)k ≤ Lσx sup (v1)k(v2)k ≤ Lσx. ∂w2 2,2,1 kv1k=kv2k=1 k=1 kv1k=kv2k=1 k=1

6 Thus, we conclude that the Hessian spectral norm kHk = O (k∂f/∂αk∞). • Tangent kernel and the gradient are controlled by k∂f/∂αk. Note that the norm of the tangent 1 Pn kernel is lower bounded by the average of diagonal entries: kKk ≥ n i=1 K(xi,xi), where n is the size of the dataset. Consider an arbitrary diagonal entry K(x,x) of the tangent kernel matrix. 2 2 ∂α ∂f K(x,x) = k∇wf(w; x)k = . (14) ∂w ∂α

∂α ∂αi 0 ∂α Note that, is a diagonal matrix with = σ (wix)x. By the Lipschitz continuity of σ(·), ∂w ∂wi ∂w is finite. Therefore, the tangent kernel is of the same order as the 2-norm k∂f/∂αk.

∂f √1 • The discrepancy between the norms. For the network in Eq. (8) we have ∂α = m v. Hence,

∂f 1 ∂f = √ , = 1. (15) ∂α ∞ m ∂α The transition to linearity stems from this observation and the fact discussed above that the Hessian

∂f ∂f norm scales as ∂α , while the tangent kernel is of the same order as ∂α . ∞ In what follows, we show that this is a general principle applicable to wide neural networks. We start by analyzing two hidden layer neural networks, which are mathematically similar to the general case, but much less complex in terms of the notation.

3.2 Two hidden layer neural networks

Now, we demonstrate that analogous results hold for 2-hidden layer neural networks. Consider the 2-hidden layer neural network: 1  1  f(W ,W ; x) = √ vT σ √ W σ(W x) ,W ∈ m×d,W ∈ m×m, x ∈ d. (16) 1 2 m m 2 1 1 R 2 R R (1) We denote the output of the first hidden layer by α (W1; x) = σ(W1x) and the output of the second   (2) √1 hidden layer by α (W1,W2; x) = σ m W2σ(W1x) .

Hessian is controlled by k∂f/∂αk∞. Similarly to Eq.(13), we can bound the Hessian spectral norm by ∞-norms of ∂f/∂α(1) and ∂f/∂α(2). Proposition 3.1. 2 ! ∂α(1) ∂2α(2) ∂α(1) ∂2α(2) ∂2α(2) ∂f kHk ≤ + 2 + (1) 2 (1) 2 (2) ∂W1 (∂α ) 2,2,1 ∂W1 ∂W2∂α 2,2,1 ∂W2 2,2,1 ∂α ∞ ∂2α(1) ∂f + . 2 (1) (17) ∂W1 2,2,1 ∂α ∞ ∂ Here denotes partial derivatives w.r.t. each element of Wl, i.e. after flattening the matrix Wl . ∂Wl As this Proposition is a special case of Theorem 3.1, we omit the proof. (1) When W1,W2 are initialized as random Gaussians, every term in Eq. (17), except for ∂f/∂α ∞ (2) and ∂f/∂α ∞, is of order O(1), with high probability within a ball of a finite radius (see the discussion in Subsection 3.3 for details). Hence, just like the one hidden layer case, the magnitude of Hessian spectral norm is controlled by these ∞-norms:  ∂f ∂f  kHk = O + . (1) (2) (18) ∂α ∞ ∂α ∞ Tangent kernel and the gradient are controlled by k∂f/∂αk. A diagonal entry of the kernel matrix can be decomposed into 2 2 ∂α(1) ∂f ∂α(2) ∂f K = k∇ f(W ,W ; x)k2 + k∇ f(W ,W ; x)k2 = + , (x,x) W1 1 2 W2 1 2 (1) (2) ∂W1 ∂α ∂W2 ∂α

7 (l) (l) with each additive term being related to each layer. As the matrix ∂α /∂Wl and the vector ∂f/∂α 2 ∂α(l) ∂f are independent from each other and random at initialization, we expect (l) to be of the ∂Wl ∂α 2 2 ∂α(l) ∂f same order as (l) , for l = 1, 2. ∂Wl ∂α

3.3 Multilayer neural networks

Now, we extend the analysis to general deep neural networks. First, we show that, in parallel to one and two hidden layer networks, the Hessian spectral norm and the tangent kernel of a multilayer neural network are controlled by ∞-norms and 2-norms of the vectors ∂f/∂α(l), respectively. Then we show that the magnitudes of the two types of vector norms scales differently with respect to the network width. We consider a general form of a deep neural network f with a linear output layer: α(0) = x, (l) (l) (l−1) α = φl(w ; α ), ∀l = 1, 2, ··· , L, 1 f = √ vT α(L), (19) m (l) m m (l) p where each vector-valued function φl(w ; ·): R l−1 → R l , with parameters w ∈ R l , is considered as a layer of the network, and m = mL is the width of the last hidden layer. This definition includes the standard fully connected, convolutional (CNN) and residual (ResNet) neural networks as special cases. Initialization and parameterization. In this paper, we consider the NTK initialization/ parameteri- zation [12], under which the constancy of the tangent kernel had been initially observed. Specifically, the parameters, (weights), W := {w(1), w(2), ··· , w(L), w(L+1) := v} are drawn i.i.d. from a (l) √ standard Gaussian, i.e., wi ∼ N (0, 1), at initialization, denoted as W0. The factor 1/ m in the output layer is required by the NTK parameterization in order that the output f is of order Θ(1). (l) Different parameterizations (e.g., LeCun initialization: wi ∼ N (0, 1/m)) rescale the tangent kernel and the Hessian by the same factor, and thus do not change our conclusions (see Appendix A).

3.3.1 Bounding the Hessian To simplify the notation, we start by defining the following useful quantities:  ∂f  Q (f) max , ∞ , (l) (20) 1≤l≤L ∂α ∞  (l)  ∂α QL(f) , max , 1≤l≤L ∂w(l) ( 2 (l ) (l ) 2 (l ) ∂ α 1 ∂α 1 ∂ α 2 Q2,2,1(f) max , , , (l1)2 (l1) (l2−1) (l2) 1≤l1

Remark 3.2. It is important to note that the quantity Q∞(f) is simply the maximum of the ∞-norms (l) (l) ∂f/∂α ∞, and that QL(f) and Q2,2,1(f) are independent of the vectors ∂f/∂α , l ∈ [L]. The Hessian spectral norm is bounded by these quantities via the following theorem (see Appendix E for the proof). Theorem 3.1. Consider a L-layer neural network in the form of Eq.(19). For any W in the parameter space, the following inequality holds: 1 kH(f)k ≤ C Q (f)Q (f) + √ C Q (f), 1 2,2,1 ∞ m 2 L 2 2L L L where C1 = L(L Lφ + LLφ + 1) and C2 = LLφ .

8 √ Remark 3.3. The factor 1/ m in the second term comes from the definition of the output layer in Eq. (19) and is useful to make sure the model output at initialization is of the same order as the ground truth labels. Tangent kernel and 2-norms. A diagonal entry of the kernel matrix can be decomposed into L L (l) 2 X 2 X ∂α ∂f K(x,x) = k∇ (l) f(W; x)k = w ∂w(l) ∂α(l) l=1 l=1

(l) 2 with each additive term being related to each layer. As before, we expect each term ∂α ∂f ∂w(l) ∂α(l) (l) 2 2 has the same order as ∂α ∂f . ∂w(l) ∂α(l)

3.3.2 Small Hessian spectral norm and constant tangent kernel

In the following, we apply Theorem 3.1 to fully connected√ neural networks, and show that the corresponding Hessian spectral norm scales as O˜(1/ m), in a region with finite radius. To simplify our analysis, we make the following assumption.

Assumptions. We assume the hidden layer width ml = m for all l ∈ [L], the number of parameters 5 in each layer pl ≥ m, and the output is a scalar. We assume that (vector-valued) layer functions φl(w; α), l ∈ [L], are Lφ-Lipschitz continuous and twice differentiable with respect to input α and parameters w. A fully connected neural network has the form as in Eq.(19), with each layer function specified by 1 α(l) = σ(˜α(l)), α˜(l) = √ W (l)α(l−1), for l ∈ [L], (22) m where σ(·) is a Lσ-Lipschitz continuous, βσ-smooth activation function, such as sigmoid and tanh. The layer parameters W (l) are reshaped into an m × m matrix. The Euclidean norm of W becomes: PL (l) 2 1/2 kWk = ( l=1 kW kF ) . With high probability over the Gaussian random initialization, we have the following lemma to bound the quantities Q∞(f), Q2,2,1(f) and QL(f) in a neighborhood of W0: Lemma 3.1. Consider a fully connected neural network f(W; x) with linear output layer and Gaussian random initialization W0. Given any fixed R > 0, at any point W ∈ B(W0,R) := {W : kW − W0k ≤ R}, with high probability over the initialization, the quantity √ Q∞(f) = O˜(1/ m), Q2,2,1(f) = O(1), QL(f) = O(1), w.r.t. m. (23) See the proof of the lemma in Appendix F. Applying this lemma to Theorem 3.1, we immediately obtain the following theorem: Theorem 3.2. Consider a fully connected neural network f(W; x) with linear output layer and Gaussian random initialization W0. Given any fixed R > 0, and any W ∈ B(W0,R) := {W : kW − W0k ≤ R}, with high probability over the initialization, the Hessian spectral norm satisfies the following: √ kH(W)k = O˜ 1/ m . (24) Remark 3.4. We note that the above theorem also applies to more general networks that have different hidden layer widths, as long as the width of each layer is larger than m. See Theorem 3.3below. In the limit of m → ∞, the spectral norm of the Hessian kH(W)k converges to 0, for all W ∈ B(W0,R). By Proposition 2.3, this immediately implies constancy of tangent kernel and linearity of the model, in the ball B(W0,R). On the other hand, the tangent kernel is of order Θ(1) (see for example [7], where the smallest eigenvalue of the tangent kernel is lower bounded by a width-independent constant). Intuitively, the order of tangent kernel stems from the fact that the 2-norms k∂f/∂α(l)k are of order Θ(1). Remark 3.5. By the optimization theory built in our work [17], a finite radius R is enough to include the gradient descent solution, for the square loss. Hence, for very wide networks, the tangent kernel is constant during gradient descent training. 5 The assumption ml = m is to simplify the analysis, as we discuss below we only need ml ≥ m.

9 3.3.3 Neural networks with hidden layers of different width and general architectures Our analysis above is applicable to other common neural architectures including Convolutional Neural Networks (CNN) and ResNets, as well as networks with a mixed architectural types. Below we briefly highlight the main differences from the fully connected case. Precise statements can be found in Appendix G. CNN. A convolutional layer maps a hidden layer “image" α(l) ∈ Rp×q×ml to the next layer α(l+1) ∈ p×q×m R l+1 , where p and q are the sizes of images in the spatial dimensions and ml and ml+1 are the number of channels. Note that the number of channels, which can be arbitrarily large, defines the width of CNN, while spatial dimensions, p and q, are always fixed. The key observation is that a convolutional layer is “fully connected” in the channel dimension. In contrast, the convolutional operation, which is sparse, is only within the spatial dimensions. Hence, we can apply our analysis to the channel dimension with only minor modifications. As the spatial dimension sizes are independent of the network width, the convolutional operation only contributes constant factors to our analysis. Therefore, our norm analysis extends to the CNN setting. ResNet. A residual layer has the same form as Eq.(22), except that the activation α(l) = σ(˜α(l)) + α(l−1), which results in an additional identity matrix I in the first order derivative w.r.t. α(l−1). As shown in Appendix G, the appearance of I does not affect the orders of both the ∞-norms and 2-norms of ∂f/∂α(l), as well as the related (2, 2, 1)-norms. Hence, the analysis above applies. Architecture with mixed layer types. Neural networks used in practice are often a mixture of different layer types, e.g., a series of convolutional layers followed by fully connected layers. Since our analysis relies on layer-wise quantities, our results extend to such networks. We have the following general theorem which summarizes our theoretical results. Theorem 3.3 (Hessian norm is controlled by the minimum hidden layer width). Consider a general neural network f(W; x) of the form Eq.(19), which can be a fully connected network, CNN, ResNet or a mixture of these types. Let m be the minimum of the hidden layer widths, i.e., m = minl∈[L] ml. Given any fixed R > 0, and any W ∈ B(W0,R) := {W : kW −W0k ≤ R}, with high probability over the initialization, the Hessian spectral norm satisfies the following: √ kH(W)k = O˜ 1/ m . (25)

4 Constant tangent kernel is not a general property of wide networks

In this section, we show that a class of infinitely wide neural networks with non-linear output, do not generally have constant tangent kernels. It also demonstrates that a linear output layer is a necessary condition for transition to linearity. We consider the neural network f˜:

f˜(w; x) := φ(f(w; x)). (26) where f(w; x) is a sufficiently wide neural network with linear output layer considered in Section 3, and φ(·) is a non-linear twice-differentiable activation function. The only difference between f and f˜ is that f˜ has a non-linear output layer. As we shall see, this difference leads to a non-constant tangent kernel during training, as well as a different scaling behavior of the Hessian spectral norm. ˜ ˜ ˜ 0 Tangent kernel of f. The gradient of f is given by ∇wf(w; x) = φ (f(w; x))∇wf(w; x). Hence, each diagonal entry of the tangent kernel of f˜ is ˜ ˜ 2 02 K(x,x)(w) = k∇wf(w; x)k = φ (f(w; x))K(x,x)(w), (27) ˜ where K(·,·)(w) is the tangent kernel of f. By Eq.(10) we have K(x,x)(w) = Θ(1), which is of the same order as K(x,x)(w). ˜ Yet, unlike K(·,·)(w), the kernel K(·,·)(w) changes significantly during training, even as m → ∞ (with a change of the order of Θ(1)). To prove that, it is enough to verify that at least one entry of

10 ˜ K(x,x)(w) has a change of Θ(1), for an arbitrary x. Consider a diagonal entry. For any w, we have ˜ ˜ 02 02 K(x,x)(w) − K(x,x)(w0) = φ (f(w; x))K(x,x)(w) − φ (f(w0; x))K(x,x)(w0) 02 02 02 ≥ φ (f(w; x)) − φ (f(w0; x)) · K(x,x)(w0) − φ (f(w; x)) · K(x,x)(w) − K(x,x)(w0) . | {z } | {z } A B

We note that the term B vanishes as m → ∞ due to the constancy of the tangent kernel of f. However the term A is generally of the order Θ(1), when φ is non-linear6. To see that consider any solution w∗ ∗ such that f(w ; x) = y (which exists for over-parameterized networks). Since f(w0; x) is generally not equal to y, we obtain the result. "Lazy training" does not explain constancy of NTK. From the above analysis, we can see that, even with the same parameter settings as f (i.e., same initial parameters and same parameter change), network f˜ does not have constant tangent kernel, while the tangent kernel of f is constant. This implies that the constancy of the tangent kernel cannot be explained in terms of the magnitude of the parameter change from initialization. Instead, it depends on the structural properties of the network, such as the linearity of the output layer. Indeed, as we discuss next, when the output layer is non-linear, the Hessian norm of f˜ no longer decreases with the width of the network. Hessian matrix of f˜. The Hessian matrix of f˜ is ∂2f˜ H˜ := = φ00(f)∇ f(∇ f)T + φ0(f)H, (28) ∂w2 w w where H is the Hessian matrix of model f. Hence, the spectral norm satisfies 00 2 0 kH˜ k ≥ |φ (f)| · k∇wfk − |φ (f)| · kHk. (29) 0 Since, as we already know, limm→∞ kHk = 0, the second term |φ (f)| · kHk vanishes in the infinite width limit. However, the first term is always of order Θ(1), as long as φ is not linear. Hence, √ kH˜ k = Ω(1), compared to kHk = O˜(1/ m) for networks in Section 3 and does not vanish as m → ∞. 00 2 Remark 4.1. Note that the first term |φ (f)| · k∇wfk in Eq.(29) has the same order as the square 2-norm k∂f/∂α(l)k2, instead of ∞-norm which controls the second term. Therefore, the spectral √ norm of H˜ is no longer of order O(1/ m), in contrast to that of H. Neural networks with bottleneck. Here, we show another type of neural√ networks, bottleneck networks, that does not have constant tangent kernel, by breaking the O˜(1/ m)-scaling of Hessian spectral norm. Consider a neural network with fully connected layers. Here, we assume all the hidden layers are arbitrarily wide, except one layer, l 6= L, has a narrow width. For example, let the bottleneck width mb = 1. Now, the (l + 1)-th fully connected layer, Eq.(22), reduces to α(l+1) = σ(w(l+1)α(l)),

(l) (l+1) m ∂2α(l+1) with α ∈ R and w ∈ R . In this case, the (2, 2, 1)-norm of the order 3 tensor (∂α(l))2 ∈ R1×1×m is ∂2α(l+1) m X (l+1) 2 00 (l+1) (l) = |(wi ) σ (wi α )| = Θ(m). (30) (∂α(l))2 2,2,1 i=1 This makes the quantity Q2,2,1(f) to be the order of O(m). Then, Theorem 3.1 indicates that the Hessian spectral norm is no longer arbitrarily small, suggesting a non-constant tangent kernel during training. Indeed, as we prove below, the Hessian spectral norm is lower bounded by a positive constant, which in turn implies that the linearity does not hold for this kind of neural networks. Specifically, consider a bottleneck network with of the following form: m m ! 1 X (3) 1 X (2) (1) f(W; x) = √ w(4)σ w √ w σ(w x) . (31) m j m i i j=1 i=1

6If φ is linear, the term A is identically zero.

11 This network has three hidden layers, where the first and third hidden layer have an arbitrarily large width m, and the second hidden layer, as the bottleneck layer, has a width mb = 1. Each individual parameter is initialized by the standard norm distribution. For simplicity of the analysis, the activation function is identity for the bottleneck layer is identity, and is quadratic for the first and third layers, 1 2 i.e. σ(z) = 2 z . The following theorem gives a lower bound for the Hessian spectral norm kHk in a ball around the initialization W0. Theorem 4.1. Consider the bottleneck network f(W; x) defined in Eq. (31). Given an arbitrary radius R > 0, for any W ∈ B(W0,R) and any δ ∈ (0, 1), the Hessian matrix H(W) of the model satisfies x2 1 R2  √ 3 log(4/δ)R4 + R3  kH(W)k ≥ − 3C δ3 − √ , (32) 4 2 m 1 m −m/16 for some constant C1 > 0, with probability at least 1 − 2δ − e . In particularly, in the limit of m → ∞, √ 3 kH(W)k ≥ C δ3x2, (33) 8 1 with probability at least 1 − 2δ.

See the proof in Appendix H. With this lower bounded Hessian, Proposition 2.2 directly implies that ∗ the linearity of the model does not hold for this network. As Eq. (2) shows that kW − W0k = Ω(1), our analysis implies the model is not linear, hence tangent kernel is not constant, along the optimization path. In Section 6, we empirically verify this finding. In table 1, we summarize the key findings of this section and compare them with the case of neural networks with linear output layer.

Network Hessian norm NTK Trans. to linearity (constant NTK)? √ linear output layer O˜(1/ m) Θ(1) Yes

nonlinear output layer O˜(1) Θ(1) No

bottleneck O˜(1) Θ(1) No

Table 1: Scaling of Hessian spectral norms of the models: linear output layer, non-linear output layer and bottleneck. Note: transition to linearity = constant tangent kernel, in the infinite width limit.

5 Optimization of wide neural networks

A number of recent analyses show convergence of gradient descent for wide neural networks [8, 7, 1, 23, 3, 13, 4]. While an extended discussion of optimization is beyond the scope of this work, we refer the interested reader to our separate paper [17]. The goal of this section is to clarify the important difference between the (near-)linearity of large models and convergence of optimization by gradient descent. It is easy to see that a wide model undergoing the transition to linearity can be optimized by gradient descent if its tangent kernel is well-conditioned at the initialization point. The dynamics of such a model will be essentially the same as for a linear model, an observation originally made in [12]. However near-linearity or, equivalently, near-constancy of the tangent kernel is not necessary for successful optimization. What is needed is that the tangent kernel is well-conditioned along the optimization path, a far weaker condition. For a specific example, consider the non-linear output layer neural network f˜ = φ(f), as defined in Eq. (26). As is shown in Section 4, this network does not have constant tangent kernel, even when the network width is arbitrarily large. The following theorem states that fast convergence of gradient descent still holds (also see Section 6 for empirical verification).

12 Figure 1: Neural networks with non-linear output layer vs. with linear output layer. Left panel (log scale): max change of tangent kernel ∆K from initialization to convergence w.r.t. the width m. Right panel: Evolution of tangent kernel change ∆K(t) as a function of epoch, width m = 104.

Theorem 5.1. Suppose the non-linear function φ(·) satisfies |φ0(z)| ≥ ρ > 0, ∀z ∈ R, and the network width m is sufficiently large. Then, with high probability of the random initialization, there exists constant µ > 0, such that the gradient descent, with a small enough step size η, converges to a 1 Pn ˜ 2 global minimizer of the square loss function L(w) = 2 i=1(f(w; xi) − yi) with an exponential convergence rate: 2 t L(wt) ≤ (1 − ηµρ ) L(w0). (34)

The analysis is based on the following reasoning. Convergence of gradient descent methods relies on the condition number of the tangent kernel (see [17]). It is not difficult to see that if the original model f has a well conditioned tangent kernel, then the same holds for f˜ = φ(f) as long as the the derivative of the activation function φ0 is separated from zero. Since the tangent kernel of f is not degenerate, the conclusion follows. The technical result is a consequence of Corollary 8.1 in [17].

6 Numerical Verification

We conduct experiments to verify the non-constancy of tangent kernels for certain types of wide neural networks, as theoretically observed in Section 4. Specifically, we use gradient descent to train each neural network described below on a synthetic data until convergence. We compute the following quantity to measure the max (relative) change of tangent kernel from initialization to convergence: ∆K := supt>0 kK(wt)−K(w0)kF /kK(w0)kF . For a network that has a nearly constant tangent kernel during training, ∆K is expected to be close to 0, while a network with a non-constant tangent kernel, ∆K should be Ω(1). Detailed experimental setup and data description are given in Appendix B. Wide neural networks with non-linear output layers. We consider a shallow (i.e., with one hidden layer) neural network f˜ of the type in Eq.(26) that has a softmax layer or swish [18] activation on the output. As a comparison, we consider a neural network f that has the same structure as f˜, except that the output layer is linear. We report the change of tangent kernels ∆K of f˜ and f, at different network width m = {30, 102, 103, 104, 105, 106}. The results are plotted in the left panel of Figure 1. We observe that, as the network width increases, the tangent kernel of f, which has a linear output layer, tends to be constant during training. However, the tangent kernel of f˜ which has a non-linear (softmax or swish) output layer, always takes significant change, even if the network width is large. In Figure 1, right panel, we demonstrate the evolution of tangent kernel with respect to the training time for a very wide neural network (width m = 104). We see that, for the neural network with a non-linear output layer, tangent kernel changes significantly from initialization, while tangent kernel of the linear output network is nearly unchanged during training.

13 Wide neural networks with a bottleneck. We con- sider a fully connected neural network with 3 hidden layers and a linear output layer. The second hidden layer, i.e., the bottleneck layer, has a width mb which is typically small, while the width m of the other hid- den layers are typically very large, m = 104 in our experiment. For different bottleneck width mb = {3, 5, 10, 50, 100, 500,1000}, we train the network on a synthetic dataset using gradient descent until conver- gence, and compute ∆K. The change of tangent kernels for different bottleneck Figure 2: Networks with bottleneck. Exper- width is shown in Figure 2. We can see that a narrow imental results and theoretical prediction of bottleneck layer in a wide neural network prevent relative change of tangent kernel from initial- the neural tangent kernel from being constant during ization to convergence, as a function of the training. As expected, increasing the width of the bottleneck width. bottleneck layer, makes the change of the tangent kernel√ smaller. We observe that the scaling of the tangent kernel change with width follows close to Θ (1/ m) (dashed line in Figure 2) in alignment with our theoretical results (Theorem 3.3).

Acknowledgements

The authors acknowledge support from NSF, the Simons Foundation and a Google Faculty Research Award. We thank James Lucas for correcting the proof of Prop. 2.3. The GPU used for the experiments was donated by Nvidia.

References [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A Convergence Theory for via Over-Parameterization”. In: International Conference on Machine Learning. 2019, pp. 242– 252. [2] Shun-ichi Amari. “Any target function exists in a neighborhood of any sufficiently wide random network: A geometrical perspective”. In: arXiv preprint arXiv:2001.06931 (2020). [3] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks”. In: International Conference on Machine Learning. 2019, pp. 322–332. [4] Peter L Bartlett, David P Helmbold, and Philip M Long. “Gradient descent with identity ini- tialization efficiently learns positive-definite linear transformations by deep residual networks”. In: Neural computation 31.3 (2019), pp. 477–502. [5] Anthony Carbery and James Wright. “Distributional and Lq norm inequalities for polynomials over convex bodies in Rn”. In: Mathematical research letters 8.3 (2001), pp. 233–248. [6] Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differentiable pro- gramming”. In: Advances in Neural Information Processing Systems. 2019, pp. 2933–2943. [7] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. “Gradient Descent Finds Global Minima of Deep Neural Networks”. In: International Conference on Machine Learning. 2019, pp. 1675–1685. [8] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. “Gradient descent provably optimizes over-parameterized neural networks”. In: arXiv preprint arXiv:1810.02054 (2018). [9] Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/868044 (version: 2014- 07-15). [10] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Linearized two-layers neural networks in high dimension”. In: arXiv preprint arXiv:1904.12191 (2019). [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.

14 [12] Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and generalization in neural networks”. In: Advances in neural information processing systems. 2018, pp. 8571–8580. [13] Ziwei Ji and Matus Telgarsky. “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks”. In: arXiv preprint arXiv:1909.12292 (2019). [14] Beatrice Laurent and Pascal Massart. “Adaptive estimation of a quadratic functional by model selection”. In: Annals of Statistics (2000), pp. 1302–1338. [15] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. “Efficient backprop”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 9–48. [16] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. “Wide neural networks of any depth evolve as linear models under gradient descent”. In: Advances in neural information processing systems. 2019, pp. 8570–8581. [17] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. “Toward a theory of optimization for over- parameterized systems of non-linear equations: the lessons of deep learning”. In: arXiv preprint arXiv:2003.00307 (2020). [18] Prajit Ramachandran, Barret Zoph, and Quoc V Le. “Searching for activation functions”. In: arXiv preprint arXiv:1710.05941 (2017). [19] Takashi Sakai. “On Riemannian manifolds admitting a function whose gradient is of constant norm”. In: Kodai Mathematical Journal 19.1 (1996), pp. 39–51. [20] Ruoyu Sun. “Optimization for deep learning: theory and algorithms”. In: arXiv preprint arXiv:1912.08957 (2019). [21] Terence Tao. Topics in random matrix theory. Vol. 132. American Mathematical Soc., 2012. [22] Roman Vershynin. “Introduction to the non-asymptotic analysis of random matrices”. In: arXiv preprint arXiv:1011.3027 (2010). [23] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. “Stochastic gradient descent opti- mizes over-parameterized deep relu networks”. In: arXiv preprint arXiv:1811.08888 (2018).

15 A Other Parameterization Strategies

Throughout the paper, our analysis is based on the NTK prameterization [12], under which the con- stancy of tangent kernel is originally observed. In this section, we show that different parameterization (l) strategies (e.g., LeCun initialization [15] : w0;i ∼ N (0, 1/m)) do not change our conclusions. Specifically, we show that, compared to the NTK prameterization, a different parameterization strategy only rescales the tangent kernel K and the spectral norm of the Hessian kHk by the same factor, hence the ratio between tangent kernel K and Hessian spectral norm keeps the same and kHk = o(kKk) still holds. This still implies that the tangent kernel is almost constant during training. Recall that we initialize the parameters W = {w(1), w(2), ··· , w(L), w(L+1) := v} of the general (l) form of a deep neural network f, Eq.(19) by a standard Gaussian, i.e. wi ∼ N (0, 1). If we apply ¯ (l) 2 another parameterization strategy W here, for example, w¯i ∼ N (0, σm), where σm can be a (l) (l) (l) function of m, we can see every w¯i = σmwi where wi ∼ N (0, 1). Hence each layer function becomes   (l) 1 (l) (l−1) α = φl w¯ ; α . (35) σm In this case, the gradient of the model f w.r.t. the weights of layer l is ∂f ∂w(l) ∂f 1 ∂f = = . (l) (l) (l) (l) (36) ∂w¯ ∂w¯ ∂w σm ∂w And by the same reason, the Hessian of the model f w.r.t. the weights of layer l1 and l2 is ∂2f 1 ∂2f = . (37) (l1) (l2) 2 (l1) (l2) ∂w¯ ∂w¯ σm ∂w ∂w Therefore, it’s easy to see the ratio of the norm of the tangent kernel to the norm of the Hessian keeps the same: 1 kK(W¯ )k k σ2 K(W)k kK(W)k = m = . ¯ 1 (38) kH(W)k k 2 H(W)k kH(W)k σm Example: LeCun initialization/parameterization. In many practical machine learning tasks, it (l) is popular to use the LeCun initialization/parameterization: each individual parameter (W0 )ij ∼ 1 √ N (0, m ), while there is no factor 1/ m in the definition of the layer function, e.g., for fully connected layers α(l+1) = σ(W (l)α(l)). (39) √ In this setting, the factor σm = 1/ m. Then, by the analysis above, we see that √ kKk = O(m), kHk = O( m) = o(kKk). (40) ∗ It is also interesting to note that, the Euclidean norm of the parameter change w − w0 also scales: ∗ √ kw − w0k = Θ(1/ m). (41)

B Experimental Setup

Dataset. We use a synthetic dataset of size N = 60 which contains C = 3 classes. Each data point (x, y) is sampled as follows: label y is randomly sampled from {0, 1, 2} with equal probability; given y, x is drawn from the following distribution: N (0, 1), if y = 0;  x ∼ N (10, 1), if y = 1; (42) N (−10, 1), if y = 2. N 3 We encode each yi ∈ {0, 1, 2} in {(xi, yi)}i=1 by a one-hot vector yi ∈ {0, 1} . And yi,j means the j-th component of yi. We use this dataset for all the optimization tasks mentioned below.

16 B.1 Wide neural networks with non-linear output layers

Neural Networks. In the experiments, we train three different neural networks: • Neural network with a linear output layer 1 f(w,V, b; x) = √ V σ(wx + b), (43) m where w ∈ Rm and b ∈ Rm are weights and biases for the first layer and V ∈ R3×m are the weights for the output layer, and σ(·) is the ReLU activation function. • Neural network with a softmax-activated (non-linear) output layer ˜ f1(w,V, b; x) = Softmax(f(w,V, b; x)); (44) • Neural network with a swish-activated (non-linear) output layer ˜ f2(w,V, b; x) = Swish(f(w,V, b; x)). (45) Here the swish activation function is defined as Swish(z) = z (1 + exp(−0.1 · z))−1, where is the element-wise multiplication. ˜ Optimization Tasks. We combine the training of networks f and f1 together, by optimizing the following loss function: N C 1 X X L (w, v, b) = − y · log((f˜ (x ) ). (46) 1 N i,j 1 i j i=1 j=1 ˜ In this combined training, networks f and f1 always have the same parameters during training, and ˜ the difference between f and f1 is the non-linearity on the output. ˜ For the swish-activated network f2, we minimize the square loss function: N 1 X L (w, v, b) = kf˜ (x ) − y k2. (47) 2 N 2 i i i=1 We use gradient descent to minimize the loss functions until convergence is achieved (i.e. loss less than 10−4). To measure the change of tangent kernels, we compute the max (relative) change of tangent kernel from initialization to convergence: ∆K := supt>0 kK(wt)−K(w0)kF /kK(w0)kF . For each training, we take 10 independent runs and report the average ∆K. ˜ ˜ We compare the tangent kernel changes ∆K of f, f1 and f2, at a variety of network widths, m = 30, 102, 103, 104, 105, 106.

B.2 Wide neural networks with a bottleneck

The Neural Network. In the experiment, we use a fully connected neural network with 3 hidden layers and a linear output layer. Its second hidden layer, i.e., the bottleneck layer has a width mb, while the other hidden layers has a width m. Specifically, it is defined as: 1  1  f(W; x) = √ W σ W √ W σ(W x) , (48) m 4 3 m 2 1 m×1 m ×m m×m C×m where W1 ∈ R , W2 ∈ R b , W3 ∈ R b , W4 ∈ R . Here we use ReLU as activation functions.

Optimization Tasks. We minimize the cross entropy loss: N C 1 X X L(W) = − y · log((f˜(x ) ), (49) N i,j i j i=1 j=1 where we denote Softmax(f) by f˜. Here, we let the network width m = 104, and investigate on different bottleneck width mb ∈ {3, 5, 10, 50, 100, 500,1000}. For each bottleneck width, we use gradient descent to minimize the loss functions until convergence is achieved (i.e. loss less than 10−4) and compute the max (relative) change of tangent kernel from initialization to convergence: ∆K := supt>0 kK(wt) − K(w0)kF /kK(w0)kF . For each training, take 10 independent runs and report the average ∆K.

17 C Proof for Proposition 2.2

Proof. Recall that the tangent kernel is defined as

T d Kij(w) = ∇f(w; xi) ∇f(w; xj), for any inputs xi, xj ∈ R . (50)

Linearity of f in w ⇒ constancy of tangent kernel. Since f is linear in w, ∇wf(w; x) is a constant vector in Rp, for any given input x. By the definition of the tangent kernel, each element Kij(w) is constant, for any inputs xi, xj.

Constancy of tangent kernel ⇒ linearity of f in w. It suffices to prove for every input xi, p function f(w; xi): R → R is linear in w.

For a constant tangent kernel, each element Kii(w) is constant. Noting that Kii(w) = 2 k∇wf(w, xi)k , we have k∇f(w, x)k is constant in w, for all input x. The following arguments basically follow the idea from [9] (a more general result was shown in [19]). To simplify the notation, in the rest of the proof, we hide the argument x, and we use f(w) to denote f(w; x). Let k∇f(w)k = c. Consider the ordinary differential equation (ODE) dw(t) = ∇f(w(t)), dt p where w(0) = w0 ∈ R is the initial setting of the parameters. We have df dw = h∇f, i = c2, dt dt and consequently 2 f(w(t)) = c t + f(w0). (51)

For any t1, t2, since k∇f(w)k = c, we have 2 c |t1 − t2| = |f(w(t1)) − f(w(t2))| ≤ c|w(t1) − w(t2)|,

R t1 but |w(t1) − w(t2)| = | kdw(t)/dtkdt| = c|t1 − t2|, which indicates t2

w(t) = t∇f(w0) + w0. (52) p And in the following we show for any v ∈ R , if f(v) = f(w0), we have

h∇f(w0), v − w0i = 0. (53) p Given t 6= 0, let c : [0, 1] → R be a differentiable curve joining t∇f(w0) + w0 and v. By Eq. (51) and Eq. (52), we have 2 c |t| = |f(w(t)) − f(w0)| = |f(w0 + t∇f(w0)) − f(w0)| Z 1 = | h∇f(c(s)), c0(s)ids| 0 Z 1 ≤ kc0(s)kds 0 = kv − t∇f(w0) − w0k. It follows that 4 2 2 2 c t ≤ kv − w0k + t + 2thv − w0, ∇f(w0)i.

Dividing by t and taking t to ±∞ allows us to have h∇f(w0), v − w0i = 0. Then we construct the level set p Ma = M = {w ∈ R : f(w) = a}, (54)

18 where a ∈ R. And its tangent space at w is p TwM = {w + v ∈ R : hv, ∇f(w)i = 0}. (55) By Eq.(53) we have hv − w, ∇f(w)i for all v ∈ M that satisfies f(v) = f(w). From Eq.(55) we can see v ∈ TwM. Therefore M ⊂ TwM. By the fact that M is a closed hypersurface, M = TwM for all w ∈ M. Hence there exists a w0 ∈ Rp such that kw0k = 1 and the level set Eq.(54) is equivalently defined as a 0 0 0 0 0 p Ma = { c w + v : hv , w i = 0, v ∈ R } for all a. And we can construct a function g : R → R such that g(t) = f(v0 + tw0), where g0(t) = c for all t which shows f is linear.

D Proof of Proposition 2.3

Proof. Since the function f is twice differentiable w.r.t. w, according to Taylor’s theorem, we have the following expression for the gradient: Z 1 ∇f(w) = ∇f(w0) + H(w0 + t(w − w0))(w − w0)dt. (56) 0 Then the Euclidean norm of the gradient change is bounded by Z 1 k∇wf(w) − ∇wf(w0)k = k H(w0 + t(w − w0))(w − w0)dtk 0 Z 1 ≤ kH(w0 + t(w − w0))kdt · kw − w0k. 0 Since t ∈ [0, 1] and the ball B(w0,R) is convex, the point w0 + t(w − w0) is within B(w0,R). Hence, k∇wf(w) − ∇wf(w0)k ≤ max kH(v)k · kw − w0k ≤ R. (57) v∈B(w0,R) Hence, according to the definition of the tangent kernel, for any inputs x, z ∈ Rd, |K(x,z)(w) − K(x,z)(w0)|

≤ k∇wf(w; x) − ∇wf(w0; x)k · k∇wf(w; z)k + k∇wf(w; z) − ∇wf(w0; z)k · k∇f(w0; x)k ≤ R(k∇wf(w0; x)k + k∇wf(w; z)k).

Since f is smooth, the gradients ∇wf(w0) and ∇wf(w) are bounded. Therefore, |K(x,z)(w) − K(x,z)(w0)| = O(R).

E Proof of Theorem 3.1

Proof. The Hessian matrix H of the neural network can be written as the following structure:  H(1,1) H(1,2) ··· H(1,L+1)   H(2,1) H(2,2) ··· H(2,L+1)  H =   . (58)  . . .. .   . . . .  H(L+1,1) H(L+1,2) ··· H(L+1,L+1) 2 Here, each Hessian block H(l1,l2) := ∂ f is the second derivative of f w.r.t. its weights of ∂w(l1)∂w(l2) (L+1) l1-th and l2-th layers, where we treat the final layer parameters v as w . The following lemma allows us to bound the Hessian spectral norm by the norms of its blocks (see proof in Appendix I.1). Lemma E.1. Spectral norm of a matrix H (58) is upper bounded by the sum of the spectral norm of its blocks, i.e. kHk ≤ P kH(l1,l2)k, l , l ∈ [L + 1]. l1,l2 1 2

Now, we analyze the Hessian blocks case by case. Since the Hessian matrix is symmetry, without loss of generosity, we assume 1 ≤ l1 ≤ l2 ≤ L + 1.

19 Case 1: 1 ≤ l1 ≤ l2 ≤ L. By the chain rule, the gradient of the model f w.r.t. the weights of layer l, can be written as L 0 ! ∂f ∂α(l) Y ∂α(l ) 1 = √ v. (59) ∂w(l) ∂w(l) ∂α(l0−1) m l0=l+1 Then, the Hessian block has the following expression:

H(l1,l2) l −1 0 ! ∂2α(l1) ∂f ∂α(l1) Y2 ∂α(l ) ∂2α(l2)  ∂f  = · l1=l2 + 0 (l1) 2 I (l1) (l −1) (l2−1) (l2) (l2) (∂w ) ∂αl1 ∂w 0 ∂α ∂α ∂w ∂α l =l1+1 L l−1 0 ! l 0 ! X ∂α(l1) Y ∂α(l ) ∂2α(l) ∂α(l2) Y ∂α(l )  ∂f  + 0 0 (l1) (l −1) (l−1) 2 (l2) (l −1) (l) ∂w 0 ∂α (∂α ) ∂w 0 ∂α ∂α l=l2+1 l =l1+1 l =l2+1

Hence, the spectral norm of Hessian block H(l1,l2) is bounded by

(l1,l2) H l −1 0 ∂2α(l1) ∂f ∂α(l1) 2 ∂α(l ) ∂2α(l2) ∂f Y ≤ + 0 (l1) 2 (l1) (l1) (l −1) (l2−1) (l2) (l2) (∂w ) 2,2,1 ∂α ∞ ∂w 0 ∂α ∂α ∂w 2,2,1 ∂α ∞ l =l1+1 L l 0 l 0 ∂α(l1) ∂α(l ) ∂2α(l) ∂α(l2) ∂α(l ) ∂f X Y Y + 0 0 (l1) (l −1) (l−1) 2 (l2) (l −1) (l) ∂w 0 ∂α (∂α ) 2,2,1 ∂w 0 ∂α ∂α ∞ l=l2+1 l =l1+1 l =l2+1

2 (l1) (l1) 2 (l2) ∂ α ∂f l −l −1 ∂α ∂ α ∂f ≤ + L 2 1 (l1) 2 (l1) φ (l1) (l2−1) (l2) (l2) (∂w ) 2,2,1 ∂α ∞ ∂w ∂α ∂w 2,2,1 ∂α ∞ L (l1) 2 (l) (l2) X 2l−l −l ∂α ∂ α ∂α ∂f + L 1 2 . φ (l1) (l−1) 2 (l2) (l) ∂w (∂α ) 2,2,1 ∂w ∂α ∞ l=l2+1 By the definitions in Eq.(21), we have

(l1,l2) 0 H ≤ C1Q2,2,1(f)Q∞(f), (60)

0 2 2L L with C1 = L Lφ + LLφ + 1.

Case 2: 1 ≤ l1 < l2 = L + 1. Using the gradient expression in Eq.(59), we have

L 0 ! 1 ∂α(l1) ∂α(l ) (l1,L+1) Y H = √ 0 . (61) (l1) (l −1) m ∂w 0 ∂α l =l1+1 Hence, (l ) L (l0) 1 ∂α 1 Y ∂α 1 (l1,L+1) L kH k ≤ √ 0 ≤ √ L QL(f). (62) (l1) (l −1) φ m ∂w 0 ∂α m l =l1+1

(L+1,L+1) Case 3: l1 = l2 = L + 1. In this case, the Hessian block H is simply zero. Hence, the spectral norm is zero. Applying Lemma E.1, we immediately obtain the desired result.

F Proof for Lemma 3.1

According to the definitions of the quantities Q∞(f), Q2,2,1(f) and QL(f) in Eq.(21), it suffices to show that the followings layer-wise properties hold everywhere in the ball B(W0,R) with high probability over the initialization:

20

∂f ˜ √ • The vector ∞-norm (l) = O(1/ m), for all l ∈ [L]; ∂α ∞ (l) • The matrix spectral norm ∂α = O(1) w.r.t. m, for all l ∈ [L]; ∂w(l) ∂2α(l) ∂2α(l) • The (2, 2, 1)-norms of order 3 tensors, (l)2 , (l−1) (l) and ∂w 2,2,1 ∂α ∂w 2,2,1 ∂2α(l) (l−1) 2 are all of the order O(1) w.r.t. m, for all l ∈ [L]. (∂α ) 2,2,1 We start the proof with some preliminary results, and then prove the above statements one by one.

F.1 Preliminaries

The fully connected neural network is defined in the following way: α(0) = x, 1 α(l) = σ(˜α(l)), α˜(l) = √ W (l)α(l−1), ∀l ∈ [L] ml−1 1 f = √ vT α(L), (63) m where m0 = d which is the dimension of the input x, and ml = m for all l ∈ [L]. The trainable parameters of this network are W := {W (1),W (2), ··· ,W (L),W (L+1) := v}, and are initialized (l) by the random Gaussian initialization, i.e., each parameter (W0 )ij ∼ N (0, 1), ∀l ∈ [L], and (l) v0,i ∼ N (0, 1), i, j ∈ [m]. As the parameters W of each layer are reshaped into matrices, the PL+1 (l) 2 1/2 Euclidean norm of parameters becomes kWk := ( l=1 kW kF ) , where k·kF is the Frobenius norm of a matrix. To make the presentation of the proof as simple as possible, we first make the following assumption about the initial parameters W0. Then we prove it in Lemma F.1 that the assumption is satisfied with high probability over the random Gaussian initialization.

Assumption F.1. We assume that there exists a constant c0 > 0 such that, for all initial weight (l) (l) √ matrices/vector W0 , kW0 k ≤ c0 m, where l ∈ [L + 1]. Lemma F.1 (Spectral norms of initial weight matrices). If the parameters are initialized as (l) (W0 )ij ∼ N (0, 1) for all l ∈ [L + 1] and m > d, then, for each layer l ∈ [L + 1], we have m with probability at least 1 − 2 exp(− 2 ), (l) √ kW0 k ≤ 3 m. (64) The proof is in Appendix I.2. d We further assume that, for the input x ∈ R , each component is bounded, i.e. |xi| ≤ Cx, for some constant Cx and for all i ∈ [d]. This assumption covers most of the practical cases. We prove the following lemma which states that the norm of the matrix W (l) keeps its order in a (l) finite ball around the W0 . Lemma F.2. If W0 satisfies Assumption F.1, then for any W such that kW − W0k ≤ R, we have (l) √ √ kW k ≤ c0 m + R = O( m), ∀l ∈ [L + 1]. (65) See the proof in Appendix I.3. The following lemma gives bounds on the Euclidean norm of the vector of hidden neurons for each layer.

Lemma F.3. If W0 satisfies Assumption F.1, then, for any W such that kW − W0k ≤ R, we have, at all hidden layers l (l) l √ l√ X i−1 √ i−1 √ kα (W)k ≤ Lσ(c0 + R/ m) mCx + Lσ (c0 + R/ m) σ(0) = O( m), ∀l ∈ [L]. i=1 (66) Particularly, for the input layer, √ (0) kα k = kxk ≤ dCx = O(1). (67)

21 The proof is in Appendix I.4 .

F.2 Matrix spectral norm ∂α(l)/∂w(l) = O(1) and Lipschitz continuity of α(l) w.r.t α(l−1)

(l) (l) Here, we show that, for any l ∈ [L] and at any point W ∈ B(W0,R), both ∂α /∂w and

∂α(l)/∂α(l−1) are of the order O(1), with high probability over the random Gaussian initialization (l) (l−1) of W0, the latter of which is essentially the Lipschitz continuity of α w.r.t α . When l = 2, 3, ··· ,L. Recall from Eq.(22) that, a fully connected layer α(l) is defined as, for l = 2, 3, ··· ,L:

 1  α(l) = σ √ W (l)α(l−1) . (68) m

(l) √1 (l) (l−1) The term α˜ := m W α is also known as preactivation. Note that, in this case, the parameter vector w(l) is reshaped to an m × m matrix W (l). The first derivatives of α(l) are  ∂α(l)  1 = √ σ0(˜α(l))W (l), (l−1) i ij (69) ∂α i,j m  (l)  ∂α 1 0 (l) (l−1) = √ σ (˜α )α 0 . (l) i j Ii=j (70) ∂W i,jj0 m

By the definition of spectral norm, kAk = supkvk=1 kAvk, we have, for all 2 ≤ l ≤ L, 2 (l) m 2 ∂α 1 X  0 (l) (l)  = sup σ (˜α )W vj ∂α(l−1) m i ij kvk=1 i=1 1 = sup kΣ0(l)W (l)vk2 kvk=1 m 1 ≤ kΣ0(l)k2kW (l)k2 m 2 √ 2 ≤ Lσ(c0 + R/ m) = O(1), 0(l) 0(l) 0 (l) where Σ is a diagonal matrix, with the diagonal entry Σ ii = σ (˜αi ). In the last inequality above, we used Lemma F.2 and the Lipschitz continuity of the activation σ(·). Similarly, we have 2 (l) m 2 ∂α 1 X  X 0 (l) (l−1)  = sup σ (˜αi )α 0 Ii=jVjj0 ∂W (l) m j kV kF =1 i=1 j,j0 1 = sup kΣ0(l)V α(l−1)k2 kV kF =1 m 1 ≤ kΣ0(l)k2kα(l−1)k2 m l l−1 2 ≤ Lσ(c0 + R) Cx = O(1). (71) In the last inequality, we used Lemma F.3 and the Lipschitz continuity of the activation σ(·). When l = 1. The layer function is:   (1) (1) (0) 1 (1) α = φ1(W ; α ) = σ √ W x . (72) d In this layer, the input x is fixed (independent of trainable parameters) and not a dynamical variable. Hence, ∂α(1)/∂x is not an interesting object in our Hessian analysis7.

7Indeed, it does not show up in the Hessian analysis (c.f. the proof of Theorem 3.1 in Section E).

22 For ∂α(1)/∂W (1), we have (with a similar analysis as in Eq.(71)), 1 k∂α(1)/∂W (1)k2 ≤ kΣ0(l)k2kxk2 ≤ L2 C2 = O(1). d σ x

F.3 (2, 2, 1)-norms of order 3 tensors are O(1)

Proof. We consider the first layer i.e. l = 1 and the rest of the layers i.e. l = 2, 3, ··· ,L separately. When l = 2, 3, ··· ,L. The second derivatives of the vector-valued layer function α(l), which are order 3 tensors, have the following expressions:  ∂2α(l)  1 = σ00(˜α(l))W (l)W (l), (l−1) 2 i ij ik (73) (∂α ) i,j,k m  ∂2α(l)  1 = σ00(˜α(l))W (l)α(l−1) , (l−1) (l) i ij k0 Ii=k (74) ∂α ∂W i,j,kk0 m  2 (l)  ∂ α 1 00 (l) (l−1) (l−1) = σ (˜α )α 0 α . (l) 2 i j k0 Ii=k=j (75) (∂W ) i,jj0,kk0 m

By the definition of the (2, 2, 1)-norm for order 3 tensors, and Lemma F.2, we get

2 (l) m ∂ α 1 X 00 (l) (l) (l) = sup σ (˜α )(W v1)i(W v2)i (∂α(l−1))2 m i 2,2,1 kv1k=kv2k=1 i=1 m 1 X ≤ sup β (W (l)v ) (W (l)v ) m σ 1 i 2 i kv1k=kv2k=1 i=1 m 1 X ≤ sup β (W (l)v )2 + (W (l)v )2 2m σ 1 i 2 i kv1k=kv2k=1 i=1

1 (l) 2 (l) 2 ≤ βσ sup (kW v1k + kW v2k ) 2m kv1k=kv2k=1 1 ≤ β (kW (l)k2 + kW (l)k2) 2m σ √ 2 ≤ βσ(c0 + R/ m) = O(1). (76) Similarly, by using Lemma F.2 and Lemma F.3, we have, 2 (l) m ∂ α 1 X 00 (l) (l) (l) = sup σ (˜α )(W v1)i(V2α )i ∂α(l−1)∂W (l) m i 2,2,1 kv1k=kV2kF =1 i=1

1 (l) 2 (l−1) 2 ≤ sup βσ(kW v1k + kV2α k ) kv1k=kV2kF =1 2m 1 ≤ β (kW (l)k2 + kα(l−1)k2) 2m σ β √ β √ ≤ σ (c + R/ m)2 + σ L2l−2(c + R/ m)(2l−2)C2 = O(1). 2 0 2 σ 0 x And 2 (l) m ∂ α 1 X 00 (l) (l−1) (l−1) = sup σ (˜α )(V1α )i(V2α )i (∂W (l))2 m i 2,2,1 kV1kF =kV2kF =1 i=1

1 (l−1) 2 (l−1) 2 ≤ sup βσ(kV1α k + kV2α k ) kV1kF =kV2kF =1 2m 1 ≤ β (kα(l−1)k2 + kα(l−1)k2) 2m σ 2l−2 √ 2l−2 2 ≤ βσLσ (c0 + R/ m) Cx = O(1). (77)

23 When l = 1. As discussed in Section F.2, the input α(0) = x is constant, we only need to analyze ∂2α(l) the tensor (∂W (l))2 in this case. With a similar analysis in Eq.(77), we have ∂2α(1) 1 ≤ β (kα(0)k2 + kα(0)k2) ≤ β C2 = O(1). (1) 2 σ σ x (78) (∂W ) 2,2,1 2d

√ F.4 Vector ∞-norm is O˜(1/ m)

Proof. First of all, we present a few useful facts, Lemma F.4-F.6 that will be used during the proof. The proofs of the following lemmas are in Appendix I.5-I.7. We first show that each activation of the hidden layers is bounded at initialization, with high probabil- ity.

−c(l)ln2(m) Lemma F.4. For any l ∈ [L], given i ∈ [m], with probability at least 1 − 2e α for some (l) (l) ˜ constant cα > 0, |αi | = O(1) at initialization.

(l) (l) m Define vector b := ∂f/∂α ∈ R for l ∈ [L]. And we use b0 to denote b at initialization. Specifically, b(l) takes the following form: L   Y 1 0 (l0) 1 b(l) = √ (W (l ))T Σ0 √ v, (79) m m l0=l+1

0 0 0 0(l ) 0(l ) 0 (l ) where Σ is a diagonal matrix, with (Σ )ii = σ (˜αi ). (l) The following lemma gives an upper bound to Euclidean norms of b in the ball B(W0,R).

Lemma F.5. If the initial parameters W0 of the multi-layer neural network f(W) satisfies Assump- tion F.1, then, for any W such that kW − W0k ≤ R, we have, at all hidden layers, i.e., ∀l ∈ [L],

(l) L−l √ L−l+1 kb k ≤ Lσ (c0 + R/ m) . (80) In particular, at initialization, (l) L−l L−l+1 kb0 k ≤ Lσ c0 . (81)

(l) ˜ √1 We proceed to show all the components of b0 are of order O( m ) with high probability.

−c(l)ln2(m) (l) (l) Lemma F.6. With probability at least 1 − me b for some constant c > 0, kb k = √ b 0 ∞ O˜(1/ m).

˜ √1 Now we show besides at initialization, kbk∞ is of order O( m ) in the ball B(W0,R) with high probability. Technically, we bound the difference of the ∞-norm by the difference of 2-norm. First of all, we prove, by induction, the following claim: for all l ∈ [L],  1  kb(l) − b(l)k = O˜ √ . (82) 0 m In the base case, we consider l = L. We have 1 b(L) = √ v, m 1 b(L) = √ v . 0 m 0 Hence, 1 1 1 kb(L) − b(L)k = √ kv − v k ≤ √ kW − W k ≤ √ R. (83) 0 m 0 m 0 m

24 (l) (l) ˜ √1 Now, suppose that kb − b0 k = O( m ). Then 1 b(l−1) − b(l−1) = √ (W (l))T Σ0(l)b(l) − (W (l))T Σ0(l)b(l) + (W (l))T Σ0(l)b(l) 0 m 0 0 0 0 0

(l) T 0(l) (l) (l) T 0(l) (l) (l) T 0(l) (l) + (W0 ) Σ b − (W0 ) Σ b0 − (W0 ) Σ b 1     = √ (W (l))T − (W (l))T Σ0(l)b(l) + (W (l))T Σ0(l) − Σ0(l) b(l) m 0 0 0 0   (l) T 0(l) (l) (l) + (W0 ) Σ b − b0 1 1   (l) (l) 0(l) (l) (l) 0(l) 0(l) (l) ≤ √ W − W0 kΣ kkb k + √ kW0 k Σ − Σ0 b0 m 2 m 1 + √ kW (l)kkΣ0(l)k b(l) − b(l) , (84) m 0 0 0(l) 0(l) 0 (l) where Σ is a diagonal matrix, with (Σ )ii = σ (˜αi ). To bound the second additive term above, we need the following inequality: (l) (l) kα˜ (W) − α˜ (W0)k

1 (l) (l−1) 1 (l) (l−1) = √ W α (W) − √ W α (W0) m m 0 1 1 ≤ √ kW (l)k · L · kα˜(l−1)(W) − α˜(l−1)(W )k + √ kW (l) − W (l)kkα(l−1)(W)k m 0 σ 0 m 0 1 ≤ c L kα˜(l−1)(W) − α˜(l−1)(W ) + √ kW (l) − W (l)kkα(l−1)(W)k 0 σ 0 m 0 (l−1) (l−1) = c0Lσkα˜ (W) − α˜ (W0)k + O(1), √ where the last equality is the result of Lemma F.3 that kα(l−1)k = O( m). Recursively applying the above equation, since kα˜(1)(W) − α˜(1)(W )k ≤ √R C , we have 0 d x (l) (l) l−1 l−1 (1) (1) kα˜ (W) − α˜ (W0)k = c0 Lσ kα˜ (W) − α˜ (W0)k + O(1) = O(1). (85) Also, note that Σ0 is a diagonal matrix, then, we have v h i u m  2 0(l) 0(l) (l) uX (l) 2 0 (l) 0 (l) Σ − Σ0 b0 = t (b0 )i σ (˜αi (W)) − σ (˜αi (W0)) i=1 v u m h i2 (l) uX 0 (l) 0 (l) ≤ kb0 k∞t σ (˜αi (W)) − σ (˜αi (W0)) i=1  1  ≤ kb(l)k · β kα˜(l)(W) − α˜(l)(W )k = O˜ √ , (86) 0 ∞ σ 0 m where we used Lemma F.5 and Eq.(85) in the last equality. Now, insert Eq.(86) into Eq.(84), and apply Lemma F.5 and the induction hypothesis, then we have 1 √ 1 √ h i b(l−1) − b(l−1) ≤ √ RLL−l+1(c + R/ m)L−l+1 + √ c m Σ0(l) − Σ0(l) b(l) 0 m σ 0 m 0 0 0  1  +c L b(l) − b(l) = O˜ √ . (87) 0 σ 0 m Thus, Eq.(82) holds for l − 1, and the proof of the induction step is complete. Therefore, by the principle of induction, Eq.(82) holds for all l ∈ [L]. (l) Now, let’s consider kb k∞. By Lemma F.6 and Lemma F.5, with probability at least 1 − −c(l)ln2(m) me b , (l) (l) (l) (l) kb k∞ ≤ kb0 k∞ + kb − b0 k∞  1  ≤ kb(l)k + kb(l) − b(l)k = O˜ √ . (88) 0 ∞ 0 m

25 L (l) 2 P −cb ln (m) Using union bound, for all l ∈ [L], we have with probability 1 − m l=1 e , ∂f  1  kb(l)k = k k = O˜ √ . (89) ∞ ∂α(l) ∞ m

G Generalization to other architectures

In this section, we apply Theorem 3.1 to both convolutional neural networks (CNN) and residual networks (ResNets), and show that they both have small Hessian spectral norms when the network width m is sufficiently large and last layer is of linear form.

G.1 Convolutional Neural Networks

A convolutional neural network (CNN) is a network of the type in Eq.(19), with each convolutional layer function φl defined as   (l) (l) (l−1) 1 (l) (l−1) α = φl(W ; α ) = σ √ W ∗ α , ∀l ∈ [L], (90) ml where ∗ is the convolution operator (see the definition below), and the layer width ml = m for all l = 2, 3, ··· ,L, and m1 = d with d as the number of channels of the input. To simplify the notation, we consider a one-dimensional CNN, i.e., a “image” is an 1-D array of “pixels”, and one will find that the analysis in this section also applies to higher dimensional CNNs. We also drop the layer indices l, wherever there is no ambiguity. We denote the number of channels for each hidden layer as m, the number of pixels in the “image” as Q and the size of each filter as K. Furthermore, we use i, j ∈ [m] as indices of the channels, q ∈ [Q] as indices of pixels and k ∈ [K] as indices within the filter. The input α ∈ Rm×Q is a matrix, with m rows as channels and Q columns as pixels. The parameters W ∈ RK×m×m is a order 3 tensor. The output of the layer function φ is of size m × Q. In this 1-D CNN case, the convolution operator is defined as K m X X (W ∗ α)i,q = Wk,i,jα K+1 . (91) j,q+k− 2 k=1 j=1

Reformulation of convolutional layer. Now, we reformulate the convolutional layer function in Eq.(90) into a fully-connected-like function. Then, we can use the techniques developed in Section F to prove for the CNN. Specifically, for all k ∈ [K], define matrices W [k] and α[k] such that each [k] [k] entry (W )ij = Wk,i,j and (α )jq = α K+1 . Then, the convolution operator in Eq.(91) can j,q+k− 2 be rewritten as K X (W ∗ α) = W [k]α[k]. (92) k=1 Here in the summation, it is matrix multiplication. Note that, while W [k] are independent from each other for different k ∈ [K], the inputs α[k] are not independent from each other; instead, they share [k] [k0] [k] pixels: (α )j,q = (α )j,q+k−k0 , i.e., each α is a pixel-shifted version of α (newly generated pixels after shift is filled with zeros). Therefore, the convolutional layer function can also be written as (for l > 1)

K ! X 1 φ φ(W; α) = σ √ W [k]α[k] σ(˜α). (93) , m , k=1 Here, we can see we will use this expression of convolutional layer function for analysis in this section. Before proceeding to the proof for CNN, we first point out a few useful facts, as summarized in the following lemmas.

26 Lemma G.1. Given matrices A, B and C such that A = BC, we have kAkF ≤ kBkkCkF , where kBk is the spectral norm of matrix B. See the proof in Appendix I.8. The following two lemmas provide bounds on the spectral norm of weights and Frobenius norm of hidden layers. These two lemmas (and the proofs) are analogous to Lemma F.2 and F.3, and we omit the proof. [k] Lemma G.2. Suppose the parameters are initialized as (W0 )i,j ∼ N (0, 1), for all k ∈ [K], i, j ∈ [m]. Then, with high probability of the random initialization, we have for any W ∈ B(W0,R) the following holds √ kW [k]k = O( m), ∀k ∈ [K]. (94) [k] Lemma G.3. Suppose the parameters are initialized as (W0 )i,j ∼ N (0, 1), for all k ∈ [K], i, j ∈ [m] and for all layers. Then, with high probability of the random initialization, we have for any W ∈ B(W0,R) the following holds at all hidden layers √ kαkF = O( m). (95) And at the input layer, kαkF = O(1). (96) We note that the proof for CNNs is basically analogous to that for fully connected neural networks (FCNs). Here, we refer readers to follow the proof idea for FCNs and only discuss the main differences below. In the following, we focus on analyzing the layers for l > 1. For the case of l = 1, we omit the proof, and refer the readers to the discussion in Section F, which also applies here.

Matrix spectral norm k∂φ/∂wk = O(1) and Lipschitz continuity of φ w.r.t α. As seen in Section F, it suffices to prove the boundedness of the operator norms: k∂φ/∂wk and k∂φ/∂αk. Note that, in the convolutional layer function, the vector of parameters w is reshaped to W ∈ RK×m×m, and the input is reshaped to α ∈ Rm×Q. Then, the Euclidean norm of the input becomes Frobenius PK [k] 2 1/2 norm kαkF , and the Euclidean norm kwk = ( k=1 kW kF ) . Then, the spectral norm square Q m K 2 2 1 X X 0 2 X [k] [k] k∂φ/∂αk = sup (σ (˜αi,q)) W V m i,q kV kF =1 i=1 q=1 k=1

K 2 1 2 X [k] [k] ≤ Lσ sup W V m F kV kF =1 k=1 K !2 1 X ≤ L2 sup kW [k]kkV [k]k m σ F kV kF =1 k=1 K !2 1 X ≤ L2 kW [k]k = O(1). m σ k=1 Here, in the second inequality, we used Lemma G.1, and in the last equality, we used Lemma G.2. Similarly, using Lemma G.1 and G.3, we also have Q ( m K 2 K ) 2 1 X X 0 2 X [k] [k] X [k] 2 k∂φ/∂wk = sup (σ (˜αi,q)) V α : kV kF = 1 m i,q i=1 q=1 k=1 k=1

 K 2 K  1  X X  ≤ L2 sup V [k]α[k] : kV [k]k2 = 1 m σ F  k=1 F k=1   K !2 K  1  X X  ≤ L2 sup kV [k]kkα[k]k : kV [k]k2 = 1 m σ F F  k=1 k=1 

K !2 1 X ≤ L2 kα[k]k = O(1). m σ F k=1

27 (2, 2, 1)-norms of order 3 tensors are O(1). Recall that the vector of parameters w is reshaped to W ∈ RK×m×m, and the input is reshaped to α ∈ Rm×Q. Then, by Lemma G.1 and G.2, we have

2 ∂ φ 2 ∂α 2,2,1   m Q K ! K ! X X 1 00 X [k] [k] X [k] [k]  = sup σ (˜αi,q) W V W V : kV1kF = kV2kF = 1 m 1 2 i=1 q=1 k=1 i,q k=1 i,q   K 2 K 2  βσ  X [k] X [k]  ≤ sup W [k]V + W [k]V : kV k = kV k = 1 2m 1 2 1 F 2 F  k=1 F k=1 F  K !2 βσ X ≤ · 2 kW [k]k 2m k=1 = O(1).

Similarly, by using Lemma G.1, G.2 and G.3, we also have

2 ∂ φ

∂α∂w 2,2,1   m Q K ! K ! K X X 1 00 X [k] [k] X [k] [k] X [k] 2 2  = sup σ (˜αi,q) V α W V : kV k = kV2k = 1 m 1 2 1 F F i=1 q=1 k=1 i,q k=1 i,q k=1   K 2 K 2 K  βσ  X [k] X [k] X [k]  ≤ sup V α[k] + W [k]V : kV k2 = kV k2 = 1 2m 1 2 1 F 2 F  k=1 F k=1 F k=1 

K 2 K 2! βσ  X   X  ≤ · kα[k]k + kW [k]k 2m F k=1 k=1 = O(1), and

2 ∂ φ 2 ∂w 2,2,1   m Q K ! K ! K K X X 1 00 X [k] [k] X [k] [k] X [k] 2 X [k] 2  = sup σ (˜αi,q) V α V α : kV k = kV k = 1 m 1 2 1 F 2 F i=1 q=1 k=1 i,q k=1 i,q k=1 k=1   K 2 K 2 K K  βσ  X [k] X [k] X [k] X [k]  ≤ sup V α[k] + V α[k] : kV k2 = kV k2 = 1 2m 1 2 1 F 2 F  k=1 F k=1 F k=1 k=1 

K 2 βσ  X  ≤ · 2 kα[k]k 2m F k=1 = O(1).

√ Vector ∞-norm is O˜(1/ m). The proof idea is similar to the case of fully connected case, as in Section F.4, i.e., proving by induction. The base case of the induction is the same as fully connected case, and we omit it here. The inductive hypothesis for CNN is: max (∂f/∂α(l+1)) = √ i∈[m],q∈[Q] i,q O˜(1/ m).

28 Now, for l-th layer, we have

m Q K (l) X X X (l+1) 0 (l+1) 1 [k] (∂f/∂α )i,q = (∂f/∂α )j,q0 σ (˜αj,q0 )√ Wji Iq=q0−k+ K+1 m 2 j=1 q0=1 k=1 K m X X (l+1) 0 (l+1)  1 [k] = (∂f/∂α )j,q+k− K+1 σ α˜ K+1 √ Wji 2 j,q+k− 2 m k=1 j=1 K X (l) [k] , (∂f/∂α )i,q. k=1 √ (l) [k] ˜ By the same argument as in Section F.4, we have: maxi∈[m],q∈[Q](∂f/∂α )i,q = O(1/ m), for each k ∈ [K], with high probability of the random initialization. Since K is finite, then we have √ (l) ˜ maxi∈[m],q∈[Q](∂f/∂α )i,q = O(1/ m) with high probability of the random initialization.

G.2 Residual Networks (ResNet) √ In this subsection we prove that the Hessian spectral norm for ResNet also scales as O˜(1/ m), with m being the width of the network. We define the ResNet f as follows: 1 α(1) = σ(√ W (1)x), d 1 α(l) = σ(˜α(l) ) + α(l−1), α˜(l) = √ W (l)α(l−1), ∀2 ≤ l ≤ L, res res m 1 f = √ vT α(L). (97) m The parameters W := {W (1),W (2), ··· ,W (L),W (L+1) := v} are initialized following the random (l) Gaussian initialization strategy, i.e., (W0 )ij ∼ N (0, 1), ∀l ∈ [L], and v0,i ∼ N (0, 1), i, j ∈ [m]. Remark G.1. This definition of ResNet differs from the standard ResNet architecture in [11] that the skip connections are at every layer, instead of every two layers. One will find that the same analysis can be easily generalized to cases where skip connections are at every two or more layer. The same definition, up to a scaling factor, was also theoretically studied in [7].

We see that the ResNet is the same as a fully connected neural network, Eq. (63), except that the activations α(l) has an extra additive term α(l−1) from the previous layer, interpreted as skip connection. Because of this similarity, the proof for ResNet is almost identical to that for fully connected networks. In the following, we sketch the proof for ResNet. Specifically, we focus on the arguments that are new to ResNet, and omit those identical to the fully connected case. Parallel to Lemma F.2 and F.3 for fully connected case, we have the following lemmas for the ResNet. (l) Lemma G.4. Suppose the parameters are initialized as (W0 )i,j ∼ N (0, 1), for all l ∈ [L], and v0,i ∼ N (0, 1), i, j ∈ [m]. Then, with high probability of the random initialization, we have for any W ∈ B(W0,R) the following holds √ kW (l)k = O( m), ∀l ∈ [L + 1]. (98)

(l) Lemma G.5. Suppose the parameters are initialized as (W0 )i,j ∼ N (0, 1), for all l ∈ [L], and v0,i ∼ N (0, 1), i, j ∈ [m]. Then, with high probability of the random initialization, we have for any W ∈ B(W0,R) the following holds at all hidden layers √ kα(l)k = O( m). (99) Particularly, for the input layer kα(0)k = kxk = O(1). (100)

The proofs of the above two lemmas are almost identical to those of Lemma F.2 and F.3. We omit the proofs here, and refer interested readers to proofs of Lemma F.2 and F.3.

29

Matrix spectral norm ∂α(l)/∂w(l) = O(1) and Lipschitz continuity of α(l) w.r.t α(l−1). When 2 ≤ l ≤ L, from the definition of ResNet, Eq.(97), a ResNet layer α(l) is defined by:  1  α(l) = φ (W (l); α(l−1)) = σ √ W (l)α(l−1) + α(l−1). (101) l m Therefore, we have 1 k∂α(l)/∂α(l−1)k = sup k(√ Σ0(l)W (l) + I)vk kvk=1 m 1 ≤ sup (√ kΣ0(l)kkW (l)kkvk + kvk) kvk=1 m √ ≤ Lσ(c0 + R/ m) + 1 = O(1). We note that k∂α(l)/∂w(l)k has the same expression as the one of the fully connected networks. By the same argument in Section F.2, as well as Lemma G.5, we have k∂α(l)/∂w(l)k = O(1). When l = 1, the layer function is defined by   (1) (1) (0) 1 (1) α = φ1(W ; α ) = σ √ W x . d In this layer, the input x is fixed (independent of trainable parameters) and not a dynamical variable. Hence, ∂α(1)/∂α(0) is not an interesting object in our Hessian analysis. And we have (1) (1) 1 0(l) k∂α /∂w k ≤ √ kΣ kkxk ≤ LσCx = O(1). d

We see that both k∇αφlk and k∇wφlk are bounded, hence, the (vector-valued) layer function of ResNet is Lipschitz continuous.

(2, 2, 1)-norms of order 3 tensors are O(1). Note that the skip connection term α(l−1) in Eq.(101) is linear in α(l−1) and independent from W (l). Hence, the order 3 tensors are exactly the same as in the case of fully connected networks. Applying the same argument as in Section F.3 gives the following: ∂2φ ∂2φ ∂2φ l = O(1), l = O(1), l = O(1). (l−1) 2 (l−1) (l) (l) 2 (102) (∂α ) 2,2,1 ∂α ∂W 2,2,1 (∂W ) 2,2,1

√ (l) (l) Vector ∞-norm is O˜(1/ m). For a ResNet, define vector bres := ∂f/∂α for l ∈ [L]. Specifi- (l) cally, bres takes the following form: L   Y 1 0 (l0) 1 b(l) = √ (W (l ))T Σ0 + I √ v. (103) res m m l0=l+1 Compared to the expression of b(l), in Eq.(79), which is the fully connected network case, the only (l) difference is that bres for ResNet has an extra additive identity matrix. We argue that the ∞-norm (l) √ kbresk∞ is still the order of O˜(1/ m). We show this by induction.

(l) ˜ √1 First, recall that by the analysis in Section F.4 we have kb k∞ = O( m ) for all l ∈ [L]. (L) (L) √1 (L) ˜ √1 In the base case, bres = m v = b . Then kbresk∞ = O( m ) holds. (l+1) (l) ˜ √1 Now, suppose kbres k∞ = O( m ) holds. For bres, we have  1  1 b(l) = √ (W (l+1))T Σ0(l+1) + I b(l+1) = √ (W (l+1))T Σ0(l+1)b(l+1) + b(l+1). (104) res m res m res res By an analogous analysis as in Section F.4 and I.7, we have that ∞-norm of the first term is of the √ √ order O˜(1/ m). Since kb(l+1)k is also of the order O˜(1/ m), we conclude that kb(l) k is of √ res ∞ res ∞ the order O˜(1/ m).

30 G.3 Architecture with mixed layer types

So far, we have seen that fully connected networks (Theorem√ 3.2), CNNs(Section G.1) and ResNets (Section G.2) have a Hessian spectral norm of order O˜(1/ m). In fact, our analysis generalizes to architectures with mixed layer types. Note that the analysis for the (2, 2, 1)-norms for order 3 tensors and the spectral norms of first-derivatives of hidden layers in the proof of Lemma 3.1, and its counterparts for CNN and ResNet, is purely layer-wise, and does not depend on the types of other ˜ √ layers. As for the ∞-norm, our analysis is inductive: k∇α(l) fk∞ = O(1/ m) only relies on the ˜ √ structure of the current layer and the fact that k∇α(l+1) fk∞ = O(1/ m).

H Proof of Theorem 4.1

First, we give a useful lemma below:

Lemma H.1. Let x = (x1, x2, ..., xm) and y = (y1, y2, ..., ym) where x1, x2, ...xm and y1, y2, ..., ym are i.i.d. random variables with xi, yi ∼ N (0, 1) for i ∈ [m]. Given an arbitrary m radius R > 0, for any x¯, y¯ ∈ R such that kx − x¯k ≤ R and ky − y¯k ≤ R and for any δ1 ∈ (0, 1), we have, with probability at least 1 − 2δ1,

m √ √ X 2 3 4 3 x¯iy¯i ≥ 3C1δ1 m − 3 log(4/δ1)R − R , (105) i=1 for some constant C1 > 0. The proof of the lemma is deferred to Appendix I.9.

Proof of Theorem 4.1. Consider an arbitrary parameter setting W ∈ B(W0,R). Note that spectral norm of a matrix is lower bounded by the norm of its blocks, then

∂2f kH(W)k ≥ . (106) 2 ∂w(2) 2 Hence, it’s sufficient to lower bound the norm of the Hessian block ∂2f/∂w(2) .

(2) With simple computation, the gradient of f w.r.t. wi is: m ∂f 1 X (4) (3) (3) (1) = w σ0(˜α )w α , (107) (2) m j j j i ∂wi j=1 and each entry of the Hessian matrix takes the form: 2 m ∂ f 1 X (4) (3) (3) (1) (1) = w σ00(˜α )(w )2α α (2) (2) m3/2 j j j i k ∂wi ∂wk j=1 m 1 X (4) (3) (1) (1) = w (w )2α α . (108) m3/2 j j i k j=1 1 2 In the second equality above, we have used σ(x) = 2 x . Then, the spectral norm of this Hessian block is

∂2f 1 m 2 X (4) (3) 2 (1) = wj (wj ) α . (109) (2)2 m3/2 ∂w j=1

For the last factor kα(1)k2, using the tail bound for χ2(m) [14], we have, with probability at least 1 − e−m/16, 2 m 2 2 x X  (1) x m  kα(1)k2 = w ≥ − R2 . 4 i 4 2 i=1

31

Pm (4) (3) 2 Applying Lemma H.1 to the factor j=1 wj (wj ) and using union bound, for an arbitrary δ ∈ (0, 1), we have, with probability at least 1 − 2δ − e−m/16,

∂2f x2 1 R2  √ 3 log(4/δ)R4 + R3  ≥ − 3C δ3 − √ . (110) 2 1 ∂w(2) 4 2 m m

Hence, we get the lower bound of the Hessian spectral norm at W ∈ B(W0,R)

∂2f x2 1 R2  √ 3 log(4/δ)R4 + R3  kH(W)k ≥ ≥ − 3C δ3 − √ . (111) 2 1 ∂w(2) 4 2 m m

I Proofs of Technical Lemmas

I.1 Proof of Lemma E.1

Proof.

 (1,1)   (1,2)    H 0 ··· 0 0 H ··· 0 0 0 ··· 0 0 0 ··· 0  0 0 ··· 0   0 0 ··· 0    kHk =  . . . .  +  . . . .  + ··· +  . . . .   . . . .   . . . .   . . .. .  ...... 0 0 ··· 0 0 0 ··· 0 0 0 ··· H(L+1,L+1) X ≤ kH(l1,l2)k.

l1,l2

I.2 Proofs for Gaussian Random Initialization

m1×m2 Proof of Lemma F.1. Consider an arbitrary random matrix W ∈ R with each entry Wij ∼ t2 N (0, 1). By Corollary 5.35 of [22], for any t > 0, we have with probability at least 1 − 2exp(− 2 ), √ √ kW k2 ≤ m1 + m2 + t. (112)

In particular, for the initial parameter setting W0, we have

(1) √ √ kW0 k2 ≤ d + m + t, (l) √ kW0 k2 ≤ 2 m + t, l ∈ {2, 3, ..., L}, (L+1) √ kW0 k2 ≤ m + 1 + t. √ Letting t = m and noting that m > d, we finish the proof.

I.3 Proof of Lemma F.2

PL+1 (l) Proof. By triangle inequality and the definition kWk = l=1 kW kF , we have for all layers, i.e., l ∈ [L + 1],

(l) (l) (l) (l) (l) (l) (l) √ kW k2 ≤ kW0 k2 + kW − W0 k2 ≤ kW0 k2 + kW − W0 kF ≤ c0 m + R. (113)

(L+1) Note that, at the output layer, W i.e. v is a vector, and the Frobenius norm k · kF reduces to the Euclidean norm k · k.

32 I.4 Proof of Lemma F.3

(l) (0) √Proof. To analyze√ kα (W)k, let’s first consider the input layer, i.e., l = 0: kα k = kxk ≤ dkxk∞ ≤ dCx, where d is the dimension of the input x. Then we prove Eq.(66) by induction. For the first hidden layer l = 1,   (1) 1 (1) (0) kα (W)k = σ √ W α d 1 (1) (0) ≤ √ LσkW k2kα k + σ(0) d 1 √ (0) ≤ √ Lσ(c0 m + R)kα k + σ(0) d √ √ ≤ Lσ(c0 + R/ m) mCx + σ(0). (114)

Above, we used the Lσ-Lipschitz continuity and applied Lemma F.2 in the second inequality. Now, suppose for l-th layer we have

l (l) l √ l√ X i−1 √ i−1 kα (W)k ≤ Lσ(c0 + R/ m) mCx + Lσ (c0 + R/ m) σ(0). (115) i=1 Then, by a similar argument as in Eq.(114), we can get

(l+1)  1 (l+1) (l)  kα (W)k = σ √ W α (W) m l+1 l+1 √ l+1√ X i−1 √ i−1 ≤ Lσ (c0 + R/ m) mCx + Lσ (c0 + R/ m) σ(0). i=1

I.5 Proof of Lemma F.4

(l) Proof. When 2 ≤ l ≤ L, |αi | takes the following form: m ! (l) 1 X (l) (l−1) |α | = σ √ W α i m ik k k=1 m Lσ X (l) (l−1) ≤ √ W α + |σ(0)|, m ik k k=1 Pm (l) (l−1) (l−1) 2 (l) where we can see k=1 Wik αk ∼ N (0, kα k ) since Wik ∼ N (0, 1) at initialization. By the concentration inequality for Gaussian random variable, we have m (l) Lσ X (l) (l−1) [|α | ≥ ln(m) + |σ(0)|] ≤ [ √ W α ≥ ln(m)] P i P m ik k k=1 mln2(m) − 2 (l−1) 2 ≤ 2e 2Lσ kα k −c(l)ln2(m) = 2e α ,

(l) m for cα = 2 (l−1) 2 = Ω(1) by Lemma F.3. 2Lσ kα k When l = 1, we have d ! (1) 1 X (1) √ |αi | = σ Wik xk d k=1 d Lσ X (1) √ ≤ Wik xk + |σ(0)|. d k=1

33 Pd (1) 2 Similarly, at initialization, k=1 Wik xk ∼ N (0, kxk ). Hence " d # (1) Lσ X (l) √ P[|αi | ≥ ln(m) + |σ(0)|] ≤ P Wik xk ≥ ln(m) d k=1 dln2(m) − 2 2 ≤ 2e 2Lσ kxk −c(0)ln2(m) = 2e α ,

d (0) where we denote 2 by cα , which is of the order Θ(1). Lσ kxk (l) (l) 2 ˜ −cα ln (m) Therefore, |αi | = O(1) with probability at least 1 − 2e for all l ∈ [L].

I.6 Proof of Lemma F.5

Proof. The expression of the derivatives b(l) is

L ! Y 1 0 0 1 b(l) = √ (W (l ))T Σ0(l ) √ v, (116) m m l0=l+1

0(l0) 0(l0) 0 (l) where Σ is a diagonal matrix with (Σ )ii = σ (˜αi (W)). We prove the lemma by induction. When l = L, using Lemma F.2, we have 1 1 √ √ kb(L)k = √ kvk ≤ √ (c m + R) = c + R/ m. (117) m m 0 0

(l) L−l √ L−l+1 Suppose at l-th layer, kb k ≤ Lσ (c0 + R/ m) . Then 1 kb(l−1)k = k√ (W (l))T Σ0(l)b(l)k m 1 ≤ √ kW (l)k kΣ0(l)k kb(l)k m 2 2 √ (l) ≤ (c0 + R/ m)Lσkb k L−l+1 √ L−l+2 ≤ Lσ (c0 + R/ m) .

Above, we used Lemma F.2 and the Lσ-Lipschitz continuity of the activation function σ(·) in the second inequality. Setting R = 0, we immediately obtain Eq.(81).

I.7 Proof of Lemma F.6

(L) √1 Proof. We prove it by induction. When l = L, b0 = m v0. Since v0,i ∼ N (0, 1), by the concentration inequality, for every i ∈ [m], we have

−ln2(m) P[|v0,i| ≥ ln(m)] ≤ 2e 2 .

−ln2(m) By union bound, with probability at least 1 − 2me 2 ,

kv0k∞ ≤ ln(m), in other words, √ (L) ˜ kb0 k∞ = O(1/ m).

(l) 2 (l) (l) −cb ln (m) Supposing with probability 1 − me for some constant cb > 0, we have kb0 k∞ = √ √ (l−1) 2 ˜ (l−1) ˜ −c ln (m) O(1/ m). Then we show kb0 k∞ = O(1/ m) with probability 1 − me b for some (l−1) constant cb > 0.

34 (l−1) For simplicity, in the rest of the proof, we hide the subscript 0. Hence we denote b0 = (l−1) (l−1) (l) √1 T 0 m (W0 ) Σ 0 b0 by 1 b(l−1) = √ (W (l−1))T Σ0(l−1)b(l), m (l−1) where (W )ij ∼ N (0, 1). Similarly, we analyze every component of b(l−1):   m m (l−1) 1 X (l−1) 0 1 X (l−1) (l−2) (l) |b | = √ W σ √ W α  b i m ki m kj j k k=1 j=1   m m m 1 X (l−1) 0 1 X (l−1) (l−2) (l) 1 (l−2) X (l−1) 2 (l) ≤ √ W σ √ W α  b + βσα (W ) b m ki m kj j k m i ki k k=1 j6=i k=1   m m m 1 X (l−1) 0 1 X (l−1) (l−2) (l) 1 (l−2) X (l−1) 2 (l) ≤ √ W σ √ W α  b + βσα (W ) b . m ki m kj j k m i ki k k=1 j6=i k=1 For the first term, we use a Gaussian random variable to bound it:   m m m  2  1 X (l−1) 1 X (l−1) (l−2) (l) Lσ X (l−1) (l) L √ W σ0 √ W α b ≤ √ W b ∼ N 0, σ kb(l)k2 . m ki  m kj j  k m ki k m k=1 j6=i k=1 Using the concentration inequality, we have

" m # ln2(m) − (l) 2 Lσ X (l−1) (l) ln(m) 2 (l) 2 −c ln (m) √ W b ≥ √ ≤ 2e 2Lσ kb k ≤ 2e σ , P m ki k m k=1 (l) 1 1 for some cσ = 2 (l) 2 ≥ 2L−2l+2 2L−2l+2 by Lemma F.5. 2Lσ kb k 2Lσ c0 For the second term, we have m m 1 (l−2) X (l−1) (l) 1 (l−2) X (l−1) β α (W )2b ≤ β |α |kb(l)k (W )2, m σ i ki k m σ i ∞ ki k=1 k=1 Pm (l−1) 2 2 where we can see k=1(Wki ) ∼ χ (m). (l−2) 2 (l−2) −cα ln (m) ˜ By Lemma F.4, with probability 1 − e , we get |αi | = O(1). Hence, by Lemma 1 in [14], there exist constants c˜1, c˜2, c˜3 > 0, such that

" m c˜3 # 1 (l−2) X (l−1) ln (m) β |α |kb(l)k (W )2 ≥ c˜ √ ≤ e−c˜2m, P m σ i ∞ ki 1 m k=1 −c(l)ln2(m) with probability 1 − me b by the induction hypothesis. (l−1) Combining these probability terms, there exists a constant cb such that −c(l−1)ln2(m) −c(l)ln2(m) −c(l)ln2(m) −c(l−2)ln2(m) −c˜ m e b ≤ me b + 2e σ + 2e α + e 2 .

−c(l−1)ln2(m) Then with probability at least 1 − e b , √ (l−1) ˜ |bi | = O(1/ m). −c(l−1)ln2(m) By union bound, with probability at least 1 − me b , we have (l−1) √ kb k∞ = O˜(1/ m).

−c(l)ln2(m) Hence by the principle of induction, for all l ∈ [L], with probability at least 1 − me b for (l) some constant cb > 0, we have (l) √ kb k∞ = O˜(1/ m). (118)

35 I.8 Proof of Lemma G.1

Proof. Let A = (a1, a2, ··· , ad) and C = (c1, c2, ··· , cd), where each ai is a column of the matrix A and each ci is a column of the matrix C. Then we have

ai = Bci, ∀i ∈ [d]. (119)

Now, for the Frobenius norm, we have

d d d 2 X 2 X 2 X 2 2 2 2 kAkF = kaik = kBcik ≤ kBk kcik = kBk kCkF . i=1 i=1 i=1

Hence, kAkF ≤ kBkkCkF .

I.9 Proof of Lemma H.1

Proof. First, let’s write x¯ and y¯ as

x¯ = x + s, y¯ = y + t,

where s := (s1, ··· , sm) = (¯x1 − x1, ··· , x¯m − xm) and t := (t1, ··· , tm) = (¯y1 − y1, ··· , y¯m − ym). By the condition of the Lemma, we have ksk ≤ R and ktk ≤ R. Then, we have:

m m X 2 X 2 x¯iy¯i = (xi + si)(yi + ti) i=1 i=1

m m m m m X 2 X X X 2 X 2 = (xi + si)yi + 2 xitiyi + 2 sitiyi + xiti + siti i=1 i=1 i=1 i=1 i=1

m m m m m X 2 X X X 2 X 2 ≥ (xi + si)yi + 2 xitiyi − 2 sitiyi − xiti − siti . (120) i=1 i=1 i=1 i=1 i=1

Now, let’s lower bound the first term and upper bound the last three terms. For the first term, we lower bound it by the anti-concentration inequality, i.e., Theorem 8 in [5]. Specifically, for any δ1 ∈ (0, 1), there exists a constant C1 > 0, such that, with probability at least 1 − δ1,

v m m u  m m !2 u X 2 X 3u X 2 X (xi + si)yi + 2 xitiyi ≥ C1δ1tE  (xi + si)yi + 2 xitiyi . (121) i=1 i=1 i=1 i=1

36 On the other hand,

 m m !2 X 2 X E  (xi + si)yi + 2 xitiyi  i=1 i=1

 m !2  m !2 " m ! m !# X 2 X X 2 X = E  (xi + si)yi  + 4E  xitiyi  + 4E (xi + si)yi xitiyi i=1 i=1 i=1 i=1

 m !2 X 2 ≥ E  (xi + si)yi  i=1 " m !# X 2 4 X = E (xi + si) yi + 2 sisj i=1 i6=j m X 2 X = 3m + 3 si + 2 sisj i=1 i6=j m m X 2 X 2 = 3m + 2 si + ( si) i=1 i=1 ≥ 3m, (122) where the expectation E[·] is taken over the random variables x1, ...xm, , y1, ..., ym. Above, we used 2 4 the fact that these random variables are i.i.d. and E[xi ] = 1 and that E[yi ] = 3 for all i ∈ [m].

Combining Eq.(121) and (122), we have, with probability at least 1 − δ1,

m m √ √ X 2 X 3 (xi + si)yi + 2 xitiyi ≥ 3C1δ1 m. (123) i=1 i=1

For the second and third terms, we notice that m m ! m m ! X X 2 2 X 2 X 4 sitiyi ∼ N 0, si ti , xiti ∼ N 0, ti . i=1 i=1 i=1 i=1 And it’s not hard to see m m X 2 2 4 X 4 4 si ti ≤ R , ti ≤ R . i=1 i=1 By the concentration inequality for Gaussian random variable (Prop 2.1.9 in [21]) and union bound, we have, for any δ2 ∈ (0, 1),

m m X 4 X 2 4 sitiyi ≤ log(4/δ2)R , xiti ≤ log(4/δ2)R , (124) i=1 i=1 with probability 1 − δ2. As for the last term, using ksk ≤ R and ktk ≤ R, it is easy to have the following bound:

m m X 2 X 2 3 siti ≤ R, ti = R . (125) i=1 i=1

Putting inequalities Eq.(123), Eq.(124), and Eq.(125) into Eq.(120) and using union bound, we have with probability at least 1 − δ1 − δ2,

m m √ √ X 2 X 2 3 4 3 x¯iy¯i = (xi + si)(yi + ti) ≥ 3C1δ1 m − 3 log(4/δ2)R − R . i=1 i=1

37 If we set δ1 = δ2 in the above analysis, we have, with probability at least 1 − 2δ1,

m m √ √ X 2 X 2 3 4 3 x¯iy¯i = (xi + si)(yi + ti) ≥ 3C1δ1 m − 3 log(4/δ1)R − R . i=1 i=1

38