Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Greg Yang 1

Abstract Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline tensor program that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows 1. the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; 2. conditions under which the gradient independence assumption – that weights in backpropagation can be assumed to be independent from weights in the forward pass – leads to correct computation of gradient dynamics, and corrections when it does not; 3. the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

1. Introduction Several recent trends in machine learning theory and practice have found it fruitful to study wide random neural networks, such as neural network inspired Gaussian Processes, signal propagation in DNNs, small learning rate SGD dynamics, and even, in some sense, the celebrated Approximate Message Passing algorithm for compressed sensing. We review these subjects and more in Section 2. All of these works involve some theory that derives, rigorously or semirigorously, some scaling limit of a neural network as its width goes to infinity. In this paper, we give a unifying treatment to such scaling limits: • We define a notion of tensor programs which can express most neural network computations, and a natural notion of tensor program scaling that corresponds to increasing width with Glorot initialization (Glorot & Bengio, 2010). Our main arXiv:1902.04760v3 [cs.NE] 4 Apr 2020 theorems characterize the scaling limits in the two most common scenarios that roughly correspond to DNN inference and backpropagation, as well as in the general tensor program case. They are proved via a Gaussian conditioning technique first used in Bolthausen(2012) for analyzing the TAP equations in spin glass theory. • We obtain corollaries that fully justify semirigorous derivations in prior works and strengthen previous results in the different strands of research mentioned above. In the next section we highlight the most important corollaries and discuss other briefly, leaving their details to the appendix. By standard architecture we mean any DNN architecture that is some composition of multilayer perceptrons (MLP)s, recurrent neural networks (RNNs) (e.g., Long-Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) or Gated Recurrent Unit (GRU) (Cho et al., 2014)), skip connections (He et al., 2016; Huang et al., 2016), (self-)attention (Bahdanau et al., 2014; Vaswani et al., 2017), convolution (LeCun et al., 1998; 1999), and/or batch normalization (batchnorm) (Ioffe & Szegedy, 2015). We use readout layer to mean any linear layer converting some hidden states to an output vector. While

1Microsoft Research AI. Correspondence to: [email protected].

1 most of our corollaries are stated for standard architectures, they are typically more general, but we just highlight the most relevant cases for a audience.

2. Related Works and Our Corollaries

We formulate informal versions of our main corollaries and comment on other results inline, marked by a star F.

2.1. Gaussian Behavior of Wide Neural Networks In 1995, Neal first discovered the Gaussian Process behavior of wide neural networks. He showed that under certain conditions, a single-layer neural network with random parameters can converge in distribution to a Gaussian process as its width goes to infinity. Later works extended the conditions under which this scaling limit takes place (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). Recently, Lee et al.(2018); Matthews et al.(2018) empirically and/or theoretically investigated analogous correspondences for infinite width, finite depth deep MLPs, and likewise Novak et al.(2018), for deep convolution networks. Daniely et al.(2016) also proved similar results in the framework of kernel methods, where they introduced a notion of “computational skeleton,” similar to tensor programs introduced here, that covers feedforward computation with no weight-sharing (so that, for example, it can express locally connected networks but not convolutional networks)1. Many previous works have exploited this DNN-GP correspondence implicitly or explicitly to build new models (Cho & Saul, 2009; Lawrence & Moore, 2007; Damianou & Lawrence, 2013; Wilson et al., 2016b;a; Bradshaw et al., 2017; van der Wilk et al., 2017; Kumar et al., 2018; Blomqvist et al., 2018; Borovykh, 2018). In particular, Lee et al.(2018); Garriga-Alonso et al.(2018); Novak et al.(2018) directly converted DNN to GP using this correspondence. Lee et al.(2018) constructed the state-of-the-art (SOTA) permutation-invariant GP on MNIST, and Novak et al.(2018) achieved SOTA on CIFAR10 for any GP with untrainable kernel. In this paper, we generalize the DNN-GP correspondence to standard architectures and very general nonlinearities.

Corollary 2.1 (DNN-GP correspondence, informal). Let f be a network of fixed standard architecture, with linear readout layer, and with nonlinearities bounded uniformly by exp(O(x2−)) for some  > 0. Fix a finite input set X of the right signature (e.g. set of batches for batchnorm network; set of sequences for RNN). Sampling f’s parameters from iid Gaussians induces a distribution of functions on X . If the readout layer weights are sampled independently from hidden parameters, then this distribution weakly converges to a Gaussian process as the network widths go to infinity (with fixed input and output dimensions). See Appendices D.1 and D.4.

In contrast, Matthews et al.(2018) requires φ to be linearly bounded in norm; Daniely et al.(2016) requires φ be twice- differentiable with |φ|, |φ0|, |φ00| all bounded, or that φ = ReLU; and a sufficient condition given in Novak et al.(2018) is that φ0 exists and is bounded by exp(O(x2−)), though it is unclear how the more general set of 3 conditions given there compares with ours. We hope this corollary opens the door to the design of more powerful GPs, in the way of Lee et al.(2018); Novak et al. (2018) by converting state-of-the-art DNNs. 2

2.2. Signal Propagation in Neural Networks Glorot & Bengio(2010); He et al.(2015) derived the popular Glorot and He initializations from consideration of hidden state norms in a DNN with random weights. A recent line of work generalizes their studies significantly by examining i j i j i j the evolution with depth of covariance between f(x ), f(x ) and between ∇xf(x ), ∇xf(x ) for distinct inputs x and x , when the DNN f is wide and parameters of f are randomized. This evolution is referred to as (forward and backward) signal propagation in the literature (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; 2018; Hanin & Rolnick, 2018; Chen et al., 2018; Yang et al., 2018; Pennington et al., 2017). It has been used to optimize initialization hyperparameters to prevent gradient explosion/vanishing, even to allow training of a 10,000 layer CNN without batchnorm or skip connections (Xiao et al., 2018).

1even though they claim that dealing with weight-tying is straightforward. It’s unclear what they had in mind, however, as there is a significant difference in the scaling behavior of sharing matrix transposes vs sharing no matrix transposes (see Thm 5.1 and Thm 6.3) 2For a gentler introduction to these GP results and several extensions, we recommend the reader to look at Yang(2019).

2 i l l Suppose {x }i is a set of inputs. Let f be an l-layer MLP with activation φ and uniform width n. If Σij = 1 l i l j l 1 l i l j ` 2 ` n Ehf (x ), f (x )i and Πij = n Eh∇xf (x ), ∇xf (x )i, with expectation taken over Wij ∼ N (0, σw/n), bi ∼ 2 l N (0, σb ) for each layer `, then the signal propagation literature posits that, in the n → ∞ limit, the dynamics of Σ and Πl are summarized by

l 2 > l−1 2 Σ = σw E[φ(z)φ(z) : z ∼ N (0, Σ )] + σb (1) l 2 0 0 > l l−1 Π = σw E[φ (z)φ (z) : z ∼ N (0, Σ )] Π . (2)

Note that Σl essentially is the kernel of the corresponding GP, and Eq. (1) is the same one used in the DNN-GP correspon- dence. Pennington et al.(2017) more generally computed the singular value distribution of the input-output Jacobian matrix of an MLP and characterized conditions under which this distribution concentrates around 1. To make this computation and to derive Eq. (2), they and others (Schoenholz et al., 2017; Yang & Schoenholz, 2017; 2018; Chen et al., 2018; Xiao et al., 2018; Pennington et al., 2017) relied on Assumption 2.2 (Gradient Independence Assumption). In backpropagation, whenever we multiply by W > for some weight matrix W , we multiply by an iid copy instead. that was first discovered by Schoenholz et al.(2017) to make good predictions and later formulated explicitly by Yang & Schoenholz(2017). In this paper we show Corollary 2.3 (Assm 2.2 is conditionally correct, informal). In a MLP having nonlinearities with polynomially bounded weak derivatives, Assm 2.2 leads to the correct equation Eq. (2) and the correct singular value distribution computation from Pennington et al.(2017), as long as the readout layer is sampled independently from other parameters and has mean 0. In general, Assm 2.2 does not induce correct computations – for example when the last layer is global mean pooling – and we rigorously give the correct equations, and more generally a way to compute the singular value distribution of the neural network Jacobian, both generalized to all standard architectures without batchnorm. See Appendices D.1 and D.5.

As an example, we computed the scaling limit for the gradient norms of an LSTM and compared it against empirical simulation (Fig. 1). The theoretical prediction is very precise already for 1000 neurons, which is typical for applications of LSTM. Note that this literature also studies the limit of iterating Eqs. (1) and (2) (large depth limit), but our results only apply to a fixed number of iterations, and so do not rigorously justify such limits. Chen et al.(2018) estimates the signal propagation in tied-weights RNNs with the equations for that in untied-weights RNNs. They find this a fantastic approximation for simple RNNs but not quite so for gated RNNs. F As a corollary of Thm 4.3 we show that, indeed, the tied- and untied-weights theories agree for simple RNNs, but not for general (say, gated) RNNs. We give the simplest counterexample of weight-tied residual network. See Appendix D.6.

Recently, Li & Nguyen(2018) investigated (forward only) signal propagation in weight-tied autoencoders. F A version of their main theorem allowing for arbitrary polynomially bounded activations, without restriction on smoothness, also follows as corollary of Thm 6.3. See Appendix D.6. We hope Cor 2.3 will allow future works to optimize initialization hyperparameters and prevent gradient explosion/vanishing problems for modern architectures in the way of Schoenholz et al.(2017); Yang & Schoenholz(2017; 2018); Chen et al. (2018); Xiao et al.(2018); Yang et al.(2018).

2.3. Neural Tangent Kernel

0 For any parametrized function f(x; θ), the Neural Tangent Kernel can be in general defined as Kθ(x, x ) = 0 h∇θf(x; θ), ∇θf(x ; θ)i (Jacot et al., 2018). In the case when f(x; θ) is a feedforward neural network, with parame- ters appropriately scaled (see Appendix D.1), there is a scaling limit of Kθ → K∞ when θ is randomized and f’s widths grow to infinity (Jacot et al., 2018). This convergence allows one to predict the evolution of f(x; θ) due to gradient 1 P 1 2 descent on θ. For example, if we apply gradient flow on a training set X and |X | (x,y)∈X 2 (f(x) − y) , for codomain(f) = R, Jacot et al.(2018) derived ∂f 1 t = − K (X , X )(f − f ∗) ∂t |X | θt t

3 LSTM gradient norms 100 empirical1 theoretical1 10 1 empirical2 theoretical2 10 2

10 3 gradnorm 10 4

10 5

10 6 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0 time

Figure 1: For two different random initializations of the LSTM with 1000 neurons along with a readout layer (at each time step), we run 100 steps forward on inputs of zero vectors and 20 steps of backpropagation-through-time. We collect the gradient norms k∂y100/∂htk, t = 80,..., 100, where y is the readout and h is the hidden state, and plot its mean and standard deviations (empirical1 and empirical2). Then we compute the large width limit of these gradient norms according to Thm 5.1 and overlaid them on top (theoretical1 and theoretical2). The agreement is so precise that the theoretical curves completely obscure the means of the empirical data.

where f ∗ is the “ground truth“ function that sends x 7→ y for every (x, y) ∈ X , and f and f ∗ are thought of dimension |X | vectors. Jacot et al.(2018) proved that under suitable conditions, with training time T fixed and width → ∞,

Kθt (X , X ) → K∞(X , X ) for all 0 ≤ t ≤ T . This means that, in the large width regime, f (in the function space) evolves approximately according to a linear differential equation under gradient flow. In this paper we show

Corollary 2.4 (NTK convergence, informal). Fix a finite input set X . Let f be a network of fixed standard architecture, with linear readout layer, and having nonlinearities with polynomially bounded weak derivatives (so in particular cannot have batchnorm). Then over X , Kθ → K∞ almost surely as the widths of f go to infinity and θ suitably randomized, for some K∞. See Appendices D.1 and D.7.

While Jacot et al.(2018) is groundbreaking in producing an equation to predict the behavior of gradient descent in the small learning rate, large width regime, its proof of the convergence Kθ → K∞ relies on taking the widths to infinity one by one starting from the first layer3. This is an unrealistic limit, as in practice, the widths are typically of the same order of magnitude4. In comparison, Cor 2.4 proves that the limit exists as the widths tend to infinity together, and it generalizes to arbitrary standard architectures. F We give an example computation of the NTK for a CNN in Appendix D.7; this is a new result that has not appeared in prior literature. Amari et al.(2018); Karakida et al.(2018) recently used Assm 2.2 to study the empirical Fisher information matrix (FIM), over finitely many datapoints drawn from isotropic Gaussian, of random neural networks, specifically its spectral properties. > If we let Jf be the |X | × |θ| matrix whose rows are {∇xf(x; θ)}x∈X , then (empirical) FIM ∝ Jf Jf while NTK is > 5 Jf Jf . Thus the spectral properties of empirical FIM and NTK are identical up to scaling. F By Cor 2.4, we can then justify the computations of Amari et al.(2018); Karakida et al.(2018) rigorously.

2.4. Other Works Very recently, Du et al.(2018b;a); Allen-Zhu et al.(2018b;c); Zou et al.(2018) formally proved that GD or SGD can reduce an overparametrized DNN’s training error to 0 by showing that random initialization imbues the network with certain good

3An earlier version of this paper claimed that it assumes gradient independence; this is incorrect, as the sequential limit obviates the need for it 4Indeed, when the last layer is global mean pooling rather than random Gaussians, the sequential limit obtains a different answer than simultaneous limit 5Karakida et al.(2018) called NTK the dual matrix

4 properties6 and, with small learning rate, the network never moves too far from its initialization7. Allen-Zhu et al.(2018a); Cao & Gu(2019) also show generalization bounds under various data assumptions using a similar reasoning. There is a long line of work investigating random classic spiking or hopfield networks, for example Landau & Sompolinsky (2018); Crisanti & Sompolinsky(2018); Kadmon & Sompolinsky(2015); Stern et al.(2014); Rajan et al.(2010); Sompolinsky et al.(1988); Amit et al.(1985). In the reinforcement learning literature, Osband et al.(2018); Burda et al.(2018a;b) used random DNNs for exploration. Other than the works discussed above, Li & Saad(2018); Giryes et al.(2016); Gabri et al. (2018); Reeves(2017); Fletcher & Rangan(2017) also considered neural networks with random weights.

F Our technique is general enough to rederive the semicircle law for the Gaussian Orthogonal Ensemble and the Marchenko- Pastur Law for Wishart matrices (Tao, 2012). See Appendices D.2 and D.3. Approximate Message Passing is an algorithm for recovering a ground truth vector from noisy measurements (Compressed Sensing)(Donoho et al., 2009). In one view, the algorithm repeatedly applies a certain neural network to the noisy measurement, and it succeeds if the result eventually converges to the ground truth vector. Previous works have shown that when the measurement matrix is randomized and the dimension goes to infinity, this algorithm satisfies a set of equations called State Evolution that can be used to reason about its behavior (Bayati & Montanari, 2011; Berthier et al., 2017). Their proofs are based on the same Gaussian conditioning technique used here. F In Appendix D.8, we detail the algorithm and State Evolution, and prove the validity of State Evolution equations for arbitrary polynomially bounded nonlinearities and test functions, removing the smoothness assumption of Bayati & Montanari(2011) (in exchange for a stronger moment condition on the measurements). This concludes the discussion of related works and our corollaries. We now present the tensor program framework and our main theorems.

3. Tensor Programs Consider programs of the following form, which we call tensor programs. Each line contains an assignment and a dimension annotation and can have the following types.

VecIn (G) a vector input x l nl l : g := x ∈ R MatIn (A) a matrix input A l nl ×nl l : A := A ∈ R 1 2 T (A) transpose of an A-var l j > nl ×nl nj ×nj l : A := (A ) ∈ R 1 2 = R 2 1

k j k j MatMul (G) if A and g have n2 = n , then an assignment via a linear mapping

l k j nl nk l : g := A g ∈ R = R 1 or similarly for H-vars l k j nl nk l : g := A h ∈ R = R 1 where j, k < l

LinComb (G) if nj1 = ··· = njk , then an assignment via linear combination of G-vars that appeared in previous lines: l ∈ with aji R, l j1 l : l := l j1 + ··· + l jk ∈ n = n . g aj1 g ajk g R R

Nonlin (H) if nj1 = ··· = njk , then an assignment via some general (possibly nonlinear) function fl : Rk → R, acting coordinatewise, l l j j nl nj1 l : h := f (g 1 ,..., g k ) ∈ R = R . 6using tools similar to ones in the signal propagation literature 7using reasoning similar to Jacot et al.(2018)

5 Here (G) marks those variables that we call G-vars, and similarly we have A-vars and H-vars. Vars introduced by VecIn and MatIn are also called (vector and matrix) input vars. The initial “l :” marks the line number, and each new variable formed from this line is labeled with a superscript l. A partial program with nl and input G- and A-vars unspecified is called a (program) skeleton, typically denoted by Greek letters like π. This skeleton can be thought of as a generalization of the skeleton in Daniely et al.(2016) in the language of a straightline program that allows weight sharing (transposed or not) and simple type checking.

3.1. Examples Such tensor programs can express the computation in most neural network scenarios. In Appendix B, we give example programs for computations in (1) MLP, forward and backward passes (B.1); (2) batch of input (B.2); (3) residual networks (B.3); (4) simple RNN (B.4); (5) batchnorm (B.5); (6) CNNs (B.6). It’s also clear from these examples that any combination of such computation can be expressed faithfully in a tensor program. On the other hand, tensor programs don’t capture all neural network computations, and one example is layer normalization (Ba et al., 2016), but see Section 8 for how to still compute its scaling limit in this framework.

3.2. Setup Lines of typeT, MatMul, LinComb, and Nonlin induce equality constraints on the dimensions nl. Given a skeleton π and a l m possible set of additional dimensional constraints Λ ⊆ {“n = n ”}gl,gm , consider the smallest equivalence relation ∼ on G-vars such that gl ∼ gm if “nl = nm” ∈ Λ or if they are constrained to have equal dimension by some line of typeT, MatMul, LinComb, or Nonlin. We call each class a common dimension class (CDC) of (π, Λ) and write c(gl) for the class l of a G-var g . The collection of all common dimension classes is written as C(π,Λ), or just C when π and Λ are understood from context. An algorithm to compute CDCs is presented in Appendix A. In this work, for every skeleton π (equipped with Λ), we study the behavior of vars in its realizations when the input vars are appropriately randomized and as the dimensions nl → ∞. More precisely, we consider a sequence (in t ∈ N) of dimensions lt lt lt lt lt {n }glorhl ∪ {n1 , n2 }Al respecting ∼, along with input G- and A-vars g , A of appropriate dimensions. For each c ∈ C, let nct = nlt for gl ∈ c. We extend the notations glt and hlt to the non-input G- and H-vars computed from these inputs. lt lt 2 lt lt 8 At time t, we sample independently Aij ∼ N (0, (σ ) /n2 ) for a set {σ }Al . For each c ∈ C, we also sample cint cint cint cint lt l independently gi ∼ N (µ ,K ) for each i, j. Here cin is the set of input G-vars in c, gi = (gi )g ∈cin , and cint cint µ : cin → R,K : cin × cin → R are specified mean and covariance at time t.

ct lt cint cint Thus given (π, Λ), the data {n }c∈C, {σ }Al , {µ }c∈C, and {K }c∈C realize a random program ct lt cint cint π({n }c∈C, {σ }Al , {µ }c∈C, {K }c∈C). Its vars are random variables and our theorems will concern certain “mo- ments” of them. 0 ct ct ct c0t We assume that, as t → ∞, for all c, c ∈ C: (1) n is increasing with t and n → ∞. (2) limt→∞ n /n = αc,c0 ∈ (0, ∞) 0 lt l∞ l∞ l for some constant αc,c0 depending only on c, c . (3) σ → σ for some finite σ > 0 for each input A-var A . (4) µcint → µcin∞ and Kcint → Kcin∞ for some finite µcin∞,Kcin∞, and rank Kcint = rank Kcin∞ for all large t.

Discussion Tensor programs are meant to represent the “body” of a neural network where all dimensions are large compared to input and output dimensions. The CDCs are used to capture the varying widths of practical neural networks; for example, while widths typically increase and decrease in classical networks, they are held constant in blocks in residual networks (see Appendix B.3 for an example). For the first read-through, we recommend the reader to assume all dimensions are the same so that there is a single CDC consisting of all G-vars. The sampling of A-vars reflects variants of Glorot initialization (Glorot & Bengio, 2010) used in practice. The sampling of input G-vars models the distribution of the first hidden layer across multiple inputs, sampling of the first layer parameters (see Appendix B.1 for an example), and/or sampling of bias vectors. Most often, the vector vars should be thought of as hidden layer quantities whose dimensions go to infinity; neural network inputs (of fixed dimension) are indirectly expressed as above, and outputs (of fixed dimension) are obtained as some coordinates of a vector var.

8 ˚l ∞ We could as well assume that there is an infinite 2D array of independent Gaussian variables {Aij ∼ N (0, 1)}i,j=1, and at time t, lt lt ˚l p lt ct set Aij = σ Aij / n2 . In that case, we do not need n to increase stricty with t 6 Thms 4.3 and 5.1 below say that, under certain conditions, G-vars converge to Gaussians of specific mean and covariances (hence the name “G-var”). But Thm 6.3 shows that in general this may not be true.

Notation We will often identify functions Z : A → B with vectors in BA (which should be thought of as dictionaries with keys in A). Given a subset U ⊆ A, ZU is the subvector of Z supported on U, or as a function is the restriction of Z on U. For ψ : BU → C, ψ(Z) def= ψ(ZU ), i.e. we automatically ignore the values of Z outside U. We use −−→a.s. for convergence almost surely.

4. Programs with No Transposes For any c ∈ C, we recursively define  µcin∞(gl) if gl ∈ c  in c l P l c ji l P l ji µ (g ) = i ajiµ (g ) if g := i ajig (3) 0 if gl := Akgj or gl := Akhj and recursively define

 cin∞ l m l m K (g , g ) if g , g ∈ cin  P amKc(gl, gji ) if gm := P amgji  i ji i ji c  K (gl, gm) = P mKc( ji , m) l := P l ji (4) i aji g g if g i aji g  k∞ 2 a b l k a m k b (σ ) Ez[f (z)f (z)] if g := A h , g := A h  0 else

a a j j b b j0 j0 c c where h := f (g 1 ,..., g k ), h := f (g 1 ,..., g k0 ), and z ∼ N (µ ,K ). We also make branch 4 cover the case when gl := Akga or gm := Akgb by “typecasting” ga to an H-var and setting fa = id (similarly for gb). Note that, as discussed in Notations above, fa will ignore irrelevant components of a, and the expectations only depend on the entries of µc and Kc that correspond to already-defined values, so this describes a valid recursion. We introduce the following technical assumption. Assumption 4.1 (Almost Sure Rank Convergence). For any Ak and any collection S ⊆ {G- or H-var h : gl := ct Akh for some l}, let Ht ∈ Rn ×|S| be the matrix whose columns are hmt or gmt for each hm or gm in S. If 1 t> t |S|×|S| ∗ t ∗ nct H H ∈ R converges almost surely to some C with t → ∞, then almost surely rank H = rank C for all large t.

If we don’t have lines of type LinComb, and no fl is a polynomial, then the C∗s are all full rank, implying rank convergence by the upper semicontinuity of rank. LinComb lines may add linear dependencies, but they are constant with t, so that rank Ht = rank C∗ in the limit and we still have rank convergence. k C Pk |x |α+c Definition 4.2. For α > 0, a function φ : R → R is said to be α-controlled if for some C, c > 0, |φ(x)| ≤ e i=1 i for all x ∈ Rk. Theorem 4.3. Consider dimension constraints Λ and a skeleton π withoutT lines, i.e. no transpose allowed. Suppose all fl are α-controlled for some α < 2. Sample all input vars as in Section 3.2 and assume almost sure rank convergence. Then c for any c ∈ C and any α-controlled function φ : R → R, α < 2,

nct 1 X a.s. φ(gct) −−→ φ(Z) nct i E i=1

ct lt c g c c where gi = (gi )gl∈c and R 3 Z = (Z )g∈c ∼ N (µ ,K ).

Discussion Roughly speaking, G-vars created from the same matrix Ak have nonzero correlations, but otherwise are ct d c c asymptotically independent modulo LinComb. Intuitively, gi “ = ”N (µ ,K ) for large t, iid for each i. 7 1 : A1 := A1 . .

LA LA LA : A := A L+1 1 > LA+1 1 L + 1 : A := (A ) LA + 1 : g := x . . . . L+LA LA > LA+Lg Lg L + L : A := (A ) LA + Lg : g := x A L+LA+1 1 LA + Lg + 1 : ... := ... noninput line types L + LA + 1 : g := v . . . .

L+LA+L∇ L∇ L : ... := ... last line L + LA + L∇ : g := v

(a) Program π (b) Extended program π˜

Figure 2

There is an apparent contradiction in Thm 4.3 if we consider deep linear networks with tied (y = W Lx ∈ Rn,W ∈ Rn×n) QL (l) (l) n×n c c and untied weights (y = l=1 W x, ∀l[W ∈ R ]). Via simple computations of µ and K , one sees that, by Thm 4.3, as the width n → ∞, y is distributed “similarly” in either case (in that α-controlled moments match asymptotically). This seems to contradict our intuition that W Lx should blow up or decay exponentially, with L, along the direction of the eigenvector of W corresponding to the largest eigenvalue of W ; whereas in the untied case it’s easy to see that each yi converges in distribution to an i.i.d. Gaussian. This apparent paradox is resolved by noting that Thm 4.3 only applies for fixed skeletons (so fixed L in this example), as −1/2 widths → ∞. By Rider(2003), the maximum eigenvalue of W scales like 1 + O(n ) if Wij ∼ N (0, 1/n), and so does that of W L for fixed L. Furthermore, as n increases, the components of x corresponding to large eigenvalues (≥ 1) of W decrease in magnitude to 0 in probability, by the circular law (Tao, 2012). So the L at which the exponentiating effect of W L kicks in increases with n.

5. Backprop with Zero Mean Gradients Let π be a skeleton with L lines but noT. WLOG, suppose all input vars appear at the start, with matrix inputs first, as in Fig. 2a. Consider an extension π˜ of π in the following way: The first few appended lines are transposes of A-vars l L∇ in π, followed by a series of new vector input vars {v }l=1, as in Fig. 2b. Lines appended below this can be arbitrary non-input lines except that (1) lines of type MatMul must use a transposed matrix AL+1 to AL+LA and hl or gl must have l 1 L∇ been introduced after π (i.e. l > L), and (2) any h for l > L + LA + L∇, as a function of v , . . . , v , must be odd: l l 1 L∇ l l 1 L∇ l l for any fixed values of {g }l≤L, h (−v ,..., −v , {g }l≤L) = −h (v , . . . , v , {g }l≤L); likewise g must be odd for l l > L + LA. This in particular means that LinComb lines cannot involve g for l ≤ L. If π expresses the forward computation of an NN f without matrix transposes, then π˜ has enough power to express the backpropagation of f and compute the gradients with respect to hidden states. For example, if f(x) = v>ρ(x), so that ∂f > ∂ρ ∂f ∂x = v ∂x , then ∂x is an odd function of v and can be expressed as a π˜ as above (see Appendix B.1 for a concrete example). In general, the multiple {vi} allow for multiple NN outputs. CDCs are naturally defined for π˜ (see Appendix A) just like before. We extend µc and Kc to vars introduced in π˜: For 8 c L+l c l m l, m > 0, and k ≤ LA, set µ (g ) = 0 and when l > L or m > L, set K (g , g ) =

 c ∞ l m l m K in (g , g ) if g , g are input vars  P amKc(gl, gji ) if gm := P amgji  i ji i ji P mKc( ji , m) l := P l ji i aji g g if g i aji g  k∞ 2 a b l L+k a m L+k b (σ ) α Ez f (z)f (z) if g := A h , g := A h  0 else

k nc1(A )t l L+k a m L+k b where α = αc (Ak),c (Ak) = limt→∞ k , and branch 4 covers the case when g := A g or g := A g by taking 1 2 nc2(A )t fa or fb to be identity. Note that covariances between vars of π and new vars in π˜ are 0.

it L∇ cint L+LA+i Theorem 5.1. Sample {v }i=1 with zero mean (i.e. µ (g ) → 0 for all i ∈ [L∇]) and independently from the 0 lt Lg cint l l 0 9 input vars {x }l=1 (i.e. K (g , g ) = 0 if l > L ≥ l ) . Sample all other vars in π˜ according to Section 3.2. Assume all fl of π˜ are polynomially bounded and π˜ satisfies almost sure rank convergence. Then for any dimension constraints Λ, any c c ∈ C(˜π,Λ), and any polynomially bounded function φ : R → R,

nct 1 X a.s. φ(gct) −−→ φ(Z) nct i E i=1

ct lt c c where gi = (gi )gl∈c and Z ∼ N (µ ,K ). Note that our result here does not apply to batchnorm, whose Jacobian has a singularity on a 1-dimensional affine subspace (and in particular, at the origin). This theorem allows one to justify the gradient independence assumption rigorously; see Appendix D.5.

6. General Tensor Programs

it Thm 5.1 does not give the correct computation if {v }i do not have zero mean: Consider a one-hidden-layer MLP with 1> 1 2 n0 n1×n0 1 n1 ∂f > 1 quadratic activation, f(x) = φ(W x), φ(−) = 2 (−) , x ∈ R ,W ∈ R , ∈ R . Then ∂x = W ( (W x)) = > 1 0 0 ∂f W W x. If n = n , and Wij ∼ N (0, 1/n ), then = xi. If we have assumed Thm 5.1 is correct, then we would E ∂xi 0 0 1 Pn ∂f a.s. 1 Pn 0> 0 have (incorrectly) computed 0 −−→ 0 (W W x)i = 0 where W is an iid copy of W . E n i=1 ∂xi E n i=1 Below, we develop a theory of the scaling limit of general tensor programs, from which follows the correct way of computing gradients when vit do not have 0 mean. We first introduce “extended syntax” programs, which are equivalent semantically to programs of original syntax, and then show that we can “compile” original syntax programs to extended syntax programs with no transposes, with the same scaling limit in a suitable sense. Definition 6.1. Extended syntax programs are those that allow all line types of Section 3 and in addition

Comp (H) if nj1 = ··· = njk , then an assignment via some general (possibly nonlinear) function fl : Rk → R

l l 1 k nl nj1 l : h := f (h , . . . , h ) ∈ R = R where h1, . . . , hk are previous G- or H-vars, and fl acts coordinatewise.

So in essence, extended syntax programs just allow Nonlin lines to take H-vars in addition to G-vars. While in the original syntax, H-vars must feed into lines of type MatMul, in extended syntax they can also be used to create new H-vars via coordinatewise action. One can define CDCs for extended syntax programs just as before (see Appendix A). Each extended syntax program is equivalent to an original syntax program, by expanding the definition of each H-var to a function of previous G-vars. For

9In our previous example of f(x) = v>ρ(x), this corresponds to the readout layer v sampled with zero mean and independently from x and other parameters of ρ.

9 example, if hl := fl(hk, gm), and hk := fk(gr), then the expanded definition of hl is fl(fk(gr), gm). We call this the l l l h l h l1 lm expanded definition of h , and write f for this expanded function, so that h = f (g ,..., g ) for some l1, . . . , lm < l; l l for G-vars, we also define fg = id. (In our example above, fh (x, y) = fl(fk(x), y)). So by replacing each line of type Comp with its expanded definition, we can convert an extended syntax program to an original syntax program with the same semantics for all vector vars.

lt cint cint Definition 6.2. Let π be an original syntax skeleton with associated sampling data {σ }Al , {µ ,K }c∈C. We define an extended syntax skeleton πˇ, called the detransposition of π, by induction on line number as follows. During this process, we keep track of an injective mapping ϕ taking vector (resp. matrix) vars of π to vector (resp. matrix) vars of πˇ, along with a specialized mapping ϕg taking a G-var of π produced by MatMul to a G-var of πˇ. We use a check ˇ to denote lt cˇint cˇint objects of the detransposition. We also simultaneously set {σˇ }ˇAl , {µ }c and {K }c of the detransposition. They propagate according to the usual rules, Eqs. (3) and (4), to determine µˇc and Kˇc. Let l be the current line number of π we are processing, and let ` denote the 1 + length of the current πˇ (this is where we are adding new lines in πˇ).

1. If gl := x is a VecIn line, then add a line of the same type to πˇ, ` :g ˇ` := x. Set ϕ(gl) ← gˇ`, µcˇint(ˇg`) ← µcint(gl), where ˇc = c(ˇgl) and c = c(gl). Set Kcˇint(ˇg`, ϕ(gm)) ← Kcint(gl, gm) for all input G-vars gm with m < l. 2. If Al := A is a MatIn line, then add to πˇ the line ` : Aˇ` := A. Set ϕ(Al) ← Aˇ` and σˇ`t ← σlt for all t ∈ [1, ∞]. 3. If Al := Am> is aT line, then add to πˇ an input line ` : Aˇ` := A0 for a new input A0 sampled iid as Am>. Set q mt l ` `t n1(A ) mt ϕ(A ) ← Aˇ and σˇ ← mt σ , ∀t ∈ [1, ∞]. n2(A )

l := l j1 + ··· + l jk π πˇ 4. If g aj1 g ajk g is an LinComb line in , we add a line of the same type in :

` : ˇ` := l ϕ( j1 ) + ··· + l ϕ( jk ) g aj1 g ajk g

and set ϕ(gl) ← gˇ` if each of ϕ(gji ) is a G-var in πˇ; or we add a line of type Comp

` : ˇ` := l ϕ( j1 ) + ··· + l ϕ( jk ) h aj1 g ajk g

and set ϕ(gl) ← hˇ` if some ϕ(gji ) is an H-var in πˇ.

5. If hl := fl(gj1 ,..., gjk ) is a line of type Nonlin, then we add to πˇ a line of type Comp

` : hˇ` := fˇ`(ϕ(gj1 ), . . . , ϕ(gjk ))

where fˇ` = fl, and we set ϕ(hl) ← hˇ`. (If all ϕ(gji ) are G-vars then we also typecast this line to Nonlin) 6. Suppose gl := Ahm is a line of type MatMul in π, where A is some previous A-var. Consider the A-var A0 where A0 := A> if A is an input A-var, or A := A0> if A is a transposed var. Let gi := A0hi, i = 1, . . . , s, be all previous lines of type MatMul involving A0, where hi can be G- or H-var. Define C ∈ Rs×s, v ∈ Rs by

ϕ(hi) ϕ(hj ) Cij = E f (Z)f (Z), i m ϕg(g ) ϕ(h ) vi = E f (Z)f (Z), where Z ∼ N (µˇc,Kˇc) (the expectation will only depend on components of Z corresponding to previous lines). t + s n2(A ) Compute a = αC v ∈ , where α = limt t . Then we add the following to πˇ: R n1(A )

` :g ˇ` := ϕ(A)ϕ(hm) MatMul s `+1 ` X j ` + 1 : hˇ :=g ˇ + ajϕ(h ) Comp j=1

j `+1 l `+1 l ` If ϕ(h ) are all G-vars, we typecast line `+1 to LinComb and write gˇ instead. Set ϕ(g ) ← hˇ and ϕg(g ) ← gˇ .

See Appendix B.1.1 for a concrete example of detransposition . Below, for any c ∈ Cπ, let ¯c be c ∪ {h : c(h) = c}, i.e. the collection of H- or G-vars with the same dimension constraint; see Appendix A. 10 Theorem 6.3. Let π be an original syntax program with sampling instructions, and πˇ be the detransposition of π, with ϕ the mapping from vector vars of π to vector vars of πˇ. Assume all fl of π are polynomially bounded, and that almost sure rank convergence holds for πˇ. Sample input vars of π according to Section 3.2. Then for any dimension constraints Λ, any ¯c c ∈ C(π,Λ), and any polynomially bounded function φ : R → R,

nct 1 X ct a.s. ϕ(h) φ(h ) −−→ φ((f (Z)) ¯) nct i E h∈c i=1

ct ˇc ˇc where hi is the sequence of the ith coordinates of all vector vars in ¯c, and Z ∼ N (µ ,K ). l 2 ϕ(hm) 10 If all in π are differentiable, then we can take in Item 6 of Defn 6.2 to be ασ ( ∂ i (Z)) , where f a E Zϕg(g ) f i∈[s] σ = σr∞ and r is the line number of ϕ(A0), and the above almost sure convergence will still hold.

See Appendices D.1 to D.3 for example applications of this theorem to random MLP, and rederivation of the semicircle and > j k n×m i m Marchenko-Pastur laws. Thm 6.3 has the following basic intuition. Let g = A F ((Ah )j=1) for A ∈ R , h ∈ R ,F : k×n n j n k j k R → R , ((z ∈ R )j=1) 7→ F ((z )j=1). Here F should be thought of the stretch of π that separates g from previous G-vars induced by A. Then, semirigorously, as n, m → ∞ and Aij ∼ N (0, 1/n), for any i ∈ [m],

> n \i j k 0 j k o E yi ≈ E A:i F ((A h\i)j=1) + hF , (A:ihi )j=1i

\i j j 0 \i j k where A is A without ith column, h\i is h without ith coordinate, and F is the Jacobian of F at (A h\i)j=1. Now > 0 j k Pk j Pn ∂Fm 0 > \i j k E A:i hF , (A:ihi )j=1i = j=1 hi m=1 j because F is independent of A:i. Likewise, A:i F ((A h )j=1) is approx- ∂zm \i j k imately a Gaussian with 0 mean. Then g is roughly a Gaussian plus a linear combination of {h }j=1. So, unlike restricted programs in Thms 4.3 and 5.1, we do not expect g to be “Gaussian” in the limit, as hj could be the image of an . We can, however, still keep track of g’s decomposition into a Gaussian and a linear combination part — this is the 0 key idea behind detransposition, and in particular, its step 6. There, ~aj are the coefficients of this linear combination, while > 0 ~ai records the correlation between g and previous G-vars induced by A . Each ~aj can be seen to be exactly the coefficient computed heuristically here by applying Stein’s lemma (Lem E.8) when r = 0 in step 6.

7. Proof Techniques While our tensor program framework is new and Bayati & Montanari(2011) is concerned with a much simpler setting of AMP algorithms, its technique of Gaussian conditioning is useful to us (Lem E.3): If A is a Gaussian matrix, then conditioned on G = AH, G0 = A>H0, the distribution of A is E + ΠA˜Π0 for some mean matrix E, projection matrices Π, Π0, and A˜ distributed iid as A. If we let G and H be previous G- and H-vars in MatMul lines involving A, and similarly for G0,H0 with respect to A>, this allows us to induct on line number by conditioning on previous lines. Compared to Bayati & Montanari(2011), our input G-vars have all finite moments, whereas the analogous quantities in Bayati & Montanari(2011) just have a bounded number of them. This allows us to simplify the induction somewhat, and remove the smoothness assumption on (the functions playing the same role as) fl that is required in Bayati & Montanari (2011). The latter is a result of two facts: 1. Gaussian averaging is smooth: E[Φ(z): z ∼ N (µ, Σ)] is generically smooth in µ and Σ. 2. if Π ∈ Rn×n is a projection matrix of rank n − O(1), then Πz for an isotropic Gaussian vector z has approximately independent coordinates, in so far as a law of large number is concerned. This is shown by first bounding the P off-diagonal correlations of Π using linear programming duality (Lem E.19) and then bounding the moments of i φ(Πz)i using the Hermite expansion of φ and the previous bound on correlations of Πz (Thm E.21). Again, these two tools were not accessible to Bayati & Montanari(2011) because of their assumptions of only a finite number of input moments. Note that a more straightforward application of the logic of Bayati & Montanari(2011) would not allow us to reason about nonsmooth functions such as the step function which appears as the gradient of ReLU.

10 l ˇc i j even if not all f are differentiable, as long as the covariance {K (ϕg(g ), ϕg(g ))}i,j∈[s] is nonsingular, we may take the distribu- tional derivatives and interpret the expectation as the application of this (tempered) distribution to the Gaussian density function. Then the i theorem holds under this interpretation. Even if the covariance is singular, we may consider a subset {ϕg(g )}i∈I of maximal rank, and still apply the tempered distribution interpretation; see the proof for more details.

11 8. Discussion In this paper, we have introduced a notion of a tensor program able to express almost all neural network computations in modern deep learning, and characterized its scaling limits. As corollaries, we generalized the DNN-GP correspondence, rigorously derived correct equations for signal propagation in neural networks, and proved the convergence of NTK, among many other results discussed in Section 2. While our results assume Gaussian sampling, we expect the results to hold when we sample from other “nice” distributions (say, with a few finite moments). In the random matrix and statistical physics literature, this universality is very common. In our case, the central limit intuition for DNN-GP correspondence, for example, would indeed hint at this. We also believe that the rank convergence assumptions are not necessary, but just side effects of our Gaussian conditioning technique. One should be able to remove them by considering some stability property of tensor programs. Our framework, while surprisingly general, as presented doesn’t cover a few deep learning layers, but can be easily extended to do so in all but one case: • Dropout. Our framework can already cover “Gaussian dropout”; binary dropout can be incorporated easily as well by introducing Bernoulli random variables in the way of Schoenholz et al.(2017). • Layernorm. Our framework only allows fl with a fixed signature, as “width” grows, for example batchnorm with a fixed batch size. However, as we take width to infinity, the signature for layernorm also changes. But our theorems show that the mean and variance of the layer converges a.s. to a deterministic limit, so that in the forward pass, layernorm is asymptotically the same as a linear, coordinatewise fl. A brief computation shows that the gradient of layernorm can also be asymptotically expressed via a tensor program in a similar way. Nevertheless, non-“coordinatewise” fl is worth investigating in the future, perhaps to take inspiration from the work of Berthier et al.(2017) that studied this scenario in the setting of State Evolution for AMP. • Batchnorm, when reasoning about gradients. Our framework does not allow singularities in fl, whereas during backprop batchnorm’s derivative contains a singularity at the origin. Yang et al.(2018) demonstrates that empirically our equations should still extend to this case. We leave this to future work. Our scaling limit results only apply to fixed tensor program skeletons. This would be enough to derive the behavior of a DNN on a dataset which is small compared to the width. But perhaps more reasonable is when the dataset size is commensurate or perhaps even larger than the width of the network. This would require taking a joint limit in both the skeleton size, over the l data distribution, and over the dimensions {n }l; see Pennington & Worah(2017; 2018) for analogous settings for 2 or 3 layer networks and Gaussian data. We leave the investigation of this to future work. The tensor program framework naturally lends to an automation of computations regarding random neural networks, given the program underlying it. Our community might find valuable a module in PyTorch or Tensorflow that computes the corresponding µc and Kc automatically given the tape (for PyTorch) or the computation graph (for Tensorflow). We hope the tools presented in this work will be adopted by any researcher interested in studying random neural networks.

Acknowledgement Thanks are due to Jascha Sohl-Dickstein, Sam Schoenholz, Jeffrey Pennington, Raphael Berthier, Ilya Razenshteyn, Pengchuan Zhang, Hadi Salman, and Zeyuan Allen-Zhu for discussions and help on initial drafts

12 References Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. arXiv:1811.04918 [cs, math, stat], November 2018a. URL http://arxiv.org/abs/1811.04918.

Allen-Zhu, Z., Li, Y., and Song, Z. A Convergence Theory for Deep Learning via Over-Parameterization. arXiv:1811.03962 [cs, math, stat], November 2018b. URL http://arxiv.org/abs/1811.03962.

Allen-Zhu, Z., Li, Y., and Song, Z. On the Convergence Rate of Training Recurrent Neural Networks. arXiv:1810.12065 [cs, math, stat], October 2018c. URL http://arxiv.org/abs/1810.12065.

Amari, S.-i., Karakida, R., and Oizumi, M. Fisher Information and Natural Gradient Learning of Random Deep Networks. arXiv:1808.07172 [cond-mat, stat], August 2018. URL http://arxiv.org/abs/1808.07172.

Amit, D. J., Gutfreund, H., and Sompolinsky, H. Spin-glass models of neural networks. Physical Review A, 32(2):1007–1018, August 1985. doi: 10.1103/PhysRevA.32.1007. URL https://link.aps.org/doi/10.1103/PhysRevA.32. 1007.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. arXiv:1607.06450 [cs, stat], July 2016. URL http: //arxiv.org/abs/1607.06450.

Bahdanau, D., Cho, K., and Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs, stat], September 2014. URL http://arxiv.org/abs/1409.0473.

Bayati, M. and Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, February 2011. ISSN 0018-9448, 1557-9654. doi: 10.1109/TIT.2010.2094817. URL http://arxiv.org/abs/1001.3448.

Berthier, R., Montanari, A., and Nguyen, P.-M. State Evolution for Approximate Message Passing with Non-Separable Functions. arXiv:1708.03950 [cs, math], August 2017. URL http://arxiv.org/abs/1708.03950.

Blomqvist, K., Kaski, S., and Heinonen, M. Deep convolutional Gaussian processes. arXiv preprint arXiv:1810.03052, 2018.

Bolthausen, E. An iterative construction of solutions of the TAP equations for the Sherrington-Kirkpatrick model. arXiv:1201.2891 [cond-mat, physics:math-ph], January 2012. URL http://arxiv.org/abs/1201.2891.

Borovykh, A. A gaussian process perspective on convolutional neural networks. arXiv preprint arXiv:1810.10798, 2018.

Bradshaw, J., Matthews, A. G. d. G., and Ghahramani, Z. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-Scale Study of Curiosity-Driven Learning. arXiv:1808.04355 [cs, stat], August 2018a. URL http://arxiv.org/abs/1808.04355.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by Random Network Distillation. arXiv:1810.12894 [cs, stat], October 2018b. URL http://arxiv.org/abs/1810.12894.

Cao, Y. and Gu, Q. A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks. arXiv:1902.01384 [cs, math, stat], February 2019. URL http://arxiv.org/abs/1902.01384.

Chen, M., Pennington, J., and Schoenholz, S. Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 873–882, Stockholmsmssan, Stockholm Sweden, July 2018. PMLR. URL http://proceedings.mlr.press/v80/chen18i.html.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [cs, stat], June 2014. URL http://arxiv.org/abs/1406.1078. 13 Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342– 350, 2009. URL http://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.

Crisanti, A. and Sompolinsky, H. Path Integral Approach to Random Neural Networks. Physical Review E, 98(6), December 2018. ISSN 2470-0045, 2470-0053. doi: 10.1103/PhysRevE.98.062120. URL http://arxiv.org/abs/1809. 06042.

Damianou, A. and Lawrence, N. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.

Daniely, A., Frostig, R., and Singer, Y. Toward Deeper Understanding of Neural Networks: The Power of Initial- ization and a Dual View on Expressivity. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Gar- nett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2253–2261. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6427-toward-deeper-understanding-of-neural- networks-the-power-of-initialization-and-a-dual-view-on-expressivity.pdf.

Deshpande, Y., Abbe, E., and Montanari, A. Asymptotic mutual information for the balanced binary stochastic block model. Information and Inference: A Journal of the IMA, 6(2):125–170, June 2017. ISSN 2049-8764. doi: 10.1093/imaiai/iaw017. URL https://academic.oup.com/imaiai/article/6/2/125/2739335.

Donoho, D. and Montanari, A. High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3):935–969, December 2016. ISSN 1432-2064. doi: 10.1007/s00440- 015-0675-z. URL https://doi.org/10.1007/s00440-015-0675-z.

Donoho, D. L., Maleki, A., and Montanari, A. Message Passing Algorithms for Compressed Sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, November 2009. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.0909892106. URL http://arxiv.org/abs/0907.3574.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], November 2018a. URL http://arxiv.org/abs/1811.03804.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], October 2018b. URL http://arxiv.org/abs/1810.02054.

Fletcher, A. K. and Rangan, S. Inference in Deep Networks in High Dimensions. arXiv:1706.06549 [cs, math, stat], June 2017. URL http://arxiv.org/abs/1706.06549.

Gabri, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborov, L. Entropy and mutual information in models of deep neural networks. arXiv:1805.09785 [cond-mat, stat], May 2018. URL http://arxiv.org/abs/ 1805.09785.

Garriga-Alonso, A., Aitchison, L., and Rasmussen, C. E. Deep Convolutional Networks as shallow Gaussian Processes. arXiv:1808.05587 [cs, stat], August 2018. URL http://arxiv.org/abs/1808.05587.

Giryes, R., Sapiro, G., and Bronstein, A. M. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, July 2016. ISSN 1053-587X, 1941-0476. doi: 10.1109/TSP.2016.2546221. URL http://arxiv.org/abs/1504.08291.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Teh, Y. W. and Titterington, M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html.

Hanin, B. and Rolnick, D. How to Start Training: The Effect of Initialization and Architecture. arXiv:1803.01719 [cs, stat], March 2018. URL http://arxiv.org/abs/1803.01719.

Hazan, T. and Jaakkola, T. Steps Toward Deep Kernel Methods from Infinite Neural Networks. arXiv:1508.05133 [cs], August 2015. URL http://arxiv.org/abs/1508.05133. 14 He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026– 1034, 2015. URL http://www.cv-foundation.org/openaccess/content_iccv_2015/html/He_ Delving_Deep_into_ICCV_2015_paper.html.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. pp. 770– 778, 2016. URL https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_ Deep_Residual_Learning_CVPR_2016_paper.html.

Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Comput., 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735.

Hu, T.-C. and L. Taylor, R. On the strong law for arrays and for the bootstrap mean and variance. International Journal of Mathematics and Mathematical Sciences, 20, 1997. doi: 10.1155/S0161171297000483.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely Connected Convolutional Networks. arXiv:1608.06993 [cs], August 2016. URL http://arxiv.org/abs/1608.06993.

Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs], February 2015. URL http://arxiv.org/abs/1502.03167.

Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. arXiv:1806.07572 [cs, math, stat], June 2018. URL http://arxiv.org/abs/1806.07572.

Kabashima, Y., Krzakala, F., Mzard, M., Sakata, A., and Zdeborov, L. Phase Transitions and Sample Complexity in Bayes-Optimal Matrix Factorization. IEEE Transactions on Information Theory, 62(7):4228–4265, July 2016. ISSN 0018-9448. doi: 10.1109/TIT.2016.2556702.

Kadmon, J. and Sompolinsky, H. Transition to Chaos in Random Neuronal Networks. Physical Review X, 5(4), November 2015. ISSN 2160-3308. doi: 10.1103/PhysRevX.5.041030. URL https://link.aps.org/doi/10.1103/ PhysRevX.5.041030.

Kamilov, U. S., Rangan, S., Fletcher, A. K., and Unser, M. Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning. arXiv:1207.3859 [cs, math], July 2012. URL http://arxiv.org/ abs/1207.3859.

Karakida, R., Akaho, S., and Amari, S.-i. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. arXiv:1806.01316 [cond-mat, stat], June 2018. URL http://arxiv.org/abs/1806.01316.

Kumar, V., Singh, V., Srijith, P., and Damianou, A. Deep Gaussian Processes with Convolutional Kernels. arXiv preprint arXiv:1806.01655, 2018.

Landau, I. D. and Sompolinsky, H. Coherent chaos in a with structured connectivity. bioRxiv, October 2018. doi: 10.1101/350801. URL http://biorxiv.org/lookup/doi/10.1101/350801.

Lawrence, N. D. and Moore, A. J. Hierarchical Gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007.

Le Roux, N. and Bengio, Y. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

LeCun, Y., Haffner, P., Bottou, L., and Bengio, Y. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pp. 319–345. Springer, 1999.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., and Sohl-dickstein, J. Deep Neural Networks as Gaussian Processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=B1EA-M-0Z. 15 Li, B. and Saad, D. Exploring the Function Space of Deep-Learning Machines. Physical Review Letters, 120(24), June 2018. ISSN 0031-9007, 1079-7114. doi: 10.1103/PhysRevLett.120.248301. URL http://arxiv.org/abs/1708. 01422. Li, P. and Nguyen, P.-M. On Random Deep Weight-Tied Autoencoders: Exact Asymptotic Analysis, Phase Transitions, and Implications to Training. September 2018. URL https://openreview.net/forum?id=HJx54i05tX. Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. Gaussian Process Behaviour in Wide Deep Neural Networks. arXiv:1804.11271 [cs, stat], April 2018. URL http://arxiv.org/abs/1804.11271. Neal, R. M. BAYESIAN LEARNING FOR NEURAL NETWORKS. PhD Thesis, University of Toronto, 1995. Novak, R., Xiao, L., Lee, J., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes. arXiv preprint arXiv:1810.05148, 2018. O’Donnell, R. Analysis of boolean functions. Cambridge University Press, New York, NY, 2014. ISBN 978-1-107-03832-5. Osband, I., Aslanides, J., and Cassirer, A. Randomized Prior Functions for Deep Reinforcement Learning. arXiv:1806.03335 [cs, stat], June 2018. URL http://arxiv.org/abs/1806.03335. Pennington, J. and Worah, P. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, pp. 2634–2643, 2017. Pennington, J. and Worah, P. The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network. In Advances in Neural Information Processing Systems 31, pp. 10, 2018. Pennington, J., Schoenholz, S., and Ganguli, S. Resurrecting the sigmoid in deep learning through dynamical isom- etry: theory and practice. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4788–4798. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7064-resurrecting-the-sigmoid-in-deep- learning-through-dynamical-isometry-theory-and-practice.pdf. Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016. Rajan, K., Abbott, L. F., and Sompolinsky, H. Stimulus-Dependent Suppression of Chaos in Recurrent Neural Networks. Physical Review E, 82(1), July 2010. ISSN 1539-3755, 1550-2376. doi: 10.1103/PhysRevE.82.011903. URL http: //arxiv.org/abs/0912.3513. Reeves, G. Additivity of Information in Multilayer Networks via Additive Gaussian Noise Transforms. arXiv:1710.04580 [cs, math, stat], October 2017. URL http://arxiv.org/abs/1710.04580. Rider, B. A limit theorem at the edge of a non-Hermitian random matrix ensemble. Journal of Physics A: Mathematical and General, 36(12):3401, 2003. ISSN 0305-4470. doi: 10.1088/0305-4470/36/12/331. URL http://stacks.iop. org/0305-4470/36/i=12/a=331. Schniter, P. and Rangan, S. Compressive Phase Retrieval via Generalized Approximate Message Passing. IEEE Transactions on Signal Processing, 63(4):1043–1055, February 2015. ISSN 1053-587X. doi: 10.1109/TSP.2014.2386294. Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. Deep Information Propagation. 2017. URL https: //openreview.net/pdf?id=H1W1UN9gg. Sompolinsky, H., Crisanti, A., and Sommers, H. J. Chaos in Random Neural Networks. Phys. Rev. Lett., 61(3):259–262, July 1988. doi: 10.1103/PhysRevLett.61.259. URL https://link.aps.org/doi/10.1103/PhysRevLett. 61.259. Stern, M., Sompolinsky, H., and Abbott, L. F. Dynamics of Random Neural Networks with Bistable Units. Physical review. E, Statistical, nonlinear, and soft matter physics, 90(0):062710, December 2014. ISSN 1539-3755. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4348075/. Tao, T. Topics in random matrix theory. Graduate studies in Mathematics, 132, 2012. 16 van der Wilk, M., Rasmussen, C. E., and Hensman, J. Convolutional Gaussian Processes. In Advances in Neural Information Processing Systems 30, pp. 2849–2858, 2017. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \., and Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017. Williams, C. K. I. Computing with Infinite Networks. In Advances in neural information processing systems, pp. 7, 1997.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016a. Wilson, A. G., Hu, Z., Salakhutdinov, R. R., and Xing, E. P. Stochastic Variational Deep Kernel Learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016b. Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393– 5402, Stockholmsmssan, Stockholm Sweden, July 2018. PMLR. URL http://proceedings.mlr.press/v80/ xiao18a.html. Yang, G. and Schoenholz, S. S. Mean Field Residual Network: On the Edge of Chaos. In Advances in neural information processing systems, 2017. Yang, G. and Schoenholz, S. S. Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion. February 2018. URL https://openreview.net/forum?id=rJGY8GbR-. Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and Schoenholz, S. S. A Mean Field Theory of Batch Normalization. September 2018. URL https://openreview.net/forum?id=SyMDXnCcF7.

Yang, G. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes In Advances in neural information processing systems, 2019. Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks. arXiv:1811.08888 [cs, math, stat], November 2018. URL http://arxiv.org/abs/1811.08888.

17 A. Common Dimension Classes We present an algorithm below to compute each CDC c of (π, Λ). We write c(g) for the CDC associated to the G-var g, and l m we as well associate CDC c(h) to each H-var and left- and right- CDCs c1(A), c2(A) to each A-var. Here g and g should be interpreted as elements of the set c and not as vectors. We induct on lines in the skeleton π. First let ∼Λ be the smallest Λ equivalence relation on G-vars such that gl ∼ gm if (nl, nm) ∈ Λ.

Λ 1. VecIn gl. Set c(gl) ← {gm : gm ∼ gl}.

l l l 2. MatIn A . Set c1(A ) ← {}, c2(A ) ← {}.

l def k > l k l k 3.T A = (A ) . Set c1(A ) ← c2(A ), c2(A ) ← c1(A ). 4. MatMul l k j j j k k l k m m Λ l k (a) If g := A g : Merge c(g ) ← c(g ) ∪ c2(A ) → c2(A ) and c(g ) ← c1(A ) ∪ {g : g ∼ g } → c1(A ). l k j j j k k l k m m Λ l k (b) If g := A h : Merge c(h ) ← c(h ) ∪ c2(A ) → c2(A ) and c(g ) ← c1(A ) ∪ {g : g ∼ g } → c1(A ).

Λ l l j1 l jk ji m m l S ji0 l := + ··· + c( ) ← { : ∼ } ∪ 0 c( ) → c( ) i 5. LinComb g aj1 g ajk g . Merge g g g g i g g for all .

l j1 l l j1 jk n n ji S ji0 l 6. Nonlin h := f (g ,..., g ) ∈ R = R . Merge c(g ) ← i0 c(g ) → c(h ) for all i. 7. Comp similar to Nonlin.

B. Example Programs B.1. MLP

1 : g1 := W 1x network input multiplied by weight matrix 2 : g2 := b1 layer 1 bias 3 : g3 := g1 + g2 layer 1 preactivation 4 : h4 := φ(g3) layer 1 activation 5 : A5 := W 2 layer 2 weights 6 : g6 := b2 layer 2 biases 7 : g7 := A5h4 8 : g8 := g7 + g6 layer 2 preactivations 9 : h9 := φ(g8) layer 2 activations

0 1 Pn −i −i 1 In line 1 above, we could also spend a few more lines and equivalently set g := i=1 xig where g := W:i. For brevity, we adopt the current approach, but later for reasoning about backprop, this way of expression W 1x is useful. Note that we can also express x as its own deterministic input G-var and W 1 as an input A-var, but the program given here is more consistent with our scaling limit, which takes the hidden layer width to infinity but keeps the input dimension fixed. Backprop of fully-connected, feedforward:

10 10 : g := ∇h9 L last layer gradient 11 : h11 := φ0(g8) g10 layer 2 preactivation gradient 12 : A12 := (A5)> 13 : g13 := A12h11 layer 1 activation gradient 14 : h14 := φ0(g3) g13 layer 1 preactivation gradient

18 Here ∇h9 L can be any vector but in the context of neural networks, it can be thought as the gradient of some loss function L obtained through some readout layer.

B.1.1. DETRANSPOSITION We demonstrate the detransposition πˇ of the above program π. Line 1 to line 9 are almost copied verbatim to πˇ because the only matrix multiplications involve new A-vars.

1 :g ˇ1 := W 1x ϕ(g1) =g ˇ1 2 :g ˇ2 := b1 ϕ(g2) =g ˇ2 3 :g ˇ3 :=g ˇ1 +g ˇ2 ϕ(g3) =g ˇ3 4 : hˇ4 := φ(ˇg3) ϕ(h4) = hˇ4 5 :g ˇ5 := b2 ϕ(g5) =g ˇ5 6 : Aˇ6 := W 2 ϕ(A6) = Aˇ6 7 6 4 7 7 7 :g ˇ := Aˇ hˇ begin MatMul conversion; ϕg(g ) =g ˇ 8 :g ˇ8 :=g ˇ7 + 0 end MatMul conversion; ϕ(g7) =g ˇ8 9 :g ˇ9 :=g ˇ8 +g ˇ5 ϕ(g8) =g ˇ9 10 : hˇ10 := φ(ˇg9) ϕ(h9) = hˇ10

Now to convert line 10 to 14 of π

11 10 11 11 :g ˇ := ∇h9 L ϕ(g ) =g ˇ 12 : hˇ12 := φ0(ˇg9) gˇ11 ϕ(h11) = hˇ12 13 : Aˇ13 := iid copy of W 2> ϕ(A12) = Aˇ13 14 13 12 13 14 14 :g ˇ := Aˇ hˇ begin MatMul conversion; ϕg(g ) =g ˇ 15 : hˇ15 :=g ˇ14 + a0hˇ4 end MatMul conversion; ϕ(g13) = hˇ15 16 : hˇ16 := φ0(ˇg3) hˇ15 where

0 ϕ(h4) 2 −1 ϕ (g7) ϕ(h11) a = α(E f (Z) ) E f g (Z)f (Z) ˇh4 0 2 −1 ˇg7 ˇh12 = α(E f (Z ) ) E f (Z)f (Z) 0 2 −1 0 = α(E φ(Z3) ) E Z7φ (Z9)Z11

0 ˇc ˇc 6 6 with Z,Z ∼ N (µ ,K ) and α = lim n1(A )/n2(A ).

B.2. Batched input For the second input y in the batch

15 : g15 := W 1y0 2nd input multiplied by (same) weight matrix 16 : g16 := g15 + g2 using same bias as before 17 : h17 := φ(g16) layer 1 activation 18 : g18 := A5h17 using same weight matrix 19 : g19 := g18 + g6 using same bias 20 : h20 := φ(g19) layer 2 activations

19 Gradients:

21 21 : g := ∇h20 L last layer gradient for input y 22 : h22 := φ0(g19) g21 layer 2 preactivation gradient 23 : g23 := A12h22 layer 1 activation gradient; using same weights 24 : h24 := φ0(g16) g23 layer 1 preactivation gradient

B.3. Residual network Style 1: resblock merges after weights

1 : g1 := W 1x network input multiplied by weight matrix 2 : g2 := b1 resblock 1 bias 3 : g3 := g1 + g2 resblock 1 preactivation 4 : h4 := φ(g3) resblock 1 activation 5 : A5 := W 2 resblock 1 merge weights 6 : g6 := b2 resblock 1 merge biases 7 : g7 := A5h4 8 : g8 := g1 + g7 + g6 return to main branch

Style 2: resblock merges before weights

1 : g1 := W 1x network input multiplied by weight matrix 2 : g2 := b1 resblock 1 bias 3 : g3 := g1 + g2 resblock 1 preactivation 4 : h4 := φ(g3) + g1 resblock 1 activation, merge back to main branch 5 : A5 := W 2 6 : g6 := b2 7 : g7 := A5h4 8 : g8 := g7 + g6 resblock 2 preactivation 9 : h9 := φ(g8) + φ(g3) + g1 merge of 2nd resblock; semantically same as h9 := φ(g8) + h4

B.4. Simple RNN This is almost the same as the feedforward case except we tie the weights across time, and have inputs for each time step. 20 1 : g1 := h0 hidden state at t = 0 2 : g2 := b RNN bias 3 : A3 := W RNN weights 4 : g4 := Ux1 + a affine transform of input at t = 1 5 : g5 := A3g1 6 : g6 := g5 + g2 + g4 7 : h7 := φ(g6) hidden state at t = 1 8 : g8 := Ux2 + a affine transform of input at t = 2, with same U and a 9 : g9 := A3h7 10 : g10 := g9 + g2 + g8 11 : h11 := φ(g10) hidden state at t = 2

More advanced RNNs like GRU or LSTM can be similarly expressed.

B.5. Batchnorm, fully-connected

Let φ˜(h) = φ((h − h¯)/std(h)) be batchnorm followed by coordinatewise nonlinearity φ, where h ∈ RB should be q ¯ 1 PB 1 PB ¯ 2 interpreted as a single neuron across a batch, and h = B i=1 hi, std(h) = B i=1(hi − h) . For example, let x1, . . . , x4 be the batch of inputs. 1 1 1 : g := W x1 2 1 2 : g := W x2 3 1 3 : g := W x3 4 1 4 : g := W x4 5 : g5 := b1 layer 1 bias 6 : g6 := g1 + g5 7 : g7 := g2 + g5 8 : g8 := g3 + g5 9 : g9 := g4 + g5 10 ˜ 10 : h := φ(g6, g7, g8, g9)1 layer 1 input 1 activations 11 ˜ 11 : h := φ(g6, g7, g8, g9)2 layer 1 input 2 activations 12 ˜ 12 : h := φ(g6, g7, g8, g9)3 layer 1 input 3 activations 13 ˜ 13 : h := φ(g6, g7, g8, g9)4 layer 1 input 4 activations

The transformer without layernorm (in particular, the softmax and self-attention mechanism) can be expressed in a similar way.

B.6. Convolution

l l For any convolution weights {Wβij}β∈ker,i∈c0,j∈[c], Wβ is a dense matrix. Suppose x is an image input to the network with s pixels and c channels, {xαj}α∈pos,j∈[c], so that xα is vector of dimension c. Then the αth pixel, across all channels, of the convolution W l applied to x can be described as

X l Wβxα+β β∈ker

21 Define

X 1 x˜α0j0 = Wβijxα+β,j

For demonstration, assume ker = {0, 1} and pos = [3] and the convolution is circular, and for simplicity assume we don’t have biases.

1 1 1 : g := W0 x1 2 1 2 : g := W0 x2 3 1 3 : g := W0 x3 4 1 4 : g := W1 x1 5 1 5 : g := W1 x2 6 1 6 : g := W1 x3 7 : g7 := g1 + g5 layer 1 pixel 1 preactivations 8 : g8 := g2 + g6 layer 1 pixel 2 preactivations 9 : g9 := g3 + g4 layer 1 pixel 3 preactivations 10 : h10 := φ(g7) layer 1 pixel 1 activations 11 : h11 := φ(g8) layer 1 pixel 2 activations 12 : h12 := φ(g9) layer 1 pixel 3 activations 13 2 13 : A := W0 14 2 14 : A := W1 layer 2 weights 15 2 10 15 : g := W0 h 16 2 11 16 : g := W0 h 17 2 12 17 : g := W0 h 18 2 10 18 : g := W1 h 19 2 11 19 : g := W1 h 20 2 12 20 : g := W1 h 21 : g21 := g15 + g19 layer 2 pixel 1 preactivations 22 : g22 := g16 + g20 layer 2 pixel 2 preactivations 23 : g23 := g17 + g18 layer 2 pixel 3 preactivations 24 : h24 := φ(g21) layer 2 pixel 1 activations 25 : h25 := φ(g22) layer 2 pixel 2 activations 26 : h26 := φ(g23) layer 2 pixel 3 activations

C. Additional Notations We will use teletype font g, h, A, etc, to denote variables or nodes as defined in the straightline program. The superscripts in this case, like gl, will mean that line number associated to such a variable. In contrast, we will use normal font g, h, A, etc, to denote arbitrary variables of the correct type in a program (though sometimes we use h to also denote var of type H or G), i l i but the superscripts, such as in g , are not attached to the semantics of the program. In either case, gk or gk will denote the scalar value at the kth position of gl or gi. We write line(g) for the line number of the G-node g so that g = gline(g) (same for H-nodes and A-nodes). We write n(g) for the dimension of a G-node so that nline(g) = n(g) (similar for H-node). Similarly, n1(A) and n2(A) gives the first and second dimensions of an A-node A. Let Gπ be the collection of all G-nodes in of a skeleton π, Hπ be the collection of all H-nodes, and let Gπ be the collection of all input G-nodes. Sometimes we need 22 ≤L to talk about all G-nodes on or before line L. We will use Gπ to denote such a set. When π is clear from context, we suppress the subscript π for brevity. def l def l Definition C.1. If c is a CDC, then let c≤m = {g ∈ c : l ≤ m} and c

Given a kernel K : Rm × Rm → R and subsets X , Y ⊆ Rm, write K(X , Y) for the corresponding |X | × |Y| submatrix of K, and write K|X = K(X , X ) to be the restriction of K to X . d d Given two random variables X,Y , and a σ-algebra A, the notation X|A = Y or X =A Y means that for any integrable function φ and for any random varible Z measurable on A, E φ(X)Z = E φ(Y )Z. We say that X is distributed as (or is equal in distribution to) Y conditional on A. In case A is the trivial σ-algebra, we just write X =d Y . The expression d t a.s. ∞ ∞ a.s. t X −→ Y means X converges to Y in distribution. If random variables X −−→ X , then we write X = limt→∞ X . Definition C.2. For a function Φ: Rn → Rn, we define def > VΦ(Σ) = E[Φ(z)Φ(z) : z ∼ N (0, Σ)] for a PSD matrix Σ. When φ : R → R, we write Vφ to mean V applied to the function that acts coordinatewise by φ.

D. Consequences D.1. Warmup: MLP We warm up by considering the GP correspondence, gradient dynamics, and NTK convergence of MLPs first. We define a 0 fully-connected, feedforward neural network f(x; θ), x ∈ Rn as follows x0(x) def= x 1 hl(x) def= √ W lxl−1(x) + bl nl−1 xl(x) def= φ(hl(x)) 1 f(x; θ) def= √ v>xL(x) nL l L l l−1 with bl ∈ Rn , v ∈ Rn , and W l ∈ Rn ×n for l = 1,...,L. These form the parameters θ of f. (Note here that we follow prior notation and use h for “preactivation,” but it is in fact equivalent to a G-var). We sample W l ∼ N (0, (σl )2), v ∼ ij √ w i L+1 2 l l 2 l N (0, (σw ) ) and bi ∼ N (0, (σb) ). We can think of this parametrization as “pulling out the 1/ n from Glorot initialization.” This parametrization doesn’t change the forward kernel Σl (defined below), but it does change the scaling of gradients.

0 Define kernels ΣL+1 :(Rn )2 → R by n0 def 1 X Σ1(x, x0) = (σ1 )2 x x0 + (σ1)2 w n0 i i b i=1 l def l 2 l−1 l 2 Σ = (σw) Vφ(Σ ) + (σb) L+1 def L+1 2 L Σ = (σw ) Vφ(Σ ).

0 For any parametrized function f(x; θ), the Neural Tangent Kernel can be in general defined as Kθ(x, x ) = 0 h∇θf(x; θ), ∇θf(x ; θ)i. In the case when f(x; θ) is defined as above, there is a scaling limit of Kθ when θ is ran- domized (Jacot et al., 2018). The “proof” given by Jacot et al.(2018) was a sketch and most importantly was silent about its application of the gradient independence assumption (used when applying induction). Below we give a formal proof of NTK convergence, but first we introduce a “gradient kernel.” Suppose we take n1, . . . , nL → ∞ in such a way that l m l n0 2 n /n → αl,m for constants αl,m ∈ (0, ∞). Then define Π :(R ) → R by

L 0 def L+1 2 Π (x, x ) = (σw ) l def 0 l+1 l+1 Π = αl+1,lVφ (Σ ) Π . 23 n0 1 L l m Theorem D.1. Fix a finite set of inputs X ⊆ R . As n , . . . , n → ∞ with n /n → αl,m, with parameter sampled as above,

d L+1 f(X ; θ) −→N (0, Σ |X ) if the nonlinearity φ is α-controlled with α < 2; and

l L n   n X ∂f a.s. ψ { (x)} −−→ [ψ(ζ): ζ ∼ N (0, Πl| )], for all polynomially bounded ψ, l ≥ 1 nl ∂xl x∈X E X i=1 i L l l 2 l 2 a.s. X Σ + (σ ) − (σ ) K | −−→ α Πl Vφ0(Σl) w b + ΣL+1/(σL+1)2 θ X l,L (σl )2 w l=1 w L L X K Σl + (σl )2 − (σl )2 = Vφ0(Σm) w b + ΣL+1/(σL+1)2. (σl )2 w l=1 m=l w if the nonlinearity φ has a polynomially bounded weak derivative.

This theorem formally justifies the computations made in Poole et al.(2016); Schoenholz et al.(2017); Jacot et al.(2018) l l Remark D.2. If we set σw = σb = 1 for all l, then we can recover the NTK recurrence relation given in Jacot et al.(2018). The β factor in Jacot et al.(2018) on the bias can also be easily accounted for, but we will not consider such a parametrization here.

Proof. Since X is finite, it suffices to just consider two inputs x01 = x, x02 = x0. We can form a tensor program to model a fully-connected, feedforward network, with a batch of inputs. We construct it implicitly as follows, where we use superscripts in •(•) to denote semantically relevant quantities.

0 (i) 1 (l) l (l) l (1a) Pn (i) 0a Define the input vars g˜ := W:i, A := W for l = 2,...L, and g¯ := b for l = 1,...,L. Define g := i=1 g˜ xi for a = 1, 2, gˆ(la) := g(la) +g ¯(l) (this represents hl), h(la) := φ(ˆg(la)) (this represents xl), g(l+1,a) := A(l+1)h(la). For (l) (l) simplicity, we assume that n1(A ) 6= n2(A ) for all l, so that each “layer” belongs to a different CDC. Below, we write K for Kc where c is automatically understood based on the arguments; similarly we write µ for µc with c implied.

(l) (l) line(A )t l line(¯g )t l cint We have the corresponding sampling hyperparameters σ = σw, σ = σb, µ = 0 for all input G-vars, and cint (i) 1 0 2 (l) l 2 K is such that g˜j ∼ N (0, (σw/n ) ) and g¯j ∼ N (0, (σb) ), for all j in appropriate ranges. Then we can compute µ = 0 and

K(ˆg(la), gˆ(lb)) = Σl(x, x0) K(g(la), g(lb)) = Kc(ˆg(la), g(lb)) l 0 l 2 = Σ (x, x ) − (σb) (la) (l) l 2 K(ˆg , g¯ ) = (σb) and K(g, g0) = 0 for all other pairs of G-vars g, g0. Thus, by Thm 4.3, for any (< 2)-controlled ψ,

nl 1 X ψ(hl(x0a))ψ(hl(x0b)) nl i=1 nl 1 X = ψ(ˆg(la))ψ(ˆg(lb)) nl i=1 −−→a.s. ψ(z)ψ(z0) 0 E l (z,z )∼N (0,Σ |x,x0 )

24  l l 0  l Σ (x, x)Σ (x, x ) L where Σ | 0 = . Obviously, given h (x) for all x, f(x; θ) is a Gaussian process with kernel x,x Σl(x0, x)Σl(x0, x0) L 0 L+1 2 1 Pn L L 0 a.s. L+1 K(x, x ) = (σw ) nL i=1 φ(h (x))φ(h (x )) and mean 0. Since K −−→ Σ over the randomization of parameters in layer < L, we have that f itself is a Gaussian process with this kernel, in the limit. Now we think about backprop. We have ∂f = √1 v. We can thus extend the tensor program by g(La) := v, h(la) := g(la) φ0(ˆg(la)), g(l−1,a) := ∂xL nL √ √ (l)> (la) (l) ∂f L (l) ∂f L A h , for a = 1, 2 and for l = L − 1,..., 2. Here g represents ∂xl n and h represents ∂hl n . For brevity we just wrote A(l)> for an implicitly defined transposed var of A(l). We can compute K(g(la), g(lb)) = Πl(x, x0), and K(g, g0) = 0 for all other G-var pairs g, g0 not appearing in the original tensor program. L+1 2 (la) (la) Because v ∼ N (0, (σw ) ), and all g , h are odd in v (being linear in v), Thm 5.1 can be applied. Then, for l = 1,...,L,

nl,nl−1 nl,nl−1 1 X ∂f ∂f 1 X  1 ∂f   1 ∂f  (x) (x0) = √ (x)xl−1(x) √ (x0)xl−1(x0) nl ∂W l ∂W l nl l−1 ∂hl j l−1 ∂hl j i,j=1 ij ij i=j=1 n i n i

 nl  nl−1  1 X ∂f ∂f 0 X l−1 l−1 0 =  (x) (x )  x (x)x (x ) nl−1nl ∂hl ∂hl j j i=1 i i j=1

 nl  nl−1  1 X X = h(la)h(lb) h(l−1,a)h(l−1,b) nl−1nl     i=1 j=1 a.s. l 0 l l 0 l  l−1 l−1  −−→ E zaφ (za)zbφ (zb) E φ(za )φ(zb ) by Thm 5.1 l l 0 l 0 l l−1 l−1 = E zazb E φ (za)φ (zb) E φ(za )φ(zb ) l l l l l where (za, zb) ∼ N (0,K|g(la),g(lb) ), and independently (za, zb) ∼ N (0,K|ˆg(la),ˆg(lb) ). Thus the above is just [Π 0 l l l 2 l 2 0 Vφ (Σ ) (Σ − (σb) )/(σw) ](x, x ). On the other hand, 1 X ∂f ∂f 1 X ∂f ∂f (x) (x0) = (x) (x0) nl ∂bl ∂bl nl ∂hl ∂hl i i i i i i ! 1 X = h(la)h(lb) nl i a.s. l 0 l l 0 l −−→ E zaφ (za)zbφ (zb) l l 0 l 0 l = E zazb E φ (za)φ (zb) = [Πl Vφ0(Σl)](x, x0)

l l l l 1 P ∂f ∂f 0 1 P L L 0 a.s. L+1 0 L+1 2 where z , z , z , z are sampled as above. Also, L (x) (x ) = L h (x)h (x ) −−→ Σ (x, x )/(σ ) a b a b n i ∂vi ∂vi n i i i w as deduced before. Thus L l l 2 l 2 a.s. X Σ + (σ ) − (σ ) ΘL −−→ α Πl Vφ0(Σl) w b + ΣL+1/(σL+1)2 l,L (σl )2 w l=1 w L L−1 L−1 X Y K Σl + (σl )2 − (σl )2 = α α Vφ0(Σm+1) Vφ0(Σl) w b + ΣL+1/(σL+1)2 l,L m+1,m (σl )2 w l=1 m=l m=l w L L X K Σl + (σl )2 − (σl )2 = Vφ0(Σm) w b + ΣL+1/(σL+1)2 (σl )2 w l=1 m=l w

25 1 1> L Global mean pooling as readout layer. Now suppose that f(x; θ) = nL x (x). As in the proof above, we can construct a program π for computing f and its gradients over two inputs x01, x02:

(i) 1 (l) l (l) l Forward Define the input vars g˜ := W:i, A := W for l = 2,...L, and g¯ := b for l = 1,...,L. Define 0 (1a) Pn (i) 0a (la) (la) (l) l (la) (la) l g := i=1 g˜ xi for a = 1, 2, gˆ := g +g ¯ (this represents h ), h := φ(ˆg ) (this represents x ), g(l+1,a) := A(l+1)h(la). Backward We set g(La) := 1, h(la) := g(la) φ0(ˆg(la)), g(l−1,a) := A(l)>h(la), for a = 1, 2 and for l = L − 1,..., 2. (l) L ∂f (l) L ∂f Here g represents n ∂xl and h represents n ∂hl .

Now we construct the detransposition πˇ of π (see Appendix B.1.1 for a concrete example of detransposition). The forward part of πˇ is almost identical to that of π:

ˇ(i) 1 ˇ(l) l (l) l Forward Define the input vars g˜ := W:i, A := W for l = 2,...L, and g¯ˇ := b for l = 1,...,L. Define 0 (1a) Pn ˇ(i) 0a ˇ(la) (la) ˇ(l) l ˇ(la) ˇ(la) l gˇ := i=1 g˜ xi for a = 1, 2, gˆ :=g ˇ + g¯ (this represents h ), h := φ(gˆ ) (this represents x ), gˇ(l+1,a) := Aˇ(l+1)hˇ(la). Finally, we set ϕ(•) = •ˇ for all vars defined here.

(la) l l Here we have automatically simplified the detransposition of g produced from Defn 6.2, by identifying ϕ(g ) and ϕg(g ), which in this case are the same. Now the backward part

Backward Let Aˇ(l)0 be an input A-var sampled iid as A(l)>. We set hˇ(La) :=g ˇ(La) := 1 (representing nL times L (la) (la) (la) 0 ˇ(la) (l−1,a) (l−1,a) (l)0 (la) (l−1,a) the gradient at x ), ϕ(h ) = hˇ := hˇ φ (gˆ ), ϕg(g ) = g˜ˇ := Aˇ hˇ , ϕ(g ) = hˇ(l−1,a) := g˜ˇ(l−1,a) + a(l−1,a)h(l−1,a) for a = 1, 2 and for l = L − 1,..., 2, and for a(la) computed via the derivative rule of Thm 6.3. Specifically, we have, by a simple induction, (L−1,a) L 2 00 L a = αL,L−1(σw) E[φ (y): y ∼ N (0, Σaa)] (la) (l+1,a) l+1 2 0 (l+1,a) l+1 2 0 2 00 a = a αl+1,l(σw ) E ∂y(φ(y)φ (y)) = a αl+1,l(σw ) (E φ (y) + φ(y)φ (y)), l+1 with y ∼ N (0, Σaa ) The derivatives here should be interpreted as tempered distributions in general, testing against the Gaussian density of y. Note that if φ is odd, then φ00 is odd, so that aL−1,a = 0 = ala for all l < L − 1. If φ is ReLU, then φ00 is the Dirac L−1,a p L Delta tempered distribution at 0, so that a = 1/ 2πΣaa.

(la) (lb) d l (la) (lb) d l In the forward pass, (ˆg , gˆ )“ = ”N (0, Σ |a,b) as before. In the backward pass, (g˜ˇ , g˜ˇ )“ = ”N (0, Π |a,b), where

L−1 L 2 0 L Π |a,b = αL,L−1(σw) Vφ (Σ |a,b) l l+1 2 l+1 0 l+1 (l+1,a) (l+1,b) ⊗2 l+1 Π |a,b = αl+1,l(σw ) (Π |a,b Vφ (Σ |a,b) + (a , a ) Vφ(Σ |a,b)) and

nl 1 X (la) (lb) a.s. h h −−→ Πl Vφ0(Σl) + a(la)a(lb)V(φφ0)(Σl) nl i i ab ab ab i=1

Thus Corollary D.3. The NTK of the MLP above with global mean pooling readout layer converges a.s. to a.s. NTK(xa, xb) −−→ L−1 ΣL + (σL)2 − (σL)2 X Σl + (σl )2 − (σl )2 Vφ0(ΣL) ab w b + α Πl Vφ0(Σl) + a(la)a(lb)V(φφ0)(Σl)  ab w b ab σL l,L ab ab ab σl w l=1 w 26 Corollary D.4. If the readout layer is global mean pooling in an MLP and the last layer nonlinearity is odd, then the Gradient Independence Assumption can be applied to give the correct computation of the gradient covariance and the NTK.

Now that we have warmed up a little, we will work with tensor programs and apply Thms 4.3, 5.1 and 6.3 more informally.

D.2. Warmup: Semicircle Law Thm 6.3 has enough power to rederive the semicircle law for the Gaussian Orthogonal Ensemble (Tao, 2012).

Definition D.5. The Gaussian Orthogonal Ensemble (GOE) is the sequence of matrices (Wn)n≥0 defined as follows: Let def X ∼ N (0, 1), iid, for all i, j ∈ . Then set W = √1 (X + X ) . ij N n 2n ij ji i,j∈[n] def 1 Pn Definition D.6. The empirical spectral distribution (ESD) µWn of Wn is given by µWn = n i=1 δλi(Wn), where δx is the Dirac Delta measure at x, and λi are the eigenvalues in decreasing (in i) order. √ 2 Definition D.7. The semicircle law µsc is defined to be the distribution with density ∝ 4 − x . Definition D.8. A random distribution µ on R, i.e. a random variable taking values in the space of probability distributions on , converges to a deterministic distribution µ∗ almost surely, if for all compactly supported continuous function f, R a.s. Ez∼µ f(z) −−→ Ez∼µ∗ f(z).

To prove that µWn converges almost surely to the semicircle law µsc, it suffices to compute the (polynomial) moments of

µWn and show that it converges almost surely to the moments of µsc as n → ∞ (Tao, 2012). It’s well known that the odd k moments of µsc are 0 and for even k, EX∼µsc X = Ck/2, where Cj is the jth Catalan number defined by

j−1 X C0 = 1,Cj = CiCj−1−i. i=0

Now, k 1 k 1 > k E X = E tr Wn = E E v Wn v. X∼µWn Wn n Wn v∼N (0,In) n

The latter expectation can be expressed as a tensor program: We couple the time index t to t = n.√ Define the A-var 1t 1t 1t 1t p 2 1t A to be an input var, with n1(A ) = n2(A ) = n, sampled Aij ∼ N (0, 1/ 2n (A )) = N (0, 1/ 2n), and define 2t 1t> 1t 2t 0 0 A = A . Thus A + A represents Wn. Set g := v, an input var, sampling gi ∼ N (0, 1). Inductively, set 0j 1 i−1 00j 2 i−1 j 0j 00j j j g := A g , g := A g , and g := g + g . Thus g represents Wnv (where the superscript j represents an j j 1 Pn k 1 Pn 0 k index in g while it represents an exponent in Wn). We aim to compute the limit of n i=1 vi(Wn v)i = n i=1 gi gi as n → ∞. This limit is prescribed by Thm 6.3. The detransposition πˇ of the above program can be described by the following: Define Aˇ1 := ϕ(A1), Aˇ2 := ϕ(A2) (so that they are independently sampled in πˇ), and

hˇ0 :=g ˇ0 := v = ϕ(g0), 0j ˇ1ˇj−1 0j gˇ := A h = ϕg(g ), 00j ˇ2ˇj−1 00j gˇ := A h = ϕg(g ), j−2 ˇ0j 0j X 0j ˇ00i 0j h :=g ˇ + a i h = ϕ(g ), i=0 j−2 ˇ00j 00j X 00j ˇ0i 00j h :=g ˇ + a i h = ϕ(g ), i=0 hˇj := hˇ0j + hˇ00j = ϕ(gj),

0j 00j hˇj hˇj where a i (resp. a i ) is computed by differentiating f via Thm 6.3, because it can be easily seen that f is always a 0r 00r j−1 hˇj Pj−1 j gˇ0r gˇ00r j gˇ0 linear function of {gˇ , gˇ }r=1. In fact, a symmetry argument shows that f (Z) = r=1 br(Z + Z ) + b0Z for 27 j j j some coefficients {br}r=0. An easy inductive argument shows that br satisfies the recurrence 0 b0 = 1, j ∀r 6∈ [0, j], br = 0, j−2 j X i j−1 ∀r < j, br = brbi+1 , i=r j bj = 1 These equations have the unique solution ( C if j − r is even bj = (j−r)/2 r 0 else.

Simultaneously, another easy inductive argument shows that µˇc = 0 and Kˇc(ˇg0, gˇj) = 0 for all j > 0. Thus Thm 6.3 yields, for Z ∼ N (µˇc,Kˇc),

n k−1 ! 1 1 X a.s. 0 X 0r 00r 0 tr(W k) = g0gk −−→ Zgˇ bk(Zgˇ + Zgˇ ) + bkZgˇ n n n i i E r 0 i=1 r=1 k gˇ0 2 = E b0 (Z ) ( C if k is even = bk = k/2 0 0 else. as desired.

D.3. Warmup: Marchenko-Pastur Law

(n) mn Suppose Yij ∼ N (0, 1) for all i, j ∈ N, and Y = (Yij)i∈[mn],j∈[n] for a sequence of {mn}n satisfying limn→∞ n → 1 (n) (n)> α ∈ (0, ∞). The Marchenko-Pastur Law says that the spectral distribution of n Y Y converges almost surely to µmp(α), defined as 1 (1 − α−1) δ + p(b − x)(x − a) (x) dx + 0 α2πx I[a,b] √ 2 √ 2 where (r)+ = r if r > 0 and 0 else, and a = (1 − α) and b = (1 + α) . We can again show this via the moment method. √ We define a tensor program π as follows. Couple n = t. Let A := Y (n)/ n be an input A-var and let A0 := A>. Let 0 0 g := v ∼ N (0,Imn ) = N (0,In(g )) be an input G-var. Define recursively gi := A0gi−1 gi := Agi

0t a.s. 1 > k a.s. 1 > > k a.s. 1 Pn(g ) 0t kt We seek to compute lim tr(AA ) = lim v (AA ) v = lim 0t g g . n mn n mn Ev∈N (0,Imn ) t→∞ n(g ) i=1 i i ˇ ˇ0 0 The detransposition πˇ of the above, is as follows.√ Set A, A to be input A-vars, corresponding to ϕ(A), ϕ(A ), and sampled iid as such, so that σ∞(Aˇ) = 1 and σ∞(Aˇ0) = α. Then define ˇ0 0 h :=g ˇ = v ∼ N (0,Imn ) i ˇ0ˇi i g˜ˇ := A h = ϕg(g ) i ˇˇi−1 i g˜ˇ := Ah = ϕg(g ) i−1 ˇi ˇi X i ˇj i h := g˜ + ajh = ϕ(g ) j=0 i−1 ˇi ˇi X i ˇj i h := g˜ + ajh = ϕ(g ) j=0

28 i i where aj (resp. aj) is computed through the derivative rule of Thm 6.3. By a simple inductive argument, we see that we can express

i−1 i−1 ˇi X i j ˇi X i j h = bjgˇ , h = bjgˇ j=0 j=0

i i for some set of coefficients {bj, bj}i≥j≥0. Then it’s easy to see that they satisfy the recurrence

0 0 b0 = b0 = 1, i−1 i X i−1 k i ∀j < i, bj = bk bj , bi = 1, k=j i i X i k−1 i ∀j < i, bj = α bkbj , bi = 1. k=j+1

We claim that the solution to these equations is given by

i bj = Mi−j, i i ∀j < i, bj = αMi−j, bi = 1,

r where Mr = E[x : x ∼ µmp(α)]. It suffices to verify the following Catalan-like identity

s−2 X Ms = α MrMs−1−r + (1 + α)Ms−1. (5) r=1

r This can be done by the change of variable in the integral of E[x : x ∼ µmp(α)] to get

r √ r−1 E[x : x ∼ µmp(α)] = E[( αy + 1 + α) : y ∼ µsc] b(r−1)/2c X r − 1 = αk(1 + α)r−1−2k C 2k k k=0 where Ck is the kth Catalan number. Then one can verify Eq. (5) by expanding and applying the Catalan identity Pk−1 Ck = i=0 CiCk−1−i repeatedly. Finally, an easy inductive argument shows that µˇc = 0 and Kˇc(ˇg0, gˇj) = 0 for all j > 0. Thus, we have

n(g0t) n(ˇg0t) a.s. a.s. a.s. 1 > k 1 X 0t kt 1 X 0tˇkt k gˇ0 2 k lim tr(AA ) = lim gi gi = lim E gˇi hi = E b0 (Z ) = b0 = Mk n m t n(g0t) t n(ˇg0t) Z∼N (µˇc,Kˇc) n i=1 i=1 as desired.

D.4. DNN-GP correspondence Suppose ρ = F (z; θ) is the part of a neural network that takes an input embedding z = (z1, . . . , zB) and produces a representation ρ = (ρ1, . . . , ρm) of it. For example, z can be Ax for an input x and an embedding matrix A, or 1 B i B 1 B (Ax , . . . , Ax ) for a sequence/batch of inputs (x )i=1 (say when x , . . . , x is a sequence of tokens to be processed by an RNN, or when they form a batch, perhaps to be processed by batchnorm), or (A1x1,...,ABxB) when they form the pixel vectors across the channels of an input image in the case of CNN, perhaps in combination with RNNs/batchnorm. Similarly, ρ can be a vector representation in a MLP or a sequence/batch of vector representations in the case of RNN/batchnorm/CNN. The neural network then converts ρ to an output via some linear transformation, say ρ 7→ (v1>ρ1, . . . , vm>ρm), where each vi is a vector of appropriate size, and vi is allowed to equal to vj whenever they have the same shape. Note that this scenario is general enough to cover simultaneous computation of a neural network on a batch of input, where ρ can be partitioned into the corresponding representations of each parallel output. 29 Suppose F (z; θ) can be represented by a tensor program π where z and θ appear as input G- and A-vars; let the output (ρ1, . . . , ρm) be represented by H- or G-vars h1, . . . , hm of π. When the input embedding is linear, z = (A1x1,...,ABxB), and its matrices A1,...,AB are sampled from zero mean Gaussian distributions, (z1, . . . , zB) is jointly Gaussian with a covariance depending on pairwise products between x1, . . . , xB. Furthermore, if θ is randomized by Gaussians according to Section 3.2 (with some set of compatible sampling hyperparameters σl, µcin , etc), then by Thm 6.3 we get Corollary D.9 (DNN-GP correspondence). If all fl of π are polynomially bounded and almost sure rank convergence 1 i> j a.s. i j l holds for πˇ, then n(hi) h h −−→ Cij for some PSD matrix C, whenever n(h ) = n(h ), as the dimensions {n }l of π go to infinity. The kernel C can be computed via Thm 6.3. A fortiori, if each vi of the readout layer is sampled from N (0, 1/n(vi)), where for each i 6= j, either vi = vj or vi is independent from vj, then the neural network output (v1>ρ1, . . . , vm>ρm) −→Nd (0,C0) in this limit, for ( C if vi = vj C0 = ij ij 0 otherwise.

For example,

1. if z = (Ax1, . . . , AxB) is just a batch of inputs where each Axi is processed by the same neural network f in parallel, and the network outputs (v>ρ1, . . . , v>ρm) for readout weights v, then Cor D.9 says f converges to a Gaussian Process in distribution in the infinite width limit.

ij BS 2. if z = (Ax )i=j=1 represents the embedding of a batch of B sequences of length S, and the network is an RNN that processes each sequence in parallel, in a seq2seq fashion, then Cor D.9 says f converges to a (multivariate) Gaussian Process in distribution in the infinite width limit.

3. we obtain similar GP convergence results for any standard architecture.

D.5. Gradient Independence Assumption Thm D.1 already shows that gradient independence assumption leads to the correct computation for MLPs. In general, if, as before, F (z; θ) is the body of the network that takes an input embedding to a representation, and it can be represented by a tensor program π having no line of typeT, then backprop can be represented by an extended program as in Section 5 with vi being readout layer weights. Thus, if vi are sampled with zero mean and all nonlinearities have polynomially bounded weak derivatives, then Thm 5.1 applies and we can compute the gradient dynamics by computing Kc and µc according to Section 5, which allows us to pretend that the G-vars in π are independent from the weights used in the backward pass. This is in particular true if F (z; θ) has a standard architecture without batchnorm (with no transposed weight sharing in the forward pass). Batchnorm is not covered by our theorems because its gradient has singularities, for example at the origin. However, based on the simulations of Yang et al.(2018), Thm 5.1 seems to hold even when batchnorm is involved.

Singular value distribution. Let F (z; θ) be as above. Denote by J its Jacobian in z. Pennington et al.(2017) applied free probability theory to compute the eigenvalues of JJ > and hence of the singular value of J, when F (z; θ) represents a MLP. Thus J can be expressed as DLW L ··· D2W 2D1W 1 for weight matrices W l for each layer l and diagonal matrices Dl = Diag({φ0(hl+1)}). Specifically, the authors compute the Stieljes transform of JJ > and then its S-transform by leveraging the latter’s compatibility with matrix multiplication. Crucial in this computation is the assumption that Dl and W l are asymptotically free, allowing the application of S-transform. We now justify this assumption. The Stieljes and S-transform methods can be thought of a more nicely packaged way of applying the moment method (Tao, > k > k l 2012), i.e. computing tr((JJ ) ) for each k. Thus it suffices to show that, in the computation of tr((JJ ) ), {D }l can l be thought of as independent of {W }l. > k > > k > k Now tr((JJ ) ) = Ea∼N (0,I) a (JJ ) a. The computation (JJ ) a can be expressed with a tensor program: If π represents the computation of F (forward pass), π0 represents J >, i.e. backprop from gradient vector a (so that π˜ = π|π0 is an extended program of the form described in Section 5), and π00 represents J, then (JJ >)ka is given by the output of πˆ = π|(π0kπ00k · · · kπ0kπ00). Here, | denotes concatenation and k denotes “piping”, so that the output of ρ is inserted as the 30 > > k input of τ in ρkτ. Then lim Ea∼N (0,I) a (JJ ) a can be computed via Thm 6.3. Finally, it only remains to notice that K(g, h0) = 0 for any G-var of π and H-var (of G-var) of (π0kπ00k · · · kπ0kπ00) other than a because h0 is always odd in a a.s. > > k (apply the same reasoning from proof of Thm 5.1). Thus lim Ea∼N (0,I) a (JJ ) a has the same limit as if the A-vars of π are independent from the rest of πˆ. i j In fact, this reasoning, applied to mixed moments, establishes that {D }i ∪ {W }j are almost surely asymptotically free. l l l0 Corollary D.10. In the MLP above, let its hidden layer widths {n }l go to infinity such that n /n → αl,l0 ∈ (0, ∞) for l l l> some constants αl,l0 . Then, for X1,...,Xk chosen from {D ,W ,W }l such that the sizes match and X1 ··· Xk is a square matrix,

1 1 tr (X ··· X ) − tr (ϕ(X ) ··· ϕ(X )) −−→a.s. 0, nL 1 k nL 1 k where ϕ(W l) = W l and ϕ(Dl) = an iid copy of Dl, independent from all other values of ϕ.

This corollary is sufficient to justify the Stieljes transformation calculations of (Pennington et al., 2017), and show that the singular value distributions converge to their limits, almost surely. More generally, even with weight tying and arbitrary architecture, we can compute the singular value distribution of the neural network Jacobian, by expressing the moment computations as tensor programs, just like the above, and crank the machinery of Thm 6.3. Appendix D.3 can be thought of the most basic such case of linear regression.

D.6. Signal Propagation We begin by examining the simple RNN and the weight-tied autoencoder, before reviewing some mean field equations that appeared in prior literature, which can be justified rigorously. Finally, we close by looking at the weight-tied residual network, which is perhaps the simpliest “RNN” where the weight-tying leads to a different behavior than not tying the weights (in contrast to the simple RNN vs MLP).

Simple RNN A simple RNN that takes in input only at time 1 and outputs only at time L can be thought of as an MLP with parameters tied across layers:

x0(x) def= x 1 hl(x) def= √ W xl−1(x) + b n xl(x) def= φ(hl(x)) 1 RNN(x; θ) def= √ v>xL(x) n

n×n n 2 2 for W ∈ R and x, v, b ∈ R . We will sample Wij ∼ N (0, σW /n), bi ∼ N (0, σb ). The computation of i B {RNN(x ; θ)}i=1 over a batch of inputs can be expressed by a tensor program with no line of typeT (essentially the program 0 0 in Appendix B.1 but with weights and biases tied). By Thm 4.3, we can compute Kc(hl(xi), hl (xi )) = 0 whenever l 6= l0, c 2 c 2 c l i l−1 i and that K |{h (x )}i = σW Vφ(K |{h (x )}i ) + σb . This is, of course, exactly the same as the K computed if W and b are not tied across layers (see Appendix D). This therefore mathematically proves what the experiments of Chen et al. (2018) suggested. Corollary D.11. Suppose φ is α-controlled for α < 2. Assume almost sure rank convergence. Then for any α-controlled ψ : RL → R, as n → ∞,

n n 1 X 1 X a.s. ψ(h1 (xj) , . . . , hL (xj) ) − ψ(h1 (xj) , . . . , hL (xj) ) −−→ 0, n RNN i RNN i n MLP i MLP i i=1 i=1

l l where hRNN (hMLP) denotes the hidden state of the RNN (MLP), with the weights and biases in each model identically sampled.

31 Autoencoder A weight-tied autoencoder is described by the following equations (we follow Li & Nguyen(2018) here)

x0(x) def= x xl(x) def= W lσl−1(xl−1) + bl, ∀l ∈ {1,...,L} xˆL def= W L>φL(xL) + vL xˆl(x) def= W l>φl(ˆxl+1) + vl

0 l l−1 l l−1 for input x ∈ Rn , a set of weights W l ∈ Rn ×n , encoder biases bl ∈ Rn , decoder biases vl ∈ Rn , for l = 1, . . . , L. We also have the decoder and encoder activation functions φl : R → R, σl : R → R. The parameters are sampled iid according to

l 2 l−1 l 2 l 2 Wij ∼ N (0, σW l /n ), bi ∼ N (0, σbl ), vi ∼ N (0, σvl ).

We consider taking the limit where nl → ∞, ∀l, with nl/nl−1 → αl ∈ (0, ∞). Li & Nguyen(2018) proved a (forward) signal propagation theorem of the above weight-tied autoencoder that uses the L L following quantities. Define {τl}l=1 and {τ¯l}l=0 inductively:

2 1 0 2 2 2 2 τ¯ = kσ (x)k , τ¯ = τ + σ l , ∀l ∈ {1,...,L}, 0 n0 l l b 2 2 2 2 2 l−1 2 τ1 = σ 1 τ¯0 , τl = σ l σ (¯τl−1z) , ∀l ∈ {2,...,L} W W Ez

L+1 where z ∼ N (0, 1). Next define {γl, ρl}l=2 inductively:

1 L L 2 γL+1 = τ¯Lz1φ (¯τLz1), ρL+1 = φ (¯τLz1) , 2 zE zE τ¯L 1 1  q  1 l−1 l 2 l−1 l 2 2 γl = τ¯l−1z1φ α σ l γl+1σ (¯τl−1z1) + α σ l ρl+1 + σ l z2 , 2 z E,z W W v τ¯l−1 1 2  q 2 l−1 l 2 l−1 l 2 2 ρl = E φ α σW l γl+1σ (¯τl−1z1) + α σW l ρl+1 + σvl z2 , ∀l ∈ {L − 2,..., 2} z1,z2 where z1, z2 ∼ N (0, 1). By expressing the autoencoder computation on a single input x as a tensor program and applying Thm 6.3, we obtain a version of the main theorem of Li & Nguyen(2018) that assumes no smoothness of the nonlinearities and of test functions. If Xt,Y t ∈ Rn(t) are two sequences of random vectors in t, then write Xt =∼ Y t to mean that for any polynomially bounded 1 Pn(t) t 1 Pn(t) t ψ : R → R, n(t) i=1 ψ(Xi ) and n(t) i=1 ψ(Yi ) converge a.s. to the same limit, as n(t) → ∞. l l l Corollary D.12. Let the activation functions {σ , φ }l be polynomially bounded. Then in the limit {n }l → ∞ as described above,

l ∼ 1. x = N (0, τ¯lInl ), ∀l ∈ {1,...,L}. q l ∼ l 2 l−1 l 2 2 2. xˆ = α σW l γl+1σ (¯τl−1~z1) + α σW l ρl+1 + σvl z2, ∀l ∈ {2,...,L} where ~z1, ~z2 ∼ N (0,Inl−1 ) independently.

3. the autoencoder output xˆ satisfies q ∼ 0 1 2 0 1 2 2 xˆ = φ (α σW 1 γ2σ (x) + α σW 1 ρ2 + σv1 ~z2),

where ~z2 ∼ N (0,In0 ) independent of x.

Li & Nguyen(2018)’s main theorem is almost the same as this, except that 32 l l 2 1. Li & Nguyen(2018) requires σ to be nontrivial in the sense that for any τ > 0, Ez∼N (0,1) σ (τz) > 0. But this l l P is equivalent to saying that σ is not a.e. 0. Indeed, if σ (x) = i aihi(x) is its Hermite expansion in orthonormal l 2 P 2 2i Hermite basis hi, then Ez∼N (0,1) σ (τz) = i ai τ , which can be 0 for positive τ iff all ais vanish.

2. All nonlinearities σl, φl are required by Li & Nguyen(2018) to be globally Lipschitz (and hence linearly bounded). Here we only need them to be polynomially bounded.

3. The equivalence relation =∼ is defined differently in Li & Nguyen(2018). t ∼ t t t p There, X = Y if φt(X ) − E φt(Y ) −→ 0 for any sequence of uniformly pseudo-Lipschitz functions. A sequence of n(t) functions φt : R → R is said to be uniformly pseudo-Lipschitz if there exists a constant C, independent of n, such that for any x, y ∈ Rn(t),  kxk kyk kx − yk |φ (x) − φ (y)| ≤ C 1 + √ + √ √ . n n n n n

In contrast, the test functions φt we allow are coordinatewise functions — a stronger assumption than the above — but does not need to be smooth, just polynomially bounded — a weaker assumption than the above. We also guarantee almost sure convergence, a stronger result than their convergence in probability. It would be interesting in future work to study whether one can remove the smoothness assumption even for noncoordinatewise test functions.

D.6.1. JUSTIFYINGSEMIRIGOROUSEQUATIONS Below, we give several examples of signal propagation equations derived heuristically in prior works, which can now be justified rigorously using the tensor program framework.

MLP (Schoenholz et al., 2017) See Appendix D.1.

0 Residual Network (Yang & Schoenholz, 2017) We define a residual network f(x; θ), x ∈ Rn as follows

x0(x) def= x 1 hl(x) def= √ W lxl−1(x) + bl nl−1 1 xl(x) def= √ V lφ(hl(x)) + xl−1(x) + al nl 1 f(x; θ) def= √ w>xL(x) nL

l L l l l l−1 with al, bl ∈ Rn , w ∈ Rn , V l ∈ Rn ×n , and W l ∈ Rn ×n for l = 1,...,L. These form the parameters θ of f. We l l 2 l l 2 L+1 2 l l 2 l l 2 sample Wij ∼ N (0, (σw) ),Vij ∼ N (0, (σV ) ), wi ∼ N (0, (σw ) ), ai ∼ N (0, (σa) ), and bi ∼ N (0, (σb) ). Define 0 kernels ΣL+1 :(Rn )2 → R by

n0 def 1 X Σ˜ 0(x, x0) = x x0 n0 i i i=1 l def l 2 ˜ l−1 l 2 Σ = (σw) Σ + (σb) ˜ l def ˜ l−1 l 2 l 2 Σ = Σ + (σV ) Vφ(Σ ) + σa L+1 def L+1 2 L Σ = (σw ) Σ .

0 Then for any finite subset X ⊆ Rn , for α-controlled φ,

d L+1 f(X ; θ) −→N (0, Σ |X ). 33 Convolutional Network (Xiao et al., 2018) Consider a convolutional network

0 def xαi(x) = xαi

l def 1 X l l−1 l hαi(x) = √ Wβijx (x) + bi nl α+β,j j∈[nl−1] β∈[sl−1]

l def l xαi(x) = φ(hαi(x))

l l where hαi denotes the preactivation at the lth layer, the ith channel, each with s neurons, and the αth neuron, and likewise l l for xαi. n is the number of channels in layer l. l l l 2 l l l 2 Suppose we have a non-negative vector (vβ)β∈ker that sums up to 1 and Wβij ∼ N (0, (σW ) vβ), bβi ∼ N (0, (σb) ). By l l l def a.s. 1 Pn l l Thm 4.3, {hα•(xa)}α,a are “jointly Gaussian” in the limit. Define Σαa,βb = lim nl i=1 hαi(xa)hβi(xb), for any i. Then with ? denoting 2D circular cross correlation, Xiao et al.(2018) calculated, semirigorously,

l+1 l 2 l l l 2 Σ = (σW ) Diag(v ) ? Vφ(Σ ) + (σb) l+1 l 2 X l l l 2 Σαa,βb = (σW ) vγ Vφ(Σ )α+γ;a,β+γ;b + (σb) . γ∈[sl] These equations can now be recovered rigorously using Thm 4.3. Now suppose the last layer (layer L) is linear with output,

def 1 X L L−1 f(x; θ) = √ Wαixαi ∈ R nL−1sL−1 α∈[sL−1] i∈[nL−1]

L and the weights are sampled according to Wαi ∈ N (0, 1). Then, we can compute via Thm 4.3,   L−1 L−1  d 1 tr Vφ(Σ )•a,•a tr Vφ(Σ )•a,•b (f(xa; θ), f(xb; θ)) −→N 0, L−1 L−1 . sL−1 tr Vφ(Σ )•a,•b tr Vφ(Σ )•b,•b

l l def a.s. Pn ∂f ∂f Define, via Thm 5.1, Παa,βb = lim i=1 l (xa) l (xb), for any i. Then Xiao et al.(2018) essentially calculated, ∂xαi ∂xβi semirigorously, 1 ΠL−1 = (α = β). αa,βb sL−1 I and in all previous layers, the recurrence

l−1 l 2 l# 0 l l Π = (σw) Diag(v ) ? (Vφ (Σ ) Π ) where vl# is the reverse of vl. These equations can now be justified rigorously using Thm 5.1.   Batchnorm (Yang et al., 2018) Given φ : → , let B : B → B, z 7→ φ z−z¯√ . This is an application of R R φ R R kz−z¯k/ B batchnorm followed by coordinatewise action by φ, where z should be thought of as a fixed unit across a batch of size B.

n0 B×n0 B×1 If ~x = (xi, . . . , xB) is a batch of inputs xi ∈ R , then define a deep batchnorm network f(~x; θ): R → R by

~x0(~x) def= ~x

~ l def 1 l l−1 l B h (~x) = (√ W x (~x)i + b )i=1 nl−1 l def ~ l ~x (~x) = Bφ(h (~x))  B def 1 > L f(~x; θ) = √ w x (~x)i . L n i=1

34 l l l 2 L+1 2 Here B and n0 will be fixed and n → ∞ for l > 0. We sample Wij ∼ N (0, (σw) ), wi ∼ N (0, (σw ) ) and l l 2 l B×n0 2 B×B bi ∼ N (0, (σb) ). Define multivariate kernels Σ :(R ) → R by 1 Σ1(~x,~x0) def= (σ1 )2 x >x0 + (σ1)2 ij w n0 i i b l def l 2 l−1 l 2 Σ |{~x,~x0} = (σw) VBφ(Σ |{~x,~x0}) + (σb) L+1 def L+1 2 L Σ |{~x,~x0} = (σw ) VBφ(Σ |{~x,~x0}).

0 Then for any finite set of batches X ⊆ RB×n , for α-controlled φ,

d L+1 f(X ; θ) −→N (0, Σ |X ).

Yang et al.(2018) also calculated the gradient dynamics of such a deep batchnorm network, but our theorems cannot rigorously justify them due to the singularity of the Jacobian of batchnorm.

D.6.2. A TASTE OF WEIGHT-TYING Weight-tied Residual Network The simpliest “recurrent neural network” for understanding when weight-tying can have a different behavior than not is perhaps in a residual network with weights tied across layers.

In this section, fix a matrix W ∈ RN×N and a function φ : R → R. Consider the dynamics t t−1 t−1 t N h = W φ(h ) + h , h ∈ R . 2 What is the “average behavior” of this dynamics as N → ∞, if we were to sample Wij ∼ N (0, σw/N)? Thm 4.3 applies t here when φ is α-controlled, and it tells us that “(hi)i are i.i.d. samples of a zero-mean Gaussian distribution, in the limit N → ∞,” as far as α-controlled test functions are concerned. By Thm 4.3, we can make the following def a.s. 1 PN l m def a.s. 1 PN l P m Definition D.13. Define K(l, m) = lim N i=1 hihi and C(l, m) = lim N i=1 hi j Wijφ(hj ). Theorem D.14. K and C satisfy the following equations in the limit N → ∞. K(l, m) = C(l, m − 1) + K(l, m − 1) (6) = C(m, l − 1) + K(m, l − 1) (7) 2 l−1,m−1 = σwVφ(K )12 + K(l − 1, m − 1) + C(l − 1, m − 1) + C(m − 1, l − 1) (8) 2 l−1,m C(l, m) = C(l − 1, m) + σwVφ(K )12 (9) K(a, a) K(a, b) where Ka,b is the matrix . K(a, b) K(b, b) In addition, for all m, l ≥ 0, K(l, m) = K(m, l), K(0, m) = K(m, 0) = K(0, 0), C(0, m) = 0.

Proof. The identities at the end are obvious. We will focus on proving Eqs. (6) to (9). We have

N a.s. 1 X K(l, m) = lim hlhm N i i i=1 N   a.s. 1 X X = lim hl W φ(hm−1) + hm−1 N i  ij j i  i=1 j N N a.s. 1 X X a.s. 1 X = lim hl W φ(hm−1) + lim hlhm−1 N i ij j N i i i=1 j i=1 = C(l, m − 1) + K(l, m − 1)

35 which gives Eq. (6) and also Eq. (7) by symmetry. Now

N a.s. 1 X X C(l, m) = lim hl W φ(hm) N i ij j i=1 j N N a.s. 1 X X a.s. 1 X X = lim W φ(hl−1)W φ(hm) + lim hl−1 W φ(hm) N ik k ij j N i ij j i=1 j,k i=1 j N a.s. 1 X X = lim W φ(hl−1)W φ(hm) + C(l − 1, m) N ik k ij j i=1 j,k N a.s. 1 X X = lim W 2 φ(hl−1)φ(hm) + C(l − 1, m) N ij j j i=1 j 2 l−1,m = σwVφ(K )12 + C(l − 1, m) yielding Eq. (9). Finally, Eq. (8) is given by expanding C(l, m − 1) by Eq. (6) and expanding K(l, m − 1) by Eq. (7).

One can see immediately that the growth of hl norm is much faster here than for untied-weights residual network.

We now study the simultaneous evolution of two vectors hl and ~l. def a.s. 1 PN l m def def Definition D.15. Define Kh~(l, m) = lim N i=1 hi~i = K~h(m, l) and Ch~(l, m) = a.s. 1 PN l P m def a.s. 1 PN l P m lim N i=1 hi j Wijφ(~j ), C~h(l, m) = lim N i=1 ~i j Wijφ(hj ). Theorem D.16.

Kh~(l, m) = Ch~(l, m − 1) + Kh~(l, m − 1) (10)

= C~h(m, l − 1) + K~h(m, l − 1) (11) 2 l−1,m−1 = σ Vφ(K )12 + Kh (l − 1, m − 1) + Ch (l − 1, m − 1) + C h(m − 1, l − 1) (12) w h~ ~ ~ ~ 2 l−1,m Ch (l, m) = Ch (l − 1, m) + σ Vφ(K )12 (13) ~ ~ w h~

K (l − 1, l − 1) K (l − 1, m) where Kl−1,m is the matrix hh h~ . h~ Kh~(l − 1, m) K~~(m, m)

Proof.

N a.s. 1 X K (l, m) = lim hl m h~ N i~i i=1 N   a.s. 1 X X = lim hl W φ( m−1) + m−1 N i  ij ~j ~i  i=1 j N N a.s. 1 X X a.s. 1 X = lim hl W φ( m−1) + lim hl m−1 N i ij ~j N i~i i=1 j i=1

= Ch~(l, m − 1) + Kh~(l, m − 1)

36 N a.s. 1 X X C (l, m) = lim hl W φ( m) h~ N i ij ~j i=1 j N N a.s. 1 X X a.s. 1 X X = lim W φ(hl−1)W φ( m) + lim hl−1 W φ( m) N ik k ij ~j N i ij ~j i=1 j,k i=1 j 2 l−1,m = σ Vφ(K )12 + Ch (l − 1, m) w h~ ~

t t T t Now for the backward pass. Define h := ∇ht E, g := W h for a loss function E. Then

ht−1 = ht + (W T ht) ◦ φ0(ht−1)

So

N N a.s. 1 X a.s. 1 X lim hsht = lim hs+1ht + gs+1φ0(hs)ht N i i N i i i i i i=1 i=1 N N a.s. a.s. 1 X t s 1 X X t t lim g g = lim W h W 0 h 0 N i i N ji j j i j i=1 i=1 j,j0 N a.s. 1 X = σ2 lim ht ht w N k k i=1 N N a.s. 1 X a.s. 1 X lim htgsφ0(hr) = lim ht+1gsφ0(hr) + gt+1gsφ0(ht)φ0(hr) N i i i N i i i i i i i i=1 i=1 N N ! N ! a.s. 1 X a.s. 1 X a.s. 1 X = lim ht+1gsφ0(hr) + σ2 lim ht+1hs lim φ0(ht)φ0(hr) N i i i w N i i N i i i=1 i=1 i=1

Suppose we backprop a zero mean Gaussian vector with normalized norm 1. For a weight-tied residual network that runs S steps, we have boundary conditions

N a.s. 1 X lim hShS = 1 N i i i=1 N a.s. 1 X lim hS+1ht = 0, ∀t N i i i=1 N a.s. 1 X lim hSgsφ0(hr) = 0, ∀s, r N i i i i=1

These equations then yield the dynamics of gradients in a weight-tied residual network.

D.7. Neural Tangent Kernel Again, let F (z; θ) be the body of a neural network as above, and suppose it’s represented by a tensor program π. For every ∂F P ∂F P ∂F input A-var A of π, ∂A = g:=Ah ∂g ⊗ h + g:=A>h h ⊗ ∂g , where the sums are over vars satisfying the subscripts in the sums. If the network has scalar output is given by f(x) = v>F (E(x); θ) where E is an embedding function, then the 37 contribution of A to the NTK of f is

def X X ∂F ∂F X NTK (x, y) = (v> (E(x))) (v> (E(y))) h (E(x))h (E(y)) A ∂g i ∂g i j j g:=Ah i j X X X ∂F ∂F + h (E(x))h (E(y)) (v> (E(x))) (v> (E(y))) . (14) i i ∂g j ∂g j g:=A>h i j Each of the four subsums in Eq. (14) can be computed via Thm 6.3 (or Thm 5.1 if π doesn’t useT lines) when we expand the computation of f over x and y as well as its gradients into a single tensor program. Note that Eq. (14) scales like (nl)2; dividing by this factor roughly corresponds to using the parametrization of Jacot et al.(2018). The contribution to NTK from input G-vars is similar and even simpler to compute. The above computation would hold as long as we can apply Thm 6.3, which requires that we have almost sure rank convergence of the relevant programs and that the nonlinearities of f have polynomially bounded weak derivatives. We give an example by computing the NTK of a CNN (which has not appeared in prior literature).

CNN Assume the notation of the CNN section of Appendix D.6.1. l The contribution to the NTK of weights Wβij for l < L is

nl,nl−1 a.s. X X ∂f ∂f lim (x ) (x ) ∂W l a ∂W l b β∈ker i,j=1 βij βij nl,nl−1 a.s. 1 X X X ∂f l−1 X ∂f l−1 = lim l−1 ( l (xa)xα+β,j(xa))( l (xb)xα0+β,j(xb)) n ∂h ∂h 0 β∈ker i,j=1 α αi α0 α i

nl,nl−1 2k+1 1 X X X l l 0 l−1 = Π 0 Vφ(Σ ) 0 Vφ (Σ ) 0 nlnl−1 αa,α b αa,α b α+β;a,α +β;b β∈ker i,j=1 α,α0=1 2k+1 X l l X 0 l−1 = Παa,α0bVφ(Σ )αa,α0b Vφ (Σ )α+β;a,α0+β;b α,α0=1 β∈ker l l 0 l−1 = hΠ•a,•b Vφ(Σ•a,•b),I? Vφ (Σ•a,•b)i

l l 2 l q l l Note that if we sample Wβij ∼ N (0, (σw) ) and replace Wβij with vβWβij, then in the above expression we replace I with Diag(vl). l Similarly, the contribution of bi for l < L is

nl a.s. X ∂f ∂f lim (x ) (x ) ∂bl a ∂bl b i=1 i i nl a.s. X X ∂f X ∂f = lim ( l (xa))( l (xb)) ∂h ∂h 0 i=1 α αi α0 α i X l 0 l = Παa,α0bVφ (Σ )αa,α0b α,α0 11T l 0 l = h , Π•a,•b Vφ (Σ )•a,•bi l 0 l = hΠ•a,•b, Vφ (Σ )•a,•bi

The last layer weights (in the linear layer setting) contribute a.s. 1 X 1 lim xL−1(x )xL−1(x ) = tr Vφ(ΣL−1) . nL−1sL−1 αi a αi b sL−1 •a,•b α,i

38 Therefore Corollary D.17. The NTK of the CNN defined above converges almost surely to

a.s. 1 X NTK(x , x ) −−→ tr Vφ(ΣL−1) + hΠl Vφ(Σl ),I? Vφ0(Σl−1 )i + hΠl , Vφ0(Σl) i a b sL−1 •a,•b •a,•b •a,•b •a,•b •a,•b •a,•b l

D.8. Approximate Message Passing We follow Bayati & Montanari(2011); Berthier et al.(2017) for a brief introduction to Approximate Message Passing. N Given an n × N matrix A, the compressed sensing problem asks for a way to reconstruct a (sparse) vector x0 ∈ R from a n (small) vector of linear observations y = Ax0 + w ∈ R . Here w is a noise vector and A is assumed to be known. The Approximate Message Passing algorithm (Donoho et al., 2009) starts with an initial guess x0 = 0 and proceed by

t+1 > t t x = ηt(A z + x ), zt = y − Axt + αtzt−1

t 1 PN 0 > t−1 t−1 for an appropriate sequence of nonlinearities {ηt : R → R}t≥0 and α = n i=1 ηt−1((A z + x )i) ∈ R. The t algorithm succeeds if x converges to a good approximation of x0. Similar algorithms have been applied to robust regression (Donoho & Montanari, 2016), Bayesian estimation (Kamilov et al., 2012), low rank matrix recovery (Kabashima et al., 2016), phase retrieval (Schniter & Rangan, 2015), and community detection in graphs (Deshpande et al., 2017). The behavior of the AMP algorithm is accurately described by a formalism called “stated evolution” (SE), as n, N → ∞ with constant ratio n/N → δ ∈ (0, ∞), that bears some resemblance to the evolution of kernels in the GP correspondence of deep neural networks (see Section 2.1) and to the gradient dynamical equations in the signal propagation analysis of DNNs (see Section 2.2). SE was introduced in Donoho et al.(2009) and later suitably formalized and rigorously proved for random Gaussian A and suitably smooth ηt in Bayati & Montanari(2011). A more general version of the algorithm where N N ηt : R → R (instead of acting coordinatewise) was analyzed and a similar SE equations proved in Berthier et al.(2017). As a corollary to one of our main theorems Thm 6.3, we show that, in the main theorem of Bayati & Montanari(2011), we can forgo smoothness assumptions on ηt when each component of x0 is sampled iid from a Gaussian. We’ll work with the following more general version of AMP from Bayati & Montanari(2011). The algorithm is defined by two sequences of 2 2 n N t t N functions {ft : R → R}t≥0, {gt : R → R}t≥0. Given w ∈ R , x0 ∈ R , define the sequence of vectors h , q ∈ R t t n 0 t t t t and z , m ∈ R , by fixing the initial condition q , and obtaining {b }t≥0, {m }t≥0, {h }t≥1, and {q }t≥1 through

t+1 > t t t t h = A m − ξtq , m = gt(b , w), t t t−1 t t b = Aq − λtm , q = ft(h , x0),

1 t t 1 t t 11 where ξt = 2 hb , gt(b , w)i and λt = 2 hh , ft(h , x0)i, and σt and τt are defined via Nσt nτt−1

2 def 2 2 def N 2 2 2 τ = gt(σt(Z,W )) , σ = ft(τt−1Z,X0) , where Z ∼ N (0, 1),W ∼ N (0, σ ),X0 ∼ N (0, σ ) t E t n E w x0

σ2 , σ2 for sampling hyperparameters w x0 . By translating the above computation into a tensor program and applying Thm 6.3, we obtain

Corollary D.18. Let {q0(N)}N≥0 and {A(N)}N≥0 be resp. a sequence of initial conditions and a sequence of matrices n×N A ∈ R indexed by N with iid entries Aij ∼ N (0, 1/n). Assume n/N → δ ∈ (0, ∞). Consider the sequence of vectors {x (N), w(N)} whose empirical distributions converge weakly to N (0, σ2 ) and N (0, σ2 ). Suppose that the 0 N≥0 x0 w 2 functions ft and gt are polynomially bounded for all t. Then for any polynomially bounded function ψ : R → R and all 11Note that here we are using h, i to denote (unscaled) inner product, which is different from the usage of this notation in Bayati & Montanari(2011).

39 t ≥ 0,

N 1 X a.s. ψ(ht+1, x ) −−→ ψ(τ Z,X), N i 0,i E t i=1 n 1 X a.s. ψ(bt, w ) −−→ ψ(σ Z,W ), n i i E t i=1 as N → ∞, where X ∼ N (0, σ2 ) and W ∼ N (0, σ2 ) independent of Z ∼ N (0, 1). 0 x0 w This version differs from theorem 2 of Bayati & Montanari(2011) in the following ways

1 PN 0 t 1 Pn 0 t 1. Bayati & Montanari(2011) defined ξt = N i=1 gt(bi, wi) and λt = n i=1 ft(hi, x0,i), where the derivatives are taken against the first argument. This is asymptotically equivalent to our formulation here by Stein’s lemma Lem E.8. Our formulation has the benefit of being defined for gt and ft without weak derivatives.

2. We are requiring that the x0 and w have empirical distributions that converge to Gaussians; with a bit more effort, we can also prove a result that allow them to converge to any distribution with all moments. This is a much stronger assumption than Bayati & Montanari(2011), who only assume that the limit distributions have some finite number of bounded moments.

3. We don’t have any smoothness assumptions on the nonlinearities ft and gt, whereas Bayati & Montanari(2011) requires them to be Lipschitz. 4. We don’t have any smoothness assumptions on the test function ψ, whereas Bayati & Montanari(2011) requires them to be pseudo-Lipschitz of some order 12.

This concludes our discussion of various corollaries of our main theorems. We now turn to their proofs. First let us present the necessary lemmas.

E. Lemmas E.1. The Conditioning Trick We first recall Moore-Penrose pseudoinverse and some properties of it. Definition E.1. For A ∈ Rn×m, a pseudoinverse of A is defined as a matrix A+ ∈ Rm×n that satisfies all of the following criteria

• AA+A = A • A+AA+ = A+ • (AA+)> = AA+ • (A+A)> = A+A

The following facts are standard

• if A has real entries, then so does A+. • The pseudoinverse always exists and is unique. • When A is invertible, A+ = A−1. • (A>)+ = (A+)>, which we denote as A+>.

12 s k−1 A function f : R → R is pseudo-Lipschitz of order k if there is a universal constant C s.t. |f(x) − f(y)| ≤ C(1 + |f(x)| + |f(y)|k−1)kx − yk. Note that this implies f is bounded by a polynomial of degree k

40 • A+ = (A>A)+A> = A>(AA>)+. • AA+ is the orthogonal projector to the column space of A; I − A+A is the orthogonal project to the null space of A. • if A has singular value decomposition A = UΛV where U and V are orthogonal and Λ has the singular values on its diagonal, then A+ = V >Λ+U > where Λ+ inverts all nonzero entries of Λ. n Pn + • For any collection of vectors {vi}i=1 in a Hilbert space, w 7→ i,j=1 vi(Σ )ijhvj, wi, where Σij = hvi, vji, is the n projection operator to the linear span of {vi}i=1.

We present a slightly more general versions of lemmas from Bayati & Montanari(2011) that deal with singular matrices. Lemma E.2. Let z ∈ Rn be a random vector with i.i.d. N (0, v2) entries and let D ∈ Rm×n be a linear operator. Then for any constant vector b ∈ Rn the distribution of z conditioned on Dz = b satisfies:

d + z =Dz=b D b + Π˜z where D+ is the (Moore-Penrose) pseudoinverse, Π is the orthogonal projection onto subspace {z : Dz = 0}, and z˜ is a random vector of i.i.d. N (0, v2).

Proof. When D = [Im×m|0m×n−m], this claim is immediate. By rotational symmetry, this shows that, for any vector space V and v orthogonal to it, conditioning z on V + v yields a Gaussian centered on v with covariance determined by ΠV z. Then the lemma in the general case is implied by noting that {z : Dz = b} can be decomposed as {z : Dz = 0} + D+b.

n×m 2 Lemma E.3. Let A ∈ R be a matrix with random Gaussian entries, Aij ∼ N (0, σ ). Consider fixed matrices Q ∈ Rm×q,Y ∈ Rn×q,P ∈ Rn×p,X ∈ Rm×p. Suppose there exists a solution in A to the equations Y = AQ and X = A>P . Then the distribution of A conditioned on Y = AQ and X = A>P is

d ⊥ ˜ ⊥ A =Y =AQ,X=A>P E + ΠP AΠQ where E = YQ+ + P +>X> − P +>P >YQ+,

˜ ⊥ + ⊥ + + + A is an iid copy of A, and ΠP = I −ΠP = PP and ΠQ = I −ΠQ = QQ in which ΠP = I −PP and ΠQ = I −QQ are the orthogonal projection to the space spanned by the column spaces of P and Q respectively.

Proof. We apply Lem E.2 to D : A 7→ (AQ, P >A). The pseudoinverse of D applied to (Y,X>) can be formulated as the unique solution of  2 > > argmin kAkF : AQ = Y,P A = X A > > where k − kF denotes Frobenius norm. We check that E is a 1) a solution to AQ = Y,P A = X and 2) the minimal norm solution. We have EQ = YQ+Q + P +>X>Q − P +>P >YQ+Q. Note that YQ+Q = Y because Y = AQ =⇒ YQ+Q = AQQ+Q = AQ = Y . So EQ = Y + P +T (X>Q − P >Y ). But X>Q = P >AQ = P >Y , so EQ = Y as desired. A similar, but easier reasoning, gives P >E = X>. This verifies that E is a solution. To check that E is minimal norm, we show that it satisfies the stationarity of the Lagrangian 2 > L(A, Θ, Γ) = kAkF + hΘ,Y − AQi + hΓ,X − A P i.

∂L > > n×q m×p > + So ∂A = 0 =⇒ 2A = ΘQ + P Γ for some choices of Θ ∈ R and Γ ∈ R . For Θ = 2Y (Q Q) and Γ> = 2(P >P )+[X> − P >YQ>], we can check that ΘQ> + P Γ> = 2Y (Q>Q)+Q> + 2P (P >P )+[X> − P >YQ+] = 2YQ+ + 2P +>X> − 2P +>P >YQ+ = 2E as desired. 41 E.2. Probability Facts

Theorem E.4 (Strong Law of Large Numbers for triangular arrays (Hu & L. Taylor, 1997)). Let {Xn,i : 1 ≤ i ≤ n, n ≥ 1} be a triangular array of random variables with (Xn,1,...,Xn,n) mutually independent with mean equal to zero for each n −1 Pn 2+ρ ρ/2 1 Pn and n i=1 E |Xn,i| ≤ cn for some 0 < ρ < 1, c < ∞. Then n i=1 Xi,n → 0 almost surely as n → ∞. Lemma E.5. Let {Xn}n≥1 be a sequence of random variables with zero mean. If for some p ∈ N and for all n, 2p −1−ρ E Xn ≤ cn , for some ρ > 0, then Xn → 0 almost surely.

Proof. By Markov’s inequality, for any  > 0,

2p 2p 2p 2p −1−ρ 2p Pr(|Xn| > ) = Pr(Xn >  ) ≤ E Xn / ≤ cn / X X −1−ρ 2p Pr(|Xn| > ) ≤ cn / < ∞. n n

By Borel-Cantelli Lemma, almost surely, |Xn| ≤  for all large n. Then, if we pick a sequence {k > 0}k converging to 0, we have that, almost surely, for each k, |Xn| ≤ k for large enough n — i.e. almost surely, Xn → 0.

The following is a standard fact about multivariate Gaussian conditioning

n1+n2 n1 n2 Proposition E.6. Suppose R 3 x ∼ N (µ, K), where we partition x = (x1, x2) ∈ R × R , µ = (µ1, µ2) ∈   K K d n1 n2 11 12 R × R , and K = . Then x1 =x2 N (µ|x2 ,K|x2 ) where K21 K22

+ µ|x2 = µ1 − K12K22(x2 − µ2) + K|x2 = K11 − K12K22K21.

Lemma E.7. Let Φ: Rn → R be measurable. Then for z ∼ N (ζ, Σ),

d2 d Φ(z) = 2 Φ(z) dζ2 E dΣ E whenever both sides exist.

Proof. First assume Σ is invertible. We check

d − 1 (ζ−z)Σ−1(ζ−z) −1 − 1 (ζ−z)Σ−1(ζ−z) e 2 = −Σ (ζ − z)e 2 dζ 2 d − 1 (ζ−z)Σ−1(ζ−z)  −1 −1 > −1 − 1 (ζ−z)Σ−1(ζ−z) e 2 = −Σ + Σ (ζ − z)(ζ − z) Σ e 2 dζ2 − 1 (ζ−z)Σ−1(ζ−z) − 1 (ζ−z)Σ−1(ζ−z) d e 2 1 e 2 = −Σ−1 + Σ−1(ζ − z)(ζ − z)>Σ−1 dΣ det(2πΣ)1/2 2 det(2πΣ)1/2 2 1 d − 1 (ζ−z)Σ−1(ζ−z) = e 2 . 2 dζ2

Integrating against Φ gives the result. For general Σ, apply a continuity argument, since the set of invertible Σs is dense inside the set of all PSD Σ.

Lemma E.8 (Stein’s lemma). For jointly Gaussian random variables Z1,Z2 with zero mean, and any function φ : R → R 0 where E φ (Z1) and E Z1φ(Z2) exists, we have

0 E Z1φ(Z2) = Cov(Z1,Z2) E φ (Z2). 42 E.3. α-controlled functions The next lemma is easy to show using the equivalence of norms in finite dimensional Euclidean space. Lemma E.9. Let φ : Rk → R. The following are equivalent

1. φ is α-controlled

α α Ckxkp +g(x) 2. For some p ≥ 1 and some g(x) = okxkp→∞(kxkp ), C, c > 0, |φ(x)| ≤ e

Ckxkα+c 3. For all p ≥ 1, there is some C, c > 0, |φ(x)| ≤ e p

α k ≥0 ckzk2 Lemma E.10. Let Cα : R → R, c 7→ Ez∼N (0,Ik) e . Then

k 1. Cα < ∞ iff α < 2

2. for α ≥ 1,

α α α/2 Ckzk2 Ckµk2 k E e ≤ e Cα(CαkΣk2 ) z∼N (µ,Σ)

where kΣk2 denotes the spectral norm of Σ.

k 3. for any α-controlled φ : Rk → R with α ≥ 1, there is C > 0 such that for all µ ∈ Rk, Σ ∈ PSD ,

α α/2 Ckµk2 k E |φ(z)| ≤ Ce Cα(CαkΣk2 ) z∼N (µ,Σ)

where kΣk2 denotes the spectral norm of Σ.

Note that the RHS is a montonic function in kµk2 and kΣk2, in the sense that if kµk2 and kΣk2 don’t decrease, then the RHS will not decrease either.

Proof. The first claim is obvious and the third follows from the second easily. For the second,

√ Ckzkα Ck Σz+µkα E e 2 ≤ E e 2 z∼N (µ,Σ) z∼N (0,I) √ Cα(k Σzkα+kµkα) ≤ E e 2 2 z∼N (0,I)

Ckµkα CαkΣkα/2kzkα ≤ e 2 E e 2 2 z∼N (0,I) α α/2 Ckµk2 k = e Cα(CαkΣk2 ).

E.4. Hermite Polynomials We follow a presentation roughly given by O’Donnell(2014). xt− 1 t2 Definition E.11. Let Hen(x) be the probabilist’s Hermite polynomial, given by the generating function e 2 = P∞ tn 2 n=0 Hen(x) n! . Let L (R; N (0, 1)) be the space of square-integrable functions against the standard Gaussian measure, 2 equipped with inner product hφ, ψiG = Ex∼N (0,1) φ(x)ψ(x) and norm kφkG = hφ, φiG. Let Hn(x) = Hen(x)/kHenkG be the normalized versions. 2 Fact E.12. {Hen(x)}n≥0 form an orthogonal basis for L (R; N (0, 1)) and {Hn(x)}n≥0 form an orthonormal basis for L2(R; N (0, 1)). √ 2 Fact E.13. kHenkG = n! so that Hn(x) = Hen(x)/ n!. 43 1 k k i j Suppose u , . . . , u are unit vectors in R , and let ρij := hu , u i. Construct a zero mean Gaussian vector z = (z1, . . . , zk) d i k such that E zizj = ρij. Note that z = Ug where g = (g1, . . . , gk) is a standard Gaussian vector and U = (uj)i,j=1 is the i matrix with u as rows. Then for any s = (s1, . . . , sk) we can compute

> Y > E exp(hs, zi) = E exp(s Ug) = E exp(gi(U s)i) i Y > = E exp(gi(U s)i) i

by independence of {gi}i Y 1  = exp (U >s)2 2 i i ! 1 X = exp (U >s)2 2 i i 1  = exp kU >sk2 2   1 X = exp hui, ujis s 2 i j i,j   1 X = exp ρ s s . 2 ij i j i,j

1 P 2 Dividing by exp 2 i si , we obtain

!   X 2 X E exp sizi − si = exp  ρijsisj i i

Y X −1 m Y X −1 n E Hem(zi)(m!) si = (n!) (ρijsisj) i m i

n(ij) mi P ρ X Y si Y X Y j6=i n(ij) Y ij E Hemi (zi) = si mi! n(ij)! k i i (n ) i i

k Theorem E.14. For any sequence (mi ≥ 0)i=1,

  ! n(ij) Y Y Y ρij He (z ) = m ! E mi i r  n !  i r i

P Q whenever there are (n(ij) ≥ 0)i

In particular, P∞ P∞ Theorem E.15. If φi : R → R has Hermite expansion φi(z) = u=0 aiuHu(z) = u=0 biuHeu(z) where biu = 44 √ aiu/ u!, then   ! n(ij) Y X Y Y ρij E φi(zi) = brmr mr!   n(ij)! i (n(ij))i

P where mi = j6=i n(ij), whenever the RHS is absolutely convergent.

Lemma E.16. Suppose φi, i ∈ [k] are as in Thm E.15, with additionally the constraint that we have an index set I ⊆ [k] such that bi0 = ai0 = 0 (i.e. E φi(zi) = 0) for all i ∈ I. Assume that, for some λ < 1/2, |ρij| ≤ λ/(k − 1) for all i 6= j. Then

k k !

Y Y d|I|/2e E φi(zi) ≤ Ck,|I| kφrkG λ i=1 r=1 for some constant Ck,|I| depending on k and |I| but independent of {φi}i and λ.

Proof. In the notation of Thm E.15, mi  ≤ (k − 1)mi by the multinomial theorem. Thus {n(ij)}j6=i   k k s ! Y X Y mr Y n(ij) φ (z ) ≤ a ρ E i i rmr  ij  {n(rj)}j6=r i=1 (n(ij))i

(n(ij))i

Lemma E.17. Suppose φi, i ∈ [k] are as in Thm E.15, with additionally the constraint that, we have some index√ set I ⊆ [3, k] such that for all i ∈ I, bi0 = ai0 = 0 (i.e. E φi(zi) = 0). Assume that |ρ12| ≤ 1/2, for some λ < 1/ 8, |ρij| ≤ λ/(k − 1) for all i 6= j and {i, j}= 6 {1, 2}. Then

k k !

Y 0 Y d|I|/2e E φi(zi) ≤ Ck,|I| kφrkG λ i=1 r=1

0 for some constant Ck depending on k and I but independent of {φi}i and λ. 45 Proof. Define P = {(i, j) : 1 6= i < j 6= 2} and Q = {(i, j): i < j &(i = 1 XOR j = 2)}. Also write Qk R = r=1 kφrkG. As in the above proof,   k k s !  m  n Y X Y r Y (ij) −n(12) | E φi(zi)| ≤ armr  ρij  2 {n(rj)}j6=r i=1 (n(ij))i

(n(ij))(i,j)∈P : ∀r∈I,mr ≥1  |I|/4 d|I|/2e  ≤ 2R 2 B|I|λ (1 + o(1)) where B is the number of ways of covering |I| vertices with d |I| e edges, and o(1) is a term that goes to 0 as λ → 0 and is |I| √ 2 0 upper bounded by a function of k for all λ < 1/ 8. Choosing the appropriate constant Ck,|I| then gives the result.

E.5. Moment Bounds of Lightly Correlated Gaussians

n×n Lemma E.18. Let Π ∈ R be an orthogonal projection matrix. Then each diagonal entry Πii ∈ [0, 1].

2 P 2 P 2 Proof. Because Π = Π , we have for each i, Πii = j Πij =⇒ Πii(1 − Πii) = j6=i Πij ≥ 0 =⇒ Πii ∈ [0, 1].

def Lemma E.19. Let Π ∈ Rn×n be an orthogonal projection matrix of rank k. Consider the correlation matrix C = −1/2 −1/2 P 2 2 D ΠD where D = Diag(Π). Then the off-diagonal entries of C satisfy i

2 P 2 P 2 P 2 Proof. Because Π = Π , we have for each i, Πii = j Πij = j ΠiiΠjjCij =⇒ 1 − Πii = j6=i ΠjjCij. At the same 2 2 time, each Cij ∈ [0, 1]. Thus we seek an upper bound on the following linear program in the n(n − 1)/2 variables C(ij) 46 (which identiy Cij = Cji = C(ij)).

X 2 Maximize C(ij) i6=j X 2 s.t. ∀i, 1 − Πii = ΠjjC(ij) j6=i 2 ∀i < j, C(ij) ∈ [0, 1].

This LP has the dual

X X Minimize τij + (1 − Πii)ζi i

s.t. ∀i < j, τij + ζiΠjj + ζjΠii ≥ 1

∀i < j, τij ≥ 0 ∀i, ζi ∈ R.

Any feasible value of the dual LP is an upper bound on the original LP. We now set the dual variables.

k def k n k2+n2 WLOG, assume Π11 ≥ · · · ≥ Πnn. Then necessarily, 1 ≥ Πii ≥ n for each i ∈ [k]. First define ρ = n + k = nk ≥ 2. (n−k)2 Note that ρ − 2 = nk .

def 1 Dual variables for 1 ≤ i < j ≤ k. Now set ζi = for i ∈ [k]. Then for 1 ≤ i < j ≤ k, ρΠii

def τij = 1 − (ζiΠjj + ζjΠii) 1 = 1 − r + r−1 ρ ij ij

−1 where rij = Πii/Πjj ≥ 1. Note that 1) τij is nonnegative: indeed, since r + r is increasing in r for r ≥ 1, and −1 (n−k)2 −1 (n−k)2 rij ≤ r1k ≤ n/k, we have rij + rij ≤ ρ, so that τij ≥ 0; 2) τij ≤ n2+k2 : rij + rij ≥ 2 so τij ≤ 1 − 2/ρ = n2+k2 .

def 1  Πjj  ζj for j > k. Now set ζj = 1 − for k < j ≤ n. Note that ζj ≥ 1/2. Π11 2Π11

τij for i ≤ k < j. Then for i ≤ k < j, set

def τij = 1 − ζiΠjj − ζjΠii

Πjj = 1 − − ζjΠii. ρΠii

47 Note that for Π11 = x ≥ y ≥ k/n,

Π Π  jj + ζ x − jj + ζ y ρx j ρy j Π y − x = jj + ζ (x − y) ρ xy j  Π  = (x − y) ζ − (xy)−1 jj j ρ   Π  Π  = (x − y) x−1 1 − jj − (xy)−1 jj 2x ρ  Π Π  = (x − y)x−1 1 − jj − y−1 jj (15) 2x ρ  k/n k/n ≥ (x − y)x−1 1 − − (k/n)−1 2x ρ  1 1 ≥ (x − y)x−1 1 − − 2 2 ≥ 0.

Πjj  Πjj  −1 Thus for all i ≤ k < j, τ1j ≤ τij. Simultaneously, Eq. (15) also shows that ρx + ζjx − ρy + ζjy ≤ (x − y)x , so Π11−Πii n−k that τij − τ1j ≤ ≤ . We check that τ1j ≥ 0: Π11 n

Πjj τ1j = 1 − − ζjΠ11 ρΠ11 Π  Π  = 1 − jj − 1 − jj ρΠ11 2Π11 Π 1 1 = jj − Π11 2 ρ  Π 1 (n − k)2  ∈ 0, jj Π11 2ρ nk  (n − k)2  ⊆ 0, nk

n−k n(n−k) n−k Combined with our deduction above, we get τij ≤ τ1j + n ≤ nk = k .

τij for k < i < j. For k < i < j, we set

def τij = 1 − (ζiΠjj + ζjΠii)  Π  Π  Π  Π  = 1 − jj 1 − ii + ii 1 − jj Π11 2Π11 Π11 2Π11   Πjj + Πii ΠjjΠii = 1 − − 2 Π11 Π11 (Π11 − Πjj)(Π11 − Πii) = 2 Π11 ∈ [0, 1].

48 Summary of dual variables. In summary, we have h ni ∀1 ≤ i ≤ k, ζ def= (ρΠ )−1 ∈ ρ−1, ρ−1 i ii k     def 1 Πjj 1 n ∀k < i ≤ n, ζi = 1 − ∈ , Π11 2Π11 2 k  (n − k)2  ∀1 ≤ i < j ≤ k, τ ∈ 0, ij n2 + k2  n − k  ∀i ≤ k < j ≤ n, τ ∈ 0, ij k

∀k < i < j ≤ n, τij ∈ [0, 1].

Objective value given by the dual variables. Now we compute X X τij + (1 − Πii)ζi i

  k n ! X X X X X =  + +  τij + + (1 − Πii)ζi i

n×n Lemma E.20. Let z ∼ N (0, Π) where Π ∈ R is an orthogonal projection matrix of rank k. Suppose φi : R → R for each i has finite variance Var (φi(x): x ∼ N (0, Πii)). Then

n ! X 5(n − k)2  X Var φi(zi) ≤ √ + 1 Var (φi(x): x ∼ N (0, Πii)) . i=1 2 i Pn P In particular, if n − k = O(1), then Var ( i=1 φi(zi)) = Θ ( i Var (φi(x): x ∼ N (0, Πii))) . √ −1/2 −1/2 Proof. Let C √= D ΠD ,D = Diag(Π), be the correlation matrix of Π. Let ψi(y) = φi( Πiiy) − Ex∼N (0,1)[φi( Πiix)]. Then

n ! n !2 X X Var φi(zi) = E ψi(zi) z∼N (0,Π) z∼N (0,C) i=1 i=1 X = E ψi(zi)ψj(zj). z∼N (0,C) i,j

Expand ψi in the Hermite orthonormal basis,

ψi(x) = ai1H1(x) + ai2H2(x) + ···

2 where Hj(x) is the jth Hermite polynomial, normalized so that Ez∼N (0,1) Hj(z) = 1, (note that H0(x) = 1 and does 2 def not appear here because Ex∼N (0,1) ψi(x) = 0 by construction). For any locally integrable φ : R → R, let kφkG = 49 2 2 P 2 Ez∼N (0,1) φ(z) , so that kψikG = k aik = Var (φi(x): x ∼ N (0, Πii)) . Then,

∞ X X X k E ψi(zi)ψj(zj) = aikajkCij z∼N (0,C) i

since |Cij| ≤ 1 ∞ ! −1/2 X X 2 2 ≤ 2 aik (2.5(n − k) ) k=1 i by Lem E.19 5(n − k)2 X = kψ k2 23/2 i G i

P 2 P 2 On the other hand, i Ex∼N (0,1) ψi(x) = i kψkG, so that

 2  X 5(n − k) X 2 E ψi(zi)ψj(zj) ≤ √ + 1 kψikG z∼N (0,C) i,j 2 i 5(n − k)2  X = √ + 1 Var (φi(x): x ∼ N (0, Πii)) . 2 i

Theorem E.21. Let z ∼ N (0, Π) where Π ∈ Rn×n is an orthogonal projection matrix of rank n − O(1), where O(1) denotes a quantity that stays bounded as n → ∞. Suppose φi : R → R for each i ∈ [n] has finite centered moments 0 r 0 def 1 Pn Ex[(φi(x) − Ex0 φi(x )) ], for x, x ∼ N (0, Πii), for all r ≤ 2p, where p ≥ 6. Then for Q = n i=1 φi(zi), as n → ∞,

  2p  2p −1.5  0  0 E[(Q − E Q) ] ≤ O n max E φi(x) − E φi(x ) : x, x ∼ N (0, Πii) . i∈[n] x x0

If in addition, each φi has finite centered moments up to r ≤ 2pL for some L > 1, then  v  u 1 n  2pL  2p −1.5+1/L uL X 0 0 E[(Q − E Q) ] ≤ O n t E φi(x) − E φi(x ) : x, x ∼ N (0, Πii)  . n x x0 i=1

Here O(−) hides constants that do not depend on n, any of the functions φi, or Π. √ −1/2 −1/2 Proof. Let C √= D ΠD ,D = Diag(Π), be the correlation matrix of Π. Let ψi(y) = φi( Πiiy) − Ex∼N (0,1)[φi( Πiix)]. Order the off-diagonal entries of the correlation matrix in the order of decreasing squared value:

2 2 2 C(ij)(1) ≥ C(ij)(2) ≥ ... ≥ C(ij)(N) ,

n (t) t t t t P 2 where N = 2 , and (ij) = (i j ) are unordered pairs of distinct indices i 6= j . Since t C(ij)t ≤ R for some constant −1/4 √ R, by Lem E.19, we deduce that |C(ij)t | ≤ n for all t > R n. 50 1 Pn 2p −2p P Q2p Consider the (2p)th centered moment E n i=1 ψi(yi) = E n σ:[2p]→[n] a=1 ψσ(a)(yσ(a)), where y ∼ N (0,C). We shall bound the sum to show that this moment is not too large. First note the naive bound via AM-GM,

2p 2p Y 1 X ψ (y ) ≤ ψ (y )2p E σ(a) σ(a) E 2p σ(a) σ(a) a=1 a=1 2p ≤ max E [ψi(y) ] i∈[n] y∼N (0,1) 2p = max E[(φi(zi) − E φi(zi)) : zi ∼ N (0, Πii)] i∈[n] def = B2p. (16)

1/L m Pm L Now, for any collection of numbers {xi ∈ R}i=1 and any L > 0, we have the trivial bound maxi |xi| ≤ j=1 |xj| , q 1/L L 1 Pn 2p L 1/L def and this bound is tighter the larger L is. Thus B2p ≤ n n i=1(E[ψi(y) ]) ≤ n B2p,L, where B2p,L = q L 1 Pn 2pL n i=1 E[ψi(y) ], for any L.

2p P Q2p We can categorize the n terms of E σ:[2p]→[n] a=1 ψσ(a)(yσ(a)) as follows. Here we use O(−) to hide any constant not depending on n or the functions ψi.

• Suppose σ is injective. t √ Q2p – Suppose for each a 6= b, (σ(a)σ(b)) = (ij) for some t > R n. By Lem E.16, E a=1 ψσ(a)(yσ(a)) ≤ Q2p  −1/4p C r=1 kψσ(r)kG n for some constant C independent of {ψr}r and n. Thus the contribution of all such σ to the sum is at most

2p !  n !2p X Y −p/4 −p/4 X C kψσ(r)kG n ≤ O n kψσ(r)kG  . σ r=1 i=1

t √ 2p √ 2p−2 – Suppose for some a, b ∈ [2p], (σ(a)σ(b)) = (ij) for t ≤ R n. There are at most 2 2 R n · n = 2p−1.5 √ 2p O(n ) such σ (indeed, there are R n of choosing such a t, 2 2 ways of choosing their preimages under σ out of 2p, and ≤ n2p−2 ways of choosing the rest of the values of σ). By Eq. (16), the contribution of all such σ 2p−1.5 to the sum is at most O(n B2p).

∗ ∗ ∗ ∗ ∗ • Suppose for some a 6= b in [2p], σ(a ) = σ(b ), but σ|[n]\{a∗,b∗} is injective and takes range outside {σ(a )}. There 2p n−1  2p−1 are 2 n 2p−2 = O(n ) such σ. t √ −1/4 – Suppose for each a 6= b, (σ(a)σ(b)) = (ij) for some t > R n, so that |Cσ(a)σ(b)| ≤ n . We apply Lem E.16 2 2 to the functions {ψσ(a∗)} ∪ {ψσ(a)}a6∈{a∗,b∗}, with ψσ(a∗) being the sole function whose expectation is not 0, so that the I of Lem E.16 has size 2p − 2, and the λ of Lem E.16 is (k − 1)n−1/4. Then Lem E.16 gives

2p   Y 2 Y −1/4 (2p−2)/2 E ψσ(a)(zσ(a)) ≤ Ckψσ(a∗)kG  kψσ(a)kG (n ) a=1 a6∈{a∗,b∗}   2 Y −(p−1)/4 = Ckψσ(a∗)kG  kψσ(a)kG n a6∈{a∗,b∗}

for some constant C. Thus the collective contribution of such σ to the sum is at most

   n ! n !2p−2 X 2 Y −(p−1)/4 −(p−1)/4 X 2 X Ckψσ(a∗)kG  kψσ(a)kG n ≤ O n kψi kG kψikG  . σ a6∈{a∗,b∗} i=1 i=1 51 t √ 2p √ n−2  – Suppose for some a, b ∈ [2p], (σ(a)σ(b)) = (ij) for t ≤ R n. There are at most 2 n · R n · 2p−3 = 2p−1.5 2p−1.5 O(n ) such σ. Using Eq. (16) again, we can upper bound the contribution of such σ by O(n B2p). • Otherwise, there are more than one pair of inputs that collide under σ. There are at most O(n2p−2) such σ. Using 2p−2 Eq. (16), we upper bound their contributions by O(n B2p).

To summarize, 2p X Y  1.75p 0 1.75(p−1) 00 2p−1.5  E ψσ(a)(yσ(a)) ≤ O n B2p + n B2p + n B2p σ:[2p]→[n] a=1

 1.75p 0 1.75(p−1) 00 2p−1.5+1/L  ≤ O n B2p + n B2p + n B2p,L

n n !2p 1 X 1 X φ (z ) − φ (z ) ≤ O n−0.25pB0 + n−0.25p−1.75B00 + n−1.5B  E n i i E n i i 2p 2p 2p i=1 i=1  −0.25p 0 −0.25p−1.75 00 −1.5+1/L  ≤ O n B2p + n B2p + n B2p,L where n !2p 1 X B0 = kψ k 2p n i G i=1 n ! n !2p−2 1 X 1 X B00 = kψ2k kψ k . 2p n i G n i G i=1 i=1

0 00 1/L By the power mean inequality, we get that B2p,B2p ≤ B2p ≤ n B2p,L. Substitution then gives the desired result.

This theorem can be significantly strengthened with more careful case work and applying more involved versions of Lems E.16 and E.17. Lemma E.22. Suppose φ : R → R is polynomially bounded: ∀x, |φ(x)| ≤ C(1 + |x|p) for some C. Then for y ∼ N (0, 1), −1 2p 2p |∂µ Var φ(σy + µ)| ≤ Rσ (1 + σ + |µ| ) for a universal constant R depending only on p but not on µ or σ.

Proof. By Lem E.7, 2 −1 2 ∂µ E φ(σy + µ) = σ E yφ(σy + µ) 2 −1 2 p 2 ∂µ E φ(σy + µ) ≤ σ C E |y|(1 + |σy + µ| ) −1 2 p−1 p p p 2 ≤ σ C E |y|(1 + 2 (σ |y| + |µ| )) −1 2 2p−2 2p 2p 2p ≤ σ 3C E |y|(1 + 2 (σ |y| + |µ| )) ≤ σ−1C0(1 + σ2p + |µ|2p) where C0 depends only on p but not on µ or σ. Similarly, −1 00 p p |∂µ E φ(σy + µ)| ≤ σ C (1 + σ + |µ| ) 000 p p |E φ(σy + µ)| ≤ C (1 + σ + |µ| ) for constants C00,C000 depending only on p but not on µ or σ. Therefore, 2 |∂µ Var φ(σy + µ)| ≤ ∂µ E φ(σy + µ) + 2 |E φ(σy + µ)| |∂µ E φ(σy + µ)| ≤ σ−1C0(1 + σ2p + |µ|2p) + 2C000(1 + σp + |µ|p)σ−1C00(1 + σp + |µ|p) ≤ C0000σ−1(1 + σ2p + |µ|2p)

52 for constant C0000 depending only on p but not on µ or σ.

F. Proof of Main Theorems Theorem 4.3. Consider dimension constraints Λ and a skeleton π withoutT lines, i.e. no transpose allowed. Suppose all fl are α-controlled for some α < 2. Sample all input vars as in Section 3.2 and assume almost sure rank convergence. Then c for any c ∈ C and any α-controlled function φ : R → R, α < 2,

nct 1 X a.s. φ(gct) −−→ φ(Z) nct i E i=1

ct lt c g c c where gi = (gi )gl∈c and R 3 Z = (Z )g∈c ∼ N (µ ,K ).

Proof. WLOG we assume that all input vars appear in the beginning of the program. We do induction on the length of the program.

ct a.s. d 1 Pn cint cint lt cint cint c φ( ) −−→ φ(z) = ( ) l = N (µ ,K ) Base case. We show that for each , nct i=1 gi E where gi gi g ∈cin and z ∼ N (µcin∞,Kcin∞). q For α-controlled φ, α ∈ [1, 2), for every q > 0, there is monotonic function f such that Ex∼N (µ,K) |φ(x)| ≤ ct cint cint 1 Pn 2+ρ f(kµk2, kΣk2) by Lem E.10 uniformly over all µ, K. If we let Xnct,i = φ(gi ) − E φ(gi ), then nct i=1 E |Xn,i| ct cint cint 1 Pn cint is bounded uniformly over all t as µ and K are bounded uniformly over all t. So by Thm E.4, nct i=1 φ(gi ) − cint a.s. E φ(gi ) −−→ 0. Because Ez∼N (µcint,Kcint) φ(z) → Ez∼N (µcin∞,Kcin∞) φ(z) as t → ∞ (for example by an easy application of dominated convergence), we have our desired result. Inductive Case. The inductive case for lines of type LinComb is obvious, as α-controlled functions are closed under composition with linear transforms. t Lt t t t l t l t Suppose at time t, g = g = A h is a line of type MatMul and h = f(g 01 ,..., g 0k0 ) is a line of type Nonlin l t l t t t t 0t 0t where g 01 ,..., g 0k0 (resp. A ) are some previous G-vars (resp. A-var). (The case for g = A g for a G-var g t 0t def def def can be reduced to this case by setting f = id and h = f(g )). Set c = c1 = c(g) = c1(A) and c2 = c(h) = c2(A). i i i r i r Suppose g := Ah , i = 1, . . . , r, are all previous lines of type MatMul involving A. Here {g }i=1 are G-vars and {h }i=1 i i li1 lik i r are G- or H-vars, defined by Nonlin lines h := f (g ,..., g i ) for a collection of functions {f : R → R}i=1. Let def c1t def c2t Gt = [g1t| · · · |grt] ∈ Rn ×r,Ht = [h1t| · · · |hrt] ∈ Rn ×r, so that Gt = AtHt. We will abuse notation and sometimes j r t use G to mean the collection of G-vars {g }j=1. Let A be the σ-algebra spanned by the G-vars appearing before g. By the conditioning trick Lem E.3,

t d t t + ˜t ⊥ t g =At (G (H ) + A ΠHt )h

˜t t t t + where A is an independent copy of A and ΠHt = H (H ) is projection to the space spanned by the columns of Ht. Each ith coordinate of gt is independent conditioned on At, with mean µt def= Gt (Ht)+ht and standard deviation q i i: t def At ⊥ t 2 t At line(A)t At σ = σ kΠHt h k /n2(A ) where σ is a shorthand for σ . For simplicity, we assume σ = 1 for all t; the general case follows very easily from the reasoning below and the fact that σAt converges to a finite, nonzero value. Claim F.0.1. (σt)2 −−→a.s. (σ∞)2 def= Kc(g, g) − Kc(g, G)Kc(G, G)+Kc(G, g).

t 2 1 t> t t> t 1 t> t t> t t> t + t> t t Proof: Note that (σ ) = nc2t (h h − h ΠH h ) = nc2t (h h − h H (H H ) H h ). By induction hypothesis, because f(z)2 is α-controlled for some α < 2,

nc2t 1 t> t 1 X l t l t 2 a.s. l l 2 c c c h h = f(g 01 ,..., g 0k0 ) −−→ E[f(z 01 , . . . , z 0k0 ) : z ∼ N (µ ,K )] = K (g, g), nc2t nc2t i=1

53 l j where z = (z )gl∈c. Likewise, because both f and {f }j are α-controlled jointly for some α < 2, by induction hypothesis,

nc2t 1 t> jt 1 X l t l t j l t l t h h = f(g 01 ,..., g 0k0 )f (g j1 ,..., g jkj ) nc2t nc2t i=1 a.s. l l j l l c c −−→ E[f(z 01 , . . . , z 0k0 )f (z j1 , . . . , z jkj ): z ∼ N (µ ,K )] = Kc(g, gj)

nc2t 0 0 1 j t> jt 1 X j l 0 t ljk t j l t l t h h = f (g j 1 ,..., g j0 )f (g j1 ,..., g jkj ) nc2t nc2t i=1 0 a.s. j l 0 lj0k j l l c c −−→ E[f (z j 1 , . . . , z j0 )f (z j1 , . . . , z jkj ): z ∼ N (µ ,K )] 0 = Kc(gj , gj).

t> t + a.s. c + Finally, by the rank convergence assumption we get (H H ) −−→ K (G, G) .  Claim F.0.2. (Ht)+ht −−→a.s. v∞ def= Kc(G, G)+Kc(G, g).

t + t t> t + t> t a.s. c + c Proof: This is similar to the above. (H ) h = (H H ) H h −−→ K (G, G) K (G, g).  c Let L = line(g). Let φ : R| 0, |φ(x)| ≤ C P |x |α+c e i i . Since for every q > 0, h i h i t c

nct nct  t α P t α 1 X h t c

nct def 1 X c t = C1 (Cqα(σt)α) ψ(g

1 ∞ α c

c

nct 1 X c t h c t i a.s. φ(gt, g

After applying the following claim and the induction hypothesis on    

c

nct   r   1 j X t c

j d c

t d P jt ∞ P jt Proof: From the claims above we have that almost surely, gi =At j gi vj + o(1)( j |gi |) + (1 + o(1))z, where def L c

L c

Hence

ct   1 n h i X t c

55 ct 1 Pn c

it L∇ cint L+LA+i Theorem 5.1. Sample {v }i=1 with zero mean (i.e. µ (g ) → 0 for all i ∈ [L∇]) and independently from the 0 lt Lg cint l l 0 13 input vars {x }l=1 (i.e. K (g , g ) = 0 if l > L ≥ l ) . Sample all other vars in π˜ according to Section 3.2. Assume all fl of π˜ are polynomially bounded and π˜ satisfies almost sure rank convergence. Then for any dimension constraints Λ, c any c ∈ C(˜π,Λ), and any polynomially bounded function φ : R → R,

nct 1 X a.s. φ(gct) −−→ φ(Z) nct i E i=1

ct lt c c where gi = (gi )gl∈c and Z ∼ N (µ ,K ).

Note that we impose a more stringent condition here, compared to Thm 4.3, that fl are polynomial bounded, because, as it will be apparent from the reasoning below, we need to reason about compositions of φ and fl; if we still allow fl and φ to be α-controlled in general, then their composition in general is not integrable against Gaussian measures.

Proof. Thm 4.3 already show that this is true for all gl in π. Because we assume that vit are sampled independently from lt x , this is also true up to line L + LA + L∇. We induct on the line number starting from ` = L + LA + L∇ + 1. If line ` does not produce a new G-var, then there is nothing to prove. If line ` is of type LinComb, then the induction hypothesis is obviously true. In the following, suppose line ` is of type MatMul. Setup This line involves a transposed matrix, ` : g` := AL+ahl, where AL+a = (Aa)> by line L + a and l > L (the l l L+a ` L+a l argument for g instead of h is similar and simpler). Let c = c1 = c1(A ) = c(g ) and c2 = c2(A ) = c(h ). Conditioning on all G-vars that appeared before, we have constraints of the form git = AL+a,thit for i = 1, . . . , r, 0it at 0it i r 0i s i r 0i s and g = A h for i = 1, . . . , s. Here {g }i=1 and {g }i=1 are previous G-vars and {h }i=1 and {h }i=1 are c1t previous G- or H-vars. Letting Gt = [g1t| · · · |grt] ∈ Rn ×r (where git are treated as column vectors), and similarly c2t c2t c1t for Ht ∈ Rn ×r,G0t ∈ Rn ×s,H0t ∈ Rn ×s, we get the expressions Gt = AL+a,tHt,G0t = AatH0t. We will abuse notation and sometimes use G to also denote the corresponding collection of G-vars; likewise for G0,H,H0. By the construction of π˜, gi = AL+ahi are lines that appear after π, and g0i = Aah0i are lines that appear in π. In addition, for each i ∈ [r], hi is an H-var that appears after π. Let At be the σ-algebra spanned by the values of all G-vars that appeared before line ` at time t. By the conditioning trick, we have

`t d t ⊥ ˜t ⊥ lt g =At (E + ΠH0t A ΠHt )h with Et = GtHt+ + H0t+>G0t> − H0t+>G0t>HtHt+ = Gt(Ht>Ht)+Ht> + H0t(H0t>H0t)+G0t> − H0t(H0t>H0t)+G0t>Ht(Ht>Ht)+Ht> where A˜t is sampled independently and identically as AL+a,t. Note that

`t d t t ⊥ g =At µ + σ ΠH0t y, with y ∼ N (0,Inc1t ) def µt = Ethlt

c2t t 2 def at 2 n ⊥ lt 2 c2t (σ ) = (σ ) kΠHt h k /n nc1t 13In our previous example of f(x) = v>ρ(x), this corresponds to the readout layer v sampled with zero mean and independently from x and other parameters of ρ.

56 at 2 at at 2 L+a,t at L+a,t where, to recall, (σ ) /n2(A ) = (σ ) /n1(A ) is the sampling variance of each entry of A and A . For brevity, we use the following shorthands

t def t> t c t r×r 0t def 0t> 0t c t s×s t def 0t> t c t s×r Σ = H H /n 2 ∈ R Σ = H H /n 1 ∈ R Υ = G H /n 2 ∈ R t def t> lt c t r t def 0t> lt c t s ω = H h /n 2 ∈ R β = G h /n 2 ∈ R so that

nc2t nc2t µt = GtΣt+ωt + H0tΣ0t+βt − H0tΣ0t+ΥtΣt+ωt. nc1t nc1t

t 0t t t t By induction hypothesis, Σ , Σ , Υ , ω , β all converge almost surely to corresponding limit values: Let α = αc2,c1 = nc2t i 0 0i c c limt→∞ nc1t ; if λi = line(h ), λi = line(h ), then, with Z ∼ N (µ ,K ),

t a.s. ∞ def λi λj k∞ −2 −1 c i j Σij −−→ Σij = E f (Z)f (Z) = (σ ) α K (g , g ) a.s. def 0 0 0t 0∞ λi λj k∞ −2 c 0i 0j Σ ij −−→ Σ ij = E f (Z)f (Z) = (σ ) K (g , g )

t a.s. ∞ def λi l k∞ −2 −1 c i ` t a.s. t a.s. ωi −−→ ωi = E f (Z)f (Z) = (σ ) α K (g , g )Υi −−→ 0 βi −−→ 0.

The last two limits go to 0 because Ht and hlt are odd in v1, . . . vL∇ , which are sampled independently from G0t as remarked above. Consequently, by our rank assumption, Σt+ −−→a.s. Σ∞+, Σ0t+ −−→a.s. Σ0∞+.

a.s. def Claim F.0.1. (σt)2 −−→ (σ∞)2 = Kc(g`, g`) − Kc(g`,G)Kc(G, G)+Kc(G, g`) with t.

Proof: We have

(σt)2 lt> ⊥ lt c2t = h Π t h /n at 2 nc2t H (σ ) nc1t lt 2 c2t lt> lt c2t = kh k /n − h ΠHt h /n = khltk2/nc2t − (hlt>Ht/nc2t)(Ht>Ht/nc2t)+(Ht>hlt/nc2t) = khltk2/nc2t − ωt>Σt+ωt.

c2t a.s. lt 2 c2t 1 Pn lt 2 l 2 c c k∞ −2 −1 c ` ` Now kh k /n = nc2t i=1 (hi ) −−→ E[f (Z) : Z ∼ N (µ ,K )] = (σ ) α K (g , g ) by induction hypothe- sis. On the other hand, again by induction hypothesis and the rank assumption, the second term converges almost surely k∞ −2 −1 c ` c + c ` at 2 nc2t ∞t 2 to (σ ) α K (g ,G)K (G, G) K (G, g ). Combined with the simple fact that (σ ) nc1t → (σ ) α, we get the desired result.

 Let vt def= Σt+ωt, for t ∈ [1, ∞], so that, by our rank condition, vt −−→a.s. v∞ = Σ∞+ω∞, which we can check is equal to Kc(G, G)+Kc(G, g`).

Claim F.0.2. For some vectors εt ∈ Rr, ε0t ∈ Rs that go to 0 almost surely with t, µt = Ethlt = Gt(v∞ + εt) + H0tε0t.

t+ a.s. ∞+ t a.s. ∞ t a.s. t a.s. Proof: Follows immediately from the fact derived above that Σ −−→ Σ , ω −−→ ω , Υ −−→ 0, β −−→ 0.  Convergence almost surely. Let φ be a function with |φ(x)| ≤ C(1 + kxk2p), p ∈ N; φ will be our test function. With 57 w ∼ N (0, 1),Z ∼ N (µc,Kc),

ct n 1 X `t c<`t φ(gi , gi ) − E φ(Z)] ≤ A + B + C nct Z i=1 with

ct   n r def 1 X X ∞ jt ∞ c<`t A = E φ  vj gi + σ w, gi  − E φ(Z) nct w Z i=1 j=1 ct   n  q  r def 1 X t t ⊥ c<`t X ∞ jt ∞ c<`t B = E φ µi + σ (Π 0t )iiw, gi − E φ  vj gi + σ w, gi  nct w H w i=1 j=1 ct n    q  def 1 X `t c<`t t t ⊥ c<`t C = φ gi , gi − E φ µi + σ (Π 0t )iiw, gi nct w H i=1 We shall show that each of A, B, C goes to 0 almost surely as t → ∞. Claim F.0.3. A −−→a.s. 0.

c c c Proof: Note that if R 3 ζ ∼ N (µ ,K ), r X ∞ gj ∞ vj ζ + σ w j=1 = (ζG)>Kc(G, G)+Kc(G, g`) + (Kc(g`, g`) − Kc(g`,G)Kc(G, G)+Kc(G, g`))w

d g` =ζc<` ζ .   Pr ∞ jt ∞ c<`t c<`t Since Ew φ j=1 vj gi + σ w, gi is purely a polynomially-bounded function of gi , this claim is given by the induction hypothesis.  Claim F.0.4. C −−→a.s. 0.

def c t def c<`t ct t t t <` ˜t t Proof: Fix the values of g . For each i ∈ [n ], let φi(x) = φ(µi + σ x, gi ), and φi(x) = φi(x) − t 0 ⊥ t nct Ex0∼N (0,(Π⊥ ) ) φi(x ). By Thm E.21 applied to ΠH0t and {φi}i=1, we get, for any ρ ≥ 6 and any q > 1, H0t ii

 ct 2ρ  v  n u nct 1 1 h i X ˜t ct −1.5+1/q uq X ˜t 2ρq E  ct φi(zi) ≤ O (n ) t ct E φi(zi)  z∼N (0,Π⊥ ) n n z ∼N (0,(Π⊥ ) ) H0t i=1 i=1 i H0t ii

t where the constant hidden in O(−) is independent of q, t, the functions φi, and ΠH0t . We first show that the sum ct h i 1 Pn ˜t 2ρq t nct i=1 Ezi φi(zi) is uniformly bounded almost surely in t over the probability of {A }t≥1, for any q > 1. Indeed, ⊥ with z ∼ N (0, ΠH0t ), nct nct "  2ρq# nct 1 X h ˜t 2ρqi 1 2ρq−1 X t 2ρq t 0 1 2ρq X  t 2ρq φ (zi) ≤ 2 φ (zi) + φ (z ) ≤ 2 φ (zi) ct E i ct E i E0 i i ct E i n zi n zi z n zi i=1 i=1 i i=1 nct 1 2ρq X h t t c<`t 2ρqi = ct 2 E φ(µi + σ zi, gi ) n zi i=1 nct 1 0 X h t 4ρpq t 4ρpq c<`t 4ρpqi ≤ ct C E |µi| + |σ zi| + kgi k n zi i=1 t→∞ nct a.s.   1 00 X t 4ρpq ∞ 4ρpq 4ρpq c<`t 4ρpq ≤ ct C |µi| + (2|σ | + 1) E |zi| + kgi k n zi i=1

58 for some constants C0,C00 > 0, where this last inequality holds for large enough t, almost surely. By induction hypothesis ⊥ and the fact that (ΠH0t )ii ∈ [0, 1] for all i,

nct   1 X ∞ 4ρpq 4ρpq c<`t 4ρpq ct (2|σ | + 1) E |zi| + kgi k n zi i=1 is uniformly bounded in t, almost surely. Thus it remains to bound

4ρpq nct nct  r  1 X t 4ρpq 1 X X jt ∞ t 0jt 0 t |µi| =  g (vj + εj) + h εj  nct nct i i i=1 i=1 j=1 where εt, ε0t −−→a.s. 0, by Claim F.0.2 4ρpq 4ρpq nct  r   r  1 000 X X jt ∞ X jt t 0jt 0 t ≤ C  g vj  +  g εj + h εj  nct i i i i=1 j=1 j=1 4ρpq 4ρpq t→∞ nct  r   r  a.s. 1 000 X X jt ∞ X jt 0jt ≤ C  g vj  +  |g | + |h | nct i i i i=1 j=1 j=1

000 0jt for some constant C , where the last inequality holds for large enough t, almost surely. Because each h i is a polynomially c<`t 14 bounded function of gi , each summand of the RHS is as well . So by induction hypothesis, this converges to a finite value, and hence is uniformly bounded in t, almost surely, as desired.

 ct 2ρ 1 Pn ˜t ct −1.25 Thus, almost surely, Ez∼N (0,Π⊥ ) ct i=1 φi(zi) ≤ c(n ) for some c, by choosing q large enough. By H0t n Lem E.5 and the fact that nct strictly increases with t, we have

ct ct 1 n 1 n    q  X ˜t X `t c<`t t t ⊥ c<`t a.s. φi(zi) = ψ gi , gi − E ψ µi + σ (Π 0t )iiw, gi −−→ 0, where w ∼ N (0, 1). nct nct w H i=1 i=1



Claim F.0.5. B −−→a.s. 0.

Proof: We apply a similar argument as the one in the proof of Thm 4.3 that leverages the smoothness of Gaussian average ˜ ⊥ over A. The major difference here is that we have to deal with the varying variances (ΠH0t )ii for each t, but this can be done by using the fact that rank H0t = O(1).

c def c Define Φ(x`, x <` ; σ) = E[φ(x` + σw, x <` ): w ∼ N (0, 1)]. Then by Lem E.7, Φ is differentiable in x` and σ2 ` c<` −1 ` c<` ` c<` 1 −2 ` with ∂x` Φ(x , x ; σ) = σ E[wφ(x + σw, x ): w ∼ N (0, 1)] and ∂σ2 Φ(x , x ; σ) = 2 σ Ew∼N (0,1) φ(x + 14This is the only place where we need the assumption that all fl are polynomially bounded. Otherwise, their composition might not ne integrable against the Gaussian measure.

59 σw, xc<` )(w2 − 1). Clearly,

` c<` −1 ` c<` 2p |∂x` Φ(x , x ; σ)| ≤ σ C E[|w|(1 + |x | + |σw| + kx k) : w ∼ N (0, 1)] −1 2p−1 ` 2p 2p c 2p ≤ σ 4 C E[|w|(1 + |x | + |σw| + kx <` k ): w ∼ N (0, 1)] ≤ σ−1C0(1 + |x`|2p + kxc<` k2p + σ2p) 1 ` c<` −2 ` c<` 2 |∂σ2 Φ(x , x ; σ)| ≤ σ E |φ(x + σw, x )||w − 1| 2 w∼N (0,1) −2 ` 2p 2p c 2p 2 ≤ Dσ E (1 + |x | + |σw| + kx <` k )|w − 1| w∼N (0,1) ≤ D0σ−2(1 + |x`|2p + kxc<` k2p + σ2p) q ` c<` ` c 2 ` c 2 =⇒ k∇x`,σ2 Φ(x , x ; σ)k = |∂x` Φ(x , x <` ; σ)| + |∂σ2 Φ(x , x <` ; σ)| ≤ D00(1 + σ−2)(1 + |x`|2p + kxc<` k2p + σ2p)

for some constant C0,D,D0,D00 depending only on p and C. Thus

p |Φ(x`, xc<` ; σ) − Φ(x` + ϑ, xc<` ; σ2 + ς2)| Z 1 2 ` c<` p 2 2 ≤ (|ϑ| + ς ) dt k∇x`,σ2 Φ(x + ϑt, x ; σ + ς t)k 0 Z 1  1  00 2 ` 2p c<` 2p 2 2 p ≤ D (|ϑ| + ς ) dt 1 + 2 2 (1 + |x + ϑt| + kx k + (σ + ς t) ) 0 σ + ς t Z 1  1  ˜ 2 ` 2p 2p 2p c<` 2p 2p 2p p ≤ C(|ϑ| + ς ) dt 1 + 2 (1 + |x | + ϑ t + kx k + σ + ς t ) 0 σ  1  ≤ C˜0(|ϑ| + ς2) 1 + (1 + |x`|2p + ϑ2p + kxc<` k2p + σ2p + ς2p) σ2

˜ ˜0 ct def ⊥ for some constants C, C depending only on p and C. Partition [n ] = U t V where U = {i : (ΠH0t )ii < 1/2} and V is its complement. Note that |U| ≤ 2 rank H0t ≤ 2s. So

ct n  q    1 X t t ⊥ c<`t t ∞+ ∞ ∞ c<`t E φ µi + σ (Π 0t )iiw, gi − E φ Gi:Σ ω + σ w, gi nct w H w i=1 ct n  q    1 X t c<`t t ⊥ t ∞+ ∞ c<`t ∞ = Φ µ , g ; σ (Π 0t )ii − Φ G Σ ω , g ; σ nct i i H i: i i=1   q    1 X t c<`t t ⊥ t ∞+ ∞ c<`t ∞ ≤ Φ µ , g ; σ (Π 0t )ii + Φ G Σ ω , g ; σ (18) nct i i H i: i i∈U  q     X t c<`t t ⊥ t ∞+ ∞ c<`t ∞ + Φ µ , g ; σ (Π 0t )ii − Φ G Σ ω , g ; σ . (19) i i H i: i i∈V

60 The first sum (18) converges almost surely to 0:  q    1 X t c<`t t ⊥ t ∞+ ∞ c<`t ∞ Φ µ , g ; σ (Π 0t )ii + Φ G Σ ω , g ; σ nct i i H i: i i∈U 2s  q    t c<`t t ⊥ t ∞+ ∞ c<`t ∞ ≤ max Φ µi, gi ; σ (ΠH0t )ii + Φ Gi:Σ ω , gi ; σ nct i∈[nct] v 2s u 1  q  N Nu X t c<`t t ⊥ ≤ t Φ µi, gi ; σ (Π 0t )ii (nct)1−1/N nct H i∈[nct] v 2s u 1   N X t ∞+ ∞ c<`t ∞ + Nu Φ G Σ ω , g ; σ (20) (nct)1−1/N tnct i: i i∈[nct]

  N 1 P t ∞+ ∞ c<`t ∞ for any N > 0. Now ct Φ G Σ ω , g ; σ converges a.s. to a finite value, by Claim F.0.3, so for nct i∈[n ] i: i  q  N 1 P t c<`t t ⊥ large N, the second term in Eq. (20) converges to 0. Similarly, ct Φ µ , g ; σ (Π 0t ) converges a.s. nct i∈[n ] i i H ii to a finite value, as in the proof of Claim F.0.4, so for large N, the first term of Eq. (20) and thus Eq. (20) itself go to 0 almost surely. We proceed with the second sum (19):  q    1 X t c<`t t ⊥ t ∞+ ∞ c<`t ∞ Φ µ , g ; σ (Π 0t )ii − Φ G Σ ω , g ; σ nct i i H i: i i∈V  1  1 ˜0 X ∞ 2 t 2 ⊥ t 2p 2p c<` 2p ∞ t 2p ≤ C 1 + √ (%i + |(σ ) − (σ ) (Π 0t )ii|)(1 + |µ | + % + kg k + max(σ , σ ) ) ∞ t 2 nct H i i i min(σ , σ / 2) i∈V

def t t 0t 0t t 0t def t 2p 2p c<` 2p where %i = G ε + H ε with ε , ε = o(1) a.s. coming from Claim F.0.2. Write Yi = (1 + |µ | + % + kg k + i: i:  i i i max(σ∞, σt)2p). Since 1 + 1 √ is obviously uniformly bounded in t, via Cauchy-Schwarz, the sum above min(σ∞,σt/ 2)2 is bounded by a constant multiple of s s 1 X 1 X (% + |(σ∞)2 − (σt)2(Π⊥ ) |)2 Y 2. nct i H0t ii nct i i∈V i∈V q 1 P 2 Using similar techniques as before, by applying induction hypothesis, nct i∈V Yi can be shown to be uniformly bounded in t, almost surely. All it remains to show is that the first term in the product above converges to 0 a.s.. Now s 1 X (% + |(σ∞)2 − (σt)2(Π⊥ ) |)2 nct i H0t ii i∈V s s s 1 X 1 X 1 X ≤ %2 + ((σ∞)2 − (σt)2)2 + (σt)4(1 − (Π⊥ ) )2 nct i nct nct H0t ii i∈V i∈V i∈V s s 1 X 1 X ≤ %2 + |(σ∞)2 − (σt)2| + (σt)2 1 − (Π⊥ ) nct i nct H0t ii i∈V i∈V ⊥ since 1 − (ΠH0t )ii ∈ [0, 1/2] s s r 1 X 1 X 2 rank H0t ≤ (Gt εt)2 + (H0t ε0t)2 + |(σ∞)2 − (σt)2| + (σt)2 nct i: nct i: nct i∈V i∈V s s r 1 X 1 X 2 rank H0t ≤ kεtk kGt k2 + kε0tk kH0t k2 + |(σ∞)2 − (σt)2| + (σt)2 . nct i: nct i: nct i∈V i∈V

61 q 1 P t 2 By induction hypothesis, nct i∈V kGi:k converges and so is also uniformly bounded in t, almost surely. Then, because kεtk −−→a.s. 0, the first term converges to 0 a.s.. Likewise for the second term. The final two terms also obviously converge to 0 almost surely with t. This completes the proof of our claim.



We are finally ready to prove Thm 6.3, but a few lemmas first would help our effort significantly.

Lemma F.1. Let X = (X1,...,Xn) be a multivariate Gaussian with 0 mean and nondegenerate covariance. For any L2(X)-integrable function f, n X Cov(Xi,Xj) X f(X) = (X − [X |X ])f(X) E i Var(X |X ) E j E j \j j=1 j \j n + X Cov(Xi,Xj) E(Xj − Cov(Xj,X\j) Cov(X\j,X\j) X\j)f(X) = Var(X ) − Cov(X ,X ) Cov(X ,X )+ Cov(X ,X ) j=1 j j \j \j \J \j j

Proof. By a density argument, it suffices to consider only the case when f is C∞. Then by Stein’s lemma, n X E Xif(X) = Cov(Xi,Xj) E ∂jf(X). j=1 By Stein’s lemma again,

E ∂jf(X) = E E ∂jf(X) X\j Xj |X\j + (Xj − Cov(Xj,X\j) Cov(X\j,X\j) X\j)f(X) = E E + X\j Xj |X\j Var(Xj) − Cov(Xj,X\j) Cov(X\j,X\J ) Cov(X\j,Xj) + (Xj − Cov(Xj,X\j) Cov(X\j,X\j) X\j)f(X) = E + . X Var(Xj) − Cov(Xj,X\j) Cov(X\j,X\J ) Cov(X\j,Xj)

def Definition F.2. Let gl = Ahm be a line in π and let gi = A0hi, i = 1, . . . , s be all previous MatMul lines that involve A0. Here A0 is the A-var such that A0 := A> if A is an input A-var, or A := A0> if A is a transposed var. Then, by the construction def l j l l Ps j ϕh(g ) Ps ϕ(h ) of πˇ, ϕ(g ) = ϕg(g ) + j=1 ajϕ(h ) for some coefficients a defined in Defn 6.2. Define f = j=1 ajf , so l l l fϕ(g ) = fϕg(g ) + fϕh(g ).

l m l Let g := Ah be a MatMul line in π and set ˇc = c(ϕ(g )). Consider the Hilbert space L2(Z) of L2 functions of Z ∼ ˇc ˇc N (µ ,K ) (equivalently, square-integrable random variables in the σ-algebra generated by Z). Write hf, gi = E f(Z)g(Z) for its inner product. Lemma F.3. Fix a line number λ ≥ l. Let gi = A0hi, i = 1, . . . , s be all MatMul lines strictly before line λ that involve Ak>. Define C ∈ Rs×s, v ∈ Rs by ϕ(hi) ϕ(hj ) Cij = hf , f i i m ϕg(g ) ϕ(h ) vi = hf , f i. Then for all i ∈ [s],

i l −1 ϕ(h ) ϕh(g ) vi = α hf , f i l i i ϕh(g ) ϕ(h ) s + X + ϕ(h ) f = α(f )i=1C v = α (C )ijvjf i,j∈[s] t t where α = limt→∞ n2(A )/n1(A ) = αc2(A),c1(A). 62 0 def i Proof. Let I = {i : line(ϕg(g )) < l}.

i l −1 ϕ(h ) ϕh(g ) 0 + We show vi = α hf , f i for all i ∈ I . Note first of all that a in Defn F.2 is defined by Defn 6.2 as αCI0 vI0 , where , CI0 = (Cij)i,j∈I0 , and vI0 = (vi)i∈I0 .

I def ϕ (gi) I0 def ϕ (gi) I I0 Suppose Z = {Z g }i∈I is a maximal linearly independent set in Z = {Z g }i∈I0 . Note that Z (as well as Z ) is also independent of all Zgˇ where gˇ is produced by MatMul involving a matrix that is not ϕ(A0) or where gˇ is an input ˇc 0 gˇ ϕ(hm) I var (by the construction of K ). Let Z the collection of all such Z . Then EZJ f (Z) is purely a function of Z , by I0 I I ϕ(hm) expressing other elements of Z as linear combinations of Z . Thus, by Lem F.1 applied to Z and EZ0 f , there exist coefficients {aj}j∈I such that, for each i ∈ I,

i m i m ϕg(g ) ϕ(h ) ϕg(g ) ϕ(h ) vi = E Z E f (Z) = E Z E f (Z) ZI Z0|ZI ZI Z0 X ˇc i j = ajK (ϕg(g ), ϕg(g )) j∈I i j ϕg(g ) X ϕg(g ) = hZ , ajZ i j∈I k∞ 2 (σ ) i X j = hfϕ(h ), a fϕ(h )i α j j∈I by construction of Kc.

This equality is extended to all i ∈ I0 via linear combination. Thus,

k∞ 2 (σ ) ϕ(hi) X ϕ(hj ) (v ) 0 = h(f ) 0 , a f i i i∈I α i∈I j j∈I l i ϕh(g ) X ϕ(h ) + f = α f (CI0 )ijvj i,j∈I0 k∞ 2 X ϕ(hi) + ϕ(hj ) X ϕ(hj ) = (σ ) f (CI0 )ijhf , ajf i i,j∈I0 j∈I k∞ 2 X ϕ(hj ) = (σ ) ΠI0 ajf j∈I k∞ 2 X ϕ(hj ) = (σ ) ajf j∈I

ϕ(hi) where ΠI0 is the projection operator on L2(Z) that projects to the linear span of {f }i∈I0 (see Defn E.1 and the basic facts underneath). 1 ϕ(hi) l 0 So all along, vi = α hf , ϕh(g )i for all i ∈ I .

−1 ϕ(hi) ϕ (gl) 0 i 0 i We show vi = α hf , f h i for all i 6∈ I . Suppose g = A h has line number greater than l. Then, we have that, 0 m i conditioned on ZI , fϕ(h ) and fϕg(g ) are independent. Indeed, with the conditioning, the randomness in the former only ϕ (gj ) 0 comes from {Z g }j6∈I0 and the randomness in the latter only comes from Z , and the two are independent. Thus,

 i   m  ϕg(g ) ϕ(h ) vi = E E f (Z) E f (Z) ZI0 Z|ZI0 Z|ZI0    ˇc ˇc + I0  ϕ(hm) = E KiI0 (KI0I0 ) Z E f (Z) ZI0 Z|ZI0 i m ˇc ˇc + ϕg(g ) ϕ(h ) = KiI0 (KI0I0 ) (hZ , f i)i∈I0

63 ˇc ˇc i j ˇc ˇc i j where K is the row vector (K (ϕg(g ), ϕ(g )))j∈I0 and KI0I0 is the submatrix (K (ϕg(g ), ϕ(g )))i,j∈I0 . Again by the construction of Kˇc, this simplifies to

X ϕ(hi) ϕ(hj ) + vi = hf , f i(C )jkvk j,k∈I0 i l ϕ(h ) −1 ϕh(g ) vi = hf , α f i

Therefore,

X ϕ(hi) + f (C )ijvj i,j∈[s] i j l −1 X ϕ(h ) + ϕ(h ) ϕh(g ) = α f (C )ijhf , f i i,j∈[s] l = α−1Πfϕh(g ) l = α−1fϕh(g ).

j l ϕ(h ) ϕh(g ) where Π is the projection operator to the span of {f }j∈[s], and the last equality follows because f is already in this span.

Theorem 6.3. Let π be an original syntax program with sampling instructions, and πˇ be the detransposition of π, with ϕ the mapping from vector vars of π to vector vars of πˇ. Assume all fl of π are polynomially bounded, and that almost sure rank convergence holds for πˇ. Sample input vars of π according to Section 3.2. Then for any dimension constraints Λ, any ¯c c ∈ C(π,Λ), and any polynomially bounded function φ : R → R,

nct 1 X ct a.s. ϕ(h) φ(h ) −−→ φ((f (Z)) ¯) nct i E h∈c i=1

ct ˇc ˇc where hi is the sequence of the ith coordinates of all vector vars in ¯c, and Z ∼ N (µ ,K ). l 2 ϕ(hm) 15 If all in π are differentiable, then we can take in Item 6 of Defn 6.2 to be ασ ( ∂ i (Z)) , where f a E Zϕg(g ) f i∈[s] σ = σr∞ and r is the line number of ϕ(A0), and the above almost sure convergence will still hold.

Proof. We proceed by induction on line number of π. All line types are trivial except MatMul. So suppose in π, line l is l m l m 0 g := Ah , and the induction hypothesis holds for l and c = c(g ). Set c1 = c and c2 = c(h ). Let A be the A-var such that A0 := A> if A is an input A-var, and A := A0> otherwise. Let gi := Ahi, i = 1, . . . , r, be all MatMul lines involving A that appear before line l, and let g0i := A0h0i, i = 1, . . . , s, 0 i 0i i 0i be all MatMul lines involving A that appear before line l. Note that c(g ) = c(h ) = c1 and c(h ) = c(g ) = c2. Set c2t c1t c1t c2t Ht = [h1t| · · · |hrt] ∈ Rn ×r, and likewise for Gt ∈ Rn ×r,H0t ∈ Rn ×s,G0t ∈ Rn ×r. Let A be the σ-algebra generated by all vector vars before gl. As in the proof of Thm 5.1, lt d t ⊥ ˜t ⊥ mt g =At µ + ΠH0t A ΠHt h ˜t t t t+ t nc2t 0t 0t+ t nc2t 0t 0t+ t t+ t where A is random matrix sampled iid as A, and µ = G Σ ω + nc1t H Σ β − nc1t H Σ Υ Σ ω and

t def t> t c t r×r 0t def 0t> 0t c t s×s t def 0t> t c t s×r Σ = H H /n 2 ∈ R Σ = H H /n 1 ∈ R Υ = G H /n 2 ∈ R t def t> m c t r t def 0t> m c t s ω = H h /n 2 ∈ R β = G h /n 2 ∈ R .

15 l ˇc i j even if not all f are differentiable, as long as the covariance {K (ϕg(g ), ϕg(g ))}i,j∈[s] is nonsingular, we may take the distribu- tional derivatives and interpret the expectation as the application of this (tempered) distribution to the Gaussian density function. Then the i theorem holds under this interpretation. Even if the covariance is singular, we may consider a subset {ϕg(g )}i∈I of maximal rank, and still apply the tempered distribution interpretation; see the proof for more details.

64 t 0t t t t By induction hypothesis, Σ , Σ , Υ , ω , β all converge almost surely to corresponding limit values: Let α = αc2,c1 = nc2t ˇc ˇc limt→∞ nc1t . With Z ∼ N (µ ,K ),

t a.s. ∞ def ϕ(hi) ϕ(hj ) Σij −−→ Σij = E f (Z)f (Z) 0t a.s. 0∞ def ϕ(h0i) ϕ(h0j ) Σ ij −−→ Σ ij = E f (Z)f (Z) t a.s. ∞ def ϕ(hi) ϕ(hm) ωi −−→ ωi = E f (Z)f (Z) t a.s. ∞ def ϕ(g0i) ϕ(hj ) Υij −−→ Υi = E f (Z)f (Z) t a.s. ∞ def ϕ(g0i) ϕ(hm) βi −−→ βi = E f (Z)f (Z).

As in the proof of Thm 5.1, we can apply Thm E.21 and Gaussian average smoothness to integrate out A˜t and obtain the following claim Claim F.3.1. With w ∼ N (0, 1),

nct 1 X lt c

def a = Σ∞+ω∞ def a0 = αΣ0∞+(β∞ − Υ∞Σ∞+ω∞) def σ∞ = Kˇc(ϕ (gl), ϕ (gl)) − Kˇc(ϕ (gl), ϕ (G))Kˇc|+ Kˇc(ϕ (G), ϕ (gl)) g g g g ϕg(G) g g

l with ˇc = c(ϕg(g )).

Combining this with the induction hypothesis and Thm 4.3, we have Claim F.3.2. With w ∼ N (0, 1) and Z ∼ N (µˇc,Kˇc),

nct  r s  j 0j0 1 X lt c

In the following, we consider the inner product space of L2-integrable functions of Z (equivalently, square-integrable random variables in the σ-algebra generated by Z), with inner product hf, gi def= f(Z)g(Z) with Z ∼ N (µˇc,Kˇc). We E m abuse notation and let any vector-var in πˇ (e.g. hˇm) denote the corresponding function (e.g. fˇh ). Note that we can rewrite

β∞ − Υ∞Σ∞+ω∞   0i m X 0i j ∞+ k m = hϕ(g ), ϕ(h )i − hϕ(g ), ϕ(h )i(Σ )jkhϕ(h ), ϕ(h )i j,k i∈s ⊥ 0i m 0i ⊥ m = (hΠϕ(H)ϕ(g ), ϕ(h )i)i∈s = (hϕ(g ), Πϕ(H)ϕ(h )i)i∈s

def i r ⊥ where Πϕ(H) is the projection operator to the span of ϕ(H) = {ϕ(h )}i=1 and Πϕ(H) is its orthogonal complement. Claim F.3.3.

r s X j X 0i 0∞+ 0j m ajϕh(g ) = α ϕ(h )(Σ )ijhΠϕ(H)ϕg(g ), ϕ(h )i. j=1 i,j=1 65 Proof: By Lem F.3,

r r X j X ∞+ q m X 0a 0∞+ 0b j ajϕh(g ) = α (Σ )jqhϕ(h ), ϕ(h )i ϕ(h )(Σ )abhϕg(g ), ϕ(h )i j=1 j,q=1 a,b∈[s] X X 0a 0∞+ 0b j ∞+ q m = α ϕ(h )(Σ )abhϕg(g ), ϕ(h )i(Σ )jqhϕ(h ), ϕ(h )i a,b∈[s] j,q∈[r] X 0a 0∞+ 0b m = α ϕ(h )(Σ )abhϕg(g ), Πϕ(H)ϕ(h )i a,b∈[s] as desired.  Claim F.3.4. r s s X j X 0 0j0 X 0i 0∞+ 0j m ajϕh(g ) + aj0 ϕ(h ) = α ϕ(h )(Σ )ijhϕg(g ), ϕ(h )i. j=1 j0=1 i,j=1

Proof: As noted above,

0 X 0∞+ ∞ ∞ ∞+ ∞ aj0 = α (Σ )j0k(β − Υ Σ ω )k k∈[s] X 0∞+ ⊥ 0i m = α (Σ )j0khΠϕ(H)ϕ(g ), ϕ(h )i k∈[s] X 0∞+ ⊥ 0i m = α (Σ )j0khΠϕ(H)ϕg(g ), ϕ(h )i k∈[s] 0i 0i because ϕ(g ) − ϕg(g ) ∈ span ϕ(H) s X 0 0j0 X 0j0 0∞+ ⊥ 0i m aj0 ϕ(h ) = α ϕ(h )(Σ )j0khΠϕ(H)ϕg(g ), ϕ(h )i. j0=1 j0,k∈[s]

Pr j ⊥ By the previous claim, adding j=1 ajϕh(g ) cancels Πϕ(H) and gives the desired result.  Therefore,

r s X j X 0 0j0 ∞ ajϕ(g ) + aj0 ϕ(h ) + σ w j=1 j0=1 r s ∞ X j X 0i 0∞+ 0j m = (σ w + ajϕg(g )) + α ϕ(h )(Σ )ijhϕg(g ), ϕ(h )i j=1 i,j=1 d l l l =H

ϕ(h) where H

nct l 1 X lt c

—————- 66 0i 0j For the second claim, let Cij = hϕ(h ), ϕ(h )i for all i, j ∈ [s]. We can compute, as in the proof of Lems F.1 and F.3,

l X 0i + 0j m ϕh(g ) = α ϕ(h )(C )ijhϕg(g ), h i i,j∈[s] X 0i + ˇc 0j 0k hm = α ϕ(h )(C ) K (ϕ (g ), ϕ (g )) ∂ 0k f (Z) ij g g E Zϕg(g ) i,j,k∈[s] by Stein’s Lemma Lem E.8 X 0i + 2 0j 0k hm = α ϕ(h )(C ) σ hϕ(h ), ϕ(h )i ∂ 0k f (Z) ij E Zϕg(g ) i,j,k∈[s] 2 X 0k hm = ασ Π ϕ(h ) ∂ 0k f (Z) ϕ(H) E Zϕg(g ) k∈[s] 2 X 0k hm = ασ ϕ(h ) ∂ 0k f (Z) E Zϕg(g ) k∈[s]

m where σ = σr∞ and r = line(ϕ(A0)). This computation goes through as long as fh is differentiable, which is implied by l ˇc 0 0 hm all f in π being differentiable, or if the covariance K (ϕg(G ), ϕg(G )) is nondegenerate (which allows us to consider f , a polynomially bounded function, as a tempered distribution, whose derivatives are also tempered distributions, giving a valid interpretation to the expectation). ˇc 0 0 0i When K (ϕg(G ), ϕg(G )) is singular, let I be a minimal subset of [s] such that {ϕ(h )}i∈I is linearly independent. We can compute similarly,

l X 0i + 0j m ϕh(g ) = α ϕ(h )(CI )ijhϕg(g ), h i i,j∈I X 0i + ˇc 0j 0k hm = α ϕ(h )(C ) K (ϕ (g ), ϕ (g )) ∂ 0k f (Z) I ij g g E Zϕg(g ) I i,j,k∈I by Stein’s Lemma Lem E.8 X 0i + 2 0j 0k hm = α ϕ(h )(C ) σ hϕ(h ), ϕ(h )i ∂ 0k f (Z) I ij E Zϕg(g ) I i,j,k∈I 2 X 0k hm = ασ Π ϕ(h ) ∂ 0k f (Z) ϕ(H) E Zϕg(g ) I k∈I 2 X 0k hm = ασ ϕ(h ) ∂ 0k f (Z) E Zϕg(g ) I k∈I

m m 0j h h ϕg(g ) where cI is the restriction of C to I, and fI is the version of f that expands Z , j 6∈ I to linear combinations of ϕ (g0i) ϕ (g0i) {Z g }i∈I . Then this computation goes through always, since {Z g }i∈I has a density.

67