Published as a conference paper at ICLR 2019

BAYESIAN DEEP CONVOLUTIONAL NETWORKSWITH MANY CHANNELSARE GAUSSIAN PROCESSES

Roman Novak †, Lechao Xiao † ∗ , Jaehoon Lee ‡ ∗ Yasaman Bahri ‡ Greg Yang◦,

Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Google Brain, ◦Microsoft Research AI,  Department of Engineering, University of Cambridge romann, xlc, jaehlee, yasamanb, ◦[email protected], {[email protected], danabo, jpennin, jaschasd @google.com }

ABSTRACT There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, in- finitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the in- finite channel limit – a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the correspond- ing GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

1 INTRODUCTION Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017). NN performance is often motivated in terms of model architectures, initializations, and train- ing procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many real-world datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality (Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tinoˇ et al., 2004; 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively. The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; Matthews et al., 2018b) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and reasoned about.1 Previous work studying these Neural Network-equivalent Gaussian Processes (NN-GPs) has estab- lished the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NN-GPs to gain specific insights into the equivalent NNs. ∗Google AI Residents (g.co/airesidency). †, ‡ Equal contribution. 1While there is broad literature on empirical interpretation of finite CNNs (Zeiler & Fergus, 2014; Simonyan et al., 2014; Long et al., 2014; Olah et al., 2017), it is commonly only applicable to fully trained networks.

1 Published as a conference paper at ICLR 2019

(a) (b) (c) (d) 83.46 85 80.07

74

63.48 63.77 63.29 Accuracy 63 58.55

53.74

52 CNN-GP w/ pooling — ? Model FCN FCN-GP LCN LCN w/pooling CNN-GP CNN CNN w/ pooling

Quality Compositionality Local connectivity (CNN topology) Equivariance Invariance

Figure 1: Disentangling the role of network topology, equivariance, and invariance on test performance, for SGD-trained and infinitely wide Bayesian networks (NN-GPs). Accuracy (%) on CIFAR10 of different models of the same depth, nonlinearity, and weight and bias variances. (a) Fully connected models – FCN (fully connected network) and FCN-GP (infinitely wide Bayesian FCN) underperform compared to (b) LCNs (locally connected network, a CNN without weight sharing) and CNN-GP (infinitely wide Bayesian CNN), which have a hierarchical local topology beneficial for image recognition. As derived in 5.1: (i) weight sharing has no effect in the Bayesian treatment of an infinite width CNN (CNN-GP performs§ similarly to an LCN), and (ii) pooling has no effect on generalization of an LCN model (LCN and LCN with pooling perform nearly identically). (c) Local connectivity combined with equivariance (CNN) is enabled by weight sharing in an SGD-trained finite model, allowing for a significant improvement. (d) Finally, invariance enabled by weight sharing and pooling allows for the best performance. Due to computational limitations, the performance of CNN-GPs with pooling – which also possess the property of invariance – remains an open question. Values are reported for 8-layer ReLU models. See G.6 for experimental details, Figure7, and Table1 for more model comparisons. §

In the present work, we extend the equivalence between NNs and NN-GPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting ar- chitecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, de- signing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality of equivariance has no impact in the Bayesian treatment of the infinite channel (number of convolutional filters) limit (Figure1). The specific novel contributions of the present work are:

1. We show analytically that CNNs with many channels, trained in a fully Bayesian fashion, correspond to an NN-GP ( 2, 3). We show this for CNNs both with and without pooling, with arbitrary convolutional§ striding,§ and with both same and valid padding. We prove convergence as the number of channels in hidden layers approach infinity simultaneously (i.e. min n1, . . . , nL , see E.4 for details), extending the result of Matthews et al. (2018a) under mild conditions→ ∞ on§ the nonlinearity derivative. Our results also provide a rigorous proof of the assumption made in Xiao et al.(2018) that pre-activations in hidden layers are i.i.d. Gaussian. 2. We show that in the absence of pooling, the NN-GP for a CNN and a Locally Connected Network (LCN) are identical (Figure1, 5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing§ or translation equivariance. 3. We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NN-GP (Figure6, b, c). Moreover, both archi- tectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure6, a, b), corroborating a similar trend observed in Canziani et al.(2016, Figure 2). However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NN-GPs by a significant margin. We ex-

2 Published as a conference paper at ICLR 2019

perimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Figure1).

4. We introduce a Monte Carlo method to compute NN-GP kernels for situations (such as CNNs with pooling) where evaluating the NN-GP is otherwise computationally infeasible ( 4). § We stress that we do not evaluate finite width Bayesian networks nor do we make any claims about their performance relative to the infinite width GP studied here or finite width SGD-trained networks. While this is an interesting subject to pursue (see Matthews et al.(2018a); Neal(1994)), it is outside of the scope of this paper.

1.1 RELATED WORK

In early work on neural network priors, Neal(1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Follow-up work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al.(2018); Matthews et al.(2018b), with Matthews et al.(2018a) extending the convergence guarantees from ReLU to any linearly bounded nonlinearities and monotonic width growth rates. In this work we further relax the conditions to absolutely continuous nonlinearities with exponentially bounded derivative and any width growth rates. The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoen- holz et al., 2017; Yang & Schoenholz, 2017; 2018; Hanin & Rolnick, 2018; Chen et al., 2018; Yang et al., 2018) is related to the construction of the GPs we consider. They apply a mean field ap- proximation in which the pre-activation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Re- cently, Xiao et al.(2017b; 2018) extended this to convolutional architectures without pooling. Xiao et al.(2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NN-GP performance, as discussed in B. § Compositional kernels coming from wide convolutional and fully connected layers also appeared outside of the GP context. Cho & Saul(2009) derived closed-form compositional kernels for recti- fied polynomial activations (including ReLU). Daniely et al.(2016) proved approximation guaran- tees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases. There is a line of work considering stacking of GPs, such as deep GPs (Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b;a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model end-to-end. Finally, van der Wilk et al.(2017) incorporates convolutional structure into GP kernels, with follow-up work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit, when all layers are taken to be of infinite size. We remark that while alternative models, such as deep GPs, do include infinite-sized layers in their construction, they do not treat all layers in this way – for instance, through insertion of bottleneck layers which are kept finite. While it remains to be seen exactly which limit is applicable for understanding realistic CNN architectures in practice, the limit we consider is natural for a large class of CNNs, namely those for which all layers sizes are large and rather comparable in size. Deep GPs, on the other hand, correspond to a potentially richer class of models, but are difficult to analytically characterize and suffer from higher inference cost. Borovykh(2018) analyzes the convergence of CNN outputs at different spatial locations (or different timepoints for a temporal CNN) to a GP for a single input example. Thus, while they also consider a GP limit (and perform an experiment comparing posterior GP predictions to an SGD-trained CNN),

3 Published as a conference paper at ICLR 2019

they do not address the dependence of network outputs on multiple input examples, and thus their model is unable to generate predictions on a test set consisting of new input examples. In concurrent work, Garriga-Alonso et al.(2018) derive an NN-GP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NN-GP performance to finite width SGD-trained CNNs and analyze the differences between the two models.

2 MANY-CHANNEL BAYESIAN CNNSARE GAUSSIANPROCESSES

2.1 PRELIMINARIES

General setup. For simplicity of presentation we consider 1D convolutional networks with circularly-padded activations (identically to Xiao et al.(2018)). Unless specified otherwise, no pooling anywhere in the network is used. If a model (NN or GP) is mentioned explicitly as “with pooling”, it always corresponds to a single global average pooling layer at the top. Our analysis is straightforward to extend to higher dimensions, using zero (same) or no (valid) padding, strided convolutions, and pooling in intermediary layers ( C). We consider a series of L + 1 convolutional layers, l = 0,...,L. § Random weights and biases. The parameters of the network are the convolutional filters and biases, l l ωij,β and bi, respectively, with outgoing (incoming) channel index i (j) and filter relative spatial location β [ k] k, . . . , 0, . . . , k .2 Assume a Gaussian prior on both the filter weights and biases, ∈ ± ≡ {− } σ2 ωl 0, v ω , bl 0, σ2 . (1) ij,β ∼ N β nl i ∼ N b   2 2 l  The weight and bias variances are σω, σb , respectively. n is the number of channels (filters) in layer l, 2k + 1 is the filter size, and vβ is the fraction of the receptive field variance at location β (with v = 1). In experiments we utilize uniform v = 1/(2k + 1), but nonuniform v = 1/(2k + 1) β β β β 6 should enable kernel properties that are better suited for ultra-deep networks, as in Xiao et al.(2018). P Inputs, pre-activations, and activations. Let denote a set of input images (training set, valida- tion set, or both). The network has activations yXl(x) and pre-activations zl(x) for each input image 0 x Rn d, with input channel count n0 N, number of pixels d N, where ∈ X ⊂ ∈ ∈ nl k l xi,α l = 0 l l l l yi,α(x) l 1 , zi,α(x) ωij,βyj,α+β(x) + bi. (2) ≡ φ zi,α− (x) l > 0 ≡  j=1 β= k X X− We emphasize the dependence  of yl (x) and zl (x) on the input x. φ : R R is a nonlinearity i,α i,α → (with elementwise application to higher-dimensional inputs). Similarly to Xiao et al.(2018), yl is assumed to be circularly-padded and the spatial size d hence remains constant throughout the network in the main text (a condition relaxed in C and Remark E.3). See Figures2 and3 for a visual depiction of our notation. § Activation covariance. A recurring quantity in this work will be the empirical uncentered covari- ance matrix Kl of the activations yl, defined as nl l 1 l l K (x, x0) yi,α(x)yi,α (x0). (3) α,α0 ≡ nl 0 i=1 X l   K is a random variable indexed by two inputs x, x0 and two spatial locations α, α0 (the dependence on layer widths n1, . . . , nl, as well as weights and biases, is implied and by default not stated explicitly). K0, the empirical uncentered covariance of inputs, is deterministic. Shapes and indexing. Whenever an index is omitted, the variable is assumed to contain all possible 0 0 l entries along the respective dimension. For example, y is a vector of size n d, K α,α is a matrix of shape , and zl is a vector of size d. |X | 0 |X | × |X | j |X |   2We will use Roman letters to index channels and Greek letters for spatial location. We use letters i, j, i0, j0, etc to denote channel indices, α, α0, etc to denote spatial indices and β, β0, etc for filter indices.

4 Published as a conference paper at ICLR 2019

y0(x) = x

) z0 x 1 8,8 ( ) y x ( ) ( ) = 0 6,6 1 2 d k ( z x y x = ( = k ( ) ( ) 3,3) 1 ) d = 4,4 (3,3 ( 2 ) 2 = z x d ¯ ( ) ϕ ϕ n¯3 = 10 n2 = 12 n2 = 12 n1 = 12 n1 = 12 n0 = 3

Figure 2: A sample 2D CNN classifier annotated according to notation in 2.1, 3. The network transforms n0 d0 = 3 8 8-dimensional inputs y0(x) = x into n¯3 =§ 10-dimensional§ logits z¯2(x). Model× has two convolutional× × layers with k = (3, 3)-shaped∈ X filters, nonlinearity φ, and a fully connected layer at the top (y2(x) z¯2(x), 3.1). Hidden (pre-)activations have n1 = n2 = 12 → § filters. As min n1, n2 , the prior of this CNN will approach that of a GP indexed by inputs → ∞ x and target class indices from 1 to n¯3 = 10. The covariance of such GP can be computed as 2 2  σω d 2 0 2 2 6 6 α=(1,1) ( ) K + σb In¯3 , where the sum is over the 1,..., 4 hypercube C ◦ A α,α ⊗ { }  ×  (see 2.2P , 3.1, hC). Presented isi a CNN with stride 1 and no (valid) padding, i.e. the spatial shape of the§ input§ shrinks§ as it propagates through it d0 = (8, 8) d1 = (6, 6) d2 = (4, 4) . Note → → that for notational simplicity 1D CNN and circular padding with d0 = d1 = d2 = d is assumed in the text, yet our formalism easily extends to the model displayed ( C). Further, while displayed (pre-)activations have 3D shapes, in the text we treat them as 1D vectors§ ( 2.1). §

Our work concerns proving that the top-layer pre-activations zL converge in distribution to an nL+1d-variate normal random vector with a particular covariance matrix of shape |X | nL+1d nL+1d as min n1, . . . , nL . We emphasize that only the channels in |X | × |X | → ∞ hidden layers are taken to infinity, and nL+1, the number of channels in the top-layer pre-activations    zL, remains fixed. For convergence proofs, we always consider zl, yl, as well as any of their indexed l l l subsets like zj, yi,α to be 1D vector random variables, while K , as well as any of its indexed subsets l l (when applicable, e.g. K , K (x, x0)) to be 2D matrix random variables. α,α0     2.2 CORRESPONDENCEBETWEEN GAUSSIANPROCESSESAND BAYESIAN DEEP CNNSWITH INFINITELY MANY CHANNELS

We next consider the prior over outputs zL computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, min n1, . . . , nL , and derive its equivalence to a GP with a compositional kernel. This section outlines an argument→ ∞ showing that zL is normally distributed conditioned on previous layer activation covariance KL, which itself becomes deterministic in the infinite limit. This allows to conclude convergence in distribution of the outputs to a Gaussian with the respective deterministic covariance limit. This section omits many technical details elaborated in E.4. § 2.2.1 A SINGLECONVOLUTIONALLAYERISA GP CONDITIONEDONTHEUNCENTERED COVARIANCE MATRIX OF THE PREVIOUS LAYER’S ACTIVATIONS As can be seen in Equation2, the pre-activations zl are a linear transformation of the multivariate Gaussian ωl, bl , specified by the previous layer’s activations yl. A linear transformation of a mul- tivariate Gaussian is itself a Gaussian with a covariance matrix that can be derived straightforwardly. Specifically, l l l z y 0, K I l+1 , (4) | ∼ N A ⊗ n l+1 l+1 l l where I l+1 is an n n identity  matrix, and K is the covariance of the pre-activations z n × A i and is derived in Xiao et al.(2018). Precisely, : PSD d PSD d is an affine transformation (a cross-correlation operator followed by a shiftingA operator) |X | → on the|X space | of positive semi-definite d d matrices defined as follows: |X | × |X | 2 2 [ (K)] (x, x0) σ + σ vβ [K] (x, x0) . (5) A α,α0 ≡ b ω α+β,α0+β Xβ

5 Published as a conference paper at ICLR 2019

preserves positive semi-definiteness due to Equation4. Notice that the covariance matrix in A l+1 l n Equation4 is block diagonal due to the fact that separate channels zi i=1 are i.i.d. conditioned on l+1 l l l n y , due to i.i.d. weights and biases ωi, bi i=1 .  We further remark that per Equation4 the normal distribution of zl yl only depends on Kl, hence | the random variable zl Kl has the same distribution by the law of total expectation: |  l l l  z K 0, K I l+1 . (6) | ∼ N A ⊗ n    2.2.2 ACTIVATION COVARIANCE MATRIX BECOMES DETERMINISTIC WITH INCREASING CHANNELCOUNT

l 1 It follows from Equation6 that the summands in Equation3 are i.i.d. conditioned on fixed K − . Subject to weak restrictions on the nonlinearity φ, we can apply the weak law of large numbers and conclude that the covariance matrix Kl becomes deterministic in the infinite channel limit in layer l (note that pre-activations zl remain stochastic). Precisely,

3 l 1 l l 1 P l 1 K − PSD d K K − ( ) K − (in probability), (7) ∀ ∈ |X | | −−−−→nl C ◦ A →∞   where is defined for any d d PSD matrix K as C |X | × |X | E [ (K)]α,α (x, x0) u (0,K) [φ (uα(x)) φ (uα0 (x0))] . (8) C 0 ≡ ∼N The decoupling of the kernel “propagation” into and is highly convenient since is a simple affine transformation of the kernel (see Equation5C), andA is a well-studied map in literatureA (see G.4), and for nonlinearities such as ReLU (Nair & HintonC, 2010) and the error function (erf) can be§ computed in closed form as derived in Cho & Saul(2009) and Williams(1997) respectively.C We refer the reader to Xiao et al.(2018, Lemma A.1) for complete derivation of the limiting value in Equation7. A less obvious result is that, under slightly stronger assumptions on φ, the top-layer activation co- variance KL becomes unconditionally (dependence on observed deterministic inputs y0 is implied) deterministic as channels in all hidden layers grow to infinity simultaneously:

P KL KL ( )L K0 , (9) −−−−−−−−−−−−→min n1,...,nL ∞ ≡ C ◦ A { }→∞  i.e. KL is ( ) applied L times to K0, the deterministic input covariance. We prove this in E.4 (Theorem∞ E.5C ◦). A See Figure3 for a depiction of the correspondence between neural networks§ and their infinite width limit covariances Kl . ∞

2.2.3 A CONDITIONALLYNORMALRANDOMVARIABLEBECOMESNORMALIFITS COVARIANCEBECOMESDETERMINISTIC

2.2.1 established that zL KL is Gaussian, and 2.2.2 established that its covariance matrix § L | § L K InL+1 converges in probability to a deterministic K Inl+1 in the infinite channel A ⊗  A ∞ ⊗ L P L d d nL+1 d nL+1 d limit (since K K , and ( ) InL+1 : R|X | ×|X | R  |X | × |X | is continuous). As we establish in→ E.4∞ (TheoremA · E.6⊗), this is sufficient to conclude→ with the following result. § Result. If φ : R R is absolutely continuous and has an exponentially bounded derivative, i.e. → a, b R : φ0 (x) a exp (bx) a.e. (almost everywhere), then the following convergence in distribution∃ ∈ holds:| | ≤

L 0 L z y D 0, K InL+1 . (10) | −−−−−−−−−−−−→min n1,...,nL N A ∞ ⊗ { }→∞    For more intuition behind Equation 10 and an informal proof please consult E.3. § 3The weak law of large numbers allows convergence in probability of individual entries of Kl. However, due to the finite dimensionality of Kl, joint convergence in probability follows.

6 Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

Roman Novak^, Lechao Xiao*^, Jeahoon Lee*†, Yasaman Bahri*†, Greg Yang°, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein BayesianGoogle Brain. *Google AI residents. Deep ^,†Equal contribution. Convolutional °Microsoft Research AI. Under review as aNetworks conference paper at ICLR 2019 1. Summary 5. SGD-trained finite vs Bayesian infinite CNNs (all displayed models are 100% accurate on the training set) withBayesian• Outputs Many of a deep convolutional Deep Channels neural Convolutional network (CNN)are Gaussian(a) Networks Processes(c) converge to a Gaussian Process ("CNN-GP") as the number ^ ^ † † Romanwithof Novak channels Many, Lechao in hidden Xiao* ,layers JeahoonChannels go toLee* infinity., Yasaman Bahri* are, Greg Yang°,Gaussian Daniel A. Abolafia, Je ffProcessesrey Pennington, Jascha Sohl-Dickstein Google Brain. *Google AI residents. ^,†Equal contribution. °Microsoft Research AI. Under review as a conference paper at ICLR 2019 Roman• CNN-GP Novak^ ,achieves Lechao Xiao* SOTA^, Jeahoonresults on Lee* CIFAR10†, Yasaman among Bahri* GPs†, Greg Yang°, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein without trainable kernels. 1.Google Summary Brain. *Google AI residents. ^,†Equal contribution. °Microsoft Research AI. 5.Under SGD-trained review as a conference paperfinite at ICLR vs 2019 Bayesian infinite CNNs (all displayed models are 100% accurate on the training set) ••OutputsAbsent ofpooling a deep CNNs convolutional with and neuralwithout network weight sharing(CNN) (a) (c) 1. Summaryconvergeconverge to to a theGaussian same GP.Process Translation ("CNN-GP") equivariance as the number is lost in 5. SGD-trained finite vs Bayesian infinite CNNs (all displayed models are 100% accurate on the training set) ofinfinite channels Bayesian in hidden CNNs layers (CNN-GPs), go to infinity. and finite SGD-trained (a) (c) • OutputsCNNs can of outperforma deep convolutional them. neural network (CNN) converge to a Gaussian Process ("CNN-GP") as the number • CNN-GP achieves SOTA results on CIFAR10 among GPs of channels in hidden layers go to infinity. (inputs covariance) 2. Convergencewithout trainable kernels. (an affine transformation) (convergence in distribution) Outputs of a CNN converge in distribution to (b) No Pooling Global Average Pooling • CNN-GP achieves SOTA results on CIFAR10 amongL GPs0 (before the last readout layer) •a Gaussian process with a compositional outputs inputs D 0, ( ) K InL+1 Absent pooling CNNs with and withoutmin n1,...,nL weight sharing kernel (under mild conditions on the weights AAAC4nicbVJNb9QwEHXCV1k+usCRi8UWqYhVlfQCx0I5gKiqIrFtpXV25XidXauOHdkT6CqYH8ANceWPcebCz8D52KpsGSnyy5s3euMZp4UUFqLoVxBeu37j5q2N2707d+/d3+w/eHhsdWkYHzEttTlNqeVSKD4CAZKfFobTPJX8JD3br/Mnn7ixQquPsCx4ktO5EplgFDw17f/ZIpJnsE2An4PJK11CUYLFJBXzL1io+scRI+YLeEbOm5Maoz+PSS5UU0oqNYmHZKbBDtXkoNUSR0AToTJYJhXJKSwYldUb5y7woWuNoyG+4F45woRhXUcrdr9jL6nafrxZo3w/iVYdahA5t/jd1DdVHTyP3Uq7Ne0Pop2oCXwVxB0YoC6Opv3f/k6szLkCJqm14zgqIKmoAcEkdz1SWl5QdkbnfOyhot44qZqNOPzUMzOcaeM/BbhhL1dUNLd2madeWV/Lrudq8r+5NM3XrCF7mVTNorhirXNWSgwa1/vGM2E4A7n0gDIjfPOYLaihDPyr6PmpxOszuAqOd3dijz/sDvZed/PZQI/RE7SNYvQC7aG36AiNEAsOAwhc8DWchd/C7+GPVhoGXc0j9E+EP/8C5Jrtng== ! N A C A ⌦ • without trainable kernels. { }!1 and biases distribution and nonlinearity). (#channels in hidden layers) ⇣ ⌘ converge to the same GP. Translation equivariance is (#channelslost in thein outputs) (computes covariance of post-nonlinearity FCNinfinite (fully-connected): Bayesian CNNs (CNN-GPs), andactivations finite given normal SGD-trainedpre-activations) • Absent pooling CNNs with and without weight sharing LCN

CNNs can outperform0 them. 1 2 1 (CNN w/o weight sharing) converge to thez same GP. Translationz equivariancey = ϕ (z ) is lost in z¯2 0 nonlinearity nonlinearity ∞

y ∞ ∞ ∞ ∞ ∞

infinite Bayesian CNNs (CNN-GPs),∞ and finite SGD-trained ∞

4 → → 3 → → → 2

→ (inputs covariance) → → 1 2 1 2 n 2 2 1 1 10 classes n 1 n n n n n n (an affine transformation) 2. ConvergenceCNNsmatrix-multiply can outperform them.matrix-multiply matrix-multiply per input (convergence in distribution) Outputs4 inputs, of a 3 CNN features converge in distribution to (b) No Pooling Global Average Pooling Published• as a conference paper at ICLR 2019 L 0 (before the last readout layer) a Gaussian process with a compositional outputs inputs D 0, ( ) K InL+1 CNN min n1,...,nL (inputs covariance) kernel (under mild conditions on the weights AAAC4nicbVJNb9QwEHXCV1k+usCRi8UWqYhVlfQCx0I5gKiqIrFtpXV25XidXauOHdkT6CqYH8ANceWPcebCz8D52KpsGSnyy5s3euMZp4UUFqLoVxBeu37j5q2N2707d+/d3+w/eHhsdWkYHzEttTlNqeVSKD4CAZKfFobTPJX8JD3br/Mnn7ixQquPsCx4ktO5EplgFDw17f/ZIpJnsE2An4PJK11CUYLFJBXzL1io+scRI+YLeEbOm5Maoz+PSS5UU0oqNYmHZKbBDtXkoNUSR0AToTJYJhXJKSwYldUb5y7woWuNoyG+4F45woRhXUcrdr9jL6nafrxZo3w/iVYdahA5t/jd1DdVHTyP3Uq7Ne0Pop2oCXwVxB0YoC6Opv3f/k6szLkCJqm14zgqIKmoAcEkdz1SWl5QdkbnfOyhot44qZqNOPzUMzOcaeM/BbhhL1dUNLd2madeWV/Lrudq8r+5NM3XrCF7mVTNorhirXNWSgwa1/vGM2E4A7n0gDIjfPOYLaihDPyr6PmpxOszuAqOd3dijz/sDvZed/PZQI/RE7SNYvQC7aG36AiNEAsOAwhc8DWchd/C7+GPVhoGXc0j9E+EP/8C5Jrtng== ! N A C A ⌦ { }!1 2. andConvergence biases distribution and nonlinearity). (#channels in hidden layers) ⇣(an affine transformation) ⌘ 4 (#channels in the outputs) 3 (convergence in distribution) 2 0 2 0 2 1 2 1 2 2 1 K σωK + σb K σωK +(computesσb covariance of Kpost-nonlinearity % (b) No Pooling Global Average Pooling • Outputs of a CNN converge in distribution to activations given normal pre-activations) FCN (fully-connected): L 0 #Channels (before the last readout layer) a Gaussian process with a compositional outputs inputs D 0, ( ) K InL+1 LCN affine transform affine transformmin n1,...,nL affine transform 4x4 covariance kernel (under mild conditions on the weights AAAC4nicbVJNb9QwEHXCV1k+usCRi8UWqYhVlfQCx0I5gKiqIrFtpXV25XidXauOHdkT6CqYH8ANceWPcebCz8D52KpsGSnyy5s3euMZp4UUFqLoVxBeu37j5q2N2707d+/d3+w/eHhsdWkYHzEttTlNqeVSKD4CAZKfFobTPJX8JD3br/Mnn7ixQquPsCx4ktO5EplgFDw17f/ZIpJnsE2An4PJK11CUYLFJBXzL1io+scRI+YLeEbOm5Maoz+PSS5UU0oqNYmHZKbBDtXkoNUSR0AToTJYJhXJKSwYldUb5y7woWuNoyG+4F45woRhXUcrdr9jL6nafrxZo3w/iVYdahA5t/jd1DdVHTyP3Uq7Ne0Pop2oCXwVxB0YoC6Opv3f/k6szLkCJqm14zgqIKmoAcEkdz1SWl5QdkbnfOyhot44qZqNOPzUMzOcaeM/BbhhL1dUNLd2madeWV/Lrudq8r+5NM3XrCF7mVTNorhirXNWSgwa1/vGM2E4A7n0gDIjfPOYLaihDPyr6PmpxOszuAqOd3dijz/sDvZed/PZQI/RE7SNYvQC7aG36AiNEAsOAwhc8DWchd/C7+GPVhoGXc0j9E+EP/8C5Jrtng== ! N A C A ⌦ { }!1 1 per class and biases distribution and nonlinearity).00 1 1 1 2 ⇣2 1 ⌘ (CNN w/o weight sharing) AAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFWa10EYM2lhGNA9I1jA7mU2GzOwuM7NCXPIJtvoB6cTWb/ATxNofcZJYaOKBC4dz7uVcTpAIrg3Gn05uYXFpeSW/6q6tb2xuFbZ3ajpOFWVVGotYNQKimeARqxpuBGskihEZCFYP+pdjv37PlOZxdGsGCfMl6UY85JQYK9083OF2oYhLeAI0T7wfUjx/d8+S0YdbaRe+Wp2YppJFhgqiddPDifEzogyngg3dVqpZQmifdFnT0ohIpv1s8uoQHVilg8JY2YkMmqi/LzIitR7IwG5KYnp61huL/3pBIGeiTXjqZzxKUsMiOk0OU4FMjMZFoA5XjBoxsIRQxe3ziPaIItTYulzbijfbwTypHZW84xK+xsXyBUyRhz3Yh0Pw4ATKcAUVqAKFLjzCEzw7Q2fkvDiv09Wc83OzC3/gvH0Dmi6X7Q== (#channelsz inAAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYM2lhGNA9I1jA7mU2GzMwuM7NCXPIJtvoB6cTWb/ATxNofcZJYaOKBC4dz7uVcTphwpo3nfTq5hcWl5ZX8qru2vrG5Vdjeqek4VYRWScxj1QixppxJWjXMcNpIFMUi5LQe9i/Hfv2eKs1ieWsGCQ0E7koWMYKNlW4e7vx2oeiVvAnQPPF/SPH83T1LRh9upV34anVikgoqDeFY66bvJSbIsDKMcDp0W6mmCSZ93KVNSyUWVAfZ5NUhOrBKB0WxsiMNmqi/LzIstB6I0G4KbHp61huL/3phKGaiTXQaZEwmqaGSTJOjlCMTo3ERqMMUJYYPLMFEMfs8Ij2sMDG2Lte24s92ME9qRyX/uORde8XyBUyRhz3Yh0Pw4QTKcAUVqAKBLjzCEzw7Q2fkvDiv09Wc83OzC3/gvH0Dm8OX7g== hidden layers)

AAAB+3icbVC7SgNBFL0bXzG+opaKDAbBKuxqoWXQxjJB84BkDbOTSTJkZnaZmRWWJaWlrX6AndgK+RVrS3/CyaPQxAMXDufcy7mcIOJMG9f9dDJLyyura9n13Mbm1vZOfnevpsNYEVolIQ9VI8CaciZp1TDDaSNSFIuA03owuB779QeqNAvlnUki6gvck6zLCDZWuk3uvXa+4BbdCdAi8WakUDocVb4fj0bldv6r1QlJLKg0hGOtm54bGT/FyjDC6TDXijWNMBngHm1aKrGg2k8nrw7RiVU6qBsqO9Kgifr7IsVC60QEdlNg09fz3lj81wsCMRdtupd+ymQUGyrJNLkbc2RCNC4CdZiixPDEEkwUs88j0scKE2PrytlWvPkOFkntrOidF92KrecKpsjCARzDKXhwASW4gTJUgUAPnuAZXpyh8+q8Oe/T1Ywzu9mHP3A+fgAurphX 2 z y z y ϕyAAAB+3icbVC7SgNBFL3rM8ZX1FKRwSBYhd1YaBm0sUzQPCBZw+xkkgyZmV1mZoVlSWlpqx9gJ7ZCfsXa0p9w8ig08cCFwzn3ci4niDjTxnU/naXlldW19cxGdnNre2c3t7df02GsCK2SkIeqEWBNOZO0apjhtBEpikXAaT0YXI/9+gNVmoXyziQR9QXuSdZlBBsr3Sb3xXYu7xbcCdAi8WYkXzoaVb4fj0fldu6r1QlJLKg0hGOtm54bGT/FyjDC6TDbijWNMBngHm1aKrGg2k8nrw7RqVU6qBsqO9Kgifr7IsVC60QEdlNg09fz3lj81wsCMRdtupd+ymQUGyrJNLkbc2RCNC4CdZiixPDEEkwUs88j0scKE2PrytpWvPkOFkmtWPDOC27F1nMFU2TgEE7gDDy4gBLcQBmqQKAHT/AML87QeXXenPfp6pIzuzmAP3A+fgAwQ5hY z z = ( ) ⊗ I10 3 2 2 (#channels in the outputs) Figure 3: 4 z zAAACAHicbVC7SgNBFL0bX3F9RS1tBoNgFXZjoY0YtLGMYB6QrGFmMpsMmZ1dZmaFGNL4Cbba2diJrT/gJ4i1P+LkUWjigQuHc+7lXA5JBNfG876czMLi0vJKdtVdW9/Y3Mpt71R1nCrKKjQWsaoTrJngklUMN4LVE8VwRASrkd7FyK/dMqV5LK9NP2FBhDuSh5xiY6V6k2CF7m6KrVzeK3hjoHniT0n+7MM9TZ4/3XIr991sxzSNmDRUYK0bvpeYYICV4VSwodtMNUsw7eEOa1gqccR0MBj/O0QHVmmjMFZ2pEFj9ffFAEda9yNiNyNsunrWG4n/eoREM9EmPAkGXCapYZJOksNUIBOjURuozRWjRvQtwVRx+zyiXawwNbYz17biz3YwT6rFgn9U8K68fOkcJsjCHuzDIfhwDCW4hDJUgIKAB3iEJ+feeXFenbfJasaZ3uzCHzjvPwxnmdI= ¯ (a) SGD-trained(a): SGD-trained finite CNNs CNNs often often perform perform better better with with increasing increasing number number of channels. of channels. nonlinearity nonlinearity(computes covariance of post-nonlinearity¯ 4x4 input0 covariance0 activations given normal pre-activations) ∞ Each line corresponds to a particular choice of architecture and initialization hyperparameters, with

AAACCXicbVC7SgNBFL3rM8ZHVi0VGQyCVdjVQhshaGOZgHlAsobZyWwyZGZ3mZkVliWllZ9gqx9gF2z9CmtLf8LJo9DEAxcO59zLPRw/5kxpx/m0lpZXVtfWcxv5za3tnYK9u1dXUSIJrZGIR7LpY0U5C2lNM81pM5YUC5/Thj+4GfuNByoVi8I7ncbUE7gXsoARrI3UsQttgXWfYJ41h1fpvdOxi07JmQAtEndGiuXDUfX78WhU6dhf7W5EEkFDTThWquU6sfYyLDUjnA7z7UTRGJMB7tGWoSEWVHnZJPgQnRili4JImgk1mqi/LzIslEqFbzbHMdW8Nxb/9XxfzL3WwaWXsTBONA3J9HOQcKQjNK4FdZmkRPPUEEwkM+ER6WOJiTbl5U0r7nwHi6R+VnLPS07V1HMNU+TgAI7hFFy4gDLcQgVqQCCBZ3iBV+vJerNG1vt0dcma3ezDH1gfPxUHncI= = y

y ∞ FCN (fully-connected):∞ ∞ ∞ ∞ ∞ ∞

AAAB/HicbVC7SgNBFL3rM8ZX1FKRwSBYhV0ttAzaWCZgHpAsYXYymwyZmV1mZoWwxM7WVj/ATmwlv2Jt6U84m6TQxAMXDufcy7mcIOZMG9f9dJaWV1bX1nMb+c2t7Z3dwt5+XUeJIrRGIh6pZoA15UzSmmGG02asKBYBp41gcJP5jXuqNIvknRnG1Be4J1nICDaZ1I77rFMouiV3ArRIvBkplo/G1e/H43GlU/hqdyOSCCoN4VjrlufGxk+xMoxwOsq3E01jTAa4R1uWSiyo9tPJryN0apUuCiNlRxo0UX9fpFhoPRSB3RTY9PW8l4n/ekEg5qJNeOWnTMaJoZJMk8OEIxOhrAnUZYoSw4eWYKKYfR6RPlaYGNtX3rbizXewSOrnJe+i5FZtPdcwRQ4O4QTOwINLKMMtVKAGBPrwBM/w4jw4r86b8z5dXXJmNwfwB87HD0B7mPY= AAAB/HicbVC7SgNBFL3rM8ZX1FKRwSBYhV0ttAzaWCZgHpAsYXYymwyZmV1mZoWwxM7WVj/ATmwlv2Jt6U84m6TQxAMXDufcy7mcIOZMG9f9dJaWV1bX1nMb+c2t7Z3dwt5+XUeJIrRGIh6pZoA15UzSmmGG02asKBYBp41gcJP5jXuqNIvknRnG1Be4J1nICDaZ1I77rFMouiV3ArRIvBkplo/G1e/H43GlU/hqdyOSCCoN4VjrlufGxk+xMoxwOsq3E01jTAa4R1uWSiyo9tPJryN0apUuCiNlRxo0UX9fpFhoPRSB3RTY9PW8l4n/ekEg5qJNeOWnTMaJoZJMk8OEIxOhrAnUZYoSw4eWYKKYfR6RPlaYGNtX3rbizXewSOrnJe+i5FZtPdcwRQ4O4QTOwINLKMMtVKAGBPrwBM/w4jw4r86b8z5dXXJmNwfwB87HD0B7mPY= X LCN

4 → → 3 → → (b) Performance of SGD-trained finite CNNs often approaches that of the respective → 2 → → → 1 2 1 2 best learning rate and weight decay selected independently for each number of channels (x-axis). n 2 2 1 1 10 classes n 1 n n n n n n

0 1 2 1 (CNN w/o weight sharing) matrix-multiply z matrix-multiply z y = matrix-multiplyϕ (z ) per input (b):CNN-GPSGD-trained with increasing CNNs often the approach number of the channels. performance of their corresponding CNN-GP FCN (K)= (x) (x)T 2 0 Ex (0,K) 1 z 0 K AAACeHicbVFdS8MwFE3r9/yaH2++BKeoINLuRV8EUQRBEAWnwlpHkqVbMGlLciuOMv+nz/orfDJbq+jmhcDhnHO5N+fSVAoDnvfmuBOTU9Mzs3OV+YXFpeXqyuqdSTLNeIMlMtEPlBguRcwbIEDyh1Rzoqjk9/TpbKDfP3NtRBLfQi/loSKdWESCEbBUq/q6FSgCXUZkftYPJI9g9zLQotOFPXyMhxql+Xm/lb8ERqgf81Vp9va/7QXRDNKuKKSXUhhnHm8LEG61qjXvwBsWHgd+CWqorOtW9T1oJyxTPAYmiTFN30shzIkGwSTvV4LM8JSwJ9LhTQtjorgJ82FOfbxtmTaOEm1fDHjI/u7IiTKmp6h1Dv5pRrUB+a9GqRoZDdFRmIs4zYDHrJgcZRJDggdXwG2hOQPZs4AwLezymHWJJgzsrSo2FX80g3FwVz/wLb6p105Oy3xm0QbaRLvIR4foBF2ga9RADH04FWfNWXc+XezuuHuF1XXKnjX0p9z6F6yiwSY= 1 ⇠N K 2 ¯ with increasing number of channels. All models have the same architecture except for pooling

activations C 4 inputs,K 3 features0 $ ( ) nonlinearity K $ ( ) nonlinearity K % (c) However, the best performing SGD CNN outperforms the best CNN-GP by a

h ∞ i

y ∞ ∞ ∞ CNN ∞ ∞ ∞ ∞ and weight sharing, as well as training-related hyperparameters such as learning rate, weight decay

4 → → 3 → → significant margin. → 2 → → CNN (1D): → 1 2 1 2 n 2 2 1 1 10 classes n 1 n n n n

n x n and batch size, which are selected for each number of channels ( -axis) to maximize validation matrix-multiply matrix-multiply matrix-multiply per input 4 4 performance (y-axis) of a neural network. As the number of channels grows, best validation accu- 3 3 0 2 2 0 2 0 0 2 11 2 1 12 2 K 1 1 K σω$K(K+ σ)b KK σωK$ (+Kσb) K % % 4 inputs, 3 features 2 K0 + 2 1 2 K1 + 2 2 vec racy increases and approaches accuracy of the respective GP (solid horizontal line). (c): However, AAACHHicbVDLSsNAFJ3UV62vqEs3Q4tQEEpSF3VZdCO4qWAf0KRlMp2mQ2eSMDMRSsjefxD8BLf6AW5E3AqiO3/E6QPU1gMXDufcy733eBGjUlnWu5FZWl5ZXcuu5zY2t7Z3zN29hgxjgUkdhywULQ9JwmhA6ooqRlqRIIh7jDS94dnYb14TIWkYXKlRRFyO/ID2KUZKS10z70jqc9QpdxMn5MRHKbzoWEc/qpd2zYJVsiaAi8SekUI1X/z8qDzf1rrml9MLccxJoDBDUrZtK1JugoSimJE058SSRAgPkU/amgaIE+kmk19SeKiVHuyHQleg4ET9PZEgLuWIe7qTIzWQ895Y/NfzPD63WvVP3IQGUaxIgKeb+zGDKoTjpGCPCoIVG2mCsKD6eIgHSCCsdJ45nYo9n8EiaZRL9nHJutTxnIIpsuAA5EER2KACquAc1EAdYHAD7sEDeDTujCfjxXidtmaM2cw++APj7RuARqV6 K AAACJXicbVDLSgMxFM3UV62vUZduQotQEMpMXdRl0Y3gpoJ9QF9k0kwbmmSGJCOUYT7CtR/gJ7jVpQt3IoiCC3/E9LGorQcunJxzLzf3eCGjSjvOh5VaWV1b30hvZra2d3b37P2DmgoiiUkVByyQDQ8pwqggVU01I41QEsQ9Rure8GLs12+JVDQQN3oUkjZHfUF9ipE2Utc+aSna56hT7MatgJM+SuBVxzUPKnw9SuZsL+naOafgTACXiTsjuXI2//VZermvdO2fVi/AESdCY4aUarpOqNsxkppiRpJMK1IkRHiI+qRpqECcqHY8OSqBx0bpQT+QpoSGE3V+IkZcqRH3TCdHeqAWvbH4r+d5fGG19s/aMRVhpInA081+xKAO4Dgy2KOSYM1GhiAsqfk8xAMkEdYm2IxJxV3MYJnUigX3tOBcm3jOwRRpcASyIA9cUAJlcAkqoAowuAOP4Ak8Ww/Wq/VmvU9bU9Zs5hD8gfX9C3WoqbI= K

AAACBHicbVC7SgNBFL0bXzG+opY2Q0RIFXa1iGXQRrCJYB6QXcPsZDYZMju7zMwKy5LWT7Cx0A+wE1t7P0G080ecPApNPHDhcM69nMvxY86Utu0PK7e0vLK6ll8vbGxube8Ud/eaKkokoQ0S8Ui2fawoZ4I2NNOctmNJcehz2vKH52O/dUulYpG41mlMvRD3BQsYwdpI7uWN081cJgKdjrrFQ7tiT4AWiTMjh7VS+euz+v5Q7xa/3V5EkpAKTThWquPYsfYyLDUjnI4KbqJojMkQ92nHUIFDqrxs8vMIHRmlh4JImhEaTdTfFxkOlUpD32yGWA/UvDcW//V8P5yL1sGplzERJ5oKMk0OEo50hMaNoB6TlGieGoKJZOZ5RAZYYqJNbwXTijPfwSJpHleck4p9Zeo5gynycAAlKIMDVajBBdShAQRiuIdHeLLurGfrxXqdruas2c0+/IH19gOqA5yR AAACBHicbVC7SgNBFJ2NrxhfUUubIUFIFXZjEcugjWATwTwgG8PsZDYZMju7zNwVliWtn2BjoR9gJ7b2foJo5484eRSaeODC4Zx7OZfjRYJrsO0PK7Oyura+kd3MbW3v7O7l9w+aOowVZQ0ailC1PaKZ4JI1gINg7UgxEniCtbzR+cRv3TKleSivIYlYNyADyX1OCRjJvbyp9FKXSx+ScS9ftMv2FHiZOHNSrBVKX5/V94d6L//t9kMaB0wCFUTrjmNH0E2JAk4FG+fcWLOI0BEZsI6hkgRMd9Ppz2N8bJQ+9kNlRgKeqr8vUhJonQSe2QwIDPWiNxH/9TwvWIgG/7SbchnFwCSdJfuxwBDiSSO4zxWjIBJDCFXcPI/pkChCwfSWM604ix0sk2al7JyU7StTzxmaIYuOUAGVkIOqqIYuUB01EEURukeP6Mm6s56tF+t1tpqx5jeH6A+stx+roZyS ! b ! b AAACHXicbVC7SgNBFJ2NrxhfUUubRRGswq4WsQzaCDYRTBTcGGYnd3VwZnaZuRsMy36A32DhJ9jqB1gIYiuinT/iJLHQxAMDh3PO5d45YSK4Qc97dwoTk1PTM8XZ0tz8wuJSeXmlaeJUM2iwWMT6NKQGBFfQQI4CThMNVIYCTsKr/b5/0gVteKyOsZdAS9ILxSPOKFqpXd4IJMVLRkV2mLezgKsIe/l5FiBco5ZZF1ie25RX8QZwx4n/QzZq61ufH9Xn23q7/BV0YpZKUMgENebM9xJsZVQjZwLyUpAaSCi7ohdwZqmiEkwrG3wmdzet0nGjWNun0B2ovycyKo3pydAm+6ebUa8v/uuFoRxZjdFuK+MqSREUG26OUuFi7ParcjtcA0PRs4Qyze3xLrukmjK0hZZsK/5oB+OkuV3xdyreka1njwxRJGtknWwRn1RJjRyQOmkQRm7IPXkgj86d8+S8OK/DaMH5mVklf+C8fQNy+KfT #Channels 0 1 1 1 K1 CNN KAAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFWa10EYM2gg2Ec0DkjXMTmaTITO7y8ysEJZ8gq1+QDqx9Rv8BLH2R5wkFpp44MLhnHs5lxMkgmuD8aeTW1hcWl7Jr7pr6xubW4XtnZqOU0VZlcYiVo2AaCZ4xKqGG8EaiWJEBoLVg/7l2K8/MKV5HN2ZQcJ8SboRDzklxkq31/e4XSjiEp4AzRPvhxTP392zZPThVtqFr1YnpqlkkaGCaN30cGL8jCjDqWBDt5VqlhDaJ13WtDQikmk/m7w6RAdW6aAwVnYigybq74uMSK0HMrCbkpienvXG4r9eEMiZaBOe+hmPktSwiE6Tw1QgE6NxEajDFaNGDCwhVHH7PKI9ogg1ti7XtuLNdjBPakcl77iEb3CxfAFT5GEP9uEQPDiBMlxBBapAoQuP8ATPztAZOS/O63Q15/zc7MIfOG/fT3WXvg== 6. When do CNNs outperform CNN-GPs? affiaffinene transform transform affine transform affinea ffitransformne transform 4x44x4 covariance covariance the best-performing SGD-trained CNNs can outperform their corresponding CNN-GPs. Each

1 1 perper class class 86 2 2 I point corresponds to the test accuracy of: (y-axis) a specific CNN-GP; (x-axis) the best (on valida- ⊗ 10 ⊗ I10

3 AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYMprGMYB6QLGF2MkmGzMwuM7NCWFL6CbZaWNqJrbWfINb+iLNJCk08cOFwzr3cwwljzrTxvC8nt7S8srqWX3c3Nre2dwq7e3UdJYrQGol4pJoh1pQzSWuGGU6bsaJYhJw2wmEl8xt3VGkWyVszimkgcF+yHiPYWKnVFtgMCOZpZdwpFL2SNwFaJP6MFC8/3Iv4+dOtdgrf7W5EEkGlIRxr3fK92AQpVoYRTsduO9E0xmSI+7RlqcSC6iCdRB6jI6t0US9SdqRBE/X3RYqF1iMR2s0sop73MvFfLwzF3GvTOw9SJuPEUEmmn3sJRyZCWSGoyxQlho8swUQxGx6RAVaYGFuba1vx5ztYJPWTkn9a8m68YvkKpsjDARzCMfhwBmW4hirUgEAED/AIT8698+K8Om/T1Zwzu9mHP3DefwA8Y5um AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYMprGMYB6QLGF2MkmGzMwuM7NCWFL6CbZaWNqJrbWfINb+iLNJCk08cOFwzr3cwwljzrTxvC8nt7S8srqWX3c3Nre2dwq7e3UdJYrQGol4pJoh1pQzSWuGGU6bsaJYhJw2wmEl8xt3VGkWyVszimkgcF+yHiPYWKnVFtgMCOZpZdwpFL2SNwFaJP6MFC8/3Iv4+dOtdgrf7W5EEkGlIRxr3fK92AQpVoYRTsduO9E0xmSI+7RlqcSC6iCdRB6jI6t0US9SdqRBE/X3RYqF1iMR2s0sop73MvFfLwzF3GvTOw9SJuPEUEmmn3sJRyZCWSGoyxQlho8swUQxGx6RAVaYGFuba1vx5ztYJPWTkn9a8m68YvkKpsjDARzCMfhwBmW4hirUgEAED/AIT8698+K8Om/T1Zwzu9mHP3DefwA8Y5um FCN-GP 3 4 Figure 3: covariance 4 4 3 (a) SGD-trained(a): SGD-trained finite CNNs CNNs often often perform perform better better with with increasing increasing number number of channels. of channels. 2 0 2 0 2 C 1 2 1 2 C 2 tion) CNN with the same architectural hyper-parameters selected among the 100%-accurate models 1 K σωK + σb K σωK + σb K % 4x44x4x10 input input covariance covariance Eachon the line full corresponds training CIFAR10 to a particularNN (underfitting dataset choice with different of allowed) architecture#Channels learning and rates, initialization weight decay hyperparameters, and number of with chan- (not always necessary to track the whole 4x4x10x10 covariance.) (b) Performance of SGD-trained finite CNNs often approaches that of the respective affine transform affine transform affine transform 4x4 covariance best learning rate and weight decay selected independently for each number of channels (x-axis). 0 1 0 1 2 1 per class nels.75 While CNN-GP appears competitive against 100%-accurate CNNs (above the diagonal), the 1 z0 y ϕ1 z z 1 y 2ϕ z CNN-GP with increasing the number of channels. AAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFWa10EYM2lhGNA9I1jA7mU2GzOwuM7NCXPIJtvoB6cTWb/ATxNofcZJYaOKBC4dz7uVcTpAIrg3Gn05uYXFpeSW/6q6tb2xuFbZ3ajpOFWVVGotYNQKimeARqxpuBGskihEZCFYP+pdjv37PlOZxdGsGCfMl6UY85JQYK9083OF2oYhLeAI0T7wfUjx/d8+S0YdbaRe+Wp2YppJFhgqiddPDifEzogyngg3dVqpZQmifdFnT0ohIpv1s8uoQHVilg8JY2YkMmqi/LzIitR7IwG5KYnp61huL/3pBIGeiTXjqZzxKUsMiOk0OU4FMjMZFoA5XjBoxsIRQxe3ziPaIItTYulzbijfbwTypHZW84xK+xsXyBUyRhz3Yh0Pw4ATKcAUVqAKFLjzCEzw7Q2fkvDiv09Wc83OzC3/gvH0Dmi6X7Q== = ( ) AAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYM2lhGNA9I1jA7mU2GzMwuM7NCXPIJtvoB6cTWb/ATxNofcZJYaOKBC4dz7uVcTphwpo3nfTq5hcWl5ZX8qru2vrG5Vdjeqek4VYRWScxj1QixppxJWjXMcNpIFMUi5LQe9i/Hfv2eKs1ieWsGCQ0E7koWMYKNlW4e7vx2oeiVvAnQPPF/SPH83T1LRh9upV34anVikgoqDeFY66bvJSbIsDKMcDp0W6mmCSZ93KVNSyUWVAfZ5NUhOrBKB0WxsiMNmqi/LzIstB6I0G4KbHp61huL/3phKGaiTXQaZEwmqaGSTJOjlCMTo3ERqMMUJYYPLMFEMfs8Ij2sMDG2Lte24s92ME9qRyX/uORde8XyBUyRhz3Yh0Pw4QTKcAUVqAKBLjzCEzw7Q2fkvDiv09Wc83OzC3/gvH0Dm8OX7g== = ( )

2 AAAB+3icbVC7SgNBFL0bXzG+opaKDAbBKuxqoWXQxjJB84BkDbOTSTJkZnaZmRWWJaWlrX6AndgK+RVrS3/CyaPQxAMXDufcy7mcIOJMG9f9dDJLyyura9n13Mbm1vZOfnevpsNYEVolIQ9VI8CaciZp1TDDaSNSFIuA03owuB779QeqNAvlnUki6gvck6zLCDZWuk3uvXa+4BbdCdAi8WakUDocVb4fj0bldv6r1QlJLKg0hGOtm54bGT/FyjDC6TDXijWNMBngHm1aKrGg2k8nrw7RiVU6qBsqO9Kgifr7IsVC60QEdlNg09fz3lj81wsCMRdtupd+ymQUGyrJNLkbc2RCNC4CdZiixPDEEkwUs88j0scKE2PrytlWvPkOFkntrOidF92KrecKpsjCARzDKXhwASW4gTJUgUAPnuAZXpyh8+q8Oe/T1Ywzu9mHP3A+fgAurphX z y z yAAAB+3icbVC7SgNBFL3rM8ZX1FKRwSBYhd1YaBm0sUzQPCBZw+xkkgyZmV1mZoVlSWlpqx9gJ7ZCfsXa0p9w8ig08cCFwzn3ci4niDjTxnU/naXlldW19cxGdnNre2c3t7df02GsCK2SkIeqEWBNOZO0apjhtBEpikXAaT0YXI/9+gNVmoXyziQR9QXuSdZlBBsr3Sb3xXYu7xbcCdAi8WYkXzoaVb4fj0fldu6r1QlJLKg0hGOtm54bGT/FyjDC6TDbijWNMBngHm1aKrGg2k8nrw7RqVU6qBsqO9Kgifr7IsVC60QEdlNg09fz3lj81wsCMRdtupd+ymQUGyrJNLkbc2RCNC4CdZiixPDEEkwUs88j0scKE2PrytpWvPkOFkmtWPDOC27F1nMFU2TgEE7gDDy4gBLcQBmqQKAHT/AML87QeXXenPfp6pIzuzmAP3A+fgAwQ5hY I T 2⊗ 10 (b): SGD-trained CNNs often approach the performance of their corresponding CNN-GP 3 (K)= (x) (x) z 2 Figurebest CNNs 3: overall outperform CNN-GPs by a significant margin (below the diagonal, right). For

¯ AAACAHicbVC7SgNBFL0bX3F9RS1tBoNgFXZjoY0YtLGMYB6QrGFmMpsMmZ1dZmaFGNL4Cbba2diJrT/gJ4i1P+LkUWjigQuHc+7lXA5JBNfG876czMLi0vJKdtVdW9/Y3Mpt71R1nCrKKjQWsaoTrJngklUMN4LVE8VwRASrkd7FyK/dMqV5LK9NP2FBhDuSh5xiY6V6k2CF7m6KrVzeK3hjoHniT0n+7MM9TZ4/3XIr991sxzSNmDRUYK0bvpeYYICV4VSwodtMNUsw7eEOa1gqccR0MBj/O0QHVmmjMFZ2pEFj9ffFAEda9yNiNyNsunrWG4n/eoREM9EmPAkGXCapYZJOksNUIBOjURuozRWjRvQtwVRx+zyiXawwNbYz17biz3YwT6rFgn9U8K68fOkcJsjCHuzDIfhwDCW4hDJUgIKAB3iEJ+feeXFenbfJasaZ3uzCHzjvPwxnmdI= 4 0 Ex (0,K) 1 z¯ (a) SGD-trained(a): SGD-trained finite CNNs CNNs often often perform perform better better with with increasing increasing number number of channels. of channels.

AAACeHicbVFdS8MwFE3r9/yaH2++BKeoINLuRV8EUQRBEAWnwlpHkqVbMGlLciuOMv+nz/orfDJbq+jmhcDhnHO5N+fSVAoDnvfmuBOTU9Mzs3OV+YXFpeXqyuqdSTLNeIMlMtEPlBguRcwbIEDyh1Rzoqjk9/TpbKDfP3NtRBLfQi/loSKdWESCEbBUq/q6FSgCXUZkftYPJI9g9zLQotOFPXyMhxql+Xm/lb8ERqgf81Vp9va/7QXRDNKuKKSXUhhnHm8LEG61qjXvwBsWHgd+CWqorOtW9T1oJyxTPAYmiTFN30shzIkGwSTvV4LM8JSwJ9LhTQtjorgJ82FOfbxtmTaOEm1fDHjI/u7IiTKmp6h1Dv5pRrUB+a9GqRoZDdFRmIs4zYDHrJgcZRJDggdXwG2hOQPZs4AwLezymHWJJgzsrSo2FX80g3FwVz/wLb6p105Oy3xm0QbaRLvIR4foBF2ga9RADH04FWfNWXc+XezuuHuF1XXKnjX0p9z6F6yiwSY= 0 1 ∞ 2 All models have the same architecture except for pooling K $ (K ) nonlinearityC K ⇠N $ (K ) nonlinearity K % withfurther(c) increasing However, analysis numberthe of factors best ofperforming leading channels.NN (no to similarunderfitting) SGD CNN or diverging outperforms behavior the between best CNN-GP SGD-trained byCNN-GP a finite CNNs 4x4 input0 0 covariance h → i Each line corresponds to a particular choice of architecture and initialization hyperparameters, with ∞ ∞ ∞ 2

AAAB/HicbVC7SgNBFL3rM8ZX1FKRwSBYhV0ttAzaWCZgHpAsYXYymwyZmV1mZoWwxM7WVj/ATmwlv2Jt6U84m6TQxAMXDufcy7mcIOZMG9f9dJaWV1bX1nMb+c2t7Z3dwt5+XUeJIrRGIh6pZoA15UzSmmGG02asKBYBp41gcJP5jXuqNIvknRnG1Be4J1nICDaZ1I77rFMouiV3ArRIvBkplo/G1e/H43GlU/hqdyOSCCoN4VjrlufGxk+xMoxwOsq3E01jTAa4R1uWSiyo9tPJryN0apUuCiNlRxo0UX9fpFhoPRSB3RTY9PW8l4n/ekEg5qJNeOWnTMaJoZJMk8OEIxOhrAnUZYoSw4eWYKKYfR6RPlaYGNtX3rbizXewSOrnJe+i5FZtPdcwRQ4O4QTOwINLKMMtVKAGBPrwBM/w4jw4r86b8z5dXXJmNwfwB87HD0B7mPY= AAAB/HicbVC7SgNBFL3rM8ZX1FKRwSBYhV0ttAzaWCZgHpAsYXYymwyZmV1mZoWwxM7WVj/ATmwlv2Jt6U84m6TQxAMXDufcy7mcIOZMG9f9dJaWV1bX1nMb+c2t7Z3dwt5+XUeJIrRGIh6pZoA15UzSmmGG02asKBYBp41gcJP5jXuqNIvknRnG1Be4J1nICDaZ1I77rFMouiV3ArRIvBkplo/G1e/H43GlU/hqdyOSCCoN4VjrlufGxk+xMoxwOsq3E01jTAa4R1uWSiyo9tPJryN0apUuCiNlRxo0UX9fpFhoPRSB3RTY9PW8l4n/ekEg5qJNeOWnTMaJoZJMk8OEIxOhrAnUZYoSw4eWYKKYfR6RPlaYGNtX3rbizXewSOrnJe+i5FZtPdcwRQ4O4QTOwINLKMMtVKAGBPrwBM/w4jw4r86b8z5dXXJmNwfwB87HD0B7mPY=

AAACCXicbVC7SgNBFL3rM8ZHVi0VGQyCVdjVQhshaGOZgHlAsobZyWwyZGZ3mZkVliWllZ9gqx9gF2z9CmtLf8LJo9DEAxcO59zLPRw/5kxpx/m0lpZXVtfWcxv5za3tnYK9u1dXUSIJrZGIR7LpY0U5C2lNM81pM5YUC5/Thj+4GfuNByoVi8I7ncbUE7gXsoARrI3UsQttgXWfYJ41h1fpvdOxi07JmQAtEndGiuXDUfX78WhU6dhf7W5EEkFDTThWquU6sfYyLDUjnA7z7UTRGJMB7tGWoSEWVHnZJPgQnRili4JImgk1mqi/LzIslEqFbzbHMdW8Nxb/9XxfzL3WwaWXsTBONA3J9HOQcKQjNK4FdZmkRPPUEEwkM+ER6WOJiTbl5U0r7nwHi6R+VnLPS07V1HMNU+TgAI7hFFy4gDLcQgVqQCCBZ3iBV+vJerNG1vt0dcma3ezDH1gfPxUHncI= =y y

n and weight sharing, as well as training-related hyperparameters such as learning rate, weight decay 4 Accuracy → X → → (b) Performance of SGD-trained finite CNNs often approaches that of the respective CNN3 (1D): andsignificant64 infinite Bayesianmargin. CNNs see Tables 1 and 2. Experimental details: all networks have reached 1 2 1 2 10 classes best learning rate and weight decay selected independently for each number of channels (x-axis). n 1 n n and batch size, which are selected for each number of channels (x-axis) to maximize validation per input 100% training accuracy on CIFAR10. Values in (b) are reported on an 0.5K/4K(Fully-connected) train/validation sub- convolution convolution matrix-multiply (b):CNN-GPSGD-trained with increasing CNNs often the approach number of the channels. performance of their correspondingFCN-GPCNN-GP 4 T performance (y-axis) of a neural network. As the number of channels grows, best validation accu- 3 (K)= (x) (x) 2 0 0 Ex (0,K) 1 1 set downsampled to 8 8 for computational reasons. See A.7.5 and A.7.1 for full experimental

0 AAACeHicbVFdS8MwFE3r9/yaH2++BKeoINLuRV8EUQRBEAWnwlpHkqVbMGlLciuOMv+nz/orfDJbq+jmhcDhnHO5N+fSVAoDnvfmuBOTU9Mzs3OV+YXFpeXqyuqdSTLNeIMlMtEPlBguRcwbIEDyh1Rzoqjk9/TpbKDfP3NtRBLfQi/loSKdWESCEbBUq/q6FSgCXUZkftYPJI9g9zLQotOFPXyMhxql+Xm/lb8ERqgf81Vp9va/7QXRDNKuKKSXUhhnHm8LEG61qjXvwBsWHgd+CWqorOtW9T1oJyxTPAYmiTFN30shzIkGwSTvV4LM8JSwJ9LhTQtjorgJ82FOfbxtmTaOEm1fDHjI/u7IiTKmp6h1Dv5pRrUB+a9GqRoZDdFRmIs4zYDHrJgcZRJDggdXwG2hOQPZs4AwLezymHWJJgzsrSo2FX80g3FwVz/wLb6p105Oy3xm0QbaRLvIR4foBF2ga9RADH04FWfNWXc+XezuuHuF1XXKnjX0p9z6F6yiwSY= 1 2 4 inputs,K K3 01features, $ (K )K CK K1 ⇠N $ (K )K K K2 with increasing number of channels. All models have the same architecture except for pooling CNN (1D) $ ( ) $ ( ) % % However, the best performing SGD CNN outperforms the best CNN-GP by a activations racy increases(c) and approaches⇥ accuracy of the respective GP§ (solid horizontal§ line). (c): However, 10 pixels h i 6.anddetails When weight of (a, sharing, c) anddo (b) as CNNs well plots as respectively. training-related outperform hyperparameters CNN-GPs? such as learning rate, weight decay 53 CNN a(1D):ffine transform affine transform affine transform 4x4 covariance the best-performingsignificant margin. SGD-trained CNNs can outperform their corresponding CNN-GPs. Each and batchFCN size, which areCNN w/ selected ERF for eachCNN w/ numbersmall learning of rate channelsCNN w/ ReLU (x -axis)& large learning to maximize rate CNN validation w/ pooling 1 per class 86 (error function non-linearity) 2 point corresponds to the test accuracy of: (y-axis) a specific CNN-GP; (x-axis) the best (on valida- ⊗ I10 3 4 Under review as a conference paper at ICLR 2019 performance (y-axis) of a neural network. As the number of channels grows, best validation accu- 4 3 0 2 00 11 1 22 vec tion) CNN with the same architectural hyper-parameters selected among the 100%-accurate models KAAAB+3icbVC7SgNBFL0bX3F9RS1tBoNgFWa10EYM2gg2Ec0DkjXMTmaTITO7y8ysEJZ8gq1+QDqx9Rv8BLH2R5wkFpp44MLhnHs5lxMkgmuD8aeTW1hcWl7Jr7pr6xubW4XtnZqOU0VZlcYiVo2AaCZ4xKqGG8EaiWJEBoLVg/7l2K8/MKV5HN2ZQcJ8SboRDzklxkq31/e4XSjiEp4AzRPvhxTP392zZPThVtqFr1YnpqlkkaGCaN30cGL8jCjDqWBDt5VqlhDaJ13WtDQikmk/m7w6RAdW6aAwVnYigybq74uMSK0HMrCbkpienvXG4r9eEMiZaBOe+hmPktSwiE6Tw1QgE6NxEajDFaNGDCwhVHH7PKI9ogg1ti7XtuLNdjBPakcl77iEb3CxfAFT5GEP9uEQPDiBMlxBBapAoQuP8ATPztAZOS/O63Q15/zc7MIfOG/fT3WXvg== 1 KK KK K KK

AAACBHicbVC7SgNBFL0bXzG+opY2Q0RIFXa1iGXQRrCJYB6QXcPsZDYZMju7zMwKy5LWT7Cx0A+wE1t7P0G080ecPApNPHDhcM69nMvxY86Utu0PK7e0vLK6ll8vbGxube8Ud/eaKkokoQ0S8Ui2fawoZ4I2NNOctmNJcehz2vKH52O/dUulYpG41mlMvRD3BQsYwdpI7uWN081cJgKdjrrFQ7tiT4AWiTMjh7VS+euz+v5Q7xa/3V5EkpAKTThWquPYsfYyLDUjnI4KbqJojMkQ92nHUIFDqrxs8vMIHRmlh4JImhEaTdTfFxkOlUpD32yGWA/UvDcW//V8P5yL1sGplzERJ5oKMk0OEo50hMaNoB6TlGieGoKJZOZ5RAZYYqJNbwXTijPfwSJpHleck4p9Zeo5gynycAAlKIMDVajBBdShAQRiuIdHeLLurGfrxXqdruas2c0+/IH19gOqA5yR AAACBHicbVC7SgNBFJ2NrxhfUUubIUFIFXZjEcugjWATwTwgG8PsZDYZMju7zNwVliWtn2BjoR9gJ7b2foJo5484eRSaeODC4Zx7OZfjRYJrsO0PK7Oyura+kd3MbW3v7O7l9w+aOowVZQ0ailC1PaKZ4JI1gINg7UgxEniCtbzR+cRv3TKleSivIYlYNyADyX1OCRjJvbyp9FKXSx+ScS9ftMv2FHiZOHNSrBVKX5/V94d6L//t9kMaB0wCFUTrjmNH0E2JAk4FG+fcWLOI0BEZsI6hkgRMd9Ppz2N8bJQ+9kNlRgKeqr8vUhJonQSe2QwIDPWiNxH/9TwvWIgG/7SbchnFwCSdJfuxwBDiSSO4zxWjIBJDCFXcPI/pkChCwfSWM604ix0sk2al7JyU7StTzxmaIYuOUAGVkIOqqIYuUB01EEURukeP6Mm6s56tF+t1tpqx5jeH6A+stx+roZyS

AAACHXicbVC7SgNBFJ2NrxhfUUubRRGswq4WsQzaCDYRTBTcGGYnd3VwZnaZuRsMy36A32DhJ9jqB1gIYiuinT/iJLHQxAMDh3PO5d45YSK4Qc97dwoTk1PTM8XZ0tz8wuJSeXmlaeJUM2iwWMT6NKQGBFfQQI4CThMNVIYCTsKr/b5/0gVteKyOsZdAS9ILxSPOKFqpXd4IJMVLRkV2mLezgKsIe/l5FiBco5ZZF1ie25RX8QZwx4n/QzZq61ufH9Xn23q7/BV0YpZKUMgENebM9xJsZVQjZwLyUpAaSCi7ohdwZqmiEkwrG3wmdzet0nGjWNun0B2ovycyKo3pydAm+6ebUa8v/uuFoRxZjdFuK+MqSREUG26OUuFi7ParcjtcA0PRs4Qyze3xLrukmjK0hZZsK/5oB+OkuV3xdyreka1njwxRJGtknWwRn1RJjRyQOmkQRm7IPXkgj86d8+S8OK/DaMH5mVklf+C8fQNy+KfT $ ( ) $AAACHHicbVA9SwNBEN2LXzF+RS1tjgQhacKdFlpGbQSbCOYDcvHY2+wlS/b2jt05IR7p7WytLC1t9QfYia1g7R9xL0mhiQ8GHu/NMDPPizhTYFlfRmZhcWl5JbuaW1vf2NzKb+80VBhLQusk5KFseVhRzgStAwNOW5GkOPA4bXqDs9Rv3lCpWCiuYBjRToB7gvmMYNCSmy84AYY+wTw5GTmc+lC6uLZdhwkfho5kvT6U3XzRqlhjmPPEnpJitXx/G2WfHmpu/tvphiQOqADCsVJt24qgk2AJjHA6yjmxohEmA9yjbU0FDqjqJONfRua+VrqmH0pdAsyx+nsiwYFSw8DTnenlatZLxX89zwtmVoN/3EmYiGKggkw2+zE3ITTTpMwuk5QAH2qCiWT6eJP0scQEdJ45nYo9m8E8aRxU7MOKdanjOUUTZNEeKqASstERqqJzVEN1RNAdekYv6NV4NN6Md+Nj0poxpjO76A+Mzx834KUz ( ) AAACFXicbVC7TgJBFJ3FF+Jr1ZJmIjGBhuxqoSVqY2KDiTwSFsnsMAsTZh+ZuWuCGwq/wFK/QFv9ADtja23tjzgLFAqe5CYn59ybe+9xI8EVWNaXkVlYXFpeya7m1tY3NrfM7Z26CmNJWY2GIpRNlygmeMBqwEGwZiQZ8V3BGu7gLPUbN0wqHgZXMIxY2ye9gHucEtBSx8w7PoE+JSI5GTmCeVC8uLYcyXt9KHXMglW2xsDzxJ6SQqV0fxtlnx6rHfPb6YY09lkAVBClWrYVQTshEjgVbJRzYsUiQgekx1qaBsRnqp2Mnxjhfa10sRdKXQHgsfp7IiG+UkPf1Z3pyWrWS8V/Pdf1Z1aDd9xOeBDFwAI62ezFAkOI04hwl0tGQQw1IVRyfTymfSIJBR1kTqdiz2YwT+oHZfuwbF3qeE7RBFmUR3uoiGx0hCroHFVRDVF0h57RC3o1How34934mLRmjOnMLvoD4/MHS8CiBw== % 1 racy increases and approaches accuracy of the respective GP (solid horizontal line). 4x4x10 input covariance A 1 A 1 K1 on6. the fullWhen training CIFAR10 doNN CNNs (underfitting dataset with outperform different allowed) learning rates, weightCNN-GPs? decay and number(c): of However, chan- (not always necessary to track the whole 4x4x10x10 covariance.) 10 3. Differentaffine transform NN-GP architecturesaffine transform affine transform 4x4 covariance the best-performing SGD-trained CNNs can outperform their corresponding CNN-GPs. Each 0 1 0 1 2 1 nels.7.75 While Disentangling CNN-GP appears competitive the against CNN100% architecture-accurate CNNs (above the diagonal), the 1 z (Monte Carlo estimate)per class AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EaM2lhGMA9IljA7mSRDZmeWmVkhLCn9BFstLO3E1tpPEGt/xNkkhSYeuHA4517u4YQxZ9p43peTW1hcWl7Jr7pr6xubW4XtnZqWiSK0SiSXqhFiTTkTtGqY4bQRK4qjkNN6OLjK/PodVZpJcWuGMQ0i3BOsywg2Vmq2Imz6BPP0YtQuFL2SNwaaJ/6UFM8/3LP4+dOttAvfrY4kSUSFIRxr3fS92AQpVoYRTkduK9E0xmSAe7RpqcAR1UE6jjxCB1bpoK5UdoRBY/X3RYojrYdRaDeziHrWy8R/vTCMZl6b7mmQMhEnhgoy+dxNODISZYWgDlOUGD60BBPFbHhE+lhhYmxtrm3Fn+1gntSOSv5xybvxiuVLmCAPe7APh+DDCZThGipQBQISHuARnpx758V5dd4mqzlnerMLf+C8/wA5N5uk y = ϕ (z ) AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EaM2lhGMA9IljA7mSRDZmeWmVkhLCn9BFstLO3E1tpPEGt/xNkkhSYeuHA4517u4YQxZ9p43peTW1hcWl7Jr7pr6xubW4XtnZqWiSK0SiSXqhFiTTkTtGqY4bQRK4qjkNN6OLjK/PodVZpJcWuGMQ0i3BOsywg2Vmq2Imz6BPP0YtQuFL2SNwaaJ/6UFM8/3LP4+dOttAvfrY4kSUSFIRxr3fS92AQpVoYRTkduK9E0xmSAe7RpqcAR1UE6jjxCB1bpoK5UdoRBY/X3RYojrYdRaDeziHrWy8R/vTCMZl6b7mmQMhEnhgoy+dxNODISZYWgDlOUGD60BBPFbHhE+lhhYmxtrm3Fn+1gntSOSv5xybvxiuVLmCAPe7APh+DDCZThGipQBQISHuARnpx758V5dd4mqzlnerMLf+C8/wA5N5uk z y = ϕ (z ) 86 2 A A I point corresponds to the test accuracy of: (y-axis) a specific CNN-GP; (x-axis) the best (on valida- 3 2 ⊗ 10 AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYMprGMYB6QLGF2MkmGzMwuM7NCWFL6CbZaWNqJrbWfINb+iLNJCk08cOFwzr3cwwljzrTxvC8nt7S8srqWX3c3Nre2dwq7e3UdJYrQGol4pJoh1pQzSWuGGU6bsaJYhJw2wmEl8xt3VGkWyVszimkgcF+yHiPYWKnVFtgMCOZpZdwpFL2SNwFaJP6MFC8/3Iv4+dOtdgrf7W5EEkGlIRxr3fK92AQpVoYRTsduO9E0xmSI+7RlqcSC6iCdRB6jI6t0US9SdqRBE/X3RYqF1iMR2s0sop73MvFfLwzF3GvTOw9SJuPEUEmmn3sJRyZCWSGoyxQlho8swUQxGx6RAVaYGFuba1vx5ztYJPWTkn9a8m68YvkKpsjDARzCMfhwBmW4hirUgEAED/AIT8698+K8Om/T1Zwzu9mHP3DefwA8Y5um AAACA3icbVC7SgNBFL0bX3F9RS1tBoNgFXa10EYMprGMYB6QLGF2MkmGzMwuM7NCWFL6CbZaWNqJrbWfINb+iLNJCk08cOFwzr3cwwljzrTxvC8nt7S8srqWX3c3Nre2dwq7e3UdJYrQGol4pJoh1pQzSWuGGU6bsaJYhJw2wmEl8xt3VGkWyVszimkgcF+yHiPYWKnVFtgMCOZpZdwpFL2SNwFaJP6MFC8/3Iv4+dOtdgrf7W5EEkGlIRxr3fK92AQpVoYRTsduO9E0xmSI+7RlqcSC6iCdRB6jI6t0US9SdqRBE/X3RYqF1iMR2s0sop73MvFfLwzF3GvTOw9SJuPEUEmmn3sJRyZCWSGoyxQlho8swUQxGx6RAVaYGFuba1vx5ztYJPWTkn9a8m68YvkKpsjDARzCMfhwBmW4hirUgEAED/AIT8698+K8Om/T1Zwzu9mHP3DefwA8Y5um z best CNNs overall outperform CNN-GPs by a significant margin (below the diagonal, right). For CNN-GP Various CNN4 architectural decisions ¯

covariance • C C nonlinearity ∞ nonlinearity tion) CNN with the same architecturalNN (no underfitting) hyper-parameters selected among the 100%-accurateCNN-GP models like pooling / vectorizing / further analysis of factors leading to similar or diverging behavior between SGD-trained finite CNNs 4x4x100 input covariance → 85 NN (underfitting allowed) ∞ ∞ ∞ y 2 on the full training CIFAR10 dataset with different learning rates, weight decay and number of chan- 4 subsampling, zero or no padding n Accuracy → (not always necessary to track the→ whole 4x4x10x10 covariance.) → and infinite Bayesian CNNs see Tables 1 and 2. all networks have reached 3 64 Experimental details: 1 2 have a corresponding NN-GP.1 2 10 classes n 1 n 0 1 0 1 n 2 1 nels.75 While CNN-GP appears competitive against 100%-accurate CNNs (above the diagonal), the z y ϕ z z y ϕ z (Fully-connected) convolution = (convolution) = matrix-multiply( ) per input 100% training accuracy on CIFAR10. Values in (b) are reported on an 0.5K/4K train/validationFCN-GP sub- 2 best CNNs 74overall outperform CNN-GPs by a significant margin (below the diagonal, right). For Figure• 3:Pooling enforces translation-invariant z¯ set downsampled to 8 8 for computational reasons. See A.7.5 and A.7.1 for full experimental Visualization of sample finite neural networks∞ and their infinitely wide Bayesian 4 inputs,predictions 3 features, in both CNNs and CNN- nonlinearity nonlinearity further analysis of factors⇥ leadingNN (no to similarunderfitting) or diverging behavior§ between§ SGD-trainedCNN-GP finite CNNs GPs and allows0 for the best → ∞ ∞ ∞ 10 pixels Accuracy y Presented are networks with nonlinearity2 φ, L = 2 hidden layers that details of (a, c) and (b) plots respectively. counterparts (NN-GPs). n 4 53 63 Accuracy → performance. → → and infinite Bayesian CNNs see Tables 1 and 2. all networks have reached 3 64 Experimental details: 1 23 1 2 2 10 classes n n n FCN CNN w/ ERF CNN w/ small learning rate CNN w/ ReLU & large learning rate CNN w/ pooling n¯1 = 10 z¯ (x) x regress -dimensional outputs for each of the 4 (1, 2, 3, 4) inputs from the dataset (error function non-linearity) (Fully-connected) convolution convolution matrix-multiply per input 100% training accuracy on CIFAR10. Values in (b) are reported on an 0.5K/4K train/validationFCN-GP sub- . A hierarchical• Shallow CNN models computation can perform Under performed review as a by conference a NN paper corresponds at ICLR 2019 to a hierarchical transformation worse than fully-connected set downsampled52 to 8 8 for computational reasons. See A.7.5 and A.7.1 for full experimental X 4 alternativesinputs, 3 features, due to failing to capture vec (attaching a fully-connected layer Model FCN ⇥FCN-GP LCN LCN w/pooling § CNN-GP § CNN CNN w/ pooling of the respective covariance matrix (resulting in I10). Top two rows: aonly fully to the center connected pixel of the output) non-linear10 pixels interactions between details of (a,(Fully-connected c) and (b) network) plots respectively.(CNN w/o weight sharing) K∞ ⊗ 53 10 network3. distantDi (FCN, ffpixels.erent above) and NN-GP the respective architectures FCN-GP (below). Bottom two rows: a 1D CNN with 7. DisentanglingFCN CNN w/ ERF the CNNCNN w/ small learning architecture rate CNN w/ ReLU & large learning rate CNN w/ pooling (Monte Carlo estimate) Quality Compositionality(error function non-linearity) Local connectivity Equivariance Invariance no (valid) padding, and the respectiveUnder review CNN-GP. as a conference The paper analogy at ICLR between 2019 the two models (NN and • Various CNN architectural decisions Figure 1: Different dimensionality collapsing strategies described in 3. Validation accuracy of GP) islike qualitatively pooling / vectorizing similar / to the caseUnder ofUnder FCN, review review as modulo a conference as a conference additional paper paper at ICLR at spatial ICLR 2019 2019 dimensions§ in the internal an MC-CNN-GP with pooling ( 3.2.1) is consistently better than other models due to translation 85 activations4.subsampling Computing and, zero the or respectiveno padding additional theinvariance NN-GP entriesof the kernel. in intermediaryCNN-GP covariance§ with zero covariance padding ( 3.1 matrices.) outperforms Note an analogous that inCNN- Under review as a10 conference paper at ICLR 2019 3.have a Dicorrespondingfferent NN-GP. NN-GP architecturesas depth increases. At depth 15 the§ spatial dimension of the output without 7. Disentangling the CNN architecture d = 10 GP without10 padding10 (Monte Carlo estimate) general pixels would induce a -fold1 increase1 in the number of covariance entries (in 74 Pooling enforces translation-invariant padding is reduced to , making the CNN-GP without padding equivalent to the center pixel 0 • 1 ManyVarious di ffCNNerent architectural NN architectures decisions converge to a GP. × ⇥ 8. Large-depth behavior K , Kpredictions• , ... )in both compared CNNs and CNN- to the respectiveselection strategy FCN,( 3.2.2 yet) without – which also pooling performs only worsea than small the CNN-GP fraction(we of conjecture, them due like pooling / vectorizing / § GPs∞ and allows for the best to overfitting to centrally-located features) but approaches the latter (right) in the limit of large 85

vec Accuracy • Somesubsampling admit a, tractablezero or no analytic padding form. E.g. the covariance of a I CNNs suffer from exploding or vanishing Depth: 1 10 100 1000 Phase boundary (displayed)performance. needs be computed todepth, obtain as information the top-layer becomes GP more covariance uniformly spatially distributed10 (see (Xiao et3.1 al.,).2018). CNN-GPs • 63 CNN-GPhave a corresponding with no pooling NN-GP. can be computed using K∞ ⊗ § gradients at large depths unless they are O(#samples^2 x #pixels) resources. generally outperform FCN-GP, presumably due to the local connectivity prior, but can fail to capture initialized correctly. The phase boundary between • Shallow CNN models can perform 74 Pooling enforces translation-invariant nonlinear interactions between spatially-distant pixels at shallow depths (left). Values are reported order and chaos (rightmost plot for ERF worse• than fully-connected 52 Forpredictions other architectures in both CNNs we and use CNN- a Monte Carlo approach. (attaching a fully-connected layer alternatives• due to failing to capture on a 2K/4K train/validation subsets of CIFAR10. See A.7.3 for experimental details. Modelnonlinearity) FCN is the region whereFCN-GP large-depth LCN LCN w/pooling CNN-GP CNN CNN w/ pooling

only to the center pixel of the output) CNN-GP 3 TRANSFORMINGASamplingGPs and allows finite random for the bestnetworks ofGP a givenOVER architecture SPATIAL and LOCATIONS INTO§ A GP OVER (Fully-connected network) (CNN w/o weight sharing) non-linear interactions between trainingAccuracy is feasible. empirically computing the output covariance allows to 63 distantperformance. pixels. CLASSESapproximate the CNN-GP covariance. In our experimentsMC-CNN-GP thisMC-CNN-GP • CNN-GP exhibits similar behavior. Initialized away 1 Quality Local connectivity • (biased)Shallow estimateCNN models converges can perform to the true CNN-GP1. covarianceGlobal in average pooling: take h = d 1d. Then from the phase boundary,Compositionality the covariance between Equivariance Invariance #(instantiated networks) and #channels,Figure both in 1: terms Different of dimensionality collapsing strategies described in 3. Validation accuracy of 52

worse than fully-connected all different inputs becomes asymptotically FCN-GP

UnderUnder review review as a conference as a conference paper paper at ICLR at ICLR 20192 2019 § (attaching a fully-connected layer (Fully-connected) covariancealternatives Frobeniusdue to failing distance to capture and thean GP accuracy. ( 3.2.1) is consistently better than other models due to translation constant and makes FCN learning infeasible.FCN-GP LCN LCN w/pooling CNN-GP CNN CNN w/ pooling In 2.2 we showed that in the infiniteMC-CNN-GP channel with limit pooling a deep CNNpool is! a(#NN GP instantiations) indexedL+1 by2only inputto the center samples, pixel of the output) Model § K + b . (17) (Fully-connected network) (CNN w/o weight sharing) non-linear interactions between invariance of the kernel. 2 ( 3.1)↵ outperforms,↵0 an analogous Under review as a conference paper at ICLR 2019 § 4. Computing theFigure NN-GPFigure 2: Validation 2: ValidationCNN-GP accuracy covariance accuracy with(left)K1 zero of(left)⌘ and padding MC-CNN-GPof an MC-CNN-GP1 increases increases with n withLMn (i.e.M channel(i.e.CNN- channel count count output spatialdistant pixels. locations, and output channel indices. Further, its↵ covariance,↵0 § matrix K⇥ ⇥can be 2 GP withouttimes paddingtimes number number ofas samples) depth of samples) increases. and approaches and Atapproaches depth thatX15 of that thethe⇥ of exact spatial the⇤ CNN-GPexact dimension CNN-GP (not of shown), (not the shown), output while without while the distance the distance Table 3: Validation accuracy of CNN- and FCN- GPs as a function of weight (!, horizontal ∞ Quality Compositionality Local2 connectivity Equivariance Invariance computed in closed form. Herepadding we show(right) is reduced(right) to that the to to exact1 transformationsthe1 exact kernel, making kernel decreases. the decreases.CNN-GP The to dark Theobtain without band dark in bandpadding class the in left the predictionsequivalent plot left corresponds plot corresponds to the that tocenter ill-conditioningare to pixel ill-conditioning axis) and bias ( b , vertical axis) variances. As predicted in A.2, the regions of good performance This approach⇥ corresponds to applying global average pooling right after the last convolu- concentrate around the critical line (phase boundary, right)§ as the depth increases (left to right). • Many different NN architectures convergeselection to Figurea GP. strategy1:L Different+1 L(+13.2.2 dimensionality8 ) – which also collapsing performs strategies worse than described the CNN-GPL in+1 3L.+1(we Validation conjecture, accuracy due of8. Large-depth behavior common in CNN classifiers can be representedUnderof Ktionaln,MUnderof reviewKwhen layer.n,M review§ aswhen theas aThis conference numberas either the a approach conference number ofvectorization paper outer takes of paper at outerproducts ICLR all at pixel-pixelproducts ICLR 2019 contributingor 2019 contributingprojection covariances to Kn,M to intoK§approximatelyn,M(as considerationapproximately long as equals and we equals makes its rank. its rank. All plots share common axes ranges and employ the erf nonlinearity. See A.7.2 for experimental toan overfittingMC-CNN-GP to centrally-located with pooling features)( 3.2.1) is but consistently approaches better the latter than other (right)2 2 models in the due limit to of translation large § • Some admit a tractable analytic form. E.g. theinvariance covarianceValuestheValues ofof reported kernel a the reported kernel. translation are forCNN-GP are a forinvariant.3-layer§ a 3-layer with model However, zero model applied padding appliedit requiresto a( 2K/4K to3.1 a) 2K/4K outperforms train/validationd train/validationmemory an analogoussubset to compute subset of CIFAR10CNN- of the CIFAR10• CNNs suffer from exploding or vanishing Depth:Underdetails. review1 as a conference10 paper 100 at ICLR 2019 1000 Phase boundary treat classification4.CNN-GP Computing with no pooling as regressioncan be computeddepth, (seeusingthe asG information ),NN-GP identically becomes to more covariance Lee uniformly et al.( spatially2018)). distributed2 BothO (of|XXiao | these et al., operations2018). CNN-GPs GP withoutdownsampled§ sample-sampledownsampled padding to 8as covariance to depth88. See increases.8. Figure Seeof the Figure7 GP Atfor depth (or7 similarfor15 similard resultstheper§ spatial results covariance with dimension other with entry other architectures of in architectures the an iterative output and withoutA.7.2 andor dis-A.7.2forgradientsfor at large depths unless they are O(#samples^2 x #pixels) resources. generallyexperimental outperformexperimental details.FCN-GP details.⇥ , presumably⇥ due to theO local connectivity prior, but can fail to capture§ § initialized correctly. The phase boundary between preserve the GP equivalence andpadding allow istributed the reduced computation setting). to 1 It1, is making impractical of the theCNN-GP to covariance use this without method matrix padding to analyticallyequivalent of the evaluate respective to the thecenter GP, and pixel we A.2 RELATIONSHIP TO DEEP SIGNAL PROPAGATION nonlinear interactions between⇥ spatially-distant pixels at shallow depths (left).L Values are reported order and chaos (rightmost plot for ERF GP (now• For• Manyother indexed diarchitecturesfferent NN by architectures we input use a Monte samples converge Carloselection toapproach. anda GP. targetpropose strategy toclasses)( use3.2.2 a Monte) – as which Carlo a simple also approach performs transformation (see worse4). than the ofCNN-GPK .(we conjecture, due 8. Large-depth behavior on a 2K/4K train/validation§ subsets of CIFAR10. See A.7.3§ for experimental details. nonlinearity) is the region where large-depth The recurrence relation linking the GP kernel at layer l +1to that of layer l following from Equa- Sampling finite random networks of a given architectureto overfitting and to centrally-located features) but approaches§ the latter (right)∞ in the limit of large CNN-GP 2. Subsampling one particular pixel: take h = e↵, training is feasible. l+1 l empirically• Some admit computing a tractable the output analytic covariance form. E.g. allows the covariancerandom to randomfinite of a widthfinite width networks networks and use and the use empirical the empirical uncentered uncentered covariances covariances ofL activations+2 of activations to estimate to estimate• CNNs suffer from exploding or vanishing tionDepth:10 (i.e. K 1=( 10) K ) is 100 precisely the 1000covariance map examinedPhase boundary in a series of related CNN-GP with no pooling can be computeddepth, Lusing+2 as information becomes more uniformly spatially distributedL+1 (Xiao etn¯ al., 2018). CNN-GPs papers on signal1 propagationC A (Xiao1 et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee In theapproximate following the CNN-GP we will covariance. denote In our n¯experimentstheMC-CNN-GP Monteasthe thisMC-CNN-GP the Monte Carlo-GP number Carlo-GP (MC-GP) of (MC-GP) target kernel, kernel,e↵ classes,2 L+1z¯ (x2) as the gradients at large depths unless they are O(#samples^2 x #pixels) resources. generally outperform FCN-GP, presumably1 due to theK local connectivity+ . prior,R but can fail to capture(18)• CNN-GP exhibits similar behavior. Initialized away (biased) estimate converges to the true CNN-GP1. covariance in take h = 1d. Then! ↵,↵ b initialized correctly. The phase boundary between et al., 2018) (modulo notational differences; denoted as F , or e.g. ? in Xiao et al. (2018)). GlobalL average+1 ¯L pooling:+1 K1d ⌘ M 1 nM n ∈ from the phase boundary, the covariance between C A C outputs#(instantiated of theCNN-based networks) and #channels, classifier, bothnonlinear in terms and of interactionsω¯ , b betweenas spatially-distant the last (number1 pixels1 at shallowL + 1 depths) layer (left). weights Values are and reported order and chaos (rightmost plot for ERF In those works, the action of this map on hidden-state covariance matrices was interpreted as defin- For other architectures we use a Monte Carlo approach. l l ⇥ ⇤ l l l l all different inputs becomes asymptotically FCN-GP covariance• Frobenius distance and the GP accuracyon a 2K/4K.This train/validation approach makesKn,M subsetsK usen,M of( ofx, only CIFAR10. x0)( onex,2 x0 pixel-pixel) See A.7.3 covariance,yc↵for(xy experimental;c✓↵m()x andy; c✓↵m requires)(xyc0;↵ details.✓m(x) the0; ✓m same) amount(19) (19)nonlinearity) is the region where large-depth ing a dynamical(Fully-connected) system whose large-depth behavior informs aspects of trainability. In particular, ↵,↵0 ↵,↵0 ! (#NN instantiations) 0 0 constant and makes learning infeasible. l+1 l l i pool L+1 2 CNN-GP biases. ForSampling each finite class random networkswe will of a given denote architecture the and empirical uncentered⌘ covarianceMn⌘ KMn§ of+ the . outputs as (17) training is feasible. as l , K =( ) K K K⇤ , i.e. the covariance approaches a fixed point of memory as vectorization ( 3.1) to2 compute.m=1 c=1m=1↵,↵c=1 b !1 1 C A 1 ⇡ 1 ⌘ 1 empirically computing the output covarianceFigure allowsFigure 2:to Validation 2: Validation accuracy⇥ accuracy⇥ ⇤ (left)K1⇤ § of(left)⌘ and MC-CNN-GPof an MC-CNN-GPX1XX increasesX0 increases with n withMn (i.e.M channel(i.e. channel count count K⇤ . The convergence to a fixed point is problematic for learning because the hidden states no T ↵,↵0 1 2

MC-CNN-GP MC-CNN-GP L+1 L+1 ⇥ ⇥ approximate the CNN-GP covariance. In ourtimes experimentswheretimes numberwhere✓ consists numberthis of✓ samples)consists of samples)M ofdraws andM approachesdraws and of the approaches of weights the that weightsX of and that the biases of exactand the biases from CNN-GPexact their from CNN-GP (notprior their shown), distribution,(not prior shown), distribution, while✓ while them ✓p them(✓),p and(✓), andCNN-GP exhibits similar behavior. Initialized awayTable longer 3: Validation contain information accuracy thatofCNN- can distinguish and FCN- different GPs as pairs a function of inputs. of weight It is similarly (!, horizontal problematic z¯ z¯ , 1 ⇥ ⇤ (11)distancedistance• 2 (biased) estimate converges to the true CNN-GPWe compare1. covarianceiGlobal the in averageperformancei pooling:i of presentedtake h = strategies1d. Then in Figure 1. Note that all described⇠ strategies⇠ axis)for and GPs, bias as ( the , kernel vertical becomes axis) variances. pathological As predicted as it approaches in A.2 a, the fixed regions point. of Precisely, good performance in the chaotic (right)nKThisis(right) to the≡n approach theis width to the exact the width or exactcorresponds kernel number or kernel number decreases. of channels decreases.to of applying channels The in dark hidden Thed global in band dark hidden layers. average in band the layers. The in left pooling the MC-GP The plot left MC-GP corresponds right plot kernel corresponds after kernel converges the to converges lastill-conditioning toconvolu- to ill-conditioning the to analytic the analyticfrom the phase boundary, the covariance between b § #(instantiated networks) and #channels, bothadmit in terms stackingL+1 ofL +1 additional FC layers on top whilel retainingl l l the GPL+1 equivalence,L+1 using a derivation concentrateregime outputs around of the the critical GP become line (phase asymptotically boundary, decorrelated right) as the and depth therefore increases independent, (left to right). while in kernelkernel with increasing with8 increasing width, width,limn limKn K= K =inK probability.in probability. all different inputs becomes asymptotically FCN-GP of Ktionalof Kwhen layer.when theThis number the approach number of outer takes of outerproducts all pixel-pixelproductsn,M2 contributingn,M contributing covariances to K to intoKapproximately considerationapproximately equals and equals makes its rank. its rank. (Fully-connected) a randomcovariance variable Frobenius of distance shape and the GPanalogous accuracyn,M. to. Wen,M2 (Lee will et al. show, 2018; thatMatthews!1 in the!1et al., scenarios2018b1 ). 1 consideredn,M n,M below in Allthe plots ordered share common regime they axes approach ranges and perfect employ correlation the erf nonlinearity. of 1. Either See of theseA.7.2 scenariosfor experimental captures no §   pool ! (#NN instantiations)L+1 2 22 constant and makes learning infeasible. § L|X+1 | ×Values0 |Xthe |Values reported kernel reported translation are for are a forinvariant.3-layer a 3-layer model However, model applied2 appliedit requirestol aK 2K/4K tol a↵ 2K/4K,↵ train/validation+l d train/validationb . memoryl subset to computel subset of CIFAR10l of the CIFAR10(17) details.information about the training data in the kernel and makes learning infeasible. FigureFor finiteFigureFor 2: finitewidth 2: width networks, networks, the uncertainty(left)K the1 uncertaintyof(left)⌘ and MC-CNN-GPof in anKn,M MC-CNN-GP in K1isn,MVarO increasesis|X0KVar | increasesn,MK withn,M=n Var with=✓Mn VarK(i.e.n✓M(✓K channel)(i.e.n/M(✓ channel)./M count From. count From 3.1 and 3.2, the outputs z¯ ydownsampledwilldownsampled convergeValidation to 8Validation to88. accuracy in See8 distribution.accuracy Figure See Figure7 for7 similarfor to↵, similar↵ a2 results multivariate results with other with other architectures Gaussian architectures and (i.i.d.A.7.2 and A.7.2for for i sample-sample covariance of the GP (or l d 0 perl covariance1 1 entry in⇥ an iterative⇥l orl dis-1 1 TableThis problem 3: Validation can be amelioratedaccuracy of by CNN- judicious and FCN-hyperparameter GPs as a selection, function of which weight can ( reduce2 , horizontal the rate § § | timesDanielytimesDaniely number et al. number of( et2016 samples)⇥al. of(),2016 samples)⇥we), andknow we approaches andknow that approachesVar that✓ thatVarOKXn of✓( that✓ theK⇥) n of exact(✓ the) ⇤, CNN-GPexact which , CNN-GP which leads (not leadsto shown), (notVar to✓ shown),[KVar while✓[K]§ while the distance]§ the. Fordistance. For ! 4MexperimentaltributedONTEexperimental setting).C details.ARLO details. It EVALUATIONis impractical to use OF this INTRACTABLE method/ ton analytically/ ⇥nGP⇥KERNELS⇤ evaluate⇤ ⇥ then,M GP,⇥ ⇤ and/n,MMn we⇤/ Mn axis)of exponential and bias ( convergence2, vertical axis) to the variances. fixed point. As For predicted hyperpameters in A.2, chosen the regions on a critical of good line performance separating (right)finiteThis(right)finiten to, K approach theln to, exactK theisl also exactcorresponds kernelis a also biased kernel decreases. a biased estimate decreases.to applying Theestimate of dark TheK globall of band dark, whereK averagel in band, where the the in left biaspooling the the plot depends left bias corresponds right plot depends correspondssolely after solelythe on to lastill-conditioning network on to convolu- ill-conditioning network width. width. A.2 RELATIONSHIPb TO DEEP SIGNAL PROPAGATION §  proposeL+1 toLn,M+1 use an,M Monte Carlo approach (see 4). L+1 L+1 concentratetwo untrainable around phases, the critical the convergence line (phase rates boundary, slow to polynomial,right) as the and depth very increases deep networks (left to can right). be of Ktionalof Kwhen layer.when the8 This number the approach number of outer takes of outerproducts all⇥ pixel-pixelproducts1 contributing⇥ ⇤ 1 contributing covariances⇤ to K to intoKapproximately considerationapproximately equals and equals makes its rank. its rank. We introduceWe don,MWe not a don,M currentlyMonte not currently Carlo have estimation an have analytic an analytic method form forform for§ this NN-GP for bias, this butkernelsbias, we butn,M can which we seen,M can are in see Figures computationally in Figures2 and27andthat im-7 forthat for TheAlltrained, recurrence plots and share relationinference common linking with axes deep the ranges GP NN-GP kernel and kernels employ at layer can thel +1 beerf performedtononlinearity. that of layer– see Seel TablefollowingA.7.23. for from experimental Equa- h = e 2 2 2 2 l+1 l § random2.ValuestheSubsamplingrandom hyperparametersthefiniteValuesthe reported kernel hyperparameterswidthfinite reported translation onewidth arenetworks7 particularfor we arenetworks probea andforweinvariant.3-layer probea useit pixel:and3-layer is themodel small useit However, empirical istake themodel small relative applied empirical relative appliedituncentered to requires↵to the, a uncentered to variance.2K/4K to the a covariances variance.2K/4K train/validation Incovariances particular,d train/validation In ofmemory particular, activations of subsetK activations tol compute subsetK to( of✓l estimate) CIFAR10 to( ofK✓ theestimate)L CIFAR10KL tiondetails.10 (i.e. K =( ) K ) is precisely the covariance map examined in a series of related practical to compute analytically, or for which we do not2 knowO the|X analytic | form. Similarn,M n,M in spirit F F 1 C A 1 thedownsampled Montethesample-sampledownsampled Monte Carlo-GP Carlo-GP to 8 (MC-GP) covariance to88. (MC-GP) See8 kernel,. Figure Seeof kernel, thee Figure7 GPfor2 (or7 similarforL+1 similard resultsper results covariance with2 other with entry other architectures in architectures an iterative and A.7.2 andor dis-1A.7.2for 1for papersA.3 on S signalTRIDED propagation CONVOLUTIONS (Xiao et AND al., AVERAGE2018; Poole POOLING et al., 2016 IN INTERMEDIATE; Schoenholz et LAYERS al., 2017; Lee to traditionalis nearlyis nearly random constant constant feature for⇥ constant for methods⇥ constantMn (Rahimi↵.Mn We. thus We &K Rechttreat thusMn treat, 2007+asMn), the the.as effective core the effective idea sample is to sample instantiate size for size§ the(18) for many Monte§ the Monte experimentaltributedexperimental setting). details. details. It is impractical to use! thisO method↵,↵ to analyticallyb evaluate the GP, and we et al., 2018) (modulo notational differences; denoted as F , or e.g. ? in Xiao et al. (2018)). CarloCarlo kernel kernel estimate. estimate. Increasing IncreasingKM1 and⌘M reducingandM 1 reducingnMn cannn reducecan reduce memory memory cost, though cost, though potentially potentially at at In thoseA.2Our analysis works, RELATIONSHIP the inthe action main ofTOtext thisDEEP canmapS easilyIGNAL on hidden-state bePROPAGATION extended covariance to coverC average matricesA pooling wasC interpreted and strided as defin- convolu- 8 propose to usel a Montel Carlo approach1 (see1 4).l l l l l SpatiallytheThis expensethe approach local expense of average increased makesK of increased poolingK use compute of( inx, only compute intermediary x0) time( onex, x0 andpixel-pixel) time bias.⇥ layers and bias. can covariance,⇤y be( constructedxy; ✓ ()x andy; ✓ requiresin)(xy a0 similar; ✓ (x) the0; fashion✓ same) amount( A.3).(19) We (19) ingtions a dynamical (applied system before whosethe pointwise large-depth nonlinearity). behavior informsRecall that aspects conditioned of trainability. on K the In pre-activation particular, n,M ↵n,M,↵ ↵,↵ § c↵ c↵m c↵m0 c↵0 m m 0 0 ⌘ Mn⌘ Mn § Thel recurrencedl+11 relation linkingl the GP kernell at layer l +1tod that2 d of1 layer l following from Equa- focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects as lzj (x) , KR isl+1=( a zero-mean) K multivariatel K Gaussian.K⇤ , i.e. Let theB covarianceR ⇥ denote approaches a linear a operator. fixed point Then 2.of memorySubsampling as vectorization one particular ( 3.1) pixel: to compute.takem=1hc==1m=1e↵c,=1 pool pool 1K =( ) 1K 1 1 randomIn a non-distributedrandomInfinite a non-distributedwidthfinite width networks setting, networks setting, the and MC-GP use the andthe MC-GP use reduces empirical the reduces empirical the uncentered memory the uncentered memory requirements covariances requirements covariances of to activations compute of to activations compute to estimate tofrom estimatefrom tion!1l 102(i.e. d2 C A ⇡) is precisely⌘ the covariance2 map examined in a series of related of the model like local⇥ connectivity⇥ ⇤ or⇤ equivariance.§ X XX X K⇤Bz. The(x) convergenceR 1is a zero-mean toC aA fixed Gaussian, point1 is problematic and the covariance for learning is because the hidden states no 2 2 2 2 GP GP 1papersj on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee wherethewhere Monte✓ consiststhe Monte✓2 Carlo-GPconsists of2 Carlo-GPM ofdraws (MC-GP)M draws of (MC-GP) the2 kernel, of weights the2 kernel,e weights↵ and2 biases andL biases+1 from their from prior their2 distribution, prior distribution,✓m ✓pm(✓),p and(✓), and longer contain2 information that can distinguish different pairs of inputs. It is similarly problematic d tod to + n + ndn +, makingnd , making theK evaluation the evaluation+ of CNN-GPs. of CNN-GPs with pooling with pooling practical.(18) practical. T We compare the performance of presented strategies! in Figure↵,1↵. Noteb that all described⇠ strategies⇠ foret GPs, al., as2018 the) kernel (modulo becomes notationall pathological differences;l as it denoted approachesl as F a, fixedorl point.e.g. l Precisely,? inT Xiaol in the etT al. chaotic(2018)). n isO then|Xis widthO | the|X width or | numberO or number|XO of | channels|X of | channels inK hidden1 in⌘ hidden layers.M 1 layers. ThenM MC-GP Then MC-GP kernel kernel converges converges to the to analytic the analytic E !l,bl Bzj (x) Bzj (x0) K = BE !l,bCl zj (x)Azj (xC0) K B . (20) admit stacking additional FC layers on top while retaining the GP equivalence, using a derivation In those works,{ the} action of this map on hidden-state covariance{ } matrices was interpreted as defin- ⇣ ⇣ ⌘ ⌘⇣ l ⇣ liml limK⌘ l K⌘=1l Kl=1 Kl l l l l regime outputs of the GP become asymptotically decorrelated and therefore independent, while in kernelkernel withThis increasingapproach with increasing makes width,K width,K usen of(x, only xn0)(n,M onex, x0 pixel-pixel)n,M7⇥ in probability. covariance,⇤yin probability.(xy; ✓ ()x andy; ✓ requires)(xy0; ✓ (x) the0; ✓ same) amount(19) (19) ing a dynamical systemh whosel large-depthl behaviori informs aspectsh of trainability. i In particular, analogous to 2 (Lee et al., 2018n,M; Matthewsn,M!1 !1 et al., 2018b1 ). 1 c↵ c↵m c↵m0 c↵0 m m theOne ordered can easilyregime see they that approachBz K perfectare correlationi.i.d. multivariate of 1.Gaussian Either of as these well. scenarios captures no ISCUSSIONISCUSSION ↵,↵0 ↵,↵0 ⌘ Mn⌘ Mn l Kl+1 =( j) Kl j Kl K 5Dof§5D memory as vectorization ( 3.1) to compute.l m=1 cl =1m=1 c=1l l l l informationas about, the training data in the kernel and makes⇤ , i.e. learning the covariance infeasible. approaches a fixed point K K Var KVar K = Var= VarK (✓K) /M(✓) /M !1 1 C A 1 ⇡ 1 ⌘ 1 For finiteFor finitewidth width networks,⇥ networks, the⇥ ⇤ uncertainty the⇤ § uncertainty in n,M inXisn,MXXisXn,M n,M ✓ n✓ n . From. From KStrided⇤ . The convolution convergence. Strided to a fixed convolution point is is problematic equivalent to for a non-strided learning because convolution the hidden composed states with no wherewhere✓ consists✓ consists of M ofdrawsM draws of the of weights the weightsl and biasesl and1 biases from1 their from prior their distribution, prior distribution,l ✓ l 1✓p (✓)1,p and(✓), and Thislonger problem1 contain canbe information ameliorated that by can judicious distinguish hyperparameter different pairs selection, of inputs. which It is can similarly reduce problematic the rate Daniely5.1Daniely et5.1 BAYESIAN al. ( et B2016AYESIAN al. (),CNN2016 we), knowSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTEDCNN we knowSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTED that Var that✓ VarKn✓(✓K) n (✓) , which, which leads leadsto Var to✓[KVarn,M✓[K] mn,M ] m. For . For subsampling. Let s N denote size of the stride. Then the strided convolution is equivalent to 4MWe compareONTE C theARLO performance EVALUATION of presented OF strategiesINTRACTABLE/ inn Figure/ ⇥nGP1. Note⇥KERNELS⇤ that⇤ all⇥ described⇥ ⇤/ ⇠ strategiesMn⇤/ ⇠Mn of exponentialfor GPs, as convergence the kernel2 becomes to the fixed pathological point. For as hyperpameters it approaches achosen fixed point. on a critical Precisely, line separatingin the chaotic n is thenNETWORKSis widthl theNETWORKS widthl or numberIN or THE numberIN of ABSENCE channels THE of ABSENCE channels in OF hidden POOLING inl OF hidden POOLING layers.l layers. The MC-GP The MC-GP kernel kernel converges converges to the to analytic the analytic choosing B as follows: Bij = (is j) for i 0, 1,...(d2 1) . admitfinite stackingfiniten, Kn,Mn, additionalKisn,M also,is a also FCbiased, alayers biased estimate on estimate top of whileKl of, whereretainingKl l, where thel the bias the GP depends bias equivalence, depends solely solely on using network on a derivation network width. width. tworegime untrainable outputs phases, of the the GP convergence become asymptotically rates slow2 { to decorrelated polynomial, and and} therefore very deep independent, networks can while be in kernelkernel with increasing with increasing width, width,limn limKn⇥ n,M1K⇥=n,M⇤K1 =⇤inK probability.in probability. analogousWe doWe not to do currently2 not(Lee currently et have al., 2018 an have analytic; anMatthews analytic!1 form!1 etfor form al. this, for2018b bias, this1 ). butbias,1 we but can we see can in see Figures in Figures2 and27andthat7 forthat for trained,theAverage ordered and inference pooling regime. with Average they deep approach pooling NN-GP perfect withkernels stride correlation cans beand performed of window1. Either – size see of Tablews theseis3 equivalent. scenarios tocaptures choosing no We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally im- 2 2 LocallyLocally Connected§ Connected Networks Networks (LCNs) (LCNs) (Fukushima (Fukushimal , 1975l ,;1975Lecun;l Lecun, 1989l ,)1989 are CNNs) arel l CNNs withoutl l withoutL weightL weight informationBij =1/ws aboutfor i =0 the, training1,...(d data2 1) in theand kernelj = is, and . . . makes , (is + learningws 1) infeasible.. practicaltheFor hyperparameters tothe computefiniteFor hyperparameters finitewidth analytically, width networks, we probe networks, weor probethe it isfor uncertainty smallthe itwhich is uncertainty small relative wein relative do toK not the in knowtoK variance.is theVar variance.theis KVar Inanalytic particular,K In= particular, form. Var=✓K Similar Varn,MK ✓K(✓n,MK) in/M spirit(K✓)./M FromK. From sharingsharing between between spatial spatial locations. locations. LCNs LCNs preserve preserven,M then,M connectivity the connectivityn,M pattern,n,M pattern, and thusn and topology, thusn topology,1 F of1 a F of a l l 1 1 l l 1 1 A.3ThisND S convolutions.TRIDED problem CONVOLUTIONS can beNote ameliorated that our AND analysis by AVERAGE judicious in the hyperparameterPOOLING main text IN(1D) INTERMEDIATE easily selection, extends which LAYERSto higher-dimensionalcan reduce the rate to traditionalis nearlyDanielyis nearlyDaniely random constant et al. constant( featureet2016 for al. constant(),2016 for methods we constant), know weMn ( know thatRahimi.Mn WeVar that. thus We✓ &VarK Rechttreat thus✓(✓MnK treat,) 2007(as✓Mn)), the, the whichas effective core the, which leads effective idea sample leadsto isVar to sample toinstantiate size✓[KVar for size✓[K] the for many Monte] the. Monte For . For 4MCNN.ONTECNN. However,C However,ARLO they EVALUATION do they not do possess not possess the OF equivariance INTRACTABLEthen equivariancen propertyn property⇥nGP of a⇥KERNELS⇤ CNN of a⇤ CNN– if an⇥ – input ifn,M an⇥ ⇤ inputisn,M translated,Mn⇤ is translated,Mn ofconvolutions exponential by convergence replacing integer to the fixed pixel point. indices For and hyperpameters sizes d, ↵, chosenwith tupleson a critical (see also line separatingFigure 4). CarloCarlo kernel kernel estimate.l estimate.l Increasing IncreasingM andM reducingand reducingl n canl/n reducecan/ reduce memory memory cost, though cost, though potentially/ potentially/ at at Our analysis in the main text can easily be extended to cover average pooling and strided convolu- 8 finitethe latentfinitethen, K latent representationn,Mn, K representationisn,M alsois a also in biased an a LCN in biased estimate an will LCN estimate be of will completelyK be of, completely whereK , different, where the bias different, the rather depends bias rather depends than solely also than solely onbeing also network onbeing translated. network translated. width. width. two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be Spatiallythe expensethe local expense of average increased of increased pooling compute in compute intermediary time and time bias. layers and⇥ 1 bias. can⇥ ⇤ be1 constructed⇤ in a similar fashion ( A.3). We tions (applied before the pointwise nonlinearity). Recall that conditioned on Kl the pre-activation We doWe not do currently not currently have an have analytic an analytic form forform this for bias, this butbias, we but can we see can in see Figures in Figures2§and27andthat7 forthat for l trained,d and1 inference with deep NN-GP kernels can be performedd2 d1 – see Table 3. focusWe on introduceThe global CNN-GPThe average a CNN-GP Monte pooling predictions Carlo predictions in this estimation without work without to spatial more methodspatial effectively pooling for pooling NN-GP in isolate3.1 inkernels theand3.1 effects3.2.2and which ofdepend3.2.2 pooling aredepend computationallyonly from on onlyother sample-sample on aspects sample-sample im- z (x) R is a zero-mean multivariate Gaussian.18 Let B R ⇥ denote a linear operator. Then l pooll poolL 2 L 2 j 2 2 In athe non-distributedIn hyperparameters athe non-distributed hyperparameters setting, we setting, probe the we MC-GP probe it the isMC-GP small it reduces is small relative reduces the relative to memory the§ to variance. memory the§ requirements variance.§ requirements In§ particular, In to particular, compute toK computeK(✓) from(K✓) fromK l d2 of thepractical modelcovariances, liketocovariances, compute local andconnectivity analytically, do and not do depend or not equivariance. or depend on for pixel-pixel which on pixel-pixel we covariances.do not covariances. know LCNs the analytic LCNs destroy destroy form. pixel-pixel Similar pixel-pixeln,M covariances:n,M in spirit covariances:F F Bz (x) R is a zero-mean Gaussian, and the covariance is 2 2 2 2 GP GP 1 1 jA.3 STRIDED CONVOLUTIONS AND AVERAGE POOLING IN INTERMEDIATE LAYERS is nearlyL isLCN2 nearlyL constantLCN2 constant for constant for2 constantMn2 .Mn We. thus We treat thusMn treatasMn theas effective the effective sample sample size for size the for Monte the Monte 2 to traditionald randomtod to feature+ n methods+ ndn + (,Rahimi makingnd , making & the Recht evaluation the, 2007 evaluation), of the CNN-GPs of core CNN-GPs idea with is to pooling with instantiate pooling practical. many practical. T T K ↵K,↵ (x,↵,↵ x0)=0(x, x0)=0, for ↵, for= ↵0 =and↵0 allandx, allx0 x, x0 and L>and L>0. However,0. However, LCNs LCNs preserve preserve the the l l l l l l T O Carlo|XO1 | Carlo kernel|X10 | kernel estimate.O0 |X estimate.O | Increasing|X | Increasing6 M6 andM reducingand reducing2n canX 2n reducecanX reduce memory memory cost, though cost, though potentially potentially at at E !l,bl Bzj (x) Bzj (x0) K = BE !l,bl zj (x) zj (x0) K B . (20) LCN LCN CNN CNN Our analysis{ in} the main text can easily be extended{ to cover} average pooling and strided convolu- 8 ⇣ ⇣ ⌘ ⌘⇣ ⇣ ⌘ ⌘ L L L L Spatiallythecovariances expensethecovariances local expense of between average increased of between increased pooling input compute input examples in compute intermediary examples time at and time every7 bias. layersat and every pixel: bias. can pixel: beK constructedK (x, xin0)=( ax, similar x0)=K fashionK ( (x,A.3 x0)()..x, We As x0). a As a h i h Kil ⇥ ⇤⇥ ⇤ ↵,↵ ↵,↵ ↵,↵ ↵,↵ Onetions can easily(applied see before that Bz thel pointwiseKl are nonlinearity). i.i.d. multivariate Recall Gaussian that conditioned as well. on the pre-activation 1 1 1 1 § l d1 j j d2 d1 focus5Dresult, on5D globalISCUSSIONresult, in averagetheISCUSSION in absence the pooling absence of inpooling, this of pooling, work LCN-GPs to more LCN-GPs effectively and CNN-GPs and isolate CNN-GPs are the identical. effects are identical. of pooling Moreover, Moreover, from LCN-GPs other LCN-GPs aspectspool withpool with zj (x) R is a zero-mean multivariate Gaussian. Let B R ⇥ denote a linear operator. Then In a non-distributedIn a non-distributed setting, setting, the MC-GP the MC-GP reduces reduces the memory⇥ the memory⇤⇥ requirements⇤ requirements to⇥ compute⇤ to⇥ compute⇤ from from l 2 d2 2 of the model like local connectivity or equivariance. StridedBzj ( convolutionx) R is. a Strided zero-mean convolution Gaussian, is andequivalent the covariance to a non-strided is convolution composed with 2 2 2 2 2 2 2 GP GP 2 AYESIANd AYESIANtod toSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTED+SWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTEDn + ndn +, makingnd , making the evaluation the evaluation of CNN-GPs of CNN-GPs with pooling with pooling practical. practical. subsampling. Let s N denote size of the stride. Then the strided convolution is equivalent to 5.15.1 B B CNNCNN 2 l l T l l l T l T O |XO | |X | O |XO | |X | 8 8 choosing B asE follows:!l,bl BBzij j=(x)(isBzjj)(forx0) i K0, 1=,...B(Ed2!l,bl1) z. j (x) zj (x0) K B . (20) ⇣NETWORKS⇣NETWORKS⌘ , IN⌘⇣ THE, IN⇣ ABSENCE THE ABSENCE OF⌘ POOLING OF⌘ POOLING { } 2 { { } } 7 Average pooling. Averageh Bz poolingl Kl with stride s iand window sizeh ws is equivalenti to choosing One can easily see that j j are i.i.d. multivariate Gaussian as well. Locally5DLocally Connected5DISCUSSION ConnectedISCUSSION Networks Networks (LCNs) (LCNs) (Fukushima (Fukushima, 1975,;1975Lecun; Lecun, 1989,)1989 are CNNs) are CNNs without without weight weight Bij =1/ws for i =0, 1,...(d2 1) and j = is, . . . , (is + ws 1). Strided convolution. Strided convolution is equivalent to a non-strided convolution composed with sharingsharing between between spatial spatial locations. locations. LCNs LCNs preserve preserve the connectivity the connectivity pattern, pattern, and thus and topology, thus topology, of a of a ND convolutions. Note that our analysis in the main text (1D) easily extends to higher-dimensional subsampling. Let s denote size of the stride. Then the strided convolution is equivalent to CNN.5.1CNN. However,5.1 BAYESIAN However, BAYESIAN theyCNN do they notSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTEDCNN do possess notSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTED possess the equivariance the equivariance property property of a CNN of a CNN– if an – input if an inputis translated, is translated, convolutions by replacing integerN pixel indices and sizes d, ↵, with tuples (see also Figure 4). choosing B as follows:2 B = (is j) for i 0, 1,...(d 1) . the latenttheNETWORKS latent representationNETWORKS representation, IN in THE, anIN LCNABSENCE in THE an will LCNABSENCE be OF will completely POOLING be OF completely POOLING different, different, rather rather than also than being also being translated. translated. ij 2 { 2 } Average pooling. Average pooling with stride s and window size ws is equivalent to choosing The CNN-GPThe CNN-GP predictions predictions without without spatial spatial pooling pooling in 3.1 inand3.13.2.2and depend3.2.2 depend only on only sample-sample on sample-sample 18 LocallyLocally Connected Connected Networks Networks (LCNs) (LCNs) (Fukushima (Fukushima§, 1975§,;1975§Lecun; §Lecun, 1989,)1989 are CNNs) are CNNs without without weight weight Bij =1/ws for i =0, 1,...(d2 1) and j = is, . . . , (is + ws 1). covariances,sharingcovariances,sharing between and between do and spatial not do depend spatial not locations. depend on locations. pixel-pixel LCNs on pixel-pixel LCNs preserve covariances. preserve covariances. the connectivity the LCNs connectivity LCNs destroy pattern, destroy pixel-pixel pattern, and pixel-pixel thus and covariances: topology, thus covariances: topology, of a of a L LCNL LCN ND convolutions. Note that our analysis in the main text (1D) easily extends to higher-dimensional KCNN.KCNN. However,(x, x However,0)=0(x,they x0)=0, for do they not↵, for do= possess not↵0 =and possess↵ the0 alland equivariancex, the allx0 equivariancex, x0 and propertyL>and property ofL>0. a However, CNN of0. a However, CNN– if LCNs an – input if LCNs an preserve inputis translated, preserve is the translated, the convolutions by replacing integer pixel indices and sizes d, ↵, with tuples (see also Figure 4). 1 ↵,↵10 ↵,↵0 6 6 2 X 2 X the latentthe latent representation representation in an LCN in an will LCN be will completely be completelyL different,LCNL different,LCN rather rather thanL also thanCNN beingL alsoCNN being translated. translated. ⇥covariances⇤⇥covariances⇤ between between input input examples examples at every at every pixel: pixel:K K (x, x0)=(x, x0)=K K (x, x0)(.x, As x0). a As a 1 ↵,↵1 ↵,↵ 1 ↵,↵1 ↵,↵ result,Theresult, in CNN-GPThe the in absenceCNN-GP the predictions absence of predictions pooling, of without pooling, LCN-GPs without spatial LCN-GPs spatial and pooling CNN-GPs and pooling in CNN-GPs3.1 in areand3.1 identical. are3.2.2and identical.depend3.2.2 Moreover,depend Moreover,only onLCN-GPs only sample-sample onLCN-GPs sample-sample with with 18 covariances,covariances, and do and not do depend not depend on pixel-pixel on pixel-pixel covariances.⇥ § covariances.⇤⇥ § LCNs§⇤ LCNs§ destroy⇥ destroy pixel-pixel⇤⇥ pixel-pixel⇤ covariances: covariances: L LCNL LCN K K (x, x0)=0(x, x0)=0, for ↵, for= ↵0 =and↵0 allandx, allx0 x, x0 and L>and L>0. However,0. However, LCNs LCNs preserve preserve the the 1 ↵,↵10 ↵,↵0 6 6 8 28X 2 X L LCNL LCN L CNNL CNN ⇥covariances⇤⇥covariances⇤ between between input input examples examples at every at every pixel: pixel:K K (x, x0)=(x, x0)=K K (x, x0)(.x, As x0). a As a 1 ↵,↵1 ↵,↵ 1 ↵,↵1 ↵,↵ result,result, in the in absence the absence of pooling, of pooling, LCN-GPs LCN-GPs and CNN-GPs and CNN-GPs are identical. are identical. Moreover, Moreover, LCN-GPs LCN-GPs with with ⇥ ⇤⇥ ⇤ ⇥ ⇤⇥ ⇤

8 8 Published as a conference paper at ICLR 2019

1 L+1 for every class i) as min n , . . . , n with covariance denoted as R|X |×|X |, or, ∞ L+1 0 { } → ∞ K ∈ jointly, z¯ y D (0, In¯L+2 ). As in 2.2, we postpone the formal derivation until E.4 (Theorem E.6).| → N K∞ ⊗ § §  3.1 VECTORIZATION

One common readout strategy is to vectorize (flatten) the output of the last convolutional layer L+1 L+1 zL (x) Rn d into a vector4 vec zL (x) Rn d and stack a fully connected layer on top: ∈ ∈ nL+1d  z¯L+1 (x) ω¯L+1φ vec zL (x) + ¯bL+1, (12) i ≡ ij j i j=1 X   L+1 n¯L+2 nL+1d ¯L+1 n¯L+2 L+1 where the weights ω¯ R × and biases b R are i.i.d. Gaussian, ω¯ ∈ ∈ ij ∼ 0, σ2 /(nL+1d) , ¯bL+1 0, σ2 . N ω i ∼ N b It is easy to verify that z¯L+1 KL+1 is a multivariate Gaussian with covariance: i | 2  T σ E vec KL+1 = E z¯L+1 z¯L+1 KL+1 = ω KL+1 + σ2. (13) Ki i i d α,α b α   h   i X  

L+1 0 The argument of 2.2 can then be extended to conclude that in the limit of infinite width z¯i y converges in distribution§ to a multivariate Gaussian (i.i.d. for each class i) with covariance |  2 vec σω L+1 2 = K + σb . (14) K∞ d ∞ α,α α X   A sample 2D CNN using this readout strategy is depicted in Figure2, and a sample correspondence between a FCN, FCN-GP, CNN, and CNN-GP is depicted in Figure3. L+1 Note that as observed in Xiao et al.(2018), to compute the summands K (x, x0) in Equa- ∞ α,α L tion 13, one needs only the corresponding terms K (x, x0). Consequently, we only need to ∞ α,α   l 2 compute K (x, x0): x, x0 , α 1 . . . d  and the memory cost is d ∞ α,α ∈ X ∈ { } l=0,...,L O |X | (or (d)nper covariance entry in an iterative or distributedo setting). Note that this approach ignores pixel-pixelO covariances and produces a GP corresponding to a locally connected network (see 5.1). §

3.2 PROJECTION

Another readout approach is a projection to collapse the spatial dimensions. Let h Rd be a ∈ deterministic projection vector, ω¯L+1 0, σ2 /nL+1 , and ¯bL+1 be the same as in 3.1. ij ∼ N ω § Define the output to be  nL+1 d z¯L+1 (x) ω¯L+1 φ zL (x) h + ¯bL+1, leading to (analogously to 3.1) (15) i ≡ ij j,α α i § j=1 α=1 ! X X  h 2 L+1 2 σω hαhα0 K + σb . (16) K∞ ≡ ∞ α,α0 α,α X0   Examples of this approach include

1 1. Global average pooling: take h = d 1d. Then 2 pool σω L+1 2 K + σb . (17) K∞ ≡ d2 ∞ α,α0 α,α X0   4Note that since per our notation described in 2.1 zL (x) is already considered a 1D vector, here this L § operation simply amounts to re-indexing z (x) with one channel index j instead of two (channel i and pixel α).

8 Published as a conference paper at ICLR 2019

0.5 MC-CNN-GP with pooling

CNN-GP 0.4 CNN-GP (center pixel)

0.3 CNN-GP (without padding) Accuracy

FCN-GP 0.2

1 2 5 10 2 5 100 Depth Figure 4: Different dimensionality collapsing strategies described in 3. Validation accuracy of an MC-CNN-GP with pooling ( 3.2.1) is consistently better than other§ models due to translation invariance of the kernel. CNN-GP§ with zero padding ( 3.1) outperforms an analogous CNN- GP without padding as depth increases. At depth 15 the§ spatial dimension of the output without padding is reduced to 1 1, making the CNN-GP without padding equivalent to the center pixel selection strategy ( 3.2.2× ) – which also performs worse than the CNN-GP (we conjecture, due to overfitting to centrally-located§ features) but approaches the latter (right) in the limit of large depth, as information becomes more uniformly spatially distributed (Xiao et al., 2018). CNN-GPs generally outperform FCN-GP, presumably due to the local connectivity prior, but can fail to capture nonlinear interactions between spatially-distant pixels at shallow depths (left). Values are reported on a 2K/4K train/validation subset of CIFAR10. See G.3 for experimental details. §

This approach corresponds to applying global average pooling right after the last convolu- tional layer.5 This approach takes all pixel-pixel covariances into consideration and makes the kernel translation invariant. However, it requires 2 d2 memory to compute the O |X | sample-sample covariance of the GP (or d2 per covariance entry in an iterative or dis- tributed setting). It is impractical to use thisO method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see 4). § 2. Subsampling one particular pixel: take h = eα,

eα 2 L+1 2 σω K + σb . (18) K∞ ≡ ∞ α,α This approach makes use of only one pixel-pixel covariance, and requires the same amount of memory as vectorization ( 3.1) to compute. § We compare the performance of presented strategies in Figure4. Note that all described strategies admit stacking additional FC layers on top while retaining the GP equivalence, using a derivation analogous to 2.2(Lee et al., 2018; Matthews et al., 2018b). §

4 MONTE CARLO EVALUATION OF INTRACTABLE GP KERNELS

We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally im- practical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte Carlo-GP (MC-GP) kernel,

M n l 1 l l Kn,M (x, x0) ycα (x; θm) ycα (x0; θm) (19) α,α0 ≡ Mn 0 m=1 c=1   X X 5 Spatially local average pooling in intermediary layers can be constructed in a similar fashion ( C). We focus on global average pooling in this work to more effectively isolate the effects of pooling from§ other aspects of the model like local connectivity or equivariance.

9 Published as a conference paper at ICLR 2019

10 0.42 10 0.5 Accuracy log10(Distance) log2(#Channels) log2(#Channels) MC-CNN-GP 0 0.1 0 −6.5

0 log2(#Samples) 10 0log2(#Samples) 10

Figure 5: Convergence of NN-GP Monte Carlo estimates. Validation accuracy (left) of an MC- CNN-GP increases with n M (channel count number of samples) and approaches that of the exact CNN-GP (not shown),× while the distance×(right) to the exact kernel decreases. The dark L+1 band in the left plot corresponds to ill-conditioning of Kn,M when the number of outer products L+1 contributing to Kn,M approximately equals its rank. Values reported are for a 3-layer model applied to a 2K/4K train/validation subset of CIFAR10 downsampled to 8 8. See Figure 10 for similar results with other architectures and G.2 for experimental details. × §

where θ consists of M draws of the weights and biases from their prior distribution, θm p (θ), and n is the width or number of channels in hidden layers. The MC-GP kernel converges to∼ the analytic l l kernel with increasing width, limn Kn,M = K in probability. →∞ ∞ l l l For finite width networks, the uncertainty in Kn,M is Var Kn,M = Varθ Kn (θ) /M. From Daniely et al.(2016, Theorems 2, 3), we know that for well-behaved nonlinearities (that include ReLU and erf considered in our experiments) Var Kl (θ) 1 , which leads to Var [Kl ] θ n ∝ n θ n,M ∝ 1 . For finite n, Kl is also a biased estimate of Kl , where the bias depends solely on net- Mn n,M   work width. We do not currently have an analytic form∞ for this bias, but we can see in Figures 5 and 10 that for the hyperparameters we probe it is small relative to the variance. In particular, l L 2 Kn,M (θ) K is nearly constant for constant Mn. We thus treat Mn as the effective sample − ∞ F size for the Monte Carlo kernel estimate. Increasing M and reducing n can reduce memory cost, though potentially at the expense of increased compute time and bias. In a non-distributed setting, the MC-GP reduces the memory requirements to compute pool from GP 2 d2 to 2 + n2 + nd , making the evaluation of CNN-GPs with pooling practical. O |X | O |X |     5 DISCUSSION

5.1 BAYESIAN CNNSWITHMANYCHANNELSAREIDENTICALTOLOCALLYCONNECTED NETWORKS, INTHEABSENCEOFPOOLING

Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated. The CNN-GP predictions without spatial pooling in 3.1 and 3.2.2 depend only on sample-sample covariances, and do not depend on pixel-pixel covariances.§ LCNs§ destroy pixel-pixel covariances: L LCN K (x, x0) = 0, for α = α0 and all x, x0 and L > 0. However, LCNs preserve the co- ∞ α,α0 6 ∈ X L LCN L CNN variances between input examples at every pixel: K (x, x0) = K (x, x0). As a result, ∞ α,α ∞ α,α in the absence of pooling, LCN-GPs and CNN-GPs are identical. Moreover, LCN-GPs with pooling     are identical to CNN-GPs with vectorization of the top layer (under suitable scaling of yL+1). We confirm these findings experimentally in trained networks in the limit of large width in Figures1 and6 (b), as well as by demonstrating convergence of MC-GPs of the respective architectures to the same CNN-GP (modulo scaling of yL+1) in Figures5 and 10.

10 Published as a conference paper at ICLR 2019

5.2 POOLINGLEVERAGESEQUIVARIANCETOPROVIDEINVARIANCE

The only kernel leveraging pixel-pixel covariances is that of the CNN-GP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures1,4,5,6 (b); Table1), in both CNNs and CNN-GPs.

5.3 FINITECHANNEL SGD-TRAINED CNNS CAN OUTPERFORM INFINITE CHANNEL BAYESIAN CNNS, INTHEABSENCEOFPOOLING

In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finite width SGD-trained CNNs often approaches that of their CNN-GP counterpart (Figure6, b, c), 6 suggesting that in those cases equivariance does not play a beneficial role in SGD-trained networks. However, as can be seen in Figures1,6 (c),7, and Table1, the best CNN overall outperforms the best CNN-GP by a significant margin – an observation specific to CNNs and not FCNs or LCNs.7 We observe this gap in performance especially in the case of ReLU networks trained with a large learning rate. In Figure1 we demonstrate this large gap in performance by evaluating different models with equivalent architecure and hyperparameter settings, chosen for good SGD-trained CNN performance. We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the in- finite channel CNN limit, contributes to the performance of SGD-trained finite channel CNNs with the correct settings of hyperparameters. Nonetheless, more work is needed to disentangle and quan- tify the separate contributions of stochastic optimization and finite width effects,8 and differences in performance between CNNs with weight sharing and their corresponding CNN-GPs.

6 CONCLUSION

In this work we have derived a Gaussian process that corresponds to fully Bayesian multi-layer CNNs with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture. The CNN-GP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all training-related hyperparameters. However, we found that the best overall performance, at least in the absence of pooling, is achieved by finite SGD-trained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research towards understanding the distribution over functions induced by model architecture and training approach, and what aspects of this distribution are im- portant for model performance. Another natural extension of our work is the study of other architectures in the in- finitely wide limit. After the publication of this paper, Yang(2019) devised a unifying framework proving the GP convergence for even more models (such as ones using batch normalization, (self- )attention, LSTM) with slightly different assumptions on the nonlinearity.

6This observation is conditioned on the respective NN fitting the training set to 100%. Underfitting breaks the correspondance to an NN-GP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table1). 7Performing an analogous large-dataset comparison between CNNs and CNN-GPs with pooling was not computationally feasible. Their relative performance remains an interesting open question for future research. 8We remark that concerns about GPs not being able to learn hierarchical representations have been raised in the literature (Matthews et al., 2018a, Section 7), (Neal, 1995, Chapter 5), (MacKay, 2003, Section 45.7) However, practical impact of these assertions have not been extensively investigated empirically or theoretically, and we hope that our work stimulates research in this direction.

11 Published as a conference paper at ICLR 2019

(a) (c) CNN w/ pooling Accuracy 0.8 ReLU 0.85 erf

0.8 0.6 0.75 CNN-GP Accuracy 0.7 0.4 ReLU 0.65 erf

6 7 8 9 2 3 4 5 6 7 8 9 100 1000 0.4 0.6 0.8 #Channels Best CNN w/o pooling

(b) No Pooling Global Average Pooling 0.35 0.35 0.3 0.3 0.25 0.25 GP GP Accuracy LCN 0.2 NN 0.2 NN 0.15 0.15 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

0.35 0.35 0.3 0.3 0.25 0.25 GP GP

0.2 Accuracy 0.2 CNN NN NN 0.15 0.15 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

#Channels

Figure 6: (a): SGD-trained CNNs often perform better with increasing number of channels. Each line corresponds to a particular choice of architecture and initialization hyperparameters, with best learning rate and weight decay selected independently for each number of channels (x-axis). (b): SGD-trained CNNs often approach the performance of their corresponding CNN-GP with increasing number of channels. All models have the same architecture except for pooling and weight sharing, as well as training-related hyperparameters such as learning rate, weight decay and batch size, which are selected for each number of channels (x-axis) to maximize validation performance (y-axis) of a neural network. As the number of channels grows, best validation accu- racy increases and approaches accuracy of the respective GP (solid horizontal line). (c): However, the best-performing SGD-trained CNNs can outperform their corresponding CNN-GPs. Each point corresponds to the test accuracy of: (y-axis) a specific CNN-GP; (x-axis) the best (on valida- tion) CNN with the same architectural hyper-parameters selected among the 100%-accurate models on the full training CIFAR10 dataset with different learning rates, weight decay and number of channels. While CNN-GP appears competitive against 100%-accurate CNNs (above the diagonal), the best CNNs overall outperform CNN-GPs by a significant margin (below the diagonal, right). For further analysis of factors leading to similar or diverging behavior between SGD-trained finite CNNs and infinite Bayesian CNNs see Figures1,7, and Table1. Experimental details: all net- works have reached 100% training accuracy on CIFAR10. Values in (b) are reported on an 0.5K/4K train/validation subset downsampled to 8 8 for computational reasons. See G.5 and G.1 for full experimental details of (a, c) and (b) plots× respectively. § §

12 Published as a conference paper at ICLR 2019

86 82.76 NN (underfitting allowed)

75 75.24

NN (no underfitting) 67.14 CNN-GP 64 Test Accuracy, % Accuracy, Test 58.94 FCN-GP

53 FCN CNN w/ ERF CNN w/ small LR (learning rate) CNN w/ ReLU & large LR

Figure 7: Aspects of architecture and inference influencing test performance. Test accuracy (vertical axis, %) for the best model within each model family (horizontal axis), maximizing vali- dation accuracy over depth, width, and training and initialization hyperpameters (“No underfitting” means only models that achieved 100% training accuracy were considered). CNN-GP outperforms SGD-trained models optimized with a small learning rate to 100% train accuracy. When SGD optimization is allowed to underfit the training set, there is a significant improvement in general- ization. Further, when ReLU nonlinearities are paired with large learning rates, the performance of SGD-trained models again improves relative to CNN-GPs, suggesting a beneficial interplay between ReLUs and fast SGD training. These differences in performance between CNNs and CNN-GPs are not observed between FCNs and FCN-GPs, or between LCNs and LCN-GPs (Figure1), suggest- ing that equivariance is the underlying factor responsible for the improved performance of finite SGD-trained CNNs relative to infinite Bayesian CNNs without pooling. Further comparison be- tween SGD-trained CNNs and CNN-GPs with pooling (omitted due to computational limitations), where both models do share the property of invariance, would be an interesting direction for future research. See Table1 for further results on other datasets, as well as comparison to GP performance in prior literature. See G.5 for experimental details. §

Model CIFAR10 MNIST Fashion-MNIST CNN with pooling 14.85 (15.65) − − CNN with ReLU and large learning rate 24.76 (17.64) CNN-GP 32.86 0−.88 −7.40 CNN with small learning rate 33.31(22.89) CNN with erf (any learning rate) 33.31(22.17) − − − − Convolutional GP (van der Wilk et al., 2017) 35.40 1.17 ResNet GP (Garriga-Alonso et al., 2018) 0.84 − Residual CNN-GP (Garriga-Alonso et al., 2018) − 0.96 − CNN-GP (Garriga-Alonso et al., 2018) − 1.03 − − − FCN-GP 41.06 1.22 8.22 FCN-GP (Lee et al., 2018) 44.34 1.21 FCN 45.52 (44.73) − − − Table 1: Best achieved test error for different model classes (vertical axis) and datasets (hor- izontal axis). Test error (%) for the best model within each model family, maximizing validation accuracy over depth, width, and training and initialization hyperpameters. Except where indicated by parentheses, all models achieve 100% training accuracy. For SGD-trained CNNs, numbers in parentheses correspond to the same model family, but without restriction on training accuracy. CNN- GP achieves state of the art results on CIFAR10 for GPs without trainable kernels and outperforms SGD-trained models optimized with a small learning rate to 100% train accuracy. See Figure7 for visualization and interpretation of these results. See G.5 for experimental details. §

13 Published as a conference paper at ICLR 2019

7 ACKNOWLEDGEMENTS

We thank Sam Schoenholz, Vinay Rao, Daniel Freeman, Qiang Zeng, and Phil Long for frequent discussion and feedback on preliminary results.

REFERENCES Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

Patrick Billingsley. Probability and Measure. John Wiley & Sons, 1995.

Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 1999.

Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional gaussian processes. arXiv preprint arXiv:1810.03052, 2018.

Anastasia Borovykh. A gaussian process perspective on convolutional neural networks. arXiv preprint arXiv:1810.10798, 2018.

John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.

Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 873–882, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/chen18i.html.

Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342–350, 2009.

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard´ Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204, 2015.

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International confer- ence on machine learning, pp. 2990–2999, 2016.

Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.

Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.

Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cyber- netics, 20(3-4):121–136, 1975.

Kunihiko Fukushima and Sei Miyake. : A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pp. 267–285. Springer, 1982.

Adria` Garriga-Alonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow Gaussian processes. arXiv preprint arXiv:1808.05587, aug 2018. URL https://arxiv.org/abs/1808.05587.

14 Published as a conference paper at ICLR 2019

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Scul- ley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017. Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. International Conference on Learning Representations, 2015. Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018. Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on and pattern recognition, pp. 770–778, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, 2015. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Vinayak Kumar, Vaibhav Singh, PK Srijith, and Andreas Damianou. Deep gaussian processes with convolutional kernels. arXiv preprint arXiv:1806.01655, 2018. Neil D Lawrence and Andrew J Moore. Hierarchical gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007. Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007. Yann Lecun. Generalization and network design strategies. In Connectionism in perspective. Else- vier, 1989. Yann LeCun, Leon´ Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohl- dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z. Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017. Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In Advances in Neural Information Processing Systems, pp. 1601–1609, 2014. Philip M Long and Hanie Sedghi. On the effect of the activation function on the distribution of hidden nodes in a deep network. arXiv preprint arXiv:1901.02104, 2019. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 9 2018a. Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 4 2018b. URL https://openreview.net/forum?id= H1-nGgWC-.

15 Published as a conference paper at ICLR 2019

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010. Radford M. Neal. Priors for infinite networks (tech. rep. no. crg-tr-94-1). University of Toronto, 1994. Radford M Neal. BAYESIAN LEARNING FOR NEURAL NETWORKS. PhD thesis, University of Toronto, 1995. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the international Conference on Learning Representations workshop track, abs/1412.6614, 2015. Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Confer- ence on Learning Representations, 2018. URL https://openreview.net/forum?id= HJC2SzZCW. Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014. Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A re- view. International Journal of Automation and Computing, 14(5):503–519, Oct 2017. ISSN 1751-8520. doi: 10.1007/s11633-017-1054-2. URL https://doi.org/10.1007/ s11633-017-1054-2. Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Expo- nential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016. Joaquin Quinonero-Candela˜ and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005. Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In In Neural Infom- ration Processing Systems, 2007. Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ICLR, 2017. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. ICLR Workshop, 2014. Peter Tino,ˇ Michal Cernansky, and Lubica Benuskova. Markovian architectural bias of recurrent neural networks. IEEE Transactions on Neural Networks, 15(1):6–15, 2004.

16 Published as a conference paper at ICLR 2019

Peter Tino,ˇ Barbara Hammer, and Mikael Boden.´ Markovian bias of neural-based architectures with feedback connections. In Perspectives of neural-symbolic integration, pp. 95–133. Springer, 2007. Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574, 2009. Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian pro- cesses. In Advances in Neural Information Processing Systems 30, pp. 2849–2858, 2017. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor- mation Processing Systems, pp. 5998–6008, 2017. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010. Paul J Werbos. Generalization of with application to a recurrent gas market model. Neural networks, 1(4):339–356, 1988. Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997. Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016b. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark- ing machine learning algorithms, 2017a. Lechao Xiao, Yasaman Bahri, Sam Schoenholz, and Jeffrey Pennington. Training ultra-deep cnns with critical initialization. In NIPS Workshop, 2017b. Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla con- volutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393–5402, Stock- holmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings. mlr.press/v80/xiao18a.html. Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. Greg Yang and Sam S. Schoenholz. Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion. ICLR Workshop, February 2018. URL https://openreview.net/forum?id=rJGY8GbR-. Greg Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114, 2017. Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Sam S. Schoenholz. A mean field theory of batch normalization. ICLR, February 2018. URL https://openreview. net/forum?id=rJGY8GbR-. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Rep- resentations, 2017.

17 Published as a conference paper at ICLR 2019

Appendices

AADDITIONAL FIGURES

No Pooling Global Average Pooling

0.14 0.14 0.13 GP 0.13 GP 0.12 NN 0.12 NN 0.11 0.11 0.1 0.1

LCN 0.09 0.09 Validation Loss Validation 0.08 0.08 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

0.14 0.14 0.13 GP 0.13 GP 0.12 NN 0.12 NN 0.11 0.11 0.1 0.1

CNN 0.09 0.09 Validation Loss Validation 0.08 0.08 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

#Channels Figure 8: Validation loss convergence. Best validation loss (vertical axis) of trained neural net- works (dashed line) as the number of channels increases (horizontal axis) approaches that of a respective (MC-)CNN-GP (solid horizontal line). See Figure6 (b) for validation accuracy, Figure 9 for training loss and G.1 for experimental details. §

No Pooling Global Average Pooling 0.1 0.1

0.01 0.01

0.001 0.001

100μ 100μ

LCN 10μ 10μ Training Loss 1μ 1μ 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

0.1 0.1 0.01 0.01 0.001 0.001 100μ 100μ

CNN 10μ 10μ Training Loss 1μ 1μ 5 10 2 5 100 2 5 1000 2 5 5 10 2 5 100 2 5 1000 2 5

#Channels Figure 9: No underfitting in small models. Training loss (vertical axis) of best (in terms of vali- dation loss) neural networks as the number of channels increases (horizontal axis). While perfect 0 loss is not achieved (but 100% accuracy is), we observe no consistent improvement when increasing the capacity of the network (left to right). This eliminates underfitting as a possible explanation for why small models perform worse in Figure6 (b). See Figure8 for validation loss and G.1 for experimental details. §

18 Published as a conference paper at ICLR 2019

10 0.42 10 0.5 Accuracy log10(Distance) log2(#Channels) log2(#Channels)

0 0.1 0 −6.5

MC-CNN-GP with pooling 0 log2(#Samples) 10 0log2(#Samples) 10

10 0.42 10 0.5 Accuracy log10(Distance) log2(#Channels) log2(#Channels) MC-LCN-GP 0 0.1 0 −6.5

0 log2(#Samples) 10 0log2(#Samples) 10

10 0.42 10 0.5 Accuracy log10(Distance) log2(#Channels) log2(#Channels)

0 0.1 0 −6.5

MC-LCN-GP with Pooling 0 log2(#Samples) 10 0log2(#Samples) 10

10 0.42 10 0.5 Accuracy log10(Distance) log2(#Channels) log2(#Channels) MC-FCN-GP 0 0.1 0 −6.5

0 log2(#Samples) 10 0log2(#Samples) 10

Figure 10: Convergence of NN-GP Monte Carlo estimates. As in Figure5, validation accuracy (left) of MC-GPs increases with n M (i.e. width times number of samples), while the distance (right) to the the respective exact GP× kernel (or the best available estimate in the case of CNN-GP with pooling, top row) decreases. We remark that when using shared weights, convergence is slower as smaller number of independent random parameters are being used. See G.2 for experimental details. §

19 Published as a conference paper at ICLR 2019

Depth 1 10 100 1000 Phase boundary → 2 0.38 CNN-GP Accuracy Bias variance 0.1 FCN-GP 0 0.1Weight variance 5 Figure 11: Large depth performance of NN-GPs. Validation accuracy of CNN- and FCN-GPs as 2 2 a function of weight (σω, horizontal axis) and bias (σb , vertical axis) variances. As predicted in B, the regions of good performance concentrate around the critical line (phase boundary, right) as§ the depth increases (left to right). All plots share common axes ranges and employ the erf nonlinearity. See G.2 for experimental details. §

BRELATIONSHIP TO DEEP SIGNAL PROPAGATION

The recurrence relation linking the GP kernel at layer l+1 to that of layer l following from Equation9 (i.e. Kl+1 = ( ) Kl ) is precisely the covariance map examined in a series of related papers on signal∞ propagationC ◦ A (Xiao∞ et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as F , or e.g. ? in Xiao et al.(2018)). In those works, the action of this map on hidden-state covarianceC matricesA C was interpreted as defining a dynamical system whose large-depth behavior informs aspects of trainability. In particular, as l+1 l l l , K = ( ) K K K∗ , i.e. the covariance approaches a fixed point K∗ . The→ convergence∞ ∞ toC a◦ fixedA point∞ ≈ is problematic∞ ≡ ∞ for learning because the hidden states no longer∞ contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in both chaotic and ordered regimes, outputs of the GP become asymptically identically correlated. Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible. This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NN-GP kernels can be performed – see Figure 11.

CSTRIDEDCONVOLUTIONS, AVERAGE POOLING IN INTERMEDIATE LAYERS, HIGHERDIMENSIONS

Our analysis in the main text can easily be extended to cover average pooling and strided convolu- tions (applied before the pointwise nonlinearity). Recall that conditioned on Kl the pre-activation l d d d z (x) R 1 is a zero-mean multivariate Gaussian. Let B R 2× 1 denote a linear operator. Then j ∈ ∈ Bzl (x) Rd2 is a zero-mean Gaussian, and the covariance is j ∈ l l T l l l T l T E ωl,bl Bzj (x) Bzj (x0) K = BE ωl,bl zj (x) zj (x0) K B . (20) { } { } h   i h i Bzl Kl One can easily see that j j are i.i.d. multivariate Gaussian as well.  Strided convolution. Strided convolution is equivalent to a non-strided convolution composed with subsampling. Let s N denote size of the stride. Then the strided convolution is equivalent to choosing B as follows:∈ B = δ(is j) for i 0, 1,... (d 1) . ij − ∈ { 2 − } Average pooling. Average pooling with stride s and window size ws is equivalent to choosing B = 1/ws for i = 0, 1,... (d 1) and j = is, . . . , (is + ws 1). ij 2 − −

20 Published as a conference paper at ICLR 2019

ND convolutions. Note that our analysis in the main text (1D) easily extends to higher-dimensional convolutions by replacing integer pixel indices and sizes d, α, β with tuples (see also Figure2). In Equation2 β values would have to span the hypercube [ k]N = k, . . . , k N in the pre- activation definition. Similarly, in 3 the normalizing factor d±(d2) should{− be the product} (squared) § of its entries, and summations over α, β should span the [d0] [dN ] hypercube as well. The definition of the kernel propagation operator in Equation×5 will · · · × remain exactly the same, so long as β is summed over the hypercube, and theA variance weights remain respectively normalized β vβ = 1. P DREVIEWOFEXACT BAYESIAN REGRESSION WITH GPS

Our discussion in the paper has focused on model priors. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be done exactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let T denote training inputs x1, ..., x , t = t1, ..., t training targets, and collectively for theX training set. The integral over|X the | posterior can be|X | evaluated analytically to give a posteriorD  2 predictive distribution on a test point y which is Gaussian, (z∗ , y∗) µ , σ , with ∗ |D ∼ N ∗ ∗ T 1 2 − µ = (x∗, ) ( , ) + σε I t,  (21) ∗ K X K X X |X | T 1 2 2 − σ = (x∗, x∗) (x∗, ) ( , ) + σε I (x∗, ) . (22) ∗ K − K X K X X |X | K X We use the shorthand ( , ) to denote the matrix formed by evaluating the GP co- K X X |X | × |X |  variance on the training inputs, and likewise (x∗, ) is a -length vector formed from the co- variance between the test input and training inputs.K Computationally,X |X | the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as 3 (though algorithms scaling as 2.4 exist for sufficiently O |X | O |X | large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (Quinonero-Candela˜ & Rasmussen, 2005; Titsias, 2009).

EEQUIVALENCE BETWEEN RANDOMLY INITIALIZED NNSAND GPS

In this section, we present two different approaches, the sequential limit ( E.3) and simultaneous limit ( E.4), to illustrate the relationship between many-channels Bayesian CNNs§ and GPs. § Sequential limit ( E.3) involves taking the infinite channel limit in hidden layers in a sequence, starting from bottom§ (closest to inputs) layers and going upwards (to the outputs), i.e. n1 , . . . , nL . Note that this approach in fact only gives intuition into construction a GP using→ a NN∞ architecture→ ∞ to define its covariance, and does not provide guarantees on actual convergence of large but finite Bayesian CNNs to GPs (which is of most practical interest), nor does it guarantee the existence of the specified GP on a given probability space. However, it has the following benefits:

1. Weak assumptions on the NN activation function φ and on the distribution of the NN pa- rameters. 2. The arguments can be easily extended to more complicated network architectures, e.g. architectures with max pooling, dropout, etc. 3. A straightforward and intuitive way to compute the covariance of the Gaussian process without diving into mathematical details.

Simultaneous limit ( E.4) considers growing the number of channels in hidden layers uniformly, i.e. min n1, . . . , nL § . This approach establishes convergence of finite channel Bayesian CNNs to GPs and is thus a→ more ∞ practically relevant result. However, it makes stronger assumptions, and the proof is more involved. We highlight that the GPs obtained by the two approaches are identical. In both sections, we only provide the arguments for CNNs. It is straightforward (and in fact simpler) to extend them to LCNs and FCNs. Indeed, an FCN is a particular case of a CNN where the

21 Published as a conference paper at ICLR 2019

inputs and filters have singular spatial dimensions (d = 1, k = 0). For LCNs, the proof goes LCN LCN through in an identical fashion if we replace with defined as (K) (x, x0) A A A α,α0 ≡ δα,α [ (K)] (x, x0). 0 A α,α0  

E.1 SETUP

Probability space. Let P be a collection of countably many mutually independent random vari- ables (R.V.s) defined on a probability space (Ω, ,P ), where is a product Borel σ-algebra and P is the probability measure. Here P W B F H is the collectionF of parameters used to define neural networks: ≡ ∪ ∪

l l l (i) Weights. W = l W and W = ωij,β : i, j N, β [ k] , where [ k] [ k, k] Z. ∈N ∈ ∈ ± ± ≡ − ∩ We assume ωl are i.i.d. R.V.s withn mean zero and finite varianceo 0 < σ2 < (note the ij,βS ω ∞ lack of scaling by input dimensionality, compensated for later in Equations 29 and 42; see also similar notation used in Matthews et al.(2018a)). When l = 0, we further assume they are Gaussian distributed. (ii) Biases. B = Bl and Bl = bl : j N . We assume bl are i.i.d. Gaussian with mean l N j ∈ j zero and variance∈0 σ2 < . S ≤ b ∞  (iii) Place-holder. H is a place-holder to store extra (if needed) R.V.s , e.g. parameters coming from the final dense layer.

n0 d 0 Inputs. We will consider a fixed R × to denote the inputs, with input channel count n , number of pixels d. Assume x = 0Xfor ⊆ all x and , the cardinality of the inputs, is finite. 6 ∈ X |X | However, our results can be straightforwardly extended to a countably-infinite input indexing spaces for certain topologies via an argument presented in Matthews et al.(2018a, section 2.2), allowing toX infer weak convergence on from convergence on any finite subset (which is the case we consider in this text; see also BillingsleyX (1999, page 19) for details). For this reason, as in Matthews et al. (2018a), weak convergence of countably-infinite stochastic processes will be considered with respect to the topology generated by the following metric:

∞ k N ν (s, s0) 2− min (1, sk s0 ) , s, s0 R . ≡ | − k| ∀ ∈ kX=1 Notation, shapes, and indexing. We adopt the notation, shape, and indexing convention similar to 2.1, which the reader is encouraged to review. We emphasize that whenever an index is omitted, the§ variable is assumed to contain all possible entries along the respective dimension (e.g. whole if x is omitted, or all nl channels if the channel i is omitted). X

E.2 PRELIMINARY

We will use the following well-known theorem. m Theorem E.1. Let X, Xn n be R.V.s in R . The following are equivalent: { } ∈N

(i) X D X (converges in distribution / converges weakly), n → (ii) (Portmanteau Theorem) For all bounded continuous function f : Rm R, → lim E [f(Xn)] = E [f(X)] , (23) n →∞

T it Xn (iii)(L evy’s´ Continuity Theorem) The characteristic functions of Xn, i.e. E e , converge to those of X pointwise, i.e. for all t Rm, h i ∈ itT X itT X lim E e n = E e , (24) n →∞ h i h i where i denotes the imaginary unit.

22 Published as a conference paper at ICLR 2019

Using the equivalence between (i) and (iii), it is straightforward to show that Theorem E.2 (Cramer-Wold, (Billingsley(1995), Theorem 29.4)) .

T T m Xn D X a Xn D a X for all a R . (25) → ⇐⇒ → ∈ In particular,

T T m Xn D (0, Σ) a Xn D (0, a Σa) for all a R . (26) → N ⇐⇒ → N ∈

E.3 SEQUENTIALLIMIT

In this section, we give intuition into constructing a Gaussian process using an infinite channel CNN with the limits taken sequentially. Informally, our argument amounts to showing that if the l 1 l inputs z − to the layer l are a multivariate Gaussian with covariance K Inl , then it’s out- l A ⊗ l+1 puts z converge in distribution to a multivariate Gaussian with covariance K I l+1 as A ⊗ n nl . This “allows” us to sequentially replace z1, z2, . . . , zL with their respective limiting  Gaussian→ ∞ R.V.s as we take n1 , n2 , . . . , nL . However, since each convergence only holds in distribution and→ does ∞ not guarantee→ ∞ the existence→ ∞ of the necessary GPs on a given probability space, what follows merely gives intuition into understanding the relationship between wide Bayesian neural networks and GPs (contrary to E.4, which presents a rigorous convergence proof for min n1, . . . , nL ). § → ∞ Let Ψ1 denote the space of functions with a uniformly bounded second moment, i.e. having 2 C2 (R, φ) sup Ex (0,r) φ(x) < for every R 1 . (27) ≡ 1/R r R ∼N | | ∞ ≥ ≤ ≤ Let N∗ = N 0 . We construct Gaussian processes with the following iterative in l (from 0 to L) procedure: \{ }

l, (i) If l > 0, define random variables (a GP) zi ∞ in (Ω, ,P ) as i.i.d. Gaussian with i N∗ F mean zero and covariance Kl , so thatn they areo ∈ also independent from any future events, i.e. independent from the σA-algebra∞ generated by all R.V.s with layer index greater than l. Here we implicitly assume the probability  space (Ω, ,P ) has enough capacity to fit in the R.V.s generated by the above procedure, if not we willF extend the sample space Ω using product measures.

(ii) For n N∗, define ∈ xi,α l = 0 l yi,α(x) , (28) ≡  l 1,  φ zi,α− ∞(x) l > 0 1  l l l 0 v ω y (x) + b l = 0  √n0 j [n ] β [ k] √ β ij,β j,α+β i l,n  ∈ ∈ ± zi,α(x) , (29) ≡  1 P P l l l √n j [n] β [ k] √vβωij,βyj,α+β(x) + bi l > 0  ∈ ∈ ± where  P P [n] 1, . . . , n and [ k] k, . . . , 0, . . . k . (30) ≡ { } ± ≡ {− } l,n d (iii) Prove that for any finite m 1, zi R|X | converges in distribution to a multivariate ≥ i [m] ⊆ ∈ l l normal with mean zero and covarianceh i K Im as n , where K is defined ∞ ∞ A ⊗ → ∞ l,n identically to Equation9. As a consequence, per our remark in E.1, zi converges § i N∗ l, n o ∈ weakly to the GP zi ∞ . i N∗ n o ∈ In what follows, we use the central limit theorem to prove (iii).

23 Published as a conference paper at ICLR 2019

l,n d Theorem E.3. If φ Ψ1, then for every l 0 and every m 1, zi R|X | converges in ∈ ≥ ≥ i [m] ⊆ ∈l distribution to a multivariate normal with mean zero and covarianceh iK Im. A ∞ ⊗ Proof. We proceed by induction. This is obvious in the base case l = 0, since the weights and biases are assumed to be independent Gaussian. Now we assume the theorem holds for l 1. This implies l l 1, − yi = φ zi− ∞ are i.i.d. random variables. i N∗ ∈ i N∗ n  o ∈  T m Choose a vector a [ai(x)]i [m], x R |X |. Then ≡ ∈ ∈X ∈

l,n 1 l l l ai(x)z (x) = ai(x) √vβω y (x) + b (31) i √n  ij,β j,α+β i i [m] i [m] j [n] β [ k] xX∈ xX∈ X∈ ∈X± ∈X ∈X  

1 l l l  ai(x) √vβω y (x) + ai(x)b (32) √n ij,β j,α+β i j [n] i [m] β [ k] i [m]  X∈ xX∈ ∈X±  xX∈  ∈X  ∈X 1  u + q. (33) ≡ √n j j [n] X∈ It is not difficult to see that uj are i.i.d. and q is Gaussian. Then we can use the central limit theorem to conclude that the above converges in distribution to a Gaussian, once we verify the second moment l of uj is finite. Using the fact that ωij,β is a collection of independent R.V.s, and integrating i,β over these R.V.s first, we get n o 2 2 l l Eu =E ai(x) √vβω y (x) (34) j  ij,β j,α+β  i [m] , x β [ k] ∈ X ∈X ∈X±  2  2 l =σω vβE ai(x)yj,α+β(x) (35) i [m] β [ k] "x # X∈ ∈X± X∈X T ¯ l = ai K ai (36) A ∞ i [m] X∈  where by ¯ we denote the linear transformation part of from Equation5, i.e. without the A 2 A A translation term σb : ¯ 2 (K) (x, x0) σω vβ [K]α+β,α +β (x, x0) . (37) A α,α0 ≡ 0 β   X To prove finiteness of the second moment in Equation 36, it is sufficient to show that all the diagonal l terms of K are finite. This easily follows from the assumption φ Ψ1 and the definition of ∞ ∈ Kl = ( )l K0 . Together with the distribution of v (whose covariance is straightforward ∞ C ◦ A l,n to compute), the joint distribution of zi converges weakly to a mean zero Gaussian with  i [m] l ∈ covariance matrix K Im by Theoremh i E.1 (Equation 26). A ∞ ⊗ Remark E.1. The results  of Theorem E.3 can be strengthened / extended in many directions. (1) The same result for a countably-infinite input index set follows immediately according to the respective remark in E.1. X § (2) The same analysis carries over (and the covariance matrix can be computed without much extra effort) if we stack a channel-wise deterministic affine transform after the convolution operator. Note that average pooling (not max pooling, which is not affine), global average pooling and the convolutional striding are particular examples of such affine tranforms. Moreover, valid padding (i.e. no padding) convolution can be regarded as a subsampling operator (a linear projection) composed with the regular circular padding.

24 Published as a conference paper at ICLR 2019

(3) The same analysis applies to max pooling, but computing the covariance may require non-trivial effort. Let m denote the max pooling operator and assume it is applied right after the activation l l 1, l function. The assumption yi = φ(zi− ∞) are i.i.d. implies m yi are i [n] i [n] i [n] ∈ ∈ ∈ also i.i.d. Then we can proceed exactlyn as above excepto for verifying the finiteness  of second l moment of m(yi) with the following trivial estimate:

l 2 l 2 E max y E y (38) i,α α [s] ≤ i,α ∈ α [s]  X∈

where s is the window size of the max pooling. l In general, one can stack a channel-wise deterministic operator op on yi so long as the second moment of op yl is finite. One can also stack a stochastic operator (e.g. dropout), so long i  as the outputs are still channel-wisely i.i.d. and have finite second moments.  

E.4 SIMULTANEOUS LIMIT

In this section, we present a sufficient condition on the activation function φ so that the neural net- works converge to a Gaussian process as all the widths approach infinity simultaneously. Precisely, let t N and for each l 0, let nl : N N be the width function at layer l (by convention n0(t)∈ = n0 is constant). We≥ are interested in→ the simultaneous limit nl = nl(t) as t , i.e., for any fixed L 1 → ∞ → ∞ ≥ min n1(t), . . . , nL(t) . (39) t −−−→→∞ ∞ Define a sequence of finite channel CNNs as follows:

xi,α, l = 0 yl,t (x) , (40) i,α ≡  l 1,t  φ zi,α− (x) , l > 0   (41)  1 l l,t l √ 0 j [n0] β [ k] √vβωij,βyj,α+β(x) + bi , l = 0 l,t n ∈ ∈ ± zi,α(x)  . (42) ≡ P P l l,t l  j [nl(t)] β [ k] √vβωij,βyj,α+β(x) + bi , l > 0 ∈ ∈ ±  P P This network induces a sequence of covariance matrices Kl (which are R.V.s): for l 0 and t 0, t ≥ ≥ for x, x0 ∈ X nl(t) l 1 l,t l,t Kt (x, x0) yi,α(x)yi,α (x0). (43) α,α0 ≡ nl(t) 0 i=1   X We make an extra assumption on the parameters. Assumption: all R.V.s in W are Gaussian distributed.

Notation. Let PSDm denote the set of m m positive semi-definite matrices and for R 1, define × ≥ PSD (R) Σ PSD : 1/R Σ R for 1 α m . (44) m ≡ { ∈ m ≤ α,α ≤ ≤ ≤ } Further let : PSD2 R be a function given by T∞ → (Σ) E(x,y) (0,Σ) [φ(x)φ(y)] , (45) T∞ ≡ ∼N and C (φ, R) (may equal ) denotes the uniform upper bound for the k-th moment k ∞ k Ck (φ, R) sup Ex (0,r) φ(x) . (46) ≡ 1/R r R ∼N | | ≤ ≤ Let Ψ denotes the space of measurable functions φ with the following properties:

25 Published as a conference paper at ICLR 2019

1. Uniformly bounded second moment: for every R 1,C2 (φ, R) < . ≥ ∞ 2. Lipschitz continuity: for every R 1, there exists β = β (φ, R) > 0 such that for all ≥ Σ, Σ0 PSD (R), ∈ 2 (Σ) (Σ0) β Σ Σ0 ; (47) |T∞ − T∞ | ≤ k − k∞ 3. Uniform convergence in probability: for every R 1 and every ε > 0 there exists ≥ a positive sequence ρn(φ, ε, R) with ρn(φ, ε, R) 0 as n such that for every Σ PSD (R) and any (x , y ) n i.i.d. (0→, Σ) → ∞ ∈ 2 { i i }i=1 ∼ N 1 n P φ (xi) φ (yi) (Σ) > ε ρn (φ, ε, R) . (48) n − T∞ ! ≤ i=1 X

We will also use Ψ1, Ψ2 and Ψ 3 to denote the spaces of measurable functions φ satisfying properties 1, 2, and 3, respectively. It is not difficult to see that for every i, Ψi is a vector space, and so is Ψ = Ψ . ∩i i Finally, we say that a function f : R R is exponentially bounded if there exist a, b > 0 such that → b x f(x) ae | | a.e. (almost everywhere) (49) | | ≤

We now prove our main result presented in 2.2.3 through the following three theorems. § Theorem E.4. If φ is absolutely continuous and φ0 is exponentially bounded then φ Ψ. ∈ l P l Theorem E.5. If φ Ψ, then for l 0, Kt K . ∈ ≥ → ∞ l P l l,t Theorem E.6. If for l 0, Kt K and m 1, the joint distribution of zj converges ≥ → ∞ ≥ j [m] ∈ l in distribution to a multivariate normal distribution with mean zero and covarianceh i K Im. A ∞ ⊗ The proofs of Theorems E.4, E.5, and E.6 can be found in E.7, E.6, and E.5 respectively.  The proof of Theorem E.5 is slightly more technical and we will§ borrow§ some ideas§ from Daniely et al. (2016).

E.5 PROOFOF THEOREM E.6

m Per Equation 26, it suffices to prove that for any vector [ai(x)]i [m], x R |X |, ∈ ∈X ∈

l,t D T l ai(x)zi (x) 0, ai(x) K ai(x) . (50) −→ N  A ∞  i [m] i [m], x xX∈ ∈ X∈X  ∈X   Indeed, the characteristic function

i l,t E exp ai(x)z (x) (51)  i  i [m], x ∈ X∈X   i 1 l l,t l =E exp  ai(x) √vβωij,βy (x) + bi  . (52)  nl(t) j,α+β  i [m] j [nl(t)] β [ k]  xX∈ ∈X ∈X±   ∈X    p  l,t Note that conditioned on yj , the exponent in the above expression is just a linear com- j [nl(t)] ∈ n o l l bination of independent Gaussian R.V.s ωij,β, bi , which is also a Gaussian. Integrating out these n o 26 Published as a conference paper at ICLR 2019

R.V.s using the formula of the characteristic function of a Gaussian distribution yields

i l,t E exp ai(x)z (x) (53)  i  i [m], x ∈ X∈X   1 T l = E exp a K ai (54) −2 i A t  i [m] X∈    1 T l exp ai K ai as t , (55) −→ −2 A ∞  → ∞ i [m] X∈    l P l where we have used Kt K and Lipschitz continuity of the respective function in the vicinity of Kl in the last step. Therefore,−→ ∞ Equation 50 is true by Theorem E.1 (iii). As in E.3, the same result for∞ a countably-infinite input index set follows immediately according to the§ respective remark in E.1. X §

Remark E.2. We briefly comment how to handle the cases when stacking an average pooling, a subsampling or a dense layer after flattening the activations in the last layer.

1 d (i) Global Average pooling / subsampling. Let B R × be any deterministic linear functional d ∈ defined on R . The fact that

l P l Kt K (56) −→ ∞

l,t implies that the empirical covariance of Byj n o T T 1 l,t l,t P l B⊗|X |yj B⊗|X |yj B⊗|X |K B⊗|X | (57) nl(t) −→ ∞ j nl(t) ∈X    

1 d where B⊗|X | R × |X |, copies of B. Invoking the same “characteristic function” argu- ments as above,∈ it is not difficult|X | to show that stacking a dense layer (assuming the weights and 2 2 biases are drawn from i.i.d. Gaussian with mean zero and variances σω and σb , and are prop- l,t erly normalized) on top of Byj the outputs are i.i.d. Gaussian with mean zero and covari- 2 l nT o2 1 1 1 d 1 d ance σωB⊗|X |K B⊗|X | + σb . Taking B = d ,..., d R × or B = eα R × implies the result∞ of global average pooling ( 3.2.1, Equation∈ 17), or subsampling∈ ( 3.2.2, Equation 18).  §  § 

ωl (ii) Vectorization and a dense layer. Let ij,α i [m],j [nl(t)],α [d] be the weights of the dense l ∈ ∈ ∈ layer, ωij,α represents the weight connecting the α-th pixel of the j-channel to the i-th output. Note that the range of α is [d] not [ k] because there is no weight sharing. Define the outputs to be ±

1 l l,t l fi(x) = ωij,αy (x) + bi (58) nl(t)d j,α α [d] j [nl(t)] X∈ ∈X p m Now let [ai(x)]i [m],x R |X | and compute the characteristic function of ∈ ∈X ∈

ai(x)fi(x). (59) i,x X

27 Published as a conference paper at ICLR 2019

Using the fact Eωij,αωi0j0,α0 = 0 unless (ij, α) = (i0j0, α0) and integrating out the R.V.s of the dense layer, the characteristic function is equal to

1 1 2 l,t l,t 2 E exp  ai(x)ai(x0)  σ y (x)y (x0) + σ  (60) 2 nl(t)d ω j,α j,α b − l i [m] x,x0 j [n (t)]  X∈ X∈X  X    ∈α [d]    ∈     1 2 l 2 = E exp ai(x)ai(x0)tr¯ σ K (x, x0) + σ (61) −2 ω t b  i [m] x,x X∈ X0∈X    1 2 l 2 exp ai(x)ai(x0)tr¯ σωK (x, x0) + σb , (62) −→ −2 ∞  i [m] x,x X∈ X0∈X  where tr¯ denotes the mean trace operator acting on the pixel by pixel matrix, i.e. the functional computing the mean of the diagonal terms of the pixel by pixel matrix. Therefore [fi]i [m] 2 ¯ l 2 ∈ converges weakly to a mean zero Gaussian with covariance σωtr K (x, x0) + σb ∞ x,x0 ⊗ I in the case of vectorization ( 3.1). ∈X m §   

E.6 PROOFOF THEOREM E.5

L L d d Proof. We recall Kt and K to be random matrices in R |X |× |X |, and we will ∞ L P L prove convergence Kt K with respect to , the pointwise `∞-norm −−−→t ∞ k · k∞ →∞ L (i.e. K = maxx,x0,α,α0 Kα,α0 (x, x0) ). Note that due to finite dimensionality of K , con- vergencek k w.r.t.∞ all other norms| follows. | 2 We first note that the affine transform is σω-Lipschitz and property2 of Ψ implies that the operator is β-Lipschitz (both w.r.t. w.r.t.A ). Indeed, if we consider C k · k∞ [K] (x, x)[K] (x, x ) Σ α,α α,α0 0 , (63) ≡ [K]α0,α (x0, x)[K]α0,α0 (x0, x0)   2 then [ (K)]α,α0 (x, x0) = (Σ). Thus is σωβ-Lipschitz. C T∞ C ◦ A l P l We now prove the theorem by induction. Assume Kt K as t (obvious for l = 0). → ∞ → ∞ l l l We first remark that K PSD d, since PSD d E Kt K and PSD d is closed. Moreover, due to Equation∞ ∈4, |Xnecessarily | preserves|X | 3 positive semi-definiteness,→ ∞ |X and | therefore l A   K PSD d as well. A ∞ ∈ |X | ε l Now let ε > 0 be sufficiently small so that the 2β -neighborhood of (K ) is contained in ∞ l A 9 PSD d(R), where we take R to be large enough for K to be an interior point of PSD d(R). |X | ∞ |X | Since l+1 l+1 l+1 l l l+1 K Kt K Kt + Kt Kt (64) ∞ − ∞ ≤ ∞ − C ◦ A ∞ C ◦ A − ∞ l l l l+1 = K  Kt +  Kt Kt , (65) C ◦ A ∞ − C ◦ A ∞ C ◦ A − ∞ l+1 P l+1    to prove Kt K , it suffices to show that for every δ > 0, there is a t∗ such that for all t > t∗, → ∞ l l ε l l+1 ε P K Kt > + P Kt Kt > < δ. (66) C ◦ A ∞ − C ◦ A ∞ 2 C ◦ A − ∞ 2    l   l   By our induction assumption, there is a t such that for all t > t

l l ε δ P K Kt > < . (67) ∞ − 2σ2 β 3  ∞ ω  9 l Such R always exists, because the diagonal terms of K∞ are always non-zero. See (Long & Sedghi, 2019, Lemma 4) for proof.

28 Published as a conference paper at ICLR 2019

Since is σ2 β-Lipschitz, then C ◦ A ω l l ε δ P K Kt > < . (68) C ◦ A ∞ − C ◦ A ∞ 2 3     To bound the second term in Equation 66, let U (t) denote the event l U (t) Kt PSD d (R) (69) ≡ A ∈ |X | and U (t)c its complement. For all t > tlits probability is

c l l ε P (U (t) ) < P K Kt > [assumption on small ε] (70) A ∞ − A 2β  ∞    2 l l ε 2 < P σω K Kt > is σω-Lipshitz (71) ∞ − 2β A  ∞    l l ε δ = P K Kt > < . [Equation 67] (72) ∞ − 2σ2 β 3  ∞ ω 

Finally, denote

l l+1 ε [V ( t )]α,α (x, x0) Kt (x, x0) Kt (x, x0) > , (73) 0 ≡ C ◦ A α,α0 − α,α0 2 n o     i.e. the event that the inequality inside the braces holds. The fact

l l+1 ε c K K > U (t) [V (t)] (x, x0) U (t) C ◦ A t − t 2 ⊆  α,α0  ∞ x,x ,α,α n  o [ [0 0 \   implies

l l+1 ε δ 2 2 P K K (t) > + d max P [V (t)] (x, x0) U (t) , t α,α0 C ◦ A − ∞ 2 ≤ 3 |X | x,x0,α,α0 ∩ n  o  (74)

2 where the maximum is taken over all (x, x0, α, α0) [d]. ∈ X × Consider a fixed κ PSD d, and define ∈ |X |

[ (κ)]α,α (x, x)[ (κ)]α,α (x, x0) Σ(κ, α, α0, x, x0) A A 0 , (75) ≡ [ (κ)]α0,α (x0, x)[ (κ)]α0,α0 (x0, x0)  A A  a deterministic matrix in PSD2. Then

[ (κ)] (x, x0) = (Σ (κ, α, α0, x, x0)) , (76) C ◦ A α,α0 T∞ l and, conditioned on Kt,

nl+1(t) l+1 l 1 l,t l,t Kt Kt = κ (x, x0) = φ zi,α (x) φ zi,α (x0) , (77) | α,α0 nl+1(t) 0 i=1   X     l,t l,t l where zi,α(x), zi,α (x0) Kt = κ are i.i.d. (0, Σ(κ, α, α0, x, x0)). 0 i [nl+1(t)] ∼ N n  o ∈ Then if Σ(κ, α, α0, x, x0) PSD (R) we can apply property3 of Ψ to conclude that: ∈ 2 l ε P [V (t)] (x, x0) U (t) K = κ < ρnl+1(t) φ, R, . (78) α,α0 ∩ t 2     

However, if Σ(κ, α, α0, x, x0) PSD2 (R), then necessarily (κ) PSD d (R) (since 6∈ l A 6∈ |X | Σ(κ, α, α0, x, x0) PSD2), ensuring that P U (t) Kt = κ = 0. Therefore Equation 78 holds ∈ 2| 2 for any κ PSD d, and for any (x, x0, α, α0) [d] . ∈ |X | ∈ X × 

29 Published as a conference paper at ICLR 2019

ε We further remark that ρnl+1(t) φ, R, 2 is deterministic and does not depend on (κ, x, x0, α, α0). Marginalizing out Kl and maximizing over (x, x , α, α ) in Equation 78 we conclude that t  0 0 ε max P [V (t)] (x, x0) U (t) < ρ l+1 φ, R, . (79) α,α0 n (t) x,x0,α,α0 ∩ 2     Since ρ φ, R, ε 0 as n , there exists n such that for any nl+1(t) n, n 2 → → ∞ ≥  ε δ max P [V (t)] (x, x0) U (t) < ρnl+1(t) φ, R, , (80) α,α0 2 2 x,x0,α,α0 ∩ 2 ≤ 3 d     |X | and, substituting this bound in Equation 74,

l l+1 ε 2δ P Kt Kt > U (t) < . (81) C ◦ A − ∞ 2 ∩ 3 n  o  Therefore we just need to choose tl+1 > tl so that nl+1(t) n for all t > tl+1. ≥ Remark E.3. We list some directions to strengthen / extend the results of Theorem E.5 (and thus Theorem E.6) using the above framework.

1. Consider stacking a deterministic channel-wise linear operator right after the convolutional layer. Again, strided convolution, convolution with no (valid) padding and (non-global) d0 d average pooling are particular examples of this category. Let B R × denote a linear operator. Then the recurrent formula between two consecutive layers∈ is Kl+1 = (Kl ) (82) ∞ C ◦ B ◦ A ∞ where is the linear operator on the covariance matrix induced by B. Conditioned on l B Kt, since the outputs after applying the linear operator B are still i.i.d. Gaussian (and the property3 is applicable), the analysis in the above proof can carry over with replaced by . A B ◦ A 2. More generally, one may consider inserting an operator op (e.g. max-pooling, dropout and more interestingly, normalization) in some hidden layer. 3. Gaussian prior on weights and biases might be relaxed to sub-Gaussian.

E.7 PROOFOF THEOREM E.4

Note that absolutely continuous exponentially bounded functions contain all polynomials, and are closed under multiplication and integration in the sense that for any constant C the function x φ(t)dt + C (83) Z0 is also exponentially bounded. Theorem E.4 is a consequence of the following lemma. Lemma E.7. The following is true:

1. for k 1, C (φ, R) < if φ is exponentially bounded. ≥ k ∞ 2. φ Ψ if φ0 exists a.e. and is exponentially bounded. ∈ 2 3. φ Ψ if C (φ, R) < . ∈ 3 4 ∞

Indeed, if φ is absolutely continuous and φ0 is exponentially bounded, then φ is also exponentially bounded. By the above lemma, φ Ψ. ∈

b x Proof of Lemma E.7. 1. We prove the first statement. Assume φ(x) ae | |. | | ≤ k k k √rb k k2b2r/2 Ex (0,r) φ (x) = Ex (0,1) φ √rx Ex (0,1) ae |X | 2a e . (84) ∼N | | ∼N ≤ ∼N ≤



30 Published as a conference paper at ICLR 2019

Thus k k k2b2R/2 Ck (φ, R) = sup Ex (0,r) φ (x) 2a e . (85) 1/R r R ∼N | | ≤ ≤ ≤

2. To prove the second statement, let Σ, Σ0 PSD2(R) and define A (similarly for A0): ∈ √Σ11 0 2 A Σ Σ22Σ11 Σ . (86) ≡ 12 − 12 √Σ11 Σ11 ! q T T Then AA = Σ (and A0A0 = Σ0). Let

A(t) (1 t)A + tA0, t [0, 1] (87) ≡ − ∈ and f(w) φ(x)φ(y) where w (x, y)T . (88) ≡ ≡ Since φ0 is exponentially bounded, φ is also exponentially bounded due to being absolutely contin- uous. In addition, p ( w ) f(w) is exponentially bounded for any polynomial p ( w ). k k2 k∇ k2 k k2 Applying the Mean Value Theorem (we use the notation . to hide the dependence on R and other absolute constants)

1 2 (Σ) (Σ0) = (f (Aw) f (A0w)) exp w /2 dw (89) |T∞ − T∞ | 2π − − k k2 Z   1 2 = ( f (A(t)w)) ((A0 A) w) exp w 2 /2 dtdw (90) 2π [0,1] ∇ − − k k ZZ  

2 . (A0 A) w 2 f (A(t)w) 2 exp w 2 /2 dwdt (91) [0,1] k − k k∇ k − k k Z Z   2 A0 A op w 2 f (A(t)w) 2 exp w 2 /2 dwdt. (92) ≤ [0,1] k − k k k k∇ k − k k Z Z   Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and w 2 f (A(t)w) 2 is exponentially bounded. There is a constant a (hidden in .) and b such that thek k abovek∇ is boundedk by

2 A0 A exp (b A(t) w 2) exp w 2 /2 dwdt (93) [0,1] k − k∞ k k∞ k k − k k Z Z   √ 2 . A0 A exp b R w 2 w 2 /2 dwdt (94) k − k∞ [0,1] k k − k k Z Z   . A0 A (95) k − k∞ . Σ0 Σ . (96) k − k∞ Here we have applied the facts

A0 A . Σ Σ0 and A(t) √R. (97) k − k∞ k − k∞ k k∞ ≤ 3. Chebyshev’s inequality implies 1 n P φ (xi) φ (yi) (Σ) > ε (98) n − T∞ ! i=1 X 1 1 2 Var (φ (xi) φ (yi)) E φ (xi) φ (yi) (99) ≤n2 ≤ n2 | | 1 C (φ, R) 0 as n . (100) ≤n2 4 → → ∞

31 Published as a conference paper at ICLR 2019

Remark E.4. In practice the 1/n decay bound obtained by Chebyshev’s inequality in Equation 100 is often too weak to be useful. However, if φ is linearly bounded, then one can obtain an exponential decay bound via the following concentration inequality: Lemma E.8. If φ(x) a + b x a.e., then there is an absolute constant c > 0 and a constant κ = κ(a, b, R) >|0 such| ≤ that property| | 3 (Equation 48) holds with n2ε2 nε ρ (φ, , R) = 2 exp c min , . (101) n − (2κ)2 2κ    Proof. We postpone the proof of the following claim.

Claim E.9. Assume φ(x) a+b x . Then there is a κ = κ(a, b, R) such that for all Σ PSD2(R) and all p 1, | | ≤ | | ∈ ≥ p 1/p E(x,y) (0,Σ) φ(x)φ(y) κp. (102) ∼N | | ≤  Claim E.9 and the triangle inequality imply

p 1/p E(x,y) (0,Σ) φ(x)φ(y) Eφ(x)φ(y) 2κp. (103) ∼N | − | ≤ We can apply Bernstein-type inequality (Vershynin, 2010, Lemma 5.16) to conclude that there is a c > 0 such that for every Σ PSD (R) and any (x , y ) n i.i.d. (0, Σ) ∈ 2 { i i }i=1 ∼ N 1 n n2ε2 nε P φ (xi) φ (yi) (Σ) > ε 2 exp c min 2 , . (104) n − T∞ ! ≤ − (2κ) 2κ i=1    X It remains to prove Claim E.9. For p 1, ≥ p 1/p 2p 1/2p 2p 1/2p E(x,y) (0,Σ) φ(x)φ(y) Ex (0,Σ11) φ(x) Ey (0,Σ22) φ(y) (105) ∼N | | ≤ ∼N | | ∼N | | 2p 1/2p 2p 1/2p  a + b E x  a + b E y  (106) ≤ | | | | 2     2p 1/2p   a + b√R Eu (0,1) u (107) ≤ ∼N | | 2  2p p 1/2p   a + b√R c0 p (108) ≤ 2  2   a + bc0 √R p (109) ≤ κp.  (110) ≡ We applied Cauchy-Schwarz’ inequality in the first inequality, the triangle inequality in the second one, the fact Σ , Σ R in the third one, absolute moments estimate of standard Gaussian in the 11 22 ≤ fourth one, where c0 is a constant such that

p 1/p Eu (0,1) u c0√p. (111) ∼N | | ≤ 

FGLOSSARY

We use the following shorthands in this work:

1. NN - neural network; 2. CNN - convolutional neural network; 3. LCN - locally connected network, a.k.a. convolutional network without weight sharing; 4. FCN - fully connected network, a.k.a. multilayer perceptron (MLP); 5. GP - Gaussian process; 6. X-GP - a GP equivalent to a Bayesian infinitely wide neural network of architecture X ( 2). §

32 Published as a conference paper at ICLR 2019

7. MC-(X-)-GP - a Monte Carlo estimate ( 4) of the X-GP. § 8. Width, (number of) filters, (number of) channels represent the same property for CNNs and LCNs. 9. Pooling - referring to architectures as “with” or “without pooling” means having a single global average pooling layer (collapsing the spatial dimensions of the activations yL+1) before the final linear FC layer giving the regression outputs zL+1. 10. Invariance and equivariance are always discussed w.r.t. translations in the spatial dimen- sions of the inputs.

GEXPERIMENTAL SETUP

Throughout this work we only consider 3 3 (possibly unshared) convolutional filters with stride 1 and no dilation. × All inputs are normalized to have zero mean and unit variance, i.e. lie on the d dimensional sphere − of radius √d, where d is the total dimensionality of the input. All labels are treated as regression targets with zero mean. i.e. for a single-class classification problem with C classes targets are C dimensional vectors with 1/C and (C 1)/C entries in incorrect and correct class indices respectively.− − − If a subset of a full dataset is considered for computational reasons, it is randomly selected in a balanced fashion, i.e. with each class having an equal number of samples. No data augmentation is used. All experiments were implemented in Tensorflow (Abadi et al., 2016) and executed with the help of Vizier (Golovin et al., 2017). All neural networks are trained using Adam (Kingma & Ba, 2015) minimizing the mean squared error loss.

G.1 MANY-CHANNEL CNNSAND LCNS

Relevant Figures:6 (b),8,9. We use a training and validation subsets of CIFAR10 of sizes 500 and 4000 respectively. All images are bilinearly downsampled to 8 8 pixels. × All models have 3 hidden layers with an erf nonlinearity. No (valid) padding is used. Weight and bias variances are set to σ2 1.7562 and σ2 0.1841, corresponding to the pre- ω ≈ b ≈ activation variance fixed point q∗ = 1 (Poole et al., 2016) for the erf nonlinearity. NN training proceeds for 219 gradient updates, but aborts if no progress on training loss is observed 4 for the last 100 epochs. If the training loss does not reduce by at least 10− for 20 epochs, the learning rate is divided by 10. All computations are done with 32-bit precision. The following NN parameters are considered:10

1. Architecture: CNN or LCN. 2. Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer. 3. Number of channels: 2k for k from 0 to 12. k 4. Initial learning rate: 10− for k from 0 to 15. k 5. Weight decay: 0 and 10− for k from 0 to 8. 6. Batch size: 10, 25, 50, 100, 200. 10Due to time and memory limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.

33 Published as a conference paper at ICLR 2019

For NNs, all models are filtered to only 100%-accurate ones on the training set and then for each configuration of architecture, pooling, number of channels the model with the lowest validation loss is selected among{ the configurations of learning rate, weight} decay, batch size . { } For GPs, the same CNN-GP is plotted against CNN and LCN networks without pooling. For LCN with pooling, inference was done with an appropriately rescaled CNN-GP kernel, i.e. vec 2 2 σb /d + σb , where d is the spatial size of the penultimate layer. For CNNs with pool- K∞ − ing, a Monte Carlo estimate was computed (see 4) with n = 212 filters and M = 26 samples.  § For GP inference, the initial diagonal regularization term applied to the training convariance matrix 10 is 10− ; if the cholesky decompisition fails, the regularization term is increased by a factor of 10 until it either succeeeds or reaches the value of 105, at which point the trial is considered to have failed.

G.2 MONTE CARLO EVALUATION OF INTRACTABLE GPKERNELS

Relevant Figures:5, 10. We use the same setup as in G.1, but training and validation sets of sizes 2000 and 4000 respectively. § For MC-GPs we consider the number of channels n (width in FCN setting) and number of NN instantiations M to accept values of 2k for k from 0 to 10. Kernel distance is computed as: 2 Kn,M ∞ F kK − 2 k , (112) kK∞kF where is substituted with K210,210 for the CNN-GP pooling case (due to impracticality of com- K∞ puting the exact pool). GPs are regularized in the same fashion as in G.1, but the regularization K4∞ 10 § factor starts at 10− and ends at 10 and is multiplied by the mean of the training covariance diag- onal.

G.3 TRANSFORMINGA GP OVER SPATIAL LOCATIONS INTO A GP OVERCLASSES

Relevant Figure:4. We use the same setup as in G.2, but rescale the input images to size of 31 31, so that at depth 15 the spatial dimension collapses§ to a 1 1 patch if no padding is used (hence× the curve of the CNN-GP without padding halting at that depth).× For MC-CNN-GP with pooling, we use samples of networks with n = 16 filters. Due to computa- tional complexity we only consider depths up to 31 for this architecture. The number of samples M was selected independently for each depth among 2k for k from 0 to 15 to maximize the valida- tion accuracy on a separate 500-points validation set. This allowed us to avoid the poor conditioning of the kernel. GPs are regularized in the same fashion as in G.1, but for MLP-GP the multiplicative 4 10 § factor starts at 10− and ends at 10 .

G.4 RELATIONSHIP TO DEEP SIGNAL PROPAGATION

Relevant Figure: 11. We use a training and validation subsets of CIFAR10 of sizes 500 and 1000 respectively. We use the erf nonlinearity. For CNN-GP, images are zero-padded (same padding) to maintain the spatial shape of the activations as they are propagated through the network. 2 2 Weight and bias variances (horizontal axis σω and vertical axis σb respectively) are sampled from a uniform grid of size 50 50 on the range [0.1, 5] [0, 2] including the endpoints. × × All computations are done with 64-bit precision. GPs are regularized in the same fashion as in G.1, but the regularization factor is multiplied by the mean of the training covariance diagonal. If§ the experiment fails due to numerical reasons, 0.1 (random chance) validation accuracy is reported.

34 Published as a conference paper at ICLR 2019

G.5 CNN-GP ON FULL DATASETS

Relevant Figures6 (a, c),7 and Table:1. We use full training, validation, and test sets of sizes 50000, 10000, and 10000 respectively for MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017a), 45000, 5000, and 10000 for CIFAR10 (Krizhevsky, 2009). We use validation accuracy to select the best configuration for each model (we do not retrain on validation sets). GPs are computed with 64-bit precision, and NNs are trained with 32-bit precision. GPs are regu- larized in the same fashion as in G.4. § Zero-padding (same) is used. The following parameters are considered:

1. Architecture: CNN or FCN. 2. Nonlinearity: erf or ReLU. 3. Depth: 2k for k from 0 to 4 (and up to 25 for MNIST and Fashion-MNIST datasets).

4. Weight and bias variances. For erf: q∗ from 0.1, 1, 2,..., 8 . For ReLU: a fixed weight 2 16 {2 } variance σ = 2 + 4e− and bias variance σ from 0.1, 1, 2,..., 8 . ω b { } On CIFAR10, we additionally train NNs for 218 gradient updates with a batch size of 128 with corresponding parameters in addition to11

1. Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer (only for CNNs). 2. Number of channels or width: 2k for k from 1 to 9 (and up to 210 for CNNs with pooling in Figure6, a). k 16 3. Learning rate: 10− 2 / (width q∗) for k from 5 to 9, where width is substituted with × × 2 the number of channels for CNNs and q∗ is substituted with σb for ReLU networks. “Small learning rate” in Table1 refers to k 8, 9 . ∈ { } k 4. Weight decay: 0 and 10− for k from 0 to 5.

For NNs, all models are filtered to only 100%-accurate ones on the training set (expect for values in parentheses in Table1). The reported values are then reported for models that achieve the best validation accuracy.

G.6 MODEL COMPARISON ON CIFAR10

Relevant Figure:1. We use the complete CIFAR10 dataset as described in G.5 and consider 8-layer ReLU models with 2 2 § 5 weight and bias variances of σω = 2 and σb = 0.01. The number of channels / width is set to 2 , 210 and 212 for LCN, CNN, and FCN respectively. GPs are computed with 64-bit precision, and NNs are trained with 32-bit precision. No padding (valid) is used. NN training proceeds for 218 gradient updates with batch size 64, but aborts if no progress on 4 training loss is observed for the last 10 epochs. If the training loss does not reduce by at least 10− for 2 epochs, the learning rate is divided by 10. k Values for NNs are reported for the best validation accuracy over different learning rates (10− k for k from 2 to 12) and weight decay values (0 and 10− for k from 2 to 7). For GPs, validation accuracy is maximized over initial diagonal regularization terms applied to the training convariance k matrix: 10− [mean of the diagonal] for k among 2, 4 and 9 (if the cholesky decompisition fails, the regularization× term is increased by a factor of 10 until it succeeds or k reaches the value of 10). 11Due to time and compute limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.

35