The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks

Russell Tsuchida,12 Tim Pearce,3 Chris van der Heide,2 Fred Roosta,24 Marcus Gallagher2 1CSIRO, 2The University of Queensland, 3University of Cambridge, 4International Computer Science Institute

f(x ) = √1 P U ψ(W>x )+V ψ Abstract 1 n i=1 i fi e1 b, where is an activa- > > Analysing and computing with Gaussian processes arising tion function and xe1 = (x1 , 1) . The covariance between from infinitely wide neural networks has recently seen a any two outputs is resurgence in popularity. Despite this, many explicit covari- (1) ance functions of networks with activation functions used in k (x1, x2) n n modern networks remain unknown. Furthermore, while the h X X i kernels of deep networks can be computed iteratively, theo- E > > 2 = Viψ(Wfi xe1) Vjψ(Wfj xe2) +σb retical understanding of deep kernels is lacking, particularly i=1 j=1 with respect to fixed-point dynamics. Firstly, we derive the h i 2 E > > 2 covariance functions of multi-layer perceptrons (MLPs) with = σw ψ(Wf1 xe1)ψ(Wf1 xe2) + σb . exponential linear units (ELU) and Gaussian error linear units (GELU) and evaluate the performance of the limiting Gaus- The expectation over d + 1 random variables reduces sian processes on some benchmarks. Secondly, and more gen- to an expectation over 2 random variables, W>x and erally, we analyse the fixed-point dynamics of iterated ker- f1 e1 > nels corresponding to a broad range of activation functions. Wf1 xe2. The joint distribution of these two random vari- We find that unlike some previously studied neural network ables is a bivariate Gaussian. The mean of each com- (1/2) 2 kernels, these new kernels exhibit non-trivial fixed-point dy- ponent is zero, and the variance is kΣ xeik , where namics which are mirrored in finite-width neural networks. k · k denotes the Euclidean norm. The covariance is The fixed point behaviour present in some networks explains (1/2) (1/2) kΣ xe1kkΣ xe2k cos θ, where θ is the angle between a mechanism for implicit regularisation in overparameterised (1/2) (1/2) deep models. Our results relate to both the static iid param- Σ xe1 and Σ xe2. Therefore, the expectation in terms eter conjugate kernel and the dynamic neural tangent kernel of Z ∼ N (0,S) is constructions1. h i (1) 2 E   2 k (x1, x2) = σw ψ s1Z1 + µe1 ψ s2Z2 + µe2 + σb , 1 Background — Infinitely Wide Neural (1) Networks as Gaussian Processes where S has diagonals 1 and off-diagonals cos θ, si = kΣ(1/2)x k and µ = µ>x . Infinitely wide neural networks (NNs) and Gaussian pro- ei ei ei cesses (GPs) share an interesting connection (Neal 1995; Definition 1. We call (1) the kernel. We call cos θ(1) = (1) Jacot, Gabriel, and Hongler 2018) which has only par- √ k (x1,x2) the normalised kernel. (1) (1) tially been explored. We begin by reviewing this connection. k (x1,x1)k (x2,x2) Readers familiar with this connection may skip to § 2. Con- The above NN converges to a GP as n → ∞ under sider a one-hidden layer network with independent param- mild conditions on the input and ψ (Neal eters. Suppose each ith row of weights Wi together with 1995). Since f(x1) is a sum of independent random vari- the corresponding bias Bi in the hidden layer has distri- ables scaling as n−1/2, it converges to a Gaussian random > > bution (Wi ,Bi) = Wgi ∼ N µ, Σ), with Σ 0 be- variable with zero mean as n → ∞. More generally, any (1/2) N ing a diagonal matrix having a unique “square root” Σ . fixed collection of N evaluations of f, {f(xi)}i=1 con- Further, suppose the output layer parameter vector V = verges to an N-dimensional 0-mean Gaussian as n → ∞. √1 2 Analytical and closed-form covariance functions (1) are n U satisfies U ∼ N (0, σwI), where n is the number of neurons in the hidden layer and the output bias satis- available for specific choices of ψ (Le Roux and Bengio fies V ∼ N (0, σ2). The output evaluated at input x is 2007; Tsuchida, Roosta, and Gallagher 2018, 2019a; Pearce b b 1 et al. 2019; Tsuchida, Roosta, and Gallagher 2019b), al- Copyright c 2021, Association for the Advancement of Artificial though some of these require µ = 0. Most notably, the Intelligence (www.aaai.org). All rights reserved. kernel is known for historically relevant activation func- 1Software at github.com/RussellTsuchida/ELU GELU kernels tions ψ(z) = erf(z), RBF networks (Williams 1997) and

9967 for the more modern ReLU activation, ψ(z) = max(0, z) (Cho and Saul 2009). More recently Meronen, Irwanto, and Solin (2020) solved the inverse problem, finding ψ that re- covers the Matern´ class of covariance functions. Once the form of (1) is known, the kernel of deep networks can 2 2 2 be evaluated in the case where Σ = diag(σw, ..., σw, σb ) and µ = 0 (Matthews et al. 2018; Lee et al. 2018; Yang 2019b,a). The case where µ 6= 0 can also be han- dled (Tsuchida, Roosta, and Gallagher 2019b), but we focus Figure 1: Illustration of simplicity bias due to kernel fixed on µ = 0 in this work. The kernel in layer l+1 can be found R2 iteratively as a function of the kernel in layer l, points. Training data x ∈ is uniformly sampled on the unit disc at heading γ. Curves show the posterior mean of a h i (l+1) 2 E (l) (l) (l) (l) 2 GP regression model on y = sin(γ)+ with known additive k (x1, x2) = σw ψ s1 Z1 ψ s2 Z2 + σb , noise variance. σw is chosen according to Figure 3. Colours " # ! Z(l)  1 cos θ(l) move from purple to yellow as depth increases from 1 to 1 ∼ N 0, , (2) 64. (Left) GELU without unique kernel fixed point leading (l) cos θ(l) 1 Z2 to overfitting (Right) ReLU with unique kernel fixed point leading to underfitting. More examples in Appendix M. (l) (l) where cos θ is the normalised kernel in layer l, and si = p (l) k (xi, xi). Definition 2. We call k(l) in (2) the kernel in layer l. We sufficient conditions for the existence of a unique kernel (l) fixed point. We show theoretically and empirically that call cos θ(l) = √ k (x1,x2) the normalised kernel (l) (l) unlike the kernel corresponding to ReLU ψ, the new ker- k (x1,x1)k (x2,x2) in layer l. nels are able to avoid unique fixed points. These condi- tions apply to both the iid prior (Neal 1995) and dynamic A generalisation of iid weight priors to partially ex- NTK (Jacot, Gabriel, and Hongler 2018) cases. This fixed changeable weight priors is also available (Tsuchida, point behaviour can be used to explain a simplicity bias Roosta, and Gallagher 2019b), resulting in a GP with an in deep NN. More surprisingly, we find theoretically that additional layer of inference over the hyperparameters µ the NTK dynamic which approximates and Σ. Convergence to GPs also occurs for other NN ar- preserves this simplicity bias. chitectures such as convolutional architectures (Garriga- Alonso, Rasmussen, and Aitchison 2018; Novak et al. 2019) 2.1 Motivation for Studying Fixed Points and general compositions of recurrent, graph convolution, pooling, skip connection, attention and normalisation lay- Viewing NNs through the arguably idealised lens of GPs has ers (Yang 2019b,a)2. When an MLP is trained under a some surprisingly non-intuitive practical implications. One continuous-time analogue of gradient descent, the limiting important open problem is in explaining the empirical obser- output is still a GP (Jacot, Gabriel, and Hongler 2018). The vation that some overparameterised NNs do not overfit, even dynamics and associated covariance of the GP depends on when trained without explicit regularisation (Zhang et al. the neural tangent kernel (NTK) T (l), which in addition to 2017). Tsuchida, Roosta, and Gallagher (2019b) show em- the iterations (2), is given by pirically that samples from the limiting prior of deep MLPs with zero-mean parameters and LReLU activations are ap- (1) (1) T (x1, x2) = k (x1, x2) proximately constant on the unit hypersphere. Valle-Perez,´ h i Camargo, and Louis (2019) argue that deep NNs with ReLU ˙ (l+1) 2 E 0 (l) 0 (l) k (x1, x2) = σw ψ (s1Z1 )ψ (s2Z2 ) , activations exhibit a “simplicity bias”, in that randomly ini- (l+1) (l) ˙ (l+1) tialised NNs implementing Boolean functions are likely to T (x1, x2) = T (x1, x2)k (x1, x2) be simple. Yang and Salman (2019) explain this simplicity (l+1) + k (x1, x2). (3) bias through a spectral decomposition of the limiting kenel, showing that most of its mass is concentrated around the 2 Contributions and Motivation constant eigenfunction, even when accounting for training using gradient descent under the NTK (Jacot, Gabriel, and This paper contains two main contributions. We: Hongler 2018). Yang and Salman (2019) are clear to sepa- 1. Derive kernels for GELU and ELU activations (defined rate the case of ReLU activations, which do result in ker- below) and verify our results numerically. We implement nels having peaked spectral mass and exhibiting a simplicity GPs with different NN kernels on some benchmarks. bias, from ERF activations which do not. 2. Study the fixed point dynamics of the kernel when ψ is Our motivation for studying fixed points is in a similar bounded by the absolute value of a polynomial. We find spirit to the work above. For LReLU networks, we observe a so called kernel fixed point. An infinitely deep LReLU 2As detailed in these works, knowledge of (1) is often suffi- network is degenerate, and therefore over-regularised, in cient to describe these kernel, so our new kernels apply to more that all functions in the prior are constant over inputs on complicated networks. any hypersphere. Therefore, increasingly deep kernels rep-

9968 k(x1, x2) =

2 2 " 1 2 2 2 2 2 !#! 2 2 s1s2 s1s2 2 (cos(2θ) + 3) + s1 + s2 + s1s2 sin θ cos θ −1 cos θs1s2 σb + σw cos θ + p + tan p 4 2π 2 2 2 2 2 2 2 s1s2 2 2 2 2 2 (1 + s1)(1 + s2) 1 + s1 + s2 + s1s2 sin θ 1 + s1 + s2 + s1s2 sin θ

Figure 2: The GELU kernel. Plots show the normalised kernels in layer l as a function of the angle θ(0) between the inputs for 2 2 MLPs of increasing depth when Σ = diag(σw, ..., σw, 0) and kxik is constant for all i. Values of σw are chosen to preserve the expected square norm kxk2 (see § 4.1). Solid curve shows infinitely wide limit, and dots show samples from a network with 2 > inputs and 3000 neurons in each layer. Each dot corresponds to an x1 and x2 generated through a random rotation of (1, 0) and (cos θ(0), sin θ(0))>. The random rotation is found through a QR decomposition of a matrix containing entries sampled independently from U[0, 1]. (Left) kxk = 0.5, σw = 1.59 (Middle) kxk = 1, σw = 1.47. (Right) kxk = 5, σw = 1.42. resent a strict and potentially undesirable prior. On the other optimiser to find good solutions, and reveals interesting im- hand, kernels corresponding to GELU and ELU activation plicit regularisation structure in the infinitely wide setting. functions do not exhibit unique fixed points, and are there- fore less biased towards simple functions. Just as tradi- 2.3 Model Selection tional regularisation frameworks allow practitioners to con- The choice of activation function in an NN can be framed trol the bias-variance trade-off, in our framework, the acti- as a model selection problem, and as such shares similari- vation function represents a similar choice. A deep ReLU ties with choosing other model hyperparameters. Take the network contains a higher degree of implicit regularisation case of choosing the L2 ridge-regularisation parameter as an than a deep GELU network. An illustration of this effect is example. One could apply n-fold cross-validation, with an shown in Figure 1. Even more surprisingly, our analysis ex- implicit understanding that this parameter penalises model tends to the NTK. complexity as measured through the norm in an RKHS. Alternatively, a Bayesian might interpret the L2 weight- 2.2 Recently Introduced Activation Functions ing as the precision of a Gaussian prior, and optimise the The increased volume of gradient-based re- marginal likelihood with respect to this weighting. A more search has seen the introduction of new popular activa- pure Bayesian might put a prior over the regularisation (or tion functions. Notably these include the exponential linear precision) parameter and marginalise over these models. unit (ELU) (Clevert, Unterthiner, and Hochreiter 2016), the Common to all these approaches is (a) an intuition based Gaussian error linear unit (GELU) (Hendrycks and Gimpel on theory of what the parameter does and (b) a practical 2016) and the Swish (Ramachandran, Zoph, and Le 2017; methodology for dealing with the parameter. In this paper Elfwing, Uchibe, and Doya 2018). The GELU and ELU are we provide (a), and leave it to the practitioner to decide on z (b) based on their philosophy and/or apparatus. Our work ψ(z) = zΦ(z) and ψ(z) = Θ(z)z + Θ(−z)(e − 1), shows that GELU and ELU activations can avoid a regu- respectively, where Φ denotes the CDF of the standard Gaus- larisation mechanism that grows with depth that is always sian and Θ denotes the Heaviside step function. implicit in ReLU activations. Many state-of-the-art models use GELU (Radford et al. We stress that the new kernels and the fixed point mech- 2018; Devlin et al. 2019) or swish activations (Chua et al. anisms are neither “good” nor “bad” in isolation. The new 2018). However, even when critically evaluating empirical models should be evaluated according to model selection evidence, it is difficult to determine whether certain activa- criteria in their given application, and the fixed point mech- tion functions are a better choice for a given problem, let anisms describe a regularisation implicit in some kernels. alone separate the activation function expressivity from the ability of optimisers to find good solutions. Analysing acti- 3 New Kernels vation functions through the lens of GPs allows one to vi- Proposition 3. When ψ is the GELU and µ = 0, the ker- sualise the function space in isolation of the ability of the nel (1) is given by the equation in Figure 2.

9969 Figure 3: As in Figure 2, but with ELU ψ. (Left-Right) (kxk, σw) = (5, 1.40), (1, 1.26), (0.5, 1.17).

The derivation is in the same spirit as Williams (1997); 4 Fixed Point Analysis we introduce dummy parameters β1 and β2 in the argu- In this section, we first analyse the conditions under which ment of Φ, differentiate with respect to β1 and β2 to ob- the expected squared norm of the signals in each layer are tain a PDE, then solve the PDE and evaluate the solution at preserved as the signal propagates through the network for a β1 = β2 = 1. Complete working is given in Appendix A. It different choices of the activation function (§ 4.1). In such a is plausible that our method of derivation extends to the case situation, the expected squared norm remains at a fixed point µ 6= 0, although the calculations and resulting expression as it passes through the network. Then, more generally, we become more complicated. Even with µ = 0, this kernel has analyse conditions under which the expected squared norm some interesting properties that we discuss in § 4. Interest- of any two signals and the cosine angle between them ap- ingly, unlike the ELU kernel with µ = 0, the GELU ker- proaches a constant as depth increases (i.e., when the kernel nel does not contain any hard-to-compute special functions, has a fixed point). We are especially interested in the case only some (inverse) trigonometric functions. where the cosine angle between the signals converges to a Our expression for the ELU kernel is lengthy and we do unique fixed point (§ 4.2). Finally, we relate the existence of not assemble it in the main text, but provide a visualisation a unique fixed point to a degenerate, underfitting property of in the form of Figure 3. very deep infinitely wide MLPs (§ 4.3). For § 4, 5 and 6, we suppose all the weights have the same variance, and so do all the biases; the first d diagonals of the Proposition 4. When ψ is the ELU, the kernel (1) has an 2 2 analytical expression implemented in software1 in terms of diagonal matrix Σ are σw, and the last diagonal is σb . the univariate and bivariate normal CDFs. 4.1 Warm-Up — Norm Preservation Complete working is given in Appendix B. Unfortunately, A useful application of the kernel in finite-width iid- the ELU kernel involves exponentiating arguments involv- initialised networks is to track the expected squared norm of the signals in each layer as the depth of the network in- ing s1 and s2. This can lead to numerical instability in GP regression when many data points are involved. Despite this, creases. This is used in initialisation to avoid exploding or having an analytical expression still allows us to gain in- vanishing signals when using gradient optimisers. The expected norm squared of the signal in the first hid- sights into finite width networks, as we shall see in § 4. (1) 2 2 den layer is (k (x1, x1) − σb )/σw. For the squared norm The scaled exponential linear unit (SELU) (Klambauer of the signal in the hidden layer to be the same as the squared et al. 2017) is a slightly modified version of the ELU which 2 (1) 2 2 norm of the input, we set kxe1k = (k (x1, x1) − σb )/σw. introduces two scaling parameters: one applied over the en- We may then solve this condition to find the hyperparame- tire domain and one over the negative domain. Our analy- ter values that preserve input norms. For example, using the sis for the ELU also handles the SELU (see Appendix B). kernel corresponding to ReLU (Cho and Saul 2009), one ob- Klambauer et al. (2017) motivates the SELU by showing that √ tains He initialisation (He et al. 2015), that σ = 2, where it is able avoid the exploding and vanishing gradient prob- w Σ1/2 = diag(σ , ..., σ , 0)>. lem by carefully selecting the value of the scaling parame- w w The analogue for GELU is more involved since no single ters. An entirely distinct problem is to analyse the correla- σ preserves the expected square norms of all inputs. Set- tions between signals in the network. Fixed points in signal w ting σ = 0, k(x, x)/σ2 = kxk2 and s = s = σ kxk in norms are desirable to maintain the scale of signals as they b w 1 2 w the equation in Figure 2, we find a root σ∗(kxk) of propagate through the network. Fixed points in signal cor- 4 2 relations can be undesirable as they force unrelated inputs σ kxk gkxk(σ) = + to have similar feature representations in deep layers of the π(σ2kxk2 + 1)p2σ2kxk2 + 1 network. While Klambauer et al. (2017) is concerned with σ2  2 σ2kxk2  obtaining fixed points of signal norms (i.e. our § 4.1), it does 1 + sin−1 − 1, not relate to fixed points of signal correlations (i.e. our § 4.2). 4 π 1 + σ2kxk2

9970 ∗ 0 numerically. Figure 4 shows a plot of σ (kxk) as k√xk varies. provided the right hand terms are finite, where ψ is the dis- The root of the limit of gkxk(σ) as kxk → ∞ is 2, which tributional derivative of ψ. recovers He initialisation. This implies that when data has Our result does not include ρ ∈ {−1, 1}. With the addi- large norms (such as images or audio files), He initialisation tional assumption that ψ is continuous almost everywhere, is suitable. The same procedure can be carried out for the the expression for λ3 is valid on the closed interval ρ ∈ ELU kernel as shown in Figure 4. This procedure may be [−1, 1], as shown in Appendix F. We would now like to com- viewed as a warm-up handling the special case of x1 = x2 bine Theorem 5 and Theorem 6 in order to comment on the for our general fixed point analysis. existence of unique fixed points in some special cases. We consider two general cases in Corollaries 7 and 8 below. 4.2 General Fixed Point Analysis Corollary 7 (Unique fixed point under absolute homogene- Let S ⊆ [0, ∞) × [0, ∞) × [−1, 1]. In the infinitely ity). Suppose σ2 = 0 (as is common when initialising neu- wide limit, we may view each layer as updating a state b ral networks) and ψ is absolutely homogeneous, that is, (s2, s2, cos θ) ∈ S containing the expected square norms 1 2 ψ(|a|z) = |a|ψ(z) for any a ∈ R. Then and the cosine angle between the signals in the hidden layers through a function g : S → S. Let (G ,G )> ∼ N (0,I).  0 0  1 2 ∂g3 E ψ (Z1)ψ (Z2) We study the fixed-point dynamics of the iterated map g hav- = λ3 = . ∂ρ E[ψ2(Z )] ing components 1 2 2 2 E 2  2 Furthermore, if g1(s1, s2, ρ) = σw ψ (s1G1) + σb , 2 2 2  2  2 • maxi=1,2,3 |λi| < 1 then g admits a unique fixed point at g2(s , s , ρ) = σ E ψ (s2G2) + σ 1 2 w b s2s2, 1 for some s2. g (s2, s2, ρ) = 3 1 2 • λ < 1 and g (·, s2, ρ) admits a fixed point s2 for any h i 3 1 2 E 2 p 2  2 2 2 2 σwψ(s1G1)ψ s2(G1ρ + G2 1 − ρ ) + σb s2, ρ, then g3(s , s , ·) admits a unique fixed point at 1. , (4) p 2 2 2 2 g1(s1, s2, ρ)g2(s1, s2, ρ) Proof. Absolute homogeneity implies that 2 2 which track the expected square norms (after a linear trans- g3(s1, s2, ρ) 2 2 formation involving σw and σ ) and normalised kernel as the h i b E 2 p 2  signals propogate through the layers3. By inspection, g (but σwψ(s1G1)ψ s2(G1ρ + G2 1 − ρ ) 3 = not necessarily g) always has an uncountable set of fixed r h ir h i E 2 2  E 2 2  points at ρ = 1 along s1 = s2. Banach’s fixed point theorem σwψ (s1G1) σwψ (s2G2) says that if g is a contraction mapping on a closed set, then g h i p 2 has a unique fixed point on that set (Agarwal, Meehan, and E ψ(G1)ψ G1ρ + G2 1 − ρ O’regan 2001). In a slightly different setting, Hasselblatt and = r h ir h i Katok (2003, Theorem 2.2.16) allows some open sets. 2  2  E ψ (G1) E ψ (G2) Theorem 5. Let Dg denote the Jacobian of g and d0 denote the metric induced by a norm with induced matrix norm k·k0. ∂g ∂g = 0 = 3 = 3 . Rm 2 2 If C ⊂ is an open strictly convex set, C is its closure, ∂s1 ∂s2 g : C → C differentiable on C and continuous on C with 0 Absolute homogeneity also implies that for all a ∈ R, kDgk ≤ λ < 1 on C, then g has a unique fixed point 0 0 0 L  L 0 ψ (|a|z) = ψ (z). Then by Theorem 6, c0 ∈ C and d g (c), c0 ≤ λ d (c, c0) for every c ∈ C. 2 We therefore consider the eigenvalues of the Jacobian, σws1s2  0 0  λ3 = √ E ψ (s1Z1)ψ (s2Z2) proving the following in Appendix C. g1g2 Theorem 6. Let g be as in (4), and suppose the absolute 2 σ s1s2 value of ψ is bounded by a polynomial. Let (Z ,Z ) ∼ = w Eψ0(Z )ψ0(Z ) 1 2 2 E 2  1 2 N (0,S) with covariance ρ = cos θ and unit variances. σws1s2 ψ (G1)  0 0  Then for ρ ∈ (−1, 1) 0 < s1, s2, the (unordered) eigen- E ψ (Z )ψ (Z ) = 1 2 . values of the Jacobian of g are  2  E ψ (G1) h i ∂g1 2 E 2  2 2 λ1 := 2 = σw Z1 − 1 ψ (s1Z1) /(2s1), Note that the Jacobian ∂s1  ∂g1 ∂g1 ∂g1  ∂g2 h i ∂s2 ∂s2 ∂ρ ! λ := = σ2 E Z2 − 1ψ2(s Z ) /(2s2), and 1 2 λ1 0 0 2 2 w 2 2 2 2 ∂g2 ∂g2 ∂g2 ∂s2  2 2  = 0 λ2 0  ∂s1 ∂s2 ∂ρ  2 ∂g3 ∂g3 ∂g3 0 0 λ3 ∂g3 σws1s2  0 0  ∂s2 ∂s2 ∂ρ λ3 := = √ E ψ (s1Z1)ψ (s2Z2) , 1 2 ∂ρ g1g2 is diagonal, and therefore the induced matrix norm (corre- 3We expressed the kernel (1) in terms of iid Gaussians G in- sponding to a Euclidean vector norm) of the Jacobian, the stead of dependent Gaussians Z. largest singular value of the Jacobian, is simply the largest

9971 Figure 4: (Left) Values σ∗ that preserve the layer-wise expected square norm of kxk in MLPs with GELU, ELU and ReLU activations. (Middle) lower bound of λ3 for GELU (Right) λ3 for ELU. If λ3 ≤ 1 on θ ∈ (0, π), a unique fixed point exists. absolute value of the diagonal elements. Thus by Theorem 5, constant over subsets of the input space. We first consider if maxi=1,2,3 |λi| < 1 then g has a unique fixed point. the limiting prior as the depth goes to infinity. 2 Alternatively, suppose λ3 < 1 and g1(·, s2, ρ) admits a (L) 2 2 2 2  Proposition 9. Let {f (x)}x∈X be a Gaussian pro- fixed point at s for any s2, ρ. Then g3 s , s , · admits a cess with mean zero and covariance function k(L). Sup- unique fixed point by applying Theorem 5 to the 1D system (L) (L) pose that lim k (x1, x2) = lim k (x1, x1) = g3 with induced matrix norm |λ3|. L→∞ L→∞ (L) lim k (x2, x2) < ∞ for every x1, x2 ∈ X∗. Then Intuitively but informally, the absolute homogeneity con- L→∞ dition in Corollary 7 leads to independent updates, so that g (L) (L) lim f (x1) − f (x2) = 0 may be thought of as three functions g1, g2, g3 whose inputs L→∞ and outputs do not interact between iterations. When abso- almost surely. That is, all draws from the limiting prior are lute homogeneity is removed, g1 and g2’s inputs and outputs almost surely a constant function. are not affected by g3, but g3’s inputs are affected by the outputs of g1 and g2. This makes it more difficult to analyse The proof is given in Appendix H. Suppose we take a pri- exactly the same situation as in Corollary 7. However, if we ori the Gaussian process in Proposition 9 before the limit is fix the output of g1 and g2 at some fixed point (which are taken, update our belief after observing some data to obtain guaranteed to exist in the cases handled in § 4.1), then we the posterior and then take the limit. An interesting question can neglect interactions between the outputs of g1, g2 and is whether the limit commutes with the Bayesian update. the inputs of g3 and analyse the iterates of g3 as a univariate (L) function. We proceed with this strategy in Corollary 8. Proposition 10. Let {f (x)}x∈X be a Gaussian process prior with mean zero and covariance function k(L). Fix some Corollary 8 (Unique fixed point of normalised kernel). If 2 2 X∗ ⊆ X such that for every x1, x2 ∈ X∗ ⊆ X and x3, x4 ∈ a fixed point s of the system only involving g1(·, s2, ρ): ∗ X , [0, ∞) → [0, ∞) exists when σw = σ , we have (L) (L) • lim k (x1, x2) = lim k (x1, x1) = ∂g3 ∗ 2  0 0  L→∞ L→∞ = λ3 = (σ ) E ψ (sZ1)ψ (sZ2) (L) lim k (x2, x2) = C < ∞, ∂ρ L→∞ 2 2 (L) (L) at the fixed point of g1(·, s2, ρ) and g2(s1, ·, ρ). Further- • lim k (x1, x3) = lim k (x2, x3) = D(x3), and 2 2 L→∞ L→∞ more, if |λ3| < 1, then g3(s , s , ·):[−1, 1] → [−1, 1] (L) admits a unique fixed point at ρ = 1. • lim k (x3, x4) < ∞ exists L→∞

Proof. By Theorem 6, we have where C ∈ R and D : X → R may depend on X∗. Fix some 2 dataset X, Y, where each row Xi of X is in X . σws1s2  0 0  λ3 = √ E ψ (s1Z1)ψ (s2Z2) Then under Bayesian Gaussian process regression with g1g2 Gaussian likelihood and strictly positive noise variance ∗ 2  0 0  2 = (σ ) E ψ (sZ1)ψ (sZ2) . σn > 0, all draws from the limiting posterior predictive dis- tribution given observations X, Y over X∗ as L → ∞ are 2 2 g3(s , s , ·) admits a unique fixed point by taking C = almost surely a constant function. (−1, 1) and d as the the Euclidean metric in Theorem 5. The proof is given in Appendix H. We are now ready to 4.3 Degenerate Priors and Posteriors relate the existence of unique fixed points to degenerate pri- ors and posteriors for some specific examples. Having established conditions under which unique fixed points exist, we now examine what a unique fixed point im- Example, LReLU Taking X∗ to be any (subset of a) hy- plies for the limiting prior and posterior. The prior and poste- persphere in Propositions 9 and 10, we have the following riors are degenerate in the sense that they are almost surely result, the proof of which is given in Appendix H.

9972 Corollary 11. Let X∗ be any hypersphere. Define a Gaus- sian process prior with a covariance function corresponding to an infinitely wide MLP with LReLU activations, µ = 0, 2 σw = 2 and depth L. Then as L → ∞, draws from the prior and posterior predictive distributions are almost surely con- stant over X∗. Examples, GELU and ELU In contrast with LReLU ac- tivations, such a degeneracy guarantee does not exist for Figure 5: Posterior predictive 2 standard deviations. GELU and ELU activations. For both the GELU and ELU, we consider the dynamics on a ball of constant kxk, where σw is chosen such that g1 = kxk. In Figure 4, for different 6 Gaussian Process Experiments values of kxk in the context of Corollary 8, we evaluate a We perform two sets of experiments. In the first, we inves- lower bound for λ3 in the case of GELU and λ3 exactly in tigate the performance of GPs with various neural network the case of ELU. Full working is given in Appendices G.1 kernels. In the second, we observe the degree of implicit reg- and G.2. We observe that each exceeds 1 at some point on ularisation that is obtained using GPs of finite depth. the (but not over the whole) interval, and is therefore not a contraction mapping and hence not guaranteed to have a 6.1 Benchmarking unique fixed point. This is consistent with Figures 2 and 3, We provide a software implementation of our new ker- where fixed points are shown by intersecting curves. nels. To demonstrate usage of our covariance functions, first we compare the performance of GP regression models us- 5 Extension of Theoretical Results to NTK ing ReLU, LReLU, ERF and GELU kernels on a popular We may also study kernel fixed points of infinitely wide neu- Bayesian deep learning benchmark (Hernandez-Lobato´ and ral networks trained under gradient flow. This amounts to Adams 2015). The purpose of these experiments is not to studying the fixed point properties of the neural tangent ker- showcase the superiority of one prior over another, but rather nel (NTK). If such a unique fixed point exists, this implies to provide a sample implementation. This implementation that the functions obtained by applying gradient flow to an has already been ported over to another framework in con- infinitely wide, infinitely deep MLP are also degenerate. In current work by others (Novak et al. 2020). The ELU kernel this section, we briefly sketch how such a result may be ob- was not included in our experiments (see § 3). We perform tained. We leave the presentation of formal results and em- separate experiments on shallow models having 1 hidden pirical evaluations for future work. layer and deep models having up to 32 hidden layers. All Informally, the value of the eigenvalue λ3 can still predict data was standardised to have mean 0 and variance 1. the fixed point behaviour of the NTK. Formally, the result is slightly more involved, see Appendix I. Consider a state Shallow models. Do differences in priors induced by the R4 2 2 space S ⊂ containing states (s1, s2, k, T ) consisting of various activation functions affect empirical performance? squared norms for both inputs, the kernel, and the NTK. We Using the limiting GP allows us to remove the interaction update the states through h : S → S: between ψ and optimisation, and purely consider the effect 2 2 2  2  2 of ψ on the functional prior. Figure 5 shows the predictive hi(s , s , k, T ) = σ E ψ (siZi) + σ , i = i, 2 1 2 w b distribution of GPs with GELU, ReLU, LReLU and ERF 2 2 2 E  2 h3(s1, s2, k, T ) = σw ψ(s1Z1)ψ(s2Z2) + σb , kernels on a toy regression task. ERF has different extrapola- 2 2 2 E 0 0  tion properties due to being a bounded activation, whilst the h4(s1, s2, k, T ) = T σw ψ (s1Z1)ψ (s2Z2) + others appear qualitatively similar, though with extrapola- 2 E  2 σw ψ(s1Z1)ψ(s2Z2) + σb , tion variance decreasing in the order GELU/ReLU/LReLU. where Cov(Z ,Z ) = k/(s s ), E[Z ] = E[Z ] = 0. As in Figure 6 shows benchmark results for single-hidden-layer 1 2 1 2 1 2 GPs using a 90%/10% training/test split. See Appendix J.1 Corollary 8, if a fixed point of the system only involving s1 and h exists at a value σ = σ∗ and s = s = s, then the for more details and plots. All kernels perform comparably; 1 w 1 2 gains can be made by selecting a kernel suited to the dataset. system reduces to a 2 dimensional update involving only h3 and h along s = s = s. The Jacobian is diagonal, Results are most different for ERF — either negatively 4 1 2 (Concrete, Energy) or positively (Boston, Protein, Wine).  ∂h3 ∂h3   ∂h3 0  Differences are observed between GELU/ReLU/LReLU. J = ∂k ∂T = ∂k , ∂h4 ∂h4 ∂h4 ∂h4 For example, GELU offers an advantage in Naval and Yacht, ∂k ∂T ∂k ∂T and LReLU performs poorly on Protein. ∂h3 ∂h4 None of the kernels consistently outperform the others. so the eigenvalues are ∂k and ∂T . By Theorem 6, This is expected behaviour, similar to how a Matern kernel −1 ∂h3 ∂h3 ∂k  ∂g3 p = = h h (s s )−1 might outperform a squared exponential kernel only some of ∂k ∂ρ ∂ρ ∂ρ 1 2 1 2 the time on real-world datasets. The purpose of the experi- ∂h ment was to evaluate whether different ψ result in a strong = (σ∗)2Eψ0(s Z )ψ0(s Z ) = 4 = λ . 1 1 2 2 ∂T 3 enough difference in priors that empirical performance dif- ferences can be observed. Having answered in the positive,

9973 2 Figure 7: RMSE against σw for equivalent l layer GPs, Wine dataset. Shaded region shows 1 standard deviation. (Clockwise from top left) ReLU, GELU, LReLU, ERF.

fles. Other datasets are given in Appendix J.2. We make two qualitative observations. Firstly, ` > 1 models out-perform ` = 1 models. Secondly, the RMSE changes smoothly in 2 both depth and σw. The visual smoothness is not due to av- eraging over 5 trials; smoothness is also observed when we plot results from only 1 random shuffling. Table 1 shows the best models obtained over the grid search for each kernel.

6.2 Overfitting and Underfitting We empirically investigate the relationship between depth and training/testing error. When the covariance function has a unique fixed point, we expect to see underfitting at large depth, since large depth will push the kernel towards the unique fixed point, that is, a constant normalised kernel. On Figure 6: RMSE for equivalent single-hidden-layer GPs. the other hand, when the covariance function does not have Mean 2 standard errors (over 20 runs). a unique fixed point, we might expect to see overfitting as model complexity may increase with depth. We build predictor variables of a training dataset by uni- we posit that for finite-width networks, the difference in per- formly sampling x ∈ R2 on the unit disc at heading γ. formance found by varying ψ may partially derive from dif- We then sample training targets through the mapping y = ferences in the induced prior. This is in contrast to previously f(γ) +  for a number of different choices of f, where  is cited reasons such as bias shift and its relation to natural gra- additive Gaussian noise with variance fixed at 0.1. We find dient (Clevert, Unterthiner, and Hochreiter 2016). the posterior predictive mean of a Gaussian process with it- erated GELU and ReLU covariance functions of depth L Deep models. How does the performance of models vary between 1 and 100 with a choice of σw according to Fig- with depth? We randomly shuffled the data into an 80/20% ure 4. We repeat this process with a new random training set train/test split 5 times. For each split, we ran GP regression 10 times. Each repetition, we find the mean-squared error with an additive iid Gaussian noise model having variance (MSE) between the posterior mean of the Gaussian process fixed at 0.1. We varied the depth ` ∈ [1, 32] in steps of 1 over the training set and a test set built from a (deterministic) and the weight and bias variances (which were constrained uniform grid of size 100. Figure 8 shows the resulting train 2 to be equal in each layer) σw ∈ [0.1, 5] in steps of 0.1. For and test errors on one choice of f, and Appendix L shows each setting, we measured the RMSE between the mean of other choices of f. Figure 1 shows a more direct illustration the GP prediction and the true regression targets. Figure 7 of the function fit on one of the random data samples, with shows the average RMSE on the Wine dataset over 5 shuf- other choices of f shown in Appendix L.

9974 ReLU GELU LReLU ERF 2 2 2 2 RMSE σw ` RMSE σw ` RMSE σw ` RMSE σw ` Boston 2.85  0.64 1.90 7 2.86  0.65 1.80 6 2.60  1.07 2.00 32 2.69  0.95 5.00 2 Concrete 5.22  0.55 5.00 2 5.21  0.56 5.00 2 5.23  0.44 3.30 3 5.63  0.46 5.00 2 Energy 0.89  0.11 5.00 2 0.92  0.12 5.00 2 2.77  0.31 0.10 1 2.79  0.23 0.10 1 Wine 1.15  0.13 5.00 4 1.17  0.14 5.00 4 1.04  0.12 5.00 5 3.96  0.75 5.00 1 Yacht 0.58  0.01 4.80 32 0.58  0.01 2.30 32 0.60  0.02 1.80 29 0.57  0.02 5.00 8

Table 1: Best performing models for each kernel over the grid search.

and (b) but not (c). We have recently been made aware of the concurrent work of (Huang et al. 2020), who also study the degeneracy of processes induced through ReLU activations. Their focus is on both (a) and (b), arguing that the NTK for residual architectures does not suffer from this degeneracy. (l) 2 Knowing the gradient of the kernel with respect to (σw ) (l) 2 and (σb ) is useful for both empirical Bayesian methodolo- gies (e.g. optimising the marginal likelihood using LBFGS) Figure 8: Training and testing errors for GPs with covariance and hierarchical models (e.g. using HMC to integrate out the functions corresponding to infinitely wide MLPs of increas- hyperprior). As we detail in Appendix K, our results may ing depth L using the same example as in Figure 1. Solid be used to find this gradient. Theorem 6 provides 7 of the curve shows the mean over 10 training data samples, and 9 elements of the Jacobian. The other 2 can only easily be the shaded region shows  two standard deviations. By ex- evaluated in special cases (e.g. LReLU results in a diago- amining whether the training and testing error increases or nal Jacobian). In future work, it may be interesting to extend decreases with depth, we observe that the GELU and ReLU Theorem 6 to cover the remaining elements of the Jacobian. respectively overfit and underfit with depth. See Appendix L While Lee et al. (2019) found close agreement be- for more error curves. (Left) GELU (Right) ReLU. tween NNs and their corresponding limiting GPs, several authors (Neal 1995; MacKay 2003; Der and Lee 2006; Matthews et al. 2018; Chizat, Oyallon, and Bach 2019; 7 Discussion and Conclusion Tsuchida, Roosta, and Gallagher 2019b; Allen-Zhu, Li, and Liang 2019; Peluchetti, Favaro, and Fortini 2020; Aitchison We introduced two new positive semi-definite kernels aris- 2020) have argued against the use of GP models as a means ing from the infinite-width limit of Bayesian GELU or ELU to understand the success of deep learning. If deep learning’s NNs. We provided visualisations of these kernels for varying performance can be explained using GPs, why do NN mod- depths. We introduced a general framework for understand- els outperform their limiting GP counterparts? Arora et al. ing the fixed-point dynamics of such kernels and their NTK (2019) attain 77% test accuracy on CIFAR10 using a limit- counterparts. Using this framework, we showed that unlike ing GP arising from a trained CNN, while a ResNet is able the ReLU, the GELU and ELU kernels are able to avoid to achieve 96% (Springenberg et al. 2015). On the other unique fixed points. We empirically verified that finite-width hand, Arora et al. (2020) find that GP models are compet- NNs are able to avoid unique kernel fixed points in Figures 2 itive on small datasets. It remains to determine if this differ- and 3. We applied our kernels in the setting of shallow and ence in performance is due to the tricks for which equiva- deep GP regression, finding that for some problems specific lence in the GP setting have not yet been fully explored, or kernels are more appropriate, and that the GELU kernel is if it is the result of some deeper property of GPs. Lee et al. competitive with the ReLU kernel. (2020) empirically explore some of these questions includ- Investigations into implicit regularisation consider the ing the effects of finite width, architectures, weight decay role of one or all of (a) the architecture, (b) the learning and the performance difference between infinite Bayesian algorithm and (c) the data sampling process. Neyshabur, and NTK models. While we acknowledge the limitations of Tomioka, and Srebro (2015) argue that (b) leads implicitly the infinitely wide approach, we believe it warrants further to low-norm solutions, explaining the generalisation ability exploration, if not to understand the power of deep learning, of deep NNs. On the other hand, Derezinski,´ Liang, and Ma- at least to investigate its generalisation abilities. The purpose honey (2019) construct a data sampling distribution (c) that of our study was not to optimise any architecture for perfor- explains double descent and implicit regularisation in linear mance on a particular problem, but rather to develop results models. Similar to our work but not considering the NTK, under the GP framework that contribute to our understand- the signal propagation literature (Schoenholz et al. 2017; ing of generalisation in the overparameterised setting. Poole et al. 2016) explains simplicity biases in randomly ini- ∂g3 Acknowledgements tialised networks (a). They develop objects similar to ∂ρ , but require bounded activations, and seem to also require This work was partially funded by CSIRO’s Machine some notion of differentiability. Our analysis considers (a) Learning and Artificial Intelligence Future Science Plat-

9975 form. TP was funded through EPSRC (EP/N509620/1). Language Understanding. In Proceedings of the 2019 Con- CvdH was supported by ACEMS under ARC grant number ference of the North American Chapter of the Association CE140100049. We would like to thank Bob Williamson for for Computational Linguistics: Human Language Technolo- a helpful discussion. gies, 4171–4186. Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid- References weighted linear units for neural network function approx- Agarwal, R. P.; Meehan, M.; and O’regan, D. 2001. Fixed imation in reinforcement learning. Neural Networks 107: point theory and applications, volume 141. Cambridge uni- 3–11. versity press. Garriga-Alonso, A.; Rasmussen, C. E.; and Aitchison, L. Aitchison, L. 2020. Why bigger is not always better: on 2018. Deep Convolutional Networks as shallow Gaussian finite and infinite neural networks. In International Confer- Processes. In International Conference on Learning Repre- ence on Machine Learning, 156–164. sentations. Allen-Zhu, Z.; Li, Y.; and Liang, Y. 2019. Learning and Hasselblatt, B.; and Katok, A. 2003. A first course in dynam- generalization in overparameterized neural networks, going ics: with a panorama of recent developments. Cambridge beyond two layers. In Advances in neural information pro- University Press. cessing systems, 6155–6166. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep Arora, S.; Du, S. S.; Hu, W.; Li, Z.; Salakhutdinov, R. R.; into rectifiers: Surpassing human-level performance on im- and Wang, R. 2019. On exact computation with an infinitely agenet classification. In Proceedings of the IEEE interna- wide neural net. In Advances in Neural Information Pro- tional conference on computer vision, 1026–1034. cessing Systems, 8139–8148. Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear Arora, S.; Du, S. S.; Li, Z.; Salakhutdinov, R.; Wang, R.; and units (GELUs). In arXiv preprint arXiv:1606.08415. Yu, D. 2020. Harnessing the Power of Infinitely Wide Deep Hernandez-Lobato,´ J. M.; and Adams, R. P. 2015. Proba- Nets on Small-data Tasks. In International Conference on bilistic Backpropagation for Scalable Learning of Bayesian Learning Representations. Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning. Billingsley, P. 1995. Probability and measure. John Wiley & Sons, 3rd edition. Huang, K.; Wang, Y.; Tao, M.; and Zhao, T. 2020. Why Do Deep Residual Networks Generalize Better than Deep Feed- Bui, T.; Hernandez-Lobato,´ D.; Hernandez-Lobato, J.; Li, forward Networks?–A Neural Tangent Kernel Perspective. Y.; and Turner, R. 2016. Deep Gaussian processes for re- Advances in Neural Information Processing Systems 33. gression using approximate expectation propagation. In In- ternational conference on machine learning, 1472–1481. Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural tan- gent kernel: Convergence and generalization in neural net- Chizat, L.; Oyallon, E.; and Bach, F. 2019. On Lazy Train- works. In Advances in neural information processing sys- ing in Differentiable Programming. In Advances in Neural tems, 8571–8580. Information Processing Systems. Jones, D. S. 1982. The theory of generalised functions. Cam- Cho, Y.; and Saul, L. K. 2009. Kernel Methods for Deep bridge ; New York: Cambridge University Press, 2nd ed. edi- Learning. In Advances in Neural Information Processing tion. Systems, 342–350. Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, Chua, K.; Calandra, R.; McAllister, R.; and Levine, S. 2018. S. 2017. Self-normalizing neural networks. In Advances in Deep reinforcement learning in a handful of trials using neural information processing systems, 971–980. probabilistic dynamics models. In Advances in Neural In- formation Processing Systems, 4754–4765. Le Roux, N.; and Bengio, Y. 2007. Continuous neural net- works. In Artificial Intelligence and Statistics, 404–411. Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2016. Fast and accurate deep network learning by exponential linear Lee, J.; Bahri, Y.; Novak, R.; Schoenholz, S. S.; Penning- units (ELUs). In The International Conference on Learn- ton, J.; and Sohl-Dickstein, J. 2018. Deep neural networks The International Conference on ing Representations. as Gaussian processes. In Learning Representations. Der, R.; and Lee, D. D. 2006. Beyond Gaussian processes: Lee, J.; Schoenholz, S.; Pennington, J.; Adlam, B.; Xiao, On the distributions of infinite networks. In Advances in L.; Novak, R.; and Sohl-Dickstein, J. 2020. Finite versus Neural Information Processing Systems, 275–282. infinite neural networks: an empirical study. Advances in Derezinski,´ M.; Liang, F.; and Mahoney, M. W. 2019. Ex- Neural Information Processing Systems 33. act expressions for double descent and implicit regular- Lee, J.; Xiao, L.; Schoenholz, S.; Bahri, Y.; Novak, R.; Sohl- arXiv preprint ization via surrogate random design. In Dickstein, J.; and Pennington, J. 2019. Wide neural net- arXiv:1912.04533 . works of any depth evolve as linear models under gradient Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. descent. In Advances in neural information processing sys- BERT: Pre-training of Deep Bidirectional Transformers for tems, 8570–8581.

9976 MacKay, D. J. 2003. Information theory, inference and Series A, Containing Papers of a Mathematical or Physical learning algorithms, 547. Cambridge university press. Character (192): 140. Matthews, A. G. d. G.; Rowland, M.; Hron, J.; Turner, R. E.; Springenberg, J.; Dosovitskiy, A.; Brox, T.; and Riedmiller, and Ghahramani, Z. 2018. Gaussian process behaviour in M. 2015. Striving for Simplicity: The All Convolutional wide deep neural networks. In The International Conference Net. In International Conference on Learning Representa- on Learning Representations. tions (workshop track). Meronen, L.; Irwanto, C.; and Solin, A. 2020. Stationary Tsuchida, R.; Roosta, F.; and Gallagher, M. 2018. Invariance Activations for Uncertainty Calibration in Deep Learning. of Weight Distributions in Rectified MLPs. In International Advances in Neural Information Processing Systems 33. Conference on Machine Learning, 5002–5011. Neal, R. M. 1995. Bayesian learning for neural networks. Tsuchida, R.; Roosta, F.; and Gallagher, M. 2019a. Ex- Ph.D. thesis, University of Toronto. changeability and Kernel Invariance in Trained MLPs. In Neyshabur, B.; Tomioka, R.; and Srebro, N. 2015. In search International Joint Conference on Artificial Intelligence. of the real inductive bias: On the role of implicit regulariza- Tsuchida, R.; Roosta, F.; and Gallagher, M. 2019b. Richer tion in deep learning. In International Conference on Learn- priors for infinitely wide multi-layer perceptrons. In arXiv ing Representations (workshop track). preprint arXiv:1911.12927. Novak, R.; Xiao, L.; Hron, J.; Lee, J.; Alemi, A. A.; Sohl- Valle-Perez,´ G.; Camargo, C. Q.; and Louis, A. A. 2019. Dickstein, J.; and Schoenholz, S. S. 2020. Neural Tangents: Deep learning generalizes because the parameter-function Fast and Easy Infinite Neural Networks in Python. In In- map is biased towards simple functions. In International ternational Conference on Learning Representations. URL Conference on Learning Representations. https://github.com/google/neural-tangents. Williams, C. K. 1997. Computing with infinite networks. Novak, R.; Xiao, L.; Lee, J.; Bahri, Y.; Yang, G.; Hron, J.; In Advances in neural information processing systems, 295– Abolafia, D. A.; Pennington, J.; and Sohl-Dickstein, J. 2019. 301. Bayesian Deep Convolutional Networks with Many Chan- Yang, G. 2019a. Scaling limits of wide neural networks with nels are Gaussian Processes. In The International Confer- weight sharing: Gaussian process behavior, gradient inde- ence on Learning Representations. pendence, and neural tangent kernel derivation. In arXiv Pearce, T.; Tsuchida, R.; Zaki, M.; Brintrup, A.; and Neely, preprint arXiv:1902.04760. A. 2019. Expressive Priors in Bayesian Neural Networks: Yang, G. 2019b. Wide Feedforward or Recurrent Neural Kernel Combinations and Periodic Functions. In Uncer- Networks of Any Architecture are Gaussian Processes. In tainty in Artificial Intelligence. Advances in Neural Information Processing Systems, 9947– Peluchetti, S.; Favaro, S.; and Fortini, S. 2020. Stable be- 9960. haviour of infinitely wide deep neural networks. In Inter- Yang, G.; and Salman, H. 2019. A fine-grained spec- national Conference on Artificial Intelligence and Statistics, tral perspective on neural networks. In arXiv preprint 1137–1146. PMLR. arXiv:1907.10599. Poole, B.; Lahiri, S.; Raghu, M.; Sohl-Dickstein, J.; and Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. Ganguli, S. 2016. Exponential expressivity in deep neural 2017. Understanding deep learning requires rethinking gen- networks through transient chaos. In Lee, D. D.; Sugiyama, eralization. In International Conference on Learning Repre- M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Ad- sentations. vances in Neural Information Processing Systems 29, 3360– 3368. Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training. In Preprint. Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Searching for activation functions. In arXiv preprint arXiv:1710.05941. Rosenbaum, S. 1961. Moments of a truncated bivariate nor- mal distribution. Journal of the Royal Statistical Society: Series B (Methodological) 23(2): 405–408. Schoenholz, S. S.; Gilmer, J.; Ganguli, S.; and Sohl- Dickstein, J. 2017. Deep information propagation. In In- ternational Conference on Learning Representations. Sheppard, W. F. 1899. On the application of the theory of error to cases of normal distribution and normal correlation. Philosophical Transactions of the Royal Society of London.

9977