Avoiding Kernel Fixed Points: Computing with ELU and GELU Inﬁnite Networks

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks Russell Tsuchida,12 Tim Pearce,3 Chris van der Heide,2 Fred Roosta,24 Marcus Gallagher2 1CSIRO, 2The University of Queensland, 3University of Cambridge, 4International Computer Science Institute f(x ) = p1 P U (W>x )+V Abstract 1 n i=1 i fi e1 b, where is an activa- > > Analysing and computing with Gaussian processes arising tion function and xe1 = (x1 ; 1) . The covariance between from infinitely wide neural networks has recently seen a any two outputs is resurgence in popularity. Despite this, many explicit covari- (1) ance functions of networks with activation functions used in k (x1; x2) n n modern networks remain unknown. Furthermore, while the h X X i kernels of deep networks can be computed iteratively, theo- E > > 2 = Vi (Wfi xe1) Vj (Wfj xe2) +σb retical understanding of deep kernels is lacking, particularly i=1 j=1 with respect to fixed-point dynamics. Firstly, we derive the h i 2 E > > 2 covariance functions of multi-layer perceptrons (MLPs) with = σw (Wf1 xe1) (Wf1 xe2) + σb : exponential linear units (ELU) and Gaussian error linear units (GELU) and evaluate the performance of the limiting Gaus- The expectation over d + 1 random variables reduces sian processes on some benchmarks. Secondly, and more gen- to an expectation over 2 random variables, W>x and erally, we analyse the fixed-point dynamics of iterated ker- f1 e1 > nels corresponding to a broad range of activation functions. Wf1 xe2. The joint distribution of these two random vari- We find that unlike some previously studied neural network ables is a bivariate Gaussian. The mean of each com- (1=2) 2 kernels, these new kernels exhibit non-trivial fixed-point dy- ponent is zero, and the variance is kΣ xeik , where namics which are mirrored in finite-width neural networks. k · k denotes the Euclidean norm. The covariance is The fixed point behaviour present in some networks explains (1=2) (1=2) kΣ xe1kkΣ xe2k cos θ, where θ is the angle between a mechanism for implicit regularisation in overparameterised (1=2) (1=2) deep models. Our results relate to both the static iid param- Σ xe1 and Σ xe2. Therefore, the expectation in terms eter conjugate kernel and the dynamic neural tangent kernel of Z ∼ N (0;S) is constructions1. h i (1) 2 E 2 k (x1; x2) = σw s1Z1 + µe1 s2Z2 + µe2 + σb ; 1 Background — Infinitely Wide Neural (1) Networks as Gaussian Processes where S has diagonals 1 and off-diagonals cos θ, si = kΣ(1=2)x k and µ = µ>x . Infinitely wide neural networks (NNs) and Gaussian pro- ei ei ei cesses (GPs) share an interesting connection (Neal 1995; Definition 1. We call (1) the kernel. We call cos θ(1) = (1) Jacot, Gabriel, and Hongler 2018) which has only par- p k (x1;x2) the normalised kernel. (1) (1) tially been explored. We begin by reviewing this connection. k (x1;x1)k (x2;x2) Readers familiar with this connection may skip to x 2. Con- The above NN converges to a GP as n ! 1 under sider a one-hidden layer network with independent param- mild conditions on the input and activation function (Neal eters. Suppose each ith row of weights Wi together with 1995). Since f(x1) is a sum of independent random vari- the corresponding bias Bi in the hidden layer has distri- ables scaling as n−1=2, it converges to a Gaussian random > > bution (Wi ;Bi) = Wgi ∼ N µ, Σ), with Σ 0 be- variable with zero mean as n ! 1. More generally, any (1=2) N ing a diagonal matrix having a unique “square root” Σ . fixed collection of N evaluations of f, ff(xi)gi=1 con- Further, suppose the output layer parameter vector V = verges to an N-dimensional 0-mean Gaussian as n ! 1. p1 2 Analytical and closed-form covariance functions (1) are n U satisfies U ∼ N (0; σwI), where n is the number of neurons in the hidden layer and the output bias satis- available for specific choices of (Le Roux and Bengio fies V ∼ N (0; σ2). The output evaluated at input x is 2007; Tsuchida, Roosta, and Gallagher 2018, 2019a; Pearce b b 1 et al. 2019; Tsuchida, Roosta, and Gallagher 2019b), al- Copyright c 2021, Association for the Advancement of Artificial though some of these require µ = 0. Most notably, the Intelligence (www.aaai.org). All rights reserved. kernel is known for historically relevant activation func- 1Software at github.com/RussellTsuchida/ELU GELU kernels tions (z) = erf(z), RBF networks (Williams 1997) and 9967 for the more modern ReLU activation, (z) = max(0; z) (Cho and Saul 2009). More recently Meronen, Irwanto, and Solin (2020) solved the inverse problem, finding that re- covers the Matern´ class of covariance functions. Once the form of (1) is known, the kernel of deep networks can 2 2 2 be evaluated in the case where Σ = diag(σw; :::; σw; σb ) and µ = 0 (Matthews et al. 2018; Lee et al. 2018; Yang 2019b,a). The case where µ 6= 0 can also be han- dled (Tsuchida, Roosta, and Gallagher 2019b), but we focus Figure 1: Illustration of simplicity bias due to kernel fixed on µ = 0 in this work. The kernel in layer l+1 can be found R2 iteratively as a function of the kernel in layer l, points. Training data x 2 is uniformly sampled on the unit disc at heading γ. Curves show the posterior mean of a h i (l+1) 2 E (l) (l) (l) (l) 2 GP regression model on y = sin(γ)+ with known additive k (x1; x2) = σw s1 Z1 s2 Z2 + σb ; noise variance. σw is chosen according to Figure 3. Colours " # ! Z(l) 1 cos θ(l) move from purple to yellow as depth increases from 1 to 1 ∼ N 0; ; (2) 64. (Left) GELU without unique kernel fixed point leading (l) cos θ(l) 1 Z2 to overfitting (Right) ReLU with unique kernel fixed point leading to underfitting. More examples in Appendix M. (l) (l) where cos θ is the normalised kernel in layer l, and si = p (l) k (xi; xi). Definition 2. We call k(l) in (2) the kernel in layer l. We sufficient conditions for the existence of a unique kernel (l) fixed point. We show theoretically and empirically that call cos θ(l) = p k (x1;x2) the normalised kernel (l) (l) unlike the kernel corresponding to ReLU , the new ker- k (x1;x1)k (x2;x2) in layer l. nels are able to avoid unique fixed points. These conditions apply to both the iid prior (Neal 1995) and dynamic A generalisation of iid weight priors to partially ex- NTK (Jacot, Gabriel, and Hongler 2018) cases. This fixed changeable weight priors is also available (Tsuchida, point behaviour can be used to explain a simplicity bias Roosta, and Gallagher 2019b), resulting in a GP with an in deep NN. More surprisingly, we find theoretically that additional layer of inference over the hyperparameters µ the NTK dynamic which approximates gradient descent and Σ. Convergence to GPs also occurs for other NN ar- preserves this simplicity bias. chitectures such as convolutional architectures (Garriga- Alonso, Rasmussen, and Aitchison 2018; Novak et al. 2019) 2.1 Motivation for Studying Fixed Points and general compositions of recurrent, graph convolution, pooling, skip connection, attention and normalisation lay- Viewing NNs through the arguably idealised lens of GPs has ers (Yang 2019b,a)2. When an MLP is trained under a some surprisingly non-intuitive practical implications. One continuous-time analogue of gradient descent, the limiting important open problem is in explaining the empirical obser- output is still a GP (Jacot, Gabriel, and Hongler 2018). The vation that some overparameterised NNs do not overfit, even dynamics and associated covariance of the GP depends on when trained without explicit regularisation (Zhang et al. the neural tangent kernel (NTK) T (l), which in addition to 2017). Tsuchida, Roosta, and Gallagher (2019b) show em- the iterations (2), is given by pirically that samples from the limiting prior of deep MLPs with zero-mean parameters and LReLU activations are ap- (1) (1) T (x1; x2) = k (x1; x2) proximately constant on the unit hypersphere. Valle-Perez,´ h i Camargo, and Louis (2019) argue that deep NNs with ReLU _ (l+1) 2 E 0 (l) 0 (l) k (x1; x2) = σw (s1Z1 ) (s2Z2 ) ; activations exhibit a “simplicity bias”, in that randomly ini- (l+1) (l) _ (l+1) tialised NNs implementing Boolean functions are likely to T (x1; x2) = T (x1; x2)k (x1; x2) be simple. Yang and Salman (2019) explain this simplicity (l+1) + k (x1; x2): (3) bias through a spectral decomposition of the limiting kenel, showing that most of its mass is concentrated around the 2 Contributions and Motivation constant eigenfunction, even when accounting for training using gradient descent under the NTK (Jacot, Gabriel, and This paper contains two main contributions. We: Hongler 2018). Yang and Salman (2019) are clear to sepa- 1. Derive kernels for GELU and ELU activations (defined rate the case of ReLU activations, which do result in ker- below) and verify our results numerically. We implement nels having peaked spectral mass and exhibiting a simplicity GPs with different NN kernels on some benchmarks. bias, from ERF activations which do not. 2. Study the fixed point dynamics of the kernel when is Our motivation for studying fixed points is in a similar bounded by the absolute value of a polynomial.

Avoiding Kernel Fixed Points: Computing with ELU and GELU Inﬁnite Networks

Neural Tangent Generalization Attacks

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Mean-Field Behaviour of Neural Tangent Kernel for Deep Neural

Limiting the Deep Learning Yiping Lu Peking University Deep Learning Is Successful, But…

Theory of Deep Learning

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

When and Why the Tangent Kernel Is Constant

On the Neural Tangent Kernel of Equilibrium Models

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Optimization for Deep Learning: an Overview