Stable Behaviour of Infinitely Wide Deep Neural Networks

Stable behaviour of infinitely wide deep neural networks Stefano Favaro Sandra Fortini Stefano Peluchetti [email protected] [email protected] [email protected] Department ESOMAS Department of Decision Sciences Cogent Labs University of Torino Bocconi University and Collegio Carlo Alberto Abstract (Lee et al., 2018; Garriga-Alonso et al., 2019); ii) kernel regression for infinitely wide networks which are trained with continuous-time gradient descent via the We consider fully connected feed-forward neural tangent kernel (Jacot et al., 2018; Lee et al., deep neural networks (NNs) where weights 2019; Arora et al., 2019); iii) analysis of the properties and biases are independent and identically of infinitely wide networks as function of depth via distributed as symmetric centered stable dis- the information propagation framework (Poole et al., tributions. Then, we show that the infi- 2016; Schoenholz et al., 2017; Hayou et al., 2019). It nite wide limit of the NN, under suitable has been shown a substantial gap between finite NNs scaling on the weights, is a stochastic pro- and their corresponding infinite (wide) GPs counter- cess whose finite-dimensional distributions parts in terms of empirical performance, at least on are multivariate stable distributions. The some of the standard benchmarks applications. More- limiting process is referred to as the stable over, it has been shown to be a difficult task to avoid process, and it generalizes the class of Gaus- undesirable empirical properties arising in the context sian processes recently obtained as infinite of very deep networks. Given that, there exists an in- wide limits of NNs (Matthews et al., 2018b). creasing interest in expanding GPs arising in the limit Parameters of the stable process can be com- of infinitely wide NNs as a way forward to close, or puted via an explicit recursion over the layers even reverse, this empirical performance gap and to of the network. Our result contributes to the avoid, or slow down, pathological behaviors in very theory of fully connected feed-forward deep deep NN. NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite Let (µ, σ2) denote the Gaussian distribution with N wide limits. mean µ R and variance σ2 R+. Following the celebrated2 work of Neal (1995), we2 consider the shallow NN 1 Introduction I (1) (1) (1) fi (x)= wi,j xj + bi j=1 The connection between infinitely wide deep feed- X forward neural networks (NNs), whose parameters 1 n f (2)(x, n)= w(2)φ(f (1)(x)) + b(2), at initialization are independent and identically dis- i pn i,j j i j=1 tributed (iid) as scaled and centered Gaussian distri- X butions, and Gaussian processes (GPs) is well known where φ is a nonlinearity, i =1,...,n, w(1),w(2) iid (Neal, 1995; Der and Lee, 2006; Lee et al., 2018; i,j i,j ⇠ Matthews et al., 2018a,b; Yang, 2019). Recently, this 2 (1) (2) iid 2 I (0,σw), bi ,bi (0,σb ) and x R is the input. intriguing connection has been exploited in many ex- ItN follows that ⇠N 2 citing research directions, including: i) Bayesian in- (1) iid 2 ference for GPs arising from infinitely wide networks f (x) 0,σ (1) (x) i ⇠N f (2) (1) iid ⇣ 2 ⌘ rd f (x, n) f 0,σ (2) (x, n) Proceedings of the 23 International Conference on Artifi- i | ⇠N f cial Intelligence and Statistics (AISTATS) 2020, Palermo, ⇣ I ⌘ Italy. PMLR: Volume 108. Copyright 2020 by the au- 2 2 2 1 2 σ (1) (x)=σ + σ x thor(s). f b w I j j=1 X Stable behaviour of infinitely wide deep neural networks n 2 2 2 1 (1) 2 each layer; iv) the convergence in distribution is es- σ (2) (x, n)=σ + σ φ(f (x)) . f b w n j tablished jointly for multiple inputs, namely the con- j=1 X vergence concerns the class of finite dimensional distributions of the NN viewed as a stochastic process If x0 is another input we obtain bivariate Gaussian distributions in function space. See Neal (1995) and Der and Lee (2006) for early works on NNs under stable initializa- (1) (1) iid tion. (f (x),f (x0)) 0, ⌃ (1) (x, x0) i i ⇠N2 f (2) (2) (1) iid Within our setting, we show that the infinite wide limit (f (x, n),f (x0,n)) f 0, ⌃ (2) (x, x0,n ) , i i | ⇠N2 f of the NN, under suitable scaling on the weights, is a where stochastic process whose finite-dimensional distributions are multivariate stable distributions (Samorad- 2 nitsky, 2017). This process is referred to as the stable σf (1) (x) cf (1) (x, x0) ⌃f (1) (x, x0)= 2 process. Our result may be viewed as a generalization cf (1) (x, x0) σ (1) (x0) " f # of the main result of Matthews et al. (2018b)tothe 2 σf (2) (x, n) cf (2) (x, x0,n) context of stable distributions, as well as a generaliza- ⌃f (2) (x, x0,n)= 2 c (2) (x, x ,n) σ (x ,n) tion of results of Neal (1995) and Der and Lee (2006) " f 0 f (2) 0 # to the context of deep NN. Our result contributes to I 2 2 1 the theory of fully connected feed-forward deep NNs, c (1) (x, x0)=σ + σ x x0 f b w I j j and it paves the way to extend the research directions j=1 X i) ii) and iii) that rely on Gaussian infinite wide lim- n 2 2 1 (1) (1) its. The class of stable distributions is known to be c (2) (x, x0,n)=σ + σ φ(f (x))φ(f (x0)). f b w n j j especially relevant. Indeed while the contribution of j=1 X each Gaussian weight vanishes as the width grows un- a.s. bounded, some of the stable weights retains significant Let denote the almost sure convergence. By the size, thus allowing them to represent ”hidden features” strong! law of large numbers we know that, as n + , (Neal, 1995). one has ! 1 n The paper is structured as follows. Section 2 contains 1 (1) 2 a.s. (1) 2 some preliminaries on stable distributions, whereas in φ(f (x)) E[φ(f (x)) ] n j ! 1 j=1 Section 3 we define the class of feedforward NNs con- X n sidered in this work. Section 4 contains our main 1 (1) (1) a.s. (1) (1) φ(f (x))φ(f (x0)) E[φ(f (x))φ(f (x0))], result: as the width tends to infinity jointly on net- n j j ! 1 1 j=1 work’s layers, the finite dimensional distributions of X the NN converges to a multivariate stable distribution from which one can conjecture that in the limit of whose parameters are compute via a recursion over (2) infinite width the stochastic processes fi (x) are the layers. The convergence of the NN to the sta- distributed as iid (over i)centeredGPwithkernel ble process then follows by finite-dimensional projec- 2 2 (1) (1) tions. In Section 5 we detail how our result extends K(x, x0)=σb + σwE[φ(f1 (x))φ(f1 (x0))]. Pro- vided that the nonlinear function φ is chosen so that previously established large width convergence results (1) and comment on related work, whereas in Section 6 φ(f1 (x)) has finite second moment, Matthews et al. (2018b) made rigorous this argument and extended it we discuss how our result applys to the research lines to deep NNs. highlighted in i) ii) and iii) which relies on GP limits. In Section 7 we comment on future research di- A key assumption underlying the interplay between rections. The Supplementary Material (SM) contains infinite wide NNs and GPs is the finiteness of the vari- all the proofs (SM A,B,C), a preliminary numerical ance of the parameters’ distribution at initialization. experiment on the recursion evaluation (SM D), an In this paper we remove the assumption of finite vari- empirical investigation of the distribution of trained ance by considering iid initializations based on stable NN models’ parameters (SM E). Code is available at distributions, which includes Gaussian initializations https://github.com/stepelu/deep-stable. as a special case. We study the infinite wide limit of fully connected feed-forward NN in the following general setting: i) the NN is deep, namely the NN 2 Stable distributions is composed of multiple layers; ii) biases and scaled weights are iid according to centered symmetric stable Let St(↵,σ) denote the symmetric centered stable dis- distributions; iii) the width of network’s layers goes to tribution with stability parameter ↵ (0, 2] and scale 2 infinity jointly on the layers, and not sequentially on parameter σ>0, and let S↵,σ be a random vari- Running heading author breaks the line (l) able distributed as St(↵,σ). That is, the characteristic That is, the characteristic function of wi,j St(↵,σw) itS↵,σ ⇠ function of S↵,σ St(↵,σ)is'S↵,σ (t)=E[e ]= is ↵ ↵ (l) ↵ ↵ σ t ⇠ itwi,j σw t e− | | . For any σ>0, a S↵,σ random variable with ' (l) (t)=E[e ]=e− | | , (4) wi,j 0 <↵<2 has finite absolute moments [ S ↵ "] E ↵,σ − (l) ↵ | | for any i 1, j 1 and l 1. Let b denote the bi- for any ">0, while E[ S↵,σ ]=+ . Note that ≥ ≥ ≥ i when ↵ = 2, we have that| S| 1(0, 2σ2). The ases of the l-th hidden layer, and assume that they are 2,σ ⇠N random variable S2,σ has finite absolute moments of independent and identically distributed as St(↵,σb).

Stable Behaviour of Infinitely Wide Deep Neural Networks

Introduction to Lévy Processes

A Mathematical Theory of Network Interference and Its Applications

Introduction to Lévy Processes

Fclts for the Quadratic Variation of a CTRW and for Certain Stochastic Integrals

Levy Processes

Financial Modeling with L´Evy Processes and Applying L

Models Beyond the Dirichlet Process

Final Report (PDF)

Modern Bayesian Nonparametrics

Multifractal Analysis of Superprocesses with Stable

BEJ 22 4.Pdf

Random Walks at Random Times: Convergence to Iterated Lévy