Stable behaviour of infinitely wide deep neural networks

Stefano Favaro Sandra Fortini Stefano Peluchetti [email protected] [email protected] [email protected] Department ESOMAS Department of Decision Sciences Cogent Labs University of Torino Bocconi University and Collegio Carlo Alberto

Abstract (Lee et al., 2018; Garriga-Alonso et al., 2019); ii) ker- nel regression for infinitely wide networks which are trained with continuous-time gradient descent via the We consider fully connected feed-forward neural tangent kernel (Jacot et al., 2018; Lee et al., deep neural networks (NNs) where weights 2019; Arora et al., 2019); iii) analysis of the properties and biases are independent and identically of infinitely wide networks as function of depth via distributed as symmetric centered stable dis- the information propagation framework (Poole et al., tributions. Then, we show that the infi- 2016; Schoenholz et al., 2017; Hayou et al., 2019). It nite wide limit of the NN, under suitable has been shown a substantial gap between finite NNs scaling on the weights, is a stochastic pro- and their corresponding infinite (wide) GPs counter- cess whose finite-dimensional distributions parts in terms of empirical performance, at least on are multivariate stable distributions. The some of the standard benchmarks applications. More- limiting process is referred to as the stable over, it has been shown to be a dicult task to avoid process, and it generalizes the class of Gaus- undesirable empirical properties arising in the context sian processes recently obtained as infinite of very deep networks. Given that, there exists an in- wide limits of NNs (Matthews et al., 2018b). creasing interest in expanding GPs arising in the limit Parameters of the stable process can be com- of infinitely wide NNs as a way forward to close, or puted via an explicit recursion over the layers even reverse, this empirical performance gap and to of the network. Our result contributes to the avoid, or slow down, pathological behaviors in very theory of fully connected feed-forward deep deep NN. NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite Let (µ, 2) denote the Gaussian distribution with N wide limits. mean µ R and variance 2 R+. Following the celebrated2 work of Neal (1995), we2 consider the shallow NN 1 Introduction I (1) (1) (1) fi (x)= wi,j xj + bi j=1 The connection between infinitely wide deep feed- X forward neural networks (NNs), whose parameters 1 n f (2)(x, n)= w(2)(f (1)(x)) + b(2), at initialization are independent and identically dis- i pn i,j j i j=1 tributed (iid) as scaled and centered Gaussian distri- X butions, and Gaussian processes (GPs) is well known where is a nonlinearity, i =1,...,n, w(1),w(2) iid (Neal, 1995; Der and Lee, 2006; Lee et al., 2018; i,j i,j ⇠ Matthews et al., 2018a,b; Yang, 2019). Recently, this 2 (1) (2) iid 2 I (0,w), bi ,bi (0,b ) and x R is the input. intriguing connection has been exploited in many ex- ItN follows that ⇠N 2 citing research directions, including: i) Bayesian in- (1) iid 2 ference for GPs arising from infinitely wide networks f (x) 0, (1) (x) i ⇠N f (2) (1) iid ⇣ 2 ⌘ rd f (x, n) f 0, (2) (x, n) Proceedings of the 23 International Conference on Artifi- i | ⇠N f cial Intelligence and (AISTATS) 2020, Palermo, ⇣ I ⌘ Italy. PMLR: Volume 108. Copyright 2020 by the au- 2 2 2 1 2 (1) (x)= + x thor(s). f b w I j j=1 X Stable behaviour of infinitely wide deep neural networks

n 2 2 2 1 (1) 2 each layer; iv) the convergence in distribution is es- (2) (x, n)= + (f (x)) . f b w n j tablished jointly for multiple inputs, namely the con- j=1 X vergence concerns the class of finite dimensional dis- tributions of the NN viewed as a If x0 is another input we obtain bivariate Gaussian distributions in function space. See Neal (1995) and Der and Lee (2006) for early works on NNs under stable initializa- (1) (1) iid tion. (f (x),f (x0)) 0, ⌃ (1) (x, x0) i i ⇠N2 f (2) (2) (1) iid Within our setting, we show that the infinite wide limit (f (x, n),f (x0,n)) f 0, ⌃ (2) (x, x0,n ) , i i | ⇠N2 f of the NN, under suitable scaling on the weights, is a where stochastic process whose finite-dimensional distribu- tions are multivariate stable distributions (Samorad- 2 nitsky, 2017). This process is referred to as the stable f (1) (x) cf (1) (x, x0) ⌃f (1) (x, x0)= 2 process. Our result may be viewed as a generalization cf (1) (x, x0) (1) (x0) " f # of the main result of Matthews et al. (2018b)tothe 2 f (2) (x, n) cf (2) (x, x0,n) context of stable distributions, as well as a generaliza- ⌃f (2) (x, x0,n)= 2 c (2) (x, x ,n) (x ,n) tion of results of Neal (1995) and Der and Lee (2006) " f 0 f (2) 0 # to the context of deep NN. Our result contributes to I 2 2 1 the theory of fully connected feed-forward deep NNs, c (1) (x, x0)= + x x0 f b w I j j and it paves the way to extend the research directions j=1 X i) ii) and iii) that rely on Gaussian infinite wide lim- n 2 2 1 (1) (1) its. The class of stable distributions is known to be c (2) (x, x0,n)= + (f (x))(f (x0)). f b w n j j especially relevant. Indeed while the contribution of j=1 X each Gaussian weight vanishes as the width grows un- a.s. bounded, some of the stable weights retains significant Let denote the almost sure convergence. By the size, thus allowing them to represent ”hidden features” strong! we know that, as n + , (Neal, 1995). one has ! 1

n The paper is structured as follows. Section 2 contains 1 (1) 2 a.s. (1) 2 some preliminaries on stable distributions, whereas in (f (x)) E[(f (x)) ] n j ! 1 j=1 Section 3 we define the class of feedforward NNs con- X n sidered in this work. Section 4 contains our main 1 (1) (1) a.s. (1) (1) (f (x))(f (x0)) E[(f (x))(f (x0))], result: as the width tends to infinity jointly on net- n j j ! 1 1 j=1 work’s layers, the finite dimensional distributions of X the NN converges to a multivariate from which one can conjecture that in the limit of whose parameters are compute via a recursion over (2) infinite width the stochastic processes fi (x) are the layers. The convergence of the NN to the sta- distributed as iid (over i)centeredGPwithkernel ble process then follows by finite-dimensional projec- 2 2 (1) (1) tions. In Section 5 we detail how our result extends K(x, x0)=b + wE[(f1 (x))(f1 (x0))]. Pro- vided that the nonlinear function is chosen so that previously established large width convergence results (1) and comment on related work, whereas in Section 6 (f1 (x)) has finite second moment, Matthews et al. (2018b) made rigorous this argument and extended it we discuss how our result applys to the research lines to deep NNs. highlighted in i) ii) and iii) which relies on GP lim- its. In Section 7 we comment on future research di- A key assumption underlying the interplay between rections. The Supplementary Material (SM) contains infinite wide NNs and GPs is the finiteness of the vari- all the proofs (SM A,B,C), a preliminary numerical ance of the parameters’ distribution at initialization. experiment on the recursion evaluation (SM D), an In this paper we remove the assumption of finite vari- empirical investigation of the distribution of trained ance by considering iid initializations based on stable NN models’ parameters (SM E). Code is available at distributions, which includes Gaussian initializations https://github.com/stepelu/deep-stable. as a special case. We study the infinite wide limit of fully connected feed-forward NN in the following general setting: i) the NN is deep, namely the NN 2 Stable distributions is composed of multiple layers; ii) biases and scaled weights are iid according to centered symmetric stable Let St(↵,) denote the symmetric centered stable dis- distributions; iii) the width of network’s layers goes to tribution with stability parameter ↵ (0, 2] and scale 2 infinity jointly on the layers, and not sequentially on parameter >0, and let S↵, be a random vari- Running heading author breaks the line

(l) able distributed as St(↵,). That is, the characteristic That is, the characteristic function of wi,j St(↵,w) itS↵, ⇠ function of S↵, St(↵,)is'S↵, (t)=E[e ]= is ↵ ↵ (l) ↵ ↵ t ⇠ itwi,j w t e | | . For any >0, a S↵, random variable with ' (l) (t)=E[e ]=e | | , (4) wi,j 0 <↵<2 has finite absolute moments [ S ↵ "] E ↵, (l) ↵ | | for any i 1, j 1 and l 1. Let b denote the bi- for any ">0, while E[ S↵, ]=+ . Note that i when ↵ = 2, we have that| S| 1(0, 22). The ases of the l-th hidden layer, and assume that they are 2, ⇠N random variable S2, has finite absolute moments of independent and identically distributed as St(↵,b). any order. For any a R we have the scaling identity That is, the characteristic function of the random vari- 2 (l) aS St(↵, a ). able b St(↵,b)is ↵,1 ⇠ | | i ⇠ (l) ↵ ↵ We recall the definition of symmetric and centered itbi b t 'b(l) (t)=E[e ]=e | | , (5) multivariate stable distribution and its marginal dis- i k 1 k (l) tributions. First, let be the unit sphere in . S R for any i 1 and l 1. The random weights wi,j are Let St (↵, ) denote the symmetric and centered k- k independent of the biases b(l), for any i 1, j 1 and dimensional stable distribution with stability ↵ (0, 2] i k 21 l 1. That is, and scale (finite) spectral measure on S , and (l) (l) let S↵, be a random vector of dimension k 1dis- ↵ ↵ 1/↵ ⇥ (wi,j + bi ) St(↵, (w + b ) ). tributed as Stk(↵, ). The characteristic function of ⇠ Let : be a nonlinearity with a finite number S↵, Stk(↵, ) is R R ⇠ of discontinuities! and such that it satisfies the envelope T it S↵, T ↵ condition 'S↵, (t)=E[e ]=exp t s (ds) . k 1 | | (s) (a + b s ) (6) ⇢ S Z (1) | | | | for every s R, and for any parameter a, b > 0, < If S↵, St(↵, ) then the marginal distributions of 1 2 1 I ⇠ ↵ and < .Ifx R is the input argument of S↵, are described as follows. Let 1r denote a vector the NN, then the NN is2 explicitly defined by means of of dimension k 1with1inther-the entry and 0 ⇥ I elsewhere. Then, the random variable corresponding (1) (1) (1) (1) f (x, n)=f (x)= w xj + b , (7) to the r-th element of S↵, St(↵, ) can be defined i i i,j i ⇠ j=1 as follows X 1T S St(↵,(r)), (2) and r ↵, ⇠ n where (l) 1 (l) (l 1) (l) 1/↵ fi (x, n)= wi,j (fj (x, n)) + bi (8) T ↵ n1/↵ (r)= 1r s (ds) . (3) j=1 k 1 | | X S ✓Z ◆ for l =2,...,D and i =1,...,n in (7) and (8). The The distribution St (↵, ) with characteristic func- k scaling of the weights in (8)willbeshowntobethe tion (1) allows for marginals which are not centered correct one to obtain non-degenerate limits as n nor symmetric. However in the present work all the + . ! marginals will be centered and symmetric, and the 1 spectral measure will often be a discrete measure, i.e., n k 1 ( )= j=1 jsj ( )for n N, sj S and j R. 4 Infinitely wide limits In· particular, under· these2 specific2 assumptions,2 we have P We show that, as the width of the NN tends to in- n finity jointly on network’s layers, the finite dimen- T ↵ 'S↵, (t)=exp j t sj . sional distributions of the network converge to a multi- 8 | | 9 j=1 < X = variate stable distribution whose parameters are com- See Samoradnitsky (2017) for a detailed account on pute via a suitable recursion over the network lay- : ; ers. Then, by combining this limiting result with S↵, St(↵, ). ⇠ standard arguments on finite-dimensional projections we obtain the large n limit of the stochastic process 3 Deep stable networks (l) (1) (l) (k) (1) (k) (fi (x ,n),...,fi (x ,n))i 1 where x ,...,x are the inputs to the NN. In particular, let w de- We consider fully connected feed-forward NNs com- note the weak convergence. Then, we show! that as posed of D 1 layers where each layer is of width (l) n + , n 1. Let wi,j be the weights of the l-th layer, and ! 1 (l) (1) (l) (k) w assume that they are independent and identically dis- (fi (x ,n),...,fi (x ,n))i 1 Stk(↵, (l)) ! tributed as St(↵,w), a stable distribution with stabil- i 1 O ity parameter ↵ (0, 2] and scale parameter > 0. (9) 2 w Stable behaviour of infinitely wide deep neural networks where is the product measure. From now on k is the i.e. (1) d number of inputs, which is equal to the dimensionality fi (x) = S ↵ I ↵ ↵ 1/↵ ; ↵,( xj + ) of theN finite dimensional distributions of interest for w j=1 | | b (l) and for l =2,...,D P the stochastic processes fi .Thorough the rest of the paper we assume that the assumptions introduced in ' (l) (l 1) (t) f (x,n) f (x,n) Section 3 hold true. Hereafter we present a sketch of i |{ j }j=1,...,n (l) (l 1) the proofs of our main result for a fixed index i and itfi (x,n) = E[e f (x, n) j=1,...,n] |{ j } input x, and we defer to the SM for the complete proofs n of our main results. 1 (l) (l 1) (l) = E exp it w (f (x, n)) + b n1/↵ i,j j i " ( " j=1 #) We start with a technical remark: in (7)-(8)the X (l) stochastic processes fi (x, n) are only defined for i = (l 1) 1,...,n, while the limiting measure in (9)istheprod- fj (x, n) j=1,...,n { } # uct measure on i 1. This fact does not determine ↵ n problems, as for each N there is a n large enough (l 1) L⇢ (l) w ↵ ↵ ↵ =exp ( (fj (x, n)) + b ) t , such that for each i the processes fi (x, n) are 8 n | | | | 9 2L j=1 defined. In any case, the simplest solution consists < X = (l) in extending fi (x, n) from i =1,...,n to i 1in i.e., : ; (7)-(8), and we will make this assumption in all the (l) (l 1) proofs. f (x, n) f (x, n) j=1,...,n i |{ j } d ↵ 1/↵ = S n (l 1) . ↵, w (f (x,n)) ↵+↵ 4.1 Large width asymptotics: k =1 n j=1 | j | b ⇣ P ⌘ We characterize the limiting distribution of f (l)(x, n) It comes from (8) that, for every fixed l and for ev- i (l) as n for a fixed i and input x. We show that, as ery fixed n the sequence (fi (n, x))i 1 is exchange- !1 (l) n + , able. In particular, let pn denote the directing (ran- ! 1 (l) w fi (x, n) St(↵,(l)), (10) dom) measure of the exchangeable se- ! (l) where the parameter (l) is computed through the re- quence (fi (n, x))i 1. That is, by de Finetti repre- (l) (l) cursion: sentation theorem, conditionally to pn the fi (n, x)’s (l) I are iid as pn . Now, consider the induction hypoth- 1/↵ ↵ ↵ ↵ (l 1) w (l 1) (l 1) (1) = b + w xj esis that pn q as n + ,withq | | ! ! 1 j=1 be St(↵,(l 1)), and the parameter (l 1) will be X ↵ ↵ ↵ 1/↵ specified. Therefore, (l)= b + wEf q(l 1) [ (f) ] ⇠ | | itf (l)(x,n) and q(l)=St(↵,(l)) for each l 1. The general- E[e i ] ization of this result to k 1 inputs is given in Sec- ↵ n tion 4.2. ↵ w (l 1) ↵ ↵ = E exp t (f (x, n)) + 2 8| | 0 n | j | b 193 j=1 < X = Proof of (10). The proof exploits exchangeability of 4 @ A 5 (l) ↵ n ↵ ↵: (l 1) ; the sequence (fi (n, x))i 1, an induction argument on t b ↵ w ↵ = e| | E exp t (f (x, n)) the layer’s index l for the directing (random measure) 2 8| | n | j | 93 j=1 (l) < X = of (fi (n, x))i 1, and some technical lemmas that are 4 ↵ 5n t ↵↵ ↵ w ↵ (l 1) proved in SM. Recall that the input is a real-valued = e| | b E :exp t (f) p (df;) . | | n | | n vector of dimension I. By means of (4) and (5), for ✓Z ⇢ ◆ i 1: (11) where the first equality comes from plugging in the ' (1) (t) fi (x) (l) n (1) definition of fi (x, n), rewriting E[exp( j=1 )] as itf (x) n n ··· = E[e i ] E[ exp( )] = E[exp( )] due to indepen- j=1 ··· j=1 ··· P I dence, computing the characteristic function for each (1) (1) Q Q (l 1) = E exp it w xj + b term, and re-arranging. therein, since (fI (n, x))i 1 2 8 2 i,j i 393 < j=1 = is exchangeable there exists (de Finetti theorem) a ran- X (l 1) 4 4 I 5 5 dom probability measure pn such that condition- : ↵ ↵ ↵ ↵ ; (l 1) (l 1) (l 1) =exp ( x + ) t , ally to pn the f (n, x) are iid as pn which 8 w | j| b | | 9 I j=1 explains (11). < X = : ; Running heading author breaks the line

p Now, let denote the convergence in probability. as n + . That is, we proved that the large n ! ! 1 (l) The following technical lemmas (Appendix A): limiting distribution of fi (x, n)isSt(↵,(l)), where we set (l 1) L1) for each l 2Pr[pn I] = 1, with I = p : 1/↵ ↵ 2 { ↵ ↵ ↵ (l 1) (f) p(df) < + ; (l)= + (f) q (df) | | 1} b w | | ✓ Z ◆ R ↵ (l 1) p ↵ (l 1) L2) (f) pn (df) (f) q (df), as n | + | ; ! | | R ! 1 R ↵ ↵ t ↵ w (f) ↵ (l 1) p 4.2 Large width asymptotics: k 1 L3) (f) [1 e| | n | | ]pn (df) 0, as n | + | . ! R ! 1 We establish the convergence in distribution of (l) (1) (l) (k) (fi (x ,n),...,fi (x ,n)) as n + for a fixed i together with Lagrange theorem, are the main ingre- and k inputs x(1),...,x(k). This result,! combined1 with dients for proving (10) by studying the large n asymp- standard arguments on finite-dimensional projections, totic behavior of (11). By combining (11)withlemma then establishes the convergence of the NN to the sta- L1, ble process. Precisely, we show that, as n + , one has ! 1 itf (l)(x,n) t ↵↵ E[e i ]=e| | b E (l 1) (l) (1) (l) (k) w (pn I) (f (x ,n),...,f (x ,n)) St (↵, (l)), (12) " { 2 } i i ! k n ↵ ↵ w ↵ (l 1) where the spectral measure (l) is computed through exp t (f) p (df) . ⇥ | | n | | n the recursion: Z ( ) ! # I ↵ ↵ ↵ ↵ By means of Lagrange theorem, there exists ✓n [0, 1] (1) = 1 1 + xj xj (13) b 1 w 2 || || || || xj such that || || j=1 || || X ↵ ↵ ↵ ↵ ↵ (l)= 1 1 + (l 1) [ (f) (f) ] b wEf q ↵ w ↵ 1 ⇠ (f) exp t (f) || || || || || || || || | | n | | (14) ⇢ ↵ =1 t ↵ w (f) ↵ | | n | | and q(l)=Stk(↵, (l)) for each l 1, where xj = (1) (k) ↵ ↵ [x ,...,x ] Rk. Here (and in all the expressions + t ↵ w (f) ↵ 1 exp ✓ t ↵ w (f) ↵ . j j 2 | | n | | n| | n | | involving the function ) we make use of the nota- ✓ ⇢ ◆ tional assumption that if =0in ,then = 0. • • Now, since This assumption allows us to avoid making the no- tation more cumbersome than necessary to explicitly ↵ ↵ w ↵ ↵ ✓n t n (f) (l 1) exclude the case of (f) = 0, for which (f)/ (f) 0 (f) [1 e | | | | ]pn (df)  | | is undefined. We omit the sketch of the proof of|| (12),|| Z ↵ ↵ t ↵ w (f) ↵ (l 1) as it is a step-by-step parallel of the proof of (10)with (f) [1 e| | n | | ]p (df),  | | n the added complexities due to the multivariate stable Z distributions. The reader can refer to the SM for the full proof. itf (l)(x,n) t ↵↵ [e i ] e| | b (l 1) E E (p I)  " { n 2 } 4.3 Finite-dimensional projections ↵ ↵ w ↵ (l 1) In Section 4.2 we obtained the convergence in law of 1 t (f) pn (df) ⇥ | | n | | f (l)(x(1),n),...,f(l)(x(k),n) for k inputs and a generic Z i i n i to a multivariate Stable distribution. Let refer to this ↵ ↵ ↵ w ↵ t ↵ w (f) ↵ (l 1) + t (f) [1 e| | n | | ]p (df) . random vector as fi(x). Now, we derive the limiting | | n | | n Z ! # behavior in law of fi(x) jointly over all i =1,... (again for a given k-dimensional input). It is enough to study x Finally, recall the fundamental limit e = the convergence of f1(x),...,fn(x) for a generic n 1. n limn + (1 + x/n) . This, combined with L2 ! 1 That is, it is enough to establish the convergence of the and L3 leads to finite dimensional distributions (over i: we consider

(l) ↵ ↵ ↵ ↵ (l 1) here fi(x) as random sequence over i). See Billingsley itf (x,n) t [ + (f) q (df)] E[e i ] e| | b w | | , (1999) for details. ! R Stable behaviour of infinitely wide deep neural networks

To establish the convergence of the finite dimensional 5Relatedwork distributions (over i) it then suces to establish the convergence of linear combinations. More precisely, (1) (k) I k let X =[x ,...,x ] R ⇥ . We show that, as For the classical case of Gaussian weights and biases, n + , 2 and more in general for finite-variance iid distribu- ! 1 tions, the seminal work is that of Neal (1995). Here (l) w (fi (X,n))i 1 Stk(↵, (l)), the author establishes, among other notable contribu- ! i 1 O tions, the connection between infinitely wide shallow by proving the large n asymptotic behavior of any fi- (1 hidden layer) NNs and centered GPs. We reviewed nite linear combination of the f (l)(X,n)’s, for i the essence of it in Section 1. i 2L⇢ N. Following the notation of Matthews et al. (2018b), This result is extended in Lee et al. (2018)todeep let NNs where the width n(l) of each layer l goes to in- (l) (l) finity sequentially, starting from lowest layer, i.e. n(1) T (l)( ,p,X,n)= p [f (X,n) b 1]. L i i i to n(D). The sequential nature of the limits reduces i X2L the task to a sequential application of the approach Then, we write of Neal (1995). The computation of the GP kernel T (l)( ,p,X,n) for each layer l involves a recursion, and the authors L propose a numerical method to approximate the inte- n 1 (l) (l 1) gral involved in each step of the recursion. The case = p w ( f (X,n)) i 2n1/↵ i,j j 3 where each n(l) goes to infinity jointly, i.e. n(l)=n, i j=1 X2L X is considered in Matthews et al. (2018a) under more 1 4n 5 restrictive hypothesis, which are relaxed in Matthews = (l)( ,p,X,n), n1/↵ j L et al. (2018b). While this setting is most representa- j=1 X tive of a sequence of increasingly wide networks, the where theoretical analysis is considerably more complicated (l) (l) (l 1) as it does not reduce to a sequential application of the ( ,p,X,n)= p w ( f (X,n)). j L i i,j j classical multivariate . i X2L Going beyond finite-variance weight and bias distri- Then, butions, Neal (1995) also introduced preliminary re-

' (l 1) (t) sults for infinitely wide shallow NNs when weights and T (l)( ,p,X,n) f (X,n) L |{ j }j=1,...,n biases follow centered symmetric stable distributions. itT T (l)( ,p,X,n) (l 1) = E[e L f (X,n) j=1,...,n] These results are refined in Der and Lee (2006)which |{ j } n ↵ ↵ establishes the convergence to a stable process, again p T (l 1) ↵ i w t ( f (X,n)) = e n | j | in the setting of shallow NNs. j=1 i Y Y2L The present paper can be considered a generalization That is, of the work of Matthews et al. (2018b) to the context

(l) (l 1) d of weights and biases distributed according to centered T ( ,p,X,n) f (X,n) = S (l) L |{ j }j=1,...,n ↵, and symmetric stable distributions. Our proof fol- where lows di↵erent arguments from the proof of Matthews n et al. (2018b), and in particular it does not rely on (l) 1 (l 1) ↵ the central limit theorem for exchangeable sequences = p ( f (X,n)) (l 1) i w j f (X,n) n || || j j=1 i (l 1) (Blum et al., 1958). Hence, since the Gaussian distri- f (X,n) X X2L || j || bution is a special case of the stable distribution, our Then, along lines similar to the proof of the large n proof provides an alternative and self-contained proof asymptotics for the i-th coordinate, we have the fol- to the result of Matthews et al. (2018b). It should be lowing noted that our envelope condition (6) is more restric- tive than the linear envelope condition of Matthews itT T (l)( ,p,X,n) T ↵ et al. (2018b). For the classical Gaussian setting the E[e L ] exp t s ! k 1 | | conditions on the activation function have been weak- ( ZZS ened in the work of Yang (2019). ↵ (l 1) piw( f) f (ds)q (df) ⇥ || || f Finally, there has been recent interest in using heavy- i || || ! ) X2L tailed distributions for gradient noise (Simsekli et al., as n + . This complete the proof of the limiting 2019) and for trained parameter distributions (Mar- behaviour! 1 (9). tin and Mahoney, 2019). In particular, Martin and Running heading author breaks the line

Mahoney (2019) includes an empirical analysis of the lenging problem with respect to the Gaussian set- parameters of pre-trained convolutional architectures ting. A sketch of a potential approach is as fol- (which we also investigate in SM E) supportive of lows. Over the training data points and test points, (1) heavy-tailed distributions. Results of this kind are f Stk(↵, (1)) where k is equal to the size of compatible with the conjecture that stochastic pro- training⇠ and test datasets combined. As (1) is a dis- cesses arising from NNs whose parameters are heavy- crete measure exact simulations algorithms are avail- tailed might be closer representations of their finite, able with a computational cost of (I) per sample high-performing, counterparts. O (1) (Nolan, 2008). We can thus generate M samples fj , j =1,...,M,in (IM), and use these to approxi- (2) O 6 Future applications mate f Stk(↵, (2)) with Stk(↵, (2)) with e(2) being ⇠ 6.1 Bayesian inference e e M ↵ ↵ ↵ (1) ↵ 1 (1) (2) = b 1 + w (fj ) (f ) Infinitely wide NNs with centered iid Gaussian ini- 1 j || || || || || || j=1 (1) (fe ) tializations, and more in general finite variance cen- X || j || tered iid initializations, gives rise to iid centered GPs e e e at every layer l. Let us assume that weights and bi- We can repeat this procedure by generating (approx- ases are distributed as in Section 1, and let us as- imate) random samples f (2), with a cost of (M 2), j O sume L layers (L 1 hidden layers). Each centered that in turn are used to approximate (3) and so on. GPs is characterized by its covariance kernel func- In this procedure the errorse can accumulate across the tion. Let us denote by f (l) such GPs for the layer layers, as in Lee et al. (2018). This may be ameliorated 2 l L.Overtwoinputsx and x0 the distribution by using quasi random number generators of Joe and  (l) (l) of (f (x),f (x0)) is characterized by the variances Kuo (2008), as the sampling algorithms for multivari- (l) (l) (l) (l) qx = [f (x)], q = [f (x0)] and by the covari- ate stable distributions (Weron, 1996; Weron et al., V x0 V (l) (l) (l) 2010; Nolan, 2008) are all implemented as transfor- ance c = [f (x),f (x0)]. These quantities sat- x,x0 C isfy mations of uniform distributions. The use of QRNG e↵ectively defines a quadrature scheme for the inte- 2 gration problem. We report in the SM preliminary (l) 2 2 (l 1) results regarding the numerical approximation of the qx = b + wE qx z (15) " q ! # recursion defined by (13)-(14).

(l) 2 2 (l 1) This leaves us with the problem of computing a statis- cx,x = b + wE qx z (16) (L) 0 " ! tic of f (x⇤) (x⇤, ) or sampling from it, to perform q prediction. Again,| D it could be beneficial to leverage (l 1) (l 1) (l 1) 2 on the discreteness of (L). For example, these mul- q ⇢ z + 1 (⇢ ) z0 ⇥ x0 x,x0 x,x0 tivariate stable random variables can be expressed as q q !# ⇣ ⌘ suitable linear transformationse of independent stable where z and z0 are independent standard Gaussian dis- random variables (Samoradnitsky, 2017), and results tributions (0, 1), expressing stable variables as mixtures of Gaussian N variables are available in Samoradnitsky (2017). c(l) ⇢(l) = x,x0 (17) (l) (l) 6.2 Neural tangent kernel qx q x0 q In Section 6.1 we reviewed how the connection with (1) 2 2 2 (1) with initial conditions qx = + x and c = b w x,x0 GPs makes it possible to perform Bayesian inference 2 2 k k b + w x, x0 . directly on the limiting process. This corresponds to a h i ”weakly-trained” regime of NNs, in the sense that the To perform prediction via [f (L)(x ) x , ], it is nec- E ⇤ ⇤ point (mean) predictions are equivalent to assuming essary to compute these recursions for| all orderedD pairs an l loss function, and fitting only a terminal linear of data points x, x in the training dataset , and for 2 0 layer to the training data, i.e. performing a kernel re- all pairs x ,x with x . Lee et al. (2018D) proposes ⇤ gression (Arora et al., 2019). The works of Jacot et al. an ecient quadrature2D solution that keeps the com- (2018), Lee et al. (2019) and Arora et al. (2019) con- putational requirements manageable for an arbitrary sider ”fully-trained” NNs with l loss and continuous- activation . 2 time gradient descent. Under Gaussian initialization In our setting, the corresponding recursion is defined assumptions it is shown that as the width of the NN by (13)-(14), which is a more computationally chal- goes to infinity, the point predictions corresponding by Stable behaviour of infinitely wide deep neural networks such fully trained networks are given again by a kernel 7 Conclusions regression but with respect to a di↵erent kernel, the neural tangent kernel. Within the setting of fully connected feed-forward deep In the derivation of the neural tangent kernel, one im- NNs with weights and biases iid as centered and sym- portant point is that the gradients are not computed metric stable distributions, we proved that the infinite with respect to the standard model parameters, i.e. wide limit of the NN, under suitable scaling on the the the weights and biases entering the ane trans- weights, is a stable process. This result contributes to forms. Instead they are ”reparametrized gradients” the theory of fully connected feed-forward deep NNs, which are computed with respect to parameters ini- generalizing the work of Matthews et al. (2018b). We tialized as (0, 1), with any scaling (standard devi- presented an extensive discussion on how our result N ation) defined by parameter multiplication. It would can be used to extend recent lines of research which thus be interesting to study whether a corresponding relies on GP limits. neural tangent kernel can be defined for the case of On the theoretical side further developments of our stable distributions with 0 <↵<2, and whether the work are possible. Firstly, Matthews et al. (2018b) parametrization of (7)-(8) is the appropriate one to do performs an empirical analysis of the rates of conver- so. gence to the limiting process as function of depth with the respect to the MMD discrepancy (Gretton et al., 2012). Having proved the convergence of the finite 6.3 Information propagation dimensional distributions to multivariate stable distri- butions, the next step would be to establish the rate of The recursions (15)-(16) define the evolution over convergence with respect to a metric of choice as func- (l) depth of the distribution of f for two points x, x0 tion of the stability index ↵ and depth l. Secondly, when weights and biases are distributed as in Section 1. all the established convergence results (this paper in- The information propagation framework studies the cluded) concern the convergence of the finite dimen- (l) (l) behavior of qx and ⇢ as l + .Itisshownin sional distributions of the NN layers. For the count- x,x0 ! 1 Poole et al. (2016) and Schoenholz et al. (2017) that able case, which is the case of the components i 1 the (w,b) positive quadrant is divided in two re- in each layer, this is equivalent to the convergence in gions: a stable phase where ⇢(l) 1 and a chaotic distribution of the whole process (over all the i)with x,x0 ! phase where ⇢(l) converges to a random variable (in respect to the product topology. However, the input x,x0 I the = tanh case, in other cases the limiting pro- space being R it is not countable. Hence, for a given cesses may fail to exist). Thus in the stable phase i, the convergence of the finite dimensional distribu- f (l) is eventually perfectly correlated over inputs (and tions (i.e. over a finite collection of inputs) is not in most cases perfectly constant), while in the chaotic enough to establish the convergence in distribution of phase it is almost everywhere discontinuous. The work the stochastic process seen as a random function on the of Hayou et al. (2019) formalizes these results and in- input (with respect to an appropriate metric). This is also the case for results concerning the convergence to vestigates the case where (w,b) is on the curve sepa- rating the stable from the chaotic phase, i.e. the edge GPs. It would thus be worthwhile to complete this of chaos. Here it is shown that the behavior is quali- theoretical line of research by establishing such result for any 0 <↵ 2. As a side result, doing so is likely tatively similar to that of the stable case, but with a  lower rate of convergence with respect to depth. Thus to provide estimates on the smoothness proprieties of in all cases the distribution of f (l) eventually collapse the limiting stochastic processes. to degenerate and inexpressive distributions as depth increases. 8 Acknowledgements In this context it would be interesting to study what is the impact of the use of stable distributions. All results mentioned above holds for the Gaussian case, We wish to thank the three anonymous reviewers and which corresponds to ↵ = 2. Thus this further anal- the meta reviewer for their valuable feedback. Stefano ysis would study the case 0 <↵<2, resulting in Favaro received funding from the European Research atriplet(w,b,↵). Even though it seems hard to Council (ERC) under the European Union’s Horizon escape the course of depth under iid initializations, it 2020 research and innovation programme under grant might be that the use of stable distributions, with their agreement No 817257. Stefano Favaro gratefully ac- not-uniformly-vanishing relevance at unit level (Neal, knowledge the financial support from the Italian Min- 1995), might slow down the rate of convergence to the istry of Education, University and Research (MIUR), limiting regime. “Dipartimenti di Eccellenza” grant 2018-2022. Running heading author breaks the line

References Neal, R. M. (1995). Bayesian Learning for Neural Net- works. PhD thesis, University of Toronto. Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). On exact computation with Nolan, J. P. (2008). An overview of multivariate sta- an infinitely wide neural net. In Advances in Neural ble distributions. Online: http://academic2. ameri- Information Processing Systems 32. can.edu/ jpnolan/stable/overview.pdf. Billingsley, P. (1999). Convergence of Probability Mea- Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., sures. Wiley-Interscience, 2nd edition. and Ganguli, S. (2016). Exponential expressivity Der, R. and Lee, D. D. (2006). Beyond gaussian pro- in deep neural networks through transient chaos. In cesses: On the distributions of infinite networks. Advances in Neural Information Processing Systems In Advances in Neural Information Processing Sys- 29, pages 3360–3368. tems, pages 275–282. Samoradnitsky, G. (2017). Stable non-Gaussian ran- Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, dom processes: stochastic models with infinite vari- L. (2019). Deep convolutional networks as shallow ance. Routledge. gaussian processes. In International Conference on Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl- Learning Representations. Dickstein, J. (2017). Deep information propagation. Gretton, A., Borgwardt, K. M., Rasch, M. J., In International Conference on Learning Represen- Sch¨olkopf, B., and Smola, A. (2012). A kernel two- tations. sample test. Journal of Research, Simsekli, U., Sagun, L., and Gurbuzbalaban, M. 13(Mar):723–773. (2019). A tail-index analysis of stochastic gradi- Hayou, S., Doucet, A., and Rousseau, J. (2019). On ent noise in deep neural networks. arXiv preprint the impact of the activation function on deep neural arXiv:1901.06053. networks training. In Proceedings of the 36th In- Weron, R. (1996). On the chambers-mallows-stuck ternational Conference on Machine Learning, pages method for simulating skewed stable random vari- 2672–2680. ables. Statistics & probability letters, 28(2):165–171. Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Weron, R. et al. (2010). Correction to: ”on tangent kernel: Convergence and generalization in the chambers–mallows–stuck method for simulating neural networks. In Advances in Neural Information skewed stable random variables”. Technical report, Processing Systems 31, pages 8571–8580. University Library of Munich, Germany. Joe, S. and Kuo, F. Y. (2008). Notes on generating Yang, G. (2019). Scaling limits of wide neural net- sobol sequences. ACM Transactions on Mathemat- works with weight sharing: be- ical Software (TOMS), 29(1):49–57. havior, gradient independence, and neural tangent Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., kernel derivation. arXiv preprint arXiv:1902.04760. Schoenholz, S., and Bahri, Y. (2018). Deep neu- ral networks as gaussian processes. In International Conference on Learning Representations. Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl- Dickstein, J., and Pennington, J. (2019). Wide neu- ral networks of any depth evolve as linear models under gradient descent. In Advances in Neural In- formation Processing Systems 32. Martin, C. H. and Mahoney, M. W. (2019). Tradi- tional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276. Matthews, A. G. d. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. (2018a). Gaussian pro- cess behaviour in wide deep neural networks. In International Conference on Learning Representa- tions. Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018b). Gaussian pro- cess behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.