<<

Invariance of Weight Distributions in Rectified MLPs

Russell Tsuchida 1 Farbod Roosta-Khorasani 2 3 Marcus Gallagher 1

Abstract plication rather than an understanding of the capabilities and training of neural networks. Recently, significant work An interesting approach to analyzing neural net- has been done to increase understanding of neural net- works that has received renewed attention is to works (Choromanska et al., 2015; Haeffele & Vidal, 2015; examine the equivalent kernel of the neural net- Poole et al., 2016; Schoenholz et al., 2017; Zhang et al., work. This is based on the fact that a fully 2016; Martin & Mahoney, 2017; Shwartz-Ziv & Tishby, connected feedforward network with one hid- 2017; Balduzzi et al., 2017; Raghu et al., 2017). However, den layer, a certain weight distribution, an acti- there is still work to be done to bring theoretical under- vation function, and an infinite number of neu- standing in line with the results seen in practice. rons can be viewed as a mapping into a Hilbert space. We derive the equivalent kernels of MLPs The connection between neural networks and kernel ma- with ReLU or Leaky ReLU activations for all chines has long been studied (Neal, 1994). Much past rotationally-invariant weight distributions, gen- work has been done to investigate the equivalent kernel eralizing a previous result that required Gaus- of certain neural networks, either experimentally (Burgess, sian weight distributions. Additionally, the Cen- 1997), through sampling (Sinha & Duchi, 2016; Livni tral Limit Theorem is used to show that for cer- et al., 2017; Lee et al., 2017), or analytically by assum- tain activation functions, kernels corresponding ing some random distribution over the weight parameters to layers with weight distributions having 0 mean in the network (Williams, 1997; Cho & Saul, 2009; Pandey and finite absolute third moment are asymptot- & Dukkipati, 2014a;b; Daniely et al., 2016; Bach, 2017a). ically universal, and are well approximated by Surprisingly, in the latter approach, rarely have distribu- the kernel corresponding to layers with spherical tions other than the Gaussian distribution been analyzed. Gaussian weights. In deep networks, as depth in- This is perhaps due to early influential work on Bayesian creases the equivalent kernel approaches a patho- Networks (MacKay, 1992), which laid a strong mathemat- logical fixed point, which can be used to argue ical foundation for a Bayesian approach to training net- why training randomly initialized networks can works. Another reason may be that some researchers may be difficult. Our results also have implications hold the intuitive (but not necessarily principled) view that for weight initialization. the Central Limit Theorem (CLT) should somehow apply. In this work, we investigate the equivalent kernels for net- works with Rectified Linear Unit (ReLU), Leaky ReLU 1. Introduction (LReLU) or other activation functions, one-hidden layer, and more general weight distributions. Our analysis carries Neural networks have recently been applied to a number over to deep networks. We investigate the consequences of diverse problems with impressive results (van den Oord that weight initialization has on the equivalent kernel at et al., 2016; Silver et al., 2017; Berthelot et al., 2017). the beginning of training. While initialization schemes that These breakthroughs largely appear to be driven by ap- mitigate exploding/vanishing gradient problems (Hochre- 1School of ITEE, University of Queensland, Bris- iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001) bane, Queensland, Australia 2School of Mathematics and for other activation functions and weight distribution com- Physics, University of Queensland, Brisbane, Queens- binations have been explored in earlier works (Glorot & 3 land, Australia International Computer Science Institute, Bengio, 2010; He et al., 2015), we discuss an initialization Berkeley, California, USA. Correspondence to: Rus- sell Tsuchida , Farbod Roosta- scheme for Muli-Layer Perceptrons (MLPs) with LReLUs Khorasani , Marcus Gallagher and weights coming from distributions with 0 mean and fi- . nite absolute third moment. The derived kernels also allow

th us to analyze the loss of information as an input is propa- Proceedings of the 35 International Conference on Machine gated through the network, offering a complementary view Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). to the shattered gradient problem (Balduzzi et al., 2017). Invariance of Weight Distributions in Rectified MLPs

2. Preliminaries optimal solutions to the ridge regression problem of min- imizing a regularized cost function C using the kernel (1) Consider a fully connected (FC) feedforward neural net- and the kernel (2) respectively. The number of datapoints work with m inputs and a hidden layer with n neurons. Let n required√ to probabilistically bound C(ˆg) − C(g) is found σ : R → R be the activation function of all the neurons in to be O( n log n) under a suitable set of assumptions. the hidden layer. Further assume that the biases are 0, as is This work notes the connection between kernel machines common when initializing neural network parameters. For m and one-layer Neural Networks with ReLU activations and any two inputs x, y ∈ R propagated through the network, Gaussian weights by citing Cho & Saul (2009). We extend the dot product in the hidden layer is this connection by considering other weight distributions n and activation functions. 1 1 X h(x) · h(y) = σ(w · x)σ(w · y), (1) n n i i In this work our focus is on deriving expressions for the i=1 target kernel, not the approximation error. Additionally, where h(·) denotes the n dimensional vector in the hidden we consider random mappings that have not been consid- m th layer and wi ∈ R is the weight vector into the i neuron. ered elsewhere. Our work is related to work by Poole et Assuming an infinite number of hidden neurons, the sum al. (2016) and Schoenholz et al. (2017). However, our re- in (1) has an interpretation as an inner product in feature sults apply to the unbounded (L)ReLU activation function space, which corresponds to the kernel of a Hilbert space. and more general weight distributions, and their work con- We have siders random biases as well as weights. Z k(x, y) = σ(w · x)σ(w · y)f(w) dw, (2) m 3. Equivalent Kernels for Infinite Width R Hidden Layers where f(w) is the probability density function (PDF) for the identically distributed weight vector W = The kernel (2) has previously been evaluated for a number T (W1, ..., Wm) in the network. The connection of (2) to of choices of f and σ (Williams, 1997; Roux & Bengio, the kernels in kernel machines is well-known (Neal, 1994; 2007; Cho & Saul, 2009; Pandey & Dukkipati, 2014a;b). In Williams, 1997; Cho & Saul, 2009). particular, the equivalent kernel for a one-hidden layer net- work with spherical Gaussian weights of [W 2] Probabilistic bounds for the error between (1) and (2) have E i and mean 0 is the Arc-Cosine Kernel (Cho & Saul, 2009) been derived in special cases (Rahimi & Recht, 2008) when the kernel is shift-invariant. Two specific random feature 2 mappings are considered: (1) Random Fourier features are E[Wi ]kxkkyk  k(x, y) = sin θ0 + (π − θ0) cos θ0 , (3) taken for the σ in (1). Calculating the approximation er- 2π ror in this way requires being able to sample from the PDF defined by the of the target kernel. More −1 x·y  where θ0 = cos kxkkyk is the angle between the inputs explicitly, the weight distribution f is the Fourier transform x and y and k·k denotes the `2 norm. Noticing that the Arc- of the target kernel and the n samples σ(wi ·x) are replaced Cosine Kernel k(x, y) depends on x and y only through by some appropriate scale of cos(wi · x). (2) A random bit their norms, with an abuse of notation we will henceforth string σ(xi) is associated to each input according to a grid set k(x, y) ≡ k(θ0). Define the normalized kernel to be the with random pitch δ sampled from f imposed on the input cosine similarity between the signals in the hidden layer. The normalized Arc-Cosine Kernel is given by space. This method requires having access to the second of the target kernel to sample from the distribu- f k(x, y) 1  tion . cos θ1 = = sin θ0 + (π − θ0) cos θ0 , pk(x, x)pk(y, y) π Other work (Bach, 2017b) has focused on the smallest er- ror between a target function g in the reproducing kernel Hilbert space (RKHS) defined by (2) and an approximate where θ1 is the angle between the signals in the first layer. function gˆ expressible by the RKHS with the kernel (1). Figure1 shows a plot of the normalized Arc-Cosine Kernel. R One might ask how the equivalent kernel changes for a dif- More explicitly, let g(x) = m G(w)σ(w, x)f(w) dw be R ferent choice of weight distribution. We investigate the the representation of g in the RKHS. The quantity gˆ − Pn R equivalent kernel for networks with (L)ReLU activations g = αiσ(wi, ·) − m G(w)σ(w, ·)f(w) dw i=1 R and general weight distributions in Section 3.1 and 3.2. The (with some suitable norm) is studied for the best set of α i equivalent kernel can be composed and applied to deep net- and random w with an optimized distribution. i works. The kernel can also be used to choose good weights Yet another measure of kernel approximation error is in- for initialization. These, as well as other implications for vestigated by Rudi & Rosasco (2017). Let gˆ and g be the practical neural networks, are investigated in Section5. Invariance of Weight Distributions in Rectified MLPs

1 Theoretical curve 0.9 Empirical samples where Θ is the Heaviside .

0.8

0.7

0.6 The proof is given in AppendixA. The main idea is to rotate 0.5 w (following Cho & Saul (2009)) so that 0.4 0.3 Z 0.2 k(x, y) = Θ(w )Θ(w cos θ + w sin θ )w 0.1 1 1 0 2 0 1 m 0 0 0.5 1 1.5 2 2.5 3 R (w1 cos θ0 + w2 sin θ0)f(w) dwkxkkyk.

Figure 1. Normalized Arc-Cosine Kernel as a function of θ0 for Now differentiating twice with respect to θ0 yields the sec- a single hidden layer network, Gaussian weights, and ReLU ac- ond order ODE (4). The usefulness of the ODE in its cur- 1000 tivations. Empirical samples from a network with inputs rent form is limited, since the forcing term F (θ0) as in (5) and 1000 hidden units are plotted alongside the theoretical curve. is difficult to interpret. However, regardless of the underly- Samples are obtained by generating R from a QR decomposi- ing distribution on weights w, as long as the PDF f in (5) tion of a random matrix, then setting x = RT (1, 0, ..., 0)T and corresponds to any rotationally-invariant distribution, the y = RT (cos θ, sin θ, 0, ..., 0)T . enjoys a much simpler representation. Proposition 3. With the conditions in Proposition1, the 3.1. Kernels for Rotationally-Invariant Weights forcing term F (θ0) in the kernel ODE is given by F (θ0) = In this section we show that (3) holds more generally than K sin θ0, where for the case where f is Gaussian. Specifically, (3) holds Z 3 T  when f is any rotationally invariant distribution. We do this K = Θ(s)s f (s, 0, w3, ..., wm) m−1 by casting (2) as the solution to an ODE, and then solving R ds dw , ... dw kxkkyk < ∞, the ODE. We then extend this result using the same tech- 3 m nique to the case where σ is LReLU. and the solution to the distributional ODE (4) is the solu- A rotationally-invariant PDF one with the property f(w) = tion to the corresponding classical ODE. f(Rw) = f(kwk) for all w and orthogonal matrices R. Recall that the class of rotationally-invariant distribu- The proof is given in AppendixB. tions (Bryc, 1995), as a subclass of elliptically contoured Note that in the representation F (θ0) = K sin θ0 of the distributions (Johnson, 2013), includes the Gaussian distri- forcing term, the underlying distribution appears only as bution, the multivariate t-distribution, the symmetric mul- a constant K. For all rotationally-invariant distributions, tivariate Laplace distribution, and symmetric multivariate the forcing term in (4) results in an equivalent kernel with stable distributions. the same form. We can combine Propositions2 and3 to Proposition 1. Suppose we have a one-hidden layer feed- find the equivalent kernel assuming rotationally-invariant forward network with ReLU σ and random weights W weight distributions. with uncorrelated and identically distributed rows with m 2 Due to the rotational invariance of f, k(0) = rotationally-invariant PDF f : → and [W ] < ∞. 2 R R E i R 2 kxkkykE[Wi ] m Θ(w1)w f(Rw) dwkxkkyk = . The so- The equivalent kernel of the network is (3). R 1 2 lution to the ODE in Proposition2 using the forcing term from Proposition3 is k(θ0) = c1 cos θ0 + c2 sin θ0 − Proof. First, we require the following. 1 2 Kθ0 cos θ0. Using the conditions from the IVP and k(0), Proposition 2. With the conditions in Proposition1 and the values of c1, c2 and K give the required result. inputs x, y ∈ Rm the equivalent kernel of the network is the solution to the Initial Value Problem (IVP) One can apply the same technique to the case of LReLU activations σ(z) = a + (1 − a)Θ(z)z, where a specifies 00 0 k (θ0) + k(θ0) = F (θ0), k (π) = 0, k(π) = 0, (4) the gradient of the activation for z < 0. Proposition 4. Consider the same situation as in Propo- where θ0 ∈ (0, π) is the angle between the inputs x and y. The are meant in the distributional sense; they sition1 with the exception that the activations are LReLU. ∞ The integral (2) is then given by are functionals applying to all test functions in Cc (0, π). F (θ ) is given by the m − 1 dimensional integral 2 0 h(1 − a)  i k(x, y) = sin θ0 + (π − θ0) cos θ0 + a cos θ0 Z   2π T 2 F (θ0) = f (s sin θ0, −s cos θ0, w3, ..., wm) E[W ]kxkkyk, (6) m−1 i R 3 Θ(s)s ds dw3 dw4... dwmkxkkyk sin θ0, (5) where a ∈ [0, 1) is the LReLU gradient parameter. Invariance of Weight Distributions in Rectified MLPs

This is just a slightly more involved calculation than the ReLU case; we defer our proof to the supplementary mate- rial.

3.2. Asymptotic Kernels In this section we approximate k for large m and more general weight PDFs. We invoke the CLT as m → ∞, which requires a condition that we discuss briefly before Figure 2. The solid line is an average over 1000 randomly sam- presenting it formally. The dot product w·x can be seen as pled datapoints. The shaded region represents 1 standard devi- a linear combination of the weights, with the coefficients ation in the worst-case direction. Data is preprocessed so that corresponding to the coordinates of x. Roughly, such a lin- each dimension is in the range [0, 255]. (Left) Aligned and ear combination will obey the CLT if many coefficients are cropped CelebA dataset (Liu et al., 2015), with true dimension- non-zero. To let m → ∞, we construct a sequence of in- ality m = 116412. The images are compressed using Bicubic (m) ∞ Interpolation. (Right) CHiME3 embedded et05 real live speech puts {x }m=2. This may appear unusual in the context of neural networks, since m is fixed and finite in practice. data from The 4th CHiME Speech Separation and Recognition The sequence is used only for asymptotic analysis. Challenge (Vincent et al., 2017; Barker et al., 2017). Each clip is trimmed to a length of 6.25 seconds and the true sample rate is As an example if the dataset were CelebA (Liu et al., 2015) 16000 Hz, so the true dimensionality is m = 100000. Compres- with 116412 inputs, one would have x(116412). To generate sion is achieved through subsampling by factors. an artificial sequence, one could down-sample the image to be of size 116411, 116410, and so on. At each point in the 2  2  sequence, one could normalize the point so that its ` norm 2 kxk kxkkyk cos θ0 (116412) ance matrix E[Wi ] 2 and is kx k. One could similarly up-sample the image. kxkkyk cos θ0 kyk 0 mean. Every Z(m) = (W(m) · x(m), W(m) · y(m))T has Intuitively, if the up-sampled image does not just insert (m) the same mean and covariance matrix as Z. |xi | zeros, as m increases the we expect the ratio kx(m)k to decrease because the denominator stays fixed and the Convergence in distribution is a weak form of convergence, numerator gets smaller. In our proof the application of so we cannot expect in general that all kernels should con- |x(m)| m i 1/4 verge asymptotically. For some special cases however, this CLT requires maxi=1 kx(m)k to decrease faster than m . Hypothesis5 states this condition precisely. is indeed possible to show. We first present the ReLU case. 3 Corollary 7. Let m, W, fm, E[Wi] and E|W | be as de- (m) (m) m i Hypothesis 5. For x , y ∈ R , define se- fined in Theorem6. Define the corresponding kernel to be (m) ∞ (m) ∞ quences of inputs {x }m=2 and {y }m=2 with fixed (m) (m) (m) (m) (m) kf x , y . Consider a second infinitely wide FC (m) (m) −1 x ·y kx k=kxk, ky k=kyk, and θ0= cos kxkkyk for layer with m inputs. Suppose the random weights come all m. from a spherical Gaussian with E[Wi] = 0 and finite vari- 2 (m) ance [W ] with PDF g . Define the corresponding ker- Letting x be the ith coordinate of x(m), E i m i (m) (m) (m) |x(m)| nel to be kg x , y . Suppose that the conditions in assume that lim m(1/4) maxm i and m→∞ i=1 kxk Hypothesis5 are satisfied and the activation functions are |y(m)| σ(z) = Θ(z)z. Then for all s ≥ 2, lim m(1/4) maxm i are both 0. m→∞ i=1 kyk Figures2 and5 empirically investigate Hypothesis5 for (m) (m) (m) lim kf x , y two datasets, suggesting it makes reasonable assumptions m→∞ (s) (s) (s)   on high dimensional data such as images and audio. = kg x , y = E σ(Z1)σ(Z2) , Theorem 6. Consider an infinitely wide FC layer with al- most everywhere continuous activation functions σ. Sup- where Z is as in Theorem6. Explicitly, k(m) converges pose the random weights W come from an IID distribution f 3 to (3). with PDF fm such that E[Wi] = 0 and E|Wi | < ∞. Sup- pose that the conditions in Hypothesis5 are satisfied. Then The proof is given in AppendixD. This implies that the (m) (m) (m) (m) D σ(W · x )σ(W · y ) −→ σ(Z1)σ(Z2), Arc-Cosine Kernel is well approximated by ReLU layers with weights from a wide class of distributions. Similar re- where −→D denotes convergence in distribution and Z = sults hold for other σ including the LReLU and ELU (Clev- T (Z1,Z2) is a Gaussian random vector with covari- ert et al., 2016), as shown in the supplementary material. Invariance of Weight Distributions in Rectified MLPs

1.00 1.00 1.00 1.00

0.90 0.86 0.90 0.86

0.80 0.72 0.80 0.72

0.70 0.58 0.70 0.58

0.60 0.45 0.60 0.45

0.50 0.31 0.50 0.31

0.40 0.17 0.40 0.17

0.30 0.03 0.30 0.03

0.20 -0.11 0.20 -0.11

0.10 -0.25 0.10 -0.25

0.00 -0.38 0.00 -0.38 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3

(a) (b) (c) (d)

Figure 3. Theoretical normalized kernel for networks of increasing depth. Empirical samples from a network with between 1 and 128 hidden layers, 1000 hidden neurons in each layer, m = 1000 and weights coming from different symmetric distributions. The sampling process for each θ0 is as described in Figure1. The variance is chosen according to (8). (a) ReLU Activations, (7) distribution. (b) LReLU Activations with a = 0.2,(7) distribution. (c) ReLU Activations, t-distribution. (d) LReLU Activations with a = 0.2, t-distribution.

4. Empirical Verification of Results 5. Implications for Practical Networks We empirically verify our results using two families of 5.1. Composed Kernels in Deep Networks weight distributions. First, consider the m-dimensional t- A recent advancement in understanding the difficulty in distribution training deep neural networks is the identification of the Γ[(ν + m)/2] shattered gradients problem (Balduzzi et al., 2017). With- f(w) = out skip connections, the gradients of deep networks ap- Γ(ν/2)νm/2πm/2p|det(Σ)| proach white noise as they are backpropagated through the h 1 i−(ν+m)/2 1 + (wT Σ−1w) , network, making them difficult to train. ν A simple observation that complements this view is ob- with degrees of freedom ν and identity shape matrix Σ = I. tained through repeated composition of the normalized The multivariate t-distribution approaches the multivariate kernel. As m → ∞, the angle between two inputs th Gaussian as ν → ∞. Random variables drawn from the in the j layer of a LReLU network random weights 3 multivariate t-distribution are uncorrelated but not indepen- with E[W ] = 0 and E|W | < ∞ approaches cos θj = 1  (1−a)2   dent. This distribution is rotationally-invariant and satisfies 1+a2 π sin θj−1+(π−θj−1) cos θj−1 +2a cos θ0 . the conditions in Propositions (1) and (4). A result similar to the following is hinted at by Lee et Second, consider the multivariate distribution al. (2017), citing Schoenholz et al. (2017). Their analy- sis, which considers biases in addition to weights (Poole m Y β β et al., 2016), yields insights on the trainability of random f(w) = e−|wi/α| , (7) 2αΓ(1/β) neural networks that our analysis cannot. However, their i=1 argument does not appear to provide a complete formal which is not rotationally-invariant (except when β = 2, proof for the case when the activation functions are un- which coincides with a Gaussian distribution) but whose bounded, e.g., ReLU. The degeneracy of the composed ker- random variables are IID and satisfy the conditions in The- nel with more general activation functions is also proved by orem6. As β → ∞ this distribution converges pointwise Daniely (2016), with the assumption that the weights are to the uniform distribution on [−α, α]. Gaussian distributed. Corollary 8. The normalized kernel corresponding to In Figure3, we empirically verify Propositions1 and4. LReLU activations converges to a fixed point at θ∗ = 0. In the one hidden layer case, the samples follow the blue curve j = 1, regardless of the specific multivariate t weight Proof. Let z = cos θ and define distribution which varies with ν. We also observe that the j−1 universality of the equivalent kernel appears to hold for the 1 (1 − a)2 p  T (z)= 1 − z2+(π−cos−1 z)z+2az . distribution (7) regardless of the value of β, as predicted by 1 + a2 π theory. We discuss the relevance of the curves j 6= 1 in 2 −1 T 1− 1−a  cos z Section5. The magnitude of the derivative of is 1+a π Invariance of Weight Distributions in Rectified MLPs which is bounded above by 1 on [−1, 1]. Therefore, T is a contraction mapping. By Banach’s fixed point theorem there exists a unique fixed point z∗ = cos θ∗. Set θ∗ = 0 to verify that θ∗ = 0 is a solution, and θ∗ is unique. Corollary8 implies that for this deep network, the angle between any two signals at a deep layer approaches 0. No matter what the input is, the kernel “sees” the same thing after accounting for the scaling induced by the norm of the input. Hence, it becomes increasingly difficult to train Figure 4. Histograms showing the ratio of the norm of signals in deeper networks, as much of the information is lost and the layer j to the norm of the input signals. Each histogram contains outputs will depend merely on the norm of the inputs; the 1000 data points randomly sampled from a unit Gaussian distri- signals decorrelate as they propagate through the layers. bution. The network tested has 1000 inputs, 1000 neurons in each layer, and LReLU activations with a = 0.2. The legend indicates At first this may seem counter-intuitive. An appeal to in- the number of layers in the network. The weights are randomly tuition can be made by considering the corresponding lin- initialized from a Gaussian distribution. (Left) Weights initial- ear network with deterministic and equal weight matrices ized according to the method of He et al. (2015). (Right) Weights in each layer, which amounts to the celebrated power it- initialized according to (8). eration method. In this case, the repeated application of a matrix transformation A to a vector v converges to the optimizer that tries to relate inputs, outputs and weights dominant eigenvector (i.e. the eigenvector corresponding through a suitable cost function will be “blind” to relation- to the largest eigenvalue) of A. ships between inputs and outputs. Figure3 shows that the theoretical normalized kernel for Our results can be used to argue against the utility of networks of increasing depth closely follows empirical controversial Extreme Learning Machines (ELM) (Huang samples from randomly initialized neural networks. et al., 2004), which randomly initialize hidden layers from In addition to convergence of direction, by also requiring symmetric distributions and only learn the weights in the that kxk = kyk it can be shown that after accounting for final layer. A single layer ELM can be replaced by kernel scaling, the magnitude of the signals converge as the sig- ridge regression using the equivalent kernel. Furthermore, nals propagate through the network. This is analogous to a Multi-Layer ELM (Tang et al., 2016) with (L)ReLU acti- having the dominant eigenvalue equal to 1 in the power it- vations utilizes a pathological kernel as shown in Figure3. eration method comparison. It should be noted that ELM bears resemblance to early h (j) works (Schmidt et al., 1992; Pao et al., 1994). Corollary 9. The quantity E σ (x) − (j) 2i (j) 2 σ (y) /E[σ (x) ] in a j-layer random (L)ReLU 5.2. Initialization network of infinite width with random uncorrelated and Suppose we wish to approximately preserve the `2 norm identically distributed rotationally-invariant weights with from the input to hidden layer. By comparing (1) and (2), kxk=kyk approaches 0 as j → ∞. we approximately have kh(x)k ≈ pk(x, x)n. Letting q 2 2 nE[Wi ](1+a ) Proof. Denote the output of one neuron in the jth layer of θ0 = 0 in (6), we have kh(x)k = kxk 2 . (1) (j (j) a network σ(W · σ(...σ(W )x)) by σ (x) and let kj Setting kh(x)k = kxk, be the kernel for the j-layer network. Then v q u 2 h 2i [W 2] = u . (j) (j)  (j) 2 E i t  (8) E σ (x) − σ (y) /E[σ (x) ] 1 + a2 n  = kj(x, x) − 2kj(x, y) + kj(y, y) /kj(x, x), This applies whenever the conditions in Proposition4 or = 2 − 2 cos θj Corollary 12 are satisfied. This agrees with the well-known which approaches 0 as j → ∞. case when the elements of W are IID (He et al., 2015) and a = 0. For small values of a,(8) is well approximated by Contrary to the shattered gradients analysis, which ap- the known result (He et al., 2015). For larger values of a, plies to gradient based optimizers, our analysis relates to this approximation breaks down, as shown in Figure4. any optimizers that initialize weights from some distribu- tion satisfying conditions in Proposition4 or Corollary7. An alternative approach to weight initialization is the data- Since information is lost during signal propagation, the net- driven approach (Mishkin & Matas, 2016), which can be work’s output shares little information with the input. An applied to more complicated network structures such as Invariance of Weight Distributions in Rectified MLPs convolutional and max-pooling layers commonly used in transformation is 1 since R is orthogonal. We have practice. As parameter distributions change during train- Z k(x, y) = Θ(w · u)Θ(w · v)(w · u)(w · v) ing, batch normalization inserts layers with learnable scal- m ing and centering parameters at the cost of increased com- R f(Rw) dw, putation and complexity (Ioffe & Szegedy, 2015). Z = Θ(kxkw1)Θ(kyk(w1 cos θ0 + w2 sin θ0)) m 6. Conclusion R w1(w1 cos θ0 + w2 sin θ0)f(w) dwkxkkyk. We have considered universal properties of MLPs with (9) weights coming from a large class of distributions. We One may view the integrand as a functional acting on test have theoretically and empirically shown that the equiva- functions of θ0. Denote the set of infinitely differentiable lent kernel for networks with an infinite number of hidden ∞ test functions on (0, π) by Cc (0, π). The linear functional ReLU neurons and all rotationally-invariant weight distri- ∞ acting over Cc (0, π) is a Generalized Function and we butions is the Arc-Cosine Kernel. The CLT can be ap- may take distributional derivatives under the integral by plied to approximate the kernel for high dimensional in- Theorem 7.40 of Jones (1982). Differentiating twice, put data. When the activations are LReLUs, the equivalent k00 + k kernel has a similar form. The kernel converges to a fixed Z point, showing that information is lost as signals propagate 2 = Θ(w1)w1(−w1 sin θ0 + w2 cos θ0) m through the network. R δw cos θ + w sin θ f(w) dwkxkkyk, One avenue for future work is to study the equivalent kernel 1 0 2 0 Z for different activation functions, noting that some activa-  T  = f (s sin θ0, −s cos θ0, w3, ..., wm) m−1 tions such as the ELU may not be expressible in a closed R 3 form (we do show in the supplementary material however, Θ(s)s ds dw3 dw4... dwmkxkkyk sin θ0. that the ELU does have an asymptotically universal kernel). The initial condition k(π) = 0 is obtained by putting Since wide networks with centered weight distributions θ0 = π in (9) and noting that the resulting integrand have approximately the same equivalent kernel, powerful contains a factor of Θ(w1)Θ(−w1)w1 which is 0 every- trained deep and wide MLPs with (L)ReLU activations where. Similarly, the integrand of k0(π) contains a factor should have asymmetric, non-zero mean, non-IID param- of Θ(w2)Θ(−w2)w2. eter distributions. Future work may consider analyzing the The ODE is meant in a distributional sense, that equivalent kernels of trained networks and more compli- Z π 00  cated architectures. We should not expect that k(x, y) may ψ(θ0) k (θ0) + k(θ0) − F (θ0) dθ0 = 0 be expressed neatly as k(θ0) in these cases. This work is a 0 crucial first step in identifying invariant properties in neu- ∀ψ ∈ Cc∞(0, π), where k is a distribution with a distribu- ral networks and sets a foundation from which we hope to tional second derivative k00. expand in future. B. Proof of Proposition3 A. Proof of Proposition2 Proof. Denote the marginal PDF of the first two coordi- Proof. The kernel with weight PDF f(ω) and ReLU σ is nates of W by f12. Due to the rotational invariance of f, Z f(Ox) = f(kxk) = f(x) for any orthogonal matrix O. So k(x, y) = Θ(ω · x)Θ(ω · y)(ω · x)(ω · y)f(ω) dω. Z m T  R F (θ0) = f (s sin θ0, −s cos θ0, w3, ..., wm) m−1 Let θ0 be the angle between x and y. Define u = R T T 3 (kxk, 0, ..., 0) and v = (kyk cos θ0, kyk sin θ0, 0, ..., 0) sin θ0Θ(s)s ds dw3, ... dwmkxkkyk, with u, v ∈ m. Following Cho & Saul (2009), there ex- Z R 3 T  ists some m × m rotation matrix R such that x = Ru and = sin θ0 Θ(s)s f12 (s, 0, ) dskxkkyk, y = Rv. We have R Z = K sin θ0,K ∈ (0, ∞]. k(x, y) = Θ(ω · Ru)Θ(ω · Rv)(ω · Ru)(ω · Rv) It remains to check that K < ∞. F is integrable since m R Z Z π 2 f(ω) dω. Θ(w1)w1(−w1 sin θ0 + w2 cos θ0) 2 Let ω = Rw and note that the dot product is invariant R 0 under rotations and the determinant of the Jacobian of the δ(w1 cos θ0 + w2 sin θ0)f12(w1, w2)dθ0dw1dw2 Invariance of Weight Distributions in Rectified MLPs

Z T 2 2 1/2 kxkW · R1, kyk(W · R1 cos θ0 + W · R2 sin θ0) = = Θ(w1)w1 (w1 + w2) f12(w1, w2)dw1dw2, 2 T R kxk(U cos θ − U sin θ ), kykU  converges in distri- q q 1 0 2 0 1 ≤ Θ2(W )W 2 W 2 + W 2] < ∞. bution to the bivariate Gaussian random variable with co- E 1 1 E 1 2  2  2 kxk kxkkyk cos θ0 F variance matrix E[Wi ] 2 . Therefore, is finite almost everywhere. This is only true kxkkyk cos θ0 kyk 00 if K < ∞. k = F − k must be a function, so the distri- Since σ is continuous almost everywhere, by the Continu- butional and classical derivatives coincide. ous Mapping Theorem, (m) (m) (m) (m) D σ(W · x )σ(W · y ) −→ σ(Z1)σ(Z2). C. Proof of Theorem6 If θ0 = 0 or θ0 = π, we may treat R2 as 0 and the above m Proof. There exist some orthonormal R1, R2∈R such still holds. (m) (m) (m) that y =ky k(R1 cos θ0 + R2 sin θ0) and x = (m) kx kR1. We would like to examine the asymptotic dis- (m) (m)  D. Proof of Corollary7 tribution of σ ky kW · R1 cos θ0+R2 sin θ0 (m) (m)  (m) (m) (m) σ kx kW ·R1 . Proof. We have limm→∞ kf x , y =  (m) (m)  (m) lim σ(Z )σ(Z ) Let U =W · R cos θ + W · R sin θ and m→∞ E 1 2 and would like to bring 1 1 0 2 0 the limit inside the expected value. By Theorem6 and U (m) = −W·R sin θ +W·R cos θ . Note that 2 1 0 2 0 Theorem 25.12 of Billingsley (1995), it suffices to show (m)2 (m)2 2 (m) (m) E[U1 ]=E[U2 ]=E[Wi ] and E[U1 ]=E[U2 ]=0. that σ(Z(m))σ(Z(m)) is uniformly integrable. Define h to (m) (m) 1 2 (m) Also note that U1 and U2 are uncorrelated since be the joint PDF of Z . We have h (m) (m) 2 2 Z E[U1 U2 ] = E (W·R1)(W·R2)(cos θ0+sin θ0)− lim |σ(z1)σ(z2)|h(z1, z2) dz1dz2 2 2i α→∞ cos θ sin θ (W · R ) − (W · R ) = 0. |σ(z1)σ(z2)|>α 0 0 1 2 Z k (m) T = lim |Θ(z1)Θ(z2)z1z2|h(z1, z2) Let Mk = E Wi , U = (U1,U2) , I be the 2×2 iden- α→∞  |Θ(z1)Θ(z2)z1z2|>α tity matrix and Q ∼ N 0,M2I . Then for any convex set dz1dz2, S ∈ R2 and some C ∈ R, by the Berry-Esseen Theorem, 2 2 2 but the integrand is 0 whenever z1 ≤ 0 or z2 ≤ 0. So P[U ∈ S] − P[Q ∈ S] ≤ Cγ where γ is given by Z m  X −1  R cos θ + R sin θ  32 |σ(z1)σ(z2)|h(z1, z2) dz1dz2 2 1j 0 2j 0 E M2 Wi I , |σ(z1)σ(z2)|>α −R1j sin θ0 + R2j cos θ0 j=1 Z = z z Θ(z z − α)Θ(z )Θ(z )h(z , z ) dz dz . m   1 2 1 2 1 2 1 2 1 2  −3 X R cos θ + R sin θ 32 2 2 1j 0 2j 0 R = M2 M3 E , −R1j sin θ0 + R2j cos θ0 We may raise the Heaviside functions to any power without j=1 changing the value of the integral. Squaring the Heaviside m  −3 (3/2)2 2 X 2 2 functions and applying Holder’s¨ inequality, we have = M M3 R + R , 2 1j 2j  Z 2 j=1 2 2 2 z1z2Θ (z1z2 − α)Θ (z1)Θ (z2)h(z1, z2)dz1dz2 m 2 X 3 R ≤ M −3M 2m R2 + R2 , 2 2 3 1j 2j ≤ E[z1 Θ(z1z2 − α)Θ(z1)Θ(z2)] j=1 2 E[z2 Θ(z1z2 − α)Θ(z1)Θ(z2)]. m −3 2 X 6 4 2 2 4 6 = M2 M3 m R1j + 3R1jR2j + 3R1jR2j + R2j , Examining the first of these factors, ∞ ∞ j=1 Z Z z2h(z , z ) dz dz ,  m m  1 1 2 2 1 −3 2 4 4 0 α/z ≤ M2 M3 m 4 max R1k + 4 max R2k . 1 k=1 k=1 Z ∞ Z ∞ 2 The last line is due to the fact that = z1 h(z1, z2) dz2dz1. m m 0 α/z1 X 6 4 2 m 4 X 2 2 R + 3R R ≤ max R R + 3R . R ∞ 2 1j 1j 2j 1k 1j 2j Now let gα(z1) = h(z1, z2) dz2. gα(z1)z is mono- k=1 α/z1 1 j=1 j=1 tonically pointwise non-increasing to 0 in α for all z1 > 0   R 2 2 xk 1 yk xk and z g0(z1)dz1 ≤ [Z ] < ∞ . By the Monotone Con- Now R1k = kxk and R2k = sin θ kyk − kxk cos θ0 , 1 E 1 0 vergence Theorem lim [z2Θ(z z − α)Θ(z )] = 0. so if θ 6= 0, π by Hypothesis5 U(m) converges in dis- α→∞ E 1 1 2 1 0 The second factor has the same limit, so the limit of the tribution to the bivariate spherical Gaussian with variance right hand side of Holder’s¨ inequality is 0. 2 (m) (m) (m) T E[Wi ]. Then the random vector Z = (Z1 ,Z2 ) = Invariance of Weight Distributions in Rectified MLPs

Acknowledgements Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and ac- curate deep network learning by exponential linear units We thank the anonymous reviewers for directing us to- (elus). In International Conference on Learning Repre- ward relevant work and providing helpful recommen- sentations, 2016. dations regarding the presentation of the paper. Far- bod Roosta-Khorasani gratefully acknowledges the Daniely, A., Frostig, R., and Singer, Y. Toward deeper un- from the Australian Research Council through a Discovery derstanding of neural networks: The power of initializa- Early Career Researcher Award (DE180100923). Russell tion and a dual view on expressivity. In Advances In Tsuchida’s attendance at the conference was made possible Neural Information Processing Systems, pp. 2253–2261, by an ICML travel award. 2016.

References Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceed- Bach, F. Breaking the curse of dimensionality with con- ings of the Thirteenth International Conference on Arti- vex neural networks. Journal of Machine Learning Re- ficial Intelligence and Statistics, pp. 249–256, 2010. search, 18(19):1–53, 2017a. Haeffele, B.D. and Vidal, R. Global optimality in tensor Bach, F. On the equivalence between kernel quadrature factorization, deep learning, and beyond. arXiv preprint rules and random feature expansions. Journal of Ma- arXiv:1506.07540, 2015. chine Learning Research, 18(21):1–38, 2017b. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into Balduzzi, D., Frean, M., Leary, L., Lewis, J.P., Ma, K.W., rectifiers: Surpassing human-level performance on ima- and McWilliams, B. The shattered gradients problem: If genet classification. In Proceedings of the IEEE inter- resnets are the answer, then what is the question? In Pro- national conference on computer vision, pp. 1026–1034, ceedings of the 34th International Conference on Ma- 2015. chine Learning, volume 70, pp. 342–350, 2017. Hochreiter, S. Untersuchungen zu dynamischen neu- Barker, J., Marxer, R., Vincent, E., and Watanabe, S. The ronalen netzen. Diploma, Technische Universitat¨ third chime speech separation and recognition challenge: Munchen¨ , 91, 1991. Analysis and outcomes. Computer Speech and Lan- guage, 46:605–626, 2017. Hochreiter, S., Bengio, Y., and Frasconi, P. Gradient flow Bengio, Y., Simard, P., and Frasconi, P. Learning long-term in recurrent nets: the difficulty of learning long-term de- dependencies with gradient descent is difficult. IEEE pendencies. In Kolen, J. and Kremer, S. (eds.), Field transactions on neural networks, 5(2):157–166, 1994. Guide to Dynamical Recurrent Networks. IEEE Press, 2001. Berthelot, D., Schumm, T., and Metz, L. Began: Bound- ary equilibrium generative adversarial networks. arXiv Huang, G., Zhu, Q., and Siew, C. Extreme learning ma- preprint arXiv:1703.10717, 2017. chine: a new learning scheme of feedforward neural net- works. In Neural Networks, 2004. Proceedings. 2004 Billingsley, P. Probability and Measure. Wiley- IEEE International Joint Conference on, volume 2, pp. Interscience, 3rd edition, 1995. ISBN 0471007102. 985–990. IEEE, 2004. Bryc, W. Rotation invariant distributions. In The Normal Ioffe, S. and Szegedy, C. Batch normalization: Accelerat- Distribution, pp. 51–69. Springer, 1995. ing deep network training by reducing internal covariate Burgess, A.N. Estimating equivalent kernels for neural shift. In International Conference on Machine Learning, networks: A data perturbation approach. In Advances pp. 448–456, 2015. in Neural Information Processing Systems, pp. 382–388, 1997. Johnson, M.E. Multivariate statistical simulation: A guide to selecting and generating continuous multivariate dis- Cho, Y. and Saul, L.K. Kernel methods for deep learning. tributions. John Wiley & Sons, 2013. In Advances in Neural Information Processing Systems, pp. 342–350, 2009. Jones, D.S. The Theory of Generalised Functions, chap- ter 7, pp. 263. Cambridge University Press, 2nd edition, Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., 1982. and LeCun, Y. The loss surfaces of multilayer net- works. In Artificial Intelligence and Statistics, pp. 192– Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple 204, 2015. layers of features from tiny images. 2009. Invariance of Weight Distributions in Rectified MLPs

Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Penning- Roux, N. Le and Bengio, Y. Continuous neural networks. ton, J., and Sohl-Dickstein, J. Deep neural networks as In Artificial Intelligence and Statistics, pp. 404–411, gaussian processes. arXiv preprint arXiv:1611.01232, 2007. 2017. Rudi, Alessandro and Rosasco, Lorenzo. Generalization Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning properties of learning with random features. In Advances face attributes in the wild. In Proceedings of Interna- in Neural Information Processing Systems, pp. 3218– tional Conference on Computer Vision (ICCV), Decem- 3228, 2017. ber 2015. Schmidt, W.F., Kraaijveld, M.A., and Duin, R.P.W. Feed- Livni, R., Carmon, D., and Globerson, A. Learning infinite forward neural networks with random weights. In Pat- layer networks without the kernel trick. In International tern Recognition, 1992. Vol. II. Conference B: Pattern Conference on Machine Learning, pp. 2198–2207, 2017. Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, pp. 1–4. IEEE, MacKay, D.J.C. A practical Bayesian framework for back- 1992. propagation networks. Neural Computation, 4(3):448– 472, 1992. Schoenholz, S.S., Gilmer, J., Ganguli, S., and Sohl- Dickstein, J. Deep information propagation. In Interna- Martin, C.H. and Mahoney, M.W. Rethinking general- tional Conference on Learning Representations, 2017. ization requires revisiting old ideas: statistical mechan- ics approaches and complex learning behavior. arXiv Shwartz-Ziv, R. and Tishby, N. Opening the Black Box of preprint arXiv:1710.09553, 2017. Deep Neural Networks via Information. arXiv preprint arXiv:1703.00810, 2017. Mishkin, D. and Matas, J. All you need is a good init. In International Conference on Learning Representations, Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, T., 2016. Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Neal, R.M. Bayesian Learning for Neural Networks. PhD van den Driessche, G., Graepel, T., and Hassabis, D. thesis, University of Toronto, 1994. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017. Pandey, G. and Dukkipati, A. To go deep or wide in learn- ing? In Artificial Intelligence and Statistics, pp. 724– Sinha, A. and Duchi, J.C. Learning kernels with random 732, 2014a. features. In Advances in Neural Information Processing Systems, pp. 1298–1306, 2016. Pandey, G. and Dukkipati, A. Learning by stretching deep networks. In Proceedings of the 31st International Tang, J., Deng, C., and Huang, G. Extreme learning ma- Conference on Machine Learning (ICML-14), pp. 1719– chine for multilayer perceptron. IEEE Transactions on 1727, 2014b. Neural Networks and Learning Systems, 27(4):809–821, 2016. Pao, Y., Park, G., and Sobajic, D.J. Learning and gener- alization characteristics of the random vector functional- van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., link net. Neurocomputing, 6(2):163–180, 1994. Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and raw audio. arXiv preprint arXiv:1609.03499, 2016. Ganguli, S. Exponential expressivity in deep neural net- works through transient chaos. In Advances In Neural Vincent, E., Watanabe, S., Nugraha, A., Barker, J., and Information Processing Systems, pp. 3360–3368, 2016. Marxer, R. An analysis of environment, microphone and data simulation mismatches in robust speech recog- Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl- nition. Computer Speech and Language, 46:535–557, Dickstein, J. On the expressive power of deep neural 2017. networks. In Precup, D. and Teh, Y.W. (eds.), Proceed- ings of the 34th International Conference on Machine Williams, C.K.I. Computing with infinite networks. In Learning, volume 70 of Proceedings of Machine Learn- Advances in Neural Information Processing Systems, pp. ing Research, pp. 2847–2854, 2017. 295–301, 1997.

Rahimi, A. and Recht, B. Random features for large-scale Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, kernel machines. In Advances in neural information pro- O. Understanding deep learning requires rethinking gen- cessing systems, pp. 1177–1184, 2008. eralization. arXiv preprint arXiv:1611.03530, 2016.