Invariance of Weight Distributions in Rectified Mlps
Total Page:16
File Type:pdf, Size:1020Kb
Invariance of Weight Distributions in Rectified MLPs Russell Tsuchida 1 Farbod Roosta-Khorasani 2 3 Marcus Gallagher 1 Abstract plication rather than an understanding of the capabilities and training of neural networks. Recently, significant work An interesting approach to analyzing neural net- has been done to increase understanding of neural net- works that has received renewed attention is to works (Choromanska et al., 2015; Haeffele & Vidal, 2015; examine the equivalent kernel of the neural net- Poole et al., 2016; Schoenholz et al., 2017; Zhang et al., work. This is based on the fact that a fully 2016; Martin & Mahoney, 2017; Shwartz-Ziv & Tishby, connected feedforward network with one hid- 2017; Balduzzi et al., 2017; Raghu et al., 2017). However, den layer, a certain weight distribution, an acti- there is still work to be done to bring theoretical under- vation function, and an infinite number of neu- standing in line with the results seen in practice. rons can be viewed as a mapping into a Hilbert space. We derive the equivalent kernels of MLPs The connection between neural networks and kernel ma- with ReLU or Leaky ReLU activations for all chines has long been studied (Neal, 1994). Much past rotationally-invariant weight distributions, gen- work has been done to investigate the equivalent kernel eralizing a previous result that required Gaus- of certain neural networks, either experimentally (Burgess, sian weight distributions. Additionally, the Cen- 1997), through sampling (Sinha & Duchi, 2016; Livni tral Limit Theorem is used to show that for cer- et al., 2017; Lee et al., 2017), or analytically by assum- tain activation functions, kernels corresponding ing some random distribution over the weight parameters to layers with weight distributions having 0 mean in the network (Williams, 1997; Cho & Saul, 2009; Pandey and finite absolute third moment are asymptot- & Dukkipati, 2014a;b; Daniely et al., 2016; Bach, 2017a). ically universal, and are well approximated by Surprisingly, in the latter approach, rarely have distribu- the kernel corresponding to layers with spherical tions other than the Gaussian distribution been analyzed. Gaussian weights. In deep networks, as depth in- This is perhaps due to early influential work on Bayesian creases the equivalent kernel approaches a patho- Networks (MacKay, 1992), which laid a strong mathemat- logical fixed point, which can be used to argue ical foundation for a Bayesian approach to training net- why training randomly initialized networks can works. Another reason may be that some researchers may be difficult. Our results also have implications hold the intuitive (but not necessarily principled) view that for weight initialization. the Central Limit Theorem (CLT) should somehow apply. In this work, we investigate the equivalent kernels for net- works with Rectified Linear Unit (ReLU), Leaky ReLU 1. Introduction (LReLU) or other activation functions, one-hidden layer, and more general weight distributions. Our analysis carries Neural networks have recently been applied to a number over to deep networks. We investigate the consequences of diverse problems with impressive results (van den Oord that weight initialization has on the equivalent kernel at et al., 2016; Silver et al., 2017; Berthelot et al., 2017). the beginning of training. While initialization schemes that These breakthroughs largely appear to be driven by ap- mitigate exploding/vanishing gradient problems (Hochre- 1School of ITEE, University of Queensland, Bris- iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001) bane, Queensland, Australia 2School of Mathematics and for other activation functions and weight distribution com- Physics, University of Queensland, Brisbane, Queens- binations have been explored in earlier works (Glorot & 3 land, Australia International Computer Science Institute, Bengio, 2010; He et al., 2015), we discuss an initialization Berkeley, California, USA. Correspondence to: Rus- sell Tsuchida <[email protected]>, Farbod Roosta- scheme for Muli-Layer Perceptrons (MLPs) with LReLUs Khorasani <[email protected]>, Marcus Gallagher and weights coming from distributions with 0 mean and fi- <[email protected]>. nite absolute third moment. The derived kernels also allow th us to analyze the loss of information as an input is propa- Proceedings of the 35 International Conference on Machine gated through the network, offering a complementary view Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). to the shattered gradient problem (Balduzzi et al., 2017). Invariance of Weight Distributions in Rectified MLPs 2. Preliminaries optimal solutions to the ridge regression problem of min- imizing a regularized cost function C using the kernel (1) Consider a fully connected (FC) feedforward neural net- and the kernel (2) respectively. The number of datapoints work with m inputs and a hidden layer with n neurons. Let n requiredp to probabilistically bound C(^g) − C(g) is found σ : R ! R be the activation function of all the neurons in to be O( n log n) under a suitable set of assumptions. the hidden layer. Further assume that the biases are 0, as is This work notes the connection between kernel machines common when initializing neural network parameters. For m and one-layer Neural Networks with ReLU activations and any two inputs x; y 2 R propagated through the network, Gaussian weights by citing Cho & Saul (2009). We extend the dot product in the hidden layer is this connection by considering other weight distributions n and activation functions. 1 1 X h(x) · h(y) = σ(w · x)σ(w · y); (1) n n i i In this work our focus is on deriving expressions for the i=1 target kernel, not the approximation error. Additionally, where h(·) denotes the n dimensional vector in the hidden we consider random mappings that have not been consid- m th layer and wi 2 R is the weight vector into the i neuron. ered elsewhere. Our work is related to work by Poole et Assuming an infinite number of hidden neurons, the sum al. (2016) and Schoenholz et al. (2017). However, our re- in (1) has an interpretation as an inner product in feature sults apply to the unbounded (L)ReLU activation function space, which corresponds to the kernel of a Hilbert space. and more general weight distributions, and their work con- We have siders random biases as well as weights. Z k(x; y) = σ(w · x)σ(w · y)f(w) dw; (2) m 3. Equivalent Kernels for Infinite Width R Hidden Layers where f(w) is the probability density function (PDF) for the identically distributed weight vector W = The kernel (2) has previously been evaluated for a number T (W1; :::; Wm) in the network. The connection of (2) to of choices of f and σ (Williams, 1997; Roux & Bengio, the kernels in kernel machines is well-known (Neal, 1994; 2007; Cho & Saul, 2009; Pandey & Dukkipati, 2014a;b). In Williams, 1997; Cho & Saul, 2009). particular, the equivalent kernel for a one-hidden layer net- work with spherical Gaussian weights of variance [W 2] Probabilistic bounds for the error between (1) and (2) have E i and mean 0 is the Arc-Cosine Kernel (Cho & Saul, 2009) been derived in special cases (Rahimi & Recht, 2008) when the kernel is shift-invariant. Two specific random feature 2 mappings are considered: (1) Random Fourier features are E[Wi ]kxkkyk k(x; y) = sin θ0 + (π − θ0) cos θ0 ; (3) taken for the σ in (1). Calculating the approximation er- 2π ror in this way requires being able to sample from the PDF defined by the Fourier transform of the target kernel. More −1 x·y where θ0 = cos kxkkyk is the angle between the inputs explicitly, the weight distribution f is the Fourier transform x and y and k·k denotes the `2 norm. Noticing that the Arc- of the target kernel and the n samples σ(wi ·x) are replaced Cosine Kernel k(x; y) depends on x and y only through by some appropriate scale of cos(wi · x). (2) A random bit their norms, with an abuse of notation we will henceforth string σ(xi) is associated to each input according to a grid set k(x; y) ≡ k(θ0): Define the normalized kernel to be the with random pitch δ sampled from f imposed on the input cosine similarity between the signals in the hidden layer. The normalized Arc-Cosine Kernel is given by space. This method requires having access to the second derivative of the target kernel to sample from the distribu- f k(x; y) 1 tion . cos θ1 = = sin θ0 + (π − θ0) cos θ0 ; pk(x; x)pk(y; y) π Other work (Bach, 2017b) has focused on the smallest er- ror between a target function g in the reproducing kernel Hilbert space (RKHS) defined by (2) and an approximate where θ1 is the angle between the signals in the first layer. function g^ expressible by the RKHS with the kernel (1). Figure1 shows a plot of the normalized Arc-Cosine Kernel. R One might ask how the equivalent kernel changes for a dif- More explicitly, let g(x) = m G(w)σ(w; x)f(w) dw be R ferent choice of weight distribution. We investigate the the representation of g in the RKHS. The quantity g^ − Pn R equivalent kernel for networks with (L)ReLU activations g = αiσ(wi; ·) − m G(w)σ(w; ·)f(w) dw i=1 R and general weight distributions in Section 3.1 and 3.2. The (with some suitable norm) is studied for the best set of α i equivalent kernel can be composed and applied to deep net- and random w with an optimized distribution. i works. The kernel can also be used to choose good weights Yet another measure of kernel approximation error is in- for initialization.