On the Impact of the Activation Function on Deep Neural Networks Training
Total Page:16
File Type:pdf, Size:1020Kb
On the Impact of the Activation Function on Deep Neural Networks Training Soufiane Hayou 1 Arnaud Doucet 1 Judith Rousseau 1 Abstract obtained similar results using a topological measure of ex- The weight initialization and the activation func- pressiveness. tion of deep neural networks have a crucial impact Since the training of deep neural networks is a non-convex on the performance of the training procedure. An optimization problem, the weight initialization and the ac- inappropriate selection can lead to the loss of in- tivation function will essentially determine the functional formation of the input during forward propagation subspace that the optimization algorithm will explore. We and the exponential vanishing/exploding of gra- follow here the approach of (Poole et al., 2016) and (Schoen- dients during back-propagation. Understanding holz et al., 2017) by investigating the behaviour of ran- the theoretical properties of untrained random net- dom networks in the infinite-width and finite-variance i.i.d. works is key to identifying which deep networks weights context where they can be approximated by a Gaus- may be trained successfully as recently demon- sian process as established by (Neal, 1995), (Matthews et al., strated by (Schoenholz et al., 2017) who showed 2018) and (Lee et al., 2018). that for deep feedforward neural networks only a specific choice of hyperparameters known as the In this paper, our contribution is three-fold. Firstly, we ‘Edge of Chaos’ can lead to good performance. provide a comprehensive analysis of the so-called Edge of While the work by (Schoenholz et al., 2017) dis- Chaos (EOC) curve and show that initializing a network on cuss trainability issues, we focus here on training this curve leads to a deeper propagation of the information acceleration and overall performance. We give a through the network and accelerates the training. In particu- comprehensive theoretical analysis of the Edge of lar, we show that a feedforward ReLU network initialized Chaos and show that we can indeed tune the ini- on the EOC acts as a simple residual ReLU network in tialization parameters and the activation function terms of information propagation. Secondly, we introduce a in order to accelerate the training and improve class of smooth activation functions which allow for deeper performance. signal propagation (Proposition3) than ReLU. In partic- ular, this analysis sheds light on why smooth versions of ReLU (such as SiLU or ELU) perform better experimentally 1. Introduction for deep neural networks; see, e.g., (Clevert et al., 2016), (Pedamonti, 2018), (Ramachandran et al., 2017) and (Mil- Deep neural networks have become extremely popular as letarí et al., 2018). Lastly, we show the existence of optimal they achieve state-of-the-art performance on a variety of points on the EOC curve and we provide guidelines for the important applications including language processing and choice of such point and we demonstrate numerically the computer vision; see, e.g., (Goodfellow et al., 2016). The consistence of this approach. We also complement previous success of these models has motivated the use of increas- empirical results by illustrating the benefits of an initializa- ingly deep networks and stimulated a large body of work to tion on the EOC in this context. All proofs are given in the arXiv:1902.06853v2 [stat.ML] 26 May 2019 understand their theoretical properties. It is impossible to Supplementary Material. provide here a comprehensive summary of the large num- ber of contributions within this field. To cite a few results relevant to our contributions, (Montufar et al., 2014) have 2. On Gaussian process approximations of shown that neural networks have exponential expressive neural networks and their stability power with respect to the depth while (Poole et al., 2016) 2.1. Setup and notations 1 Department of Statistics, University of Oxford, Oxford, We use similar notations to those of (Poole et al., 2016) United Kingdom. Correspondence to: Soufiane Hayou <soufi- [email protected]>. and (Lee et al., 2018). Consider a fully connected feedfor- ward random neural network of depth L, widths (Nl)1≤l≤L, th 2 Proceedings of the 36 International Conference on Machine l iid σw l iid 2 weights Wij ∼ N (0; ) and bias Bi ∼ N (0; σ ), Learning, Long Beach, California, PMLR 97, 2019. Copyright Nl−1 b 2019 by the author(s). where N (µ, σ2) denotes the normal distribution of mean µ On the Impact of the Activation Function on Deep Neural Networks Training and variance σ2. For some input a 2 Rd, the propagation Let E and G be two subsets of R. We define the following of this input through the network is given for an activation sets of functions for k 2 N by function φ : R ! R by Dk(E; G) = ff : E ! G such that f (k) existsg d k k (k) 1 X 1 1 C (E; G) = ff 2 D (E; G) such that f is continuousg yi (a) = Wijaj + Bi ; (1) k k (j) 2 j=1 Dg (E; G) = ff 2 D (E; G): 8j ≤ k; E[f (Z) ] < 1g Nl−1 Ck(E; G) = ff 2 Ck(E; G): 8j ≤ k; [f (j)(Z)2] < 1g l X l l−1 l g E yi(a) = Wijφ(yj (a)) + Bi; for l ≥ 2: (2) j=1 (k) th Throughout this paper we assume that for all l the pro- where f is the k derivative of f. When E and G are not l explicitly mentioned, we assume E = G = R. cesses yi(:) are independent (across i) centred Gaussian processes with covariance kernels κl and write accordingly ind 2.2. Limiting behaviour of the variance and covariance yl ∼ GP(0; κl). This is an idealized version of the true i operators processes corresponding to choosing Nl−1 = +1 (which yl(a) l l implies, using Central Limit Theorem, that i is a Gaus- We analyze here the limiting behaviour of qa and ca;b as l l sian variable for any input a). The approximation of yi(:) goes to infinity. From now onwards, we will also assume 1 by a Gaussian process was first proposed by (Neal, 1995) in without loss of generality that cab ≥ 0 (similar results can 1 the single layer case and has been recently extended to the be obtained straightforwardly when cab ≤ 0). We first need multiple layer case by (Lee et al., 2018) and (Matthews et al., to define the Domains of Convergence associated with an 2018). We recall here the expressions of the limiting Gaus- activation function φ. d l sian process kernels. For any input a 2 R , E[yi(a)] = 0 0 + 2 Definition 1. Let φ 2 Dg, (σb; σw) 2 (R ) . so that for any inputs a; b 2 d R (i) Domain of convergence for the variance Dφ,var : l l l (σb; σw) 2 Dφ,var if there exists K > 0, q ≥ 0 such that κ (a; b) = E[yi(a)yi(b)] 1 l for any input a with q ≤ K, liml!1 q = q. We denote 2 2 l−1 l−1 a a = σb + σwE[φ(yi (a))φ(yi (b))] by Kφ,var(σb; σw) the maximal K satisfying this condition. 2 2 l−1 l−1 l−1 (ii) Domain of convergence for the correlation D : = σb + σwFφ(κ (a; a); κ (a; b); κ (b; b)) φ,corr (σb; σw) 2 Dφ,corr if there exists K > 0 such that for 1 1 l where Fφ is a function that only depends on φ. This gives any two inputs a; b with qa; qb ≤ K, liml!1 cab = 1. We l a recursion to calculate the kernel κ ; see, e.g., (Lee et al., denote by Kφ,corr(σb; σw) the maximal K satisfying this 2018) for more details. We can also express the kernel condition. l l κ (a; b) (which we denote hereafter by qab) in terms of the l th Remark: Typically, q in Definition1 is a fixed point of the correlation cab in the l layer variance function defined in (3). Therefore, it is easy to q q l 2 2 l−1 l−1 l−1 see that for any (σb; σw) such that F is non-decreasing and qab = σb + σwE[φ( qa Z1)φ( qb U2(cab ))] admits at least one fixed point, we have Kφ,var(σb; σw) ≥ q q where q is the minimal fixed point; i.e. q := minfx : l−1 l−1 l−1 l−1 l−1 l−1 where qa := qaa , resp. cab := qab = qa qb , is F (x) = xg. Thus, if we re-scale the input data to have th q1 ≤ q ql q the variance, resp.p correlation, in the (l − 1) layer and a , the variance a converges to . We can also re- 2 U2(x) = xZ1 + 1 − x Z2 where Z1, Z2 are independent scale the variance σw of the first layer (only) to assume that 1 standard Gaussian random variables. When it propagates qa ≤ q for all inputs a. l through the network. qa is updated through the layers by the l l−1 The next Lemma gives sufficient conditions under which recursive formula qa = F (qa ), where F is the ‘variance Kφ,var and Kφ,corr are infinite. function’ given by Lemma 1. Assume φ00 exists at least in the distribution p 2 2 2 sense.1 F (x) = σb + σwE[φ( xZ) ];Z ∼ N (0; 1) (3) 02 00 Let Mφ := supx≥0E[jφ (xZ) + φ (xZ)φ(xZ)j]. Assume Throughout this paper, Z; Z1;Z2 will always denote 2 1 Mφ < 1, then for σw < M and σb ≥ 0, we have independent standard Gaussian variables, and a; b two φ (σb; σw) 2 Dφ,var and Kφ,var(σb; σw) = 1.