On the Derivatives of the Sigmoid
Total Page:16
File Type:pdf, Size:1020Kb
Neural Networks. Vol. 6, pp. 845-853, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All fights reserved. Copyright © 1993 Pergamon Press Ltd. ORIGINAL CONTRIBUTION On the Derivatives of the Sigmoid ALI A. MINAI AND RONALD D. WILLIAMS University of Virginia, Charlottesville (Received 6 Februa~. 1992; revised and accepted 13 January 1993) Abstract--The sigmoid fimction & very widely used as a neuron activation fimction in artificial neural networks. which makes its attributes a matter of some interest. This paper presents some general results on the derivatives of the sigmoid. These results relate the coefficients of various derivatives to standard number sequences from combi- natorial theory, and thus provide a standard efficient way of calculating these derivatives. Such derivatives can be used in approximating the effect of input perturbations on the output of single neurons and could also be usefid in statistical modeling applications. Keywords--Neuron activation function, Sigmoid function, Logistic function, Eulerian numbers, Stifling numbers. 1. INTRODUCTION one input. However, it can also be given a more general interpretation. Let i be a neuron with n inputs A), 1 < The sigmoid function has found extensive use as a non- j < n, and let the output y of the neuron be given by: linear activation function for neurons in artificial neural 1 networks. Thus, it is of some interest to explore its y - -- (2.1) characteristics. In this paper, we study the derivatives 1 +e--" of the l-dimensional sigmoid function z= ~ woxj + O = wx + O, (2.2) J I y = a(x; w) 1 + e -w" ' ( 1 ) where 0 is a bias value, x = [x~, x: ..... xn] and w = [wi~, wi: ..... w~]. This defines a sigmoidal step in n where x ~ Y/is the independent variable and w E Y/ + 1 dimensions (henceforth called an n-dimensional is a weight parameter. In particular, we present recur- sigmoid), oriented in the direction of the n-dimensional rence relations for calculating derivatives of any order, vector w (Figure 1 ). and show that the coefficients generated by these re- The use of this sigmoid function in neural networks currences are directly related to standard number derives in part from its utility in Bayesian estimation theoretic sequences. Thus, our formulation can be used of classification probabilities. It arises as the expression to determine higher derivatives of the sigmoid directly for the posterior probability when the weights are in- from standard tables. terpreted as logs of prior probability ratios (Stolorz et al., 1992). The sigmoid is also attractive as an activation 2. MOTIVATION function because of its monotonicity and its simple form. The form of its lower derivatives also makes it The sigmoid in eqn ( 1 ) is also called the logisticfimc- attractive for learning algorithms like back propagation tion, and can be seen as representing a "neuron" with (Werbos, 1974). It is clear from eqns (2.1, 2.2) that the value ofy remains unchanged along n-dimensional hyperplanes Acknowledgements: This research was supported by the Center orthogonal to w. If0 = 0, the n-dimensional hyperplane for Semicustom Integrated Systems at the University of Virginia and z = 0 (y = 0.5), passes through the origin x = 0. If 0 the Virginia Center for Innovative Technology. We would like to thank Prof. Worthy N. Martin and Prof. Bruce A. Chartres for their valuable is not 0, the sigmoid is shifted along w by a distance suggestions. Dr. Minai would also like to thank the Department of -0/II w II, where II II is the Euclidean norm. The shape Neurosurgery, University of Virginia, and especially Prof. William B. of the sigmoid (specifically, the sharpness of its nonlin- Levy for their support. The paper has also benefited considerably earity) is determined by 11w ]l, with a larger value giving from the suggestions of two anonymous reviewers. a sharper sigmoid. Figure 2 shows a 1-dimensional sig- Requests for reprints should be sent to Dr. Ali A. Minai, De- partment of Neurosurgery, Box 420, Health Sciences Center, University moid with different weight values. of Virginia, Charlottesville, VA 22908. It can be shown quite easily ( Minai, 1992) that trac- 845 846 A. A. Minai and R. D. Williams FIGURE 1. A 2-dimensional sigmoid function" y(x~, x2) = 1/[1 + exp(-wx + 0)]; 0 = threshold, z = w~x~ + w2x2 + O, w = [w~w2]. ing y along any line not orthogonal to w in the input it is sul~cient to consider the corresponding change in space yields a 1-dimensional sigmoid with the effective a 1-dimensional sigmoid. A Taylor series expansion of weight w = IIw IIcos 8, where B is the angle between w the l-dimensional sigmoid can be used to approximate and the direction of the trace. Thus, in order to deter- this change, and this requires the calculation of higher mine the change in y due to an arbitrary change in x, derivatives, depending on the desired level of accuracy. Y 1.0- w =0 0.5| 0.0 0.0 x FIGURE 2. The shape of a 1-dimensional sigmoid function as it varies with the weight w. In the large limit of w, the sigmoid becomes a threshold function. Sigmoid Derivatives 847 Expanding the function of eqn ( 1 ) around x gives: The second derivative can be obtained from this by differentiating again with respect to x. It is instructive 1 d"a(x) .... a(x+~x)=~(x)+ Y~n! ~ tox~ to consider the following general case: n=l 1 _~. [yk( 1 -- ),)t] = kyk-i( 1 -- y)twy( 1 -- y) -- I( 1 -- ),)t-I = ~(x) + X ~. ~(")(x)(~x)", (3) n= I ×[--(wy( I -- y)]j,k = wltvk( 1 -- y)t+l where at"i(x) --- d"a(x)/dx'. The issue is to deter- -- wlyk+l(l -- V) t, (5) mine a t"> . One interesting application of expansions such as which immediately suggests the following recursive these might be in the analysis of layers of sigmoid neu- procedure for obtaining a (,+t) from at'). rons. Using Taylor series expansions of a relatively high Let order, one could accurately write the transformation Ao[Ay~( 1 - y)l] __- Akyk( 1 - y)t+t (6) performed by a layer of sigmoid neurons locally as a polynomial. Such a description could then be used in A,[Ayk( 1 - JO t] m -Alyk+t( 1 - y)t (7) analysing the effects of changes in the layer's input, thus Aah[G] -= Ab[Aa[G]], (8) characterizing the smoothness of the mapping and its robustness to perturbations. Such analysis could be very where a is a binary sequence (e.g., 01 I0101 ) and b E useful in studying the fault-tolerance of feed forward { O, 1 }. Thus, networks and, perhaps, even their ability to generalize Ao, ioi[G] =- At[Ao[At[At[AoIG]]]]] (which is related to the smoothness of the induced mappings). Another application for such polynomial A[G] --= Ao[G] + At[G] (9) approximations might be in studying the dynamics of A[G + H] -= A[G] + A[H] (I0) recurrent networks with sigmoid neurons, particularly AO[G] --- A[A"-t[G]]. (II) in characterizing the sensitivity to small perturbations. The Appendix gives further details on estimating the Then the following obtain: effect of perturbations. Another motivation for studying the logistic function LEMMA I. a (n) = W3"[a(n-i)]. and its derivatives comes from the area of logistic regression. The logistic function is widely used as the Proof The conclusion follows from eqn (5) and the canonical link function in the statistical modeling of superposition of derivatives. [] binary data with a Bernoulli distribution (McCullagh & Nelder, 1989). Specifically, when generalized linear LEMMA 2. a t'> = w'-IA"-t[a (I)] = w'-13""-I[wy(l models on binary data, x = [xi...xm], xi ~ {0, I}, - y)] = w'A'-t[y(1 - y)] are used to estimate probabilities for classification, it is convenient to relate a probability, p, to the estimator, Proof The conclusion follows by recursion on Lemma o~wxby 1. [] o(x;w)---w'x=log P =- logit(p). I-p REMARK I. Following Lemma 2, a ~'> can be written as a t") = w'g"t'>, where ~-t,) __- 3,,-I[ y( 1 - y)]. Several This, in turn, implies the linking function observations can be made directly about the derivatives obtained using the above method. These are presented e' 1 P 1 +e ~ I +e -"' below in the form of lemmas and corollaries. Most of the proofs are straightforward, and are omitted to save (i.e., the logistic function). There is considerable in- space (see Minai, 1992, for all proofs). Some of the terest in the higher derivatives of the logistic function more significant ones are given in full. because they provide direct information about higher- order statistics of the underlying distribution. LEMMA 3. a t'). as generated by 3-, is the sum of 2"-' terms. Th& form & called the expanded or 3,-generated form of a ('>. The process of obtaining at'i from a t'> 3. DETERMINING a (") can be seen as the development of a n-level binary tree An interesting property of the sigmoid is that the de- (Figure 3), with the pth level nodes representing the rivatives of y can be written in terms of y and w only. individual terms of a (P). The first derivative ofy with respect to yj is given by: If, on each application of 3, to a term of a (p), the e -wx left child node is generated by 3-o and the right child O.(1 ) = dy w. dx ( 1 + e-'x) 2 wy( 1 - y). (4) node by 3,~, then this defines a unique ordering on all 848 A.