<<

Neural Networks. Vol. 6, pp. 845-853, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All fights reserved. Copyright © 1993 Pergamon Press Ltd.

ORIGINAL CONTRIBUTION

On the of the Sigmoid

ALI A. MINAI AND RONALD D. WILLIAMS University of Virginia, Charlottesville

(Received 6 Februa~. 1992; revised and accepted 13 January 1993)

Abstract--The sigmoid fimction & very widely used as a neuron activation fimction in artificial neural networks. which makes its attributes a matter of some interest. This paper presents some general results on the derivatives of the sigmoid. These results relate the coefficients of various derivatives to standard number sequences from combi- natorial theory, and thus provide a standard efficient way of calculating these derivatives. Such derivatives can be used in approximating the effect of input perturbations on the output of single neurons and could also be usefid in statistical modeling applications. Keywords--Neuron activation , Sigmoid function, , Eulerian numbers, Stifling numbers.

1. INTRODUCTION one input. However, it can also be given a more general interpretation. Let i be a neuron with n inputs A), 1 < The sigmoid function has found extensive use as a non- j < n, and let the output y of the neuron be given by: linear for neurons in artificial neural 1 networks. Thus, it is of some interest to explore its y - -- (2.1) characteristics. In this paper, we study the derivatives 1 +e--" of the l-dimensional sigmoid function z= ~ woxj + O = wx + O, (2.2) J I y = a(x; w) 1 + e -w" ' ( 1 ) where 0 is a bias value, x = [x~, x: ..... xn] and w = [wi~, wi: ..... w~]. This defines a sigmoidal step in n where x ~ Y/is the independent variable and w E Y/ + 1 dimensions (henceforth called an n-dimensional is a weight parameter. In particular, we present recur- sigmoid), oriented in the direction of the n-dimensional rence relations for calculating derivatives of any order, vector w (Figure 1 ). and show that the coefficients generated by these re- The use of this sigmoid function in neural networks currences are directly related to standard number derives in part from its utility in Bayesian estimation theoretic sequences. Thus, our formulation can be used of classification probabilities. It arises as the expression to determine higher derivatives of the sigmoid directly for the posterior probability when the weights are in- from standard tables. terpreted as logs of prior probability ratios (Stolorz et al., 1992). The sigmoid is also attractive as an activation 2. MOTIVATION function because of its monotonicity and its simple form. The form of its lower derivatives also makes it The sigmoid in eqn ( 1 ) is also called the logisticfimc- attractive for learning algorithms like back propagation tion, and can be seen as representing a "neuron" with (Werbos, 1974). It is clear from eqns (2.1, 2.2) that the value ofy remains unchanged along n-dimensional hyperplanes Acknowledgements: This research was supported by the Center orthogonal to w. If0 = 0, the n-dimensional hyperplane for Semicustom Integrated Systems at the University of Virginia and z = 0 (y = 0.5), passes through the origin x = 0. If 0 the Virginia Center for Innovative Technology. We would like to thank Prof. Worthy N. Martin and Prof. Bruce A. Chartres for their valuable is not 0, the sigmoid is shifted along w by a distance suggestions. Dr. Minai would also like to thank the Department of -0/II w II, where II II is the Euclidean norm. The shape Neurosurgery, University of Virginia, and especially Prof. William B. of the sigmoid (specifically, the sharpness of its nonlin- Levy for their support. The paper has also benefited considerably earity) is determined by 11w ]l, with a larger value giving from the suggestions of two anonymous reviewers. a sharper sigmoid. Figure 2 shows a 1-dimensional sig- Requests for reprints should be sent to Dr. Ali A. Minai, De- partment of Neurosurgery, Box 420, Health Sciences Center, University moid with different weight values. of Virginia, Charlottesville, VA 22908. It can be shown quite easily ( Minai, 1992) that trac- 845 846 A. A. Minai and R. D. Williams

FIGURE 1. A 2-dimensional sigmoid function" y(x~, x2) = 1/[1 + exp(-wx + 0)]; 0 = threshold, z = w~x~ + w2x2 + O, w = [w~w2]. ing y along any line not orthogonal to w in the input it is sul~cient to consider the corresponding change in space yields a 1-dimensional sigmoid with the effective a 1-dimensional sigmoid. A Taylor series expansion of weight w = IIw IIcos 8, where B is the angle between w the l-dimensional sigmoid can be used to approximate and the direction of the trace. Thus, in order to deter- this change, and this requires the calculation of higher mine the change in y due to an arbitrary change in x, derivatives, depending on the desired level of accuracy.

Y

1.0-

w =0 0.5|

0.0 0.0 x

FIGURE 2. The shape of a 1-dimensional sigmoid function as it varies with the weight w. In the large limit of w, the sigmoid becomes a threshold function. Sigmoid Derivatives 847

Expanding the function of eqn ( 1 ) around x gives: The second can be obtained from this by differentiating again with respect to x. It is instructive 1 d"a(x) .... a(x+~x)=~(x)+ Y~n! ~ tox~ to consider the following general case: n=l

1 _~. [yk( 1 -- ),)t] = kyk-i( 1 -- y)twy( 1 -- y) -- I( 1 -- ),)t-I = ~(x) + X ~. ~(")(x)(~x)", (3) n= I ×[--(wy( I -- y)]j,k = wltvk( 1 -- y)t+l where at"i(x) --- d"a(x)/dx'. The issue is to deter- -- wlyk+l(l -- V) t, (5) mine a t"> . One interesting application of expansions such as which immediately suggests the following recursive these might be in the analysis of layers of sigmoid neu- procedure for obtaining a (,+t) from at'). rons. Using Taylor series expansions of a relatively high Let order, one could accurately write the transformation Ao[Ay~( 1 - y)l] __- Akyk( 1 - y)t+t (6) performed by a of sigmoid neurons locally as a polynomial. Such a description could then be used in A,[Ayk( 1 - JO t] m -Alyk+t( 1 - y)t (7) analysing the effects of changes in the layer's input, thus Aah[G] -= Ab[Aa[G]], (8) characterizing the smoothness of the mapping and its robustness to perturbations. Such analysis could be very where a is a binary sequence (e.g., 01 I0101 ) and b E useful in studying the fault-tolerance of feed forward { O, 1 }. Thus, networks and, perhaps, even their ability to generalize Ao, ioi[G] =- At[Ao[At[At[AoIG]]]]] (which is related to the smoothness of the induced mappings). Another application for such polynomial A[G] --= Ao[G] + At[G] (9) approximations might be in studying the dynamics of A[G + H] -= A[G] + A[H] (I0) recurrent networks with sigmoid neurons, particularly AO[G] --- A[A"-t[G]]. (II) in characterizing the sensitivity to small perturbations. The Appendix gives further details on estimating the Then the following obtain: effect of perturbations. Another motivation for studying the logistic function LEMMA I. a (n) = W3"[a(n-i)]. and its derivatives comes from the area of . The logistic function is widely used as the Proof The conclusion follows from eqn (5) and the canonical link function in the statistical modeling of superposition of derivatives. [] binary data with a Bernoulli distribution (McCullagh & Nelder, 1989). Specifically, when generalized linear LEMMA 2. a t'> = w'-IA"-t[a (I)] = w'-13""-I[wy(l models on binary data, x = [xi...xm], xi ~ {0, I}, - y)] = w'A'-t[y(1 - y)] are used to estimate probabilities for classification, it is convenient to relate a probability, p, to the estimator, Proof The conclusion follows by recursion on Lemma o~wxby 1. []

o(x;w)---w'x=log P =- (p). I-p REMARK I. Following Lemma 2, a ~'> can be written as a t") = w'g"t'>, where ~-t,) __- 3,,-I[ y( 1 - y)]. Several This, in turn, implies the linking function observations can be made directly about the derivatives obtained using the above method. These are presented e' 1 P 1 +e ~ I +e -"' below in the form of lemmas and corollaries. Most of the proofs are straightforward, and are omitted to save (i.e., the logistic function). There is considerable in- space (see Minai, 1992, for all proofs). Some of the terest in the higher derivatives of the logistic function more significant ones are given in full. because they provide direct information about higher- order statistics of the underlying distribution. LEMMA 3. a t'). as generated by 3-, is the sum of 2"-' terms. Th& form & called the expanded or 3,-generated form of a ('>. The process of obtaining at'i from a t'> 3. DETERMINING a (") can be seen as the development of a n-level binary tree An interesting property of the sigmoid is that the de- (Figure 3), with the pth level nodes representing the rivatives of y can be written in terms of y and w only. individual terms of a (P). The first derivative ofy with respect to yj is given by: If, on each application of 3, to a term of a (p), the e -wx left child node is generated by 3-o and the right child O.(1 ) = dy w. dx ( 1 + e-'x) 2 wy( 1 - y). (4) node by 3,~, then this defines a unique ordering on all 848 A. A. Minai and R. D. Williams

wy(l-y) O41)

w2y (I-y)2 -w 2y 2( l-y ) 0(2)

II w3y ( I-y )3 -2w3y 2( 1--'¢ )2 -2w3y2(l-y )2 w3y3(l-y) (3"(3 )

i Ai l " i FIGURE 3. The binary tree formulation of the procedure for calculating successive derivatives of the sigmoid. The arcs are labeled with the subscript for the appropriate A operator. The derivative shown on the right is the sum of all the terms in the corresponding level of the tree. terms of 0"(p+I). If this scheme is used at each level in COROLLARY 3. In the A-generated expression for ~ (~), the generation of a <"), the terms of a (") are uniquely there are (~.2~) = (.-k).-i terms with the power pair { k, n ordered as follows: +l-k}. Because a(n) = w,~(,), ~(,) completely determines LEMMA 4. The mth term of a (") is given by w"As[y( 1 ("). Thus, only ~-(m will be studied henceforth. Note - y)], ,,here m = 0, 1, 2 ..... 2 "-t - 1 and S is the that, like a ("), it has A-generated and simplified forms, n - 1 digit binary representation of m, and is called obtained by dividing a t,) by w". the generating sequence for the term. Another fact about atm and ~'(") is that they possess odd symmetry about the origin for even n, and even REMARK 2. tr I") can also be generated by w"A"[y], symmetry for odd n. This follows trivially from the fact because A[yJ( 1 - y)0] = y( 1 - y). However, the At that the sigmoid has odd symmetry. Also, a (") has n + branch operating on y generates a O-term, producing a 1 roots (zero-crossings), since it is a polynomial of de- single-term a (~). Because this introduces a needless gree n + 1 in y. The first six derivatives are shown in asymmetry into the procedure, the formulation of Lemma 2 is used throughout. Figure 4, plotted over the range 0 _< y -< 1, which cor- responds to -~ _< x _< oo. LEMMA 5. The term generated by As[),( 1 - y)] has the form Cyk( 1 -- y)l, where 4. CALCULATING THE COEFFICIENTS k = 1 + the number of I's in S LEMMA 6. If the ruth term of the A-generated ~(.) has ! = 1 + the number of O's in S the generating sequence S~ ) = { di } ( di is the ith digit of the bina~ sequence), the coefficient ct.,") of the term Proof The powers of both y and ( 1 - y) are initially can be obtained by the following equations: 1. Thereafter, each operation by A~ adds 1 to the power n-2 of), [eqn ( 7 )], and each operation by A0 does the same C(,.n) = l'-[ [dp(-(n + I - kp)) + I dp - l]kp] to the power of (1 - y) [eqn (6)]. The conclusion p=O

follows immediately. I--1 p-I kp= l + Z dq, COROLLARY I. In the A-generated expression for a ("), q=O two terms a~") and (r(q") can be combined if their gen- or, equivalently, by the equations: erating sequences Sp and Sq have the same number of l "s and O's. n-2 c~") = /-I [do(-/p) + I dp - l l(n + 1 -/p)] p=0 COROLLARY 2. Each term in atm has the form Cyk( I p-I _ y)t, where l + k = n + 1 and C is a coefficient. { k, lp=l+~ Idq- II. l } is called the power pair of the term. q=0 Sigmoid Derivatives 849

0.4-

0.2-

-0.2 --

-0.4 -

I I I I I 0 0.2 0.4 0.6 0.8 1

FIGURE 4. The first six derivatives of the sigmoid for w = 1. The domain 0 < y ~ 1 corresponds to -~ _< x < ~. The total number of maxima end minima equals the order of the derivative in each case.

Proof The coefficient in question can be generated as sequence have negative coefficients, while all terms with follows: an even number of 1 's have positive coefficients. Initialize k = 1, l = 1, c = 1 (since cC~I)= 1). For each digit di of St,,~ COROLLARY 6. All terms in ~tn) which have odd k in their power pair will have positive coefficients, while all if dr is 0 terms with even k will have negative coefficients. c=cXk. 1=/+1. LEMMA 7. In the A-generated form of ~ ~, the mth and the ( 2 ~- ~ - m) th terms have coefficients with the else if dr is 1 same absolute value (0 < m < 2 ~-I _ 1 ). They are c = cX -/. called duals of each other. k=k+l. COROLLARY 7. If n is even, the coefficient cm of the mth c in) = c. term and coefficient cu of the Mth term, M = 2 ~-~ - The equations given in the lemma implement this al- m, are related by cm = --CM. If n is odd, they are related gorithm, using the fact that l + k = n + 1 (Corollary by Cm = CM. 2). [] LEMMA 8. Ofall the terms in the A-generated form of COROLLARY 4. All terms of the A-generated form of ~(n), those with generating sequences of the form 010101 ~(~) which have the same power pair { k, l} have coef- •.. or 101010... have the coefficients with the highest ficients with the same sign. absolute value.

COROLLARY 5. In the A-generated form of ~), all Proof This is easily proved by induction. Suppose, terms with an odd number of 1 "s in their generating without loss of generality, that the first (leftmost) ele- 850 A. A. Minai and R. D. Hqlliams

ment of the sequence is 0, giving a coefficient of 1. For quence of A being composed of two operators that in- the next sequence digit, choosing 0 again gives the coef- crement one power while multiplying the coefficient by ficient 1 × 1, while chosing 1 gives the coefficient 1 × the other. -2 (because the power of( 1 - y) is now 2 due to the initial operation by A0). The coefficient with the largest 5. SIMPLER FORMS OF THE DERIVATIVES absolute value that can be produced with a 2-digit se- quence is, therefore, -2, obtained by choosing 1, giving While the A-generated version of ~'{') is quite interesting the sequence 01. At the next step, choosing 0 gives coef- algorithmically, it can be simplified to much more ficient 1 X -2 X 2, while choosing 1 gives the coefficient compact and manageable forms. This is discussed next. 1 x -2 X -2, both of which have the same absolute value. Thus, the largest absolute value coefficient for a LEMMA 9. After simplification, ~(") has n terms, and 3-digit sequence belongs to the term generated by 010 can be written in the form: (and by 011 ). Let the greatest absolute value coefficient for a p-digit sequence be associated with the term ~'(")= ~ (-I ~k-lr('))~--k )'k~l-- V) "+l-k 010101 ... 010. The coefficient in question is given by k=l 1 X -2 X 2 x -3 X 3 x ... X -m X m, where m = = ~ (| ~n-lp(n) .,n+l-l[ 1 __ V)I 12) [0.5p]. The power ofy is m, while that of( 1 - y) is I~1 m + 1 (because there are m O's and m - 1 l's in the sequence). Thus, a choice of 0 as the next digit would This is called the simplified form of or (.I. multiply the coefficient by m, while a choice of 1 will multiply it by -(m + 1 ). Clearly, choosing 1 maximizes Proof Because all terms of ~'(") have generating se- the absolute value, and the highest absolute value coef- quences of n - 1 digits, there are n different 0/1 dis- ficient for a p + 1 digit sequence belongs to the term tributions, ranging from terms with n - 1 l's and no generated by 010101... 0101. At the next step, choos- O's to those with no l's and n - 1 O's. Thus, there will ing 0 will multiply the coefficient by m + 1 (because be n different power pair combinations in ~'t~), and n the power of y is now also m + 1 ), while choosing 1 irreducible terms. The form of eqn (12) follows from will multiply it by -(m + 1). Clearly, choosing 0 is Corollary 2 and the alternating signs from Corollary sufficient to ensure the maximization of absolute value, 6. [] and the p + 2 digit sequence producing this value is The coefficients C~ ") for the first few values ofn are 010101 ... 01010 (and also 010101 ... 01011 ). It is given in Table 1 (also see Theorem 1 ). also obvious that no future gain in coefficient multi- pliers can be realized by choosing the smaller multiplier LEMMA 10. In the simplified version of ~ C'), the coef- at any step. For example, choosing 0 as the p + 1st ficient C~ ~) of the yk( 1 -- y)"+~-k term can be obtained digit multiplies the coefficient by m (instead of-(m by the recursion: + 1 ), and the choices at the p + 2nd step become m if C(n) 0 is chosen and - (m + 2) if 1 is chosen. Because I m ~. =0Vn t'f k< 1 or n<0 X - (m + 2) ] < I - ( m + 1 ) X (m + 1 ) l, this produces C(i ')= I a loss rather than a gain. An identical argument can be (n-l} C~")= kC~"-'~ + (n + 1 - k)Ck_t • (13) made if the first digit chosen is 1. Thus, terms with generating sequences of the form 010101 ... 101010 Proof The only two ways to obtain power pair { k, n •.. have the largest absolute value coefficients among + 1 - k } in the expression for ~'{') are: ( 1 ) Ao operating all terms with generating sequences of the same on a {k, n - k} term of ~'('-~); and (2) A~ operating length. [] on a {k - 1, n + I - k} term of~"c"-t). The former produces coefficient k(- 1 )kC~'-t), and the latter -(n REMARK 3. It might be noted that, for a given n, the + 1 - k)(-1)k-tC~_qJ)= (n + 1 - k)(-1)kC~_q '). absolute values of the coefficients for the 010101 ... Thus, the two add in magnitude to give the coefficient term and the 101010... term are the same. If n is even, (--l)k[kC~ "-I) + (n + 1 -- k)C~_q')], which gives they have opposite signs, and if odd, identical signs. C~") as in (13). [] This is just a special case of the duality results. THEOREM 1. The coefficient C~ ") in the simplified rep- REMARK 4. The proof for Lemma 8 shows that the resentation of ~ {') is equal to the Eulerian number A,.k_ terms with generating sequences 01010... and 10101 (see Graham et al., 1989, for a discussion). ... are not the only ones with the highest coefficients. A necessary condition that a term have the highest ab- Proof The recurrence for generating Eulerian numbers solute vahw coefficient is that its generating sequence is: begin with 10 or 01, and contain no subsequence of Ar,q = (q + 1 )A~-l.q + (r -- q)Ar-t.q-G more than two consecutive O's or 1 's. This is a conse- integerq, r>O. (14) Sigmoid Derivatives 851

TABLE 1 The Coefficients, C~),forthe Eulenan Form ofthe FirstSeven Derivatives Derivative k Number (n) 1 2 3 4 5 6 7 1 1 2 1 1 3 1 4 1 4 1 11 11 1 5 1 26 66 26 1 6 1 57 302 302 57 1 7 1 120 1191 2146 1191 120 Each C}(n) is the coefficient for a term of the form y*(1 - y).+l-k; 1 < k < n, and equals A..~-I, where A.~ are the Eulerian Numbers. The sign is given by (-1) *-1.

with the conditions that AL0 = 0 and Ar, q = 0 if either The following results provide specific information r or q is less than 0. Substituting k - 1 for q, n for r, about ~kz.r(n) • and C~ ") for A,.k-t in eqn (14), eqn (13) is obtained and the result follows. [] LEMMA i 1. If ~ (n) is written in the form of eqn (16), all terms with odd powers have positive coefficients, REMARK 5. Following Lemma 10 and Theorem 1 ~t,) while all terms with even powers have negative coeffi- can be written as: cients.

n ~-c,) = ~ (-I ~k-lAJ,.k-l.vk'l~ _ y),+,-k 15) Proof In generating the terms of ~'(") from those of k=l ~'("-~ ), the yq term of the latter produces two terms in which may be called the Eulerian form of ~ ('). Because ~'<"-t). Denoting this by an operator r, the Eulerian numbers can be easily calculated (or looked up). anj, a (") can be calculated using eqn ( 15 ). F[yq] = qyq-l ),( 1 - y) = qyq - qyq+'. (17) Other forms can also be found for ~'¢"), and these Thus, the yq term in ~'('-~) produces a yq term in ~'¢") may provide further insight. One of these is briefly dis- with the same sign. and a yq+~ term with the opposite cussed next. sign. Because ~-(t) = y _ y2, it is clear that, for all ~'('), Using odd powers ofy will have positive coefficients and even powers will have negative ones. [] ~-(I) = ),( 1 - y) = y - 7 2, (16) Thus, ~-t,,) can be written as: ~'t") may be written in powers of y only (rather than n+l powers of y and 1 - y). This is done as before by de- f~"~ = ~ (--l)k-'K~"~), ~, (18) fining a recurrence that generates ~'t") from ~-c,-t). k=l Clearly, from (12), ~(") has the form: v(n) with,,k =0Vn<0, k< 1, k>n+ I. n+l . (n) ~'(") = Z H~")) 'k. The coefficients ~k for the first few values ofn are k=l given in Table 2.

TABLE 2 The Coefficients, K~,"), for the Stirling Form of the First Six Derivatives

Derivative k Number (n) 1 2 3 4 5 6 7 0 1 1 1 1 2 1 3 2 3 1 7 12 6 4 1 15 50 60 24 5 1 31 180 390 360 120 6 1 63 602 2100 3360 2520 720 Each KJ,") is the coefficient for a term of the form y*; 1 < k < n, and equals (k - 1)!S.+1~, where S.~, are the Stirling Numbers of the Second Kind. The sign is given by (-1) k-l. 852 A. A. Minai and R. D. Williams

THEOREM 2. The coefficient Ktk"~ of y k in ft,) is equal TABLE 3 to ( k - 1 )!Sn+l.k, where Sn. k are the Stirling Numbers The Coefficients, t["~, for the Hyperbolic Tangent Form of the of the Second Kind (see Graham et al., 1989 for a First Six Derivatives description ). Derivative Number Proof Using eqn (17), the recurrence for generating (n) 2 4 6 8 K~,m is: 0 1 1 2 K~"~ = (k - 1 )K~"_~t~ + kK~"-'~. (19) 2 4 2 The proof can be facilitated by considering the triangle 4 8 16 5 16 88 16 of coefficients K generated by eqn ( 19 ), with the rows 6 32 416 272 indexed by n and the columns by k (see Table 2 ). The 7 64 1824 2880 272 assertion of the theorem can be proved by double in- Defining p = 0.5 wx, each tJ,") is the coefficient for a term of the duction on k and n, using the standard recursion for form 2~"*l~sechZ~(p)tanh"*l-Z~(p); 1 < k < [(n + 1)12], and equals Stifling numbers of the Second Kind: S.÷~,k = S..k-~ T.,_I, where T.~ is the number of permutations of the first n integers + kSn.k. with exactly k peaks. The sign is given by (-lY '-k. k = 1" eqn (16) gives KI I) = 1, and from eqn (19),

K', ") = (I - 1)K0c"-') + IKC,"-') = 1 Vn > 0 Many other equivalent forms can, of course, be found for the derivatives using standard combinatorial Because S.+~.t = 1 Vn > 0, the assertion is satisfied for relationships. For example, by writing the sigmoid as k=l. a hyperbolic tangent function: Suppose the assertion holds for column k - 1. Then, proceeding with the induction on n, consider the first 1 e wx 1 non-zero element of column k. Because this corre- I +e -wx 1 +e "x 2 [tanh(wx/2) + 1], (21) sponds to n = k - 1, it is given by and defining p = wx/2, it is possible to write ~-t,) as K(k-i~ t ~v( k-2~ kKtk k-2) k =(k- I )l~,k_ I + 1 ttn+tl/2j = (k - 1. )Kk_~. ..(k-2) since K(kk-2) = 0 ~'~") = 2,+1 ~ t~k"~sech2k(p)tanh"+t-2k(p). (22) k=l = (k - l)(k- 2)!Sk-Lk-t It appears that the coefficients tk(0) are expressible in (by the hypothesis of the induction) terms of a well-known group of numbers: permutations = (k- 1)!Sk-l.k-i, of the first n integers by the number of peaks: t~") (-1 .-k but Sk-~,k-t = Sk.k = 1. Which means that = ) T..k-l, (23) K(k-t) where T,,k are the number of permutations of the first k = (k - I )!Sk, k. n integers with exactly k peaks (see David et al., 1966, Thus, the assertion holds for the first non-zero element pp. 53, 260). The unsigned coefficients tk(") for the first of column k (corresponding to n = k - 1 ). Suppose few values of n are shown in Table 3. The sign is de- the assertion holds for the n - 1 term in column k, termined by (- 1 ),-k. The study of this and other forms i.e., for Ktk"- t). Then, for the n term is left for the future.

K~k"' = (k - 1 )Xtk"_]'' + kX~"-" 6. CONCLUSION = (k - 1 )(k - 2)!S..k-~ + k(k - 1 )!S..k

= (k- l)![S..k_j +kS..k] In this paper, we have derived two very simple and ele- gant formulations for the derivatives of the standard = (k- l)!S.+~.k, sigmoid function used in neural network research. Not which means that the assertion holds for K~,"). Thus, only do these formulations lead to very efficient meth- the double induction is complete, and the assertion is ods for calculating higher derivatives, they also relate proved. [] the sigmoid to well-known combinatorial number families, which can help in the analysis of systems with REMARK 6. Following the above results, ~") may be sigmoidal units. written as:

n+l REFERENCES ~'(") = ~ (--l)k-t(k - l)[Sn+l.kY k (20) k=l David, E N., Kendall, M. G., & Barton, D. E. (1966). Symmetric function and allied tables. Cambridge, UK: Cambridge University which may be called the Stirling form of ~ ~"). Press. Sigmoid Derivatives 853

Graham, R. L., Knuth, D. E., & Patashnik, O. (1989). Concrete Consider a layer of two neurons connected to the same two inputs, mathematics. Reading, MA: Addison-Wesley. xt and x2, through weight vectors w~ and w2, respectively. Let the McCullagh, E, & Nelder, J. A. (1989). Generalized linear models. neuron outputs by y~ and Y2, and define x = [xax2]. We can thus Cambridge, UK: Chapman & Hall. write Minai, A. A. ( 1992 ). The robustness offeed forward neural networks: [Y~Y2] = [a(x; w~)a(x; w2)]. A preliminary investigation. Ph.D. Dissertation. University of Virginia, Charlottesville, VA. Suppose there is a perturbation, +x --- [~XL+X2] in the input. The Stolorz, P., Lapedes, A., & Xia, Y. (1992). Predicting protein sec- change in output is then approximated by ondary structure using neural net and statistical methods. Journal of Molecular Biology, 225, 363-377. +y, ~ ~. (llw, ll II+xllcos ~,)+~'c'~(yl) Werbos, P. ( 1974 ). Beyond regression: New tools for prediction and n-I analysis in the behavioral sciences. Ph.D. Thesis. Harvard Uni- versity, Cambridge, MA. +y2 ~ ~ (llw, llll+xllcos ~2)'~'<~(y2), n-I

APPENDIX where N is the order of the approximation and t~J is the angle between t~x and wi. The approximation holds provided the perturbation lies In this appendix, we briefly describe how our formulation could be within the radius of convergence for the sigmoid. This is O( [[wi []-~) used to estimate the effect of input perturbations on the output of (see Minai, 1992 for details). Thus, the magnitude of the effect at layers of sigmoid neurons. the layer's output is lilY[], where ~y = [~yl~y2].