Exact Calculation of the Pro duct of the
Hessian Matrix of Feed-Forward Network
Error Functions and a Vector in O N
Time
Martin Mller
Computer Science Department
Aarhus University
DK-8000 Arhus, Denmark
Abstract
Several methods for training feed-forward neural networks require second order
information from the Hessian matrix of the error function. Although it is p ossible
to calculate the Hessian matrix exactly it is often not desirable b ecause of the
computation and memory requirements involved. Some learning techniques do es,
however, only need the Hessian matrix times a vector. This pap er presents a
method to calculate the Hessian matrix times a vector in O N time, where N is the
numberofvariables in the network. This is in the same order as the calculation of
the gradient to the error function. The usefulness of this algorithm is demonstrated
by improvement of existing learning techniques.
1 Intro duction
The second derivative information of the error function asso ciated with
feed-forward neural networks forms an N N matrix, which is usually re-
ferred to as the Hessian matrix. Second derivative information is needed
in several learning algorithms, e.g., in some conjugate gradient algorithms
This work was done while the author was visiting Scho ol of Computer Science, Carnegie Mellon
University, Pittsburgh, PA. 1
[Mller 93a], and in recent network pruning techniques [MacKay 91], [Hassibi 92].
Several researchers have recently derived formulae for exact calculation of
the elements in the Hessian matrix [Buntine 91], [Bishop 92]. In the gen-
2
eral case exact calculation of the Hessian matrix needs O N time- and
memory requirements. For that reason it is often not worth while explicitly
to calculate the Hessian matrix and approximations are often made as de-
scrib ed in [Buntine 91]. The second order information is not always needed
in the form of the Hessian matrix. This makes it p ossible to reduce the
time- and memory requirements needed to obtain this information. The
scaled conjugate gradient algorithm [Mller 93a] and a training algorithm
recently prop osed by Le Cun involving estimation of eigenvalues to the
Hessian matrix [Le Cun 93] are go o d examples of this. The second order
information needed here is always in the form of the Hessian matrix times
avector. In b oth metho ds the pro duct of the Hessian and the vector is
usually approximated by a one sided di erence equation. This is in many
cases a go o d approximation but can, however, b e numerical unstable even
when high precision arithmetic is used.
It is p ossible to calculate the Hessian matrix times a vector exactly
without explicitly having to calculate and store the Hessian matrix itself.
Through straightforward analytic evaluations we give explicit formulae for
the Hessian matrix times a vector. We prove these formulae and givean
algorithm that calculates the pro duct. This algorithm has ON time- and
memory requirements which is of the same order as the calculation of the
gradient to the error function. The algorithm is a generalized version of
an algorithm outlined byYoshida, whichwas derived by applying an auto-
matic di erentiation technique [Yoshida 91]. The automatic di erentiation
technique is an indirect metho d of obtaining derivative information and
provides no analytic expressions of the derivatives [Dixon 89]. Yoshida's
algorithm is only valid for feed-forward networks with connections b etween
adjacentlayers. Our algorithm works for feed-forward networks with arbi-
trary connectivity.
The usefulness of the algorithm is demonstrated by discussing p ossible
improvements of existing learning techniques. We here fo cus on improve-
ments of the scaled conjugate gradient algorithm and on estimation of eigen-
values of the Hessian matrix. 2
2 Notation
The networks we consider are multilayered feed-forward neural networks
l
with arbitrary connectivity. The network @ consist of no des n arranged in
m
layers l =0;.. .;L. The numberofnodesinalayer l is denoted N . In order
l
l
to b e able to handle the arbitrary connectivitywe de ne for eachnoden
m
a set of sourcenodes and a set of target nodes.
l r r l
S = fn 2@j There is a connection from n to n ; r r m s s m l r l r T = fn 2@j There is a connection from n to n ; r>l; 1sN g r m s m s The training set acco ciated with network @ is 0 fu ;s =1;...;N ; t ;j =1;.. .;N ;p =1;...;Pg 2 0 pj L ps l when a pattern p is propagated through the The output from a no de n m network is X l l l lr r l u = f v , where v = w u + w ; 3 pm pm pm ms ps m r l n 2S s m lr r l l and w is the weight from no de n to no de n . w is the usual bias of ms s m m l l is an appropriate activation function, e.g., the sigmoid. . f v no de n pm m l The net-input v is chosen to b e the usual weighted linear summation of pm inputs. The calculations to b e made could, however, easily b e extended to l other de nitions of v . Let an error function E wbe pm P X L L Ew= E u ; .. .;u ;t ;.. . ;t ; 4 p p1 pN p1 pN L L p=1 where w is a vector containing all weights and biases in the network, and E is some appropriate error measure asso ciated with pattern p from the p training set. Based on the chain rule we de ne some basic recursive formulae to cal- culate rst derivative information. These formulae are used frequently in the next section. Formulae based on backward propagation are h h r h X X @v @v @v @v pi pi ps pi 0 l rl = = f v w 5 pm sm l r l r @v @v @v @v r l r l pm ps pm ps n 2T n 2T s m s m r X @v X @E @E @E p p p ps 0 l rl = = f v w 6 pm sm l r l r @v @v @v @v r l r l pm ps pm s n 2T n 2T s m s m 3 3 Calculation of the Hessian matrix times a vector This section presents an exact algorithm to calculate the vector H wd, p where H w is the Hessian matrix of the error measure E , and d is a p p vector. The co ordinates in d are arranged in the same manner as the co ordinates in the weight-vector w. L N L X dv d dE d @E p p pj T T H wd = d = d 7 p L dw dw dw @v dw j =1 pj L L 2 L 2 N L X dv dv d v @ E @E p p pj pj pj T = d + d L L 2 2 @v dw dw @v dw j =1 pj pj L L 2 L 2 N L X dv dv d v @ E @E @E p p p pj pj pj T 0 L 2 00 L = f v + f v d + d ; pj pj L L L 2 2 @u @u dw dw @v dw j =1 pj pj pj The rst and second terms of equation 7 will from now on b e referred to as the A- and B-vector resp ectively.Sowehave L L 2 L 2 N N L L X dv dv X d v @ E @E @E p p p pj pj pj T 0 L 2 00 L A= f v +f v d and B = d : pj pj L L L 2 2 @u @u dw dw @v dw j=1 j =1 pj pj pj 8 We rst concentrate on calculating the A-vector. l dv T pm l l l can becalculatedby = d . ' be de nedas' Lemma 1 Let ' pm pm pm dw forwardpropagation using the recursive formula P l lr r lr 0 r r l 0 ' = r l d u + w f v ' + d ;l > 0 ; ' =0 ;1 n 2S pm ms ps ms ps ps m pi s m iN . 0 0 Pro of.For input no des wehave' = 0 as desired. Assume the lemma is pi true for all no des in layers k l l X dv dw d pm m l T T lr r ' = d = d w u + pm ms ps dw dw dw r l n 2S s m r X dv X ps T lr 0 r lr r l lr r lr 0 r r l = w f v d + d u + d = d u + w f v ' + d ms ps ms ps m ms ps ms ps ps m dw r l r l n 2S n 2S s m s m 2 4 l Lemma 2 Assume that the ' factors have been calculated for al l nodes pm in the network. The A-vector can becalculatedbybackwardpropagation using the recursive formula lh l l h l A = u ; A = , mi pm pi m pm l where is pm P l 0 l rl r r l = f v w ;l < L ; n 2T pm pm sm ps s m 2 @ E @E p p L 0 L 2 00 L L = f v + f v ' ; 1 j N : L L L 2 pj pj pj pj @u @u pj pj Pro of. L L L N N N L L L @v X X @v X @v pj pm pj L lh L h l L = A = u = pj mi pj pi pm pj lh l l @w @v @v j =1 j =1 j =1 mi pm pm Lh h L as desired. Assume that the u For the output layer wehave A = pi pj ji lemma is true for all no des in layers k>l. L L N N L L X @v X X @v pj pj l L L 0 l rl = = f v w pm pj pj pm sm l r @v @v r l j =1 j =1 pm ps n 2T s m L N L X X @v X pj 0 l rl L 0 l rl r = f v w = f v w pm sm pj pm sm ps r @v r l r l j =1 ps n 2T n 2T s m s m 2 The calculation of the B-vector is a bit more involved but is basicly con- structed in the same manner. l Lemma 3 Assume that the ' factors have been calculated for al l nodes pm in the network. The B-vector can becalculatedbybackwardpropagation using the recursive formula lh l l 0 h h l h l B = f v ' + u ; B = mi pm pi pi pm pi m pm l l and are where pm pm P @E p l 0 l rl r L r l ; 1 j N : = f v w ;l L L n 2T pm pm sm ps pj @v s m pj P l 0 l rl r rl 0 l rl 00 l l r = r l f v w + d f v + w f v ' ;l < n 2T pm pm sm ps sm pm sm pm pm ps s m L; 5 L =0 ;1j N L pj Pro of. Observe that the B-vector can b e written in the form 2 L L N N L k X d v X d' @E @E p p pj pj B = d = : L L 2 @v dw @v dw j =1 j =1 pj pj l l Using the chain rule we can derive analytic exp eressions for and . pm pm l l L L L N N L L @' @v X @' X @' @' @E @E p p pm pm pj pj pj lh B = = + mi L lh L lh lh l l @v @w @v @' @w @v @w j =1 j =1 pj mi pj mi mi pm pm L L N L @' X @' @E p pj pj 0 h h h f v ' + u = pi pi pi L l l @v @' @v j =1 pj pm pm l l So if the lemma is true and are given by pm pm L L N N L L X @' X @' @E @E p p pj pj l l = ; = pm pm L L l l @v @' @v @v j =1 j =1 pj pj pm pm The rest of the pro of is done in two steps. We lo ok at the parts concerned l l with the and factors separately. For all output no des wehave pm pm @E p L = as desired. For non output no des wehave L pj @v pj L r L N N L L X X @' @' X X @' @E @E p p pj ps pj l 0 l rl = = f v w pm pm sm L L r l r @v @' @' @v @' r l r l j =1 j =1 pj pj ps pm ps n 2T n 2T s m s m L N L X @' X X @E p pj 0 l rl 0 l rl r = f v w = f v w pm sm pm sm ps L r @v @' r l r l j =1 pj ps n 2T n 2T s m s m L Similiarly is = 0 for all output no des as desired. For non output no des pj wehave L r L r N L X X @' @v @' @' @E p pj ps pj ps l + = pm L r l r l @v @v @v @' @v r l j =1 pj ps pm ps pm n 2T s m L L N L X X @' @' @E p pj pj 0 l rl rl 0 l rl 00 l l = f v + d w f v + w f v ' pm sm sm pm sm pm pm L r r @v @v @' r l j =1 pj ps ps n 2T s m L L N N L L @' X @' X X @E @E p p pj pj rl 0 l rl 00 l l 0 l rl = + d f v + w f v ' f v w sm pm sm pm pm pm sm L L r r @v @v @v @' r l j =1 j =1 pj pj ps ps n 2T s m X 0 l rl r rl 0 l rl 00 l l r = f v w + d f v +w f v ' pm sm ps sm pm sm pm pm ps r l n 2T s m 6 l The pro of of the formula for B follows easily from the ab ove derivations m and is left to the reader. 2 We are now ready to give an explicit formula for calculation of the Hes- sian matrix times a vector. Let Hd b e the vector H wd . p l Corollary 1 Assume that the ' factors have been calculated for al l nodes pm in the network. The vector Hd can becalculatedbybackwardpropagation using the fol lowing recursive formula lh l l 0 h h l l h l l Hd = f v ' + + u ; Hd = + , mi pm pi pi pm pm pi m pm pm l l l , and are given as shown in lemma 2 and lemma 3. where pm pm pm Pro of. By combination of lemma 2 and lemma 3. 2 @E @E p p If we view rst derivatives like and as already available infor- l l @u @v pm pm mation, then the formula for Hd can reformulated into a formula based l l only on one recursive parameter. First we observe that and can b e pm pm written in the form @E p l = 9 pm l @v pm X @E @E p p l 0 l rl r rl 00 l l = f v w + d + f v ' pm pm sm ps sm pm pm r l @v @u r l ps pm n 2T s m l Corollary 2 Assume that the ' factors have been calculated for al l nodes pm in the network. The vector Hd can becalculatedbybackwardpropagation using the fol lowing recursive formula @E lh l p 0 h h l h l Hd = f v ' + u ; Hd = , l mi pi pi pm pi m pm @v pm l where is pm P @E @E p p l 0 l rl r rl 00 l l r l = f v w + d + f v ' r l n 2T pm pm sm ps sm pm pm @v @u s m ps pm 2 @ E @E p p L 0 L 2 00 L L = f v + f v ' L L 2 pj pj pj pj @u @u pj pj 7 Pro of. By corollary 1 and equation 9. 2 The formula in corollary 2 is a generalized version of the one that Yoshida derived for feed-forward networks with only connections b etween adjacent P P layers. An algorithm that calculates H wd based on corollary 1 p p=1 is given b elow. The algorithm also calculates the gradientvector G = P dE p P . p=1 dw 1. Initialize. Hd = 0 ; G = 0 Rep eat the following steps for p =1;...;P. 2. Forward propagation. 0 For no des i =1 to N do: ' =0. 0 pi For layers l =1 to L and no des m =1 to N do: l P l lr r l l l v = r l w u + w ; u = f v , n 2S pm ms ps m pm pm s m P l r 0 r lr r lr l . + d ' f v + w u = r l d ' n 2S m ps ps ms ps ms pm s m 3. Output layer. For no des j =1 to N do L 2 @E @ E @E p p p L L L 0 L 2 00 L L = ; =0; = f v + f v ' : L L L 2 pj pj pj pj pj pj @v @u @u pj pj pj L r do 2 S For all no des n j s Lr Lr L L L 0 r r L r L Hd = Hd + f v ' + u ; Hd = Hd + ; js js pj ps ps pj ps j j pj Lr Lr L L L r L G = G + u ; G = G + . js js pj ps j j pj 4. Backward propagation. For layers l = L 1downto 1 and no des m =1 to N do: l P P l 0 l rl r l 0 l rl r = f v r l w ; = f v r l w , n 2T n 2T pm pm sm ps pm pm sm ps s m s m P l 0 l rl r rl 0 l rl 00 l l r r l = f v w + d f v + w f v ' . n 2T pm pm sm ps sm pm sm pm pm ps s m r l For all no des n 2 S do s m 8 lr lr l l 0 r r l l r Hd = Hd + f v ' + + u ; Hd = ms ms pm ps ps pm pm ps m l l l Hd + + ; m pm pm lr lr l l l r l G = G + u ; G = G + . ms ms pm ps m m pm Clearly this algorithm has O N time- and memory requirements. More precisely the time complexity is ab out 2.5 times the time complexityofa gradient calculation alone. 4 Improvement of existing learning techniques In this section we justify the importance of the exact calculation of the Hessian times a vector, by showing some p ossible improvements on two di erent learning algorithms. 4.1 The scaled conjugate gradient algorithm The scaled conjugate gradient algorithm is a variation of a standard con- jugate gradient algorithm. The conjugate gradient algorithms pro duce non-interfering directions of search if the error function is assumed to b e quadratic. Minimization in one direction d followed by minimizationin t another direction d imply that the quadratic approximation to the error t+1 has b een minimizedover the whole subspace spanned by d and d . The t t+1 search directions are given by 0 d = E w + d ; 10 t+1 t+1 t t where w is a vector containing all weightvalues at time step t and is t t 0 0 0 2 T jE w j E w E w t+1 t+1 t = 11 t 0 2 jE w j t In the standard conjugate gradient algorithms the step size is found by t a line search which can b e very time consuming b ecause this involves several calculations of the error and or the rst derivative. In the scaled conjugate gradient algorithm the step size is estimated by a scaling mechanism thus avoiding the time consuming line search. The step size is given by 0 T