Exact Calculation of the Product of the Hessian Matrix of Feed-Forward
Total Page:16
File Type:pdf, Size:1020Kb
Exact Calculation of the Pro duct of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in O N Time Martin Mller Computer Science Department Aarhus University DK-8000 Arhus, Denmark Abstract Several methods for training feed-forward neural networks require second order information from the Hessian matrix of the error function. Although it is p ossible to calculate the Hessian matrix exactly it is often not desirable b ecause of the computation and memory requirements involved. Some learning techniques do es, however, only need the Hessian matrix times a vector. This pap er presents a method to calculate the Hessian matrix times a vector in O N time, where N is the numberofvariables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques. 1 Intro duction The second derivative information of the error function asso ciated with feed-forward neural networks forms an N N matrix, which is usually re- ferred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient algorithms This work was done while the author was visiting Scho ol of Computer Science, Carnegie Mellon University, Pittsburgh, PA. 1 [Mller 93a], and in recent network pruning techniques [MacKay 91], [Hassibi 92]. Several researchers have recently derived formulae for exact calculation of the elements in the Hessian matrix [Buntine 91], [Bishop 92]. In the gen- 2 eral case exact calculation of the Hessian matrix needs O N time- and memory requirements. For that reason it is often not worth while explicitly to calculate the Hessian matrix and approximations are often made as de- scrib ed in [Buntine 91]. The second order information is not always needed in the form of the Hessian matrix. This makes it p ossible to reduce the time- and memory requirements needed to obtain this information. The scaled conjugate gradient algorithm [Mller 93a] and a training algorithm recently prop osed by Le Cun involving estimation of eigenvalues to the Hessian matrix [Le Cun 93] are go o d examples of this. The second order information needed here is always in the form of the Hessian matrix times avector. In b oth metho ds the pro duct of the Hessian and the vector is usually approximated by a one sided di erence equation. This is in many cases a go o d approximation but can, however, b e numerical unstable even when high precision arithmetic is used. It is p ossible to calculate the Hessian matrix times a vector exactly without explicitly having to calculate and store the Hessian matrix itself. Through straightforward analytic evaluations we give explicit formulae for the Hessian matrix times a vector. We prove these formulae and givean algorithm that calculates the pro duct. This algorithm has ON time- and memory requirements which is of the same order as the calculation of the gradient to the error function. The algorithm is a generalized version of an algorithm outlined byYoshida, whichwas derived by applying an auto- matic di erentiation technique [Yoshida 91]. The automatic di erentiation technique is an indirect metho d of obtaining derivative information and provides no analytic expressions of the derivatives [Dixon 89]. Yoshida's algorithm is only valid for feed-forward networks with connections b etween adjacentlayers. Our algorithm works for feed-forward networks with arbi- trary connectivity. The usefulness of the algorithm is demonstrated by discussing p ossible improvements of existing learning techniques. We here fo cus on improve- ments of the scaled conjugate gradient algorithm and on estimation of eigen- values of the Hessian matrix. 2 2 Notation The networks we consider are multilayered feed-forward neural networks l with arbitrary connectivity. The network @ consist of no des n arranged in m layers l =0;.. .;L. The numberofnodesinalayer l is denoted N . In order l l to b e able to handle the arbitrary connectivitywe de ne for eachnoden m a set of sourcenodes and a set of target nodes. l r r l S = fn 2@j There is a connection from n to n ; r<l; 1sN1g r m s s m l r l r T = fn 2@j There is a connection from n to n ; r>l; 1sN g r m s m s The training set acco ciated with network @ is 0 fu ;s =1;...;N ; t ;j =1;.. .;N ;p =1;...;Pg 2 0 pj L ps l when a pattern p is propagated through the The output from a no de n m network is X l l l lr r l u = f v , where v = w u + w ; 3 pm pm pm ms ps m r l n 2S s m lr r l l and w is the weight from no de n to no de n . w is the usual bias of ms s m m l l is an appropriate activation function, e.g., the sigmoid. f v no de n pm m l The net-input v is chosen to b e the usual weighted linear summation of pm inputs. The calculations to b e made could, however, easily b e extended to l other de nitions of v . Let an error function E wbe pm P X L L Ew= E u ; .. .;u ;t ;.. ;t ; 4 p p1 pN p1 pN L L p=1 where w is a vector containing all weights and biases in the network, and E is some appropriate error measure asso ciated with pattern p from the p training set. Based on the chain rule we de ne some basic recursive formulae to cal- culate rst derivative information. These formulae are used frequently in the next section. Formulae based on backward propagation are h h r h X X @v @v @v @v pi pi ps pi 0 l rl = = f v w 5 pm sm l r l r @v @v @v @v r l r l pm ps pm ps n 2T n 2T s m s m r X @v X @E @E @E p p p ps 0 l rl = = f v w 6 pm sm l r l r @v @v @v @v r l r l pm ps pm s n 2T n 2T s m s m 3 3 Calculation of the Hessian matrix times a vector This section presents an exact algorithm to calculate the vector H wd, p where H w is the Hessian matrix of the error measure E , and d is a p p vector. The co ordinates in d are arranged in the same manner as the co ordinates in the weight-vector w. L N L X dv d dE d @E p p pj T T H wd = d = d 7 p L dw dw dw @v dw j =1 pj L L 2 L 2 N L X dv dv d v @ E @E p p pj pj pj T = d + d L L 2 2 @v dw dw @v dw j =1 pj pj L L 2 L 2 N L X dv dv d v @ E @E @E p p p pj pj pj T 0 L 2 00 L = f v + f v d + d ; pj pj L L L 2 2 @u @u dw dw @v dw j =1 pj pj pj The rst and second terms of equation 7 will from now on b e referred to as the A- and B-vector resp ectively.Sowehave L L 2 L 2 N N L L X dv dv X d v @ E @E @E p p p pj pj pj T 0 L 2 00 L A= f v +f v d and B = d : pj pj L L L 2 2 @u @u dw dw @v dw j=1 j =1 pj pj pj 8 We rst concentrate on calculating the A-vector. l dv T pm l l l can becalculatedby = d . ' be de nedas' Lemma 1 Let ' pm pm pm dw forwardpropagation using the recursive formula P l lr r lr 0 r r l 0 ' = r l d u + w f v ' + d ;l > 0 ; ' =0 ;1 n 2S pm ms ps ms ps ps m pi s m iN . 0 0 Pro of.For input no des wehave' = 0 as desired. Assume the lemma is pi true for all no des in layers k<l. l l X dv dw d pm m l T T lr r ' = d = d w u + pm ms ps dw dw dw r l n 2S s m r X dv X ps T lr 0 r lr r l lr r lr 0 r r l = w f v d + d u + d = d u + w f v ' + d ms ps ms ps m ms ps ms ps ps m dw r l r l n 2S n 2S s m s m 2 4 l Lemma 2 Assume that the ' factors have been calculated for al l nodes pm in the network. The A-vector can becalculatedbybackwardpropagation using the recursive formula lh l l h l A = u ; A = , mi pm pi m pm l where is pm P l 0 l rl r r l = f v w ;l < L ; n 2T pm pm sm ps s m 2 @ E @E p p L 0 L 2 00 L L = f v + f v ' ; 1 j N : L L L 2 pj pj pj pj @u @u pj pj Pro of.