<<

Exact Calculation of the Pro duct of the

Hessian of Feed-Forward Network

Error Functions and a Vector in O N 

Time



Martin Mller

Computer Science Department

Aarhus University



DK-8000 Arhus, Denmark

Abstract

Several methods for training feed-forward neural networks require second order

information from the Hessian matrix of the error . Although it is p ossible

to calculate the Hessian matrix exactly it is often not desirable b ecause of the

computation and memory requirements involved. Some learning techniques do es,

however, only need the Hessian matrix times a vector. This pap er presents a

method to calculate the Hessian matrix times a vector in O N  time, where N is the

numberofvariables in the network. This is in the same order as the calculation of

the to the error function. The usefulness of this algorithm is demonstrated

by improvement of existing learning techniques.

1 Intro duction

The second information of the error function asso ciated with

feed-forward neural networks forms an N  N matrix, which is usually re-

ferred to as the Hessian matrix. information is needed

in several learning algorithms, e.g., in some conjugate gradient algorithms



This work was done while the author was visiting Scho ol of Computer Science, Carnegie Mellon

University, Pittsburgh, PA. 1

[Mller 93a], and in recent network pruning techniques [MacKay 91], [Hassibi 92].

Several researchers have recently derived formulae for exact calculation of

the elements in the Hessian matrix [Buntine 91], [Bishop 92]. In the gen-

2

eral case exact calculation of the Hessian matrix needs O N  time- and

memory requirements. For that reason it is often not worth while explicitly

to calculate the Hessian matrix and approximations are often made as de-

scrib ed in [Buntine 91]. The second order information is not always needed

in the form of the Hessian matrix. This makes it p ossible to reduce the

time- and memory requirements needed to obtain this information. The

scaled conjugate gradient algorithm [Mller 93a] and a training algorithm

recently prop osed by Le Cun involving estimation of eigenvalues to the

Hessian matrix [Le Cun 93] are go o d examples of this. The second order

information needed here is always in the form of the Hessian matrix times

avector. In b oth metho ds the pro duct of the Hessian and the vector is

usually approximated by a one sided di erence equation. This is in many

cases a go o d approximation but can, however, b e numerical unstable even

when high arithmetic is used.

It is p ossible to calculate the Hessian matrix times a vector exactly

without explicitly having to calculate and store the Hessian matrix itself.

Through straightforward analytic evaluations we give explicit formulae for

the Hessian matrix times a vector. We prove these formulae and givean

algorithm that calculates the pro duct. This algorithm has ON time- and

memory requirements which is of the same order as the calculation of the

gradient to the error function. The algorithm is a generalized version of

an algorithm outlined byYoshida, whichwas derived by applying an auto-

matic di erentiation technique [Yoshida 91]. The automatic di erentiation

technique is an indirect metho d of obtaining derivative information and

provides no analytic expressions of the [Dixon 89]. Yoshida's

algorithm is only valid for feed-forward networks with connections b etween

adjacentlayers. Our algorithm works for feed-forward networks with arbi-

trary connectivity.

The usefulness of the algorithm is demonstrated by discussing p ossible

improvements of existing learning techniques. We here fo cus on improve-

ments of the scaled conjugate gradient algorithm and on estimation of eigen-

values of the Hessian matrix. 2

2 Notation

The networks we consider are multilayered feed-forward neural networks

l

with arbitrary connectivity. The network @ consist of no des n arranged in

m

layers l =0;.. .;L. The numberofnodesinalayer l is denoted N . In order

l

l

to b e able to handle the arbitrary connectivitywe de ne for eachnoden

m

a set of sourcenodes and a set of target nodes.

l r r l

S = fn 2@j There is a connection from n to n ; r

r

m s s m

l r l r

T = fn 2@j There is a connection from n to n ; r>l; 1sN g

r

m s m s

The training set acco ciated with network @ is

0

fu ;s =1;...;N ; t ;j =1;.. .;N ;p =1;...;Pg 2

0 pj L

ps

l

when a pattern p is propagated through the The output from a no de n

m

network is

X

l l l lr r l

u = f v  , where v = w u + w ; 3

pm pm pm ms ps m

r l

n 2S

s m

lr r l l

and w is the weight from no de n to no de n . w is the usual bias of

ms s m m

l l

 is an appropriate activation function, e.g., the sigmoid. . f v no de n

pm m

l

The net-input v is chosen to b e the usual weighted linear summation of

pm

inputs. The calculations to b e made could, however, easily b e extended to

l

other de nitions of v . Let an error function E wbe

pm

P

X

L L

Ew= E u ; .. .;u ;t ;.. . ;t  ; 4

p p1 pN

p1 pN L

L

p=1

where w is a vector containing all weights and biases in the network, and

E is some appropriate error measure asso ciated with pattern p from the

p

training set.

Based on the we de ne some basic recursive formulae to cal-

culate rst derivative information. These formulae are used frequently in

the next section. Formulae based on backward propagation are

h h r h

X X

@v @v @v @v

pi pi ps pi

0 l rl

= = f v  w 5

pm sm

l r l r

@v @v @v @v

r l r l

pm ps pm ps

n 2T n 2T

s m s m

r

X @v X

@E @E @E

p p p

ps

0 l rl

= = f v  w 6

pm sm

l r l r

@v @v @v @v

r l r l

pm ps pm s

n 2T n 2T

s m s m 3

3 Calculation of the Hessian matrix times a

vector

This section presents an exact algorithm to calculate the vector H wd,

p

where H w is the Hessian matrix of the error measure E , and d is a

p p

vector. The co ordinates in d are arranged in the same manner as the

co ordinates in the weight-vector w.

L

N

L

X dv

d dE d @E

p p

pj

T T

H wd =  d  = d  7

p

L

dw dw dw @v dw

j =1

pj

L L 2 L

2

N

L

X dv dv d v

@ E @E

p p

pj pj pj

T

= d  +  d

L L

2 2

@v  dw dw @v dw

j =1

pj pj

L L 2 L

2

N

L

X dv dv d v

@ E @E @E

p p p

pj pj pj

T

0 L 2 00 L

= f v  + f v  d  +  d ;

pj pj

L L L

2 2

@u  @u dw dw @v dw

j =1

pj pj pj

The rst and second terms of equation 7 will from now on b e referred

to as the A- and B-vector resp ectively.Sowehave

L L 2 L

2

N N

L L

X dv dv X d v

@ E @E @E

p p p

pj pj pj

T

0 L 2 00 L

A=  f v  +f v   d  and B =  d :

pj pj

L L L

2 2

@u  @u dw dw @v dw

j=1 j =1

pj pj pj

8

We rst concentrate on calculating the A-vector.

l

dv

T

pm

l l l

can becalculatedby = d . ' be de nedas' Lemma 1 Let '

pm pm pm

dw

forwardpropagation using the recursive formula

P

l lr r lr 0 r r l 0

' = r l d u + w f v '  + d ;l > 0 ; ' =0 ;1

n 2S

pm ms ps ms ps ps m pi

s m

iN .

0

0

Pro of.For input no des wehave' = 0 as desired. Assume the lemma is

pi

true for all no des in layers k

l

l

X

dv

dw d

pm

m

l T T lr r

' = d = d  w u + 

pm ms ps

dw dw dw

r l

n 2S

s m

r

X dv X

ps

T

lr 0 r lr r l lr r lr 0 r r l

= w f v d + d u  + d =  d u + w f v '  + d

ms ps ms ps m ms ps ms ps ps m

dw

r l r l

n 2S n 2S

s m s m

2 4

l

Lemma 2 Assume that the ' factors have been calculated for al l nodes

pm

in the network. The A-vector can becalculatedbybackwardpropagation

using the recursive formula

lh l

l h l

A =  u ; A =  ,

mi pm pi m pm

l

where  is

pm

P

l 0 l rl r

r l

 = f v  w  ;l < L ;

n 2T

pm pm sm ps

s m

2

@ E @E

p p

L 0 L 2 00 L L

 = f v  + f v   ' ; 1  j  N :

L L

L

2

pj pj pj pj

@u  @u

pj pj

Pro of.

L L L

N N N

L L L

@v X X @v X @v

pj pm pj

L lh L h l L

 =  A =  u   = 

pj mi pj pi pm pj

lh

l l

@w @v @v

j =1 j =1 j =1

mi

pm pm

Lh

h L

as desired. Assume that the u For the output layer wehave A = 

pi pj ji

lemma is true for all no des in layers k>l.

L L

N N

L L

X @v X X @v

pj pj

l L L 0 l rl

 =  =  f v  w

pm pj pj pm sm

l r

@v @v

r l

j =1 j =1

pm ps

n 2T

s m

L

N

L

X X @v X

pj

0 l rl L 0 l rl r

= f v  w    = f v  w 

pm sm pj pm sm ps

r

@v

r l r l

j =1

ps

n 2T n 2T

s m s m

2

The calculation of the B-vector is a bit more involved but is basicly con-

structed in the same manner.

l

Lemma 3 Assume that the ' factors have been calculated for al l nodes

pm

in the network. The B-vector can becalculatedbybackwardpropagation

using the recursive formula

lh l

l 0 h h l h l

B =  f v ' + u ; B =

mi pm pi pi pm pi m pm

l l

and are where 

pm pm

P

@E

p

l 0 l rl r L

r l

; 1  j  N :  = f v  w  ;l

L

L

n 2T

pm pm sm ps pj

@v

s m

pj

P

l 0 l rl r rl 0 l rl 00 l l r

= r l  f v w + d f v + w f v '   ;l <

n 2T

pm pm sm ps sm pm sm pm pm ps

s m

L; 5

L

=0 ;1j N

L

pj

Pro of. Observe that the B-vector can b e written in the form

2 L L

N N

L k

X d v X d'

@E @E

p p

pj pj

B =  d  = :

L L

2

@v dw @v dw

j =1 j =1

pj pj

l l

Using the chain rule we can derive analytic exp eressions for  and .

pm pm

l l L L L

N N

L L

@' @v X @' X @' @'

@E @E

p p

pm pm pj pj pj

lh

B = =  + 

mi

L lh L lh lh

l l

@v @w @v @' @w @v @w

j =1 j =1

pj mi pj mi mi

pm pm

L L

N

L

@' X @'

@E

p

pj pj

0 h h h

f v ' + u  = 

pi pi pi

L

l l

@v @' @v

j =1

pj

pm pm

l l

So if the lemma is true  and are given by

pm pm

L L

N N

L L

X @' X @'

@E @E

p p

pj pj

l l

 = ; =

pm pm

L L

l l

@v @' @v @v

j =1 j =1

pj pj

pm pm

The rest of the pro of is done in two steps. We lo ok at the parts concerned

l l

with the and  factors separately. For all output no des wehave

pm pm

@E

p

L

 = as desired. For non output no des wehave

L

pj

@v

pj

L r L

N N

L L

X X @' @' X X @'

@E @E

p p

pj ps pj

l 0 l rl

 = = f v  w

pm pm sm

L L

r l r

@v @' @' @v @'

r l r l

j =1 j =1

pj pj

ps pm ps

n 2T n 2T

s m s m

L

N

L

X @' X X

@E

p

pj

0 l rl 0 l rl r

= f v  w = f v  w 

pm sm pm sm ps

L

r

@v @'

r l r l

j =1

pj

ps

n 2T n 2T

s m s m

L

Similiarly is = 0 for all output no des as desired. For non output no des

pj

wehave

L r L r

N

L

X X

@' @v @' @'

@E

p

pj ps pj ps

l

 +  =

pm

L

r l r l

@v @v @v @' @v

r l

j =1

pj

ps pm ps pm

n 2T

s m

L L

N

L

X X

@' @'

@E

p

pj pj

0 l rl rl 0 l rl 00 l l

=  f v +  d  w f v + w f v ' 

pm sm sm pm sm pm pm

L

r r

@v @v @'

r l

j =1

pj

ps ps

n 2T

s m

L L

N N

L L

@' X @' X X

@E @E

p p

pj pj

rl 0 l rl 00 l l 0 l rl

= +  d f v + w f v '    f v w

sm pm sm pm pm pm sm

L L

r r

@v @v @v @'

r l

j =1 j =1

pj pj

ps ps

n 2T

s m

X

0 l rl r rl 0 l rl 00 l l r

=  f v w + d f v +w f v '   

pm sm ps sm pm sm pm pm ps

r l

n 2T

s m 6

l

The pro of of the formula for B follows easily from the ab ove derivations

m

and is left to the reader. 2

We are now ready to give an explicit formula for calculation of the Hes-

sian matrix times a vector. Let Hd b e the vector H wd .

p

l

Corollary 1 Assume that the ' factors have been calculated for al l nodes

pm

in the network. The vector Hd can becalculatedbybackwardpropagation

using the fol lowing recursive formula

lh l

l 0 h h l l h l l

Hd =  f v ' + + u ; Hd =  + ,

mi pm pi pi pm pm pi m pm pm

l l l

,  and are given as shown in lemma 2 and lemma 3. where 

pm pm pm

Pro of. By combination of lemma 2 and lemma 3. 2

@E @E

p p

If we view rst derivatives like and as already available infor-

l l

@u @v

pm pm

mation, then the formula for Hd can reformulated into a formula based

l l

only on one recursive parameter. First we observe that  and can b e

pm pm

written in the form

@E

p

l

 = 9

pm

l

@v

pm

X

@E @E

p p

l 0 l rl r rl 00 l l

= f v   w + d  + f v '

pm pm sm ps sm pm pm

r l

@v @u

r l

ps pm

n 2T

s m

l

Corollary 2 Assume that the ' factors have been calculated for al l nodes

pm

in the network. The vector Hd can becalculatedbybackwardpropagation

using the fol lowing recursive formula

@E

lh l

p

0 h h l h l

Hd = f v ' + u ; Hd = ,

l

mi pi pi pm pi m pm

@v

pm

l

where is

pm

P

@E @E

p p

l 0 l rl r rl 00 l l

r l

= f v   w + d  + f v '

r l

n 2T

pm pm sm ps sm pm pm

@v @u

s m

ps pm

2

@ E @E

p p

L 0 L 2 00 L L

= f v  + f v   '

L L

2

pj pj pj pj

@u  @u

pj pj 7

Pro of. By corollary 1 and equation 9. 2

The formula in corollary 2 is a generalized version of the one that Yoshida

derived for feed-forward networks with only connections b etween adjacent

P

P

layers. An algorithm that calculates H wd based on corollary 1

p

p=1

is given b elow. The algorithm also calculates the gradientvector G =

P

dE

p

P

.

p=1

dw

1. Initialize.

Hd = 0 ; G = 0

Rep eat the following steps for p =1;...;P.

2. Forward propagation.

0

For no des i =1 to N do: ' =0.

0

pi

For layers l =1 to L and no des m =1 to N do:

l

P

l lr r l l l

v = r l w u + w ; u = f v ,

n 2S

pm ms ps m pm pm

s m

P

l r 0 r lr r lr l

.  + d ' f v + w u = r l  d '

n 2S

m ps ps ms ps ms pm

s m

3. Output layer.

For no des j =1 to N do

L

2

@E @ E @E

p p p

L L L 0 L 2 00 L L

 = ; =0;  = f v  + f v   ' :

L L L

2

pj pj pj pj pj pj

@v @u  @u

pj pj pj

L r

do 2 S For all no des n

j s

Lr Lr L L

L 0 r r L r L

Hd = Hd +  f v ' +  u ; Hd = Hd +  ;

js js pj ps ps pj ps j j pj

Lr Lr L L

L r L

G = G +  u ; G = G +  .

js js pj ps j j pj

4. Backward propagation.

For layers l = L 1downto 1 and no des m =1 to N do:

l

P P

l 0 l rl r l 0 l rl r

 = f v  r l w  ;  = f v  r l w  ,

n 2T n 2T

pm pm sm ps pm pm sm ps

s m s m

P

l 0 l rl r rl 0 l rl 00 l l r

r l

= f v w +  d f v + w f v '   .

n 2T

pm pm sm ps sm pm sm pm pm ps

s m

r l

For all no des n 2 S do

s m 8

lr lr l

l 0 r r l l r

Hd = Hd +  f v ' + + u ; Hd =

ms ms pm ps ps pm pm ps m

l

l l

Hd +  + ;

m pm pm

lr lr l l

l r l

G = G +  u ; G = G +  .

ms ms pm ps m m pm

Clearly this algorithm has O N  time- and memory requirements. More

precisely the time complexity is ab out 2.5 times the time complexityofa

gradient calculation alone.

4 Improvement of existing learning techniques

In this section we justify the importance of the exact calculation of the

Hessian times a vector, by showing some p ossible improvements on two

di erent learning algorithms.

4.1 The scaled conjugate gradient algorithm

The scaled conjugate gradient algorithm is a variation of a standard con-

jugate gradient algorithm. The conjugate gradient algorithms pro duce

non-interfering directions of search if the error function is assumed to b e

quadratic. Minimization in one direction d followed by minimizationin

t

another direction d imply that the quadratic approximation to the error

t+1

has b een minimizedover the whole subspace spanned by d and d . The

t t+1

search directions are given by

0

d = E w + d ; 10

t+1 t+1 t t

where w is a vector containing all weightvalues at time step t and is

t t

0 0 0

2 T

jE w j E w  E w 

t+1 t+1 t

= 11

t

0

2

jE w j

t

In the standard conjugate gradient algorithms the step size  is found by

t

a which can b e very time consuming b ecause this involves several

calculations of the error and or the rst derivative. In the scaled conjugate

gradient algorithm the step size is estimated by a scaling mechanism thus

avoiding the time consuming line search. The step size is given by

0

T

d E w 

t

t

 = ; 12

t

T

2

d s +  jd j

t t t

t 9

where s is

t

00

s = E w d : 13

t t t

 is the step size that minimizes the second order approximation to the

t

error function.  is a scaling parameter whose function is similar to the

t

scaling parameter found in Levenb erg-Marquardt metho ds [Fletcher 75]. 

t

is in each iteration raised or lowered according to how go o d the second order

approximation is to the real error. The weight up date formula is given by

4w =  d 14

t t t

s has up til now b een approximated by a one sided di erence equation of

t

the form

0 0

E w +  d  E w 

t t t t

s = ; 0 < 1 15

t t



t

s can now b e calculated exactly by applying the algorithm from the last

t

section. We tested the SCG algorithm on several test problems using b oth

T

exact and approximated calculations of d s . The experiments indicated a

t

t

sp eedup in favor of the exact calcuation. Equation 15 is in many

cases a go o d approximation but can, however, b e numerical unstable even

0

when high precision arithmetic is used. If the relative error of E w is"

t

2"

then the relative error of equation 15 can b e as high as [Ralston 78]. So



t

the relative error gets higher when  is lowered. We refer to [Mller 93a]

t

for a detailed description of SCG. For a sto chastic version of SCG esp ecially

designed for training on large, redundant training sets, see also [Mller 93b].

4.2 Eigenvalue estimation

A recent learning algorithm prop osed by Le Cun, Simard

and Pearlmutter involves the estimation of the eigenvalues of the Hessian

matrix. We will give a brief description of the ideas in this algorithm mainly

in order to explain the use of the eigenvalues and the technique to estimate

them. We refer to [Le Cun 93] for a detailed description of this algorithm.

Assume that the Hessian H w isinvertible. We then haveby the sp ec-

t

tral theorem from linear algebra that H w  has N eigenvectors that forms

t

N

an orthogonal basis in < [Horn 85]. This implies that the inverse of the 10

1

Hessian matrix H w  can b e written in the form

t

T

N

X

e e

i

i

1

H w  = ; 16

t

2

je j 

i i

i=1

where  is the i'th eigenvalue of H w  and e is the corresp onding eigen-

i t i

vector. Equation 16 implies that the search directions d of the Newton

t

algorithm [Fletcher 75] can b e written as

T T

N N

X X

e e e Gw 

i t

i i

1

d = H w  Gw  = Gw  = e ; 17

t t t t i

2 2

je j  je j 

i i i i

i=1 i=1

where Gw  is the gradientvector. So the Newton search direction can b e

t

interpreted as a sum of pro jections of the gradientvector onto the eigenvec-

tors weighted with the inverse of the eigenvalues. To calculate all eigenval-

3

ues and corresp onding eigenvectors costs in O N  time which is infeasible

for large N. Le Cun et al. argues that only a few of the largest eigenval-

ues and the corresp onding eigenvectors is needed to achieve a considerable

sp eed up in learning. The idea is to reduce the weightchange in directions

with large curvature, while keeping it large in all other directions. They

cho ose the search direction to b e

T

k

X

 e Gw 

k +1 t

i

d =  Gw  e  ; 18

t t i

2

 je j

1 i

i=1

where i now runs from the largest eigenvalue  down to the k'th largest

1

eigenvalue  . The eigenvalues of the Hessian matrix are the curvatures in

k

the direction of the corresp onding eigenvectors. So Equation 18 reduces

the component of the gradient along the directions with large curvature.

See also [Le Cun 91] for a discussion of this. The can now



1

b e increased with a factor of , since the comp onents in directions with



k +1

large curvature has b een reduced with the inverse of this factor.

The largest eigenvalue and the corresp onding eigenvector can b e esti-

mated by an iterative pro cess known as the Power method [Ralston 78].

The Power method can b e used successively to estimate the k largest eigen-

values if the components in the directions of already estimated eigenvectors

are substracted in the pro cess. Belowwe show an algorithm for estimation

of the i'th eigenvalue and eigenvector. The Power method is here combined

with the Rayleigh quotient technique [Ralston 78]. This can accelerate the 11

pro cess considerably.

0

Cho ose an initial random vector e . Rep eat the following steps for m =

i

1; ...;M, where M is a small constant:

T m

P

e e

m1 i1

j i

m m m

e = H w e ; e = e e

t 2 j

i j =1

i i i

jej j

m1

m

T

e  e

1

m m m

i

i

 = ; e = e :

m1 m

2

i i i



je j

i

i

M M

 and e are resp ectively the estimated eigenvalue and eigenvector. The-

i i

oretically it would b e enough to substract the component in the direction

of already estimated eigenvectors once, but in practice roundo errors will

generally intro duce these components again.

m

with a one sided di er- Le Cun et al. approximates the term H w e

t

i

encing as shown in equation 15. Now this term can b e calculated exactly

by use of the algorithm describ ed in the last sections.

5 Conclusion

This pap er has presented an algorithm for the exact calculation of the

pro duct of the Hessian matrix of error functions and a vector. The pro duct

is calculated without ever explicitly calculating the Hessian matrix itself.

The algorithm has O N  time- and memory requirements, where N is the

numberofvariables in the network.

The relevance of this algorithm has b een demonstrated by showing p ossi-

ble improvements in two di erent learning techniques, the scaled conjugate

gradient learning algorithm and an algorithm recently prop osed by Le Cun,

Simard and Pearlmutter.

Acknowledgements

It has recently come to the authors knowledge that the same algorithm has

b een derived indep endently and at approximately the same time by Barak

Pearlmutter, Department of Computer Science and Engineering Oregon

Graduate Institute [Pearlmutter 93]. Thank you to Barak for his nice and

immediate recognition of the indep endence of our work.

Iwould also like to thank Wray Buntine, Scott Fahlman, Brian Mayoh

and Ole sterby for helpful advice. This researchwas supp orted bya 12

grant from the Royal Danish Research Council. Facilities for this research

were provided by the National Science Foundation U.S. under grant IRI-

9214873. All opinions, ndings, conclusions and recommendations in this

pap er are those of the author and do not necessarily re ect the views of the

Royal Danish Research Council or the National Science Foundation.

References

[Bishop 92] C. Bishop 1992, Exact Calculation of the Hessian

Matrix for the Multilayer Perceptron, Neural Compu-

tation, Vol. 2, pp. 494-501.

[Buntine 91] W. Buntine and A. Weigend 1991, Calculating Sec-

ond Derivatives on Feed-Forward Networks, submitted

to IEEE Transactions on Neural Networks.

[Le Cun 91] Y. Le Cun, I. Kanter, S. Solla 1991, Eigenvalues of

Covariance Matrices: Application to Neural Network

Learning,Physical Review Letters, Vol. 66, pp. 2396-

2399.

[Le Cun 93] Y. Le Cun, P.Y. Simard and B. Pearlmutter 1993,

Local Computation of the Second Derivative Informa-

tion in a Multilayer Network, in Pro ceedings of Neural

Information Pro cessing Systems, Morgan Kau man,

in print.

[Dixon 89] L.C.W. Dixon and R.C. Price 1989, Truncated New-

ton Method for Sparse Unconstrained Optimization

Using Automatic Di erentiation, Journal of Optimiza-

tion Theory and Applications, Vol. 60, No. 2, pp. 261-

275.

[Fletcher 75] R. Fletcher 1975. Practical Methods of Optimization,

Vol. 1, John Wiley & Sons.

[Hassibi 92] B. Hassibi and D.G. Stork 1992, Second Order

Derivatives for Network Pruning: Optimal Brain Sur-

geon, In Pro ceedings of Neural Information Pro cessing

Systems, Morgan Kau man. 13

[Horn 85] R.H. Horn and C.A. Johnson 1985, Matrix Analysis,

Cambridge University Press, Cambridge.

[MacKay 91] D.J.C. MacKay 1991, APractical Bayesian Frame-

work for Back-Prop Networks, Neural Computation,

Vol. 4, N0. 3, pp. 448-472.

[Mller 93a] M. Mller 1993, AScaled Conjugate Gradient Algo-

rithm for Fast SupervisedLearning, Neural Networks,

in press.

[Mller 93b] M. Mller 1993, SupervisedLearning on Large Re-

dundant Training sets,International Journal of Neural

Systems, in press.

[Pearlmutter 93] B.A. Pearlmutter 1993, Fast Exact Multiplication by

the Hessian, preprint, submitted.

[Ralston 78] A. Ralston and P. Rabinowitz 1978, A First Course

in Numerical Analysis, McGraw-Hill Bo ok Company,

Inc.

[Yoshida 91] T. Yoshida 1991, ALearning Algorithm for Multilay-

ered Neural Networks: A Newton Method Using Auto-

matic Di erentiation, In Pro ceedings of International

Joint Conference on Neural Networks, Seattle, Poster. 14