<<

Machine Srihari

Backpropagation

Sargur Srihari

1 Srihari Topics in

1. Forward Propagation 2. Loss and Descent 3. using 4. Computational graph for backpropagation 5. Backprop 6. The Jacobian

2 Machine Learning Srihari A neural network with one hidden

D input variables x1,.., xD M hidden unit activations D a w(1)x w(1) where j 1,..,M j = ∑ ji i + j 0 = i=1 Hidden unit activation functions

z j =h(aj)

K output activations M a w(2)x w(2) where k 1,..,K k = ∑ ki i + k 0 = i=1 Output activation functions Augmented network yk =σ(ak)

⎛ M ⎛ D ⎞ ⎞ No. of weights in w: y ( , ) w(2)h w(1)x w(1) w(2) k x w = σ ⎜ ∑ kj ⎜ ∑ ji i + j 0 ⎟ + k 0 ⎟ j =1 ⎝ i=1 ⎠ T=(D+1)M+(M+1)K ⎝ ⎠ =M(D+K+1)+K

3 Machine Learning Srihari : Forward Propagation

• Each layer is a function of layer that preceded it • First layer is given by z =h (W(1)T x + b(1)) • Second layer is y = σ (W(2)T x + b(2)) • Note that W is a matrix rather than a vector • Example with D=3, M=3

⎪⎧ T T T ⎪ W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤ T ⎪ 1 ⎣⎢ 11 12 13 ⎦⎥ 2 ⎣⎢ 21 22 23 ⎦⎥ 3 ⎣⎢ 31 32 33 ⎦⎥ x=[x1,x2,x3] w = ⎨ ⎪ T T T ⎪ W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤ ⎪ 1 ⎣⎢ 11 12 13 ⎦⎥ 2 ⎣⎢ 21 22 23 ⎦⎥ 3 ⎣⎢ 31 32 33 ⎦⎥ ⎩⎪

First Network layer Network layer output In matrix multiplication notation

4 Machine Learning Srihari Loss and Regularization

y=f (x,w)

1 N E = E f( (i), ),t ∑ i ( x w i ) N i=1 x Forward: y Loss +

Ei Backward: Gradient of

Ei+R

R(W)

5 Machine Learning Srihari

• Goal: determine weights w from labeled set of training samples • Learning procedure has two stages

1. Evaluate derivatives of loss ∇E(w) with respect to weights w1,..wT 2. Use vector to compute adjustments to weights

⎡ ∂E ⎤ ⎢ ⎥ (τ+1) (τ) (τ) ⎢ ∂w0 ⎥ w = w − η∇E(w ) ⎢ E ⎥ ⎢ ∂ ⎥ ∇E w = w ( ) ⎢ ∂ 1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ∂E ⎥ ⎢ ⎥ ∂w ⎢ T ⎥ ⎣ ⎦ Machine Learning Srihari Derivative of composite function with one weight

g (y) y=f (w) p=g(y)

w f (w) k(p,q) o Input Output weight z=f (w) q=h(z) h(z)

Composite Function o = k(p,q)=k(g(f(w)),h(f(w))

!" = !" !% !& + !" !' !( ð$ ð% ð& ð$ ð' ð( ð$

!" !)(%.') !)(%,') = -′(/) 0′(1)+ ℎ4 5 0′(1) ð$ ð% ð'

7 Path 1 Path 2 Machine Learning Srihari Derivative of a composite function with four inputs

E (a,b,c,d) = e =a.b + c log d

Derivatives by inspection: e !" !" !" 1 =1 1 =1 =b ð$ ð& ð%

!" =a ð) u v !" =log d ð' !$ !$ !& !& 3 =b 2 =a 5 = c 1 =log d ð% ð) ð( ð' !" = + c ð* * a c Computational graph b !" !" !" e=u+v, u=a.b, v=c.t , t=log d =3 =2 =1 ð% ð) t ð'

We want to compute derivatives !( + of output wrt the input values 0.1 = ð* * a = 2, b = 3, c = 5, d =10 ⎡ ∂E ⎤ ⎢ ⎥ ∂w d ⎢ 0 ⎥ 3 ⎢ E ⎥ ⎢ ∂ ⎥ !" ∇E w = w 2 ( ) ⎢ ∂ 1 ⎥ =0.5 ⎢ ⎥ ð* ⎢ ⎥ 1 ⎢ ∂E ⎥ ⎢ ⎥ ∂w 0.5 ⎢ T ⎥ 8 ⎣ ⎦ Machine Learning Srihari Example of Derivative Computation

9 Machine LearningDerivatives of f =(x+y)z wrt x,y,z Srihari Machine Learning Srihari Derivatives for a : z=f(x,y) Machine Learning Srihari Composite Function

• Consider a composite function f (g (h (x))) • i.e., an outer function f, an inner function g and a final inner function h(x) sin(x 2 ) • Say f ( x ) = e we can decompose it as: f (x)=ex g(x)=sin x and h(x)=x2 or f (g(h(x)))=e g(h(x)) • Its computational graph is

• Every connection is an input, every node is a function or operation DeepMachine Learning Learning Srihari Derivatives of Composite function • To get derivatives of f (g (h (x)))= e g(h(x)) wrt x

df df dg dh 1. We use the chain rule = ⋅ ⋅ where dx dg dh dx

df g(h(x )) = e g(h(x)) x dg since f (g(h(x)))=e & derivative of e is e dg = cos(h(x)) since g(h(x))=sin h(x) & derivative sin is cos dh dh because h x x2 & its derivative is x = 2x ( )= 2 dx df = eg(h(x )) ⋅cos h(x)⋅2x = esinx**2 ⋅cosx 2 ⋅2x • Therefore dx • In each of these cases we pretend that the inner function is a single variable and derive it as such sin(x 2 ) 2. Another way to view it f(x) = e • Create temp variables u=sin v, v=x2, then f (u)=eu with computational graph: DeepMachine Learning Learning Srihari Derivative using Computational Graph • All we need to do is get the derivative of each node wrt each of its inputs

With u=sin v, v=x2, f (u)=eu

• We can get whichever derivative we want by multiplying the ‘connection’ derivatives

df dh dg = eg(h(x )) = 2x = cos(h(x)) dg dx dh

df df dg dh df Since x and = ⋅ ⋅ g(h(x )) f (x)=e , g(x)=sin x = e ⋅cos h(x)⋅2x 2 dx dg dh dx dx h(x)=x sinx 2 2 = e ⋅cosx ⋅2x

14 Machine Learning Srihari

Evaluating the gradient

• Goal of this section: • Find an efficient technique for evaluating gradient of an error function E(w) for a feed-forward neural network:

• Gradient evaluation can be performed using a local message passing scheme • In which information is alternately sent forwards and backwards through the network • Known as error backpropagation or simply as backprop Machine Learning Srihari Back-propagation Terminology and Usage

• Backpropagation means a variety of different things • Computing derivative of the error function wrt weights • In a second separate stage the derivatives are used to compute the adjustments to be made to the weights • Can be applied to error function other than sum of squared errors • Used to evaluate other matrices such as Jacobian and Hessian matrices • Second stage of weight adjustment using calculated derivatives can be tackled using variety of optimization schemes substantially more powerful than gradient descent Machine Learning Srihari Overview of Backprop algorithm

• Choose random weights for the network • Feed in an example and obtain a result • Calculate the error for each node (starting from the last stage and propagating the error backwards) • Update the weights • Repeat with other examples until the network converges on the target output

• How to divide up the errors needs a little calculus

17 Machine Learning Srihari Evaluation of Error Function Derivatives

• Derivation of back-propagation algorithm for • Arbitrary feed-forward topology • Arbitrary differentiable nonlinear • Broad class of error functions • Error functions of practical interest are sums of errors associated with each training point

N E( ) E ( ) w = ∑ n w n=1 • We consider problem of evaluating ∇E (w) n • For the nth term in the error function

• Derivatives are wrt the weights w1,..wT • This can be used directly for sequential optimization or accumulated over training set (for batch)

18 Machine LearningSimple Model (Multiple ) Srihari

• Outputs yk are linear combinations of inputs xi yk y w x wki k = ∑ ki i i xi • Error function for a particular input xn is 1 2 Where summation is E = y − t n 2 ∑( nk nk ) over all K outputs For a particular input x and k weight w , squared error is: • where ynk=yk(xn,w) 1 2 E = (y(x,w) −t ) • Gradient of Error function wrt a weight wji: 2 ∂E ∂E n = y − t x = (y(x,w) −t )x = δ ⋅x ∂w ( nj nj ) ni ∂w ji • a local computation involving product of yj • error signal ynj-tnj associated with output end of link wji wji t • variable xni associated with input end of link j xi ∂E = (yj −tj )xi = δj ⋅ xi ∂wji Machine Learning Srihari Extension to more complex multilayer Network

a w z • Each unit computes a weighted sum of its inputs j = ∑ ji i i

zj=h(aj) aj=∑iwjizi zi wji

• zi is activation of a unit (or input) that sends a connection to unit j and wji is the weight associated with the connection

• Output is transformed by a nonlinear activation function zj=h(aj)

• The variable zi can be an input and unit j could be an output

• For each input xn in the training set, we calculate activations of all hidden and output units by applying above equations • This process is called forward propagation Machine Learning Srihari

Evaluation of Derivative En wrt a weight wji • The outputs of the various units depend on particular input n • We shall omit the subscript n from network variables

• Note that En depends on wji only via the summed input aj to unit j. • We can therefore apply chain rule for partial derivatives to give ∂E ∂E ∂a n = n j ∂w ∂a ∂w ji j ji • Derivative wrt weight is given by product of derivative wrt activity and derivative of activity wrt weight ∂E δ ≡ n • We now introduce a useful notation j ∂a j • Where the δs are errors as we shall see

∂aj a = w z = zi • Using j ∑ ji i we can write ∂w ji i ∂E n = δ z • Substituting we get ∂w j i ji • i.e., required derivative is obtained by multiplying the value of δ for the unit at the output end of the weight by the the value of z at the input end of the weight

• This takes the same form as for the simple linear model 21 Machine Learning Srihari ∂ E Summarizing evaluation of Derivative n ∂w ji • By chain rule for partial derivatives ∂E ∂E ∂a n = n j a =∑ w z ∂w ∂a ∂w j i ji i ji j ji a = w z Define ∂E j ∑ ji i δ ≡ n i zi wji j ∂a ∂a j we have j = z ∂w i ji • Substituting we get ∂E n = δ z ∂w j i ji • Thus required derivative is obtained by multiplying 1. Value of δ for the unit at output end of weight 2. Value of z for unit at input end of weight

• Need to figure out how to calculate δj for each unit of network 1 ∂ E If E = (y − t )2and y = a = w z then δ = = y − t For • For output units δj=yj-tj 2 ∑ j j j j ∑ ji i j ∂a j j regression j j • For hidden units, we again need to make use of chain rule of derivatives to ∂E determine δ ≡ n j ∂a j Machine Learning Srihari Calculation of Error for hidden unit δj Blue arrow for forward propagation Red arrows indicate direction of information flow during error backpropagation

• For hidden unit j by chain rule

∂E ∂E ∂a Where sum is over all units k to which δ ≡ n = n k j sends connections j ∂a ∑ ∂a ∂a j k k j a = w z = w h(a ) ∂E k ∑ ki i ∑ ki i • Substituting n i i δk ≡ a ∂a ∂ k k = w h '(a ) ∂a ∑ kj j j k

• We get the backpropagation formula for error derivatives at stage j € h '(a ) w δj = j ∑ kjδk k

Input to activation from error derivative earlier units at later unit k Machine Learning Srihari Error Backpropagation Algorithm

Unit j 1. Apply input vector xn to network and Unit forward propagate through network using k a w z j = ∑ ji i and zj=h(aj) i • Backpropagation Formula 2. Evaluate δk for all output units using δk=yk-tk

δ h '(a ) w 3. Backpropagate the s using δj = j ∑ kjδk δ = h '(a ) w δ k j j ∑ kj k k to obtain δ for each hidden unit • Value of δ for a particular j hidden unit can be obtained 4. Use ∂E n = δ z by propagating the δ s ∂w j i backward from units higher- ji to evaluate required derivatives up in the network

24 Machine Learning Srihari A Simple Example

• Two-layer network • Sum-of-squared error • Output units: linear activation functions, i.e., multiple regression

yk=ak Standard Sum of Squared Error • Hidden units have logistic sigmoid 1 2 E y t activation function n = ∑( k − k ) 2 k h(a)=tanh (a) a −a where e − e yk: activation of output unit k tanh(a) = a −a e + e tk : corresponding target for input xk simple form for derivative h '(a) = 1 − h(a)2 € Machine Learning Srihari Simple Example: Forward and Backward Prop For each input in training set: D a w(1)x j = ∑ ji i • Forward Propagation i=0

z j = tanh(aj ) M y w(2)z k = ∑ kj j j • Output differences =0 δ = y − t k k k h '(a ) w δj = j ∑ kjδk • Backward Propagation (δ s for hidden units) k K 2 (1 z 2) w h'(a) =1− h(a) δj = − j ∑ kjδk k=1 • Derivatives wrt first layer and second layer weights ∂E ∂E n = δ x n = δ z € ∂w(1) j i ∂w(2) k j ji kj ∂E ∂E • Batch method = ∑ n ∂w n ∂w ji ji Machine Learning Srihari Using derivatives to update weights • Gradient descent w(τ+1) = w(τ) − η∇E (w(τ)) • Update the weights using

• Where the gradient vector ∇ E ( w ( τ ) ) consists of the vector of

derivatives evaluated using back-propagation ⎡ ⎤ ⎢ ∂E ⎥ ⎢ (1) ⎥ There are W= M(D+1)+K(M+1) elements ⎢ ∂w ⎥ ⎢ 11 ⎥ in the vector ⎢ . ⎥ Gradient ( τ ) is a W x 1 vector ⎢ ⎥ ∇E (w ) ⎢ ∂E ⎥ ⎢ (1) ⎥ d ⎢ ∂w ⎥ ∇E(w) = E(w) = ⎢ MD ⎥ dw ⎢ ⎥ ⎢ ∂E ⎥ ⎢ ∂w(2) ⎥ ⎢ 11 ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ∂E ⎥ ⎢ (2) ⎥ ⎢ ∂wKM ⎥ ⎣ ⎦ 27 Machine Learning Numerical example Srihari D a = w(1)x j ∑ ji i () i=1

z j = σ(aj ) M y = w(2)z z1 k ∑ kj j j=1 D=3 Errors M=2 y1 K=1 δ = σ '(a ) w δ N=1 j j ∑ kj k k

δk = σ '(ak )(yk −tk )

Error Derivatives z2 ∂E ∂E n = δ x n = δ z ∂w(1) j i ∂w(2) k j ji kj • First training example, x = [1 0 1]T whose class label is t = 1 • The sigmoid activation function is applied to hidden layer and output layer • Assume that the η is 0.9 28 Machine Learning Outputs, Errors, Derivatives, Weight Update Srihari

δk = σ '(ak )(yk −tk ) = [σ(ak )(1− σ(ak ))](1− σ(ak )) δ = σ '(a ) w δ = ⎡σ(a )(1− σ(a ))⎤ w δ j j ∑ jk k ⎣⎢ j j ⎦⎥ ∑ jk k k k Initial input and weight values x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06 ------1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Net input and output calculation Unit Net input a Output σ(a) ------4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e0.7)=0.332 5-0.3 +0+0.2 +0.2 =0.1 1/(1+e0.1)=0.525 Weight Update* 6 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474 Weight New value ------w46 -03+(0.9)(0.1311)(0.332)= -0.261 w56 -0.2+(0.9)(0.1311)(0.525)= -0.138 w14 0.2 +(0.9)(-0.0087)(1) = 0.192 w15 -0.3 +(0.9)(-0.0065)(1) = -0.306 w24 0.4+ (0.9)(-0.0087)(0) = 0.4 Errors at each node w25 0.1+ (0.9)(-0.0065)(0) = 0.1 Unit δ w34 -0.5+ (0.9)(-0.0087)(1) = -0.508 ------w35 0.2 + (0.9)(-0.0065)(1) = 0.194 6 (0.474)(1-0.474)(1-0.474)=0.1311 w06 0.1 + (0.9)(0.1311) = 0.218 5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065 w05 0.2 + (0.9)(-0.0065)=0.194 4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087 w04 -0.4 +(0.9)(-0.0087) = -0.408

* Positive update since we used (tk-yk) Machine Learning Srihari

MATLAB Implementation (Pseudocode)

• Allows for multiple hidden layers • Allows for training in batches • Determines using back-propagation using sum- of-squared error • Determines misclassification

Machine Learning Srihari 30 Machine Learning Srihari Initializations

% This pseudo-code illustrates implementing a several layer neural %network. You need to fill in s{1} = size(train_x, 1); the missing part to adapt the program to %your s{2} = 100; own use. You may have to correct minor s{3} = 100; mistakes in the program s{4} = 100; s{5} = 2; %% prepare for the data

load data.mat %Initialize the parameters %You may set them to zero or give them small train_x = .. %random values. Since the neural network test_x = .. %optimization is non-convex, your algorithm %may get stuck in a local minimum which may %be caused by the initial values you assigned. train_y = .. test_y = .. for i = 1 : numOfHiddenLayers %% Some other preparations W{i} = .. %Number of hidden layers b{i} = .. end numOfHiddenLayer = 4;

x is the input to the neural network, y is the output Machine Learning Training epochs, Back-propagation Srihari for j = 1 : numepochs The training data is divided into several %randomly rearrange the training data for each epoch batches of size 100 for efficiency %We keep the shuffled index in kk, so that the input and output could %be matched together kk = randperm(size(train_x, 2)); losses = []; train_errors = []; for l = 1 : numbatches test_wrongs = []; %Set the activation of the first layer to be the training data %while the target is training labels %Here we perform mini-batch stochastic gradient descent %If batchsize = 1, it would be stochastic gradient descent a{1} = train_x(:, kk( (l-1)*batchsize+1 : l*batchsize ) ); %If batchsize = N, it would be basic gradient descent y = train_y(:, kk( (l-1)*batchsize+1 : l*batchsize ) ); %Forward propagation, layer by layer batchsize = 100; %Here we use as an example

%Num of batches for i = 2 : numOfHiddenLayer + 1 a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); numbatches = size(train_x, 2) / batchsize; end %Calculate the error and back-propagate error layer by layers %% Training part d{numOfHiddenLayer + 1} = %Learning rate alpha -(y - a{numOfHiddenLayer + 1}) .* a{numOfHiddenLayer + 1} .* (1-a{numOfHiddenLayer + 1}); alpha = 0.01; for i = numOfHiddenLayer : -1 : 2 d{i} = W{i}' * d{i+1} .* a{i} .* (1-a{i}); %Lambda is for regularization end lambda = 0.001; %Calculate the gradients we need to update the parameters %Num of iterations %L2 regularization is used for W numepochs = 20; for i = 1 : numOfHiddenLayer dW{i} = d{i+1} * a{i}’; db{i} = sum(d{i+1}, 2); W{i} = W{i} - alpha * (dW{i} + lambda * W{i}); b{i} = b{i} - alpha * db{i}; end end Machine Learning Srihari Performance Evaluation

% Do some to know the performance a{1} = test_x; %Calculate training error % forward propagation %minibatch size bs = 2000; for i = 2 : numOfHiddenLayer + 1 % no. of mini-batches nb = size(train_x, 2) / bs; %This is essentially doing W{i-1}*a{i-1}+b{i-1}, but since they %have different dimensionalities, this addition is not allowed in train_error = 0; %. Another way to do it is to use repmat %Here we go through all the mini-batches for ll = 1 : nb a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); %Use submatrix to pick out mini-batches end a{1} = train_x(:, (ll-1)*bs+1 : ll*bs ); yy = train_y(:, (ll-1)*bs+1 : ll*bs ); %Here we calculate the sum-of-square error as loss = sum(sum((test_y-a{numOfHiddenLayer + 1}).^2)) / size(test_x, 2); for i = 2 : numOfHiddenLayer + 1 a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); end % Count no. of misclassifications so that we can compare it train_error = train_error + sum(sum((yy-a{numOfHiddenLayer + 1}).^2)); % with other classification methods end % If we let max return two values, the first one represents the max train_error = train_error / size(train_x, 2); % value and second one represents the corresponding index. Since we % care only about the class the model chooses, we drop the max value losses = [losses loss]; % (using ~ to take the place) and keep the index. test_wrongs = [test_wrongs, test_wrong]; [~, ind_] = max(a{numOfHiddenLayer + 1}); [~, ind] = max(test_y); train_errors = [train_errors train_error]; test_wrong = sum( ind_ ~= ind ) / size(test_x, 2) * 100; end

max calculation returns value and index Machine Learning Srihari

Efficiency of Backpropagation

• Computational Efficiency is main aspect of back-prop • No of operations to compute derivatives of error function scales with total number W of weights and • Single evaluation of error function for a single input requires O(W) operations (for large W) • This is in contrast to O(W2) for numerical differentiation • As seen next

34 Machine Learning Srihari Another Approach: Numerical Differentiation • Compute derivatives using method of finite differences • Perturb each weight in turn and approximate derivatives by

∂E E (w + ε) − E (w ) n = n ji n ji +O(ε) where ε<<1 ∂w ε ji • Accuracy improved by making ε smaller until round-off problems arise • Accuracy can be improved by using central differences ∂E E (w + ε) − E (w − ε) n = n ji n ji +O(ε2) ∂w 2ε ji • This is O(W2) • Useful to check if software for backprop has been correctly implemented (for some test cases)

35 Machine Learning Srihari

Summary of Backpropagation

• Derivatives of error function wrt weights are obtained by propagating errors backward • It is more efficient than numerical differentiation • It can also be used for other computations • As seen next for Jacobian

36 Machine Learning Srihari The Jacobian Matrix

• For a vector valued output y={y1,..,ym} with vector input x ={x1,..xn}, • Jacobian matrix organizes all the partial derivatives into an m x n matrix

∂y J = k ki ∂x i For a neural network we have a D+1 by K matrix Determinant of Jacobian Matrix is referred to simply as the Jacobian

37 Machine Learning Srihari Jacobian Matrix Evaluation

• In backprop, derivatives of error function wrt weights are obtained by propagating errors backwards through the network

• The technique of backpropagation can also be used to calculate other derivatives • Here we consider the Jacobian matrix • Whose elements are derivatives of network outputs wrt inputs

∂y J = k ki ∂x i • Where each such derivative is evaluated with other inputs fixed

38 Machine Learning Srihari Use of Jacobian Matrix

• Jacobian plays useful role in systems built from several modules • Each module has to be differentiable • Suppose we wish to minimize error E wrt parameter w in a modular classification system shown here:

∂E ∂E ∂y ∂z = k j ∂w ∑ ∂y ∂z ∂w k,j k j • Jacobian matrix for red module appears in the middle term • Jacobian matrix provides measure of local sensitivity of outputs to changes in each of the input variables 39 Machine Learning Srihari

Summary of Jacobian Matrix Computation

• Apply input vector corresponding to point in input space where the Jacobian matrix is to be found • Forward propagate to obtain activations of the hidden and output units in the network • For each row k of Jacobian matrix, corresponding to output unit k: • Backpropagate for all the hidden units in the network • Finally backpropagate to the inputs • Implementation of such an algorithm can be checked using numerical differentiation in the form

∂y y (x + ε) − y (x − ε) k = k i k i +O(ε2) ∂x 2ε i 40 Machine Learning Srihari Summary

• Neural network learning needs learning of weights from samples involves two steps: • Determine derivative of output of a unit wrt each input • Adjust weights using derivatives • Backpropagation is a general term for computing derivatives

• Evaluate δk for all output units

• (using δk=yk-tk for regression)

• Backpropagate the δks to obtain δj for each hidden unit • Product of δs with activations at the unit provide the derivatives for that weight • Backpropagation is also useful to compute a Jacobian matrix with several inputs and outputs • Jacobian matrices are useful to determine the effects of different inputs

41