Backpropagation
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Srihari Backpropagation Sargur Srihari 1 Machine Learning Srihari Topics in Backpropagation 1. Forward Propagation 2. Loss Function and Gradient Descent 3. Computing derivatives using chain rule 4. Computational graph for backpropagation 5. Backprop algorithm 6. The Jacobian matrix 2 Machine Learning Srihari A neural network with one hidden layer D input variables x1,.., xD M hidden unit activations D (1) (1) a w x w where j 1,..,M j = ∑ ji i + j 0 = i=1 Hidden unit activation functions z j =h(aj) K output activations M (2) (2) a w x w where k 1,..,K k = ∑ ki i + k 0 = i=1 Output activation functions Augmented network yk =σ(ak) ⎛ M ⎛ D ⎞ ⎞ No. of weights in w: y ( , ) w(2)h w(1)x w(1) w(2) k x w = σ ⎜ ∑ kj ⎜ ∑ ji i + j 0 ⎟ + k 0 ⎟ j =1 ⎝ i=1 ⎠ T=(D+1)M+(M+1)K ⎝ ⎠ =M(D+K+1)+K 3 Machine Learning Srihari Matrix Multiplication: Forward Propagation • Each layer is a function of layer that preceded it • First layer is given by z =h (W(1)T x + b(1)) • Second layer is y = σ (W(2)T x + b(2)) • Note that W is a matrix rather than a vector • Example with D=3, M=3 ⎪⎧ T T T ⎪ W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤ T ⎪ 1 ⎣⎢ 11 12 13 ⎦⎥ 2 ⎣⎢ 21 22 23 ⎦⎥ 3 ⎣⎢ 31 32 33 ⎦⎥ x=[x1,x2,x3] w = ⎨ ⎪ T T T ⎪ W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤ ⎪ 1 ⎣⎢ 11 12 13 ⎦⎥ 2 ⎣⎢ 21 22 23 ⎦⎥ 3 ⎣⎢ 31 32 33 ⎦⎥ ⎩⎪ First Network layer Network layer output In matrix multiplication notation 4 Machine Learning Srihari Loss and Regularization y=f (x,w) 1 N E = E f(x (i),w),t N ∑ i ( i ) i=1 x Forward: y Loss + Ei Backward: Gradient of Ei+R R(W) 5 Machine Learning Srihari Gradient Descent • Goal: determine weights w from labeled set of training samples • Learning procedure has two stages 1. Evaluate derivatives of loss ∇E(w) with respect to weights w1,..wT 2. Use derivative vector to compute adjustments to weights ⎡ ∂E ⎤ ⎢ ⎥ (τ+1) (τ) (τ) ∂w E( ) ⎢ 0 ⎥ w = w − η∇ w ⎢ ∂E ⎥ ⎢ ⎥ ∇E w = w ( ) ⎢ ∂ 1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ∂E ⎥ ⎢ ⎥ ∂w ⎢ T ⎥ ⎣ ⎦ Machine Learning Srihari Derivative of composite function with one weight g (y) y=f (w) p=g(y) w f (w) k(p,q) o Input Output weight z=f (w) q=h(z) h(z) Composite Function o = k(p,q)=k(g(f(w)),h(f(w)) !" = !" !% !& + !" !' !( ð$ ð% ð& ð$ ð' ð( ð$ !" !)(%.') !)(%,') = -′(/) 0′(1)+ ℎ4 5 0′(1) ð$ ð% ð' 7 Path 1 Path 2 Machine Learning Srihari Derivative of a composite function with four inputs E (a,b,c,d) = e =a.b + c log d Derivatives by inspection: e !" !" !" 1 =1 1 =1 =b ð$ ð& ð% !" =a ð) u v !" =log d ð' !$ !$ !& !& 3 =b 2 =a 5 = c 1 =log d ð% ð) ð( ð' !" = + c ð* * a c Computational graph b !" !" !" e=u+v, u=a.b, v=c.t , t=log d =3 =2 =1 ð% ð) t ð' We want to compute derivatives !( + of output wrt the input values 0.1 = ð* * a = 2, b = 3, c = 5, d =10 ⎡ ∂E ⎤ ⎢ ⎥ ∂w d ⎢ 0 ⎥ 3 ⎢ E ⎥ ⎢ ∂ ⎥ !" ∇E w = w 2 ( ) ⎢ ∂ 1 ⎥ =0.5 ⎢ ⎥ ð* ⎢ ⎥ 1 ⎢ ∂E ⎥ ⎢ ⎥ ∂w 0.5 ⎢ T ⎥ 8 ⎣ ⎦ Machine Learning Srihari Example of Derivative Computation 9 Machine LearningDerivatives of f =(x+y)z wrt x,y,z Srihari Machine Learning Srihari Derivatives for a neuron: z=f(x,y) Machine Learning Srihari Composite Function • Consider a composite function f (g (h (x))) • i.e., an outer function f, an inner function g and a final inner function h(x) sin(x 2 ) • Say f ( x ) = e we can decompose it as: f (x)=ex g(x)=sin x and h(x)=x2 or f (g(h(x)))=e g(h(x)) • Its computational graph is • Every connection is an input, every node is a function or operation DeepMachine Learning Learning Srihari Derivatives of Composite function • To get derivatives of f (g (h (x)))= e g(h(x)) wrt x df df dg dh 1. We use the chain rule = ⋅ ⋅ where dx dg dh dx df g(h(x )) = e g(h(x)) x dg since f (g(h(x)))=e & derivative of e is e dg = cos(h(x)) since g(h(x))=sin h(x) & derivative sin is cos dh dh because h x x2 & its derivative is x = 2x ( )= 2 dx df = eg(h(x )) ⋅cos h(x)⋅2x = esinx**2 ⋅cosx 2 ⋅2x • Therefore dx • In each of these cases we pretend that the inner function is a single variable and derive it as such 2 2. Another way to view it sin(x ) f(x) = e • Create temp variables u=sin v, v=x2, then f (u)=eu with computational graph: DeepMachine Learning Learning Srihari Derivative using Computational Graph • All we need to do is get the derivative of each node wrt each of its inputs With u=sin v, v=x2, f (u)=eu • We can get whichever derivative we want by multiplying the ‘connection’ derivatives df dh dg = eg(h(x )) = 2x = cos(h(x)) dg dx dh df df dg dh df Since f (x)=ex, g(x)=sin x and = ⋅ ⋅ = eg(h(x )) ⋅cos h(x)⋅2x dx dg dh dx h(x)=x2 dx sinx 2 2 = e ⋅cosx ⋅2x 14 Machine Learning Srihari Evaluating the gradient • Goal of this section: • Find an efficient technique for evaluating gradient of an error function E(w) for a feed-forward neural network: • Gradient evaluation can be performed using a local message passing scheme • In which information is alternately sent forwards and backwards through the network • Known as error backpropagation or simply as backprop Machine Learning Srihari Back-propagation Terminology and Usage • Backpropagation means a variety of different things • Computing derivative of the error function wrt weights • In a second separate stage the derivatives are used to compute the adjustments to be made to the weights • Can be applied to error function other than sum of squared errors • Used to evaluate other matrices such as Jacobian and Hessian matrices • Second stage of weight adjustment using calculated derivatives can be tackled using variety of optimization schemes substantially more powerful than gradient descent Machine Learning Srihari Overview of Backprop algorithm • Choose random weights for the network • Feed in an example and obtain a result • Calculate the error for each node (starting from the last stage and propagating the error backwards) • Update the weights • Repeat with other examples until the network converges on the target output • How to divide up the errors needs a little calculus 17 Machine Learning Srihari Evaluation of Error Function Derivatives • Derivation of back-propagation algorithm for • Arbitrary feed-forward topology • Arbitrary differentiable nonlinear activation function • Broad class of error functions • Error functions of practical interest are sums of errors associated with each training data point N E( ) E ( ) w = ∑ n w n=1 • We consider problem of evaluating ∇E (w) n • For the nth term in the error function • Derivatives are wrt the weights w1,..wT • This can be used directly for sequential optimization or accumulated over training set (for batch) 18 Machine LearningSimple Model (Multiple Linear Regression) Srihari • Outputs yk are linear combinations of inputs xi yk y w x wki k = ∑ ki i i xi • Error function for a particular input xn is 1 2 Where summation is E = y − t n 2 ∑( nk nk ) over all K outputs For a particular input x and k weight w , squared error is: • where ynk=yk(xn,w) 1 2 E = (y(x,w) −t ) • Gradient of Error function wrt a weight wji: 2 ∂E ∂E n = y − t x = (y(x,w) −t )x = δ ⋅x ∂w ( nj nj ) ni ∂w ji • a local computation involving product of yj • error signal ynj-tnj associated with output end of link wji wji t • variable xni associated with input end of link j xi ∂E = (yj −tj )xi = δj ⋅ xi ∂wji Machine Learning Srihari Extension to more complex multilayer Network a w z • Each unit computes a weighted sum of its inputs j = ∑ ji i i zj=h(aj) aj=∑iwjizi zi wji • zi is activation of a unit (or input) that sends a connection to unit j and wji is the weight associated with the connection • Output is transformed by a nonlinear activation function zj=h(aj) • The variable zi can be an input and unit j could be an output • For each input xn in the training set, we calculate activations of all hidden and output units by applying above equations • This process is called forward propagation Machine Learning Srihari Evaluation of Derivative En wrt a weight wji • The outputs of the various units depend on particular input n • We shall omit the subscript n from network variables • Note that En depends on wji only via the summed input aj to unit j. • We can therefore apply chain rule for partial derivatives to give ∂E ∂E ∂a n = n j ∂w ∂a ∂w ji j ji • Derivative wrt weight is given by product of derivative wrt activity and derivative of activity wrt weight ∂E δ ≡ n • We now introduce a useful notation j ∂a j • Where the δs are errors as we shall see ∂aj a = w z = zi • Using j ∑ ji i we can write ∂w ji i ∂E n = δ z • Substituting we get ∂w j i ji • i.e., required derivative is obtained by multiplying the value of δ for the unit at the output end of the weight by the the value of z at the input end of the weight • This takes the same form as for the simple linear model 21 Machine Learning Srihari ∂ E Summarizing evaluation of Derivative n ∂w ji • By chain rule for partial derivatives ∂E ∂E ∂a n = n j a =∑ w z ∂w ∂a ∂w j i ji i ji j ji a = w z Define ∂E j ∑ ji i δ ≡ n i zi wji j ∂a ∂a j we have j = z ∂w i ji • Substituting we get ∂E n = δ z ∂w j i ji • Thus required derivative is obtained by multiplying 1.