Neural Networks Learning the Network: Part 2

Neural Networks Learning the network: Part 2 11-785, Fall 2020 Lecture 4 1 Recap: Universal approximators • Neural networks are universal approximators – Can approximate any function • Provided they have sufficient architecture – We have to determine the weights and biases to make them model the function 2 Recap: Approach • Define a divergence between the actual output and desired output of the network – Must be differentiable: can quantify how much a miniscule change of changes • Make all neuronal activations differentiable – Differentiable: can quantify how much a miniscule change of changes • Differentiability – enables us to determine if a small change in any parameter of the network is increasing or decreasing – Will let us optimize the network 3 Recap: The expected divergence • Minimize the expected “divergence” between the output of the net and the desired function over the input space 4 Recap: Emipirical Risk Minimization • Problem: Computing the expected divergence requires knowledge of at all which we will not have • Solution: Approximate it by the average divergence over a large number of “training” samples drawn from • Estimate the parameters to minimize this “loss” instead 5 Problem Statement • Given a training set of input-output pairs • Minimize the following function w.r.t • This is problem of function minimization – An instance of optimization 6 • A CRASH COURSE ON FUNCTION OPTIMIZATION 7 A brief note on derivatives.. derivative • A derivative of a function at any point tells us how much a minute increment to the argument of the function will increment the value of the function . For any expressed as a multiplier to a tiny increment to obtain the increments to the output . Based on the fact that at a fine enough resolution, any smooth, continuous function is locally linear at any point 8 Scalar function of scalar argument • When and are scalar . Derivative: . Often represented (using somewhat inaccurate notation) as . Or alternately (and more reasonably) as 9 Multivariate scalar function: Scalar function of vector argument Note: is now a vector • Giving us that is a row vector: • The partial derivative gives us how increments when only is incremented • Often represented as 10 Multivariate scalar function: Scalar function of vector argument Note: is now a vector We will be using this • Where symbol for vector and matrix derivatives o You may be more familiar with the term “gradient” which is actually defined as the transpose of the derivative 11 Caveat about following slides • The following slides speak of optimizing a function w.r.t a variable “x” • This is only mathematical notation. In our actual network optimization problem we would be optimizing w.r.t. network weights “w” • To reiterate – “x” in the slides represents the variable that we’re optimizing a function over and not the input to a neural network • Do not get confused! 12 The problem of optimization f(x) global maximum inflection point local minimum global minimum x • General problem of optimization: find the value of x where f(x) is minimum 13 Finding the minimum of a function f(x) x • Find the value at which = 0 – Solve • The solution is a “turning point” – Derivatives go from positive to negative or vice versa at this point • But is it a minimum? 14 Turning Points 0 0 - + - + - - + - + - + - + - + - + - -- + - - 0 • Both maxima and minima have zero derivative • Both are turning points 15 Derivatives of a curve f’(x) f(x) x • Both maxima and minima are turning points • Both maxima and minima have zero derivative 16 Derivative of the derivative of the curve f’’(x) f’(x) f(x) x • Both maxima and minima are turning points • Both maxima and minima have zero derivative • The second derivative f’’(x) is –ve at maxima and +ve at minima! 17 Solution: Finding the minimum or maximum of a function f(x) x • Find the value at which = 0: Solve • The solution is a turning point • Check the double derivative at : compute • If is positive is a minimum, otherwise it is a maximum 18 A note on derivatives of functions of single variable maximum • All locations with zero Inflection point derivative are critical points minimum – These can be local maxima, local minima, or inflection points Critical points • The second derivative is – Positive (or 0) at minima – Negative (or 0) at maxima – Zero at inflection points • It’s a little more complicated for Derivative is 0 functions of multiple variables 19 A note on derivatives of functions of single variable maximum • All locations with zero Inflection point derivative are critical points minimum – These can be local maxima, local minima, or inflection points • The second derivative is – at minima – at maxima – Zero at inflection points positive zero • It’s a little more complicated for functions of multiple variables.. negative 20 What about functions of multiple variables? • The optimum point is still “turning” point – Shifting in any direction will increase the value – For smooth functions, miniscule shifts will not result in any change at all • We must find a point where shifting in any direction by a microscopic amount will not change the value of the function 21 A brief note on derivatives of multivariate functions 22 The Gradient of a scalar function • The derivative of a scalar function of a multi-variate input is a multiplicative factor that gives us the change in for tiny variations in – The gradient is the transpose of the derivative 23 Gradients of scalar functions with multivariate inputs • Consider • Relation: This is a vector inner product. To understand its behavior lets consider a well-known property of inner products 24 A well-known vector property • The inner product between two vectors of fixed lengths is maximum when the two vectors are aligned – i.e. when 25 Properties of Gradient vs angle of Blue arrow is • • For an increment of any given length is max if is aligned with T – The function f(X) increases most rapidly if the input increment T is exactly in the direction of • The gradient is the direction of fastest increase in f(X) 26 Gradient Gradient 푇 vector 27 Gradient Gradient 푇 vector Moving in this direction increases fastest 28 Gradient Gradient 푇 vector Moving in this 푇 direction increases Moving in this fastest direction decreases fastest 29 Gradient Gradient here is 0 Gradient here is 0 30 Properties of Gradient: 2 푇 • The gradient vector is perpendicular to the level curve 31 The Hessian • The Hessian of a function is given by the second derivative 2 f 2 f 2 f 2 . x x x x x1 1 2 1 n 2 f 2 f 2 f . 2 2 x2x1 x2 x2xn X f (x1,..., xn ) : . . 2 f 2 f 2 f . 2 xnx1 xnx2 xn 32 Returning to direct optimization… 33 Finding the minimum of a scalar function of a multivariate input • The optimum point is a turning point – the gradient will be 0 34 Unconstrained Minimization of function (Multivariate) 1. Solve for the where the derivative (or gradient) equals to zero X f (X ) 0 2. Compute the Hessian Matrix at the candidate solution and verify that – Hessian is positive definite (eigenvalues positive) -> to identify local minima – Hessian is negative definite (eigenvalues negative) -> to identify local maxima 35 Closed Form Solutions are not always available f(X) X • Often it is not possible to simply solve – The function to minimize/maximize may have an intractable form • In these situations, iterative solutions are used – Begin with a “guess” for the optimal and refine it iteratively until the correct value is obtained 36 Iterative solutions f(X) X x0 x1 x2 x5 x3 x4 • Iterative solutions – Start from an initial guess for the optimal – Update the guess towards a (hopefully) “better” value of – Stop when no longer decreases • Problems: – Which direction to step in – How big must the steps be 37 The Approach of Gradient Descent • Iterative solution: – Start at some point – Find direction in which to shift this point to decrease error • This can be found from the derivative of the function – A negative derivative moving right decreases error – A positive derivative moving left decreases error – Shift point in this direction 38 The Approach of Gradient Descent • Iterative solution: Trivial algorithm . Initialize . While • If is positive: 푥 = 푥 − 푠푡푒푝 • Else 푥 = 푥 + 푠푡푒푝 – What must step be to ensure we actually get to the optimum?39 The Approach of Gradient Descent • Iterative solution: Trivial algorithm . Initialize . While • Identical to previous algorithm 40 The Approach of Gradient Descent • Iterative solution: Trivial algorithm . Initialize . While • is the “step size” 41 Gradient descent/ascent (multivariate) • The gradient descent/ascent method to find the minimum or maximum of a function iteratively – To find a maximum move in the direction of the gradient – To find a minimum move exactly opposite the direction of the gradient • Many solutions for step size 42 Gradient descent convergence criteria • The gradient descent algorithm converges when one of the following criteria is satisfied k+1 k f (x )- f (x ) < e1 • Or k x f (x ) < e 2 43 Overall Gradient Descent Algorithm • Initialize: . • do . • while 44 Convergence of Gradient Descent • For appropriate step size, for convex (bowl- shaped) functions gradient descent will always find the minimum. • For non-convex functions it will find a local minimum or an inflection point 45 • Returning to our problem.. 46 Problem Statement • Given a training set of input-output pairs • Minimize the following function w.r.t • This is problem of function minimization – An instance

Neural Networks Learning the Network: Part 2

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support