Gradient Descent

Machine Learning Srihari Gradient Descent Sargur Srihari [email protected] 1 Machine Learning Srihari Topics • Simple Gradient Descent/Ascent • Difficulties with Simple Gradient Descent • Line Search • Brent’s Method • Conjugate Gradient Descent • Weight vectors to minimize error • Stochastic Gradient Descent 2 Machine Learning Srihari Gradient-Based Optimization • Most ML algorithms involve optimization • Minimize/maximize a function f (x) by altering x – Usually stated a minimization – Maximization accomplished by minimizing –f (x) • f (x) referred to as objective or criterion – In minimization also referred to as loss function cost, or error 1 f(x) = || Ax −b ||2 – Example is linear least squares 2 – Denote optimum value by x*=argmin f (x) 3 Machine Learning Srihari Calculus in Optimization • Consider function y=f (x), x, y real nos. – Derivative of function denoted: f’(x) or as dy/dx • Derivative f’(x) gives the slope of f (x) at point x • It specifies how to scale a small change in input to obtain a corresponding change in the output: f (x + ε) ≈ f (x) + ε f’ (x) – It tells how you make a small change in input to make a small improvement in y – We know that f (x - ε sign (f’(x))) is less than f (x) for small ε. Thus we can reduce f (x) by moving x in small steps with opposite sign of derivative • This technique is called gradient descent (Cauchy 1847) Machine Learning Srihari Gradient Descent Illustrated • Given function is f (x)=½ x2 which has a bowl shape with global minimum at x=0 – Since f ’(x)=x • For x>0, f(x) increases with x and f’(x)>0 • For x<0, f(x) decreases with x and f’(x)<0 • Use f’(x) to follow function downhill – Reduce f (x) by going in direction opposite sign of derivative f’(x) 5 Machine Learning Srihari Minimizing with multiple inputs • We often minimize functions with multiple inputs: f: RnàR • For minimization to make sense there must still be only one (scalar) output 6 Machine Learning Srihari Application in ML: Minimize Error Negated gradient Error surface shows desirability of Direction in every weight vector, a parabola with a single global minimum w0-w1 plane producing steepest descent • Gradient descent search determines a weight vector w that minimizes E(w) by – Starting with an arbitrary initial weight vector – Repeatedly modifying it in small steps – At each step, weight vector is modified in the direction that produces the steepest descent along the error 7 surface Machine Learning Srihari Definition of Gradient Vector • The Gradient (derivative) of E with respect to each component of the vector w !" ⎡ ∂E ∂E ∂E ⎤ ∇E[w] ≡ ⎢ , ,# ⎥ ⎣∂w0 ∂w1 ∂wn ⎦ • Notice Ñ E [ w ] is a vector of partial derivatives • Specifies the direction that produces steepest increase in E • Negative of this vector specifies direction of steepest decrease 8 Machine Learning Srihari Directional Derivative • Directional derivative in direction u (a unit vector) is the slope of function f in direction u T – This evaluates to u ∇x f x ( ) • To minimize f find direction in which f decreases the fastest min uT ∇ f x = min u ∇ f x cosθ – Do this using T x ( ) T x ( ) u,u u=1 u,u u=1 2 2 • where θ is angle between u and the gradient • Substitute ||u||2=1 and ignore factors that not depend on u this simplifies to minucosθ • This is minimized when u points in direction opposite to gradient • In other words, the gradient points directly uphill, and 9 the negative gradient points directly downhill Machine Learning Srihari Gradient Descent Rule ! ! w ¬ w + Dw – where Dw = -hÑE[w] – h is a positive constant called the learning rate • Determines step size of gradient descent search • Component Form of Gradient Descent – Can also be written as wi ¬ wi + Dwi • where ¶E Dw = -h 10 i ¶wi Machine Learning Srihari Method of Gradient Descent • The gradient points directly uphill, and the negative gradient points directly downhill • Thus we can decrease f by moving in the direction of the negative gradient – This is known as the method of steepest descent or gradient descent • Steepest descent proposes a new point x' = x − η∇x f (x) – where ε is the learning rate, a positive scalar. Set to a small constant. 11 Machine Learning Srihari Simple Gradient Descent Procedure Gradient-Descent ( Intuition θ1 //Initial starting point Taylor’s expansion of function f(θ) in the t t t T t f //Function to be minimized neighborhood of θ is f(θ) ≈ f(θ )+(θ −θ ) ∇f(θ ) Let θ= t+1 = t +h, thus t+1 t t δ //Convergence threshold θ θ f(θ ) ≈ f(θ )+ h ∇f(θ ) Derivative of f(θt+1) wrt h is ∇f(θt ) ) t At h = ∇f(θ ) a maximum occurs (since h2 is 1 t ← 1 positive) and at h = − ∇ f ( θ t ) a minimum occurs. 2 do t+1 t t Alternatively, 3 θ ← θ − η∇f (θ ) ∇f (θt ) The slope points to the direction of 4 t ← t +1 steepest ascent. If we take a step η in the 5 while || θt −θt−1 ||> δ opposite direction we decrease the value of f 6 return θt ( ) One-dimensional example Let f(θ)=θ2 This function has minimum at θ=0 which we want to determine using gradient descent We have f’(θ)=2θ For gradient descent, we update by –f’(θ) If θt > 0 then θt+1<θt If θt<0 then f’(θt)=2θt is negative, thus θt+1>θt Machine Learning Srihari Ex: Gradient Descent on Least Squares 1 • Criterion to minimize f(x) = || Ax − b ||2 2 N 2 1 T – Least squares regression ED (w) = å{ t n-w f(xn ) } 2 n=1 • The gradient is ∇ f x = AT Ax −b = ATAx −AT b x ( ) ( ) • Gradient Descent algorithm is 1.Set step size ε, tolerance δ to small, positive nos. 2.while || ATAx −AT b || > δ do 2 x ← x − η ATAx −AT b ( ) 1.end while 13 Machine Learning Srihari Stationary points, Local Optima • When f’(x)=0 derivative provides no information about direction of move • Points where f’(x)=0 are known as stationary or critical points – Local minimum/maximum: a point where f(x) lower/higher than all its neighbors – Saddle Points: neither maxima nor minima 14 Machine Learning Srihari Presence of multiple minima • Optimization algorithms may fail to find global minimum • Generally accept such solutions 15 Machine Learning Srihari Difficulties with Simple Gradient Descent • Performance depends on choice of learning rate η θt+1 ← θt − η∇f θt ( ) • Large learning rate – “Overshoots” • Small learning rate – Extremely slow • Solution – Start with large η and settle on optimal value – Need a schedule for shrinking η 16 Machine Learning Srihari Convergence of Steepest Descent • Steepest descent converges when every element of the gradient is zero – In practice, very close to zero • We may be able to avoid iterative algorithm and jump to the critical point by solving the equation ∇ x f ( x ) = 0 for x 17 Machine Learning Srihari Choosing η: Line Search • We can choose η in several different ways • Popular approach: set η to a small constant • Another approach is called line search: f(x − η∇x f (x) • Evaluate for several values of η and choose the one that results in smallest objective function value 18 Machine Learning Srihari Line Search (with ascent) • In Gradient Ascent, we increase f by moving in the direction of the gradient θt+1 ← θt + η∇f θt ( ) • Intuition of Line Search: Adaptive choice of step size η at each step – Choose direction to ascend and continue in direction until we start to descend – Define “line” in direction of gradient ! g (η) = θt + η∇f θt ( ) • Note that this is a linear function of η and hence it is a “line” Machine Learning Srihari Line Search: determining η ! g (η) = θt + η∇f θt •Given ( ) •Find three points η1<η2<η3 so that f (g(η2)) is larger than at both f (g(η1)) and f (g(η3)) – We say that η1<η2<η3 bracket a maximum •If we find an η’ so that we can find a new tighter bracket η1<η’<η2 - To find η’ use binary search Choose η’=(η1+η3)/2 •Method ensures that new bracket is half of the old one Machine Learning Srihari Brent’s Method • Faster than line search • Use quadratic function – Based on values of f at η1, η2, η3 • Solid line is f (g(η)) • Three points bracket its maximum • Dashed line shows quadratic fit Machine Learning Srihari Conjugate Gradient Descent • In simple gradient ascent: – two consecutive gradients are orthogonal: ∇f( t+1) is orthogonal to ∇f (θt ) – θ – Progress is zig-zag: progress to maximum slows down Quadratic: f(x,y)=-(x2+10y2) Exponential: f(x,y)=exp [-(x2+10y2)] • Solution: – Remember past directions and bias new direction ht as combination of current gt and past ht-1 – Faster convergence than simple gradient ascent Machine Learning Srihari Sequential (On-line) Learning • Maximum likelihood solution is T −1 T wML = (Φ Φ) Φ t • It is a batch technique – Processing entire training set in one go • It is computationally expensive for large data sets • Due to huge N x M Design matrix Φ • Solution is to use a sequential algorithm where samples are presented one at a time • Called Stochastic gradient descent 23 Machine Learning Srihari Stochastic Gradient Descent N 2 1 T • Error function E D ( w ) = ∑ { t n − w ϕ ( x n ) } sums over data 2 n=1 – Denoting ED(w)=ΣnEn update parameter vector w using (t +1) (t ) w = w -hÑEn • Substituting for the derivative (τ +1) (τ ) (τ ) w = w + η(tn − w φ(x n ))φ(x n ) • w is initialized to some starting vector w(0) • η chosen with care to ensure convergence • Known as Least Mean Squares Algorithm 24 Machine Learning Srihari SGD Performance Most practitioners use SGD for DL Consists of showing the input vector for a few examples, computing the outputs and the errors, Computing the average gradient for those examples, Flucutations in objective and adjusting the weights accordingly.

Load more