Machine Learning I Lecture 05 Gradient Descent

ECE 595: Machine Learning I Lecture 05 Gradient Descent Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 31 Outline c Stanley Chan 2020. All Rights Reserved. 2 / 31 Outline Mathematical Background Lecture 4: Intro to Optimization Lecture 5: Gradient Descent Lecture 5: Gradient Descent Gradient Descent Descent Direction Step Size Convergence Stochastic Gradient Descent Difference between GD and SGD Why does SGD work? c Stanley Chan 2020. All Rights Reserved. 3 / 31 Gradient Descent The algorithm: x(t+1) = x(t) − α(t)rf (x(t)); t = 0; 1; 2;:::; where α(t) is called the step size. c Stanley Chan 2020. All Rights Reserved. 4 / 31 Why is the direction −∇f (x)? Recall (Lecture 4): If x∗ is optimal, then 1 lim [f (x∗ + d) − f (x∗)] = rf (x∗)T d !0 | {z } ≥0; 8d =) rf (x∗)T d ≥ 0; 8d But if x(t) is not optimal, then we want f (x(t) + d) ≤ f (x(t)) So, 1h i lim f (x(t) + d) − f (x(t)) = rf (x(t))T d !0 | {z } ≤0; for some d =) rf (x(t))T d ≤ 0 c Stanley Chan 2020. All Rights Reserved. 5 / 31 Descent Direction Pictorial illustration: rf (x) is perpendicular to the contour. A search direction d can either be on the positive side rf (x)T d ≥ 0 or negative side rf (x)T d < 0. Only those on the negative side can reduce the cost. All such d's are called the descent directions. c Stanley Chan 2020. All Rights Reserved. 6 / 31 The Steepest d Previous slide: If x(t) is not optimal yet, then some d will give rf (x(t))T d ≤ 0: So, let us make rf (x(t))T as negative as possible. d (t) = argmin rf (x(t))T d; kdk2=δ We need δ to control the magnitude; Otherwise d is unbounded. The solution is d (t) = −∇f (x(t)) Why? By Cauchy Schwarz, (t) T (t) rf (x ) d ≥ −k∇f (x )k2kdk2: Minimum attained when d = −∇f (x(t)). (t) Set δ = krf (x )k2. c Stanley Chan 2020. All Rights Reserved. 7 / 31 Steepest Descent Direction Pictorial illustration: Put a ball surrounding the current point. All d's inside the ball are feasible. Pick the one that minimizes rf (x)T d. This direction must be parallel (but opposite sign) to rf (x). c Stanley Chan 2020. All Rights Reserved. 8 / 31 Step Size The algorithm: x(t+1) = x(t) − α(t)rf (x(t)); t = 0; 1; 2;:::; where α(t) is called the step size. 1. Fixed step size α(t) = α: 2. Exact line search α(t) = argmin f x(t) + αd (t) ; α x 1 x T Hx cT x E.g., if f ( ) = 2 + , then rf (x (t))T d (t) α(t) = − : d (t)T Hd (t) 3. Inexact line search: Amijo / Wolfe conditions. See Nocedal-Wright Chapter 3.1. c Stanley Chan 2020. All Rights Reserved. 9 / 31 Convergence Let x∗ be the global minimizer. Assume the followings: Assume f is twice differentiable so that r2f exist. 2 n Assume0 λminI r f (x) λmaxI for all x 2 R Run gradient descent with exact line search. Then, (Nocedal-Wright Chapter 3, Theorem 3.3) λ 2 f (x(t+1)) − f (x∗) ≤ 1 − min f (x(t)) − f (x∗) λmax λ 4 ≤ 1 − min f (x(t−1)) − f (x∗) λmax . ≤ . λ 2t ≤ 1 − min f (x(1)) − f (x∗) : λmax Thus, f (x(t)) ! f (x∗) as t ! 1. c Stanley Chan 2020. All Rights Reserved. 10 / 31 Understanding Convergence Gradient descent can be viewed as successive approximation. Approximate the function as 1 f (xt + d) ≈ f (xt ) + rf (xt )T d + kdk2: 2α We can show that the d that minimizes f (xt + d) is d = −αrf (xt ). This suggests: Use a quadratic function to locally approximate f . Converge when curvature α of the approximation is not too big. c Stanley Chan 2020. All Rights Reserved. 11 / 31 Advice on Gradient Descent Gradient descent is useful because Simple to implement (compared to ADMM, FISTA, etc) Low computational cost per iteration (no matrix inversion) Requires only first order derivative (no Hessian) Gradient is available in deep networks (via back propagation) Most machine learning has built-in (stochastic) gradient descents Welcome to implement your own, but you need to be careful Convex non-differentiable problems, e.g., `1-norm Non-convex problem, e.g., ReLU in deep network Trap by local minima Inappropriate step size, a.k.a. learning rate Consider more \transparent" algorithms such as CVX when Formulating problems. No need to worry about algorithm. Trying to obtain insights. c Stanley Chan 2020. All Rights Reserved. 12 / 31 Outline Mathematical Background Lecture 4: Intro to Optimization Lecture 5: Gradient Descent Lecture 5: Gradient Descent Gradient Descent Descent Direction Step Size Convergence Stochastic Gradient Descent Difference between GD and SGD Why does SGD work? c Stanley Chan 2020. All Rights Reserved. 13 / 31 Stochastic Gradient Descent Most loss functions in machine learning problems are separable: N N 1 X 1 X J(θ) = L(g (xn); y n) = J (θ): (1) N θ N n n=1 n=1 For example, Square-loss: N X n n 2 J(θ) = (gθ(x ) − y ) n=1 Cross-entropy loss: N X n n n n J(θ) = − y log gθ(x ) + (1 − y ) log(1 − gθ(x )) n=1 Logistic loss: N X n T xn J(θ) = log(1 + e−y θ ) c Stanley Chan 2020. All Rights Reserved. n=1 14 / 31 Full Gradient VS Partial Gradient Vanilla gradient descent: θt+1 = θt − ηt rJ(θt ) : (2) | {z } main computation The full gradient of the loss is N 1 X rJ(θ) = rJ (θ) (3) N n n=1 Stochastic gradient descent: 1 X rJ(θ) ≈ rJ (θ) (4) jBj n n2B where B ⊆ f1;:::; Ng is a random subset. jBj = batch size. c Stanley Chan 2020. All Rights Reserved. 15 / 31 SGD Algorithm Algorithm (Stochastic Gradient Descent) 1 Given f(xn; y n) j n = 1;:::; Ng. 2 Initialize θ (zero or random) 3 For t = 1; 2; 3;::: Draw a random subset B ⊆ f1;:::; Ng. Update 1 X θt+1 = θt − ηt rJ (θ) (5) jBj n n2B If jBj = 1, then use only one sample at a time. The approximate gradient is unbiased: (See Appendix for Proof) " # 1 X rJ (θ) = rJ(θ): E jBj n n2B c Stanley Chan 2020. All Rights Reserved. 16 / 31 Interpreting SGD Just showed that the SGD step is unbiased: " # 1 X rJ (θ) = rJ(θ): E jBj n n2B Unbiased gradient implies that each update is gradient + zero-mean noise Step size: SGD with constant step size does not converge. ∗ ∗ 1 PN ∗ If θ is a minimizer, then J(θ ) = N n=1 Jn(θ ) = 0. But 1 X J (θ∗) 6= 0; since B is a subset: jBj n n2B Typical strategy: Start with large step size and gradually decrease: ηt ! 0, e.g., ηt = t−a for some constant a. c Stanley Chan 2020. All Rights Reserved. 17 / 31 Perspectives of SGD Classical optimization literature have the following observations. Compared to GD in convex problems: SGD offers a trade-off between accuracy and efficiency More iterations Less gradient evaluation per iteration Noise is a by-product Recent studies of SGD for non-convex problems found that SGD for training deep neural networks works SGD finds solution faster SGD find a better local minima Noise matters c Stanley Chan 2020. All Rights Reserved. 18 / 31 GD compared to SGD c Stanley Chan 2020. All Rights Reserved. 19 / 31 Smoothing the Landscape Analyzing SGD is an active research topic. Here is one by Kleinberg et al. (https://arxiv.org/pdf/1802.06175.pdf ICML 2018) The SGD step can be written as GD + noise: xt+1 = xt − η(rf (xt ) + w t ) = xt − ηrf (xt ) − ηw t : | {z } def =y t y t is the \ideal" location returned by GD. Let us analyze y t+1: y t+1 def= xt+1 − ηrf (xt+1) = (y t − ηw t ) − ηrf (y t − ηw t ) Assume E[w] = 0, then t+1 t t t E[y ] = y − ηrE[f (y − ηw )] c Stanley Chan 2020. All Rights Reserved. 20 / 31 Smoothing the Landscape t t Let us look at E[f (y − ηw )]: Z E[f (y − ηw)] = f (y − ηw)p(w) dw; where p(w) is the distribution of w. R f (y − ηw)p(w) dw is the convolution between f and p. p(w) ≥ 0 for all w, so the convolution always smoothes the function. Learning rate controls the smoothness Too small: Under-smooth. You have not yet escaped from bad local minimum. Too large: Over-smooth. You may miss a local minimum. c Stanley Chan 2020. All Rights Reserved. 21 / 31 Smoothing the Landscape c Stanley Chan 2020. All Rights Reserved. 22 / 31 Reading List Gradient Descent S. Boyd and L. Vandenberghe, \Convex Optimization", Chapter 9.2-9.4. J. Nocedal and S. Wright, \Numerical Optimization", Chapter 3.1-3.3. Y. Nesterov, \Introductory lectures on convex optimization", Chapter 2. CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/ convexopt/lectures/grad-descent.pdf Stochastic Gradient Descent CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/ convexopt/lectures/stochastic-gd.pdf Kleinberg et al. (2018) \When Does SGD Escape Local Minima", https://arxiv.org/pdf/1802.06175.pdf c Stanley Chan 2020.

Load more