Lecture 5: Stochastic Gradient Descent

Lecture 5: Stochastic Gradient Descent CS4787 — Principles of Large-Scale Machine Learning Systems Combining two principles we already discussed into one algorithm. • Principle: Write your learning task as an optimization problem and solve it with a scalable optimization algorithm. • Principle: Use subsampling to estimate a sum with something easier to compute. Recall: we parameterized the hypotheses we wanted to evaluate with parameters w 2 Rd, and want to solve the problem n n 1 X 1 X minimize: R(h ) = L(h (x ); y ) = f(w) = f (w) over w 2 d: w n w i i n i R i=1 i=1 Stochastic gradient descent (SGD). Basic idea: in gradient descent, just replace the full gradient d (which is a sum) with a single gradient example. Initialize the parameters at some value w0 2 R , and decrease the value of the empirical risk iteratively by sampling a random index ~it uniformly from f1; : : : ; ng and then updating w = w − α · rf (w ) t+1 t t ~it t where as usual wt is the value of the parameter vector at time t, αt is the learning rate or step size, and rfi denotes the gradient of the loss function of the ith training example. Compared with gradient descent and Newton’s method, SGD is simple to implement and runs each iteration faster. A potential objection: this is not necessarily going to be decreasing the loss at every step! So we can’t demonstrate convergence by using a proof like the one we used for gradient descent, where we showed that the loss decreases at every iteration of the algorithm. The fact that SGD doesn’t always improve the loss at each iteration motivates the question: does SGD even work? And if so, why does SGD work? Demo. Gradient descent versus stochastic gradient descent on linear regression. Why might it be fine to get an approximate solution to an optimization problem for training? Takeaway: 1 Why does SGD work? Unlike GD, SGD does not necessarily decrease the value of the loss at each step. Let’s just try to analyze it in the same way that we did with gradient descent and see what happens. But first, we need some new assumption that characterizes how far the gradient samples can be from the true gradient. Assume that, for some constant G > 0, the magnitude of our gradient samples are bounded, for all w 2 Rd, by krfi(x)k ≤ G: In other words, this is a global bound on the magnitude of the gradient samples, and is similar in spirit to the global bound on the magnitude used in Hoeffding’s inequality. As before, we will also assume that for some constant L > 0, for all x in the space and for any vector u 2 Rd, T 2 2 u r f(x)u ≤ L kuk : From here, we can analyze SGD like we did with gradient descent, first without assuming convexity. From Taylor’s theorem, there exists a ξt such that f(w ) = f(w − α rf (w )) t+1 t t ~it t T 1 T 2 = f(wt) − αtrf~ (wt) rf(wt) + αtrf~ (wt) r f(ξt) αtrf~ (wt) it 2 it it 2 T αt L 2 ≤ f(wt) − αtrf~ (wt) rf(wt) + rf~ (wt) it 2 it 2 2 T αt G L ≤ f(wt) − αtrf~ (wt) rf(wt) + : it 2 Now we’re faced with a problem. The term −α rf (w )T rf(w ) t ~it t t is not necessarily nonnegative, so we’re not necessarily making any progress in the loss. The key insight: we are making progress in expectation. If we take the expected value of both sides of this expression (where the expectation is taken over the randomness in the sample selection ~it), we get 2 2 T αt G L E [f(wt+1)] ≤ E f(wt) − αtrf~ (wt) rf(wt) + it 2 2 2 T αt G L = E [f(wt)] − αtE rf~ (wt) rf(wt) + : it 2 Now, the expected value of rf (w ) given w is ~it t t n n X ~ X 1 E rf~ (wt) wt = rfi(wt) · P it = i wt = rfi(wt) · = rf(wt); it n i=1 i=1 so h i α2G2L E [f(w )] ≤ E [f(w )] − α E krf(w )k2 + t : t+1 t t t 2 Rearranging the terms, summing up over T iterations, and telescoping the sum, T −1 T −1 T −1 2 2 X h 2i X X α G L α E krf(w )k ≤ (E [f(w )] − E [f(w )]) + t t t t t+1 2 t=0 t=0 t=0 T −1 G2L X = f(w ) − f(w ) + α2 0 T 2 t t=0 T −1 G2L X ≤ f(w ) − f ∗ + α2 0 2 t t=0 2 where f ∗ is the global optimum of f. This is a pretty nice expression, but we still need to do something useful with the term we bounded on the left. Here’s one thing we can do: run SGD for a random number of iterations τ, where we run for τ = t iterations with probability α P (τ = t) / t : PT −1 k=0 αk Then the expected squared-norm of the gradient of rf(wτ ) is T −1 T −1 !−1 T −1 h 2i X h 2i X X h 2i E krf(wτ )k = E krf(wt)k · P (τ = t) = αk αtE krf(wt)k ; t=0 k=0 t=0 and so we can bound this with −1 T −1 ! 2 T −1 ! h 2i X G L X E krf(w )k ≤ α f(w ) − f ∗ + α2 : τ k 0 2 t k=0 t=0 For example, if the learning rate is a constant αt = α, then h i G2L f(w ) − f ∗ αLG2 E krf(w )k2 ≤ (T α)−1 f(w ) − f ∗ + T α2 = 0 + : τ 0 2 αT 2 This second term is our noise ball term: the term that is in some sense “causing” SGD to converge not to a point with zero gradient but rather to some reason nearby. We can avoid this by decreasing the learning rate over time. The norm of the gradient of the output wτ will be guaranteed to go to zero if T −1 T −1 X X 2 αt grows much faster than αt : t=0 t=0 −1=2 One example of such a step size rule is αt = (t + 1) . Then we have (these are approximations, but can be made rigorous with a bit of work) T −1 T −1 X X 1 Z T 1 p αt = p ≈ p dx = 2 T t + 1 x t=0 t=0 0 and T −1 T −1 X X 1 Z T +1 1 α2 = ≈ dx = log(T + 1): t t + 1 x t=0 t=0 1 −1=2 As a result, with this αt = (t + 1) decreasing learning rate, we get h i p −1 G2L E krf(w )k2 2 T f(w ) − f ∗ + (log(T + 1)) τ / 0 2 2(f(w ) − f ∗) + G2L log(T + 1) 1 = 0 p = O~ p : 4 T T This is, indeed, going to zero as the number of steps T increases. How does this compare to the expression that we got for gradient descent? 3 Gradient descent for strongly convex objectives. This was without assuming strong convexity. But how does SGD perform on strongly convex problems? As before, we start from this sort of expression h i α2G2L E [f(w )] ≤ E [f(w )] − α E krf(w )k2 + t t+1 t t t 2 and apply the Polyak–Lojasiewicz condition, krf(x)k2 ≥ 2µ (f(x) − f ∗); this gives us α2G2L E [f(w )] ≤ E [f(w )] − 2µα E [f(w ) − f ∗] + t : t+1 t t t 2 Subtracting f ∗ from both sides, we get α2G2L E [f(w ) − f ∗] ≤ (1 − 2µα )E [f(w ) − f ∗] + t : t+1 t t 2 From here we can use the same analysis that we used for Gradient descent to analyze the convergence of this algorithm. As before, it will converge to a noise ball if the learning rate is fixed. 4.

Load more