Stochastic Gradient Descent

Optimization for Machine Learning Stochastic Gradient Methods Lecturers: Francis Bach & Robert M. Gower Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou African Master's in Machine Intelligence (AMMI), Kigali Solving the Finite Sum Training Problem Recap Training Problem L(w) General methods ● Gradient Descent ● Pro!imal gradient Two parts (IS"A) ● #ast pro!imal gradient (#IS"A) Optimization Sum of Terms A Datum Function Finite Sum Training Problem Can %e use this sum str&cture? The Training Problem The Training Problem Stochastic Gradient Descent Is it possi(le to design a method that &ses onl) the gradient of a single data f&nction at each iteration' Unbiased Estimate Let j (e a random index sampled from {1, …, n} selected &niformly at random. Then EXE: Stochastic Gradient Descent Stochastic Gradient Descent Optimal point Assumptions for Convergence Strong !on"exit# Expected ounded Stochastic Gradients Complexity / Convergence Theorem Proo$: "aking e!pectation %ith respect to j Un(iased estimator Strong con/, 0ounded "a.ing total expectation Stoch grad Stochastic Gradient Descent α =0.01 Stochastic Gradient Descent α =0.1 Stochastic Gradient Descent α =0.2 Stochastic Gradient Descent α =0.5 Assumptions for Convergence Strong !on"exit# Expected ounded Stochastic Gradients Realistic assumptions for Convergence Strongl# quasi&con"exit# Each fi is con"ex and Li smooth Gradient Noise Realistic assumptions for Convergence Strongl# quasi&con"exit# Each fi is con"ex and Li smooth Gradient Noise Assumptions for Convergence EXE: Calculate the Li ’s and Lmax for ()NT: A twice differentiable fi is Li - smooth if and only if Assumptions for Convergence EXE: Calculate the Li ’s and Lmax for ()NT: A twice differentiable fi is Li - smooth if and only if Assumptions for Convergence EXE: Calculate the Li ’s and Lmax for ()NT: A twice differentiable fi is Li - smooth if and only if Assumptions for Convergence EXE: Calculate the Li ’s and Lmax for ()NT: A twice differentiable fi is Li - smooth if and only if Assumptions for Convergence EXE: $alculate the Li 1s and Lmax for Assumptions for Convergence EXE: $alculate the Li 1s and Lmax for Assumptions for Convergence EXE: $alculate the Li 1s and Lmax for Relationship between smoothness constants EXE: Sho% that Relationship between smoothness constants EXE: Sho% that Proof: #rom the Hessian de5nition of smoothness #&rthermore The 5nal res&lt no% follows () ta.ing the max o/er w, then max o/er i Complexity / Convergence Theorem EXE: The steps of the proof are gi/en in a exercise list for homework! RMG, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtarik (2019) arXiv:1901.09401 SGD: General Analysis and Improved Rates. Stochastic Gradient Descent α =0.5 Stochastic Gradient Descent α =0.5 1) Start with (ig steps and end with smaller steps Stochastic Gradient Descent α =0.5 1) Start with (ig steps and end with smaller steps 2) "ry a/eraging the points SGD shrinking stepsize Shrinking Stepsize SGD shrinking stepsize Shrinking Ho% should %e Stepsize sample j ' Does this con/erge' SGD with shrinking stepsize $ompared %ith Gradient Descent Gradient Descent SGD 1.9 SGD with shrinking stepsize $ompared %ith Gradient Descent Gradient Descent SGD 1.9 Complexity / Convergence Theorem for shrin*ing stepsi+es Complexity / Convergence Theorem for shrin*ing stepsi+es Complexity / Convergence Theorem for shrin*ing stepsi+es Stochastic Gradient Descent $ompared %ith Gradient Descent Gradient Descent Noisy iterates. "ake a/erages' SGD SGD with (late start) averaging B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992) Acceleration of stochastic approximation by averaging SGD with (late start) averaging This is not efficient. How to make this efficient? B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992) Acceleration of stochastic approximation by averaging Stochastic Gradient Descent <ith and %ithout a/eraging Starts slow, (ut can reach higher acc&racy Stochastic Gradient Descent <ith and %ithout a/eraging Starts slow, (ut can reach higher acc&racy Only use a/eraging to%ards the end? Stochastic Gradient Descent A/eraging the last fe% iterates A/eraging starts here Comparison GD and SGD for strongly convex SGD GD )teration complexit# !ost of an interation Total complexity GD and SGD for strongly convex Approximate solution SGD Gradient descent What happens What happens if is small? if n is big? Comparison SGD vs GD SGD time M. Schmidt, N. Le Roux, F. Bach (2016) Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient. Comparison SGD vs GD SGD time M. Schmidt, N. Le Roux, F. Bach (2016) Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient. Comparison SGD vs GD GD SGD time M. Schmidt, N. Le Roux, F. Bach (2016) Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient. Comparison SGD vs GD GD SGD Stoch, A/erage Gradient (SAG) time M. Schmidt, N. Le Roux, F. Bach (2016) Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient. Practical SGD for Sparse Data Lazy SGD updates for Sparse Data L2 regularizor + Finite Sum Training Problem linear hypothesis Ass&me each data point xi is s3sparse, ho% many operations does each SGD step cost' Lazy SGD updates for Sparse Data L2 regularizor + Finite Sum Training Problem linear hypothesis Ass&me each data point xi is s3sparse, ho% many operations does each SGD step cost' Lazy SGD updates for Sparse Data L2 regularizor + Finite Sum Training Problem linear hypothesis Ass&me each data point xi is s3sparse, ho% many operations does each SGD step cost' >escaling Addition sparse O(d) /ector O(s) Lazy SGD updates for Sparse Data SGD step ?@?A Lazy SGD updates for Sparse Data SGD step ?@?A Lazy SGD updates for Sparse Data SGD step ?@?A Lazy SGD updates for Sparse Data SGD step ?@?A O(1) scaling = O(s) sparse add B O(s) &pdate Why Machine Learners Like SGD <h) Machine Learners li.e SGD "hough %e sol/eA <e %ant to sol/eA The statistical learning problem: Minimize the e!pected loss o/er an unknown e!pectation SGD can sol/e the statistical learning problem6 <h) Machine Learners li.e SGD The statistical learning problem: Minimize the e!pected loss o/er an unknown expectation Coding time! Appendix Proo$ SGDA Part ): "aking e!pectation %ith respect to j Un(iased estimator $on/exit) 0o&nded "a.ing total expectation and re3arranging Stoch grad S&mming &p for 1 to T Proo$ Part I): SGD with averaging for non- smooth and strongly convex functions Complexity for strongly convex Theorem (Shrinking stepsi+e- Complexity for strongly convex Theorem (Shrinking stepsi+e- #aster S&(linear con/ergence SGD for non-smooth functions SGD Theory for non-smooth Assumptions ● f(w) is con/e! ● S&(gradients (ounded ● "here exists Convergence for Convex Theorem for non-smooth ,Shrin*ing stepsi+e- Ohad Shamir and Tong Zhang (2013) International Conference on Machine Learning Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. Convergence for Convex Theorem for non-smooth ,Shrin*ing stepsi+e- S&(linear con/ergence Ohad Shamir and Tong Zhang (2013) International Conference on Machine Learning Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. Complexity for Strong. Convex Theorem for non-smooth ,Shrin*ing stepsi+e- Ohad Shamir and Tong Zhang (2013) International Conference on Machine Learning Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. Complexity for Strong. Convex Theorem for non-smooth ,Shrin*ing stepsi+e- #aster S&(linear con/ergence Ohad Shamir and Tong Zhang (2013) International Conference on Machine Learning Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes..

Load more