Stochastic Second-Order Optimization Methods Part I: Convex

Intro Smooth NonSmooth Stochastic Second-Order Optimization Methods Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland Intro Smooth NonSmooth Disclaimer Disclaimer: To preserve brevity and flow, many cited results are simplified, slightly-modified, or generalized...please see the cited work for the details. Unfortunately, due to time constraints, many relevant and interesting works are not mentioned in this presentation. Intro Smooth NonSmooth Problem Statement Problem Risk zZ }| { min F (x) = Ewf (x; w) = f (x; w(!)) dP(w(!)) x2X ⊆Rd Ω | {z } e.g., Loss/Penalty + prediction function F : (non-)convex/(non-)smooth High-dimensional: d 1 n p With empirical (counting) measure over fwi gi=1 ⊂ R , i.e., n n Z 1 X 1 X dP(w) = 1 ; 8A ⊆ Ω =) F (x) = f (x; w ) n fwi 2Ag n i A i=1 i=1 | {z } finite-sum/empirical risk \Big data": n 1 Intro Smooth NonSmooth Intro Smooth NonSmooth Intro Smooth NonSmooth Humongous Data / High Dimension Classical algorithms =) High per-iteration costs Intro Smooth NonSmooth Humongous Data / High Dimension Modern variants: 1 Low per-iteration costs 2 Maintain original convergence rate Intro Smooth NonSmooth First Order Methods Use only gradient information E.g. : Gradient Descent x(k+1) = x(k) − α(k)rF (x(k)) Smooth Convex F ) Sublinear, O(1=k) Smooth Strongly Convex F ) Linear, O(ρk );ρ< 1 However, iteration cost is high! Intro Smooth NonSmooth First Order Methods Stochastic variants e.g., (mini-batch) SGD s For some s small, chosen at random fwi gi=1 and s α(k) X x(k+1) = x(k) − rf (x(k); w ) s i i=1 Cheap per-iteration costs! However slower to converge: p Smooth Convex F )O(1= k) Smooth Strongly Convex F )O(1=k) Intro Smooth NonSmooth Modifications... Achieve the convergence rate of the full GD Preserve the per-iteration cost of SGD E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG, Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ... Intro Smooth NonSmooth Second Order Methods Use both gradient and Hessian information E.g. : Classical Newton's method x(k+1) = x(k) − α(k) [r2F (x(k))]−1rF (x(k)) | {z } Non-uniformly Scaled Gradient Smooth Convex F ) Locally Superlinear Smooth Strongly Convex F ) Locally Quadratic and Globally Linear However, per-iteration cost is much higher! Intro Smooth NonSmooth Outline Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Outline Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Finite Sum / Empirical Risk Minimization FSM/ERM n n 1 X 1 X min F (x) = f (x; wi ) = fi (x) x2X ⊆Rd n n i=1 i=1 fi : Convex and Smooth F : Strongly Convex =) Unique minimizer x? n 1 and/or d 1 Intro Smooth NonSmooth Iterative Scheme 1 y(k) = argmin (x − x(k))T g(k) + (x − x(k))T H(k)(x − x(k)) x2X 2 (k+1) (k) (k) (k) x = x + αk y − x Intro Smooth NonSmooth Iterative Scheme (k) (k) T (k) 1 (k) T (k) (k) (k+1) (k) (k) (k) y = argmin (x − x ) g + (x − x ) H (x − x ) ; x = x + αk y − x x2X 2 Newton: g(k) = rF (x(k))& H(k) = r2F (x(k)) Gradient Descent: g(k) = rF (x(k))& H(k) = I Frank-Wolfe: g(k) = rF (x(k))& H(k) = 0 (k) 1 P (k) (k) (mini-batch) SGD: Sg ⊂ f1; 2;:::; ng =) g = rf (x )& H = I jSgj j2Sg j Sub-Sampled Newton: Hessian Sub-Sampling g(k) = rF (x(k)) (k) 1 X 2 (k) Sg ⊂ f1; 2;:::; ng =) H = r fj (x ) jSH j j2SH Gradient and Hessian Sub-Sampling (k) 1 X (k) Sg ⊂ f1; 2;:::; ng =) g = rfj (x ) jSgj j2Sg (k) 1 X 2 (k) SH ⊂ f1; 2;:::; ng =) H = r fj (x ) jSH j j2SH Intro Smooth NonSmooth Sub-sampled Newton's Method Let X = Rd , i.e., unconstrained optimization Iterative Scheme 1 p(k) ≈ argmin pT g(k) + pT H(k)p ; p2Rd 2 x(k+1) = x(k) + α(k)p(k) Hessian Sub-Sampling g(k) = rF (x(k)) (k) 1 X 2 (k) H = r fj (x ) jSH j j2SH Intro Smooth NonSmooth Sub-sampled Newton's Method Global convergence, i.e., starting from any initial point Intro Smooth NonSmooth Sub-sampled Newton's Method Iterative Scheme Descent Direction: H(k)p(k) ≈ −g(k); ( αk = argmax α Step Size: α≤1 (k) (k) T (k) s.t. F (x + αpk ) ≤ F (x ) + αβpk rF (x ) Update: x(k+1) = x(k) + α(k)p(k) Intro Smooth NonSmooth Sub-sampled Newton's Method Algorithm Newton's Method with Hessian Sub-Sampling 1: Input: x(0) 2: for k = 0; 1; 2; ··· until termination do 3: - g(k) = rF (x(k)) (k) 4: - SH ⊆ f1; 2;:::; ng 1 X 5: - H(k) = r2f (x(k)) (k) j jSH j (k) j2SH 6: - H(k)p(k) ≈ −g(k) 7: - Find α(k) that passes Armijo linesearch 8: - Update x(k+1) = x(k) + α(k)p(k) 9: end for Intro Smooth NonSmooth Sub-sampled Newton's Method Theorem ( [Byrd et al., 2011]) If each fi is strongly convex, twice continuously differentiable with Lipschitz continuous gradient, then lim rF (x(k)) = 0: k!1 Intro Smooth NonSmooth Sub-sampled Newton's Method Theorem ( [Bollapragada et al., 2016]) If each fi is strongly convex, twice continuously differentiable with Lipschitz continuous gradient, then (k) ? k (0) ? E F (x ) − F (x ) ≤ ρ F (x ) − F (x ) ; where ρ ≤ 1 − 1=κ2 with κ being the \condition number" of the problem. Intro Smooth NonSmooth Sub-sampled Newton's Method What if F is strongly convex, but each fi is only weakly convex?, e.g., 1 Pn T F (x) = n i=1 ì (ai x) d n d ai 2 R and Range(fai gi=1) = R 00 ì : R!R and ì ≥ γ > 0 2 00 T T Each r fi (x) = ì (ai x)ai ai is rank one! n ! 2 X T But r F (x) γ · λmin ai ai i=1 | {z } >0 Intro Smooth NonSmooth Sub-Sampling Hessian 1 X H = r2f (x) jSj j j2S Lemma ( [Roosta and Mahoney, 2016a]) 2 2 Suppose r F (x) γI and 0 r fi (x) LI. Given any 0 < < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with 2κ log(d/δ) jSj ≥ ; 2 then Pr H (1 − )γI ≥ 1 − δ: where κ = L/γ. Intro Smooth NonSmooth Sub-Sampling Hessian Inexact Update Hp ≈ −g =) kHp + gk ≤ θ kgk, θ< 1, using CG. Why CG and not other solvers (at least theoretically)? H is SPD 1 D E 1 D E p(t) = argmin pT g+ pT Hp ! p(t); g ≤ − p(t); Hp(t) p2Kt 2 2 Intro Smooth NonSmooth Sub-Sampling Hessian Theorem ( [Roosta and Mahoney, 2016a]) r (1 − ) If θ ≤ , then, w.h.p, κ F (x(k+1)) − F (x?) ≤ ρ F (x(k)) − F (x?) ; where ρ = 1 − (1 − )/κ2. Intro Smooth NonSmooth Sub-sampled Newton's Method Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1 Linear-Quadratic Error Recursion 2 (k+1) ? (k) ? (k) ? x − x ≤ ξ1 x − x + ξ2 x − x Intro Smooth NonSmooth Sub-sampled Newton's Method [Erdogdu and Montanari, 2015] [Ur+1; Λr+1] = TSVD (H; r + 1) ^ −1 −1 −1 1 T H = λr+1I + Ur Λr − I Ur λr+1 n o x(k+1) = argmin x − x(k) − α(k)H^ −1rF (x(k)) x2X Theorem ( [Erdogdu and Montanari, 2015]) 2 (k+1) ? (k) (k) ? (k) (k) ? x − x ≤ ξ1 x − x + ξ2 x − x ; s λ (H(k)) L log d L ξ(k) = 1 − min + g ; and ξ(k) = H : 1 (k) (k) (k) 2 (k) λr+1(H ) λr+1(H ) 2λr+1(H ) jSH j Note: For constrained optimization, the method is based on two metric gradient projection =) starting from an arbitrary point, the algorithm might not recognize, i.e., fail to stop at, an stationary point. Intro Smooth NonSmooth Sub-sampling Hessian 1 X H = r2f (x) jSj j j2S Lemma ( [Roosta and Mahoney, 2016b]) Given any 0 < < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with 2κ2 log(d/δ) jSj ≥ ; 2 then Pr (1 − )r2F (x) H(x) (1 + )r2F (x) ≥ 1 − δ: Intro Smooth NonSmooth Sub-sampled Newton's Method [Roosta and Mahoney, 2016b] Theorem ( [Roosta and Mahoney, 2016b]) With high-probability, we get 2 (k+1) ? (k) ? (k) ? x − x ≤ ξ1 x − x + ξ2 x − x ; where r κ L ξ = + θ; and ξ = H : 1 (1 − ) 1 − 2 2(1 − )γ p If θ = / κ, then ξ1 is problem-independent! ) Can be made arbitrarily small! Intro Smooth NonSmooth Sub-sampled Newton's Method Theorem ( [Roosta and Mahoney, 2016b]) Consider any 0 <ρ 0 <ρ< 1 and ≤ ρ0=(1 + ρ0). If (0) ∗ kx − x k ≤ (ρ − ρ0)/ξ2, and r (1 − ) θ ≤ ρ ; 0 κ we get locally Q-linear convergence (k) ∗ (k−1) ∗ kx − x k ≤ ρkx − x k; k = 1;:::; k0 with probability (1 − δ)k0 .

Load more