A Tutorial on Stochastic Approximation

A Tutorial on Stochastic Approximation A Tutorial on Stochastic Approximation Uday V. Shanbhag Industrial and Manufacturing Engg. Pennsylvania State University University Park, PA 16802 International Conference on Stochastic Programming Trondheim, Norway July 28, 2019 ** Author is grateful to Afrooz Jalilzadeh and Jinlong Lei 1 / 106 A Tutorial on Stochastic Approximation Introduction Problem definitionI We consider the following problem: min f (x) E[F (x; ξ(!))] ; (Opt) x2X , n d where X ⊆ R is a closed and convex set, F : X × R ! R is a real-valued function, d ξ :Ω ! R , and (Ω; F; P) denotes the probability space. Suppose γk denotes a positive steplength, r~ x F (xk ;!k ) denotes an estimator of the true gradient rx f (xk ) at xk , and ΠX (u) denotes the Euclidean projection of u onto the set X . We consider the analysis of the following stochastic approximation scheme. Given an x0 2 X , the sequence fxk g is generated as follows. h i xk+1 := ΠX xk − γk r~ x F (xk ;!k ) ; k ≥ 0: (SA) 2 / 106 A Tutorial on Stochastic Approximation Introduction HistoryI In their seminal paper [14], Robbins and Monro considered the question of finding the root of a stochastic system; we consider the variant of this problem in the context of solving stochastic optimization problems. ~ Suppose wk , rx F (xk ;!k ) − rx f (xk ): n To motivate the scheme, suppose X , R , and given x0, and consider a Newton method in which xk+1 is updated as follows: 2 −1 xk+1 := xk − (r f (xk )) r~ F (xk ;!k ); k = 0; 1;:::: By noting that r~ F (xk ;!k ) can be expressed as rx f (xk ) + wk , we have that 2 −1 2 −1 xk+1 := xk − (r f (xk )) rx f (xk ) − (r f (xk )) wk ; k = 0; 1;:::: ∗ 2 2 ∗ It follows that if xk ! x , then rx f (xk ) ! 0 and r f (xk ) ! r f (x ) 0: Consequently, it has to hold that wk ! 0. Challenge: But this does not hold for a range of problems (such as settings with i.i.d. random variables with finite variance). 3 / 106 A Tutorial on Stochastic Approximation Introduction HistoryII To avert this issue, Robbins and Monro [14] considered an alternate scheme in which 2 −1 (r f (xk )) by γk , a positive sequence; more specifically, the scheme reduces to the following: xk+1 := xk − γk r~ x F (xk ;!k ); k = 0; 1;::: This iteration can be equivalently written as xk+1 := xk − γk (rx f (xk ) + wk ); k = 0; 1;::: P1 Moreover, the sequence fγk g needed to satisfy the requirements: (i) k=0 γk = 1 P1 2 (Non-summability) and (ii) k=0 γk < 1 (Square-summability) to guarantee asymptotic convergence 4 / 106 A Tutorial on Stochastic Approximation Introduction Outline In this tutorial, we analyze such schemes in the following settings: 1 f is smooth/nonsmooth and strongly convex/convex. min f (x) E[f (x;!)] (Opt) x2X , 2 f is smooth/smoothable while g is deterministic convex, closed, and proper. min f (x) + g(x); where f (x) E[f (x;!)] (Composite) x2X , 3 Stochastically constrained stochastic optimization: f and c are smooth and convex. min f (x) x2X subject to c(x) ≤ 0; (Cons-Opt) x 2 X : where f (x) , E[f (x;!)] and c(x) , E[c(x;!)]. 5 / 106 A Tutorial on Stochastic Approximation Introduction Literature 1 Monographs on stochastic approximation [1, 17, 9, 4, 16] 2 We do not conduct a comprehensive lit review here; note our material is sourced as follows. 1 Smooth convex/strongly convex [18]. 2 Nonsmooth convex/strongly convex [16]. 3 Variance-reduced schemes for strongly convex [10] and convex smooth and nonsmooth (but smoothable) [7] 6 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsI We begin by considering a setting where f (x) is a strongly convex function over n X ⊆ R . Note that strong convexity over a closed and convex set suffices for the existence of a unique solution x ∗. Assumption 1 Suppose f (x) is an η-strongly convexa and continuously differentiable with L-Lipschitz continuous gradientsb on X . a T η 2 For any x; y 2 X , f (x) ≥ f (y) + rx f (y) (x − y) + 2 kx − yk . b For any x; y 2 X , krx f (x) − rx f (y)k ≤ Lkx − yk. We assume the existence of a stochastic first-order oracle (SFO) that can produce an estimate of the gradient denoted by r~ f (x;!). 2 Throughout, we assume that E kx0k < 1. Furthermore we let Fk denote the history of the method up to time k, i.e., Fk = fx0;!0;!1;:::;!k−1g for k ≥ 1 and F0 = fx0g. 7 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsII ~ We assume the following on the moments of wk , rx f (xk ) − rx f (xk ;!k ). Assumption 2 (Moment requirements) 2 2 We assume that E[wk j Fk ] = 0 a.s. and E[kwk k j Fk ] ≤ ν a.s. for all k ≥ 0. The following Lemma is employed for establishing almost-sure convergence and may be found in [13] (cf. Lemma 10, page 49). 8 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functions III Lemma 1 Let fvk g be a sequence of nonnegative random variables, where E[v0] < 1, and let fuk g and fµk g be deterministic scalar sequences such that: E[vk+1jv0;:::; vk ] ≤ (1 − uk )vk + µk a:s: for all k ≥ 0; 0 ≤ uk ≤ 1; µk ≥ 0; for all k ≥ 0; 1 1 X X µk uk = 1; µk < 1; lim = 0: k!1 uk k=0 k=0 Then, vk ! 0 almost surely as k ! 1: 9 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsIV Proposition 2 (a.s. convergence under strong convexity) Consider Algorithm (SA) and suppose Assumptions 1 and 2 hold. Then the sequence ∗ fxk g converges almost surely to x , a unique solution of (Opt). Proof: Consider algorithm (SA). By the non-expansivity property of the Euclidean 1 ∗ 2 projection operator , for all k ≥ 0, kxk+1 − x k can be bounded as follows: ∗ 2 ∗ ∗ 2 kxk+1 − x k = kΠX (xk − γk (rx f (xk ) + wk )) − ΠX (x − γk rx f (x ))k ∗ ∗ 2 ≤ kxk − x − γk (rx f (xk ) + wk − rx f (x ))k : 10 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsV By taking conditional expectations and invoking the fact that the conditional expectation of wk was zero or E[wk j Fk ] = 0, we have h ∗ 2 i ∗ 2 E kxk+1 − x k j Fk ≤ kxk − x k 2 ∗ 2 2 h 2 i + γk krx f (xk ) − rx f (x )k + γk E kwk k j Fk ∗ T ∗ − 2γk (xk − x ) (rx f (xk ) − rx f (x )) 2 2 ∗ 2 2 2 ≤ (1 − 2ηγk + γk L )kxk − x k + γk ν ; where the second inequality is a resulting of leveraging the strong monotonicity and 2 Lipschitz continuity of F (x) over X as well as the boundedness of E kwk k j Fk : We may observe that 2 2 2 1 (1 − 2ηγk + γk L ) ≤ (1 − ηγk (2 − γk L )) ≤ (1 − ηγk ) < 1; if γk < η : | {z } 1 ≥1; if γ ≤ k L2 11 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsVI ∗ 2 2 2 2 2 It follows that if vk = kxk − x k , uk = (2ηγk − γk L ) and µk = γk L E[vk+1 j Fk ] ≤ (1 − uk )vk + µk ; 2 2 1 1 where uk = 2ηγk − γk L < 1 if γk ≤ minf L2 ; η g: P1 2 P1 2 In addition, k=0 µk = L k=0 γk < 1, and P1 P1 P1 2 2 k=0 uk = k=0 2ηγk − k=0 γk L = 1; since γk is not summable but square summable. By Lemma 1, vk ! 0 almost surely as k ! 1 and the result holds. 12 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Rate of convergenceI We now derive a rate statement when X is a compact set. We utilize Lemma ?? (appendix). Proposition 3 (Convergence in mean and rate statement under strong convexity) Consider Algorithm (SA) and suppose Assumptions 1 and 2 hold. In addition, suppose X ∗ θ is a bounded set such that kx − x k ≤ C for all x 2 X and γk = k . Then the sequence ∗ fxk g converges to x in mean and 2 2 −1 ∗ 2 ∗ 2 max θ M (2ηθ − 1) ; E[kx0 − x k ] [kx − x k ] ≤ ; E k k where θ > 1=2η. Proof. We restart the proof of the previous result from the recursion: h ∗ 2 i 2 2 ∗ 2 2 2 E kxk+1 − x k j Fk ≤ (1 − 2ηγk + γk L )kxk − x k + γk ν ; 13 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Rate of convergenceII It follows that by taking unconditional expectations and recalling that kx − x ∗k ≤ C, we have that h k+1 ∗ 2i h ∗ 2i 2 2 2 2 E kx − x k ≤ (1 − 2ηγk )E kxk − x k + γk (ν + L C ): Recall that if θ > 1=2µ and M2=2 = (ν2 + L2C 2), we obtain the result for for all k (see [16, Ch.

Load more