A Tutorial on Stochastic Approximation

A Tutorial on Stochastic Approximation

Uday V. Shanbhag

Industrial and Manufacturing Engg. Pennsylvania State University University Park, PA 16802

International Conference on Stochastic Programming Trondheim, Norway July 28, 2019

** Author is grateful to Afrooz Jalilzadeh and Jinlong Lei

1 / 106 A Tutorial on Stochastic Approximation Introduction

Problem definitionI

We consider the following problem:

min f (x) E[F (x, ξ(ω))] , (Opt) x∈X ,

n d where X ⊆ R is a closed and convex set, F : X × R → R is a real-valued function, d ξ :Ω → R , and (Ω, F, P) denotes the probability space.

Suppose γk denotes a positive steplength, ∇˜ x F (xk , ωk ) denotes an estimator of the true gradient ∇x f (xk ) at xk , and ΠX (u) denotes the Euclidean projection of u onto the set X . We consider the analysis of the following stochastic approximation scheme. Given an x0 ∈ X , the sequence {xk } is generated as follows. h i xk+1 := ΠX xk − γk ∇˜ x F (xk , ωk ) , k ≥ 0. (SA)

2 / 106 A Tutorial on Stochastic Approximation Introduction

HistoryI

In their seminal paper [14], Robbins and Monro considered the question of finding the root of a stochastic system; we consider the variant of this problem in the context of solving problems. ˜ Suppose wk , ∇x F (xk , ωk ) − ∇x f (xk ). n To motivate the scheme, suppose X , R , and given x0, and consider a Newton method in which xk+1 is updated as follows:

2 −1 xk+1 := xk − (∇ f (xk )) ∇˜ F (xk , ωk ), k = 0, 1,....

By noting that ∇˜ F (xk , ωk ) can be expressed as ∇x f (xk ) + wk , we have that

2 −1 2 −1 xk+1 := xk − (∇ f (xk )) ∇x f (xk ) − (∇ f (xk )) wk , k = 0, 1,....

∗ 2 2 ∗ It follows that if xk → x , then ∇x f (xk ) → 0 and ∇ f (xk ) → ∇ f (x ) 0. Consequently, it has to hold that wk → 0. Challenge: But this does not hold for a of problems (such as settings with i.i.d. random variables with finite ).

3 / 106 A Tutorial on Stochastic Approximation Introduction

HistoryII

To avert this issue, Robbins and Monro [14] considered an alternate scheme in which 2 −1 (∇ f (xk )) by γk , a positive sequence; more specifically, the scheme reduces to the following:

xk+1 := xk − γk ∇˜ x F (xk , ωk ), k = 0, 1,... This iteration can be equivalently written as

xk+1 := xk − γk (∇x f (xk ) + wk ), k = 0, 1,...

P∞ Moreover, the sequence {γk } needed to satisfy the requirements: (i) k=0 γk = ∞ P∞ 2 (Non-summability) and (ii) k=0 γk < ∞ (Square-summability) to guarantee asymptotic convergence

4 / 106 A Tutorial on Stochastic Approximation Introduction

Outline

In this tutorial, we analyze such schemes in the following settings: 1 f is smooth/nonsmooth and strongly convex/convex.

min f (x) E[f (x, ω)] (Opt) x∈X ,

2 f is smooth/smoothable while g is deterministic convex, closed, and proper.

min f (x) + g(x), where f (x) E[f (x, ω)] (Composite) x∈X ,

3 Stochastically constrained stochastic optimization: f and c are smooth and convex.

min f (x) x∈X subject to c(x) ≤ 0, (Cons-Opt) x ∈ X .

where f (x) , E[f (x, ω)] and c(x) , E[c(x, ω)].

5 / 106 A Tutorial on Stochastic Approximation Introduction

Literature

1 Monographs on stochastic approximation [1, 17, 9, 4, 16] 2 We do not conduct a comprehensive lit review here; note our material is sourced as follows. 1 Smooth convex/strongly convex [18]. 2 Nonsmooth convex/strongly convex [16]. 3 Variance-reduced schemes for strongly convex [10] and convex smooth and nonsmooth (but smoothable) [7]

6 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsI

We begin by considering a setting where f (x) is a strongly convex function over n X ⊆ R . Note that strong convexity over a closed and convex set suffices for the existence of a unique solution x ∗.

Assumption 1

Suppose f (x) is an η-strongly convexa and continuously differentiable with L-Lipschitz continuous gradientsb on X .

a T η 2 For any x, y ∈ X , f (x) ≥ f (y) + ∇x f (y) (x − y) + 2 kx − yk . b For any x, y ∈ X , k∇x f (x) − ∇x f (y)k ≤ Lkx − yk.

We assume the existence of a stochastic first-order oracle (SFO) that can produce an estimate of the gradient denoted by ∇˜ f (x, ω).  2 Throughout, we assume that E kx0k < ∞. Furthermore we let Fk denote the history of the method up to time k, i.e., Fk = {x0, ω0, ω1, . . . , ωk−1} for k ≥ 1 and F0 = {x0}.

7 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsII

˜ We assume the following on the moments of wk , ∇x f (xk ) − ∇x f (xk , ωk ). Assumption 2 ( requirements)

2 2 We assume that E[wk | Fk ] = 0 a.s. and E[kwk k | Fk ] ≤ ν a.s. for all k ≥ 0.

The following Lemma is employed for establishing almost-sure convergence and may be found in [13] (cf. Lemma 10, page 49).

8 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsIII

Lemma 1

Let {vk } be a sequence of nonnegative random variables, where E[v0] < ∞, and let {uk } and {µk } be deterministic scalar sequences such that:

E[vk+1|v0,..., vk ] ≤ (1 − uk )vk + µk a.s. for all k ≥ 0,

0 ≤ uk ≤ 1, µk ≥ 0, for all k ≥ 0,

∞ ∞ X X µk uk = ∞, µk < ∞, lim = 0. k→∞ uk k=0 k=0

Then, vk → 0 almost surely as k → ∞.

9 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsIV

Proposition 2 (a.s. convergence under strong convexity)

Consider Algorithm (SA) and suppose Assumptions 1 and 2 hold. Then the sequence ∗ {xk } converges almost surely to x , a unique solution of (Opt).

Proof: Consider algorithm (SA). By the non-expansivity property of the Euclidean 1 ∗ 2 projection operator , for all k ≥ 0, kxk+1 − x k can be bounded as follows:

∗ 2 ∗ ∗ 2 kxk+1 − x k = kΠX (xk − γk (∇x f (xk ) + wk )) − ΠX (x − γk ∇x f (x ))k ∗ ∗ 2 ≤ kxk − x − γk (∇x f (xk ) + wk − ∇x f (x ))k .

10 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsV

By taking conditional expectations and invoking the fact that the conditional expectation of wk was zero or E[wk | Fk ] = 0, we have

h ∗ 2 i ∗ 2 E kxk+1 − x k | Fk ≤ kxk − x k

2 ∗ 2 2 h 2 i + γk k∇x f (xk ) − ∇x f (x )k + γk E kwk k | Fk ∗ T ∗ − 2γk (xk − x ) (∇x f (xk ) − ∇x f (x )) 2 2 ∗ 2 2 2 ≤ (1 − 2ηγk + γk L )kxk − x k + γk ν ,

where the second inequality is a resulting of leveraging the strong monotonicity and  2  Lipschitz continuity of F (x) over X as well as the boundedness of E kwk k | Fk . We may observe that

2 2 2 1 (1 − 2ηγk + γk L ) ≤ (1 − ηγk (2 − γk L )) ≤ (1 − ηγk ) < 1, if γk < η . | {z } 1 ≥1, if γ ≤ k L2

11 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Convergence of SA for strongly convex functionsVI

∗ 2 2 2 2 2 It follows that if vk = kxk − x k , uk = (2ηγk − γk L ) and µk = γk L

E[vk+1 | Fk ] ≤ (1 − uk )vk + µk ,

2 2 1 1 where uk = 2ηγk − γk L < 1 if γk ≤ min{ L2 , η }. P∞ 2 P∞ 2 In addition, k=0 µk = L k=0 γk < ∞, and P∞ P∞ P∞ 2 2 k=0 uk = k=0 2ηγk − k=0 γk L = ∞, since γk is not summable but square summable.

By Lemma 1, vk → 0 almost surely as k → ∞ and the result holds.

12 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Rate of convergenceI

We now derive a rate statement when X is a compact set. We utilize Lemma ?? (appendix).

Proposition 3 (Convergence in and rate statement under strong convexity)

Consider Algorithm (SA) and suppose Assumptions 1 and 2 hold. In addition, suppose X ∗ θ is a bounded set such that kx − x k ≤ C for all x ∈ X and γk = k . Then the sequence ∗ {xk } converges to x in mean and

 2 2 −1 ∗ 2 ∗ 2 max θ M (2ηθ − 1) , E[kx0 − x k ] [kx − x k ] ≤ , E k k where θ > 1/2η.

Proof. We restart the proof of the previous result from the recursion:

h ∗ 2 i 2 2 ∗ 2 2 2 E kxk+1 − x k | Fk ≤ (1 − 2ηγk + γk L )kxk − x k + γk ν ,

13 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Strongly convex and smooth problems Rate of convergenceII

It follows that by taking unconditional expectations and recalling that kx − x ∗k ≤ C, we have that

h k+1 ∗ 2i h ∗ 2i 2 2 2 2 E kx − x k ≤ (1 − 2ηγk )E kxk − x k + γk (ν + L C ).

Recall that if θ > 1/2µ and M2/2 = (ν2 + L2C 2), we obtain the result for for all k (see [16, Ch. 5]).

14 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems Weakening strong convexity assumption

Assumption 3

Suppose f (x) is a convex, continuously differentiable, with L-Lipschitz continuous gradients over X .

We use the following result by Robbins and Siegmund that pertains to almost super-martingales Lemma 4 (Robbins-Siegmund Lemma)

Let vk , uk , αk , and βk be nonnegative random variables, and let the following relations hold almost surely:

∞ ∞ h ˜ i X X E vk+1 | Fk ≤ (1 + αk )vk − uk + βk for all k, αk < ∞, βk < ∞, k=0 k=0 where F˜k denotes the collection v0,..., vk , u0,..., uk , α0, . . . , αk , β0, . . . , βk . Then, a.s. P∞ a we have limk→∞ vk = v and k=0 uk < ∞, where v ≥ 0 is a random variable .

a A stochastic process {X (k), k ∈ Z+} is said to be a (Sub,super)-Martingale process if E[Y (k + 1) | Y (1),..., Y (k)] (≥, ≤) = Y (k). 15 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems AnalysisI

Proposition 5 (a.s.convergence under convexity)

Let Assumptions 2 and 3 hold. Assume that the optimal set X ∗ of (Opt) is nonempty. Then, the sequence {xk } generated by (SA) converges almost surely to some random point in X ∗.

Proof: By definition of the method and the nonexpansive property of the projection operation, we obtain for any x ∗ ∈ X ∗ and k ≥ 0,

∗ 2 ∗ 2 kxk+1 − x k ≤ kxk − x − γk (∇f (xk ) + wk )k ∗ 2 T ∗ 2 2 = kxk − x k − 2γk (∇f (xk ) + wk ) (xk − x ) + γk k∇f (xk ) + wk k . By leveraging convexity and the gradient inequality, we have that

∗ T ∗ f (x ) ≥ f (xk ) + ∇f (xk ) (x − xk ),

implying that T ∗ ∗ −∇f (xk ) (xk − x ) ≤ −(f (xk ) − f (x )). 16 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems AnalysisII

By the previous observation, we have the following:

∗ 2 ∗ 2 ∗ kxk+1 − x k ≤ kxk − x k − 2γk (f (xk ) − f (x )) T ∗ 2 2 − 2γk wk (xk − x ) + γk k∇f (xk ) + wk k .

2 2 2 n ∗ ∗ Since ka + bk ≤ 2kak + 2kbk for any a, b ∈ R , by using f = f (x ), and by adding and subtracting ∇f (x ∗) in the last term, we obtain

∗ 2 ∗ 2 ∗ T ∗ kxk+1 − x k ≤ kxk − x k − 2γk (f (xk ) − f ) − 2γk wk (xk − x ) 2 ∗ 2 2 ∗ 2 + 2γk k∇f (xk ) − ∇f (x )k + 2γk k∇f (x ) + wk k .

Taking the conditional expectation given Fk , using E[wk | Fk ] = 0 and the Lipschitzian property of the gradient, we have

h ∗ 2 i 2 2 ∗ 2 ∗ E kxk+1 − x k | Fk ≤ (1 + 2L γk )kxk − x k − 2γk (f (xk ) − f )

2  ∗ 2 h 2 i + 2γk k∇f (x )k + E kwk k | Fk .

17 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems AnalysisIII

The conditions of Lemma 4 are satisfied. Therefore, almost surely, the sequence ∗ ∗ ∗ P∞ ∗ {kxk+1 − x k} is convergent for any x ∈ X and k=0 γk (f (xk ) − f ) < ∞.

The former relation implies that {xk } is bounded a.s., while the latter implies ∗ P∞ lim infk→∞ f (xk ) = f a.s. in view of the condition k=0 γk = ∞.

Since the set X is closed, all accumulation points of {xk } lie in X . Furthermore, ∗ since f (xk ) → f along a subsequence a.s., by continuity of f it follows that {xk } has a subsequence converging to some random point in X ∗ a.s. Moreover, since ∗ ∗ ∗ {kxk+1 − x k} is convergent for any x ∈ X a.s., the entire sequence {xk } converges to some random point in X ∗ a.s.

18 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems Rate of convergenceI

Proposition 6 (Rate of convergence under convexity)

Let Assumptions 2 and 3 hold. Suppose γ = √1 . Assume that the optimal set X ∗ of k k (Opt) is nonempty. Then for all K, we have that

PK−1 ∗ a2+2(L2C 2+c2) ln(K+1) γk xk [f (¯x ) − f ] ≤ 0 √ , where x¯ k=0 . E K 4 K+1 K , γk xk

Proof. We restart the proof from the recursion.

h ∗ 2 i 2 2 ∗ 2 ∗ E kxk+1 − x k | Fk ≤ (1 + 2L γk )kxk − x k − 2γk (f (xk ) − f )

2  ∗ 2 h 2 i + 2γk k∇f (x )k + E kwk k | Fk .

19 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems Rate of convergenceII

Taking unconditional expectations, we obtain that

∗ 2 2 h ∗ 2i h ∗ 2i 2E[γk (f (xk ) − f )] ≤ (1 + 2L γk )E kxk − x k − E kxk+1 − x k

2  ∗ 2 h 2i + 2γk k∇f (x )k + E kwk k .

Summing from k = 0,..., K − 1, we obtain that

K−1 K−1 X ∗ h ∗ 2i X 2 2 h ∗ 2i 2 2 2 E[γk (f (xk ) − f )] ≤ E kx0 − x k + 2 (L γk E kxk − x k + 2γk c ) k=0 k=0 K−1 ∗ P [γ (f (x ) − f )] 2 PK−1 2 2 2 2 k=0 E k k a0+2 k=0 (L C +c )γk =⇒ 2 K−1 ≤ PK−1 . P γk k=0 γk k=0

20 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Convex and smooth problems Rate of convergenceIII

∗ By Jensen’s inequality, we have the following bound for E[f (¯xK ) − f (x )], where PK−1 k=0 γk xk x¯K . , we have that , γk xk

PK−1 [γ (f (x ) − f ∗)] 2 [(f (¯x ) − f (x ∗))] ≤ 2 k=0 E k k E K PK−1 k=0 γk a2 + 2 PK−1(L2C 2 + c2)γ2 =⇒ 2 [(f (¯x ) − f (x ∗))] ≤ 0 k=0 k E K PK−1 k=0 γk 2 2 2 2 R K+1 1 a0 + 2(L C + c ) 1 x dx (By upper/lower bounding sums by integrals) ≤ x R √1 0 x dx a2 + 2(L2C 2 + c2) ln(K + 1) ≤ 0 √ . 2 K + 1

Jensen’s inequality. If f (x) is a convex function, then f (E[F (x, ω)]) ≤ E[f (F (x, ω))] for any x.

21 / 106 A Tutorial on Stochastic Approximation Smooth convex problems Summary of findings Convergence and Rate StatementsI

Table: Convergence and Rate Statements

Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong Y Y O k O  E kxk − x k  ln(k)  [f (x, ω)] x ∈ X Convex Y Y O √ – f (¯x − x∗) E k E K Note that oracle complexity represents the number of calls to the stochastic first-order oracle to get an -solution in an expected-value sense. One call is made at every iteration (i.e. a single sample is used at each step).  ∗ 2 1 E kxk − x k ≤ . By utilizing the rate statement and recalling that a single sample is taken at each step, oracle complexity = iteration complexity (in number of projection steps). h ∗ 2i a a E kxk − x k ≤ k ≤  =⇒ k = d  e.

22 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsI

Let us begin with an assumption.

Assumption 4 (Bounded subgradients)

2 2 Suppose the sampled subgradient of f (x) satisfy E[k∇f (x, ω)k ] ≤ M for all x ∈ X where ∇x f (x, ω) ∈ ∂x f (x, ω).

Proposition 7 (Rate of convergence under nonsmoothness and convexity)

Suppose f is µ-strongly convex. Let Assumptions 2 and 4 hold. Assume that the optimal set X ∗ of (Opt) is nonempty. Then for all k, we have that 1 [kx − x ∗k2] ≤ . E k k

Proof. 1 ∗ 2 Suppose Aj := 2 kxj − x k and aj := E[Aj ].

23 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsII

Next, we observe that

1 ∗ 2 Aj+1 = 2 kxj − x k 1 ∗ 2 = 2 kΠX (xj − γj ∇f (xj , ωj )) − ΠX (x )k 1 ∗ 2 ≤ 2 k (xj − γj ∇f (xj , ωj )) − x k 1 2 2 ∗ T = Aj + 2 γj k∇f (xj , ωj )k − γj (xj − x ) ∇f (xj , ξj ). (1)

It can be seen that xj = xj (ω1, . . . , ωj−1) = xj (ω[j−1]), implying that xj is independent of ωj . As a consequence, since E[X ] = E[E[X | Y ]], we have that

∗ T ∗ T E[(xj − x ) ∇f (xj , ωj )] = E[E[(xj − x ) ∇f (xj , ωj ) | ω[j−1]]] ∗ T = E[(xj − x ) E[∇f (xj , ωj ) | ω[j−1]]] ∗ T (Conditionally unbiased) = E[(xj − x ) ∇f (xj )].

24 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsIII

It follows from taking expectations on both sides of

1 2 2 ∗ T Aj+1 ≤ Aj + 2 γj k∇f (xj , ωj )k − γj (xj − x ) ∇f (xj , ωj ), leading to

1 2 2 ∗ T aj+1 ≤ aj + 2 γ M − γj E[(xj − x ) ∇f (xj , ωj )]. By assumption, we have that f (x) is strongly convex over X with a constant µ. Recall by the optimality of x ∗,

∗ T ∗ (x − x ) ∇x f (x ) ≥ 0, ∀x ∈ X ,

∗ ∗ where ∇x f (x ) ∈ ∂x f (x ).

As a consequence, if ∇x f (xj , ω) ∈ ∂x f (xj , ω), then

∗ T ∗ T ∗ ∗ T ∗ E[(xj − x ) ∇f (xj , ω)] = E[(xj − x ) (∇f (xj , ω) − ∇f (x , ω))] + (xj − x ) ∇f (x ) | {z } ≥ 0 ∗ T ∗ ≥ E[(xj − x ) (∇f (xj , ω) − ∇f (x , ω))] ∗ 2 ≥ ckxj − x )k = 2caj . 25 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsIV

It follows from (1), that

1 2 2 aj+1 ≤ (1 − 2cγj )aj + 2 γj M . Generally in stochastic approximation problems, the steplength sequence employed is γj = θ/j where θ is a positive scalar, allowing us to claim that

1 2 2 2 aj+1 ≤ (1 − 2µθ/j)aj + 2 M θ /j . Suppose we now additional impose a requirement that θ > 1/2µ, we obtain that (see [16, Ch. 5])  2 2 −1 max θ M (2µθ − 1) , 2a1 2a ≤ . j j As a result, Q(θ) [kx − x ∗k2] ≤ , E j j where n 2 2 −1 ∗ 2o Q(θ) , max θ M (2µθ − 1) , kx1 − x k .

26 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsV

Furthermore, Q(θ) is minimized at θ = 1/µ. Consequently, n M2 ∗ 2o max µ2 , kx1 − x k [kx − x ∗k2] ≤ . E j j

Next, we assume that x ∗ ∈ int(X ). Furthermore, ∇f (x) is assumed to be Lipschitz continuous with a constant L. As a result, we have that for all x ∈ X ,

f (x ∗) ≥ f (x) + ∇f (x)T (x ∗ − x) = f (x) + (∇f (x) − ∇f (x ∗))T (x − x ∗) f (x) ≤ f (x ∗) − (∇f (x) − ∇f (x ∗))T (x − x ∗) ≤ f (x ∗) + Lkx − x ∗k2.

∗ ∗ 2 Q(θ)L It follows that E[f (xj ) − f (x )] ≤ LE[kxj − x k ] ≤ j . After j iterations, expected error in solution and function value is of the order of 1 − O(j 2 ) and O(j −1), respectively.

27 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Strongly convex problems Strongly convex nonsmooth problemsVI

Example 8 ([16])

1 2 ∗ Consider a noise-free problem in which f (x) = 2 κx with κ > 0. Clearly x = 0. Furthermore, G(x, ξ) = ∇f (x) = κx. Suppose θ = 1 and γj = 1/j. Then we have

xj+1 = xj − κxj /j = xj (1 − κ/j).

Suppose κ = 1. Then the optimal solution is found in one iteration. Suppose κ < 1. Then we have

    j   j κ X κ xj+1 = x1Πs=1 1 − = x1exp − ln 1 +  . s s=1 s − κ

A bit of algebra reveals that −κ −2κ xj+1 = O(1)j and f (xj+1) > O(1)j .

In effect the convergence becomes very poor as κ → 0. Specifically, to reduce the error in xj by a factor of 10, we need to increase the number of iterations by 101/κ. For example, if κ = 0.1, x1 = 1, j = 1e5, we have that xj > 0.28. To reduce this to 0.028. we need to increase the number of iterations by 1010 or j = 1015. Furthermore if the function loses strong convexity, the parameter c degenerates to zero and no choice of θ > 1/2c may be available.

28 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationI

An observation drawn from the previous subsection is that κ can be far too small to achieve a reasonable empirical behavior . An important advance was made in [12] in generating a robust stepsize policy. In such a scheme, the simulation length was fixed at some K and the steplength was defined to be constant but dependent on K. Note that such schemes provide approximate solutions

Proposition 9 (Rate of convergence under nonsmoothness and convexity [12])

Suppose f is merely convex. Let Assumption 2 and 4 hold. Assume that the optimal set X ∗ of (Opt) is nonempty. Then for all k, we have that

∗ DX M E[f (˜x1,N ) − f (x )] ≤ √ . (2) N

Proof.

29 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationII

By convexity of f (x), we have that for any x ∈ X ,

T f (x) ≥ f (xj ) + (x − xj ) ∇x f (xj ),

where ∇x f (xj ) ∈ ∂f (xj ). Taking expectations, we obtain

∗ T ∗ E[(xj − x ) ∇x f (xj )] ≥ E[f (xj ) − f (x )]. But, we have that

∗ T 1 2 2 γj E[(xj − x ) g(xj )] ≤ aj − aj+1 + 2 γj M , implying that ∗ 1 2 2 γj E[f (xj ) − f (x )] ≤ aj − aj+1 + 2 γj M . As a result, for 1 ≤ i ≤ j, we have the following:

j j j j X ∗ X 1 X 2 2 1 X 2 2 γt E[f (xt ) − f (x )] ≤ (at − at+1) + 2 γt M ≤ ai + 2 γt M . t=i t=i t=i t=i

30 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationIII

Next, we define vt and DX as

γt vt and DX max kx − x1k2. , Pj , x∈X τ=1 γτ It follows from invoking these definitions, that

" j # 1 Pj 2 2 X ∗ ai + γt M v f (x ) − f (x ) ≤ 2 t=i . (3) E t t Pj t=i t=i γt

Pj Next, we consider points given byx ˜i,j := t=i vt xt . By the convexity of X , we have thatx ˜i,j ∈ X and by the convexity of f (•):

j X f (˜xi,j ) ≤ vt f (xt ). t=i

31 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationIV

1 2 2 From (16) and by noting that a1 ≤ 2 DX and ai ≤ 2Dx for i > 1, we obtain the following:

D2 + 1 Pj γ2M2 [f (˜x ) − f (x ∗)] ≤ X 2 t=i t , 1 ≤ j (4) E 1,j Pj 2 t=i γt 4D2 + 1 Pj γ2M2 [f (˜x ) − f (x ∗)] ≤ X 2 t=i t , 1 ≤ i ≤ j (5) E i,j Pj 2 t=i γt

We are now in a position to develop constant stepsize schemes. Suppose γt = γ for all t = 1,..., N. Then it follows that

D2 + M2Nγ2 [f (˜x , N) − f (x ∗)] ≤ X . E 1 2Nγ By minimizing the right hand side in γ > 0, we obtain that D γ∗ = √X . M N

32 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationV

This leads to the following efficiency estimate:

∗ DX M E[f (˜x1,N ) − f (x )] ≤ √ . (6) N Next, we can also claim that for 1 ≤ K ≤ N,

∗ CN,K DX M 2N 1 E[f (˜xK,N ) − f (x )] ≤ √ , where CN,K , + . (7) N N − K + 1 2 If we scale γ by θ, a positive scalar, implying that

θDX γt = √ , t = 1,..., N. M N The resulting efficiency estimate then becomes the following:   ∗ 1 CN,K DX M E[f (˜xK,N ) − f (x )] ≤ max θ, √ . (8) θ N

33 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Convex and nonsmooth problems Robust Stochastic ApproximationVI

1 − Naturally, the expected error O(N 2 ) is worse than the O(N−1) obtained in the classical approach. But in that setting, the function was a smooth and strongly convex function. Here the error bound is guaranteed regardless of smoothness and the strong convexity assumptions. Further, changing θ does not have a devastating impact; it just rescales the error. Thus the qualifier “robust.”

34 / 106 A Tutorial on Stochastic Approximation Nonsmooth convex problems Summary of findings Convergence and Rate StatementsI

Table: Convergence and Rate Statements

Smooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong Y Y O k O  E kxk − x k  ln(k)  [f (x, ω)] x ∈ X Convex Y Y O √ – f (¯x ) − x∗) E k E K Nonsmooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong N N O k O  E kxk − x k     [f (x, ω)] x ∈ X Convex N N O √1 O 1 f (¯x ) − x∗) E K 2 E K

35 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems GoalI

Consider a deterministic µ-strongly convex and L-smooth optimization problem:

min f (x). (9) x∈X Consider a standard projected gradient method for such a scheme.

xk+1 := ΠX [xk − γ∇x f (xk )] . (10) Then we have the following non-asymptotic linear rate of convergence.

∗ 2 k kxk − x k ≤ O(q ), (11) where q < 1.

However, if we replace ∇x f (xk ) by ∇x f (xk , ωk ), then the (expected) rate of convergence decays to the following sublinear level.

∗ 2 1 E[kxk − x k ] ≤ O( k ). (12)

How can this gap in rates be bridged? Ans: by reducing the bias in directions by increasing batch-sizes.

36 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems A variance-reduced scheme

We consider the following scheme:

xk+1 := ΠX [xk − γk (∇x f (xk ) +w ¯k )] , k ≥ 0, (13)

wherew ¯k is defined as follows.

PNk ˜ j=1 ∇x F (xk ,ωj,k )−∇x f (xk ) w¯k . (14) , Nk

The following assumption is imposed onw ¯k . Assumption 5 (Moment requirements)

2 ν2 We assume that [w ¯k | Fk ] = 0 a.s. and [kw¯k k | Fk ] ≤ a.s. for all k ≥ 0. E E Nk

We make the following assumption on the problem. Assumption 6

Suppose f (x) is a continuously differentiable, µ-strongly convex, and L-smooth function n on X . In addition, suppose X ⊆ R is a closed and convex set.

37 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityI

Lemma 10

+ Suppose Assumption 6 and 5 hold. Let x, y ∈ X and suppose x and cX (x) are defined by  1  x + := Π x − (∇ f (x) + w) and c (x) L(x − x +), (15) X L x X , respectively. Then the following holds. 1 µ f (x +) − g(y) ≤ c (x)T (x − y) − kc (x)k2 − w T (x + − y) − kx − yk2. X 2L X 2

Proof. We begin by recalling the projection inequality

  1 T x + − x − (∇ f (x) + w) (x + − y) ≤ 0. L x

38 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityII

Consequently, we have that

T + T + T + ∇x f (x) (x − y) ≤ cX (x) (x − y) − w (x − y). (16)

Then by the µ-strong convexity and L-smoothness of f (·), we may now derive the following bound.

f (x +) − f (y) = f (x +) − f (x) + f (x) − f (y) L µ ≤ ∇ f (x)T (x + − x) + kx + − xk2 + ∇ f (x)T (x − y) − kx − yk2 x 2 x 2 1 µ = ∇ f (x)T (x + − y) + kc (x)k2 − kx − yk2 x 2L X 2 (16) 1 µ ≤ c (x)T (x + − y) − w T (x + − y) + kc (x)k2 − kx − yk2 X 2L X 2 1 µ = c (x)T (x − y) − kc (x)k2 − w T (x + − y) − kx − yk2. X 2L X 2

39 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityIII

Theorem 11 (a.s. convergence and linear rate of convergence)

Let {xk } be generated by (13). Suppose Assumptions 6 and 5 hold. Then the sequence generated by (13) satisfies the following. 1 We have that k→∞ ∗ xk −−−→ x . a.s. 2 Furthermore, we have that for all k,

∗ 2 ∗ 2 k E[kxk − x k ] ≤ E[kx0 − x k ](q + c) ,

2ν2 where q , (1 − µ/L), q + c < 1, Nk , d L2ck e.

Proof. (a) We begin by noting the following.

 1  x = Π x − (∇f (x ) +w ¯ ) , (17) k+1 X k L k k

40 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityIV

∗ + From Lemma 10, by setting x = xk , w =w ¯k , and y = x , we have that x = xk+1, cX (xk ) = L(xk − xk+1), implying the following inequality.

T ∗ ∗ 1 2 µ ∗ 2 T ∗ −cX (xk ) (xk − x ) ≤ − (f (xk+1) − f (x )) − kcX (xk )k − kxk − x k − w¯k (xk+1 − x ) | {z } 2L 2 ≥ 0 1 µ ≤ − kc (x )k2 − kx − x ∗k2 − w¯T (x − x ∗). (18) 2L X k 2 k k k+1

∗ 2 Since cX (xk ) = L(xk − xk+1), we may bound kxk+1 − x k as follows. 1 1 2 kx − x ∗k2 = kx − c (x ) − x ∗k2 = kx − x ∗k2 + kc (x )k2 − c (x )T (x − x ∗) k+1 k L X k k L2 X k L X k k (18) 1 2  1 µ  ≤ kx − x ∗k2 + kc (x )k2 − kc (x )k2 + kx − x ∗k2 +w ¯T (x − x ∗) k L2 X k L 2L X k 2 k k k+1  µ  2 = 1 − kx − x ∗k2 − w¯T (x − x ∗) L k L k k+1  µ  2 2 = 1 − kx − x ∗k2 − w¯T (x − x¯ ) − w¯T (¯x − x ∗), L k L k k+1 k+1 L k k+1

41 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityV

 1  wherex ¯k+1 , ΠX xk − L ∇x f (xk ) . By (17) and the non-expansivity of the Euclidean T 2 projector, one may obtain that −w¯k (xk+1 − x¯k+1) ≤ kw¯k k kxk+1 − x¯k+1k ≤ kw¯k k /L. Therefore,  µ  2 2 kx − x ∗k2 ≤ 1 − kx − x ∗k2 + kw¯ k2 − w¯T (¯x − x ∗). k+1 L k L2 k L k k+1

Taking expectations conditioned on Fk on both sides of the above equation, we obtain ∗ the next inequality since xk and x¯ k+1 are adapted to Fk .

∗ 2  µ  ∗ 2 2 2 [kx − x k | F ] ≤ 1 − kx − x k + [kw k | F ] ( by [w |F ]=0) E k+1 k L k L2 E k k E k k 2  µ  ∗ 2 2ν ( by Assumption 5) ≤ 1 − [kxk − x k ] + 2 , . L L Nk It follows by using the super-Martingale convergence theorem and recalling that P 1 < ∞, we have that k Nk ∗ 2 a.s. kxk − x k −−−→ 0. k→∞

42 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Variance-reduced schemes for strongly convex problems Rate analysis under strong convexityVI

(b) By taking unconditional expectations, we obtain that

2 ∗ 2  µ  ∗ 2 2ν E[kxk+1 − x k ] ≤ 1 − E[kxk − x k ] + 2 L L Nk 2 2 2 ∗ 2 2ν 2ν = d E[kxk−1 − x k ] + d 2 + 2 L Nk−1 L Nk 2 2 k ∗ 2 k−1 2ν 2ν = d E[kx0 − x k ] + d 2 + ... + 2 L N1 L Nk ∗ 2  k k−1 k−1 k  ≤ E[kx0 − x k ] d + cd + ... + dc + c ∗ 2 k ≤ E[kx0 − x k ](d + c) ,

2ν2 k ∗ 2 2ν2 where 2 ≤ c [kx0 − x k ] since Nk d 2 k e. L Nk E , L c

43 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

1 We now consider the regime of structured nonsmooth problems

2 Specifically, we consider the following problem:

min f (x) + g(x), where f (x) [f (x, ω)] , (Comp) n , E x∈R f is a smooth convex function and g is a convex, closed, and proper function on X with a tractable proximal evaluation.

(dom(g) , {x : g(x) < +∞}. g is proper if dom(g) is nonempty and g(x) > −∞ for all x ∈ dom(g). g is lower semicontinuous if lim inf g(x) ≥ g(x ) for all x . A function is closed if for each α ∈ , {x ∈ dom(g): g(x) ≤ α} is a closed set. A proper convex x→x0 0 0 R function is closed if and only if it is lsc.) 3 Examples of g(x) include the following:

(i) g(x) , kxk1 (`1 norm) – “encourage sparsity” (ii) g(x) , 1X (x) (Indicator function) – capture constraints

44 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

Algorithm 1 VS-APM

(0) Given x1 ∈ X , y1 = x1 and positive sequences {γk , Nk }; Set λ0 = 0, λ1 = 1; k := 1. (1) y = P (x − γ (∇ f (x ) +w ¯ )); k+1 γk√,g k k x k k 2 1+ 1+4λk (2) λk+1 = 2 ; (λk −1) (3) xk+1 = yk+1 + (yk+1 − yk ); λk+1 (4) If k > K, then stop; else k := k + 1; return to (1).

1 This scheme has two steps

2 Step 1: “Prox step” requires defining the proximal operator Pg (x)   1 2 Pg (x) argmin g(u) + kx − uk . (19) , n u∈R 2

3 When g is proximable or has a tractable proximal evaluation, this minimization problem is either solvable in closed form or cheap to solve. 1 If g(x) = 1X (x), then Pg (x) , ΠX (x). 2 If g(x) = 0, then Pg (x) , x.

45 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

3 If g(x) = kxk1, then  x − 1, x ≥ 1  i i Pg (x) , 0, kxi k ≤ 1 .  xi + 1, xi ≤ −1.

4 Step 2: Averages between consecutive yk , i.e.

1 − λk xk+1 := (1 − βk )yk+1 + βk yk , where βk , . λk+1 In this subsection, we develop rate and oracle complexity statements for (VS-APM) when f is smooth. We begin with a modified assumption.

Assumption 7

(i) The function g(x) is lower semicontinuous and convex with effective domain denoted by dom(g); (ii) f (x) is smooth on an open set containing dom(g); (iii) There exists ∗ C > 0 such that E[kx1 − x k] ≤ C.

46 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems Lemma 12

Given a symmetric positive definite matrix Q, then for any ν1, ν2, ν3: √ T 1 2 2 2 T (ν2 − ν1) Q(ν3 − ν1) = 2 (kν2 − ν1kQ + kν3 − ν1kQ − kν2 − ν3kQ ), where kνkQ , ν Qν.

Lemma 13

Suppose Assumptions 2 and 7 hold. Furthermore, γk = 1/2L for all k. If h(xk ) , 2L(xk − yk+1), µ 2 1 2 T 2 2 F (x) − 2 kx − xk k ≥ F (yk+1) + 4L kh(xk )k + h(xk ) (x − xk ) − L kw¯k k .

Lemma 14

Consider Algorithm 2 and suppose Assumptions 2 and 7 hold for f (x) and g(x). If {γk } is a decreasing sequence and γk ≤ ηk /2, then the following holds for all K ≥ 2:

K−1 2 2 ∗ 2 X 2 2 ν 2C E[F (yK ) − F (x )] ≤ 2 γk k + 2 . γK−1(K − 1) Nk γK−1(K − 1) k=1

Proof.

47 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

1 By the update rule in Algorithm 2, we have

1 2 T yk+1 = argmin g(x) + kx − xk k + (∇x f (xk ) +w ¯k ) x. (20) x 2γk

2 From the optimality condition for (20), 1 0 ∈ ∂g(yk+1) + (yk+1 − xk ) + ∇x f (x) +w ¯k γk 1 =⇒ − (yk+1 − xk ) − ∇x f (xk ) − w¯k ∈ ∂g(yk+1). (21) γk

T 3 By convexity of g(x), we have that g(x) ≥ g(yk+1) + s (x − yk+1) for all s ∈ ∂g(yk+1). Hence, by (21), we obtain the following.

T T g(x) + (∇x f (xk ) +w ¯k ) x ≥ g(yk+1) + (∇x f (xk ) +w ¯k ) yk+1

1 T − (x − yk+1) (yk+1 − xk ). γk

48 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

4 Now by using Lemma 12, we obtain that

T 1 2 g(x) + (∇x f (xk ) +w ¯k ) x + kx − xk k (22) 2γk

T 1 2 1 2 ≥ g(yk+1) + (∇x f (xk ) +w ¯k ) yk+1 + kxk − yk+1k + kx − yk+1k 2γk 2γk

1 T 1 T + (x − yk+1) (yk+1 − xk ) − (x − yk+1) (yk+1 − xk ) γk γk

T 1 2 1 2 = g(yk+1) + (∇x f (xk ) +w ¯k ) yk+1 + kxk − yk+1k + kx − yk+1k . 2γk 2γk

49 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

5 By invoking the convexity of f (x) and Lipschitz continuity of ∇x f (x), we obtain

(Convexity) T f (x) ≥ f (xk ) + ∇x f (xk ) (x − xk ) (L-smoothness) L ≥ f (y ) + ∇f (x )T (x − y ) − kx − y k2 + ∇ f (x )T (x − x ) k+1 k k k+1 2 k k+1 x k k L = f (y ) + (∇ f (x ))T (x − y ) − kx − y k2 k+1 x k k+1 2 k k+1 L = f (y ) + (∇ f (x ) +w ¯ )T (x − y ) − kx − y k2 k+1 x k k k+1 2 k k+1 T − w¯k (x − yk+1), (23)

where the last equality follows from adding and subtractingw ¯k .

50 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

6 By adding (22) and (23), we obtain   1 2 1 2 L 1 2 F (yk+1) − F (x) ≤ kx − xk k − kx − yk+1k + − kxk − yk+1k 2γk 2γk 2 2γk T − w¯k (yk+1 − x)   L 1 2 1 T = − kxk − yk+1k + (xk − yk+1) (xk − x) (24) 2 2γk γk

1 2 T − kxk − yk+1k − w¯k (yk+1 − x), 2γk   L 1 2 1 T = − kxk − yk+1k + (xk − yk+1) (xk − x) (25) 2 γk γk T − w¯k (yk+1 − x),

where the penultimate equality follows from Lemma 12 by choosing Q = I , v1 = xk , v2 = x, and v3 = yk .

51 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

7 By setting x = yk in (24), we have

 L 1  2 1 T F (yk+1) − F (yk ) ≤ − kxk − yk+1k + (xk − yk+1) (xk − yk ) 2 γk γk T − w¯k (yk+1 − yk ). (26)

∗ 8 Similarly, by letting x = x , we can obtain

∗  L 1  2 1 T ∗ Fηk (yk+1) − Fηk (x ) ≤ − kxk − yk+1k + (xk − yk+1) (xk − x ) 2 γk γk T ∗ − w¯k (yk+1 − x ). (27)

9 By invoking Lemma 12 where v1 = xk , v2 = yk+1 and v3 = yk , we obtain

1 T 1  2 2 2 (yk+1 − xk ) (yk − xk ) = kyk − xk k + kyk+1 − xk k − kyk+1 − yk k . γk 2γk

52 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

10 Consequently, (26) can further bounded as follows:

F (yk+1) − F (yk )

 L 1  2 1 T T ≤ − kxk − yk+1k + (xk − yk+1) (xk − yk ) − w¯k (yk+1 − yk ) 2 γk γk

Lemma 12  L 1  2 1  2 2 2 = − kxk − yk+1k + kxk − yk k + kyk+1 − xk k − kyk+1 − yk k 2 γk 2γk T − w¯k (yk+1 − yk )

 L 1  2 1  2 2 = − kxk − yk+1k + kxk − yk k − kyk+1 − yk k 2 2γk 2γk T − w¯k (yk+1 − yk ). (28)

11 Similarly we have that

∗  L 1  2 1  ∗ 2 ∗ 2 F (yk+1) − F (x ) ≤ − kxk − yk+1k + kxk − x k − kyk+1 − x k 2 2γk 2γk T ∗ − w¯k (yk+1 − x ). (29)

53 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

12 By multiplying (28) by (λk − 1) and adding to (29),

λk δk+1 − (λk − 1)δk (30)

 L 1  2 ≤ − λk kyk+1 − xk k 2 2γk

1  2 2 1  ∗ 2 ∗ 2 + (λk − 1) kxk − yk k − kyk+1 − yk k + kxk − x k − kyk+1 − x k 2γk 2γk T ∗ +w ¯k ((λk − 1)yk + x − λk yk+1) , (31)

∗ where δk , F (yk ) − F (x ).

54 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

13 Again by using Lemma 12, we may express the terms in (31) as follows:

1  2 2 1  ∗ 2 ∗ 2 (λk − 1) kxk − yk k − kyk+1 − yk k + kxk − x k − kyk+1 − x k 2γk 2γk

1  2 2 2 = λk kxk − yk k − λk kyk+1 − yk k − kxk − yk k 2γk 2 ∗ 2 ∗ 2 +kyk+1 − yk k + kxk − x k − kyk+1 − x k

1  2 T 2 = −λk kyk+1 − xk k + 2λk (yk+1 − xk ) (yk − xk ) + kyk+1 − xk k 2γk T 2 T ∗  −2(yk+1 − xk ) (yk − xk ) − kyk+1 − xk k + 2(yk+1 − xk ) (x − xk )

1  2 T ∗  = −λk kyk+1 − xk k + 2(yk+1 − xk ) ((λk − 1)yk − λk xk + x ) . 2γk

14 In addition,

T ∗ w¯k ((λk − 1)yk + x − λk yk+1) T ∗ T =w ¯k ((λk − 1)yk + x − λk xk ) +w ¯k (λk xk − λk yk+1) .

55 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

15 From the update rule, 2 2 λk−1 = λk (λk − 1) = λk − λk .

16 Now by multiplying (31) by λk , we obtain the following, where ∗ uk = (λk − 1)yk − λk xk + x :

2 2 λk δk+1 − λk−1δk   2 L 1 2 ≤ λk − kyk+1 − xk k (32) 2 2γk

1  2 T ∗  + −kλk yk+1 − λk xk k + 2(λk yk+1 − λk xk ) ((λk − 1)yk + x − λk xk ) 2γk 2 T T − λk w¯k (xk − yk+1) − λk wk uk   2 L 1 2 2 T = λk − kyk+1 − xk k − λk w¯k (xk − yk+1) (33) 2 2γk

1  ∗ 2 ∗ 2 T + kλk xk − (λk − 1)yk − x k − kλk yk+1 − (λk − 1)yk − x k − λk wk uk 2γk 2 λk 2 1  2 2 T ≤   kw¯k k + kuk k − kuk+1k − λk wk uk , 2 1 − L 2γk γk

56 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

where in the last inequality we used the update rule of algorithm, λk −1 xk+1 = yk+1 + (yk+1 − yk ), to obtain the following: λk+1

∗ ∗ uk+1 = (λk+1 − 1)yk+1 − λk+1xk+1+x = (λk − 1)yk − λk yk+1+x .

17 By multiplying both sides by γk and assuming γk ≤ γk−1, we obtain

2 2 2 γk λk 2 1  2 2 T γk λk δk+1 − γk−1λk−1δk ≤   kw¯k k + kuk k − kuk+1k − γk λk wk uk . 2 1 − L 2 γk (34)

1 1 1 18 By assuming γk ≤ , we obtain − L ≥ , implying that 2L γk 2γk 1   γ λ2 δ − γ λ2 δ ≤ γ2λ2 kw¯ k2 + ku k2 − ku k2 − γ λ w¯ T u . (35) k k k+1 k−1 k−1 k k k k 2 k k+1 k k k k

57 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

Summing (35) from k = 1 to K − 1, we have the following:

K−1 K−1 2 X 2 2 2 1 2 X T γ λ δ ≤ γ λ kw¯ k + ku k − γ λ w¯ u K−1 K−1 K k k k 2 1 k k k k k=1 k=1 K−1 1 X 2 2 2 1 2 =⇒ δK ≤ 2 γk λk kw¯k k + 2 ku1k γK−1λ 2γK−1λ K−1 k=1 K−1 K−1 1 X T − 2 γk λk w¯k uk . γK−1λ K−1 k=1

19 Taking expectations, we note that the last term on the right is zero (under a zero bias assumption), leading to the following:

K−1 2 1 X 2 2 ν 1 2 E[δK ] ≤ 2 γk λk + 2 E[ku1k k] γK−1λ Nk 2γK−1λ K−1 k=1 K−1 K−1 2 2 2 X 2 2 ν 2C ≤ 2 γk k + 2 , γK−1(K − 1) Nk γK−1(K − 1) k=1

58 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

where in the last inequality we used the fact that ky − x ∗k ≤ C for all y ∈ dom(g) k and 2 ≤ λk ≤ k which may be shown inductively.

Corollary 15 (Rate and oracle complexity bounds with smooth f for (VS-APM)) Suppose f (x) is smooth and Assumption 2 and 7 hold. Suppose γk = γ ≤ 1/2L for all k. a 2ν2γ(a−2) 4C 2 (i) Let Nk = bk c where a = 3 + δ and Cb , a−3 + γ . Then the following holds.

K()   ∗ Cb X 1 [F (y − F (x ))] ≤ for all K and N ≤ O , E K+1 K 2 k 2+δ/2 k=1

∗ where E[F (yK()+1) − F (x )] ≤ . 2 ˜ 2 4C 2 (ii) Given a K > 0, let Nk = bk Kc where a > 3 and C , 2ν γ + γ . Then the following holds.

˜ K   ∗ C X 1 ∗ [F (y − F (x ))] ≤ and N ≤ O , where [F (y ) − F (x )] ≤ . E K+1 K 2 k 2 E K+1 k=1

59 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

a 1 a Proof. (i)Let Nk = bk c ≥ 2 k and γk = γ. Then the following holds where 2ν2γ(a−2) 4C 2 Cb , a−3 + γ .

2 K 2 2 2 2 ∗ 2ν γ X k 4C 2ν γ(a − 2) 4C [F (y ) − F (x )] ≤ + ≤ + = Cb , (36) E K+1 K 2 ka γK 2 (a − 3)K 2 γK 2 K 2 k=1 where the first inequality follows from bounding the summation as follows:

K K Z K 3−a X 2−a X 2−a 2−a 1 K 1 a − 2 k = 1 + k ≤ 1 + x dx = − + 1 ≤ + 1 = . a − 3 a − 3 a − 3 a − 3 k=1 k=2 1

∗ Cb Suppose yK+1 satisfies E[F (yK+1) − F (x )] ≤ , implying that K 2 ≤  or K = dCb1/2 / 1/2e. If  ≤ Cb/2, then the oracle complexity can be bounded as follows: √ √ 1+ C/ q K K b Z 2+ Cb/ (2 + C/)1+a X X a X a a b N ≤ k = k ≤ k da = k 1 + a k=1 k=1 k=1 0 p !1+a Cˆ  1  ≤ √ = O . 2  2+δ/2

60 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Structured nonsmooth problems

2 1 2 (ii) Let Nk = bk Kc ≥ 2 k K. Then similar to part (i), we may bound the expected ˜ 2 4C 2 sub-optimality as follows where C , 2ν γ + γ .

2 K 2 2 2 2 ˜ ∗ 2ν γ X k 4C 2ν γ 4C C [F (y ) − F (x )] ≤ + = + ≤ . E K+1 K 2 k2K γK 2 K 2 γK 2 K 2 k=1

Since K = dC˜1/2/1/2e, the oracle complexity may be bounded as follows:

K K   X X 2 1 2 1 2 2 4 1 N ≤ k K = K (K + 1)(2K + 1) = K (2K + 3K + 1) ≤ K ≤ O . k 6 6 2 k=1 k=1

61 / 106 A Tutorial on Stochastic Approximation Acceleration and variance reduction techniques Summary of findings Convergence and Rate StatementsI

Table: Convergence and Rate Statements

Smooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong Y Y O k O  E kxk − x k  ln(k)  [f (x, ω)] x ∈ X Convex Y Y O √ – f (¯x ) − x∗) E k E K Nonsmooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong N N O k O  E kxk − x k     [f (x, ω)] x ∈ X Convex N N O √1 O 1 f (¯x ) − x∗) E K 2 E K Smooth and structured nonsmooth convex + variance reduction Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity γk , Nk Metric  k   1  −k h ∗ 2i E[f (x, ω)] , x ∈ X Strong Y Y O q O  γ, daq e E kxk − x k     [f (x, ω)] , x ∈ X Convex Composite Y O 1 O 1 γ, bk3+δ e f (¯x ) − x∗) E K2 2 E K

62 / 106 A Tutorial on Stochastic Approximation Smoothing

1 A key shortcoming of the previous scheme is that the expectation-valued part (namely f (x)) of the problem is smooth 2 Suppose the expectation-valued part of the objective is nonsmooth, i.e. f (x) is an expectation-valued nonsmooth but (1, B2) smoothable function while g(x) is assumed to be a proper, closed, and convex function with an efficient proximal operator. 3 We recall the following definition of a smoothable function [2].

Definition 16 n A convex function f : R → R is said to be (α, β) smoothable if there exists a n continuously differentiable convex function fη : R → R such that

fη(x) ≤ f (x) ≤ fη(x) + ηB

n for all x ∈ R and fη is α/η-smooth, i.e.

α n k∇x fη(x) − ∇x fη(y)k ≤ η kx − yk, ∀x, y ∈ R .

4 There are a host of smoothing functions based on the nature of f .

63 / 106 A Tutorial on Stochastic Approximation Smoothing

q 2 2 For instance, when f (x) = kxk2, then fη(x) = kxk2 + η − η, implying that f is (1, 1)−smoothable function. If f (x) = max(x1, x2,..., xn), then f is (1, log(n))-smoothable and Pn xi /η fη(x) = η log( i=1 e ) − η log(n). (see [3] for more examples). When f is a proper, closed, and convex function, the Moreau envelope is defined as n 1 2o 2 fη(x) , minu f (u) + 2η ku − xk . In fact, f is (1, B )-smoothable when fη is given by the Moreau envelope (see [3]) and B denotes a uniform bound on ksk in x where s ∈ ∂f (x).

5 When f (x, ω) is a proper, closed, and convex function in x for every ω, then f (x, ω) 2 is (1, B )-smoothable for every ω where fη(x, ω) is a suitable smoothing. 6 We proceed to develop a smoothed variant of (VS-APM), referred to as

(sVS-APM), in which ∇x fηk (xk , ωk ) is generated from the stochastic oracle and ηk is driven to zero at a sufficient rate (See Alg. 2).

64 / 106 A Tutorial on Stochastic Approximation Smoothing

Algorithm 2 Smoothed VS-APM (sVS-APM)

(0) Given x1 ∈ X , y1 = x1 and positive sequences {γk , Nk , ηk }; Set λ0 = 0, λ1 = 1; k := 1. (1) y = P (x − γ (∇ f (x ) +w ¯ )); k+1 γk√,g k k x ηk k k 2 1+ 1+4λk (2) λk+1 = 2 ; (λk −1) (3) xk+1 = yk+1 + (yk+1 − yk ); λk+1 (4) If k > K, then stop; else k := k + 1; return to (1).

Note that in this setting, we assume that for every η > 0, one can generate an estimator ∇˜ x fη(xk , ωk ) of the true gradient

Assumption 8

(i) The function g(x) is lower semicontinuous and convex with effective domain denoted by dom(g); (ii) f (x) is nonsmooth but (1, B2) smoothable on an open set containing ∗ dom(g); (iii) There exists C > 0 such that E[kx1 − x k] ≤ C.

We are now ready to prove our main rate result and oracle complexity bound for (sVS-APM).

65 / 106 A Tutorial on Stochastic Approximation Smoothing

Theorem 17 (Rate Statement and Oracle Complexity Bound for (sVS-APM))

Suppose Assumptions 2 and 8 hold. Suppose {λk } is specified in Algorithm 2. Suppose a ηk = 1/k, and γk = 1/2k, and Nk = bk c, where a > 1. ¯ 2ν2a 2 2 (i) If C , a−1 + 4C + B , then the following holds for any K ≥ 1:

C¯ 2ν2a [F (y ) − F (x ∗)] ≤ , where C¯ + 4C 2 + B2. E K+1 K , (a − 1)

˜ ∗ PK 1  (ii) Let  ≤ C/2 and K is such that E[F (yK+1) − F (x )] ≤ . Then k=1 Nk ≤ O 1+a .

a 1 a Proof. (i) If Nk = bk c ≥ 2 k and γk = 1/(2k) is utilized in Lemma 14, we obtain the following

K 2ν2 X 1 4C 2 [δ ] ≤ + . (37) E K+1 K ka K k=1

66 / 106 A Tutorial on Stochastic Approximation Smoothing

For a > 1, we may derive the next bound.

K K Z K 1−a X −a X −a −a 1 − K a k = 1 + k ≤ 1 + k dk = 1 + ≤ . a − 1 a − 1 k=1 k=2 1

2 By invoking (1, B )−smoothability of f and ηK = 1/K, we have that ∗ ∗ 2 FηK (yK+1) ≤ F (yK+1) and −FηK (x ) ≤ −F (x ) + ηB . Hence, the required bound follows from (37)

2ν2a 4C 2 + B2 C¯ 2ν2a [F (y ) − F (x ∗)] ≤ + ≤ , where C¯ + 4C 2 + B2. E K+1 (a − 1)K K K , (a − 1)

∗ C¯ (ii) To find yK+1 satisfying E[F (yK+1) − F (x )] ≤  we have K ≤  which implies that ¯ PK K = dC/e. To obtain the optimal oracle complexity we require k=1 Nk gradients. Hence, the following holds for sufficiently small  such that 2 ≤ C¯/:

¯ K K 1+C/ Z 2+C¯/ ¯ 1+a  ¯ 1+a   X X a X a a (2 + C/) C 1 N ≤ k = k ≤ k da = ≤ ≤ O . k 1 + a  1+a k=1 k=1 k=1 0

67 / 106 A Tutorial on Stochastic Approximation Smoothing Summary of findings Convergence and Rate StatementsI

Table: Convergence and Rate Statements

Smooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong Y Y O k O  E kxk − x k  ln(k)  [f (x, ω)] x ∈ X Convex Y Y O √ – f (¯x ) − x∗) E k E K Nonsmooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong N N O k O  E kxk − x k     [f (x, ω)] x ∈ X Convex N N O √1 O 1 f (¯x ) − x∗) E K 2 E K Smooth stochastic and structured nonsmooth convex + variance reduction Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity γk , Nk Metric  k   1  −k h ∗ 2i E[f (x, ω)] , x ∈ X Strong Y Y O q O  γ, daq e E kxk − x k     [f (x, ω)] , x ∈ X Convex Composite Y O 1 O 1 γ, bk3+δ c f (¯x ) − x∗) E K2 2 E K Nonsmooth (but smoothable) stochastic and structured nonsmooth convex + variance reduction Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity γk , Nk Metric     [f (x, ω)] , x ∈ X Convex Nonsmooth Y O 1 O 1 γ, bk1+δ c f (¯x ) − x∗) E K 2+δ E K

68 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

IntroductionI

Suppose the problem has expectation-valued constraints as given by the following:

min f (x) , E[f (x, ω)] subject to c(x) , E[c(x, ω)] ≤ 0, (λ) (Opt) x ∈ X .

Suppose f , c are smooth and convex functions. Then under suitable regularity conditions, z = (x, λ) denotes a primal-dual solution of (Opt) if and only if

(y − z)T H(z) ≥ 0, ∀y ∈ X ,

where  T  [E[∇x f (x, ω) + ∇x c(x, ω) λ] H(z) , . E[c(x, ω)] n In other words, z is a solution of VI(X , H) where H : X → R .

69 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

IntroductionII

We often qualify this problem as a stochastic variational inequality problem, since H is expectation-valued. In fact, the mapping H is monotone on X . This can be concluded as follows.

T (H(z1) − H(z2)) (z1 − z2) T ∇ f (x ) + ∇ c(x )T λ − ∇ f (x ) − ∇ c(x )T λ  (x − x ) = x 1 x 1 1 x 2 x 2 2 1 2 c(x1) − c(x2) (λ1 − λ2) T = (∇x f (x1) − ∇x f (x2)) (x1 − x2) | {z } ≥0 T T + (∇x c(x1)(x1 − x2)) λ1 − (∇x c(x2)(x1 − x2)) λ2 | {z } | {z } ≥c(x2)−c(x1) ≥c(x2)−c(x1) T + (c(x1) − c(x2)) (λ1 − λ2) ≥ 0.

70 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

IntroductionIII

Note that since c(x) is a vector of convex constraints, it follows that

T ∇x cj (x1) (x2 − x1) ≤ cj (x2) − cj (x1), for j = 1,..., m.

=⇒ ∇x c(x1)(x2 − x1) ≤ c(x2) − c(x1).

Under suitable regularity conditions on the problem (such as Slater’s regularity condition), we may also prove that H is Lipschitz.

71 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeI

˜ ˜ Suppose wk , H(zk ) − H(zk , ωk ) where H(zk , ωk ) is an estimator of H(zk ).

Assumption 9 (Mapping requirements)

We assume that H is a monotone and L-Lipschitz continuous mapping on X .

We consider the following two-step (extragradient) scheme for resolving this problem.

xk+ 1 := ΠX (xk − γ(F (xk ) + wk )), 2 (38) xk+1 := ΠX (xk − γ(F (xk+1/2) + wk+1/2).

72 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeII

Lemma 18

Consider the stochastic variational inequality problem defined by SVI(X , H) and let x ∗ denote any solution of SVI(X , H). Suppose Assumptions 2 and 9 hold. Furthermore, suppose X is a nonempty, closed, convex, and bounded set. Consider the sequence of the T ∗ iterates be generated by the extragradient scheme (38) and let uk , 2γk H(xk ) (xk − x ). Then, the following holds for any iterate k:

 γ2  kx − x ∗k2 ≤ 1 + k kx − x ∗k2 − u − 2γ w T (x − x ∗) + γ2t , (39) k+1 β k k k k+1/2 k k k where the scalar tk ≥ 0 is an appropriately defined scalar.

Proof. Let yk = xk − γk (H(xk+1/2) + wk+1/2). Then,

∗ 2 ∗ 2 kxk+1 − x k = kΠX (yk ) − x k ∗ 2 2 T ∗ = kyk − x k + kΠX (yk ) − yk k + 2 (ΠX (yk ) − yk ) (yk − x ).

73 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeIII

2 T ∗ Note that 2kyk − ΠX (yk )k + 2(ΠX (yk ) − yk ) (yk − x ) 2 T ∗ = 2kyk − ΠX (yk )k + 2(ΠX (yk ) − yk ) (yk − ΠX (yk ) + ΠX (yk ) − x ) 2 2 T ∗ = 2kyk − ΠX (yk )k − 2kyk − ΠX (yk )k + 2(ΠX (yk ) − yk ) (ΠX (yk ) − x ) T ∗ = 2(ΠX (yk ) − yk ) (ΠX (yk ) − x ) ≤ 0, where the last inequality follows from the projection property. Consequently, we have that

2 T ∗ 2 kyk − ΠX (yk )k + 2(ΠX (yk ) − yk ) (yk − x ) ≤ −kyk − ΠX (yk )k . (40)

74 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeIV

∗ 2 By invoking (40) in the expansion of kxk+1 − x k , we obtain

∗ 2 ∗ 2 2 ∗ T ∗ kxk+1 − x k = kyk − x k + kΠX (yk ) − yk k + 2(ΠX (yk ) − x ) (yk − x ) ∗ 2 2 ≤ kyk − x k − kyk − ΠX (yk )k ∗ 2 = kxk − γk (H(xk+1/2) + wk+1/2) − x k 2 − kxk − γk (H(xk+1/2) + wk+1/2) − xk+1k ∗ 2 2 2 = kxk − x k + γk kH(xk+1/2) + wk+1/2)k ∗ T − 2γk (xk − x ) (H(xk+1/2) + wk+1/2) 2 2 2 − kxk+1 − xk k − γk kH(xk+1/2) + wk+1/2)k T + 2γk (xk − xk+1) (H(xk+1/2) + wk+1/2) ∗ 2 2 ∗ T = kxk − x k − kxk − xk+1k + 2γk (x − xk+1) (H(xk+1/2) + wk+1/2).

75 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeV

T By adding and subtracting xk (H(xk+1/2)+ wk+1/2), we obtain

∗ 2 ∗ 2 2 ∗ T kxk+1 − x k ≤ kxk − x k − kxk − xk+1k + 2γk (x − xk+1) (H(xk+1/2) + wk+1/2) ∗ 2 2 ∗ T = kxk − x k − kxk − xk+1k + 2γk (x − xk ) (H(xk+1/2) + wk+1/2) T + 2γk (xk − xk+1) (H(xk+1/2) + wk+1/2) ∗ 2 2 ∗ T ≤ kxk − x k − kxk − xk+1k + 2γk (x − xk ) (H(xk+1/2) + wk+1/2) 2 2 2 + kxk − xk+1k + γk kH(xk+1/2) + wk+1/2k ∗ 2 ∗ T ∗ T = kxk − x k + 2γk (x − xk ) H(xk ) + 2γk (x − xk ) (H(xk+1/2) − H(xk )) | {z } Term a ∗ T 2 2 + 2γk (x − xk ) wk+1/2 + γk kH(xk+1/2) + wk+1/2k .

76 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeVI

Next, we observe that Term a can be bounded as follows:

γ2 γ2 Term a ≤ k kx ∗ − x k2 + βkH(x − H(x ))k2 ≤ k kx ∗ − x k2 + βL2kx − x k2 β k k+1/2 k β k k+1/2 k γ2 ≤ k kx ∗ − x k2 + βL2kΠ (x − γ (H(x ) + w )) − Π (x )k2 β k X k k k k X k γ2 = k kx ∗ − x k2 + γ2βL2kH(x ) + w k2. β k k k k

77 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeVII

Therefore, we have that

∗ 2 kxk+1 − x k (41)  γ2  ≤ 1 + k kx − x ∗k2 − 2γ (x − x ∗)T H(x ) + 2γ (x ∗ − x )T w β k k k k k k k+1/2 2  2 2 2 2 2 2 + γk kH(xk+1/2)k + kwk+1/2k + βL kH(xk )k + βL kwk k

2  2 T T  + 2γk βL wk H(xk ) + wk+1/2H(xk+1/2)

=uk  γ2  z }| { ≤ 1 + k kx − x ∗k2 − 2γ (x − x ∗)T H(x ) + 2γ (x ∗ − x )T w (42) β k k k k k k k+1/2

=tk z }| {  B2(1 + βL2)  + γ2 + kw k2 + βL2kw k2 + 2w T H(x ) + 2βL2w T H(x ) , k 4 k+1/2 k k+1/2 k+1/2 k k where we invoke the boundedness of H over X .

78 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeVIII

Proposition 19 (a.s. convergence of ESA)

Consider SVI(X , H). Suppose Assumptions 9 and 2 hold. Suppose X is a nonempty, closed, convex, and bounded set. Then, the extragradient scheme (38) generates a sequence {xk } such that {xk } is bounded a.s. and any limit point of {xk } is a solution of SVI(X , H) in an a.s. sense.

79 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeIX

Proof. Taking expectations conditioned on Fk , we have that

 γ2  [kx − x ∗k2 | F ] ≤ 1 + k kx − x ∗k2 − u − 2γ [(x − x ∗)T w | F ] E k+1 k β k k k E k k+1/2 k  B2(1 + βL2)  + γ2 + [kw k2 | F ] + βL2 [kw k2 | F ] k 4 E k+1/2 k E k k 2  T 2 T  + γk E[2wk+1/2H(xk+1/2) | Fk ] + 2βL E[wk H(xk )|Fk ]  γ2  = 1 + k kx − x ∗k2 − u − 2γ [ [(x − x ∗)T w | F ]| F ] β k k k E E k k+1/2 k+1/2 k  B2(1 + βL2)  + γ2 + [ [kw k2 | F ] | F ] + βL2 [kw k2 | F ] k 4 E E k+1/2 k+1/2 k E k k 2  T 2 T  + γk E[E[2wk+1/2H(xk+1/2) | Fk+1/2] | Fk ] + 2βL E[wk H(xk ) | Fk ] ∗ 2 ≤ (1 + δk ) kxk − x k − uk + ψk , (43) γ2  (B2 + 8ν2)(1 + βL2)  where δ k and ψ γ2 , k , β k , k 4

80 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeX

and the inequalities follow from using the tower law and assumption (A1). The remainder of the proof requires application of the super-martingale convergence theorem (Lemma 4).

This requires that uk ≥ 0 for all k, which follows from noting that F is a monotone map over X , implying that

∗ T ∗ T ∗ ∗ T ∗ (xk − x ) H(xk ) = (xk − x ) (H(xk ) − H(x )) + (xk − x ) H(x ) ≥ 0. P Using assumption (A3), it is observed that k ψk < ∞. Invoking Lemma 4, we ∗ 2 have that {kxk − x k } is a convergent sequence in an a.s. sense.

Then in an a.s. sense, it follows that {xk }k≥0 is a bounded sequence and has a convergent subsequence {xk }k∈K.

We proceed by contradiction; suppose xk converges tox ˆ along subsequence Kwhere xˆ is not necessarily a solution to SVI(X , H).

81 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeXI

Since {δk } is summable in an a.s. sense, {uk } is summable a.s. and from the non-summability of γk , we have that a.s., the following implication holds.

X X T ∗ T ∗ uk = 2γk H(xk ) (xk − x ) < ∞ =⇒ lim H(xk ) (xk − x ) = 0. k∈K k∈K k∈K

a.s. Since xk −−−→ xˆ along the subsequence K from (43) and by the continuity of F k→∞ over X , we obtain

H(ˆx)T (ˆx − x ∗) = 0. (44)

By recalling that x ∗ is a solution of VI(X , H) and since H is a symmetric monotone map (since it is a gradient map of a function), it follows from [5, Ch. 2] that H is monotone plus and therefore pseudomonotone plus (as formalized by the next implication).

h i H(x ∗)T (ˆx − x ∗) ≥ 0 and H(ˆx)T (ˆx − x ∗) = 0 =⇒ H(ˆx) = H(x ∗). (45)

82 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Stochastic extragradient schemeXII

We may then conclude the following.

H(ˆx)T (x − xˆ) = H(ˆx)T (x − x ∗) + H(ˆx)T (x ∗ − xˆ)

(44) = H(ˆx)T (x − x ∗) = (H(ˆx) − H(x ∗))T (x − x ∗) + H(x ∗)T (x − x ∗).

Therefore from (45), the following holds:

(45) ∀x ∈ X , H(ˆx)T (x − xˆ) = H(x ∗)T (x − xˆ) = H(x ∗)T (x − x ∗) + H(x ∗)T (x ∗ − xˆ) ∗ x ∈ SOL(X ,F ) ≥ H(x ∗)T (x ∗ − xˆ) (45) (44) = H(ˆx)T (x ∗ − xˆ) = 0.

It follows thatx ˆ is a solution to SVI(X , H) and any limit point of {xk } is a solution of SVI(X , H) in an a.s. sense.

83 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisI

The following rate statements are provided for the sequencex ¯N , an average of the iterates {xk+1/2} generated by (SEG) over the window constructed from Nl to N where Nl , bN/2c and N ≥ 2: PN k=N γk xk+ 1 x¯ l 2 . (46) N , PN γk k=Nl

Lemma 20

n n Let X be nonempty closed convex set in R . Then for all y ∈ X and for any x ∈ R , we T have that the following hold: (i) (ΠX (x) − x) (y − ΠX (x)) ≥ 0; and (ii) 2 2 2 kΠX (x) − yk ≤ kx − yk − kx − ΠX (x)k .

84 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisII

Lemma 21

Consider VI(X , H) and let {xk } be generated by (38). Then the following holds for any y and any index k.

2 2 2 2 2 [kxk+1 − yk | Fk+1/2] ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k E k+ 2 T 2 2 − 2γk H(xk+1/2) (x 1 − y) + 8γk ν . (47) k+ 2

Proof. By Lemma 20(ii), we have

2 2 2 kxk+1 − yk ≤ kxk − γk H(x 1 , ω 1 ) − yk − kxk − γk H(x 1 , ω 1 ) − xk+1k k+ 2 k+ 2 k+ 2 k+ 2 2 2 T = kxk − yk − kxk+1 − xk k − 2γk (H(x 1 ) + w 1 ) (xk+1 − y). (48) k+ 2 k+ 2 We have

T T T −H(x 1 ) (xk+1 − y) = −H(x 1 ) (xk+1 − x 1 ) − H(x 1 ) (x 1 − y). (49) k+ 2 k+ 2 k+ 2 k+ 2 k+ 2

85 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisIII

Applying (49) to (48) yields

2 2 2 T kxk+1 − yk ≤ kxk − yk − kxk+1 − xk k − 2γk H(x 1 ) (xk+1 − x 1 ) k+ 2 k+ 2 T T − 2γk H(xk+ 1 ) (xk+ 1 − y) − 2w 1 (xk+1 − y) 2 2 k+ 2 2 2 2 T = kxk − yk − kx 1 − xk k − kx 1 − xk+1k − 2γk H(x 1 ) (x 1 − y) k+ 2 k+ 2 k+ 2 k+ 2 T T − 2w 1 (xk+1 − y) + 2(xk+1 − xk+ 1 ) (xk − γk H(xk+ 1 ) − xk+ 1 ). (50) k+ 2 2 2 2 By Lemma 20(i), we have

T (x 1 − xk+1) (x 1 − xk + γk H(xk , ωk )) ≤ 0. (51) k+ 2 k+ 2

86 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisIV

Using (51) in (50), we obtain

2 2 2 2 kxk+1 − yk ≤ kxk − yk − kx 1 − xk k − kx 1 − xk+1k k+ 2 k+ 2 T T − 2γk H(xk+ 1 ) (xk+ 1 − y) − 2w 1 (xk+1 − y) 2 2 k+ 2 T T + 2γk (xk+1 − x 1 ) (H(xk ) − H(x 1 )) + 2γk ωk (xk+1 − x 1 ) k+ 2 k+ 2 k+ 2 2 2 2 T ≤ kxk − yk − kx 1 − xk k − kx 1 − xk+1k − 2γk H(x 1 ) (x 1 − y) k+ 2 k+ 2 k+ 2 k+ 2 T T − 2w 1 (xk+1 − y) + 2γk Lkxk+1 − xk+ 1 kkxk − xk+ 1 k + 2γk wk (xk+1 − xk+ 1 ) k+ 2 2 2 2 2 2 2 2 1 2 ≤ kxk − yk − (1 − 2γk L )kxk − xk+ 1 k − kxk+ 1 − xk+1k 2 2 2 T T T − 2γk H(xk+ 1 ) (xk+ 1 − y) − 2w 1 (xk+ 1 − y) + 2γk (wk − wk+ 1 ) (xk+1 − xk+ 1 ) 2 2 k+ 2 2 2 2 2 2 2 2 T ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k − 2γk H(x 1 ) (x 1 − y) k+ 2 k+ 2 k+ 2 T 2 2 − 2w 1 (xk+ 1 − y) + 2γk kwk − wk+ 1 k . (52) k+ 2 2 2

87 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisV

Taking conditional expectations with respect to Fk+1/2, we obtain the following.

2 2 2 2 2 T [kxk+1 − yk | Fk+1/2] ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k − 2γk H(x 1 ) (x 1 − y) E k+ 2 k+ 2 k+ 2 T 2 2 − 2 [wk+1/2(x 1 − y) | Fk+1/2] + 2γk [kwk − wk+1/2k | Fk+1/2] E k+ 2 E 2 2 2 2 T ≤ kxk − yk − (1 − 2γk L )kxk − xk+1/2k − 2γk H(xk+1/2) (xk+1/2 − y) 2 2 2 + 4γk E[kwk k + kwk+1/2k | Fk+1/2] 2 2 2 2 T 2 2 ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k − 2γk H(xk+1/2) (x 1 − y) + 8γk ν . k+ 2 k+ 2

88 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisVI

Proposition 22 (Dim. steplength: SEG)

Consider the stochastic variational inequality problem defined by SVI(X , H) and let x ∗ denote any solution of SVI(X , H). Suppose Assumptions 2 and 9 hold. Furthermore, suppose X is a nonempty, closed, convex, and bounded√ set. Let {x¯N } be defined in (46), where 0 < γk ≤ 1/L for all k ≥ 0 and γk = γ0/ k. Then we have  1  E[G(¯xN )] = O √ . N

89 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisVII

Proof. Recall from (47), we have that

2 2 2 2 2 [kxk+1 − yk | Fk+1/2] ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k E k+ 2 T 2 2 − 2γk H(xk+1/2) (x 1 − y) + 8γk ν (53) k+ 2 2 2 2 2 ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k k+ 2 T − 2γk (H(xk+1/2 − H(y)) (x 1 − y) (54) k+ 2 | {z } ≥ 0 T 2 2 − 2γk H(y) (xk+1/2 − y) + 8γk ν 2 2 2 2 ≤ kxk − yk − (1 − 2γk L )kxk − x 1 k k+ 2 T 2 2 − 2γk H(y) (xk+1/2 − y) + 8γk ν . (55)

2 2 Taking expectations on both sides of (55) and recalling that (1 − 2γk L ) ≥ 0, we obtain

T 2 2 2 2 2γk [H(y) (x 1 − y)] ≤ [kxk − yk ] − [kxk+1 − yk ] + 8γk ν , ∀y ∈ X . (56) E k+ 2 E E

90 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisVIII

From (56), by summing over k from Nl to N, we have the following for all y ∈ X :

N N X T 2 2 X 2 2 2 γk [H(y) (x 1 − y)] ≤ [kxNl − yk ] − [kxN+1 − yk ] + 8 γk ν . E k+ 2 E E k=Nl k=Nl Consequently, we have the following sequence of inequalities:

 N  N X T 2 2 2 X 2 2 2  γk  E[H(y) (¯xN − y)] ≤ E[kxNl − yk ] − E[kxN+1 − yk ] ≤ B2 + 8 γk ν , k=Nl k=Nl (57) √ where the second inequality follows from the boundedness of X . Since γk = γ0/ k, it follows that for all y ∈ X :

B2 + 8 PN γ2ν2 2 PN k−1 T 2 k=Nl k B2 1 2 k=Nl [H(y) (¯xN − y)] ≤ = + 4γ0ν . (58) E PN N − 1 N − 1 2 γk 2γ0 P k 2 P k 2 k=Nl k=Nl k=Nl

91 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisIX

We now utilize the following lower bound on the denominator for N ≥ 1:

N Z N X − 1 − 1 p p p k 2 ≥ (x + 1) 2 dx = 2 (N + 1) − 2 N/2 + 1) ≥ 2 N/40. (59) N k=Nl 2 Similarly an upper bound may be constructed:

N Z N X −1 −1 1 k ≤ x dx + N ≤ log 2 + 1. (60) N b c k=Nl 2 2 By substituting (59) and (60) in (58), we obtain that the following holds: √  2 √  T C4 40B2 2 E[H(y) (¯xN − y)] ≤ √ for all y ∈ X where C4 , + 2 40(log 2 + 1)γ0ν . N 4γ0 The result follows by taking supremum over y ∈ X since

T C4 G(¯xN ) , sup E[H(y) (¯xN − y)] ≤ √ . y∈X N

92 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Rate analysisX

Proposition 23 ( Constant steplength: SEG)

Consider the stochastic variational inequality problem defined by SVI(X , H) and let x ∗ denote any solution of SVI(X , H). Suppose Assumptions 2 and 9 hold. Furthermore, suppose X is a nonempty, closed, convex, and bounded set. Let {x¯N } be defined in (46), where 0 < γk = γ ≤ 1/L for all k ≥ 0. Then we have  1  E[G(¯xN )] = O √ . N

93 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Convergence and Rate StatementsI

Table: Convergence and Rate Statements

Smooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong Y Y O k O  E kxk − x k  ln(k)  [f (x, ω)] x ∈ X Convex Y Y O √ – f (¯x ) − x∗) E k E K Nonsmooth convex Objective Constraints Convexity Smoothness a.s. Rate Oracle complexity Metric  1   1  h ∗ 2i E[f (x, ω)] x ∈ X Strong N N O k O  E kxk − x k     [f (x, ω)] x ∈ X Convex N N O √1 O 1 f (¯x ) − x∗) E K 2 E K Smooth stochastic and structured nonsmooth convex + variance reduction Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity γk , Nk Metric  k   1  −k h ∗ 2i E[f (x, ω)] , x ∈ X Strong Y Y O q O  γ, daq e E kxk − x k     [f (x, ω)] , x ∈ X Convex Composite Y O 1 O 1 γ, bk3+δ c f (¯x ) − x∗) E K2 2 E K Nonsmooth (but smoothable) stochastic and structured nonsmooth convex + variance reduction Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity γk , Nk Metric     [f (x, ω)] , x ∈ X Convex Nonsmooth Y O 1 O 1 γ, bk1+δ c f (¯x ) − x∗) E K 2+δ E K Smooth with expectation-valued constraints Obj., Cons. Convexity Smooth. a.s. Rate Oracle complexity Metric E[f (x, ω)],  1   1  Convex Nonsmooth Y O √ O E[G(¯xK )] E[c(x, ω)] ≤ 0, x ∈ X K 2

94 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Strongly convex and smooth f I

Regularized :

n 1 X  T  λ2 2 min h(β) , log 1 + exp(−yi xi β) + kβk2 + λ1kβk1, (61) x∈ d n 2 R i=1

where λ1 and λ2 are regularization parameters. We compare (VS-APM) with de-facto standards such as Prox-SVRG [?] and Prox-SDCA [15] on two datasets specified in Table 6.

Data n d source λ1 λ2 sido0 12678 4932 [6] 10−4 10−4 rcv1 20242 47236 [11] 10−5 10−4

Table: Characteristics of the datasets

95 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Comparison across algorithmsI

Method emp. err. # full grad. # prox. CPU(s) VS-APM 4.713e-6 10 721 89 prox-SVRG 2.1142e-5 10 126780 12651 prox-SDCA 6.9341e-9 - 126780 2386 SG 4.56e-2 - 126780 2438 Method emp. err. # full grad. # prox. CPU(s) VS-APM 5.3595e-3 0 484 38 prox-SVRG 4.4812e-3 10 126780 12704 prox-SDCA 7.5541e-3 - 126780 2248 SG 7.7042e-3 - 126780 2320

−4 −7 Table: sido0: Constant # full grads and λ2 = 10 (L), Constant # samples and λ2 = 10 (R)

96 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Comparison across algorithmsII

Method E[f (yk ) − f ] # of prox. CPU(s) SIDO0 VS-APM 2.374e-8 1015 180 prox-SVRG 0.19e+0 4265 180 prox-SDCA 4.580e-4 14480 180 SG 2.09e-1 14540 180 RCV1 VS-APM 3.477e-8 1369 300 prox-SVRG 1.4977e+0 2177 300 prox-SDCA 1.239e-1 2162 300 SG 7.878e-1 40750 300

Table: sido0, rcv1: Constant CPU time

97 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Comparison across algorithmsIII

VS-APM Jofre and Thompson [8] ∗ ∗ κ E[f (yk ) − f ] E[f (xk ) − f ] 2.5e+1 2.591e-06 1.365e-06 2.5e+2 5.845e-06 8.175e-03 2.5e+3 2.294e-05 1.107e-01 2.5e+4 7.337e-04 1.200e-01 2.5e+5 1.698e-02 1.227e-01

Table: sido0: Constant budget, sensitivity to κ

98 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Comparison across algorithmsIV

Figure: sido0: Constant time Figure: rcv1: Constant time

99 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Comparison across algorithmsV

Figure: sido0: sensitivity to κ

100 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Smoothed VS-APMI

In this setting, we compare the performance of our iterative smoothing scheme (sVS-APM) with a scheme where the smoothing parameter is fixed on the following stochastic utility problem:

" n ! # X  i  min E φ + ξi xi , (62) kxk≤1 n i=1

2 where φ(t) , max1≤j≤m(vi + si t).

101 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Smoothed VS-APMII

sVS-APM Fixed smooth. ∗ ∗ n m µk E[f (yk ) − f ] µ E[f (yk ) − f ] 20 10 1/k 1.832e-4 1/K 3.455e-3 1/(2k) 3.014e-3 1/(2K) 2.157e-2 1/(3k) 1.269e-2 1/(3K) 6.079e-2 100 25 1/k 1.944e-3 1/K 3.126e-2 1/2k 1.181e-2 1/2K 5.130e-2 1/3k 2.411e-2 1/3K 5.817e-2 200 10 1/k 1.067e-4 1/K 4.695e-3 1/2k 5.173e-3 1/2K 3.957e-2 1/3k 1.594e-2 1/3K 6.929e-2

102 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Smoothed VS-APMIII

s-APM Fixed smooth. ∗ ∗ n µk E[f (yk ) − f ] µ E[f (yk ) − f ] 100 1/k 3.547e-4 1/K 4.693e-4 1/2k 1.850e-4 1/2K 7.175e-4 1/4k 9.129e-5 1/4K 1.690e-3 1000 1/k 2.447e-3 1/K 9.204e-3 1/2k 2.457e-3 1/2K 2.751e-2 1/4k 4.511e-3 1/4K 1.020e-1 2000 1/k 6.081e-3 1/K 3.613e-2 1/2k 1.165e-2 1/2K 1.396e-1 1/3k 2.829e-2 1/3K 1.020e-1

Table: Comparing (sVS-APM) with fixed smoothing; stochastic (L), deterministic (R)

103 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Smoothed VS-APMIV

Figure: figure Figure: figure

(sVS-APM) vs fixed smoothing;n = 200 (s-APM) vs fixed smoothing;n=1000

104 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Numerics: Smoothed VS-APMV

m×n m Consider the minimization of kAx − bk1 + kxk1, where A ∈ R and b ∈ R are randomly generated from a standard normal distribution. We utilize the smooth Pm i i approximation of function f , given by fµ(x) = i=1 hµ(x), where hµ(x) is defined in (63) [3] and compute a prox on g(x) = kxk1.

2 ( (Ai x−bi ) i 2µ , |Ai x − bi | ≤ µ hµ(x) = (63) |Ai x − bi |. otherwise.

105 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

Areas not covered

1 Optimality of rates 2 Mirror-prox schemes, primal-dual schemes, randomized block schemes 3 Other problem classes: (i) Stochastic saddle-point problems; (ii) Stochastic variational inequality problems; (iii) Stochastic Nash games.

106 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

N. A. and Y. D., Problem complexity and method efficiency in optimization, (1983). A. Beck, First-Order Methods in Optimization, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. A. Beck and M. Teboulle, Smoothing and first order methods: A unified framework, SIAM Journal on Optimization, 22 (2012), pp. 557–580. V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, 2008. F. Facchinei and J.-S. Pang, Finite Dimensional Variational Inequalities and Complementarity Problems: Vols I and II, Springer-Verlag, NY, Inc., 2003. I. Guyon, Sido: A phamacology dataset, URL: http://www. causality. inf. ethz. ch/data/SIDO. html, (2008). A. Jalilzadeh, U. V. Shanbhag, J. H. Blanchet, and P. W. Glynn, Optimal smoothed variable sample-size accelerated proximal methods for structured nonsmooth stochastic convex programs, arXiv:1803.00718, (2018). A. Jofre´ and P. Thompson, On variance reduction for stochastic smooth convex optimization with multiplicative noise, arXiv:1705.02969, (2017).

106 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

H. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Springer, 2003. J. Lei and U. V. Shanbhag, Asynchronous schemes for stochastic and misspecified potential games and nonconvex optimization, https://arxiv.org/abs/1711.03963. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, Rcv1: A new benchmark collection for text categorization research, Journal of machine learning research, 5 (2004), pp. 361–397. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–1609. B. Polyak, Introduction to optimization, Optimization Software, Inc., New York, 1987. H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. , 22 (1951), pp. 400–407. S. Shalev-Shwartz and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in International Conference on Machine Learning, 2014, pp. 64–72. 106 / 106 A Tutorial on Stochastic Approximation Stochastic constraints

A. Shapiro, D. Dentcheva, and A. Ruszczynski´ , Lectures on stochastic programming, vol. 9 of MPS/SIAM Series on Optimization, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009. Modeling and theory. I. J. Wang and J. C. Spall, Stochastic optimization with inequality constraints using simultaneous perturbations and penalty functions, In Proc. IEEE Conf. on Decision and Control, Maui, HI, (2003), pp. 3808–3813. F. Yousefian, A. Nedic, and U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica, 48 (2012), pp. 56–67.

106 / 106