ORIE 6326:

Subgradient Method

Professor Udell Operations Research and Information Engineering Cornell

April 25, 2017

Some slides adapted from Stanford EE364b

1/45 Subgradient

recall: the subgradient generalizes the for nonsmooth functions for f : Rn R, we say g Rn is a subgradient of f at x dom(f ) if → ∈ ∈ f (y) f (x)+ g T (y x) y dom(f ) ≥ − ∀ ∈ this is just the first-order condition for convexity notation: ◮ ∂f (x) denotes the set of all subgradients of f at x ◮ ˜ f (x) denotes a particular choice of an element of ∂f (x) ∇

2/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

3/45 Subgradients and sublevel sets g is a subgradient at x means f (y) f (x)+ g T (y x) ≥ − hence f (y) f (x) = g T (y x) 0 ≤ ⇒ − ≤

g ∂f (x0) ∈

x0

f (x) f (x0) ≤

x1

f (x1) ∇

◮ f differentiable at x0: f (x0) is normal to the sublevel set x f (x) f (x ) ∇ { | ≤ 0 } ◮ f nondifferentiable at x0: subgradient defines a supporting hyperplane to sublevel set through x0

4/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x)

0 ∂f (x0) ∈ x x0 proof.

5/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x)

0 ∂f (x0) ∈ x x0 proof. by definition (!) f (y) f (x ⋆) + 0T (y x ⋆) for all y 0 ∂f (x ⋆) ≥ − ⇐⇒ ∈ ...seems trivial but isn’t 5/45 Example: piecewise linear minimization

T f (x) = maxi=1,...,m(ai x + bi )

⋆ ⋆ T ⋆ ⋆ x minimizes f 0 ∂f (x )= conv ai a x + bi = f (x ) ⇐⇒ ∈ { | i }

there is a λ with ⇐⇒ m T λ 0, 1 λ = 1, λi ai = 0  i X=1 T ⋆ ⋆ where λi =0 if ai x + bi < f (x )

6/45 ...but these are the KKT conditions for the epigraph form

minimize t T subject to a x + bi t, i = 1,..., m i ≤ with dual maximize bT λ subject to λ 0, AT λ = 0, 1T λ = 1 

7/45 Optimality conditions — constrained

minimize f0(x) subject to fi (x) 0, i = 1,..., m ≤ we assume

◮ n fi convex, defined on R (hence subdifferentiable) ◮ strict feasibility (Slater’s condition) x ⋆ is primal optimal (λ⋆ is dual optimal) iff

⋆ ⋆ fi (x ) 0, λ 0 ≤ i ≥ ⋆ m ⋆ ⋆ 0 ∂f (x )+ λ ∂fi (x ) ∈ 0 i=1 i ⋆ ⋆ λi fi (x ) = 0 P

...generalizes KKT for nondifferentiable fi

8/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

9/45 Directional derivative

directional derivative of f at x in the direction δx is

∆ f (x + hδx) f (x) f ′(x; δx) = lim − h 0 h ց can be + or ∞ −∞

◮ f convex, finite near x = f ′(x; δx) exists ⇒

◮ f differentiable at x iff, for some g (= f (x)) and all δx, ∇ T f ′(x; δx)= g δx

(i.e., f ′(x; δx) is a linear function of δx)

10/45 Directional derivative and subdifferential

T general formula for convex f : f ′(x; δx) = sup g δx g ∂f (x) ∈

δx

∂f (x)

11/45 Descent directions

δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇

12/45 Descent directions

δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇ warning: for nondifferentiable (convex) functions, ˜ f (x) need not be descent direction ∇

x2 g

x1 example: f (x)= x + 2 x | 1| | 2|

12/45 Subgradients and distance to sublevel sets

if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point

proof:

13/45 Subgradients and distance to sublevel sets

if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point

proof: x tg z 2 = x z 2 2tg T (x z)+ t2 g 2 k − − k2 k − k2 − − k k2 x z 2 2t(f (x) f (z)) + t2 g 2 ≤ k − k2 − − k k2

13/45 Descent directions and optimality fact: for f convex, finite near x, either ◮ 0 ∂f (x) (in which case x minimizes f ), or ◮ there∈ is a descent direction for f at x i.e., x is optimal (minimizes f ) iff there’s no descent direction for f at x proof: define δxsd = argmin z 2 − z ∂f (x) k k ∈ if δxsd = 0, then 0 ∂f (x), so x is optimal; otherwise ∈ 2 f ′(x; δxsd)= infz ∂f (x) z 2 < 0, so δxsd is a descent direction − ∈ k k 

∂f (x)

xsd 14/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

15/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable f

(k+1) (k) (k) x = x αk g −

◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size

16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f

(k+1) (k) (k) x = x αk g −

◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright].

16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f

(k+1) (k) (k) x = x αk g −

◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright]. instead, keep track of best point so far

(k) (i) fbest = min f (x ) i=1,...,k

16/45 + subgradients = danger

warning: with exact line search, subgradient method can converge to suboptimal point!

17/45 Line search + subgradients = danger

warning: with exact line search, subgradient method can converge to suboptimal point! x2 g

x1 example: f (x)= x + 2 x | 1| | 2|

17/45 Step size rules step sizes are fixed ahead of time (can’t do linesearch)

◮ constant step size: αk = α (constant) ◮ (k) (k+1) (k) constant step length: αk = γ/ g (so x x = γ) k k2 k − k2 ◮ square summable but not summable: step sizes satisfy

∞ 2 ∞ α < , αk = k ∞ ∞ k k X=1 X=1 1 e.g., αk = k ◮ nonsummable diminishing: step sizes satisfy

∞ lim αk = 0, αk = , k ∞ →∞ k X=1 e.g., α = 1 k √k

18/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

19/45 Lipschitz functions

a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz)

20/45 Lipschitz functions

a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz) notice for any g ∂f (x), ∈ f (y) f (x) g T (y x) g y x − ≤ − ≤ k kk − k so the Lipschitz constant for f satisfies

L sup sup g ≤ x dom f g ∂f (x) k k ∈ ∈

20/45 Lipschitz functions

examples: ◮ x : | |

21/45 Lipschitz functions

examples: ◮ x : L = 1 | |

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0):

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : k k1

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 p

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: k k k k≤

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ):

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors)

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors)

◮ indicators of convex sets : C

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors)

◮ indicators of convex sets : not Lipschitz continuous C

21/45 Lipschitz functions

examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1

◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors)

◮ indicators of convex sets : not Lipschitz continuous C

21/45 Assumptions

◮ ⋆ f is bounded below: p = infx f (x) > −∞ ◮ f attains its minimum at x ⋆: f (x ⋆)= p⋆ ◮ f is L-Lipschitz, so

g L for all g ∂f k k2 ≤ ∈ ◮ x (0) x ⋆ R k − k2 ≤

these assumptions are stronger than needed, just to simplify proofs

22/45 Convergence results define

◮ (k) 1 k (k) averaged iteratex ¯ = k i=1 x ◮ averaged value f¯(k) = f (¯x (k)) P ◮ (k) f¯ = limk f¯ →∞ ◮ (k) fbest = limk f¯best →∞ we’ll show f¯(k) f¯ is close to p⋆. → ◮ (k) (k) notice fbest f¯ ≤ ◮ advantage of averaging: no need to evaluate f !

◮ constant step size: f¯ p⋆ L2α/2, i.e., converges to L2α/2-suboptimal− ≤ (converges to p⋆ if f differentiable, α small enough)

⋆ ◮ constant step length: fbest p Lγ/2, i.e., converges to Lγ/2-suboptimal− ≤

⋆ ◮ diminishing step size rule: fbest = p , i.e., converges

23/45 Subgradient method decreases distance to optimal set key quantity: Euclidean distance to the optimal set, not the function value let x ⋆ be any minimizer of f

(k+1) ⋆ 2 (k) (k) ⋆ 2 x x 2 = x αk g x 2 k − k k (k) − ⋆ 2 − k(k)T (k) ⋆ 2 (k) 2 = x x 2 2αk g (x x )+ αk g 2 k (k) − ⋆k2 − (k) −⋆ 2 k(k) 2k x x 2αk (f (x ) p )+ α g ≤ k − k2 − − k k k2 using p⋆ = f (x ⋆) f (x (k))+ g (k)T (x ⋆ x (k)) ≥ − apply recursively to get

k k (k+1) ⋆ 2 (1) ⋆ 2 (i) ⋆ 2 (i) 2 x x x x 2 αi (f (x ) p )+ α g k − k2 ≤ k − k2 − − i k k2 i i X=1 X=1 k k 2 (i) ⋆ 2 2 R 2 αi (f (x ) p )+ L α ≤ − − i i i X=1 X=1 24/45 Subgradient method for constant step size

for constant step size αk = α, we can use

k k α(f (x (i)) p⋆) = (α f (x (i))) αkp⋆ − − i i X=1 X=1 αkf (¯x (k) p⋆)= αkf (¯x (k) p⋆) ≥ − − (where the inequality is Jensen’s) to get

R2 f¯(k) p⋆ + L2α/2. − ≤ 2kα

◮ righthand side converges to L2α/2 as k → ∞ ◮ for fixed k, choose α = R to minimize bound L√k RL RL RL f¯(k) p⋆ + = . − ≤ 2√k 2√k √k

25/45 Subgradient method for varying step size by recursive application of subgradient inequality, we had

k k (k+1) ⋆ 2 (1) ⋆ 2 (i) ⋆ 2 (i) 2 x x x x 2 αi (f (x ) p )+ α g k − k2 ≤ k − k2 − − i k k2 i i X=1 X=1 k k 2 (i) ⋆ 2 2 R 2 αi (f (x ) p )+ L α ≤ − − i i i X=1 X=1 so for changing step size, use

k k (i) ⋆ (k) ⋆ αi (f (x ) p ) (fbest p ) αi − ≥ − i i ! X=1 X=1 to get 2 2 k 2 (k) ⋆ R + L i=1 αi fbest p . − ≤ k 2 i=1Pαi P 26/45 Subgradient method for varying step size

(k) constant step length: for αk = γ/ g we get k k2 2 k 2 (i) 2 2 2 (k) ⋆ R + i=1 αi g 2 R + γ k fbest p k k , − ≤ k ≤ 2γk/L 2P i=1 αi righthand side converges to Lγ/P2 as k → ∞ square summable but not summable step sizes: suppose step sizes satisfy

∞ 2 ∞ α < , αk = k ∞ ∞ i k X=1 X=1 then 2 2 k 2 (k) ⋆ R + L i=1 αi fbest p − ≤ k 2 i=1Pαi as k , numerator converges to a finiteP number, denominator → ∞ (k) ⋆ converges to , so fbest p ∞ → 27/45 Stopping criterion

R2 + L2 k α2 ◮ terminating when i=1 i ǫ is really, really, slow k ≤ 2 i=1Pαi

◮ P R RL we saw that if αi = α = then after k steps, ǫ , so for L√k ≤ √k accuracy ǫ need R2L2 k ≥ ǫ2 iterations

◮ the truth: there’s no good stopping criterion for the subgradient method ...

28/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

29/45 Example: Piecewise linear minimization

T minimize f (x) = maxi=1,...,m(ai x + bi ) to find a subgradient of f : find index j for which

T T aj x + bj = max (ai x + bi ) i=1,...,m and take g = aj

(k+1) (k) subgradient method: x = x αk aj −

30/45 problem instance with n = 20 variables, m = 100 terms, p⋆ 1.1 ≈

(k) ⋆ fbest p , constant step length γ = 0.05, 0.01, 0.005 −

0 γ = .05 10 γ = .01 γ = .005

⋆ −1

p 10 − ) k ( best f −2 10

−3 10 500 1000 1500 2000 2500 3000 k

31/45 diminishing step rules αk = 0.1/√k and αk = 1/√k, square summable step size rules αk = 1/k and αk = 10/k

1 10 0.1/sqrt(k) value 1/sqrt(k) value 1/k value 0 10 10/k value ⋆ p

−1 − 10 ) k ( best f

−2 10

−3 10 0 500 1000 1500 2000 2500 3000 k

32/45 How to avoid slow convergence don’t use subgradient method when you want very high accuracy! instead, ◮ for highest accuracy: rewrite problem as LP or SDP; use IPM ◮ for medium accuracy: ◮ regularize your objective (so it’s strongly convex)

f˜(x) = f (x) + α x x 0 2 k − k ◮ smooth your objective (so it’s smooth)

f˜(x) = E f (y) y:ky−xk≤δ

◮ more on these later... ◮ for low accuracy: use a constant step size; terminate when you stop improving much or get bored

33/45 Outline

Subgradients and optimality

Subgradients and descent

Subgradient method

Analysis of SGM

Examples

Polyak’s optimal step size

34/45 Optimal step size when p⋆ is known

◮ choice due to Polyak:

f (x (k)) p⋆ α = − k g (k) 2 k k2 (can also use when optimal value is estimated)

◮ motivation: start with basic inequality

(k+1) ⋆ 2 (k) ⋆ 2 (k) ⋆ 2 (k) 2 x x x x 2αk (f (x ) p )+ α g k − k2 ≤ k − k2 − − k k k2

and choose αk to minimize righthand side

35/45 ◮ yields

(f (x (k)) p⋆)2 x (k+1) x ⋆ 2 x (k) x ⋆ 2 − k − k2 ≤ k − k2 − g (k) 2 k k2 (in particular, x (k) x ⋆ decreases each step) k − k2 ◮ applying recursively,

k (f (x (i)) p⋆)2 − R2 g (i) 2 ≤ i 2 X=1 k k and so using Jensen twice,

k 2 k 1 k f (x (i)) p⋆ (f (x (i)) p⋆)2 R2L2 k − ≤ − ≤ i ! i X=1 X=1 k 1 RL f (x (i)) p⋆ k − ≤ √ i k X=1 RL f (¯x (i)) p⋆ − ≤ √k

36/45 PWL example with Polyak’s step size, αk = 0.1/√k, αk = 1/k

1 10 Polyak αk =0.1/√k αk =1/k 0 10 ⋆ p

−1 − 10 ) k ( best f

−2 10

−3 10 0 500 1000 1500 2000 2500 3000 k

37/45 Finding a point in the intersection of convex sets

n C = C Cm is nonempty, C ,..., Cm R closed and convex 1 ∩··· 1 ⊆

find a point in C by minimizing

f (x) = max dist(x, C ),..., dist(x, Cm) { 1 } with dist(x, Cj )= f (x), a subgradient of f is

x PCj (x) g = dist(x, Cj )= − ∇ x PC (x) k − j k2

38/45 subgradient update with optimal step size:

(k+1) (k) (k) x = x αk g − x PC (x) = x (k) f (x (k)) − j − x PC (x) k − j k2 (k) = PCj (x )

◮ a version of the famous alternating projections algorithm ◮ at each step, project the current point onto the farthest set ◮ for m = 2 sets, projections alternate onto one set, then the other ◮ convergence: dist(x (k), C) 0 as k → → ∞

39/45 Alternating projections

first few iterations:

x (1)

x (2) x (3) x (4) x ⋆ C1 C2

... x (k) eventually converges to a point x ⋆ C C ∈ 1 ∩ 2

40/45 Example: Positive semidefinite matrix completion

◮ some entries of matrix in Sn fixed ◮ find values for others so completed matrix is PSD ◮ n n C1 = S+, C2 is (affine) set in S with specified fixed entries ◮ projection onto C1 by eigenvalue decomposition, truncation: n T for X = i=1 λi qi qi , P n T PC (X )= max 0,λi qi q 1 { } i i X=1 ◮ projection of X onto C2 by setting specified entries to fixed values

41/45 specific example: 50 50 matrix missing about half of its entries ×

◮ initialize X (1) with unknown entries set to 0

42/45 convergence is linear:

2 10

0

F 10 k ) k ( X −2

− 10 +1) k (

X −4

k 10

−6 10 0 20 40 60 80 100 k

43/45 Polyak step size when p⋆ isn’t known

◮ use step size (k) (k) f (x ) fbest + γk α = − k g (k) 2 k k2 2 with ∞ γk = , ∞ γ < k=1 ∞ k=1 k ∞

◮ (k) P P ⋆ fbest γk serves as estimate of p − ◮ γk is in scale of objective value

◮ (k) ⋆ can show fbest p →

44/45 PWL example with Polyak’s step size, using p⋆, and estimated with γk = 10/(10 + k)

1 10 optimal estimated

0 10 ⋆ p

−1 − 10 ) k ( best f

−2 10

−3 10 0 500 1000 1500 2000 2500 3000 k

45/45