<<

L. Vandenberghe ECE236C (Spring 2020) 2. Subgradients

• definition

• subgradient calculus

and optimality conditions

• directional

2.1 Basic inequality recall the basic inequality for differentiable convex functions:

f (y) ≥ f (x) + ∇ f (x)T(y − x) for all y ∈ dom f

(x, f (x))  ∇ f (x)  −1

• the first-order approximation of f at x is a global lower bound • ∇ f (x) defines non-vertical supporting hyperplane to of f at (x, f (x)):

 ∇ f (x)  T   y   x   − ≤ 0 for all (y,t) ∈ epi f −1 t f (x)

Subgradients 2.2 Subgradient g is a subgradient of a convex f at x ∈ dom f if

f (y) ≥ f (x) + gT(y − x) for all y ∈ dom f

f (y)

T f (x1) + g1 (y − x1)

T f (x1) + g2 (y − x1)

T f (x2) + g3 (y − x2)

x1 x2 g1, g2 are subgradients at x1; g3 is a subgradient at x2

Subgradients 2.3 Subdifferential the subdifferential ∂ f (x) of f at x is the of all subgradients:

∂ f (x) = {g | gT(y − x) ≤ f (y) − f (x), ∀y ∈ dom f }

Properties

• ∂ f (x) is a closed (possibly empty)

this follows from the definition: ∂ f (x) is an intersection of halfspaces

• if x ∈ int dom f then ∂ f (x) is nonempty and bounded

proof on next two pages

Subgradients 2.4 Proof: we show that ∂ f (x) is nonempty when x ∈ int dom f

• (x, f (x)) is in the boundary of the convex set epi f

• therefore there exists a supporting hyperplane to epi f at (x, f (x)):

 a  T   y   x   ∃(a, b) 0, − ≤ 0 ∀(y,t) ∈ epi f , b t f (x)

• b > 0 gives a contradiction as t → ∞

• b = 0 gives a contradiction for y = x + a with small  > 0

1 • therefore b < 0 and g = a is a subgradient of f at x |b|

Subgradients 2.5 Proof: ∂ f (x) is bounded when x ∈ int dom f

• for small r > 0, define a set of 2n points

B = {x ± rek | k = 1,..., n} ⊂ dom f

and define M = max f (y) < ∞ y∈B • for every g ∈ ∂ f (x), there is a point y ∈ B with

T rkgk∞ = g (y − x)

(choose an index k with |gk | = kgk∞, and take y = x + r sign(gk)ek) • since g is a subgradient, this implies that

T f (x) + rkgk∞ = f (x) + g (y − x) ≤ f (y) ≤ M

• we conclude that ∂ f (x) is bounded:

M − f (x) kgk∞ ≤ for all g ∈ ∂ f (x) r

Subgradients 2.6 Example

f (x) = max { f1(x), f2(x)} with f1, f2 convex and differentiable

f (y)

f2(y)

f1(y)

• if f1(xˆ) = f2(xˆ), subdifferential at xˆ is line segment [∇ f1(xˆ), ∇ f2(xˆ)]

• if f1(xˆ) > f2(xˆ), subdifferential at xˆ is {∇ f1(xˆ)}

• if f1(xˆ) < f2(xˆ), subdifferential at xˆ is {∇ f2(xˆ)}

Subgradients 2.7 Examples

Absolute value f (x) = |x|

f (x) ∂ f (x)

1

x

x −1

Euclidean f (x) = kxk2

1 ∂ f (x) = { x} if x , 0, ∂ f (x) = {g | kgk2 ≤ 1} if x = 0 kxk2

Subgradients 2.8 Monotonicity the subdifferential of a is a monotone operator:

(u − v)T(x − y) ≥ 0 for all x, y, u ∈ ∂ f (x), v ∈ ∂ f (y)

Proof: by definition

f (y) ≥ f (x) + uT(y − x), f (x) ≥ f (y) + vT(x − y) combining the two inequalities shows monotonicity

Subgradients 2.9 Examples of non-subdifferentiable functions the following functions are not subdifferentiable at x = 0

• f : R → R, dom f = R+

f (x) = 1 if x = 0, f (x) = 0 if x > 0

• f : R → R, dom f = R+ √ f (x) = − x

the only supporting hyperplane to epi f at (0, f (0)) is vertical

Subgradients 2.10 Subgradients and sublevel sets if g is a subgradient of f at x, then

f (y) ≤ f (x) =⇒ gT(y − x) ≤ 0

g

x f (y) ≤ f (x)

the nonzero subgradients at x define supporting hyperplanes to the sublevel set

{y | f (y) ≤ f (x)}

Subgradients 2.11 Outline

• definition

• subgradient calculus

• duality and optimality conditions

• directional derivative Subgradient calculus

Weak subgradient calculus: rules for finding one subgradient

• sufficient for most nondifferentiable algorithms

• if you can evaluate f (x), you can usually compute a subgradient

Strong subgradient calculus: rules for finding ∂ f (x) (all subgradients)

• some algorithms, optimality conditions, etc., need entire subdifferential

• can be quite complicated

we will assume that x ∈ int dom f

Subgradients 2.12 Basic rules

Differentiable functions: ∂ f (x) = {∇ f (x)} if f is differentiable at x

Nonnegative linear combination if f (x) = α1 f1(x) + α2 f2(x) with α1, α2 ≥ 0, then

∂ f (x) = α1∂ f1(x) + α2∂ f2(x)

(right-hand side is addition of sets)

Affine transformation of variables: if f (x) = h(Ax + b), then

∂ f (x) = AT ∂h(Ax + b)

Subgradients 2.13 Pointwise maximum

f (x) = max { f1(x),..., fm(x)} define I(x) = {i | fi(x) = f (x)}, the ‘active’ functions at x

Weak result to compute a subgradient at x, choose any k ∈ I(x), any subgradient of fk at x

Strong result [ ∂ f (x) = conv ∂ fi(x) i∈I(x)

• the of the union of subdifferentials of ‘active’ functions at x

• if fi’s are differentiable, ∂ f (x) = conv {∇ fi(x) | i ∈ I(x)}

Subgradients 2.14 Example: piecewise-linear function

T f (x) = max (ai x + bi) i=1,...,m

f (x)

T ai x + bi

x the subdifferential at x is a polyhedron

∂ f (x) = conv {ai | i ∈ I(x)}

T with I(x) = {i | ai x + bi = f (x)}

Subgradients 2.15 Example: `1-norm

T f (x) = kxk1 = max s x s∈{−1,1}n the subdifferential is a product of intervals

 [−1,1] xk = 0  ∂ f (x) = J1 × · · · × Jn, Jk = {1} xk > 0  {−1} xk < 0 

(1,1) (−1,1) (1,1) (1,1)

(−1,−1) (1,−1) (1,−1)

∂ f (0,0) = [−1,1] × [−1,1] ∂ f (1,0) = {1} × [−1,1] ∂ f (1,1) = {(1,1)}

Subgradients 2.16 Pointwise supremum

f (x) = sup fα(x), fα(x) convex in x for every α α∈A

Weak result: to find a subgradient at xˆ,

• find any β for which f (xˆ) = fβ(xˆ) (assuming maximum is attained)

• choose any g ∈ ∂ fβ(xˆ)

(Partial) strong result: define I(x) = {α ∈ A | fα(x) = f (x)}

[ conv ∂ fα(x) ⊆ ∂ f (x) α∈I(x) equality requires extra conditions (for example, A compact, fα continuous in α)

Subgradients 2.17 Exercise: maximum eigenvalue

Problem: explain how to find a subgradient of

T f (x) = λmax(A(x)) = sup y A(x)y kyk2=1 where A(x) = A0 + x1A1 + ··· + xn An with symmetric coefficients Ai

Solution: to find a subgradient at xˆ,

• choose any unit eigenvector y with eigenvalue λmax(A(xˆ)) • the gradient of yT A(x)y at xˆ is a subgradient of f :

T T (y A1y,..., y Any) ∈ ∂ f (xˆ)

Subgradients 2.18 Minimization

f (x) = inf h(x, y), h jointly convex in (x, y) y

Weak result: to find a subgradient at xˆ,

• find yˆ that minimizes h(xˆ, y) (assuming minimum is attained)

• find subgradient (g,0) ∈ ∂h(xˆ, yˆ)

Proof: for all x, y, h(x, y) ≥ h(xˆ, yˆ) + gT(x − xˆ) + 0T(y − yˆ) = f (xˆ) + gT(x − xˆ) therefore f (x) = inf h(x, y) ≥ f (xˆ) + gT(x − xˆ) y

Subgradients 2.19 Exercise: to convex set

Problem: explain how to find a subgradient of

f (x) = inf kx − yk2 y∈C where C is a closed convex set

Solution: to find a subgradient at xˆ,

• if f (xˆ) = 0 (that is, xˆ ∈ C), take g = 0

• if f (xˆ) > 0, find projection yˆ = P(xˆ) on C and take

1 1 g = (xˆ − yˆ) = (xˆ − P(xˆ)) kyˆ − xˆk2 kxˆ − P(xˆ)k2

Subgradients 2.20 Composition

f (x) = h( f1(x),..., fk(x)), h convex and nondecreasing, fi convex

Weak result: to find a subgradient at xˆ,

• find z ∈ ∂h( f1(xˆ),..., fk(xˆ)) and gi ∈ ∂ fi(xˆ)

• then g = z1g1 + ··· + zkgk ∈ ∂ f (xˆ) reduces to standard formula for differentiable h, fi

Proof:

 T T  f (x) ≥ h f1(xˆ) + g1 (x − xˆ),..., fk(xˆ) + gk (x − xˆ)

T  T T  ≥ h ( f1(xˆ),..., fk(xˆ)) + z g1 (x − xˆ),..., gk (x − xˆ) T = h ( f1(xˆ),..., fk(xˆ)) + (z1g1 + ··· + zkgk) (x − xˆ) = f (xˆ) + gT(x − xˆ)

Subgradients 2.21 Optimal value function define f (u, v) as the optimal value of convex problem

minimize f0(x) subject to fi(x) ≤ ui, i = 1,..., m Ax = b + v

(functions fi are convex; optimization variable is x)

Weak result: suppose f (uˆ, vˆ) is finite and holds with the dual

! X T maximize inf f0(x) + λi( fi(x) − uˆi) + ν (Ax − b − vˆ) x i subject to λ  0 if λˆ, νˆ are optimal dual variables (for right-hand sides uˆ, vˆ) then (−λ,ˆ −νˆ) ∈ ∂ f (uˆ, vˆ)

Subgradients 2.22 Proof: by for problem with right-hand sides u, v

! X T f (u, v) ≥ inf f0(x) + λˆi( fi(x) − ui) + νˆ (Ax − b − v) x i ! X T = inf f0(x) + λˆi( fi(x) − uˆi) + νˆ (Ax − b − vˆ) x i − λˆT(u − uˆ) − νˆT(v − vˆ) = f (uˆ, vˆ) − λˆT(u − uˆ) − νˆT(v − vˆ)

Subgradients 2.23 Expectation

f (x) = E h(x,u) u random, h convex in x for every u

Weak result: to find a subgradient at xˆ,

• choose a function u 7→ g(u) with g(u) ∈ ∂xh(xˆ,u)

• then, g = Eu g(u) ∈ ∂ f (xˆ)

Proof: by convexity of h and definition of g(u),

f (x) = E h(x,u)   ≥ E h(xˆ,u) + g(u)T(x − xˆ)

= f (xˆ) + gT(x − xˆ)

Subgradients 2.24 Outline

• definition

• subgradient calculus

• duality and optimality conditions

• directional derivative f (y)

Optimality conditions — unconstrained x? minimizes f (x) if and only 0 ∈ ∂ f (x?)

x? this follows directly from the definition of subgradient:

f (y) ≥ f (x?) + 0T(y − x?) for all y ⇐⇒ 0 ∈ ∂ f (x?)

Subgradients 2.25 Example: piecewise-linear minimization

T f (x) = max (ai x + bi) i=1,...,m

Optimality condition

? T 0 ∈ conv {ai | i ∈ I(x )} where I(x) = {i | ai x + bi = f (x)}

• in other words, x? is optimal if and only if there is a λ with

m T X ? λ  0, 1 λ = 1, λiai = 0, λi = 0 for i < I(x ) i=1

• these are the optimality conditions for the equivalent linear program

minimize t maximize bT λ subject to Ax + b  t1 subject to AT λ = 0 λ  0, 1T λ = 1

Subgradients 2.26 Optimality conditions — constrained

minimize f0(x) subject to fi(x) ≤ 0, i = 1,..., m

n assume dom fi = R , so functions fi are subdifferentiable everywhere

Karush–Kuhn–Tucker conditions if strong duality holds, then x?, λ? are primal, dual optimal if and only if 1. x? is primal feasible

2. λ?  0 ? ? 3. λi fi(x ) = 0 for i = 1,..., m ? ( ?) ( ) Pm ? ( ) 4. x is a minimizer of L x, λ = f0 x + i=1 λi fi x :

m ? X ? ? 0 ∈ ∂ f0(x ) + λi ∂ fi(x ) i=1

Subgradients 2.27 Outline

• definition

• subgradient calculus

• duality and optimality conditions

• directional derivative Directional derivative

Definition (for general f ): the directional derivative of f at x in the direction y is

f (x + αy) − f (x) f 0(x; y) = lim α&0 α  1  = lim t( f (x + y) − t f (x) t→∞ t

(if the limit exists)

• f 0(x; y) is the right derivative of g(α) = f (x + αy) at α = 0

• f 0(x; y) is homogeneous in y:

f 0(x; λy) = λ f 0(x; y) for λ ≥ 0

Subgradients 2.28 Directional derivative of a convex function

Equivalent definition (for convex f ): replace lim with inf

f (x + αy) − f (x) f 0(x; y) = inf α>0 α  1  = inf t f (x + y) − t f (x) t>0 t

Proof

• the function h(y) = f (x + y) − f (x) is convex in y, with h(0) = 0

• its perspective th(y/t) is nonincreasing in t (ECE236B ex. A2.5); hence

f 0(x; y) = lim th(y/t) = inf th(y/t) t→∞ t>0

Subgradients 2.29 Properties consequences of the expressions (for convex f )

f (x + αy) − f (x) f 0(x; y) = inf α>0 α  1  = inf t f (x + y) − t f (x) t>0 t

• f 0(x; y) is convex in y (partial minimization of a convex function in y, t)

• f 0(x; y) defines a lower bound on f in the direction y:

f (x + αy) ≥ f (x) + α f 0(x; y) for all α ≥ 0

Subgradients 2.30 Directional derivative and subgradients for convex f and x ∈ int dom f

f 0(x; y) = sup gT y g∈∂ f (x) fˆ0(x, y) = gT y

gˆ f 0(x; y) is support function of ∂ f (x) y ∂ f (x)

• generalizes f 0(x; y) = ∇ f (x)T y for differentiable functions

• implies that f 0(x; y) exists for all x ∈ int dom f , all y (see page 2.4)

Subgradients 2.31 Proof: if g ∈ ∂ f (x) then from page 2.29

f (x) + αgT y − f (x) f 0(x; y) ≥ inf = gT y α>0 α it remains to show that f 0(x; y) = gˆT y for at least one gˆ ∈ ∂ f (x)

• f 0(x; y) is convex in y with domain Rn, hence subdifferentiable at all y

• let gˆ be a subgradient of f 0(x; y) at y: then for all v, λ ≥ 0,

λ f 0(x; v) = f 0(x; λv) ≥ f 0(x; y) + gˆT(λv − y)

• taking λ → ∞ shows that f 0(x; v) ≥ gˆT v; from the lower bound on page 2.30,

f (x + v) ≥ f (x) + f 0(x; v) ≥ f (x) + gˆT v for all v

hence gˆ ∈ ∂ f (x)

• taking λ = 0 we see that f 0(x; y) ≤ gˆT y

Subgradients 2.32 Descent directions and subgradients y is a descent direction of f at x if f 0(x; y) < 0

• the negative gradient of a differentiable f is a descent direction (if ∇ f (x) , 0) • negative subgradient is not always a descent direction

Example: f (x1, x2) = |x1| + 2|x2|

x2

g = (1,2)

x1 (1,0)

g = (1,2) ∈ ∂ f (1,0), but y = (−1,−2) is not a descent direction at (1,0)

Subgradients 2.33 Steepest descent direction

Definition: (normalized) steepest descent direction at x ∈ int dom f is

0 ∆xnsd = argmin f (x; y) kyk2≤1

∆xnsd is the primal solution y of the pair of dual problems (BV §8.1.3)

0 minimize (over y) f (x; y) maximize (over g) −kgk2 subject to kyk2 ≤ 1 subject to g ∈ ∂ f (x)

• dual optimal g? is subgradient with least norm 0 ? ∂ f (x) • f (x; ∆xnsd) = −kg k2 g? ? ? • if 0 < ∂ f (x), ∆xnsd = −g /kg k2

• ∆xnsd can be expensive to compute

T 0 ∆xnsd g ∆xnsd = f (x,∆xnsd)

Subgradients 2.34 Subgradients and distance to sublevel sets if f is convex, f (y) < f (x), g ∈ ∂ f (x), then for small t > 0,

2 2 T 2 2 kx − tg − yk2 = kx − yk2 − 2tg (x − y) + t kgk2 2 2 2 ≤ kx − yk2 − 2t( f (x) − f (y)) + t kgk2 2 < kx − yk2

• −g is descent direction for kx − yk2, for any y with f (y) < f (x)

• in particular, −g is descent direction for distance to any minimizer of f

Subgradients 2.35 References

• A. Beck, First-Order Methods in Optimization (2017), chapter 3.

• D. P. Bertsekas, A. Nedić, A. E. Ozdaglar, and Optimization (2003), chapter 4.

• J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization Algoritms (1993), chapter VI.

• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.1.

• B. T. Polyak, Introduction to Optimization (1987), section 5.1.

Subgradients 2.36