L. Vandenberghe ECE236C (Spring 2020) 2. Subgradients
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
2.1 Basic inequality recall the basic inequality for differentiable convex functions:
f (y) ≥ f (x) + ∇ f (x)T(y − x) for all y ∈ dom f
(x, f (x)) ∇ f (x) −1
• the first-order approximation of f at x is a global lower bound • ∇ f (x) defines non-vertical supporting hyperplane to epigraph of f at (x, f (x)):
∇ f (x) T y x − ≤ 0 for all (y,t) ∈ epi f −1 t f (x)
Subgradients 2.2 Subgradient g is a subgradient of a convex function f at x ∈ dom f if
f (y) ≥ f (x) + gT(y − x) for all y ∈ dom f
f (y)
T f (x1) + g1 (y − x1)
T f (x1) + g2 (y − x1)
T f (x2) + g3 (y − x2)
x1 x2 g1, g2 are subgradients at x1; g3 is a subgradient at x2
Subgradients 2.3 Subdifferential the subdifferential ∂ f (x) of f at x is the set of all subgradients:
∂ f (x) = {g | gT(y − x) ≤ f (y) − f (x), ∀y ∈ dom f }
Properties
• ∂ f (x) is a closed convex set (possibly empty)
this follows from the definition: ∂ f (x) is an intersection of halfspaces
• if x ∈ int dom f then ∂ f (x) is nonempty and bounded
proof on next two pages
Subgradients 2.4 Proof: we show that ∂ f (x) is nonempty when x ∈ int dom f
• (x, f (x)) is in the boundary of the convex set epi f
• therefore there exists a supporting hyperplane to epi f at (x, f (x)):
a T y x ∃(a, b) 0, − ≤ 0 ∀(y,t) ∈ epi f , b t f (x)
• b > 0 gives a contradiction as t → ∞
• b = 0 gives a contradiction for y = x + a with small > 0
1 • therefore b < 0 and g = a is a subgradient of f at x |b|
Subgradients 2.5 Proof: ∂ f (x) is bounded when x ∈ int dom f
• for small r > 0, define a set of 2n points
B = {x ± rek | k = 1,..., n} ⊂ dom f
and define M = max f (y) < ∞ y∈B • for every g ∈ ∂ f (x), there is a point y ∈ B with
T rkgk∞ = g (y − x)
(choose an index k with |gk | = kgk∞, and take y = x + r sign(gk)ek) • since g is a subgradient, this implies that
T f (x) + rkgk∞ = f (x) + g (y − x) ≤ f (y) ≤ M
• we conclude that ∂ f (x) is bounded:
M − f (x) kgk∞ ≤ for all g ∈ ∂ f (x) r
Subgradients 2.6 Example
f (x) = max { f1(x), f2(x)} with f1, f2 convex and differentiable
f (y)
f2(y)
f1(y)
• if f1(xˆ) = f2(xˆ), subdifferential at xˆ is line segment [∇ f1(xˆ), ∇ f2(xˆ)]
• if f1(xˆ) > f2(xˆ), subdifferential at xˆ is {∇ f1(xˆ)}
• if f1(xˆ) < f2(xˆ), subdifferential at xˆ is {∇ f2(xˆ)}
Subgradients 2.7 Examples
Absolute value f (x) = |x|
f (x) ∂ f (x)
1
x
x −1
Euclidean norm f (x) = kxk2
1 ∂ f (x) = { x} if x , 0, ∂ f (x) = {g | kgk2 ≤ 1} if x = 0 kxk2
Subgradients 2.8 Monotonicity the subdifferential of a convex function is a monotone operator:
(u − v)T(x − y) ≥ 0 for all x, y, u ∈ ∂ f (x), v ∈ ∂ f (y)
Proof: by definition
f (y) ≥ f (x) + uT(y − x), f (x) ≥ f (y) + vT(x − y) combining the two inequalities shows monotonicity
Subgradients 2.9 Examples of non-subdifferentiable functions the following functions are not subdifferentiable at x = 0
• f : R → R, dom f = R+
f (x) = 1 if x = 0, f (x) = 0 if x > 0
• f : R → R, dom f = R+ √ f (x) = − x
the only supporting hyperplane to epi f at (0, f (0)) is vertical
Subgradients 2.10 Subgradients and sublevel sets if g is a subgradient of f at x, then
f (y) ≤ f (x) =⇒ gT(y − x) ≤ 0
g
x f (y) ≤ f (x)
the nonzero subgradients at x define supporting hyperplanes to the sublevel set
{y | f (y) ≤ f (x)}
Subgradients 2.11 Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative Subgradient calculus
Weak subgradient calculus: rules for finding one subgradient
• sufficient for most nondifferentiable convex optimization algorithms
• if you can evaluate f (x), you can usually compute a subgradient
Strong subgradient calculus: rules for finding ∂ f (x) (all subgradients)
• some algorithms, optimality conditions, etc., need entire subdifferential
• can be quite complicated
we will assume that x ∈ int dom f
Subgradients 2.12 Basic rules
Differentiable functions: ∂ f (x) = {∇ f (x)} if f is differentiable at x
Nonnegative linear combination if f (x) = α1 f1(x) + α2 f2(x) with α1, α2 ≥ 0, then
∂ f (x) = α1∂ f1(x) + α2∂ f2(x)
(right-hand side is addition of sets)
Affine transformation of variables: if f (x) = h(Ax + b), then
∂ f (x) = AT ∂h(Ax + b)
Subgradients 2.13 Pointwise maximum
f (x) = max { f1(x),..., fm(x)} define I(x) = {i | fi(x) = f (x)}, the ‘active’ functions at x
Weak result to compute a subgradient at x, choose any k ∈ I(x), any subgradient of fk at x
Strong result [ ∂ f (x) = conv ∂ fi(x) i∈I(x)
• the convex hull of the union of subdifferentials of ‘active’ functions at x
• if fi’s are differentiable, ∂ f (x) = conv {∇ fi(x) | i ∈ I(x)}
Subgradients 2.14 Example: piecewise-linear function
T f (x) = max (ai x + bi) i=1,...,m
f (x)
T ai x + bi
x the subdifferential at x is a polyhedron
∂ f (x) = conv {ai | i ∈ I(x)}
T with I(x) = {i | ai x + bi = f (x)}
Subgradients 2.15 Example: `1-norm
T f (x) = kxk1 = max s x s∈{−1,1}n the subdifferential is a product of intervals
[−1,1] xk = 0 ∂ f (x) = J1 × · · · × Jn, Jk = {1} xk > 0 {−1} xk < 0
(1,1) (−1,1) (1,1) (1,1)
(−1,−1) (1,−1) (1,−1)
∂ f (0,0) = [−1,1] × [−1,1] ∂ f (1,0) = {1} × [−1,1] ∂ f (1,1) = {(1,1)}
Subgradients 2.16 Pointwise supremum
f (x) = sup fα(x), fα(x) convex in x for every α α∈A
Weak result: to find a subgradient at xˆ,
• find any β for which f (xˆ) = fβ(xˆ) (assuming maximum is attained)
• choose any g ∈ ∂ fβ(xˆ)
(Partial) strong result: define I(x) = {α ∈ A | fα(x) = f (x)}
[ conv ∂ fα(x) ⊆ ∂ f (x) α∈I(x) equality requires extra conditions (for example, A compact, fα continuous in α)
Subgradients 2.17 Exercise: maximum eigenvalue
Problem: explain how to find a subgradient of
T f (x) = λmax(A(x)) = sup y A(x)y kyk2=1 where A(x) = A0 + x1A1 + ··· + xn An with symmetric coefficients Ai
Solution: to find a subgradient at xˆ,
• choose any unit eigenvector y with eigenvalue λmax(A(xˆ)) • the gradient of yT A(x)y at xˆ is a subgradient of f :
T T (y A1y,..., y Any) ∈ ∂ f (xˆ)
Subgradients 2.18 Minimization
f (x) = inf h(x, y), h jointly convex in (x, y) y
Weak result: to find a subgradient at xˆ,
• find yˆ that minimizes h(xˆ, y) (assuming minimum is attained)
• find subgradient (g,0) ∈ ∂h(xˆ, yˆ)
Proof: for all x, y, h(x, y) ≥ h(xˆ, yˆ) + gT(x − xˆ) + 0T(y − yˆ) = f (xˆ) + gT(x − xˆ) therefore f (x) = inf h(x, y) ≥ f (xˆ) + gT(x − xˆ) y
Subgradients 2.19 Exercise: Euclidean distance to convex set
Problem: explain how to find a subgradient of
f (x) = inf kx − yk2 y∈C where C is a closed convex set
Solution: to find a subgradient at xˆ,
• if f (xˆ) = 0 (that is, xˆ ∈ C), take g = 0
• if f (xˆ) > 0, find projection yˆ = P(xˆ) on C and take
1 1 g = (xˆ − yˆ) = (xˆ − P(xˆ)) kyˆ − xˆk2 kxˆ − P(xˆ)k2
Subgradients 2.20 Composition
f (x) = h( f1(x),..., fk(x)), h convex and nondecreasing, fi convex
Weak result: to find a subgradient at xˆ,
• find z ∈ ∂h( f1(xˆ),..., fk(xˆ)) and gi ∈ ∂ fi(xˆ)
• then g = z1g1 + ··· + zkgk ∈ ∂ f (xˆ) reduces to standard formula for differentiable h, fi
Proof:
T T f (x) ≥ h f1(xˆ) + g1 (x − xˆ),..., fk(xˆ) + gk (x − xˆ)
T T T ≥ h ( f1(xˆ),..., fk(xˆ)) + z g1 (x − xˆ),..., gk (x − xˆ) T = h ( f1(xˆ),..., fk(xˆ)) + (z1g1 + ··· + zkgk) (x − xˆ) = f (xˆ) + gT(x − xˆ)
Subgradients 2.21 Optimal value function define f (u, v) as the optimal value of convex problem
minimize f0(x) subject to fi(x) ≤ ui, i = 1,..., m Ax = b + v
(functions fi are convex; optimization variable is x)
Weak result: suppose f (uˆ, vˆ) is finite and strong duality holds with the dual
! X T maximize inf f0(x) + λi( fi(x) − uˆi) + ν (Ax − b − vˆ) x i subject to λ 0 if λˆ, νˆ are optimal dual variables (for right-hand sides uˆ, vˆ) then (−λ,ˆ −νˆ) ∈ ∂ f (uˆ, vˆ)
Subgradients 2.22 Proof: by weak duality for problem with right-hand sides u, v
! X T f (u, v) ≥ inf f0(x) + λˆi( fi(x) − ui) + νˆ (Ax − b − v) x i ! X T = inf f0(x) + λˆi( fi(x) − uˆi) + νˆ (Ax − b − vˆ) x i − λˆT(u − uˆ) − νˆT(v − vˆ) = f (uˆ, vˆ) − λˆT(u − uˆ) − νˆT(v − vˆ)
Subgradients 2.23 Expectation
f (x) = E h(x,u) u random, h convex in x for every u
Weak result: to find a subgradient at xˆ,
• choose a function u 7→ g(u) with g(u) ∈ ∂xh(xˆ,u)
• then, g = Eu g(u) ∈ ∂ f (xˆ)
Proof: by convexity of h and definition of g(u),
f (x) = E h(x,u) ≥ E h(xˆ,u) + g(u)T(x − xˆ)
= f (xˆ) + gT(x − xˆ)
Subgradients 2.24 Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative f (y)
Optimality conditions — unconstrained x? minimizes f (x) if and only 0 ∈ ∂ f (x?)
x? this follows directly from the definition of subgradient:
f (y) ≥ f (x?) + 0T(y − x?) for all y ⇐⇒ 0 ∈ ∂ f (x?)
Subgradients 2.25 Example: piecewise-linear minimization
T f (x) = max (ai x + bi) i=1,...,m
Optimality condition
? T 0 ∈ conv {ai | i ∈ I(x )} where I(x) = {i | ai x + bi = f (x)}
• in other words, x? is optimal if and only if there is a λ with
m T X ? λ 0, 1 λ = 1, λiai = 0, λi = 0 for i < I(x ) i=1
• these are the optimality conditions for the equivalent linear program
minimize t maximize bT λ subject to Ax + b t1 subject to AT λ = 0 λ 0, 1T λ = 1
Subgradients 2.26 Optimality conditions — constrained
minimize f0(x) subject to fi(x) ≤ 0, i = 1,..., m
n assume dom fi = R , so functions fi are subdifferentiable everywhere
Karush–Kuhn–Tucker conditions if strong duality holds, then x?, λ? are primal, dual optimal if and only if 1. x? is primal feasible
2. λ? 0 ? ? 3. λi fi(x ) = 0 for i = 1,..., m ? ( ?) ( ) Pm ? ( ) 4. x is a minimizer of L x, λ = f0 x + i=1 λi fi x :
m ? X ? ? 0 ∈ ∂ f0(x ) + λi ∂ fi(x ) i=1
Subgradients 2.27 Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative Directional derivative
Definition (for general f ): the directional derivative of f at x in the direction y is
f (x + αy) − f (x) f 0(x; y) = lim α&0 α 1 = lim t( f (x + y) − t f (x) t→∞ t
(if the limit exists)
• f 0(x; y) is the right derivative of g(α) = f (x + αy) at α = 0
• f 0(x; y) is homogeneous in y:
f 0(x; λy) = λ f 0(x; y) for λ ≥ 0
Subgradients 2.28 Directional derivative of a convex function
Equivalent definition (for convex f ): replace lim with inf
f (x + αy) − f (x) f 0(x; y) = inf α>0 α 1 = inf t f (x + y) − t f (x) t>0 t
Proof
• the function h(y) = f (x + y) − f (x) is convex in y, with h(0) = 0
• its perspective th(y/t) is nonincreasing in t (ECE236B ex. A2.5); hence
f 0(x; y) = lim th(y/t) = inf th(y/t) t→∞ t>0
Subgradients 2.29 Properties consequences of the expressions (for convex f )
f (x + αy) − f (x) f 0(x; y) = inf α>0 α 1 = inf t f (x + y) − t f (x) t>0 t
• f 0(x; y) is convex in y (partial minimization of a convex function in y, t)
• f 0(x; y) defines a lower bound on f in the direction y:
f (x + αy) ≥ f (x) + α f 0(x; y) for all α ≥ 0
Subgradients 2.30 Directional derivative and subgradients for convex f and x ∈ int dom f
f 0(x; y) = sup gT y g∈∂ f (x) fˆ0(x, y) = gT y
gˆ f 0(x; y) is support function of ∂ f (x) y ∂ f (x)
• generalizes f 0(x; y) = ∇ f (x)T y for differentiable functions
• implies that f 0(x; y) exists for all x ∈ int dom f , all y (see page 2.4)
Subgradients 2.31 Proof: if g ∈ ∂ f (x) then from page 2.29
f (x) + αgT y − f (x) f 0(x; y) ≥ inf = gT y α>0 α it remains to show that f 0(x; y) = gˆT y for at least one gˆ ∈ ∂ f (x)
• f 0(x; y) is convex in y with domain Rn, hence subdifferentiable at all y
• let gˆ be a subgradient of f 0(x; y) at y: then for all v, λ ≥ 0,
λ f 0(x; v) = f 0(x; λv) ≥ f 0(x; y) + gˆT(λv − y)
• taking λ → ∞ shows that f 0(x; v) ≥ gˆT v; from the lower bound on page 2.30,
f (x + v) ≥ f (x) + f 0(x; v) ≥ f (x) + gˆT v for all v
hence gˆ ∈ ∂ f (x)
• taking λ = 0 we see that f 0(x; y) ≤ gˆT y
Subgradients 2.32 Descent directions and subgradients y is a descent direction of f at x if f 0(x; y) < 0
• the negative gradient of a differentiable f is a descent direction (if ∇ f (x) , 0) • negative subgradient is not always a descent direction
Example: f (x1, x2) = |x1| + 2|x2|
x2
g = (1,2)
x1 (1,0)
g = (1,2) ∈ ∂ f (1,0), but y = (−1,−2) is not a descent direction at (1,0)
Subgradients 2.33 Steepest descent direction
Definition: (normalized) steepest descent direction at x ∈ int dom f is
0 ∆xnsd = argmin f (x; y) kyk2≤1
∆xnsd is the primal solution y of the pair of dual problems (BV §8.1.3)
0 minimize (over y) f (x; y) maximize (over g) −kgk2 subject to kyk2 ≤ 1 subject to g ∈ ∂ f (x)
• dual optimal g? is subgradient with least norm 0 ? ∂ f (x) • f (x; ∆xnsd) = −kg k2 g? ? ? • if 0 < ∂ f (x), ∆xnsd = −g /kg k2
• ∆xnsd can be expensive to compute
T 0 ∆xnsd g ∆xnsd = f (x,∆xnsd)
Subgradients 2.34 Subgradients and distance to sublevel sets if f is convex, f (y) < f (x), g ∈ ∂ f (x), then for small t > 0,
2 2 T 2 2 kx − tg − yk2 = kx − yk2 − 2tg (x − y) + t kgk2 2 2 2 ≤ kx − yk2 − 2t( f (x) − f (y)) + t kgk2 2 < kx − yk2
• −g is descent direction for kx − yk2, for any y with f (y) < f (x)
• in particular, −g is descent direction for distance to any minimizer of f
Subgradients 2.35 References
• A. Beck, First-Order Methods in Optimization (2017), chapter 3.
• D. P. Bertsekas, A. Nedić, A. E. Ozdaglar, Convex Analysis and Optimization (2003), chapter 4.
• J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization Algoritms (1993), chapter VI.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.1.
• B. T. Polyak, Introduction to Optimization (1987), section 5.1.
Subgradients 2.36