Subgradients
Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: gradient descent
Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x ∈ R , repeat
(k) (k−1) (k−1) x = x − tk · ∇f(x ), k = 1, 2, 3,...
Step sizes tk chosen to be fixed and small, or by backtracking line search
If ∇f Lipschitz, gradient descent has convergence rate O(1/)
Downsides: • Requires f differentiable ← next lecture • Can be slow to converge ← two lectures from now
2 Outline
Today: crucial mathematical underpinnings! • Subgradients • Examples • Subgradient rules • Optimality characterizations
3 Subgradients
Remember that for convex and differentiable f,
f(y) ≥ f(x) + ∇f(x)T (y − x) for all x, y
I.e., linear approximation always underestimates f
n A subgradient of a convex function f at x is any g ∈ R such that
f(y) ≥ f(x) + gT (y − x) for all y
• Always exists • If f differentiable at x, then g = ∇f(x) uniquely • Actually, same definition works for nonconvex f (however, subgradients need not exist)
4 Examples of subgradients
Consider f : R → R, f(x) = |x| 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5
−2 −1 0 1 2
x
• For x 6= 0, unique subgradient g = sign(x) • For x = 0, subgradient g is any element of [−1, 1]
5 n Consider f : R → R, f(x) = kxk2
f(x)
x2
x1
• For x 6= 0, unique subgradient g = x/kxk2
• For x = 0, subgradient g is any element of {z : kzk2 ≤ 1}
6 n Consider f : R → R, f(x) = kxk1
f(x)
x2
x1
• For xi 6= 0, unique ith component gi = sign(xi)
• For xi = 0, ith component gi is any element of [−1, 1]
7 n Let f1, f2 : R → R be convex and differentiable, and consider f(x) = max{f1(x), f2(x)} 15 10 f(x) 5 0
−2 −1 0 1 2
x
• For f1(x) > f2(x), unique subgradient g = ∇f1(x)
• For f2(x) > f1(x), unique subgradient g = ∇f2(x)
• For f1(x) = f2(x), subgradient g is any point on the line segment between ∇f1(x) and ∇f2(x)
8 Subdifferential
Set of all subgradients of convex f is called the subdifferential:
n ∂f(x) = {g ∈ R : g is a subgradient of f at x}
• ∂f(x) is closed and convex (even for nonconvex f) • Nonempty (can be empty for nonconvex f) • If f is differentiable at x, then ∂f(x) = {∇f(x)} • If ∂f(x) = {g}, then f is differentiable at x and ∇f(x) = g
9 Connection to convex geometry
n n Convex set C ⊆ R , consider indicator function IC : R → R, ( 0 if x ∈ C IC (x) = I{x ∈ C} = ∞ if x∈ / C
For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x, recall
n T T NC (x) = {g ∈ R : g x ≥ g y for any y ∈ C}
Why? By definition of subgradient g,
T IC (y) ≥ IC (x) + g (y − x) for all y
• For y∈ / C, IC (y) = ∞ • For y ∈ C, this means 0 ≥ gT (y − x)
10 ● ●
● ●
11 Subgradient calculus
Basic rules for convex functions: • Scaling: ∂(af) = a · ∂f provided a > 0
• Addition: ∂(f1 + f2) = ∂f1 + ∂f2 • Affine composition: if g(x) = f(Ax + b), then
∂g(x) = AT ∂f(Ax + b)
• Finite pointwise maximum: if f(x) = maxi=1,...m fi(x), then
[ ∂f(x) = conv ∂fi(x)
i:fi(x)=f(x)
the convex hull of union of subdifferentials of all active functions at x
12 • General pointwise maximum: if f(x) = maxs∈S fs(x), then
[ ∂f(x) ⊇ cl conv ∂fs(x)
s:fs(x)=f(x)
and under some regularity conditions (on S, fs), we get an equality above
• Norms: important special case, f(x) = kxkp. Let q be such that 1/p + 1/q = 1, then
T kxkp = max z x kzkq≤1
Hence ∂f(x) = argmax zT x kzkq≤1
13 Why subgradients?
Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function
14 Optimality condition
For any f (convex or not),
f(x?) = min f(x) ⇐⇒ 0 ∈ ∂f(x?) x
I.e., x? is a minimizer if and only if 0 is a subgradient of f at x?. This is called the subgradient optimality condition
Why? Easy: g = 0 being a subgradient means that for all y
f(y) ≥ f(x?) + 0T (y − x?) = f(x?)
Note the implication for a convex and differentiable function f, with ∂f(x) = {∇f(x)}
15 Derivation of first-order optimality
Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition. Recall that for f convex and differentiable, the problem
min f(x) subject to x ∈ C x is solved at x if and only if
∇f(x)T (y − x) ≥ 0 for all y ∈ C
Intuitively says that gradient increases as we move away from x. How to see this? First recast problem as
min f(x) + IC (x) x
Now apply subgradient optimality: 0 ∈ ∂(f(x) + IC (x))
16 But 0 ∈ ∂ f(x) + IC (x)
⇐⇒ 0 ∈ {∇f(x)} + NC (x)
⇐⇒ − ∇f(x) ∈ NC (x) ⇐⇒ − ∇f(x)T x ≥ −∇f(x)T y for all ∈ C ⇐⇒ ∇f(x)T (y − x) ≥ 0 for all y ∈ C as desired
Note: the condition 0 ∈ ∂f(x) + NC (x) is a fully general condition for optimality in a convex problem. But this is not always easy to work with (KKT conditions, later, are easier)
17 Example: lasso optimality conditions
n n×p Given y ∈ R , X ∈ R , lasso problem can be parametrized as:
1 2 min ky − Xβk2 + λkβk1 β∈Rp 2 where λ ≥ 0. Subgradient optimality: 1 0 ∈ ∂ ky − Xβk2 + λkβk 2 2 1 T ⇐⇒ 0 ∈ −X (y − Xβ) + λ∂kβk1 ⇐⇒ XT (y − Xβ) = λv for some v ∈ ∂kβk1, i.e., {1} if β > 0 i vi ∈ {−1} if βi < 0 , i = 1, . . . p [−1, 1] if βi = 0
18 Write X1,...Xp for columns of X. Then subgradient optimality reads: ( T Xi (y − Xβ) = λ · sign(βi) if βi 6= 0 T |Xi (y − Xβ)| ≤ λ if βi = 0
Note: the subgradient optimality conditions do not directly lead to an expression for a lasso solution ... however they do provide a way to check lasso optimality
They are also helpful in understanding the lasso estimator; e.g., if T |Xi (y − Xβ)| < λ, then βi = 0
19 Example: soft-thresholding
Simplfied lasso problem with X = I:
1 2 min ky − βk2 + λkβk1 β∈Rn 2 This we can solve directly using subgradient optimality. Solution is β = Sλ(y), where Sλ is the soft-thresholding operator: y − λ if y > λ i i [Sλ(y)]i = 0 if − λ ≤ yi ≤ λ , i = 1, . . . n yi + λ if yi < −λ
Check: from last slide, subgradient optimality conditions are ( yi − βi = λ · sign(βi) if βi 6= 0
|yi − βi| ≤ λ if βi = 0
20 Now plug in β = Sλ(y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi| ≤ λ, βi = 0, and |yi − βi| = |yi| ≤ λ 1.0 0.5
Soft-thresholding in 0.0 one variable: −0.5 −1.0
−1.0 −0.5 0.0 0.5 1.0
21 Example: distance to a convex set
Recall the distance function to a convex set C:
dist(x, C) = min ky − xk2 y∈C
This is a convex function. What are its subgradients?
Write dist(x, C) = kx − PC (x)k2, where PC (x) is the projection of x onto C. Then when dist(x, C) > 0,
x − P (x) ∂dist(x, C) = C kx − PC (x)k2
Only has one element, so in fact dist(x, C) is differentiable and this is its gradient
22 We will only show one direction, i.e., that
x − P (x) C ∈ ∂dist(x, C) kx − PC (x)k2
Write u = PC (x). Then by first-order optimality conditions for a projection, (u − x)T (y − u) ≥ 0 for all y ∈ C Hence C ⊆ H = {y :(u − x)T (y − u) ≥ 0}
Claim: for any y,
(x − u)T (y − u) dist(y, C) ≥ kx − uk2 Check: first, for y ∈ H, the right-hand side is ≤ 0
23 T Now for y∈ / H, we have (x − u) (y − u) = kx − uk2ky − uk2 cos θ where θ is the angle between x − u and y − u. Thus
(x − u)T (y − u) = ky − uk2 cos θ = dist(y, H) ≤ dist(y, C) kx − uk2 as desired
Using the claim, we have for any y
(x − u)T (y − x + x − u) dist(y, C) ≥ kx − uk2 x − u T = kx − uk2 + (y − x) kx − uk2
Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x
24 References and further reading
• S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 • R. T. Rockafellar (1970), “Convex analysis”, Chapters 23–25 • L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012
25