<<

Subgradients

Ryan Tibshirani 10-725/36-725 Last time:

Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x ∈ R , repeat

(k) (k−1) (k−1) x = x − tk · ∇f(x ), k = 1, 2, 3,...

Step sizes tk chosen to be fixed and small, or by

If ∇f Lipschitz, gradient descent has convergence rate O(1/)

Downsides: • Requires f differentiable ← next lecture • Can be slow to converge ← two lectures from now

2 Outline

Today: crucial mathematical underpinnings! • Subgradients • Examples • Subgradient rules • Optimality characterizations

3 Subgradients

Remember that for convex and differentiable f,

f(y) ≥ f(x) + ∇f(x)T (y − x) for all x, y

I.e., always underestimates f

n A subgradient of a convex function f at x is any g ∈ R such that

f(y) ≥ f(x) + gT (y − x) for all y

• Always exists • If f differentiable at x, then g = ∇f(x) uniquely • Actually, same definition works for nonconvex f (however, subgradients need not exist)

4 Examples of subgradients

Consider f : R → R, f(x) = |x| 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5

−2 −1 0 1 2

x

• For x 6= 0, unique subgradient g = sign(x) • For x = 0, subgradient g is any element of [−1, 1]

5 n Consider f : R → R, f(x) = kxk2

f(x)

x2

x1

• For x 6= 0, unique subgradient g = x/kxk2

• For x = 0, subgradient g is any element of {z : kzk2 ≤ 1}

6 n Consider f : R → R, f(x) = kxk1

f(x)

x2

x1

• For xi 6= 0, unique ith component gi = sign(xi)

• For xi = 0, ith component gi is any element of [−1, 1]

7 n Let f1, f2 : R → R be convex and differentiable, and consider f(x) = max{f1(x), f2(x)} 15 10 f(x) 5 0

−2 −1 0 1 2

x

• For f1(x) > f2(x), unique subgradient g = ∇f1(x)

• For f2(x) > f1(x), unique subgradient g = ∇f2(x)

• For f1(x) = f2(x), subgradient g is any point on the line segment between ∇f1(x) and ∇f2(x)

8 Subdifferential

Set of all subgradients of convex f is called the subdifferential:

n ∂f(x) = {g ∈ R : g is a subgradient of f at x}

• ∂f(x) is closed and convex (even for nonconvex f) • Nonempty (can be empty for nonconvex f) • If f is differentiable at x, then ∂f(x) = {∇f(x)} • If ∂f(x) = {g}, then f is differentiable at x and ∇f(x) = g

9 Connection to convex geometry

n n C ⊆ R , consider indicator function IC : R → R, ( 0 if x ∈ C IC (x) = I{x ∈ C} = ∞ if x∈ / C

For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x, recall

n T T NC (x) = {g ∈ R : g x ≥ g y for any y ∈ C}

Why? By definition of subgradient g,

T IC (y) ≥ IC (x) + g (y − x) for all y

• For y∈ / C, IC (y) = ∞ • For y ∈ C, this means 0 ≥ gT (y − x)

10 ● ●

● ●

11 Subgradient calculus

Basic rules for convex functions: • Scaling: ∂(af) = a · ∂f provided a > 0

• Addition: ∂(f1 + f2) = ∂f1 + ∂f2 • Affine composition: if g(x) = f(Ax + b), then

∂g(x) = AT ∂f(Ax + b)

• Finite pointwise maximum: if f(x) = maxi=1,...m fi(x), then

 [  ∂f(x) = conv ∂fi(x)

i:fi(x)=f(x)

the of union of subdifferentials of all active functions at x

12 • General pointwise maximum: if f(x) = maxs∈S fs(x), then

  [  ∂f(x) ⊇ cl conv ∂fs(x)

s:fs(x)=f(x)

and under some regularity conditions (on S, fs), we get an equality above

• Norms: important special case, f(x) = kxkp. Let q be such that 1/p + 1/q = 1, then

T kxkp = max z x kzkq≤1

Hence ∂f(x) = argmax zT x kzkq≤1

13 Why subgradients?

Subgradients are important for two reasons: • : optimality characterization via subgradients, monotonicity, relationship to • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function

14 Optimality condition

For any f (convex or not),

f(x?) = min f(x) ⇐⇒ 0 ∈ ∂f(x?) x

I.e., x? is a minimizer if and only if 0 is a subgradient of f at x?. This is called the subgradient optimality condition

Why? Easy: g = 0 being a subgradient means that for all y

f(y) ≥ f(x?) + 0T (y − x?) = f(x?)

Note the implication for a convex and differentiable function f, with ∂f(x) = {∇f(x)}

15 Derivation of first-order optimality

Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition. Recall that for f convex and differentiable, the problem

min f(x) subject to x ∈ C x is solved at x if and only if

∇f(x)T (y − x) ≥ 0 for all y ∈ C

Intuitively says that gradient increases as we move away from x. How to see this? First recast problem as

min f(x) + IC (x) x

Now apply subgradient optimality: 0 ∈ ∂(f(x) + IC (x))

16 But  0 ∈ ∂ f(x) + IC (x)

⇐⇒ 0 ∈ {∇f(x)} + NC (x)

⇐⇒ − ∇f(x) ∈ NC (x) ⇐⇒ − ∇f(x)T x ≥ −∇f(x)T y for all ∈ C ⇐⇒ ∇f(x)T (y − x) ≥ 0 for all y ∈ C as desired

Note: the condition 0 ∈ ∂f(x) + NC (x) is a fully general condition for optimality in a convex problem. But this is not always easy to work with (KKT conditions, later, are easier)

17 Example: lasso optimality conditions

n n×p Given y ∈ R , X ∈ R , lasso problem can be parametrized as:

1 2 min ky − Xβk2 + λkβk1 β∈Rp 2 where λ ≥ 0. Subgradient optimality: 1  0 ∈ ∂ ky − Xβk2 + λkβk 2 2 1 T ⇐⇒ 0 ∈ −X (y − Xβ) + λ∂kβk1 ⇐⇒ XT (y − Xβ) = λv for some v ∈ ∂kβk1, i.e.,  {1} if β > 0  i vi ∈ {−1} if βi < 0 , i = 1, . . . p  [−1, 1] if βi = 0

18 Write X1,...Xp for columns of X. Then subgradient optimality reads: ( T Xi (y − Xβ) = λ · sign(βi) if βi 6= 0 T |Xi (y − Xβ)| ≤ λ if βi = 0

Note: the subgradient optimality conditions do not directly lead to an expression for a lasso solution ... however they do provide a way to check lasso optimality

They are also helpful in understanding the lasso estimator; e.g., if T |Xi (y − Xβ)| < λ, then βi = 0

19 Example: soft-thresholding

Simplfied lasso problem with X = I:

1 2 min ky − βk2 + λkβk1 β∈Rn 2 This we can solve directly using subgradient optimality. Solution is β = Sλ(y), where Sλ is the soft-thresholding operator:  y − λ if y > λ  i i [Sλ(y)]i = 0 if − λ ≤ yi ≤ λ , i = 1, . . . n  yi + λ if yi < −λ

Check: from last slide, subgradient optimality conditions are ( yi − βi = λ · sign(βi) if βi 6= 0

|yi − βi| ≤ λ if βi = 0

20 Now plug in β = Sλ(y) and check these are satisfied:

• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1

• When yi < −λ, argument is similar

• When |yi| ≤ λ, βi = 0, and |yi − βi| = |yi| ≤ λ 1.0 0.5

Soft-thresholding in 0.0 one variable: −0.5 −1.0

−1.0 −0.5 0.0 0.5 1.0

21 Example: distance to a convex set

Recall the distance function to a convex set C:

dist(x, C) = min ky − xk2 y∈C

This is a convex function. What are its subgradients?

Write dist(x, C) = kx − PC (x)k2, where PC (x) is the projection of x onto C. Then when dist(x, C) > 0,

 x − P (x)  ∂dist(x, C) = C kx − PC (x)k2

Only has one element, so in fact dist(x, C) is differentiable and this is its gradient

22 We will only show one direction, i.e., that

x − P (x) C ∈ ∂dist(x, C) kx − PC (x)k2

Write u = PC (x). Then by first-order optimality conditions for a projection, (u − x)T (y − u) ≥ 0 for all y ∈ C Hence C ⊆ H = {y :(u − x)T (y − u) ≥ 0}

Claim: for any y,

(x − u)T (y − u) dist(y, C) ≥ kx − uk2 Check: first, for y ∈ H, the right-hand side is ≤ 0

23 T Now for y∈ / H, we have (x − u) (y − u) = kx − uk2ky − uk2 cos θ where θ is the angle between x − u and y − u. Thus

(x − u)T (y − u) = ky − uk2 cos θ = dist(y, H) ≤ dist(y, C) kx − uk2 as desired

Using the claim, we have for any y

(x − u)T (y − x + x − u) dist(y, C) ≥ kx − uk2  x − u T = kx − uk2 + (y − x) kx − uk2

Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x

24 References and further reading

• S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 • R. T. Rockafellar (1970), “Convex analysis”, Chapters 23–25 • L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012

25