Subgradients

Subgradients Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Recall gradient descent We want to solve min f(x); x2Rn for f convex and differentiable (0) n Gradient descent: choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: If rf Lipschitz, gradient descent has convergence rate O(1=k) Downsides: • Requires f differentiable next lecture • Can be slow to converge two lectures from now 2 Outline Today: • Subgradients • Examples • Subgradient rules • Optimality characterizations 3 Subgradients n Remember that for convex f : R ! R, f(y) ≥ f(x) + rf(x)T (y − x) all x; y I.e., linear approximation always underestimates f n n A subgradient of convex f : R ! R at x is any g 2 R such that f(y) ≥ f(x) + gT (y − x); all y • Always exists • If f differentiable at x, then g = rf(x) uniquely • Actually, same definition works for nonconvex f (however, subgradients need not exist) 4 Examples Consider f : R ! R, f(x) = jxj 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5 −2 −1 0 1 2 x • For x 6= 0, unique subgradient g = sign(x) • For x = 0, subgradient g is any element of [−1; 1] 5 n Consider f : R ! R, f(x) = kxk2 f(x) x2 x1 • For x 6= 0, unique subgradient g = x=kxk2 • For x = 0, subgradient g is any element of fz : kzk2 ≤ 1g 6 n Consider f : R ! R, f(x) = kxk1 f(x) x2 x1 • For xi 6= 0, unique ith component gi = sign(xi) • For xi = 0, ith component gi is an element of [−1; 1] 7 n Let f1; f2 : R ! R be convex, differentiable, and consider f(x) = maxff1(x); f2(x)g 15 10 f(x) 5 0 −2 −1 0 1 2 x • For f1(x) > f2(x), unique subgradient g = rf1(x) • For f2(x) > f1(x), unique subgradient g = rf2(x) • For f1(x) = f2(x), subgradient g is any point on the line segment between rf1(x) and rf2(x) 8 Subdifferential Set of all subgradients of convex f is called the subdifferential: n @f(x) = fg 2 R : g is a subgradient of f at xg • @f(x) is closed and convex (even for nonconvex f) • Nonempty (can be empty for nonconvex f) • If f is differentiable at x, then @f(x) = frf(x)g • If @f(x) = fgg, then f is differentiable at x and rf(x) = g 9 Connection to convex geometry n n Convex set C ⊆ R , consider indicator function IC : R ! R, ( 0 if x 2 C IC (x) = Ifx 2 Cg = 1 if x2 = C For x 2 C, @IC (x) = NC (x), the normal cone of C at x, n T T NC (x) = fg 2 R : g x ≥ g y for any y 2 Cg Why? Recall definition of subgradient g, T IC (y) ≥ IC (x) + g (y − x) for all y • For y2 = C, IC (y) = 1 • For y 2 C, this means 0 ≥ gT (y − x) 10 ● ● ● ● 11 Subgradient calculus Basic rules for convex functions: • Scaling: @(af) = a · @f provided a > 0 • Addition: @(f1 + f2) = @f1 + @f2 • Affine composition: if g(x) = f(Ax + b), then @g(x) = AT @f(Ax + b) • Finite pointwise maximum: if f(x) = maxi=1;:::m fi(x), then [ @f(x) = conv @fi(x) ; i:fi(x)=f(x) the convex hull of union of subdifferentials of all active functions at x 12 • General pointwise maximum: if f(x) = maxs2S fs(x), then n [ o @f(x) ⊇ cl conv @fs(x) s:fs(x)=f(x) and under some regularity conditions (on S; fs), we get = • Norms: important special case, f(x) = kxkp. Let q be such that 1=p + 1=q = 1, then T kxkp = max z x kzkq≤1 Hence n T T o @f(x) = y : kykq ≤ 1 and y x = max z x kzkq≤1 13 Why subgradients? Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function 14 Optimality condition For any f (convex or not), f(x?) = min f(x) () 0 2 @f(x?) x2Rn I.e., x? is a minimizer if and only if 0 is a subgradient of f at x? Why? Easy: g = 0 being a subgradient means that for all y f(y) ≥ f(x?) + 0T (y − x?) = f(x?) Note implication for differentiable case, where @f(x) = frf(x)g 15 Projection onto a convex set n n Given closed, convex set C ⊆ R , and a point y 2 R , we define the projection operator onto C as PC (x) = argmin ky − xk2 x2C ● ? * Optimality characterization: x = PC (y) if and only if ● hy − x?; x? − xi ≥ 0 for all x 2 C Sometimes called variational inequality 16 ? How to see this? Note that x = PC (y) minimizes the criterion 1 f(x) = ky − xk2 + I (x) 2 2 C where IC is the indicator function of C. Hence we know this is equivalent to ? ? ? 0 2 @f(x ) = −(y − x ) + NC (x ) i.e., ? ? y − x 2 NC (x ) which exactly means (y − x?)T x? ≥ (y − x?)T x for all x 2 C 17 Soft-thresholding Lasso problem can be parametrized as 1 2 min ky − Xβk2 + λkβk1 β2Rp 2 where λ ≥ 0. Consider simplified problem with X = I: 1 2 min ky − βk2 + λkβk1 β2Rn 2 ^ Claim: solution of simple problem is β = Sλ(y), where Sλ is the soft-thresholding operator, 8 y − λ if y > λ <> i i [Sλ(y)]i = 0 if − λ ≤ yi ≤ λ > :yi + λ if yi < −λ 18 1 2 Why? Subgradients of f(β) = 2 ky − βk2 + λkβk1 are g = β − y + λs, where si = sign(βi) if βi 6= 0 and si 2 [−1; 1] if βi = 0 Now just plug in β = Sλ(y) and check that we can get g = 0 1.0 0.5 Soft-thresholding in 0.0 one variable: −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 19 References • S. Boyd, Lecture Notes for EE 264B, Stanford University, Spring 2010-2011 • R. T. Rockafellar (1970), \Convex analysis", Chapters 23{25 • L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012 20.

Subgradients

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support