Subgradients

Subgradients Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Recall gradient descent We want to solve min f(x); x2Rn for f convex and differentiable (0) n Gradient descent: choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: If rf Lipschitz, gradient descent has convergence rate O(1=k) Downsides: • Requires f differentiable next lecture • Can be slow to converge two lectures from now 2 Outline Today: • Subgradients • Examples • Subgradient rules • Optimality characterizations 3 Subgradients n Remember that for convex f : R ! R, f(y) ≥ f(x) + rf(x)T (y − x) all x; y I.e., linear approximation always underestimates f n n A subgradient of convex f : R ! R at x is any g 2 R such that f(y) ≥ f(x) + gT (y − x); all y • Always exists • If f differentiable at x, then g = rf(x) uniquely • Actually, same definition works for nonconvex f (however, subgradients need not exist) 4 Examples Consider f : R ! R, f(x) = jxj 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5 −2 −1 0 1 2 x • For x 6= 0, unique subgradient g = sign(x) • For x = 0, subgradient g is any element of [−1; 1] 5 n Consider f : R ! R, f(x) = kxk2 f(x) x2 x1 • For x 6= 0, unique subgradient g = x=kxk2 • For x = 0, subgradient g is any element of fz : kzk2 ≤ 1g 6 n Consider f : R ! R, f(x) = kxk1 f(x) x2 x1 • For xi 6= 0, unique ith component gi = sign(xi) • For xi = 0, ith component gi is an element of [−1; 1] 7 n Let f1; f2 : R ! R be convex, differentiable, and consider f(x) = maxff1(x); f2(x)g 15 10 f(x) 5 0 −2 −1 0 1 2 x • For f1(x) > f2(x), unique subgradient g = rf1(x) • For f2(x) > f1(x), unique subgradient g = rf2(x) • For f1(x) = f2(x), subgradient g is any point on the line segment between rf1(x) and rf2(x) 8 Subdifferential Set of all subgradients of convex f is called the subdifferential: n @f(x) = fg 2 R : g is a subgradient of f at xg • @f(x) is closed and convex (even for nonconvex f) • Nonempty (can be empty for nonconvex f) • If f is differentiable at x, then @f(x) = frf(x)g • If @f(x) = fgg, then f is differentiable at x and rf(x) = g 9 Connection to convex geometry n n Convex set C ⊆ R , consider indicator function IC : R ! R, ( 0 if x 2 C IC (x) = Ifx 2 Cg = 1 if x2 = C For x 2 C, @IC (x) = NC (x), the normal cone of C at x, n T T NC (x) = fg 2 R : g x ≥ g y for any y 2 Cg Why? Recall definition of subgradient g, T IC (y) ≥ IC (x) + g (y − x) for all y • For y2 = C, IC (y) = 1 • For y 2 C, this means 0 ≥ gT (y − x) 10 ● ● ● ● 11 Subgradient calculus Basic rules for convex functions: • Scaling: @(af) = a · @f provided a > 0 • Addition: @(f1 + f2) = @f1 + @f2 • Affine composition: if g(x) = f(Ax + b), then @g(x) = AT @f(Ax + b) • Finite pointwise maximum: if f(x) = maxi=1;:::m fi(x), then [ @f(x) = conv @fi(x) ; i:fi(x)=f(x) the convex hull of union of subdifferentials of all active functions at x 12 • General pointwise maximum: if f(x) = maxs2S fs(x), then n [ o @f(x) ⊇ cl conv @fs(x) s:fs(x)=f(x) and under some regularity conditions (on S; fs), we get = • Norms: important special case, f(x) = kxkp. Let q be such that 1=p + 1=q = 1, then T kxkp = max z x kzkq≤1 Hence n T T o @f(x) = y : kykq ≤ 1 and y x = max z x kzkq≤1 13 Why subgradients? Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function 14 Optimality condition For any f (convex or not), f(x?) = min f(x) () 0 2 @f(x?) x2Rn I.e., x? is a minimizer if and only if 0 is a subgradient of f at x? Why? Easy: g = 0 being a subgradient means that for all y f(y) ≥ f(x?) + 0T (y − x?) = f(x?) Note implication for differentiable case, where @f(x) = frf(x)g 15 Projection onto a convex set n n Given closed, convex set C ⊆ R , and a point y 2 R , we define the projection operator onto C as PC (x) = argmin ky − xk2 x2C ● ? * Optimality characterization: x = PC (y) if and only if ● hy − x?; x? − xi ≥ 0 for all x 2 C Sometimes called variational inequality 16 ? How to see this? Note that x = PC (y) minimizes the criterion 1 f(x) = ky − xk2 + I (x) 2 2 C where IC is the indicator function of C. Hence we know this is equivalent to ? ? ? 0 2 @f(x ) = −(y − x ) + NC (x ) i.e., ? ? y − x 2 NC (x ) which exactly means (y − x?)T x? ≥ (y − x?)T x for all x 2 C 17 Soft-thresholding Lasso problem can be parametrized as 1 2 min ky − Xβk2 + λkβk1 β2Rp 2 where λ ≥ 0. Consider simplified problem with X = I: 1 2 min ky − βk2 + λkβk1 β2Rn 2 ^ Claim: solution of simple problem is β = Sλ(y), where Sλ is the soft-thresholding operator, 8 y − λ if y > λ <> i i [Sλ(y)]i = 0 if − λ ≤ yi ≤ λ > :yi + λ if yi < −λ 18 1 2 Why? Subgradients of f(β) = 2 ky − βk2 + λkβk1 are g = β − y + λs, where si = sign(βi) if βi 6= 0 and si 2 [−1; 1] if βi = 0 Now just plug in β = Sλ(y) and check that we can get g = 0 1.0 0.5 Soft-thresholding in 0.0 one variable: −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 19 References • S. Boyd, Lecture Notes for EE 264B, Stanford University, Spring 2010-2011 • R. T. Rockafellar (1970), \Convex analysis", Chapters 23{25 • L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012 20.

Load more