Machine Learning Theory Lecture 20: Mirror Descent

Machine Learning Theory Lecture 20: Mirror Descent Nicholas Harvey November 21, 2018 In this lecture we will present the Mirror Descent algorithm, which is a common generalization of Gradient Descent and Randomized Weighted Majority. This will require some preliminary results in convex analysis. 1 Conjugate Duality A good reference for the material in this section is [5, Part E]. n ∗ n Definition 1.1. Let f : R ! R [ f1g be a function. Define f : R ! R [ f1g by f ∗(y) = sup yTx − f(x): x2Rn This is the convex conjugate or Legendre-Fenchel transform or Fenchel dual of f. For each linear functional y, the convex conjugate f ∗(y) gives the the greatest amount by which y exceeds the function f. Alternatively, we can think of f ∗(y) as the downward shift needed for the linear function y to just touch or \support" epi f. 1.1 Examples Let us consider some simple one-dimensional examples. ∗ Example 1.2. Let f(x) = cx for some c 2 R. We claim that f = δfcg, i.e., ( 0 (if x = c) f ∗(x) = : +1 (otherwise) This is called the indicator function of fcg. Note that f is itself a linear functional that obviously supports epi f; so f ∗(c) = 0. Any other linear functional x 7! yx − r cannot support epi f for any r (we have supx(yx − cx) = 1 if y 6= c), ∗ ∗ so f (y) = 1 if y 6= c. Note here that a line (f) is getting mapped to a single point (f ). 1 ∗ Example 1.3. Let f(x) = jxj. We claim that f = δ[−1;1] (the indicator function of [−1; 1]). For any y 2 [−1; 1], the linear functional x 7! yx supports epi f at the point (0; 0); so f ∗(y) = 0. On the other hand, if y > 1 then the linear functional x 7! yx − r cannot support epi f for any r ∗ (we have supx(yx − jxj) = 1 for y > 1), so f (y) = 1. Similarly for y < −1. 1 T ∗ Example 1.4. Let f(x) = 2 x x. We claim that f = f. We have ∗ T 1 T 1 2 f (y) = sup y x − 2 x x ≤ sup kyk2 kxk2 − 2 kxk2 : x2Rn x2Rn This upper bound is maximized when kxk2 = kyk2, and the inequality is achieved when x = y. ∗ 1 T ∗ Thus f (y) = 2 y y = f(y), so f = f . n Pn Example 1.5 (Negative entropy). Define f : R>0 ! R by f(x) = i=1 xi ln xi. We saw in our ∗ Pn yi−1 earlier lectures on convexity that f is convex. We claim that f (y) = i=1 e . By Claim 1.9, proving the result for n = 1 also establishes the general result. ∗ By definition f (y) = supz>0(yz −z ln z). The derivative of yz −z ln z is y −ln z −1. The unique y−1 ∗ y−1 y−1 y−1 critical point satisfies z = e and it is a maximizer. Thus f (y) = ye − e (y − 1) = e . n 1 2 ∗ 1 2 Example 1.6. Let k·k be a norm on R and let Let f(x) = 2 kxk . Then f = 2 kxk∗. References. [3, Example 3.27]. 1.2 Properties n Claim 1.7 (Young-Fenchel Inequality). For any x; y 2 R , yTx ≤ f(x) + f ∗(y): Proof. f ∗(y) + f(x) = sup yTx0 − f(x0) + f(x) ≥ yTx − f(x) + f(x) = yTx: x02Rn Claim 1.8. f ∗ is closed and convex (regardless of whether f is). T Proof. For each x, define gx(y) = y x − f(x). Note that gx is an affine function of y, so gx is ∗ ∗ closed and convex. As f = supx2Rn gx, Lemma 5.8 implies that f is closed and convex. a b Claim 1.9 (Conjugate of Separable Function). Let f : R × R ! R be defined by f(x1; x2) = ∗ ∗ ∗ f1(x1) + f2(x2). Then f (x1; x2) = f1 (x1) + f2 (x2). 2 Proof. Straight from the definitions, we have ∗ T f (y1; y2) = sup (y1; y2) (z1; z2) − f(z1; z2) a b (z1;z2)2R ×R T T = sup y1 z1 + y2 z2 − f1(z1) − f2(z2) a b (z1;z2)2R ×R T T = sup y1 z1 − f1(z1) + sup y2 z2 − f2(z2) a b z12R z22R ∗ ∗ = f1 (y1) + f2 (y2): Claim 1.10. Suppose f is a closed, convex function. Then f ∗∗ = f. References. [2, Proposition 7.1.1], [3, Exercise 3.39]. The following claim shows that vectors x and y achieving inequality in Claim 1.7 are rather special. Claim 1.11. Suppose that f is closed and convex. The following are equivalent: y 2 @f(x) (1.1a) x 2 @f ∗(y) (1.1b) h y; x i = f(x) + f ∗(y) (1.1c) References. See [7, Slide 7-15], [5, Part E, Corollary 1.4.4]. In the differentiable setting, (1.1a) () (1.1c) appears in [3, pp. 95]. Proof. ∗ (1.1a))(1.1c): Suppose y 2 @f(x). Then f (y) = supu h y; u i − f(u) = h y; x i − f(x), by the subgradient inequality. n (1.1c))(1.1b): For any v 2 R , we have f ∗(v) = sup h v; u i − f(u) u ≥ h v; x i − f(x) = h v − y; x i − f(x) + h x; y i = h v − y; x i + f ∗(y); by (1.1c). This shows that x 2 @f ∗(y). (1.1b))(1.1a): Let g = f ∗. Then g is closed and convex by Claim 1.8. If x 2 @g(y) then y 2 @g∗(x), by the implication (1.1a))(1.1c). But g∗ = f by Claim 1.10, so this establishes the desired result. 3 2 Bregman Divergence A good reference for the material in this section is [8]. Let X be a closed convex set. Let f : X! R be a continuously-differentiable and convex function. The first-order approximation of f at x is f(x) ≈ f(y) + h rf(y); x − y i: Since f is convex, the subgradient inequality implies that the left-hand side is at least the right- hand side. The amount by which the left-hand side exceeds the right-hand side is the Bregman divergence. Definition 2.1. The Bregman divergence is defined to be Df (x; y) = f(x) − f(y) − h rf(y); x − y i: 2.1 Examples n 2 Example 2.2. Define f : R ! R by f(x) = kxk2. Then Df (x; y) = f(x) − f(y) − h rf(y); x − y i 2 2 = kxk2 − kyk2 − 2h y; x − y i 2 2 = kxk2 + kyk2 − 2h y; x i 2 = kx − yk2 : n Example 2.3 (Negative entropy). Recall that the negative entropy function is f : R>0 ! R Pn defined by f(x) = i=1 xi ln xi. Then T Df (x; y) = f(x) − f(y) − rf(y) (x − y) n n n X X X = xi ln xi − yi ln yi − (ln yi + 1)(xi − yi) i=1 i=1 i=1 n n n n X X X X = xi ln xi − xi ln yi − xi + yi i=1 i=1 i=1 i=1 n n n X X X = xi ln(xi=yi) − xi + yi i=1 i=1 i=1 = DKL(x k y); (2.1) the generalized KL-divergence between x and y, which we introduced in Lecture 16. In the case Pn Pn that i=1 xi = i=1 yi = 1, this is the ordinary KL divergence (or \relative entropy") between x and y. Negative entropy will be particularly important to us, so we prove one property of it now. 4 Claim 2.4. Negative entropy is 1-strongly convex with respect to the `1 norm. To prove this, we require the following theorem. 1 2 Theorem 2.5 (Pinsker's Inequality). For any distributions p; q, we have DKL(p k q) ≥ 2 kp − qk1. References. Wikipedia, Lecture notes of Sanjeev Khudanpur. Pn Proof (of Claim 2.4). As in Example 2.3, let f(x) = i=1 xi ln xi. Then, f(y) ≥ f(x) + h rf(x); y − x i + DKL(y k x) (by (2.1)) 1 2 = f(x) + h rf(x); y − x i + 2 kx − yk1 (by Theorem 2.5): There are also some interesting examples involving matrices. Example 2.6. Let f(X) = tr(X log X). Then Df (X; y) = tr(X log X − X log Y − X + Y ). This is called the von Neumann divergence, or quantum relative entropy. −1 −1 Example 2.7. Let f(X) = − log det X. Then Df (X; Y ) = tr(XY − I) − log det(XY ). This is called the log-det divergence. 2.2 Properties Claim 2.8. Df (x; y) is convex in x. Proof. This is immediate from the definition since f(x) is convex in x and −h rf(y); x − y i is linear in x. Note. Df (x; y) is not generally convex in y. Consider the case f(x) = exp(x) and x = 4. Then 4 Df (4; 0) = e − 5 < 50 4 Df (4; 1) = e − 4e > 43 4 2 Df (4; 2) = e − 3e < 33 As Df (4; 1) > Df (4; 0) + Df (4; 2) =2, Df (x; y) is not convex in y. Lemma 2.9. Let f be closed, convex and differentiable. Fix any x; y 2 X . Definex ^ = rf(x) and y^ = rf(y). Then rf ∗(^x) = x (2.2) Df (x; y) = Df ∗ (^y; x^) (2.3) References.

Machine Learning Theory Lecture 20: Mirror Descent

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support