Machine Learning Theory Lecture 20: Mirror Descent
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Theory Lecture 20: Mirror Descent Nicholas Harvey November 21, 2018 In this lecture we will present the Mirror Descent algorithm, which is a common generalization of Gradient Descent and Randomized Weighted Majority. This will require some preliminary results in convex analysis. 1 Conjugate Duality A good reference for the material in this section is [5, Part E]. n ∗ n Definition 1.1. Let f : R ! R [ f1g be a function. Define f : R ! R [ f1g by f ∗(y) = sup yTx − f(x): x2Rn This is the convex conjugate or Legendre-Fenchel transform or Fenchel dual of f. For each linear functional y, the convex conjugate f ∗(y) gives the the greatest amount by which y exceeds the function f. Alternatively, we can think of f ∗(y) as the downward shift needed for the linear function y to just touch or \support" epi f. 1.1 Examples Let us consider some simple one-dimensional examples. ∗ Example 1.2. Let f(x) = cx for some c 2 R. We claim that f = δfcg, i.e., ( 0 (if x = c) f ∗(x) = : +1 (otherwise) This is called the indicator function of fcg. Note that f is itself a linear functional that obviously supports epi f; so f ∗(c) = 0. Any other linear functional x 7! yx − r cannot support epi f for any r (we have supx(yx − cx) = 1 if y 6= c), ∗ ∗ so f (y) = 1 if y 6= c. Note here that a line (f) is getting mapped to a single point (f ). 1 ∗ Example 1.3. Let f(x) = jxj. We claim that f = δ[−1;1] (the indicator function of [−1; 1]). For any y 2 [−1; 1], the linear functional x 7! yx supports epi f at the point (0; 0); so f ∗(y) = 0. On the other hand, if y > 1 then the linear functional x 7! yx − r cannot support epi f for any r ∗ (we have supx(yx − jxj) = 1 for y > 1), so f (y) = 1. Similarly for y < −1. 1 T ∗ Example 1.4. Let f(x) = 2 x x. We claim that f = f. We have ∗ T 1 T 1 2 f (y) = sup y x − 2 x x ≤ sup kyk2 kxk2 − 2 kxk2 : x2Rn x2Rn This upper bound is maximized when kxk2 = kyk2, and the inequality is achieved when x = y. ∗ 1 T ∗ Thus f (y) = 2 y y = f(y), so f = f . n Pn Example 1.5 (Negative entropy). Define f : R>0 ! R by f(x) = i=1 xi ln xi. We saw in our ∗ Pn yi−1 earlier lectures on convexity that f is convex. We claim that f (y) = i=1 e . By Claim 1.9, proving the result for n = 1 also establishes the general result. ∗ By definition f (y) = supz>0(yz −z ln z). The derivative of yz −z ln z is y −ln z −1. The unique y−1 ∗ y−1 y−1 y−1 critical point satisfies z = e and it is a maximizer. Thus f (y) = ye − e (y − 1) = e . n 1 2 ∗ 1 2 Example 1.6. Let k·k be a norm on R and let Let f(x) = 2 kxk . Then f = 2 kxk∗. References. [3, Example 3.27]. 1.2 Properties n Claim 1.7 (Young-Fenchel Inequality). For any x; y 2 R , yTx ≤ f(x) + f ∗(y): Proof. f ∗(y) + f(x) = sup yTx0 − f(x0) + f(x) ≥ yTx − f(x) + f(x) = yTx: x02Rn Claim 1.8. f ∗ is closed and convex (regardless of whether f is). T Proof. For each x, define gx(y) = y x − f(x). Note that gx is an affine function of y, so gx is ∗ ∗ closed and convex. As f = supx2Rn gx, Lemma 5.8 implies that f is closed and convex. a b Claim 1.9 (Conjugate of Separable Function). Let f : R × R ! R be defined by f(x1; x2) = ∗ ∗ ∗ f1(x1) + f2(x2). Then f (x1; x2) = f1 (x1) + f2 (x2). 2 Proof. Straight from the definitions, we have ∗ T f (y1; y2) = sup (y1; y2) (z1; z2) − f(z1; z2) a b (z1;z2)2R ×R T T = sup y1 z1 + y2 z2 − f1(z1) − f2(z2) a b (z1;z2)2R ×R T T = sup y1 z1 − f1(z1) + sup y2 z2 − f2(z2) a b z12R z22R ∗ ∗ = f1 (y1) + f2 (y2): Claim 1.10. Suppose f is a closed, convex function. Then f ∗∗ = f. References. [2, Proposition 7.1.1], [3, Exercise 3.39]. The following claim shows that vectors x and y achieving inequality in Claim 1.7 are rather special. Claim 1.11. Suppose that f is closed and convex. The following are equivalent: y 2 @f(x) (1.1a) x 2 @f ∗(y) (1.1b) h y; x i = f(x) + f ∗(y) (1.1c) References. See [7, Slide 7-15], [5, Part E, Corollary 1.4.4]. In the differentiable setting, (1.1a) () (1.1c) appears in [3, pp. 95]. Proof. ∗ (1.1a))(1.1c): Suppose y 2 @f(x). Then f (y) = supu h y; u i − f(u) = h y; x i − f(x), by the subgradient inequality. n (1.1c))(1.1b): For any v 2 R , we have f ∗(v) = sup h v; u i − f(u) u ≥ h v; x i − f(x) = h v − y; x i − f(x) + h x; y i = h v − y; x i + f ∗(y); by (1.1c). This shows that x 2 @f ∗(y). (1.1b))(1.1a): Let g = f ∗. Then g is closed and convex by Claim 1.8. If x 2 @g(y) then y 2 @g∗(x), by the implication (1.1a))(1.1c). But g∗ = f by Claim 1.10, so this establishes the desired result. 3 2 Bregman Divergence A good reference for the material in this section is [8]. Let X be a closed convex set. Let f : X! R be a continuously-differentiable and convex function. The first-order approximation of f at x is f(x) ≈ f(y) + h rf(y); x − y i: Since f is convex, the subgradient inequality implies that the left-hand side is at least the right- hand side. The amount by which the left-hand side exceeds the right-hand side is the Bregman divergence. Definition 2.1. The Bregman divergence is defined to be Df (x; y) = f(x) − f(y) − h rf(y); x − y i: 2.1 Examples n 2 Example 2.2. Define f : R ! R by f(x) = kxk2. Then Df (x; y) = f(x) − f(y) − h rf(y); x − y i 2 2 = kxk2 − kyk2 − 2h y; x − y i 2 2 = kxk2 + kyk2 − 2h y; x i 2 = kx − yk2 : n Example 2.3 (Negative entropy). Recall that the negative entropy function is f : R>0 ! R Pn defined by f(x) = i=1 xi ln xi. Then T Df (x; y) = f(x) − f(y) − rf(y) (x − y) n n n X X X = xi ln xi − yi ln yi − (ln yi + 1)(xi − yi) i=1 i=1 i=1 n n n n X X X X = xi ln xi − xi ln yi − xi + yi i=1 i=1 i=1 i=1 n n n X X X = xi ln(xi=yi) − xi + yi i=1 i=1 i=1 = DKL(x k y); (2.1) the generalized KL-divergence between x and y, which we introduced in Lecture 16. In the case Pn Pn that i=1 xi = i=1 yi = 1, this is the ordinary KL divergence (or \relative entropy") between x and y. Negative entropy will be particularly important to us, so we prove one property of it now. 4 Claim 2.4. Negative entropy is 1-strongly convex with respect to the `1 norm. To prove this, we require the following theorem. 1 2 Theorem 2.5 (Pinsker's Inequality). For any distributions p; q, we have DKL(p k q) ≥ 2 kp − qk1. References. Wikipedia, Lecture notes of Sanjeev Khudanpur. Pn Proof (of Claim 2.4). As in Example 2.3, let f(x) = i=1 xi ln xi. Then, f(y) ≥ f(x) + h rf(x); y − x i + DKL(y k x) (by (2.1)) 1 2 = f(x) + h rf(x); y − x i + 2 kx − yk1 (by Theorem 2.5): There are also some interesting examples involving matrices. Example 2.6. Let f(X) = tr(X log X). Then Df (X; y) = tr(X log X − X log Y − X + Y ). This is called the von Neumann divergence, or quantum relative entropy. −1 −1 Example 2.7. Let f(X) = − log det X. Then Df (X; Y ) = tr(XY − I) − log det(XY ). This is called the log-det divergence. 2.2 Properties Claim 2.8. Df (x; y) is convex in x. Proof. This is immediate from the definition since f(x) is convex in x and −h rf(y); x − y i is linear in x. Note. Df (x; y) is not generally convex in y. Consider the case f(x) = exp(x) and x = 4. Then 4 Df (4; 0) = e − 5 < 50 4 Df (4; 1) = e − 4e > 43 4 2 Df (4; 2) = e − 3e < 33 As Df (4; 1) > Df (4; 0) + Df (4; 2) =2, Df (x; y) is not convex in y. Lemma 2.9. Let f be closed, convex and differentiable. Fix any x; y 2 X . Definex ^ = rf(x) and y^ = rf(y). Then rf ∗(^x) = x (2.2) Df (x; y) = Df ∗ (^y; x^) (2.3) References.