Machine Learning Theory Lecture 20: Mirror Descent
Nicholas Harvey
November 21, 2018
In this lecture we will present the Mirror Descent algorithm, which is a common generalization of Gradient Descent and Randomized Weighted Majority. This will require some preliminary results in convex analysis.
1 Conjugate Duality
A good reference for the material in this section is [5, Part E].
n ∗ n Definition 1.1. Let f : R → R ∪ {∞} be a function. Define f : R → R ∪ {∞} by
f ∗(y) = sup yTx − f(x). x∈Rn This is the convex conjugate or Legendre-Fenchel transform or Fenchel dual of f. For each linear functional y, the convex conjugate f ∗(y) gives the the greatest amount by which y exceeds the function f. Alternatively, we can think of f ∗(y) as the downward shift needed for the linear function y to just touch or “support” epi f.
1.1 Examples
Let us consider some simple one-dimensional examples.
∗ Example 1.2. Let f(x) = cx for some c ∈ R. We claim that f = δ{c}, i.e., ( 0 (if x = c) f ∗(x) = . +∞ (otherwise)
This is called the indicator function of {c}. Note that f is itself a linear functional that obviously supports epi f; so f ∗(c) = 0. Any other linear functional x 7→ yx − r cannot support epi f for any r (we have supx(yx − cx) = ∞ if y 6= c), ∗ ∗ so f (y) = ∞ if y 6= c. Note here that a line (f) is getting mapped to a single point (f ).
1 ∗ Example 1.3. Let f(x) = |x|. We claim that f = δ[−1,1] (the indicator function of [−1, 1]). For any y ∈ [−1, 1], the linear functional x 7→ yx supports epi f at the point (0, 0); so f ∗(y) = 0. On the other hand, if y > 1 then the linear functional x 7→ yx − r cannot support epi f for any r ∗ (we have supx(yx − |x|) = ∞ for y > 1), so f (y) = ∞. Similarly for y < −1. 1 T ∗ Example 1.4. Let f(x) = 2 x x. We claim that f = f. We have
∗ T 1 T 1 2 f (y) = sup y x − 2 x x ≤ sup kyk2 kxk2 − 2 kxk2 . x∈Rn x∈Rn
This upper bound is maximized when kxk2 = kyk2, and the inequality is achieved when x = y. ∗ 1 T ∗ Thus f (y) = 2 y y = f(y), so f = f . n Pn Example 1.5 (Negative entropy). Define f : R>0 → R by f(x) = i=1 xi ln xi. We saw in our ∗ Pn yi−1 earlier lectures on convexity that f is convex. We claim that f (y) = i=1 e . By Claim 1.9, proving the result for n = 1 also establishes the general result. ∗ By definition f (y) = supz>0(yz −z ln z). The derivative of yz −z ln z is y −ln z −1. The unique y−1 ∗ y−1 y−1 y−1 critical point satisfies z = e and it is a maximizer. Thus f (y) = ye − e (y − 1) = e . n 1 2 ∗ 1 2 Example 1.6. Let k·k be a norm on R and let Let f(x) = 2 kxk . Then f = 2 kxk∗. References. [3, Example 3.27].
1.2 Properties
n Claim 1.7 (Young-Fenchel Inequality). For any x, y ∈ R ,
yTx ≤ f(x) + f ∗(y).
Proof.
f ∗(y) + f(x) = sup yTx0 − f(x0) + f(x) ≥ yTx − f(x) + f(x) = yTx. x0∈Rn Claim 1.8. f ∗ is closed and convex (regardless of whether f is).
T Proof. For each x, define gx(y) = y x − f(x). Note that gx is an affine function of y, so gx is ∗ ∗ closed and convex. As f = supx∈Rn gx, Lemma 5.8 implies that f is closed and convex. a b Claim 1.9 (Conjugate of Separable Function). Let f : R × R → R be defined by f(x1, x2) = ∗ ∗ ∗ f1(x1) + f2(x2). Then f (x1, x2) = f1 (x1) + f2 (x2).
2 Proof. Straight from the definitions, we have
∗ T f (y1, y2) = sup (y1, y2) (z1, z2) − f(z1, z2) a b (z1,z2)∈R ×R T T = sup y1 z1 + y2 z2 − f1(z1) − f2(z2) a b (z1,z2)∈R ×R T T = sup y1 z1 − f1(z1) + sup y2 z2 − f2(z2) a b z1∈R z2∈R ∗ ∗ = f1 (y1) + f2 (y2).
Claim 1.10. Suppose f is a closed, convex function. Then f ∗∗ = f. References. [2, Proposition 7.1.1], [3, Exercise 3.39]. The following claim shows that vectors x and y achieving inequality in Claim 1.7 are rather special. Claim 1.11. Suppose that f is closed and convex. The following are equivalent:
y ∈ ∂f(x) (1.1a) x ∈ ∂f ∗(y) (1.1b) h y, x i = f(x) + f ∗(y) (1.1c)
References. See [7, Slide 7-15], [5, Part E, Corollary 1.4.4]. In the differentiable setting, (1.1a) ⇐⇒ (1.1c) appears in [3, pp. 95].
Proof.
∗ (1.1a)⇒(1.1c): Suppose y ∈ ∂f(x). Then f (y) = supu h y, u i − f(u) = h y, x i − f(x), by the subgradient inequality.
n (1.1c)⇒(1.1b): For any v ∈ R , we have
f ∗(v) = sup h v, u i − f(u) u ≥ h v, x i − f(x) = h v − y, x i − f(x) + h x, y i = h v − y, x i + f ∗(y), by (1.1c). This shows that x ∈ ∂f ∗(y).
(1.1b)⇒(1.1a): Let g = f ∗. Then g is closed and convex by Claim 1.8. If x ∈ ∂g(y) then y ∈ ∂g∗(x), by the implication (1.1a)⇒(1.1c). But g∗ = f by Claim 1.10, so this establishes the desired result.
3 2 Bregman Divergence
A good reference for the material in this section is [8]. Let X be a closed convex set. Let f : X → R be a continuously-differentiable and convex function. The first-order approximation of f at x is
f(x) ≈ f(y) + h ∇f(y), x − y i.
Since f is convex, the subgradient inequality implies that the left-hand side is at least the right- hand side. The amount by which the left-hand side exceeds the right-hand side is the Bregman divergence. Definition 2.1. The Bregman divergence is defined to be
Df (x, y) = f(x) − f(y) − h ∇f(y), x − y i.
2.1 Examples
n 2 Example 2.2. Define f : R → R by f(x) = kxk2. Then
Df (x, y) = f(x) − f(y) − h ∇f(y), x − y i 2 2 = kxk2 − kyk2 − 2h y, x − y i 2 2 = kxk2 + kyk2 − 2h y, x i 2 = kx − yk2 .
n Example 2.3 (Negative entropy). Recall that the negative entropy function is f : R>0 → R Pn defined by f(x) = i=1 xi ln xi. Then
T Df (x, y) = f(x) − f(y) − ∇f(y) (x − y) n n n X X X = xi ln xi − yi ln yi − (ln yi + 1)(xi − yi) i=1 i=1 i=1 n n n n X X X X = xi ln xi − xi ln yi − xi + yi i=1 i=1 i=1 i=1 n n n X X X = xi ln(xi/yi) − xi + yi i=1 i=1 i=1
= DKL(x k y), (2.1) the generalized KL-divergence between x and y, which we introduced in Lecture 16. In the case Pn Pn that i=1 xi = i=1 yi = 1, this is the ordinary KL divergence (or “relative entropy”) between x and y. Negative entropy will be particularly important to us, so we prove one property of it now.
4 Claim 2.4. Negative entropy is 1-strongly convex with respect to the `1 norm. To prove this, we require the following theorem.
1 2 Theorem 2.5 (Pinsker’s Inequality). For any distributions p, q, we have DKL(p k q) ≥ 2 kp − qk1. References. Wikipedia, Lecture notes of Sanjeev Khudanpur.
Pn Proof (of Claim 2.4). As in Example 2.3, let f(x) = i=1 xi ln xi. Then,
f(y) ≥ f(x) + h ∇f(x), y − x i + DKL(y k x) (by (2.1)) 1 2 = f(x) + h ∇f(x), y − x i + 2 kx − yk1 (by Theorem 2.5).
There are also some interesting examples involving matrices.
Example 2.6. Let f(X) = tr(X log X). Then Df (X, y) = tr(X log X − X log Y − X + Y ). This is called the von Neumann divergence, or quantum relative entropy. −1 −1 Example 2.7. Let f(X) = − log det X. Then Df (X,Y ) = tr(XY − I) − log det(XY ). This is called the log-det divergence.
2.2 Properties
Claim 2.8. Df (x, y) is convex in x.
Proof. This is immediate from the definition since f(x) is convex in x and −h ∇f(y), x − y i is linear in x.
Note. Df (x, y) is not generally convex in y. Consider the case f(x) = exp(x) and x = 4. Then
4 Df (4, 0) = e − 5 < 50 4 Df (4, 1) = e − 4e > 43 4 2 Df (4, 2) = e − 3e < 33 As Df (4, 1) > Df (4, 0) + Df (4, 2) /2, Df (x, y) is not convex in y. Lemma 2.9. Let f be closed, convex and differentiable. Fix any x, y ∈ X . Definex ˆ = ∇f(x) and yˆ = ∇f(y). Then
∇f ∗(ˆx) = x (2.2)
Df (x, y) = Df ∗ (ˆy, xˆ) (2.3)
References. [8, Eq. (6)].
Proof. By Claim 1.11,
f ∗(ˆx) = h x,ˆ x i − f(x), f ∗(ˆy) = h y,ˆ y i − f(y), and ∇f ∗(ˆx) = x.
5 This proves (2.2). Thus
∗ ∗ ∗ Df ∗ (ˆy, xˆ) = f (ˆy) − f (ˆx) − h ∇f (ˆx), yˆ − xˆ i (( = h ∇f(y), y i − f(y) − (h ∇(f((x(), x i − f(x) − h x, ∇f(y) − ∇f(x) i = f(x) − f(y) − h ∇f(y), x − y i
= Df (x, y) This proves (2.3).
Lemma 2.10 (Generalized Pythagoras Identity). T Df (x, y) + Df (z, x) − Df (z, y) = ∇f(x) − ∇f(y) (x − z).
2 Example 2.11. Consider the case of f(x) = kxk2. Applying (5.1) with a = x − z and b = x − y (so that a − b = y − z), we obtain
2 2 2 T T kx − yk2 + kz − xk2 − kz − yk2 = 2(x − y) (x − z) = ∇f(x) − ∇f(y) (x − z).
Proof (of Lemma 2.10).
Df (x, y) + Df (z, x) − Df (z, y) ¨¨ H ¨¨H ¨¨ = ¨f(x) − f(HyH) − h ∇f(y), x − y¡ i + ¨f(HzH) − ¨f(x) − h ∇f(x), z − x i ¨¨H H − ¨f(HzH) − f(HyH) − h ∇f(y), z − y¡ i = ∇f(x) − ∇f(y)T(x − z)
Claim 2.12. ∇xDf (x, y) = ∇f(x) − ∇f(y).
Proof. ∇x f(x) − f(y) − h ∇f(y), x − y i = ∇xf(x) − ∇xh ∇f(y), x i = ∇f(x) − ∇f(y)
2.2.1 Projections
n Let X ⊆ R be a closed, convex set. Assume that f : X → R is a strictly convex function, which implies that Df (x, y) is strictly convex in x. Definition 2.13. The projection of y onto X under the Bregman divergence is
f ΠX (y) = argmin Df (x, y). x∈X
The minimizer is uniquely determined since Df (x, y) is strictly convex in x.
6 The next claim gives optimality conditions for the Bregman projection, which is analogous to results that we discussed for Euclidean projections.
n f Claim 2.14. Suppose that f is differentiable. Fix any y ∈ R and let π = ΠX (y). Then