Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent 1 Bregman Divergence Motivation • Generalize squared Euclidean distance to a class of distances that all share similar properties • Lots of applications in machine learning, clustering, exponential family Definition 1 (Bregman divergence) Let :Ω ! R be a function that is: a) strictly convex, b) continuously differentiable, c) defined on a closed convex set Ω. Then the Bregman divergence is defined as ∆ (x; y) = (x) − (y) − hr (y); x − yi ; 8 x; y 2 Ω: (1) That is, the difference between the value of at x and the first order Taylor expansion of around y evaluated at point x. Examples 1 2 1 2 • Euclidean distance. Let (x) = 2 kxk . Then ∆ (x; y) = 2 kx − yk . P n 0 0 • (x) = i xi log xi and Ω = x 2 R+ : 1 x = 1 , where 1 = (1; 1;:::; 1) . Then ∆ (x; y) = P xi xi log for x; y 2 Ω. This is called relative entropy, or Kullback–Leibler divergence between i yi probability distributions x and y. 1 1 1 2 1 2 1 2 • Lp norm. Let p ≥ 1 and p + q = 1. (x) = 2 kxkq. Then ∆ (x; y) = 2 kxkq + 2 kykq − D 1 2E 1 2 x; r 2 kykq . Note 2 kykq is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. 1.1 Properties of Bregman divergence • Strict convexity in the first argument x. Trivial by the strict convexity of . • Nonnegativity: ∆ (x; y) ≥ 0 for all x; y. ∆ (x; y) = 0 if and only if x = y. Trivial by strict convexity. • Asymmetry: in general, ∆ (x; y) 6= ∆ (y; x). Eg, KL-divergence. Symmetrization not always useful. • Non-convexity in the second argument. Let Ω = [1; 1), (x) = − log x. Then ∆ (x; y) = − log x + x−y 1 2x log y + y . One can check its second order derivative in y is y2 ( y − 1), which is negative when 2x < y. • Linearity in . For any a > 0, ∆ +a'(x; y) = ∆ (x; y) + a∆'(x; y). @ • Gradient in x: @x ∆ (x; y) = r (x) − r (y). Gradient in y is trickier, and not commonly used. • Generalized triangle inequality: ∆ (x; y) + ∆ (y; z) = (x) − (y) − hr (y); x − yi + (y) − (z) − hr (z); y − zi (2) = ∆ (x; z) + hx − y; r (z) − r (y)i : (3) • Special case: is called strongly convex with respect to some norm with modulus σ if σ (x) ≥ (y) + hr (y); x − yi + kx − yk2 : (4) 2 Note the norm here is not necessarily the Euclidean norm. When the norm is Euclidean, this condition is σ 2 P equivalent to (x)− 2 kxk being convex. For example, the (x) = i xi log xi used in KL-divergence n 0 is 1-strongly convex over the simplex Ω = x 2 R+ : 1 x = 1 , with respect to the L1 norm (not so trivial). When is σ strongly convex, we have σ ∆ (x; y) ≥ kx − yk2 : (5) 2 σ 2 Proof: By definition ∆ (x; y) = (x) − (y) − hr (y); x − yi ≥ 2 kx − yk . • Duality. Suppose is strongly convex. Then ∗ (r )(r (x)) = x; ∆ (x; y) = ∆ ∗ (r (y); r (x)) : (6) Proof: (for the first equality only) Recall ∗(y) = sup fhz; yi − (z)g : (7) z2Ω sup must be attainable because is strongly convex and Ω is closed. x is a maximizer if and only if y = r (x). So ∗(y) + (x) = hx; yi , y = r (x): (8) Since = ∗∗, so ∗(y) + ∗∗(x) = hx; yi, which means y is the maximizer in ∗∗(x) = sup fhx; zi − ∗(z)g : (9) z This means x = r ∗(y). To summarize, (r ∗)(r (x)) = x. • Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. Then min EU∼µ[∆ (U; x)] (10) x2S R is optimized at u¯ := Eµ[U] = u2S uµ(u). Proof: For any x 2 S, we have EU∼µ[∆ (U; x)] − EU∼µ[∆ (U; u¯)] (11) 0 0 = Eµ[ (U) − (x) − (U − x) r (x) − (U) + (¯u) + (U − u¯) r (¯u)] (12) 0 0 0 0 = (¯u) − (x) + x r (x) − u¯ r (¯u) + Eµ[−U r (x) + U r (¯u)] (13) = (¯u) − (x) − (¯u − x)0r (x) (14) = ∆ (¯u; x): (15) This must be nonnegative, and is 0 if and only if x =u ¯. ∗ • Pythagorean Theorem. If x is the projection of x0 onto a convex set C ⊆ Ω: ∗ x = argmin ∆ (x; x0): (16) x2C Then for all y 2 C, ∗ ∗ ∆ (y; x0) ≥ ∆ (y; x ) + ∆ (x ; x0): (17) ∗ In Euclidean case, it means the angle \yx x0 is obtuse. More generally Lemma 2 Suppose L is a proper convex function whose domain is an open set containing C. L is not necessarily differentiable. Let x∗ be ∗ ∗ x = argmin fL(x) + ∆ (x ; x0)g : (18) x2C Then for any y 2 C we have ∗ ∗ ∗ L(y) + ∆ (y; x0) ≥ L(x ) + ∆ (x ; x0) + ∆ (y; x ): (19) 2 The projection in (16) is just a special case of L = 0. This property is the key to the analysis of many optimization algorithms using Bregman divergence. ∗ Proof: Denote J(x) = L(x) + ∆ (x; x0). Since x minimizes J over C, there must exist a subgradient d 2 @J(x∗) such that hd; x − x∗i ≥ 0; 8 x 2 C: (20) ∗ ∗ ∗ ∗ Since @J(x ) = fg + rx=x∗ ∆ (x; x0): g 2 @L(x )g = fg + r (x ) − r (x0): g 2 @L(x )g. So there must be a subgradient g 2 L(x∗) such that ∗ ∗ hg + r (x ) − r (x0); x − x i ≥ 0; 8 x 2 C: (21) Therefore using the property of subgradient, we have for all y 2 C that L(y) ≥ L(x∗) + hg; y − x∗i (22) ∗ ∗ ∗ ≥ L(x ) + hr (x0) − r (x ); y − x i (23) ∗ ∗ ∗ = L(x ) − hr (x0); x − x0i + (x ) − (x0) (24) + hr (x0); y − x0i − (y) + (x0) (25) − hr (x∗); y − x∗i + (y) − (x∗) (26) ∗ ∗ ∗ = L(x ) + ∆ (x ; x0) − ∆ (y; x0) + ∆ (y; x ): (27) Rearranging completes the proof. 2 Mirror Descent for Batch Optimization Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule x 1 = xk − ηkgk; where gk 2 @f(xk) (28) k+ 2 2 1 1 2 xk+1 = argmin x − xk+ 1 = argmin kx − (xk − ηkgk)k : (29) x2C 2 2 x2C 2 This can be interpreted as follows. First approximate f around xk by a first-order Taylor expansion f(x) ≈ f(xk) + hgk; x − xki : (30) 1 2 Then penalize the displacement by kx − xkk . So the update rule is to find a regularized minimizer of 2ηk the model 1 2 xk+1 = argmin f(xk) + hgk; x − xki + kx − xkk : (31) x2C 2ηk It is trivial to see this is exactly equivalent to (29). Mirror descent extension To generalize (31) beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: 1 xk+1 = argmin f(xk) + hgk; x − xki + ∆ (x; xk) (32) x2C ηk = argmin fηkf(xk) + ηk hgk; x − xki + ∆ (x; xk)g : (33) x2C It is again equivalent to two steps: 1 xk+ 1 = argmin f(xk) + hgk; x − xki + ∆ (x; xk) (34) 2 x ηk xk+1 = argmin ∆ (x; xk+ 1 ): (35) x2C 2 The first order optimality condition for (34) is 1 gk + (r (xk+ 1 ) − r (xk)) = 0 (36) ηk 2 () r (x 1 ) = r (xk) − ηkgk (37) k+ 2 −1 ∗ () x 1 = (r ) (r (xk) − ηkgk) = (r )(r (xk) − ηkgk): (38) k+ 2 For example, in KL-divergence over simplex, the update rule becomes x 1 (i) = xk(i) exp(−ηkgk(i)): (39) k+ 2 3 2.1 Rate of convergence for subgradient descent with Euclidean distance We now analyze the rates of convergence of subgradient descent as in (31) and (33). It takes four steps. 1. Bounding on a single update 2 ∗ 2 ∗ ∗ 2 kxk+1 − x k ≤ x 1 − x = kxk − ηkgk − x k (≤ by the Pythagorean theorem in (17)) (40) 2 k+ 2 2 ∗ 2 ∗ 2 2 = kxk − x k2 − 2ηk hgk; xk − x i + ηk kgkk2 (41) ∗ 2 ∗ 2 2 ≤ kxk − x k2 − 2ηk (f(xk) − f(x )) + ηk kgkk2 : (42) 2. Telescope over k = 1;:::;T (summing them up): T T ∗ 2 ∗ 2 X ∗ X 2 2 kxT +1 − x k2 ≤ kx1 − x k2 − 2 ηk (f(xk) − f(x )) + ηk kgkk2 : (43) k=1 k=1 2 2 ∗ 2 2 2 3. Bounding by kgkk2 ≤ G and kx1 − x k2 ≤ R := maxx2C kx1 − xk2: T T X ∗ 2 2 X 2 2 ηk (f(xk) − f(x )) ≤ R + G ηk: (44) k=1 k=1 ∗ 4. Denote k = f(xk) − f(x ) and rearrange 2 2 PT 2 R + G k=1 ηk min k ≤ : (45) k2f1;:::;T g PT 2 k=1 ηk Denote [T ] := f1; 2;:::;T g. By setting the step size η = Rp , we can achieve k G k PT 1 R T 1 1 + k=1 k 2 + 1 x dx RG log T min k ≤ RG T ≤ RG T +1 ≤ p : (46) k2[T ] 2 P p1 4 R p1 dx 2 T k=1 k 1 x Remark 1 The term log T in the bound can actually be removed by using the following simple fact. Given d c > 0, b 2 R+, and D a positive definite matrix. Then c + 1 x0Dx r 2c r 2c min 2 = ; where the optimal x = D−1b: (47) d 0 0 −1 0 −1 x2R+ b x b D b b D b One can prove it by writing out the KKT condition for the equivalent convex problem (with a perspective 1 1 0 d 0 function) infx;u u (c + 2 x Dx), s.t.

Bregman Divergence and Mirror Descent

Learning to Approximate a Bregman Divergence

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Metrics Defined by Bregman Divergences †

On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid

Information Divergences and the Curious Case of the Binary Alphabet

Statistical Exponential Families: a Digest with Flash Cards

Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected]

Jensen-Bregman Logdet Divergence with Application to Efficient

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Generalized Bregman and Jensen Divergences Which Include Some F-Divergences

Bregman Metrics and Their Applications

Low-Rank Kernel Learning with Bregman Matrix Divergences