Total Jensen Divergences: Deﬁnition, Properties and K-Means++ Clustering

Total Jensen divergences: Definition, Properties and k-Means++ Clustering Frank Nielsen1 Richard Nock2 www.informationgeometry.org 1Sony Computer Science Laboratories, Inc. 2UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/19 Divergences: Distortion measures F a smooth convex function, the generator. ◮ Skew Jensen divergences: ′ Jα(p : q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q), = (F (p)F (q))α − F ((pq)α), where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and (F (p)F (q))γ = γF (p)+(1−γ)F (q)= F (q)+γ(F (p)−F (q)). ◮ Bregman divergences: B(p : q)= F (p) − F (q) − p − q, ∇F (q), lim Jα(p : q) = B(p : q), α→0 lim Jα(p : q) = B(q : p). α→1 ◮ Statistical Bhattacharrya divergence: Bhat α 1−αd ′ (p1 : p2)= − log p1(x) p2(x) ν(x)= Jα(θ1 : θ2) Z for exponential families [5]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/19 Geometrically designed divergences Plot of the convex generator F . F :(x, F (x)) (q, F (q)) J(p, q) (p, F (p)) tB(p : q) B(p : q) p+q q 2 p c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/19 Total Bregman divergences Conformal divergence, conformal factor ρ: D′(p : q)= ρ(p, q)D(p : q) plays the rôle of “regularizer” [8] Invariance by rotation of the axes of the design space B(p : q) tB(p : q) = = ρB (q)B(p : q), 1+ ∇F (q), ∇F (q) 1 ρB (q) = p . 1+ ∇F (q), ∇F (q) Total squared Euclideanp divergence: 1 p − q, p − q tE(p, q)= . 2 1+ q, q p c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/19 Total Jensen divergences 1 tB(p : q) = ρB (q)B(p : q), ρB (q)= s1+ ∇F (q), ∇F (q) tJ 1 α(p : q) = ρJ (p, q)Jα(p : q), ρJ (p, q)= 2 v1+ (F (p)−F (q)) u p−q,p−q u t Jensen-Shannon divergence, square root is a metric [2]: d d 1 2pi 1 2qi JS(p, q) = pi log + qi log 2 pi + qi 2 pi + qi Xi=1 Xi=1 Lemma The square root of the total Jensen-Shannon divergence is not a metric. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/19 Total Jensen divergence: Illustration F p F q ′ ′ F p ( ( ) ( ))α ′ (F (p )F (q )) ( ) F (p ) α (F (p)F (q))β ′ ′ ′ (F (p )F (q ))β Jα(p : q) ′ ′ ′ Jα(p : q ) ′ F (q) ′ tJα(p : q) ′ ′ ′ F (q ) tJα(p : q ) F ((pq)α) ′ ′ F ((p q )α) ′ O p q q (pq)α p′ (p′q′) O α c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/19 Total Jensen divergence: Illustration α on graph plot, β on interpolated segment Two kinds of total Jensen divergences (but one always yields closed-form) β > 1 β ∈ [0, 1] β < 0 β > 1 β ∈ [0, 1] β < 0 (F (p)F (q))β (F (p)F (q))β F ((pq)α) F ((pq)α) p q p q c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/19 Total Jensen divergences/Total Bregman divergences Total Jensen is not a generalization of total Bregman. limit cases α ∈ {0, 1}, we have: lim tJα(p : q) = ρJ (p, q)B(p : q) = ρB (q)B(p : q), α→0 lim tJα(p : q) = ρJ (p, q)B(q : p) = ρB (p)B(q : p), α→1 since ρJ (p, q) = ρB (q). Squared chord slope index in ρJ : ∆2 ∆⊤∇F (ǫ)∆⊤∇F (ǫ) s2 = F = = ∇F (ǫ), ∇F (ǫ) = ∇F (ǫ)2. ∆2 ∆⊤∆ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/19 Conformal factor from mean value theorem When p ≃ q, ρJ (p, q) ≃ ρB (q), and the total Jensen divergence tends to the total Bregman divergence for any value of α. 1 ρJ (p, q)= = ρB (ǫ), 1+ ∇F (ǫ), ∇F (ǫ) for ǫ ∈ [p, q]. p For univariate generators, explicitly the value of ǫ: ∆ ∆ ǫ = ∇F −1 F = ∇F ∗ F , ∆ ∆ where F ∗ is the Legendre convex conjugate [5]. Stolarsky mean [7]: tJα(p : q)= ρB (ǫ)J(p : q) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/19 Centroids and statistical robustness Centroids (barycenters) are minimizers of average (weighted) divergences: n L(x; w) = wi × tJα(pi : x), Xi=1 cα = arg min L(x; w), x∈X ◮ Is it unique? ◮ Is it robust to outliers [3]? Iterative convex-concave procedure (CCCP) [5] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/19 Robustness of Jensen centroids (univariate generator) Theorem The Jensen centroid is robust for a strictly convex and smooth ′ p+y generator f if |f ( 2 )| is bounded on the domain X for any prescribed p. ◮ Jensen-Shannon: X = R+, f (x)= x log x − x ,f ′(x) = log(x), f ′′(x) = 1/x. ′ p+y p+y |f ( 2 )| = | log 2 | is unbounded when y → +∞. JS centroid is not robust ◮ Jensen-Burg: X = R+, f (x)= − log x, f ′(x)= −1/x, ′′ 1 f (x)= x2 ′ p+y 2 |f ( 2 )| = | p+y | is always bounded for y ∈ (0, +∞). 1 2 z(y) = 2p2 − p p + y When y →∞, we have |z(y)|→ 2p < ∞. JB centroid is robust. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/19 Clustering: No closed-form centroid, no cry! k-means++ [1] picks up randomly seeds, no centroid calculation. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/19 Divergence-based k-means++ Theorem Suppose there exist some U and V such that, ∀x, y, z: tJα(x : z) ≤ U(tJα(x : y)+ tJα(y : z)) , (triangular inequality) tJα(x : z) ≤ V tJα(z : x) , (symmetric inequality) Then the average potential of total Jensen seeding with k clusters satisfies 2 E[tJα] ≤ 2U (1 + V )(2 + log k)tJopt,α, where tJopt,α is the minimal total Jensen potential achieved by a clustering in k clusters. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/19 Divergence-based k-means++: Two assumptions H H: ◮ First, the maximal condition number of the Hessian of F , that is, the ratio between the maximal and minimal eigenvalue (> 0) of the Hessian of F , is upperbounded by K1. ◮ Second, we assume the Lipschitz condition on F that 2 ∆F /∆, ∆≤ K2, for some K2 > 0. Lemma Assume 0 <α< 1. Then, under assumption H, for any p, q, r ∈ S, there exists ǫ> 0 such that: 2(1 + K )K 2 1 1 tJ (p : r) ≤ 2 1 tJ (p : q)+ tJ (q : r) . α ǫ 1 − α α α α c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/19 Divergence-based k-means++ Corollary The total skew Jensen divergence satisfies the following triangular inequality: 2(1 + K )K 2 tJ (p : r) ≤ 2 1 (tJ (p : q)+ tJ (q : r)) . α ǫα(1 − α) α α 2(1 + K )K 2 U = 2 1 ǫ Lemma 2 Symmetric inequality condition holds for V = K1 (1 + K2)/ǫ, for some 0 <ǫ< 1. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/19 Total Jensen divergences: Recap Total Jensen divergence = conformal divergence with non-separable double-sided conformal factor. ◮ Invariant to axis rotation of “design space“ ◮ Equivalent to total Bregman divergences [8, 4] only when p ≃ q ◮ Square root of total Jensen-Shannon divergence is not a metric (square root of total JS is a metric). ◮ Jensen centroids are not always robust (e.g., Jensen-Shannon centroid) ◮ Total Jensen k-means++ do not require centroid computations and guaranteed approximation Interest of conformal divergences in SVM [9] (double-sided separable), in information geometry [6] (flattening). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/19 Thank you. @article{totalJensen-arXiv1309.7109 , author="Frank Nielsen and Richard Nock", title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering", year="2013", eprint="arXiv/1309.7109" } www.informationgeometry.org c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/19 Bibliographic references I David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In IEEE International Symposium on Information Theory, pages 31–31, 2004. F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Mathematical Statistics, 1986. Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011. Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari. A dually flat structure on the space of escort distributions. Journal of Physics: Conference Series, 201(1):012012, 2010. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/19 Bibliographic references II Kenneth B Stolarsky. Generalizations of the logarithmic mean. Mathematics Magazine, 48(2):87–92, 1975. Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, pages 475–483, 2011. Si Wu and Shun-ichi Amari. Conformal transformation of kernel functions a data dependent way to improve support vector machine classifiers.

Total Jensen Divergences: Deﬁnition, Properties and K-Means++ Clustering

AMS / MAA CLASSROOM RESOURCE MATERIALS VOL 35 I I “Master” — 2011/5/9 — 10:51 — Page I — #1 I I

Sharp Inequalities Between Hölder and Stolarsky Means of Two Positive

Stolarsky Means and Hadamard's Inequality

An Exposition on Means Mabrouck K

Janusz Matkowski GENERALIZATIONS OF

A Note on Stolarsky, Arithmetic and Logarithmic Means

The Log-Convexity of Another Class of One-Parameter Means and Its Applications

Stolarsky Means of Several Variables

Stolarsky Means of Several Variables

Generalized Least-Squares Regressions IV: Theory and Classification Using Generalized Means

Total Jensen Divergences: Definition, Properties and K-Means++ Clustering

Entropic Means