Total Jensen Divergences: Definition, Properties and K-Means++ Clustering

Total Jensen Divergences: Definition, Properties and K-Means++ Clustering

Total Jensen divergences: Definition, Properties and k-Means++ Clustering Frank Nielsen1 Richard Nock2 www.informationgeometry.org 1Sony Computer Science Laboratories, Inc. 2UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/19 Divergences: Distortion measures F a smooth convex function, the generator. ◮ Skew Jensen divergences: ′ Jα(p : q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q), = (F (p)F (q))α − F ((pq)α), where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and (F (p)F (q))γ = γF (p)+(1−γ)F (q)= F (q)+γ(F (p)−F (q)). ◮ Bregman divergences: B(p : q)= F (p) − F (q) − p − q, ∇F (q), lim Jα(p : q) = B(p : q), α→0 lim Jα(p : q) = B(q : p). α→1 ◮ Statistical Bhattacharrya divergence: Bhat α 1−αd ′ (p1 : p2)= − log p1(x) p2(x) ν(x)= Jα(θ1 : θ2) Z for exponential families [5]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/19 Geometrically designed divergences Plot of the convex generator F . F :(x, F (x)) (q, F (q)) J(p, q) (p, F (p)) tB(p : q) B(p : q) p+q q 2 p c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/19 Total Bregman divergences Conformal divergence, conformal factor ρ: D′(p : q)= ρ(p, q)D(p : q) plays the rˆole of “regularizer” [8] Invariance by rotation of the axes of the design space B(p : q) tB(p : q) = = ρB (q)B(p : q), 1+ ∇F (q), ∇F (q) 1 ρB (q) = p . 1+ ∇F (q), ∇F (q) Total squared Euclideanp divergence: 1 p − q, p − q tE(p, q)= . 2 1+ q, q p c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/19 Total Jensen divergences 1 tB(p : q) = ρB (q)B(p : q), ρB (q)= s1+ ∇F (q), ∇F (q) tJ 1 α(p : q) = ρJ (p, q)Jα(p : q), ρJ (p, q)= 2 v1+ (F (p)−F (q)) u p−q,p−q u t Jensen-Shannon divergence, square root is a metric [2]: d d 1 2pi 1 2qi JS(p, q) = pi log + qi log 2 pi + qi 2 pi + qi Xi=1 Xi=1 Lemma The square root of the total Jensen-Shannon divergence is not a metric. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/19 Total Jensen divergence: Illustration F p F q ′ ′ F p ( ( ) ( ))α ′ (F (p )F (q )) ( ) F (p ) α (F (p)F (q))β ′ ′ ′ (F (p )F (q ))β Jα(p : q) ′ ′ ′ Jα(p : q ) ′ F (q) ′ tJα(p : q) ′ ′ ′ F (q ) tJα(p : q ) F ((pq)α) ′ ′ F ((p q )α) ′ O p q q (pq)α p′ (p′q′) O α c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/19 Total Jensen divergence: Illustration α on graph plot, β on interpolated segment Two kinds of total Jensen divergences (but one always yields closed-form) β > 1 β ∈ [0, 1] β < 0 β > 1 β ∈ [0, 1] β < 0 (F (p)F (q))β (F (p)F (q))β F ((pq)α) F ((pq)α) p q p q c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/19 Total Jensen divergences/Total Bregman divergences Total Jensen is not a generalization of total Bregman. limit cases α ∈ {0, 1}, we have: lim tJα(p : q) = ρJ (p, q)B(p : q) = ρB (q)B(p : q), α→0 lim tJα(p : q) = ρJ (p, q)B(q : p) = ρB (p)B(q : p), α→1 since ρJ (p, q) = ρB (q). Squared chord slope index in ρJ : ∆2 ∆⊤∇F (ǫ)∆⊤∇F (ǫ) s2 = F = = ∇F (ǫ), ∇F (ǫ) = ∇F (ǫ)2. ∆2 ∆⊤∆ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/19 Conformal factor from mean value theorem When p ≃ q, ρJ (p, q) ≃ ρB (q), and the total Jensen divergence tends to the total Bregman divergence for any value of α. 1 ρJ (p, q)= = ρB (ǫ), 1+ ∇F (ǫ), ∇F (ǫ) for ǫ ∈ [p, q]. p For univariate generators, explicitly the value of ǫ: ∆ ∆ ǫ = ∇F −1 F = ∇F ∗ F , ∆ ∆ where F ∗ is the Legendre convex conjugate [5]. Stolarsky mean [7]: tJα(p : q)= ρB (ǫ)J(p : q) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/19 Centroids and statistical robustness Centroids (barycenters) are minimizers of average (weighted) divergences: n L(x; w) = wi × tJα(pi : x), Xi=1 cα = arg min L(x; w), x∈X ◮ Is it unique? ◮ Is it robust to outliers [3]? Iterative convex-concave procedure (CCCP) [5] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/19 Robustness of Jensen centroids (univariate generator) Theorem The Jensen centroid is robust for a strictly convex and smooth ′ p+y generator f if |f ( 2 )| is bounded on the domain X for any prescribed p. ◮ Jensen-Shannon: X = R+, f (x)= x log x − x ,f ′(x) = log(x), f ′′(x) = 1/x. ′ p+y p+y |f ( 2 )| = | log 2 | is unbounded when y → +∞. JS centroid is not robust ◮ Jensen-Burg: X = R+, f (x)= − log x, f ′(x)= −1/x, ′′ 1 f (x)= x2 ′ p+y 2 |f ( 2 )| = | p+y | is always bounded for y ∈ (0, +∞). 1 2 z(y) = 2p2 − p p + y When y →∞, we have |z(y)|→ 2p < ∞. JB centroid is robust. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/19 Clustering: No closed-form centroid, no cry! k-means++ [1] picks up randomly seeds, no centroid calculation. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/19 Divergence-based k-means++ Theorem Suppose there exist some U and V such that, ∀x, y, z: tJα(x : z) ≤ U(tJα(x : y)+ tJα(y : z)) , (triangular inequality) tJα(x : z) ≤ V tJα(z : x) , (symmetric inequality) Then the average potential of total Jensen seeding with k clusters satisfies 2 E[tJα] ≤ 2U (1 + V )(2 + log k)tJopt,α, where tJopt,α is the minimal total Jensen potential achieved by a clustering in k clusters. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/19 Divergence-based k-means++: Two assumptions H H: ◮ First, the maximal condition number of the Hessian of F , that is, the ratio between the maximal and minimal eigenvalue (> 0) of the Hessian of F , is upperbounded by K1. ◮ Second, we assume the Lipschitz condition on F that 2 ∆F /∆, ∆≤ K2, for some K2 > 0. Lemma Assume 0 <α< 1. Then, under assumption H, for any p, q, r ∈ S, there exists ǫ> 0 such that: 2(1 + K )K 2 1 1 tJ (p : r) ≤ 2 1 tJ (p : q)+ tJ (q : r) . α ǫ 1 − α α α α c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/19 Divergence-based k-means++ Corollary The total skew Jensen divergence satisfies the following triangular inequality: 2(1 + K )K 2 tJ (p : r) ≤ 2 1 (tJ (p : q)+ tJ (q : r)) . α ǫα(1 − α) α α 2(1 + K )K 2 U = 2 1 ǫ Lemma 2 Symmetric inequality condition holds for V = K1 (1 + K2)/ǫ, for some 0 <ǫ< 1. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/19 Total Jensen divergences: Recap Total Jensen divergence = conformal divergence with non-separable double-sided conformal factor. ◮ Invariant to axis rotation of “design space“ ◮ Equivalent to total Bregman divergences [8, 4] only when p ≃ q ◮ Square root of total Jensen-Shannon divergence is not a metric (square root of total JS is a metric). ◮ Jensen centroids are not always robust (e.g., Jensen-Shannon centroid) ◮ Total Jensen k-means++ do not require centroid computations and guaranteed approximation Interest of conformal divergences in SVM [9] (double-sided separable), in information geometry [6] (flattening). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/19 Thank you. @article{totalJensen-arXiv1309.7109 , author="Frank Nielsen and Richard Nock", title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering", year="2013", eprint="arXiv/1309.7109" } www.informationgeometry.org c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/19 Bibliographic references I David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In IEEE International Symposium on Information Theory, pages 31–31, 2004. F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Mathematical Statistics, 1986. Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011. Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari. A dually flat structure on the space of escort distributions. Journal of Physics: Conference Series, 201(1):012012, 2010. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/19 Bibliographic references II Kenneth B Stolarsky. Generalizations of the logarithmic mean. Mathematics Magazine, 48(2):87–92, 1975. Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, pages 475–483, 2011. Si Wu and Shun-ichi Amari. Conformal transformation of kernel functions a data dependent way to improve support vector machine classifiers.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    19 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us