Total Jensen divergences: Definition, Properties and k-++ Clustering

Frank Nielsen1 Richard Nock2 www.informationgeometry.org

1Sony Computer Science Laboratories, Inc. 2UAG-CEREGMIA

September 2013

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/19 Divergences: Distortion measures F a smooth convex function, the generator. ◮ Skew Jensen divergences: ′ Jα(p : q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q), = (F (p)F (q))α − F ((pq)α),

where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and (F (p)F (q))γ = γF (p)+(1−γ)F (q)= F (q)+γ(F (p)−F (q)). ◮ Bregman divergences: B(p : q)= F (p) − F (q) − p − q, ∇F (q),

lim Jα(p : q) = B(p : q), α→0 lim Jα(p : q) = B(q : p). α→1 ◮ Statistical Bhattacharrya divergence: Bhat α 1−αd ′ (p1 : p2)= − log p1(x) p2(x) ν(x)= Jα(θ1 : θ2) Z for exponential families [5]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/19 Geometrically designed divergences

Plot of the convex generator F .

F :(x, F (x))

(q, F (q))

J(p, q) (p, F (p))

tB(p : q)

B(p : q)

p+q q 2 p

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/19 Total Bregman divergences Conformal divergence, conformal factor ρ:

D′(p : q)= ρ(p, q)D(p : q)

plays the rˆole of “regularizer” [8]

Invariance by rotation of the axes of the design space

B(p : q) tB(p : q) = = ρB (q)B(p : q), 1+ ∇F (q), ∇F (q) 1 ρB (q) = p . 1+ ∇F (q), ∇F (q)

Total squared Euclideanp divergence:

1 p − q, p − q tE(p, q)= . 2 1+ q, q p c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/19 Total Jensen divergences

1 tB(p : q) = ρB (q)B(p : q), ρB (q)= s1+ ∇F (q), ∇F (q)

tJ 1 α(p : q) = ρJ (p, q)Jα(p : q), ρJ (p, q)= 2 v1+ (F (p)−F (q)) u p−q,p−q u t Jensen-Shannon divergence, square root is a metric [2]:

d d 1 2pi 1 2qi JS(p, q) = pi log + qi log 2 pi + qi 2 pi + qi Xi=1 Xi=1 Lemma The square root of the total Jensen-Shannon divergence is not a metric.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/19 Total Jensen divergence: Illustration

F p F q ′ ′ F p ( ( ) ( ))α ′ (F (p )F (q )) ( ) F (p ) α (F (p)F (q))β ′ ′ ′ (F (p )F (q ))β Jα(p : q) ′ ′ ′ Jα(p : q ) ′ F (q) ′ tJα(p : q) ′ ′ ′ F (q ) tJα(p : q )

F ((pq)α) ′ ′ F ((p q )α)

′ O p q q (pq)α p′ (p′q′) O α

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/19 Total Jensen divergence: Illustration

α on graph plot, β on interpolated segment Two kinds of total Jensen divergences (but one always yields closed-form)

β > 1 β ∈ [0, 1] β < 0 β > 1 β ∈ [0, 1] β < 0

(F (p)F (q))β

(F (p)F (q))β F ((pq)α) F ((pq)α)

p q p q

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/19 Total Jensen divergences/Total Bregman divergences

Total Jensen is not a generalization of total Bregman. limit cases α ∈ {0, 1}, we have:

lim tJα(p : q) = ρJ (p, q)B(p : q) = ρB (q)B(p : q), α→0 lim tJα(p : q) = ρJ (p, q)B(q : p) = ρB (p)B(q : p), α→1

since ρJ (p, q) = ρB (q).

Squared chord index in ρJ :

∆2 ∆⊤∇F (ǫ)∆⊤∇F (ǫ) s2 = F = = ∇F (ǫ), ∇F (ǫ) = ∇F (ǫ)2. ∆2 ∆⊤∆

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/19 Conformal factor from value theorem

When p ≃ q, ρJ (p, q) ≃ ρB (q), and the total Jensen divergence tends to the total Bregman divergence for any value of α. 1 ρJ (p, q)= = ρB (ǫ), 1+ ∇F (ǫ), ∇F (ǫ) for ǫ ∈ [p, q]. p

For univariate generators, explicitly the value of ǫ:

∆ ∆ ǫ = ∇F −1 F = ∇F ∗ F , ∆ ∆     where F ∗ is the Legendre convex conjugate [5]. Stolarsky mean [7]:

tJα(p : q)= ρB (ǫ)J(p : q)

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/19 Centroids and statistical robustness

Centroids (barycenters) are minimizers of average (weighted) divergences:

n L(x; w) = wi × tJα(pi : x), Xi=1 cα = arg min L(x; w), x∈X

◮ Is it unique? ◮ Is it robust to outliers [3]? Iterative convex-concave procedure (CCCP) [5]

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/19 Robustness of Jensen centroids (univariate generator) Theorem The Jensen centroid is robust for a strictly convex and smooth ′ p+y generator f if |f ( 2 )| is bounded on the domain X for any prescribed p. ◮ Jensen-Shannon: X = R+, f (x)= x log x − x ,f ′(x) = log(x), f ′′(x) = 1/x. ′ p+y p+y |f ( 2 )| = | log 2 | is unbounded when y → +∞. JS centroid is not robust ◮ Jensen-Burg: X = R+, f (x)= − log x, f ′(x)= −1/x, ′′ 1 f (x)= x2 ′ p+y 2 |f ( 2 )| = | p+y | is always bounded for y ∈ (0, +∞). 1 2 z(y) = 2p2 − p p + y   When y →∞, we have |z(y)|→ 2p < ∞. JB centroid is robust.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/19 Clustering: No closed-form centroid, no cry!

k-means++ [1] picks up randomly seeds, no centroid calculation.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/19 Divergence-based k-means++

Theorem Suppose there exist some U and V such that, ∀x, y, z:

tJα(x : z) ≤ U(tJα(x : y)+ tJα(y : z)) , (triangular inequality)

tJα(x : z) ≤ V tJα(z : x) , (symmetric inequality)

Then the average potential of total Jensen seeding with k clusters satisfies 2 E[tJα] ≤ 2U (1 + V )(2 + log k)tJopt,α,

where tJopt,α is the minimal total Jensen potential achieved by a clustering in k clusters.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/19 Divergence-based k-means++: Two assumptions H

H: ◮ First, the maximal condition number of the Hessian of F , that is, the ratio between the maximal and minimal eigenvalue (> 0) of the Hessian of F , is upperbounded by K1. ◮ Second, we assume the Lipschitz condition on F that 2 ∆F /∆, ∆≤ K2, for some K2 > 0. Lemma Assume 0 <α< 1. Then, under assumption H, for any p, q, r ∈ S, there exists ǫ> 0 such that:

2(1 + K )K 2 1 1 tJ (p : r) ≤ 2 1 tJ (p : q)+ tJ (q : r) . α ǫ 1 − α α α α  

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/19 Divergence-based k-means++

Corollary The total skew Jensen divergence satisfies the following triangular inequality:

2(1 + K )K 2 tJ (p : r) ≤ 2 1 (tJ (p : q)+ tJ (q : r)) . α ǫα(1 − α) α α

2(1 + K )K 2 U = 2 1 ǫ Lemma 2 Symmetric inequality condition holds for V = K1 (1 + K2)/ǫ, for some 0 <ǫ< 1.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/19 Total Jensen divergences: Recap

Total Jensen divergence = conformal divergence with non-separable double-sided conformal factor. ◮ Invariant to axis rotation of “design space“ ◮ Equivalent to total Bregman divergences [8, 4] only when p ≃ q ◮ Square root of total Jensen-Shannon divergence is not a metric (square root of total JS is a metric). ◮ Jensen centroids are not always robust (e.g., Jensen-Shannon centroid) ◮ Total Jensen k-means++ do not require centroid computations and guaranteed approximation Interest of conformal divergences in SVM [9] (double-sided separable), in information geometry [6] (flattening).

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/19 Thank you.

@article{totalJensen-arXiv1309.7109 , author="Frank Nielsen and Richard Nock", title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering", year="2013", eprint="arXiv/1309.7109" }

www.informationgeometry.org

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/19 Bibliographic references I

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In IEEE International Symposium on Information Theory, pages 31–31, 2004.

F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Mathematical Statistics, 1986.

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.

Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari. A dually flat structure on the space of escort distributions. Journal of Physics: Conference Series, 201(1):012012, 2010.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/19 Bibliographic references II

Kenneth B Stolarsky.

Generalizations of the . Mathematics Magazine, 48(2):87–92, 1975.

Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, pages 475–483, 2011.

Si Wu and Shun-ichi Amari.

Conformal transformation of kernel functions a data dependent way to improve support vector machine classifiers. Neural Processing Letters, 15(1):59–67, 2002.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/19