All in the Exponential Family: Bregman Duality in Thermodynamic

Total Page:16

File Type:pdf, Size:1020Kb

All in the Exponential Family: Bregman Duality in Thermodynamic Bregman Duality in Thermodynamic Variational Inference A. Conjugate Duality A.2. Dual Divergence The Bregman divergence associated with a convex function We can leverage convex duality to derive an alternative f :⌦ R can be written as (Banerjee et al., 2005): divergence based on the conjugate function ⇤. ! ⇤(⌘)=sup⌘ β (β)= ⌘ = β (β) D [p : q]=f(p)+f(q) p q, f(q) β · − ) r Bf h − r i = ⌘ β (β ) (36) · ⌘ − ⌘ The family of Bregman divergences includes many familiar quantities, including the KL divergence corresponding to The conjugate measures the maximum distance between the negative entropy generator f(p)= p log pd!. Geo- the line ⌘ β and the function (β), which occurs at the − · metrically, the divergence can be viewed as the difference unique point β⌘ where ⌘ = β (β). This yields a bijective R r between f(p) and its linear approximation around q. Since mapping between ⌘ and β for minimal exponential fami- f is convex, we know that a first order estimator will lie lies (Wainwright & Jordan, 2008). Thus, a distribution p may be indexed by either its natural parameters βp or mean below the function, yielding Df [p : q] 0. ≥ parameters ⌘p. For our purposes, we can let f , (β) = log Zβ over Noting that ( ⇤)⇤ = (β)=sup ⌘ β ⇤(⌘) (Boyd the domain of probability distributions indexed by natural ⌘ · − & Vandenberghe, 2004), we can use a similar argument as parameters of an exponential family (e.g. (13)) : above to write this correspondence as β = ⇤(⌘).We r⌘ can then write the dual divergence D ⇤ as: D [β : β ]= (β ) (β ) β β , (β ) p q p − q h p − q rβ q i (34) D [⌘p : ⌘q]= ⇤(⌘p) ⇤(⌘q) ⌘p ⌘q, ⌘ ⇤(⌘q) ⇤ − h − r i = ⇤(⌘ ) ⇤(⌘ ) ⌘ β + ⌘ β p − q − p · q q · q This is a common setting in the field of information geom- = ⇤(⌘p)+ (βq) ⌘p βq (37) etry (Amari, 2016), which introduces dually flat manifold − · structures based on the natural parameters and the mean where we have used (36) to simplify the underlined terms. parameters. Similarly, D [β : β ]= (β ) (β ) β β , (β ) A.1. KL Divergence as a Bregman Divergence p q p − q h p − q rβ q i = (βp) (βq) βp ⌘q + βq ⌘q For an exponential family with partition function (β) and − − · · = (β )+ ⇤(⌘ ) β ⌘ (38) sufficient statistics T (!) over a random variable !, the Breg- p q − p · q man divergence D corresponds to a KL divergence. Re- Comparing (37) and (38), we see that the divergences are calling that β (β)=⌘β = E⇡β [T (!)] from (16), we r simplify the definition (34) to obtain equivalent with the arguments reversed, so that: D [βp : βq]=D ⇤ [⌘q : ⌘p] (39) D [β : β ]= (β ) (β ) β ⌘ + β ⌘ p q p − q − p · q q · q = (βp) (βq) Eq[βp T (!)] This indicates that the Bregman divergence D should also − − · ⇤ be a KL divergence, but with the same order of arguments. + Eq[βq T (!)] · We derive this fact directly in (44) , after investigating the = Eq βq T (!) (βq) + Eq[⇡0(!)] form of the conjugate function ⇤. · − ⇥ log q(!) ⇤ A.3. Conjugate ⇤ as Negative Entropy | Eq βp T (!){z (βp) Eq[⇡0}(!)] − · − − We first treat the case of an exponential family with no base ⇥ log p(!) ⇤ measure ⇡0(!), with derivations including a base measure | q(!) {z } in App. A.4. For a distribution p in an exponential family, = Eq log p(!) indexed by β or ⌘ , we can write log p(!)=β T (!) p p p · − = D [q(!) p(!)] (35) (β). Then, (36) becomes: KL || ⇤(⌘p)=βp ⌘p (βp) (40) where we have added and subtracted terms involving the · − = βp Ep[T (!)] (βp) (41) base measure ⇡0(!), and used the definition of our expo- · − = log p(!) nential family from (13). The Bregman divergence D is Ep (42) thus equal to the KL divergence with arguments reversed. = H (!) (43) − p Bregman Duality in Thermodynamic Variational Inference since βp and (βp) are constant with respect to !. Utilizing B. Renyi Divergence Variational Inference ⇤(⌘p)=Ep log p(!) from above, the dual divergence with q becomes: In this section, we show that each intermediate partition function log Zβ corresponds to a scaled version of the Renyi´ D [⌘p : ⌘q]= ⇤(⌘p) ⇤(⌘q) ⌘p ⌘q, ⌘ ⇤(⌘q) VI objective (Li & Turner, 2016). ⇤ − h − r i L↵ = Ep log p(!) ⇤(⌘q) ⌘p βq + ⌘q βq − − · · To begin, we recall the definition of Renyi’s ↵ divergence. = Ep log p(!) ⌘p βq + (βq) − · 1 1 ↵ ↵ = Ep log p(!) Ep[T (!) βq]+ (βq) D↵[p q]= log q(!) − p(!) d! − · || ↵ 1 − Z = Ep log p(!) Ep log q(!) − = DKL[p(!) q(!)] (44) Note that this involves geometric mixtures similar to (14). || Pulling out the factor of log p(x) to consider normalized distributions over z x, we obtain the objective of Li & Thus, the conjugate function is the negative entropy and in- | duces the KL divergence as its Bregman divergence (Wain- Turner (2016). This is similar to the ELBO, but instead wright & Jordan, 2008). subtracts a Renyi divergence of order ↵. Note that, by ignoring the base distribution over !, we have 1 β β (β) = log q(z x) − p(x, z) d z instead assumed that ⇡0(!):=u(!) is uniform over the | domain. In the next section, we illustrate that the effect of Z = β log p(x) (1 β)D [p (z x) q (z x)] adding a base distribution is to turn the conjugate function − − β ✓ | || φ | ⇡ (!) = β log p(x) βD1 β[qφ(z x) p✓(z x)] into a KL divergence, with the base 0 in the second − − | || | argument. This is consistent with our derivation of negative := β 1 β entropy, since D [p (!) u(!)] = H (⌦) + const. L − KL β || − pβ where we have used the skew symmetry property ↵ A.4. Conjugate ⇤ as a KL Divergence D↵[p q]= 1 ↵ D1 ↵[q p] for 0 <↵<1 (Van Erven || − − || & Harremos, 2014). Note that 0 =0and 1 = log p✓(x) As noted above, the derivation of the conjugate ⇤(⌘) in L L as in Li & Turner (2016) and Sec. 3. (40)-(43) ignored the possibilty of a base distribution in our exponential family. We see that ⇤(⌘) takes the form of a KL divergence when considering a base measure ⇡0(!). C. TVO using Taylor Series Remainders ⇤(⌘)=supβ ⌘ (β) (45) Recall that in Sec. 4, we have viewed the KL divergence · − β D [β : β0] as the remainder in a first order Taylor approxi- = β⌘ ⌘ (β⌘) mation of (β) around β0. The TVO objectives correspond · − to the linear term in this approximation, with the gap in = E⇡β⌘ [β⌘ T (!)] (β⌘) · − TVOL(✓,φ, x) and TVOU (✓,φ, x) bounds amounting to a = E⇡β [β⌘ T (!)] (β⌘) E⇡β [log ⇡0(!)] ⌘ · − ± ⌘ sum of KL divergences or Taylor remainders. Thus, the TVO may be viewed as a first order method. = E⇡β [log ⇡β⌘ (!) log ⇡0(!)] ⌘ − = DKL[⇡β⌘ (!) ⇡0(!)] (46) Yet we may also ask, what happens when considering other || approximation orders? We proceed to show that thermody- Note that we have added and subtracted a factor of namic integration arises from a zero-order approximation, ⇡ log ⇡0(!) in the fourth line, where our base mea- E β⌘ while the symmetrized KL divergence corresponds to a sim- sure ⇡ (!)=q(z x) in the case of the TVO. Compar- 0 | ilar application of the fundamental theorem of calculus in ing with the derivations in (41)-(42), we need to include a the mean parameter space ⌘ = (β). In App. E, we β rβ term of Ep⇡0(!) in moving to an expected log-probability briefly describe how ‘higher-order’ TVO objectives might Ep log p(!), with the extra, subtracted base measure term be constructed, although these will no longer be guaranteed transforming the negative entropy into a KL divergence. to provide upper or lower bounds on likelihood. In the TVO setting, this corresponds to We will repeatedly utilize the integral form of the Taylor remainder theorem, which characterizes the error in a k-th ⇤(⌘)=DKL[⇡β (z x) q(z x)] . (47) ⌘ | || | order approximation of (x) around a, with β [a, x]2. 2 This identity can be derived using the fundamental theo- When including a base distribution, the induced Bregman rem of calculus and repeated integration by parts (see, e.g. divergence is still the KL divergence since, as in the deriva- tion of (35), both Ep log p(!) and Ep log q(!) will contain 2We use generic variable x, not to be confused with data x, for terms involving the base distribution Ep log ⇡0(!). notational simplicity. Bregman Duality in Thermodynamic Variational Inference (Kountourogiannis & Loya, 2003) and references therein): sentation of the KL divergence: βk (k+1) x p✓(x, z) β (β) k R (x)= r (x β) dβ (48) DKL! [⇡βk 1 ⇡βk ]= (βk β) Var⇡β log dβ k − || − q (z x) a k! − φ βkZ 1 | Z − (53) C.1. Thermodynamic Integration as 0th Order Remainder Following similar arguments in the reverse direction, we can obtain an integral form for the TVO upper bound gap Consider a zero-order Taylor approximation of (1) around R1 (βk 1)=DKL[⇡βk ⇡βk 1 ] via the first-order approxi- a =0, which simply uses (0) as an estimator. Applying − || − mation of (βk 1) around a = βk. the remainder theorem, we obtain the identity (6) underly- − ing thermodynamic integration in the TVO: R1 (βk 1)= (βk 1) (βk)+(βk 1 βk) β (βk) − − − − − r (1) = (0) + R (1) =(βk βk 1) β (βk) ( (βk) (βk 1)) 0 (49) − − r − − − 1 βk 1 (1) (0) = β (β)dβ (50) − 2 (β) − r β 1 Z0 = r (βk 1 β) dβ (54) 1 1! − − p✓(x, z) βZk log p✓(x)= E⇡β log dβ (51) 0 qφ(z x) Z | Note that the TVO upper bound (10) arises from the sec- ond line, with R1 (βk 1) 0 and (βk βk 1) β (βk) where the last line follows as the definition of ⌘ = − ≥ − − r (β)= log Z in (16).
Recommended publications
  • The Mean Value Theorem Math 120 Calculus I Fall 2015
    The Mean Value Theorem Math 120 Calculus I Fall 2015 The central theorem to much of differential calculus is the Mean Value Theorem, which we'll abbreviate MVT. It is the theoretical tool used to study the first and second derivatives. There is a nice logical sequence of connections here. It starts with the Extreme Value Theorem (EVT) that we looked at earlier when we studied the concept of continuity. It says that any function that is continuous on a closed interval takes on a maximum and a minimum value. A technical lemma. We begin our study with a technical lemma that allows us to relate 0 the derivative of a function at a point to values of the function nearby. Specifically, if f (x0) is positive, then for x nearby but smaller than x0 the values f(x) will be less than f(x0), but for x nearby but larger than x0, the values of f(x) will be larger than f(x0). This says something like f is an increasing function near x0, but not quite. An analogous statement 0 holds when f (x0) is negative. Proof. The proof of this lemma involves the definition of derivative and the definition of limits, but none of the proofs for the rest of the theorems here require that depth. 0 Suppose that f (x0) = p, some positive number. That means that f(x) − f(x ) lim 0 = p: x!x0 x − x0 f(x) − f(x0) So you can make arbitrarily close to p by taking x sufficiently close to x0.
    [Show full text]
  • Newton and Halley Are One Step Apart
    Methods for Solving Nonlinear Equations Local Methods for Unconstrained Optimization The General Sparsity of the Third Derivative How to Utilize Sparsity in the Problem Numerical Results Newton and Halley are one step apart Trond Steihaug Department of Informatics, University of Bergen, Norway 4th European Workshop on Automatic Differentiation December 7 - 8, 2006 Institute for Scientific Computing RWTH Aachen University Aachen, Germany (This is joint work with Geir Gundersen) Trond Steihaug Newton and Halley are one step apart Methods for Solving Nonlinear Equations Local Methods for Unconstrained Optimization The General Sparsity of the Third Derivative How to Utilize Sparsity in the Problem Numerical Results Overview - Methods for Solving Nonlinear Equations: Method in the Halley Class is Two Steps of Newton in disguise. - Local Methods for Unconstrained Optimization. - How to Utilize Structure in the Problem. - Numerical Results. Trond Steihaug Newton and Halley are one step apart Methods for Solving Nonlinear Equations Local Methods for Unconstrained Optimization The Halley Class The General Sparsity of the Third Derivative Motivation How to Utilize Sparsity in the Problem Numerical Results Newton and Halley A central problem in scientific computation is the solution of system of n equations in n unknowns F (x) = 0 n n where F : R → R is sufficiently smooth. Sir Isaac Newton (1643 - 1727). Sir Edmond Halley (1656 - 1742). Trond Steihaug Newton and Halley are one step apart Methods for Solving Nonlinear Equations Local Methods for Unconstrained Optimization The Halley Class The General Sparsity of the Third Derivative Motivation How to Utilize Sparsity in the Problem Numerical Results A Nonlinear Newton method Taylor expansion 0 1 00 T (s) = F (x) + F (x)s + F (x)ss 2 Nonlinear Newton: Given x.
    [Show full text]
  • Learning to Approximate a Bregman Divergence
    Learning to Approximate a Bregman Divergence Ali Siahkamari1 Xide Xia2 Venkatesh Saligrama1 David Castañón1 Brian Kulis1,2 1 Department of Electrical and Computer Engineering 2 Department of Computer Science Boston University Boston, MA, 02215 {siaa, xidexia, srv, dac, bkulis}@bu.edu Abstract Bregman divergences generalize measures such as the squared Euclidean distance and the KL divergence, and arise throughout many areas of machine learning. In this paper, we focus on the problem of approximating an arbitrary Bregman divergence from supervision, and we provide a well-principled approach to ana- lyzing such approximations. We develop a formulation and algorithm for learning arbitrary Bregman divergences based on approximating their underlying convex generating function via a piecewise linear function. We provide theoretical ap- proximation bounds using our parameterization and show that the generalization −1=2 error Op(m ) for metric learning using our framework matches the known generalization error in the strictly less general Mahalanobis metric learning setting. We further demonstrate empirically that our method performs well in comparison to existing metric learning methods, particularly for clustering and ranking problems. 1 Introduction Bregman divergences arise frequently in machine learning. They play an important role in clus- tering [3] and optimization [7], and specific Bregman divergences such as the KL divergence and squared Euclidean distance are fundamental in many areas. Many learning problems require di- vergences other than Euclidean distances—for instance, when requiring a divergence between two distributions—and Bregman divergences are natural in such settings. The goal of this paper is to provide a well-principled framework for learning an arbitrary Bregman divergence from supervision.
    [Show full text]
  • Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2958705, IEEE Transactions on Information Theory 1 Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W. Wornell, Fellow, IEEE Abstract—A loss function measures the discrepancy between Moreover, designing a predictor with respect to one measure the true values and their estimated fits, for a given instance may result in poor performance when evaluated by another. of data. In classification problems, a loss function is said to For example, the minimizer of a 0-1 loss may result in be proper if a minimizer of the expected loss is the true underlying probability. We show that for binary classification, an unbounded loss, when measured with a Bernoulli log- the divergence associated with smooth, proper, and convex likelihood loss. In such cases, it would be desireable to design loss functions is upper bounded by the Kullback-Leibler (KL) a predictor according to a “universal” measure, i.e., one that divergence, to within a normalization constant. This implies is suitable for a variety of purposes, and provide performance that by minimizing the logarithmic loss associated with the KL guarantees for different uses. divergence, we minimize an upper bound to any choice of loss from this set. As such the logarithmic loss is universal in the In this paper, we show that for binary classification, the sense of providing performance guarantees with respect to a Bernoulli log-likelihood loss (log-loss) is such a universal broad class of accuracy measures.
    [Show full text]
  • Metrics Defined by Bregman Divergences †
    1 METRICS DEFINED BY BREGMAN DIVERGENCES y PENGWEN CHEN z AND YUNMEI CHEN, MURALI RAOx Abstract. Bregman divergences are generalizations of the well known Kullback Leibler diver- gence. They are based on convex functions and have recently received great attention. We present a class of \squared root metrics" based on Bregman divergences. They can be regarded as natural generalization of Euclidean distance. We provide necessary and sufficient conditions for a convex function so that the square root of its associated average Bregman divergence is a metric. Key words. Metrics, Bregman divergence, Convexity subject classifications. Analysis of noise-corrupted data, is difficult without interpreting the data to have been randomly drawn from some unknown distribution with unknown parameters. The most common assumption on noise is Gaussian distribution. However, it may be inappropriate if data is binary-valued or integer-valued or nonnegative. Gaussian is a member of the exponential family. Other members of this family for example the Poisson and the Bernoulli are better suited for integer and binary data. Exponential families and Bregman divergences ( Definition 1.1 ) have a very intimate relationship. There exists a unique Bregman divergence corresponding to every regular exponential family [13][3]. More precisely, the log-likelihood of an exponential family distribution can be represented by a sum of a Bregman divergence and a parameter unrelated term. Hence, Bregman divergence provides a likelihood distance for exponential family in some sense. This property has been used in generalizing principal component analysis to the Exponential family [7]. The Bregman divergence however is not a metric, because it is not symmetric, and does not satisfy the triangle inequality.
    [Show full text]
  • Calculus Online Textbook Chapter 2
    Contents CHAPTER 1 Introduction to Calculus 1.1 Velocity and Distance 1.2 Calculus Without Limits 1.3 The Velocity at an Instant 1.4 Circular Motion 1.5 A Review of Trigonometry 1.6 A Thousand Points of Light 1.7 Computing in Calculus CHAPTER 2 Derivatives The Derivative of a Function Powers and Polynomials The Slope and the Tangent Line Derivative of the Sine and Cosine The Product and Quotient and Power Rules Limits Continuous Functions CHAPTER 3 Applications of the Derivative 3.1 Linear Approximation 3.2 Maximum and Minimum Problems 3.3 Second Derivatives: Minimum vs. Maximum 3.4 Graphs 3.5 Ellipses, Parabolas, and Hyperbolas 3.6 Iterations x, + ,= F(x,) 3.7 Newton's Method and Chaos 3.8 The Mean Value Theorem and l'H8pital's Rule CHAPTER 2 Derivatives 2.1 The Derivative of a Function This chapter begins with the definition of the derivative. Two examples were in Chapter 1. When the distance is t2, the velocity is 2t. When f(t) = sin t we found v(t)= cos t. The velocity is now called the derivative off (t). As we move to a more formal definition and new examples, we use new symbols f' and dfldt for the derivative. 2A At time t, the derivativef'(t)or df /dt or v(t) is fCt -t At) -f (0 f'(t)= lim (1) At+O At The ratio on the right is the average velocity over a short time At. The derivative, on the left side, is its limit as the step At (delta t) approaches zero.
    [Show full text]
  • Efficient Computation of Sparse Higher Derivative Tensors⋆
    Efficient Computation of Sparse Higher Derivative Tensors? Jens Deussen and Uwe Naumann RWTH Aachen University, Software and Tools for Computational Engineering, Aachen, Germany fdeussen,[email protected] Abstract. The computation of higher derivatives tensors is expensive even for adjoint algorithmic differentiation methods. In this work we in- troduce methods to exploit the symmetry and the sparsity structure of higher derivatives to considerably improve the efficiency of their compu- tation. The proposed methods apply coloring algorithms to two-dimen- sional compressed slices of the derivative tensors. The presented work is a step towards feasibility of higher-order methods which might benefit numerical simulations in numerous applications of computational science and engineering. Keywords: Adjoints · Algorithmic Differentiation · Coloring Algorithm · Higher Derivative Tensors · Recursive Coloring · Sparsity. 1 Introduction The preferred numerical method to compute derivatives of a computer program is Algorithmic Differentiation (AD) [1,2]. This technique produces exact derivatives with machine accuracy up to an arbitrary order by exploiting elemental symbolic differentiation rules and the chain rule. AD distinguishes between two basic modes: the forward mode and the reverse mode. The forward mode builds a directional derivative (also: tangent) code for the computation of Jacobians at a cost proportional to the number of input parameters. Similarly, the reverse mode builds an adjoint version of the program but it can compute the Jacobian at a cost that is proportional to the number of outputs. For that reason, the adjoint method is advantageous for problems with a small set of output parameters, as it occurs frequently in many fields of computational science and engineering.
    [Show full text]
  • A NUMERICAL METHOD in TERMS of the THIRD DERIVATIVE for a DELAY INTEGRAL EQUATION from BIOMATHEMATICS Volume 6, Issue 2, Article 42, 2005
    Journal of Inequalities in Pure and Applied Mathematics A NUMERICAL METHOD IN TERMS OF THE THIRD DERIVATIVE FOR A DELAY INTEGRAL EQUATION FROM BIOMATHEMATICS volume 6, issue 2, article 42, 2005. Received 02 March, 2003; ALEXANDRU BICA AND CRACIUN˘ IANCU accepted 01 March, 2005. Communicated by: S.S. Dragomir Department of Mathematics University of Oradea Str. Armatei Romane 5 Oradea, 3700, Romania. EMail: [email protected] Abstract Faculty of Mathematics and Informatics Babe¸s-Bolyai University Contents Cluj-Napoca, Str. N. Kogalniceanu no.1 Cluj-Napoca, 3400, Romania. EMail: [email protected] JJ II J I Home Page Go Back Close c 2000 Victoria University ISSN (electronic): 1443-5756 Quit 024-03 Abstract This paper presents a numerical method for approximating the positive, bounded and smooth solution of a delay integral equation which occurs in the study of the spread of epidemics. We use the cubic spline interpolation and obtain an algorithm based on a perturbed trapezoidal quadrature rule. 2000 Mathematics Subject Classification: Primary 45D05; Secondary 65R32, A Numerical Method in Terms of 92C60. the Third Derivative for a Delay Key words: Delay Integral Equation, Successive approximations, Perturbed trape- Integral Equation from zoidal quadrature rule. Biomathematics Contents Alexandru Bica and Craciun˘ Iancu 1 Introduction ......................................... 3 Title Page 2 Existence and Uniqueness of the Positive Smooth Solution .. 4 3 Approximation of the Solution on the Initial Interval ....... 7 Contents 4 Main Results ........................................ 11 JJ II References J I Go Back Close Quit Page 2 of 33 J. Ineq. Pure and Appl. Math. 6(2) Art. 42, 2005 http://jipam.vu.edu.au 1.
    [Show full text]
  • 2.8 the Derivative As a Function
    2 Derivatives Copyright © Cengage Learning. All rights reserved. 2.8 The Derivative as a Function Copyright © Cengage Learning. All rights reserved. The Derivative as a Function We have considered the derivative of a function f at a fixed number a: Here we change our point of view and let the number a vary. If we replace a in Equation 1 by a variable x, we obtain 3 The Derivative as a Function Given any number x for which this limit exists, we assign to x the number f′(x). So we can regard f′ as a new function, called the derivative of f and defined by Equation 2. We know that the value of f′ at x, f′(x), can be interpreted geometrically as the slope of the tangent line to the graph of f at the point (x, f(x)). The function f′ is called the derivative of f because it has been “derived” from f by the limiting operation in Equation 2. The domain of f′ is the set {x | f′(x) exists} and may be smaller than the domain of f. 4 Example 1 – Derivative of a Function given by a Graph The graph of a function f is given in Figure 1. Use it to sketch the graph of the derivative f′. Figure 1 5 Example 1 – Solution We can estimate the value of the derivative at any value of x by drawing the tangent at the point (x, f(x)) and estimating its slope. For instance, for x = 5 we draw the tangent at P in Figure 2(a) and estimate its slope to be about , so f′(5) ≈ 1.5.
    [Show full text]
  • On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid
    On a generalization of the Jensen-Shannon divergence and the JS-symmetrization of distances relying on abstract means∗ Frank Nielsen† Sony Computer Science Laboratories, Inc. Tokyo, Japan Abstract The Jensen-Shannon divergence is a renown bounded symmetrization of the unbounded Kullback-Leibler divergence which measures the total Kullback-Leibler divergence to the aver- age mixture distribution. However the Jensen-Shannon divergence between Gaussian distribu- tions is not available in closed-form. To bypass this problem, we present a generalization of the Jensen-Shannon (JS) divergence using abstract means which yields closed-form expressions when the mean is chosen according to the parametric family of distributions. More generally, we define the JS-symmetrizations of any distance using generalized statistical mixtures derived from abstract means. In particular, we first show that the geometric mean is well-suited for exponential families, and report two closed-form formula for (i) the geometric Jensen-Shannon divergence between probability densities of the same exponential family, and (ii) the geometric JS-symmetrization of the reverse Kullback-Leibler divergence. As a second illustrating example, we show that the harmonic mean is well-suited for the scale Cauchy distributions, and report a closed-form formula for the harmonic Jensen-Shannon divergence between scale Cauchy distribu- tions. We also define generalized Jensen-Shannon divergences between matrices (e.g., quantum Jensen-Shannon divergences) and consider clustering with respect to these novel Jensen-Shannon divergences. Keywords: Jensen-Shannon divergence, Jeffreys divergence, resistor average distance, Bhat- tacharyya distance, Chernoff information, f-divergence, Jensen divergence, Burbea-Rao divergence, arXiv:1904.04017v3 [cs.IT] 10 Dec 2020 Bregman divergence, abstract weighted mean, quasi-arithmetic mean, mixture family, statistical M-mixture, exponential family, Gaussian family, Cauchy scale family, clustering.
    [Show full text]
  • Information Divergences and the Curious Case of the Binary Alphabet
    Information Divergences and the Curious Case of the Binary Alphabet Jiantao Jiao∗, Thomas Courtadey, Albert No∗, Kartik Venkat∗, and Tsachy Weissman∗ ∗Department of Electrical Engineering, Stanford University; yEECS Department, University of California, Berkeley Email: fjiantao, albertno, kvenkat, [email protected], [email protected] Abstract—Four problems related to information divergence answers to the above series of questions imply that the class measures defined on finite alphabets are considered. In three of “interesting” divergence measures can be very small when of the cases we consider, we illustrate a contrast which arises n ≥ 3 (e.g., restricted to the class of f-divergences, or a between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the multiple of KL divergence). However, in the binary alphabet larger-alphabet settings do not generalize their binary-alphabet setting, the class of “interesting” divergence measures is strik- counterparts. For example, we show that f-divergences are not ingly rich, a point which will be emphasized in our results. In the unique decomposable divergences on binary alphabets that many ways, this richness is surprising since the binary alphabet satisfy the data processing inequality, despite contrary claims in is usually viewed as the simplest setting one can consider. the literature. In particular, we would expect that the class of interesting I. INTRODUCTION divergence measures corresponding to binary alphabets would Divergence measures play a central role in information the- be less rich than the counterpart class for larger alphabets. ory and other branches of mathematics. Many special classes However, as we will see, the opposite is true.
    [Show full text]
  • Statistical Exponential Families: a Digest with Flash Cards
    Statistical exponential families: A digest with flash cards∗ Frank Nielsen†and Vincent Garcia‡ May 16, 2011 (v2.0) Abstract This document describes concisely the ubiquitous class of exponential family distributions met in statistics. The first part recalls definitions and summarizes main properties and duality with Bregman divergences (all proofs are skipped). The second part lists decompositions and re- lated formula of common exponential family distributions. We recall the Fisher-Rao-Riemannian geometries and the dual affine connection information geometries of statistical manifolds. It is intended to maintain and update this document and catalog by adding new distribution items. arXiv:0911.4863v2 [cs.LG] 13 May 2011 ∗See the jMEF library, a Java package for processing mixture of exponential families. Available for download at http://www.lix.polytechnique.fr/~nielsen/MEF/ †Ecole´ Polytechnique (France) and Sony Computer Science Laboratories Inc. (Japan). ‡Ecole´ Polytechnique (France). 1 Part I A digest of exponential families 1 Essentials of exponential families 1.1 Sufficient statistics A fundamental problem in statistics is to recover the model parameters λ from a given set of observations x1, ..., etc. Those samples are assumed to be randomly drawn from an independent and identically-distributed random vector with associated density p(x; λ). Since the sample set is finite, statisticians estimate a close approximation λˆ of the true parameter. However, a surprising fact is that one can collect and concentrate from a random sample all necessary information for recovering/estimating the parameters. The information is collected into a few elementary statistics of the random vector, called the sufficient statistic.1 Figure 1 illustrates the notions of statistics and sufficiency.
    [Show full text]