On Representations of Divergence Measures and Related Quantities in Exponential Families

entropy Article On Representations of Divergence Measures and Related Quantities in Exponential Families Stefan Bedbur and Udo Kamps * Institute of Statistics, RWTH Aachen University, 52056 Aachen, Germany; [email protected] * Correspondence: [email protected] Abstract: Within exponential families, which may consist of multi-parameter and multivariate distributions, a variety of divergence measures, such as the Kullback–Leibler divergence, the Cressie– Read divergence, the Rényi divergence, and the Hellinger metric, can be explicitly expressed in terms of the respective cumulant function and mean value function. Moreover, the same applies to related entropy and affinity measures. We compile representations scattered in the literature and present a unified approach to the derivation in exponential families. As a statistical application, we highlight their use in the construction of confidence regions in a multi-sample setup. Keywords: exponential family; cumulant function; mean value function; divergence measure; distance measure; affinity MSC: 60E05; 62H12; 62F25 1. Introduction Citation: Bedbur, S.; Kamps, U. On There is a broad literature on divergence and distance measures for probability distri- Representations of Divergence butions, e.g., on the Kullback–Leibler divergence, the Cressie–Read divergence, the Rényi Measures and Related Quantities in divergence, and Phi divergences as a general family, as well as on associated measures of Exponential Families. Entropy 2021, entropy and affinity. For definitions and details, we refer to [1]. These measures have been 23, 726. https://doi.org/10.3390/ extensively used in statistical inference. Excellent monographs on this topic were provided e23060726 by Liese and Vajda [2], Vajda [3], Pardo [1], and Liese and Miescke [4]. Within an exponential family as defined in Section2, which may consist of multi- Academic Editor: Maria Longobardi parameter and multivariate distributions, several divergence measures and related quantities are seen to have nice explicit representations in terms of the respective cumulant Received: 12 May 2021 Accepted: 5 June 2021 function and mean value function. These representations are contained in different sources. Published: 8 June 2021 Our focus is on a unifying presentation of main quantities, while not aiming at an ex- haustive account. As an application, we derive confidence regions for the parameters of Publisher’s Note: MDPI stays neutral exponential distributions based on different divergences in a simple multi-sample setup. with regard to jurisdictional claims in For the use of the aforementioned measures of divergence, entropy, and affinity, we published maps and institutional affil- refer to the textbooks [1–4] and exemplarily to [5–10] for statistical applications, including iations. the construction of test procedures as well as methods based on dual representations of divergences, and to [11] for a classification problem. 2. Exponential Families Let Q 6= Æ be a parameter set, m be a s-finite measure on the measurable space (X , B), Copyright: © 2021 by the authors. P = f 2 g (X B) Licensee MDPI, Basel, Switzerland. and PJ : J Q be an exponential family (EF) of distributions on , with This article is an open access article m-density ( k ) distributed under the terms and fJ(x) = C(J) exp Zj(J) Tj(x) h(x) , x 2 X , (1) conditions of the Creative Commons ∑ j=1 Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ of PJ for J 2 Q, where C, Z1, ... , Zk : Q ! R are real-valued functions on Q and 1 1 4.0/). h, T1, ... , Tk : (X , B) ! (R , B ) are real-valued Borel-measurable functions with h ≥ 0. Entropy 2021, 23, 726. https://doi.org/10.3390/e23060726 https://www.mdpi.com/journal/entropy Entropy 2021, 23, 726 2 of 14 Usually, m is either the counting measure on the power set of X (for a family of dis- crete distributions) or the Lebesgue measure on the Borel sets of X (in the continuous case). Without loss of generality and for a simple notation, we assume that h > 0 (the set fx 2 X : h(x) = 0g is a null set for all P 2 P). Let n denote the s-finite measure with m-density h. We assume that representation (1) is minimal in the sense that the number k of sum- mands in the exponent cannot be reduced. This property is equivalent to Z1, ... , Zk being affinely independent mappings and T1, ... , Tk being n-affinely independent mappings; see, e.g., [12] (Cor. 8.1). Here, n-affine independence means affine independence on the complement of every null set of n. To obtain simple formulas for divergence measures in the following section, it is convenient to use the natural parameter space Z ∗ t X = z 2 Rk : ez T h dm < ¥ ∗ and the (minimal) canonical representation fPz : z 2 Z(Q)g of P with m-density ∗ ∗ ztT(x) fz (x) = C (z) e h(x) , x 2 X , (2) ∗ ∗ t ∗ of Pz and normalizing constant C (z) for z = (z1, ... , zk) 2 Z(Q) ⊂ X , where Z = t t (Z1, ... , Zk) denotes the (column) vector of the mappings Z1, ... , Zk and T = (T1, ... , Tk) denotes the (column) vector of the statistics T1, ... , Tk. For simplicity, we assume that P is regular, i.e., we have that Z(Q) = X∗ (P is full) and that X∗ is open; see [13]. In particular, this guarantees that T is minimal sufficient and complete for P; see, e.g., [14] (pp. 25–27). The cumulant function k(z) = − ln(C∗(z)) , z 2 X∗ , associated with P is strictly convex and infinitely often differentiable on the convex set X∗; see [13] (Theorem 1.13 and Theorem 2.2). It is well-known that the Hessian matrix of k at z ∗ coincides with the covariance matrix of T under Pz and that it is also equal to the Fisher information matrix I(z) at z. Moreover, by introducing the mean value function ∗ p(z) = Ez [T] , z 2 X , (3) we have the useful relation p = rk , (4) where rk denotes the gradient of k; see [13] (Cor. 2.3). p is a bijective mapping from X∗ to the interior of the convex support of nT , i.e., the closed convex hull of the support of nT ; see [13] (p. 2 and Theorem 3.6). Finally, note that representation (2) can be rewritten as ∗ ztT(x)−k(z) fz (x) = e h(x) , x 2 X , (5) for z 2 X∗. 3. Divergence Measures Divergence measures may be applied, for instance, to quantify the “disparity” of a distribution to some reference distribution or to measure the “distance” between two distributions within some family in a certain sense. If the distributions in the family are dominated by a s-finite measure, various divergence measures have been introduced by means of the corresponding densities. In parametric statistical inference, they serve to construct statistical tests or confidence regions for underlying parameters; see, e.g., [1]. Entropy 2021, 23, 726 3 of 14 Definition 1. Let F be a set of distributions on (X , B). A mapping D : F × F ! R is called a divergence (or divergence measure) if: (i) D(P, Q) ≥ 0 for all P, Q 2 F and D(P, Q) = 0 , P = Q (positive definite- ness). If additionally (ii) D(P, Q) = D(Q, P) for all P, Q 2 F (symmetry) is valid, D is called a distance (or distance measure or semi-metric). If D then moreover meets (iii) D(P1, P2) ≤ D(P1, Q) + D(Q, P2) for all P1, P2, Q 2 F (triangle inequality), D is said to be a metric. Some important examples are the Kullback–Leibler divergence (KL-divergence): Z f1 DKL(P1, P2) = f1 ln dm , f2 the Jeffrey distance: DJ(P1, P2) = DKL(P1, P2) + DKL(P2, P1) as a symmetrized version, the Rényi divergence: 1 Z q 1−q DR (P1, P2) = ln f f dm , q 2 R n f0, 1g , (6) q q(q − 1) 1 2 ( ) = ( ) along with the related Bhattacharyya distance DB P1, P2 DR1/2 P1, P2 /4, the Cressie– Read divergence (CR-divergence): Z " q−1 # 1 f1 DCRq (P1, P2) = f1 − 1 dm , q 2 R n f0, 1g , (7) q(q − 1) f2 which is the same as the Chernoff a-divergence up to a parameter transformation, the re- ( ) = ( ) lated Matusita distance DM P1, P2 DCR1/2 P1, P2 /2, and the Hellinger metric: 1/2 Z p p 2 DH(P1, P2) = f1 − f2 dm (8) for distributions P1, P2 2 F with m-densities f1, f2, provided that the integrals are well- defined and finite. 2 n f g DKL, DRq , and DCRq for q R 0, 1 are divergences, and DJ, DR1/2 , DB, DCR1/2 , 2 and DM (= DH), since they moreover satisfy symmetry, are distances on F × F. DH is known to be a metric on F × F. In parametric models, it is convenient to use the parameters as arguments and briefly write, e.g., ( ) ( ) 2 DKL J1, J2 for DKL PJ1 , PJ2 , J1, J2 Q , if the parameter J 2 Q is identifiable, i.e., if the mapping J 7! PJ is one-to-one on Q. This property is met for the EF P in Section2 with minimal canonical representation (5); see, e.g., [13] (Theorem 1.13(iv)). It is known from different sources in the literature that the EF structure admits simple formulas for the above divergence measures in terms of the corresponding cumulant function and/or mean value function. For the KL-divergence, we refer to [15] (Cor. 3.2) and [13] (pp. 174–178), and for the Jeffrey distance also to [16]. Entropy 2021, 23, 726 4 of 14 Theorem 1.

On Representations of Divergence Measures and Related Quantities in Exponential Families

IR-IITBHU at TREC 2016 Open Search Track: Retrieving Documents Using Divergence from Randomness Model in Terrier

Subnational Inequality Divergence

A Generalized Divergence for Statistical Inference

Divergence Measures for Statistical Data Processing—An Annotated Bibliography

Robust Hypothesis Testing with Α-Divergence Gokhan¨ Gul,¨ Student Member, IEEE, Abdelhak M

Divergence Tendencies in the European Integration Process: a Danger for the Sustainability of the E(M)U?

GDP PER CAPITA VERSUS MEDIAN HOUSEHOLD INCOME: WHAT GIVES RISE to DIVERGENCE OVER TIME? Brian Nolan, Max Roser, and Stefan

On the Use of Entropy to Improve Model Selection Criteria

Digitizing Omics Profiles by Divergence from a Baseline

Stagnating Median Incomes Despite Economic Growth: Explaining the Divergence in 27 OECD Countries

Lecture 7: Hypothesis Testing and KL Divergence 1 Introducing The

Improved Security Proofs in Lattice-Based Cryptography: Using the Rényi Divergence Rather Than the Statistical Distance