GAIT: a Geometric Approach to Information Theory
Total Page:16
File Type:pdf, Size:1020Kb
GAIT: A Geometric Approach to Information Theory Jose Gallego Ankit Vani Max Schwarzer Simon Lacoste-Julieny Mila and DIRO, Université de Montréal Abstract supports. Optimal transport distances, such as Wasser- stein, have emerged as practical alternatives with theo- retical grounding. These methods have been used to We advocate the use of a notion of entropy compute barycenters (Cuturi and Doucet, 2014) and that reflects the relative abundances of the train generative models (Genevay et al., 2018). How- symbols in an alphabet, as well as the sim- ever, optimal transport is computationally expensive ilarities between them. This concept was as it generally lacks closed-form solutions and requires originally introduced in theoretical ecology to the solution of linear programs or the execution of study the diversity of ecosystems. Based on matrix scaling algorithms, even when solved only in this notion of entropy, we introduce geometry- approximate form (Cuturi, 2013). Approaches based aware counterparts for several concepts and on kernel methods (Gretton et al., 2012; Li et al., 2017; theorems in information theory. Notably, our Salimans et al., 2018), which take a functional analytic proposed divergence exhibits performance on view on the problem, have also been widely applied. par with state-of-the-art methods based on However, further exploration on the interplay between the Wasserstein distance, but enjoys a closed- kernel methods and information theory is lacking. form expression that can be computed effi- ciently. We demonstrate the versatility of our Contributions. We i) introduce to the machine learn- method via experiments on a broad range of ing community a similarity-sensitive definition of en- domains: training generative models, comput- tropy developed by Leinster and Cobbold(2012). Based ing image barycenters, approximating empiri- on this notion of entropy we ii) propose geometry-aware cal measures and counting modes. counterparts for several information theory concepts. We iii) present a novel notion of divergence which in- corporates the geometry of the space when comparing 1 Introduction probability distributions, as in optimal transport. How- ever, while the former methods require the solution of Shannon’s seminal theory of information (1948) has an optimization problem or a relaxation thereof via been of paramount importance in the development of matrix-scaling algorithms, our proposal enjoys a closed- modern machine learning techniques. However, stan- form expression and can be computed efficiently. We dard information measures deal with probability distri- refer to this collection of concepts as Geometry-Aware butions over an alphabet considered as a mere set of Information Theory: GAIT. arXiv:1906.08325v3 [cs.LG] 13 Oct 2020 symbols and disregard additional geometric structure, Paper structure. We introduce the theory behind which might be available in the form of a metric or the GAIT entropy and provide motivating examples similarity function. As a consequence of this, informa- justifying its use. We then introduce and characterize a tion theory concepts derived from the Shannon entropy divergence as well as a definition of mutual information (such as cross entropy and the Kullback-Leibler diver- derived from the GAIT entropy. Finally, we demon- gence) are usually blind to the geometric structure in strate applications of our methods including training the domains over which the distributions are defined. generative models, approximating measures and find- This blindness limits the applicability of these concepts. ing barycenters. We also show that the GAIT entropy For example, the Kullback-Leibler divergence cannot can be used to estimate the number of modes of a be optimized for empirical measures with non-matching probability distribution. Notation. S yCanada CIFAR AI Chair. Calligraphic letters denote ets, bold Proceedings of the 23rdInternational Conference on Artificial letters represent Matrices and vectors, and double- Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. barred letters denote Probability distributions and PMLR: Volume 108. Copyright 2020 by the author(s). information-theoretic functionals. To emphasize cer- GAIT: A Geometric Approach to Information Theory 1 A 1 0.1 0.7 C 0.1 1 B K K Figure 1: H1 interpolates towards Figure 2: A 3-point space with Figure 3: H1 for distributions the Shannon entropy as r ! 1. two highly similar elements. over the space in Fig.2. tain computational aspects, we alternatively denote a 2.1 Entropy and diversity distribution P over a finite space X as a vector of proba- bilities p. I, 1 and J denote the identity matrix, a vec- Let P be a probability distribution on X. P in- tor of ones and matrix of ones, with context-dependent duces a similarity profile KP : X ! [0; 1], given by 1 v α K (x) y∼ [κ(x; y)] = (Kp)x. K (x) represents dimensions. For vectors v, u and α 2 R, u and v de- P , E P P note element-wise division and exponentiation. h·; ·i de- the expected similarity between element x and a ran- notes the Frobenius inner-product between two vectors dom element of the space sampled according to P. In- n or matrices. ∆n , fx 2 R j h1; xi = 1 and xi ≥ 0g tuitively, it assesses how “satisfied” we would be by x denotes the probability simplex over n elements. δx selecting as a one-point summary of the space. In denotes a Dirac distribution at point x. We adopt the other words, it measures the ordinariness of x, and thus 1 is the rarity or distinctiveness of x (Leinster and conventions 0 · log(0) = 0 and x log(0) = −∞ for x > 0. KP(x) Cobbold, 2012). Note that the distinctiveness depends Reproducibility. Our experiments can be reproduced crucially on both the similarity structure of the space via: https://github.com/jgalle29/gait and the probability distribution at hand. Much like the interpretation of Shannon’s entropy as 2 Geometry-Aware Information Theory the expected surprise of observing a random element of the space, we can define a notion of diversity as P 1 expected distinctiveness: X (x) . This arith- Suppose that we are given a finite space X with n ele- x2 P KP(x) ments along with a symmetric function that measures metic weighted average is a particular instance of the the similarity between elements, κ : X × X ! [0; 1]. Let family of power (or Hölder) means. Given w 2 ∆n n K be the matrix induced by κ on X; i.e, Kx;y , κxy , and x 2 R≥0, the weighted power mean of order β is 1 κ(x; y) = κ(y; x). Kx;y = 1 indicates that the elements β β defined as Mw,β(x) , w; x . Motivated by this av- x and y are identical, while Kx;y = 0 indicates full eraging scheme, Leinster and Cobbold(2012) proposed dissimilarity. We assume that κ(x; x) = 1 for all x 2 X. the following definition: We call (X; κ) a (finite) similarity space. For brevity Definition 1. (Leinster and Cobbold, 2012) (GAIT we denote (X; κ) by X whenever κ is clear from the Entropy) The GAIT entropy of order α ≥ 0 of distri- context. bution P on finite similarity space (X; κ) is given by: Of particular importance are the similarity spaces aris- K 1 [ ] log Mp;1−α (1) ing from metric spaces. Let (X; d) be a metric space Hα P , Kp −d(x;y) and define κ(x; y) , e . Here, the symmetry n 1 X 1 and range conditions imposed on κ are trivially satis- = log pi : (2) 1 − α (Kp)1−α fied. The triangle inequality in (X; d) induces a mul- i=1 i tiplicative transitivity on (X; κ): for all x; y; z 2 X, κ(x; y) ≥ κ(x; z)κ(z; y). Moreover, for any metric space It is evident that whenever K = I, this definition re- of the negative type, the matrix of its associated simi- duces to the Rényi entropy (Rényi, 1961). Moreover, larity space is positive definite (Reams, 1999, Lemma a continuous extension of Eq. (1) to α = 1 via a 2.5). L’Hôpital argument reveals a similarity-sensitive ver- sion of Shannon’s entropy: In this section, we present a theoretical framework K[ ] = − hp; log(Kp)i = − [log(K ) ]: which quantifies the “diversity” or “entropy” of a proba- H1 P Ex∼P P x (3) bility distribution defined on a similarity space, as well 1This denotes the x-th entry of the result of the matrix- as a notion of divergence between such distributions. vector multiplication Kp. Gallego, Vani, Schwarzer, Lacoste-Julien Let us dissect this definition via two simple examples. similarity space to the support of P. > First, consider a distribution pθ = [θ; 1 − θ] over the Identical elements points fx; yg at distance r ≥ 0, and define the similarity • : If two elements are identi- −r cal (two equal rows in K), then combining them κxy , e . As the points get further apart, the Gram into one and adding their probabilities leaves the matrix Kr transitions from J to I. Fig.1 displays the Kr entropy unchanged. behavior of H1 [pθ]. We observe that when r is large we recover the usual shape of Shannon entropy for a K • Continuity: Hα [P] is continuous in α 2 [0; 1] Bernoulli variable. In contrast, for low values of r, the for fixed P, and continuous in P (w.r.t. standard curve approaches a constant zero function. In this case, topology on ∆) for fixed α 2 (0; 1). we regard both elements of the space as identical: no K matter how we distribute the probability among them, • α-Monotonicity: Hα [P] is non-increasing in α. we have low uncertainty about the qualities of random samples. Moreover, the exponential of the maximum en- h i The role of α.