Information Divergences and the Curious Case of the Binary Alphabet
Total Page:16
File Type:pdf, Size:1020Kb
Information Divergences and the Curious Case of the Binary Alphabet Jiantao Jiao∗, Thomas Courtadey, Albert No∗, Kartik Venkat∗, and Tsachy Weissman∗ ∗Department of Electrical Engineering, Stanford University; yEECS Department, University of California, Berkeley Email: fjiantao, albertno, kvenkat, [email protected], [email protected] Abstract—Four problems related to information divergence answers to the above series of questions imply that the class measures defined on finite alphabets are considered. In three of “interesting” divergence measures can be very small when of the cases we consider, we illustrate a contrast which arises n ≥ 3 (e.g., restricted to the class of f-divergences, or a between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the multiple of KL divergence). However, in the binary alphabet larger-alphabet settings do not generalize their binary-alphabet setting, the class of “interesting” divergence measures is strik- counterparts. For example, we show that f-divergences are not ingly rich, a point which will be emphasized in our results. In the unique decomposable divergences on binary alphabets that many ways, this richness is surprising since the binary alphabet satisfy the data processing inequality, despite contrary claims in is usually viewed as the simplest setting one can consider. the literature. In particular, we would expect that the class of interesting I. INTRODUCTION divergence measures corresponding to binary alphabets would Divergence measures play a central role in information the- be less rich than the counterpart class for larger alphabets. ory and other branches of mathematics. Many special classes However, as we will see, the opposite is true. of divergences, such as Bregman divergences, f-divergences, The observation of this dichotomy between binary and and Kullback-Liebler-type f-distance measures, enjoy various larger alphabets is not without precedent. For example, Fischer properties which make them particularly useful in problems proved the following in his 1972 paper [?]. related to learning, clustering, inference, optimization, and Theorem 1 Suppose n ≥ 3. If, and only if, f satisfies quantization, to name a few. In this paper, we investigate n n the relationships between these three classes of divergences, X X pkf(pk) ≤ pkf(qk) (1) each of which will be defined formally in due course, and the k=1 k=1 subclasses of divergences which satisfy desirable properties for all probability distributions P = (p ; p ; : : : ; p );Q = such as monotonicity with respect to data processing. Roughly 1 2 n (q ; q ; : : : ; q ), then it is of the form speaking, we address the following four questions: 1 2 n QUESTION 1: If a decomposable divergence satisfies the data f(p) = c log p + b for all p 2 (0; 1), (2) processing inequality, must it be an f-divergence? where b and c ≤ 0 are constants. QUESTION 2: Is Kullback-Leibler (KL) divergence the unique As implied by his supposition that n ≥ 3 in Theorem 1, KL-type f-distance measure which satisfies the data process- Fischer observed and appreciated the distinction between the ing inequality? binary and larger alphabet settings when considering so-called QUESTION 3: Is KL divergence the unique Bregman diver- Shannon-type inequalities of the form (1). Indeed, in the same gence which is invariant to statistically sufficient transforma- paper [?], Fischer gave the following result: tions of the data? QUESTION 4: Is KL divergence the unique Bregman diver- Theorem 2 The functions of the form gence which is also an f-divergence? Z G(q) f(q) = dq; q 2 (0; 1); (3) Of the above four questions, only QUESTION 4 has an q affirmative answer (despite indications to the contrary having with G arbitrary, nonpositive, and satisfying G(1−q) = G(q) appeared in the literature; a point which we address further in for q 2 (0; 1), are the only absolutely continuous functions1 Section III-A). However, this assertion is slightly deceptive. In- on (0; 1) satisfying (1) when n = 2. deed, if the alphabet size n is at least 3, then all four questions can be answered in the affirmative. Thus, counterexamples Only in the special case where G is taken to be constant in (3), only arise in the binary setting when n = 2. do we find that f is of the form (2). We direct the interested This is perhaps unexpected. Intuitively, the data processing reader to [?, Chap. 4] for a detailed discussion. inequality is not a stringent requirement. In this sense, the In part, the present paper was inspired and motivated by This work was supported in part by the NSF Center for Science of 1There also exist functions f on (0; 1) which are not absolutely continuous Information under grant agreement CCF-0939370. and satisfy (1). See [?]. Theorems 1 and 2, and the distinction they draw between A transformation PY jX : X 7! Y is said to be a sufficient binary and larger alphabets. Indeed, the answers to ques- transformation of X for Z if Y is a sufficient statistic of X tions QUESTION 1 – QUESTION 3 are in the same spirit for Z. We remind the reader that Y is a sufficient statistic of as Fischer’s results. For instance, our answer to question X for Z if the following two Markov chains hold: QUESTION 2 demonstrates that the functional inequality (1) Z − X − YZ − Y − X: (6) and a data processing requirement are still not enough to demand f take the form (2) when n = 2. To wit, we prove Definition 2 (Sufficiency) A divergence function D satisfies an analog of Theorem 2 for this setting (see Section III-B). the sufficiency property if, for all PXj1;PXj2 2 Γn and Z 2 This paper is organized as follows. In Section II, we recall f1; 2g, we have several important classes of divergence measures and define what it means for a divergence measure to satisfy the data D(PXj1; PXj2) ≥ D PY j1; PY j2 (7) processing, sufficiency, or separability properties. In Section for any sufficient transformation PY jX : X 7! Y of X for III, we investigate each of the questions posed above and state Z, where PY jz is defined by PXjz ! PY jX ! PY jz for z 2 our main results. The Appendices contain all proofs. f1; 2g. Clearly the sufficiency property is weaker than the data II. PRELIMINARIES:DIVERGENCES,DATA PROCESSING, processing property because our attention is restricted to only AND SUFFICIENCY PROPERTIES those (possibly stochastic) transformations PY jX for which Y ¯ is a sufficient statistic of X for Z. Given the definition of a Let R , [−∞; +1] denote the extended real line. Through- out this paper, we only consider finite alphabets. To this end, sufficient statistic, we note that the inequality in (7) can be let X = f1; 2; : : : ; ng denote the alphabet, which is of size replaced with equality to yield an equivalent definition. Pn Henceforth, we will simply say that a divergence function n, and let Γn = f(p1; p2; : : : ; pn): i=1 pi = 1; pi ≥ 0; i = 1; 2; : : : ; ng be the set of probability measures on X , D(·; ·) satisfies DATA PROCESSING when it satisfies the da- + Pn ta processing property. Similarly, we say that a divergence with Γn = f(p1; p2; : : : ; pn): i=1 pi = 1; pi > 0; i = 1; 2; : : : ; ng denoting its relative interior. function D(·; ·) satisfies SUFFICIENCY when it satisfies the ¯ sufficiency property. We refer to a nonnegative function D :Γn × Γn ! R simply as a divergence function (or, divergence measure). Remark 1 In defining both DATA PROCESSING and SUFFI- Of course, essentially all common divergences – including CIENCY, we have required that Y 2 X . This is necessary since Bregman and f-divergences, which are defined shortly – fall the divergence function D(·; ·) is only defined on Γn × Γn. into this general class. In this paper, we are primarily interested in divergence measures which satisfy either of two properties: We make one more definition following [?]. the data processing property and the sufficiency property. Definition 3 (Decomposibility) A divergence function D is In the course of defining these properties, we will consider said to be decomposable (or, separable) if there exists a 2 ¯ (possibly stochastic) transformations PY jX : X 7! Y , where bivariate function δ(u; v) : [0; 1] ! R such that Y 2 X . That is, PY jX is a Markov kernel with source n and target both equal to X (equipped with the discrete σ- X D(P ; Q) = δ(pi; qi) (8) algebra). If X ∼ PX , we will write PX ! PY jX ! PY to i=1 denote that PY is the marginal distribution of Y generated for all P = (p1; : : : ; pn) and Q = (q1; : : : ; qn) in Γn. by passing X through the channel PY jX . That is, PY (·) , P x2X PX (x)PY jX (·|x). Having defined divergences in general, we will now recall Now, we are in a position to formally define the data three important classes of divergences which will be of inter- processing property. est to us: Bregman divergences, f-divergences, and KL-type divergences. Definition 1 (Data Processing) A divergence function D sat- isfies the data processing property if, for all P ;Q 2 Γ , X X n A. Bregman Divergences we have Let G(P ):Γn ! R be a convex function defined D (PX ; QX ) ≥ D (PY ; QY ) (4) + on Γn, differentiable on Γn . For two probability measures for any transformation PY jX : X 7! Y , where PY and QY P = (p1; : : : ; pn) and Q = (q1; : : : ; qn) in Γn, the Bregman are defined via PX ! PY jX ! PY and QX ! PY jX ! QY , divergence generated by G is defined by respectively. G D (P ; Q) , G(P ) − G(Q) − hrG(Q);P − Qi; (9) A weaker version of the data processing inequality is the where hrG(Q);P − Qi denotes the standard inner product sufficiency property.