Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Conditional Entropy Let Y be a discrete random variable with outcomes, {y1,...,ym}, which occur with probabilities, pY (y j). The avg. information you gain when told the outcome of Y is: m HY = − ∑ pY (y j)log pY (y j). j=1 Conditional Entropy (contd.) Let X be a discrete random variable with outcomes, {x1,...,xn}, which occur with probabilities, pX(xi). Consider the 1D distribution, pY |X=xi(y j)= pY|X(y j |xi) i.e., the distribution of Y outcomes given that X = xi. The avg. information you gain when told the outcome of Y is: m HY|X=xi = − ∑ pY |X(y j |xi)log pY|X(y j |xi). j=1 Conditional Entropy (contd.) The conditional entropy is the expected value for the entropy of pY|X=xi: HY|X = HY|X=xi . It follows that: n HY|X = ∑ pX(xi)HY|X=xi i=1 n m = ∑ pX(xi) − ∑ pY |X(y j |xi)log pY|X (y j |xi) i=1 j=1 ! n m = − ∑ ∑ pX(xi)pY|X (y j |xi)log pY |X (y j |xi). i=1 j=1 The entropy, HY|X, of the conditional distribution, pY|X, is therefore: n m HY|X = − ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Hy Hy|x Figure 1: There is less information in the conditional than in the marginal (Theo- rem 1.2). Theorem 1.2 There is less information in the conditional, pY|X, than in the marginal, pY : HY|X − HY ≤ 0. Proof: HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m + ∑ pY (y j)log pY (y j). j=1 Theorem 1.2 (contd.) HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m n + ∑ ∑ pXY (xi,y j) log pY (y j). j=1 i=1 ! HY|X − HY = n m p (y ) ∑ ∑ p (x ,y )log Y j . XY i j p (y |x ) i=1 j=1 Y|X j i Theorem 1.2 (contd.) Using the inequality, loga ≤ (a−1)loge, it follows that: HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j − 1 loge. XY i j p (y |x ) i=1 j=1 Y|X j i HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j loge XY i j p (y |x ) i=1 j=1 Y|X j i n m − ∑ ∑ pXY (xi,y j)loge. i=1 j=1 Theorem 1.2 (contd.) HY |X − HY ≤ n m p (y ) ∑ ∑ p (x )p (y |x ) Y j loge−loge. X i Y|X j i p (y |x ) i=1 j=1 Y |X j i n m HY|X − HY ≤ ∑ ∑ pX (xi)pY (y j)loge − loge i=1 j=1 n m ≤ ∑ pX(xi) ∑ pY (y j) loge − loge "i=1 j=1 !# ≤ loge − loge ≤ 0. Hx Hy|x Hxy Figure 2: The information in the joint is the sum of the information in the conditional and the marginal (Theorem 1.3). Theorem 1.3 The information in the joint, pXY , is the sum of the information in the conditional, pY|X, and the marginal, pX: HXY = HY|X + HX. Proof: n m HXY = − ∑ ∑ pXY (xi,y j)log pXY (xi,y j) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j)log pX(xi)pY|X (y j |xi) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j) log pX(xi)+ log pY |X (y j |xi) . i=1 j=1 Theorem 1.3 (contd.) HXY = n m − ∑ ∑ pXY (xi,y j) log pX(xi) i=1 j=1 ! n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 = HX + HY|X. Mutual Information The mutual information, IXY , between X and Y is deﬁned to be: IXY = HY − HY|X = IYX = HX − HX|Y . The mutual information is a measure of the statistical independence of two random variables. Mutual Information (contd.) m IXY = HY −HY|X = − ∑ pY (y j)log pY (y j) j=1 n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 m n IXY = − ∑ ∑ pXY (xi,y j) log pY (y j) j=1 i=1 ! n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Mutual Information (contd.) n m pY X(y j |xi) I = ∑ ∑ p (x ,y )log | XY XY i j p (y ) i=1 j=1 Y j n m p (x ,y ) = ∑ ∑ p (x ,y )log XY i j XY i j p (x )p (y ) i=1 j=1 X i Y j m n p (y ,x ) = ∑ ∑ p (y ,x )log YX j i YX j i p (y )p (x ) j=1 i=1 Y j X i = IYX = HX − HX|Y . Four Cases • X and Y are statistically independent: HXY = HX + HY Hx Hy Figure 3: X and Y are statistically independent. Four Cases (contd.) • X is completely dependent on Y: HXY = HY Ixy = Hx Figure 4: X is a function of Y . Four Cases (contd.) • Y is completely dependent on X: HXY = HX Ixy = Hy Figure 5: Y is a function of X. Four Cases (contd.) • X and Y are not independent but nei- ther is completely dependent on the other: HXY = HX + HY − IXY Hx Hy Hx|y Ixy Hy|x Figure 6: The general case. Informational Channel noise H Y | X input output H Y H X I XY H X | Y loss Figure 7: Information channel. Kullback-Liebler Distance Let pX and qX be probability mass func- tions for two discrete r.v.’s over the same set of events, {x1,...,xN}. The Kullback- Liebler distance (or KL divergence), is deﬁned as follows: N p (x ) KL(p ||q )= ∑ p (x )log X i . X X X i q (x ) i=1 X i The KL divergence is a measure of how different two probability distributions are. Note that in general KL(pX||qX) 6= KL(qX||pX). Kullback-Liebler Distance (contd.) Let pXY be the joint p.m.f. for discrete r.v.’s X and Y and let pX and pY be the corresponding marginal distributions: M pX(xi) = ∑ pXY (xi,y j) j=1 N pY (y j) = ∑ pXY (xi,y j). i=1 We observe that IXY = KL(pXY ||qXY ) where qXY (xi,y j)= pX(xi) · pY (y j). In other words, mutual information is the KL divergence between a joint distribution, pXY , and the product of its marginal distributions, pX and pY ..

Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

Package 'Infotheo'

Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020

GAIT: a Geometric Approach to Information Theory

On a General Definition of Conditional Rényi Entropies

Noisy Channel Coding

Lecture 11: Channel Coding Theorem: Converse Part 1 Recap

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Entropy, Relative Entropy, Cross Entropy Entropy

Estimating the Mutual Information Between Two Discrete, Asymmetric Variables with Limited Samples

Noisy-Channel Coding Copyright Cambridge University Press 2003