<<

Conditional Entropy Let Y be a discrete with outcomes, {y1,...,ym}, which occur with probabilities, pY (y j). The avg. infor- mation you gain when told the outcome of Y is: m HY = − ∑ pY (y j)log pY (y j). j=1 Conditional Entropy (contd.) Let X be a discrete random variable with outcomes, {x1,...,xn}, which occur with probabilities, pX(xi). Consider the 1D distribution,

pY |X=xi(y j)= pY|X(y j |xi) i.e., the distribution of Y outcomes given that X = xi. The avg. information you gain when told the outcome of Y is: m

HY|X=xi = − ∑ pY |X(y j |xi)log pY|X(y j |xi). j=1 Conditional Entropy (contd.) The conditional entropy is the expected value for the entropy of pY|X=xi:

HY|X = HY|X=xi . It follows that: n

HY|X = ∑ pX(xi)HY|X=xi i=1 n m = ∑ pX(xi) − ∑ pY |X(y j |xi)log pY|X (y j |xi) i=1 j=1 ! n m = − ∑ ∑ pX(xi)pY|X (y j |xi)log pY |X (y j |xi). i=1 j=1

The entropy, HY|X, of the conditional distribution, pY|X, is therefore: n m HY|X = − ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Hy

Hy|x

Figure 1: There is less information in the conditional than in the marginal (Theo- rem 1.2). Theorem 1.2 There is less information in the condi- tional, pY|X, than in the marginal, pY :

HY|X − HY ≤ 0. Proof:

HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m + ∑ pY (y j)log pY (y j). j=1 Theorem 1.2 (contd.)

HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m n + ∑ ∑ pXY (xi,y j) log pY (y j). j=1 i=1 ! HY|X − HY = n m p (y ) ∑ ∑ p (x ,y )log Y j . XY i j p (y |x ) i=1 j=1  Y|X j i  Theorem 1.2 (contd.) Using the inequality, loga ≤ (a−1)loge, it follows that:

HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j − 1 loge. XY i j p (y |x ) i=1 j=1  Y|X j i  HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j loge XY i j p (y |x ) i=1 j=1  Y|X j i  n m − ∑ ∑ pXY (xi,y j)loge. i=1 j=1 Theorem 1.2 (contd.)

HY |X − HY ≤ n m p (y ) ∑ ∑ p (x )p (y |x ) Y j loge−loge. X i Y|X j i p (y |x ) i=1 j=1  Y |X j i  n m HY|X − HY ≤ ∑ ∑ pX (xi)pY (y j)loge − loge i=1 j=1 n m ≤ ∑ pX(xi) ∑ pY (y j) loge − loge "i=1 j=1 !# ≤ loge − loge ≤ 0. Hx Hy|x

Hxy

Figure 2: The information in the joint is the sum of the information in the condi- tional and the marginal (Theorem 1.3). Theorem 1.3

The information in the joint, pXY , is the sum of the information in the conditional, pY|X, and the marginal, pX:

HXY = HY|X + HX. Proof:

n m HXY = − ∑ ∑ pXY (xi,y j)log pXY (xi,y j) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j)log pX(xi)pY|X (y j |xi) i=1 j=1 n m   = − ∑ ∑ pXY (xi,y j) log pX(xi)+ log pY |X (y j |xi) . i=1 j=1   Theorem 1.3 (contd.)

HXY = n m − ∑ ∑ pXY (xi,y j) log pX(xi) i=1 j=1 ! n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 = HX + HY|X.

The mutual information, IXY , between X and Y is defined to be:

IXY = HY − HY|X = IYX = HX − HX|Y . The mutual information is a measure of the statistical independence of two ran- dom variables. Mutual Information (contd.)

m IXY = HY −HY|X = − ∑ pY (y j)log pY (y j) j=1 n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 m n IXY = − ∑ ∑ pXY (xi,y j) log pY (y j) j=1 i=1 ! n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Mutual Information (contd.)

n m pY X(y j |xi) I = ∑ ∑ p (x ,y )log | XY XY i j p (y ) i=1 j=1  Y j  n m p (x ,y ) = ∑ ∑ p (x ,y )log XY i j XY i j p (x )p (y ) i=1 j=1  X i Y j  m n p (y ,x ) = ∑ ∑ p (y ,x )log YX j i YX j i p (y )p (x ) j=1 i=1  Y j X i  = IYX = HX − HX|Y . Four Cases • X and Y are statistically independent:

HXY = HX + HY

Hx Hy

Figure 3: X and Y are statistically independent. Four Cases (contd.) • X is completely dependent on Y:

HXY = HY

Ixy = Hx

Figure 4: X is a function of Y . Four Cases (contd.) • Y is completely dependent on X:

HXY = HX

Ixy = Hy

Figure 5: Y is a function of X. Four Cases (contd.) • X and Y are not independent but nei- ther is completely dependent on the other:

HXY = HX + HY − IXY

Hx Hy

Hx|y Ixy Hy|x

Figure 6: The general case. Informational Channel

noise H Y | X

input output H Y H X

I XY

H X | Y loss Figure 7: Information channel. Kullback-Liebler Distance

Let pX and qX be probability mass func- tions for two discrete r.v.’s over the same set of events, {x1,...,xN}. The Kullback- Liebler distance (or KL divergence), is defined as follows: N p (x ) KL(p ||q )= ∑ p (x )log X i . X X X i q (x ) i=1  X i  The KL divergence is a measure of how different two probability distributions are. Note that in general

KL(pX||qX) 6= KL(qX||pX). Kullback-Liebler Distance (contd.)

Let pXY be the joint p.m.f. for discrete r.v.’s X and Y and let pX and pY be the corresponding marginal distributions: M pX(xi) = ∑ pXY (xi,y j) j=1 N pY (y j) = ∑ pXY (xi,y j). i=1 We observe that

IXY = KL(pXY ||qXY ) where qXY (xi,y j)= pX(xi) · pY (y j). In other words, mutual information is the KL divergence between a joint distribu- tion, pXY , and the product of its marginal distributions, pX and pY .