Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Conditional Entropy Let Y be a discrete random variable with outcomes, {y1,...,ym}, which occur with probabilities, pY (y j). The avg. information you gain when told the outcome of Y is: m HY = − ∑ pY (y j)log pY (y j). j=1 Conditional Entropy (contd.) Let X be a discrete random variable with outcomes, {x1,...,xn}, which occur with probabilities, pX(xi). Consider the 1D distribution, pY |X=xi(y j)= pY|X(y j |xi) i.e., the distribution of Y outcomes given that X = xi. The avg. information you gain when told the outcome of Y is: m HY|X=xi = − ∑ pY |X(y j |xi)log pY|X(y j |xi). j=1 Conditional Entropy (contd.) The conditional entropy is the expected value for the entropy of pY|X=xi: HY|X = HY|X=xi . It follows that: n HY|X = ∑ pX(xi)HY|X=xi i=1 n m = ∑ pX(xi) − ∑ pY |X(y j |xi)log pY|X (y j |xi) i=1 j=1 ! n m = − ∑ ∑ pX(xi)pY|X (y j |xi)log pY |X (y j |xi). i=1 j=1 The entropy, HY|X, of the conditional distribution, pY|X, is therefore: n m HY|X = − ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Hy Hy|x Figure 1: There is less information in the conditional than in the marginal (Theo- rem 1.2). Theorem 1.2 There is less information in the conditional, pY|X, than in the marginal, pY : HY|X − HY ≤ 0. Proof: HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m + ∑ pY (y j)log pY (y j). j=1 Theorem 1.2 (contd.) HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m n + ∑ ∑ pXY (xi,y j) log pY (y j). j=1 i=1 ! HY|X − HY = n m p (y ) ∑ ∑ p (x ,y )log Y j . XY i j p (y |x ) i=1 j=1 Y|X j i Theorem 1.2 (contd.) Using the inequality, loga ≤ (a−1)loge, it follows that: HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j − 1 loge. XY i j p (y |x ) i=1 j=1 Y|X j i HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j loge XY i j p (y |x ) i=1 j=1 Y|X j i n m − ∑ ∑ pXY (xi,y j)loge. i=1 j=1 Theorem 1.2 (contd.) HY |X − HY ≤ n m p (y ) ∑ ∑ p (x )p (y |x ) Y j loge−loge. X i Y|X j i p (y |x ) i=1 j=1 Y |X j i n m HY|X − HY ≤ ∑ ∑ pX (xi)pY (y j)loge − loge i=1 j=1 n m ≤ ∑ pX(xi) ∑ pY (y j) loge − loge "i=1 j=1 !# ≤ loge − loge ≤ 0. Hx Hy|x Hxy Figure 2: The information in the joint is the sum of the information in the conditional and the marginal (Theorem 1.3). Theorem 1.3 The information in the joint, pXY , is the sum of the information in the conditional, pY|X, and the marginal, pX: HXY = HY|X + HX. Proof: n m HXY = − ∑ ∑ pXY (xi,y j)log pXY (xi,y j) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j)log pX(xi)pY|X (y j |xi) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j) log pX(xi)+ log pY |X (y j |xi) . i=1 j=1 Theorem 1.3 (contd.) HXY = n m − ∑ ∑ pXY (xi,y j) log pX(xi) i=1 j=1 ! n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 = HX + HY|X. Mutual Information The mutual information, IXY , between X and Y is deﬁned to be: IXY = HY − HY|X = IYX = HX − HX|Y . The mutual information is a measure of the statistical independence of two random variables. Mutual Information (contd.) m IXY = HY −HY|X = − ∑ pY (y j)log pY (y j) j=1 n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 m n IXY = − ∑ ∑ pXY (xi,y j) log pY (y j) j=1 i=1 ! n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Mutual Information (contd.) n m pY X(y j |xi) I = ∑ ∑ p (x ,y )log | XY XY i j p (y ) i=1 j=1 Y j n m p (x ,y ) = ∑ ∑ p (x ,y )log XY i j XY i j p (x )p (y ) i=1 j=1 X i Y j m n p (y ,x ) = ∑ ∑ p (y ,x )log YX j i YX j i p (y )p (x ) j=1 i=1 Y j X i = IYX = HX − HX|Y . Four Cases • X and Y are statistically independent: HXY = HX + HY Hx Hy Figure 3: X and Y are statistically independent. Four Cases (contd.) • X is completely dependent on Y: HXY = HY Ixy = Hx Figure 4: X is a function of Y . Four Cases (contd.) • Y is completely dependent on X: HXY = HX Ixy = Hy Figure 5: Y is a function of X. Four Cases (contd.) • X and Y are not independent but nei- ther is completely dependent on the other: HXY = HX + HY − IXY Hx Hy Hx|y Ixy Hy|x Figure 6: The general case. Informational Channel noise H Y | X input output H Y H X I XY H X | Y loss Figure 7: Information channel. Kullback-Liebler Distance Let pX and qX be probability mass func- tions for two discrete r.v.’s over the same set of events, {x1,...,xN}. The Kullback- Liebler distance (or KL divergence), is deﬁned as follows: N p (x ) KL(p ||q )= ∑ p (x )log X i . X X X i q (x ) i=1 X i The KL divergence is a measure of how different two probability distributions are. Note that in general KL(pX||qX) 6= KL(qX||pX). Kullback-Liebler Distance (contd.) Let pXY be the joint p.m.f. for discrete r.v.’s X and Y and let pX and pY be the corresponding marginal distributions: M pX(xi) = ∑ pXY (xi,y j) j=1 N pY (y j) = ∑ pXY (xi,y j). i=1 We observe that IXY = KL(pXY ||qXY ) where qXY (xi,y j)= pX(xi) · pY (y j). In other words, mutual information is the KL divergence between a joint distribution, pXY , and the product of its marginal distributions, pX and pY ..

Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support