Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Conditional Entropy Lety Be a Discrete Random Variable with Outcomes

Conditional Entropy Let Y be a discrete random variable with outcomes, {y1,...,ym}, which occur with probabilities, pY (y j). The avg. infor- mation you gain when told the outcome of Y is: m HY = − ∑ pY (y j)log pY (y j). j=1 Conditional Entropy (contd.) Let X be a discrete random variable with outcomes, {x1,...,xn}, which occur with probabilities, pX(xi). Consider the 1D distribution, pY |X=xi(y j)= pY|X(y j |xi) i.e., the distribution of Y outcomes given that X = xi. The avg. information you gain when told the outcome of Y is: m HY|X=xi = − ∑ pY |X(y j |xi)log pY|X(y j |xi). j=1 Conditional Entropy (contd.) The conditional entropy is the expected value for the entropy of pY|X=xi: HY|X = HY|X=xi . It follows that: n HY|X = ∑ pX(xi)HY|X=xi i=1 n m = ∑ pX(xi) − ∑ pY |X(y j |xi)log pY|X (y j |xi) i=1 j=1 ! n m = − ∑ ∑ pX(xi)pY|X (y j |xi)log pY |X (y j |xi). i=1 j=1 The entropy, HY|X, of the conditional distribution, pY|X, is therefore: n m HY|X = − ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Hy Hy|x Figure 1: There is less information in the conditional than in the marginal (Theo- rem 1.2). Theorem 1.2 There is less information in the condi- tional, pY|X, than in the marginal, pY : HY|X − HY ≤ 0. Proof: HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m + ∑ pY (y j)log pY (y j). j=1 Theorem 1.2 (contd.) HY|X − HY = n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 m n + ∑ ∑ pXY (xi,y j) log pY (y j). j=1 i=1 ! HY|X − HY = n m p (y ) ∑ ∑ p (x ,y )log Y j . XY i j p (y |x ) i=1 j=1 Y|X j i Theorem 1.2 (contd.) Using the inequality, loga ≤ (a−1)loge, it follows that: HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j − 1 loge. XY i j p (y |x ) i=1 j=1 Y|X j i HY|X − HY ≤ n m p (y ) ∑ ∑ p (x ,y ) Y j loge XY i j p (y |x ) i=1 j=1 Y|X j i n m − ∑ ∑ pXY (xi,y j)loge. i=1 j=1 Theorem 1.2 (contd.) HY |X − HY ≤ n m p (y ) ∑ ∑ p (x )p (y |x ) Y j loge−loge. X i Y|X j i p (y |x ) i=1 j=1 Y |X j i n m HY|X − HY ≤ ∑ ∑ pX (xi)pY (y j)loge − loge i=1 j=1 n m ≤ ∑ pX(xi) ∑ pY (y j) loge − loge "i=1 j=1 !# ≤ loge − loge ≤ 0. Hx Hy|x Hxy Figure 2: The information in the joint is the sum of the information in the condi- tional and the marginal (Theorem 1.3). Theorem 1.3 The information in the joint, pXY , is the sum of the information in the conditional, pY|X, and the marginal, pX: HXY = HY|X + HX. Proof: n m HXY = − ∑ ∑ pXY (xi,y j)log pXY (xi,y j) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j)log pX(xi)pY|X (y j |xi) i=1 j=1 n m = − ∑ ∑ pXY (xi,y j) log pX(xi)+ log pY |X (y j |xi) . i=1 j=1 Theorem 1.3 (contd.) HXY = n m − ∑ ∑ pXY (xi,y j) log pX(xi) i=1 j=1 ! n m − ∑ ∑ pXY (xi,y j)log pY|X(y j |xi) i=1 j=1 = HX + HY|X. Mutual Information The mutual information, IXY , between X and Y is defined to be: IXY = HY − HY|X = IYX = HX − HX|Y . The mutual information is a measure of the statistical independence of two ran- dom variables. Mutual Information (contd.) m IXY = HY −HY|X = − ∑ pY (y j)log pY (y j) j=1 n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 m n IXY = − ∑ ∑ pXY (xi,y j) log pY (y j) j=1 i=1 ! n m + ∑ ∑ pXY (xi,y j)log pY |X(y j |xi). i=1 j=1 Mutual Information (contd.) n m pY X(y j |xi) I = ∑ ∑ p (x ,y )log | XY XY i j p (y ) i=1 j=1 Y j n m p (x ,y ) = ∑ ∑ p (x ,y )log XY i j XY i j p (x )p (y ) i=1 j=1 X i Y j m n p (y ,x ) = ∑ ∑ p (y ,x )log YX j i YX j i p (y )p (x ) j=1 i=1 Y j X i = IYX = HX − HX|Y . Four Cases • X and Y are statistically independent: HXY = HX + HY Hx Hy Figure 3: X and Y are statistically independent. Four Cases (contd.) • X is completely dependent on Y: HXY = HY Ixy = Hx Figure 4: X is a function of Y . Four Cases (contd.) • Y is completely dependent on X: HXY = HX Ixy = Hy Figure 5: Y is a function of X. Four Cases (contd.) • X and Y are not independent but nei- ther is completely dependent on the other: HXY = HX + HY − IXY Hx Hy Hx|y Ixy Hy|x Figure 6: The general case. Informational Channel noise H Y | X input output H Y H X I XY H X | Y loss Figure 7: Information channel. Kullback-Liebler Distance Let pX and qX be probability mass func- tions for two discrete r.v.’s over the same set of events, {x1,...,xN}. The Kullback- Liebler distance (or KL divergence), is defined as follows: N p (x ) KL(p ||q )= ∑ p (x )log X i . X X X i q (x ) i=1 X i The KL divergence is a measure of how different two probability distributions are. Note that in general KL(pX||qX) 6= KL(qX||pX). Kullback-Liebler Distance (contd.) Let pXY be the joint p.m.f. for discrete r.v.’s X and Y and let pX and pY be the corresponding marginal distributions: M pX(xi) = ∑ pXY (xi,y j) j=1 N pY (y j) = ∑ pXY (xi,y j). i=1 We observe that IXY = KL(pXY ||qXY ) where qXY (xi,y j)= pX(xi) · pY (y j). In other words, mutual information is the KL divergence between a joint distribu- tion, pXY , and the product of its marginal distributions, pX and pY ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us