Mutual Information and Channel Capacity

4 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several more information-theoretic quantities. These quantities are important in the study of Shannon's results. 4.1 Information-Theoretic Quantities Definition 4.1. Recall that, the entropy of a discrete random variable X is defined in Definition 2.41 to be X H (X) = − pX (x) log2 pX (x) = −E [log2 pX (X)] : (16) x2SX Similarly, the entropy of a discrete random variable Y is given by X H (Y ) = − pY (y) log2 pY (y) = −E [log2 pY (Y )] : (17) y2SY In our context, the X and Y are input and output of a discrete memory- less channel, respectively. In such situation, we have introduced some new notations in Section 3.1: SX ≡ X pX(x) ≡ p(x) p pY jX(yjx) ≡ Q(yjx) Q matrix SY ≡ Y pY (y) ≡ q(y) q pX;Y (x; y) ≡ p(x; y) P matrix Under such notations, (16) and (17) become X H (X) = − p (x) log2 p (x) = −E [log2 p (X)] (18) x2X and X H (Y ) = − q (y) log2 q (y) = −E [log2 q (Y )] : (19) y2Y Definition 4.2. The joint entropy for two random variables X and Y is given by X X H (X; Y ) = − p (x; y) log2p (x; y) = −E [log2 p (X; Y )] : x2X y2Y 48 Example 4.3. Random variables X and Y have the following joint pmf matrix P: 2 1 1 1 1 3 8 16 16 4 1 1 1 6 16 8 16 0 7 6 1 1 1 7 4 32 32 16 0 5 1 1 1 32 32 16 0 Find H(X), H(Y ) and H(X; Y ). Definition 4.4. The (conditional) entropy of Y when we know X = x is denoted by H (Y jX = x) or simply H(Y jx). It can be calculated from X H (Y jx) = − Q (y jx) log2 Q (y jx) y2Y • Note that the above formula is what we should expect it to be. When we want to find the entropy of Y , we use (19): X H (Y ) = − q (y) log2 q (y): y2Y When we have an extra piece of information that X = x, we should update the probability about Y from the unconditional probability q(y) to the conditional probability Q(yjx). • Note that when we consider Q(yjx) with the value of x fixed and the value of y varied, we simply get the whole x-row from Q matrix. So, to 49 find H (Y jx), we simply find the \usual" entropy from the probability values in the row corresponding to x in the Q matrix. Example 4.5. Consider the following DMC (actually BAC) 1/2 / 0 0 1/2 X Y / 1 1 1 8 1=4; y = 0; < Originally P [Y = y] = q(y) = 3=4; y = 1; : 0; otherwise. (a) Suppose we know that X = 0. 1=2; y = 0; 1; The \x = 0" row in the Q matrix gives Q (yj0) = 0; otherwise; that is, given x = 0, the RV Y will be uniform. (b) Suppose we know that X = 1. The \x = 1" row in the Q matrix 1; y = 1; gives Q (yj1) = that is, given x = 1, the RV Y is 0; otherwise; degenerated (deterministic). Definition 4.6. Conditional entropy: The (average) conditional entropy of Y when we know X is denoted by H(Y jX). It can be calculated from X H (Y jX ) = p (x) H (Y jx): x2X Example 4.7. In Example 4.5, 50 4.8. An alternative way to calculate H(Y jX) can be derived by first rewrit- ing it as X X X H (Y jX ) = p (x) H (Y jx) = − p (x) Q (y jx) log2 Q (y jx) x2X x2X y2Y X X = − p (x; y) log2Q (y jx) = −E [log2 Q (Y jX )] x2X y2Y p(x;y) Note that Q(yjx) = p(x) . Therefore, p (X; Y ) H (Y jX) = − [log Q (Y jX )] = − log E 2 E 2 p (X) = (−E [log2p (X; Y )]) − (−E [log2p (X)]) = H (X; Y ) − H (X) Example 4.9. In Example 4.5, Example 4.10. Continue from Example 4.3. Random variables X and Y have the following joint pmf matrix P: 2 1 1 1 1 3 8 16 16 4 1 1 1 6 16 8 16 0 7 6 1 1 1 7 4 32 32 16 0 5 1 1 1 32 32 16 0 Find H(Y jX) and H(XjY ). 51 Definition 4.11. The mutual information13 I(X; Y ) between two random variables X and Y is defined as I (X; Y ) = H (X) − H (X jY ) (20) = H (Y ) − H (Y jX ) (21) = H (X) + H (Y ) − H (X; Y ) (22) p (X; Y ) X X p (x; y) = log = p (x; y) log (23) E 2 p (X) q (Y ) p (x) q (y) x2X y2Y P (X jY ) Q (Y jX ) = log XjY = log : (24) E 2 p (X) E 2 q (Y ) • Mutual information quantifies the reduction in the uncertainty of one random variable due to the knowledge of another. • Mutual information is a measure of the amount of information one random variable contains about another [5, p 13]. • It is also natural to think of I(X; Y ) as a measure of how far X and Y are from being independent. ◦ Technically, it is the (Kullback-Leibler) divergence between the joint and product-of-marginal distributions. 4.12. Some important properties (a) H(X; Y ) = H(Y; X) and I(X; Y ) = I(Y ; X). However, in general, H(XjY ) 6= H(Y jX). (b) I and H are always ≥ 0. (c) There is a one-to-one correspondence between Shannon's information measures and set theory. We may use an information diagram, which 13The name mutual information and the notation I(X; Y ) was introduced by [Fano, 1961, Ch 2]. 52 is a variation of a Venn diagram, to represent relationship between Shannon's information measures. This is similar to the use of the Venn diagram to represent relationship between probability measures. These diagrams are shown in Figure 7. Venn Diagram Information Diagram Probability Diagram 퐵 푌 퐵 퐻 푋 푃 퐴 퐴\B 퐴 ∩ 퐵 퐵\A 퐻 푋|푌 퐼 푋; 푌 퐻 푌|푋 푃 퐴\B 푃 퐴 ∩ 퐵 푃 퐵\A 퐴 푋 퐻 푌 퐴 푃 퐵 퐴 ∪ 퐵 퐻 푋, 푌 푃 퐴 ∪ 퐵 Figure 7: Venn diagram and its use to represent relationship between information measures and relationship between probabilities • Chain rule for information measures: H (X; Y ) = H (X) + H (Y jX ) = H (Y ) + H (X jY ) : (d) I(X; Y ) ≥ 0 with equality if and only if X and Y are independent. • When this property is applied to the information diagram (or def- initions (20), (21), and (22) for I(X; Y )), we have (i) H(XjY ) ≤ H(X), (ii) H(Y jX) ≤ H(Y ), (iii) H(X; Y ) ≤ H(X) + H(Y ) Moreover, each of the inequalities above becomes equality if and = only X j Y . (e) We have seen in Section 2.4 that 0 ≤ H (X) ≤ log2 jX j : (25) deterministic (degenerated) uniform Similarly, 0 ≤ H (Y ) ≤ log2 jYj : (26) deterministic (degenerated) uniform 53 For conditional entropy, we have 0 ≤ H (Y jX ) ≤ H (Y ) (27) 9g Y =g(X) X = Y j and 0 ≤ H (X jY ) ≤ H (X) : (28) 9g X=g(Y ) X = Y j For mutual information, we have 0 ≤ I (X; Y ) ≤ H (X) (29) X = Y 9g X=g(Y ) j and 0 ≤ I (X; Y ) ≤ H (Y ) : (30) X = Y 9g Y =g(X) j Combining 25, 26, 29, and 30, we have 0 ≤ I (X; Y ) ≤ min fH (X) ;H (Y )g ≤ min flog2 jX j ; log2 jYjg (31) (f) H (X jX ) = 0 and I(X; X) = H(X). Example 4.13. Find the mutual information I(X; Y ) between the two ran- h 1 1 i 2 4 dom variables X and Y whose joint pmf matrix is given by P = 1 : 4 0 Example 4.14. Find the mutual information I(X; Y ) between the two ran- h 1 3 i 1 3 4 4 dom variables X and Y whose p = 4; 4 and Q = 3 1 : 4 4 54 4.2 Operational Channel Capacity 4.15. In Section 3, we have studied how to compute the error probability P (E) for digital communication systems over DMC. At the end of that section, we studied block encoding where the channel is used n times to transmit a k-bit info-block. In this section, our consideration is \reverse". 4.16. In this and the next subsections, we introduce a quantity called channel capacity which is crucial in benchmarking communication system. Recall that, in Section 2 where source coding was discussed, we were interested in the minimum rate (in bits per source symbol) to represent a source. Here, we are interested in the maximum rate (in bits per channel use) that can be sent through a given channel reliably. 4.17. Here, reliable communication means arbitrarily small error probability can be achieved. • This seems to be an impossible goal. ◦ If the channel introduces errors, how can one correct them all? ∗ Any correction process is also subject to error, ad infinitum.

Mutual Information and Channel Capacity

Information Theory Techniques for Multimedia Data Classification and Retrieval

On Measures of Entropy and Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

Mutual Information Analysis: a Comprehensive Study∗

Nonparametric Maximum Entropy Estimation on Information Diagrams

Mutual Information Analysis: a Comprehensive Study *

Anatomy of a Bit: Information in a Time Series Observation

4 Mutual Information and Channel Capacity

Information Measures & Information Diagrams

5 Mutual Information and Channel Capacity

Two Information-Theoretic Tools to Assess the Performance of Multi-Class Classiﬁers

The Conditional Entropy Bottleneck