An Introduction to Information Theory
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Information Theory Vahid Meghdadi reference : Elements of Information Theory by Cover and Thomas September 2007 Contents 1 Entropy 2 2 Joint and conditional entropy 4 3 Mutual information 5 4 Data Compression or Source Coding 6 5 Channel capacity 8 5.1 examples . 9 5.1.1 Noiseless binary channel . 9 5.1.2 Binary symmetric channel . 9 5.1.3 Binary erasure channel . 10 5.1.4 Two fold channel . 11 6 Differential entropy 12 6.1 Relation between differential and discrete entropy . 13 6.2 joint and conditional entropy . 13 6.3 Some properties . 14 7 The Gaussian channel 15 7.1 Capacity of Gaussian channel . 15 7.2 Band limited channel . 16 7.3 Parallel Gaussian channel . 18 8 Capacity of SIMO channel 19 9 Exercise (to be completed) 22 1 1 Entropy Entropy is a measure of uncertainty of a random variable. The uncertainty or the amount of information containing in a message (or in a particular realization of a random variable) is defined as the inverse of the logarithm of its probabil- ity: log(1=PX (x)). So, less likely outcome carries more information. Let X be a discrete random variable with alphabet X and probability mass function PX (x) = PrfX = xg, x 2 X . For convenience PX (x) will be denoted by p(x). The entropy of X is defined as follows: Definition 1. The entropy H(X) of a discrete random variable is defined by 1 H(X) = E log p(x) X 1 = p(x) log (1) p(x) x2X Entropy indicates the average information contained in X. When the base of the logarithm function is 2, the entropy is measured in bits. For example the entropy of a fair coin toss is 1 bit. Note: The entropy is a function of the distribution of X. It does not depend on the actual values taken by the random variable, but only on the probabilities. note: H(X) ≥ 0. Example 1. : Let 1 with probability p X = (2) 0 with probability 1 − p Show that the entropy of X is H(X) = −p log p − (1 − p) log(1 − p) (3) Some times this entropy is denoted by H(p; 1 − p). Note that the entropy is maximized for p = 0:5 and it is zero for p = 1 or p = 0. This makes sense because when p = 0 or p = 1 there is no uncertainty over the random variable X and hence no information in revealing its outcome. The maximum uncertainty is when the two events are equi-probable. Example 2. : Suppose X can take on K values. Show that that the entropy is maximized when X is uniformly distributed on these K Values and in this case, H(X) = log K. Solution: Calculating H(X) results in: K X 1 H(X) = p(x) log p(x) x=1 2 1 0.9 0.8 0.7 0.6 0.5 H(p) 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Figure 1: H(p) versus p Since log(x) is a concave function, using Jenson inequality for concave function f(x) stated below: X X λif(xi) ≤ f λixi it is clear that ! X 1 X 1 H(X) = p(x) log ≤ log p(x) p(x) p(x) x x So H(x) ≤ log K. It means that the maximum value for H(x) can be log K. Choosing p(X = i) = 1=K, we can obtain H(X) = log 1=K. So uniformly distributed X maximize the entropy and this entropy is log K. There is another way to solve this problem using Lagrange multipliers. We are going to maximize K X H(X) = − P (X = k) log P (X = k) k=1 The constraint of this maximization is K X g(p1 + p2 + ::: + pK ) = pk = 1 k=1 3 So by changing pi we try to find the maximum point of H. For all k from 1 to K we should maximize H + λ(g − 1) Hence we require: @ (H + λ(g − 1)) = 0 @pk It means that: K K ! @ X X − p(k) log p(k) + λ( pk − 1) = 0 @pk k=1 k=1 After calculating the differentiation we obtain a set of K independent equa- tions as: 1 − + log p + λ = 0 ln 2 2 k Solving this equation, we remark that all the pk have the same value. So the 1 p = k K As a conclusion, the uniform distribution yields the greatest entropy. 2 Joint and conditional entropy We saw the entropy of a single random variable (RV) and we now extend it to a pair of RV. Definition 2. The joint entropy H(X; Y ) of a pair of discrete random variables (X; Y ) with a joint distribution p(x; y) is defined as: X X H(X; Y ) = − p(x; y) log p(x; y) (4) x2X y2Y which can also be expressed as H(X; Y ) = − E log p(X; Y ) (5) The conditional entropy is defined at the same way. It is the expectation of the entropies of the conditional distributions: 4 Definition 3. If (X; Y ) ∼ p(x; y), then the conditional entropy H(Y jX) is defined as: X H(Y jX) = p(x)H(Y jX = x) (6) x2X X X = − p(x) p(yjx) log p(yjx) (7) x2X y2Y X X = − p(x; y) log p(yjx) (8) x2X y2Y = Ep(x;y) log p(Y jX) (9) Theorem 1. (chain rule) H(X; Y ) = H(X) + H(Y jX) (10) The proof comes from the fact that log p(X; Y ) = log p(X) + log p(Y jX) and then take the expectation. 3 Mutual information The mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction of uncertainty of one random variable due to the knowledge of the other. It is: I(X; Y ) = H(X) − H(XjY ) (11) p(X; Y ) = E log (12) p(x;y) p(X)p(Y ) = H(Y ) − H(Y jX) (13) = I(Y ; X) (14) The diagram of figure 2 represents the relation between the conditional entropies and mutual information. The chain rule can be stated here for the mutual information. First we define the conditional mutual information as: I(X; Y jZ) = H(XjZ) − H(XjY; Z) (15) Using this definition the chain rule can be written. I(X1;X2; Y ) = H(X1;X2) − H(X1;X2jY ) (16) = H(X1) + H(X2jX1) − H(X1jY ) − H(X2jX1;Y ) (17) = I(X1; Y ) + I(X2; Y jX1) (18) 5 H(Y|X) H(X|Y) I(X;Y) Figure 2: Relation between entropy and mutual information Some properties • I(X; Y ) ≥ 0 • I(X; Y jZ) ≥ 0 • H(X) ≤ log jX j where jX j denotes the number of elements in the range of X, with the equality if and only if X has a uniform distribution over X . • Condition reduces entropy: H(XjY ) ≤ H(X) with equality if and only if X and Y are independent. Pn • H(X1;X2;:::;Xn) ≤ i=1 H(Xi) with equality if and only if the Xi are independent. • Let (X; Y ) ∼ p(x; y) = p(x)p(yjx). The mutual information I(X; Y ) is a concave function of p(x) for fixed p(yjx) and a convex function of p(yjx) for fixed p(x). 4 Data Compression or Source Coding Two classes of coding are used: lossless source coding and lossy source coding. In this section, only lossless coding will be considered. The data compression can be achieved by assigning short descriptions to the most frequent outcomes of data source and then longer to the less probable. For example in Morse code, the letter "e" is coded by just a point. In this chapter some principles of compression will be given. Definition 4. A source code C for a random variable X is a mapping from X , the range of X, to D, the set of finite length strings of symbols from a D-ary alphabet. Let C(x) denote the codeword corresponding to x and let l(x) denote 6 the length of C(x). Example 3. : If you toss a coin, X = ftail; headg, C(head) = 0;C(tail) = 11; l(head) = 1; l(tail) = 2. Definition 5. The expected length of a code C(x) for a random variable X with probability mass function p(x) is given by X L(C) = p(x)l(x) (19) x2X For example for the above example the expected length of the code is 1 1 L(C) = ∗ 2 + ∗ 1 = 1:5 2 2 Definition 6. The code is singular if C(x1) = C(x2) and x1 6= x2. Non singular codes are uniquely decodable. Definition 7. The extension of a code C is a code obtained as: C(x1x2 : : : xn) = C(x1)C(x2) :::C(xn) (20) It means that a long message can be coded by concatenating the shorter mes- sage code words. For example if C(x1) = 11 and C(x2) = 00, then C(x1x2) = 1100. Definition 8. A code is uniquely decodable if its extension is non-singular. In other words, any encoded string has only one possible source string and there is no ambiguity. Definition 9. A code is called a prefix code or an instantaneous code if no code word is a prefix of any other codeword.