University of Central Florida School of Electrical Engineering and Computer Science EEL-6532: Information Theory and Coding. Spring 2010 - Dcm
Total Page:16
File Type:pdf, Size:1020Kb
University of Central Florida School of Electrical Engineering and Computer Science EEL-6532: Information Theory and Coding. Spring 2010 - dcm Lecture 2 - Wednesday January 13, 2000 Shannon Entropy In his endeavor to construct mathematically tractable models of communication Shannon concentrated on stationary and ergodic1 sources of classical information. A stationary source of information emits symbols with a probability that does not change over time and an ergodic source emits information symbols with a probability equal to the frequency of their occurrence in a long sequence. Stationary ergodic sources of information have a ¯nite but arbitrary and potentially long correlation time. In late 1940s Shannon introduced a measure of the quantity of information a source could generate [13]. Earlier, in 1927, another scientist from Bell Labs, Ralph Hartley had proposed to take the logarithm of the total number of possible messages as a measure of the amount of information in a message generated by a source of information arguing that the logarithm tells us how many digits or characters are required to convey the message. Shannon recognized the relationship between thermodynamic entropy and informational entropy and, on von Neumann's advice, he called the negative logarithm of probability of an event, entropy2. Consider an event which happens with probability p; we wish to quantify the information content of a message communicating the occurrence of this event and we impose the condition that the measure should reflect the \surprise" brought by the occurrence of this event. An initial guess for a measure of this surprise would be 1=p, the lower the probability of the event the larger the surprise. But this simplistic approach does not resist scrutiny; the surprise should be additive. If an event is composed of two independent events which occur with probabilities q and r then the probability of the event should be p = qr, but we see that: 1 1 1 6= + : p q r On the other hand, if the surprise is measured by the logarithm of 1=p, then the additivity property is obeyed: 1 1 1 log = log + log : p q r P Given a probability distribution i pi = 1 we see that the uncertainty is in fact equal to the average surprise: X 1 p log : i p i i 1A stochastic process is said to be ergodic if time averages are equal to ensemble averages, in other words if its statistical properties such as its mean and variance can be deduced from a single, su±ciently long sample (realization) of the process 2It is rumored that von Neumann told Shannon \It is already in use under that name and besides it will give you a great edge in debates because nobody really knows what entropy is anyway" [2]. 1 The entropy is a measure of the uncertainty of a single random variable X before it is observed, or the average uncertainty removed by observing it. This quantity is called entropy due to its similarity to the thermodynamic entropy. Entropy: the entropy of a random variable X with a probability density function pX (x) is: X H(X) = ¡ pX (x) log pX (x): x The entropy of a random variable is a positive number. Indeed, the probability pX (x) is a positive real number between 0 and 1 therefore, log pX (x) · 0 and H(X) ¸ 0. Let X be a binary random variable and p = pX (x = 1) be the probability that the X takes the value 1; then the entropy of X is: H(p) = ¡p log p ¡ (1 ¡ p) log(1 ¡ p): If the logarithm is in base 2 then the binary entropy is measured in bits. Figure 1 shows H(p) function of p for a binary random variable. The entropy has a maximum of 1 bit when p = 1=2 and goes to zero when p = 0 or p = 1. Intuitively, we expect the entropy to be zero when the outcome is certain and reach its maximum when both outcomes are equally likely. It is easy to see that: (i) H(X) > 0 for 0 < p < 1; (ii) H(X) is symmetric about p = 0:5; (iii) limp!0H(X) = limp!1H(X) = 0; (iv) H(X) is increasing for 0 < p < 0:5, decreasing for 0:5 < p < 1 and has a maximum for p = 0:5. (V) The binary entropy is a concave function of p, the probability of an outcome. Before discussing this property of binary entropy we review a few properties of convex and concave functions. A function f(x) is convex over an interval (a; b) if: f(γx1 + (1 ¡ γ)x2) · γf(x1) + (1 ¡ γ)f(x2) 8 x1; x2 2 (a; b) and 0 · γ · 1: The function is strictly convex i® this inequality holds only for γ = 0 or γ = 1. A function f(x) is concave only if (¡f(x)) is convex over the same interval. It is easy to prove that if the second derivative of the function f(x) is non-negative, then the function is convex. Call x0 = γx1 + (1 ¡ γ)x2. From the Taylor series expansion: f 00(») f(x) = f(x ) + f 0(x )(x ¡ x ) + (x ¡ x )2; x · » · x 0 0 0 2 0 0 00 2 and from the fact that the second derivative is non-negative, it follows that f (»)(x¡x0) ¸ 0. When x = x1 then x1 ¡ x0 = (1 ¡ γ)(x1 ¡ x2) and: 0 0 f(x1) ¸ f(x0) + f (x0)(x1 ¡ x0) or f(x1) ¸ f(x0) + f (x0)(1 ¡ γ)(x1 ¡ x2): When x = x2 then x2 ¡ x0 = γ(x2 ¡ x1) and: 2 0 0 f(x2) ¸ f(x0) + f (x0)(x2 ¡ x0) or f(x2) ¸ f(x0) + f (x0)γ(x2 ¡ x1): It follows that: γf(x1) + (1 ¡ γ)f(x2) ¸ f(γx1 + (1 ¡ γ)x2): ¤ Convex functions enjoy a number of useful properties; for example, if X is a discrete random variable with the probability density function pX (xi) and f(x) is a convex function then f(x) satis¯es Jensen inequality: Ã ! X X pX (xi)f(xi) ¸ f pX (xi)xi i i It is easy to prove that f(x) is concave if and only if: µ ¶ x + x f(x ) + f(x ) f 1 2 ¸ 1 2 : 2 2 Figure 1 illustrates the fact that the binary entropy H(p) is a concave function of p; the function lies above any chord, in particular above the cord connecting the points (p1;H(p1)) and (p2;H(p2)): µ ¶ p + p H(p ) + H(p ) H 1 2 ¸ 1 2 : 2 2 Table 1 shows some values for H(X) for 0:0001 · p · 0:5. 1 2 2 1 2 1 1 1 2 2 Figure 1: The entropy of a binary random variable function of the probability of an outcome. Now we consider two random variables, X and Y with the probability density functions pX (x) and qY (y); let pXY (x; y) be the joint probability density function of X and Y . To 3 Table 1: The entropy of a binary random variable for 0:0001 · p · 0:5. p H(X) p H(X) p H(X) p H(X) 0.0001 0.001 0.01 0.081 0.2 0.722 0.4 0.971 0.001 0.011 0.1 0.469 0.3 0.881 0.5 1.000 quantify the uncertainty about the pair (x; y) we introduce the joint entropy of the two random variables. Joint entropy: the joint entropy of two random variables X and Y is de¯ned as: X H(X; Y ) = ¡ pXY (x; y) log pXY (x; y): x;y If we have acquired all the information about the random variable X we may ask how much uncertainty is still there about the pair of the two random variables, (X; Y ). To answer this question we introduce the conditional entropy. Conditional entropy of random variable Y given X is de¯ned as: X X H(Y jX) = ¡ pXY (x; y) log pY jX (yjx): x y Consider two random variables X and Y . Each of them takes values over a ¯ve-letter alphabet consisting of the symbols a; b; c; d; e. The joint distribution of the two random variables is given in Table 2. Table 2: The joint probability distribution matrix of random variables X and Y. pX;Y (x; y) a b c d e a 1/10 1/20 1/40 1/80 1/80 b 1/20 1/40 1/80 1/80 1/10 c 1/40 1/80 1/80 1/10 1/20 d 1/80 1/80 1/10 1/20 1/40 e 1/80 1/10 1/20 1/40 1/80 The marginal distribution of X and Y can be computed from the relations: X X pX (x) = p(x; y); and pY (y) = p(x; y) y x as follows X 1 1 1 1 1 1 p (a) = p (x = a; y) = + + + + = X XY 10 20 40 80 80 5 y Similarly, we obtain: 1 p (x = b) = p (x = c) = p (x = d) = p (x = e) = ; X X X X 5 and 4 1 p (y = a) = p (y = b) = p (y = c) = p (y = d) = p (y = e) = : Y Y Y Y Y 5 The entropy of X is thus: · ¸ X 1 H(X) = ¡ p (x) log p (x) = 5 log 5 = log 5 bits. X X 5 x Similarly, the entropy of Y is: · ¸ X 1 H(Y ) = ¡ p (y) log p (y) = 5 log 5 = log 5 bits.