Entropy Rate

School of Information Science Entropy Rate 2009 2-2 Course - Information Theory - Tetsuo Asano and Tad matsumoto Email: {t-asano, matumoto}@jaist.ac.jp Japan Advanced Institute of Science and Technology Asahidai 1-1, Nomi, Ishikawa 923-1292, Japan http://www.jaist.ac.jp School of Information Science Preliminary Experiment Information Destination Source Encoder Fax Tx+ Fax Rx Decoder Noise In the previous chapter, we learned that information can be evaluated by the uncertainty of the source output random variable. Æ Entropy Question: How does the entropy of the sequence grow with its length? Goes the growth rate have a limit, if the source outputs infinite length of information? School of Information Science Outline We have made a journey on the topics: 1. Revisit to Markov Chain 2. Entropy Rate - Stationary Process - Asymptotic Properties - Entropy Rate Chain Rule - Conditioning School of Information Science Revisit of Markov Chain (1) Definition 5.1.1: Markov Chain or Markov Process A discrete stochastic process X1, X2, …, Xk, … forms Markov Chain (or Markov Process), if for n=1, 2, …., Pr(X n+1 = xn+1 X n = xn , X n−1 = xn−1,L, X1 = x1 )= Pr(X n+1 = xn+1 X n = xn ) withx1, x2 ,L, xn , xn+1 ∈ X . Theorem 5.1.1: Joint Distribution: p()x1,L, xn−1, xn = p(x1 )p(x2 x1 )p(x3 x2 )L p(xn xn−1 ) Proof: Obvious from the definition. Definition 5.1.2: Time Invariance: The Markov process is time-invariant, if the conditional probability p(xn|xn-1) does not depend on n, i.e., . Pr(X n+1 = a X n = b)= Pr(X 2 = a X1 = b) School of Information Science Revisit of Markov Chain (2) Definition 5.1.3: States If X1, X2, …, Xn, … forms Markov Chain, Xn is called the state at time n. The matrix P={Pij}, with Pij =Pr{Xn=j|Xn-1=i} is called probability transition matrix. Obviously, p()x = p ()x P , 1≤ x ≤ m , 1≤ x ≤ m n+1 ∑ n xn xn+1 n n+1 xn Definition 5.1.4: Stationarity A distribution on the states at time n+1 is the same as that at time n, the Markov process is called stationary. The stationary Markov process has a stationary probability distribution. Example 5.1.1: Two State Markov Process A two state Markov process, of which state diagram is shown in the figure on RHS, has a probability transition matrix: 1-α α 1−β ⎡1−α α ⎤ S S P = ⎢ ⎥ 1 2 ⎣ β 1− β ⎦ β School of Information Science Revisit of Markov Chain (3) Theorem 5.1.2: State Probability The stationary state probability vector µ satisfies the following equations: µP = µ and m ∑ µi =1 with µ = []µ1, µ2 ,L, µm i=1 Proof: Obvious from the definition. Example 5.1.2: Two State Markov Process The stationary state probability vector µ of the two state Markov process in Example 5.1.1 is given by: ⎡ β α ⎤ []µ1, µ2 = ⎢ , ⎥ ⎣α + β α + β ⎦ with which, entropy of the state Xn at time n is: ⎡ β α ⎤ H (X n ) = H ⎢ , ⎥ ⎣α + β α + β ⎦ School of Information Science Entropy Rate (1) Definition 5.2.1: Entropy Rate The entropy rate of a stochastic process {Xi} is defined by: 1 H ()X = lim H ()X1, X 2 ,L, X n n→∞ n Examples: (1) Typewriter: A type writer having m equally likely output letters can produce mn sequences of length n, all of which are equally likely. Therefore, 1 H ()X = lim H ()X1, X 2 ,L, X n = log m bits per symbol. n→∞ n (2) Independent Identically Distributed (iid) random variable (special case): If X1, X2, …, Xn are i.i.d. random variable, 1 nH (X1 ) H ()X = lim H ()X1, X 2 ,L, X n = = H ()X1 n→∞ n n (3) Independent but NOT Identically Distributed random variable: n H ()()X1, X 2 ,L, X n = ∑ H X i i=1 School of Information Science Entropy Rate (2) Definition 5.2.2: Conditional Entropy Rate The conditional entropy rate of a stationary stochastic process {Xi} is defined by: H '()X = lim H (X n X n−1, , X 2 , X1 ) n→∞ L Theorem 5.2.1: H ()X n+1 X n , , X 2 , X1 → H'(X ) < ∞ L n→∞ Proof: Since conditioning decreases entropy, H (X n+1 X n ,L, X 2 , X1 )≤ H (X n+1 X n ,L, X 2 )= H (X n X n−1,L, X1 ) Since entropy is non-negative, it has to have a limit. 1 n Theorem 5.2.2: If a → a and b = a then b → a n n n ∑ i n Proof. Obvious. i=1 Theorem 5.2.3: H '(X ) = H (X ) Proof. By the chain rule of entropy, n H ()X n ,L, X 2 , X1 1 = ∑ H ()X i X i−1,L, X1 n n i=1 School of Information Science Entropy Rate (3) Proof of Theorem 5.2.3 (Continued) : Using the result of Theorem 5.2.2, where H (X n X n−1,L, X1 ) corresponds to an and 1 n ∑ H ()X i X i−1,L, X1 corresponds to bn n i=1 with which they both converge into the same limit. Therefore, n H (X1,L, X n ) 1 H ()X = lim = lim ∑ H (X i X i−1,L, X1) n→∞ n n→∞ n i=1 = lim H (X n X n−1,L, X1) = H '(X ) n→∞ Theorem 5.2.4: Entropy Rate of Markov Chain For a stationary Markov chain, H ()X = H ' ()X = lim H (X n X n−1,L, X1) = lim H (X n X n−1) = H (X 2 X1) n→∞ n→∞ Proof. Obvious from the definitions of Markov chain’s stationarity and entropy rate. School of Information Science Entropy Rate (4) Exercise : Proof the following: (1) For a stationary Markov chain with transition probability matrix P={Pij}, and the stationary state vector µ, ={µi}, H ()X = −∑ µi Pij log Pij ij (2) For a two state stationary Markov chain, β a H ()X = H ()α + H ()β α + β α + β 1-α α 1−β S1 S2 β School of Information Science Entropy Rate (5) Entropy of English: (1) There are 26 letters and space symbols. (2) Appearance probability of each character is, without space, summarized in the Table below: Letter Probability Letter Probability Letter Probability If we have no knowledge about the appearance of the each letter, A 8.29% J 0.21% S 6.33% we need: B 1.43 K 0.48 T 9.27 log226=4.7 bits C 3.68 L 3.68 U 2.53 to encode one English letter. D 4.29 M 3.23 V 1.03 E 12.8 N 7.16 W 1.62 If we have knowledge about F 2.20 O 7.28 X 0.20 the appearance of the each letter, we need: G 1.71 P 2.93 Y 1.57 H 4.54 Q 0.11 Z 0.09 26 H ()S = −∑ pi log pi = 4.17 bits I 7.16 R 6.90 i=1 School of Information Science Entropy Rate (7) If we can evaluate conditional probabilities p(xi|xi-1), p(xi|xi-1, xi-2), …, p(xi|xi-1, xi-2, …, xi-n), empirically or theoretically and create a Markov model of the letter appearances, we can further reduce the rate required to encode English. Shannon’s landmark paper presents artificially created English sentences! Using empirical knowledge p(xi) p(xi|xi-1) p(xi|xi-1, xi-2) School of Information Science Entropy Rate (8) Using empirical knowledge of p(xi|xi-1, xi-2 , xi-3) Using empirical knowledge of the word appearance probability p(wi) p(wi|wi-1, wi-2) With the 4th order model, Shannon showed that 2.8 bits are enough to express one English letter! School of Information Science Summary We have made a journey on the topics: 1. Revisit to Markov Chain 2. Entropy Rate - Stationary Process - Asymptotic Properties - Entropy Rate Chain Rule - Conditioning.

Entropy Rate

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support