School of Information Science

Entropy Rate

2009 2-2 Course - - Tetsuo Asano and Tad matsumoto Email: {t-asano, matumoto}@jaist.ac.jp

Japan Advanced Institute of Science and Technology Asahidai 1-1, Nomi, Ishikawa 923-1292, Japan http://www.jaist.ac.jp

School of Information Science

Preliminary Experiment

Information Destination Source Encoder Fax Tx+ Fax Rx Decoder

Noise

In the previous chapter, we learned that information can be evaluated by the uncertainty of the source output random variable. Æ

Question: How does the entropy of the sequence grow with its length? Goes the growth rate have a limit, if the source outputs infinite length of information? School of Information Science Outline

We have made a journey on the topics: 1. Revisit to 2. Entropy Rate - Stationary Process - Asymptotic Properties - Entropy Rate Chain Rule - Conditioning

School of Information Science Revisit of Markov Chain (1)

Definition 5.1.1: Markov Chain or Markov Process

A discrete X1, X2, …, Xk, … forms Markov Chain (or Markov Process), if for n=1, 2, ….,

Pr(X n+1 = xn+1 X n = xn , X n−1 = xn−1,L, X1 = x1 )= Pr(X n+1 = xn+1 X n = xn ) withx1, x2 ,L, xn , xn+1 ∈ X . Theorem 5.1.1: Joint Distribution:

p()x1,L, xn−1, xn = p(x1 )p(x2 x1 )p(x3 x2 )L p(xn xn−1 ) Proof: Obvious from the definition. .

Definition 5.1.2: Time Invariance:

The Markov process is time-invariant, if the conditional probability p(xn|xn-1) does not depend on n, i.e., .

Pr(X n+1 = a X n = b)= Pr(X 2 = a X1 = b) School of Information Science Revisit of Markov Chain (2)

Definition 5.1.3: States

If X1, X2, …, Xn, … forms Markov Chain, Xn is called the state at time n.

The matrix P={Pij}, with Pij =Pr{Xn=j|Xn-1=i} is called probability transition matrix. Obviously, p()x = p ()x P , 1≤ x ≤ m , 1≤ x ≤ m n+1 ∑ n xn xn+1 n n+1 xn Definition 5.1.4: Stationarity A distribution on the states at time n+1 is the same as that at time n, the Markov process is called stationary. The stationary Markov process has a stationary probability distribution. Example 5.1.1: Two State Markov Process A two state Markov process, of which state diagram is shown in the figure on RHS, has a probability transition matrix: 1-α α 1−β ⎡1−α α ⎤ S S P = ⎢ ⎥ 1 2 ⎣ β 1− β ⎦ β

School of Information Science Revisit of Markov Chain (3)

Theorem 5.1.2: State Probability The stationary state probability vector µ satisfies the following equations: µP = µ and m ∑ µi =1 with µ = []µ1, µ2 ,L, µm i=1 Proof: Obvious from the definition. .

Example 5.1.2: Two State Markov Process The stationary state probability vector µ of the two state Markov process in Example 5.1.1 is given by: ⎡ β α ⎤ []µ1, µ2 = ⎢ , ⎥ ⎣α + β α + β ⎦

with which, entropy of the state Xn at time n is: ⎡ β α ⎤ H (X n ) = H ⎢ , ⎥ ⎣α + β α + β ⎦ School of Information Science Entropy Rate (1)

Definition 5.2.1: Entropy Rate

The entropy rate of a stochastic process {Xi} is defined by: 1 H ()X = lim H ()X1, X 2 ,L, X n n→∞ n Examples: (1) Typewriter: A type writer having m equally likely output letters can produce mn sequences of length n, all of which are equally likely. Therefore, 1 H ()X = lim H ()X1, X 2 ,L, X n = log m bits per symbol. n→∞ n (2) Independent Identically Distributed (iid) random variable (special case):

If X1, X2, …, Xn are i.i.d. random variable, 1 nH (X1 ) H ()X = lim H ()X1, X 2 ,L, X n = = H ()X1 n→∞ n n (3) Independent but NOT Identically Distributed random variable: n H ()()X1, X 2 ,L, X n = ∑ H X i i=1

School of Information Science Entropy Rate (2)

Definition 5.2.2: Rate

The conditional entropy rate of a stationary stochastic process {Xi} is defined by: H '()X = lim H (X n X n−1, , X 2 , X1 ) n→∞ L

Theorem 5.2.1: H ()X n+1 X n , , X 2 , X1 → H'(X ) < ∞ L n→∞ Proof: Since conditioning decreases entropy,

H (X n+1 X n ,L, X 2 , X1 )≤ H (X n+1 X n ,L, X 2 )= H (X n X n−1,L, X1 ) Since entropy is non-negative, it has to have a limit. 1 n Theorem 5.2.2: If a → a and b = a then b → a n n n ∑ i n Proof. Obvious. i=1 Theorem 5.2.3: H '(X ) = H (X ) Proof. By the chain rule of entropy, n H ()X n ,L, X 2 , X1 1 = ∑ H ()X i X i−1,L, X1 n n i=1 School of Information Science Entropy Rate (3) Proof of Theorem 5.2.3 (Continued) : Using the result of Theorem 5.2.2, where

H (X n X n−1,L, X1 ) corresponds to an and 1 n ∑ H ()X i X i−1,L, X1 corresponds to bn n i=1 with which they both converge into the same limit. Therefore, n H (X1,L, X n ) 1 H ()X = lim = lim ∑ H (X i X i−1,L, X1) n→∞ n n→∞ n i=1

= lim H (X n X n−1,L, X1) = H '(X ) n→∞ Theorem 5.2.4: Entropy Rate of Markov Chain For a stationary Markov chain,

H ()X = H ' ()X = lim H (X n X n−1,L, X1) = lim H (X n X n−1) = H (X 2 X1) n→∞ n→∞ Proof. Obvious from the definitions of Markov chain’s stationarity and entropy rate.

School of Information Science Entropy Rate (4)

Exercise : Proof the following: (1) For a stationary Markov chain with transition probability matrix

P={Pij}, and the stationary state vector µ, ={µi},

H ()X = −∑ µi Pij log Pij ij (2) For a two state stationary Markov chain,

β a H ()X = H ()α + H ()β α + β α + β

1-α α 1−β

S1 S2 β School of Information Science Entropy Rate (5) Entropy of English: (1) There are 26 letters and space symbols. (2) Appearance probability of each character is, without space, summarized in the Table below:

Letter Probability Letter Probability Letter Probability If we have no knowledge about the appearance of the each letter, A 8.29% J 0.21% S 6.33% we need: B 1.43 K 0.48 T 9.27 log226=4.7 bits C 3.68 L 3.68 U 2.53 to encode one English letter. D 4.29 M 3.23 V 1.03 E 12.8 N 7.16 W 1.62 If we have knowledge about F 2.20 O 7.28 X 0.20 the appearance of the each letter, we need: G 1.71 P 2.93 Y 1.57 H 4.54 Q 0.11 Z 0.09 26 H ()S = −∑ pi log pi = 4.17 bits I 7.16 R 6.90 i=1

School of Information Science Entropy Rate (7)

If we can evaluate conditional probabilities p(xi|xi-1), p(xi|xi-1, xi-2), …, p(xi|xi-1, xi-2, …, xi-n), empirically or theoretically and create a Markov model of the letter appearances, we can further reduce the rate required to encode English. Shannon’s landmark paper presents artificially created English sentences!

Using empirical

knowledge p(xi)

p(xi|xi-1)

p(xi|xi-1, xi-2) School of Information Science Entropy Rate (8) Using empirical knowledge of p(xi|xi-1, xi-2 , xi-3)

Using empirical knowledge of the word appearance

probability p(wi)

p(wi|wi-1, wi-2)

With the 4th order model, Shannon showed that 2.8 bits are enough to express one English letter!

School of Information Science Summary

We have made a journey on the topics: 1. Revisit to Markov Chain 2. Entropy Rate - Stationary Process - Asymptotic Properties - Entropy Rate Chain Rule - Conditioning