& CODING Week 7 : Rate

Dr. Rui Wang

Department of Electrical and Electronic Engineering Southern Univ. of Science and Technology (SUSTech)

Email: [email protected]

October 27, 2020

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 1 / 20 Review Summary

McMillan inequality

Uniquely decodable codes ⇔ P D−`i ≤ 1.

Huffman code

∗ X L = min pi `i P D−`i ≤1 ∗ HD (X ) ≤ L < HD (X ) + 1.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 2 / 20 Outline

On average, nH(X ) + 1 bits suffices to describe n i.i.d. random variables. But what if the random variables are dependent?

Markov Chain: a simplest way to model the correlations among random variables in a .

Entropy Rate: average number of bits suffices to describe one random variable in a stochastic process.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 3 / 20 How to Model Dependence: Markov Chains

A stochastic process {Xi } is an indexed sequence of random variables (X0, X1,... ) characterized by the joint PMF n p(x1, x2,..., xn), where (x1, x2,..., xn) ∈ X for n = 0, 1,... .

Definition A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index, i.e.,

Pr[X1 = x1, X2 = x2,..., Xn = xn]

= Pr[X1+` = x1, X2+` = x2,..., Xn+` = xn]

for every n and every shift ` and for all x1, x2,..., xn ∈ X .

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 4 / 20 Markov Chains Definition

A discrete stochastic process X1, X2,... is said to be a or a Markov process if for n = 1, 2,...,

Pr [Xn+1 = xn+1|Xn = xn, Xn−1 = xn−1,..., X1 = x1]

= Pr [Xn+1 = xn+1|Xn = xn]

for all x1, x2,..., xn, xn+1 ∈ X .

In this case, the joint PMF can be written as

p(x1, x2,..., xn) = p(x1)p(x2|x1)p(x3|x2) ··· p(xn|xn−1).

Hence, a Markov chain is completely characterized by initial distribution p(x1) and transition probabilities p(xn|xn−1), n = 2, 3, 4, ...

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 5 / 20 Markov Chains

Definition The Markov chain is called time invariant if the transition probability p(xn+1|xn) does NOT depend on n, i.e., for n = 1, 2,...,

Pr[Xn+1 = b|Xn = a] = Pr[X2 = b|X1 = a], ∀a, b ∈ X .

We deal with time invariant Markov chains. If {Xi } is a Markov chain, Xn is called the state at time n. A time invariant Markov chain is characterized by its initial state and a probability transition matrix P =[Pij ], i, j ∈{1, 2,..., m}, where Pij =Pr[Xn+1 =j|Xn =i].

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 6 / 20 Markov Chain Example: Simple Weather Model

X = {Sunny: S, Rainy: R}

p(S|S) = 1 − β, p(R|R) = 1 − α, p(R|S) = β, p(S|R) = α

 1 − β β  P = α 1 − α

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 7 / 20 Markov Chain Example: Simple Weather Model

Probability of seeing a sequence SSRR:

p(SSRR) = p(S)p(S|S)p(R|S)p(R|R) = p(S)(1 − β)β(1 − α)

Suppose the first day is ”Sunny” with probability γ, what is the weather distribution of the second day, third day, ...?

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 8 / 20 Stationary Distribution

If the PMF of the random variable at time n is n µi = Pr[Xn = i], the PMF at time n + 1, say n+1 µj = Pr[Xn+1 = j], can be written as

n+1 X n X n µj = µi Pr[Xn+1 = j|Xn = i] = µi Pij . i i

n n n+1 {µi |∀i} is called a stationary distribution if µi = µi .

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 9 / 20 Stationary Distribution   α β If µ = [µS , µR ] = α+β α+β

 1 − β β  P = α 1 − α

p (Xn+1 = S) = p(S|S)µS + p(S|R)µR α β α = (1 − β) + α = = µ . α + β α + β α + β S

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 10 / 20 Stationary Distribution

How to calculate stationary distribution? - Stationary distribution µi , i = 1, 2,..., |X | satisfies

|X | |X | X X µj = µi Pij and µi = 1. i=1 i=1

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 11 / 20 Entropy Rate

When Xi ’s are i.i.d., the entropy

n n X H(X ) = H(X1, X2,..., Xn) = H(Xi ) = nH(X ). i=1

n With dependent sequences Xi ’s, how does H(X ) grow with n?

Entropy rate characterized the growth rate.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 12 / 20 Entropy Rate

Definition 1: average entropy per symbol

H(X , X ,..., X ) H(X ) = lim 1 2 n n→∞ n

Definition 2: of the last r.v. given the past

0 H (X ) = lim H(Xn|Xn−1, Xn−2,..., X1) n→∞

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 13 / 20 Entropy Rate

Theorem 4.2.2

For a stationary stochastic process, H(Xn|Xn−1,..., X1) is nonincreasing in n and has a limit H0(X ).

Proof.

conditional reduces entropy H (Xn+1|X1, X2,..., Xn) ≤ H (Xn+1|Xn,..., X2) stationary = H(Xn|Xn−1,..., X1),

- H(Xn|Xn−1,..., X1) decreases as n increases - H(X ) ≥ 0 - The limit must exist.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 14 / 20 Entropy Rate

Theorem 4.2.1 For a stationary stochastic process, H(X ) = H0(X ).

Proof. By the chain rule, n 1 1 X H(X ,..., X ) = H(X |X ,..., X ). n 1 n n i i−1 1 i=1 0 H(Xn|Xn−1,..., X1) → H (X ) 1 Pn Cesaro mean: If an → a, bn = n i=1 ai , then bn → a. So 1 H(X ,..., X ) → H0(X ) n 1 n

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 15 / 20 AEP for Stationary Ergodic Process (chap 16)

1 − log p(X ,..., X ) → H(X ) n 1 n

−nH(X ) p(X1,..., Xn) ≈ 2

Typical sequences in of size2 −nH(X )

We can use nH(X ) bits to reprensent typical sequences

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 16 / 20 Entropy Rate for Markov Chain

For a stationary Markov chain, the entropy rate is

0 H(X )= H (X ) = lim H (Xn|Xn−1,..., X1) = lim H (Xn|Xn−1)

= H (X2|X1)

Let Pij = Pr[X2 = j|X1 = i]. By definition, entropy rate of stationary Markov chain X X H(X )= H (X2|X1) = µi ( −Pij log Pij ) i j X = − µi Pij log Pij ij

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 17 / 20 To Calculate Entropy Rate

1 Find stationary distribution µi

|X | X X µi = µj pji and µi = 1 j i=1

2 User transition probability Pij X H(X ) = − µi Pij log Pij ij

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 18 / 20 Entropy Rate of Weather Model

α β Stationary distribution µ(S) = α+β , µ(R) = α+β

 1 − β β  P = α 1 − α

H(X ) = µ(S)H(β) + µ(R)H(α) α β = H(β) + H(α) α + β α + β Jensen’s inequality αβ ≤ H(2 ) α + β

Maximum when α = β = 1/2: degenerate to independent process

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 19 / 20 Reading & Homework

Reading : Whole Chapter 4

Homework : Problems 4.7(a-d), 4.9

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING October 27, 2020 20 / 20