Lecture 6: Entropy Rate
• Entropy rate H(X) • Random walk on graph
Dr. Yao Xie, ECE587, Information Theory, Duke University Coin tossing versus poker
• Toss a fair coin and see and sequence
Head, Tail, Tail, Head ···
−nH(X) (x1, x2,..., xn) ≈ 2
• Play card games with friend and see a sequence
A ♣ K r Q q J ♠ 10 ♣ ···
(x1, x2,..., xn) ≈?
Dr. Yao Xie, ECE587, Information Theory, Duke University 1 How to model dependence: Markov chain
• A stochastic process X1, X2, ···
– State {X1,..., Xn}, each state Xi ∈ X – Next step only depends on the previous state
p(xn+1|xn,..., x1) = p(xn+1|xn).
– Transition probability
pi, j : the transition probability of i → j ∑ = | – p(xn+1) xn p(xn)p(xn+1 xn) – p(x1, x2, ··· , xn) = p(x1)p(x2|x1) ··· p(xn|xn−1)
Dr. Yao Xie, ECE587, Information Theory, Duke University 2 Hidden Markov model (HMM)
• Used extensively in speech recognition, handwriting recognition, machine learning.
• Markov process X1, X2,..., Xn, unobservable
• Observe a random process Y1, Y2,..., Yn, such that
Yi ∼ p(yi|xi)
• We can build a probability model
∏n−1 ∏n n n p(x , y ) = p(x1) p(xi+1|xi) p(yi|xi) i=1 i=1
Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Time invariance Markov chain
• A Markov chain is time invariant if the conditional probability p(xn|xn−1) does not depend on n
p(Xn+1 = b|Xn = a) = p(X2 = b|X1 = a), for all a, b ∈ X
• For this kind of Markov chain, define transition matrix ··· P11 P1n = ··· P Pn1 ··· Pnn
Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Simple weather model
• X = { Sunny: S, Rainy R }
• p(S|S) = 1 − β, p(R|R) = 1 − α, p(R|S) = β, p(S|R) = α [ ] 1 − β β P = α 1 − α
#"
"! " "!#" "
"
Dr. Yao Xie, ECE587, Information Theory, Duke University 5 • Probability of seeing a sequence SSRR:
p(SSRR) = p(S)p(S|S)p(R|S)p(R|R) = p(S)(1 − β)β(1 − α)
What will this sequence behave, after many days of observations?
• What sequences of observations are more typical?
• What is the probability of seeing a typical sequence?
Dr. Yao Xie, ECE587, Information Theory, Duke University 6 Stationary distribution
• Stationary distribution: a distribution µ on the states such that the distribution at time n + 1 is the same as the distribution at time n.
• Our weather example: α β – If µ(S) = α+β, µ(R) = α+β [ ] 1 − β β P = α 1 − α
– Then
p(Xn+1 = S) = p(S|S)µ(S) + p(S|R)µ(R) α β α = − β + α = = µ . (1 )α + β α + β α + β (S )
Dr. Yao Xie, ECE587, Information Theory, Duke University 7 • How to calculate stationary distribution
– Stationary distribution µi, i = 1, ··· , |X| satisfies
∑ ∑|X| µi = µ j p ji, (µ = µP), and µi = 1. j i=1
– “Detailed balancing”:
# "
" $! "#!"
$" "#""
Dr. Yao Xie, ECE587, Information Theory, Duke University 8 Stationary process
• A stochastic process is stationary if the joint distribution of any subset is invariant to time-shift
p(X1 = x1, ··· , Xn = xn) = p(X2 = x1, ··· , Xn+1 = xn).
• Example: coin tossing
p(X1 = head, X2 = tail) = p(X2 = head, X3 = tail) = p(1 − p).
Dr. Yao Xie, ECE587, Information Theory, Duke University 9 Entropy rate ∑ • n = , ··· , = n = When Xi are i.i.d., entropy H(X ) H(X1 Xn) i=1 H(Xi) nH(X)
n • With dependent sequence Xi, how does H(X ) grow with n? Still linear?
• Entropy rate characterizes the growth rate
• Definition 1: average entropy per symbol
H(Xn) H(X) = lim n→∞ n
• Definition 2: rate of information innovation
′ H (X) = lim H(Xn|Xn−1, ··· , X1) n→∞
Dr. Yao Xie, ECE587, Information Theory, Duke University 10 ′ • H (X) exists, for Xi stationary
H(Xn|X1, ··· , Xn−1) ≤ H(Xn|X2, ··· , Xn−1) (1)
≤ H(Xn−1|X1, ··· , Xn−2) (2)
– H(Xn|X1, ··· , Xn−1) decreases as n increases – H(X) ≥ 0 – The limit must exist
Dr. Yao Xie, ECE587, Information Theory, Duke University 11 ′ • H(X) = H (X), for Xi stationary
∑n 1 1 H(X , ··· , X ) = H(X |X − , ··· , X ) n 1 n n i i 1 1 i=1
′ • Each H(Xn|X1, ··· , Xn−1) → H (X)
• Cesaro mean:
∑ → = 1 n → → If an a, bn n i=1 ai, bi a, then bn a.
• So 1 H(X , ··· , X ) → H′(X) n 1 n
Dr. Yao Xie, ECE587, Information Theory, Duke University 12 AEP for stationary process
1 − log p(X , ··· , X ) → H(X) n 1 n
−nH(X) • p(X1, ··· , Xn) ≈ 2
• Typical sequences in typical set of size 2−nH(X)
• We can use nH(X) bits to represent typical sequence
Dr. Yao Xie, ECE587, Information Theory, Duke University 13 Entropy rate for Markov chain
• For Markov chain
H(X) = lim H(Xn|Xn−1, ··· , X1) = lim H(Xn|Xn−1) = H(X2|X1)
• By definition p(X2 = j|X1 = i) = Pi j
• Entropy rate of Markov chain ∑ H(X) = − µiPi j log Pi j i j
Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Calculate entropy rate is fairly easy
1. Find stationary distribution µi
2. Use transition probability Pi j ∑ H(X) = − µiPi j log Pi j i j
Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Entropy rate of weather model α β Stationary distribution µ(S) = α+β, µ(R) = α+β [ ] 1 − β β P = α 1 − α
β X = α α + − α − α H( ) α + β[ log (1 ) log(1 )] α β = β + α α + βH( ) α + βH( ) ( ) αβ ( √ ) ≤ ≤ αβ H 2α + β H
Maximum when α = β = 1/2: degenerate to independent process
Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Random walk on graph
• An undirected graph with m nodes {1,..., m}
• Edge i → j has weight Wi j ≥ 0 (Wi j = W ji)
• A particle walks randomly from node to node
• Random walk X1, X2, ··· : a sequence of vertices
• Given Xn = i, next step chosen from neighboring nodes with probability
Wi j Pi j = ∑ k Wik
Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Dr. Yao Xie, ECE587, Information Theory, Duke University 18 Entropy rate of random walk on graph
• Let ∑ ∑ Wi = Wi j, W = Wi j j i, j:i> j
• Stationary distribution is µ = Wi i 2W
• Can verify this is a stationary distribution: µP = µ
• Stationary distribution ∝ weight of edges emanating from node i (locality)
Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Dr. Yao Xie, ECE587, Information Theory, Duke University 20 Summary
• AEP Stationary process X1, X2, ··· , Xi ∈ X: as n → ∞
−nH(X) p(x1, ··· , xn) ≈ 2
• Entropy rate
1 H(X) = lim H(Xn|Xn−1,..., X1) = lim H(X1,..., Xn) n→∞ n n→∞
• Random walk on graph µ = Wi i 2W
Dr. Yao Xie, ECE587, Information Theory, Duke University 21