Entropy Rates & Markov Chains Information Theory Duke University, Fall 2020

ECE 587 / STA 563: Lecture 4 { Entropy Rates & Markov Chains Information Theory Duke University, Fall 2020 Author: Galen Reeves Last Modified: September 1, 2020 Outline of lecture: 4.1 Entropy Rate........................................1 4.2 Markov Chains.......................................3 4.3 Card Shuffling and MCMC................................4 4.4 Entropy of the English Language [Optional].......................6 4.1 Entropy Rate • A stochastic process fXig is an indexed sequence of random variables. They need not be in- n dependent nor identically distributed. For each n the distribution on X = (X1;X2; ··· ;Xn) is characterized by the joint pmf n pXn (x ) = p(x1; x2; ··· ; xn) • A stochastic process provides a useful probabilistic model. Examples include: ◦ the state of a deck of cards after each shuffle ◦ location of a molecule each millisecond. ◦ temperature each day ◦ sequence of letters in a book ◦ the closing price of the stock market each day. • With dependent sequences, hows does the entropy H(Xn) grow with n? (1) Definition 1: Average entropy per symbol H(X ;X ; ··· ;X ) H(X ) = lim 1 2 n n!1 n (2) Definition 2: Rate of information innovation 0 H (X ) = lim H(XnjX1;X2; ··· ;Xn−1) n!1 • If fXig are iid ∼ p(x), then both limits exists and are equal to H(X). nH(X) H(X ) = lim = H(X) n!1 n 0 H (X ) = lim H(XnjX1;X2; ··· ;Xn−1) = H(X) n!1 But what if the symbols are not iid? 2 ECE 587 / STA 563: Lecture 4 • A stochastic process is stationary if the joint distribution of subsets is invariant to shifts in the time index, i.e. P[X1 = x1;X2 = x2; ··· ;Xn = xn] = P[X1+k = x1;X2+k = x2; ··· ;Xn+k = xn] for every n and every shift k for all xn 2 X . • Examples: ◦ State of a card shuffles when started with a uniformly random permutation? Yes ◦ State of card shuffles when started from a sorted deck? No ◦ Daily high temperature when X1 corresponds to January 1? No ◦ Daily high temperature when X1 is distributed uniformly over days in the year? Yes • Theorem: For a stationary process, the limits in H(X ) and H0(X ) exist and are equal. • Proof: H0(X ) exists ◦ Observe that H(XnjX1; ··· ;Xn−1) ≤ H(XnjX2; ··· ;Xn−1) conditioning cannot increase entropy = H(Xn−1jX1; ··· ;Xn−2) since stationary ◦ Therefore, H(XnjX1; ··· ;Xn−1) is non-increasing in n ◦ Also H(X) ≥ 0 ◦ We can apply the monotone convergence theorem to conclude that the limit must exist. • Proof: H(X ) = H0(X ) ◦ By the chain rule for entropy; n 1 1 X H(X ; ··· ;X ) = H(X jX ; ··· ;X ) n 1 n n i 1 i−1 i=1 ◦ From the arguments above, we know that 0 H(XijX1; ··· ;Xi−1) ! H (X ) ◦ To conclude, we use the fact that converge of a sequence implies convergence of the Cesàromean: n 1 X a ! a =) a ! a n n i i=1 | {z } Cesàromean • Recall AEP for iid sources 1 − log p(X ;X ; ··· ;X ) ! H(X) n 1 2 n • Theorem: (AEP for stochastic sources) For a stationary ergodic process 1 − log p(X ;X ; ··· ;X ) ! H(X ) n 1 2 n ◦ Proof given in Section 16.8 of [CT]. ◦ You will see an example of a process that is stationary but not ergodic in your homework. ◦ Note that ergodic means \time averages converge to ensemble averages" ECE 587 / STA 563: Lecture 4 3 4.2 Markov Chains • A discrete-time stochastic process fX1;X2;· · ·} is said to be a Markov chain or Markov process if for all n = 1; 2; ··· ; P[Xn+1 = xn+1jXn = xn;Xn−1 = xn−1; ··· ;X1 = x1] = P[Xn+1 = xn+1jXn = xn] for all xn+1 2 X n+1. • A Markov chain is time invariant if the conditional probability p(xn+1 j xn) does not depend on n, i.e. P[Xn+1 = bjXn = a] = P[X2 = bjX1 = a] for all a; b 2 X • A time invariant Markov Chain with X = f1; 2; ··· ;Mg is characterized by an M × M probability transition matrix P where Pi;j = P[Xn+1 = jjXn = i]; i; j 2 f1; 2; ··· ;Mg The vector of probabilities at time n is denote by µn = (P[Xn = 1]; P[Xn = 2]; ··· ; P[Xn = M]) and is updated according to µn+1 = µnP • Example: Consider the following time invariant Markov chain with X = f1; 2; 3g and probability transition matrix 21 − α α 0 3 P = 4 β 1 − α − β α 5 0 β 1 − β 1 − α − β α α 1 − α 1 2 3 1 − β β β • A time-invariant Markov process is completely specified by three items: (1) the set of all states X (2) the jX j × jX j transition probability matrix P (3) the starting distribution µ1. • A distribution µ is a stationary distribution of a time-invariant Markov process if µ = µP • Example: For the Markov process described in the previous example, the stationary distribution is 1 r r2 µ = ; ; 1 + r + r2 1 + r + r2 1 + r + r2 where r = α/β. Note that if α = β, then this is the uniform distribution since each state has probability 1=3. 4 ECE 587 / STA 563: Lecture 4 • Theorem: Consider a time-invariant Markov process that is irreducible and aperiodic. Then (1) The stationary distribution µ is unique (2) Independently of the starting distribution µ1, the distribution µn will converge to the stationary distribution as µ as n ! 1. (3) The Markov process is stationary if, and only if, the starting distribution µ1 is chosen to be the steady state distribution. • Theorem: For a stationary time-invariant Markov process. the entropy rate is given by H(X ) = H(X2jX1) where the conditional entropy is calculated using the stationary distribution. 4.3 Card Shuffling and MCMC • Markov Chain Monte Carlo (MCMC) is a widely used technique for computing functions of distributions. The basic idea is as follows: ◦ Suppose you want to compute the expectation E[f(X)] where X ∼ Q ◦ If you could draw samples fXig iid ∼ Q, then you could approximate the expectation using the law of large numbers: n 1 X f(X ) ! [f(X)] n i E i=1 ◦ However, drawing independent samples can be tricky. One method is to setup an irreducible and aperiodic Markov chain fYng whose stationary distribution is equal to Q. Then, if one starts from an arbitrary point Y1 and waits long enough, the distribution of Yn will converge to Q and Yn will be independent of the starting point. ◦ They key question is to determine how long it takes before the distribution of Yn is sufficiently close to Q and sufficiently independent of Y1. • Let X be the state of a deck of card. There are jX j = 52! possibilities. Thus, if X is distributed uniformly, the the entropy is 52 X H(X) = log 52! = log(i) ≈ 226 (bits) i=1 • Markov chain of one-at-a-time shuffling. Let X1;X2; ··· ;Xn be a sequence of card shuffles where Xn+1 is the state of the deck after a card is selected uniformly at random from one of the 52 locations, and placed on the top. ◦ The entropy satisfies H(Xn+1jXn;Xn−1; ··· ;X1) = H(Xn+1jXn) = log 52 ≈ 5:7 (bits) ◦ By symmetry, the stationary distribution of this Markov chain is the uniform distribution. The Markov chain is irreducible and aperiodic, and thus H(Xn) ! log(52!) as n ! 1. ECE 587 / STA 563: Lecture 4 5 • How many one-at-a-time shuffles needed to produce a state that is uniformly distributed and independent of the starting point X0? ◦ Independence requires that I(X0; Xn) = 0. Thus the necessary and sufficient condition for independent and uniformly distributed is H(XnjX0) = log 52! ◦ Let N1;N2;N3; ··· denote the location of the card that is moved to the top in the n-th shuffle. ◦ The joint entropy can be expressed as H(X0;X1;X2; ··· ;Xn) = H(X0;N1;N2; ··· Nn) = H(X0) + nH(N1) and also as H(X0;X1; ··· ;Xn) = H(X0) + H(X1; ··· ;XnjX0) = H(X0) + H(XnjX0) + H(X1; ··· ;Xn−1jXn;X0) ◦ Thus, the conditional entropy after n shuffles is upper bounded by H(XnjX0) ≤ nH(N1) = n log(52) This provides a lower bound on the minimum number of shuffles that are needed: log 52! number of shuffles ≤ ≈ 39:6 =) H(X jX ) < log(52!) log 52 n 0 ◦ The bound above is necessary, but is it sufficient? No, In fact, it can be shown that H(XnjX0) < log 52! for all n In other words, the there is always some memory from the initial state X0 • Modified one-at-a-time shuffle: Lets instead try a modified one-at-a-time shuffle: on the n-th shuffle, we draw a number M uniformly on fn; n + 1; n + 2; ··· ; 52g and place the Mth card on the top. After n = 52 no more changes are made to the deck. ◦ The joint entropy can be decomposed as n X H(X0;X1; ··· ;Xn) = H(X0;M1;M2; ··· Mn) = H(X0) + H(Mi) i=1 n ◦ Given the first state X0 and last state Xn, one can reconstruct the sequence fMigi=1 by looking at the top cards in the deck. Thus, H(X1; ··· ;Xn−1jXn;X0) = 0 ◦ Combining the above displays yields: 51 51 X X H(X51jX0) = H(Mi) = log(52 − i + 1) = log 52! i=1 i=1 So, after exactly 51 shuffles we have a truly random permutation, regardless of our starting point.

Entropy Rates & Markov Chains Information Theory Duke University, Fall 2020

Entropy Rate of Stochastic Processes

Entropy Rate Estimation for Markov Chains with Large State Space

Entropy Power, Autoregressive Models, and Mutual Information

Source Coding: Part I of Fundamentals of Source and Video Coding

Relative Entropy Rate Between a Markov Chain and Its Corresponding Hidden Markov Chain

Information Theory 1 Entropy 2 Mutual Information

Arxiv:1711.03962V1 [Stat.ME] 10 Nov 2017 the Entropy Rate of an Individual’S Behavior

Entropy Rate

Redundancy Rates of Slepian-Wolf Coding∗

Entropy Rate

Arxiv:1906.02570V2 [Cs.IT] 22 Apr 2020 on Orsodn Oagvnmliait Ore E [Kac12])

A Quick and Easy Way to Estimate Entropy and Mutual Information For