arXiv:1910.00941v1 [cs.IT] 2 Oct 2019 uhaMro chain Markov a Such hr sa is there admsequence random A Definitions univ is m describe 1.1 the we state algorithm we Lempel-Ziv 1.3 the Section of in version And Mar the hidden re algorithm. that of Lempel-Ziv we namely class r the Specifically the entropy of (for its version universal result. define a be and main to model, algorithm the Markov compression state hidden a a to define us chain, allow Markov will a that notes. entire terminology these the to of 5, B overview Section a of the of notes the claims of unpublished though and understanding to articles notions back our original goes owe here the we use in particular we those analysis theory as [MU17, specific information the same say elementary the in and essentially used 1]) discipline are (as Chapter undergradu of theory [LPW17, in spectrum probability (e.g., lectures broad chains discrete of a elementary basis across “( on the situat students of only form this to case remedy can analysis special to article with the is along this for article that algorithm this is their of hope of aim analysis The and unde and statement understood. well-known are widely algorithms ZL78] not their ZL77, is While [LZ76, to rate. entropy Ziv able their Jacob were to and that algorithms Lempel efficient Abraham and 1970’s late the In Introduction 1 1715187. USA. 02138, MA admvariable random ∗ † enwtr osaigtemi hoe ews opoe nSection In prove. to wish we theorem main the stating to turn now We avr College. Harvard avr onA alo colo niern n ple Sc Applied and Engineering of School Paulson A. John Harvard aeapoce h nrp aeo h hi en compress being chain the of ful rate a entropy include the we approaches Specifically rate sources. Markovian (hidden) on efcnandAayi fteLme-i Compression Lempel-Ziv the of Analysis Self-contained A hsatcegvsasl-otie nlsso h perform the of analysis self-contained a gives article This k × k matrix [email protected] Z 0 [email protected] esythat say We . Z 0 M M Z , uhta o every for that such 1 ssi ob pcfidb h matrix the by specified be to said is Z , 2 . . . , ah Sudan Madhu M with okspotdi atb iosIvsiao wr n NS and Award Investigator Simons a by part in supported Work . sa is Z k t universally saeMro chain. Markov -state ∈ t coe ,2019 2, October [ Algorithm ≥ k ssi ob tm-nain)fiiesaeMro hi if chain Markov state finite (time-invariant) a be to said is ] and 0 Abstract ∗ 1 opesotuso nc”sohsi rcse down processes stochastic “nice” of outputs compress ,j i, ∈ ecs avr nvriy 3Ofr tet Cambridge, Street, Oxford 33 University, Harvard iences, [ ai Xiang David neo h eplZvcmrsinalgorithm compression Lempel-Ziv the of ance k ro fteasrinta h compression the that assertion the of proof l ti h aeta Pr[ that case the is it ] ed. M n h itiuinΠ distribution the and o oes.I eto . ete give then we 1.2 Section In models). kov idn akvmdl” u primary Our models”. Markov hidden) bGlae rmte19s[a9] In [Gal94]. 1990s the from Gallager ob ayi,a ela oto h specific the of most as well as nalysis, sod h nlsso hi algorithms their of analysis the rstood, .I atclroraayi depends analysis our particular In s. t.W lodfiewa tmasfor means it what define also We ate. algorithm this teach that courses ate ta xoiini rmsrthand scratch from is exposition ctual C0,Catr2.Tepof here proofs The 2]. Chapter [CT06, aesm xrml ipe clever simple, extremely some gave ra o idnMro models. Markov hidden for ersal R5) ai at bu Markov about facts basic MR95]), i hoe bu hsalgorithm, this about theorem ain o ypoiigaself-contained a providing by ion † . eitouesm h basic the some introduce we 1.1 altento fafiiestate finite a of notion the call Z t +1 = 0 j ∈ | Z ∆([ t wr CCF Award F = k i )o the of ]) = ] M ij . The Markov chain is thus essentially specified by a weighted directed graph (possibly with self loops) corresponding to the matrix M. The chain is said to be irreducible if the underlying graph is strongly connected. Equivalently a chain Z0,...,Zt,... is irreducible if for every i, j [k] there exists t such that Pr[Z = j Z = i] > 0. Similarly a Markov chain is said to be aperiodic if greatest∈ common divisor of the t | 0 cycle lengths in the underlying graphs is 1. Formally we say that the chain Z0,Z1,Z2,... has a cycle of length ℓ> 0 if there exists t and state i such that Pr[Zt+ℓ = i Zt = i] > 0. The chain is said to be aperiodic if the greatest common divisor of cycle lengths is 1. | Throughout this article we will consider only irreducible aperiodic Markov chains. (The study can be extended to the periodic case easily, but the irreducible case is actually different.) Irreducible and aperiodic chains have a unique stationary distribution Π (satisfying Π = Π M). · Definition 1.1 (Hidden Markov Models). A sequence X ,X ,X ,..., with X Σ is said to be a Hidden 0 1 2 t ∈ Markov Model if there exists a k-state Markov chain Z0,Z1,Z2,..., specified by matrix M and initial dis- (1) (k) (Zt) tribution Π0, and k distributions P ,...,P ∆(Σ) such that Xt P . We use the notation to ∈ ∼ (1) (k) M denote the ingredients specifying a hidden Markov model, namely the tuple (k, M, Π0, P ,...,P ). Definition 1.2 (Entropy Rate). The entropy rate of a hidden Markov model with output denoted 1 M X0,X1,..., is given by limt t H(X0,...,Xt 1) when the limit exists. The entropy rate is denoted H( ). →∞  · − M Proposition 1.3. The entropy rate (i.e., the limit) always exists for irreducible, aperiodic, Markov chains. Furthermore it does not depend on the initial distribution Π0. Proof. We start with the “furthermore” part. This part follows from the fact that irreducible aperiodic chains converge to their stationary distribution, i.e., for every ε> 0 there exists T = T (ε) such that for all Z0 and every t T0 the distribution of Xt is ǫ-close (in statistical distance) to the stationary probability distribution. Now≥ consider n sufficiently large. We have

H(X ,...,X ) H(X ,...,X ) H(X ,...,X )+ t log Ω . t n ≤ 0 n ≤ t n | | n 1 n 1 Now, we’ll compare H(Xt,...Xn)= H(Xt)+ − H(Xk+1 Xk) to H(X′,...X′ )= H(X′)+ − H(X′ X′ ), Pk=t | t n t Pk=t k+1| k where Xt′,...Xn′ are obtained from starting with Xt′ as the stationary probability distribution. We first bound the distance H(X′ X′ ) H(X X ) by noting | k+1| k − k+1| k |

H(X′ X′ ) H(X X ) = Pr(X′ = i)H(X′ X′ = i) Pr(X = i)H(X X = i) | k+1| k − k+1| k | | X k k+1| k − k k+1| k | i Σ ∈ Pr(X′ = i) Pr(X = i) H(X X = i) ≤ X | k − k | k+1| k i Σ ∈ 2ǫ log Ω ≤ | |

Now the distance H(X′) H(X ) can be controlled by standard bounds which relate statistical difference | t − t | to maximal entropy difference. To be exact, for Xt′,Xt ǫ-close in statistical difference,

H(X′) H(X ) H(ǫ)+ ǫ log Ω | t − t |≤ | |

Combining these bounds, we have that H(X ,...X ) H(X′,...X′ ) 2nǫ log Ω so that | t n − t n |≤ | |

H(X′,...X ) 2nǫ log Σ H(X ...X ) H(X′,...X′ )+ t log Ω +2nǫ log Ω t n − | |≤ 0 n ≤ t n | | | | Dividing by n and taking the limit as n then shows that the entropy rate of a Markov model, regardless of the initial distribution converges to the→ ∞ same value. The first part of the proposition now follows from the fact that when starting from the stationary probability distribution Π, we have H(Xt X

2 H(Xt X1,...,Xt 1)= H(Xt 1 X1,...,Xt 1). (The inequality comes from “conditioning does not increase | − − | − entropy” and the equality comes from the identity of the distributions). It follows that the sequence vt, , 1 vt t H(X0,...,Xt) is a non-increasing sequence in the interval [0, log Σ ] and so the limit limt vt exists. · | | →∞{ }

Definition 1.4 (Good Compressor, Universality). For ǫ,δ > 0, a compression algorithm A : Σ∗ 0, 1 ∗ → { } is an (ǫ,δ)-Good Compressor for a hidden Markov model of entropy rate H( ) if there exists n0 such that for all n n we have M M ≥ 0

Pr [ A(X0,...,Xn 1) H( ) (1 + ǫ) n] 1 δ. X0,...,Xn−1 | − |≤ M · · ≥ −

We say that A is Universal (for the class of HMMs) if it is an (ǫ,δ)-good compressor for every hidden Markov model and every ǫ,δ > 0. M 1.2 The Lempel-Ziv Algorithm We describe the algorithm structurally below. First we fix a prefix free encoding of the positive integers denoted [ ] such that for every i, [i] log i + 2 log log i. We also fix an arbitrary binary encoding of Σ λ , denoted bin· satisfying bin(b) |0, 1|≤ℓ where ℓ = log( Σ + 1) . ∪{ } ∈{ } ⌈ | | ⌉ Given X¯ = X0,...,Xn 1, write X¯ = σ0 σ1 σm where σi Σ∗ satisfy the following properties: − ◦ ◦···◦ ∈ (1) σ0 = λ (the empty string) (2) For i [m 1], σ is the unique string such that ∈ − i (2a) σ = σ ′ for i′