Lossless Source Coding

Gadiel Seroussi Marcelo Weinberger

Information Theory Research Hewlett-Packard Laboratories Palo Alto, California

Seroussi/Weinberger – Lossless source coding 26-May-04 1 LosslessLossless SourceSource CodingCoding

1. Fundamentals

Seroussi/Weinberger – Lossless source coding 26-May-04 2 Notation

• A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1)

• pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x)

• Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = of a X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q

Seroussi/Weinberger – Lossless source coding 26-May-04 3 Coding in a communication/storage system

data data encoder channel decoder source destination

noise

data source channel source encoder encoder

noise channel

data source channel destination decoder decoder

Seroussi/Weinberger – Lossless source coding 26-May-04 4

q Shannon, “A mathematical theory of communication,” Bell Tech. Journal, 1948 l Theoretical foundations of source and channel coding l Fundamental bounds and coding theorems in a probabilistic setting u in a nutshell: perfect communication in the presence of noise is possible as long as the of the source is below the l Fundamental theorems essentially non-constructive: we’ve spent the last 52 years realizing Shannon’s promised paradise in practice u very successful: enabled current digital revolution (multimedia, internet, wireless communication, mass storage, …) l Separation theorem: source and channel coding can be done independently

Seroussi/Weinberger – Lossless source coding 26-May-04 5 Source Coding

data source channel source encoder encoder noiseless channelnoise channel

data source channel destination decoder decoder

Source coding = Data compression ÞÞ efficient use of bandwidth/space

Seroussi/Weinberger – Lossless source coding 26-May-04 6 Data Compression

de- D compressor C C D’ channel compressor (encoder) (decoder)

data compressed data decompressed data

the goal: size(C) < size(D) size(C) compression ratio: r = < 1 in appropriate units, size(D) e.g., bits/symbol

q Channel: l Communications channel (“from here to there”) l Storage channel (“from now to then”)

qq LosslessLosslesscompression:compression: DD==D’D’ the case of interest here q Lossy compression: D’ is an approximation of D under some metric

Seroussi/Weinberger – Lossless source coding 26-May-04 7 Data Sources

data D n source x1x2 x3 Kxt Kxn = x1

q Symbols xi Î A = a countable (usually finite) alphabet n q Probabilistic source: xi are random variables; x1 obeys some probability distribution P on An (the ensemble of possible sequences emitted by the source) ¥ l we are often interested in n®¥ : x1 is a random process ¥ ¥ u stationary (time-invariant): xi = xj , as random processes, " i,j ³ 1 u ergodic: time averages converge (to ensemble averages)

u memoryless: xi are statistically independent

u independent, identically distributed (i.i.d.): memoryless, and xi ~ pX " i

q Individual sequence: xi are just symbols, not assumed to be a realization of a random process. The “data source” will include a probability assignment, but it will be derived from the data under certain constraints, and with certain objectives

Seroussi/Weinberger – Lossless source coding 26-May-04 8 Statistics on Individual Sequences

q Empirical distributions 1 memoryless, pˆ n (a) = {xi |1 £ i £ n, xi = a}, a Î A Bernoulli model x1 n l we can compute empirical statistics of any order (joint, conditional, etc.) l sequence probability according to own empirical distribution n ˆ n P n (x1 ) = pˆ n (xi ) x1 Õ x1 i=1 l this is the highest probability assigned to the sequence by any distribution from the model class (maximum likelihood estimator)

l Example: A = {0,1}, n0 = {i | xi = 0}, n1 = n - n0 = {i | xi = 1} n n nn0 nn1 pˆ(0) = 0 , pˆ(1) = 1 : Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 n n 1 nn n l Notice that if x1 is in fact the outcome of a random process, then its empirical statistics are themselves random variables l e.g., expressions of the type Prp (| pˆ(a) - p(a) |³ e )

Seroussi/Weinberger – Lossless source coding 26-May-04 9 Statistical Models for Data Sources

x 1 xt-k+1 xt-1 xt xt+1 q Markov of order k ³ 0 k : finite memory t t p(xt+1 | x1 ) = p(xt+1 | xt-k +1), t ³ k (some convention for t < k)

l i.i.d = Markov of order 0 Example p(0|s1) = 0.5 q Finite State Machine (FSM) 1 s 1 l state space S = {s0,s1,...,sK-1} 1

l initial state s0 0 0 1 l transition probability s0 s2 q(s |s’,a), s,s’ÎS, aÎA 0 p(0|s ) = 0.9 p(0|s ) = 0.1 l output probability p(a|s), aÎA, sÎS 0 2 l unifilar Û deterministic transitions: Steady state next-state function f : S´ A®S state probs. é.9 .1 0ù ê ú l every Markov source is equivalent [p 0 p1 p 2 ] .5 0 .5 = [p 0 p1 p 2 ] to a unifilar FSM with K £ |A|k, but in ê.1 0 .9ú ë û symb. probs. general, finite state ¹ finite memory é5 1 5 ù é5 3ù Þ [p 0 p1 p 2 ] = , [p0 p1] = ëê8 16 16ûú ëê8 8ûú

Seroussi/Weinberger – Lossless source coding 26-May-04 10 Statistical Models for Data Sources (cont.)

q Tree sources (FSMX) 1 s1 1 l finite memory £ k (Markov) l # of past symbols needed to 0 s 0 s 1 determine the state might be < 0 0 2 k for some states s0 « 0, s1 « 01, s2 « 11

l by merging nodes from the full Markov tree, we get a model with a smaller number of free parameters 0 1 l the set of tree sources with unbalanced trees has measure zero in the space of p(0|s0) s0 Markov sources of any given order 0 1 l yet, tree source models have proven very useful in practice, and are associated s0 s0 s1 s2 with some of the best compression algorithms to date p(0|s0) p(0|s0) p(0|s1) p(0|s2) l more about this later ...

Seroussi/Weinberger – Lossless source coding 26-May-04 11 Entropy

D X ~ p(x) : H (X ) = - p(x)log p(x) [0log 0 = 0] åxÎA

entropy of X (or of the PMF p(×) ), measured in bits

H (X ) = E p [- log p(X )] l H measures the uncertainty or self-information of X l we also write H(p) : a random variable is not actually needed; p(×) could be an empirical distribution

Example: A={0,1}, pX(1) = p, pX(0) = 1-p (overloaded notation!)

H 2 ( p) = - p log p - (1- p)log(1- p) binary entropy function

Main properties: H2(p)

• H2(p) ³ 0, H2 (p) is Ç-convex, 0 £ p £ 1

• H2(p) ® 0 as p ® 0 or 1, with slope ¥

• H2(p) is maximal at p = 0.5, H2(0.5)=1 Þ the entropy of an unbiased coin is 1 bit p Seroussi/Weinberger – Lossless source coding 26-May-04 12 Entropy (cont.)

q For a general finite alphabet A, H(X) is maximal when X ~ pu , where pu(a)=1/ |A| for all aÎ A (uniform distribution) l Jensen’s inequality: if f is a È-convex function, then Ef (X) ³ f (EX). l - log x is a È-convex function of x

l H(X) = E[log(1/p(X))] = - E[- log(1/p(X))] £ log E[1/p(X)] = log |A| = H(pu)

q Empirical entropy

l entropy computed an an empirical distribution l Example: recall that the probability assigned to an individual binary sequence by its own zero-order empirical distribution is nn0 nn1 normalized, Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 1 nn in bits/symbol l we have 1 n æ n ö n æ n ö ˆ n 0 0 1 1 ˆ ˆ n - log P(x1 ) = - logç ÷ - logç ÷ = H2 ( p(0)) = H (x1 ) n n è n ø n è n ø

1 in fact, - log P ˆ ( x n ) = H ˆ ( x n ) holds for a large class of probability models n 1 1

Seroussi/Weinberger – Lossless source coding 26-May-04 13 Joint and Conditional

q The of random variables (X,Y) ~ p(x,y) is defined as H(X ,Y ) = - p(x, y)log p(x, y) åx,y l this can be extended to any number of random variables: H(X1,X2,..., Xn)

Notation: H( X1,X2,..., Xn) = joint entropy of X1,X2,..., Xn (0 £ H £ n log |A|)

H( X1,X2,..., Xn) = H/n = normalized per-symbol entropy (0 £ H £ log |A|) l if (X,Y) are statistically independent, then H(X,Y) = H(X) + H(Y) q The is defined as H (Y | X ) = p(x)H (Y | X = x) = -E log p(Y | X ) åx p( x, y) q Chain rule: H(X ,Y ) = H (X ) + H (Y | X )

q Conditioning reduces uncertainty (on the average): H (X |Y ) £ H (X )

l but H(X|Y=y) ³ H(X) is possible

Seroussi/Weinberger – Lossless source coding 26-May-04 14 Entropy Rates

q Entropy rate of a random process in bits/symbol, ¥ 1 n H (X1 ) = lim H( X1 ) n®¥ n if the limit exists!

q A related limit based on conditional entropy

* ¥ in bits/symbol, H (X1 ) = lim H (X n X n-1, X n-2 ,K, X1) n®¥ if the limit exists!

Theorem: For a stationary random process, both limits exist, and * ¥ ¥ H (X1 ) = H(X1 ) q Examples: ¥ l X1,X2,... i.i.d.: H( X1 ) = lim n®¥ H( X1,X2,..., Xn) / n = lim n®¥ nH( X1)/n = H( X1) ¥ l X1 stationary k-th order Markov: theorem Markov stationary ¥ * ¥ H ( X1 ) = H (X1 ) = lim H ( X n | X n-1,..., X1) = lim H ( X n | X n-1,..., X n-k ) = H ( X k +1 | X k ,..., X1) n®¥ n®¥ The theorem provides a very useful tool to compute entropy rates for a broad family of source models

Seroussi/Weinberger – Lossless source coding 26-May-04 15 Entropy Rates - Examples

p(0|s1) = 0.5 Steady state

1 s1 1 é5 1 5 ù é5 3ù [p 0 p1 p 2 ] = , [p0 p1] = ëê8 16 16ûú ëê8 8ûú 0 s 0 s 1 state probs. symb. probs. 0 0 2

p(0|s0) = 0.9 p(0|s2) = 0.1

q Zero-order entropy q Markov process entropy 2 H (X | S) = p(s )H ( p(0 | s )) = H(0.375) = 0.954 å i i i=0 5 1 5 H (0.9) + H (0.5) + H (0.1) » 0.502 8 16 16

q Individual sequence - fitted with FSM model

0000001111111111100000100000001111111110 s0 s1 s2

Empirical entropy: 16 1 1 é19 3 18ù ˆ pˆ(0 | s0 ) = , pˆ(0 | s1) = , pˆ(0 | s2 ) = , [pˆ0 pˆ1 pˆ2 ]= , H (x | S) = 0.594 19 3 9 ëê40 40 40ûú

Seroussi/Weinberger – Lossless source coding 26-May-04 16 Relative Entropy

q The relative entropy (or Kullback-Leibler distance, or information divergence) between two PMFs p(x) and q(x) is defined as p(x) p(x) D( p || q) = p(x)log = E åx q(x) p q(x)

Theorem: D(p||q) ³ 0, with equality iff p = q u Proof (using strict concavity of log, and Jensen’s inequality): q(x) q(x) - D( p || q) = p(x)log £ log p(x) = log q(x) £ 0 åx åx å p(x) p(x) x the summations are over values of x where p(x) q(x) ¹0; other terms contribute either 0 or ¥ to D. Since log is strictly concave, equality holds iff p(x)/q(x)=1 "x. g

l D is not symmetric, and therefore not a distance in the metric sense l however, it is a very useful way to express ‘proximity’ of distributions

in a sense, D(p||q) measures the inefficiency of assuming that the distribution is q when it is actually p

Seroussi/Weinberger – Lossless source coding 26-May-04 17 Maximal Entropy

Theorem: Let X ~ p be a PMF over Z>0 such that EpX=m . Then H(X) is + maximized when p(x) = exp ( l0 l1 x) satisfying the constraint l a similar theorem holds for moments of any order

l Proof: Consider a PMF q satisfying the constraint. Then show H(q) £ H(p)

using non-negativity of D(q||p), EpX= EqX and Ep1= Eq1. g

Corollary: For X as above, H (X ) £ (m +1)log(m +1) - m logm

Seroussi/Weinberger – Lossless source coding 26-May-04 18 Source Codes

q A source code C for a random variable X is a mapping C : A®D*, where D is a finite coding alphabet of size d, and D* is the set of finite strings over D l Definitions: C(x) = codeword corresponding to x , l(x) = |C(x)| (length) q The expected length of C(x), for X ~ p(x), is

L(C) = Ep[l (x)] = Sx p(x) l(x) Examples: A={a,b,c,d}, D = {0,1} A={a,b,c}, D = {0,1} p(a) = 1/2 C(a) = 0 p(a) = 1/3 C(a) = 0 p(b) = 1/4 C(b) = 10 p(b) = 1/3 C(b) = 10 p(c) = 1/8 C(c) = 110 p(c) = 1/3 C(c) = 11 p(d) = 1/8 C(d) = 111 H(X) = log 3 » 1.58 bits H(X) = 1.75 bits L(C) = 1.75 bits L(C) = 5/3 » 1.66 bits in fact, we have l(x) = - log p(x) for all xÎ A in this case

Seroussi/Weinberger – Lossless source coding 26-May-04 19 Source Codes (cont.)

q A code C : A®D* extends naturally to a code C*: A*®D* defined by * * C (l) = l, C (x1 x2 ... xn) = C(x1) C (x2) ... C (xn) q C is called uniquely decodable (UD) if its extension C* is injective q C is called a prefix (or instantaneous) code if no codeword of C is a prefix of any other codeword UD l a prefix code is uniquely decodable all codes l prefix codes are “self-punctuating” codes prefix codes Code examples X not UD UD, not prefix prefix code a 0 10 0 0 1 b 010 00 10 a 0 1 c 01 11 110 b 0 1 d 10 110 111 sample 010 ad 100011000111... 100011000111... c d string b a b d b c ... b aa c aa d a prefix code can always be described by a tree with codewords at the leaves Seroussi/Weinberger – Lossless source coding 26-May-04 20 Prefix Codes

q Kraft’s inequality

l The codeword lengths l1, l2 , ..., lm of any d-ary prefix code satisfy m åd -li £ 1 i=1 Conversely, given a set of lengths that satisfy this inequality, there exists a prefix code with these word lengths u the theorem holds also for countably infinite codes u in fact, the theorem holds for any UD code (McMillan)

Code tree embedded li l max inner nodes in full d-ary tree leaves of depth lmax outside code

åd lmax -li £ d lmax i d lmax -li d lmax Seroussi/Weinberger – Lossless source coding 26-May-04 21 The Entropy Lower Bound

q The expected code length L of any prefix code for a PMF p with probabilities p1, p2 , ..., pm satisfies

H ( p) L(C) ³ H (p) = d logd d-ary entropy, in “dits/symbol” - li with equality iff { pi } = { d } - li -1 - li u Proof: Let c = S d £ 1 (Kraft), qi = c d (normalized distribution) L - H ( p) = p l + p log p = d åi i i åi i d i - p log d -li + p log p = åi i d åi i d i p 1 1 p log i + log = D( p || q) + log ³ 0 g åi i d d d qi c c

From now on, we assume d = 2 for simplicity (binary codes)

Seroussi/Weinberger – Lossless source coding 26-May-04 22 LosslessLossless SourceSource CodingCoding

2. Basic coding techniques

Seroussi/Weinberger – Lossless source coding 26-May-04 23 The Shannon Code

q The lower bound L(C) ³ H(p) would be attained if we could

have a code with lengths li = -log pi .

But the li must be integers, and -log pi are generally not

q Simple approximation: take li = é- log pi ù -l i log p i Lengths satisfy Kraft: å 2 £ å 2 = å pi = 1 Þ there is a prefix code with these lengths (Shannon code)

q Optimal for dyadic distributions: all pi’s powers of 2 Þ L = H(p) l not optimal in general q In general, the Shannon code satisfies

L = å pi é- log pi ù £ å pi (- log pi +1) = H ( p) +1 i i Þ the optimal prefix code satisfies H(p) £ L £ H(p)+1 q Upper bound cannot be improved:

L ³ lmin ³1 but we can have H(p) ®0

Seroussi/Weinberger – Lossless source coding 26-May-04 24 Huffman Codes

q Shannon codes are very simple but generally sub-optimal. In 1952, Huffman presented a construction of optimal prefix codes. Construction of Huffman codes - by example: Probabilities 1 0.30 Huffman algorithm 1 Given p1,p2 ,... , pm : 0.20 1 1. k ¬ m+1 2. find smallest pair of 1 0.60 0.15 unused pi,pj 0.30 0 1.00 3. form pk= pi+ pj 0 0.15 0 4. mark pi,pj ‘used’

0.40 5. if only unused is pk 1 p(x) C(x) l(x) stop 0.15 0.20 .05 000 3 0 6. k ¬ k + 1, go to 2. .15 001 3 0 0.05 .15 100 3 .15 101 3 .20 01 2 .30 11 2 L=2.5 H =2.433...

Seroussi/Weinberger – Lossless source coding 26-May-04 25 Huffman Codes

Theorem: Codes constructed with the Huffman algorithm are optimal; i.e., if C * is a Huffman code for a PMF p, and C is a prefix code with * the same number of words, then Lp(C ) £ Lp(C ).

l Let p1 ³ p2 ³... ³ pm be the probabilities in p Lemma: For any PMF, there is an optimal prefix code satisfying 1. p > p Þ l £ l Huffman codes i j i j satisfy the Lemma 2. the two longest codewords have the same length, they differ only in the last bit, and they correspond to the least likely symbols by construction

Proof of the Theorem: By induction on m. Trivial for m=2. Let Cm be a Huffman code for p. W.l.o.g., the first step in the construction of Cm merged pm and pm-1. Clearly, the remaining steps constructed a Huffman code Cm-1 for a PMF p’ with probabilities p1 , p2 ,... , pm-2, pm-1+pm . Now, m-2 L(Cm-1) = åli pi + (lm-1 -1)( pm-1 + pm ) = L(Cm ) - pm-1 - pm i=1

Let C’m be an optimal code for p, and satisfying the Lemma. Applying the same merging on C’m , we obtain a code C’m-1 for p’, with L(C’m ) = L(C’m-1 )+ pm-1+pm . Since Cm-1 is optimal (by ind.), we must have L(C’m-1)³L(Cm-1 ) Þ L(C’m)³ L(Cm ) g

Seroussi/Weinberger – Lossless source coding 26-May-04 26 Redundancy of Huffman Codes

q Redundancy: excess average code length over entropy l the redundancy of a Huffman code for a PMF p satisfies 0 £ L(C) - H(p) £ 1 l the redundancy can get arbitrarily close to 1 when H(p) ®0, but how large is it typically? q Gallager [1978] proved Example p(x) C(x) l(x) L(C) - H(p) £ P1 + c .05 000 3 where P1 is the probability of the most likely symbol, and .15 001 3 .15 100 3 c = 1 - log e + log log e » 0.086. .15 101 3 .20 01 2 For P1 ³ 1/2, .30 11 2 L(C) - H(p) £ 2 - H2(P1 ) - P1 £ P1 L=2.5 H =2.433...

q Precise characterization of the Huffman redundancy r = 0.067 has been a very difficult problem bound = 0.386 lmost recent results in [Szpankowsky, IEEE IT ’01]

Seroussi/Weinberger – Lossless source coding 26-May-04 27 A Coding Theorem

q For a sequence of symbols from a data source, the per-symbol redundancy can be reduced by using an alphabet extension

n A = { (a1,a2,...,an) | ai Î A }

n and an optimal code C for super-symbols (X1,X2,...,Xn) ~ p(x1,x2,...,xn) . n Then, H (X1,X2,...,Xn) £ L(C ) £ H (X1,X2,...,Xn)+1. Dividing by n, we get:

Coding Theorem (Shannon): The minimum expected codeword length per symbol satisfies 1 H ( X , X ,..., X ) £ L* £ H (X , X ,..., X ) + . 1 2 n n 1 2 n n Furthermore, if X ¥ is a random process with an entropy rate, then * ¥ Ln ¾n¾®¾¥® H (X )

Shannon tells us that there are codes that attain the fundamental compression limits asymptotically. But, how do we get there in practice?

Seroussi/Weinberger – Lossless source coding 26-May-04 28 Ideal Code Length

q A probability assignment is a function P : A* ® [0,1] satisfying * SaÎA P(sa) = 1 " s Î A , with P(l) = 1 q P is not a PMF on A*, but it is a PMF on any complete subset of A* l complete subset = leaves of a complete tree rooted at l, e.g., An n q The ideal code length for a string x1 relative to P is defined as * n n l (x1 ) = -log P(x1 ) n q The Shannon code attains the ideal code length for every string x1 , up to an integer-constraint excess o(1) which we shall ignore l notice that attaining the ideal code length point-wise for every string is a stronger requirement than attaining the entropy on the average q The Shannon code, as defined, is infeasible in practice (as would be a Huffman code on An for large n ) n n l while the code length for x1 is relatively easy to compute given P(x1 ), it is not clear how the codeword assignment proceeds n l as defined, it appears that one needs to look at the whole x1 before encoding; we would like to encode sequentially as we get the xi evolution that led to the solution of both issues Þ arithmetic coding

Seroussi/Weinberger – Lossless source coding 26-May-04 29 The Shannon-Fano Code

q A codeword assignment for the Shannon code l Let X ~ P(x) take values in M={0,1,...,m-1}, P(0)³ P(1) ³ ... ³ P(m-1) > 0 l Define F(x) = å P(a) , x Î M F is strictly increasing a

F(1) Example:

P(0) x P F lx C(x) F(0) 0 0.5 0 1 .0 0 1 . . . x . . . m-1 1 0.25 0.5 2 .10 2 0.125 0.75 3 .110 3 0.125 0.875 3 .111 Seroussi/Weinberger – Lossless source coding 26-May-04 30 Elias Coding - Arithmetic Coding

n n q To encode x1 we take M = A , ordered lexicographically n l to compute F(x1 ) directly, we would need to add an exponential number of probabilities, and compute with huge precision -- infeasible q Sequential probability assignment n n-1 n-1 n-1 P(x1 ) = P(x1 )P(xn | x1 ) F(x1 +1) what the model will provide at each step q Sequential encoding F(xn ) = P(y n ) = P(yn-1) + P(xn-1 y) n 1 å 1 å 1 å 1 F(x1 +1) yn

Seroussi/Weinberger – Lossless source coding 26-May-04 31 Arithmetic Coding - Example

P(0) = 0.25 P(1) = 0.75 (static i.i.d. model) 1.0 0.762695 “1” 0.75 3/4 3/4 0.68359 Code = .1100 0.578125 0.4375 1/4 4 æ 3 ö 1 81 0.25 P(x5 ) = × = = 0.0791 “0” 1 ç ÷ 1/4 è 4 ø 4 1024 1 1 1 1 0 0.0 5 Initial Input sequence No. codebits = é- log 2 P(x1 )ù = 4 interval q Computational challenges l precision of floating-point operations – register length l active interval shrinks, but small numerical changes can lead to changes in many bits of the binary representation – carry-over problem l encoding/decoding delay – how many cycles does it take since a digit enters the encoder until it can be output by the decoder?

Seroussi/Weinberger – Lossless source coding 26-May-04 32 Arithmetic Coding

q Arithmetic coding [Elias ca.‘60, Rissanen ‘75, Pasco ‘76] solves problems of precision and carry-over in the sequential n computation of F(x1 ), making it practical with bounded delay and modest memory requirements l refinements and contributions by many researchers in past 25 years q When carefully designed, AC attains a code length

n -log P(x1 ) + O(1),

ideal up to an additive constant q It reduces the lossless compression problem to one of finding the n best probability assignment for the given data x1 , that which will provide the shortest ideal code length

the problem is not to find the best code for a given probability distribution, it is to find the best probability assignment for the data at hand

Seroussi/Weinberger – Lossless source coding 26-May-04 33 LosslessLossless SourceSource CodingCoding

4. Lempel-Ziv coding

Seroussi/Weinberger – Lossless source coding 26-May-04 34 The Lempel-Ziv Algorithms

q A family of data compression algorithms first presented in

[LZ77] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inform.Theory, vol. IT-23, pp. 337–343, May 1977

[LZ78] J. Ziv and A. Lempel, “Compression of individual sequences via variable rate coding,” IEEE Trans. Inform. Theory, vol. IT-24, pp. 530–536, Sept. 1978.

q Many desirable features, the conjunction of which was unprecedented l simple and elegant l universal for individual sequences in the class of finite-state encoders u Arguably, every real-life computer is a finite-state automaton l convergence to the entropy for stationary ergodic sources l string matching and dictionaries, no explicit probability model l very practical, with fast and effective implementations applicable to a wide range of data types

Seroussi/Weinberger – Lossless source coding 26-May-04 35 Two Main Variants

q [LZ77] and [LZ78] present different algorithms with common elements l The main mechanism in both schemes is pattern matching: find string patterns that have occurred in the past, and compress them by encoding a reference to the previous occurrence

... r e s t o r e o n e s t o n e ...

q Both schemes are in wide practical use l many variations exist on each of the major schemes l we focus on LZ78, which admits a simpler analysis with a stronger result. The proof here follows [Cover & Thomas ‘91] , attributed to [Wyner & Ziv]. It differs from the original proof in [LZ78]. l the scheme is based on the notion of incremental parsing

Seroussi/Weinberger – Lossless source coding 26-May-04 36 Incremental Parsing

n q Parse the input sequence x1 into phrases, each new phrase being the shortest substring that has not appeared so far in the parsing n x1 = 1 ,0 ,1 1 ,0 1 ,0 1 0 ,0 0 ,1 0 , ... (assume A={0,1}) 1 2 3 4 5 6 7 q Each new phrase is of the form wb, w = a previous phrase, bÎ {0,1} l a new phrase can be described as (i,b) , where i = index (w) (phrase #) l in the example: (0,1), (0,0), (1,1), (2,1), (4,0), (2,0),(1,0) (phrase #0 = l) n l let c(n) = number of phrases in x1 l a phrase description takes £ 1+log c(n) bits l in the example, 28 bits to describe 13 : bad deal! it gets better as n®¥ l decoding is straightforward l in practice, we do not need to know c(n) before we start encoding u use increasing length codes that the decoder can keep track of Lemma: n c(n) £ , en ® 0 as n ® ¥ (1- e n )logn k j k +1 Proof: c(n) is max when we take all phrases as short as possible. Let nk = å j2 = (k -1)2 + 2, j=1 with nk £ n

Seroussi/Weinberger – Lossless source coding 26-May-04 37 Universality of LZ78

D n q Let 0 j-1 Qk (x-(k-1) ,..., x-1, x0 , x1,..., xn )=Q(x-(k-1) )ÕQ(x j | x j-k ) j=1 n be any k-th order Markov probability assignment for x1 , with arbitrary initial state (x-(k-1) ,..., x0) n q Assume x1 is parsed into distinct phrases y1, y2,...,yc . Define: l v = index of start of y = (x ,..., x ) where i i vi vi+1-1 are l s = (x ,..., x ) = the k bits preceding y in x n , s = (x ,..., x ) we i vi - k vi -1 i 1 1 -(k-1) 0 going k l cls = number of phrases yi of length l and preceding state sÎ{0,1} with all this? l we have Sl,s cls = c and Sl,s l cls = n

n Ziv’s inequality: For any distinct parsing of x1 , and any Qk , we have

logQk (x1,..., xn | s1) £ -åcls log cls l,s

The lemma upperbounds the probability of any sequence under any probability assignment from the class, based on properties of any distinct parsing of the sequence (including the incremental parsing)

Seroussi/Weinberger – Lossless source coding 26-May-04 38 Universality of LZ78 (proof of Ziv’s inequality)

Proof of the Ziv’s inequality:

c Qk (x1, x2 ,..., xn | s1) = Q( y1, y2 ,..., yc | s1) = ÕQ( yi | si ) i=1 c logQk (x1, x2 ,..., xn | s1) = ålogQ( yi | si ) i=1

= å ålogQ( yi | si ) l,s i:|yi |=l,si =s 1 = åcls å logQ( yi | si ) l,s i:|yi |=l,si =s cls Jensen æ 1 ö £ c logç Q( y | s )÷ å ls ç å c i i ÷ l,s è i:|yi |=l,si =s ls ø

Since the yi are distinct, we have åQ( yi | si ) £ 1 i:|yi |=l,si =s 1 Þ logQk (x1, x2 ,..., xn | s1) £ åcls log n l,s cls

Seroussi/Weinberger – Lossless source coding 26-May-04 39 Universality for Individual Sequences: Theorem

n Theorem: For any sequence x1 and for any k-th order probability assignment Qk , we have

c(n)logc(n) 1 n (1+ o(1))k æ loglogn ö £ - logQk (x1 | s1) + + Oç ÷ n n logn è logn ø

D n clsc cls Proof: Lemma Þ logQ(x1 | s1) £ -åcls log = -c log c - cåp ls logp ls , p ls = l,s c l,s c

We have Sl,s pls = 1 and Sl,s l pls = n/c. Define r.v.’s U,S ~ P(U=l,S=s)= pls 1 c c c c Then, EU= n/c and - logQ(xn | s ) ³ log c - H(U,V ) ³ log c - (H (U) + H(V )) n 1 1 n n n n Now, H(V) £ k, and by the maximum entropy theorem for mean-constrained r.v.’s, æ n ö æ n ö n n c c c n H (U ) £ ç +1÷logç +1÷ - log Þ H (U,V ) £ k + log + o(1) è c ø è c ø c c n n n c c n æ log log n ö Recall c/n £ (1+o(1))/log n Þ log £ Oç ÷ n c è log n ø

1 n clog c (1+ o(1))k æ log log n ö n Þ - logQ(x1 | s1 )³ - - Oç ÷ n n n è log n ø Seroussi/Weinberger – Lossless source coding 26-May-04 40 Universality for Individual Sequences: Discussion

q The theorem holds for any k-th order probability assignment Qk, n and in particular, for the k-th order empirical distribution of x1 , which gives an ideal code length equal to the empirical entropy

1 - log Pˆ(xn ) = Hˆ (xn ) n 1 1

q The asymptotic O(log log n/log n) term in the redundancy has been improved to O(1/log n) – no better upper bound can be achieved l obtained with tools from renewal theory

Seroussi/Weinberger – Lossless source coding 26-May-04 41 Compressibility

n Qk is optimized for x1 , q Finite-memory compressibility for each k 1 n æ n ö th order, finite sequence we must have FM k (x1 ) = inf ç- logQk (x1 | s1 )÷ k- Q ,s n®¥ before k 1è n ø k®¥, otherwise ¥ n FM k (x1 ) = limsup(FM k (x1 )) k-th order, infinite sequence definitions are n®¥ meaningless! ¥ ¥ FM (x1 ) = lim FM k (x1 ) FM compressibility k®¥ q Lempel-Ziv compression ratio c(n)(logc(n) +1) LZ(xn ) = finite sequence 1 n

¥ n LZ(x1 ) = limsup(LZ(x1 )) LZ compression ratio n®¥

¥ ¥ ¥ Theorem: For any sequence x1 , LZ(x1 ) £ FM (x1 )

Seroussi/Weinberger – Lossless source coding 26-May-04 42 Probabilistic Setting

¥ Theorem: Let X -¥ be a stationary ergodic random process. Then,

¥ ¥ LZ(X1 ) £ H (X1 ) with probability 1

Proof: via approximation of the stationary ergodic process with Markov processes of increasing order, and the previous theorems

D n Q (x0 xn ) = P (x0 ) P (x | x j-1 ), X ~ P k -( k -1) 1 X -(k -1) Õ X j j-k X j=1

j-1 k®¥ H (x j | x j-k ) ¾¾¾® H (X ) Markov k-th order approximation

Seroussi/Weinberger – Lossless source coding 26-May-04 43 The Parsing Tree

n x1 = 1, 0 ,1 1, 0 1, 0 1 0, 0 0 ,1 0 , ...

l code phrase 0 0 1 0 l 1 0,1 2 1 0 1 0 1 2 0,0 3 1,1 4 7 3 6 0 4 2,1 5 4,0 5 6 2,0 7 1,0

l coding could be made more efficient by “recycling” ¼ ¼ codes of nodes that have a complete set of children dictionary (e.g., 1, 2 above) lwill not affect asymptotics lmany (many many) tricks and hacks exist in practical implementations

Seroussi/Weinberger – Lossless source coding 26-May-04 44 The LZ Probability Assignment

n x1 = 1, 0 ,1 1, 0 1, 0 1 0, ... q Slightly different tree evolution anticipatory parsing l q A weight is kept at every node 0 324576 1 l number of times the node was traversed through + 1 123 0 1243 1 0 1 q A node act as a conditioning state, assigning to its children 1 132 1 12 probabilities proportional to their 0 1 0 1 weight 12 1 1 1 q Example: string s=101101010 0 1 P(0|s) = 4/7 1 1 P(1|s0) = 3/4 P(1|s01) = 1/3 P(011|s) = (4/7)*(3/4)*(1/3) = 1/7 q In general, 1 Notice `telescoping’ P(xn ) = 1 (c(n) +1)! q P(s011) = 1/7! - log P = c(n)logc(n) + o(c(n)logc(n)) LZ code length! every lossless compression algorithm defines a prob. assignment, even if it wasn’t meant to!

Seroussi/Weinberger – Lossless source coding 26-May-04 45 Other Properties

q Individual sequences result applies also to FSM probability assignments q The “worst sequence” l counting sequence 0 1 00 01 10 11 000 001 010 011 100 101 110 111 .. l maximizes c(n) Þ incompressible with LZ78 q Generalization to larger alphabets is straightforward q LZW modification: extension symbol b not sent. It is determined by the first symbol of the next phrase instead [Welch 1984] l dictionary is initialized with all single-symbol strings l works very well in practice l breakthrough in popularization of LZ, led to UNIX compress q In real life we use bounded dictionaries, and need to reset them from time to time

Seroussi/Weinberger – Lossless source coding 26-May-04 46 Lempel-Ziv 77

q Exhaustive parsing as opposed to incremental l a new phrase is formed by the longest match anywhere in a finite past window, plus the new symbol l a pointer to the location of the match, its length, and the new symbol are sent q Has a weaker proof of universality, but actually works better in practice

1,0,1 1 ,0 1 0 ,1 0 0 ,0 1 0 1 0 1 ,1 1 0 1 1 ,... (2,1) (3,2) (4,2) (6,5) (14,5)

offset back match length

Seroussi/Weinberger – Lossless source coding 26-May-04 47 Lempel-Ziv in the Real World

q The most popular data compression algorithm in use l virtually every computer in the world runs some variant of LZ l LZ78 u compress u GIF u TIFF u V.42 modems l LZ77 u gzip, pkzip (LZ77 + Huffman for pointers and symbols) u png l many more implementations in software and hardware u MS Windows dll - software distribution u tape drives u printers u network routers u various comemrcially available VLSI designs u ...

Seroussi/Weinberger – Lossless source coding 26-May-04 48