Lecture 3: Entropy and Arithmetic Coding

Data Compression Techniques Part 1: Entropy Coding Lecture 3: Entropy and Arithmetic Coding Juha Kärkkäinen 07.11.2017 1 / 24 Entropy As we have seen, the codeword lengths `s , s 2 Σ, of a complete prefix P −`s code for an alphabet Σ satisfy s2Σ 2 = 1: Thus the prefix code defines a probability distribution over Σ: P(s) = 2−`s for all s 2 Σ: Conversely, we can convert probabilities into codeword lengths: `s = − log P(s) for all s 2 Σ: For an arbitrary distribution, these codeword lengths are generally not integral and do not represent an actual prefix code. However, the average code length derived from these lengths is a quantity called the entropy. Definition Let P be a probability distribution over an alphabet Σ. The entropy of P is X H(P) = − P(s) log P(s): s2Σ The quantity − log P(s) is called the self-information of the symbol s. 2 / 24 The concept of (information theoretic) entropy was introduced in 1948 by Claude Shannon in his paper \A Mathematical Theory of Communication" that established the discipline of information theory. From that paper is also the following result. Theorem (Noiseless Coding Theorem) Let C be an optimal prefix code for an alphabet Σ with the probability distribution P. Then X H(P) ≤ P(s)jC(s)j < H(P) + 1: s2Σ The same paper contains the Noisy-channel Coding Theorem that deals with coding in the presence of error. 3 / 24 The entropy is an absolute lower bound on compressibility in the average sense. Because of integral code lengths, a prefix codes may not quite match the lower bound, but the upper bound shows that they can get fairly close. Example symbol a e i o u y probability 0.2 0.3 0.1 0.2 0.1 0.1 self-information 2.32 1.74 3.32 2.32 3.32 3.32 Huffman codeword length 2 2 3 2 4 4 H(P) ≈ 2:45 while the average Huffman code length is 2.5. We will later see how to get even closer to entropy, but it is never possible to get below. We will next prove the Noiceless Coding Theorem, and for that we will need Gibbs' Inequality. 4 / 24 Lemma (Gibbs' Inequality) For any two probability distributions P and Q over Σ X X − P(s) log P(s) ≤ − P(s) log Q(s): s2Σ s2Σ with equality if only if P = Q. Proof. Since ln x ≤ x − 1 for all x > 0, X Q(s) X Q(s) X X P(s) ln ≤ P(s) − 1 = Q(s) − P(s) = 0: P(s) P(s) s2Σ s2Σ s2Σ s2Σ Dividing both sides by ln 2, we can change ln into log. Then X X X Q(s) P(s) log Q(s) − P(s) log P(s) = P(s) log ≤ 0 ; P(s) s2Σ s2Σ s2Σ which proves the inequality. Since ln x = x − 1 if and only if x = 1, all the inequalities above are equalities if and only if P = Q. 5 / 24 Proof of Noiseless Coding Theorem Let us prove the upper bound first. For all s 2 Σ, let `s = d− log P(s)e. Then X X X 2−`s ≤ 2log P(s) = P(s) = 1 s2Σ s2Σ s2Σ and by Kraft's inequality, there exists a (possibly inoptimal and even redundant) prefix code with codeword length `s for each s 2 Σ (known as Shannon code). The average code length of this code is X X X P(s)`s = P(s)d− log P(s)e < P(s)(− log P(s)+1) = H(P)+1: s2Σ s2Σ s2Σ Now we prove the lower bound. Let C be an optimal prefix code with codeword length `s for each s 2 Σ. We can assume that C is complete, P −`s −`s i.e., s2Σ 2 = 1. Define a distribution Q by setting Q(s) = 2 : By Gibbs' inequality, the average code length of C satisfies X X −`s X P(s)`s = P(s)(− log(2 )) = − P(s) log(Q(s)) ≥ H(P): s2Σ s2Σ s2Σ 6 / 24 One can prove tighter upper bounds than H(P) + 1 on the average Huffman code length in terms of the maximum probability pmax of a single symbol: X H(P) + p + 0:086 when p < 0:5 P(s)jC(s)j ≤ max max H(P) + pmax when pmax ≥ 0:5 s2Σ The case when pmax is close to 1 represents the worst case for prefix coding, because then the self-information − log pmax is much less than 1, but a codeword can never be shorter than 1. Example symbol a b probability 0.99 0.01 self-information 0.014 6.64 Huffman codeword length 1 1 H(P) ≈ 0:081 while the average Huffman code length is 1. The Huffman code is more than 10 times longer than the entropy. 7 / 24 One way to get closer to entopy is to use a single codeword for multiple symbols. Example symbol pair aa ab ba bb probability 0.9801 0.0099 0.0099 0.0001 self-information 0.029 6.658 6.658 13.288 Huffman codeword length 1 2 3 3 H(P) ≈ 0:162 while the average Huffman code length is 1.0299. We are now encoding two symbols with just over one bit on average. The more symbols are encoded with one codeword, the closer to entropy we get. Taking this approach to the logical extreme | encoding the whole message with a single codeword | leads to arithmetic coding. 8 / 24 Arithmetic coding Let Σ = fs1; s2;:::; sσg be an alphabet with an ordering s1 < s2 < ··· < sσ. Let P be a probability distribution over Σ and define the cumulative distribution F (si ), for all si 2 Σ, as i X F (si ) = P(sj ) j=1 We associate a symbol si with the interval[ F (si ) − P(si ); F (si )). Example symbol abc probability 0.2 0.5 0.3 interval [0; 0:2) [0:2; 0:7) [0:7; 1:0) The intervals correspond to the dyadic intervals in the proof of Kraft's inequality except now we do not require the intervals to be dyadic. 9 / 24 We can extend the interval representation from symbols to strings. For any n ≥ 1, define a probability distribution Pn and a cumulative n distribution Fn over Σ , the set of strings of length n, as n Y Pn(X ) = P(xi ) i=1 X Fn(X ) = Pn(Y ) Y 2Σn;Y ≤X n for any X = x1x2 ::: xn 2 Σ , where Y ≤ X uses the lexicographical ordering of strings. The interval for X is [Fn(X ) − Pn(X ); Fn(X )). The interval can be computed incrementally. n Input: string X = x1x2 ::: xn 2 Σ and distributions P and F over Σ. Output: interval [l; r) for X . (1) [l; r) [0:0; 1:0) (2) for i = 1 to n do 0 0 0 0 0 (3) r F (xi ); l r − P(xi ) // [l ; r ) is interval for xi (4) p r − l; l l + p · l 0; r l + p · r 0 (5) return[ l; r) 10 / 24 Example Computing the interval for the string cab. symbol abc probability 0.2 0.5 0.3 interval [0; 0:2) [0:2; 0:7) [0:7; 1:0) Subdivide the interval for c. string ca cb cc probability 0.06 0.15 0.09 interval [0:7; 0:76) [0:76; 0:91) [0:91; 1:0) Subdivide the interval for ca. string caa cab cac probability 0.012 0.03 0.018 interval [0:7; 0:712) [0:712; 0:742) [0:742; 0:76) Output [0:712; 0:742). 11 / 24 We can similarly associate an interval with each string over the code alphabet (which we assume to be binary here) but using equal probabilities. The interval for a binary string B 2 f0; 1g` is val(B) val(B) + 1 I(B) = ; 2` 2` where val(B) is the value of B as a binary number. I This is the same interval as in the proof of Kraft's inequality and is always dyadic. I Another characterization of I(B) is that it contains exactly all fractional binary numbers beginning with B. In fact, in the literature, arithmetic coding is often described using binary fractions instead of intervals. Example For B = 101110, val(B) = 46 and thus 46 47 I(B) = ; = [:71875;:734375) = [:101110;:101111): 64 64 12 / 24 The arithmetic coding procedure is: source string 7! source interval 7! code interval 7! code string The missing step is the mapping from source intervals to code intervals. A good choice for the code interval is the longest dyadic interval that is completely contained in the source interval. This ensures that the code intervals cannot overlap when the source intervals do not overlap and thus the code is a prefix code for Σn. Example The string cab is encoded as 101110: cab 7! [:712;:742) 7! [:71875;:734375) = [:101110;:101111) 7! 101110 The code is usually not optimal or even complete, because there are gaps between the code intervals. However, the length of the code interval for a source interval of length p is always more than p=4. Thus the code length for a string X is less than − log Pn(X ) + 2, and the average code length is less than H(Pn) + 2. Since H(Pn) = nH(P), the average code length per symbols is H(P) + 2=n.

Load more