Techniques Part 1: Entropy Coding Lecture 3: Entropy and Arithmetic Coding

Juha K¨arkk¨ainen

07.11.2017

1 / 24 Entropy As we have seen, the codeword lengths `s , s ∈ Σ, of a complete prefix P −`s code for an alphabet Σ satisfy s∈Σ 2 = 1. Thus the prefix code defines a probability distribution over Σ: P(s) = 2−`s for all s ∈ Σ. Conversely, we can convert probabilities into codeword lengths:

`s = − log P(s) for all s ∈ Σ. For an arbitrary distribution, these codeword lengths are generally not integral and do not represent an actual prefix code. However, the average code length derived from these lengths is a quantity called the entropy. Definition Let P be a probability distribution over an alphabet Σ. The entropy of P is X H(P) = − P(s) log P(s). s∈Σ

The quantity − log P(s) is called the self-information of the symbol s. 2 / 24 The concept of (information theoretic) entropy was introduced in 1948 by Claude Shannon in his paper “A Mathematical Theory of Communication” that established the discipline of . From that paper is also the following result. Theorem (Noiseless Coding Theorem) Let C be an optimal prefix code for an alphabet Σ with the probability distribution P. Then X H(P) ≤ P(s)|C(s)| < H(P) + 1. s∈Σ The same paper contains the Noisy-channel Coding Theorem that deals with coding in the presence of error.

3 / 24 The entropy is an absolute lower bound on compressibility in the average sense. Because of integral code lengths, a prefix codes may not quite match the lower bound, but the upper bound shows that they can get fairly close. Example symbol a e i o u y probability 0.2 0.3 0.1 0.2 0.1 0.1 self-information 2.32 1.74 3.32 2.32 3.32 3.32 Huffman codeword length 2 2 3 2 4 4

H(P) ≈ 2.45 while the average Huffman code length is 2.5. We will later see how to get even closer to entropy, but it is never possible to get below.

We will next prove the Noiceless Coding Theorem, and for that we will need Gibbs’ Inequality.

4 / 24 Lemma (Gibbs’ Inequality) For any two probability distributions P and Q over Σ X X − P(s) log P(s) ≤ − P(s) log Q(s). s∈Σ s∈Σ with equality if only if P = Q.

Proof. Since ln x ≤ x − 1 for all x > 0,

X Q(s) X Q(s)  X X P(s) ln ≤ P(s) − 1 = Q(s) − P(s) = 0. P(s) P(s) s∈Σ s∈Σ s∈Σ s∈Σ Dividing both sides by ln 2, we can change ln into log. Then

X X X Q(s) P(s) log Q(s) − P(s) log P(s) = P(s) log ≤ 0 , P(s) s∈Σ s∈Σ s∈Σ which proves the inequality. Since ln x = x − 1 if and only if x = 1, all the inequalities above are equalities if and only if P = Q. 5 / 24 Proof of Noiseless Coding Theorem Let us prove the upper bound first. For all s ∈ Σ, let `s = d− log P(s)e. Then X X X 2−`s ≤ 2log P(s) = P(s) = 1 s∈Σ s∈Σ s∈Σ and by Kraft’s inequality, there exists a (possibly inoptimal and even redundant) prefix code with codeword length `s for each s ∈ Σ (known as Shannon code). The average code length of this code is X X X P(s)`s = P(s)d− log P(s)e < P(s)(− log P(s)+1) = H(P)+1. s∈Σ s∈Σ s∈Σ Now we prove the lower bound. Let C be an optimal prefix code with codeword length `s for each s ∈ Σ. We can assume that C is complete, P −`s −`s i.e., s∈Σ 2 = 1. Define a distribution Q by setting Q(s) = 2 . By Gibbs’ inequality, the average code length of C satisfies

X X −`s X P(s)`s = P(s)(− log(2 )) = − P(s) log(Q(s)) ≥ H(P). s∈Σ s∈Σ s∈Σ

6 / 24 One can prove tighter upper bounds than H(P) + 1 on the average Huffman code length in terms of the maximum probability pmax of a single symbol:

X  H(P) + p + 0.086 when p < 0.5 P(s)|C(s)| ≤ max max H(P) + pmax when pmax ≥ 0.5 s∈Σ

The case when pmax is close to 1 represents the worst case for prefix coding, because then the self-information − log pmax is much less than 1, but a codeword can never be shorter than 1. Example symbol a b probability 0.99 0.01 self-information 0.014 6.64 Huffman codeword length 1 1

H(P) ≈ 0.081 while the average Huffman code length is 1. The Huffman code is more than 10 times longer than the entropy.

7 / 24 One way to get closer to entopy is to use a single codeword for multiple symbols. Example symbol pair aa ab ba bb probability 0.9801 0.0099 0.0099 0.0001 self-information 0.029 6.658 6.658 13.288 Huffman codeword length 1 2 3 3

H(P) ≈ 0.162 while the average Huffman code length is 1.0299. We are now encoding two symbols with just over one on average.

The more symbols are encoded with one codeword, the closer to entropy we get. Taking this approach to the logical extreme — encoding the whole message with a single codeword — leads to arithmetic coding.

8 / 24 Arithmetic coding

Let Σ = {s1, s2,..., sσ} be an alphabet with an ordering s1 < s2 < ··· < sσ. Let P be a probability distribution over Σ and define the cumulative distribution F (si ), for all si ∈ Σ, as

i X F (si ) = P(sj ) j=1

We associate a symbol si with the interval[ F (si ) − P(si ), F (si )). Example symbol abc probability 0.2 0.5 0.3 interval [0, 0.2) [0.2, 0.7) [0.7, 1.0)

The intervals correspond to the dyadic intervals in the proof of Kraft’s inequality except now we do not require the intervals to be dyadic.

9 / 24 We can extend the interval representation from symbols to strings. For any n ≥ 1, define a probability distribution Pn and a cumulative n distribution Fn over Σ , the set of strings of length n, as n Y Pn(X ) = P(xi ) i=1 X Fn(X ) = Pn(Y ) Y ∈Σn,Y ≤X n for any X = x1x2 ... xn ∈ Σ , where Y ≤ X uses the lexicographical ordering of strings. The interval for X is [Fn(X ) − Pn(X ), Fn(X )). The interval can be computed incrementally.

n Input: string X = x1x2 ... xn ∈ Σ and distributions P and F over Σ. Output: interval [l, r) for X . (1) [l, r) ← [0.0, 1.0) (2) for i = 1 to n do 0 0 0 0 0 (3) r ← F (xi ); l ← r − P(xi ) // [l , r ) is interval for xi (4) p ← r − l; l ← l + p · l 0; r ← l + p · r 0 (5) return[ l, r)

10 / 24 Example Computing the interval for the string cab. symbol abc probability 0.2 0.5 0.3 interval [0, 0.2) [0.2, 0.7) [0.7, 1.0)

Subdivide the interval for c. string ca cb cc probability 0.06 0.15 0.09 interval [0.7, 0.76) [0.76, 0.91) [0.91, 1.0)

Subdivide the interval for ca. string caa cab cac probability 0.012 0.03 0.018 interval [0.7, 0.712) [0.712, 0.742) [0.742, 0.76)

Output [0.712, 0.742).

11 / 24 We can similarly associate an interval with each string over the code alphabet (which we assume to be binary here) but using equal probabilities. The interval for a binary string B ∈ {0, 1}` is val(B) val(B) + 1 I(B) = , 2` 2` where val(B) is the value of B as a binary number.

I This is the same interval as in the proof of Kraft’s inequality and is always dyadic. I Another characterization of I(B) is that it contains exactly all fractional binary numbers beginning with B. In fact, in the literature, arithmetic coding is often described using binary fractions instead of intervals. Example For B = 101110, val(B) = 46 and thus

46 47 I(B) = , = [.71875,.734375) = [.101110,.101111). 64 64

12 / 24 The arithmetic coding procedure is: source string 7→ source interval 7→ code interval 7→ code string The missing step is the mapping from source intervals to code intervals. A good choice for the code interval is the longest dyadic interval that is completely contained in the source interval. This ensures that the code intervals cannot overlap when the source intervals do not overlap and thus the code is a prefix code for Σn. Example The string cab is encoded as 101110:

cab 7→ [.712,.742) 7→ [.71875,.734375) = [.101110,.101111) 7→ 101110 The code is usually not optimal or even complete, because there are gaps between the code intervals. However, the length of the code interval for a source interval of length p is always more than p/4. Thus the code length for a string X is less than − log Pn(X ) + 2, and the average code length is less than H(Pn) + 2. Since H(Pn) = nH(P), the average code length per symbols is H(P) + 2/n. 13 / 24 A common description of arithmetic coding in the literature maps the source interval to any number in that interval and uses the binary representation of the number as the code. However, this is not necessarily a prefix code for Σn, and one has to be careful about how to end the code sequence. Example For strings of length 1 in our running example, we might obtain the following codes: source string abc source interval [0, 0.2) [0.2, 0.7) [0.7, 1.0) code number 0 .510 = .12 .7510 = .112 code string 0 1 11

14 / 24 A problematic issue with arithmetic coding is the potentially high precision that may be needed in the interval computations. The number of digits required is proportional to the length of the final code string.

Practical implementations avoid high precision numbers using a combination of two techniques.

I Rouding: Replace high precision numbers with low precision approximations. As long as the final intervals for two distinct strings can never overlap, the coding still works correctly. Rounding may increase the average code length, though.

I Renormalization: When the interval gets small enough, renormalize it, essentially “zooming in”. This way the intervals never get extremely small. Practical implementations commonly use integer arithmetic both to improve performance and to have full control of precision and the details of rounding.

15 / 24 Example: Arithmetic Coding in Practice We use integer arithmetic with integers [0, 64] representing the unit interval [0, 1.0]. For example, 13 represents 13/64 = 0.203125.

Due to limited precision, the interval sizes are not a perfect representation of the true probabilities. The alphabet intervals for our running example are: symbol abc true probability 0.2 0.5 0.3 self-information 2.32 1.0 1.74 interval [0, 13) [13, 45) [45, 64) implied probability 0.203125 0.5 0.296875 code length 2.30 1.0 1.75

This may have a tiny effect on the compression rate, but much smaller than using a prefix code would have.

A real implementation would, of course, use larger integers resulting in a higher precision and a smaller reduction in compression rate. 16 / 24 Now let us start encoding the string cab: [l, r) ← [0, 64) // initialize [l0, r 0) ← [45, 64) // interval for symbol c p ← 64 − 0 = 64 p·l0 64·45 l ← l + 64 = 0 + 64 = 45 p·r 0 64·64 r ← l + 64 = 0 + 64 = 64 When multiplying two probabilities, we have to divide by 64 in order to keep the right scale. In a moment, we will see what happens when the result is not an integer.

17 / 24 The source interval [45, 64) is now completely in the second half of the unit interval [0, 64). This means that all the subintervals including the final code interval are in the second half too. Thus we know that the first bit of the code string is 1.

Now we output 1, i.e., set the first bit of the code string to 1. This corresponds to setting the code interval to [32, 64). Then we renormalize by zooming in the second half: output 1 l ← (l − 32) × 2 = 26 r ← (r − 32) × 2 = 64

The code interval becomes [0, 64) again.

If the source interval had been completely in the first half, we would have output 0 and multiplied both end points by 2.

18 / 24 Now [l, r) = [26, 64) and we process the next symbol. [l0, r 0) ← [0, 13) // interval for symbol a p ← 64 − 26 = 38 j p·l0 k  38·0  l ← l + 64 = 26 + 64 = 26 j p·r 0 k  38·13  r ← l + 64 = 26 + 64 = 26 + b7.7c = 33

Here we had to do some rounding, because the exact values could not be represented with the limited precision. This can again cause a tiny increase in the average code length.

Other rounding schemes are possible. However, one has to be careful about not to create an overlap between intervals that should be distinct. Our rounding scheme is OK in this respect. (Why?)

19 / 24 Now [l, r) = [26, 33). This interval is not completely in the first half nor in the second half. However, it is completely in the “middle” half [16, 48). Now we do not know what to output yet, but we still zoom in. Instead of outputting, we increment a variable called delayed (initialized to 0): delayed bits ← delayed bits + 1 = 1 l ← (l − 16) × 2 = 20 r ← (r − 16) × 2 = 34

[20, 34) is still in the middle half, so increment delayed bits and renormalize again: delayed bits ← delayed bits + 1 = 2 l ← (l − 16) × 2 = 8 r ← (r − 16) × 2 = 36

Zooming in the middle half ensures that the length of the source interval remains larger than one quarter of the unit interval, 16 in this case.

20 / 24 The delayed bits variable keeps track of how many times we have renormalized without outputting anything. The output to produce will be resolved when we zoom in to the first or the second half. If the next zooming is to the left, the code intervals involved look like this:

The first interval, from before zooming in the middle, was a proper dyadic interval. The next two intervals, resulting from zooming in the middle, are not dyadic due to wrong alignment. The last interval is dyadic again, and the transition from the first to the last interval corresponds to the bits 011. The general procedure when zooming in the first half is to output 0 followed by delayed bits 1’s, and then to set delayed bits to zero. Similarly, when zooming in the second half, output 1 followed by delayed bits 0’s, and set delayed bits to zero. 21 / 24 Let us return to the computation. The source interval is [l, r) = [8, 36) and we process the next source symbol. [l0, r 0) ← [13, 45) // interval for symbol b p ← 36 − 8 = 28 j p·l0 k  28·13  l ← l + 64 = 8 + 64 = 8 + b5.7c = 13 j p·r 0 k  28·45  r ← l + 64 = 8 + 64 = 8 + b19.7c = 27

[13, 27) is in the first half, so we output and renormalize: output 011 delayed bits ← 0 l ← l × 2 = 26 r ← r × 2 = 54

22 / 24 Now there is no more source symbols. The final scaled source interval is [26, 54). The current renormalized code interval is [0, 64). We need two more bits: output 10 Now the code interval is [32, 48), which is completely inside the source interval. The general rule for the last two bits is:

I If the source interval contains the second quadrant [16, 32), output 01 plus delayed bits 1’s.

I If the source interval contains the third quadrant [32, 48), output 10 plus delayed bits 0’s. If the source interval does not contain either quadrant, it must be inside the first, middle or last half and we can renormalize. If the source interval contains both quadrants, either output is OK.

The final output is 101110. This the same we got with exact arithmetic, so in this case the rounding did not change the result.

23 / 24 Decoding arithmetic code performs much the same process, but the process is now controlled by the code string and it produces the source string as output. Here is an outline of the decoding procedure in our example:

I We start with the code interval [0, 64). Each bit read from the code string halves the code interval.

I In the beginning, the source interval is [0, 64) too, and it is divided into three candidate intervals: [0, 13), [13, 45) and [45, 64).

I Whenever the code interval falls within one of the candidate intervals, that interval becomes the new source interval and we output the corresponding symbol.

I If the new source interval is within one of the three halves of the unit interval, we renormalize scaling both the source and the code interval.

I The new source interval is divided into three candidate intervals and the process continues.

I The process ends, when n symbols have been output. The full details are left as an exercise.

24 / 24