Entropy and Huffman

Entropy as a Measure of Information Content

Entropy of a random variable. Let X be a random variable that takes on values from the set P {x1, x2, . . . , xn} with respective probabilities p1, p2, . . . , pn, where i pi = 1. Then the entropy of X, H(X), represents the average amount of information contained in X and is defined by

n X H(X) = − pi log2 pi. i=1

Note that entropy is measured in bits. Also, Hn(p1, . . . , pn) denotes another way of writing the entropy function.

Arguments in favor of the above definition of information:

• the definition is consistent with the following extreme cases:

1 1. If n = 2 and p1 = p2 = 2 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1) has an equally likely chance of occurring or not occurring, then its outcome possesses one bit of information. This is the maximum amount of information a binary outcome may possess.

2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose outcome is certain possesses no information.

• moreover, the above definition is the only definition which satisfies the following three properties of information which seem reasonable under any definition:

1 1 – Normalization: H2( 2 , 2 ) = 1

– Continuity: H2(p, 1 − p) is a continuous function of p on the interval (0, 1) – Grouping:

p1 p2 Hm(p1, . . . , pm) = Hm−1(p1 + p2, p3, . . . , pm) + (p1 + p2)H2( , ) p1 + p2 p1 + p2

1 (1912-2001). Pioneer in

• applying Boolean logic to electronic circuit design

• studying the complexity of Boolean circuits

• signal processing: determined lower bounds on the amount of samples needed to achieve a desired estimation accuracy

• game theory: inventor of minimax

: first to give a precise definition for the concept of information

• coding theory: Channel Coding Theorem, Optimal Coding Theorem

2 Example 1. calculate the amount of information contained in a weather forecast if the possibilities include {normal, rain, fog, hot, windy} if there respective probabilities are .8,.10,.04,.03, and .03.

Example 2. Calculate the entropy of the probability distribution (1/2, 1/4, 1/8, 1/16, 1/16).

3 Example 3. Verify that independently tossing a fair coin n times imparts n bits of information.

4 Introduction to Codes

A is a set of words C over some alphabet Ψ. The elements of C are called codewords. Let X be a finite set of objects. Then an encoding of X is a map φ : X → Ψ. Thus, the image of the map is a set of codewords. Moreover, when we speak of a code C, we generally are referring to some encoding φ whose image is C. More defintions:

• encoding φ is one-to-one or non-singular if φ is a one-to-one map

• an extension of an encoding φ is a map φ∗ which maps finite-length strings from X to finite-length strings over Ψ, and is defined in terms of φ in the following manner:

∗ φ (x1 ··· xn) = φ(x1)φ(x2) ··· φ(xn).

• a code (encoding) is called uniquely decodable iff its extension is one-to-one

• a code is called a prefix code if no codeword is a proper prefix of another codeword.

Theorem 1. Exercise: every prefix-code is uniquely decodable.

5 Example 4. Consider the following famous binary code (if we let a “dot” denote 0, and a “dash” denote 1) known as the , which encodes the alphabet {A − Z} (see below). Is this code injective? a prefix-code? uniquely decodable? What if we increase the alphabet to {0, 1, #}, where e.g. φ(‘A0) = 01#, where # represents a pause between letters?

CHARACTER INTERNATIONAL MORSE CODE A 01 B 1000 C 1010 D 100 E 0 F 0010 G 110 H 0000 I 00 J 0111 K 101 L 0100 M 11 N 10 O 111 P 0110 Q 0010 R 010 S 000 T 1 U 001 V 0001 W 011 X 1001 Y 1011 Z 1100

6 Example 5. Same questions as in Example 4, but now assume the code is the ASCII code (American Standard Code for Information Interchange). Hint: each letter is encoded into a seven-bit word.

0 1 2 3 4 5 6 7 0 NUL SOH STX ETX EOT ENQ ACK BEL 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB 2 SP ! ” # $ % & ’ 3 0 1 2 3 4 5 6 7 4 A B C D E F G 5 P Q R S T U V W 6 ‘ a b c d e f g 7 p q r s t u v w

8 9 A B C D E F 0 BS HT LF VT FF CR SO SI 1 CAN EM SUB ESC FS GS RS US 2 ( ) * + , - . / 3 8 9 : ; ¡ = ¿ ? 4 H I J K L M N O 5 X Y Z [ backslash ] caret 6 h i j k l m n o 7 x y z { | } tilde DEL

7 Theorem 2.

1. Kraft Inquality: For any binary prefix-code with codeword lengths l1, . . . , ln, n X 2−li ≤ 1. i=1 Conversely, given a set of codeword lengths that satisfy this inequality, there exists a binary prefix code with these word lengths. 2. McMillan’s Result: Kraft’s inequality holds for all uniquely-decodable codes.

Proof of Theorem 2, Part 1. Let C denote a finite binary prefix code and lmax denote the maximum length of a codeword in C. Now consider a perfect binary T with height lmax. Then the following facts should seem self-evident upon sufficient consideration.

1. T has 2lmax leaves

2. there is one-to-one correspondence between binary words of length not exceeding lmax and nodes of T 3. there is one-to-one mapping from C into the set of nodes of T 4. every leaf of T has at most one ancestor in C

lmax−li 5. every codeword wi ∈ C is the ancestor of exactly 2 leaves of T

From the above facts we see that n X 2lmax−li ≤ 2lmax , i=1 and hence (dividing both sides of the above equation by 2lmax ) n X 2−li ≤ 1. i=1

Conversely, suppose that lengths l1, . . . , ln satisfy n X 2−li ≤ 1. i=1

l1 Let T and lmax be as above. Define binary code C in the following manner. Basis step: let w1 = 0 . Inductive step: assume that there exists 1 ≤ k ≤ n − 1 such that codewords w1, . . . , wk have been defined Pk −li in such a way that |wi| = li, for all 1 ≤ i ≤ k. Then by the Kraft Inequality, i=1 2 < 1, which implies Pk lmax−li lmax i=1 2 < 2 . Then there exists a leaf L of T for which no member of w1, . . . , wk is an ancestor of L. Choose the first such leaf L and set wk+1 to the ancestor of L having length lk+1. Continuing in this manner a prefix-code C with the desired word lengths will be attained. QED

8 Efficient Codes

We now show the beautiful connection between entropy and coding theory. We may think of a computer file (or any other entity that possesses information content) as a finite string from some alphabet X = {x1, . . . , xn}. Moreover, we know that letters of the alphabet occur with different frequencies. For example, if X is the set of ASCII characters, then the letter “e” occurs with much greater frequency than “EOF”, the symbol that denotes the end of a file. Furthermore, through empirical studies, we can obtain good estimates for the values pi, 1 ≤ i ≤ n, where pi denotes the proportion of a file that is comprised of the i th symbol xi.

Now suppose we want to represent file F as a binary string, so as to store it and transmit it on a computer network. Let φ be an encoding of X into binary strings (words). We can define the average codelength with respect to φ as X Lφ = pi|φ(xi)|, i where pi represents the probability of xi appearing in the file. Thus, given a file F that consists of m letters from X , and given some encoding φ, the size of File F with respect to encoding φ is denoted by |F |φ and equals m · Lφ. And so to minimize the size of the file it follows that we must find an encoding φ for which Lφ is minimized. We call such encodings length-optimal, or, in the case of files, size-optimal.

Theorem 3 (Claude Shannon). Let X = {x1, . . . , xn}, and suppose that pi is the probability (or weighted frequency) of xi. Let φ : X → C be uniquely-decodable. Then

1. Lφ ≥ H(X ).

2. There exists an encoding φˆ for which

H(X ) ≤ Lφˆ < H(X ) + 1.

To prove Theorem 3, we introduce an asymmetric distance measure between two finite probability distribu- tions.

Kullback-Leibler Distance. Let p = {p1, . . . , pn} and r = {r1, . . . , rn} be two probability distributions. Then the Kullback-Leibler distance between p and r, denoted D(p||r) is given by

n X pi D(p||r) = p log( ). i r i=1 i

9 The following is stated without proof.

Lemma 1. D(p||r) ≥ 0, for all distributions p and r.

Proof of Theorem 3.

Part 1. Let li denote the length of φ(xi) for all 1 ≤ i ≤ n.

n n X X 1 L − H(X ) = p l − p log = φ i i i p i=1 i=1 i

n n X −li X − pi log 2 + pi log pi. i=1 i=1

−l Pn −li 2 i Now let c = i=1 2 and ri = c . Then

n n X (c2−li ) X L − H(X ) = − p log + p log p = φ i c i i i=1 i=1

n n n X 1 X X p log − p log r + p log p = i c i i i i i=1 i=1 i=1 n 1 X pi 1 log + p log = log + D(p||r) ≥ 0. c i r c i=1 i QED

1 Part 2. Let li = dlog e. These lengths satisfy the Kraft inequality since pi

n n n 1 1 X −dlog p e X − log p X 2 i ≤ 2 i = pi = 1. i=1 i=1 i=1

Thus, by Theorem 2, we know that a prefix code φˆ : X → C exists with these lengths. Moreover, for all 1 ≤ i ≤ n, 1 1 log ≤ li < log + 1. pi pi

And from this we conclude (by summing over i and multiplying by pi)

H(X ) ≤ Lφˆ < H(X ) + 1. QED

10 Huffman Coding

We now provide a greedy algorithm due to D.A. Huffman (1952) which will always find an encoding having minimum average length.

Huffman’s Algorithm:

• Name of Algorithm: Huffman’s Algorithm

• Input:

– symbol set X = {x1, . . . , xn} listed in increasing order of probability

– probabilities {p1, . . . , pn} • Output: a length-optimal encoding φ

• Begin Algorithm

• base case 1. if X = {x} then return encoding φ, where φ(x) =  denotes the empty string

• base case 2. if X = {x1, x2} then return encoding φ, where φ(x1) = 0 and φ(x2) = 1 ˆ • recursive case. combine x1 and x2 into a new symbol y, having probability p1 + p2. Let X = ˆ ˆ {y, x3, . . . , xn} and let φ be the encoding obtained upon applying Huffman’s Algorithm to X and probabilities {p1 + p2, . . . , pn} • define φ as ˆ – φ(xi) = φ(xi) for i ≥ 3 ˆ – φ(x1) = φ(x1) · 0 ˆ – φ(x2) = φ(x2) · 1 • return φ

• End Algorithm

11 Example 6. Let X = {1, 2, 3, 4, 5} and with respective probabilities {.15,.15,.2,.25,.25}. Use Huffman’s Algorithm to find a length-optimal encoding for X .

Example 7. Let X = {1, 2, 3, 4} and with respective weights {4, 4, 5, 5}. Use Huffman’s Algorithm to find a length-optimal encoding for X .

12 Theorem 4. Huffman’s Algorithm is correct! In other words, if φ2 is any other encoding, then Lφ ≤ Lφ2 .

The proof of the following lemma is left as an exercise.

Lemma 2. For a distribution p = {p1, . . . , pn}, there exists a length-optimal prefix code C such that

1. if pi > pk, then li ≤ lj 2. the two longest codewords have the same length 3. the two longest codewords differ only in the last bit. These two words are the encoding of the two least likely symbols in X

Proof of Theorem 4. The proof uses induction on n = |X |.

Basis Step. n = 1. In this case X = {x} and φ(x) = , the empty string. Note that in this case Lφ = H(X ) = 0 and φ is clearly length-optimal.

Now suppose n = 2. Then X = {x1, x2} and Huffman’s Algorithm yields φ(x1) = 0 and φ(x1) = 1. Obviously φ is a length-optimal prefix code.

Induction Step. Suppose that Huffman’s Algorithm yields length-optimal codes for all codes of size n ˆ or less. Assume that x1 and x2 have the least probabilities p1 and p2. Let X denote the symbol set {y, x3, . . . , xn+1} with corresponding probabilities p = {p1 + p2, p3 . . . , pn+1}. By the induction assumption, Huffman’s Algorithm produces a length-optimal encoding φˆ. Now according to the algorithm, φ is defined as

ˆ • φ(xi) = φ(xi) for i ≥ 3 ˆ • φ(x1) = φ(x1) · 0 ˆ • φ(x2) = φ(x2) · 1

Since φˆ is a prefix code, it is clear that φ is also a prefix code. We now show that φ is length-optimal. To see this we note that n n X X Lφ = pili = p1 + p2 + p1(l1 − 1) + p2(l2 − 1) + pili = i=1 i=3

p1 + p2 + Lφˆ.

Thus, the minimizing of Lφˆ is equivalent to minimizing Lφ since p1 + p2 is a constant. Verifying this is left as an excercise and requires Lemma 2. QED

13 Exercises.

1. Prove that the entropy function Hm(p1, . . . , pm) satisfies the grouping property. 2. Suppose that multiple choice exam has fifty questions with four responses each; where the correct response is randomly assigned a letter a-d. If a student who knows nothing about the exam subject takes the exam, how much information is in the scantron that records her responses? How much information is in the scantron of a student who has a complete mastery of the subject? Assume all questions are answered independently. Hint: think in terms of what the professor is expecting to see when grading these exams. Suppose now the instructor notices that, for all questions, the correct answer was marked 67% of the time, while the second-best response was 20%, third best 10%, and worst 3%. On average, how much information can be found in the average exam?

3. Prove that every prefix code is uniquely decodable. Hint: use mathematical induction on the length ∗ of the string y, to be decoded; i.e. φ (x1, x2, . . . , xn) = y, and you must show that the sequence of objects x1, x2, . . . , xn is unique. You may assume that φ is one-to-one. 4. Is it possible to define a uniquely decodable encoding of 5 objects if the respective codeword lengths are to be 1,2,3,3,3? Explain.

5. Give an example of a code that is not a prefix code, but is still uniquely decodable.

6. A fair coin is flipped until the first head occurs. Let random variable X denote the number of flips required. Find the entropy H(X) in bits. The following expressions may be useful:

∞ X r rn = , 1 − r n=1

∞ X r nrn = . (1 − r)2 n=1 7. Let φ : X → {0, 1}∗ be a prefix code for finite set X . Given a binary string y of length n describe an efficient procedure for decoding y into a unique sequence x1, . . . , xm of objects in X . State the asymptotic running time (in terms of n) of your procedure.

8. The inventor of Morse code, Samuel Morse (1791-1872), needed to know the frequency of letters in English text that he could give the simplest codewords to the most frequently used letters. He did it simply by counting the number of letters in sets of printers’ type. The figures he came up with were:

12,000 E 2,500 F 9,000 T 2,000 W, Y 8,000 A, I, N, O, S 1,700 G, P 6,400 H 1,600 B 6,200 R 1,200 V 4,400 D 800 K 4,000 L 500 Q 3,400 U 400 J, X 3,000 C, M 200 Z

14 Use this data to compute the entropy of a random variable that outputs one of the English letters. Assuming that a 1Mb text file is Huffman encoded according to the above frequencies, what will be the size of the encoded binary file?

9. Compute the Kullback-Leibler distance between the distributions p = (1/3, 1/3, 1/3) and q = (7/8, 1/16, 1/16).

10. Given object set X = {x1, x2, x3, x4, x5} with respective probabilities 0.1, 0.35, 0.05, 0.2, 0.3, find a Huffman code for X .

15