Entropy and Huffman Codes

Entropy and Huffman Codes Entropy as a Measure of Information Content Entropy of a random variable. Let X be a random variable that takes on values from the set P fx1; x2; : : : ; xng with respective probabilities p1; p2; : : : ; pn, where i pi = 1. Then the entropy of X, H(X), represents the average amount of information contained in X and is defined by n X H(X) = − pi log2 pi: i=1 Note that entropy is measured in bits. Also, Hn(p1; : : : ; pn) denotes another way of writing the entropy function. Arguments in favor of the above definition of information: • the definition is consistent with the following extreme cases: 1 1. If n = 2 and p1 = p2 = 2 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1) has an equally likely chance of occurring or not occurring, then its outcome possesses one bit of information. This is the maximum amount of information a binary outcome may possess. 2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose outcome is certain possesses no information. • moreover, the above definition is the only definition which satisfies the following three properties of information which seem reasonable under any definition: 1 1 { Normalization: H2( 2 ; 2 ) = 1 { Continuity: H2(p; 1 − p) is a continuous function of p on the interval (0; 1) { Grouping: p1 p2 Hm(p1; : : : ; pm) = Hm−1(p1 + p2; p3; : : : ; pm) + (p1 + p2)H2( ; ) p1 + p2 p1 + p2 1 Claude Shannon (1912-2001). Pioneer in • applying Boolean logic to electronic circuit design • studying the complexity of Boolean circuits • signal processing: determined lower bounds on the amount of samples needed to achieve a desired estimation accuracy • game theory: inventor of minimax algorithm • information theory: first to give a precise definition for the concept of information • coding theory: Channel Coding Theorem, Optimal Coding Theorem 2 Example 1. calculate the amount of information contained in a weather forecast if the possibilities include fnormal; rain; fog; hot; windyg if there respective probabilities are :8;:10;:04;:03, and :03. Example 2. Calculate the entropy of the probability distribution (1=2; 1=4; 1=8; 1=16; 1=16). 3 Example 3. Verify that independently tossing a fair coin n times imparts n bits of information. 4 Introduction to Codes A code is a set of words C over some alphabet Ψ. The elements of C are called codewords. Let X be a finite set of objects. Then an encoding of X is a map φ : X! Ψ. Thus, the image of the map is a set of codewords. Moreover, when we speak of a code C, we generally are referring to some encoding φ whose image is C. More defintions: • encoding φ is one-to-one or non-singular if φ is a one-to-one map • an extension of an encoding φ is a map φ∗ which maps finite-length strings from X to finite-length strings over Ψ, and is defined in terms of φ in the following manner: ∗ φ (x1 ··· xn) = φ(x1)φ(x2) ··· φ(xn): • a code (encoding) is called uniquely decodable iff its extension is one-to-one • a code is called a prefix code if no codeword is a proper prefix of another codeword. Theorem 1. Exercise: every prefix-code is uniquely decodable. 5 Example 4. Consider the following famous binary code (if we let a \dot" denote 0, and a \dash" denote 1) known as the Morse code, which encodes the alphabet fA − Zg (see below). Is this code injective? a prefix-code? uniquely decodable? What if we increase the alphabet to f0; 1; #g, where e.g. φ(À0) = 01#, where # represents a pause between letters? CHARACTER INTERNATIONAL MORSE CODE A 01 B 1000 C 1010 D 100 E 0 F 0010 G 110 H 0000 I 00 J 0111 K 101 L 0100 M 11 N 10 O 111 P 0110 Q 0010 R 010 S 000 T 1 U 001 V 0001 W 011 X 1001 Y 1011 Z 1100 6 Example 5. Same questions as in Example 4, but now assume the code is the ASCII code (American Standard Code for Information Interchange). Hint: each letter is encoded into a seven-bit word. 0 1 2 3 4 5 6 7 0 NUL SOH STX ETX EOT ENQ ACK BEL 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB 2 SP ! " # $ % & ' 3 0 1 2 3 4 5 6 7 4 A B C D E F G 5 P Q R S T U V W 6 ` a b c d e f g 7 p q r s t u v w 8 9 A B C D E F 0 BS HT LF VT FF CR SO SI 1 CAN EM SUB ESC FS GS RS US 2 ( ) * + , - . / 3 8 9 : ; < = > ? 4 H I J K L M N O 5 X Y Z [ backslash ] caret 6 h i j k l m n o 7 x y z f j g tilde DEL 7 Theorem 2. 1. Kraft Inquality: For any binary prefix-code with codeword lengths l1; : : : ; ln, n X 2−li ≤ 1: i=1 Conversely, given a set of codeword lengths that satisfy this inequality, there exists a binary prefix code with these word lengths. 2. McMillan's Result: Kraft's inequality holds for all uniquely-decodable codes. Proof of Theorem 2, Part 1. Let C denote a finite binary prefix code and lmax denote the maximum length of a codeword in C. Now consider a perfect binary tree T with height lmax. Then the following facts should seem self-evident upon sufficient consideration. 1. T has 2lmax leaves 2. there is one-to-one correspondence between binary words of length not exceeding lmax and nodes of T 3. there is one-to-one mapping from C into the set of nodes of T 4. every leaf of T has at most one ancestor in C lmax−li 5. every codeword wi 2 C is the ancestor of exactly 2 leaves of T From the above facts we see that n X 2lmax−li ≤ 2lmax ; i=1 and hence (dividing both sides of the above equation by 2lmax ) n X 2−li ≤ 1: i=1 Conversely, suppose that lengths l1; : : : ; ln satisfy n X 2−li ≤ 1: i=1 l1 Let T and lmax be as above. Define binary code C in the following manner. Basis step: let w1 = 0 . Inductive step: assume that there exists 1 ≤ k ≤ n − 1 such that codewords w1; : : : ; wk have been defined Pk −li in such a way that jwij = li, for all 1 ≤ i ≤ k. Then by the Kraft Inequality, i=1 2 < 1, which implies Pk lmax−li lmax i=1 2 < 2 . Then there exists a leaf L of T for which no member of w1; : : : ; wk is an ancestor of L. Choose the first such leaf L and set wk+1 to the ancestor of L having length lk+1. Continuing in this manner a prefix-code C with the desired word lengths will be attained. QED 8 Efficient Codes We now show the beautiful connection between entropy and coding theory. We may think of a computer file (or any other entity that possesses information content) as a finite string from some alphabet X = fx1; : : : ; xng. Moreover, we know that letters of the alphabet occur with different frequencies. For example, if X is the set of ASCII characters, then the letter \e" occurs with much greater frequency than \EOF", the symbol that denotes the end of a file. Furthermore, through empirical studies, we can obtain good estimates for the values pi, 1 ≤ i ≤ n, where pi denotes the proportion of a file that is comprised of the i th symbol xi. Now suppose we want to represent file F as a binary string, so as to store it and transmit it on a computer network. Let φ be an encoding of X into binary strings (words). We can define the average codelength with respect to φ as X Lφ = pijφ(xi)j; i where pi represents the probability of xi appearing in the file. Thus, given a file F that consists of m letters from X , and given some encoding φ, the size of File F with respect to encoding φ is denoted by jF jφ and equals m · Lφ. And so to minimize the size of the file it follows that we must find an encoding φ for which Lφ is minimized. We call such encodings length-optimal, or, in the case of files, size-optimal. Theorem 3 (Claude Shannon). Let X = fx1; : : : ; xng, and suppose that pi is the probability (or weighted frequency) of xi. Let φ : X!C be uniquely-decodable. Then 1. Lφ ≥ H(X ): 2. There exists an encoding φ^ for which H(X ) ≤ Lφ^ < H(X ) + 1: To prove Theorem 3, we introduce an asymmetric distance measure between two finite probability distributions. Kullback-Leibler Distance. Let p = fp1; : : : ; png and r = fr1; : : : ; rng be two probability distributions. Then the Kullback-Leibler distance between p and r, denoted D(pjjr) is given by n X pi D(pjjr) = p log( ): i r i=1 i 9 The following is stated without proof. Lemma 1. D(pjjr) ≥ 0, for all distributions p and r. Proof of Theorem 3. Part 1. Let li denote the length of φ(xi) for all 1 ≤ i ≤ n. n n X X 1 L − H(X ) = p l − p log = φ i i i p i=1 i=1 i n n X −li X − pi log 2 + pi log pi: i=1 i=1 −l Pn −li 2 i Now let c = i=1 2 and ri = c .

Load more