Entropy and Huffman Codes

Entropy and Huffman Codes

Entropy and Huffman Codes Entropy as a Measure of Information Content Entropy of a random variable. Let X be a random variable that takes on values from the set P fx1; x2; : : : ; xng with respective probabilities p1; p2; : : : ; pn, where i pi = 1. Then the entropy of X, H(X), represents the average amount of information contained in X and is defined by n X H(X) = − pi log2 pi: i=1 Note that entropy is measured in bits. Also, Hn(p1; : : : ; pn) denotes another way of writing the entropy function. Arguments in favor of the above definition of information: • the definition is consistent with the following extreme cases: 1 1. If n = 2 and p1 = p2 = 2 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1) has an equally likely chance of occurring or not occurring, then its outcome possesses one bit of information. This is the maximum amount of information a binary outcome may possess. 2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose outcome is certain possesses no information. • moreover, the above definition is the only definition which satisfies the following three properties of information which seem reasonable under any definition: 1 1 { Normalization: H2( 2 ; 2 ) = 1 { Continuity: H2(p; 1 − p) is a continuous function of p on the interval (0; 1) { Grouping: p1 p2 Hm(p1; : : : ; pm) = Hm−1(p1 + p2; p3; : : : ; pm) + (p1 + p2)H2( ; ) p1 + p2 p1 + p2 1 Claude Shannon (1912-2001). Pioneer in • applying Boolean logic to electronic circuit design • studying the complexity of Boolean circuits • signal processing: determined lower bounds on the amount of samples needed to achieve a desired estimation accuracy • game theory: inventor of minimax algorithm • information theory: first to give a precise definition for the concept of information • coding theory: Channel Coding Theorem, Optimal Coding Theorem 2 Example 1. calculate the amount of information contained in a weather forecast if the possibilities include fnormal; rain; fog; hot; windyg if there respective probabilities are :8;:10;:04;:03, and :03. Example 2. Calculate the entropy of the probability distribution (1=2; 1=4; 1=8; 1=16; 1=16). 3 Example 3. Verify that independently tossing a fair coin n times imparts n bits of information. 4 Introduction to Codes A code is a set of words C over some alphabet Ψ. The elements of C are called codewords. Let X be a finite set of objects. Then an encoding of X is a map φ : X! Ψ. Thus, the image of the map is a set of codewords. Moreover, when we speak of a code C, we generally are referring to some encoding φ whose image is C. More defintions: • encoding φ is one-to-one or non-singular if φ is a one-to-one map • an extension of an encoding φ is a map φ∗ which maps finite-length strings from X to finite-length strings over Ψ, and is defined in terms of φ in the following manner: ∗ φ (x1 ··· xn) = φ(x1)φ(x2) ··· φ(xn): • a code (encoding) is called uniquely decodable iff its extension is one-to-one • a code is called a prefix code if no codeword is a proper prefix of another codeword. Theorem 1. Exercise: every prefix-code is uniquely decodable. 5 Example 4. Consider the following famous binary code (if we let a \dot" denote 0, and a \dash" denote 1) known as the Morse code, which encodes the alphabet fA − Zg (see below). Is this code injective? a prefix-code? uniquely decodable? What if we increase the alphabet to f0; 1; #g, where e.g. φ(`A0) = 01#, where # represents a pause between letters? CHARACTER INTERNATIONAL MORSE CODE A 01 B 1000 C 1010 D 100 E 0 F 0010 G 110 H 0000 I 00 J 0111 K 101 L 0100 M 11 N 10 O 111 P 0110 Q 0010 R 010 S 000 T 1 U 001 V 0001 W 011 X 1001 Y 1011 Z 1100 6 Example 5. Same questions as in Example 4, but now assume the code is the ASCII code (American Standard Code for Information Interchange). Hint: each letter is encoded into a seven-bit word. 0 1 2 3 4 5 6 7 0 NUL SOH STX ETX EOT ENQ ACK BEL 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB 2 SP ! " # $ % & ' 3 0 1 2 3 4 5 6 7 4 A B C D E F G 5 P Q R S T U V W 6 ` a b c d e f g 7 p q r s t u v w 8 9 A B C D E F 0 BS HT LF VT FF CR SO SI 1 CAN EM SUB ESC FS GS RS US 2 ( ) * + , - . / 3 8 9 : ; < = > ? 4 H I J K L M N O 5 X Y Z [ backslash ] caret 6 h i j k l m n o 7 x y z f j g tilde DEL 7 Theorem 2. 1. Kraft Inquality: For any binary prefix-code with codeword lengths l1; : : : ; ln, n X 2−li ≤ 1: i=1 Conversely, given a set of codeword lengths that satisfy this inequality, there exists a binary prefix code with these word lengths. 2. McMillan's Result: Kraft's inequality holds for all uniquely-decodable codes. Proof of Theorem 2, Part 1. Let C denote a finite binary prefix code and lmax denote the maximum length of a codeword in C. Now consider a perfect binary tree T with height lmax. Then the following facts should seem self-evident upon sufficient consideration. 1. T has 2lmax leaves 2. there is one-to-one correspondence between binary words of length not exceeding lmax and nodes of T 3. there is one-to-one mapping from C into the set of nodes of T 4. every leaf of T has at most one ancestor in C lmax−li 5. every codeword wi 2 C is the ancestor of exactly 2 leaves of T From the above facts we see that n X 2lmax−li ≤ 2lmax ; i=1 and hence (dividing both sides of the above equation by 2lmax ) n X 2−li ≤ 1: i=1 Conversely, suppose that lengths l1; : : : ; ln satisfy n X 2−li ≤ 1: i=1 l1 Let T and lmax be as above. Define binary code C in the following manner. Basis step: let w1 = 0 . Inductive step: assume that there exists 1 ≤ k ≤ n − 1 such that codewords w1; : : : ; wk have been defined Pk −li in such a way that jwij = li, for all 1 ≤ i ≤ k. Then by the Kraft Inequality, i=1 2 < 1, which implies Pk lmax−li lmax i=1 2 < 2 . Then there exists a leaf L of T for which no member of w1; : : : ; wk is an ancestor of L. Choose the first such leaf L and set wk+1 to the ancestor of L having length lk+1. Continuing in this manner a prefix-code C with the desired word lengths will be attained. QED 8 Efficient Codes We now show the beautiful connection between entropy and coding theory. We may think of a computer file (or any other entity that possesses information content) as a finite string from some alphabet X = fx1; : : : ; xng. Moreover, we know that letters of the alphabet occur with different frequencies. For example, if X is the set of ASCII characters, then the letter \e" occurs with much greater frequency than \EOF", the symbol that denotes the end of a file. Furthermore, through empirical studies, we can obtain good estimates for the values pi, 1 ≤ i ≤ n, where pi denotes the proportion of a file that is comprised of the i th symbol xi. Now suppose we want to represent file F as a binary string, so as to store it and transmit it on a computer network. Let φ be an encoding of X into binary strings (words). We can define the average codelength with respect to φ as X Lφ = pijφ(xi)j; i where pi represents the probability of xi appearing in the file. Thus, given a file F that consists of m letters from X , and given some encoding φ, the size of File F with respect to encoding φ is denoted by jF jφ and equals m · Lφ. And so to minimize the size of the file it follows that we must find an encoding φ for which Lφ is minimized. We call such encodings length-optimal, or, in the case of files, size-optimal. Theorem 3 (Claude Shannon). Let X = fx1; : : : ; xng, and suppose that pi is the probability (or weighted frequency) of xi. Let φ : X!C be uniquely-decodable. Then 1. Lφ ≥ H(X ): 2. There exists an encoding φ^ for which H(X ) ≤ Lφ^ < H(X ) + 1: To prove Theorem 3, we introduce an asymmetric distance measure between two finite probability distribu- tions. Kullback-Leibler Distance. Let p = fp1; : : : ; png and r = fr1; : : : ; rng be two probability distributions. Then the Kullback-Leibler distance between p and r, denoted D(pjjr) is given by n X pi D(pjjr) = p log( ): i r i=1 i 9 The following is stated without proof. Lemma 1. D(pjjr) ≥ 0, for all distributions p and r. Proof of Theorem 3. Part 1. Let li denote the length of φ(xi) for all 1 ≤ i ≤ n. n n X X 1 L − H(X ) = p l − p log = φ i i i p i=1 i=1 i n n X −li X − pi log 2 + pi log pi: i=1 i=1 −l Pn −li 2 i Now let c = i=1 2 and ri = c .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us