Huffman Coding (Programming Part) Due: Thursday, November
Total Page:16
File Type:pdf, Size:1020Kb
15-122 Assignment 6 Page 1 of 12 15-122 : Principles of Imperative Computation Fall 2012 Assignment 6: Huffman Coding (Programming Part) Due: Thursday, November 8, 2012 by 23:59 For the programming portion of this week's homework, you'll write a C0 implementation of a popular compression technique called Huffman coding. In fact, you will write parts of it twice, once on C0 and once in C. • huffman.c0 (described in Sections 1, 2, 3, and 4) • huffman.c (described in Section 5) You should submit these files electronically by the due date. Detailed submission instructions can be found below. Important Note: For this lab you must restrict yourself to 25 or fewer Autolab sub- missions. As always, you should compile and run some test cases locally before submitting to Autolab. 15-122 Assignment 6 Page 2 of 12 Assignment: Huffman Coding (25 points in total) Starter code. Download the file handout-hw6.zip from the course website. When you untar it, you will find several C0 and C files and a lib/ directory with some provided libraries. You should not modify or submit the library code in the lib/ directory, nor should you rely on the internals of these implementations. When we are testing your code, we will use our own implementations of these libraries with the same interface, but possibly different internals. Compiling and running. You will compile and run your code using the standard C0 tools and the gcc compiler. You should compile and run the C0 version with % cc0 -d -o huff0 huffman.c0 huffman-main.c0 % ./huff0 You should compile and run the C version with % gcc -DDEBUG -g -Wall -Wextra -Werror -std=c99 -pedantic -o huff \ lib/xalloc.c lib/heaps.c freqtable.c huffman.c huffman-main.c % ./huff Submitting. Once you've completed some files, you can submit them to Autolab. There are two ways to do this: From the terminal on Andrew Linux (via cluster or ssh) type: % handin hw6 huffman.c0 huffman.c Your score will then be available on the Autolab website. Your files can also be submitted to the web interface of Autolab. To do so, please tar them, for example: % tar -cvf hw6_solutions.tar huffman.c0 huffman.c Then, go to https://autolab.cs.cmu.edu/15122-f12 and submit them as your solu- tion to homework 6 (huffmanlab). You may submit your assignment up to 25 times. When we grade your assignment, we will consider the most recent version submitted before the due date. 15-122 Assignment 6 Page 3 of 12 Annotations. Be sure to include appropriate //@requires, //@ensures, //@assert, and //@loop_invariant annotations in your program. You should provide loop invariants and any assertions that you use to check your reasoning. If you write any auxiliary functions, in- clude precise and appropriate pre- and postconditions. Also include corresponding REQUIRES, ENSURES, and ASSERT macros in your C programs, especially where they help explain program properties that are important but not obvious. You should write these as you are writing the code rather than after you're done: docu- menting your code as you go along will help you reason about what it should be doing, and thus help you write code that is both clearer and more correct. Annotations are part of your score for the programming problems; you will not receive maximum credit if your annotations are weak or missing. Unit testing. You should write unit tests for your code. For this assignment, unit tests will not be graded, but they will help you check the correctness of your code, pinpoint the location of bugs, and save you hours of frustration. Style. Strive to write code with good style: indent every line of a block to the same level, use descriptive variable names, keep lines to 80 characters or fewer, document your code with comments, etc. If you find yourself writing the same code over and over, you should write a separate function to handle that computation and call it whenever you need it. We will read your code when we grade it, and good style is sure to earn our good graces. Feel free to ask on Piazza if you're unsure of what constitutes good style. Task 0 (5 points) On this assignment, 5 points will be given for style. We will examine you C0 code as well as your C code. 15-122 Assignment 6 Page 4 of 12 Data Compression: Overview Whenever we represent data in a computer, we have to choose some sort of encoding with which to represent it. When representing strings in C0, for instance, we use ASCII codes to represent the individual characters. Other encodings are possible as well. The UNICODE standard (as another encoding example) defines a variety of character encodings with a variety of different properties. The simplest, UTF-32, uses 32 bits per character. Under the ASCII encoding, each character is represented using 7 bits, so a string of length n requires 7n bits of storage, which we usually round up to 8n bits or n bytes. For example, consider the string "more free coffee"; ignoring spaces, it can be represented in ASCII as follows with 14 × 7 = 98 bits: 1101101 · 1101111 · 1110010 · 1100101 · 1100110 · 1110010 · 1100101 · 1100101 · 1100011 · 1101111 · 1100110 · 1100110 · 1100101 · 1100101 This encoding of the string is rather wasteful, though. In fact, since there are only 6 distinct characters in the string, we should be able to represent it using a custom encoding that uses only dlog 6e = 3 bits to encode each character. If we were to use the custom encoding shown in Figure 1, Character Code `c' 000 `e' 001 `f' 010 `m' 011 `o' 100 `r' 101 Figure 1: A custom fixed-length encoding for the non-whitespace characters in the string "more free coffee". the string would be represented with only 14 × 3 = 42 bits: 011 · 100 · 101 · 001 · 010 · 101 · 001 · 001 · 000 · 100 · 010 · 010 · 001 · 001 In both cases, we may of course omit the separator \·" between codes; they are included only for readability. If we confine ourselves to representing each character using the same number of bits, i.e., a fixed-length encoding, then this is the best we can do. But if we allow ourselves a variable- length encoding, then we can take advantage of special properties of the data: for instance, in the sample string, the character 'e' occurs very frequently while the characters 'c' and 'm' occur very infrequently, so it would be worthwhile to use a smaller bit pattern to encode the character 'e' even at the expense of having to use longer bit patterns to encode 'c' 15-122 Assignment 6 Page 5 of 12 Character Code `e' 0 `o' 100 `m' 1010 `c' 1011 `r' 110 `f' 111 Figure 2: A custom variable-length encoding for the non-whitespace characters in the string "more free coffee". and 'm'. The encoding shown in Figure 2 employs such a strategy, and using it, the sample string can be represented with only 34 bits: 1010 · 100 · 110 · 0 · 111 · 110 · 0 · 0 · 1011 · 100 · 111 · 111 · 0 · 0 Since this encoding is prefix-free|no code word is a prefix of any other code word|the \·" separators are redundant here, too. It can be proven that this encoding is optimal for this particular string: no other encoding can represent the string using fewer than 34 bits. Moreover, the encoding is optimal for any string that has the same distribution of characters as the sample string. In this assignment, you will implement a method for constructing such optimal encodings developed by David Huffman. Huffman Coding: A Brief History Huffman coding is an algorithm for constructing optimal prefix-free encodings given a fre- quency distribution over characters. It was developed in 1951 by David Huffman when he was a Ph.D student at MIT taking a course on information theory taught by Robert Fano. It was towards the end of the semester, and Fano had given his students a choice: they could either take a final exam to demonstrate mastery of the material, or they could write a term paper on something pertinent to information theory. Fano suggested a number of possible topics, one of which was efficient binary encodings: while Fano himself had worked on the subject with his colleague Claude Shannon, it was not known at the time how to efficiently construct optimal encodings. Huffman struggled for some time to make headway on the problem and was about to give up and start studying for the final when he hit upon a key insight and invented the algorithm that bears his name, thus outdoing his professor, making history, and attaining an \A" for the course. Today, Huffman coding enjoys a variety of applications: it is used as part of the DEFLATE algorithm for producing ZIP files and as part of several multimedia codecs like JPEG and MP3. 15-122 Assignment 6 Page 6 of 12 1 Huffman Trees Recall that an encoding is prefix-free if no code word is a prefix of any other code word. Prefix-free encodings can be represented as binary full trees with characters stored at the leaves: a branch to the left represents a 0 bit, a branch to the right represents a 1 bit, and the path from the root to a leaf gives the code word for the character stored at that leaf. For example, the encodings from Figures 1 and 2 are represented by the binary trees in Figures 3 and 4, respectively. c e f m o r Figure 3: The custom encoding from Figure 1 as a binary tree.