Huffman Coding
Total Page:16
File Type:pdf, Size:1020Kb
Compsci 201, Spring 2014, Huffman Coding Snarf the huff project via Eclipse. You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and NetID of each person in your group in each of the TWO README files that are submitted with the submission. This is new! We haven’t done assignments in pairs before in this class. (And, as this is the last assignment, we won’t be doing them later, either.) Working in a pair comes with a special set of responsibilities to maintain academic honesty. The most basic is this: the project must be the joint work of both members of the pair. One person doing the great majority of the work doesn’t count. Luckily, getting this right is easy: only work on the assignment when both people are there! If both members of the pair are working in front of the same computer, you’re in the clear. Plus, doing it this way keeps you from having to deal with transferring files back and forth between people, which is a huge source of pain and suffering. If you have questions about working in pairs, let us know! What to Know in Doing the Huffman Assignment • One submission will contain the code and both READMEs (named README_netID.txt) • See below for a complete description of Huffman coding for use in a Compsci 201 assignment developed in the mid 90s. • Understand what you have to do before starting to code. Read through the howto document and this document. Background There are many techniques used to compress digital data. This assignment covers two algorithms: Huffman coding and the Burrows-Wheeler transform. Only Huffman coding is a compession algorithm and it offers greater data compression if data is first transformed using Burrows-Wheeler. The Burrows- Wheeler transform is an extra-credit assignment, you must do Huffman encoding for compression.Several algorithms for data compression have been patented, e.g., the MP3 audio codec (which uses Huffman coding as one of its steps). Huffman coding was invented by David Huffman while he was a graduate student at MIT in 1950 when given the option of a term paper or a final exam. In an autobiography Huffman had this to say about the epiphany that led to his invention of the coding method that bears his name: “– but a week before the end of the term I seemed to have nothing to show for weeks of effort. I knew I’d better get busy fast, do the standard final, and forget the research problem. I remember, after breakfast that morning, throwing my research notes in the wastebasket. And at that very moment, I had a sense of sudden release, so that I could see a simple pattern in what I had been doing, that I hadn’t been able to see at all until then. The result is the thing for which I’m probably best known: the Huffman Coding Procedure. I’ve had many breakthroughs since then, but never again all at once, like that. It was very exciting.” Huffman’s original paper is available, though it’s a tough read. The Wikipedia reference is extensive as is this online material developed as one of the original Nifty Assignments. Both jpeg and mp3 encodings use Huffman Coding as part of their compression algorithms. In this assignment you’ll implement a complete program to compress and uncompress data using Huffman coding. Assignment Overview For this assignment you’ll build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of compressing a file or uncompressing a file specified by choosing a menu-option in the GUI front-end to the code you write. Abstractly you’re writing a program to read an input file and create a corresponding output file — either from uncompressed to compressed or vice versa. For extra credit you’ll add another step to the compression process: the Burrows-Wheeler transform (BWT). However, you do BWT after completing Huffman Coding — the BWT writeup is separate. The Huff class is a simple main that launches a GUI with a connected IHuffProcessor implementation. The implementation corresponds to a model in the model-view architecture we’ve been using in class: the view/GUI makes calls on the model/IHuffProcessor methods which in turn may display information in the view/GUI. The code you write will also create files of compressed or uncompresse data when the GUI-front end calls methods you will write. You’ll implement methods and store state in your IHuffProcessor implementation so that it can either compress/huff or uncompress/unhuff. You’re welcome to implement additional classes as well, but you don’t need to (except for the Burrows-Wheeler transform which is optional.) You’re writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding. Be sure to read that explanation, the notes from class, and refer appropriately to the howto for this assignment. The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like winrar or zipwhich use slightly different algorithms permitting a higher degree of compression than Huffman coding. The Huffman Compression Program The Howto compression section has complete information on how to create a compressed file. Basically you first create a Huffman tree to derive per-character encodings, then you write bits based on these encodings. The Huff main program has a GUI front-end whose menu offers three choices: count characters, compress, uncompress (and quit as a fourth choice). You can’t compress until you can count/create a tree, so make sure counting/tree-creation/encoding works. The howto document has details on how the program works. Programming Advice Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, it will be difficult to get a completely working program. The howto development section has more information on incremental developement. The Huffman Decompression Program To uncompress a file your program previously compressed you’ll need to read header information from the compressed file your program creates. The header information is data that you’ll use in writing code that recreates the Huffman tree that was originally used to compress the data — this is key, to uncompress you need the same tree you used to compress. After recreating the tree your code will read one-bit-at-a-time to uncompress the data and recreate the original file that was compressed. There’s complete information in the howto uncompression section on doing this. Basically you read the header information to recreate the tree, then do a tree-walk one bit at a time to find the characters stored in the leaves of the Huffman tree. Each time you find a leaf you print the value there. This process recreates the original, uncompressed file. Empirical Analysis You should run the program HuffMark which will read every file in a directory and compress it to another file in the same directory with a “.hf” suffix. You may want to modify this benchmarking program to print more data than it currently does, and to run it on both the calgary directory which represents the Calgary Corpus, a standard compression suite of files for empirical analysis, you can see this reference for comparisons on the Calgary Corpus and on the waterloo directory which is a collection of .tiff images used in some compression benchmarking. You can, of course, run on other data/collections. Be sure to check whether the file you compress and uncompress is the same as the original. See teh howto diff section for more information. The benchmarking program skips files with .hf suffixes, but you’ll want to change that at some point in your analysis. Your analysis should focus on a few questions: 1. Which compresses more, binary files or text files? 2. Can you gain additional compression by double-compressing an already compressed file? If so, is there eventually a limit to when this no longer saves space on ordinary files? What if you built a file that was intentionally designed to compress a lot…when would it be no longer worthwhile to recompress? If you use different methods to store information in the header of your compressed file, e.g., you store the tree and you also store counts, then reporting on differences is a good idea. Information on how to create the header is in the howto. Grading The program is worth 25 points. You get correctness/algorithmic points according to whether your program compresses and uncompresses to the original both text and other (e.g., jpg) files. You get engineering points according to how well your code is written, how robust your program is, whether the program detects compression (resulting file would be bigger) and how you write your header information. You get analysis points based on the information you report from compressing/uncompressing a corpus or two. points. ONLY ONE PARTNER SUBMITS THE CODE, ANALYSIS, AND BOTH READMEs. MAKE SURE THAT YOUR SUBMISSION IS CORRECT BY CHECKING THAT THE SUBMIT HISTORY HAS ALL THE CODE, ANALYSIS AND BOTH READMES!!!!! DESCRIPTION POINTS compression/decompression of any text file 25% compression/decompression of any file (including binary files) 25% robustness (crash on non-huffed files, force compression..) 20% program style (design, documentation, etc.) 10% empirical analysis 20% README must be complete for any credit Huffman Coding Grading Standards You can also get extra credit for this assignment by implementing multiple header styles and empirically testing which is better.