Compsci 201, Spring 2014, Huffman Coding

Snarf the huff project via Eclipse.

You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and NetID of each person in your group in each of the TWO README files that are submitted with the submission.

This is new! We haven’t done assignments in pairs before in this class. (And, as this is the last assignment, we won’t be doing them later, either.) Working in a pair comes with a special set of responsibilities to maintain academic honesty. The most basic is this: the project must be the joint work of both members of the pair. One person doing the great majority of the work doesn’t count.

Luckily, getting this right is easy: only work on the assignment when both people are there! If both members of the pair are working in front of the same computer, you’re in the clear. Plus, doing it this way keeps you from having to deal with transferring files back and forth between people, which is a huge source of pain and suffering. If you have questions about working in pairs, let us know!

What to Know in Doing the Huffman Assignment

• One submission will contain the code and both READMEs (named README_netID.txt) • See below for a complete description of Huffman coding for use in a Compsci 201 assignment developed in the mid 90s. • Understand what you have to do before starting to code. through the howto document and this document.

Background

There are many techniques used to compress digital data. This assignment covers two algorithms: Huffman coding and the Burrows-Wheeler transform. Only Huffman coding is a compession algorithm and it offers greater data compression if data is first transformed using Burrows-Wheeler. The Burrows- Wheeler transform is an extra-credit assignment, you must do Huffman encoding for compression.Several algorithms for data compression have been patented, e.g., the MP3 audio codec (which uses Huffman coding as one of its steps).

Huffman coding was invented by David Huffman while he was a graduate student at MIT in 1950 when given the option of a term paper or a final exam. In an autobiography Huffman had this to say about the epiphany that led to his invention of the coding method that bears his name: “– but a week before the end of the term I seemed to have nothing to show for weeks of effort. I knew I’d better get busy fast, do the standard final, and forget the research problem. I remember, after breakfast that morning, throwing my research notes in the wastebasket. And at that very moment, I had a sense of sudden release, so that I could see a simple pattern in what I had been doing, that I hadn’t been able to see at all until then. The result is the thing for which I’m probably best known: the Huffman Coding Procedure. I’ve had many breakthroughs since then, but never again all at once, like that. It was very exciting.”

Huffman’s original paper is available, though it’s a tough read. The Wikipedia reference is extensive as is this online material developed as one of the original Nifty Assignments. Both jpeg and mp3 encodings use Huffman Coding as part of their compression algorithms. In this assignment you’ll implement a complete program to compress and uncompress data using Huffman coding.

Assignment Overview

For this assignment you’ll build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of compressing a file or uncompressing a file specified by choosing a menu-option in the GUI front-end to the code you . Abstractly you’re writing a program to read an input file and create a corresponding output file — either from uncompressed to compressed or vice versa. For extra credit you’ll add another step to the compression process: the Burrows-Wheeler transform (BWT). However, you do BWT after completing Huffman Coding — the BWT writeup is separate.

The Huff class is a simple main that launches a GUI with a connected IHuffProcessor implementation. The implementation corresponds to a model in the model-view architecture we’ve been using in class: the view/GUI makes calls on the model/IHuffProcessor methods which in turn may display information in the view/GUI. The code you write will also create files of compressed or uncompresse data when the GUI-front end calls methods you will write. You’ll implement methods and store state in your IHuffProcessor implementation so that it can either compress/huff or uncompress/unhuff. You’re welcome to implement additional classes as well, but you don’t need to (except for the Burrows-Wheeler transform which is optional.)

You’re writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding. Be sure to read that explanation, the notes from class, and refer appropriately to the howto for this assignment.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like winrar or zipwhich use slightly different algorithms permitting a higher degree of compression than Huffman coding. The Huffman Compression Program The Howto compression section has complete information on how to create a compressed file. Basically you first create a Huffman tree to derive per-character encodings, then you write bits based on these encodings. The Huff main program has a GUI front-end whose menu offers three choices: count characters, compress, uncompress (and quit as a fourth choice). You can’t compress until you can count/create a tree, so make sure counting/tree-creation/encoding works. The howto document has details on how the program works.

Programming Advice

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, it will be difficult to get a completely working program. The howto development section has more information on incremental developement.

The Huffman Decompression Program

To uncompress a file your program previously compressed you’ll need to read header information from the compressed file your program creates. The header information is data that you’ll use in writing code that recreates the Huffman tree that was originally used to compress the data — this is key, to uncompress you need the same tree you used to compress. After recreating the tree your code will read one-bit-at-a-time to uncompress the data and recreate the original file that was compressed. There’s complete information in the howto uncompression section on doing this. Basically you read the header information to recreate the tree, then do a tree-walk one bit at a time to find the characters stored in the leaves of the Huffman tree. Each time you find a leaf you print the value there. This process recreates the original, uncompressed file.

Empirical Analysis

You should run the program HuffMark which will read every file in a and compress it to another file in the same directory with a “.hf” suffix. You may want to modify this benchmarking program to print more data than it currently does, and to run it on both the calgary directory which represents the Calgary Corpus, a standard compression suite of files for empirical analysis, you can see this reference for comparisons on the Calgary Corpus and on the waterloo directory which is a collection of .tiff images used in some compression benchmarking. You can, of course, run on other data/collections. Be sure to check whether the file you compress and uncompress is the same as the original. See teh howto diff section for more information.

The benchmarking program skips files with .hf suffixes, but you’ll want to change that at some point in your analysis.

Your analysis should focus on a few questions:

1. Which compresses more, binary files or text files? 2. Can you gain additional compression by double-compressing an already compressed file? If so, is there eventually a limit to when this no longer saves space on ordinary files? What if you built a file that was intentionally designed to compress a lot…when would it be no longer worthwhile to recompress?

If you use different methods to store information in the header of your compressed file, e.g., you store the tree and you also store counts, then reporting on differences is a good idea. Information on how to create the header is in the howto. Grading

The program is worth 25 points. You get correctness/algorithmic points according to whether your program compresses and uncompresses to the original both text and other (e.g., jpg) files. You get engineering points according to how well your code is written, how robust your program is, whether the program detects compression (resulting file would be bigger) and how you write your header information. You get analysis points based on the information you report from compressing/uncompressing a corpus or two. points.

ONLY ONE PARTNER SUBMITS THE CODE, ANALYSIS, AND BOTH READMEs. MAKE SURE THAT YOUR SUBMISSION IS CORRECT BY CHECKING THAT THE SUBMIT HISTORY HAS ALL THE CODE, ANALYSIS AND BOTH READMES!!!!!

DESCRIPTION POINTS compression/decompression of any 25% compression/decompression of any file (including binary files) 25% robustness (crash on non-huffed files, force compression..) 20% program style (design, documentation, etc.) 10% empirical analysis 20%

README must be complete for any credit

Huffman Coding Grading Standards

You can also get extra credit for this assignment by implementing multiple header styles and empirically testing which is better.

Your README file should include the names of all the people with whom you collaborated, and the TAs/UTAs you consulted with. You should include an estimate of how long you spent on the program and what your thoughts are about the assignment. Submit your README and all of your source code using Eclipse with assignment name huff. Remember that each person in a group should submit a separate README, this must include the names of the people in the group. Only one group member submits code.

Huffman Coding: From ASCII Coding to Huffman Coding

Many programming languages use ASCII coding for characters (ASCII stands for American Standard Code for Information Interchange). Some recent languages, e.g., Java, use UNICODE which, because it can encode a bigger set of characters, is more useful for languages like Japanese and Chinese which have a larger set of characters than are used in English.

We'll use ASCII encoding of characters as an example. In ASCII, every character is encoded with the same number of bits: 8 bits per character. Since there are 256 different values that can be encoded with 8 bits, there are potentially 256 different characters in the ASCII character set. The common characters, e.g., alphanumeric characters, punctuation, control characters, etc., use only 7 bits; there are 128 different characters that can be encoded with 7 bits. In C++ for example, the type char is divided into subtypes unsigned-char and (the default signed) char. As we'll see, Huffman coding compresses data by using fewer bits to encode more frequently occurring characters so that not all characters are encoded with 8 bits. In Java there are no unsigned types and char values use 16 bits (Unicode compared to ASCII). Substantial compression results regardless of the character-encoding used by a language or platform.

A Simple Coding Example

We'll look at how the string "go go gophers" is encoded in ASCII, how we might save bits using a simpler coding scheme, and how Huffman coding is used to compress the data resulting in still more savings.

With an ASCII encoding (8 bits per character) the 13 character string "go go gophers" requires 104 bits. The table below on the left shows how the coding works.

coding a message 3-bit coding ASCII coding char ASCII binary char code binary g 103 1100111 g 0 000 o 111 1101111 o 1 001 p 112 1110000 p 2 010 h 104 1101000 h 3 011

e 101 1100101 e 4 100 r 114 1110010 r 5 101 s 115 1110011 s 6 110 space 32 1000000 space 7 111

The string "go go gophers" would be written (coded numerically) as 103 111 32 103 111 32 103 111 112 104 101 114 115. Although not easily readable by humans, this would be written as the following stream of bits (the spaces would not be written, just the 0's and 1's)

1100111 1101111 1100000 1100111 1101111 1000000 1100111 1101111 1110000 1101000 1100101 1110010 1110011

Since there are only eight different characters in "go go gophers", it's possible to use only 3 bits to encode the different characters. We might, for example, use the encoding in the table on the right above, though other 3-bit encodings are possible.

Now the string "go go gophers" would be encoded as 0 1 7 0 1 7 0 1 2 3 4 5 6 or, as bits:

000 001 111 000 001 111 000 001 010 011 100 101 110 111

By using three bits per character, the string "go go gophers" uses a total of 39 bits instead of 104 bits. More bits can be saved if we use fewer than three bits to encode characters like g, o, and space that occur frequently and more than three bits to encode characters like e, p, h, r, and s that occur less frequently in "go go gophers". This is the basic idea behind Huffman coding: to use fewer bits for more frequently occurring characters. We'll see how this is done using a tree that stores characters at the leaves, and whose root-to-leaf paths provide the bit sequence used to encode the characters.

Towards a Coding Tree A tree view of the ASCII character set

Using a tree (actually a binary trie, more on that later) all characters are stored at the leaves of a complete tree. In the diagram to the right, the tree has eight levels meaning that the root-to-leaf always has seven edges. A left-edge (black in the diagram) is numbered 0, a right-edge (blue in the diagram) is numbered 1. The ASCII code for any character/leaf is obtained by following the root-to-leaf path and catening the 0's and 1's. For example, the character 'a', which has ASCII value 97 (1100001 in binary), is shown with root-to-leaf path of right-right-left-left-left-left-right.

The structure of the tree can be used to determine the coding of any leaf by using the 0/1 edge convention described. If we use a different tree, we get a different coding. As an example, the tree below on the right yields the coding shown on the left.

char binary 'g' 10 'o' 11 'p' 0100 'h' 0101 'e' 0110 'r' 0111 's' 000 ' ' 001

Using this coding, "go go gophers" is encoded (spaces wouldn't appear in the bitstream) as:

10 11 001 10 11 001 10 11 0100 0101 0110 0111 000

This is a total of 37 bits, which saves two bits from the encoding in which each of the 8 characters has a 3-bit encoding that is shown above! The bits are saved by coding frequently occurring characters like 'g' and 'o' with fewer bits (here two bits) than characters that occur less frequently like 'p', 'h', 'e', and 'r'.

The character-encoding induced by the tree can be used to decode a stream of bits as well as encode a string into a stream of bits. You can try to decode the following bitstream; the answer with an explanation follows:

01010110011100100001000101011001110110001101101100000010101 011001110110

To decode the stream, start at the root of the encoding tree, and follow a left- branch for a 0, a right branch for a 1. When you reach a leaf, write the character stored at the leaf, and start again at the top of the tree. To start, the bits are 010101100111. This yields left-right-left-right to the letter 'h', followed (starting again at the root) with left-right-right-left to the letter 'e', followed by left-right-right-right to the letter 'r'. Continuing until all the bits are processed yields

her sphere goes here

Prefix codes and Huffman Codes

When all characters are stored in leaves, and every interior/(non-leaf) node has two children, the coding induced by the 0/1 convention outlined above has what is called the prefix property: no bit-sequence encoding of a character is the prefix of any other bit-sequence encoding. This makes it possible to decode a bitstream using the coding tree by following root-to-leaf paths. The tree shown above for "go go gophers" is an optimal tree: there are no other trees with the same characters that use fewer bits to encode the string "go go gophers". There are other trees that use 37 bits; for example you can simply swap any sibling nodes and get a different encoding that uses the same number of bits. We need an algorithm for constructing an optimal tree which in turn yields a minimal per-character encoding/compression. This algorithm is called Huffman coding, and was invented by D. Huffman in 1952. It is an example of a greedy algorithm. Huffman Coding

We'll use Huffman's algorithm to construct a tree that is used for data compression. In the previous section we saw examples of how a stream of bits can be generated from an encoding, e.g., how "go go gophers" was written as 1011001101100110110100010101100111000. We also saw how the tree can be used to decode a stream of bits. We'll discuss how to construct the tree here.

We'll assume that each character has an associated weight equal to the number of times the character occurs in a file, for example. In the "go go gophers" example, the characters 'g' and 'o' have weight 3, the space has weight 2, and the other characters have weight 1. When compressing a file we'll need to calculate these weights, we'll ignore this step for now and assume that all character weights have been calculated. Huffman's algorithm assumes that we're building a single tree from a group (or forest) of trees. Initially, all the trees have a single node with a character and the character's weight. Trees are combined by picking two trees, and making a new tree from the two trees. This decreases the number of trees by one at each step since two trees are combined into one tree. The algorithm is as follows:

1. Begin with a forest of trees. All trees are one node, with the weight of the tree equal to the weight of the character in the node. Characters that occur most frequently have the highest weights. Characters that occur least frequently have the smallest weights. 2. Repeat this step until there is only one tree:

Choose two trees with the smallest weights, call these trees T1 and T2. Create a new tree whose root has a weight equal to the sum of the weights T1 + T2 and whose left subtree is T1 and whose right subtree is T2.

3. The single tree left after the previous step is an optimal encoding tree.

We'll use the string "go go gophers" as an example. Initially we have the forest shown below. The nodes are shown with a weight/count that represents the number of times the node's character occurs.

We pick two minimal nodes. There are five nodes with the minimal weight of one, it doesn't matter which two we pick. In a program, the deterministic aspects of the program will dictate which two are chosen, e.g., the first two in an array, or the elements returned by a priority queue implementation. We create a new tree whose root is weighted by the sum of the weights chosen. We now have a forest of seven trees as shown here:

Choosing two minimal trees yields another tree with weight two as shown below. There are now six trees in the forest of trees that will eventually build an encoding tree.

Again we must choose the two trees of minimal weight. The lowest weight is the 'e'-node/tree with weight equal to one. There are three trees with weight two, we can choose any of these to create a new tree whose weight will be three.

Now there are two trees with weight equal to two. These are joined into a new tree whose weight is four. There are four trees left, one whose weight is four and three with a weight of three.

Two minimal (three weight) trees are joined into a tree whose weight is six. In the diagram below we choose the 'g' and 'o' trees (we could have chosen the 'g' tree and the space-'e' tree or the 'o' tree and the space-'e' tree.) There are three trees left.

The minimal trees have weights of three and four, these are joined into a tree whose weight is seven leaving two trees.

Finally, the last two trees are joined into a final tree whose weight is thirteen, the sum of the two weights six and seven. Note that this tree is different from the tree we used to illustrate Huffman coding above, and the bit patterns for each character are different, but the total number of bits used to encode "go go gophers" is the same.

The character encoding induced by the last tree is shown below where again, 0 is used for left edges and 1 for right edges.

char binary 'g' 00 'o' 01 'p' 1110 'h' 1101 'e' 101 'r' 1111 's' 1100 ' ' 100

The string "go go gophers" would be encoded as shown (with spaces used for easier reading, the spaces wouldn't appear in the real encoding).

00 01 100 00 01 100 00 01 1110 1101 101 1111 1100

Once again, 37 bits are used to encode "go go gophers". There are several trees that yield an optimal 37-bit encoding of "go go gophers". The tree that actually results from a programmed implementation of Huffman's algorithm will be the same each time the program is run for the same weights (assuming no randomness is used in creating the tree).

Why is Huffman Coding Greedy?

Huffman's algorithm is an example of a greedy algorithm. It's called greedy because the two smallest nodes are chosen at each step, and this local decision results in a globally optimal encoding tree. In general, greedy algorithms use small-grained, or local minimal/maximal choices to result in a global minimum/maximum. Making change using U.S. money is another example of a greedy algorithm.

• Problem: give change in U.S. coins for any amount (say under $1.00) using the minimal number of coins. • Solution (assuming coin denominations of $0.25, $0.10, $0.05, and $0.01, called quarters, dimes, nickels, and pennies, respectively): use the highest-value coin that you can, and give as many of these as you can. Repeat the process until the correct change is given. • Example: make change for $0.91. Use 3 quarters (the highest coin we can use, and as many as we can use). This leaves $0.16. To make change use a dime (leaving $0.06), a nickel (leaving $0.01), and a penny. The total change for $0.91 is three quarters, a dime, a nickel, and a penny. This is a total of six coins, it is not possible to make change for $0.91 using fewer coins.

The solution/algorithm is greedy because the largest denomination coin is chosen to use at each step, and as many are used as possible. This locally optimal step leads to a globally optimal solution. Note that the algorithm does not work with different denominations. For example, if there are no nickels, the algorithm will make change for $0.31 using one quarter and six pennies, a total of seven coins. However, it's possible to use three dimes and one penny, a total of four coins. This shows that greedy algorithms are not always optimal algorithms. Implementing/Programming Huffman Coding

In this section we'll see the basic programming steps in implementing huffman coding. More details can be found in the language specific descriptions.

There are two parts to an implementation: a compression program and an uncompression/decompression program. You need both to have a useful compression utility. We'll assume these are separate programs, but they share many classes, functions, modules, code or whatever unit-of-programming you're using. We'll call the program that reads a regular file and produces a compressed file the compression or huffing program. The program that does the reverse, producing a regular file from a compressed file, will be called theuncompression or unhuffing program.

The Compression or Huffing Program

To compress a file (sequence of characters) you need a table of bit encodings, e.g., an ASCII table, or a table giving a sequence of bits that's used to encode each character. This table is constructed from a coding tree using root-to-leaf paths to generate the bit sequence that encodes each character.

Assuming you can write a specific number of bits at a time to a file, a compressed file is made using the following top-level steps. These steps will be developed further into sub-steps, and you'll eventually implement a program based on these ideas and sub-steps. 1. Build a table of per-character encodings. The table may be given to you, e.g., an ASCII table, or you may build the table from a Huffman coding tree. 2. Read the file to be compressed (the plain file) and process one character at a time. To process each character find the bit sequence that encodes the character using the table built in the previous step and write this bit sequence to the compressed file.

As an example, we'll use the table below on the left, which is generated from the tree on the right. Ignore the weights on the nodes, we'll use those when we discuss how the tree is created.

Another Huffman Tree/Table Example char binary 'a' 100 'r' 101 'e' 11 'n' 0001 't' 011 's' 010 'o' 0000 ' ' 001

To compress the string/file "streets are stone stars are not", we read one character at a time and write the sequence of bits that encodes each character. To encode "streets are" we would write the following bits:

010011101111101101000110010111

The bits would be written in the order 010, 011, 101, 11, 11, 011, 010, 001, 100, 101, 11.

That's the compression program. Two things are missing from the compressed file: (1) some information (called the header) must be written at the beginning of the compressed file that will allow it to be uncompressed; (2) some information must be written at the end of the file that will be used by the uncompression program to tell when the compressed bit sequence is over (this is the bit sequence for the pseudo-eof character described later).

Building the Table for Compression/Huffing

To build a table of optimal per-character bit sequences you'll need to build a Huffman coding tree using the greedy Huffman algorithm. The table is generated by following every root-to-leaf path and recording the left/right 0/1 edges followed. These paths make the optimal encoding bit sequences for each character.

There are three steps in creating the table:

1. Count the number of times every character occurs. Use these counts to create an initial forest of one-node trees. Each node has a character and a weight equal to the number of times the character occurs. An example of one node trees shows what the initial forest looks like. 2. Use the greedy Huffman algorithm to build a single tree. The final tree will be used in the next step. 3. Follow every root-to-leaf path creating a table of bit sequence encodings for every character/leaf.

Header Information

You must store some initial information in the compressed file that will be used by the uncompression/unhuffing program. Basically you must store the tree used to compress the original file. This tree is used by the uncompression program.

There are several alternatives for storing the tree. Some are outlined here, you may explore others as part of the specifications of your assignment.

• Store the character counts at the beginning of the file. You can store counts for every character, or counts for the non-zero characters. If you do the latter, you must include some method for indicating the character, e.g., store character/count pairs. • You could use a "standard" character frequency, e.g., for any English language text you could assume weights/frequencies for every character and use these in constructing the tree for both compression and uncompression. • You can store the tree at the beginning of the file. One method for doing this is to do a pre-order traversal, writing each node visited. You must differentiate leaf nodes from internal/non-leaf nodes. One way to do this is write a single bit for each node, say 1 for leaf and 0 for non-leaf. For leaf nodes, you will also need to write the character stored. For non-leaf nodes there's no information that needs to be written, just the bit that indicates there's an internal node.

The pseudo-eof character

When you write output the operating system typically buffers the output for efficiency. This means output is actually written to disk when some internal buffer is full, not every time you write to a stream in a program. Operating systems also typically require that disk files have sizes that are multiples of some architecture/operating system specific unit, e.g., a byte or word. On many systems all file sizes are multiples of 8 or 16 bits so that it isn't possible to have a 122 bit file.

In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you cannot be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Your decompressing/unhuff program must have some mechanism to account for these extra or "padding" bits since these bits do not represent compressed information.

Your decompression/unhuff program cannot simply read bits until there are no more left since your program might then read the extra padding bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while ((bits = input.readbits(1)) != -1) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below illustrates how reading a compressed file works using a pseudo- EOF character:

When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized by the program unhuff and used in the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf- node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit reading function returns false) the compressed file is not well formed.

Every time a file is compressed the count of the number of times the pseudo- EOF character occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created and used in creating the tree used for compression.

Huffman Howto, The Program You Develop

You're given classes to read and write individual bits-at-a-time, these are described below. You're also given a main program Huff.java that creates an instance of the non-functioning implementation of the IHuffProcessor interface named SimpleHuffProcessor. Choosing options from the GUI using this implementation as shown on the left, below, generates an error-dialog as shown on the right since none of the methods are currently implemented (they each throw an exception). You implement the methods for this assignment.

When you write your methods in SimpleHuffProcessor to read or write bits you'll need to create either BitInputStream or BitOutputStream objects to read bits-at-a-time (or write them). Information and help on how to do this is given below, but you should probably scan this howto completely before coding.

Fast-reading and Out-of-memory

If your program generates an out-of-memory error when reading large files, use the Options menu in the GUI to choose Slow Reading as shown in the screen shot below.

This makes reading files slower but the GUI/View code won't map the entire file into memory before reading when you compress or uncompress a file.

Compressing using Huffman Coding

The three steps below summarize how compression works and provide some advice on coding.

1. To compress a file, count how many times every bit-sequence occurs in a file. These counts are used to build weighted nodes that will be leaves in the Huffman tree. Although this writeup sometimes refers to "characters", you should use int variables/valuse in your code rather than char. Note that the method for reading bits-at-a- time from a BitInputStream returns an int, so using int variables makes sense (this might be different in the Burrows-Wheeler code you write.) Any wording in this write-up that uses the wordcharacter means an 8-bit chunk and this chunk-size could (in theory) change. Do not use any variables of type byte in your program. Use only int variables, or perhaps char variables if you implement the Burrows-Wheeler project. 2. From these counts build the Huffman tree. First create one node per character, weighted with the number of times the character occurs, and insert each node into a priority queue. Then choose two minimal nodes, join these nodes together as children of a newly created node, and insert the newly created node into the priority queue (see notes from class). The new node is weighted with the sum of the two minimal nodes taken from the priority queue. Continue this process until only one node is left in the priority queue. This is the root of the Huffman tree. 3. Create a table or map of 8-bit chunks (represented as an int value) to Huffman-codings. The map of chunk-codings is formed by traversing the path from the root of the Huffman tree to each leaf. Each root-to-leaf path creates a chunk-coding for the value stored in the leaf. When going left in the tree append a zero to the path; when going right append a one. The map has the 8-bit int chunks as keys and the corresponding Huffman/chunk-coding String as the value associated with the key. The map can be an array of the appropriate size (roughly 256, but be careful of PSEUDO_EOF) or you can use a Map implementation. An array is the simplest approach for this part of the huff/compress process, using a Map is not necessary, but it's fine to use one.

Once you've tested the code above you'll be ready to create the compressed output file. to do this you'll read the input file a second time, but the GUI front- end does this for you when it calls the method IHuffProcessor.compress to do the compression. For each 8-bit chunk read, write the corresponding encoding of the 8-bit chunk (obtained from the map of encodings) to the compressed file. You write bits using a BitOutputStream object, you don't write Strings/chars. Instead you write one-bit, either a zero or a one, for each corresponding character '0' or '1' in the string that is the encoding.

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from 8-bit chunk counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree. Help With Coding

The sections below contain explanations and advice on different aspects of the code you'll write to compress data.

• Pseudo-EOF character • Priority Queues • Writing Bits • Converting Huffman tree to Map/table • Implementing and Debugging the program • Bitops/reading and writing bits

Pseudo-EOF character

(For more details, see the complete huffman coding discussion.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit-at-a-time to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks or 256-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you should not use code like the loop below because the last few bits read may not have been written by your program, but rather as a result of buffering and writing bits in 8-bit chunks.

To avoid this problem, there are two solutions: store the number of real bits in the header of the compressed file or use a pseudo-EOF character whose Huffman-coding is written to the compressed file. Then when you read the compressed file your code stops when the encoding for the pseudo-EOF character is read. The pseudocode below shows how to read a compressed file using the pseudo-EOF technique.

When you're writing the compressed file be sure that the last bits written are the Huffman-coding bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized and used in the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit-reading method returns -1) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test unhuff with plain (uncompressed) files.

My program generates this error when such a file is found.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file IHuffConstants the number of characters counted is specified by ALPH_SIZE which has value 256. Although only 256 values can be represented by 8 bits, these values are between 0 and 255, inclusive. One character is used as the pseudo-EOF character -- it must be a value not- representable with 8-bits, the smallest such value is 256 which requires 9 bits to represent. However, ideally your program should be able to work with n-bit chunks, not just 8-bit chunks.

Priority Queues

You're given a TreeNode that implements Comparable. You can use this class in storing weighted character/chunk objects in a priority queue to make a Huffman tree.

Creating a Map/Table from a Huffman-tree

(for more details, see the complete huffman coding discussion.)

To create a table or map of coded bit values for each 8-bit chunk you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the map each time you reach a leaf. For example, if you reach a leaf that stores the 8-bit chunk 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map --- the answer for this assignment is to use a string whose only characters are '0' and '1', the string represents the path from the root of the Huffman tree to a leaf -- and the value in the leaf has a Huffman coding represented by the root-to-leaf path. This means you'll need to follow every root-to-leaf path in the Huffman tree, building the root-to-leaf path during the traversal. When you reach a leaf, the path is that leaf value's encoding. One way to do this is with a method that takes a TreeNode parameter and a Stringthat represents the path to the node. Initially the string is empty "" and the node is the global root. When your code traverses left, a "0" is added to the path, and similarly a "1" is added when going right.

...... recurse(root.left, path + "0"); recurse(root.right, path + "1");

Writing Bits in the Compressed File

There are three steps in writing a compressed file from the information your code determined and stored: the counts and encodings. All this code is written/called from the IHuffProcessor.compress method which is called from the GUI after theIHuffProcess.preprocessCompress method has been called to set state appropriately in your model.

1. Write a magic number at the beginning of the compressed file. You can access the IHuffConstants.MAGIC_NUMBER value either without the IHuffConstants modifier in your IHuffProcessor implementation (because the latter interface extends the former) or using the complete IHuffConstants.MAGIC_NUMBER identifier. When you uncompress you'll read this number to ensure you're reading a file your program compressed. Your program should be able to uncompress files it creates. For extra credit you should be able to process standard headers, specified by magic numbers STORE_COUNTS and STORE_TREE in the IHuffConstants interface. There's also a value for custom headers.

For example, in my working program that only works with my compressed files, not other standard formats, I have the following code:

// write out the magic number out.writeBits(BITS_PER_INT, MAGIC_NUMBER);

then in another part of the class (in another method)

int magic = in.readBits(BITS_PER_INT); if (magic != MAGIC_NUMBER){ throw new IOException("magic number not right"); } In general, a file with the wrong magic number should not generate an error, but should notify the user. For example, in my program the exception above ultimately causes the user to see what's shown below. This is because the exception is caught and the viewer'sshowError method called appropriately. Your code should at least print a message, and ideally generate an error dialog as shown.

Header Information

2. Write information after the magic number that allows the Huffman tree to be recreated. The simplest thing to do here is write ALPH_SIZE counts as int values, but you can also write the tree. Writing the counts will create a header in standard count format or SCF. This is a header of 255 counts, one 32-bit int value for each 8-bit chunk, in order from 0-255. You don't need a count for pseudo-EOF because it's one.

In my non-saving-space code using SCF, my header is written by the following code. Note that BITS_PER_INT is 32 in Java.

for(int k=0; k < ALPH_SIZE; k++){ out.writeBits(BITS_PER_INT, myCounts[k]); }

This header is then read as follows, this doesn't do much, but shows how reading/writing the header are related.

for(int k=0; k < ALPH_SIZE; k++){ int bits = in.readBits(BITS_PER_INT); myCounts[k] = bits; } Tree Header for Extra Credit

To write the tree for extra credit, think about the 20 questions program.

For example, you can use a 0 or 1 bit to differentiate between internal nodes and leaves. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)

0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the Unicode/ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the Unicode/ASCII value for a 't' which is 116 (001110100 is 116 in binary).

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (again, think about the 20- questions/animal program).

Standard Tree Format in the Huff program/suite uses a pre-order traversal, a single zero-bit for internal nodes, a single one-bit for a leaf, and nine bits for the value stored in a leaf. For extra/A credit your program should be able to read/write this format.

3. Write the bits needed to encode each character of the input file. For example, if the coding for 'a' is "01011" then your code will have to write 5 bits, in the order 0, 1, 0, 1, 1 every time the program is compressing/encoding the chunk 'a'. You'll re-read the file being compressed, look up each chunk/character's encoding and print a 0 or 1 bit for each '0' or '1' character in the encoding.

Implementing and Debugging

It's a good idea to either create more than one class to help manage the complexity in these programs or to add methods/code incrementally after each has been tested. Because the same data structures need to used to ensure that a file compressed using your huff algorithm can be decompressed, you should be able to share several parts of the implementation. You can use classes to exploit this similarity. Debugging Code

Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.

You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.

We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing Using BitInputStream

In order to read and write in a bit-at-a-time manner, two classes are provided BitInputStream and BitOutputStream.

Bit read/write subprograms To see how the readBits routine works, note that the code segment below is functionally equivalent to the Unix command cat foo --- it reads BITS_PER_WORD bits at a time (which is 8 bits as defined in IHuffConstants) and echoes what is read.

Note that executing the Java statement System.out.print('7') results in 16 bits being written because a Java char uses 16 bits (the 16 bits correspond to the character '7'). Executing System.out.println(7). results in 32 bits being written because a Java int uses 32 bits. Executing obs.writeBits(3,7) results in 3 bits being written (to the BitOutputStream obs) --- all the bits are 1 because the number 7 is represented in base two by 000111.

When using writeBits to write a specified number of bits, some bits may not be written immediately because of buffering. To ensure that all bits are written, the last bits must be explicitly flushed. The function flush must be called either explicitly or by calling .

Although readBits can be called to read a single bit at a time (by setting the parameter to 1), the return value from the method is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:

Alternatively, you can mod by 2, e.g., inbits % 2 and check to see if the remainder is 0 or 1 to determine if the right-most bit is 0 or 1. Using bitwise-and is faster than using mod, but this speed is minor compared to what you'll spend reading the file.

InputStream objects

In Java, it's simple to construct one input stream from another. The Viewer/GUI code that drives the model will send an InputStream object to the model for readable-files, it will also send an OutputStream for writeable- files. The client/model code you write will need to wrap this stream in an appropriate BitInputStream or BitOutputStream object. public int uncompress(InputStream in, OutputStream out) ... BitInputStream bis = new BitInputStream(in); ...

Of course exceptions may need to be caught or rethrown. For input, you'll need to always create a BitInputStream object to read chunks or bits from. For the output stream, you may need to create a BitOutputStream to write individual bits, so you should create such a stream -- for uncompressing it's possible to just write without creating a BitOutputStream using the OutputStream.write method, but you'll find it simpler to use BitOutputStream.writeBits method.

Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be shown indicating that this is the case. Here's a screen shot from what happens in my program.

You can choose a force compression option from the GUI/Options menu. If this is chosen/checked, the value of the third parameter to IHuffProcessor.compress is true, and your code should "compress" a file even though the resulting file will be bigger. Otherwise (force is false), if the compressed file is bigger, your program should not compress and should generate an error such as the one shown above. Same File Uncompressed/Diff

When you compress a file, e.g., foo.txt to foo.txt.hf and then uncompress it to foo.txt.unhf, you'll want to see whether the .unhf file is the same as the original. For very small text files you can verify this by eyeballing the file. But for large files, and for non-text files (e.g., .jpg, .mp3) you'll need a program to help with this.

To help with this you can use the Diff.java program. Launching this will prompt you to choose files. You should select two files, not just one. To select two files use either command-click or control-click according to Mac/Windows to select the second file. The program will then tell you whether the two files you've chosen are the same or are different. On Linux and Macs there's a command-line tool called diff that does this, but the Java program can be used for purposes of the Huffman assignment.