CS 200 and Data Structures, Spring 2013 Programming Assignment #3

Compressing Data using Huffman Coding

Due on April 2 by 5:00PM Objectives In this assignment, you will implement classes for . You will write: (1) An implementation of the Huffman Coding using a data structure (2) A class to encode a string Your system should provide a complete implementation of the interfaces and skeleton files that are provided.

Background: Huffman Coding

Huffman coding was developed by David Huffman in a term paper he wrote in 1951 while he was a graduate student at MIT. Huffman coding is a fundamental in data compression, the subject devoted to reducing the number of bits required to represent information. Huffman coding is extensively used to compress bit strings representing text and it also plays an important role in compressing audio and image files.

Based on the symbols and their frequencies, the goal is to construct a rooted binary tree where the symbols are the labels of the leaves. The algorithm begins with a forest of trees. At each step, we combine two trees having the least total weight (here frequency) into a single tree by introducing a new root and placing the tree with larger weight as its left subtree and the tree with smaller weight as its right subtree. The algorithm completes when it has constructed a tree.

Table 1 is an example of a document containing only 6 types of characters. Frequency shows the number of appearances of the character in the document. Figure 1 depicts the process of building a Huffman tree with information from the Table 1. At the end of the process, each of the characters will have a Huffman associated with them.

The decoding procedure starts by visiting the first bit in the stream. The bit is used to determine whether to go left or right in the Huffman tree. When you reach a leaf node, the character stored in the leaf node should be written to the output stream. For the next bit in the bit stream, your algorithm re-starts from the root node of the Huffman tree.

Figure 1 depicts an example of decoding. From the root of this tree, the bit stream “111” will result in “selecting the right child” three times consecutively. Finally, it reaches a leaf node associated with the character “A” and repeats this steps for successive bits in the stream until it reaches the end of the stream.

Task Description

Part 1. Build a Frequency Table The first task is to build a frequency table. Please refer to the example in Table 1. At this point, you should fill the first two columns of this table: Character and Frequency. The Huffman code is not known yet. In this table, 1) Items include characters that appear only in the input string.

1 CS 200 Algorithms and Data Structures, Spring 2013 Programming Assignment #3

2) Characters are case sensitive 3) Characters are listed in the order of appearance in the input string 4) The minimum number of characters is 2 5) The frequency refers to the number of appearances of the character. You do not need to normalize this number.

Char- Freque Huffman acter -ncy Code A 8 111 B 10 110 E 12 011 G 15 010 K 20 10 M 35 00 Table 1. Frequency Table

Initial Forest Step 3 *, 38 *, 27

A,8 G, 15 M, 35 *, 18 B, 10 E, 12 K, 20 M, 35

K, 20 G, 15 E, 12

B, 10 A,8

Step 1 Step 4 *, 62

*, 18 *, 38

E, 12 G, 15 K, 20 M, 35 *, 27 *, 18 M, 35 K, 20 K, 20 B, 10 A,8 G, 15 E, 12 B, 10 A,8

Step 2 *, 100 *, 18 *, 27 Step 5 0 1

M, 35 K, 20 *, 62 *, 38 0 B, 10 A,8 G, 15 E, 12 1 0 1

*, 27 *, 18 1 0 1 0 M, 35 K, 20

Figure 1. Huffman Coding of Symbols in Table1.G, 15 E, 12 B, 10 A,8

2 CS 200 Algorithms and Data Structures, Spring 2013 Programming Assignment #3

*, 100 0 1 Encoded bit stream: 11100011 *, 62 *, 38 0 1 0 1 *, 27 *, 18 1 0 1 M, 35 0 K,

G, 15 E, FigureB, 2. DecodingA, Example

Part 2. Build a Huffman Coding Tree Follow the algorithm and build a Huffman coding tree based on the information stored in the frequency table. (Example: Rosen 10.2 and example 5)

Part 3. List the Huffman Based on the Huffman Coding tree built in the part 2, you should create the Huffman codes for each of the characters that appear in the input string. Please note that you DO NOT need to perform bit operations in this assignment. Please store the encoded information as a String object. For example, to represent the bit stream “11110000”, you can create a String object with a new String “11110000”.

String huffmanCode = “11110000”;

The Huffman code should be stored in the corresponding column of the Frequency Table. (Please refer to the “Huffman Code” column in the Table 1.)

Part 4. Encoding a String Based on the Huffman code generated by part 3, your software should convert the input string into encoded bits. For each of the characters in the input string, look up the bit stream in the Frequency Table and replace the character with the encoded bit stream.

Part 5. Decoding bits Decode the bit stream using the Huffman Coding Tree generated in the part 2.

3 CS 200 Algorithms and Data Structures, Spring 2013 Programming Assignment #3

Part 6.Testing Your Software

Test your software with the included testing program: PA3_Test.java

(1) PA3_Test This program tests, 1) Frequency Table 2) Encoding 3) Decoding It takes as argument the input string to be encoded. If you want to include a space in your input string please use quotation marks around your string. (This might work only on the command line not in the Eclipse IDE)

% java PA3_Test eeyjjjj ======char frequency code ------e 2 10 y 1 11 j 4 0 ======

Encoded bit stream 1010110000 Total number of bits without Huffman coding: 112 Total number of bits with Huffman coding: 10

Decoded String: eeyjjjj

For the set of characters that occur the same number of times, there may be more than one possible set of codes. Similarly, a non-leaf node and a leaf node can have same number of frequencies and it can cause more than one possible set of codes.

% java PA3_Test "My Test works totally fine" ======char frequency code ------M 1 01111 y 2 1000 4 000 T 1 01110 e 2 0011 s 2 0010 t 3 101 w 1 01101 o 2 0101 r 1 01100 k 1 1111 a 1 1110 l 2 0100

4 CS 200 Algorithms and Data Structures, Spring 2013 Programming Assignment #3

f 1 1101 i 1 1100 n 1 1001 ======

Encoded bit stream 0111110000000111000110010101000011010101011001111001000010101011011 1100100010010000001101110010010011 Total number of bits without Huffman coding: 416 Total number of bits with Huffman coding: 101

Decoded String: My Test works totally fine

Deliverables

Submit a tar ball of your java source code including: Decoder.java Encoder.java HuffmanFrequencyTable.java HuffmanTree.java HuffmanTreeNode.java TableItem.java

Keep all of your source code in a single flat directory. The skeleton files are provided. Please do not modify PA3_Test.java and interface files

Note: You are required to work as a team in this assignment. You and your teammate should submit only ONE copy of the assignment. Please write down the implementer’s name(s) on top of each of the source code.

Grading

This assignment will account for 5% of your final grade. The grading itself will be done on a 50 point scale.

Late Policy

Please check the late policy available from the course web page

5