Huffman Assignment Intro

James Wei Professor Peck April 2, 2014

Slides by James Wei Outline • Intro to Huffman • The Huffman algorithm • Classes of Huffman • Understanding the design • Implementation • Analysis • Grading • Wrap-up Intro to Huffman • Huffman is a compression algorithm • Works by examining characters in a file— those appearing more frequently are replaced with shortcuts and those appearing less frequently with “longcuts” • We can then rewrite a file with different bit encodings for each character, such that the overall length is shorter • Of course, must also store the bit mappings Intro to Huffman • You will be writing code to do the following: • Read a file and count the number of appearances of every character • Create a Huffman tree/encodings from the counts • Write a header that contains the Huffman tree data to the compressed file • Write a compressed file • Read the header to recreate the Huffman tree • Uncompress a file Intro to Huffman • A few definitions before we proceed: • Bit – a 0 or 1; you will be reading individual bits using the provided utility classes • Character – a chunk of 8bits which we will store in an int; do not use the byte data type in this assignment; sometimes referred to as a word • Huffman encoding – the sequence of bits that a character is mapped to and will be replaced with in the compressed file • PSEUDO_EOF – a character that indicates the end of file (EOF), must be written to the compressed file and read when uncompressing • Magic number – a special character indicating that a file is a Huffman compressed file Intro to Huffman • A brief note on bits: • You will be working with individual bits here • We will consider a chunk of 8 bits to be a character, which could be a letter, number, whitespace, or symbol • The 8 bits can represent any number from 0-255; these numbers correspond with an ASCII encoding (see: http://www.asciitable.com/) • We will represent a character using its 8 bits stored in an int, which gives us a value of 0-255 that corresponds with an ASCII code Intro to Huffman • A brief note on bits: • For example, if we take the letter ‘A’ • ‘A’ has an ASCII code of 65 • In binary, 65 == 0100 0001 • When we read n bits, we consider those bits to be binary for some number, which is stored in an int • So if the character in my file is ‘A’, and we read in 8 bits, we will get an int equally 65 (more on this later) • When we write, we will write *one* bit at a time The Huffman Algorithm • The algorithm rewrites bit encodings so that frequently used characters can be written in fewer bits • For example, if I have a file that reads: “gggg” • Then my bit representation of this is: 01100111 01100111 01100111 01100111 • I create a shortcut: “g” => 1 • And my new file is much shorter! “gggg” => 1111 The Huffman Algorithm • How do we create encodings that do not conflict with each other? • Create a Huffman tree! • Overview of the process: • Count the number of appearances for every character in the file • Arrange the characters as leaf nodes in a tree, with the least frequent furthest from the root • Create the encoding for each character by tracing a path to its node from the root The Huffman Algorithm • Count the number of appearances for every character in the file. • Pretty straightforward… • For this example we’ll use these counts: Character Count A 29 B 14 C 9 D 17 E 45 F 11 G 5 The Huffman Algorithm • Arrange the characters as leaf nodes in a tree, with the least frequent furthest from the root • Start by picking lowest two counts—join them together as sibling nodes • Next, create a parent node for the two, and weight the parent as the combined weights of its children • Add the parent node to the pool of nodes to choose from, then repeat until there is only one node left—that’s the root The Huffman Algorithm

Character Count

A 29

B 14

C 9

D 17

E 45 G (5) C (9) F 11

G 5 The Huffman Algorithm

Character Count

A 29

B 14

C 9

(14) D 17

E 45

G (5) C (9) F 11

G 5

G/C 14 The Huffman Algorithm

Character Count

A 29

B 14

C 9 (25) D 17

E 45 (14) F (11) B (14) F 11

G 5

G (5) C (9) G/C 14 The Huffman Algorithm

Character Count A 29 B 14 C 9 (25) D 17 E 45 F 11 (14) F (11) B (14) G 5 G/C 14 G (5) C (9) F/B 25 The Huffman Algorithm

Character Count A 29 B 14 C 9 D 17 (31) E 45 F 11 (14) D (17) (25) G 5 G/C 14 G (5) C (9) F (11) B (14) F/B 25 14/D 31 The Huffman Algorithm

Character Count A 29 B 14 C 9 D 17 (31) (54) E 45 F 11 (14) D (17) (25) A (29) G 5 F/B 25 G (5) C (9) F (11) B (14) 14/D 31 25/A 54 The Huffman Algorithm

Character Count A 29 B 14 (76) C 9 D 17 (31) E (45) (54) E 45 F 11 (14) D (17) (25) A (29) G 5 14/D 31 G (5) C (9) F (11) B (14) 25/A 54 31/E 76 The Huffman Algorithm

(130) Character Count A 29 B 14 (54) (76) C 9 D 17 (25) A (29) (31) E (45) E 45 F 11 F (11) B (14) (14) D (17) G 5 25/A 54 G (5) C (9) 31/E 76 54/76 130 The Huffman Algorithm • Create the encoding for each character by tracing a path to its node from the root • Starting from the root, trace the path to each leaf • Every time you go left, append a 0, right, a 1 • The result at each leaf is its Huffman encoding The Huffman Algorithm

(130) Character Encoding

A (54) (76) B

C (25) A (29) (31) E (45) D

F (11) B (14) (14) D (17) E