Text Compression: Examples Huffman coding: go go gophers “abcde” in the different formats ASCII 3 bits Huffman Encodings ASCII: g 103 1100111 000 00 o 111 1101111 001 01 01100001011000100110001101100100… 13 ASCII: 8 bits/character p 112 1110000 010 1100 Fixed: Unicode: 16 bits/character h 104 1101000 011 1101 6 7 000001010011100 e 101 1100101 100 1110 0 1 g 3 4 r 114 1110010 101 1111 o Symbol ASCII Fixed Var. s 115 1110011 110 101 3 3 s * 2 2 length length 0 1 0 0 sp. 32 1000000 111 101 1 2 a 01100001 000 000 p h e r 0 1 0 1  0 Encoding uses : 1 1 1 1 b 01100010 001 11  0 left/1 right Var: a b c d e 6 c 01100011 010 01  How many bits? 37!! 4 000110100110 g o 1  Savings? Worth it? 0 2 2 d 01100100 011 001 3 3 3 e 01100101 100 10 0 1 0 1 p h e r c e b 0 1 1 1 1 1 s * 1 2 CPS 100 a d 8.1 CPS 100 8.2

Huffman Coding Building a Huffman tree

 D.A Huffman in early 1950’s  Begin with a forest of single-node trees (leaves)  Before compressing data, analyze the input stream  Each node/tree/leaf is weighted with character count  Represent data using variable length  Node stores two values: character and count  Variable length codes though Prefix codes  There are n nodes in forest, n is size of alphabet?  Each letter is assigned a codeword  Codeword is for a given letter is produced by traversing  Repeat until there is only one node left: root of tree the Huffman tree  Remove two minimally weighted trees from forest  Property: No codeword produced is the prefix of another  Create new tree with minimal trees as children,  Letters appearing frequently have short codewords, • New tree root's weight: sum of children (character ignored) while those that appear rarely have longer ones

 Huffman coding is optimal per-character coding method  Does this process terminate? How do we get minimal trees?  Remove minimal trees, hummm……

CPS 100 8.3 CPS 100 8.4 Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

2

1 1 F C

11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 2 1 CPS 100 8.5 CPS 100 8.6 I E N S M A B O T G D L R U P F C I E N S M A B O T G D L R U P

Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

4

2 3 2 3 2 2 L R 1 1 1 2 1 1 1 2 C F P U C F P U

11 6 5 5 4 4 3 3 3 3 3 2 2 2 2 2 11 6 5 5 4 4 4 3 3 3 3 3 2 2 2 CPS 100 8.7 CPS 100 8.8 I E N S M A B O T G D L R I E N S M A B O T G D Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U

23 21 16 37 23 CPS 100 8.9 CPS 100 8.10

Building a tree Mary Shaw

 Software engineering and software architecture “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”  Tools for constructing 60 large software systems  Development is a small 37 23 piece of total cost, maintenance is larger, 21 16 12 11 depends on well-designed and developed techniques 10 11 8 8 6 6 I  Interested in computer 5 5 5 6 4 4 4 4 3 3 science, programming, E N S M A B curricula, and canoeing, 3 2 3 3 2 2 2 2 health-care costs O T G D L R  ACM Fellow, Alan Perlis 1 1 1 2 Professor of Compsci at CMU C F P U

60 CPS 100 8.11 CPS 100 8.12 Huffman Complexities Writing out to file

 How do we measure? Size of input file, size of alphabet  How do we go from characters to encodings?  Which is typically bigger?  Build Huffman tree  Root-to-leaf path generates encoding  Accumulating character counts: ______ How can we do this in O(1) time, though not really  Need way of writing bits out to file  Building the heap/ from counts ____  Platform dependent?  Initializing heap guaranteed  Complicated to write bits and read in same ordering  Building Huffman tree ____  Why?  See BitInputStream and BitOutputStream classes  Create table of encodings from tree ____  Depend on each other, bit ordering preserved  Why?  Write tree and compressed file _____  How do we know bits come from compressed file?  Store a magic number

CPS 100 8.13 CPS 100 8.14

Decoding a message Decoding a message

01100000100001001101 1100000100001001101

60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U

CPS 100 8.15 CPS 100 8.16 Decoding a message Decoding a message

100000100001001101 00000100001001101

60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U

CPS 100 8.17 CPS 100 8.18

Decoding a message Decoding a message

0000100001001101 1

60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U

CPS 100 G 8.19 CPS 100 GOOD 8.20 Decoding a message Huffman coding: go go gophers ASCII 3 bits Huffman g 103 1100111 000 ?? g o p h e r s * 01100000100001001101 o 111 1101111 001 ?? 3 3 1 1 1 1 1 2 p 112 1110000 010 60 h 104 1101000 011 2 2 3 e 101 1100101 100 37 23 r 114 1110010 101 p h e r s * s 115 1110011 110 1 1 1 1 1 21 16 12 11 sp. 32 1000000 111 2  choose two smallest weights 10 11 6 6 8 8  combine nodes + weights I  5 5 5 6 4 4 4 4 3 3 Repeat  Priority queue? 6 E N S M A B 4 3 2 3 3 2 2 2 2  Encoding uses tree: 2 2 g o  0 left/1 right O T G D L R 3 3 1 1 1 2  How many bits? p h e r C F P U 1 1 1 1

CPS 100 GOOD 8.21 CPS 100 8.22

Huffman Tree 2 Huffman Tree 2

 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”  E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.23 CPS 100 8.24 Huffman Tree 2 Huffman Tree 2

 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”  E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.25 CPS 100 8.26

Huffman Tree 2 Huffman Tree 2

 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”  E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.27 CPS 100 8.28 Huffman Tree 2 Huffman Tree 2

 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”  E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.29 CPS 100 8.30

Huffman Tree 2 Other methods

 “A SIMPLE STRING TO BE ENCODED USING A  MINIMAL NUMBER OF BITS”  Lempel-Ziv  E.g. “ A SIMPLE” ⇔  Build the coding table on the fly while reading “10101101001000101001110011100000” document  Coding table changes dynamically  Protocol between encoder and decoder so that everyone is always using the right coding scheme  Works well in practice (compress, , etc.)  More complicated methods  Burrows-Wheeler (bunzip2)  PPM statistical methods

CPS 100 8.31 CPS 100 8.32