Text Compression: Examples Huffman Coding

Text Compression: Examples Huffman Coding

Text Compression: Examples Huffman coding: go go gophers “abcde” in the different formats ASCII 3 bits Huffman Encodings ASCII: g 103 1100111 000 00 o 111 1101111 001 01 01100001011000100110001101100100… 13 ASCII: 8 bits/character p 112 1110000 010 1100 Fixed: Unicode: 16 bits/character h 104 1101000 011 1101 6 7 000001010011100 e 101 1100101 100 1110 0 1 g 3 4 r 114 1110010 101 1111 o Symbol ASCII Fixed Var. s 115 1110011 110 101 3 3 s * 2 2 length length 0 1 0 0 sp. 32 1000000 111 101 1 2 a 01100001 000 000 p h e r 0 1 0 1 0 Encoding uses tree: 1 1 1 1 b 01100010 001 11 a b c d e 0 left/1 right Var: 6 c 01100011 010 01 How many bits? 37!! 4 000110100110 g o 1 Savings? Worth it? 0 2 2 d 01100100 011 001 3 3 3 e 01100101 100 10 0 1 0 1 p h e r c e b 0 1 1 1 1 1 s * 1 2 CPS 100 a d 8.1 CPS 100 8.2 Huffman Coding Building a Huffman tree D.A Huffman in early 1950’s Begin with a forest of single-node trees (leaves) Before compressing data, analyze the input stream Each node/tree/leaf is weighted with character count Represent data using variable length codes Node stores two values: character and count Variable length codes though Prefix codes There are n nodes in forest, n is size of alphabet? Each letter is assigned a codeword Codeword is for a given letter is produced by traversing Repeat until there is only one node left: root of tree the Huffman tree Remove two minimally weighted trees from forest Property: No codeword produced is the prefix of another Create new tree with minimal trees as children, Letters appearing frequently have short codewords, • New tree root's weight: sum of children (character ignored) while those that appear rarely have longer ones Huffman coding is optimal per-character coding method Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm…… CPS 100 8.3 CPS 100 8.4 Building a tree Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 1 1 F C 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 2 1 CPS 100 8.5 CPS 100 8.6 I E N S M A B O T G D L R U P F C I E N S M A B O T G D L R U P Building a tree Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 3 2 3 2 2 L R 1 1 1 2 1 1 1 2 C F P U C F P U 11 6 5 5 4 4 3 3 3 3 3 2 2 2 2 2 11 6 5 5 4 4 4 3 3 3 3 3 2 2 2 CPS 100 8.7 CPS 100 8.8 I E N S M A B O T G D L R I E N S M A B O T G D Building a tree Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 37 23 21 16 12 11 21 16 12 11 10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U 23 21 16 37 23 CPS 100 8.9 CPS 100 8.10 Building a tree Mary Shaw Software engineering and software architecture “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Tools for constructing 60 large software systems Development is a small 37 23 piece of total cost, maintenance is larger, 21 16 12 11 depends on well-designed and developed techniques 10 11 8 8 6 6 I Interested in computer 5 5 5 6 4 4 4 4 3 3 science, programming, E N S M A B curricula, and canoeing, 3 2 3 3 2 2 2 2 health-care costs O T G D L R ACM Fellow, Alan Perlis 1 1 1 2 Professor of Compsci at CMU C F P U 60 CPS 100 8.11 CPS 100 8.12 Huffman Complexities Writing code out to file How do we measure? Size of input file, size of alphabet How do we go from characters to encodings? Which is typically bigger? Build Huffman tree Root-to-leaf path generates encoding Accumulating character counts: ______ How can we do this in O(1) time, though not really Need way of writing bits out to file Building the heap/priority queue from counts ____ Platform dependent? Initializing heap guaranteed Complicated to write bits and read in same ordering Building Huffman tree ____ Why? See BitInputStream and BitOutputStream classes Create table of encodings from tree ____ Depend on each other, bit ordering preserved Why? Write tree and compressed file _____ How do we know bits come from compressed file? Store a magic number CPS 100 8.13 CPS 100 8.14 Decoding a message Decoding a message 01100000100001001101 1100000100001001101 60 60 37 23 37 23 21 16 12 11 21 16 12 11 10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U CPS 100 8.15 CPS 100 8.16 Decoding a message Decoding a message 100000100001001101 00000100001001101 60 60 37 23 37 23 21 16 12 11 21 16 12 11 10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U CPS 100 8.17 CPS 100 8.18 Decoding a message Decoding a message 0000100001001101 1 60 60 37 23 37 23 21 16 12 11 21 16 12 11 10 11 8 8 6 6 10 11 8 8 6 6 I I 5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3 E N S M A B E N S M A B 3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2 O T G D L R O T G D L R 1 1 1 2 1 1 1 2 C F P U C F P U CPS 100 G 8.19 CPS 100 GOOD 8.20 Decoding a message Huffman coding: go go gophers ASCII 3 bits Huffman g 103 1100111 000 ?? g o p h e r s * 01100000100001001101 o 111 1101111 001 ?? 3 3 1 1 1 1 1 2 p 112 1110000 010 60 h 104 1101000 011 2 2 3 e 101 1100101 100 37 23 r 114 1110010 101 p h e r s * s 115 1110011 110 1 1 1 1 1 21 16 12 11 sp. 32 1000000 111 2 choose two smallest weights 10 11 6 6 8 8 combine nodes + weights I 5 5 5 6 4 4 4 4 3 3 Repeat Priority queue? 6 E N S M A B 4 3 2 3 3 2 2 2 2 Encoding uses tree: 2 2 g o 0 left/1 right O T G D L R 3 3 1 1 1 2 How many bits? p h e r C F P U 1 1 1 1 CPS 100 GOOD 8.21 CPS 100 8.22 Huffman Tree 2 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” “10101101001000101001110011100000” CPS 100 8.23 CPS 100 8.24 Huffman Tree 2 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS” E.g.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us