Huffman Trees
Total Page:16
File Type:pdf, Size:1020Kb
Data compression Huffman Trees Compression reduces the size of a file: ・To save space when storing it. Greedy Algorithm for Data Compression ・To save time when transmitting it. ・Most files have lots of redundancy. Tyler Moore Who needs compression? ・Moore's law: # transistors on a chip doubles every 18–24 months. CS 2123, The University of Tulsa ・Parkinson's law: data expands to fill space available. ・Text, images, sound, video, … “ Everyday, we create 2.5 quintillion bytes of data—so much that Some slides created by or adapted from Dr. Kevin Wayne. For more information see 90% of the data in the world today has been created in the last https://www.cs.princeton.edu/courses/archive/fall12/cos226/lectures.php two years alone. ” — IBM report on big data (2011) Basic concepts ancient (1950s), best technology recently developed. 3 Applications Lossless compression and expansion Generic file compression. Message. Binary data B we want to compress. ・Files: GZIP, BZIP, 7z. Compress. Generates a "compressed" representation C (B). Archivers: PKZIP. Expand. Reconstructs original bitstream B. ・ uses fewer bits (you hope) ・File systems: NTFS, HFS+, ZFS. Multimedia. Compress Expand bitstream B compressed version C(B) original bitstream B ・Images: GIF, JPEG. 0110110101... 1101011111... 0110110101... ・Sound: MP3. ・Video: MPEG, DivX™, HDTV. Basic model for data compression Communication. ・ITU-T T4 Group 3 Fax. ・V.42bis modem. Compression ratio. Bits in C (B) / bits in B. ・Skype. Ex. 50–75% or better compression ratio for natural language. Databases. Google, Facebook, .... 4 5 Rdenudcany in Enlgsih lnagugae Variable-length codes Q. How mcuh rdenudcany is in the Enlgsih lnagugae? Use different number of bits to encode different chars. “ ... randomising letters in the middle of words [has] little or no Ex. Morse code: • • • − − − • • • effect on the ability of skilled readers to understand the text. This is easy to denmtrasote. In a pubiltacion of New Scnieitst you Issue. Ambiguity. could ramdinose all the letetrs, keipeng the first two and last two SOS ? the same, and reibadailty would hadrly be aftcfeed. My ansaylis V7 ? did not come to much beucase the thoery at the time was for IAMIE ? shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for EEWNI ? this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. ” — Graham Rawlinson In practice. Use a medium gap to separate codewords. codeword for S is a prefix of codeword for V A. Quite a bit. 14 19 Variable-length codes Prefix-free codes: trie representation Q. How do we avoid ambiguity? Q. How to represent the prefix-free code? A. Ensure that no codeword is a prefix of another. A. A binary trie! Codeword table Trie representation ・Chars in leaves. Codeword table Trie representation key value key value Ex 1. Fixed-length code. ! 101 0 1 ・Codeword is path from root to leaf. ! 101 0 1 A 0 A A 0 A Ex 2. Append special stop char to each codeword.B 1111 0 1 B 1111 0 1 C 110 C 110 Ex 3. General prefix-free code. D 100 0 1 0 1 D 100 0 1 0 1 R 1110 D ! C R 1110 D ! C 0 1 0 1 R B R B Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 011111110011001000111111100101 30 bits A B RA CA DA B RA ! A B RA CA DA B RA ! Codeword table Trie representation Codeword table Trie representation Codeword table Trie representation Codeword table Trie representation key value key value key value key value 0 1 0 1 ! 101 ! 101 0 1 ! 101 ! 101 0 1 A 0 A A 11 A 0 A A 11 0 1 0 1 B 1111 B 00 0 1 0 1 B 1111 B 00 0 1 0 1 C 110 C 010 C 110 C 010 0 1 0 1 B A 0 1 0 1 B A D 100 D 100 0 1 0 1 D 100 D 100 0 1 0 1 R 1110 D ! C R 1110 D ! C 0 1 R 011 C R D ! 0 1 R 011 C R D ! R B R B Compressed bitstring Compressed bitstring Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits A B RA CA DA B RA ! A B R A C A D A B R A ! A B RA CA DA B RA ! A B R A C A D A B R A ! Two pre!x-free codes Two pre!x-free codes Codeword table Trie representation 20 Codeword table Trie representation 21 key value key value ! 101 0 1 ! 101 0 1 A 11 A 11 B 00 0 1 0 1 B 00 0 1 0 1 C 010 B A C 010 B A D 100 0 1 0 1 D 100 0 1 0 1 R 011 C R D ! R 011 C R D ! Compressed bitstring Compressed bitstring 11000111101011100110001111101 29 bits 11000111101011100110001111101 29 bits A B R A C A D A B R A ! A B R A C A D A B R A ! Two pre!x-free codes Two pre!x-free codes Average weighted code length Prefix-free codes: compression and expansion Compression. ・Method 1: start at leaf; follow path up to the root; print bits in reverse. ・Method 2: create ST of key-value pairs.Codeword table Trie representation key value ! 101 0 1 A 0 A Expansion. B 1111 0 1 C 110 Definition Start at root. D 100 0 1 0 1 ・ R 1110 D ! C 0 1 Given a set of symbols s 2 S and corresponding frequencies fs where Go left if bit is 0; go right if 1. ・ R B P s2S fs = 1, the average weighted code length using a binary trie is ・If leaf node, print char and return to root.Compressed bitstring 011111110011001000111111100101 30 bits P f · Depth(s). A B RA CA DA B RA ! s2S s Codeword table Trie representation Codeword table Trie representation key value key value 0 1 ! 101 ! 101 0 1 A 0 A A 11 0 1 B 1111 B 00 0 1 0 1 C 110 C 010 0 1 0 1 B A D 100 D 100 0 1 0 1 R 1110 D ! C 0 1 R 011 C R D ! R B Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits A B RA CA DA B RA ! A B R A C A D A B R A ! Two pre!x-free codes Codeword table Trie representation 22 key value 2 / 10 ! 101 0 1 A 11 B 00 0 1 0 1 C 010 B A D 100 0 1 0 1 R 011 C R D ! Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A ! Shannon-Fano Codes Exercise Shannon-FanoTwo pre codes!x-free codes Q. How to find best prefix-free code? Shannon-Fano algorithm: ・Partition symbols S into two subsets S0 and S1 of (roughly) equal freq. ・Codewords for symbols in S0 start with 0; for symbols in S1 start with 1. ・Recur in S0 and S1. char freq encoding char freq encoding A 5 0... B 2 1... C 1 0... D 1 1... R 2 1... S0 = codewords starting with 0 ! 1 1... S1 = codewords starting with 1 Problem 1. How to divide up symbols? Problem 2. Not optimal! 27 3 / 10 Procedure for Creating Huffman Tree and Codes (RLW) Huffman Code Examples 1 Initialize each symbol into a one-node tree with weight corresponding to the probability of the symbol's occurrence 2 REPEAT a. Select the two trees with smallest weights (break ties randomly). b. Combine two trees into one tree whose root is the sum of weights of two trees UNTIL one tree remains 3 Assign 0 to left edge and 1 to right edge in tree. 4 Huffman code is the binary value constructed from the path from root to leaf 4 / 10 5 / 10 Huffman Coding: Implementation Huffman Coding: Implementation We will use nested lists to represent the Huffman tree structure in Python To efficiently implement Huffman codes, must use a priority queue Place trees onto the queue with associated frequency Add merged trees onto the priority queue with updated frequencies 6 / 10 7 / 10 Huffman Coding: Implementing the Tree Huffman Coding: Implementing the Codes def codes(tree , prefix = ""): i f len ( t r e e ) == 1 : from heapq import heapify , heappush, heappop return [tree, prefix] def huffman(seq, frq): return codes(tree [0] , prefix+"0")+ n codes(tree [1] , prefix+"1") t r e e s = l i s t ( zip ( frq , seq ) ) heapify(trees) # A min−heap based on freq def codesd(tr ): while len ( t r e e s ) > 1 : # Until all are combined cmap = fg fa, a = heappop(trees) # Get the two smallest trees def codesh(tree , prefix = ""): fb, b = heappop(trees) i f len ( t r e e ) == 1 : heappush(trees , (fa+fb, [a, b])) # Combine and re−add cmap[ tree]=prefix return t r e e s [ 0 ] [ − 1 ] e l s e : codesh(tree [0] , prefix+"0") codesh(tree [1] , prefix+"1") codesh(tr ,"") return cmap 8 / 10 9 / 10 Huffman Coding: Implementation seq = "abcdefghi" frq = [4,5,6,9,11,12,15,16,20] htree = huffman(seq,frq) p r i n t h t r e e p r i n t codes(htree) ch = codesd(htree) p r i n t ch text = "abbafabgee" p r i n t text , " encodes to : " p r i n t "".join([ch[x] f o r x in t e x t ] ) """ [['i ',[['a','b'],'e ']],[['f','g'],[['c','d'],'h']]] ['i ','00','a','0100','b','0101','e','011','f ','100', 'g','101','c','1100','d','1101','h','111'] f 'a':'0100','c':'1100','b':'0101','e':'011', 'd':'1101','g':'101','f ':'100','i ':'00','h':'111' g abbafabgee encodes to : 010001010101010010001000101101011011 10 / 10.