Data Compression
Total Page:16
File Type:pdf, Size:1020Kb
Data Compression Data Compression Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. ! Most files have lots of redundancy. Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. ! Text, images, sound, video, … All of the books in the world contain no more information than is Reference: Chapter 22, Algorithms in C, 2nd Edition, Robert Sedgewick. broadcast as video in a single large American city in a single year. Reference: Introduction to Data Compression, Guy Blelloch. Not all bits have equal value. -Carl Sagan Basic concepts ancient (1950s), best technology recently developed. Robert Sedgewick and Kevin Wayne • Copyright © 2006 • http://www.Princeton.EDU/~cos226 2 Applications of Data Compression Encoding and Decoding hopefully uses fewer bits Generic file compression. Message. Binary data M we want to compress. ! Files: GZIP, BZIP, BOA. Encode. Generate a "compressed" representation C(M). ! Archivers: PKZIP. Decode. Reconstruct original message or some approximation M'. ! File systems: NTFS. Multimedia. M Encoder C(M) Decoder M' ! Images: GIF, JPEG. ! Sound: MP3. ! Video: MPEG, DivX™, HDTV. Compression ratio. Bits in C(M) / bits in M. Communication. ! ITU-T T4 Group 3 Fax. Lossless. M = M', 50-75% or lower. ! V.42bis modem. Ex. Natural language, source code, executables. Databases. Google. Lossy. M ! M', 10% or lower. Ex. Images, sound, video. 3 4 Ancient Ideas Natural Encoding Ancient ideas. Natural encoding. (19 " 51) + 6 = 975 bits. ! Braille. needed to encode number of characters per line ! Morse code. ! Natural languages. 000000000000000000000000000011111111111111000000000 ! Mathematical notation. 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 ! Decimal number system. 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 "Poetry is the art of lossy data compression." 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011 19-by-51 raster of letter 'q' lying on its side 5 6 Run-Length Encoding Run-Length Encoding Natural encoding. (19 " 51) + 6 = 975 bits. Run-length encoding (RLE). Run-length encoding. (63 " 6) + 6 = 384 bits. ! Exploit long runs of repeated characters. 63 6-bit run lengths ! Binary alphabet: runs alternate between 0 and 1; output counts. ! "File inflation" possible if runs are short. 000000000000000000000000000011111111111111000000000 28 14 9 000000000000000000000000001111111111111111110000000 26 18 7 000000000000000000000001111111111111111111111110000 23 24 4 000000000000000000000011111111111111111111111111000 22 26 3 000000000000000000001111111111111111111111111111110 20 30 1 Applications. 000000000000000000011111110000000000000000001111111 19 7 18 7 000000000000000000011111000000000000000000000011111 19 5 22 5 ! JPEG. 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 ! ITU-T T4 fax machines. (black and white graphics) 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000001111000000000000000000000001110 20 4 23 3 1 000000000000000000000011100000000000000000000111000 22 3 20 3 3 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011000000000000000000000000000000000000000000000011 1 2 46 2 19-by-51 raster of letter 'q' lying on its side RLE 7 8 Fixed Length Coding Fixed Length Coding Fixed length encoding. 7-bit ASCII encoding Fixed length encoding. 3-bit abracadabra encoding ! Use same number of bits for each symbol. char dec encoding ! Use same number of bits for each symbol. char encoding ! N symbols # $lg N% bits per symbol. NUL 0 0000000 ! N symbols # $lg N% bits per symbol. a 000 … … … b 001 a 97 1100001 c 010 b 98 1100010 d 011 c 99 1100011 r 100 d 100 1100100 … … … ~ 126 1111110 DEL 127 1111111 a b r a c a d a b r a a b r a c a d a b r a 1100001 1100010 1110010 1100001 1100011 1100001 1110100 1100001 1100010 1110010 1100001 000 001 100 000 010 000 011 000 001 100 000 7 " 11 = 77 bits 3 " 11 = 33 bits 9 10 Variable Length Encoding Uniquely Decodable Codes Variable-length encoding. Use different number of bits to encode Variable-length encoding. Use different number of bits to encode different characters. different characters. Ex. Morse code. Q. How do we avoid ambiguity? A1. Append special stop symbol to each codeword. Ambiguity. • • • & & & • • • A2. Ensure no encoding is a prefix of another. ! SOS 101 is a prefix of 1011 ! IAMIE ! EEWNI ! T7O char encoding a 0 b 111 c 1011 d 100 a b r a c a d a b r a ! r 110 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 0 ! 1010 variable length coding: 28 bits 11 12 ITU-T T4 Group 3 Fax Prefix-Free Code: Encoding and Decoding Group 3 fax. Transmit image comprised of up to 1728 pels per line, How to represent? Use a binary trie. 0 1 typically mostly white. ! Symbols are stored in leaves. picture element = black or white a 0 1 ! Encoding is path to leaf. RLE. Compute run-lengths of white and black pels. 0 1 Prefix-free code. Encode run-lengths using following prefix-free code. 0 1 Encoding. d r b 0 1 run white black 194 ! Method 1: start at leaf; follow path up 0 00110101 0000110111 to the root, and print bits in reverse order. ! c 1 000111 010 … ! Method 2: create ST of symbol-encoding pairs. 2 0111 11 3 1000 10 3W 1B 2W 2B 194W … Decoding. char encoding … … … a 0 ! Start at root of tree. 63 00110100 000001100111 1000 010 0111 11 010111 0111 … b 111 64 11011 0000001111 ! Go left if bit is 0; go right if 1. c 1011 192 + 2 128 10010 000011001000 ! If leaf node, print symbol and return to root. d 100 r 110 192 010111 000011001001 ! 1010 … … … 1728 010011011 0000001100101 13 14 How to Transmit the Trie Prefix-Free Decoding Implementation How to transmit the trie? 0 1 ! Send preorder traversal of trie. a public class HuffmanDecoder { 0 1 – we use * as sentinel for internal nodes private Node root = new Node(); – what if there is no sentinel? 0 1 0 1 private class Node { ! Send number of characters to decode. char ch; ! d r b Send bits (packed 8 to the byte). Node left, right; 0 1 *a**d*!c*rb ! c Node() { ch = StdIn.readChar(); build tree from 12 preorder traversal 0111110010110100011111001010 if (ch == '*') { left = new Node(); *a**d*!c*rb char encoding right = new Node(); a 0 } ! If message is long, overhead of sending trie is small. b 111 } c 1011 boolean isInternal() { } d 100 } r 110 ! 1010 15 16 Prefix-Free Decoding Implementation Huffman Codes public void decode() { int N = StdIn.readInt(); for (int i = 0; i < N; i++) { Node x = root; use bits in real applications instead of chars while (x.isInternal()) { char bit = StdIn.readChar(); if (bit == '0') x = x.left; else if (bit == '1') x = x.right; } System.out.print(x.ch); } } 12 David Huffman 0111110010110100011111001010 17 18 Huffman Coding Huffman Coding Example Q. How to create a good prefix-free code? A. Huffman code. [David Huffman, 1950] Char Freq Huff E 125 110 T 93 011 838 To compute Huffman code: A 80 000 ! Count frequencies p for each symbol s in message. O 76 001 s 330 508 ! Start with a node corresponding to each symbol s with weight ps. I 73 1011 ! Repeat: N 71 1010 S 65 1001 156 174 270 – select two trees with min weight p1 and p2 238 R 61 1000 A O T E – merge into single tree with weight p1 + p2 H 55 1111 80 76 93 126 144 125 113 L 41 0101 D L R S N I H D 40 0100 Applications. JPEG, MP3, MPEG, PKZIP, GZIP, … 40 41 61 65 71 73 58 55 C 31 11100 U 27 11101 C U 31 27 Total 838 3.62 19 20 Huffman Tree Construction Huffman Encoding Theorem. [Huffman] Huffman coding is optimal prefix-free code. // tabulate frequencies Corollary. "Greed is good." int[] freq = new int[128]; no prefix free code uses fewer bits for (int i = 0; i < input.length(); i++) freq[input.charAt(i)]++; Implementation. // initialize priority queue with singleton elements ! Pass 1: tabulate symbol frequencies and build trie MinPQ<Node> pq = new MinPQ<Node>(); for (int i = 0; i < 128; i++) ! Pass 2: encode file by traversing trie or lookup table. if (freq[i] > 0) pq.insert(new Node((char) i, freq[i], null, null)); Running time. Use binary heap # O(M + N log N). // repeatedly merge two smallest trees while (pq.size() > 1) { input size distinct symbols Node x = pq.delMin(); Node y = pq.delMin(); Node parent = new Node('*', x.freq + y.freq, x, y); pq.insert(parent); } sentinel frequency two character count subtrees root = pq.delMin(); 21 22 ITU-T T4 Group 3 Fax: Revisited Group 3 fax. Transmit image comprised of up to 1728