Data Compression Data Compression Compression Reduces the Size of a File: to Save TIME When Transmitting It

Data Compression Data Compression Compression reduces the size of a file: To save TIME when transmitting it. To save SPACE when storing it. Most files have lots of redundancy. Run Length Encoding Huffman Encoding Who needs compression? Entropy Moore's law: # transistors on a chip doubles every 18-24 months. LZW Parkinson's law: data expands to fill space available. Text, images, sound, video, . nd All of the books in the world contain no more information than is Reference: Chapter 22, Algorithms in C, 2 Edition, Robert Sedgewick. broadcast as video in a single large American city in a single year. Reference: Introduction to Data Compression, Guy Blelloch. Not all bits have equal value. -Carl Sagan Basic concepts ancient (1950s), best technology recently developed. Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226 2 Applications of Data Compression Encoding and Decoding Generic file compression. Message. Binary data M we want to compress. Files: GZIP, BZIP, BOA. Encode. Generate a "compressed" representation C(M) that hopefully uses fewer bits. Archivers: PKZIP. Decode. Reconstruct original message or some approximation M'. File systems: NTFS. Multimedia. M Encoder C(M) Decoder M' Images: GIF, JPEG, CorelDraw. Sound: MP3. Video: MPEG, DivX™, HDTV. Compression ratio: bits in C(M) / bits in M. Communication. Lossless. M = M', 50-75% or lower. ITU-T T4 Group 3 Fax. Ex: Java source code, executables. V.42bis modem. Lossy. M ~ M', 10% or lower. Databases. Ex: images, sound, video. Google. 3 4 Simple Ideas Run-Length Encoding ASCII coding Ancient ideas. Natural encoding: 51 P 19 + 6 = 975 bits. char dec binary P Human language. Run-length encoding: 63 6 + 6 = 384 bits. NUL 0 0000000 Morse code. … … … Braille. > 62 0111110 000000000000000000000000000011111111111111000000000 28 14 9 ? 63 0111111 000000000000000000000000001111111111111111110000000 26 18 7 000000000000000000000001111111111111111111111110000 23 24 4 Fixed length coding. @ 64 1000000 000000000000000000000011111111111111111111111111000 22 26 3 A 65 1000001 000000000000000000001111111111111111111111111111110 20 30 1 Use same number of bits for each symbol. 000000000000000000011111110000000000000000001111111 19 7 18 7 B 66 1000010 000000000000000000011111000000000000000000000011111 19 5 22 5 N symbols !log N1 bits per symbol. C 67 1000011 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 7-bit ASCII code for text. … … … 000000000000000000011100000000000000000000000000111 19 3 26 3 ~ 126 1111110 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000001111000000000000000000000001110 20 4 23 3 1 DEL 127 1111111 000000000000000000000011100000000000000000000111000 22 3 20 3 3 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 a b r a c a d a b r a 011111111111111111111111111111111111111111111111111 1 50 1100001 1100010 1110010 1100001 1100011 1100001 1110100 1100001 1100010 1110010 1100001 011111111111111111111111111111111111111111111111111 1 50 011000000000000000000000000000000000000000000000011 1 2 46 2 7 P 11 = 77 bits raster of letter 'q' lying on its side RLE 5 7 Run-Length Encoding Variable Length Encoding Run-length encoding (RLE). Variable-length encoding. Exploit long runs of repeated characters. Use DIFFERENT number of bits to encode different characters. Replace run by count followed by repeated character, but don't bother if run is less than 3. – AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD Char Freq Char Freq E 12.51 M 2.53 – 4A3BAA5B8CDABCB3A4B3CD T 9.25 F 2.30 Annoyance: how to represent counts. A 8.04 P 2.00 O 7.60 G 1.96 Runs in binary file alternate between 0 and 1, so output count only. I 7.26 W 1.92 "File inflation" if runs are short. N 7.09 Y 1.73 S 6.54 B 1.54 R 6.12 V 0.99 Applications. H 5.49 K 0.67 Black and white graphics. L 4.14 X 0.19 D 3.99 J 0.16 – compression ratio improves with resolution! C 3.06 Q 0.11 JPEG (Joint Photographic Experts Group). U 2.71 Z 0.09 Linotype machine, 1886 8 9 Variable Length Encoding Variable Length Decoding Variable-length encoding. But, then how do we decode? Variable length codes can be ambiguous. Use DIFFERENT number of bits to encode different characters. char encoding a b r a c a d a b r a a 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 b 0 c r a a d b a c r a c 10 a b r a c a d a b r a d 01 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 r 00 3-bit fixed length coding: 3 P 11 = 33 bits How do we avoid ambiguity? One solution: ensure no encoding is a prefix of another. Ex: 00 is a prefix of 0001. char encoding char encoding a 1 a 1 a b r a c a d a b r a b 001 a b r a c a d a b r a b 001 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 c 0000 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 c 0000 d 0001 d 0001 variable length coding: 23 bits r 01 r 01 10 11 Prefix-Free Code: Implementation How to Transmit the Trie How to represent? Use a binary trie. 0 1 How to transmit the trie? 0 1 – symbols are stored in leaves Send lookup table. a a 0 1 0 1 – encoding is path to leaf Send preorder traversal of trie. 0 1 r – we use * as sentinel for internal nodes 0 1 r – what if there is no sentinel? b b Encoding. 0 1 0 1 Method 1: start at leaf corresponding to ****cdbra symbol, follow path up to the root, and c d 10010110000100011001011 c d print bits in reverse order. Method 2: create ST of symbol-encoding pairs. char encoding char encoding If message is long, overhead of sending trie is small. Decoding. a 1 a 1 b 001 b 001 Start at root of tree. c 0000 c 0000 Take left branch if bit is 0; right branch if 1. d 0001 d 0001 If leaf node, print symbol and return to root. r 01 r 01 12 13 Prefix-Free Decoding Implementation: Recall COS 126 Huffman Coding public class HuffmanDecoder { OK, but how do I find a good prefix-free coding scheme? Greed. private char c; private HuffmanDecoder left, right; To compute Huffman prefix-free code: public HuffmanDecoder() { build tree from Count character frequencies p for each symbol s in file. c = CharStdIn.readChar(); preorder traversal s if (c == '*') { Start with a forest of trees, each consisting of a single vertex left = new HuffmanDecoder(); c*d*b*r*a right = new HuffmanDecoder(); corresponding to each symbol s with weight ps. } Repeat: } 10010110000100011001011 – select two trees with min weight p1 and p2 public void decode() { use bits in real applications – merge into single tree with weight p1 +p2 HuffmanDecoder x = this; instead of chars while(!CharStdIn.isEmpty()) { char bit = CharStdIn.readChar(); if (bit == '0') x = x.left; else if (bit == '1') x = x.right; Applications: JPEG, MP3, MPEG, PKZIP. if (x.left == null && x.right == null) { System.out.print(x.c); x = this; } } } 14 15 Huffman Coding Example Huffman Tree Implementation Char Freq Huff private class HuffmanTree implements Comparable { char c; E 125 110 int freq; 838 T 93 011 HuffmanTree left, right; A 80 000 O 76 001 330 508 // print preorder traversal void preorder() { I 73 1011 System.out.print(c); N 71 1010 156 174 270 if (left != null && right != null) { S 65 1001 238 left.preorder(); R 61 1000 A O T E right.preorder(); H 55 1111 80 76 81 93 126 144 125 113 } } L 41 0101 D L R S N I H D 40 0100 40 41 61 65 71 73 58 55 ... C 31 11100 C U U 27 11101 } 31 27 Total 838 3.62 16 17 Huffman Encoding Implementation Huffman Encoding public HuffmanEncoder(String input) { Theorem (Huffman, 1952). Huffman coding is optimal prefix-free code. // tabulate frequencies No variable length code uses fewer bits. int[] freq = new int[SYMBOLS]; for (int i = 0; i < input.length(); i++) freq[input.charAt(i)]++; Corollary. Greed is good. // initialize priority queue with singleton elements PQ pq = new PQ(); Implementation. for (int i = 0; i < SYMBOLS; i++) if (freq[i]>0) Two passes. pq.insert(new HuffmanTree((char) i, freq[i], null, null)); – tabulate symbol frequencies and build trie // repeatedly merge two smallest trees – encode file by traversing trie or lookup table while (pq.size() > 1) { Use priority queue for delete min and insert. HuffmanTree x =(HuffmanTree) pq.delMin(); HuffmanTree y =(HuffmanTree) pq.delMin(); O(M + N log N). HuffmanTree parent = new HuffmanTree('*', x.freq + y.freq, x, y); M = file size pq.insert(parent); PQ implementation important if each N = # distinct symbols } symbol represents one English word internal node tree =(HuffmanTree) pq.delMin(); } Difficulties. root of Huffman tree Have to transmit encoding (trie). Not optimal (unless block size grows to infinity)! 18 19 What Data Can be Compressed? What Data Can be Compressed? US Patent 5,533,051 on "Methods for Data Compression." Impossible to losslessly compress all files. Capable of compressing all files. Consider all 1000 bit messages. 1000 2 possible messages. 999 998 Slashdot reports of the Zero Space Tuner™ and BinaryAccelerator™. Only 2 + 2 + … + 1 can be encoded with ? 999 bits. 499 Only 1 in 2 can even be encoded with ? 500 bits! "ZeoSync has announced a breakthrough in data compression that allows for 100:1 lossless compression of random data.

Load more