Data Compression
Total Page:16
File Type:pdf, Size:1020Kb
Data Compression Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. Data Compression ! Most files have lots of redundancy. Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. • introduction ! Text, images, sound, video, … • Huffman codes • an application • entropy All of the books in the world contain no more information than is • LZW codes broadcast as video in a single large American city in a single year. Not all bits have equal value. -Carl Sagan References: Algorithms 2nd edition, Chapter 22 Basic concepts ancient (1950s), best technology recently developed. Intro to Algs and Data Structs, Section 6.5 Copyright © 2007 by Robert Sedgewick and Kevin Wayne. 3 Applications of Data Compression Generic file compression. ! Files: GZIP, BZIP, BOA. ! Archivers: PKZIP. ! File systems: NTFS. Multimedia. introduction ! Images: GIF, JPEG. Huffman codes ! Sound: MP3. an application ! Video: MPEG, DivX™, HDTV. entropy Communication. LZW codes ! ITU-T T4 Group 3 Fax. ! V.42bis modem. Databases. Google. 2 4 Encoding and Decoding Fixed length encoding Message. Binary data M we want to compress. uses fewer bits (you hope) ! Use same number of bits for each symbol. Encode. Generate a "compressed" representation C(M). ! k-bit code supports 2k different symbols Decode. Reconstruct original message or some approximation M'. Ex. 7-bit ASCII char decimal code NUL 0 0000000 M Encoder C(M) Decoder M' ... ... a 97 1100001 b 98 1100010 c 99 1100011 Compression ratio. Bits in C(M) / bits in M. d 100 1100100 ... ... this lecture: Lossless. M = M', 50-75% or lower. ~ 126 1111110 special code for this lecture Ex. Natural language, source code, executables. 127 1111111 end-of-message a b r a c a d a b r a ! Lossy. M ! M', 10% or lower. 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 Ex. Images, sound, video. 12 symbols " 7 bits per symbol = 84 bits in code 5 7 Ancient Ideas Fixed length encoding Ancient ideas. ! Use same number of bits for each symbol. ! Braille. ! k-bit code supports 2k different symbols ! Morse code. ! Natural languages. Ex. 3-bit custom code ! Mathematical notation. char code ! Decimal number system. a 000 b 001 c 010 d 011 r 100 12 symbols " 3 bits per - ! 111 36 bits in code "Poetry is the art of lossy data compression." a b r a c a d a b r a ! 000 001 100 000 010 000 011 000 001 100 000 111 Important detail: decoder needs to know the code! 6 8 Fixed length encoding: general scheme Variable-length encoding ! count number of different symbols. Use different number of bits to encode different characters. ! $lg M% bits suffice to support M different symbols Q. How do we avoid ambiguity? Ex. genomic sequences S • • • prefix of V char code A1. Append special stop symbol to each codeword. ! 4 different codons A2. Ensure that no encoding is a prefix of another. E • prefix of I, S a 00 I •• prefix of S ! 2 bits suffice c 01 char code V •••- t 10 Ex. custom prefix-free code a 0 g 11 2N bits to encode b 111 genome with N codons c 1010 a c t a c a g a t g a d 100 00 01 10 00 01 00 11 00 10 11 00 r 110 ! 1011 28 bits in code ! Amazing but true: initial databases in 1990s did not use such a code! a b r a c a d a b r a ! 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 Decoder needs to know the code ! can amortize over large number of files with the same code Note 1: fixed-length codes are prefix-free ! in general, can encode an N-char file with N $lg M% + 16 $lg M% bits Note 2: can amortize cost of including the code over similar messages 9 11 Variable Length Encoding Prefix-Free Code: Encoding and Decoding Use different number of bits to encode different characters. How to represent? Use a binary trie. ! Symbols are stored in leaves. 0 1 Ex. Morse code. ! Encoding is path to leaf. a 0 1 Issue: ambiguity. Encoding. 0 1 0 1 ! d r b • • • # # # • • • Method 1: start at leaf; follow path up 0 1 to the root, and print bits in reverse order. c ! SOS ? ! Method 2: create ST of symbol-encoding pairs. IAMIE ? EEWNI ? Decoding. char encoding V7O ? ! Start at root of tree. a 0 ! Go left if bit is 0; go right if 1. b 111 c 1010 ! If leaf node, print symbol and return to root. d 100 r 110 ! 1011 10 12 How to transmit the trie Prefix-Free Decoding Implementation How to transmit the trie? ! Send preorder traversal of trie. 0 1 – we use * as sentinel for internal nodes public void decode() { – what if there is no sentinel? a 0 1 int N = StdIn.readInt(); ! Send number of characters to decode. for (int i = 0; i < N; i++) { use bits in real applications ! Send bits (packed 8 to the byte). 0 1 0 1 Node x = root; instead of chars while (x.isInternal()) { d r b 0 1 char bit = StdIn.readChar(); c ! if (bit == '0') x = x.left; preorder traversal *a**d*c!*rb else if (bit == '1') x = x.right; # chars to decode 12 } the message bits 0111110010100100011111001011 System.out.print(x.ch); } } ! If message is long, overhead of transmitting trie is small. char encoding 12 a 0 0111110010100100011111001011 b 111 c 1010 d 100 a b r a c a d a b r a ! r 110 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ! 1011 13 15 Prefix-Free Decoding Implementation Introduction to compression: summary Variable-length codes can provide better compression than fixed-length public class HuffmanDecoder a b r a c a d a b r a ! { 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 private Node root = new Node(); private class Node a b r a c a d a b r a ! { 000 001 100 000 010 000 011 000 001 100 000 111 char ch; Node left, right; Node() a b r a c a d a b r a ! { build tree from 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ch = StdIn.readChar(); preorder traversal if (ch == '*') { *a**d*c!*rb left = new Node(); right = new Node(); Every trie defines a variable-length code } } boolean isInternal() { } } Q. What is the best variable length code for a given message? 14 16 Huffman coding example a b r a c a d a b r a ! 5 7 a 1 1 1 2 2 5 3 4 c ! d r b a 2 2 1 2 1 2 2 2 5 r b d d r b a 1 1 1 1 c ! introduction c ! Huffman codes 12 an application 2 2 3 5 r b a entropy 1 2 5 7 d a 1 1 LZW 3 4 c ! 2 2 1 2 r b d 1 1 3 4 5 c ! a 2 2 1 2 r b d 1 1 17 c ! 19 Huffman Coding Huffman trie construction code Q. What is the best variable length code for a given message? A. Huffman code. [David Huffman, 1950] int[] freq = new int[128]; for (int i = 0; i < input.length(); i++) tabulate { freq[input.charAt(i)]++; } frequencies To compute Huffman code: MinPQ<Node> pq = new MinPQ<Node>(); ! for (int i = 0; i < 128; i++) count frequency ps for each symbol s in message. if (freq[i] > 0) initialize ! start with one node corresponding to each symbol s (with weight ps). pq.insert(new Node((char) i, freq[i], null, null)); PQ ! repeat until single trie formed: while (pq.size() > 1) – select two tries with min weight p1 and p2 { Node x = pq.delMin(); – merge into single trie with weight p1 + p2 Node y = pq.delMin(); Node parent = new Node('*', x.freq + y.freq, x, y); merge pq.insert(parent); trees } internal node Applications. JPEG, MP3, MPEG, PKZIP, GZIP, … root = pq.delMin(); total two subtrees marker frequency David Huffman 18 20 Huffman encoding An application: compress a bitmap Typical black-and-white-scanned image Theorem. [Huffman] Huffman coding is an optimal prefix-free code. 300 pixels/inch no prefix-free code uses fewer bits Implementation. 8.5 by 11 inches ! pass 1: tabulate symbol frequencies and build trie ! pass 2: encode file by traversing trie or lookup table. 300*8.5*300*11 = 8.415 million bits Running time. Use binary heap & O(M + N log N). Bits are mostly white output distinct bits symbols Typical amount of text on a page: 40 lines * 75 chars per line = 3000 chars Can we do better? [stay tuned] 21 23 Natural encoding of a bitmap one bit per pixel 000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 introduction 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 Huffman codes 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 an application 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 entropy 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 LZW 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011 19-by-51 raster of letter 'q' lying on its side 22 24 Run-length encoding of a bitmap Run-length encoding and Huffman codes in the wild to encode number of bits per line natural encoding.