Data Compression

Data Compression

Data Compression Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. Data Compression ! Most files have lots of redundancy. Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. • introduction ! Text, images, sound, video, … • Huffman codes • an application • entropy All of the books in the world contain no more information than is • LZW codes broadcast as video in a single large American city in a single year. Not all bits have equal value. -Carl Sagan References: Algorithms 2nd edition, Chapter 22 Basic concepts ancient (1950s), best technology recently developed. Intro to Algs and Data Structs, Section 6.5 Copyright © 2007 by Robert Sedgewick and Kevin Wayne. 3 Applications of Data Compression Generic file compression. ! Files: GZIP, BZIP, BOA. ! Archivers: PKZIP. ! File systems: NTFS. Multimedia. introduction ! Images: GIF, JPEG. Huffman codes ! Sound: MP3. an application ! Video: MPEG, DivX™, HDTV. entropy Communication. LZW codes ! ITU-T T4 Group 3 Fax. ! V.42bis modem. Databases. Google. 2 4 Encoding and Decoding Fixed length encoding Message. Binary data M we want to compress. uses fewer bits (you hope) ! Use same number of bits for each symbol. Encode. Generate a "compressed" representation C(M). ! k-bit code supports 2k different symbols Decode. Reconstruct original message or some approximation M'. Ex. 7-bit ASCII char decimal code NUL 0 0000000 M Encoder C(M) Decoder M' ... ... a 97 1100001 b 98 1100010 c 99 1100011 Compression ratio. Bits in C(M) / bits in M. d 100 1100100 ... ... this lecture: Lossless. M = M', 50-75% or lower. ~ 126 1111110 special code for this lecture Ex. Natural language, source code, executables. 127 1111111 end-of-message a b r a c a d a b r a ! Lossy. M ! M', 10% or lower. 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 Ex. Images, sound, video. 12 symbols " 7 bits per symbol = 84 bits in code 5 7 Ancient Ideas Fixed length encoding Ancient ideas. ! Use same number of bits for each symbol. ! Braille. ! k-bit code supports 2k different symbols ! Morse code. ! Natural languages. Ex. 3-bit custom code ! Mathematical notation. char code ! Decimal number system. a 000 b 001 c 010 d 011 r 100 12 symbols " 3 bits per - ! 111 36 bits in code "Poetry is the art of lossy data compression." a b r a c a d a b r a ! 000 001 100 000 010 000 011 000 001 100 000 111 Important detail: decoder needs to know the code! 6 8 Fixed length encoding: general scheme Variable-length encoding ! count number of different symbols. Use different number of bits to encode different characters. ! $lg M% bits suffice to support M different symbols Q. How do we avoid ambiguity? Ex. genomic sequences S • • • prefix of V char code A1. Append special stop symbol to each codeword. ! 4 different codons A2. Ensure that no encoding is a prefix of another. E • prefix of I, S a 00 I •• prefix of S ! 2 bits suffice c 01 char code V •••- t 10 Ex. custom prefix-free code a 0 g 11 2N bits to encode b 111 genome with N codons c 1010 a c t a c a g a t g a d 100 00 01 10 00 01 00 11 00 10 11 00 r 110 ! 1011 28 bits in code ! Amazing but true: initial databases in 1990s did not use such a code! a b r a c a d a b r a ! 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 Decoder needs to know the code ! can amortize over large number of files with the same code Note 1: fixed-length codes are prefix-free ! in general, can encode an N-char file with N $lg M% + 16 $lg M% bits Note 2: can amortize cost of including the code over similar messages 9 11 Variable Length Encoding Prefix-Free Code: Encoding and Decoding Use different number of bits to encode different characters. How to represent? Use a binary trie. ! Symbols are stored in leaves. 0 1 Ex. Morse code. ! Encoding is path to leaf. a 0 1 Issue: ambiguity. Encoding. 0 1 0 1 ! d r b • • • # # # • • • Method 1: start at leaf; follow path up 0 1 to the root, and print bits in reverse order. c ! SOS ? ! Method 2: create ST of symbol-encoding pairs. IAMIE ? EEWNI ? Decoding. char encoding V7O ? ! Start at root of tree. a 0 ! Go left if bit is 0; go right if 1. b 111 c 1010 ! If leaf node, print symbol and return to root. d 100 r 110 ! 1011 10 12 How to transmit the trie Prefix-Free Decoding Implementation How to transmit the trie? ! Send preorder traversal of trie. 0 1 – we use * as sentinel for internal nodes public void decode() { – what if there is no sentinel? a 0 1 int N = StdIn.readInt(); ! Send number of characters to decode. for (int i = 0; i < N; i++) { use bits in real applications ! Send bits (packed 8 to the byte). 0 1 0 1 Node x = root; instead of chars while (x.isInternal()) { d r b 0 1 char bit = StdIn.readChar(); c ! if (bit == '0') x = x.left; preorder traversal *a**d*c!*rb else if (bit == '1') x = x.right; # chars to decode 12 } the message bits 0111110010100100011111001011 System.out.print(x.ch); } } ! If message is long, overhead of transmitting trie is small. char encoding 12 a 0 0111110010100100011111001011 b 111 c 1010 d 100 a b r a c a d a b r a ! r 110 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ! 1011 13 15 Prefix-Free Decoding Implementation Introduction to compression: summary Variable-length codes can provide better compression than fixed-length public class HuffmanDecoder a b r a c a d a b r a ! { 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 private Node root = new Node(); private class Node a b r a c a d a b r a ! { 000 001 100 000 010 000 011 000 001 100 000 111 char ch; Node left, right; Node() a b r a c a d a b r a ! { build tree from 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ch = StdIn.readChar(); preorder traversal if (ch == '*') { *a**d*c!*rb left = new Node(); right = new Node(); Every trie defines a variable-length code } } boolean isInternal() { } } Q. What is the best variable length code for a given message? 14 16 Huffman coding example a b r a c a d a b r a ! 5 7 a 1 1 1 2 2 5 3 4 c ! d r b a 2 2 1 2 1 2 2 2 5 r b d d r b a 1 1 1 1 c ! introduction c ! Huffman codes 12 an application 2 2 3 5 r b a entropy 1 2 5 7 d a 1 1 LZW 3 4 c ! 2 2 1 2 r b d 1 1 3 4 5 c ! a 2 2 1 2 r b d 1 1 17 c ! 19 Huffman Coding Huffman trie construction code Q. What is the best variable length code for a given message? A. Huffman code. [David Huffman, 1950] int[] freq = new int[128]; for (int i = 0; i < input.length(); i++) tabulate { freq[input.charAt(i)]++; } frequencies To compute Huffman code: MinPQ<Node> pq = new MinPQ<Node>(); ! for (int i = 0; i < 128; i++) count frequency ps for each symbol s in message. if (freq[i] > 0) initialize ! start with one node corresponding to each symbol s (with weight ps). pq.insert(new Node((char) i, freq[i], null, null)); PQ ! repeat until single trie formed: while (pq.size() > 1) – select two tries with min weight p1 and p2 { Node x = pq.delMin(); – merge into single trie with weight p1 + p2 Node y = pq.delMin(); Node parent = new Node('*', x.freq + y.freq, x, y); merge pq.insert(parent); trees } internal node Applications. JPEG, MP3, MPEG, PKZIP, GZIP, … root = pq.delMin(); total two subtrees marker frequency David Huffman 18 20 Huffman encoding An application: compress a bitmap Typical black-and-white-scanned image Theorem. [Huffman] Huffman coding is an optimal prefix-free code. 300 pixels/inch no prefix-free code uses fewer bits Implementation. 8.5 by 11 inches ! pass 1: tabulate symbol frequencies and build trie ! pass 2: encode file by traversing trie or lookup table. 300*8.5*300*11 = 8.415 million bits Running time. Use binary heap & O(M + N log N). Bits are mostly white output distinct bits symbols Typical amount of text on a page: 40 lines * 75 chars per line = 3000 chars Can we do better? [stay tuned] 21 23 Natural encoding of a bitmap one bit per pixel 000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 introduction 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 Huffman codes 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 an application 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 entropy 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 LZW 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011 19-by-51 raster of letter 'q' lying on its side 22 24 Run-length encoding of a bitmap Run-length encoding and Huffman codes in the wild to encode number of bits per line natural encoding.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us