<<

Compression Compression reduces the size of a : To save TIME when transmitting it.

To save SPACE when storing it.

Most files have lots of redundancy. Run Length Encoding Huffman Encoding Who needs compression?

Entropy Moore's law: # transistors on a chip doubles every 18-24 months.

LZW Parkinson's law: data expands to fill space available.

Text, images, sound, , . . .

nd All of the books in the world contain no more than is Reference: Chapter 22, in C, 2 Edition, Robert Sedgewick. broadcast as video in a single large American city in a single year. Reference: Introduction to Data Compression, Guy Blelloch. Not all have equal value. -Carl Sagan

Basic concepts ancient (1950s), best technology recently developed.

Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226 2

Applications of Data Compression Encoding and Decoding

Generic file compression. Message. Binary data M we want to .

Files: , BZIP, BOA. Encode. Generate a "compressed" representation C(M) that hopefully uses fewer bits. Archivers: PKZIP. Decode. Reconstruct original message or some approximation M'. File systems: NTFS.

Multimedia. M Encoder C(M) Decoder M' Images: GIF, JPEG, CorelDraw.

Sound: MP3.

Video: MPEG, DivX™, HDTV. Compression ratio: bits in C(M) / bits in M.

Communication. Lossless. M = M', 50-75% or lower. ITU-T T4 Group 3 . Ex: Java source , executables.

V.42bis . Lossy. M ~ M', 10% or lower. Databases. Ex: images, sound, video.

Google.

3 4 Simple Ideas Run-Length Encoding

ASCII coding Ancient ideas. Natural encoding: 51 P 19 + 6 = 975 bits. char dec binary P Human language. Run-length encoding: 63 6 + 6 = 384 bits. NUL 0 0000000 . … … …

Braille. > 62 0111110 000000000000000000000000000011111111111111000000000 28 14 9 ? 63 0111111 000000000000000000000000001111111111111111110000000 26 18 7 000000000000000000000001111111111111111111111110000 23 24 4 Fixed length coding. @ 64 1000000 000000000000000000000011111111111111111111111111000 22 26 3 A 65 1000001 000000000000000000001111111111111111111111111111110 20 30 1 Use same number of bits for each symbol. 000000000000000000011111110000000000000000001111111 19 7 18 7 B 66 1000010 000000000000000000011111000000000000000000000011111 19 5 22 5 N symbols !log N1 bits per symbol. C 67 1000011 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 7- ASCII code for text. … … … 000000000000000000011100000000000000000000000000111 19 3 26 3 ~ 126 1111110 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000001111000000000000000000000001110 20 4 23 3 1 DEL 127 1111111 000000000000000000000011100000000000000000000111000 22 3 20 3 3 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 a b r a c a d a b r a 011111111111111111111111111111111111111111111111111 1 50 1100001 1100010 1110010 1100001 1100011 1100001 1110100 1100001 1100010 1110010 1100001 011111111111111111111111111111111111111111111111111 1 50 011000000000000000000000000000000000000000000000011 1 2 46 2 7 P 11 = 77 bits raster of letter 'q' lying on its side RLE

5 7

Run-Length Encoding Variable Length Encoding

Run-length encoding (RLE). Variable-length encoding.

Exploit long runs of repeated characters. Use DIFFERENT number of bits to encode different characters.

Replace run by count followed by repeated character, but don't bother if run is less than 3. – AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD Char Freq Char Freq E 12.51 M 2.53 – 4A3BAA5B8CDABCB3A4B3CD T 9.25 F 2.30 Annoyance: how to represent counts. A 8.04 P 2.00 O 7.60 G 1.96 Runs in binary file alternate between 0 and 1, so output count only. I 7.26 W 1.92 "File inflation" if runs are short. N 7.09 Y 1.73 S 6.54 B 1.54 R 6.12 V 0.99 Applications. H 5.49 K 0.67

Black and white graphics. L 4.14 X 0.19 D 3.99 J 0.16 – compression ratio improves with resolution! C 3.06 Q 0.11 JPEG (Joint Photographic Experts Group). U 2.71 Z 0.09 Linotype machine, 1886

8 9 Variable Length Encoding Variable Length Decoding

Variable-length encoding. But, then how do we decode? Variable length can be ambiguous.

Use DIFFERENT number of bits to encode different characters.

char encoding a b r a c a d a b r a a 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 b 0 c r a a d b a c r a c 10 a b r a c a d a b r a d 01 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 r 00 3-bit fixed length coding: 3 P 11 = 33 bits

How do we avoid ambiguity?

One solution: ensure no encoding is a prefix of another.

Ex: 00 is a prefix of 0001.

char encoding char encoding a 1 a 1 a b r a c a d a b r a b 001 a b r a c a d a b r a b 001 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 c 0000 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 c 0000 d 0001 d 0001 variable length coding: 23 bits r 01 r 01

10 11

Prefix-Free Code: Implementation How to Transmit the Trie

How to represent? Use a binary trie. 0 1 How to transmit the trie? 0 1

– symbols are stored in leaves Send lookup table. a a 0 1 0 1 – encoding is path to leaf Send preorder traversal of trie.

0 1 r – we use * as sentinel for internal nodes 0 1 r – what if there is no sentinel? b b Encoding. 0 1 0 1 Method 1: start at leaf corresponding to ****cdbra symbol, follow path up to the root, and c d 10010110000100011001011 c d print bits in reverse order.

Method 2: create ST of symbol-encoding pairs. char encoding char encoding If message is long, overhead of sending trie is small. Decoding. a 1 a 1 b 001 b 001 Start at root of tree. c 0000 c 0000 Take left branch if bit is 0; right branch if 1. d 0001 d 0001

If leaf , print symbol and return to root. r 01 r 01

12 13 Prefix-Free Decoding Implementation: Recall COS 126

public class HuffmanDecoder { OK, but how do I find a good prefix-free coding scheme? Greed. private char c; private HuffmanDecoder left, right; To compute Huffman prefix-free code: public HuffmanDecoder() { build tree from Count character frequencies p for each symbol s in file. c = CharStdIn.readChar(); preorder traversal s if (c == '*') { Start with a forest of trees, each consisting of a single vertex left = new HuffmanDecoder(); c*d*b*r*a right = new HuffmanDecoder(); corresponding to each symbol s with weight ps. } Repeat: } 10010110000100011001011 – select two trees with min weight p1 and p2 public void decode() { use bits in real applications – merge into single tree with weight p1 +p2 HuffmanDecoder x = this; instead of chars while(!CharStdIn.isEmpty()) { char bit = CharStdIn.readChar(); if (bit == '0') x = x.left; else if (bit == '1') x = x.right; Applications: JPEG, MP3, MPEG, PKZIP. if (x.left == null && x.right == null) { System.out.print(x.c); x = this; } } }

14 15

Huffman Coding Example Huffman Tree Implementation

Char Freq Huff private class HuffmanTree implements Comparable { char c; E 125 110 int freq; 838 T 93 011 HuffmanTree left, right; A 80 000 O 76 001 330 508 // print preorder traversal void preorder() { I 73 1011 System.out.print(c); N 71 1010 156 174 270 if (left != null && right != null) { S 65 1001 238 left.preorder(); R 61 1000 A O T E right.preorder(); H 55 1111 80 76 81 93 126 144 125 113 } } L 41 0101 D L R S N I H D 40 0100 40 41 61 65 71 73 58 55 ... C 31 11100 C U U 27 11101 } 31 27 Total 838 3.62

16 17 Huffman Encoding Implementation Huffman Encoding

public HuffmanEncoder(String input) { Theorem (Huffman, 1952). Huffman coding is optimal prefix-free code.

// tabulate frequencies No variable length code uses fewer bits. int[] freq = new int[SYMBOLS]; for (int i = 0; i < input.length(); i++) freq[input.charAt(i)]++; Corollary. Greed is good.

// initialize priority queue with singleton elements PQ pq = new PQ(); Implementation. for (int i = 0; i < SYMBOLS; i++) if (freq[i]>0) Two passes. pq.insert(new HuffmanTree((char) i, freq[i], null, null)); – tabulate symbol frequencies and build trie

// repeatedly merge two smallest trees – encode file by traversing trie or lookup table while (pq.size() > 1) { Use priority queue for delete min and insert. HuffmanTree x =(HuffmanTree) pq.delMin(); HuffmanTree y =(HuffmanTree) pq.delMin(); O(M + N log N). HuffmanTree parent = new HuffmanTree('*', x.freq + y.freq, x, y); M = pq.insert(parent); PQ implementation important if each N = # distinct symbols } symbol represents one English word internal node tree =(HuffmanTree) pq.delMin(); } Difficulties.

root of Huffman tree Have to transmit encoding (trie).

Not optimal (unless block size grows to infinity)!

18 19

What Data Can be Compressed? What Data Can be Compressed?

US Patent 5,533,051 on "Methods for Data Compression." Impossible to losslessly compress all files.

Capable of compressing all files. Consider all 1000 bit messages. 1000 2 possible messages. 999 998 Slashdot reports of the Zero Space ™ and BinaryAccelerator™. Only 2 + 2 + … + 1 can be encoded with ? 999 bits. 499 Only 1 in 2 can even be encoded with ? 500 bits! "ZeoSync has announced a breakthrough in data compression that allows for 100:1 of random data. If this is true, our problems just got a lot smaller. . . ."

Closed-cycle mill by Robert Fludd, 1618 Gravity engine by Bob Schadewald

Reference: Museum of Unworkable Devices by Donald E. Simanek http://www.lhup.edu/~dsimanek/museum/unwork.htm

20 21 A Difficult File To Compress A Difficult File To Compress

One million pseudo-random characters (a – p) public class Rand { public static void main(String[] args) { fclkkacifobjofmkgdcoiicnfmcpcjfccabckjamolnihkbgobcjbngjiceeelpfgcjiihppenefllhglfemdemgahlbpi ggmllmnefnhjelmgjncjcidlhkglhceninidmmgnobkeglpnadanfbecoonbiehglmpnhkkamdffpacjmgojmcaabpcjce for (int i = 0; i < 1000000; i++) { cplfbgamlidceklhfkkmioljdnoaagiheiapaimlcnlljniggpeanbmojgkccogpmkmoifioeikefjidbadgdcepnhdpfj aeeapdjeofklpdeghidbgcaiemajllhnndigeihbebifemacfadnknhlbgincpmimdogimgeeomgeljfjgklkdgnhafoho char c = 'a'; npjbmlkapddhmepdnckeajebmeknmeejnmenbmnnfefdbhpmigbbjknjmobimamjjaaaffhlhiggaljbaijnebidpaeigd c += (char)(Math.random() * 26); goghcihodnlhahllhhoojdfacnhadhgkfahmeaebccacgeojgikcoapknlomfignanedmajinlompjoaifiaejbcjcdibp kofcbmjiobbpdhfilfajkhfmppcngdneeinpnfafaeladbhhifechinknpdnplamackphekokigpddmmjnbngklhibohdf System.out.print(c); eaggmclllmdhafkldmimdbplggbbejkcmhlkjocjjlcngckfpfakmnpiaanffdjdlleiniilaenbnikgfnjfcophbgkhdg } mfpoehfmkbpiaignphogbkelphobonmfghpdgmkfedkfkchceeldkcofaldinljjcgafimaanelmfkokcjekefkbmegcgj ifjcpjppnabldjoaafpbdafifgcoibbcmoffbbgigmngefpkmbhbghlbdjngenldhgnfbdlcmjdmoflhcogfjoldfjpaok } epndejmnbiealkaofifekdjkgedgdlgbioacflfjlafbcaemgpjlagbdgilhcfdcamhfmppfgohjphlmhegjechgdpkklj pndphfcnnganmbmnggpphnckbieknjhilafkegboilajdppcodpeoddldjfcpialoalfeomjbphkmhnpdmcpgkgeaohfdm } cnegmibjkajcdcpjcpgjminhhakihfgiiachfepffnilcooiciepoapmdjniimfbolchkibkbmhbkgconimkdchahcnhap fdkiapikencegcjapkjkfljgdlmgncpbakhjidapbldcgeekkjaoihbnbigmhboengpmedliofgioofdcphelapijcegej gcldcfodikalehbccpbbcfakkblmoobdmdgdkafbbkjnidoikfakjclbchambcpaepfeinmenmpoodadoecbgbmfkkeabi 231 , but its output is hard to compress laoeoggghoekamaibhjibefmoppbhfbhffapjnodlofeihmjahmeipejlfhloefgmjhjnlomapjakhhjpncomippeanbik (assume random seed is fixed) khekpcfgbgkmklipfbiikdkdcbolofhelipbkbjmjfoempccneaebklibmcaddlmjdcajpmhhaeedbbfpjafcndianlfcj mmbfncpdcccodeldhmnbdjmeajmboclkggojghlohlbhgjkhkmclohkgjamfmcchkchmiadjgjhjehflcbklfifackbecg joggpbkhlcmfhipflhmnmifpjmcoldbeghpcekhgmnahijpabnomnokldjcpppbcpgcjofngmbdcpeeeiiiclmbbmfjkhl anckidhmbeanmlabncnccpbhoafajjicnfeenppoekmlddholnbdjapbfcajblbooiaepfmmeoafedflmdcbaodgeahimc gpcammjljoebpfmghogfckgmomecdipmodbcempidfnlcggpgbffoncajpncomalgoiikeolmigliikjkolgolfkdgiijj iooiokdihjbbofiooibakadjnedlodeeiijkliicnioimablfdpjiafcfineecbafaamheiipegegibioocmlmhjekfikf % javac Rand.java % ls –l effmddhoakllnifdhckmbonbchfhhclecjamjildonjjdpifngbojianpljahpkindkdoanlldcbmlmhjfomifhmncikol % java Rand > temp.txt jjhebidjdphpdepibfgdonjljfgifimniipogockpidamnkcpipglafmlmoacjibognbplejnikdoefccdpfkomkimffgj 231 Rand.java gielocdemnblimfmbkfbhkelkpfoheokfofochbmifleecbglmnfbnfncjmefnihdcoeiefllemnohlfdcmbdfebdmbeeb % compress –c temp.txt > temp.Z 1000000 temp.txt balggfbajdamplphdgiimehglpikbipnkkecekhilchhhfaeafbbfdmcjojfhpponglkfdmhjpcieofcnjgkpibcbiblfp % gzip –c temp.txt > temp.gz njlejkcppbhopohdghljlcokhdoahfmlglbdkliajbmnkkfcoklhlelhjhoiginaimgcabcfebmjdnbfhohkjphnklcbhc 576861 temp.Z jpgbadakoecbkjcaebbanhnfhpnfkfbfpohmnkligpgfkjadomdjjnhlnfailfpcmnololdjekeolhdkebiffebajjpclg % –c temp.txt > temp.bz2 570872 temp.gz hllmemegncknmkkeoogilijmmkomllbkkabelmodcohdhppdakbelmlejdnmbfmcjdebefnjihnejmnogeeafldabjcgfo aehldcmkbnbafpciefhlopicifadbppgmfngecjhefnkbjmliodhelhicnfoongngemddepchkokdjafegnpgledakmbcp 499329 temp.bz2 cmkckhbffeihpkajginfhdolfnlgnadefamlfocdibhfkiaofeegppcjilndepleihkpkkgkphbnkggjiaolnolbjpobjd cehglelckbhjilafccfipgebpc.... resulting file sizes (bytes)

22 23

Information Theory Language Model

Intrinsic difficulty of compression. How compression algorithms work?

Short program generates large data file. Exploit biases of input messages.

Optimal compression has to discover program! Word Princeton occurs more frequently than Yale.

Undecidable problem. White patches occur in typical images.

So how do we know if our algorithm is doing well? Compression is all about probability.

Want lower bound on # bits required by ANY compression scheme. Formulate probabilistic model to predict symbols. – simple: character counts, repeated strings – complex: models of a human face

Use model to encode message.

Use same model to decode message.

Example. Order 0 Markov model.

Each symbol s generated independently at random, with fixed probability p(s).

24 25 Entropy Entropy and Compression H S p s Shannon's theorem (1948). Avg # bits per symbol O entropy. 1 ( )   ( ) log Entropy. (Shannon 1948) 2 p If data source S is an order 0 Markov model, then ANY compression sS (s ) scheme must use O H(S) bits per symbol on average. 1 log of symbol s is proportional to 2 p(s ) Cornerstone result of .

Weighted average of information content over all symbols. Huffman's coding theorem (1952). Huffman code is optimal. Interface between coding and model. If data source S is an order 0 Markov mode, then H(S) ? avg # bits per symbol ? H(S) + 1. p(a) p(b) H(S) Model 1 1/2 1/2 1 Is there any hope of doing better?

Model 2 0.900 0.100 0.469 Yes. Huffman wastes up to 1 bit per symbol. Model 3 0.990 0.010 0.0808 – if H(S) is close to 0, this matters Model 4 1 0 0 – can do better with ""

Yes. Source may not be order 0 Markov model.

p(a) p(b) p(c) p(d) p(e) H(S) Model 5 1/5 1/5 1/5 1/5 1/5 2.322

26 27

Entropy of the English Language Entropy of the English Language

Q. How much redundancy is in the English language? How much information is in each character of English language?

Model = English text.

"... randomising letters in the middle of words [has] little or no effect on the ability of skilled readers to understand the text. This is easy to How can we measure it? Shannon's experiment (1951). denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all Asked humans to predict next character given previous text. the letetrs, keipeng the first two and last two the same, and reibadailty The number of guesses required for right answer: would hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's # of guesses 1 2 3 4 5 O 6 work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for this is suerly that idnetiyfing coentnt by paarllel Probability 0.79 0.08 0.03 0.02 0.02 0.05 prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang." Shannon's estimate = 0.6 - 1.3.

A. Quite a bit.

28 29 Lossless Compression Ratio for Statistical Methods

Estimate symbol frequencies, and assign codewords based on this.

Year Scheme Bits / char Entropy Bits/char ---- ASCII 7.00 Char by char 4.5 Static model. Same distribution for all texts.

1950 Huffman 4.70 8 chars at a time 2.4 Fast.

1977 LZ77 3.94 Asymptotic 1.3 Not optimal since different texts exhibit different distributions.

1984 LZMW 3.32 Ex: ASCII, Morse code, Linotype. 1987 LZH 3.30 1987 Move-to-front 3.24 Dynamic model. Generate model based on text.

1987 LZB 3.18 Preliminary pass needed to generate model. 1987 Gzip 2.71 Must transmit the model. 1988 PPMC 2.48 Ex: Huffman code. 1988 SAKDC 2.47 1994 PPM 2.34 Adaptive models. Progressively learn and update model as you read text. 1995 Burrows-Wheeler 2.29 More accurate models produce better compression. 1997 BOA 1.99 Decoding must start from beginning. 1999 RK 1.89 Ex: LZW.

30 31

LZW Algorithm LZW Example

Input Send

Lempel-Ziv-Welch (variant of LZ78). SEND 256

Adaptively maintain symbol table of useful strings. i 105 t 116 Dictionary If input matches word in ST, output index instead of string. t 116 Index Word Index Word y 121 0 258 it Algorithm. _ 32 … 259 tt Find longest word W in ST that is a prefix of string starting at b 98 32 _ 260 ty current index. i t 258 … 261 y_ Output index for W followed by x = next symbol. t 97 a 262 _b Add Wx to dictionary. y 260 98 b 263 bi _ … 264 itt Example. b 262 z ty_ i 122 265 Dictionary: a, aa, ab, aba, abb, abaa, abaab, abaaa, t 258 … 266 _bi String starting at current index: ...abaababbb... _ 256 SEND 267 it_ W = abaab, x = a. b 257 STOP 268 _bin Output index for W, insert abaaba into ST. i 266 n 110 STOP 257 32 33 LZW Implementation LZW in the Real World

Implementation. a Lempel-Ziv and friends. not patented  widely used in open source Use trie to create symbol table on-the-fly. a LZ77.

– prefix of every word also in ST LZ78. a b

patent #4,558,302 expired in US on June 20, 2003 Encode. LZW. aa ab – lookup string suffix in trie = LZ77 variant + Huffman. – output ST index at bottom a b Some versions copyrighted. – add new node to bottom of trie aba abb Decode. a PNG: LZ77. – lookup string suffix in trie Winzip, gzip, : deflate. abaa – output ST word at bottom Unix compress: LZW. a b – add new node to bottom of trie Pkzip: LZW + Shannon-Fano. abaab abaab GIF, TIFF, V.42bis modem: LZW. What to do when ST gets too large? a Google: uses which is based on deflate.

Throw away and start over. GIF abaaba never expands a file Throw away when not effective. Unix compress abaaba...

34 35

Summary

Lossless compression.

Simple approaches. RLE

Represent fixed length symbols with variable length codes. Huffman

Represent variable length symbols with fixed length codes. LZW

Lossy compression.

JPEG, MPEG, MP3.

Signal processing, , fractals, SVD, . . . .

Algorithms not covered in COS 226.

Limits on compression. Entropy.

36