Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. Data Compression ! Most files have lots of redundancy.

Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. • introduction ! Text, images, sound, , … • Huffman codes • an application • entropy All of the books in the world contain no more information than is • LZW codes broadcast as video in a single large American city in a single year. Not all have equal value. -Carl Sagan

References: Algorithms 2nd edition, Chapter 22 Basic concepts ancient (1950s), best technology recently developed.

Intro to Algs and Data Structs, Section 6.5

Copyright © 2007 by Robert Sedgewick and Kevin Wayne. 3

Applications of Data Compression

Generic file compression. ! Files: , BZIP, BOA. ! Archivers: PKZIP. ! File systems: NTFS.

Multimedia. introduction ! Images: GIF, JPEG. Huffman codes ! Sound: MP3. an application ! Video: MPEG, DivX™, HDTV. entropy Communication. LZW codes ! ITU-T T4 Group 3 Fax. ! V.42bis modem.

Databases. Google.

2 4 Encoding and Decoding Fixed length encoding

Message. Binary data M we want to compress. uses fewer bits (you hope) ! Use same number of bits for each symbol. Encode. Generate a "compressed" representation (M). ! k- code supports 2k different symbols Decode. Reconstruct original message or some approximation M'. Ex. 7-bit ASCII

char decimal code NUL 0 0000000 M Encoder C(M) Decoder M' ...... a 97 1100001 b 98 1100010 c 99 1100011 Compression ratio. Bits in C(M) / bits in M. d 100 1100100

...... this lecture: Lossless. M = M', 50-75% or lower. ~ 126 1111110 special code for this lecture Ex. Natural language, source code, executables. 127 1111111 end-of-message

a b r a c a d a b r a ! Lossy. M ! M', 10% or lower. 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 Ex. Images, sound, video. 12 symbols " 7 bits per symbol = 84 bits in code

5 7

Ancient Ideas Fixed length encoding

Ancient ideas. ! Use same number of bits for each symbol. ! Braille. ! k-bit code supports 2k different symbols ! Morse code. ! Natural languages. Ex. 3-bit custom code

! Mathematical notation. char code

! Decimal number system. a 000 b 001 c 010 d 011 r 100 12 symbols " 3 bits per - ! 111 36 bits in code "Poetry is the art of lossy data compression." a b r a c a d a b r a ! 000 001 100 000 010 000 011 000 001 100 000 111

Important detail: decoder needs to know the code! 6 8 Fixed length encoding: general scheme Variable-length encoding

! count number of different symbols. Use different number of bits to encode different characters. ! $lg M% bits suffice to support M different symbols Q. How do we avoid ambiguity? Ex. genomic sequences S • • • prefix of V char code A1. Append special stop symbol to each codeword. ! 4 different codons A2. Ensure that no encoding is a prefix of another. E • prefix of I, S a 00 I •• prefix of S ! 2 bits suffice c 01 char code V •••- t 10 Ex. custom prefix-free code a 0 g 11 2N bits to encode b 111 genome with N codons c 1010 a c t a c a g a t g a d 100 00 01 10 00 01 00 11 00 10 11 00 r 110 ! 1011 28 bits in code ! Amazing but true: initial databases in 1990s did not use such a code! a b r a c a d a b r a ! 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1

Decoder needs to know the code ! can amortize over large number of files with the same code Note 1: fixed-length codes are prefix-free ! in general, can encode an N-char file with N $lg M% + 16 $lg M% bits Note 2: can amortize cost of including the code over similar messages 9 11

Variable Length Encoding Prefix-Free Code: Encoding and Decoding

Use different number of bits to encode different characters. How to represent? Use a binary trie. ! Symbols are stored in leaves. 0 1 Ex. Morse code. ! Encoding is path to leaf.

a 0 1 Issue: ambiguity. Encoding. 0 1 0 1

! d r b • • • # # # • • • Method 1: start at leaf; follow path up 0 1 to the root, and print bits in reverse order. c ! SOS ? ! Method 2: create ST of symbol-encoding pairs. IAMIE ? EEWNI ? Decoding. char encoding V7O ? ! Start at root of tree. a 0 ! Go left if bit is 0; go right if 1. b 111 c 1010 ! If leaf node, print symbol and return to root. d 100 r 110 ! 1011

10 12 How to transmit the trie Prefix-Free Decoding Implementation

How to transmit the trie? ! Send preorder traversal of trie. 0 1 – we use * as sentinel for internal nodes public void decode() { – what if there is no sentinel? a 0 1 int N = StdIn.readInt(); ! Send number of characters to decode. for (int i = 0; i < N; i++) { use bits in real applications ! Send bits (packed 8 to the ). 0 1 0 1 Node x = root; instead of chars while (x.isInternal()) { d r b 0 1 char bit = StdIn.readChar(); c ! if (bit == '0') x = x.left; preorder traversal *a**d*c!*rb else if (bit == '1') x = x.right; # chars to decode 12 } the message bits 0111110010100100011111001011 System.out.print(x.ch); } } ! If message is long, overhead of transmitting trie is small. char encoding 12 a 0 0111110010100100011111001011 b 111 c 1010 d 100 a b r a c a d a b r a ! r 110 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ! 1011

13 15

Prefix-Free Decoding Implementation Introduction to compression: summary

Variable-length codes can provide better compression than fixed-length

public class HuffmanDecoder a b r a c a d a b r a ! { 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 private Node root = new Node(); private class Node a b r a c a d a b r a ! { 000 001 100 000 010 000 011 000 001 100 000 111 char ch; Node left, right; Node() a b r a c a d a b r a ! { build tree from 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ch = StdIn.readChar(); preorder traversal if (ch == '*') { *a**d*c!*rb left = new Node(); right = new Node(); Every trie defines a variable-length code } } boolean isInternal() { } }

Q. What is the best variable length code for a given message?

14 16 example

a b r a c a d a b r a ! 5 7 a 1 1 1 2 2 5 3 4 c ! d r b a 2 2 1 2 1 2 2 2 5 r b d d r b a 1 1 1 1 c ! introduction c ! Huffman codes 12 an application 2 2 3 5 r b a entropy 1 2 5 7 d a 1 1 LZW 3 4 c ! 2 2 1 2 r b d 1 1 3 4 5 c ! a 2 2 1 2 r b d 1 1

17 c ! 19

Huffman Coding Huffman trie construction code

Q. What is the best variable length code for a given message? A. Huffman code. [David Huffman, 1950] int[] freq = new int[128]; for (int i = 0; i < input.length(); i++) tabulate { freq[input.charAt(i)]++; } frequencies

To compute Huffman code: MinPQ pq = new MinPQ(); ! for (int i = 0; i < 128; i++) count frequency ps for each symbol s in message. if (freq[i] > 0) initialize ! start with one node corresponding to each symbol s (with weight ps). pq.insert(new Node((char) i, freq[i], null, null)); PQ ! repeat until single trie formed: while (pq.size() > 1) – select two tries with min weight p1 and p2 { – merge into single trie with weight p + p Node x = pq.delMin(); 1 2 Node y = pq.delMin(); Node parent = new Node('*', x.freq + y.freq, x, y); merge pq.insert(parent); trees } internal node Applications. JPEG, MP3, MPEG, PKZIP, GZIP, … root = pq.delMin(); total two subtrees marker frequency

David Huffman

18 20 Huffman encoding An application: compress a

Typical black-and-white-scanned image Theorem. [Huffman] Huffman coding is an optimal prefix-free code.

300 pixels/inch no prefix-free code uses fewer bits Implementation. 8.5 by 11 inches ! pass 1: tabulate symbol frequencies and build trie ! pass 2: encode file by traversing trie or lookup table. 300*8.5*300*11 = 8.415 million bits

Running time. Use binary heap & O(M + N log N). Bits are mostly white

output distinct bits symbols

Typical amount of text on a page: 40 lines * 75 chars per line = 3000 chars Can we do better? [stay tuned]

21 23

Natural encoding of a bitmap

one bit per pixel

000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 introduction 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 Huffman codes 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 an application 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 entropy 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 LZW 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011

19-by-51 raster of letter 'q' lying on its side

22 24 Run-length encoding of a bitmap Run-length encoding and Huffman codes in the wild

to encode number of bits per line natural encoding. (19 " 51) + 6 = 975 bits. ITU-T T4 Group 3 Fax for black-and-white bitmap images (~1980) run-length encoding. (63 " 6) + 6 = 384 bits. ! up to 1728 pixels per line

63 6-bit run lengths ! typically mostly white.

51 000000000000000000000000000011111111111111000000000 28 14 9 one for white and one for black 000000000000000000000000001111111111111111110000000 26 18 7 Step 1. Use run-length encoding. 000000000000000000000001111111111111111111111110000 23 24 4 000000000000000000000011111111111111111111111111000 22 26 3 Step 2. Encode run lengths using two Huffman codes. 000000000000000000001111111111111111111111111111110 20 30 1 000000000000000000011111110000000000000000001111111 19 7 18 7 run white black 000000000000000000011111000000000000000000000011111 19 5 22 5 194 000000000000000000011100000000000000000000000000111 19 3 26 3 0 00110101 0000110111 000000000000000000011100000000000000000000000000111 19 3 26 3 1 000111 010 000000000000000000011100000000000000000000000000111 19 3 26 3 000000000000000000011100000000000000000000000000111 19 3 26 3 2 0111 11 000000000000000000001111000000000000000000000001110 20 4 23 3 1 000000000000000000000011100000000000000000000111000 22 3 20 3 3 3W 1B 2W 2B 194W … 3 1000 10 011111111111111111111111111111111111111111111111111 1 50 … … … 011111111111111111111111111111111111111111111111111 1 50 011111111111111111111111111111111111111111111111111 1 50 1000 010 0111 11 010111 0111 … 63 00110100 000001100111 011111111111111111111111111111111111111111111111111 1 50 64+ 11011 0000001111 011111111111111111111111111111111111111111111111111 1 50 192 + 2 011000000000000000000000000000000000000000000000011 1 2 46 2 … 128+ 10010 000011001000 … … … 19-by-51 raster of letter 'q' lying on its side RLE Huffman codes built from 1728+ 010011011 0000001100101 frequencies in huge sample 25 27

Run-length encoding BW bitmap compression: another approach

! Exploit long runs of repeated characters. Fax machine (~1980) ! : runs alternate between 0 and 1; just output run lengths. ! slow scanner produces lines in sequential order ! Issue: how to encode run lengths (!) ! compress to save time (reduce number of bits to send)

10: 2 Electronic documents (~2000) 01: 1 ! high-resolution scanners produce huge files 001001001001001 2121212121 10011001100110011001 ! compress to save space (reduce number of bits to save) 15 bits 20 bits Idea: ! use OCR to get back to ASCII (!) ! Does not compress when runs are short. ! use Huffman on ASCII string (!)

Ex. Typical page ! 40 lines, 75 chars/line ~ 3000 chars Runs are long in typical applications (like black-and-white bitmaps). ! compress to ~ 2000 chars with Huffman code ! reduce file size by a factor of 500 (! ?)

Bottom line: Any extra information about file can yield dramatic gains 26 28 Perpetual Motion Machines

Universal data compression algorithms are the analog of perpetual motion machines.

introduction Huffman codes an application entropy LZW Closed-cycle mill by Robert Fludd, 1618 Gravity engine by Bob Schadewald

Reference: Museum of Unworkable Devices by Donald E. Simanek http://www.lhup.edu/~dsimanek/museum/unwork.htm

29 31

What data can be compressed? What data can be compressed?

US Patent 5,533,051 on "Methods for Data Compression", which is Theorem. Impossible to losslessly compress all files. capable of compression all files. Pf 1. ! consider all 1,000 bit messages. ! 21000 possible messages. ! only 2999 + 2998 + … + 1 can be encoded with ' 999 bits. Slashdot reports of the Zero Space Tuner™ and BinaryAccelerator™. ! only 1 in 2499 can be encoded with ' 500 bits!

"ZeoSync has announced a breakthrough in data compression that Pf 2 (by contradiction).

! allows for 100:1 of random data. If this is true, given a file M, compress it to get a smaller file M1.

our bandwidth problems just got a lot smaller.…" ! compress that file to get a still smaller file M2. ! continue until reaching file size 0. ! implication: all files can be compressed with 0 bits!

Practical test for any compression algorithm:

! given a file M, compress it to get a (smaller, you hope) file M1

! compress that file to get a still smaller file M2. ! continue until file size does not decrease

30 32 A difficult file to compress

Intrinsic difficulty of compression. One million pseudo-random characters (a – p) ! Short program generates large data file.

fclkkacifobjofmkgdcoiicnfmcpcjfccabckjamolnihkbgobcjbngjiceeelpfgcjiihppenefllhglfemdemgahlbpi ! ggmllmnefnhjelmgjncjcidlhkglhceninidmmgnobkeglpnadanfbecoonbiehglmpnhkkamdffpacjmgojmcaabpcjce Optimal compression algorithm has to discover program! cplfbgamlidceklhfkkmioljdnoaagiheiapaimlcnlljniggpeanbmojgkccogpmkmoifioeikefjidbadgdcepnhdpfj aeeapdjeofklpdeghidbgcaiemajllhnndigeihbebifemacfadnknhlbgincpmimdogimgeeomgeljfjgklkdgnhafoho ! Undecidable problem. npjbmlkapddhmepdnckeajebmeknmeejnmenbmnnfefdbhpmigbbjknjmobimamjjaaaffhlhiggaljbaijnebidpaeigd goghcihodnlhahllhhoojdfacnhadhgkfahmeaebccacgeojgikcoapknlomfignanedmajinlompjoaifiaejbcjcdibp kofcbmjiobbpdhfilfajkhfmppcngdneeinpnfafaeladbhhifechinknpdnplamackphekokigpddmmjnbngklhibohdf eaggmclllmdhafkldmimdbplggbbejkcmhlkjocjjlcngckfpfakmnpiaanffdjdlleiniilaenbnikgfnjfcophbgkhdg mfpoehfmkbpiaignphogbkelphobonmfghpdgmkfedkfkchceeldkcofaldinljjcgafimaanelmfkokcjekefkbmegcgj Q. How do we know if our algorithm is doing well? ifjcpjppnabldjoaafpbdafifgcoibbcmoffbbgigmngefpkmbhbghlbdjngenldhgnfbdlcmjdmoflhcogfjoldfjpaok epndejmnbiealkaofifekdjkgedgdlgbioacflfjlafbcaemgpjlagbdgilhcfdcamhfmppfgohjphlmhegjechgdpkklj pndphfcnnganmbmnggpphnckbieknjhilafkegboilajdppcodpeoddldjfcpialoalfeomjbphkmhnpdmcpgkgeaohfdm A. Want lower bound on # bits required by any compression scheme. cnegmibjkajcdcpjcpgjminhhakihfgiiachfepffnilcooiciepoapmdjniimfbolchkibkbmhbkgconimkdchahcnhap fdkiapikencegcjapkjkfljgdlmgncpbakhjidapbldcgeekkjaoihbnbigmhboengpmedliofgioofdcphelapijcegej gcldcfodikalehbccpbbcfakkblmoobdmdgdkafbbkjnidoikfakjclbchambcpaepfeinmenmpoodadoecbgbmfkkeabi laoeoggghoekamaibhjibefmoppbhfbhffapjnodlofeihmjahmeipejlfhloefgmjhjnlomapjakhhjpncomippeanbik khekpcfgbgkmklipfbiikdkdcbolofhelipbkbjmjfoempccneaebklibmcaddlmjdcajpmhhaeedbbfpjafcndianlfcj mmbfncpdcccodeldhmnbdjmeajmboclkggojghlohlbhgjkhkmclohkgjamfmcchkchmiadjgjhjehflcbklfifackbecg joggpbkhlcmfhipflhmnmifpjmcoldbeghpcekhgmnahijpabnomnokldjcpppbcpgcjofngmbdcpeeeiiiclmbbmfjkhl anckidhmbeanmlabncnccpbhoafajjicnfeenppoekmlddholnbdjapbfcajblbooiaepfmmeoafedflmdcbaodgeahimc gpcammjljoebpfmghogfckgmomecdipmodbcempidfnlcggpgbffoncajpncomalgoiikeolmigliikjkolgolfkdgiijj iooiokdihjbbofiooibakadjnedlodeeiijkliicnioimablfdpjiafcfineecbafaamheiipegegibioocmlmhjekfikf effmddhoakllnifdhckmbonbchfhhclecjamjildonjjdpifngbojianpljahpkindkdoanlldcbmlmhjfomifhmncikol jjhebidjdphpdepibfgdonjljfgifimniipogockpidamnkcpipglafmlmoacjibognbplejnikdoefccdpfkomkimffgj gielocdemnblimfmbkfbhkelkpfoheokfofochbmifleecbglmnfbnfncjmefnihdcoeiefllemnohlfdcmbdfebdmbeeb balggfbajdamplphdgiimehglpikbipnkkecekhilchhhfaeafbbfdmcjojfhpponglkfdmhjpcieofcnjgkpibcbiblfp njlejkcppbhopohdghljlcokhdoahfmlglbdkliajbmnkkfcoklhlelhjhoiginaimgcabcfebmjdnbfhohkjphnklcbhc jpgbadakoecbkjcaebbanhnfhpnfkfbfpohmnkligpgfkjadomdjjnhlnfailfpcmnololdjekeolhdkebiffebajjpclg hllmemegncknmkkeoogilijmmkomllbkkabelmodcohdhppdakbelmlejdnmbfmcjdebefnjihnejmnogeeafldabjcgfo aehldcmkbnbafpciefhlopicifadbppgmfngecjhefnkbjmliodhelhicnfoongngemddepchkokdjafegnpgledakmbcp cmkckhbffeihpkajginfhdolfnlgnadefamlfocdibhfkiaofeegppcjilndepleihkpkkgkphbnkggjiaolnolbjpobjd cehglelckbhjilafccfipgebpc....

33 35

A difficult file to compress Language model

Q. How do compression algorithms work? public class Rand { A. They exploit statistical biases of input messages.

public static void main(String[] args) ! { ex: white patches occur in typical images. for (int i = 0; i < 1000000; i++) ! ex: ord Princeton occurs more frequently than Yale. { char c = 'a'; c += (char) (Math.random() * 16); Basis of compression: probability. System.out.print(c); } ! Formulate probabilistic model to predict symbols. } } – simple: character counts, repeated strings – complex: models of a human face 231 , but output is hard to compress (assume random seed is fixed) ! Use model to encode message. ! Use same model to decode message.

% javac Rand.java % ls –l % java Rand > temp.txt 231 Rand.java Ex. Order 0 Markov model % compress –c temp.txt > temp.Z 1000000 temp.txt ! R symbols generated independently at random % gzip –c temp.txt > temp.gz 576861 temp.Z ! % –c temp.txt > temp.bz2 570872 temp.gz probability of occurrence of i th symbol: pi (fixed). 499329 temp.bz2

resulting file sizes (bytes)

34 36 Entropy Entropy of the English Language

A measure of information. [Shannon, 1948] Q. How much redundancy is in the English language?

H(M) = p0/lg p0 + p1/lg p1 + p2/lg p2 + ... + pR-1/lg pR-1

! information content of symbol s is proportional to 1/lg p(s). 2 "... randomising letters in the middle of words [has] little or no effect on the ! weighted average of information content over all symbols. ability of skilled readers to understand the text. This is easy to ! interface between coding and model. denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would Ex. 4 binary models (R = 2) hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we p0 p1 H(M) may have some pofrweul palrlael prsooscers at work. The resaon for this is 1 1/2 1/2 1 suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. 2 0.900 0.100 0.469 We only need the first and last two letetrs to spot chganes in meniang." 3 0.990 0.010 0.0808 Claude Shannon 4 1 0 0 A. Quite a bit. Ex. fair die (R = 6)

p(1) p(2) p(3) p(4) p(5) p(6) H(M) 1/6 1/6 1/6 1/6 1/6 1/6 2.585

37 39

Entropy and compression Entropy of the English Language

Theorem. [Shannon, 1948] If data source is an order 0 Markov model, Q. How much information is in each character of the English language? any compression scheme must use ( H(M) bits per symbol on average. Q. How can we measure it?

model = English text ! Cornerstone result of information theory. ! Ex: to transmit results of fair die, need ( 2.58 bits per roll. A. [Shannon's 1951 experiment] ! Asked subjects to predict next character given previous text. Theorem. [Huffman, 1952] If data source is an order 0 Markov model, ! The number of guesses required for right answer: Huffman code uses ' H(M) + 1 bits per symbol on average.

# of guesses 1 2 3 4 5 ( 6

Q. Is there any hope of doing better than Huffman coding? Fraction 0.79 0.08 0.03 0.02 0.02 0.05

A1. Yes. Huffman wastes up to 1 bit per symbol. ! Shannon's estimate: about 1 bit per char [ 0.6 - 1.3 ]. – if H(M) is close to 0, this difference matters – can do better with "" Compression less than 1 bit/char for English ? If not, keep trying! A2. Yes. Source may not be order 0 Markov model.

38 40 LZW Algorithm

Lempel-Ziv-Welch. [variant of LZ78] ! Create ST associating a fixed-length codeword with some previous substring. ! When input matches string in ST, output associated codeword. ! length of strings in ST grows, hence compression. introduction Huffman codes an application entropy To send (encode) M. ! Find longest string s in ST that is a prefix of unsent part of M LZW ! Send codeword associated with s. ! Add s ) x to ST, where x is next char in M.

Ex. ST: a, aa, ab, aba, abb, abaa, abaab, abaaa, ! unsent part of M: abaababbb… ! s = abaab, x = a. ! Output integer associated with s; insert abaaba into ST. 41 43

Statistical Methods LZW encoding example

input code add to ST ASCII ST input: 7-bit ASCII Static model. Same model for all texts. a 97 ab output: 8-bit codewords key value key value 0 ab 128 ! b 98 br Fast. br 129 ! r 114 ra Not optimal: different texts have different statistical properties. ... ra 130 a 97 ac ! Ex: ASCII, Morse code. ac 131 c 99 ca a 97 ca 132 a 97 ad b 98 ad 133 Dynamic model. Generate model based on text. d 100 da c 99 da 134 ! Preliminary pass needed to generate model. a d 100 abr 135 ! Must transmit the model. b 128 abr ... rac 136 ! Ex: Huffman code. r r 114 cad 137 a 130 rac dab 138 Adaptive model. Progressively learn and update model as you read text. c bra 139 ...... ! More accurate modeling produces better compression. a 132 cad 127 STOP 255 ! Decoding must start from beginning. d it_

! a 134 dab Ex: LZW. To send (encode) M. b ! Find longest string s in ST that is a prefix of unsent part of M r 129 bra ! Send integer associated with s. a 97 ! Add s ) x to ST, where x is next char in M. STOP 255 42 44 LZW encoding example LZW encoder: Java implementation

input code public class LZWEncoder a 97 { input stream b 98 public static void main(String[] args) { with lookahead r 114 LookAheadIn in = new LookAheadIn(); a 97 specialized c 99 LZWst st = new LZWst(); TST

97 a while (!in.isEmpty()) 100 d { encode text int codeword = st.getput(in); and build input: 7-bit ASCII a output: 8-bit codewords StdOut.println(codeword); TST 19 chars 128 b 14 chars } 133 bits 112 bits r } postprocess a 130 } to encode in binary c a 132 Use specialized TST d ! initialized with ASCII chars and codes a 134 ! getput() method returns code of longest prefix s b and adds s + next char to symbol table r 129 a 97 Key point: no need to send ST (!) STOP 255 Need input stream with backup [stay tuned] 45 47

LZW encode ST implementation LZW encoder: Java implementation (TST scaffolding)

ASCII ST public class LZWst Q. How to do longest prefix match? key value key value { A. Use a trie for the ST 0 ab 128 private int i; next codeword to assign private int codeword; codeword to return br 129 private Node[] roots; array of TSTs Encode. ... ra 130 ac 131 public LZWst() ! lookup string suffix in trie. a 97 ca 132 { ! roots = new Node[128]; initialize output ST index at bottom. b 98 ad 133 with ASCII ! for (i = 0; i < 128; i++) add new node to bottom of trie. c 99 da 134 roots[i] = new Node((char) i, i); d 100 abr 135 } ... rac 136 private class Node 97 98 99 100 114 r 114 cad 137 { a b c d r dab 138 Node(char c, int codeword) standard bra 139 { this.c = c; this.codeword = codeword; } 128 node code 131 133 129 132 134 130 ...... char c; b c d r a a a 127 STOP 255 Node left, mid, right; int codeword; 135 139 137 138 136 r a d b c } public int getput(LookAheadIn in) // See next slide. Note that all substrings are in ST } 46 48 LZW encoder: Java implementation (TST search/insert) LZW Algorithm caution: tricky recursive public int getput(LookAheadIn in) code { Lempel-Ziv-Welch. [variant of LZ78] char c = in.readChar(); ! Create ST and associate an integer with each useful string. if (c == '!') return 255; ! When input matches string in ST, output associated integer. roots[c] = getput(c, roots[c], in); in.backup(); ! length of strings in ST grows, hence compression. longest prefix return codeword; ! codeword decode by rebuilding ST from code }

public Node getput(char c, Node x, LookAheadIn in) { recursive search and if (x == null) insert { x = new Node(c, i++); return x; } To send (encode) M.

! if (c < x.c) x.left = getput(c, x.left, in); Find longest string s in ST that is a prefix of unsent part of M else if (c > x.c) x.right = getput(c, x.right, in); ! Send integer associated with s. else ! Add s ) x to ST, where x is next char in M. { char next = in.readChar(); codeword = x.codeword; To decode received message to M. x.mid = getput(next, x.mid, in); ! } Let s be ST entry associated with received integer return x; ! Add s to M. } ! Add p ) x to ST, where x is first char in s, p is previous value of s.

check for codeword overflow omitted 49 51

LZW encoder: Java implementation (input stream with lookahead) LZW decoding example

codeword output add to ST role of keys and values switched

public class LookAheadIn 97 a key value key value { 0 128 ab In in = new In(); 98 b ab char last; 114 r br 129 br boolean backup = false; 97 a ra ... 130 ra

99 c ac 131 ac public void backup() 97 a 132 ca { backup = true; } 97 a ca 98 b 133 ad 100 d ad Use an array public char readChar() 99 c 134 da 128 a to implement ST { 100 d 135 abr b da if (!backup) ... 136 rac { last = in.readChar(); } 130 r 114 r 137 cad backup = false; a abr return last; 138 dab } 132 c ... 139 bra a rac ... public boolean isEmpty() 134 d it_ 127 255 { return !backup && in.isEmpty(); } } a cad To decode received message to M. 129 b ! Let s be ST entry associated with received integer r dab ! Add s to M.

! Provides input stream with one-character lookahead. 97 a bra Add p ) x to ST, where x is first char in s, p is previous value of s. backup() call means that last readChar() call was lookahead. 255 STOP 50 52 LZW decoder: Java implementation LZW implementation details

public class LZWDecoder { How big to make ST? public static void main(String[] args) ! how long is message? initialize { ! whole message similar model? String[] st = new String[256]; ST with ASCII int i; ! ... for (i = 0; i < 128; i++) ! [many variations have been developed] { st[i] = Character.toString((char) i); } st[255] = "!";

decode text What to do when ST fills up? String prev = ""; preprocess to decode and build ST ! while (!StdIn.isEmpty()) from binary throw away and start over. GIF { ! throw away when not effective. Unix compress

int codeword = StdIn.readInt(); ! String s; Ex: ababababab ... if (codeword == i) // Tricky situation! ! [many other variations] s = prev + prev.charAt(0); else s = st[codeword]; StdOut.print(s); Why not put longer substrings in ST? if (prev.length() > 0) ! ... { st[i++] = prev + s.charAt(0); } prev = s; ! [many variations have been developed] } StdOut.println(); } } 53 55

LZW decoding example (tricky situation) LZW in the real world

Lempel-Ziv and friends. ! LZ77. LZ77 not patented & widely used in open source ! LZ78. LZW patent #4,558,302 expired in US on June 20, 2003 input code add to ST codeword output add to ST ! LZW. some versions copyrighted a 97 ab key value 97 a ! b 98 ba 128 ab 98 b ab = LZ77 variant + Huffman. a 129 ba 128 a b 128 aba b ba 130 aba a 130 a 131 abab PNG: LZ77. b b Winzip, gzip, jar: deflate. a 130 abab ... a aba b 255 98 b Unix compress: LZW. STOP 255 255 STOP Pkzip: LZW + Shannon-Fano. needed before GIF, TIFF, V.42bis modem: LZW. added to ST! Google: which is based on deflate. To send (encode) M. To decode received message to M. ! Find longest prefix ! Let s be ST entry for integer never expands a file ! Send integer associated with s. ! Add s to M. ! Add s ) x to ST, where ! Add p ) x to ST where x is next char in M. x is first char in s p is previous value of s.

54 56 Lossless compression ratio benchmarks

Calgary corpus: standard data compression benchmark

Year Scheme Bits / char Entropy Bits/char 1967 ASCII 7.00 Char by char 4.5 1950 Huffman 4.70 8 chars at a time 2.4 1977 LZ77 3.94 Asymptotic 1.3 1984 LZMW 3.32 1987 LZH 3.30 1987 Move-to-front 3.24 1987 LZB 3.18 1987 Gzip 2.71 1988 PPMC 2.48 1988 SAKDC 2.47 1994 PPM 2.34

1995 Burrows-Wheeler 2.29 next assignment 1997 BOA 1.99 1999 RK 1.89

57

Data compression summary

Lossless compression. ! Represent fixed length symbols with variable length codes. [Huffman] ! Represent variable length symbols with fixed length codes. [LZW] . [not covered in this course] ! JPEG, MPEG, MP3. ! FFT, wavelets, fractals, SVD, … Limits on compression. Shannon entropy.

Theoretical limits closely match what we can achieve in practice!

Practical compression. Use extra knowledge whenever possible.

Butch: I don’t mean to be a sore loser, but when it’s done, if I’m dead, kill him. Sundance: Love to. Butch: No, no, not yet. Not until me and Harvey get the rules straightened out. Harvey: Rules? In a knife fight? No rules. Butch: Well, if there ain’t going to be any rules, let’s get the fight started...

58