Compression Outline

Introduction : Lossy vs. Lossless, Benchmarks, … CPS 296.3: in the Real World : Entropy, etc. Probability Coding : Huffman + Applications of Probability Coding : PPM + others III Lempel-Ziv Algorithms : – LZ77, , – LZ78, compress (Not covered in class) Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK

296.3 Page 1 296.3 Page 2

Lempel-Ziv Algorithms LZ77: Sliding Window Lempel-Ziv Cursor LZ77 (Sliding Window) a a c a a c a b c a b a b a c Variants : LZSS (Lempel-Ziv-Storer-Szymanski) Dictionary Lookahead Applications : gzip , Squeeze, LHA, PKZIP, ZOO (previously coded) Buffer LZ78 (Dictionary Based) Dictionary and buffer “windows” are fixed length and slide with the cursor Variants : LZW (Lempel-Ziv-Welch), LZC Repeat : Applications : compress , GIF, CCITT (modems), Output (p, l, c) where ARC, PAK p = position of the longest match that starts in Traditionally LZ77 was better but slower, but the the dictionary (relative to the cursor) l = length of longest match gzip version is almost as fast as any LZ78. c = next char in buffer beyond longest match Advance window by l + 1

296.3 Page 3 296.3 Page 4

1 LZ77: Example LZ77 Decoding

a a c a a c a b c a b a a a c (_,0,a) Decoder keeps same dictionary window as encoder. For each message it looks it up in the dictionary and a a c a a c a b c a b a a a c (1,1,c) inserts a copy at the end of the string l a a c a a c a b c a b a a a c (3,4,b) What if > p? (only part of the message is in the dictionary.) a a c a a c a b c a b a a a c (3,3,a) E.g. dict = abcd , codeword = (2,9,e) • Simply copy from left to right a a c a a c a b c a b a a a c (1,2,c) for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Dictionary (size = 6) Longest match • Out = abcdcdcdcdcdc e Buffer (size = 4) Next character

296.3 Page 5 296.3 Page 6

LZ77 Optimizations used by gzip Optimizations used by gzip (cont.)

LZSS: Output one of the following two formats 1. Huffman code the positions, lengths and chars (0, position, length) or (1,char) 2. Non greedy: possibly use shorter match so that Uses the second format if length < 3. next match is better 3. Use a hash table to store the dictionary. a a c a a c a b c a b a a a c (1,a) – Hash keys are all strings of length 3 in the dictionary window. a a c a a c a b c a b a a a c (1,a) – Find the longest match within the correct a a c a a c a b c a b a a a c (1,c) hash bucket. – Puts a limit on the length of the search within a a c a a c a b c a b a a a c (0,3,4) a bucket. – Within each bucket store in order of position

296.3 Page 7 296.3 Page 8

2 The Hash Table Theory behind LZ77

The Sliding Window Lempel-Ziv is Asymptotically Optimal, A. D. Wyner and J. Ziv, Proceedings of the IEEE, Vol. 82. No. 6, … 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 … June 1994. Will compress long enough strings to the source entropy as the window a a c a a c a b c a b a a a c size goes to infinity. … … Source entropy for a substring of length n is given by:

… = 1 H n ∑ p(X )log X∈An p(X ) a a c 19 c a b 15 a c a 11 a a c 10 c a b 12 c a a 9 Uses logarithmic code (e.g. gamma) for the position. a a c 7 a c a 8 Problem: “long enough” is really really long.

296.3 Page 9 296.3 Page 10

Comparison to Lempel-Ziv 78 Lempel-Ziv Algorithms Summary

Both LZ77 and LZ78 and their variants keep a Adapts well to changes in the file (e.g. a Tar file with “dictionary” of recent strings that have been seen. many file types within it). The differences are: Initial algorithms did not use probability coding and – How the dictionary is stored (LZ78 is a trie) performed poorly in terms of compression. More – How it is extended (LZ78 only extends an existing modern versions (e.g. gzip) do use probability entry by one character) coding as “second pass” and compress much better. – How it is indexed (LZ78 indexes the nodes of the The algorithms are becoming outdated, but ideas are trie) used in many of the newer algorithms. – How elements are removed

296.3 Page 11 296.3 Page 12

3 Compression Outline Burrows -Wheeler

Introduction : Lossy vs. Lossless, Benchmarks, … Currently near best “balanced” algorithm for text Information Theory : Entropy, etc. Breaks file into fixed-size blocks and encodes each Probability Coding : Huffman + Arithmetic Coding block separately. Applications of Probability Coding : PPM + others For each block: Lempel-Ziv Algorithms : LZ77, gzip, compress, … – Sort each character by its full context. Other Lossless Algorithms: This is called the block sorting transform . – Burrows-Wheeler – Use move-to-front transform to encode the – ACB sorted characters. Lossy algorithms for images: JPEG, MPEG, ... The ingenious observation is that the decoder only needs the sorted characters and a pointer to the Compressing graphs and meshes: BBK first character of the original sequence.

296.3 Page 13 296.3 Page 14

Burrows Wheeler: Example Burrows Wheeler Decoding

Let’s encode: decode Key Idea: Can construct entire sorted table from sorted column alone! First: sorting the output gives last column of context: Context “wraps” around. Last char is most significant. Context Output c o Context Char Context Output d e ecode d dedec o coded e Sort coded e d e Context odede c decod e e c dedec o odede c e d edeco d ecode d ⇐⇐⇐ decod e o d edeco d

All rotations of input 296.3 Page 15 296.3 Page 16

4 Burrows Wheeler Decoding Burrows Wheeler Decoding

Now sort pairs in last column of context and output column Repeat until entire table is complete. Pointer to first to form last two columns of context: character provides unique decoding. Context Output Context Output Context Output dedec o c o ec o coded e decod e d e ed e odede c d e od e ecode d ⇐⇐⇐ e c de c edeco d

e d de d Message was d in first position, preceded in wrapped o d co d fashion by ecode: decode.

296.3 Page 17 296.3 Page 18

Burrows Wheeler Decoding Burrows-Wheeler: Decoding

Optimization: Don’t really have to rebuild the whole context Context Output Rank table. What character comes after c o 6 the first character, d 1? d e 1 Context Output The “rank” is the position d e 4 dedec o Just have to find d 1 in last of a character if it were coded 1 e1 column of context and see what sorted using a stable e c 5 sort. decod 2 e2 follows it: e 1. e d ⇐⇐⇐ 2 odede 1 c Observation: instances of same o d 3 ecode 2 d1 ⇐⇐⇐ character of output appear in edeco d2

same order in last column of context. (Proof is an exercise.)

296.3 Page 19 296.3 Page 20

5 Burrows-Wheeler Decode Decode Example

Function BW_Decode(In, Start, n) S = MoveToFrontDecode(In,n) R = Rank(S) S Rank(S) Out j = Start ⇐ o4 6 d1 for i=1 to n do e2 Out[ i] = S[ j] e2 4 c3 j = R[j] e6 5 o4 c3 1 d5 Rank gives position of each char in sorted order. ⇐⇐⇐ d1 2 e6

d5 3

296.3 Page 21 296.3 Page 22

Overview of Text Compression ACB ( Associate Coder of Buyanovsky )

PPM and Burrows-Wheeler both encode a single Keep dictionary sorted by context character based on the immediately preceding (the last character is the most context. significant) Context Contents LZ77 and LZ78 encode multiple characters based on • Find longest match for context decode matches found in a block of preceding text • Find longest match for contents dec ode d ecode Can you mix these ideas , i.e., code multiple • Code characters based on immediately preceding decod e • Distance between matches in de code context? the sorted order deco de – BZ does this, but they don’t give details on how • Length of contents match it works – current best compressor Has aspects of Burrows-Wheeler, ACB – also does this – close to best and LZ77

296.3 Page 23 296.3 Page 24

6