Sliding Window Lempel-Ziv

Home , LHA (file format), Zoo (file format)

Compression Outline

Introduction: Lossy vs. Lossless, Benchmarks, … 15-853:Algorithms in the Real World Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Data Compression III Lempel-Ziv Algorithms: – LZ77, gzip, – LZ78, compress (Not covered in class) Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK

15-853 Page 1 15-853 Page 2

Lempel-Ziv Algorithms LZ77: Sliding Window Lempel-Ziv Cursor LZ77 (Sliding Window) a a tag">c a a c a b c a b a b a c Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Dictionary Lookahead Applications: gzip, Squeeze, LHA, PKZIP, ZOO (previously coded) Buffer LZ78 (Dictionary Based) Dictionary and buffer “windows” are fixed length and slide with the cursor Variants: LZW (Lempel-Ziv-Welch), LZC Repeat: Applications: compress, GIF, CCITT (modems), Output (p, l, c) where ARC, PAK p = position of the longest match that starts in Traditionally LZ77 was better but slower, but the the dictionary (relative to the cursor) l = length of longest match gzip version is almost as fast as any LZ78. c = next char in buffer beyond longest match Advance window by l + 1

15-853 Page 3 15-853 Page 4

1 LZ77: Example LZ77 Decoding

a a c a a c a b c a b a a a c (_,0,a) Decoder keeps same dictionary window as encoder. For each message it looks it up in the dictionary and a a c a a c a b c a b a a a c (1,1,c) inserts a copy at the end of the string What if l > p? (only part of the message is in the a a c a a c a b c a b a a a c (3,4,b) dictionary.) a a c a a c a b c a b a a a c (3,3,a) E.g. dict = abcd, codeword = (2,9,e) • Simply copy from left to right a a c a a c a b c a b a a a c (1,2,c) for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Dictionary (size = 6) Longest match •Out = abcdcdcdcdcdce Buffer (size = 4) Next character

15-853 Page 5 15-853 Page 6

LZ77 Optimizations used by gzip Optimizations used by gzip (cont.)

LZSS: Output one of the following two formats 1. Huffman code the positions, lengths and chars (0, position, length) or (1,char) 2. Non greedy: possibly use shorter match so that Uses the second format if length < 3. next match is better 3. Use a hash table to store the dictionary. a a c a a c a b c a b a a a c (1,a) – Hash keys are all strings of length 3 in the dictionary window. a a c a a c a b c a b a a a c (1,a) – Find the longest match within the correct hash bucket. a a c a a c a b c a b a a a c (1,c) – Puts a limit on the length of the search within a a c a a c a b c a b a a a c (0,3,4) a bucket. – Within each bucket store in order of position

15-853 Page 7 15-853 Page 8

2 The Hash Table Theory behind LZ77

Sliding Window LZ is Asymptotically Optimal ……7 8 9 101112131415161718192021 [Wyner-Ziv,94] Will compress long enough strings to the source … a a c a a c a b c a b a a a c … entropy as the window size goes to infinity. 1 H n = ∑ p(X )log … X∈An p(X )

a a c 19 c a b 15 a c a 11 H = lim H n n→∞

a a c 10 c a b 12 c a a 9 Uses logarithmic code (e.g. gamma) for the position. a a c 7 a c a 8 Problem: “long enough” is really really long.

15-853 Page 9 15-853 Page 10

Comparison to Lempel-Ziv 78 Lempel-Ziv Algorithms Summary

Both LZ77 and LZ78 and their variants keep a Adapts well to changes in the file (e.g. a Tar file with “dictionary” of recent strings that have been seen. many file types within it). The differences are: Initial algorithms did not use probability coding and – How the dictionary is stored (LZ78 is a trie) performed poorly in terms of compression. More – How it is extended (LZ78 only extends an existing modern versions (e.g. gzip) do use probability entry by one character) coding as “second pass” and compress much better. – How it is indexed (LZ78 indexes the nodes of the The algorithms are becoming outdated, but ideas are trie) used in many of the newer algorithms. – How elements are removed

15-853 Page 11 15-853 Page 12

3 Compression Outline Burrows -Wheeler

Introduction: Lossy vs. Lossless, Benchmarks, … Currently near best “balanced” algorithm for text Information Theory: Entropy, etc. Breaks file into fixed-size blocks and encodes each Probability Coding: Huffman + Arithmetic Coding block separately. Applications of Probability Coding: PPM + others For each block: Lempel-Ziv Algorithms: LZ77, gzip, compress, … – Sort each character by its full context. Other Lossless Algorithms: This is called the block sorting transform. – Burrows-Wheeler – Use move-to-front transform to encode the –ACB sorted characters. Lossy algorithms for images: JPEG, MPEG, ... The ingenious observation is that the decoder only needs the sorted characters and a pointer to the Compressing graphs and meshes: BBK first character of the original sequence.

15-853 Page 13 15-853 Page 14

Burrows Wheeler: Example Burrows-Wheeler (Continued)

Let’s encode: d1e2c3o4d5e6 Theorem: We’ve numbered the characters to distinguish them. appear inAfter the samesorting, order equal in thevalued output characters as in the Context “wraps” around. Last char is most significant. most significant position of the context. Proof sketch: Since the chars have Context Char Context Output Context Output equal value in the most-significant- dedec o ecode6 d1 dedec3 o4 position of the context, they will 3 4 coded coded1 e2 Sort coded1 e2 be ordered by the rest of the 1 e2 Context odede2 c3 decod5 e6 context, i.e. the previous chars. decod5 e6 dedec3 o4 odede2 c3 This is also the order of the output odede2 c3 since it is sorted by the previous ecode6 d1 edeco4 d5 ecode6 d1 ⇐ characters. edeco d decod e 4 5 5 6 edeco d 4 5

15-853 Page 15 15-853 Page 16

4 Burrows-Wheeler: Decoding Burrows-Wheeler: Decoding

Consider dropping all but the last Context Output What about now? Context Output Rank character of the context. a c Answer: cabbaa a c ⇐ 6 – What follows the a b a a 1 underlined a ? Can also use the “rank”. – What follows the a b The “rank” is the position a b 4 underlined b? b a of a character if it were b b 5 – What is the whole string? sorted using a stable b a ⇐ sort. b a 2 Answer: b, a, abacab c a c a 3

15-853 Page 17 15-853 Page 18

Burrows-Wheeler Decode Decode Example

Function BW_Decode(In, Start, n) S = MoveToFrontDecode(In,n) R = Rank(S) S Rank(S) Out j = Start o 6 e6 d1 ⇐ for i=1 to n do 4 d1 e2 Out[i] = S[j] e2 4 j = R[j] e 5 e2 c3 6 c o c 1 3 4 3 o d Rank gives position of each char in sorted order. ( 4 5 d1 2 d e 5 6

d5 3

15-853 Page 19 15-853 Page 20

5 Overview of Text Compression ACB (Associate Coder of Buyanovsky)

PPM and Burrows-Wheeler both encode a single Keep dictionary sorted by context character based on the immediately preceding (the last character is the most context. significant) Context Contents LZ77 and LZ78 encode multiple characters based on • Find longest match for context decode matches found in a block of preceding text • Find longest match for contents dec ode Can you mix these ideas, i.e., code multiple •Code d ecode decod e characters based on immediately preceding • Distance between matches in context? the sorted order de code deco de – BZ does this, but they don’t give details on how • Length of contents match it works – current best compressor Has aspects of Burrows-Wheeler, – ACB also does this – close to best and LZ77

15-853 Page 21 15-853 Page 22