<<

Algorithms in the Real World

Lempel-Ziv Burroughs-Wheeler ACB

Page 1 Compression Outline

Introduction: Lossy vs. Lossless, Benchmarks, … : Entropy, etc. Probability Coding: Huffman + Applications of Probability Coding: PPM + others Lempel-Ziv : – LZ77, , – LZ78, (Not covered in class) Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK

Page 2 Lempel-Ziv Algorithms

LZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP,

LZ78 (Dictionary Based) Variants: LZW (Lempel-Ziv-Welch), LZC Applications: compress, GIF, CCITT (modems), ARC, PAK

Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

Page 3 LZ77: Sliding Window Lempel-Ziv Cursor a a tag">c a a c a b c a b a b a c Dictionary Lookahead (previously coded) Buffer Dictionary and buffer “windows” are fixed length and slide with the cursor Repeat: Output (o, l, c) where o = position (offset) of the longest match that starts in the dictionary (relative to the cursor) l = length of longest match c = next char in buffer beyond longest match Advance window by l + 1

Page 4 LZ77: Example a a c a a c a b c a b a a a c (_,0,a) a a c a a c a b c a b a a a c (1,1,c) a a c a a c a b c a b a a a c (3,4,b) a a c a a c a b c a b a a a c (3,3,a) a a c a a c a b c a b a a a c (1,2,c)

Dictionary (size = 6) Longest match

Buffer (size = 4) Next character

Page 5 LZ77 Decoding Decoder keeps same dictionary window as encoder. For each message it looks it up in the dictionary and inserts a copy at the end of the string What if l > o? (only part of the message is in the dictionary.) E.g. dict = abcd, codeword = (2,9,e) • Simply copy from left to right for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] • Out = abcdcdcdcdcdce • First character is in the dictionary, and each time a character is read, another is written, so never run out of characters to read.

Page 6 LZ77 Optimizations used by gzip

LZSS: Output one of the following two formats (0, position, length) or (1,char) Uses the second format if length < 3.

a a c a a c a b c a b a a a c (1,a)

a a c a a c a b c a b a a a c (1,a)

a a c a a c a b c a b a a a c (1,c)

a a c a a c a b c a b a a a c (0,3,4)

Page 7 Optimizations used by gzip (cont.)

1. Huffman code the positions, lengths, and chars 2. Non greedy: possibly use shorter match so that next match is better 3. Use a hash table to store the dictionary. – Hash keys are all strings of length 3 in the dictionary window. – Find the longest match within the correct hash bucket. – Puts a limit on the length of the search within a bucket. – Within each bucket store in order of position to make deleting easier when window moves. Page 8 The Hash Table

… 7 8 9 101112131415161718192021 … … a a c a a c a b c a b a a a c …

a a c 19 c a b 15 a c a 11 a a c 10 c a b 12 c a a 9

a a c 7 a c a 8

Page 9 Theory behind LZ77 The Sliding Window Lempel-Ziv is Asymptotically Optimal, A. D. Wyner and J. Ziv, Proceedings of the IEEE, Vol. 82. No. 6, June 1994. Will compress long enough strings to the source entropy as the window size goes to infinity. Special case of proof: Assume infinite dictionary window, characters generated one-at-a-time independently, look for matches of length one only.

If i’th character in alphabet occurs with probability pi, expected distance to nearest match is given by (like tossing coin until heads appears)

E[oi ] = pi + (1− pi )(1+ E[oi ])

Solving, we find that E[oi] = 1/pi, which is intuitive. If we can encode oi using ≈ log oi bits, then the expected codeword length is 1 p log ≈ ∑ i i pi Which is the entropy of the single character distribution. Page 10 Logarithmic Length Encoding

How can we encode oi using ≈ log oi bits?

Gamma code uses two parts: First ⎡log oi ⎤ is encoded in unary using ⎡log oi ⎤ bits. Then oi is encoded in binary notation using ⎡log oi ⎤ bits. Total length of codeword: 2 ⎡log oi ⎤. Off by a factor of 2!

Improvement: Code ⎡log oi ⎤ in binary using ⎡log ⎡ log oi ⎤ ⎤ bits instead, and first code ⎡log ⎡ log oi ⎤ ⎤ in unary using ⎡log ⎡ log oi ⎤ ⎤ bits. Total length of codeword: 2⎡log ⎡ log oi ⎤ ⎤ + ⎡log oi ⎤ .

Etc.: 2 ⎡log ⎡log ⎡ log oi ⎤ ⎤ ⎤ +⎡log ⎡ log oi ⎤ ⎤ + ⎡log oi ⎤ …

Page 11 Theory behind LZ77 General case: for any n, there is a window size w (typically exponential in n) such that on average a match of size n is found, and the number of bits needed to encode the position and length of the match approaches the source entropy for substrings of length n: 1 H p(X )log n = ∑ X∈An p(X )

Optimal in situation in which each character might depend on previous n-1, even though n is unknown. Uses logarithmic code for the position and match length. (Note that typically match length is short compared to position.) Problem: “long enough” window is really really long.

Page 12 Comparison to Lempel-Ziv 78

Both LZ77 and LZ78 and their variants keep a “dictionary” of recent strings that have been seen. The differences are: – How the dictionary is stored (LZ78 is a trie) – How it is extended (LZ78 only extends an existing entry by one character) – How it is indexed (LZ78 indexes the nodes of the trie) – How elements are removed

Page 13 Lempel-Ziv Algorithms Summary

Adapts well to changes in the file (e.g. a file with many file types within it). Initial algorithms did not use probability coding and performed poorly in terms of compression. More modern versions (e.g. gzip) do use probability coding as “second pass” and compress much better. The algorithms are becoming outdated, but ideas are used in many of the newer algorithms.

Page 14 Compression Outline

Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, … Other Lossless Algorithms: – Burrows-Wheeler – ACB Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK

Page 15 Burrows -Wheeler

Currently near best “balanced” algorithm for text Breaks file into fixed-size blocks and encodes each block separately. For each block: – First sort each character by its full context. This is called the block sorting transform. – Then use the move-to-front transform to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

Page 16 Burrows Wheeler: Example

Let’s encode: decode Context “wraps” around. Last char is most significant. In the output, characters with similar contexts are near each other.

Context Char Context Output ecode d sort using dedec o coded e context as coded e odede c key decod e dedec o odede c

edeco d ecode d ⇐ start decod e edeco d

All rotations of input Page 17 Burrows Wheeler Decoding

Key Idea: Can construct entire sorted table from sorted column alone! First: sorting the output gives last column of context: Context Output c o d e d e e c e d o d

Page 18 Burrows Wheeler Decoding

Now sort pairs in last column of context and output column to form last two columns of context:

Context Output Context Output c o ec o d e ed e d e od e e c de c e d de d o d co d Page 19 Burrows Wheeler Decoding

Repeat until entire table is complete. Pointer to first character provides unique decoding. Context Output dedec o coded e decod e odede c

ecode d ⇐ edeco d

Message was d in first position, preceded in wrapped fashion by ecode: decode.

Page 20 Burrows Wheeler Decoding

Optimization: Don’t really have to rebuild the whole context table. What character comes after the first character, d1? Context Output

dedec o Just have to find d1 in last coded1 e1 column of context and see what decod2 e2 follows it: e1. odede1 c Observation: instances of same ecode2 d1 ⇐ edeco d character of output appear in 2 same order in last column of context. (Proof is an exercise.)

Page 21 Burrows-Wheeler: Decoding

Context Output Rank c o 6 d e 4 The “rank” is the position of a character if it were d e 5 sorted using a stable e c 1 sort. e d ⇐ 2 o d 3

Page 22 Burrows-Wheeler Decode

Function BW_Decode(In, Start, n) S = MoveToFrontDecode(In,n) R = Rank(S) j = Start for i=1 to n do Out[i] = S[j] j = R[j]

Rank gives position of each char in sorted order.

Page 23 Decode Example

S Rank(S) Out o4 6 d1 ⇐ e2 e2 4 c3 e6 5 o4 c 1 3 d5 ⇐ d1 2 e 6 d5 3

Page 24 Overview of Text Compression

PPM and Burrows-Wheeler both encode a single character based on the immediately preceding context. LZ77 and LZ78 encode multiple characters based on matches found in a block of preceding text Can you mix these ideas, i.e., code multiple characters based on immediately preceding context? – BZ does this, but they don’t give details on how it works – ACB also does this – close to BZ

Page 25 ACB (Associate Coder of Buyanovsky)

Keep dictionary sorted by context (the last character is the most significant) • Find longest match of current context in context part of dictionary • Find longest match of look-ahead buffer in contents part of dictionary • Code • Distance between matches in the sorted order • Length of contents match • Next character that doesn’t match • Shift look-ahead window, update context, contents

Has aspects of Burrows-Wheeler and LZ77 Page 26 ACB Example Have seen “decode” so far. Sort (by context) all places in dictionary (contents) that a match for text in look- ahead buffer could occur. Context Contents Suppose current position is decode decode|odcoe dec ode d ecode Best Matches: decod e “decode” in context de code “odc” in contents (2 chars) decode deco de Output -4,2,c

Decoder can find best match in context, go from there.

Page 27