Data Compression Techniques Part 2: Text Compression Lecture 6: Dictionary Compression

Juha K¨arkk¨ainen

15.11.2017

1 / 17 Dictionary Compression The compression techniques we have seen so far replace individual symbols with a variable length codewords. In dictionary compression, variable length substrings are replaced by short, possibly even fixed length codewords. Compression is achieved by replacing long strings with shorter codewords.

The general scheme is as follows:

I The dictionary D is a collection of strings, often called phrases. For completeness, the dictionary includes all single symbols.

I The text T is parsed into a sequence of phrases:

T = T1T2 ... Tz , Ti ∈ D.

The sequence is called a parsing or a factorization of T with respect to D.

I The text is encoded by replacing each phrase Ti with a codeword that acts as a pointer to the dictionary.

2 / 17 Example Here is a simple static dictionary encoding for English text:

I The dictionary consists of some set of English words plus individual symbols.

I Compute the frequencies of the words in some corpus of English texts. Compute the frequencies of symbols in the corpus from which the dictionary words have been removed.

I Number the words and symbols in descending order of their frequencies.

I To encode a text, replace each dictionary word and each symbol that does not belong to a dictionary word with its corresponding number. Encode the sequence of numbers using γ coding.

3 / 17 Lempel-Ziv compression

In 1977 and 1978, and published two adaptive dictionary compression that soon became to dominate practical text compression. Numerous variants have been published and implemented, and they are still the most commonly used algorithms in general purpose compression tools.

The common feature of the two algorithms and all their variants is that the dictionary consists of substrings of the already processed part of the text. This means that the dictionary adapts to the text.

The two algorithms are known as LZ77 and LZ78, and most related methods can be categorized as a variant of one or the other. The primary difference is the encoding of the phrases:

I LZ77 uses direct pointers to the preceding text.

I LZ78 uses pointers to a separate dictionary.

4 / 17 LF78 The original LZ78 encodes a string T = t0t1 ... tn−1 as follows: I The dictionary consists of phrases numbered from 0 upwards: D = {Z0, Z1, Z2,...}. Each new phrase inserted to the dictionary gets the next free number. Initially, D = {Z0 = ε} (ε = empty string). I Suppose we have so far computed the parsing T1 ... Tj−1 for T [0 ... i) and the next phrase Tj starts at position i. Let Zk ∈ D be the longest phrase in the dictionary that is a prefix of T [i ... n − 2].

The next phrase is Tj = T [i ... i + |Zk |] = Zk ti+|Zk |, i.e., Zk plus a symbol, and it is inserted into the dictionary as Zj .

I The phrase Tj is encoded as the pair hk, ti+|Zk |i. Using fixed length codes, the pair needs dlog je + dlog σe bits. Example Let Σ = {a, b, c, d} and T = badadadabaab. index 0 1 2 3 4 5 6 7 phrase  bad ad ada ba ab encoding h0, bi h0, ai h0, di h2, di h4, ai h1, ai h2, bi code length 0+2 1+2 2+2 2+2 3+2 3+2 3+2 5 / 17 LZW LZW is simple optimization of LZ78 used, e.g., in the unix tool compress. I Initially, the dictionary D contains all individual symbols: D = {Z1,..., Zσ}. I Suppose the next phrase Tj starts at position i. Let Zk ∈ D be the longest phrase in the dictionary that is a prefix of T [i ... n). Now the next text phrase is Tj = Zk and the phrase added to the dictionary is

Zσ+j = Tj ti+|Tj |. I The phrase Tj is encoded with hki requiring dlog(σ + j − 1)e bits. Omitting the symbol codes saves space in practice, even though the index codes can be longer and the phrases can be shorter. Example Let Σ = {a, b, c, d} and T = badadadabaab. text phrase bad ad ada baab encoding h2i h1i h4i h6i h8i h5i h1i h2i code length 2 3 3 3 3 4 4 4 dict. phrase abcd ba ad da ada adab baa ab index 1 2 3 4 5 6 7 8 9 10 11 6 / 17 There is a tricky detail in decoding LZW:

I In order to insert the phrase Zσ+j = Tj ti+|Tj | into the dictionary, the

decoder needs to know the symbol ti+|Tj |. For the encoder this is not

a problem, but the decoder only knows that ti+|Tj | is the first symbol of the next phrase Tj+1. Thus the insertion is delayed until the next phrase has been decoded.

I The encoder inserts Zσ+j to the dictionary without a delay and has the option of choosing Tj+1 = Zσ+j . If this happens, the decoder is faced with a reference to a phrase that is not yet in the dictionary! However, in this case the decoder knows that the unknown symbol

ti+|Tj | is the first symbol of Zσ+j , which is the same as the first

symbol of Tj . Given ti+|Tj | the decoder can set Zσ+j = Tj ti+|Tj |, insert it into the dictionary, and decode Tj+1 normally.

The phrase Tj+1 = Zσ+j is problematic because it is self-referencial. Example

In our example, the phrase T5 = Z8 = ada is self-referential.

7 / 17 LZ77 The original LZ77 works as follows: I A phrase Tj starting at a position i is encoded as a triple of the form hdistance, length, symboli. A triple hd, l, si means that:

Tj = T [i...i + l] = T [i − d...i − d + l)s In other words, the string T [i..i + l) of length l has another occurrence d positions earlier in the text. I The values d and l should satisfy d ∈ [1..dmax] and l ∈ [0..lmax]. In other words, the earlier occurrence should be no longer than lmax and should start within a window T [i − dmax..i − 1]. I The algorithm searches the window for the longest possible match under the above contraints, i.e., it tries to maximize l. I The triple is encoded with fixed length codes using dlog dmaxe + dlog(lmax + 1)e + dlog σe bits. I The decoder is really simple and fast as it can just copy each phrase from the already decoded part of the text. Even the self-referential case l > d, when the strings T [i..i + l) and T [i − d...i − d + l) overlap, just requires the copying to happen in left-to-right order. 8 / 17 Example

Let Σ = {a, b, c, d} and T = badadadabaab and dmax = lmax = 8. phrase bad adadab aab encoding h1, 0, bi h1, 0, ai h1, 0, di h2, 5, bi h2, 1, ai h1, 0, bi

Later variants have improved the encoding of the phrases:

I Avoid coding the extra symbol every time by replacing the triples with two types of codes hlength, distancei and hsymboli (with some indicator to specify the type of the code).

I Use variable length codes instead of fixed length codes. Variations range from ad-hoc schemes to semiadaptive Huffman and even tANS coding. The most advanced compressors can further have a complex model to estimate the probability of each bit in the encoding and then use to encode them. Many popular compression programs, such as , and 7zip, use these types of LZ77 variants.

9 / 17 LZ77 can also be optimized for encoding speed by replacing the exhaustive search of the window with an efficient data structure.

I Many different data structures including binary trees, hash tables and suffix trees have been used for the purpose.

I Fast searching enables larger window sizes or even unbounded distances and lengths. Increasing the window size can lead to longer phrases and thus better compression.

I On the other hand, the compression can suffer from longer codes needed for larger values. With the fixed length codes of the original LZ77, this is another reason to use a small window. With variable length codes, a small upper bound is not important if a smaller distance has a shorter code. Then the longest possible phrase is not always the optimal choice.

I A recent algorithm by Ferragina, Nitto and Venturini (SODA 2009) solves this optimization problem efficiently for many encoding schemes, i.e., it finds the parsing that minimizes the total length of the final encoding.

10 / 17 LZFG As a final example of LZ-type compression methods let us briefly look at LZFG that is a kind of hybrid of LZ77 and LZ78 algorithms: I LZFG is like LZ77 but with the restriction that the earlier occurrence of each phrase has to begin at a previous phrase boundary. There is no restriction on the end of the phrase. I Each phrase is encoded as a hlength, distancei pair, but the distance now points to separate array recording the positions of phrase boundaries. In this sense, LZFG is an LZ78 type method. I Using a large or unbounded window size is easier with LZFG than with LZ77, because the distance values are smaller and the data structures for finding phrases are simpler. Example Let T = badadadabaab. Assume two types of codes, hlength, distancei and hsymboli, and no length or distance limits. phrase bad adada baab encoding hbi hai hdi h5, 2i h2, 4i hai hbi

11 / 17 An important attribute of Lempel–Ziv compression methods is the size of their effective dictionary, i.e, the number of possible distinct phrases. For a text of length n with a parsing of size z, the effective dictionary sizes are bounded by: LZ78 LZW LZ77 (original) LZ77 (variant) LZFG 2 (z + 1)σ σ + z dmax(lmax + 1)σ n + σ zn + σ In general, the effective dictionary size of LZ78 type algorithms grows slowly with n, while LZ77 type algoritms can have a much faster growth rate.

I A larger dictionary usually leads to longer and thus fewer phrases. This does not necessarily mean better compression, because the code sizes increase too, but fewer phrases can speed up decoding.

I A faster dictionary growth rate can improve compression significantly on some highly compressible texts (exercise).

12 / 17 Grammar Compression A special type of semiadaptive dictionary compression is grammar compression that represents a text as a context-free grammar. Example T = a rose is a rose is a rose.

S → ABB A → a rose B → is A The grammar should generate exactly one string. Such a grammar is called a straight-line grammar because of the following properties: I The are no branches, i.e., each non-terminal is the left-hand size of only one rule. Otherwise, multiple strings could be generated. I There are no loops, i.e., no cyclic dependencies between non-terminals. Otherwise, infinite strings could be generated. A straight-line grammar in Chomsky normal form is called a straight-line program. 13 / 17 The size of a grammar is the total length of the right-hand sides. The smallest grammar problem of computing the smallest straight-line grammar that generates a given string is NP hard. Nevertheless, there are many algorithms for constructing small grammars for a text, for example:

I LZ78 parsing is easily transformed into a grammar with one rule for each phrase.

I The best provable approximation ratio O(log(n/g)), where g is the size of the smallest grammar, has been achieved by algorithms that transform an LZ77 parsing into a grammar. I Greedy algorithms add one rule at a time as long as they can find a rule that reduces the size of the grammar. The right hand side of each new rule is chosen greedily by some criterion, for example:

I the longest substring with at least two occurrences I the most frequent substring of length at least two I the string that produces the biggest reduction in size.

14 / 17 Re-Pair

Re-Pair is a greedy grammar compression algorithm that operates as follows: 1. Find the pair of symbols XY that is the most frequent in the text T . If no pair occurs twice in T , stop. 2. Create a new non-terminal Z and add the rule Z → XY to the grammar. Replace all occurrences of XY in T by Z. Go to step 1.

The whole process can be performed in linear time using suitable data structures. The details are omitted, but the key observation is that, if nXY is the number of occurrences of the most frequent pair XY in a given step, then the replacement reduces the size of the grammar by nXY − 2. Thus we can spend O(nXY ) time to perform the step.

Re-Pair often achieves the smallest grammar among practical algorithms.

15 / 17 Example T = singing do wah diddy diddy dum diddy do. rule added text after replacement A → d singingAo wahAiddyAiddyAumAiddyAo B → dd singingAo wahAiByAiByAumAiByAo C → Ai singingAo wahCByCByAumCByAo D → By singingAo wahCDCDAumCDAo E → CD singingAo wahEEAumEAo F → ins F gF gAo wahEEAumEAo G → Aos F gF gG wahEEAumEG H → F gs HHG wahEEAumEG

Here is a simple encoding of the final result: I The number r of rules and the length z of the final text in γ code. I The right-hand sides of rules using 2dlog(σ + i − 1)e-bit fixed length codes to encode the ith rule. I The compressed text using dlog(σ + r)e-bit fixed length codes. Better compression can be achieved with a more sophisticated encoding. 16 / 17 A common feature of most dictionary compression algorithms is asymmetry of encoding and decoding:

I The encoder needs to do a lot of work in choosing the phrases or rules.

I The decoder only needs to replace each phrase. Thus the decoder is often simple and fast.

I LZ77-type decoders are particularly simple and fast as they maintain no dictionary other than the already decoded part of the text.

I LZ78-type and grammar-based decoders need some extra effort in constructing and using the dictionary. The best possible compression may require a complex model and advanced entropy coding to encode the phrase, which can make the decoder dramatically slower. Thus there is a tradeoff between speed and compression ratio, and some compressors are optimized for speed and others for maximum compression.

17 / 17