Entropy bounds for grammar compression Michał Gańczorz Institute of Computer Science, University of Wrocław, Poland [email protected]

Abstract Grammar compression represents a string as a context-free grammar. Achieving compression requires encoding such grammar as a binary string; there are a few commonly used encodings. We bound the size of practically used encodings for several heuristical compression methods, including Re-Pair and Greedy algorithms: the standard encoding of Re-Pair, which combines entropy coding and special

encoding of a grammar, achieves 1.5|S|Hk(S), where Hk(S) is k-th order entropy of S. We also show that by stopping after some iteration we can achieve |S|Hk(S). This is particularly interesting, as it explains a phenomenon observed in practice: introducing too many nonterminals causes the bit-size to grow. We generalize our approach to other compression methods like Greedy and a wide class of irreducible grammars as well as to other practically used bit encodings (including naive, which uses fixed-length codes). Our approach not only proves the bounds but also partially explains why Greedy and Re-Pair are much better in practice than other grammar based methods. In some cases, we argue that our estimations are optimal. The tools used in our analysis are of independent interest: we prove the new, optimal bounds on the zeroth-order entropy of the parsing of a string.

2012 ACM Subject Classification Theory of computation →

Keywords and phrases grammar compression, , empirical entropy, parsing, RePair

Funding Supported under National Science Centre, Poland project number 2017/26/E/ST6/00191.

1 Introduction

Grammar compression is a type of dictionary compression in which we represent the input as a context-free grammar producing exactly the input string. Computing the smallest grammar (according to the sum of productions lengths) is known to be NP-complete [6, 31, 5] and best approximation algorithms [28, 6, 18, 30] have logarithmic approximation ratios, which renders them impractical. On the other hand, we have heuristical algorithms [22, 26, 2, 20, 38, 35], which achieve compression ratios comparative to other dictionary methods [22, 2], despite having poor theoretical guarantees [6, 17]. Heuristics use various bit encodings, e.g. Huffman coding [22, 2], which is standard for other compression methods as well. The apparent success of heuristics fuelled theoretical research that tried to explain or evaluate their performance: one branch of such an approach tried to estimate their approximation ratios [6, 17] (no heuristic was shown to achieve logarithmic ratio though), the other tried to estimate the bit-size of their output, which makes it possible to compare them with other compressors as well. arXiv:1804.08547v2 [cs.DS] 20 May 2020 In general, estimating the bit-size of any compressor is hard. However, this is not the case for compressors based on the (higher-order) entropy, which includes Huffman coding, arithmetical coding, PPM, and others. In their case, the bit-size of the output is very close to the (k-order) empirical entropy of the input string (Hk), thus instead of comparing to those compressors, we can compare the size of the output with the k-order entropy of the input string. Moreover, in practice, it seems that k-order entropy is a good estimation of possible compression rate of data and entropy-based compression is widely used. Furthermore, there are multiple results that tie the output of a compressor to k-order entropy. Such analysis was carried out for BWT [23], LZ78 [21], LZ77 [32] compression methods. In some sense, the last two generalised classical results from on LZ-algorithms coding 2 Entropy bounds for grammar compression

for ergodic sources [37, 38], a similar work was also performed for a large class of grammar compressors [20] with a certain encoding. When discussing the properties of grammar compressors, it is crucial to distinguish the grammar construction and grammar bit-encoding. There are much fewer results that link the performance of grammar compressors to k-th order entropy than to optimal grammar size, despite the wide popularity of grammar-based methods and the necessity of using proper bit encodings in any practical compressor. The possible reasons are that heuristical compressors often employ their own, practically tested and efficient, yet difficult to analyze, encodings.

1.0.0.1 Our contribution The main result of this work is showing that a large spectrum of grammar compressors combined with different practically-used encodings achieve bounds related to |S|Hk(S). The considered encodings are: fully naive, in which each letter and nonterminal is encoded with code of length dlog |G|e; naive, in which the starting production string is encoded with an entropy coder and remaining grammar rules are encoded as in the fully naive variant; incremental, in which the starting string is encoded with an entropy coder and the grammar is encoded separately with a specially designed encoding (this is the encoding used originally for Re-Pair [22]); and entropy coding, which entropy-encodes the concatenation of productions’ right-hand sides.

First, we show that Re-Pair stopped at the right moment achieves |S|Hk(S) + o(|S| log σ) bits for k = o(logσ |S|) for naive, incremental, and entropy encoding. Moreover, at this moment, the size of the dictionary is O(|S|c), c < 1, where c affects linearly the actual value hidden in o(|S| log σ). This implies that grammars produced by Re-Pair and similar methods (usually) have a small alphabet, while many other compression algorithms, like LZ78, do not have this property [38]; this is of practical importance, as encoding a dictionary can be costly. Then we prove that in general Re-Pair’s output size can be bounded by 1.5|S|Hk(S) + o(|S| log σ) bits for both incremental and entropy encoding. For Greedy we give similar bounds: it achieves 1.5|S|Hk(S)+o(|S| log σ) bits with entropy c coding, 2|S|Hk(S) + o(|S| log σ) using fully naive encoding and if stopped after O(|S| ) iterations it achieves |S|Hk(S) + o(|S| log σ) bits with entropy encoding. The last result is of practical importance: each iteration of Greedy requires O(|S|) time, and so we can reduce the running time from O(|S|2) to O(|S|1+c) while obtaining comparable, if not better, compression rate. Stopping the compressor during its runtime seems counter-intuitive but is consistent with empirical observations [14, 12] and our results shed some light on the reasons behind this phenomenon. Furthermore, there are approaches suggesting partial decompression of grammar-compressors in order to achieve better (bit) compression rate [4]. Yet, as we show, this does not generalize to all grammar compressors: in general, stopping a grammar compressor after a certain iteration or partially decompressing the grammar such that the new grammar has at most O(|S|c) nonterminals, c < 1, could yield a grammar of size Θ(|S|). In particular, we show that there are irreducible grammars [20] which have this property. Comparing to Re-Pair and Greedy, both while stopped after O(|S|c) iterations

produce a small grammar of size at most O(|S|/ logσ |S|) which is an information theoretic bound for a grammar size for any S. Moreover, the fact that the grammar is small is crucial in proving the aforementioned bounds. This hints why Re-Pair and Greedy in practice produce grammars of smaller size compared to other grammar compressors: they significantly decrease the size of the grammar in a small number of steps. Finally, we apply our methods to the general class of irreducible grammars [20] and show M. Gańczorz 3

that entropy coding of an irreducible grammar uses at most 2|S|Hk(S) + o(|S| log σ) bits and fully naive encoding uses at most 6|S|Hk(S) + o(|S| log σ) bits. No such general bounds were known before. To prove our result we show a generalization of the result by Kosaraju and Manzini [21, Lem. 2.3], which in turn generalizes Ziv’s Lemma [8, Section 13.5.5].

I Theorem 1 (cf. [21, Lem. 2.3]). Let S be a string over σ-sized alphabet, YS = y1y2 . . . y|YS | its parsing, and L = Lengths(YS) is a word |y1||y2| ... |y|YS || over the alphabet {1,.., |S|}. Then:

|YS|H0(YS) ≤ |S|Hk(S) + |YS|k log σ + |L|H0(L) .

Moreover, if k = o(logσ |S|) and |YS| = O (|S|/logσ |S|) then:   |S|k log σ |S| log logσ |S| |YS|H0(YS) ≤ |S|Hk(S) + O + ≤ |S|Hk(S) + o(|S| log σ); logσ |S| logσ |S| the same bound applies to the Huffman coding of string S0 obtained by replacing different phrases of YS with different letters. Comparing to [21, Lem. 2.3], our proof is more intuitive, simpler and removes the greatest pitfall of the previous result: the dependency on the highest frequency of a phrase in the parsing. The idea of a parsing is strictly related to dictionary compression methods, as most of these methods pick some substring and replace it with a new symbol, creating a phrase in a parsing (e.g. Lempel-Ziv algorithms). Thus, we use Theorem 1 to estimate the size of the entropy-coded output of grammar compressors. We also use Theorem 1 in a less trivial way: with its help, we can lower bound the zeroth-order entropy of the string, we use such lower bounds when proving upper-bounds for both Greedy and Re-Pair. This is one of the crucial steps in showing 1.5|S|Hk(S) + o(|S| log σ) bound as it allows us to connect the entropy of the string with bit encoding of the grammar. The general idea is that when estimating the lower bound of string S, it is sometimes easier to lower bound the entropy of some parsing of S, thus we set k = 0 in Theorem 1 and show that there exists a parsing with large entropy. By Theorem 1 it follows that zeroth-order entropy of a parsing should lower bound the entropy of the string (modulo small factors). An interesting corollary of Theorem 1 is that, for k = 0, for any parsing YS the entropy |YS|H0(YS) may be larger than |S|H0(S) by at most |L|H0(L), which is o(S) for |L| = o(|S|), i.e. applying parsing to S cannot increase encoding cost significantly. Theorem 1 is an important tool to connect parsing entropy to k-order entropy of the input, as it holds for every parsing. On the other hand, it is not difficult to give examples of strings S and their parsings YS for which the bound from Theorem 1 is not tight, i.e. there are strings where the inequality holds even without the |YS|k log σ summand. One such example is de Bruijn sequence with any parsing where phrases are no longer than logσ |S|. Thus it is natural to ask whether the bound can be improved, e.g. if we are allowed to choose a parsing. Theorem 2 gives a (partial) positive answer.

I Theorem 2. Let S be a string over σ-sized alphabet. Then for any integer l we can construct a parsing YS = y1y2 . . . y|YS | of size |YS| ≤ d|S|/le + 1 satisfying:

Pl−1 H (S) |Y |H (Y ) ≤ |S| i=0 i + O(log |S|) . S 0 S l

Moreover, |y1|, |y|YS || ≤ l and all the other phrases have length l. 4 Entropy bounds for grammar compression

Theorem 2 can be readily applied to text encodings with random access based on parsing the string into equal-length phrases [9, 13, 16] and improve their performance guaran- |S| Pl−1 tee from |S|Hk(S) + O (|S|/ logσ |S| · (k log σ + log log |S|)) to l i=0 Hi(S) + O(log |S| + |S| log log |S|/ logσ |S|) (such structures consists of entropy coded string and additional structures which use O (|S| log log |S|/ logσ |S|) bits). As implied by the recent work [11], the bounds neither in Theorem 1 nor in Theorem 2 can be improved, moreover some of our estimations on grammars’ size are optimal as well (e.g. those achieving |S|Hk(S) + o(|S| log σ)).

1.0.0.2 Related results With the use of previously known tools [21] it was shown that Re-Pair with fully naive

encoding achieves 2|S|Hk(S) + o(|S| log σ) bits for k = o(logσ n) [24]. Recently it was shown [27] that irreducible grammars with the encoding of Kieffer and Yang [20] achieve |S|Hk(S) + o(|S| log σ) this was obtained by adapting the aforementioned results of Kieffer and Yang [20]. Note that this result is a simple corollary of Theorem 1, which predates that result: for a string S Kieffer and Yang showed [20] that their encoding consumes at most |YS|H0(YS) + O(|YS|) bits for a parsing YS induced by an irreducible grammar; they also showed that |YS| = O(|S|/ logσ |S|). Combining those facts with Theorem 1 yields the bound of |S|Hk(S) + o(|S| log σ). While the bound from [27] seems stronger at a first sight, it is of a different flavour: it applies to a sophisticated encoding that is not used in practice and it requires a grammar to be in the irreducible form, so, for instance, it applies to Re-Pair only after some postprocessing of the grammar (which can completely change grammar structure) and for Greedy it is assumed that the algorithm is run to the end, while the original formulation suggested stopping the compressors after certain iteration [2, 3]. Furthermore, it seems hard to apply Kieffer and Yang encoding to compressed data structures. Lastly, the bound seems to be weakly connected to real-life performance, as it qualifies a wide range of algorithms into the same class despite the fact that many of them perform differently in practice. The last disadvantage seems to be especially important: by exploiting the properties of Re-Pair and Greedy we were able to prove that they achieve |S|Hk(S) with the factor 1.5 compared to 2 for the irreducible grammars with the entropy encoding (and for k = 0 it can be shown that these bounds are tight); moreover, we show that Re-Pair and Greedy stopped after certain iteration have a small dictionary, again, contrary to irreducible grammars.

2 Strings and their parsing

A string is a sequence of elements, called letters, from a finite set, called alphabet, and it is denoted as w = w1w2 ··· wk, where each wi is a letter, its length |w| is k; alphabet’s size is denoted by σ, the alphabet is usually clear from the context, Γ is used when some explicit name is needed. We often analyse words over different alphabets. For two words w, w0 the 0 ww denotes their concatenation. By w[i . . j] we denote wiwi+1 ··· wj, this is a subword of w;  denotes the empty word. For a pair of words v, w |w|v denotes the number of different (possibly overlapping) subwords of w equal to v; if v =  then for uniformity of presentation we set |w| = |w|. Usually, S denotes the input string. A grammar compression represents an input string S as a context-free grammar generating a unique string S. The right-hand side of the start nonterminal is called a starting string (often denoted as S0) and by the grammar G we mean the collection of other rules, together they are called the full grammar and denoted as (S0,G). For a nonterminal X we denote its M. Gańczorz 5 rule right-hand side by rhs(X). For a grammar G its right-hand sides, denoted by rhs(G), is the set of G’s productions’ right-hand sides, for a full grammar (S0,G) we also add S0 to the set and denote it by rhs(S0,G). We denote by | rhs(G)| (| rhs(S0,G)|) the sum of strings’ lengths in rhs(G) (rhs(S0,G), respectively). By |N (G)| we denote the number of nonterminals. A grammar is in CNF if all right-hand sides of a grammar consist of two symbols. The string generated by a nonterminal X in a grammar G is the expansion exp(X) of X in G. All reasonable grammar compressions guarantee that no two nonterminals have the same expansion, | rhs(X)| > 1 for each nonterminal X and that each nonterminal occurs in the derivation tree; we implicitly assume this in the following. A grammar 0 (full grammar) G ((S ,G), respectively) is said to be small, if | rhs(G)| = O(|S|/logσ |S|) 0 (| rhs(S ,G)| = O(|S|/logσ |S|), respectively). This matches the folklore information-theoretic lower bound on the size of a grammar for a string of length |S|. In practice, the starting string and the grammar may be encoded in different ways, especially when G is in CNF, hence we make a distinction between these two. Note that for many (though not all) grammar compressors both theoretical considerations and proofs as well as practical evaluation show that the size of the grammar is considerably smaller than the size of the starting string.

A parsing of a string S is any representation S = y1y2 ··· yc, where each yi 6= . We call yi a phrase. We denote a parsing as YS = y1, . . . , yc and treat it as a word of length c over the alphabet {y1, . . . , yc}; in particular |YS| = c is its size. Then Lengths(Ys) = ∗ |y1|, |y2|,..., |yc| ∈ N and we treat it as a word over the alphabet {1, 2,..., |S|}. In grammar compression, the starting string S0 of a full grammar (S0,G) induces a parsing of input string, also a concatenation of strings in rhs (S0,G) induces a parsing plus some additional nonterminals.

I Definition 3. For a word S = s1s2 . . . s|S| its k-order empirical entropy is:

|S| 1 X H (S) = − log (s |S[i − k . . . i − 1]) , k |S| P i i=k+1 where P(si|w) is the empirical probability of letter si occurring in context w, i.e. P(si|S[i − k . . . i]) = |S|S[i−k...i] if S[i − k . . . i − 1] is not a suffix of S or S[i − k . . . i − 1] = ; and |S|S[i−k...i−1] |S|S[i−k...i] (si|S[i − k . . . i]) = otherwise. P |S|S[i−k...i−1]−1

We are mostly interested in the Hk of the input string S and in the H0 for parsing YS of S. The first is a natural measure of the input, to which we compare the size of the output of the grammar compressor, and the second corresponds to the size of the entropy coding of the starting string returned by a grammar compressor.

3 Entropy upper bounds on grammar compression

In this Section, we use Theorem 1 to bound the size of grammars returned by popular compressors: Re-Pair [22], Greedy [2] and a general class of methods that produce irreducible grammars. We consider a couple of natural and simple practically used bit-encodings of grammars. Interestingly, we obtain bounds of form α|S|Hk(S) + o(|S| log σ) for a constant α for all of them. 6 Entropy bounds for grammar compression

3.1 Encoding of grammars

We first discuss the possible encodings of the (full) grammar and give some estimations on their sizes. All considered encodings assume a linear order on nonterminals: if rhs(X) contains Y then X ≥ Y . In this way, we can encode the rule as a sequence of nonterminals on its right-hand side, in particular instead of storing the nonterminal names we store the positions in the above ordering. The considered encodings are simple, natural and correspond closely or exactly to encodings used in grammars compressors like Re-Pair [22] or Sequitur [25]. Some other algorithms, e.g. Greedy [2], use specialized encodings, but at some point, they still encode grammar using entropy coder with some additional information or assign codes to each nonterminal in the grammar. Thus, most of these custom encodings are roughly equivalent to (or not better than) entropy coding. Encoding of CNF grammars deserves special attention and is a problem investigated on its own. It is known that a grammar G in CNF can be encoded using |N (G)| log(|N (G)|) + 2|N (G)| + o(|N (G)|) bits [33, 34], which is close to the information theoretic lower bound of log(|N (G)|!)+2|N (G)| [33]. On the other hand, heuristics use simpler encodings, e.g. Re-Pair was implemented and tested with several encodings, which were based on a division of nonterminals of G into z groups g1, . . . , gz where X ∈ gi, X → AB, if and only if A ∈ gi−1 and B ∈ gj or B ∈ gi−1 and A ∈ gj, for some j ≤ i − 1, where g0 is the input alphabet. Then each group is encoded separately. Though no theoretic bounds were given, we show (in the Appendix) that some of these encodings achieve previously mentioned lower bound of log(|N (G)|!) + 2|N (G)| bits (plus smaller order terms). The above encodings are difficult to analyse due to heuristical optimisations; instead, we analyse an incremental encoding, which is a simplified version of one of the original methods used to encode Re-Pair output [22]. It matches the theoretical lower bound except for a larger constant hidden in O(|N (G)|). To be precise, we consider the following encodings: Fully naive We concatenate the right-hand sides of the full grammar. Then each nonter- minal and letter are assigned bitcodes of the same length. We store | rhs(X)| for each X, as it is often small, it is sufficient to store it in unary. Naive The starting string is entropy-coded, the rules are coded as in the fully-naive variant. Entropy-coded The rules’ right-hand sides are concatenated with the starting string and they are coded using an entropy coder. Incremental We use this encoding only for CNF grammars, though it can be extended to general grammars as it is possible to transform any grammar to such a form. Given a full grammar (S0,G) we use entropy coder for S0 and encoding defined below for G. It has additional requirements on the order on letters and nonterminals: if X ≤ Y and X0,Y 0 are the first symbols in productions for X,Y then X0 ≤ Y 0. Given any grammar, this property can be achieved by permuting the nonterminals, but we must drop the assumption that the right-hand side of a given nonterminal X occurs before X in the sequence. Then the grammar G can be viewed as a sequence of nonterminals:

X1 → Z1U1,...,X|N (G)| → Z|N (G)|U|N (G)|. We encode differences ∆Xi = Zi −Zi−1 with Elias δ-codes, and Ui’s naively using dlog(|N (G)| + σ)e bits.

We upper-bound the sizes of grammars under various encodings; the first two estimations are straightforward, the third one requires some calculations. The estimation in Lemma 5 requires a nontrivial argument. M. Gańczorz 7

0 I Lemma 4. Let S be a string and (S ,G) a full grammar that generates it. Then: fully naive uses at most (| rhs(S0,G)|) (log(σ + |N (G)|) + O(| rhs(S0,G)|)) bits; 0 0 0 naive uses at most |S |H0(S ) + | rhs(G)| log(σ + |N (G)|) + O(| rhs(S ,G)|) bits; 0 0 0 incremental uses at most |S |H0(S ) + |N (G)| log(σ + |N (G)|) + O(| rhs(S ,G)|) bits.

The proof idea of Lemma 5 is to show that rhs (S0,G) induces a parsing of S, with at most |N (G)| additional symbols, and then apply Theorem 1. The latter requires that different nonterminals have different expansions, all practical grammar compressors have this property.

0 I Lemma 5. Let S be a string over an alphabet of size σ, k = o(logσ |S|) and (S ,G) a full grammar generating it, where no two nonterminals have the same expansion. Let SG be 0 a concatenation of strings from rhs(S ,G). If |SG| = O (|S|/ logσ |S|) then |SG|H0(SG) ≤ 0 |S|Hk(S) + |N (G)| log |S| + o(|S| log σ) , same bound applies to the entropy coding of (S ,G).

3.2 Re-Pair

Re-Pair is one of the most known grammar compression heuristics. It starts with the input string S and in each step replaces a most frequent pair AB in S with a new symbol X and adds a rule X → AB. For the special case when A = B the replacements are realized by left-to-right scan, i.e. maximal substring An is replaced by Xn/2 for even n and X(n−1)/2A for odd n; moreover, it never replaces the pair with only two overlapping occurrences. The choice of a pair to replace is dependent only on the overall number of occurrences, both overlapping and non-overlapping, thus the algorithm may replace the pair A0A0 which has less non-overlapping occurrences than some other pair AB. Re-Pair is simple, fast, and provides compression ratio better than some of the standard dictionary methods like gzip [22]. It found usage in various applications [14, 19, 7, 10, 36, 15]. We refer to the current state of S, i.e. S with some pairs replaced, as the working string. We prove that Re-Pair stopped at the right moment achieves Hk (plus some smaller terms), using any of the: naive, incremental or entropy encoding. To this end, we show that Re-Pair reduces the input string to length α|S| for an appropriate α and stop the algorithm when logσ |S| the string gets below this size. Then use the fact that different nonterminals produced by Re-Pair have different expansions ([24, Lem. 4]): At this iteration, by estimations on the number of possible different substrings, we show that for an adequate α, β there exist a pair of nonterminals AB, with | exp(A)|, | exp(B)| < (logσ |S|)/β, such that pair AB occurs frequently, i.e. ≈ |S|1−2/β. Then, we use the fact that the frequency of the most frequent pair never increases during the algorithm runtime ([24, Lem. 3]), and so grammar size at this point must be ≈ |S|2/β. Then, on the one hand, we can show that the encoding of grammar constructed so far takes at most ≈ |S|c log |S| = o(|S|), c < 1, and on the other hand Theorem 1 yields that entropy coding of the working string is |S|Hk(S) plus some smaller terms.

I Theorem 6. Let S be a string over σ-size alphabet and k = o(logσ |S|). Let ξ > 0, β > 2 be (ξ+3β)|S| 2 constants. When the size of the working string of Re-Pair is first below + 2|S| β + 1 logσ |S| 2 2 β then the number of nonterminals in the grammar is at most 1 + ξ (|S| · logσ |S|); for every ξ > 0, β > 2 and S such a point always exists. Moreover, at this moment, the size of the entropy coding of the working string is at most |S|Hk(S) + o(|S| log σ) and the bit-size of Re-Pair stopped at this point is at most |S|Hk(S) + o(|S| log σ) for naive, entropy, and incremental encoding. 8 Entropy bounds for grammar compression

Theorem 6 says that Re-Pair achieves Hk when stopped at the appropriate time. Sur- prisingly, continuing to the end can lead to a worse compression rate. In fact, limiting the size of the dictionary for Re-Pair (as well as for similar methods) in practice results in a better compression rate for larger files [14, 12]. This is not obvious, in particular, Larsson and Moffat [22] believed that it is the other way around; this belief was supported by the results on smaller-size data. This is partially explained by Theorem 8, in which we give a 1.5|S|Hk(S) + o(|S| log σ) bound on Re-Pair run to the end with incremental or entropy encoding (a 2|S|Hk(S) + o(|S| log σ) bound for fully naive encoding was known earlier [24]). Before proving Theorem 8 we first show an example, see Lemma 7, that demonstrates that the 1.5 factor from Theorem 8 is tight, assuming certain encodings, for k = 0. The construction

employs large alphabets, i.e. σ = Θ(|S|), and as Theorem 8 assumes k = o(logσ n), this means that we consider k = 0. While such large alphabets do not reflect practical cases, in which σ is much smaller than |S|, still, in the case of grammar compression this example gives some valuable intuition: replacing the substring w decreases the size counted in symbols but may not always decrease encoding size, as we have to store additional information regarding replaced string, which is costly for some encodings.

I Lemma 7. There exists a family of strings S such that Re-Pair with both incremental and 3 entropy encoding uses at least 2 |S|H0(S) − o(|S| log σ) bits, assuming that we encode the grammar of size g (i.e. with g nonterminals) using at least g log g − O(g) bits. Moreover, |S|H0(S) = Ω(|S| log σ), which implies that the cost of the encoding, denoted by Re-Pair(S), satisfies lim sup Re-Pair(S) ≥ 3 . |S|→∞ |S|H0(S) 2

The n-th string for Lemma 7 is over an alphabet Γn = {a1, a2, . . . , an, #} and is equal to Sn = a1#a2#a3# ··· an−1#an# an#an−1 ··· a2#a1#. The strings from Lemma 7 show that at some iteration bit encoding of Re-Pair can increase. Even though the presented example requires a large alphabet and is somehow artificial, we cannot ensure that a similar instance does not occur at some iteration of Re-Pair, as the size of the alphabet of the working string increases. In the above example the size of the grammar was significant. This is the only possibility to increase bit size as by Theorem 1 adding new symbols does not increase entropy encoding of working string significantly. The key observation in the proof of Theorem 8, which can be of independent interest, is that for a large entropy strings adding rules to a grammar does not increase the total encoding cost, i.e. |S|H0(S) + |N(G)| log |N(G)|, that much. Intuitively, even though adding a rule to a grammar G costs about |N (G)| log |N (G)| bits by Lemma 4, if the string S have entropy H0(S) ≈ log |S|, then after replacing a pair with two occurrences |S|H0(S) will decrease by ≈ 2 log |S|. Thus the decrease in |S|H0(S) after replacing a pair should offset the cost of the grammar in the encoding cost. A similar, yet weaker, statement is true if H0(S) ≥ (log |S|)/β: some replacements may not decrease |S|H0(S) (as in the example given in Lemma 7), but if we do enough replacements, the |S|H0(S) will eventually decrease and in total it will offset the cost of encoding the grammar to some degree. We show the special |S| case of this observation for β = 2, i.e. if the entropy is H0(S) ≥ 2 , then the encoding cost 3 is bounded by 2 |S|H0(S) for both incremental and entropy coding. The tools we use in the full proof show that this is true not only for Re-Pair, but also for a wide range of grammar compressors (see weakly non-redundant grammars, paragraph B.2.0.1, in the Appendix). In the proof of Theorem 8 we look at the working string S0 after |S|/ log1+ |S| Re-Pair’s 1+ iterations. Till this point the grammar G0 must have at most |S|/ log |S| rules, thus encoding of those rules takes o(|S|) bits so far. Denote S00 as the final working string and G0 as 0 the final grammar. Then we estimate the encoding of grammar G = G \G0 and the encoding of S00, this can be viewed as if we would estimate the size of grammar (S00,G) obtained by M. Gańczorz 9 running Re-Pair on S0. If |S0| ≤ |S|/ log1+ |S| then even the trivial encoding takes o(|S|) bits. Otherwise, we consider incremental and entropy coding separately, first consider the case for incremental encoding. Then the encoding size of (S00,G) is upper bounded by 00 00 3 0 0 roughly |S |H0(S ) + |N (G)| log |S|, we show that this is bounded by 2 |S |H0(S ). For 00 00 1 00 00 1 00 00 this, we write |S |H0(S ) + |N (G)| log |S| as 2 |S |H0(S ) + ( 2 |S |H0(S ) + |N (G)| log |S|). 1 0 0 00 As the first term is around 2 |S |H0(S ) by Theorem 1 applied with k = 0 (S induces 0 1 00 00 a parsing of S ), we need to estimate the 2 |S |H0(S ) + |N (G)| log |S| summand. Note that |N (G)| ≤ (|S0| − |S00|)/2 as each nonterminal corresponds to at least two replacements 00 0 0 of a pair, thus it is enough to estimate |S | log |S| in terms of |S |H0(S ). To this we 0 0 0 use Theorem 1 to lower bound the |S |H0(S ): Consider a trivial parsing YP of S into consecutive pairs of letters, the calculation shows that we can lower bound the entropy 0 of parsing |YP |H0(YP ) by roughly |S |/2 log |S| (as at this point each pair occurs at most 1+ 0 0 0 O(log |S|) times). Thus |S |H0(S ) ≥ |YP |H0(YP ) ≥ (|S |/2) log |S|, and together with 0 00 1 00 00 |N (G)| ≤ (|S | − |S |)/2 this allows to estimate the ( 2 |S |H0(S ) + |N (G)| log |S|) ≤ 1 00  0 0 log |S| · 2 |S | + |N (G)| summand by |S |H0(S ). At the end, we again use the Theorem 1 3 0 0 to show that 2 |S |H0(S ) ≤ |S|Hk(S) + o(|S| log σ). As we consider different grammars (for S and S0), some additional technical care is needed, i.e. we must show that encodings of G0 0 and G \ G0 are asymptotically similar. Arguing about the full entropy encoding is more involved and requires some additional considerations, but the main idea is similar.

I Theorem 8. Let S be a string over a σ-size alphabet, k = o(logσ |S|). Then the size of 3 Re-Pair output is at most 2 |S|Hk(S) + o(|S| log σ) for incremental and entropy encoding.

3.3 Irreducible grammars and their properties Kieffer et al. introduced the concept of irreducible grammars [20], which formalise the idea that there is no immediate way to make the grammar smaller. Many heuristics fall into this category: Sequential, Sequitur, LongestMatch, and Greedy, even though some were invented before the notion was introduced [6]. They also developed an encoding of such grammars which was used as a universal code for finite-state sources [20].

I Definition 9. A full grammar (S, G) is irreducible if: (IG1) no two distinct nonterminals have the same expansion; (IG2) every nonterminal, except the starting symbol, occurs at least twice in rhs(S, G); (IG3) no pair occurs twice (without overlaps) in S and G’s right-hand sides.

Unfortunately, most irreducible grammars have the same drawback as Re-Pair: they can introduce new symbols without decreasing entropy of the starting string but increasing the bit size of the grammar. In particular, the example from Lemma 7 applies to irreducible grammars (i.e. the grammar produced by Re-Pair in Lemma 7 is irreducible). Ideally, for an irreducible grammar we would like to repeat the argument used for Re-Pair: we can stop at some iteration (or decompress some nonterminals as in [4]) such that the grammar is small and has O(|S|c) nonterminals, for some constant c < 1. It turns out that there are irreducible grammars which do not have this property, see Lemma 10. Moreover, grammar compressors that work in a top-down manner, like LongestMatch, tend to produce   such grammars. Furthermore, the grammar in Lemma 10 has size Θ |S| , which is both logσ |S| an upper and lower bound for the size of an irreducible grammar.

I Lemma 10. For a fixed σ there exists a family of irreducible grammars {Gi}i≥0, where Gi generates a string Si, such that |S1| < |S2| < ··· , size of Gi is Θ (|Si|/ logσ |Si|) and 10 Entropy bounds for grammar compression

c for every constant 0 < c < 1 decompressing any set of nonterminals so that only O(|Si| ) nonterminals remain yields a grammar of size Θ(|S|) = ω(|Si|/ logσ |Si|), i.e. the grammar is no longer small.

Moreover, some encodings of such grammars can give larger output than Re-Pair’s, i.e. the 1.5 bound does not hold for irreducible grammars.

∞ I Lemma 11. There exists a family of strings {Si}i=1 such that for each string Si there exists an irreducible grammar Gi whose entropy coding takes at least 2|Si|H0(Si) − o(|Si| log σ) bits. Moreover |Si|H0(Si) = Ω(|Si| log σ), i.e. the cost of the encoding, denoted by Irreducible(S), satisfies lim sup Irreducible(Si) ≥ 2. i→∞ |Si|H0(Si) Still, we give upper bounds for encodings of irreducible grammars, with worse constants though. The rough idea of the proof of Theorem 12 is as follows: For an irreducible 0 0 grammar (S ,G) we consider the concatenation SG of the strings in rhs (S ,G). Then we 00 00 show that SG can be obtained by permuting letters of two strings S and SN where: S is 00 a parsing of S and SN can be obtained by permuting and removing letters from S . Thus 00 00 00 00 00 |SG|H0(SG) = |S SN |H0(S SN ) ≤ 2|S |H0(S ), by applying Theorem 1 on S we get that 00 00 2|S |H0(S ) is bounded by 2|S|Hk(S) + o(|S| log σ). The proof of Theorem 13, similarly to the proof of Theorem 8, uses Theorem 1 to lower bound the entropy of SG: In irreducible grammar, each pair occurs only once in strings of 0 rhs (S ,G), thus we can find a parsing YSG of SG into pairs and letters where |SG|/3 phrases are unique (note that all rules can be of length 3). Thus the zeroth-order entropy of the

parsing is at least |YSG |H0(YSG ) ≥ (|SG|/3) log(|SG|/3). Then by Theorem 1 we get that

(|SG|/3) log(|SG|/3) ≤ |YSG |H0(YSG ) − O(|YSG |) ≤ |SG|H0(SG), plugging the estimation on |SG|H0(SG) from the Theorem 12 yields the claim.

I Theorem 12. Let S be a string over σ-sized alphabet, k = o(logσ |S|). Then the size of the entropy coding of any irreducible full grammar generating S is at most 2|S|Hk(S)+o(|S| log σ).

I Theorem 13. Let S be a string over σ-sized alphabet, k = o(logσ |S|). The size of the fully naive coding of any irreducible full grammar generating S is at most 6|S|Hk(S) + o(|S| log σ).

3.4 Greedy Greedy [2] can be viewed as non-binary Re-Pair: in each round, it replaces a substring occurring in some strings of rhs(S, G), obtaining (S0,G0), such that | rhs(S0,G0)| is the smallest possible. For this, at each round, for substrings of strings in rhs(S, G) it calculates the maximum number of non-overlapping occurrences. This can be realised with creating a suffix tree at each iteration, and so its asymptotic construction time has been bounded only by O(|S|2) so far. Note that, contrary to Re-Pair, it can correctly estimate the decrease in the number of symbols when replacing pair AA in a maximal substring An. It is known to produce small grammars in practice, both in terms of nonterminal size and bit size. Moreover, it is notorious for being hard to analyse in terms of approximation ratio [6, 17]. Greedy has similar properties as Re-Pair: after each iteration the frequency of the most frequent pair on the right-hand side of the grammar does not increase and the size of the grammar can be estimated in terms of this frequency. In particular, there always is a point of its execution in which the number of nonterminals is O(|S|c), c < 1 and the full grammar

is of size O (|S|/ logσ |S|). The entropy encoding at this time yields |S|Hk(S) + o(|S| log σ) 3 bits, while Greedy run till the end achieves 2 |S|Hk(S) using entropy coding and 2|S|Hk(S) using fully naive encoding, so the same as in the case of Re-Pair. M. Gańczorz 11

In practice stopping Greedy early is beneficial: similarly as in Re-Pair there can exist a point where we add new symbols that do not decrease the bitsize of the output. Indeed, it was suggested [2] to stop Greedy as soon as the decrease in the grammar size is small enough, yet this was not experimentally evaluated. Moreover, as the time needed for one iteration is linear we can stop after O(|S|c) iterations ending with O(|S|1+c) running time, again our analysis suggests that c factor depends on the constant in additional summand. The proofs for Greedy are similar in spirit to those for Re-Pair, yet technical details differ. In particular, we use the (non-trivial) fact that Greedy produces irreducible grammar and at every iteration conditions (IG1–IG2) are satisfied, i.e. that distinct nonterminals have distinct expansions and that every nonterminal is used at least twice. All of those properties are a consequence of Greedy being a global algorithm, see [6].

I Theorem 14. Let S be a string over σ-sized alphabet and k = o(logσ |S|). Let ξ > 0, β > 2 be constants. Let (S0,G) be full grammar generating S after some iteration of Greedy. (2ξ+10β)|S| 2 When the | rhs(S0,G)| is first below | rhs(S0,G)| < + 8|S| β + 2 the number of logσ |S| 2 2 β nonterminals |N (G)| of G is at most 1 + ξ |S| logσ |S|; for every ξ > 0, β > 2 and S such a point always exists. Moreover, at this moment the bit-size of entropy coding of (S0,G) is at most |S|Hk(S) + o(|S| log σ).

I Theorem 15. Let S be a string over σ-sized alphabet and k = o(logσ |S|). Then the size of the fully naive encoding of full grammar produced by Greedy is at most 2|S|Hk(S)+o(|S| log σ); moreover the size of the entropy encoding of full grammar produced by Greedy is at most 3 2 |S|Hk(S) + o(|S| log σ).

4 Entropy bound for string parsing

We now prove Theorem 1 and Theorem 2. We make a connection between the entropy of the parsing of a string S, i.e. |YS|H0(YS), and the k-order empirical entropy |S|Hk(S) of S; this can be seen as a refinement and strengthening of results that relate |YS| log |YS| to Hk(S) [21, Lem. 2.3]. To compare the two results, our result establishes upper bounds when phrases of YS are encoded using an entropy coder while previous ones apply to the trivial encoding of YS, which assigns to each letter log |YS| bits. The proofs follow in a couple of simple steps. First, we recall a strengthening of the fact that entropy lower-bounds the size of the encoding of the string using any prefix codes: instead of assigning natural lengths (of codes) to letters of YS we can assign them some probabilities (and think that − log(p) is the “length” of the prefix code). Then we define the probabilities of phrases in the parsing YS in such a way that they relate to higher-order entropies. Depending on the result we either look at fixed, k-letter context or (i − 1)-letter context for the i-th letter of a phrase. This already yields the first claim of Theorem 1, to obtain the second we substitute the estimation on the parsing size and make some basic estimation on the entropy of the lengths, which is a textbook knowledge [8, Lemma 13.5.4].

+ I Lemma 16 ([1]). Let w be a string over alphabet Γ and p :Γ → R be a function such P P that s∈Γ p(s) ≤ 1. Then |w|H0(w) ≤ − s∈Γ |w|s log p(s). To use Lemma 16 we need an appropriate valuation of p for phrases of the parsing. We assign p-value to each individual letter in each phrase and then for a phrase y, p(y) is a product of values of consecutive letters of y. In the case of Theorem 2 for the j-th letter of a phrase y we assign the probability of this letter occurring in j − 1 letter context in S, i.e. by Definition 3 in most cases 12 Entropy bounds for grammar compression

(y[j] | y[1 . . . j − 1]) = |S|y[1...j] , i.e. we think that we encode the letter using (j − 1)-th P |S|y[1...j−1] 1 order entropy. The difference in the case of Theorem 1, is that we assign σ to the first k letters of the phrase, the remaining ones are assigned probability of this letter occurring in k-letter context. The phrase costs are simply logarithmed values of phrase probabilities, which intuitively can be viewed as the cost of bit-encoding of a phrase.

I Definition 17 (Phrase probability, parsing cost). Given a string S and its parsing YS = y1y2 . . . yc the phrase probability P(yi) and k-bounded phrase probability are:

|yi| |yi| Y 1 Y P(yi) = P(yi[j] | yi[1 . . . j−1]) and Pk(yi) = P(yi[j] | yi[j−k . . . j−1]) , σmin(|yi|,k) j=1 j=k+1

where P(a|w) is the empirical probability of letter a occurring in context w, cf. Definition 3. P|YS | The phrase cost and parsing cost are C(yi) = − log P(yi) and C(YS) = i=1 C(yi). Similarly the k-bounded phrase cost and k-bounded parsing cost are: Ck(yi) = − log Pk(yi) P|YS | and Ck(YS) = i=1 Ck(yi).

When comparing the Ck(YS) cost and |S|Hk(S), the latter always uses −P(a|w) bits for each symbol a occurring in context w, while Ck(YS) uses log σ bits on each of the first k letters of each phrase, thus intuitively it loses up to log σ bits on each of those |YS|k letters.

I Lemma 18. Let S be a string and let YS be its parsing. Then for any k: Ck(YS) ≤ |S|Hk(S) + |YS|k log σ.

I Lemma 19. Let S be a string. Then for any l there exists a parsing YS of length |YS| ≤ d|S|/le + 1 such that each phrase except first and last has length exactly l and |S| Pl−1 C(YS) ≤ l i=0 Hi(S) + log |S|.

Ideally, we would like to plug in the phrase probabilities for YS into Lemma 16 and so obtain that the parsing cost C(YS) upper-bounds entropy coding of YS, i.e. |YS|H0(YS). But the assumption of Lemma 16 (that the values of function p sum to at most 1) may not hold as we can have phrases of different lengths and so their probabilities can somehow mix. Thus we also take into account the lengths of the phrases: we multiply the phrase probability by the probability of |y|, i.e. the frequency of |y| in Lengths(YS). After simple calculations, we conclude that |YS|H0(YS) is upper bounded by the parsing cost C(YS) plus the zeroth-order entropy of phrases’ lengths: i.e. when L = Lengths(YS), the |L|H0(L).

I Theorem 20. Let S be a string over σ-size alphabet, YS its parsing, c = |YS| and L = Lengths(YS). Then:

|YS|H0(YS) ≤ C(YS) + |L|H0(L) and |YS|H0(YS) ≤ Ck(YS) + |L|H0(L) .

∗ When S ∈ a (and so C(S) = Ck(S) = H0(S) = 0) the entropy |YS|H0(YS) is the entropy of |L|H0(L), thus summand |L|H0(L) on the right-hand is necessary. By textbook arguments [8, Lem. 13.5.4], the entropy of lengths is always O(|S|), and for

small enough |YS| it is O (|S| log logσ |S|/ logσ |S|); moreover in the case of Theorem 2 it can be bounded by O(log |S|). Theorems 1 and 2 follow directly from Lemmas 18, 19, Theorem 20 and above estimations.

References 1 Janos Aczél. On Shannon’s inequality, optimal coding, and characterizations of Shannon’s and Rényi’s entropies. In Symposia Mathematica, volume 15, pages 153–179, 1973. M. Gańczorz 13

2 Alberto Apostolico and Stefano Lonardi. Some theory and practice of greedy off-line textual substitution. In Data Compression Conference, 1998. DCC’98. Proceedings, pages 119–128. IEEE, 1998. 3 Alberto Apostolico and Stefano Lonardi. Compression of biological sequences by greedy off-line textual substitution. In Proceedings of the Conference on Data Compression, DCC ’00, pages 143–152, Washington, DC, USA, 2000. IEEE Computer Society. URL: http: //dl.acm.org/citation.cfm?id=789087.789789. 4 Martin Bokler and Eric Hildebrandt. Rule reduction—a method to improve the compression performance of grammatical compression algorithms. AEU-International Journal of Electronics and Communications, 65(3):239–243, 2011. 5 Katrin Casel, Henning Fernau, Serge Gaspers, Benjamin Gras, and Markus L. Schmid. On the complexity of grammar-based compression over fixed alphabets. In Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani, and Davide Sangiorgi, editors, 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, volume 55 of LIPIcs, pages 122:1–122:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016. doi:10.4230/LIPIcs.ICALP.2016.122. 6 Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Information Theory, 51(7):2554– 2576, 2005. URL: http://dx.doi.org/10.1109/TIT.2005.850116, doi:10.1109/TIT.2005. 850116. 7 Francisco Claude and Gonzalo Navarro. A fast and compact web graph representation. In Nivio Ziviani and Ricardo A. Baeza-Yates, editors, SPIRE, volume 4726 of LNCS, pages 118–129. Springer, 2007. URL: http://dx.doi.org/10.1007/978-3-540-75530-2_11, doi: 10.1007/978-3-540-75530-2\_11. 8 Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. 9 Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci., 372(1):115–121, 2007. doi:10.1016/j.tcs.2006.12.012. 10 Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci., 410(51):5354–5364, 2009. URL: http://dx.doi.org/10. 1016/j.tcs.2009.09.012, doi:10.1016/j.tcs.2009.09.012. 11 Michał Gańczorz. Entropy lower bounds for dictionary compression. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy., pages 11:1– 11:18, 2019. doi:10.4230/LIPIcs.CPM.2019.11. 12 Michał Gańczorz and Artur Jeż. Improvements on Re-Pair grammar compressor. 2017 Data Compression Conference (DCC), pages 181–190, 2017. 13 Rodrigo González and Gonzalo Navarro. Statistical encoding of succinct data structures. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 294–305. Springer, 2006. doi:10.1007/11780441\_27. 14 Rodrigo González and Gonzalo Navarro. Compressed text indexes with fast locate. In Bin Ma and Kaizhong Zhang, editors, CPM, volume 4580 of LNCS, pages 216–227. Springer, 2007. URL: http://dx.doi.org/10.1007/978-3-540-73437-6_23, doi:10.1007/ 978-3-540-73437-6\_23. 15 Rodrigo González, Gonzalo Navarro, and Héctor Ferrada. Locally compressed suffix arrays. J. Exp. Algorithmics, 19:1.1:1.1–1.1:1.30, January 2015. URL: http://doi.acm.org/10.1145/ 2594408, doi:10.1145/2594408. 16 Roberto Grossi, Rajeev Raman, Srinivasa Rao Satti, and Rossano Venturini. Dynamic compressed strings with random access. In Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I, pages 504–515. Springer, 2013. doi:10.1007/978-3-642-39206-1\_43. 14 Entropy bounds for grammar compression

17 Danny Hucke, Markus Lohrey, and Carl Philipp Reh. The smallest grammar problem revisited. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, SPIRE, volume 9954 of LNCS, pages 35–49, 2016. doi:10.1007/978-3-319-46049-9\_4. 18 Artur Jeż. A really simple approximation of smallest grammar. Theoretical Computer Science, 616:141–150, 2016. URL: http://dx.doi.org/10.1016/j.tcs.2015.12.032, doi: 10.1016/j.tcs.2015.12.032. 19 Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci., 1(298):253–272, 2003. URL: http://dx.doi.org/10.1016/S0304-3975(02) 00426-7, doi:10.1016/S0304-3975(02)00426-7. 20 John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000. doi:10.1109/ 18.841160. 21 S. Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput., 29(3):893–911, 1999. doi:10.1137/S0097539797331105. 22 N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. In Data Compression Conference, pages 296–305. IEEE Computer Society, 1999. doi:10.1109/DCC. 1999.755679. 23 Giovanni Manzini. An analysis of the Burrows&Mdash;Wheeler Transform. J. ACM, 48(3):407– 430, May 2001. URL: http://doi.acm.org/10.1145/382780.382782, doi:10.1145/382780. 382782. 24 Gonzalo Navarro and Luís M. S. Russo. Re-Pair achieves high-order entropy. In DCC, page 537. IEEE Computer Society, 2008. URL: http://dx.doi.org/10.1109/DCC.2008.79, doi:10.1109/DCC.2008.79. 25 Craig G Nevill-Manning and Ian H Witten. Compression and explanation using hierarchical grammars. The Computer Journal, 40(2_and_3):103–116, 1997. 26 Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res. (JAIR), 7:67–82, 1997. doi:10.1613/jair.374. 27 Carlos Ochoa and Gonzalo Navarro. RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Transactions on Information Theory, 65(5):3160–3164, 2019. doi:10.1109/TIT.2018.2871452. 28 Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211–222, 2003. doi:10.1016/ S0304-3975(02)00777-6. 29 Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1230–1239. Society for Industrial and Applied Mathematics, 2006. 30 Hiroshi Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms, 3(2-4):416–430, 2005. doi:10.1016/j.jda.2004.08.016. 31 James A. Storer and Thomas G. Szymanski. The macro model for data compression. In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors, STOC, pages 30–39. ACM, 1978. 32 Wojciech Szpankowski. Asymptotic properties of data compression and suffix trees. IEEE transactions on Information Theory, 39(5):1647–1659, 1993. 33 Yasuo Tabei, Yoshimasa Takabatake, and Hiroshi Sakamoto. A succinct grammar compression. In Annual Symposium on Combinatorial Pattern Matching, pages 235–246. Springer, 2013. 34 Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. Variable-length codes for space- efficient grammar-based compression. In International Symposium on String Processing and Information Retrieval, pages 398–410. Springer, 2012. 35 Michal Vasinek and Jan Platos. Prediction and evaluation of zero order entropy changes in grammar-based codes. Entropy, 19(5):223, 2017. doi:10.3390/e19050223. M. Gańczorz 15

36 Raymond Wan. Browsing and Searching Compressed Documents. Phd thesis, Department of Computer Science and Software Engineering, University of Melbourne, 2003. URL: http: //hdl.handle.net/11343/38920. 37 Aaron D. Wyner and Jacob Ziv. The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Proceedings of the IEEE, 82(6):872–877, Jun 1994. 38 Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory, 24(5):530–536, 1978. doi:10.1109/TIT.1978.1055934. 16 Entropy bounds for grammar compression Appendix

A Additional material for Section 2

For simplicity, if i > j then w[i . . . j] = . P We extend the notion |w|v to sets of words, i.e. |w|V = v∈V |w|v. The size of the nonterminal | rhs(X)| is the length of its right-hand side, for a grammar G or full grammar (S, G) the | rhs(G)|, | rhs(S, G)| denote the sum of lengths of strings in rhs(G), rhs(S, G), respectively.

B Expanded commentary for Section 3

We prove that Re-Pair and Greedy stopped after certain iteration achieve |S|Hk(S)+o(|S| log σ) bits and their number of different nonterminals is small, i.e O(|S|c) for some c < 1. Our analysis shows that we can decrease value c at the expense of the constant hidden in o(|S| log σ). This is particularly important in practical applications as it translates to a grammar which has small output alphabet and, as in most compressors every symbol is encoded using prefix-free code, this reduces the bits required to store this code. In comparison, certain dictionary methods, like LZ78, do not have this property, as they can produce string

over a large, i.e Ω(|S|/logσ |S|)-sized alphabet, which implies a large dictionary size.

B.1 Proofs for Section 3.1 (Encoding of grammars) We first complete the promised argument that at least some of the original encodings of Re-Pair match previously presented bound on CNF encoding.

I Lemma 21. Re-Pair original grammar encoding together with an encodes a grammar G using |N (G)| log(|N (G)| + σ) + O(|N (G)| + σ) bits.

Proof. As mentioned before, one of the encodings originally used for Re-Pair [22] is based on division of nonterminals G into groups g1, . . . , gz, where X → AB ∈ gi if and only if A ∈ gi−1 and B ∈ gj or B ∈ gi−1 and A ∈ gj, for some j ≤ i − 1. To complete the definition, g0 contains all letters of the input alphabet. In short, group gi contains non-terminals which are at height i in the derivation tree of G. It was shown [22] that each group gi, i > 0, 2 2 can be represented as a bitvector of size pi−1 − pi−2 in which pi − pi−1 entries are ones, P where pi = ( j≤i |gj|). Those arrays were then encoded using arithmetic coding [22]. When 1 |N (G)|+σ−1 probabilities P (1) = |N (G)|+σ and P (0) = |N (G)|+σ are used in such coding, each group 2 2 |N (G)|+σ is encoded using at most (pi − pi−1) log(|N (G)| + σ) + (pi−1 − pi−2) log |N (G)|+σ−1 + O(1) bits. Summing over all groups we get that the size is bounded by

z X  |N (G)| + σ  (p − p ) log(|N (G)| + σ) + (p2 − p2 ) log + O(1) i i−1 i−1 i−2 |N (G)| + σ − 1 i=1 As the sum telescopes, this can be bounded by:

|N (G)| + σ (p − p ) log(|N (G)| + σ) + (p2 − p2) log + O(z) z 0 z−1 0 |N (G)| + σ − 1 |N (G)| + σ ≤ |N (G)| log(|N (G)| + σ) + (|N (G)| + σ)2 log + O(|N (G)|) , |N (G)| + σ − 1 M. Gańczorz 17

x when the inequality follows as pz = |N (G)| + σ and p0 = σ. As x log x−1 = O(1) we have that the total cost of encoding is at most:

|N (G)| log(|N (G)| + σ) + O(|N (G)| + σ) . J

It is worth mentioning that instead of using fixed arithmetic coder original solution used an adaptive one, which can be beneficial in practice.

Proof of Lemma 4. In each of the encodings we need to store sizes of the right-hand sides of the nonterminals, storing these values in unary gives sufficient bounds for each method, i.e. it consumes O(| rhs(S0,G)|) bits. In the case of entropy coding, it is sometimes more practical to add separator characters to the concatenation of rules’ right-hand sides, observe that this adds extra O(| rhs(S0,G)|) bits. To see this observe that if we have string S ∈ Σ encoded with some prefix code p :Σ → {0, 1}∗ in a way that the encoding uses B bits, then we can devise the new code p0 : (Σ ∪ #) → {0, 1}∗ which encodes w with B + |w| bits and supports new character # by simply setting p0(#) = 0 and p0(σ) = 1p(σ), for σ 6= #. Thus adding f occurrences of a new letter to the string w increases encoding size by at most |w| + f. As in our case f < |w| = | rhs(S0,G)| the claim holds. The bound for fully naive encoding is obvious, as symbols on the right-hand sides are either one of the original σ symbols or one of |N (G)| nonterminals. In the naive encoding the starting string is encoded using entropy coder (e.g. Huffman), the grammar is encoded in the same way as in the fully naive encoding. For the case of incremental encoding, let us first argue that we can always reorder nonterminals appropriately. Consider the following procedure: start with sequence R which contains letters of the original alphabet in an arbitrary order. Through the procedure the sequence R will contain already processed letters and nonterminals. We append nonterminals to R in the following way: take the first unprocessed nonterminal X → YZ with Y being the earliest possible nonterminal or letter occurring in R, and append X to R. At the end of the above procedure it is necessary to rename nonterminals so that they will correspond to numbers in created sequence. In incremental encoding we encode the second element of pairs naively. As the first elements are sorted, we store only consecutive differences. For this differences we use Elias δ-codes. These codes, for a number x consume at most log x + 2 log log(1 + x) + 1 bits. As all these differences sum up to |N (G)| + σ, it can be shown that this encoding takes at most O(|N (G)| + σ) bits [29, Theorem 2.2]). Observe that encoding these differences in unary is also within the required bounds. J

I Lemma 22. Let S be a string over alphabet Σ ⊆ {1,..., |S|}. Then there is a Huffman-based encoder which uses at most |S|H0(S) + O(S) bits to encode S (together with a dictionary).

Proof. It is a well-known fact, that Huffman encoding of a string S takes at most |S|H0(S) + |S| bits. Observe that we do not need to store the dictionary explicitly: assuming that we use a deterministic procedure to build the Huffman tree we only need to store frequencies of each letter. It is sufficient to store the frequencies in unary, i.e. we store string 1f1 01f2 0 ... 01f|S| , where fi is a frequency of i-th letter, note that some fis could be 0. Storing the frequencies takes 2|S| − 1 = O(|S|) bits. J

I Lemma 23 (Full version of Lemma 5). Let S be a string over an alphabet of size σ and 0 (S ,G) a full grammar generating it, and k = o(logσ |S|). Let SG be a concatenation of 18 Entropy bounds for grammar compression

strings in rhs(S0,G). Assume that no two nonterminals in (S0,G) have the same expansion.  |S|  If |SG| = O then logσ |S|

|SG|H0(SG) ≤ |S|Hk(S) + |N (G)| log |S| + o(|S| log σ) .   In particular, if additionally |N (G)| = o |S| then logσ |S|

|SG|H0(SG) ≤ |S|Hk(S) + o(|S| log σ) .

The same bounds apply to the entropy coding of (S0,G).

0 Proof. From the full grammar (S ,G) we create a parsing YS of string S of size |YS| ≤ | rhs(G)| + |S0| by the following procedure: Take the starting string S0 as S00. While there is a nonterminal X such that it occurs in S00 and it was not processed before, replace one of its occurrences in S00 with the rhs(X). Clearly, S00 is over the input alphabet and nonterminals 00 0  |S|  and |S | = |S | + | rhs(G)| − |N (G)| = |SG| − |N (G)| = O , as each right-hand side logσ |S| 00 is substituted and each nonterminal is replaced once. String S induces the parsing YS of S and applying Theorem 1 to YS yields

|YS|H0(YS) ≤ |S|Hk(S) + o(|S| log σ) . (1)

It is left to compare |YS|H0(YS) with the entropy of SG, i.e. |SG|H0(SG). Note that up to permuting of letters, S00 is a concatenation of S0 and rhs (G) with one occurrence of each nonterminal removed. From the definition of entropy, adding one letter to string increases encoding cost by at most logarithmic factor, i.e. for any string w and letter a it holds that:

|wa|H0(wa) − |w|H0(w) X |wa| X |w| = |wa|b log − |w|b log |wa|b |w|b b∈Σwa b∈Σw X  |w| + 1 |w|  |wa| |w| = |w|b log − log + |wa|a log − |w|a log as |wa|a = |w|b |w|b |w|b |wa|a |w|a b∈Σwa\{a} X |w| + 1  |w| + 1 |w|  |w| + 1 = |w|b log + |w|a log − log + log |w| |wa|a |w|a |wa|a b∈Σwa\{a} |w| + 1 ≤ |w| log + log(|w| + 1) |w| ≤ log(|w| + 1) + log e = log(|w| + 1) + O(1) ,

where we assumed that the |w|a log(·) summand is 0 if |w|a = 0. Thus we obtain the bound on the entropy of SG:

|SG|H0(SG) ≤ |YS|H0(YS) + |N (G)| log |SG| + O(|N (G)|)

≤ |S|Hk(S) + |N (G)| log |S| + O(|N (G)|) + o(|S| log σ) . by (1) and as |SG| = O(|S|)

The additional assumption on |N (G)| in the second statement yields the second bound. The bounds on the entropy encoding follow from the fact that, by Lemma 22, Huffman-based encoder uses at most |SG|H0(SG) + O(|SG|) bits for a string SG. J M. Gańczorz 19

B.2 Proofs for Section 3.2 (Re-Pair)

We state the properties of Re-Pair, some were used in the previous analysis [24]:

I Lemma 24 ([24, Lem. 3]). The frequency of the most frequent pair in the working string does not increase during Re-Pair’s execution.

I Lemma 25. If the most frequent pair in the working string occurs at least z times then 2|S| |N (G)| < z .

Proof. By Lemma 24 all replaced pairs had frequency at least z thus each such a replacement removed at least z/2 letters from the starting string (note that some occurrences could overlap). As it initially had |S| letters, this cannot be done |S|/(z/2) times. J

We also need the property that different nonterminals expand to different substrings as we want to apply Theorem 1 to a working string, i.e. we need the fact that the working string induces a parsing. This can be proved by simple induction, [24, Lem. 4] gives a more general statement.

I Lemma 26 ([24, Lem. 4]). Let G be a grammar generated by Re-Pair at any iteration. Then no two nonterminals X,Y ∈ G have the same expansion.

Equipped with the above properties we are ready to prove the Theorem 6. proof of Theorem 6. We first prove the first part. Assume that the working string S0 of (ξ+3β)|S| 2 0 Re-Pair has size c ≥ + 2|S| β + 1. Then S induces a parsing YS = y1y2 . . . yc, as by logσ |S| Lemma 26 different non-terminals expand to different phrases. A phrase y in YS is long if logσ |S| |y| ≥ β − 1, and short otherwise. Let cl be the number of long phrases. Then   logσ |S| β · |S| cl · − 1 ≤ |S| =⇒ cl ≤ . β logσ |S| − β

1 If β > 3 logσ |S| then c > |S|, thus the first part of the Theorem trivially holds. So 1 assume that β ≤ 3 logσ |S|. There are exactly c − 1 digrams in YS. Any long phrase can partake in 2 pairs, so the number of remaining pairs, which consist of two short phrases, is at least:   (ξ + 3β) · |S| 2 2β · |S| c − 1 − 2 · cl ≥ + 2|S| β + 1 − 1 − logσ |S| logσ |S| − β (ξ + 3β) · |S| 2β · |S| 2 1 β ≥ − 1 + 2|S| as β ≤ logσ |S| logσ |S| logσ |S| − 3 logσ |S| 3 (ξ + 3β) · |S| 3β · |S| 2 > − + 2|S| β logσ |S| logσ |S| ξ|S| 2 ≥ + 2|S| β (2) logσ |S|

On the other hand, the number of different short phrases is at most:

 logσ |S|  β −1 log |S| X i σ 1 σ < σ β = |S| β , (3) i=1 20 Entropy bounds for grammar compression

1 1 2 as it is a geometric sequence. As there are at most |S| β ·|S| β = |S| β different pairs consisting of short phrases, there must exist a pair occurring at least:

ξ|S| 2 2 + 2|S| β 1− β logσ |S| ξ · |S| 2 ≥ + 2 (4) |S| β logσ |S|

times, which is larger than 2. Note that this, in particular, implies that if the working string is as long as c then there is another iteration of Re-Pair. Since there is a pair with frequency of value in (4) by Lemma 25 we can estimate the number of nonterminals at this point, |N (G)|, in grammar G as:

2|S| 2 2 β |N (G)| < 2 ≤ |S| · logσ |S| . (5) ξ|S| +2|S| β ξ logσ |S| 2 |S| β

(ξ+3β)|S| 2 Now consider the first iteration when |S0| < + 2|S| β + 1. By the above reasoning, logσ |S| 2 2 β at this point the grammar must be of size at most ξ |S| · logσ |S| + 1, this finishes the proof of the first part of the Theorem. Applying Theorem 1 we can bound the entropy of the working string S0 by:

0 0 |S |H0(S ) ≤ |S|Hk(S) + o(|S| log σ) , (6)   as |S0| = O |S| . logσ |S| Combining (5) with Lemma 4 gives that the size of the naive and the incremental 0 0 0 encoding of full grammar (S ,G) is bounded by |S |H0(S ) + o(|S| log σ), note that we use the assumption that β > 2. Together with (6) this gives that the encoding size is bounded by |S|Hk(S) + o(|S| log σ) for both the naive and the incremental encoding. Finally, the bound for the entropy coding of full grammar (S0,G) follows directly from Lemma 5, as   | rhs(S0,G)| = |S0| + 2|N (G)| = O |S| . logσ S J

Proof of Lemma 7. Fix n and an alphabet Γ = {a1, a2, . . . , an, #}. Consider the word S which contains all letters ai with # in between and ended with #, first in order from 1 to n, then in order n to 1:

S = a1#a2#a3# ··· an−1#an#an#an−1 ··· a2#a1# .

Then 4n 4n |S|H (S) = 2n log + 2n log 0 2 2n = 2n log (2n · 2) |S| = log |S| . 2

Every pair occurs twice in S. We assume that Re-Pair takes pairs in left-to-right order in case of a tie (which is true for standard implementations), hence Re-Pair will produce a starting string

0 S = X1X2 ...XnXnXn−1 ...X1 . M. Gańczorz 21

Its entropy is |S| |S|/2 |S0|H (S0) = log 0 2 2 |S| = log |S| − |S| , 2 i.e. the entropy of working string decreased only by a lower order term |S| during the |S| execution of Re-Pair. The produced grammar contains 4 rules, thus by the assumption that grammar of size g takes at least g log g − O(g) bits the total cost of encoding is at least: |S| |S| |S| |S| 3|S| log |S| − |S| + log − O = log |S| − O(S) . 2 4 4 4 4

The same claim holds for entropy encoding of the grammar: Let SG be a string obtained by concatenating the right-hand sides of (S0,G). We have 3|S| |S |H (S ) = log |S| − O(|S|) , G 0 G 4 as each of letters ai and nonterminals Xi occurs constant number of times in SG. J

B.2.0.1 Weakly non-redundant grammars We define a class of weakly non-redundant grammars, which include Re-Pair and Greedy. These grammars have some nice and natural properties, which we later employ in our analysis of Re-Pair and Greedy: Firstly, the full grammar of a weakly non-redundant grammar can be obtained by a series of compression steps, in which we replace only substrings of the working string (so the rules of the grammars are never altered), and those substrings occur at least twice in the working string. In particular, each such compression reduces the length of the starting string and does not increase the sums of lengths of the right-hand sides and starting string of the full grammar; this is showed in Lemma 28. Secondly, if the entropy of the input string is large, we can estimate the cost of incremental and entropy codings of a full grammar in terms of the zeroth-order entropy of the input string, see Lemma 29 and 1 Lemma 30. These results are of their own interest: for strings of entropy H0(S) ≥ 2 log |S| even not-so-reasonable grammar transform (with entropy encoding) gives an output of 3 size 2 |S|H0(S) + O(|S|). This is nontrivial, as considered transforms can introduce up to |S|/2 nonterminals, and each nonterminal in entropy coding seems to require additional log |S| bits. Moreover, it seems that we can extend this result so that for strings of entropy 1 H0(S) ≥ β log |S| the output size will be ((1 + β)/2)|S|H0(S) + O(|S|). 0 I Definition 27. (weakly non-redundant grammars) A full grammar (S ,G) is weakly non- redundant if every nonterminal X, except the starting symbol, occurs at least twice in the derivation tree of (S0,G), for every nonterminal X we have | rhs(X)| ≥ 2 and no two different nonterminals have the same expansion. For example grammar: S → AA, A → Ba, B → aa is weakly non-redundant, S → AB, A → cc, B → aa is not. We can obtain any weakly non-redundant grammar by simple (non-recursive) procedure that at each step replaces some substring which occurs at least twice.

0 I Lemma 28. Let S be a string and (S ,G) be a weakly non-redundant grammar. Then (S0,G) can be obtained by a series of replacements, where each replacement operates only on current starting S00, replaces at least 2 occurrences of some substring w, where |w| > 1 and w occurs at least 2 times, and adds a rule Xw → w to the grammar. 22 Entropy bounds for grammar compression

Proof. Take any weakly non-redundant grammar (S0,G) generating S. Consider the following inductive reasoning: Take some nonterminal X → ab, where ab are original symbols of S, i.e. a, b are leaves of the derivation tree. Replace each occurrence of ab in S which corresponds to nonterminal X in the derivation tree. This gives a new string S00. Now we can remove X from (S0,G) and obtain a weakly non-redundant grammar for S00, where lemma holds by the induction hypothesis. J In the proofs of the following lemmas, the crucial assumption is that | rhs(X)| ≥ 2, for every nonterminal X, as then each introduced nonterminal shortens the starting string by at least 2. 0 0 0 By Lemma 4 the encoding of a full grammar (S ,G) uses |S |H0(S ) + |N (G)| log |N (G)| bits plus lower order terms, thus the following Lemma says that we can obtain non-trivial bound for the encoding of weakly non-redundant grammars when the entropy of string S generated by (S0,G) is large.

I Lemma 29. Let S be a string and n ∈ N a number. Assume |S| ≤ n and |S|H0(S) ≥ |S| log n 0 2 − γ, where γ can depend on S. Let (S ,G) be a weakly non-redundant grammar generating S. Then: 3 |S0|H (S0) + |N (G)| log n ≤ |S|H (S) + O(|S| + γ) . 0 2 0 Proof. By Lemma 28 we can assume that grammar is produced by series of replacements on the starting string S, that is, we never modify right-hand sides of G after adding a rule. Because each replacement removes at least two symbols from the working string we have

|S| ≥ |S0| + 2|N (G)| . (7)

Observe that S0 induces a parsing of S, as by grammar properties different symbol expands to a different substring. Let L = Lengths(S0). Then estimating the entropy of parsing from Theorem 1, and |L|H0(L) from Lemma 38 (see the Appendix Section C) yields:

0 0 |S |H0(S ) ≤ |S|H0(S) + |L|H0(L) ≤ |S|H0(S) + O(|S|) . (8)

0 As |S | ≤ |S| and by assumption |S| ≤ n we have that |S|H0(S) ≤ |S| log n, and thus:

0 0 0 |S |H0(S ) ≤ |S | log n . (9) Now we have: 1 1 |S0|H (S0) + |N (G)| log n = |S0|H (S0) + |S0|H (S0) + |N (G)| log n 0 2 0 2 0 1 1 ≤ |S|H (S) + O(|S|) + |S0|H (S0) + |N (G)| log n by (8) 2 0 2 0 1 1 ≤ |S|H (S) + O(|S|) + |S0| log n + |N (G)| log n by (9) 2 0 2 1 1 ≤ |S|H (S) + O(|S|) + |S| log n by (7) 2 0 2 3 ≤ |S|H (S) + O(|S| + γ) by assumption 2 0 J We now show a similar property for the entropy encoding, its proof is more involved than in the case of Lemma 29, i.e. the incremental encoding. It uses a similar idea as proof of Theorem 1, that is, we assign some p values to each symbol on the right-hand side of (S0,G) and apply Lemma 16. M. Gańczorz 23

I Lemma 30. Let n ∈ N be a number and S be a string, |S| ≤ n. Assume |S|H0(S) ≥ |S| log n 0 2 − γ, where γ can depend on S. Let (S ,G) be a full weakly non-redundant grammar 0 generating S. Let SG be a string obtained by concatenation of the right-hand sides of (S ,G). Then: 3 |S |H (S ) ≤ |S|H (S) + O(|S| + γ) . G 0 G 2 0 Proof. By Lemma 28 we can assume that grammar is produced by series of replacements on the starting string S, that is we never modify the right-hand sides of G after adding a rule. Let Γ be the original alphabet of string S and SG be a concatenation of strings in rhs(S0,G). We show that

0 |S |Γ + 2|SG|N ≤ |S| . (10)

This clearly holds at the beginning. Suppose that we replace k copies of w in S0 by a new nonterminal X and add a rule X → w. Let |w|Γ and |w|N denote the number of occurrences of letters and nonterminals in w, then |w|Γ + |w|N = |w| ≥ 2. After the replacement the values of (10) change as follows:

0 0 (|S |Γ − k|w|Γ) + 2(|SG|N − (k − 1)|w|N + k) = |S |Γ + 2|SG|N − (k − 1)(|w|Γ + 2|w|N − 2) + 2 − |w|Γ 0 ≤ |S |Γ + 2|SG|N − (|w|Γ + 2|w|N − 2) + 2 − |w|Γ 0 = |S |Γ + 2|SG|N − 2(|w|Γ + |w|N − 2) 0 ≤ |S |Γ + 2|SG|N ≤ |S| .

Now, observe that for each letter a ∈ Γ it holds that:

0 |S |a + 2| rhs(G)|a ≤ |S|a . (11)

This is shown by an easy induction: clearly, it holds at the beginning. When we replace 0 k ≥ 2 occurrences of a word w with a nonterminal X and add the rule X → w then |S |a drops by k|w|a while 2| rhs(G)|a increases by 2|w|a ≤ k|w|a. We use Lemma 16 together with (10) and (11) to bound the entropy of SG. Define |S|a 0 1 p(a) = |S| as the empirical probability of letter a in S and let p (a) = 2 p(a) for original 0 1 letters of S and p (X) = 2|S| for nonterminal symbols. Those values satisfy the condition of Lemma 16: X X X p0(a) = p0(a) + p0(X) a∈Γ∪N (G) a∈Γ X∈N (G) X 1 X 1 = p(a) + 2 2|S| a∈Γ X∈N (G)

1 X |S|a 1 ≤ + |N (G)| · 2 |S| 2|S| a∈Γ 1 1 ≤ + |S| · by (10), as |N (G)| ≤ |S | 2 2|S| G N ≤ 1 .

Thus we can bound the entropy of SG using Lemma 16:

|SG|H0(SG) (12) 24 Entropy bounds for grammar compression

X 0 ≤ −|SG|a log p (a) a X 0 X 0 = −|SG|a log p (a) + −|SG|X log p (X) a∈Γ X∈N (G) X X 1 = −|S | log p(a) + |S | + |S | log |S| + |S | as p0(X) = G a G Γ G X G N 2|S| a∈Γ X∈N (G) X 1 1  = − |S0| + |S0| + | rhs(G)| log p(a) + |S | log |S| + |S | 2 a 2 a a G N G a∈Γ X 1  X 1 1 ≤ − |S0| + | rhs(G)| log p(a) + |S0| log |S| + |S | log |S| + |S | as p(a) ≥ 2 a a 2 a G N G |S| a∈Γ a∈Γ X 1 1 ≤ − |S| log p(a) + |S0| log |S| + |S | log |S| + |S | by (11) 2 a 2 Γ G N G a∈Γ 1 1  = |S|H (S) + |S0| + |S | log |S| + |S | 2 0 2 Γ G N G 1 |S| ≤ |S|H (S) + log |S| + |S | by (10) 2 0 2 G 1 ≤ |S|H (S) + |S|H (S) + γ + |S | by assumption 2 0 0 G 3 ≤ |S|H (S) + O(|S| + γ) as |S | ≤ |S| 2 0 G

B.2.0.2 Proof of Theorem 8

We prove the following fact which we will need in the proof of Theorem 8 and also later on:

0 0 I Lemma 31. Let S, S be strings such that S can be obtained from S by reordering and/or 0 0 removing letters from S. Then |S|H0(S) ≥ |S |H0(S ).

Proof. Let Σ be the alphabet of S. We have:

X |S| |S|H0(S) = |S|a log . |S|a a∈Σ

Clearly reordering letters does not change the above value. Consider the string S00 obtained by deleting a single occurrence of letter b in the string ∗ 00 00 S. Observe that the inequality holds in the trivial case of S ∈ b , as then |S |H0(S ) = |S|H0(S) = 0, so in the following we assume that |S|b < |S|. Then:

00 00 X |S| − 1 |S| − 1 |S |H0(S ) = |S|a log + (|S|b − 1) log , |S|a |S|b − 1 a∈Σ\{b}

|S|−1 where we assume that (|S|b − 1) log = 0 if |S|b = 1. As log x is increasing in range |S|b−1 (0, +∞), for each a 6= b it holds that

|S| − 1 |S| |S|a log ≤ |S|a log . |S|a |S|a M. Gańczorz 25

For b consider that

 |S|b−1! |S| − 1 |S| − |S|b (|S|b − 1) log = log 1 + |S|b − 1 |S|b − 1

 |S|b ! |S| |S| − |S|b (|S|b) log = log 1 + |S|b |S|b

n and the sequence fn = (1 + c/n) is known to be (strictly) increasing for every value c > 0 (as it monotonically tends to ec). Thus

|S| − 1 |S| (|S|b − 1) log ≤ (|S|b) log , |S|b − 1 |S|b as those are of the two consecutive elements of such sequence (for c = |S|−|S|b). J Proof of Theorem 8. Fix some  > 0 and consider the iteration in which the number of l |S| m nonterminals in the grammar is equal to log1+ |S| , call this grammar G0, i.e. we have l |S| m |N (G0)| = log1+ |S| . If no such an iteration exists, then consider the state after the last l |S| m 0 iteration, then |N (G0)| < log1+ |S| . Let S be a working string at the current state, thus 0 (S ,G0) is a full grammar generating S. By Theorem 6 with ξ = 2, β = 4 we know that if the size of the grammar is at least 1+p|S|·log |S| then the size of the working string must be at most |S0| ≤ 14|S| +2p|S|+1. σ logσ |S| p l |S| m 0  |S|  Thus, as 1 + |S| · log |S| = o 1+ it follows that |S | = O . σ log |S| logσ |S| 0 l |S| m l |S| m We make two assumptions: |S | > log1+ |S| and |N (G0)| = log1+ |S| (the latter does not hold if Re-Pair finished the run with smaller grammar), as otherwise it is easy to prove 0 l |S| m 0 the claim: If |S | ≤ log1+ |S| then the size of the final grammar is at most |S |/2 + |N (G0)|, as one replacement shortens the working string by at least 2 thus we can make at most |S0|/2 iterations from the current state. Thus, if at least one of these assumptions is not satisfied l |S| m we finish the run of Re-Pair with a grammar of size at most O log1+ |S| , in such case by Lemma 4 we get the bound |S|Hk(S) + o(|S| log σ) for incremental encoding ant the same bound for entropy encoding by Lemma 5. 0 l |S| m l |S| m We now show that the assumptions: |S | > log1+ |S| and |N (G0)| = log1+ |S| imply 0 0 that the entropy, i.e. |S |H0(S ), is high. From Lemma 25 we get that each pair occurs at 2|S| 1+ 0 most |N (G)| ≤ 2 log |S| times in S . Consider the parsing of working string S0 into consecutive pairs (with possibly last letter P P  left-out), denote it by YS0 . Let LP = Lengths YS0 , observe that those are all 2s except possibly the last, which can be 1. By applying Theorem 1 with k = 0 we get:

0 0 P P |S |H0(S ) + |LP |H0(LP ) ≥ |YS0 |H0(YS0 ) . (13)

∗ |S0|−1/2 If LP ∈ 2 then clearly H0(LP ) = 0, otherwise LP = 2 1 and so its entropy is |S0| − 1 (|S0| + 1)/2 (|S0| + 1)/2 |L |H (L ) ≤ log + log p 0 p 2 (|S0| − 1)/2 1

|S0|−1  1  2 ≤ log 1 + + log |S0| (|S0| − 1)/2 ≤ log e + log |S0| . 26 Entropy bounds for grammar compression

On the other hand, as each pair occurs at most 2 log1+ |S| times, we can lower bound the entropy by lower bounding each summand corresponding to pair in the definition of H0 P 1+ by log(|YS0 |/ log ), i.e.

 0  P P P |S | − 1 1+ |Y 0 |H (Y 0 ) ≥ |Y 0 | log /(2 log |S|) S 0 S S 2 |S0| − 1  |S|  ≥ log 2 4 log2+2 |S| |S0| − 1 |S0| − 1 = log |S| − log(4 log2+2 |S|) 2 2 |S0| − 1 ≥ log |S| − (|S0| − 1)(1 + (1 + ) log log |S|) 2 |S0| ≥ log |S| − γ0 , 2

0 0 log |S| where γ = (|S | − 1)(1 + (1 + ) log log |S|) + 2 = o(|S| log σ). Plugging those two estimations into (13) yields:

|S0| |S0|H (S0) ≥ log |S| − γ , (14) 0 2 where γ = γ0 + log e + log |S0| = o(|S| log σ). On the other hand, by Theorem 1,

0 0 |S |H0(S ) ≤ |S|Hk(S) + o(|S| log σ) . (15)

l |S| m 0  |S|  Recall that |N (G0)| = 1+ , |S | = O and so the encoding at this point log |S| logσ |S| takes |S|Hk(S) + o(|S| log σ) bits for both incremental and entropy encoding by Lemma 4 and Lemma 5 respectively. We now use the properties of newly defined weakly non-redundant 0 log |S| grammars (see B.2.0.1), which essentially state that if the entropy is H0(S ) ' 2 then 3 0 0 encoding size can grow to at most 2 |S |H0(S ), assuming that adding a non terminal costs log |S| bits (Lemma 29 and Lemma 30). We argue that this is the case for both incremental and entropy coding. We first prove the bound for incremental encoding. Let S00 be a string returned by Re-Pair at the end of its runtime and G0 be a grammar at this point. Observe that S00 induces a parsing of S, as S0 induces a parsing of S and S00 induces a parsing of S0. Let i be such that 0 the returned grammar G has |N (G0)| + i rules, i.e. there were i compression steps between 00 0 G0 and the end. Thus (S ,G ) is a full grammar which generates S, thus by Lemma 4 the size of incremental encoding is at most

00 00 0 0 00 0 |S |H0(S ) + |N (G )| log(σ + |N (G )|) + O(|S | + |N (G )|) 00 00 0 00 0 0 ≤|S |H0(S ) + |N (G )| log(2|S|) + O(|S | + |N (G )|) as σ, |N (G )| ≤ |S| 00 00 0 00 0 =|S |H0(S ) + |N (G )| log |S| + O(|S | + |N (G )|) 00 00 00 0 =|S |H0(S ) + (|N (G0)| + i) log |S| + O(|S | + |N (G )|) . (16)

We can look at the last i iterations as we start with string S0 and end up with S00 0 0 and some grammar G \ G0, i.e. grammar G without nonterminals from G0. As this grammar meets the conditions of a weakly non-redundant grammar, and S0 satisfies (14), i.e. 0 0 0 |S | |S |H0(S ) ≥ 2 log |S| − γ, we can apply Lemma 29 with n = |S| (recall that this lemma M. Gańczorz 27 requires |S0| ≤ n): 3 i log |S| + |S00|H (S00) ≤ |S0|H (S0) + O(|S0| + γ) . (17) 0 2 0

0  |S|  l |S| m 0 Note that |N (G )| = |N (G0)| + i = O , as |N (G0)| = 1+ and i < |S | = logσ |S| log |S|   O |S| . Combining this fact with (16) and (17) we obtain that the encoding of grammar logσ |S| (S00,G0) returned by Re-Pair is at most:

00 00 00 0 |S |H0(S ) + (|N (G0)| + i) log |S| + O(|S | + |N (G )|)

3 0 0 0 00 0 ≤ |S |H0(S ) + O(|S | + γ) + |N (G0)| log |S| +O(|S | + |N (G )|) 2 | {z } = |S| ·log |S| log1+ |S| 3  |S|  ≤ |S0|H (S0) + O + O(|S0| + γ + |S00| + |N (G0)|) 2 0 log |S| 3 ≤ |S0|H (S0) + o(|S| log σ) 2 0 3 ≤ |S|H (S) + o(|S| log σ) , by (15) 2 k which ends the proof for the incremental encoding. We now move to the case of entropy coding. Let SG be the concatenation of the right-hand 0 sides of (S ,G0). Using Lemma 5 we estimate the entropy of SG as :

|SG|H0(SG) ≤ |S|Hk(S) + |N (G0)| log |S| + o(|S| log σ) . (18)

0 On the other hand we can show that |SG|H0(SG) is large, (note that we use the fact that S can be obtained from SG by removing/reordering letters) i.e.:

0 0 |SG|H0(SG) ≥ |S |H0(S ) by Lemma 31 |S0| ≥ log |S| − γ from (14) 2 |S | = G log |S| − γ − 2|N (G )| log |S| as |S | = |S0| + 2|N (G )| 2 0 G 0 |S | ≥ G log |S| − γ00 , (19) 2 00 where γ = γ + 2|N (G0)| log |S| = o(|S| log σ). 0 00 0 Define SG as the concatenation of the right-hand sides of a full grammar (S ,G ), where G0 is a grammar at the end of the runtime of Re-Pair and S00 is the working string returned by Re-Pair. Again, we can look at the last iterations of Re-Pair like we would start some 00 0 compressor with input string SG and end up with (S ,G \ G0), as we can view the remaining 00 0 iterations like they would do replacements on SG. As (S ,G \ G0) meets the conditions of a weakly non-redundant grammar and SG has large entropy (19) applying Lemma 30 with 00 0 n = |S| to (S ,G \ G0) gives us: 3 |S0 |H (S0 ) ≤ |S |H (S ) + O(|S | + γ00) . (20) G 0 G 2 G 0 G G Combining (20) with (18) gives us: 3 |S0 |H (S0 ) ≤ (|S|H (S) + |N (G )| log |S| + o(|S| log σ)) + O(|S | + γ00) G 0 G 2 k 0 G 3 ≤ |S|H (S) + o(|S| log σ) , 2 k 28 Entropy bounds for grammar compression

l |S| m 0  |S|  as |N (G0)| = 1+ and |SG| = |S | + 2|N (G0)| = O . log |S| logσ |S| The bound on the entropy coding comes from the fact that by Lemma 22 entropy coder uses 0 0 0 0 3 |SG|H0(SG) + O(|SG|) bits to encode the string SG, thus the bound |S|Hk(S) + o(|S| log σ)   2 holds, as |S0 | = O |S| . G logσ |S| J

B.3 Proofs for Section 3.3 (Irreducible grammars and their properties)

B.3.0.1 Irreducible grammars — lower bounds

Proof of Lemma 10. For an integer i consider the grammar which for each possible binary string wa of length 2 < |wa| ≤ i has a rule Xwa → Xwa and Xw → w for w of length 2 and the starting rule S contains each Xw for |w| = i twice, in lexicographic order on indices. For instance for i = 3:

S → X000X000X001X001 ··· X111X111;

X000 → X000; X001 → X001; ... ;

X00 → 00; X01 → 01; ....

i i This grammar is irreducible, has size Θ(2 ) and generates a string Si of size Θ(i2 ), i.e. the grammar is of size Θ(|Si|/ logσ |Si|). We show that decompressing any set of nonterminals such that m nonterminals remain yields a grammar with a starting rule of size at least i2i+1 −2i log m. After the decompression 0 the starting rule S is of the following form: each Xw is either untouched, fully expanded 0 to string w or replaced with Xw0 B, where B is a binary string satisfying w = w B. For the uniformity of presentation we will consider the last two cases together, introducing dummy nonterminal X → . Then, there are at most m + 1 different nonterminals (including X) on the right-hand sides of G. Let n0, n1, n2, . . . , nm denote the number of occurrences of such 0 nonterminals in starting string S ; in principle, it could be that ni = 0, when some expanded nonterminal did not occur in S0. Observe that

m X i+1 nj = 2 . (21) j=1

At the same time, when a nonterminal occurs nj times, then it appeared for expanding at least nj/2 different nonterminals and so it introduced at least nj log(nj/2) new letters, as those nj/2 nonterminals need to have a common prefix (when nj = 0 then we set nj log(nj/2) = 0, which is fine by the continuity of the function x log x). Thus we extended the starting rule by at least

m X nj log(nj/2) (22) j=1

new letters. The function x log(x/2) is convex and so (by Jensen inequality) the value of (22) is M. Gańczorz 29

minimized, when all nj are equal, that is   m m Pm ! X X j=1 nj n log(n /2) ≥ n log x log x is concave j j  j 2m j=1 j=1  2i  = 2i+1 log by (21) m = i2i+1 − 2i log m , as claimed. c i c Lastly, it is easy to see that if m = O(|Si| ) = O((i2 ) ) for some 0 < c < 1 then for some constant α i2i+1 − 2i log m ≥ i2i+1 − 2i log α(i2i)c = i2i+1(1 − c) − 2i(c log i + log α)

i+1 which is linear (in |Si| = i2 ). J Proof of Lemma 11. We consider the family of grammars from Lemma 10 and strings i+1 generated by them. For a parameter i each such string Si have length 2 · i, its zeroth- order entropy is clearly 1, thus |Si|H0(Si) = |Si|. Now, in the entropy coding we take the concatenation of productions’ right-hand sides SG. Each nonterminal Xw occurs twice in i i−1 2 i+1 SG, and the total number of nonterminals Xw is 2 + 2 + ... 2 = 2 − 4, as for each binary word of length j, 2 ≤ j ≤ i there is a nonterminal representing it. Hence, we can lower bound the entropy of SG: |S | |S |H (S ) ≥ 2 · 2i+1 − 4 log G G 0 G 2 ≥ 2i+2 log 2i−O(1) − O(log 2i−O(1)) ≥ 2i+2i − O(2i+2)

= 2|Si| − o(|Si| log σ) .

As |Si| can be arbitrarily large, the claim holds. J

B.3.0.2 Irreducible grammars — upper bounds We state some properties of irreducible grammars which will be used in proof of Theorem 12, proof of Theorem 13, and to obtain bounds for Greedy. The following lemma states a well-known property of irreducible grammars, which in facts holds even if only (IG2) is satisfied.

0 I Lemma 32 ([6, Lemma 4]). Let (S ,G) be a full grammar generating S in which each nonterminal (except for the starting symbol) occurs at least twice on the right-hand side, that is it satisfies condition (IG2). Then the sum of expansions of the right sides is at most 2|S|. First, we show that irreducible grammars are necessarily small, similar to what was shown in [20], yet we will use more general lemma, which is useful in discussion of the Greedy algorithm.

0 I Lemma 33. Let (S ,G) be a full grammar satisfying conditions (IG1–IG2), let ξ, β > 0 be constants and let S be the string generated by grammar (S0,G). If | rhs(S0,G)| ≥ (2ξ+10β)|S| + logσ |S| 1− 2 2 ξ|S| β 8|S| β + 2 then there is a digram occurring at least + 4 times in the right-hand sides logσ |S| of (S0,G). 30 Entropy bounds for grammar compression

0 I Corollary 34. For any grammar full (S ,G) satisfying conditions (IG1–IG3) it holds that   | rhs(S0,G)| ≤ 42|S| + 8p|S| + 2 = O |S| , where S is a string generated by (S0,G). logσ |S| logσ |S| Proof of Lemma 33. The proof is similar to the proof of Theorem 6. Consider symbols in strings in rhs (S0,G), we distinguish between long and short symbols, depending on the length logσ |S| of their expansion: a symbol X is long if it is a nonterminal, and | exp(X)| ≥ β − 1, otherwise it is short. By cl denote number of occurrences of long symbols on the right-hand 0 0 side of full grammar (S ,G), and by cs number of short occurrences, let also c = | rhs(S ,G)|, then c = cl + cs. From Lemma 32 the sum of expansions’ lengths is at most 2|S| and so   logσ |S| 2β|S| cl · − 1 ≤ 2|S| =⇒ cl ≤ . β logσ |S| − β

1 Assume β ≤ 5 logσ |S|, as otherwise c > 2|S| thus the Lemma trivially holds. Observe that on the right-hand sides there are exactly c − 1 − |N (G)| digrams. As each production is of size at least 2 (thus 2|N (G)| ≤ c) and any long symbol can partake in 2 digrams, the number of digrams which consists of two short symbols is at least: c c − 1 − |N (G)| − 2 · c ≥ c − 1 − − 2 · c as 2|N (G)| ≤ c l 2 l c = − 1 − 2 · c 2 l   1 (2ξ + 10β) · |S| 2 4β · |S| ≥ + 8|S| β + 2 − 1 − 2 logσ |S| logσ |S| − β (ξ + 5β) · |S| 2 4β · |S| 1 β ≥ + 4|S| − 1 as β ≤ logσ |S| logσ |S| logσ |S| − 5 logσ |S| 5 (ξ + 5β) · |S| 2 5β · |S| ≥ + 4|S| β − logσ |S| logσ |S| ξ|S| 2 ≥ + 4|S| β . logσ |S| On the other hand, by (IG1) distinct nonterminals corresponds to distinct substrings of S, thus the number of different short symbols is at most:

 logσ |S|  β −1 log |S| X i σ 1 σ < σ β = |S| β , i=1

1 1 2 thus the number of different digrams which consists of two short symbols is |S| β ·|S| β = |S| β . Then there must exist a digram occurring at least

ξ|S| 2 2 + 4|S| β 1− β logσ |S| ξ|S| 2 ≥ + 4 |S| β logσ |S|

times. J

0 Proof of Theorem 12. Let (S ,G) be an irreducible full grammar generating S and SG be a concatenation of strings in rhs (S0,G). As in the proof of Lemma 5 from (S0,G) we will construct a parsing of S by expanding nonterminals: set S00 to be the starting string S0 and while possible, take X that occurs in S00 and was not chosen before then replace a single occurrence of X in S00 with rhs(X). M. Gańczorz 31

Since every nonterminal is used in the production of S, we end up with S00 in which every nonterminal was expanded once. 00 Let SN be a string in which every nonterminal of G occurs once. Observe that S SN has the same count of every symbol as rhs(S0,G): each replacement performed on S00 removes a single occurrence of on nonterminal and each nonterminal is processed once. Thus it is 00 00 enough to estimate |S SN |H0(S SN ) = |SG|H0(SG), as the order of symbols does not affect zeroth-order entropy. 00 00 00 We first estimate |S |H0(S ). To this end observe that S induces a parsing of S and 00 00 0 0 |S | < |S SN | = | rhs(S ,G)| and as (S ,G) is irreducible, by Corollary 34 it is small, i.e.     | rhs(S0,G)| = O |S| . Hence also |S00| = O |S| and by Theorem 1 we have: logσ |S| logσ |S|

00 00 |S |H0(S ) ≤ |S|Hk(S) + o(|S| log σ) . (23)

Each nonterminal is present in S00: by (IG2) each nonterminal had at least two occurrences in the right-hand sides of (S0,G), and only one occurrence was removed in the creation of S00. Now, consider the string S00S00, by the above reasoning we can reorder characters of S00S00 00 and remove some of them to obtain the string S SN . Thus, as by Lemma 31 removing and reordering letters from string w can only decrease |w|H0(w) we have:

00 00 |SG|H0(SG) = (|S SN |)H0(S SN ) same symbol count 00 00 00 00 ≤ (|S S |)H0(S S ) greater symbol count 00 00 = 2|S |H0(S )

≤ 2|S|Hk(S) + o(|S| log σ) by (23) , and the same bound holds for entropy coding, which by Lemma 22 consumes |SG|H0(SG) + |S| O(|SG|) bits, as |SG| = = O(|S| log σ). logσ |S| J 0 Proof of Theorem 13. Let (S ,G) be an irreducible full grammar generating S and SG be a concatenation of strings in rhs (S0,G). Since (S0,G) is irreducible by Corollary 34  |S|  |SG| = O . Moreover, by (IG3) no digram occurs twice (without overlaps) in logσ |S| 0 1 the right-hand sides of (S ,G). Therefore there exist 3 |SG| non-overlapping pairs in SG that occur exactly once: we can pair letters in each production, except the last letter if production’s length is odd. Let YSG be a parsing of SG into those pair and single symbols. Then 1 1 |Y |H (Y ) ≥ |S | log |S | SG 0 SG 3 G 3 G 1 = |S | log |S | − O(|S |) 3 G G G 1 = |S | log |S | − o(|S| log σ) . 3 G G

On the other hand, applying Theorem 1 for k = 0 and using the estimation on |L|H0(L) = O(|SG|) from Lemma 38 (see the Appendix Section C) we get:

|SG|H0(SG) ≥ |YSG |H0(YSG ) − o(|S| log σ) and thus 1 |S |H (S ) ≥ |S | log |S | − o(|S| log σ) . G 0 G 3 G G 32 Entropy bounds for grammar compression

Now, from Theorem 12:

2|S|Hk(S) + o(|S| log σ) ≥ |SG|H0(SG) 1 ≥ |S | log |S | − o(|S| log σ) 3 G G

and so

|SG| log |SG| ≤ 6|S|Hk(S) + o(|S| log σ) .

The naive coding of SG uses at most |SG| log |SG| + |SG| = |SG| log |SG| + o(|S| log σ) bits, note that SG includes all letters from the original alphabet. J

B.4 Proofs for Section 3.4 (Greedy)

We state properties of Greedy that are similar to those of Re-Pair.

I Lemma 35 (cf. Lemma 24). Frequency of the most frequent pair in the working string and grammar right-hand sides does not increase during the execution of Greedy.

Proof. Assume that after some iteration which introduced symbol X ← w frequency of some pair AB increased. Then the only possibility is that either A = X or B = X, as otherwise the frequency of AB cannot increase. By symmetry let us assume that A = X, for the moment assume also that A 6= B. Let Ak be the last symbol of w; then AkB occurred at least as many times as AB in the working string and grammar before replacement. The case with A = B is shown in the same way; the case with X = B is symmetric. J

I Lemma 36 (cf. Lemma 25). Let z be a frequency of the most frequent pair in the working string and grammar right-hand sides at some point in execution of Greedy. Then at this 2|S| point the number of nonterminals in the grammar is at most |N (G)| ≤ z−4 .

0 Proof. Replacing fw occurrences of a substring w of length |w| decreases the | rhs(S ,G)| by

fw(|w| − 1) − |w| = (fw − 1)(|w| − 1) − 1 . | {z } |{z} w is replaced with one symbol rule of length |w| added

In particular, replacing f occurrences of a pair of symbols shortens the grammar by f − 2 symbols. Now, by Lemma 35 in each previous iteration, there was a pair with frequency at least z. As Greedy replaces the string which shortens the working string and right-hand sides of rules by the maximal possible value, in each previous iteration |S0| + | rhs(G)| was decreased by at least z/2 − 2 = (z − 4)/2 (note that occurrences of pairs may overlap) so the 2|S| maximal number of previous iterations is z−4 and this is also the bound on the number of added nonterminals. J

I Lemma 37 ([6]). Greedy produces irreducible grammar. At each iteration of Greedy the conditions (IG1–IG2) are satisfied for the grammar (S, G), i.e. no two different nonterminals have the same expansions and each nonterminal is used a least twice.

Proof. Both claims were proved in [6], the first one is explicitly stated, the second claim follows from Greedy being a global algorithm (see [6, Section D]). J M. Gańczorz 33

Proof of Theorem 14. Let (S0,G) be a full grammar generating S at some point in the execution of Greedy. As Greedy produces irreducible grammar and at each iterations conditions (IG1–IG2) are satisfied by Lemma 37, we can apply Lemma 33: if | rhs(S0,G)| ≥ (2ξ+10β)|S| + logσ |S| 1− 2 2 ξ|S| β 8|S| β + 2 then there exists a digram occurring at least + 4 times. Lemma 36 applied logσ |S| at this point in the execution gives the bound on the number of nonterminals: 2|S| |N (G)| ≤ 1− 2  β  ξ|S| + 4 − 4 logσ |S|

2 2 ≤ |S| β log |S| . ξ σ

1− 2 β As some digram occurs ξ|S| + 4 > 2 times, then there must be the next iteration, and logσ |S| (2ξ+10β)|S| 2 thus there must exist iteration when | rhs(S0,G)| < + 8|S| β + 2 for the first time. logσ |S| 2 2 β The size of the grammar at this moment is at most ξ |S| logσ |S| + 1.   As at this moment | rhs(S0,G)| = O |S| and |N (G)| log |S| = o(|S|) for β > 2, by logσ |S| Lemma 5 the entropy coding uses at most |S|Hk(S) + o(|S| log σ) bits. J Proof of Theorem 15. Let (S00,G0) be a full grammar produced by Greedy at the end of 0 00 0 the execution and let SG be the concatenation of the right-hand sides of (S ,G ). Similarly as in the proof of Theorem 8, we will lower bound the entropy of the grammar (S0,G), which is a grammar at a certain iteration, and apply the results for weakly non- redundant grammars, see Section B.2.0.1. l |S| m Fix some  > 0 and consider the iteration in which |N (G)| = log1+ |S| ; if there is no such an iteration then consider the grammar produced at the end. Let (S0,G) be a full 0 grammar at this iteration, and let SG be a concatenation of the right-hand sides of (S ,G). p By Lemma 36 if the size of the grammar is at least 2 |S| logσ |S| + 1, then every pair p occurs less than |S|logσ |S| + 4 times. Moreover, as grammar produced by Greedy at each iteration satisfies (IG1–IG2), by Lemma 33 with ξ = 1, β = 4 if every pair occurs less than p|S|log |S| + 4 times then | rhs(S0,G)| < 42|S| + 8p|S| + 2. Thus if the grammar is of σ logσ |S| p 42|S| p p size at least 2 |S| log |S| + 1 then |SG| < + 8 |S| + 2. As 2 |S| · log |S| + 1 = σ logσ |S| σ l |S| m  |S|  o 1+ it follows that |SG| = O . Moreover, by Lemma 5 we have that: log |S| logσ |S|

|SG|H0(SG) ≤ |S|Hk(S) + |N (G)| log |S| + o(|S| log σ)

≤ |S|Hk(S) + o(|S| log σ) (24)

|S| 0 We assume that |SG| > log1+ |S| , otherwise, as | rhs(S ,G)| never increases, it holds that 0 0 0 0 0 |SG| ≤ |SG|, and then |SG|H0(SG) ≤ |SG| log |SG| ≤ |SG| log |SG| = o(|S|) thus both the naive and entropy coding takes o(S) bits. l |S| m Now, we have that either |N (G)| = log1+ |S| , or Greedy finished the execution with l |S| m l |S| m smaller grammar, i.e. |N (G)| < log1+ |S| . If |N (G)| = log1+ |S| then by Lemma 36 2|S| 1+ no digram occurs more than |N (G)| + 4 ≤ 2 log |S| + 4 times on the right-hand sides of (S0,G), and if Greedy finished the execution with smaller grammar we have that no digram occurs more than once (without overlaps) on the right-hand sides of (S0,G). Moreover, we 1 0 0 can find at least 2 (| rhs(S ,G)| − |N (G)| − 1) digrams on the right-hand side of (S ,G), as we can pair symbols naively in each right-hand side; we subtract |N (G)| factor because rules 34 Entropy bounds for grammar compression

1 and starting string can have odd length. Thus SG contains 2 (|SG| − |N (G)| − 1) digrams which occur at most 2 log1+ |S| + 4 times each. Let Y P be a parsing of S into phrases of SG G length 1 and 2 where the phrases of length 2 correspond to the digrams mentioned above, and let L = Lengths(Y P ). By applying Theorem 1 with k = 0 we get that: P SG

|S |H (S ) + |L |H (L ) ≥ |Y P |H (Y P ) , (25) G 0 G P 0 P SG 0 SG

∗ note that as LP ∈ {1, 2} , it holds that |LP |H0(LP ) ≤ |LP | log |{1, 2}| ≤ |SG|. On the other hand, as each phrase in Y P occurs at most 2 log1+ |S| + 4 times we can SG P P lower bound the entropy |YS |H0(YS ), by lower bounding each summand corresponding to |Y P | SG the phrase of length 2 in the definition of H0 by log 2 log1+ |S|+4 , i.e ! |Y P | P P 1 SG |Y |H0(Y ) ≥ (|SG| − |N (G)| − 1) log SG SG 2 2 log1+ |S| + 4  1  1 2 (|SG| − 1) ≥ (|SG| − |N (G)| − 1) log 2 2 log1+ |S| + 4 |S |  |S | − 1  |N (G)| + 1  |S | − 1  ≥ G log G − log G 2 4 log1+ |S| + 8 2 4 log1+ |S| + 8 |S | |S | ≥ G log (|S | − 1) − G log(4 log1+ |S| + 8) − γ 2 G 2 1

|N (G)|+1  |SG|−1  l |S| m where γ1 = 2 log 4 log1+ |S|+8 , observe that as |N (G)| = log1+ |S| and |SG| ≤ |S| we have that γ1 = o(S)

|S | ≥ G log (|S | − 1) − γ − γ 2 G 2 1

|SG| 1+  |S|  where γ2 = log(4 log |S| + 8), observe that as |SG| = O we have γ2 = 2 logσ |S| o(|S| log σ)

  |SG| |SG| |SG| ≥ log |SG| − log − γ2 − γ1 2 2 |SG| − 1 |S | = G log |S | − γ , (26) 2 G

|SG|  |SG|   x  where γ = log + γ2 + γ1. Observe that as O x log = O(1) we have that 2 |SG|−1 x−1 γ = o(|S| log σ).

Combining (25) and (26) gives us lower bound on the entropy of SG:

|S | |S |H (S ) + |L |H (L ) ≥ G log |S | − γ G 0 G P 0 P 2 G |S | |S |H (S ) ≥ G log |S | − γ0 , (27) G 0 G 2 G

0 where γ = γ + |LP |H0(LP ) = o(|S| log σ). 0 00 0 0 We recall that |SG| = | rhs(S ,G )| ≤ | rhs(S ,G)| = |SG| (as the size of the full grammar does not increase during the execution, see Lemma 35), moreover in fully naive encoding 00 0 0 each symbol of the (S ,G ) is encoded using at most dlog |SG|e bits. Thus, the fully naive M. Gańczorz 35 encoding consumes at most:

0 0 0 |SG| log |SG| + |SG| ≤ |SG| log |SG| + |SG| 0 ≤ 2|SG|H0(SG) + 2γ + |SG| by (27)

≤ 2|S|Hk(S) + o(|S| log σ) , by (24) which finishes the proof for the case of fully naive encoding. In the case of entropy coding, we make the same observation as in the proof of Theorem 8: we can look at the last iterations as if the algorithm had started with string SG and would have finished with grammar (S00,G0 \ G), note that concatenation of the right-hand sides 00 0 0 00 0 of (S ,G \ G) gives SG. As (S ,G \ G) meets the conditions of a weakly non-redundant grammar and the entropy of SG is large (27) we can apply Lemma 30 with n = |SG|: 3 |S0 |H (S0 ) ≤ |S |H (S ) + O(|S | + γ0) G 0 G 2 G 0 G G 3 ≤ |S|H (S) + o(|S| log σ) , by (24) 2 k thus by Lemma 22 the claim holds for the entropy coding. J

C Additional material for Section 4

The idea of assigning values to phrases (see Definition 17) was used before, for instance by Kosaraju and Manzini in estimations of entropy of LZ77 and LZ78 [21], yet their definition depends on k symbols preceding the phrase. This idea was adapted from methods used to estimate the entropy of the source model [37], see also [8, Sec. 13.5]. A similar idea was used in the construction of compressed text representations [13]. This work used a constructive argument: it calculated the bit strings for each phrase using arithmetic coding and context modeller. Later it was observed that those two can be replaced with Huffman coding [9]. Still, both of these representations were based on the assumption that text is parsed into short (i.e α logσ n, α < 1) phrases of equal length and used specific compression methods in the proof. Observe that the definition of phrase probability and parsing cost are also valid for k = 0, as we assumed that w[i . . . j] =  when i > j, and |S| = |S|.

Proof Lemma 18. The lemma follows from the definition of Ck(YS) and Hk: in |S|Hk(S) each letter a occurring in context v in S contributes log P(a|v) to the sum. Now observe that Ck(YS) is almost the same as |S|Hk(S), but the first k letters of each phrase contribute log σ instead of log P(a|v). J 0 l−1 i Proof lemma 19. Consider l parsings YS ,...,YS of S where in YS the first phrase has i letters and the other phrases are of length l, except maybe the last phrase; to streamline the 0 i argument, we add a leading empty phrase to YS . Denote YS = yi,0, yi,1,.... We estimate the Pl−1 i sum i=0 C(YS). The costs of each of the first phrases is upper-bounded by log |S|, as for any phrase y the phrase cost C(y) is at most log |S|:

|y| Y C(y) = − log P(y[j] | y[1 . . . j]) j=1 |y| Y |S|y[1...j] ≤ − log |S| j=1 y[1...j−1] 36 Entropy bounds for grammar compression

Note that by by Definition 3 some factors may be |S|wa , we bound them by |S|wa , as |S|w−1 |S|w − log |S|wa ≤ − log |S|wa |S|w−1 |S|w

|S|  = − log y |S| ≤ log(|S|/1) .

Pl−1 i Then i=0 C(YS) without the costs of all these phrases is:

i l−1 |Y | |yi,p| X XS X − log P(yi,p[j] | yi,p[1 . . . j − 1]) (28) i=0 p=1 j=1

Pl−1 and we claim that this is exactly |S| i=0 Hi(S). Together with the estimation of the cost of the first phrases this yields the claim, as

l−1 l−1 X i X C(YS) ≤ l log |S| + |S| Hi(S) i=0 i=0

and so one of the l parsings has a cost that is at most the right-hand side divided by l. To see that (28) is indeed the sum of entropies observe for each position m of the word we count log of the probability of this letter occurring in preceding i = 0, 1,... min(l − 1, m − 1)- letter context exactly once: this is clear for m ≥ l, as the consecutive parsings are offsetted S S S by one position; for m < l observe that in Y0 ,Y1 ,...Ym−1 the letter at position m we count log of probability of this letter occurring in preceding m − 1, m − 2,..., 0 letter context, while S S in Ym ,Ym+1,... we include it in the first phrase, so it is not counted in (28). J

proof of Theorem 20. We prove only the first inequality, the proof of the second one is similar.

Let Y denote the set of different phrases of YS. For each y ∈ Y define

|L| |L| p(y) = |y| ·P(y) = |y| · 2−C(y) . |L| |L|

P We want to show that y∈Y p(y) ≤ 1, that is, it satisfies the conditions of Lemma 16. First, we prove that for each l it holds that

X P(y) ≤ 1 . y∈Γl

This claim is similar to [21, Lem. A.1], we prove it by induction on l. For l = 1 we have

X X P(s) = P(s | ) s∈Γ s∈Γ

X |S|s = |S| s∈Γ = 1 . M. Gańczorz 37

For l > 1 we group elements in sum by their l − 1 letter prefixes: X X X P(y) = P(y0s) y∈Γl y0∈Γl−1 s∈Γ X X 0 0 = P(y ) · P(s |y ) y0∈Γl−1 s∈Γ X 0 X 0 = P(y ) · P(s |y ) y0∈Γl−1 s∈Γ X ≤ P(y0) · 1 y0∈Γl−1 ≤ 1 , where the last (and only) inequality follows from the induction hypothesis. Then

X X |L||y| p(y) = ·P(y) |YS| y∈Y y∈Y |S| X |L|l X = · P(y) |YS| l=1 y: y∈Y |y|=l |S| X |L|l ≤ · 1 |YS| l=1 = 1 .

Thus p satisfies the assumption of Lemma 16 and we can apply it on YS and p:

|YS | X |YS|H0(YS) ≤ − log p(yi) i=1 |YS |   X |L||yi| = − log ·P(y ) |L| i i=1

|YS | |YS | X |L||yi| X = − log − log P(y ) |L| i i=1 i=1

|YS | X ≤ |L|H0(L) + C(yi) i=1 = |L|H0(L) + C(YS) . J

We now estimate the entropy of lengths |L|H0(L), in particular in case of small parsings, i.e. when |YS| = o(|S|). Those estimations include standard results on the entropy of lengths [8, Lemma 13.5.4] and some simple calculations. (Note that a weaker estimation with stronger assumption was used implicitly in the work of Kosaraju and Manzini [21, Lemma A.3]).

I Lemma 38 (Entropy of lengths). Let S be a string, YS its parsing and L = Lengths(YS). Then: |S| |L|H (L) ≤ |L| log + |L|(1 + log e) . 0 |L| 38 Entropy bounds for grammar compression

 |S|  In particular, if |YS| = O then logσ |S|

|L|H0(L) = o(|S| log σ) ,

and for any value of |YS|:

|L|H0(L) = O(|S|) .

|L|l Proof. Introduce random variable U such that P r[U = l] = |L| . Then |L|H0(L) = |L|H(U) |S| and E[U] = |L| , where here H is the entropy function for random variables and E is the expected value. It is known [8, Lemma 13.5.4] that:

H(U) ≤ E[U + 1] log E[U + 1] − E[U] log E[U] .

Translating those results back to the setting of empirical entropy we obtain :  |S|  |S| |S| |S| H (L) ≤ 1 + log 1 + − log 0 |L| |L| |L| |L|  |S| |S| |S| + |L| |L| = log 1 + + log · |L| |L| |L| |S|  |S| |S|  |L| ≤ log 2 · + log 1 + |L| |L| |S| ! |S|  1 |S|/|L| ≤ 1 + log + log 1 + |L| |S|/|L| |S| < 1 + log + log e . |L| Multiplying by |L| yields the desired inequality.  |S|   |S|  Moving to the second claim: assume |YS| = O , thus |L| = |YS| = O . logσ |S| logσ |S| Then: e|S| |L|H (L) < |L| log + |L| 0 |L|  |S| |S|  ec = O log + |L| as x log is increasing in x for x ≤ c logσ |S| |S|/ logσ |S| x |S| log log |S| = O σ + |L| logσ |S| = o(|S| log σ)

Concerning the last claim, when |L| is arbitrary (but at most |S|), by definition |L|(1+e) = O(|S|) and the function f(x) = x log(c/x) is maximized for x = c/e, for any c; for which it log e has value c e ∈ O(c), this is easily shown by computing the derivative. Thus we have: |S| |L|H (L) < |L| log + |L|(1 + log e) 0 |L| log e ≤ |S| + |L|(1 + log e) e = O(|S|) ,

so the lemma holds. J M. Gańczorz 39

Proofs of Theorem 1 and Theorem 2. The first inequality of Theorem 1 follow from The- orem 20 and Lemma 18:

|YS|H0(S) ≤ Ck(YS) + |L|H0(L) by Theorem 20

≤ |S|Hk(S) + |YS|k log σ + |L|H0(L) . by Lemma 18

The second inequality of Theorem 1 comes directly form bounds in Lemma 38 and the 0 assumption that k = o(logσ |S|). The bound on the encoding of string S follows directly from Lemma 22. Similarly, Theorem 2 follows from Theorem 20 and Lemma 19:

|YS|H0(YS) ≤ C(YS) + |L|H0(L) by Theorem 20 l−1 |S| X ≤ H (S) + log |S| + |L|H (L) by Lemma 19 l i 0 i=0

In this case, we can bound |L|H0(L) by O(log |S|) as parsing from Lemma 19 consist of phrases of the same length, except for the first and the last one, which can be shorter:

|L| |L|H (L) ≤ |L| log + 2 log |L| = O(log |S|) 0 |L| − 2 J