Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
Total Page:16
File Type:pdf, Size:1020Kb
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models Moses Charikar∗ Eric Lehmany Ding Liu∗ Rina Panigrahyz Princeton University MIT Princeton University Cisco Systems Manoj Prabhakaran∗ April Rasalay Amit Sahai∗ abhi shelaty Princeton University MIT Princeton University MIT February 20, 2002 Abstract 1 Introduction This paper addresses the smallest grammar problem; We consider the problem of finding the smallest context- namely, what is the smallest context-free grammar that free grammar that generates exactly one given string of generates exactly one given string σ? For example, the length n. The size of this grammar is of theoretical in- smallest context-free grammar generating the string: terest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in ar- eas such as data compression and pattern extraction. σ = a rose is a rose is a rose The smallest grammar is known to be hard to approxi- is as follows: mate to within a constant factor, and an o(log n=log log n) approximation would require progress on a long-standing S BBA algebraic problem [10]. Previously, the best proved ap- A ! a rose 1=2 proximation ratio was O(n ) for the BISECTION algo- B ! A is rithm [8]. Our main result is an exponential improvement ! of this ratio; we give an O(log(n=g∗)) approximation al- The size of a grammar is defined to be the total number gorithm, where g∗ is the size of the smallest grammar. of symbols on the right sides of all rules. The small- We then consider other computable variants of Kolo- est grammar is known to be hard to approximate within 2 a small constant factor. Further, even for a very re- mogorov complexity. In particular we give an O(log n) approximation for the smallest non-deterministic finite stricted class of strings, approximating the smallest gram- log n automaton with advice that produces a given string. We mar within a factor of o log log n would require progress also apply our techniques to “advice-grammars” and on a well-studied algebraic problem [10]. Previously, the “edit-grammars”, two other natural models of string com- best proven approximation ratio was O(n1=2) for the BI- plexity. SECTION algorithm [8]. Our main result is an exponential improvement of this ratio. Several rich lines of research connect this elegant prob- lem to fields such as Kolmogorov complexity, data com- ∗e-mail: fmoses,dingliu,mp,[email protected] ye-mail: fe lehman,arasala,[email protected] pression, pattern identification, and approximation algo- ze-mail: [email protected] rithms for hierarchical problems. 1 The size of the smallest context-free grammar generat- tifying important patterns in a string, since such patterns ing a given string is a natural, but more tractable variant naturally correspond to nonterminals in a compact gram- of Kolmogorov complexity. (The Kolmogorov complex- mar. In fact, an original motivation for the problem was to ity of a string is the description length of the smallest Tur- identify regularities in DNA sequences [11]. Since then, ing machine that outputs that string.) The Turing machine smallest grammar algorithms have been used to highlight model for representing strings is too powerful to be ex- patterns in musical scores and uncover properties of lan- ploited effectively; in general, the Kolmogorov complex- guage from example texts [4]. All this is possible because ity of a string is uncomputable. However, weakening the a string represented by a context-free grammar remains model from Turing machines to context-free grammars re- relatively comprehensible. This comprehensibility is an duces the complexity of the problem from the realm of un- important attraction of grammar-based compression rela- decidability to mere intractability. How well the smallest tive to otherwise competitive compression schemes. For grammar can be approximated remains an open question. example, the best pattern matching algorithm that oper- In this work we make significant progress toward an an- ates on a string compressed as a grammar is asymptoti- swer: our main result is that the “grammar complexity” cally faster than the equivalent for LZ77 [6]. of a string is O(log(n=g∗))-approximable in polynomial Finally, work on the smallest grammar problem extends time, where g∗ is the size of the smallest grammar. the study of approximation algorithms to hierarchical ob- This perspective on the smallest grammar problem sug- jects, such as grammars, as opposed to “flat” objects, such gests a general direction of inquiry: can the complex- as graphs, CNF-formulas, etc. This is a significant shift, ity of a string be determined or approximated with re- since many real-world problems have a hierarchical na- spect to other natural models? Along this line, we con- ture, but standard approximation techniques such as lin- sider the problem of optimally representing an input string ear and semi-definite programming are not easily applied by a non-deterministic finite automaton guided by an ad- to this new domain. vice string. We show that the optimal representation of The remainder of this paper is organized as follows. 2 a string is O(log n)-approximable in this model. Then Section 2 defines terms and notation. Section 3 con- we consider context-free grammars with multiple produc- tains our main result, an O(log(n=g∗)) approximation al- tions per nonterminal, again guided by an advice string. gorithm for the smallest grammar problem. Section 4 This modified grammar model turns out, unexpectedly, to follows up the Kolmogorov complexity connection and be essentially equivalent to the original. includes the results on the complexity measure in other The smallest grammar problem is also important in the models. We conclude with some open problems and new area of data compression. Instead of storing a long string, directions. one can store a small grammar that generates it, provided such a grammar can be found. This line of thought has led to a flurry of research [8, 12, 16, 7, 1, 9] in the data com- pression community. Kieffer and Yang show that a good 2 Definitions solution to the grammar problem leads to a good uni- A grammar is a 4-tuple (Σ; Γ; S; ∆), where Σ is a set versal compression algorithm for finite Markov sources G [7]. Empirical results indicate that grammar-based data of terminals, Γ is a set of nonterminals, S Γ is the start symbol, and ∆ is a set of rules of the form T2 (Σ Γ)∗. compression is competitive with other techniques in prac- ! [ tice [8, 12]. Surprisingly, however, the most prominent A symbol may be either a terminal or nonterminal. The algorithms for the smallest grammar problem have been size of a grammar is defined to be the total number of shown to exploit the grammar model poorly [10]. In par- symbols on the right sides of all rules. Note that a gram- ticular, the SEQUITUR [12], BISECTION [8], LZ78 [16], mar of size g can be readily encoded using O(g log g) bits, 1 and SEQUENTIAL [14] algorithms all have approximation and so our measure is close to this natural alternative. ratios that are Ω(n1=3). In this paper, we achieve a loga- 1More sophisticated ways to encode a grammar [7, 14] relate the rithmic ratio. encoding size in bits to the above definition of the size of a grammar in The smallest grammar problem is also relevant to iden- a closer manner, and give theoretical justification to this definition. 2 Since generates exactly one finite string, there is ex- a single terminal. If pi = 0, then the pair represents a actly oneGrule in ∆ defining each nonterminal in Γ. Fur- portion of the prefix of σ6 that is represented by the pre- thermore, is acyclic; that is, there exists an ordering ceding i 1 pairs; namely, the l terminals beginning at G − i of the nonterminals in Γ such that each nonterminal pre- position pi in σ. Note that, unlike in the standard LZ77, cedes all nonterminals in its definition. The expansion of we require that the substring referred to by (pi; li) be fully a string of symbols η is obtained by replacing each non- contained within the prefix of σ generated by the previous terminal in η by its definition until only terminals remain. pairs. The size of the LZ77 representation of a string is s, The expansion of η is denoted η , and the length of η the number of pairs employed. The smallest LZ77 repre- is denoted [η]. h i h i sentation can be easily found using a greedy algorithm in Throughout, symbols (terminals and nonterminals) are polynomial time. More efficient algorithms are possible; uppercase, and strings of symbols are lowercase Greek. for example, [5] gives an O(n log n) time algorithm for We use σ for the input string, n for its length, g for the our variant of LZ77. size of a grammar that generates σ, and g∗ for the size of As observed in [12], a grammar for a string can be the smallest such grammar. converted in a simple way to an LZG77 representation. It denotes the size of grammar minus number of non- is not hard to see that this transformation gives an LZ77 terminals.jGj Without loss of generalily we do not allow list of at most + 1 pairs. Thus we have: jGj unary rules of the form A B; then is at least half the size of . ! jGj Lemma 1 The size of the smallest grammar for a string G is lower bounded by the size of the smallest LZ77 repre- sentation for that string. 3 An Approximation O(log(n=g∗)) Our O(log(n=g∗))-approximation algorithm essen- Algorithm tially inverts this process, mapping an LZ77 sequence to a grammar.