Hilberg's Conjecture
Total Page:16
File Type:pdf, Size:1020Kb
Hilberg’s Conjecture: Experimental Verification and Theoretical Results Łukasz Dębowski [email protected] i Institute of Computer Science Polish Academy of Sciences Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 1 Introduction to Hilberg’s conjecture 2 Empirical verification 3 Inefficiency of Lempel-Ziv code 4 Vocabulary growth 5 Random descriptions of a random world 6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 1 Introduction to Hilberg’s conjecture 2 Empirical verification 3 Inefficiency of Lempel-Ziv code 4 Vocabulary growth 5 Random descriptions of a random world 6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Texts — randomness vs. determinism If a Martian scientist sitting before his radio in Mars accidentally received from Earth the broadcast of an extensive speech which he recorded perfectly through the perfection of Martian apparatus and studied at his leisure, what criteria would he have to determine whether the reception represented the effect of animate process on Earth, or merely the latest thunderstorm on Earth? It seems that the only criteria would be the arrangement of occurrences of the elements, and the only clue to the animate origin would be this: the arrangement of the occurrences would be neither of rigidly fixed regularity such as frequently found in wave emissions of purely physical origin nor yet a completely random scattering of the same. — George Kingsley Zipf (1965:187) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Stochastic processes Probability space: (Ω; J ; P). Alphabet: X. Random variables: Xi :Ω ! X. Stochastic process: (Xi)i2Z. l Blocks: Xk = (Xi)k≤i≤l. def i+n Process (Xi)i2Z is stationary () P(Xi+1) doesn’t depend on i. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Entropy Entropy of a random variable: l h l i H(Xk) := E − log P(Xk) : It measures uncertainty of a random variable. n H(X1) = n log card X — all values are equally probable. n H(X1) = 0 — the random variable is almost surely constant. Block entropy of a stationary process: i+n H(n) := H(Xi+1): Entropy rate of a stationary process: h = lim H(n)=n: n!1 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Hilberg’s conjecture Shannon (1951) estimated entropy of text in English, assuming it is drawn from a stationary process. Hilberg (1990) replotted these estimates in the log-log scale and observed a straightish line. This line corresponds to H(n) ≈ Bnβ + hn; (1) where β ≈ 0:5 and h ≈ 0. Shannon provided estimates of H(n) for n ≤ 100 characters. Hilberg supposed that relationship (1) can be extrapolated and h = 0 holds asymptotically. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The original vs. the relaxed Hilberg conjecture def Process (Xi)i2Z is asymptotically deterministic () h = 0. (Each random variable is a function of infinite past.) Consider mutual information I(X; Y) = H(X) + H(Y) − H(X; Y): The original Hilberg conjecture is H(n) / nβ; (2) whereas the relaxed Hilberg conjecture is n 2n β I(X1; Xn+1) = 2H(n) − H(2n) / n : (3) Relationship (3) follows from (2) but it does not imply h = 0. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Why is Hilberg’s conjecture (HC) important? HC corroborates Zipf’s insight that texts produced by humans diverge from both pure randomness and pure determinism. (In a sense, they would be both random and deterministic.) Relaxed HC also distinguishes natural language from k-parameter sources. (Some basic model of statistics.) HC, in its original form, implies that texts are in a sense deterministic and infinitely compressible. (We have to explain why modern text compressors cannot achieve that.) HC can be linked with Zipf’s law and Herdan’s law. (These are celebrated laws of quantitative linguistics.) Stochastic processes that satisfy HC are mathematical terra incognita. (Understanding their construction and properties can lead to a progress both in mathematics and applications like computational linguistics.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 1 Introduction to Hilberg’s conjecture 2 Empirical verification 3 Inefficiency of Lempel-Ziv code 4 Vocabulary growth 5 Random descriptions of a random world 6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Three approaches to estimation of block entropy There are three methods of estimating block entropy of texts: guessing method (upper and lower bound), compression rate of universal codes (upper bound), subword complexity and maximal repetition (lower bound). (There are a few more methods of estimating the entropy rate.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Guessing method (Shannon 1951) Conditional entropy H(n + 1) − H(n) vs. context length n: Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Shannon’s data in log-log scale (Hilberg 1990) β ≈ 0:5 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Compression rate of universal codes An upper bound of H(n)=n vs. text length n: 10 plain switch / 20,000 Leagues under the Sea plain switch / unigram text compression rate [bpc] 1 1 10 100 1000 10000 100000 1e+06 block length [characters] β ≤ 0:898 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Estimates from subword complexity A lower bound of H(n) vs. block length n: 20 Shakespeare Swift Casanova de Seingalt 15 10 entropy 5 0 0 2 4 6 8 10 12 block length β ≥ 0:602 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Conclusions from the experiments Hilberg’s conjecture seems to hold in its original form (h = 0) but we need more experiments to check whether it can be extrapolated for block lengths n ≥ 10. There is some gap between the upper and the lower estimates of β. We have obtained β 2 [0:602; 0:898]. If the gap were zero, the process generating texts would be ergodic. (There would be no random topic persisting in the infinitely long text.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 1 Introduction to Hilberg’s conjecture 2 Empirical verification 3 Inefficiency of Lempel-Ziv code 4 Vocabulary growth 5 Random descriptions of a random world 6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion A skeptic’s remark Hilberg’s conjecture in its original form implies that a typical text of one million letters could be theoretically compressed into a string of roughly one thousand letters. This is far beyond the power of any known text compressor! How is it possible? What blocks the optimal compression? How does the optimal compression look like? Some idea: Modern text compressors work mostly by detecting repeated strings and replacing them with shorter identifiers. They cannot compress texts beyond the maximal repetition. Another idea: Giving the ISBN number is sufficient to identify a printed literary text that remains in cultural circulation. Thus given enough memory, “hypercompression” is achievable. Is something similar possible in the world of stationary stochastic processes? We suppose that it is. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Maximal repetition Definition The maximal repetition in text w is defined as L(w) := max fjsj : w = x1sy1 = x2sy2 and x1 6= x2g ; where s, xi, and yi are substrings of text w. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Maximal repetition and Hilberg’s conjecture Definition For a random variable X, topological entropy is Htop(X) = log card fx : P(X = x) > 0g : Theorem If a stationary stochastic process (Xi)i2Z satisfies i+n β Htop(Xi+1) ≤ Bn for certain constants 0 < β < 1 and B > 0 then there exists A > 0 such that for α = 1/β almost surely we have m α L(X1 ) ≥ A(log m) : Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 35 texts in 3 languages + unigram text 10000 German English French character unigram 1000 100 10 maximal length of repeat [characters] 1 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 block length [characters] A ≈ 0:093 α ≈ 2:64 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Regular Hilberg processes Definition A stationary process (Xi)i2Z is called a regular Hilberg process if H(n) = Θ(nβ); n α E L(X1) = Θ ((log n) ) for a certain β 2 (0; 1) and α ≥ 1/β. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Bound for the Lempel-Ziv code The Lempel-Ziv (LZ) code is the oldest known universal code. It compresses text by replacing repeated strings with (generally shorter) identifiers. The length of the LZ code satisfies jwj jwj jC(w)j ≥ log : L(w) + 1 L(w) + 1 For regular Hilberg processes, the length of the LZ code is orders of magnitude larger than the block entropy, β n n H(n) = Θ(n ); E C(X ) = Ω : 1 (log n)α Similar bounds hold for a few other known universal codes. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 1 Introduction to Hilberg’s conjecture 2 Empirical verification 3 Inefficiency of Lempel-Ziv code 4 Vocabulary growth 5 Random descriptions of a random world 6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Herdan’s law (an integrated version of Zipf’s law) Consider texts in a natural language (such as English): V — the number of different words in the text, n — the length of the text.