Hilberg’s Conjecture: Experimental Verification and Theoretical Results

Łukasz Dębowski [email protected] i Institute of Computer Science Polish Academy of Sciences Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Texts — randomness vs. determinism

If a scientist sitting before his radio in accidentally received from the broadcast of an extensive speech which he recorded perfectly through the perfection of Martian apparatus and studied at his leisure, what criteria would he have to determine whether the reception represented the effect of animate process on Earth, or merely the latest thunderstorm on Earth? It seems that the only criteria would be the arrangement of occurrences of the elements, and the only clue to the animate origin would be this: the arrangement of the occurrences would be neither of rigidly fixed regularity such as frequently found in wave emissions of purely physical origin nor yet a completely random scattering of the same.

— George Kingsley Zipf (1965:187) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Stochastic processes

Probability space: (Ω, J , P). Alphabet: X. Random variables: Xi :Ω → X.

Stochastic process: (Xi)i∈Z. l Blocks: Xk = (Xi)k≤i≤l.

def i+n Process (Xi)i∈Z is stationary ⇐⇒ P(Xi+1) doesn’t depend on i. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Entropy

Entropy of a random variable:

l h l i H(Xk) := E − log P(Xk) .

It measures uncertainty of a random variable. n H(X1) = n log card X — all values are equally probable. n H(X1) = 0 — the random variable is almost surely constant. Block entropy of a stationary process:

i+n H(n) := H(Xi+1). Entropy rate of a stationary process:

h = lim H(n)/n. n→∞ Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Hilberg’s conjecture

Shannon (1951) estimated entropy of text in English, assuming it is drawn from a stationary process. Hilberg (1990) replotted these estimates in the log-log scale and observed a straightish line. This line corresponds to

H(n) ≈ Bnβ + hn, (1)

where β ≈ 0.5 and h ≈ 0. Shannon provided estimates of H(n) for n ≤ 100 characters. Hilberg supposed that relationship (1) can be extrapolated and h = 0 holds asymptotically. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The original vs. the relaxed Hilberg conjecture

def Process (Xi)i∈Z is asymptotically deterministic ⇐⇒ h = 0. (Each random variable is a function of infinite past.)

Consider mutual information

I(X; Y) = H(X) + H(Y) − H(X, Y).

The original Hilberg conjecture is

H(n) ∝ nβ, (2)

whereas the relaxed Hilberg conjecture is

n 2n β I(X1; Xn+1) = 2H(n) − H(2n) ∝ n . (3) Relationship (3) follows from (2) but it does not imply h = 0. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Why is Hilberg’s conjecture (HC) important?

HC corroborates Zipf’s insight that texts produced by humans diverge from both pure randomness and pure determinism. (In a sense, they would be both random and deterministic.) Relaxed HC also distinguishes natural from k-parameter sources. (Some basic model of statistics.) HC, in its original form, implies that texts are in a sense deterministic and infinitely compressible. (We have to explain why modern text compressors cannot achieve that.) HC can be linked with Zipf’s law and Herdan’s law. (These are celebrated laws of quantitative .) Stochastic processes that satisfy HC are mathematical terra incognita. (Understanding their construction and properties can lead to a progress both in mathematics and applications like computational linguistics.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Three approaches to estimation of block entropy

There are three methods of estimating block entropy of texts: guessing method (upper and lower bound), compression rate of universal codes (upper bound), subword complexity and maximal repetition (lower bound). (There are a few more methods of estimating the entropy rate.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Guessing method (Shannon 1951)

Conditional entropy H(n + 1) − H(n) vs. context length n: Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Shannon’s data in log-log scale (Hilberg 1990)

β ≈ 0.5 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Compression rate of universal codes

An upper bound of H(n)/n vs. text length n:

10 plain switch / 20,000 Leagues under the Sea plain switch / unigram text compression rate [bpc]

1 1 10 100 1000 10000 100000 1e+06 block length [characters]

β ≤ 0.898 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Estimates from subword complexity

A lower bound of H(n) vs. block length n:

20 Shakespeare Casanova de Seingalt

15

10 entropy

5

0

0 2 4 6 8 10 12 block length

β ≥ 0.602 h = 0 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Conclusions from the experiments

Hilberg’s conjecture seems to hold in its original form (h = 0) but we need more experiments to check whether it can be extrapolated for block lengths n ≥ 10. There is some gap between the upper and the lower estimates of β. We have obtained β ∈ [0.602, 0.898]. If the gap were zero, the process generating texts would be ergodic. (There would be no random topic persisting in the infinitely long text.) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion A skeptic’s remark

Hilberg’s conjecture in its original form implies that a typical text of one million letters could be theoretically compressed into a string of roughly one thousand letters. This is far beyond the power of any known text compressor! How is it possible? What blocks the optimal compression? How does the optimal compression look like? Some idea: Modern text compressors work mostly by detecting repeated strings and replacing them with shorter identifiers. They cannot compress texts beyond the maximal repetition. Another idea: Giving the ISBN number is sufficient to identify a printed literary text that remains in cultural circulation. Thus given enough memory, “hypercompression” is achievable. Is something similar possible in the world of stationary stochastic processes? We suppose that it is. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Maximal repetition

Definition The maximal repetition in text w is defined as

L(w) := max {|s| : w = x1sy1 = x2sy2 and x1 6= x2} ,

where s, xi, and yi are substrings of text w. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Maximal repetition and Hilberg’s conjecture

Definition For a random variable X, topological entropy is

Htop(X) = log card {x : P(X = x) > 0} .

Theorem

If a stationary stochastic process (Xi)i∈Z satisfies

i+n β Htop(Xi+1) ≤ Bn for certain constants 0 < β < 1 and B > 0 then there exists A > 0 such that for α = 1/β almost surely we have

m α L(X1 ) ≥ A(log m) . Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion 35 texts in 3 + unigram text

10000 German English French character unigram 1000

100

10 maximal length of repeat [characters] 1

0.1 1 10 100 1000 10000 100000 1e+06 1e+07 block length [characters]

A ≈ 0.093 α ≈ 2.64 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Regular Hilberg processes

Definition

A stationary process (Xi)i∈Z is called a regular Hilberg process if H(n) = Θ(nβ), n α E L(X1) = Θ ((log n) ) for a certain β ∈ (0, 1) and α ≥ 1/β. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Bound for the Lempel-Ziv code

The Lempel-Ziv (LZ) code is the oldest known universal code. It compresses text by replacing repeated strings with (generally shorter) identifiers. The length of the LZ code satisfies

|w| |w| |C(w)| ≥ log . L(w) + 1 L(w) + 1

For regular Hilberg processes, the length of the LZ code is orders of magnitude larger than the block entropy,   β n n H(n) = Θ(n ), E C(X ) = Ω . 1 (log n)α

Similar bounds hold for a few other known universal codes. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Herdan’s law (an integrated version of Zipf’s law)

Consider texts in a natural language (such as English): V — the number of different words in the text, n — the length of the text. We observe Herdan’s law, i.e., the relationship

V ∝ nβ,

where β is between 0.5 a 1 depending on a text. We will show that Hilberg’s conjecture implies Herdan’s law. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion A context-free grammar that generates one text

 A → A A A A dear childrenA A all.   1 2 2 4 5 5 3     A2 → A3youA5  A3 → A4 to .  A4 → Good morning     A5 → , 

Good morning to you, Good morning to you, Good morning, dear children, Good morning to all. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The grammar-based codes

A function Γ such that Γ(w) is an admissible grammar that generates text w is called a grammar transform. Certain grammar transforms can be turned into universal codes if we apply a certain encoding of an arbitrary grammar into a string. n We may suppose that the number of distinct words in text X1 can be approximated by the number of distinct nonterminals n n V(X1) in an admissibly minimal grammar-based code C(X1) n for text X1. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The first result (non-zero entropy rate)

Admissibly minimal grammar-based codes satisfy: |C(u)| + |C(v)| − |C(uv)| ≤ BV(uv)(1 + L(uv)). The left hand side is an estimate of mutual information I(X; Y) = H(X) + H(Y) − H(X, Y).

Theorem n Let V(X1) be the number of distinct nonterminals in an admissibly n minimal grammar-based code C(X1). If for a stationary process (Xi)i∈Z over a finite alphabet with a strictly positive entropy rate n 2n β we have I(X1; Xn+1) = Ω(n ) for some β ∈ (0, 1) then

 n n  E V(X )(1 + L(X )) lim sup 1 1 > 0. n→∞ nβ Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The second result (zero entropy rate)

We suppose that admissibly minimal grammar-based codes are n universal, i.e., they satisfy E C(X1) − H(n) = o(n). This n would guarantee E C(X1) = o(n) if H(n) = o(n).

Theorem n Let V(X1) be the number of distinct nonterminals in an admissibly n minimal grammar-based code C(X1). If for a stationary process n (Xi)i∈Z over a finite alphabet the code satisfies E C(X1) = o(n) then

 2n 2n  E V(X1 )(1 + L(X1 )) lim sup h i > 0. n→∞ 1 1 n L(Xn) − 2n E 1 E L(X1 ) Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The second result for regular Hilberg processes

For a regular Hilberg process we have:

 n n  E V(X )(1 + L(X )) lim sup 1 1 > 0. n→∞ n/(log n)α+1 Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Processes satisfying Hilberg’s conjecture?

Hilberg’s conjecture:

H(n) ≈ Bnβ + hn.

There are a few processes that satisfy HC with h > 0. We have some idea how to construct a process that satisfies HC with h = 0 but have not completed the construction yet. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion The Santa Fe process

A linguistic interpretation

Process (Xi)i∈Z is a sequence of random statements consistently describing the state of an “earlier drawn” random object (Zk)k∈N. Xi = (k, z) asserts that the k-th bit of (Zk)k∈N has value Zk = z.

Let a process (Xi)i∈Z have the form

Xi := (Ki, ZKi ),

where (Ki)i∈Z and (Zk)k∈N are independent IID processes, −1/β −1 P(Ki = k) = k /ζ(β ), β ∈ (0, 1), 1 P(Zk = z) = 2 , z ∈ {0, 1} .

n 2n β We have limn→∞ I(X1; Xn+1)/n > 0. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion A mixing Santa Fe process

A linguistic interpretation

Object (Zik)k∈N described by text (Xi)i∈Z is a function of time i.

Let a process (Xi)i∈Z have the form

Xi := (Ki, Zi,Ki ),

where (Ki)i∈Z and (Zik)i∈Z, k ∈ N, are independent, −1/β −1 P(Ki = k) = k /ζ(β ), (Ki)i∈Z ∼ IID,

whereas (Zik)i∈Z are Markov chains with 1 P(Zik = z) = 2 , P(Zik = z|Zi−1,k = z) = 1 − pk.

n 2n β We have limn→∞ I(X1; Xn+1)/n > 0 for pk ≤ P(Ki = k). Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Hilberg’s conjecture with h = 0?

Hilberg’s conjecture with h = 0 states that the number of distinct texts that get reproduced in the evolution of culture is severely limited. There is some fractal scaling: The longer the reproduced texts are, proportionally the fewer they are. Hence some text interaction occurs: The smaller reproduced texts get combined into larger reproduced texts. Idea of memetic evolution: The texts get combined according to how they fit together (“fitness function”?). The process of generating texts by such a model of cultural evolution is nonstationary but it may have a stationary limit for which Hilberg’s conjecture holds. Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion

1 Introduction to Hilberg’s conjecture

2 Empirical verification

3 Inefficiency of Lempel-Ziv code

4 Vocabulary growth

5 Random descriptions of a random world

6 Conclusion Title Hilberg’s conjecture Verification Lempel-Ziv Vocabulary growth Random descriptions Conclusion Conclusions

Hilberg’s conjecture is a hypothesis about a power law growth of block entropy for texts in natural language. It may have profound implications for text compression, statistical natural language modeling, and understanding of the evolution of culture. Further fundamental mathematical research is needed (models of processes, entropy estimation methods).

www.ipipan.waw.pl/~ldebowsk