<<

Random Language Model

E. DeGiuli Institut de Physique Th´eoriquePhilippe Meyer, Ecole´ Normale Sup´erieure, PSL University, Sorbonne Universit´es,CNRS, 75005 Paris, France

Many complex generative systems use languages to create structured objects. We consider a model of random languages, defined by weighted context-free grammars. As the distribution of grammar weights broadens, a transition is found from a random phase, in which sentences are indistinguishable from noise, to an organized phase in which nontrivial information is carried. This marks the emergence of deep structure in the language, and can be understood by a competition between energy and entropy.

It is a remarkable fact that structures of the most are the lowest order of the that sup- astounding complexity can be encoded into sequences ports hierarchical structure. of digits from a finite alphabet. Indeed, the complex- Despite their ubiquity in models of complex generative ity of life is written in the genetic code, with alphabet systems, grammars have hitherto played a minor role in A, T, C, G , proteins are coded from strings of 20 amino physics, and most known results on grammars are theo- { } acids, and human-written text is composed in small, fixed rems regarding worst-case behavior [12], which need not alphabets. This ‘infinite use of finite means’ [1] was for- represent the typical case. Human languages show Zipf’s malized by Post and Chomsky with the notion of gen- law [13–15], a power-law dependence of word frequency erative grammar [2, 3], and has been elaborated upon on its rank, and many sequences, including human text, since, both by linguists and computer scientists [4]. A show long-range information-theoretic correlations [16– generative grammar consists of an alphabet of hidden 18], which can be created by a CFG [18]; but are these symbols, an alphabet of observable symbols, and a set typical features of some ensemble of grammars? In this of rules, which allow certain combinations of symbols to work we initiate this research program by proposing and be replaced by others. From an initial start symbol S, simulating an ensemble of CFGs, so that grammars can one progressively applies the rules until only observable be considered as physical systems [19]. We will find symbols remain; any sentence produced this way is said that CFGs possess two natural ‘temperature’ scales that to be ‘grammatical,’ and the set of all such sentences control grammar complexity, one at the surface interface, is called the language of the grammar. The sequence and another in the tree interior. As either of these tem- of rule applications is called a derivation. For exam- peratures is lowered, there is a phase transition, which ple, the grammar S SS, S (S),S () has a corresponds to the emergence of nontrivial information { → → → } single hidden symbol S and two observable symbols, ( propagation. We characterize this phase transition using and ), and produces the infinite set of all strings of well- results from simulations, and understand its location by formed parentheses. A simple derivation in this grammar a balance between energy and entropy. is S SS (S)S (())S (())(). Besides their Generative grammars: A generative grammar is de- → → → → original use in linguistics, where the observable symbols fined by an alphabet χ and a set of rules . The al- R are typically taken to be words, and grammars produce phabet has N hidden, ‘non-terminal’ symbols χH , and sentences (Fig 1a) [3, 5], generative grammars have found T observable, ‘terminal’ symbols χO. The most gen- application in manifold domains: in the secondary struc- eral rule is of the form a a . . . a b b . . . b , where 1 2 n → 1 2 m ture of RNA (Fig 1b) [6, 7], in design [4], in ai χH , bi χ = χH χO. In a CFG the rules are spe- self-assembly [8], in protein sequence analysis [9], and in ∈ ∈ ∪ quasicrystals [10], to name a few. S (b) S The complexity of a language is limited by conditions imposed on its grammar, as described by the Chomsky NP VP g S c arXiv:1809.01201v2 [cond-mat.dis-nn] 27 Mar 2019 hierarchy, which, in increasing complexity, distinguishes Det N V PP a S u regular, context-free, context-sensitive, and recursively S S the P NP c g enumerable grammars [11]. Each class of grammar has a bear walked S S u characteristic graphical structure of its derivations: reg- Det N g c a into S S ular grammars produce linear derivations, context-free a u grammars produce trees (Fig 1), and context-sensitive the cave a g (a) and recursively enumerable grammars produce more elab- orate graphs. Associated with an increase in complexity is an increased difficulty of [4]. Because biologi- FIG. 1. Illustrative derivation trees for (a) simple English cal instantiations of grammars must have been discovered sentence, and (b) RNA secondary structure (after [6]). The by evolution, there is a strong bias toward simpler gram- latter is a derivation of the sequence ‘gacuaagcugaguc’ and mars; we consider context-free grammars (CFGs), which shows its folded structure. Terminal symbols are encircled. 2 cialized to the form a b b . . . b , and we will insist where π is the number of times the rule a bc ap- 1 → 1 2 m abc → that m 1, so that there is no ‘empty’ string. Without pears in the configuration σ, and likewise ρaB is the loss of generality,≥ we consider CFGs in Chomsky normal number of times the rule a B appears. This de- form, in which case all rules are of the form [4] a b c fines a conditional probability→ measure on configurations → E(σ,o , ) or a A, where a, b, c χH and A χO. Note that we P(σ, o , ) = e− |T G /Z( , ) where may→ have b = a, or b =∈c, or a = b =∈ c. Any derivation |T G T G E(σ,o , ) in Chomsky reduced form can be drawn on a binary tree. Z( , ) = e− |T G . (3) T G Beginning from the start symbol S χH , rules are ap- σi,ot { X } plied until the string contains only∈ observable symbols. Such a string is called a sentence. The set of all sentences All configurations have S at the root node. For sim- is the language of the grammar. Given a string of ob- plicity, in this work we consider as a model for the servables = A ...A and a grammar , one can ask tree topology probability P( ) = Wtree/Ztree with 1 ` ∂Ω Ω T |G S G W ( ) = p| T |(1 p)| T |, where p is the emission whether there exists a derivation that produces from tree T − the start symbol S; if so, is said to be grammatical.S probability, the probability that a hidden node becomes S A as defined above can only distin- an observable node. p controls the size of trees; we will guish grammatical from ungrammatical sentences. A choose it such that the tree size distribution is cutoff richer model is obtained by giving each rule a non- above a length ξ = 1000. Some facts about the resulting negative real valued weight. Such a weighted grammar binary trees are recorded in Supplementary Material [21]. is useful in applications, because weights can be contin- A model with weights of the form (1) is called a uously driven by a learning process, and can be used to weighted CFG (WCFG). In the particular case where define probabilities of parses. Moreover, a weighted gram- 1 = b,c Mabc = A OaA for all a, it is easy to see that M and O are conditional probabilities: Mabc = P(a mar can be put into the Gibbs form, as shown below. For P P → CFGs, to every rule of the form a bc we assign a weight bc a non-terminal) and OaA = P(a A a → terminal).| → In this case the model is called a→ probabilistic| → Mabc, and to every rule of the form a A we assign a → CFG (PCFG). In the main text, we consider a weighted weight OaA. Each candidate derivation of a sentence has two dif- CFG, model W; in SI, we show that our results are ro- ferent types of degrees of freedom. There is the topol- bust in model P, a PCFG. There are tradeoffs between ogy of the tree, namely the identity (terminal or non- these models: model P is easier to sample, because it has T Z( , ) = 1 from normalization of probability, and thus terminal) of each node, as well as the variables, both ter- T G minal and non-terminal, on the nodes. We write Ω for is factorized. But model W is more amenable to theory, the set of internal factors, i.e. factors of the form a T bc, since it is less constrained. and ∂Ω for the boundary factors, i.e. those associated→ to Random Language Model: Each grammar defines a A Trules. The number of boundary factors is written probabilities for sentences. To extract the universal prop- ` →, which is also the number of leaves. Since derivations erties of grammars, which do not depend on all details of areT trees, the number of internal factors is ` 1. We M and O, we need a measure on the space of grammars. will write σ for non-terminal symbols, and o forT terminals;− What is an appropriate measure? From Eq.(2), log M these can be enumerated in an arbitrary way 1,...,N and and log O are analogous to coupling constants in statisti- cal mechanics. A simple model is to assume a Gaussian 1,...,T , respectively. Given , we can write σi for the T distribution for these, so that M and O are lognormal. value of the non-terminal on site i, and similarly oj for This can be motivated as follows: language evolution is a the terminal on site j. The number of σi is 2` 1, while T − dynamical process, which must be slow in order for lan- the number of oj is ` . We write for the pair M,O, σ for σ , and o for oT . G guage to remain comprehensible at any given moment. { i} { t} To define a probability measure on derivations, it is If each log Mabc and log OaB are the accumulation of in- convenient to factorize it into the part specifying , and dependent, additive increments [22], these will lead to a the remainder. In this way we separate the the treeT shape lognormal. We define deep and surface sparsities as, re- from the influence of the grammar on variables. For a spectively, fixed the weight of a configuration is T 1 2 Mabc 1 2 OaB sd = log , ss = log N 3 M NT O W (σ, o , ) = Mσ σ σ Oσ o , (1) a,b,c   a,B   |T G α1 α2 α3 α1 α2 X X α Ω α ∂Ω (4) Y∈ T ∈Y T where each α = (α , α , α ) is a factor in the order σ where M = 1/N 2 and O = 1/T are the correspond- 1 2 3 α1 → σα2 σα3 . Note that Mabc = Macb in general, thus the left ing uniform probabilities; it is convenient to use this and right branches are distinguished6 [20]. We can write normalization even for model W where weights are not E W = e− with strictly normalized. A lognormal distribution of gram- mar weights is E = πabc(σ) log Mabc ρaB(σ, o) log OaB (2) − − 1 dsd sss a,b,c a,B PG(M,O) Z− J e− e− (5) X X ≡ G 3

1 1 10 100 1 ˜d 2.5 0.9 0 0.9 10 2 10-1

0.8 0.8 1.5 10-2 10-1 0.7 0.7 1 10-3 10-2 10-1 100 101 102 103 0.6 -2 10 0.5 0.5 0.6 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 -3 -2 -1 0 1 2 3 0 0.5 1 10 10 10 10 10 10 10 (a) (b) (a) (b)

FIG. 2. Shannon entropy of random CFGs as functions of 3 FIG. 3. (a) Zipf plot of hidden symbols for N = 40. Here ˜d = ˜d = d/N . (a) Block entropy of hidden configurations for 3 th d/N . (b) Order parameter Q2, with bars indicating 20 and indicated k and N. (b) Block entropy of observed strings; 80th percentile ranges over grammars at each parameter value. symbols as in (a). The constant value for d >  depends th∗ th Inset: same plot in log-log axes. on the surface temperature s. Bars indicate 20 and 80 percentiles.

o1, o2, . . . , ok the Shannon block entropy rate is log M log O where J = e− a,b,c abc− a,B aB , and the space 1 Hs( ; k) = log 1/P(o1, o2, . . . , ok ) (7) of M and O is definedP by appropriateP normalization and G k |G positivity constraints. We define the Random Language For CFGs we can also consider the block entropy rate of Model as the ensemble of grammars drawn from Eq.5. deep configurations, An alternative motivation of (5) is that this is the maximum-entropy measure when the grammar-averages 1 Hd( ; k) = log 1/P(σ1, σ2, . . . , σk ) (8) sd and ss are constrained. sd and ss measure the density G k |G of rules about their respective median values M and O. where the symbols are taken from a (leftmost) derivation. When sd and ss are finite, all rules must have a finite probability: this reflects the fact that, given any finite In both cases the ensemble average is taken with the ac- tual probability of occurrence, P(o ) for Hs, and P(σ ) amount of data, one can only put a lower bound on the |G |G probability of any particular rule. In model W the La- for Hd. The grammar averages Hd(k) and Hs(k) are shown grange multipliers d and s satisfy in Fig. 2, for k as indicated; here and in the following, N 3 NT the bars show the 20th and 80th percentiles, indicating s = , s = . (6) d s the observable range of Hd and Hs over the ensemble of 2d 2s grammars [23]. The dependence on d is striking: for When  , s 0, which is the value corresponding  N 3/ log2 N, both H (1) and H (1) are flat. In this d → ∞ d → d & s d to a completely uniform deep grammar, that is, when for regime, Hd(1) log N, indicating that although configu- a non-terminal a, all rules a bc have the same probabil- rations strictly≈ follow the rules of a WCFG, deep configu- → ity 1/N 2. This is clearly the limit in which the grammar rations are nearly indistinguishable from completely ran- 3 2 carries no information. As d is lowered, sd increases, dom configurations. However, at d =  N / log N and the grammar carries more information. In terms there is a pronounced transition, and both∗ ≈ entropies be- of how deterministic the rules are, d plays the role of gin to drop. This transition corresponds to the emergence temperature, with random hot and deterministic of deep structure. ↔ ↔ cold; we will refer to it as the deep temperature. This The first block entropy Hd( ; 1) measures information analogy can also be seen formally: in SI, we show that if in the single-character distribution,G while the differential the energy E is replaced by βE, then (6) is replaced by entropies δHd( ; k) = (k+1)Hd( ; k+1) kHd( ; k) mea- 2 3 G G − G sd = β N /(2d), such that lowering d is equivalent to sure incremental information in the higher-order distribu- increasing β. Similarly, s controls information transmis- tions [17]. The Shannon entropy rate including all corre- sion at the surface; we call it the surface temperature. lations can either be obtained from limk Hd( ; k), or →∞ G To investigate the role of d on language structure, from limk δHd( ; k). These coincide, but the latter we sampled grammars from the RLM at fixed values converges→∞ faster [17].G In SI, we show that δH ( ; k), and d G T = 27, s/(NT ) = 0.01. Since the surface sparsity is thus the limiting rate, appears to collapse with ˜d log N. large, there is already some simple structure at the sur- For all entropies the sample-to-sample fluctuations de- face; we will explore how deep structure emerges as N and crease rapidly with k, suggesting that the limiting rates d are varied. For each value of N and d, we created 120 are self-averaging. distinct grammars, from which we sample 200 sentences To further investigate the nature of the transition, we (see SI for more details). Altogether approximately 7200 show in Fig. 3a a Zipf plot: the frequency of each sym- distinct languages were constructed. bol, arranged in decreasing order. Fig. 3a shows the Zipf The information content of a grammar is natu- plot for deep structure; the Zipf plot for surface struc- rally encoded by Shannon entropies. For aG sequence ture is similar, but less dramatic (see SI). We see a sharp 4 change at  : for d >  , the frequencies of hidden sym- This must be compared with the ‘energetic’ terms ∗ ∗ bols are nearly uniform, while below  , the distribution log Wtree = (` n) log(1 p) + ` log p 2` log 2 for is closer to exponential ( In SI, we show∗ that a power- p near 1/2, and−E, Eq.(2).− In E, π is positively∼ − corre- law regime for the observable symbols appears when T is lated with M, since rules with a higher weight are more large). The permutation symmetry among hidden sym- frequently used; hence we can obtain a simple scaling es- bols is thus spontaneously broken at  . timate E N 3π log m NT ρ log o where π is the mean ∗ ∼ − − What is the correct order parameter to describe this value of πabc, and log m is the value of a typical positive transition? The ferromagnetic order parameter is mr = fluctuation of log Mabc, and similarly for O. From the sum Nδ 1 , where i is a site. This does not show any rules π = Ω = ` n and ρ = ∂Ω = ` h σi,r − i a,b,c abc | | − a,B aB | | signal of a transition, despite the fact that the start sym- we have π = (` n)/N 3, ρ = `/(NT ). The mean value of P − P bol explicitly breaks the replica symmetry. A more inter- log Mabc is log M, and the mean value of log OaB is log O. esting choice is one of Edwards-Anderson type, such as These contributions lead to a constant value of E. The EA Q = Nδσi,r 1 Nδσi,s 1 where r and s label dif- positive fluctuations in log M and log O that couple to E rs h − ih − i ferent sentences produced from the same grammar, and scale as N 3 and NT , respectively, leading to 2d 2s σi is a specified site [24]. However, sentences produced by a CFG do not have fixed derivation trees, so we need q q to compare symbols in relative position. For each interior N 3 NT E ` ` + const (10) rule a bc we can define ∼ − 2 − 2 → s d r s 2 Qabc( ) = δσ ,a N δσ ,bδσ ,c 1 , (9) G h α1 α2 α3 − i Combining this with S, the effective free energy F = averaged over all interior vertices α, and averaged over E log Wtree S reflects a competition between energy − − derivations. Here σα1 is the head symbol at vertex α, and entropy. If we consider N and d as varying, then 3 2 and σα , σα are the left and right symbols, respectively. there is a scale  = N / log N where the energetic 2 3 ∗ Q measures patterns in rule application at each branching fluctuations balance entropy. For d  , the energy  ∗ of a derivation tree. It is thus an order parameter for deep of a configuration is unimportant, and the grammar is structure. Upon averaging over grammars in the absence thus irrelevant: the language produced by the WCFG of any fields, the permutation symmetry must be restored: must then be indistinguishable from random sequences, Qabc = q0 +δab ql +δac qr +δbc qh +δabδac q . As shown in as found empirically above. In contrast, for d  , ∗  ∗ SI, these components show a transition, but there is sig- the language reflects those sequences with high intrinsic nificant noise below  , despite there being 120 replicas at weight, and their entropy is less important. The char- ∗ each point. Evidently, Qabc has large fluctuations below acteristic scale  identified by these simple arguments 2 ∗  . This suggests a definition Q2 Q , plotted agrees with that found empirically above, and locates the ∗ ≡ a,b,c abc in Fig 3b. The signal is clear: on the large scale, Q2 has emergence of deep structure. However, further work is 3 P a scaling form Q2 N f (N / ) and is small above  . needed to predict the behavior of Q2, Hs, and Hd. ≈3 ∗ ∗ The scaling Q2 N suggests that below the transition, Learning human languages: Around 6000 lan- all hidden symbols∼ start to carry information in the deep guages are spoken around the world [25]; given fractured structure. and highly sparse input, how does a child come to learn Theory: How can we gain some theoretical in- the precise syntax of one of these many languages? This sight into the RLM? Consider the entropy of an observed question has a long history in linguistics and cognitive sci- string of length `, composed of n sentences of length ence [26, 27]. One scenario for learning is known as the

`k, k `k = `. The entropy of this string derives from 3 Principles and Parameters (P&P) theory [28]. This posits distinct combinatorial levels: (i) each sentence can be rep- that the child is biologically endowed with a general class resentedP by a derivation tree with many different topolo- of grammars, the ‘principles,’ and by exposure to one gies; (ii) each derivation tree can host a variety of internal particular language, fixes its syntax by setting some num- hidden variables; and (iii) given the hidden variables, the ber of parameters, assumed to be binary. For example, observed symbols can themselves vary. the head-directionality parameter controls whether verbs Some scaling considerations are useful. Each derivation come before or after objects, like English and Japanese, tree can have many topologies: the entropy of binary trees respectively. A vast effort has been devoted to mapping scales as `k log 4, so that the total tree entropy scales as out the possible parameters of human languages [25, 29]. St ` log 4. Each derivation tree has 2`k 1 hidden The richness of the discovered structure has been used variables,∼ so that the total number of hidden DOF− is 2` as criticism of the approach [30]: if the child needs to − n, and the corresponding deep entropy scales as Sd set many parameters, then do these all need to be in- (2` n) log N. Finally, the sentences have an entropy∼ nate? This would be a heavy evolutionary burden, and a − So ` log T . challenge to efficient learning. We∼ see that when typical sentences are of length ` The RLM can shed some light on this debate. First, 1, so that ` n `, these numbers are independenth i  of since only 2 living human languages are known to pos- partitioning,− to leading∼ order. For large ` we get the sess syntax beyond CFG [31], we consider WCFGs a valid scaling S ` log(4N 2T ). h i starting point [32]. Following experimental work [27], we ∼ 5 picture the learning process as follows. Initially, the child (MIT press, Cambridge, 2014). does not know the rules of the grammar, so it begins [6] D. B. Searls, Nature 420, 211 (2002). with some small number of hidden symbols and assigns [7] B. Knudsen and J. Hein, Nucleic acids research 31, 3423 uniform values to the weights M and O. To learn is to (2003). [8] E. Winfree, X. Yang, and N. C. Seeman, in DNA increase the likelihood of the grammar by adjusting the based computers II, DIMACS series in discrete mathemat- weights and adding new hidden symbols. As weights are ics and theoretical computer science, Vol. 44 (American driven away from uniform values, the temperatures d and Mathematical Soc., Providence, R.I., 1999) p. 191. s decrease. Eventually the transition to deep structure [9] J. P. Barton, A. K. Chakraborty, S. Cocco, H. Jacquin, is encountered, and the grammar begins to carry infor- and R. Monasson, Journal of Statistical Physics 162, 1267 mation. (2016). [10] J. G. Escudero, in Symmetries in Science IX (Springer, In the absence of any bias, this transition would oc- Boston, 1997) pp. 139–152. cur suddenly and dramatically, spontaneously breaking [11] M. A. Nowak, N. L. Komarova, and P. Niyogi, Nature all N 3 directions in M space simultaneously, as in Fig. 3b. 417, 611 (2002). However, in realistic child language learning, the child’s [12] For example, from Ref.4, Theorem 7.17 on the size of environment acts as a field on this likelihood-ascent, and derivation trees, Theorem 7.31 on the conversion of an can cause the structure-emerging transitions to occur at automaton to a CFG, and Theorem 7.32 on the complex- different critical deep temperatures, depending on their ity of conversion to Chomsky normal form (see below). [13] G. K. Zipf, The psycho-biology of language: An introduc- coupling to the field. For example, a left-right symmetry tion to dynamic philology (Routledge, Milton Park, 2013). breaking could correspond to setting the head direction- [14] R. F. i Cancho and R. V. Sol´e,Proceedings of the Na- ality parameter. tional Academy of Sciences 100, 788 (2003). Although this description is schematic, we insist that [15] A. Corral, G. Boleda, and R. Ferrer-i Cancho, PloS one the various symmetry-breaking transitions, which could 10, e0129031 (2015). [16] W. Ebeling and T. P¨oschel, EPL (Europhysics Letters) give rise to parameters, are emergent properties of the 26, 241 (1994). model. Thus if there are indeed many parameters to [17] T. Sch¨urmannand P. Grassberger, Chaos: An Interdisci- be set, these do not all need to be innate: the child only plinary Journal of Nonlinear Science 6, 414 (1996). needs the basic structure of a WCFG, and the rest is [18] H. W. Lin and M. Tegmark, Entropy 19, 299 (2017). emergent. The P&P theory is thus consistent with ex- [19] G. Parisi, Physica A 263, 557 (1999). istence of many parameters. If the RLM can be solved, [20] Indeed if the left-right branches are not distinguished, by which we mean that the partition function Z can be CFGs do not have any more expressive power than regular grammars [33]. computed, then the series of symmetry-breaking transi- [21] Supplementary Material includes details on binary trees, tions that occur in the presence of a field can be inferred, sampling methods, robustness in PCFG, differential en- and a map of syntax in CFGs could be deduced. This is tropies, and equation derivations, and Refs. [34, 35]. a tantalizing goal for future work. [22] D. Sornette and R. Cont, Journal de Physique I 7, 431 Conclusion: We introduced a model of random (1997). languages, which captures the generative aspect of com- [23] The error bars in measurements are then smaller by factor approximately √120 11. plex systems. The model has a transition in parameter [24] D. Gross, I. Kanter, and∼ H. Sompolinsky, Physical review space that corresponds to the emergence of deep struc- letters 55, 304 (1985). ture. Since the interaction is long-range, we expect that [25] M. C. Baker, The atoms of language: The mind’s hidden the RLM, or a variant, is exactly solvable. We hope that rules of grammar (Basic books, New York, 2008). this will be clarified in the future. [26] R. C. Berwick, P. Pietroski, B. Yankama, and N. Chom- sky, Cognitive Science 35, 1207 (2011). This work benefited from discussions with C. Callan, [27] C. Yang, S. Crain, R. C. Berwick, N. Chomsky, and J. J. J. Kurchan, G. Parisi, R. Monasson, G. Semerjian, P. Bolhuis, Neuroscience and Biobehavioral Reviews (2017). Urbani, F. Zamponi, A. Zee, and Z. Zeravcic. [28] N. Chomsky, Lectures on government and binding: The Pisa lectures, 9 (Walter de Gruyter, 1993). [29] U. Shlonsky, Language and linguistics compass 4, 417 (2010). [30] G. Ramchand and P. Svenonius, Language Sciences 46, [1] W. Von Humboldt, Humboldt:’On language’: On the di- 152 (2014). versity of human language construction and its influence [31] Only Swiss-German and Bambara have confirmed fea- on the mental development of the human species (Cam- tures beyond CFG [36, 37]. bridge University Press, 1999). [32] Note also that some lexicalized models used for machine [2] E. L. Post, American journal of mathematics 65, 197 learning, such as [38], are WCFGs with multi-indexed (1943). hidden variables. [3] N. Chomsky, (Walter de Gruyter, [33] J. Esparza, P. Ganty, S. Kiefer, and M. Luttenberger, Berlin, 2002). Information Processing Letters 111, 614 (2011). [4] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Intro- [34] S. Chib and E. Greenberg, The american statistician 49, duction to automata theory, languages, and computation, 327 (1995). 3rd ed. (Pearson, Boston, Ma, 2007). [35] P. Flajolet and R. Sedgewick, Analytic combinatorics [5] N. Chomsky, Aspects of the Theory of Syntax, Vol. 11 (cambridge University press, 2009). 6

[36] C. Culy, Linguistics and Philosophy 8, 345 (1985). Intelligence (Springer, 1985) pp. 79–89. [37] S. M. Shieber, in Philosophy, Language, and Artificial [38] M. Collins, Computational linguistics 29, 589 (2003). Random Language Model – Supplementary Information

E. DeGiuli Institut de Physique Th´eoriquePhilippe Meyer, Ecole´ Normale Sup´erieure, PSL University, Sorbonne Universit´es,CNRS, 75005 Paris, France

Sampling Methods: We consider both WCFGs and 1 1 PCFGs. Since their distribution is factorized, PCFGs are trivial to sample: we begin at the top of the tree with S, 0.8 0.8

and choose whether this branches into two non-terminals 0.6 0.6 with probability 1 p, or becomes a terminal node with probability p. S is− replaced by non-terminals or termi- 0.4 0.4

nals according to the probabilities specified in M and O, 10-3 10-2 10-1 100 101 102 103 10-3 10-2 10-1 100 101 102 respectively. This process is then repeated, where we act (a) (b) by replacement on the left-most non-terminal (for CFGs the order of replacement does not affect the derivation), FIG. 1. Differential entropy of hidden symbols of random 3 replacing non-terminals according to the probabilities M WCFGs as functions of ˜d = d/N , for indicated k and N. 2 th or O; we continue until no non-terminals remain. It is (a) versus ˜d log N; (b) versus ˜d log N. Bars indicate 20 th possible that sequences grow unboundedly; we stop any and 80 percentiles. that go beyond length 4000. This occurs for less than 4 1 a fraction 10− of samples. It is clear that the emission 1 probability p could be absorbed into M and O: the advan- 0.9 0.9 tage of our model is that we strictly control the typical 0.8 0.8 sentence length, so that this does not appear as a con- 0.7 0.7 founding variable in the analysis, and we avoid creation 0.6 of infinite trees. 0.6 0.5 WCFGs are sampled with the Metropolis-Hastings 10-3 10-2 10-1 100 101 102 103 10-3 10-2 10-1 100 101 102 103 algorithm [1]. For a given WCFG, we define a re- (a) (b) lated PCFG by M˜ = M / M , O˜ = abc abc b0,c0 ab0c0 aB FIG. 2. Shannon entropy of random PCFGs as functions OaB/ B OaB0 , for all a. This PCFG is used as a 0 P 3 candidate-generating density. In the Metropolis-Hastings of ˜d = d/N . (a) Block entropy of hidden configurations P for indicated k and N. (b) Block entropy of observed strings. rule, the individual factors Mabc,OaB all cancel, leav- Symbols are as in (a), although a different subset of param- ing only the normalization factors from b ,c Mab0c0 , 0 0 eters is shown. The constant value for d >  depends on OaB , which have smaller fluctuations than the indi- th ∗ th B0 0 P the surface temperature s. Bars indicate 20 and 80 per- vidual Mabc,OaB. This ensures efficient sampling. centiles. P For model W, for each value of N and d, we created 120 distinct grammars, from which we sample 200 sen- tences. We fix the emission probability p so that the the Catalan number mean sentence length is ` 15, and the total length of h i ≈ 1 2(` 1) 4` text sampled for each grammar is 3000. C` 1 = − (2) ≈ − ` ` 1 ∼ 4√π(` 1)3/2 Block entropies are computed using the method of  −  − Grassberger (Eq. 8 in [2]). Since a deep block entropy of leading to order k has a phase space with N k configurations, we can only effectively compute entropies for which N k 3000, . ∞ ` 1 2p Ztree = p C` 1(p(1 p)) − = , (3) and similarly surface block entropies can be computed for − − 1 + 2p 1 arXiv:1809.01201v2 [cond-mat.dis-nn] 27 Mar 2019 k T . 3000. Shown entropies in the main text satisfy this X`=1 | − | bound. They are absent from finite ` effects: we saw no where we used the result for the generating function of dependence when entropies were computed using only the the Catalan numbers. As expected there is a singularity first half of the sampled sentences. at p = 1/2. Let us write p = 1/2 +  with  > 0. Then Binary trees: Our model for the probability of a Ztree = 1 and one can show that the distribution of tree tree topology is T sizes follows `/ξ 1 Ω ∂Ω e− P( ) = (1 p)| T |p| T |, (1) P(`) , (4) T |G Ztree − ∝ `3/2 For a binary tree with ` leaves, Ω = ` 1 and ∂Ω = where ξ = 1/(42). In our numerical results we have set `. The number of binary trees with| T |` leaves− is given[3]| T | by ξ = 1000. 2

Scaling symmetry with temperature: We can been proposed that this exponent arises from constraints generalize the model by adding a bias β to the en- of efficient communication [4, 5]. ergy, such that the probability of a configuration is Equation-of-state: In model W, the grammar par- βE(σ,o , ) P(σ, o , ) = e− |T G /Z( , ). This is equiva- tition function is |T G β T G β lent to replacing Mabc by Mabc and OaB by OaB. If we β β dsd sss also rescale M as M 0 = M , and O0 = O , then the spar- ZG = dM dO J e− e− (7) 2 2 Z Z sities rescale as sd0 = β sd, ss0 = β ss. Thus the rescaled equations of state are One easily sees that

∂ log ZG ∂ log ZG 2 3 2 s = , s = (8) β N β NT d − ∂ s − ∂ s 0 = , s 0 = . (5) d s d 2 s 2 d s After a change of variable mabc = log Mabc/M, oaB = log OaB/O, ZG is Gaussian and we find Hence decreasing d and s is equivalent to increasing β, 1 1 N 3 as well as rescaling the median values M and O. It can πN 3 2 πNT 2 NT be shown that grammar-averages of moments of the par- ZG = J (9) d s tition function Z( , )m also have this scaling property.     T G Differential entropy: Deep differential entropies are leading to defined from block entropies by N 3 NT s = , s = (10) d 2 s 2 δH ( ; k) = (k + 1)H ( ; k + 1) kH ( ; k) (6) d s d G d G − d G In model P, we have a normalization δ function that can be integrated using its Fourier transform.− The final result To examine the behavior as k is increased, we performed cannot be put into a very explicit form. additional simulations over an ensemble of 30 WCFGs at Deep structure symmetries: In the absence of each d and N, with 5000 sampled sentences per gram- fields, the order parameter Qabc has a grammar average mar. The resulting δHd are plotted in Fig. 1, versus (a) 2 that must respect permutation symmetry. It then is of ˜ log N, and (b) ˜ log N, where ˜ =  /N 3. Collapse d d d d the form is better in the latter case, suggesting that the limiting rate obtained in the limit k depends on this reduced → ∞ Qabc = q0 + δab ql + δac qr + δbc qh + δabδac q , (11) variable. ∗ 2 Model P: We consider PCFGs with T = 27, s/NT = where N q0 +N(ql +qr +qh)+q = 0 from Qabc = 0. ∗ b,c 0.01, and N = 10, 20, 40, 80 with varying  . For each q0 and ql are plotted, for model W, in Fig.6. (qr behaves d P parameter set, 300 grammars were constructed, and 200 the same as ql, by symmetry. q and qh display even larger ∗ sentences were sampled. Altogether 24000 languages were fluctuations.) The bands show 20th and 80th percentiles considered. Results are shown in Figs. 2,3. All qualita- over the 120 grammars sampled at each parameter value. tive behavior is identical to that found for model W. Scal- We see that these quantities display wild fluctuations, n ing collapses are the same, but the choice of n in log N much larger than the quantity Q2. may differ: as shown in Fig.3cd, for large N the value n = 1 collapses better than n = 2, which is optimal for small N.

Zipf plots: Zipf plots for the observed symbols are [1] S. Chib and E. Greenberg, The american statistician 49, shown in Fig.3. At large d, these are not flat, unlike 327 (1995). the Zipf plots for hidden symbols, because we have fixed [2] T. Sch¨urmannand P. Grassberger, Chaos: An Interdisci- s/(NT ) = 0.01. Still, as the transition at d =  is en- plinary Journal of Nonlinear Science 6, 414 (1996). countered, the distributions broaden. In WCFGs∗ with a [3] P. Flajolet and R. Sedgewick, Analytic combinatorics larger observable alphabet, a power-law regime emerges, (cambridge University press, 2009). [4] G. K. Zipf, The psycho-biology of language: An introduc- as shown in Fig.5. For rank R greater than N, the expo- tion to dynamic philology (Routledge, Milton Park, 2013). nent is near 2, while at smaller R, the behavior depends − [5] R. F. i Cancho and R. V. Sol´e,Proceedings of the National on N, T , and d. The small R regime also depends on Academy of Sciences 100, 788 (2003). ˜s (not shown). For comparison, a slope 1, commonly [6] A. Corral, G. Boleda, and R. Ferrer-i Cancho, PloS one observed in human language [4–6], is also− shown. It has 10, e0129031 (2015). 3

101

˜d 2.5 100 100 100 2

1.5 -1 10-1 10 1 10-2 10-2 -2 0.5 10

0 10-3 0 0.2 0.4 0.6 0.8 1 10-3 10-2 10-1 100 101 102 103 10-3 10-2 10-1 100 101 102 103 10-4 10-3 10-2 10-1 100 101 102 (a) (b) (c) (d)

3 FIG. 3. Results for model P. (a) Zipf plot of hidden symbols for N = 40. Here ˜d = d/N . (b-d) Order parameter Q2, with bars indicating 20th and 80th percentile ranges over grammars at each parameter value. (d) shows a plot vs log N rather than log2 N. The collapse is better at large N but worse at small N.

101 101

100 100

10-1 10-1

10-2 10-2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (a) (b)

FIG. 4. Zipf plots. Frequency of observed symbols, in decreasing order. (a) WCFG (b) PCFG. In both cases N = 40, and labels are as in Fig. 3a.

100 100

-2 10-2 10

10-4 10-4 10-1 100 101 10-1 100 101 (a) (b)

FIG. 5. Zipf plot in WCFG with (a) T = 1000 and (b) T = 10000. Frequency of observed symbols, in decreasing order, together with indicated power laws as a function of the rank R.

(a) (b)

3 th th FIG. 6. Grammar-averages of scalars (a) q0 and (b) ql as functions of ˜d = d/N . Bands indicate 20 and 80 percentiles.