arXiv:1302.5526v2 [physics.soc-ph] 31 May 2013 et eg 2,2,2] r eesrl ofie osmall experi- to word-learning confined necessarily because are is 24]) 23, this [20, setting (e.g. part real-world ments In a in important [18–22]. most are mechanisms data. noisy from reconstructed be is also target associations a can where of It [7–9] set process 17]. error-correction neu- [3–6, an in learning as viewed as for [13–16], models network co-occur meanings ral they and when words strengthened between are associations whereby cess as meaning. to referred learning intended a is word’s is mechanism the it second This for instances, only such candidate if all strong across use: very plausible word’s be is eliminated a meaning could be of one instances meanings can separate uncertainty candidate comparing residual of by This heuris- set the large. If the in- very use. word’s weak, of a are instance to single tics a as in uncertainty meaning residual tended heuristics some these However, leave hypothe- may meanings. to plausible of produced set is struc- a word language size a of moment the experience [11]—at prior ture or [10] direction gaze ere a pl various apply First, can learning. learner word a in involved are mechanism learning [7–9]. communication to and apply- mechanics [3–6] of learning statistical this tradition in from In long problems the methods in quantitative learning. aim ing word this rapid pursue we such work, its deliver uncertainty of the actually eliminating of certain for which can identify be mechanisms to cannot proposed is aim a many hearer Our time a [2]. every meaning word, that intended a given remarkable says is speaker This learning [1]. of adulthood by speed words 60,000 of lexicon a amassing hr sltl osnu st hc word-learning which to as consensus little is There miia eerhsget httobsctpsof types basic two that suggests research Empirical thereby day, a words ten learn children average, On tcatcdnmc flxcnlann na neti and uncertain an in learning lexicon of dynamics Stochastic auta ¨rPyi,Tcnsh Universit¨at M¨unchen, Technische Fakult¨at f¨ur Physik, 1,1] omly tcnb oce sapro- a as couched be can it Formally, 13]. [12, a tee.W hwta h otbscfr of form meanings basic plausible most many the that where show conditions We under uttered. lexicon a in sueta otowrssaeacmo enn,w n phase a find we meaning, common nonunif a share a words from two drawn no that are meani assume meanings incorrect and eliminate independently to learned combined is episodes multiple efficient rbe feueaiglosi h pc fwr-enn ma word-meaning of space the in loops through enumerating process of word-learning the problem for results exact obtain partially-efficient UA colo hsc n srnm,Uiest fEdinbu of University Astronomy, and Physics of School SUPA, UA colo hsc n srnm,Uiest fEdinbu of University Astronomy, and Physics of School SUPA, esuytetm ae yalnug ere ocretyide correctly to learner language a by taken time the study We agaeEouinadCmuainRsac nt colo School Unit, Research Computation and Evolution Language erigrgm,weetelann iei eue otesh the to reduced is time learning the where regime, learning n agaeSine,Uiest fEibrh Edinburgh Edinburgh, of University Sciences, Language and heuristics eieweeicretcniaemaig o od persis words for meanings candidate incorrect where regime eg,atninto attention —e.g., cross-situational Dtd a 1 2013) 31, May (Dated: ihr .Blythe A. Richard anrReisenauer Rainer en Smith Kenny rs-iutoa learning cross-situational ae-rnkSr ,878Grhn,Grayand Germany Garching, 85748 1, James-Franck-Str. ru ersismyb eesr fe all. after necessary be Pow- may increase. heuristics sizes erful context rates as learning dramatically word decrease are can , meanings different different with where inferred scenario In realistic more that distributed. uniformly the assumption are the meanings on unintended relies these [28] conclusion study meanings, this simulation spurious that recent a of shows However, numbers required. large not out are capa- filtering heuristics, powerful of that learning ble suggests This cross-situational learn- words 25–27]. by of where [15, independently models rate in learned rapid reproduced are is the children large, in meaning. seen are ing true contexts word’s when single Even a the accompany is typically parameter that control key a size models, 25–27]. these be [15, through learning In word to addressed of fruitfully models lexicons be dynamical stochastic can large this realistically rapidly: learned allow observed experiments strategies whether is in question major A lexicons. rbe obo.Ormi euti htmta exclu- mutual that is result main Our boot. to arbitrary problem to generalized distributions be meaning to uniform [26] the and distributions for words meaning results independently-learned existing of allows case This a simple for learned. satisfied be be to must lexicon that criteria which the approach identifying ex- different entails the fundamentally solve. a 26], alone adopt let 25, we down, [15, Here, write in to unwieldy as become equation, pressions one master if a example, For with begins difficult. analysis models makes corresponding non- the which generates of words This between meaning. interactions same trivial the have may words a is 33]), constraint 26–28, sivity [15, (e.g. modelers and 32]) n ersi,o ra neett miiit eg [29– (e.g. empiricists to interest great of heuristic, One h ubro luil,btuitne,meanings unintended, but plausible, of number the : neuvlnet ttsia mechanical statistical a to equivalence an g—a efr al hnwrsare words when badly perform ngs—can r itiuin flanr further learners If distribution. orm a eifre hnvrawr is word a whenever inferred be can ppings. g,EibrhE93Z UK 3JZ, EH9 Edinburgh rgh, g,EibrhE93Z UK 3JZ, EH9 Edinburgh rgh, tf h enn falwords all of meaning the ntify rniinbtena between transition hlspy Psychology , f 2] ee ere sue htn two no that assumes learner a Here, [29]. reti a osbyb,and be, possibly can it ortest weeyifrainfrom information —whereby H L,UK 9LL, EH8 and tlt ie.We times. late at t ouiomworld nonuniform xcl ovsteinteracting the solves exactly maximally uulexclu- mutual context 2

(a) Noninteracting (b) Interacting founding meaning may have appeared in every episode "square" that any given word was uttered. To express these con- ditions mathematically, we introduce two stochastic in- "circle" dicator variables. We take Ei(t) = 1 if word i has been "circle" uttered before time t, and zero otherwise; and Ai,j (t) = 1 if confounding meaning j has appeared in every context "square" alongside word i up to time t (or if word i has never "triangle" been presented), and zero otherwise. Conditions (C1) and (C2) then imply that the that the lexi- FIG. 1. Acquisition of a three-word lexicon. Solid shapes con has been learned by time t is are meanings that have appeared in every episode alongside a word; open shapes are therefore excluded as candidate mean- L(t) = b M Ei(t) M 1 Ai,j t = M 1 Ai,j t (1) ings. (a) In the noninteracting case, only the meaning of the i j≠i[ − ( )]g b i≠j[ − ( )]g word ‘’square” is learned. (b) In the interacting case, mu- tual exclusivity further removes meanings (shown hatched) of where the angle brackets denote an average over all se- learned words, both prospectively and retrospectively (shown quences of episodes that may occur up to time t. The sec- by arrows). All three words are learned in this example. ond equality holds because Ai,j t = 1 j ≠ i if Ei t = 0. This expression is valid for any( ) distribution∀ over( ) con- texts. For brevity, we consider a single, highly illustrative construction that we call resampled Zipf (RZ). It is based sivity induces a dynamical phase transition at a critical on the idea that meaning frequencies should follow a sim- context size, below which the lexicon is learned at the ilar distribution to word forms [28]. It works by associat- fastest possible rate (i.e., the time needed to encounter ing an ordered set, i, of M confounding meanings with each word once). As far as we are aware, the ability of a th each word i. The kMmeaning in each set has an a priori single heuristic to deliver such fast learning has not been statistical weight 1 k. Whenever word i appears, mean- anticipated in earlier work. ~ ings are repeatedly sampled from i with their a priori We begin by defining our model for lexicon learning. weights, and added to the contextM if they are not already The lexicon comprises W words, and each word i is ut- present until a context of C distinct meanings has been tered as a Poisson process with rate φi. In all cases, we constructed. When words are learned independently, the take words to be produced according to the Zipf distribu- 4 learning time depends only on M, W and C, and not on tion, φi = 1~(µi), that applies for the ∼10 most frequent W which meanings are present in any given set i [26]. ∗ M words in English [34–36]. Here, µ = ∑i=1(1~i) so that We seek the time, t , at which the lexicon is learned one word appears on average per unit time. Each time with some high probability 1 ǫ. In the RZ model, each a word i is presented, the intended target meaning is as- context is an independent sample− from a fixed distribu- sumed always to be inferred by the learner by applying tion. Hence, the correlation functions Ai1 ,j1 Ai2 ,j2 ⋯ in some heuristics. At the same time, a set of non-target (1) all decay exponentially in time. To find⟨ t∗ to good⟩ ac- confounding meanings, called the context, is also inferred. curacy in the small-ǫ limit, only the slowest decay mode In the purest version of cross-situational learning for each word i is needed. Higher-order correlation func- [13, 26], a learner assumes that all meanings that have tions depend on many meanings co-occurring, and so de- appeared every time a word has been uttered are plausi- cay more rapidly than lower-order correlation functions. ble candidate meanings for that word. The word becomes As shown in Appendix A, we find that at late times (1) learned when the target is the only meaning to have ap- is well approximated by peared in each episode. In the noninteracting case, each 1 ∗ −φi( −ai )t word is learned independently—see Fig. 1a. In the in- L t ∼ M 1 − e (2) teracting case, mutual exclusivity acts to further exclude ( ) i   the meanings of learned words as candidates for other ∗ where ai is the fraction of episodes in which word i’s most words. We take this exclusion to occur at the instant a frequent confounder appears alongside the target. This word is learned, which means a single learning may expression generalizes results for independently-learned trigger an avalanche of other learning events by repeated words [15, 25, 26] from uniform to arbitrary nonuniform application of mutual exclusivity. An example of this confounder distributions. nontrivial effect that is hard to handle within standard ∗ The RZ model has the further simplification that ai approaches [15, 26] is shown in Fig. 1b. Here, learning has a common value, a∗, for all words i. Then, it is known “square” causes “circle” to be learned at the same time. from previous calculations [26] for Zipf-distributed word We consider the noninteracting case first both to intro- frequencies that the learning time is duce our more powerful analytical approach and to pin- point the origin of the catastrophic increase in learning ∗ µW W t 0 (3) times noted in [28]. Two conditions must be satisfied for ∼ ∗ 1 − a W Œ− ln 1 − ǫ ‘ the lexicon to be learned by a given time: (C1) all words ( ) must have been exposed at least once; and (C2) no con- where 0 z is the principal branch of the Lambert W W ( ) 3

1e+08 a third—necessary and sufficient—condition for the lexi- W=50, M=50 con to be learned that takes into account all the interac- W=50, M=100 W=100, M=50 tions and avalanches generated by the mutual exclusivity Asymptotic constraint. This is condition (C3): no candidate loops ex- ist at time t. A candidate loop, ℓ = i1,i2,...,in , is a 1e+06 subset of distinct, labeled meanings whereby( each) mean- ing ik has appeared alongside the word associated with meaning ik−1 (or in if k = 1) every time it has been ut- Learning time, t* tered. Inspection of Fig. 1b shows that the one candidate loop (∎,●) that exists after the third episode is destroyed 10000 in the fourth. Then, in the fifth episode, the final word appears, and since no unlabeled meaning is a candidate for any word, the entire three-word lexicon is learned. 0 10 20 30 To see why condition (C3) is necessary and sufficient Context size, C in general when (C1) and (C2) hold, we first show that a candidate loop must exist if the lexicon has not been FIG. 2. Time to learn a lexicon of W words independently learned. Suppose word i1 has not been learned. Then, to a residual probability ǫ = 0.01 with C of M confounders present in each episode. Points: data from Monte Carlo sim- at least one meaning, i2, must confound word i1. Word ulations (over 10, 000 sampled lexicons in each case). Lines: i2 must also not have been learned, otherwise meaning the analytical result, Eq. (3). i2 would not confound word i1. Hence, word i2 must be confounded by a meaning, i3, and so on. As there is a finite set of words, this sequence of meanings must function [37]. For large argument, this function behaves eventually form a loop. as a logarithm. We now show the lexicon cannot have been learned if a In Fig. 2, we compare the analytical result (3) with candidate loop exists by first assuming that it has been learning times obtained from direct Monte Carlo simula- learned under these conditions. Then, if word i1 was tions, conducted as detailed in [26]. The only complica- learned at time t, word i2 must have been learned before tion is that we unfortunately have no analytic expression time t for mutual exclusivity to act (even if words i1 and for a∗ arising from the RZ procedure. We therefore ob- i2 are learned as part of the same avalanche). Iterating tain the frequency of the most common confounder for this argument around the loop, one finds that word i1 can given C and M from independent Monte Carlo samples. only have become learned at time t if it had already been The agreement between (3) and simulation is very good. learned at some earlier time. This therefore Fig. 2 also shows that the learning time increases implies that the absence of candidate loops and a learned super-exponentially with the context size. We have found lexicon are equivalent. that the probability the kth most confounder appears in We again use indicator variables to translate conditions CeλC (C1)–(C3) into an exact expression for the learning prob- a context of size C fits the form pk ≈ 1 − 1 − wk ( ) ability. Introducing Cℓ t Ai1,i2 t Ai2,i3 t Ai ,i1 t where wk is the a priori probability and λ is a fitting ( ) = ( ) ( )⋯ n ( ) parameter that depends on M and k. As noted by Vogt that equals 1 if the loop ℓ persists at time t, we have [28], the repeated without replacement implies C W that pk 1 1 wk . Our analysis further reveals that L t E t A t C t , ≥ − ( − ) = M i M 1 − i,j M 1 − ℓ (4) the learning time is entirely determined by the frequency ( ) di=1 ( ) j>W [ ( )] ℓ [ ( )]i of the most common confounder, a∗ through (3). We note that this is true even when other confounders have again valid for any distribution of confounding meanings. comparable appearance frequencies (C ≤ 5). Here, meanings 1 to W correspond to words 1 to M, We now turn to the case where the mutual exclusiv- and so meanings with an index j > W are unlabeled. ity constraint serves to exclude the meanings of learned The product over ℓ is over all possible candidate loops. words as possible meanings for other words. In this case, This expression has the remarkable property that it is it is important to distinguish between labeled and unla- expressed concisely in terms of the word and confounder beled meanings: an unlabeled meaning is not the target appearance frequencies alone: the avalanche dynamics meaning of any word in the lexicon, and hence cannot triggered by mutual exclusivity do not enter explicitly. be excluded using the mutual exclusivity constraint. To This property, reminiscent of the avalanche dynamics generalize Eq. (1) to this problem, we must identify the of Abelian sandpile models [38], reduces analysis of the conditions for the lexicon to be learned. Condition (C1) learning probability to the statistical mechanical problem still applies: each word must be uttered at least once for of enumerating candidate loops. a learner to be able to learn it. Condition (C2) now ap- In the interacting problem, the structure of each can- plies only to unlabeled confounding meanings: these can didate set i is important, as this determines which only be excluded if they fail to appear in a context, as words interact.M We consider a model which has no un- before. When these two conditions are satisfied, there is labeled meanings and where each set i is a sample of M 4

W=100, M=50 comes from eliminating the confounder loop. In this lat- W=200, M=50 ter partially-efficient regime, the lexicon learning time is Lexicon exposure time ∗ 2µ ln ǫ Slowest loop decay time predicted as t 3 1 a∗ for small ǫ, in very good agree- 1e+06 = − ( − ) W=200, M=50 (ind.) ment with simulation data (see Fig. 3). We describe the sudden change in the dominant relaxational behavior—a phenomenon seen also in driven diffusive systems [39]— as a dynamical phase transition. It is broadly reminiscent of transitions exhibited by combinatorial optimization 10000 problems, whereby the number of unsatisfied constraints Lexicon learning time, t* increases from zero above a critical difficulty threshold [40]. In the present case the learning problem remains solvable in both regimes, but there is a transition from a

100 regime where it is solved in constant time to one where 0 5 10 15 20 25 30 Context size, C the time grows super-exponentially in the difficulty of the problem (here, the context size). FIG. 3. As Fig. 2 but with the mutual exclusivity constraint. To summarize, we have found that mutual exclusiv- Points: data from Monte Carlo simulations (100,000 lexicons ity is an extremely powerful word-learning heuristic. It for C ≤ 20, at least 2,500 lexicons for larger C). Dotted lines: can yield lexicon learning times in the presence of uncer- time for the entire lexicon to have been exposed with residual tainty that coincide with the time taken for each word to probability ǫ = 0.01. Dashed lines: time for the slowest decay- be heard at least once. Empirical data (summarized in ing candidate loop to remain with probability ǫ. Solid line: [26]) suggests that this is easily fast enough for realistic time to learn lexicon independently, Eq. (3), for comparison. lexicons of W = 60, 000 words to be learned. To enter the partially-efficient regime, each word’s most frequent con- founder would need to be present in at least 99.99% of M non-target meanings obtained via the RZ prescrip- all episodes: even then, learning is over W times faster tion. Then, in each episode, C meanings are drawn from than when mutual exclusivity is not applied. The dy- the relevant candidate set using RZ again, but with an namical transition between a maximally- and partially- a priori weight 1 k where k is the rank of a meaning efficient regime also appears to be present in a variety of ~ within the set i when ordered by the frequency of the word-learning models we have investigated, e.g., those in correspondingM words. Thus meanings of high-frequency which confounder frequencies are uncorrelated with their words are high-frequency confounders. Learning times corresponding word frequencies, or using less memory- from Monte Carlo simulations are shown in Fig. 3. intensive learning strategies [41]. We also expect the We observe two distinct learning-time regimes. At transition to be evident in models where the target mean- small C, the learning time is constant, and close to the ing does not always appear, at least in the regime where time it takes for all words in the lexicon to appear at least learning is possible [15, 27]. We believe the analytical ∗ once. (This time is given by Eq. (3) with a = 0). In this methods introduced in this work should allow more de- regime, learning is as fast as it can possibly be: mutual tailed quantities to be calculated, e.g., the distribution of exclusivity is maximally efficient and reverses the unde- learning times for a given word, which would shed light on sirable increase in learning times that arises from nonuni- such phenomena as the childhood vocabulary explosion form confounder distributions. Above a critical context at around 18 months [42]. Similar thinking may also al- size, the learning time rises, but remains much smaller low analysis of other nonequilibrium dynamical systems than when words are learned independently: mutual ex- whose master equations are hard to solve directly. Fi- clusivity is partially efficient in this regime. nally, our results suggest new empirical questions, such as Our exact result (4) can be used to explain these obser- whether high-frequency confounders correlate with high- vations, details of which appear in Appendix B. For the frequency words, and the extent to which learners are RZ model as described above, it turns out that only one able to apply the mutual-exclusivity constraint retroac- tively. We therefore contend that statistical physicists confounder loop ℓ = 1, 2 is relevant at late times. Con- sequently, the learning( probability) L t is asymptotically can contribute much to the understanding of how chil- given as the product of two factors.( The) first gives the dren learn the meaning of words. probability that all words have been encountered by time Acknowledgments — We thank Mike Cates and Cait t, and approaches unity exponentially with rate 1 µW . MacPhee for comments on the manuscript. ~ The second is the probability that the loop ℓ = 1, 2 has not decayed away: this approaches unity with( rate) ∗ Appendix A: Learning time in the noninteracting 3 1 − a 2µ. The appearance frequency of the most fre- quent( confounder,)~ a∗, increases with context size. When case ∗ 2 a < 1− 3W , the slowest relaxational mode of the learning probability is associated with each word being uttered at In the main text, we derived a formula—given there as least once, whereas for larger values, the slowest mode Eq. (1)—for the probability L t that a lexicon of words ( ) 5 is learned by time t if they are learned independently by more frequently than the least frequent meaning among cross-situational learning. This read that subset. If, for each word, the individual meaning frequencies ai j are distinct for different j, there will ( ) L t Ei t 1 Ai,j t (A1) be a unique maximum appearance frequency, and the ( ) = b M ( ) M[ − ( )]g ∗ i j≠i slowest decay is given by ai = maxj ai j . Multiplying the factors for each word i together{ yields( )} Eq. (2) of the where Ei t 1 only if word i has been presented by time main text. We note that in the special case where the ( ) = t, and Ai,j t is 1 if word i has never been presented, or, most frequent meaning is r-fold degenerate, we acquire a ( ) if in every presentation up to time t, the confounding prefactor r in front of the dominant exponential decay. meaning j i has always appeared alongside. The an- ∗ ≠ For the case where ai is the same for all words i, Eq. (3) gle brackets denote an average over all possible exposure in the main text is obtained by taking the logarithm of sequences. Under all other conditions, these indicator L t , replacing the sum with an integral, and expanding variables are zero. the( ) logarithm to first order. For a Zipf distribution of This equation was first of all presented in an alterna- W word frequencies, φi = 1 µi , µ = ∑i=1 1 i , this proce- tive form which follows from the fact that Ei t 0 im- ~( ) ( ~ ) ( ) = dure yields [26] plies that Ai,j t 1 for all j i. Hence, for all allowed ( ) = ≠ combinations of Ei t and Ai,j t , we have the identity W a∗ t ( ) ( ) 1 − 1 Ei t Ai,j t 1 Ei t which can be rearranged ln L t ≈ − dx exp −( ) [ − ( )] ( ) = [ − ( )] ( ) S1 Œ µx ‘ to obtain Ei t 1 − Ai,j t = 1 − Ai,j t . Assuming ( )[ ( )] [ ( )] µW 2 1 a∗ t that there is at least one confounding meaning for each − , ≈ − ∗ exp −( ) (A6) word, the Ei t variables in the above equation are then 1 − a t Œ µW ‘ redundant, and( ) the more concise form ( ) where we have used the asymptotics of the exponential

L t 1 Ai,j t (A2) integral [43] to obtain the second approximate equality. = M M − ∗ ( ) b i j≠i[ ( )]g The solution of the equation ln L t = ln 1 − ǫ yields the learning time given by Eq. (3).( ) This( involves) the then applies. Lambert W function which is defined by solutions of the In the main text, we discussed models where contexts W(z) equation z e = z [37]. of confounding meanings were independently sampled W( ) from distributions that may be word-dependent, but re- main fixed over time. In particular, this implies that Appendix B: Learning time for the interacting RZ the contexts appearing against different words are inde- model pendent, and we have factorization of the average into word-dependent factors: In the interacting RZ model described in the main text, it is assumed that all meanings are labeled and so Eq. (4) L t A t . = M M 1 − i,j (A3) for the learning probability simplifies to ( ) i b j≠i[ ( )]g

Since the confounder distributions are fixed, we find after W L t = M Ei t M 1 − Cℓ t , (B1) ni presentations of word i that ( ) di=1 ( ) ℓ [ ( )]i

ni 1 with prob. ai j1,...,jk where here Cℓ t Ai1,i2 t Ai2,i3 t Ai ,i1 t for an or- A 1 A (A4) = ⋯ n i,j ⋯ i,jk = 0 otherwise ( ) ( ) ( ) ( ) ( ) › dered subset ℓ = i1,i2,...,in of the W meanings. Numerical investigations( of) the RZ sampling proce- a j1,...,j k where i k is the joint probability that all dure reveal that, when the context size is large, the ( j1, j2, j3,...,j) meanings k appear in a single episode. most frequent confounders are all very likely to ap- i Since word is presented as a Poisson process with fre- pear (a 1 for the lowest j), while the relative fre- φ i,j ≈ quency i, we find that quencies of non-appearance diverge with C, i.e., that

∞ ni 1 ai,j+1 1 ai,j as C is increased. Since it φit n −φ t − − → ∞ A A A a j1,...,j i e i ( )~( ) a i,j1 i,j2 ⋯ i,jk = Q ( ) i k is these non-appearance probabilities, 1 − i,j , that enter a f 0 ni! ( ) ni= into the decay rates of correlation functions (see above), 1 −φi[ −ai(j1,...,jk)]t it follows that the slowest-decaying confounder loops are = e (A5) those that are (a) short; and (b) comprise only the most Therefore, on multiplying out the average in (A3), we frequent meanings. The slowest decay of all loops there- find a sum of exponential decays. We are interested in fore comes from ℓ = 1, 2 . We have found that a good the slowest decay mode, which corresponds to the highest match between theory( and) numerical data is obtained by possible value of ai j1,...,jk among all possible sets assuming this is the only loop that contributes to the of confounding meanings.( As) noted in the main text, late-time behavior of L t . ( ) any combination of meanings j1, j2,...,jk cannot appear To obtain the theoretical prediction, we first make this 6 single-loop approximation: Combining this result with (B4) in (B2), and noting that φi 1 µi , we arrive at = ~( ) W 2 − t ∗ µW e µW 3 1 − a t L t = M Ei t 1 − A1,2 t A2,1 t L t ∼ exp ⎛− ⎞ 1 − exp − ( ) (B7) ( ) di=1 ( ) [ ( ) ( )]i ( ) t Œ  2µ ‘ ⎝ ⎠ W E1 t A1,2 t E2 t A2,1 t = M Ei t 1 − ⟨ ( ) ( )⟩⟨ ( ) ( )⟩ , which gives an asymptotic expression for the learning 1⟨ ( )⟩  E1 t E2 t i= ⟨ ( )⟩⟨ ( )⟩ probability as a function of time. To convert this into a (B2) ∗ learning time, we need to solve the equation L t = 1−ǫ. Unfortunately, we have not been able to do this( ) exactly. It is however straightforward now to identify the slowest where we have used the fact that the contexts presented decay mode of L t by expanding out: alongside different words are uncorrelated. Ei t is the ( ) probability that an event governed by a Poisson⟨ ( )⟩ process 2 ∗ µW t 3 1 − a t with frequency φ has occurred at least once by time t. L t 1 exp exp ( ) . i ( ) ≈ − t ‹−µW  − Œ− 2µ ‘ + ⋯ Hence, (B8) Thus, as stated in the main text, we find that the mode

W W associated with exposure of the entire lexicon decays at a −φit Ei t 1 e . (B3) rate 1 µW , and that the mode associated with elimina- M = M − ~ ∗ i=1⟨ ( )⟩ i=1   tion of the confounder loop decays at rate 3 1 a t 2µ. ( − ) ~ Now, as ǫ → 0, the learning time t must diverge towards infinity. Hence, as ǫ is reduced, the subleading term in This is of the same form as Eq. (3) of the main text, but (B8) can be made arbitrarily small, and the solution of with a∗ 0, and so from (A6) we have that = 2 ∗ µW t ∗ 2 ⎧ t∗ exp Š− µW  for a < 1 − 3W ⎪ ∗ ∗ ǫ = ⎨⎪ 3(1−a )t 2 (B9) 2 a∗ W µW ⎪ exp Š− 2µ  for > 1 − 3W E t exp e−t~µW . (B4) ⎩ M i ≈ − t i=1⟨ ( )⟩ Œ ‘ yields the learning time t∗ to better and better accuracy in the limit ǫ → 0. Formally, as ǫ → 0, the function ∗ ∗ ∗ 2 φ(a ) = t ~ ln ǫ exhibits a nonanalyticity at a = 1 − 3W . Turning now to the second term in (B2), we use again It is in this sense that we regard this model to exhibit a the identity 1 Ei t Ai,j t 1 Ei t from the previ- dynamical phase transition. [ − ( )] ( ) = [ − ( )] ous section, to find that Ei t Ai,j t = 1−Ai,j t −Ei t . For more general models, in which more than one can- Hence, ( ) ( ) ( ) ( ) didate loop enters at large times, we have found that including only loops of length 2 in the product in (4) yields very good agreement with simulation data. More 1 E t A t e−φi[ −ai(j)]t e−φit precisely, numerically-determined roots of L t 1 ǫ i i,j − . ( ) = − Λi,j = ⟨ ( ) ( )⟩ = −φ t (B5) with L t given by the approximate expression Ei t 1 e i ( ) ⟨ ( )⟩ − W −φit L(t) ≈ M 1 − e  M 1 − Λi,j Λj,i (B10) If ai j is close to unity, as is the case for the high- i=1 ⟨i,j⟩ [ ] frequency( ) meanings in the RZ model, we have at late times that correspond well with simulation data, and furthermore provides evidence for our claim that the dynamical phase transition reported is not a peculiarity of the specific −φi[1−ai(j)]t Λi,j ∼ e . (B6) model discussed in the main text.

[1] P. Bloom, How Children Learn the Meanings of Words Rev. A 32 (1985). (MIT Press, Cambridge, MA, 2000). [6] D. J. Amit, H. Gutfreund, and H. Sompolinksy, Phys. [2] W. V. O. Quine, Word and Object (MIT Press, 1960) pp. Rev. Lett. 55, 1530 (1985). 26–79. [7] N. Sourlas, Nature 339, 693 (1989). [3] T. L. H. Watkin and A. Rau, Rev. Mod. Phys. 65 (1993). [8] Y. Kabashima, T. Murayama, and D. Saad, Phys. Rev. [4] J. J. Hopfield, PNAS 79, 2554 (1982). Lett. 84, 1355 (2000). [5] D. J. Amit, H. Gutfreund, and H. Sompolinksy, Phys. [9] H. Nishimori, Statistical physics of spin glasses and infor- 7

mation processing: an introduction (Oxford University [28] P. Vogt, Cognitive Sci. 36, 726 (2012). Press, Oxford, 2001). [29] E. M. Markman and G. F. Wachtel, Cognitive Psychol. [10] D. A. Baldwin, Child Dev. 62, 875 (1991). 20, 121 (1988). [11] See e.g. P. Bloom and L. Markson, Trends Cogn. Sci. 2, [30] R. M. Golinkoff, C. B. Mervis, and K. Hirsh-Pasek, J. 67 (1998) for a review. Child Lang. 21, 125 (1994). [12] S. Pinker, Learnability and cognition: The acquisition of [31] J. Halberda, Cognition 87, 23 (2003). argument structure (MIT Press, Cambridge, MA, 1989). [32] E. M. Markman, J. L. Wasow, and M. B. Hansen, Cog- [13] J. M. Siskind, Cognition 61, 1 (1996). nitive Psychol. 47, 241 (2003). [14] P. Vogt and A. D. M. Smith, in Proceedings of the An- [33] M. C. Frank, N. D. Goodman, and J. B. Tenenbaum, nual Machine Learning Conference of Belgium and The Adv. Neural Inf. Process Syst. 20, 20 (2007). Netherlands (Benelearn) (Brussels, 2004). [34] G. K. Zipf, Human behavior and the principle of least ef- [15] P. F. C. Tilles and J. F. Fontanari, J. Math. Psych. 56, fort: An introduction to human ecology (Addison-Wesley, 396 (2012). Cambridge, MA, 1949). [16] C. Yu and L. B. Smith, Psychol. Rev. 119, 21 (2012). [35] R. Ferrer i Cancho and R. V. Sol´e, J. Quant. Ling. 8, 165 [17] F. Pulverm¨uller, Behav. Brain Sci. 22, 253 (1999). (2001). [18] L. Gleitman, Lang. Acquis. 1, 3 (1990). [36] A. M. Petersen, J. N. Tenenbaum, S. Havlin, H. E. Stan- [19] S. Pinker, Lingua 92, 377 (1994). ley, and M. Perc, Sci. Rep. 2, 943 (2012). [20] C. Yu and L. B. Smith, Psychol. Sci. 18, 414 (2007). [37] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jef- [21] M. C. Frank, N. D. Goodman, and J. B. Tenenbaum, frey, and D. E. Knuth, in Advances in Computational Psychol. Sci. 20, 578 (2009). Mathematics (1996) pp. 329–359. [22] T. N. Medina, J. Snedeker, J. C. Trueswell, and L. R. [38] D. Dhar, Phys. Rev. Lett. 64, 1613 (1990). Gleitman, PNAS 108, 9014 (2011). [39] J. de Gier and F. H. L. Essler, Phys. Rev. Lett. 95, [23] L. Smith and C. Yu, Cognition 106, 1558 (2008). 240601 (2005). [24] K. Smith, A. D. M. Smith, and R. A. Blythe, Cognitive [40] M. M´ezard, G. Parisi, and R. Zecchina, Science 297, 812 Sci. 35, 480 (2011). (2002). [25] K. Smith, A. D. M. Smith, R. A. Blythe, and P. Vogt, [41] R. Reisenauer, K. Smith, A. D. M. Smith, and R. A. in Symbol Grounding and Beyond, Lecture notes in com- Blythe, In preparation. puter science, Vol. 4221 (Springer, Heidelberg, 2006) pp. [42] B. McMurray, Science 317, 631 (2007). 31–44. [43] M. Abramowitz and I. A. Stegun, Handbook of Mathe- [26] R. A. Blythe, K. Smith, and A. D. M. Smith, Cognitive matical Functions, 9th ed. (Dover, NY, 1965). Sci. 34, 620 (2010). [27] P. F. C. Tilles and J. F. Fontanari, EPL 99, 60001 (2012).