<<

Modelling function improves unsupervised segmentation Mark Johnson1,2, Anne Christophe3,4, Katherine Demuth2,6 and Emmanuel Dupoux3,5 1 Department of Computing, Macquarie University, Sydney, Australia 2 Santa Fe Institute, Santa Fe, New Mexico, USA 3 Ecole Normale Superieure,´ Paris, France 4 Centre National de la Recherche Scientifique, Paris, France 5 Ecole des Hautes Etudes en Sciences Sociales, Paris, France 6 Department of , Macquarie University, Sydney, Australia

Abstract function words are treated specially in human language acquisition. We do this by comparing Inspired by experimental psychological two computational models of word segmentation findings suggesting function words which differ solely in the way that they model play a special role in word learning, we function words. Following Elman et al. (1996) make a simple modification to an Adaptor and Brent (1999) our word segmentation models based Bayesian word segmenta- identify word boundaries from unsegmented se- tion model to allow it to learn sequences quences of corresponding to utterances, of monosyllabic “function words” at the effectively performing unsupervised learning of a beginnings and endings of collocations lexicon. For example, given input consisting of of (possibly multi-syllabic) words. This unsegmented utterances such as the following: modification improves unsupervised word segmentation on the standard Bernstein- juwɑnttusiðəbʊk Ratner (1987) corpus of child-directed En- a word segmentation model should segment this as glish by more than 4% token f-score com- ju wɑnt tu si ðə bʊk, which is the IPA representation pared to a model identical except that it of “you want to see the book”. does not special-case “function words”, We show that a model equipped with the abil- setting a new state-of-the-art of 92.4% to- ity to learn some rudimentary properties of the ken f-score. Our function word model as- target language’s function words is able to learn sumes that function words appear at the the vocabulary of that language more accurately left periphery, and while this is true of than a model that is identical except that it is inca- languages such as English, it is not true pable of learning these generalisations about func- universally. We show that a learner can tion words. This suggests that there are acqui- use Bayesian model selection to determine sition advantages to treating function words spe- the location of function words in their lan- cially that human learners could take advantage of guage, even though the input to the model (at least to the extent that they are learning similar only consists of unsegmented sequences of generalisations as our models), and thus supports phones. Thus our computational models the hypothesis that function words are treated spe- support the hypothesis that function words cially in human lexical acquisition. As a reviewer play a special role in word learning. points out, we present no evidence that children use function words in the way that our model does, 1 Introduction and we want to emphasise we make no such claim. Over the past two decades psychologists have in- While absolute accuracy is not directly relevant vestigated the role that function words might play to the main point of the paper, we note that the in human language acquisition. Their experiments models that learn generalisations about function suggest that function words play a special role in words perform unsupervised word segmentation the acquisition process: children learn function at 92.5% token f-score on the standard Bernstein- words before they learn the vast bulk of the asso- Ratner (1987) corpus, which improves the previ- ciated content words, and they use function words ous state-of-the-art by more than 4%. to help identify context words. As a reviewer points out, the changes we make The goal of this paper is to determine whether to our models to incorporate function words can computational models of human language acqui- be viewed as “building in” substantive informa- sition can provide support for the hypothesis that tion about possible human languages. The model

282 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 282–292, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics that achieves the best token f-score expects func- miners and prepositions are associated with tion words to appear at the left edge of phrases. , auxiliary and complementisers While this is true for languages such as English, are associated with main verbs) it is not true universally. By comparing the pos- 6. semantically, content words denote sets of terior probability of two models — one in which objects or events, while function words de- function words appear at the left edges of phrases, note more complex relationships over the en- and another in which function words appear at the tities denoted by content words right edges of phrases — we show that a learner 7. historically, the rate of innovation of function could use Bayesian posterior probabilities to deter- words is much lower than the rate of innova- mine that function words appear at the left edges tion of content words (i.e., function words are of phrases in English, even though they are not typically “closed class”, while content words told the locations of word boundaries or which are “open class”) words are function words. Properties 1–4 suggest that function words This paper is structured as follows. Section 2 might play a special role in language acquisition describes the specific word segmentation mod- because they are especially easy to identify, while els studied in this paper, and the way we ex- property 5 suggests that they might be useful for tended them to capture certain properties of func- identifying lexical categories. The models we tion words. The word segmentation experiments study here focus on properties 3 and 4, in that are presented in section 3, and section 4 discusses they are capable of learning specific sequences of how a learner could determine whether function monosyllabic words in peripheral (i.e., initial or words occur on the left-periphery or the right- final) positions of phrase-like units. periphery in the language they are learning. Sec- A number of psychological experiments have tion 5 concludes and describes possible future shown that infants are sensitive to the function work. The rest of this introduction provides back- words of their language within their first year of ground on function words, the Adaptor Grammar life (Shi et al., 2006; Halle´ et al., 2008; Shafer models we use to describe lexical acquisition and et al., 1998), often before they have experienced the Bayesian inference procedures we use to infer the “word learning spurt”. Crucially for our pur- these models. pose, infants of this age were shown to exploit frequent function words to segment neighboring 1.1 Psychological evidence for the role of content words (Shi and Lepage, 2008; Halle´ et function words in word learning al., 2008). In addition, 14 to 18-month-old Traditional descriptive linguistics distinguishes children were shown to exploit function words to function words, such as and prepo- constrain lexical access to known words - for in- sitions, from content words, such as nouns and stance, they expect a after a (Cau- verbs, corresponding roughly to the distinction be- vet et al., 2014; Kedar et al., 2006; Zangl and tween functional categories and lexical categories Fernald, 2007). In addition, it is plausible that of modern generative linguistics (Fromkin, 2001). function words play a crucial role in children’s Function words differ from content words in at acquisition of more complex syntactic phenom- least the following ways: ena (Christophe et al., 2008; Demuth and McCul- 1. there are usually far fewer function word lough, 2009), so it is interesting to investigate the types than types in a language roles they might play in computational models of 2. function word types typically have much language acquisition. higher token frequency than content word types 1.2 Adaptor 3. function words are typically morphologically Adaptor grammars are a framework for Bayesian and phonologically simple (e.g., they are typ- inference of a certain class of hierarchical non- ically monosyllabic) parametric models (Johnson et al., 2007b). They 4. function words typically appear in peripheral define distributions over the trees specified by positions of phrases (e.g., prepositions typi- a context-free grammar, but unlike probabilistic cally appear at the beginning of prepositional context-free grammars, they “learn” distributions phrases) over the possible subtrees of a user-specified set of 5. each function word class is associated with “adapted” nonterminals. (Adaptor grammars are specific content word classes (e.g., deter- non-parametric, i.e., not characterisable by a finite

283 set of parameters, if the set of possible subtrees from Gi. The PCFG generates the distribution GS of the adapted nonterminals is infinite). Adaptor over the set of trees generated by the start sym- TS grammars are useful when the goal is to learn a bol S; the distribution over the strings it generates potentially unbounded set of entities that need to is obtained by marginalising over the trees. satisfy hierarchical constraints. As section 2 ex- In a Bayesian PCFG one puts Dirichlet priors plains in more detail, word segmentation is such Dir(α) on the rule probability vector θ, such that a case: words are composed of syllables and be- there is one Dirichlet parameter αA α for each long to phrases or collocations, and modelling this rule A α R. There are Markov Chain→ Monte → ∈ structure improves word segmentation accuracy. Carlo (MCMC) and Variational Bayes procedures Adaptor Grammars are formally defined in for estimating the posterior distribution over rule Johnson et al. (2007b), which should be consulted probabilities θ and parse trees given data consist- for technical details. Adaptor Grammars (AGs) ing of terminal strings alone (Kurihara and Sato, are an extension of Probabilistic Context-Free 2006; Johnson et al., 2007a). Grammars (PCFGs), which we describe first. A PCFGs can be viewed as recursive mixture Context-Free Grammar (CFG) G = (N, W, R, S) models over trees. While PCFGs are expres- consists of disjoint finite sets of nonterminal sym- sive enough to describe a range of linguistically- bols N and terminal symbols W , a finite set of interesting phenomena, PCFGs are parametric rules R of the form A α where A N and models, which limits their ability to describe phe- → ∈ α (N W )⋆, and a start symbol S N. (We nomena where the set of basic units, as well as ∈ ∪ ∈ assume there are no “ϵ-rules” in R, i.e., we require their properties, are the target of learning. Lexi- that α 1 for each A α R). cal acqusition is an example of a phenomenon that | | ≥ → ∈ A Probabilistic Context-Free Grammar (PCFG) is naturally viewed as non-parametric inference, is a quintuple (N, W, R, S, θ) where N, W , R where the number of lexical entries (i.e., words) and S are the nonterminals, terminals, rules and as well as their properties must be learnt from the start symbol of a CFG respectively, and θ is a vec- data. tor of non-negative reals indexed by R that sat- It turns out there is a straight-forward modifica- isfy α R θA α = 1 for each A N, where tion to the PCFG distribution (1) that makes it suit- A → ∈ R = ∈A α : A α R is the set of rules ably non-parametric. As Johnson et al. (2007b) A { → → ∈ } expanding∑ A. explain, by inserting a Dirichlet Process (DP) Informally, θA α is the probability of a node or Pitman-Yor Process (PYP) into the generative labelled A expanding→ to a sequence of nodes la- mechanism (1) the model “concentrates” mass on belled α, and the probability of a tree is the prod- a subset of trees (Teh et al., 2006). Specifically, uct of the probabilities of the rules used to con- an Adaptor Grammar identifies a subset A N ⊆ struct each non-leaf node in it. More precisely, for of adapted nonterminals. In an Adaptor Gram- each X N W a PCFG associates distributions mar the unadapted nonterminals N A expand ∈ ∪ \ G over the set of trees generated by X as via (1), just as in a PCFG, but the distributions of X TX . follows: the adapted nonterminals A are “concentrated” by If X W (i.e., if X is a terminal) then G passing them through a DP or PYP: ∈ X is the distribution that puts probability 1 on the HX = θX B1...Bn TDX (GB1 ,...,GBn ) single-node tree labelled X. → If X N (i.e., if X is a nonterminal) then: X B1...Bn RX ∈ →∑ ∈ GX = PYP(HX , aX , bX )

GX = θX B1...Bn TDX (GB1 ,...,GBn ) (1) → Here aX and bX are parameters of the PYP asso- X B1...Bn RX →∑ ∈ ciated with the adapted nonterminal X. As Gold- where RX is the subset of rules in R expanding water et al. (2011) explain, such Pitman-Yor Pro- nonterminal X N, and: cesses naturally generate power-law distributed ∈ data. n X Informally, Adaptor Grammars can be viewed TD (G ,...,G ) = G (t ). X 1 n i i as caching entire subtrees of the adapted nonter- ( t1 ... tn ) i=1 ∏ minals. Roughly speaking, the probability of gen- That is, TDX (G1,...,Gn) is a distribution over erating a particular subtree of an adapted nonter- the set of trees generated by nonterminal X, minal is proportional to the number of times that TX where each subtree ti is generated independently subtree has been generated before. This “rich get

284 richer” behaviour causes the distribution of sub- by explaining how we modify these adaptor gram- trees to follow a power-law (the power is speci- mars to capture some of the special properties of fied by the aX parameter of the PYP). The PCFG function words. rules expanding an adapted nonterminal X de- fine the “base distribution” of the associated DP 2.1 Syllable structure and phonotactics or PYP, and the aX and bX parameters determine The rule (3) models words as sequences of inde- how much mass is reserved for “new” trees. pendently generated phones: this is what Gold- There are several different procedures for infer- water et al. (2009) called the “monkey model” of ring the parse trees and the rule probabilities given word generation (it instantiates the metaphor that a corpus of strings: Johnson et al. (2007b) describe word types are generated by a monkey randomly a MCMC sampler and Cohen et al. (2010) describe banging on the keys of a typewriter). However, the a Variational Bayes procedure. We use the MCMC words of a language are typically composed of one procedure here since this has been successfully ap- or more syllables, and explicitly modelling the in- plied to word segmentation problems in previous ternal structure of words typically improves word work (Johnson, 2008). segmentation considerably. Johnson (2008) suggested replacing (3) with the 2 Word segmentation with Adaptor following model of word structure: Grammars Word Syllable1:4 (4) Perhaps the simplest word segmentation model is → the unigram model, where utterances are modeled Syllable (Onset) Rhyme (5) → as sequences of words, and where each word is Onset Consonant+ (6) a sequence of segments (Brent, 1999; Goldwater → Rhyme Nucleus (Coda) (7) et al., 2009). A unigram model can be expressed → Nucleus Vowel+ (8) as an Adaptor Grammar with one adapted non- → terminal Word (we indicate adapted nonterminals Coda Consonant+ (9) by underlining them in grammars here; regular ex- → pressions are expanded into right-branching pro- Here and below superscripts indicate iteration ductions). (e.g., a Word consists of 1 to 4 Syllables), while an Onset consists of an unbounded number of + Word (2) Consonants), while parentheses indicate option- → Word Phone+ (3) ality (e.g., a Rhyme consists of an obligatory → Nucleus followed by an optional Coda). We as- The first rule (2) says that a sentence consists of sume that there are rules expanding Consonant one or more Words, while the second rule (3) and Vowel to the set of all consonants and vow- states that a Word consists of a sequence of one or els respectively (this amounts to assuming that the more Phones; we assume that there are rules ex- learner can distinguish consonants from vowels). panding Phone into all possible phones. Because Because Onset, Nucleus and Coda are adapted, Word is an adapted nonterminal, the adaptor gram- this model learns the possible syllable onsets, nu- mar memoises Word subtrees, which corresponds cleii and coda of the language, even though neither to learning the phone sequences for the words of syllable structure nor word boundaries are explic- the language. itly indicated in the input to the model. The more sophisticated Adaptor Grammars dis- The model just described assumes that word- cussed below can be understood as specialis- internal syllables have the same structure as word- ing either the first or the second of the rules peripheral syllables, but in languages such as in (2–3). The next two subsections review the English word-peripheral onsets and codas can Adaptor Grammar word segmentation models pre- be more complex than the corresponding word- sented in Johnson (2008) and Johnson and Gold- internal onsets and codas. For example, the water (2009): section 2.1 reviews how phonotac- word “string” begins with the onset cluster str, tic syllable-structure constraints can be expressed which is relatively rare word-internally. Johnson with Adaptor Grammars, while section 2.2 re- (2008) showed that word segmentation accuracy views how phrase-like units called “collocations” improves if the model can learn different conso- capture inter-word dependencies. Section 2.3 nant sequences for word-inital onsets and word- presents the major novel contribution of this paper final codas. It is easy to express this as an Adaptor

285 Grammar: (4) is replaced with (10–11) and (12– designed to correspond to syntactic phrases, by ex- 17) are added to the grammar. amining the sample parses induced by the Adaptor Grammar we noticed that the collocations often Word SyllableIF (10) correspond to noun phrases, prepositional phrases → Word SyllableI Syllable0:2 SyllableF (11) or phrases. This motivates the extension to → the Adaptor Grammar discussed below. SyllableIF (OnsetI) RhymeF (12) → SyllableI (OnsetI) Rhyme (13) 2.3 Incorporating “function words” into → SyllableF (Onset) RhymeF (14) collocation models → OnsetI Consonant+ (15) The starting point and baseline for our extension → is the adaptor grammar with syllable structure RhymeF Nucleus (CodaF) (16) → phonotactic constraints and three levels of collo- CodaF Consonant+ (17) → cational structure (5-21), as prior work has found that this yields the highest word segmentation to- In this grammar the suffix “I” indicates a word- ken f-score (Johnson and Goldwater, 2009). initial element, and “F” indicates a word-final el- Our extension assumes that the Colloc1 − ement. Note that the model simply has the abil- Colloc3 constituents are in fact phrase-like, so we ity to learn that different clusters can occur word- extend the rules (19–21) to permit an optional se- peripherally and word-internally; it is not given quence of monosyllabic words at the left edge any information about the relative complexity of of each of these constituents. Our model thus these clusters. captures two of the properties of function words discussed in section 1.1: they are monosyllabic 2.2 Collocation models of inter-word (and thus phonologically simple), and they appear dependencies on the periphery of phrases. (We put “function Goldwater et al. (2009) point out the detrimental words” in scare quotes below because our model effect that inter-word dependencies can have on only approximately captures the linguistic proper- word segmentation models that assume that the ties of function words). words of an utterance are independently gener- Specifically, we replace rules (19–21) with the ated. Informally, a model that generates words in- following sequence of rules: dependently is likely to incorrectly segment multi- + word expressions such as “the doggie” as single Colloc3 (FuncWords3) Colloc2 (22) → words because the model has no way to capture Colloc2 (FuncWords2) Colloc1+ (23) word-to-word dependencies, e.g., that “doggie” is → Colloc1 (FuncWords1) Word+ (24) typically preceded by “the”. Goldwater et al show → FuncWords3 FuncWord3+ (25) that word segmentation accuracy improves when → the model is extended to capture bigram depen- FuncWord3 SyllableIF (26) → dencies. FuncWords2 FuncWord2+ (27) Adaptor grammar models cannot express bi- → FuncWord2 SyllableIF (28) gram dependencies, but they can capture similiar → + inter-word dependencies using phrase-like units FuncWords1 FuncWord1 (29) → that Johnson (2008) calls collocations. John- FuncWord1 SyllableIF (30) son and Goldwater (2009) showed that word seg- → mentation accuracy improves further if the model This model memoises (i.e., learns) both the in- learns a nested hierarchy of collocations. This can dividual “function words” and the sequences of “function words” that modify the Colloc1 be achieved by replacing (2) with (18–21). − Colloc3 constituents. Note also that “function Sentence Colloc3+ (18) words” expand directly to SyllableIF, which in → turn expands to a monosyllable with a word-initial Colloc3 Colloc2+ (19) → onset and word-final coda. This means that “func- Colloc2 Colloc1+ (20) → tion words” are memoised independently of the Colloc1 Word+ (21) “content words” that Word expands to; i.e., the → model learns distinct “function word” and “con- Informally, Colloc1, Colloc2 and Colloc3 define a tent word” vocabularies. Figure 1 depicts a sample nested hierarchy of phrase-like units. While not parse generated by this grammar.

286 . Sentence Token Boundary Boundary Model Colloc3 f-score precision recall Baseline 0.872 0.918 0.956 FuncWords3 Colloc2 + left FWs 0.924 0.935 0.990 FuncWord3FuncWord3FuncWord3 Colloc1 Colloc1 + left + right FWs 0.912 0.957 0.953 you want to Word FuncWords1 Word see FuncWord1 book Table 1: Mean token f-scores and boundary preci- sion and recall results averaged over 8 trials, each the consisting of 8 MCMC runs of models trained Figure 1: A sample parse generated by the “func- and tested on the full Bernstein-Ratner (1987) cor- tion word” Adaptor Grammar with rules (10–18) pus (the standard deviations of all values are less and (22–30). To simplify the parse we only show than 0.006; Wilcox sign tests show the means of the root node and the adapted nonterminals, and all token f-scores differ p < 2e-4). replace word-internal structure by the word’s or- thographic form. 800 sample segmentations of each utterance. The most frequent segmentation in these 800 This grammar builds in the fact that function sample segmentations is the one we score in the words appear on the left periphery of phrases. This evaluations below. is true of languages such as English, but is not true 3.1 Word segmentation with “function word” cross-linguistically. For comparison purposes we models also include results for a mirror-image model that permits “function words” on the right periphery, Here we evaluate the word segmentations found a model which permits “function words” on both by the “function word” Adaptor Grammar model the left and right periphery (achieved by changing described in section 2.3 and compare it to the base- rules 22–24), as well as a model that analyses all line grammar with collocations and phonotactics words as monosyllabic. from Johnson and Goldwater (2009). Figure 2 Section 4 explains how a learner could use presents the standard token and lexicon (i.e., type) Bayesian model selection to determine that func- f-score evaluations for word segmentations pro- tion words appear on the left periphery in English posed by these models (Brent, 1999), and Table 1 by comparing the posterior probability of the data summarises the token and lexicon f-scores for the under our “function word” Adaptor Grammar to major models discussed in this paper. It is interest- that obtained using a grammar which is identi- ing to note that adding “function words” improves cal except that rules (22–24) are replaced with the token f-score by more than 4%, corresponding to mirror-image rules in which “function words” are a 40% reduction in overall error rate. attached to the right periphery. When the training data is very small the Mono- syllabic grammar produces the highest accuracy 3 Word segmentation results results, presumably because a large proportion of the words in child-directed speech are monosyl- This section presents results of running our Adap- labic. However, at around 25 sentences the more tor Grammar models on subsets of the Bernstein- complex models that are capable of finding multi- Ratner (1987) corpus of child-directed English. syllabic words start to become more accurate. We use the Adaptor Grammar software available It’s interesting that after about 1,000 sentences from http://web.science.mq.edu.au/˜mjohnson/ the model that allows “function words” only on with the same settings as described in Johnson the right periphery is considerably less accurate and Goldwater (2009), i.e., we perform Bayesian than the baseline model. Presumably this is be- inference with “vague” priors for all hyperpa- cause it tends to misanalyse multi-syllabic words rameters (so there are no adjustable parameters on the right periphery as sequences of monosyl- in our models), and perform 8 different MCMC labic words. runs of each condition with table-label resampling The model that allows “function words” only on for 2,000 sweeps of the training data. At every the left periphery is more accurate than the model 10th sweep of the last 1,000 sweeps we use the that allows them on both the left and right periph- model to segment the entire corpus (even if it ery when the input data ranges from about 100 to is only trained on a subset of it), so we collect about 1,000 sentences, but when the training data

287

WW WW WW WW W WWWW WWWW

WWW WWW

Figure 2: Token and lexicon (i.e., type) f-score on the Bernstein-Ratner (1987) corpus as a function of training data size for the baseline model, the model where “function words” can appear on the left pe- riphery, a model where “function words” can appear on the right periphery, and a model where “function words” can appear on both the left and the right periphery. For comparison purposes we also include results for a model that assumes that all words are monosyllabic. is larger than about 1,000 sentences both models Thus, the present model, initially aimed at seg- are equally accurate. menting words from continuous speech, shows three interesting characteristics that are also ex- 3.2 Content and function words found by hibited by human infants: it distinguishes be- “function word” model tween function words and content words (Shi and As noted earlier, the “function word” model gen- Werker, 2001), it allows learners to acquire at least erates function words via adapted nonterminals some of the function words of their language (e.g. other than the Word category. In order to bet- (Shi et al., 2006)); and furthermore, it may also al- ter understand just how the model works, we give low them to start grouping together function words the 5 most frequent words in each word category according to their category (Cauvet et al., 2014; found during 8 MCMC runs of the left-peripheral Shi and Melanc¸on, 2010). “function word” grammar above: Word : book, doggy, house, want, I 4 Are “function words” on the left or 1 FuncWord1 : a, the, your, little , in right periphery? FuncWord2 : to, in, you, what, put FuncWord3 : you, a, what, no, can We have shown that a model that expects function Interestingly, these categories seem fairly rea- words on the left periphery performs more accu- sonable. The Word category includes open-class rate word segmentation on English, where func- nouns and verbs, the FuncWord1 category in- tion words do indeed typically occur on the left cludes noun modifiers such as determiners, while periphery, leaving open the question: how could the FuncWord2 and FuncWord3 categories in- a learner determine whether function words gen- clude prepositions, and auxiliary verbs. erally appear on the left or the right periphery of phrases in the language they are learning? This 1The phone ‘l’ is generated by both Consonant and Vowel, so “little” can be (incorrectly) analysed as one syl- question is important because knowing the side lable. where function words preferentially occur is re-

288 lated to the question of the direction of syntac- tic headedness in the language, and an accurate method for identifying the location of function words might be useful for initialising a syntac- tic learner. Experimental evidence suggests that infants as young as 8 months of age already ex- pect function words on the correct side for their language — left-periphery for Italian infants and right-periphery for Japanese infants (Gervain et

al., 2008) — so it is interesting to see whether purely distributional learners such as the ones studied here can identify the correct location of function words in phrases. We experimented with a variety of approaches that use a single adaptor grammar inference pro- cess, but none of these were successful. For ex- ample, we hoped that given an Adaptor Gram- mar that permits “function words” on both the Figure 3: Bayes factor in favour of left-peripheral left and right periphery, the inference procedure “function word” attachment as a function of the would decide that the right-periphery rules simply number of sentences in the training corpus, cal- are not used in a language like English. Unfortu- culated using the Harmonic Mean estimator (see nately we did not find this in our experiments; the warning in text). right-periphery rules were used almost as often as the left-periphery rules (recall that a large fraction of the words in English child-directed speech are where the marginal likelihood or “evidence” for a monosyllabic). model G is obtained by integrating over all of the In this section, we show that learners could use hidden or latent structure and parameters θ: Bayesian model selection to determine that func- tion words appear on the left periphery in English P(D G) = P(D, θ G) dθ (31) | | by comparing the marginal probability of the data ∫∆ for the left-periphery and the right-periphery mod- Here the variable θ ranges over the space ∆ of all els. possible parses for the utterances in D and all pos- Instead, we used Bayesian model selection sible configurations of the Pitman-Yor processes techniques to determine whether left-peripheral and their parameters that constitute the “state” of or a right-peripheral model better fits the un- the Adaptor Grammar G. While the probability of segmented utterances that constitute the training any specific Adaptor Grammar configuration θ is 2 data. While Bayesian model selection is in prin- not too hard to calculate (the MCMC sampler for ciple straight-forward, it turns out to require the ra- Adaptor Grammars can print this after each sweep tio of two integrals (for the “evidence” or marginal through D), the integral in (31) is in general in- likelihood) that are often intractable to compute. tractable. Specifically, given a training corpus D of unseg- Textbooks such as Murphy (2012) describe a mented sentences and model families G1 and G2 number of methods for calculating P(D G), but | (here the “function word” adaptor grammars with most of them assume that the parameter space ∆ left-peripheral and right-peripheral attachment re- is continuous and so cannot be directly applied spectively), the Bayes factor K is the ratio of the here. The Harmonic Mean estimator (32) for (31), marginal likelihoods of the data: which we used here, is a popular estimator for (31) because it only requires the ability to calcu- P(D G ) K = | 1 late P(D, θ G) for samples from P(θ D,G): P(D G2) | | | 1 n − 2Note that neither the left-peripheral nor the right- 1 1 peripheral model is correct: even strongly left-headed lan- P(D G) | ≈ n P(D, θi G) guages like English typically contain a few right-headed con- ( i=1 | ) structions. For example, “ago” is arguably the head of the ∑ phrase “ten years ago”. where θ ,..., θ are n samples from P(θ i n |

289 D,G), which can be generated by the MCMC pro- models that capture other linguistic properties of cedure. function words. For example, following the sug- Figure 3 depicts how the Bayes factor in favour gestion by Hochmann et al. (2010) that human of left-peripheral attachment of “function words” learners use frequency cues to identify function varies as a function of the number of utter- words, it might be interesting to develop computa- ances in the training data D (calculated from the tional models that do the same thing. In an Adap- last 1000 sweeps of 8 MCMC runs of the cor- tor Grammar the frequency distribution of func- responding adaptor grammars). As that figure tion words might be modelled by specifying the shows, once the training data contains more than prior for the Pitman-Yor Process parameters asso- about 1,000 sentences the evidence for the left- ciated with the function words’ adapted nontermi- peripheral grammar becomes very strong. On the nals so that it prefers to generate a small number full training data the estimated log Bayes factor is of high-frequency items. over 6,000, which would constitute overwhelming It should also be possible to develop models evidence in favour of left-peripheral attachment. which capture the fact that function words tend not Unfortunately, as Murphy and others warn, the to be topic-specific. Johnson et al. (2010) and Harmonic Mean estimator is extremely unstable Johnson et al. (2012) show how Adaptor Gram- (Radford Neal calls it “the worst MCMC method mars can model the association between words ever” in his blog), so we think it is important to and non-linguistic “topics”; perhaps these models confirm these results using a more stable estima- could be extended to capture some of the semantic tor. However, given the magnitude of the differ- properties of function words. ences and the fact that the two models being com- It would also be interesting to further explore pared are of similar complexity, we believe that the extent to which Bayesian model selection is a these results suggest that Bayesian model selec- useful approach to linguistic “parameter setting”. tion can be used to determine properties of the lan- In order to do this it is imperative to develop better guage being learned. methods than the problematic “Harmonic Mean” estimator used here for calculating the evidence 5 Conclusions and future work (i.e., the marginal probability of the data) that can This paper showed that the word segmentation handle the combination of discrete and continuous accuracy of a state-of-the-art Adaptor Grammar hidden structure that occur in computational lin- model is significantly improved by extending it guistic models. so that it explicitly models some properties of As well as substantially improving the accuracy function words. We also showed how Bayesian of unsupervised word segmentation, this work is model selection can be used to identify that func- interesting because it suggests a connection be- tion words appear on the left periphery of phrases tween unsupervised word segmentation and the in- in English, even though the input to the model only duction of syntactic structure. It is reasonable to consists of an unsegmented sequence of phones. expect that hierarchical non-parametric Bayesian Of course this work only scratches the surface models such as Adaptor Grammars may be useful in terms of investigating the role of function words tools for exploring such a connection. in language acquisition. It would clearly be very interesting to examine the performance of these Acknowledgments models on other corpora of child-directed English, as well as on corpora of child-directed speech in This work was supported in part by the Aus- other languages. Our evaluation focused on word- tralian Research Council’s Discovery Projects segmentation, but we could also evaluate the ef- funding scheme (project numbers DP110102506 fect that modelling “function words” has on other and DP110102593), the European Research Coun- aspects of the model, such as its ability to learn cil (ERC-2011-AdG-295810 BOOTPHON), the syllable structure. Agence Nationale pour la Recherche (ANR-10- The models of “function words” we investi- LABX-0087 IEC, and ANR-10-IDEX-0001-02 gated here only capture two of the 7 linguistic PSL*), and the Mairie de Paris, Ecole des Hautes properties of function words identified in section 1 Etudes en Sciences Sociales, the Ecole Normale (i.e., that function words tend to be monosyllabic, Superieure,´ and the Fondation Pierre Gilles de and that they tend to appear phrase-peripherally), Gennes. so it would be interesting to develop and explore

290 References M. Johnson and S. Goldwater. 2009. Improving non- parameteric Bayesian inference: experiments on un- N. Bernstein-Ratner. 1987. The phonology of parent- supervised word segmentation with adaptor gram- child speech. In K. Nelson and A. van Kleeck, ed- mars. In Proceedings of Human Language Tech- itors, Children’s Language, volume 6, pages 159– nologies: The 2009 Annual Conference of the North 174. Erlbaum, Hillsdale, NJ. American Chapter of the Association for Compu- tational Linguistics, pages 317–325, Boulder, Col- M. Brent. 1999. An efficient, probabilistically sound orado, June. Association for Computational Linguis- algorithm for segmentation and word discovery. tics. Machine Learning, 34:71–105. M. Johnson, T. Griffiths, and S. Goldwater. 2007a. E. Cauvet, R. Limissuri, S. Millotte, K. Skoruppa, Bayesian inference for PCFGs via Markov chain D. Cabrol, and A. Christophe. 2014. Function Monte Carlo. In Human Language Technologies words constrain on-line recognition of verbs and 2007: The Conference of the North American Chap- nouns in French 18-month-olds. Language Learn- ter of the Association for Computational Linguistics; ing and Development, pages 1–18. Proceedings of the Main Conference, pages 139– 146, Rochester, New York. Association for Compu- A. Christophe, S. Millotte, S. Bernal, and J. Lidz. tational Linguistics. 2008. Bootstrapping lexical and syntactic acquisi- tion. Language and Speech, 51(1-2):61–75. M. Johnson, T. L. Griffiths, and S. Goldwater. 2007b. Adaptor Grammars: A framework for specifying S. B. Cohen, D. M. Blei, and N. A. Smith. 2010. Vari- compositional nonparametric Bayesian models. In ational inference for adaptor grammars. In Human B. Scholkopf,¨ J. Platt, and T. Hoffman, editors, Ad- Language Technologies: The 2010 Annual Confer- vances in Neural Information Processing Systems ence of the North American Chapter of the Associa- 19, pages 641–648. MIT Press, Cambridge, MA. tion for Computational Linguistics, pages 564–572, Los Angeles, California, June. Association for Com- M. Johnson, K. Demuth, M. Frank, and B. Jones. putational Linguistics. 2010. Synergies in learning words and their refer- ents. In J. Lafferty, C. K. I. Williams, J. Shawe- K. Demuth and E. McCullough. 2009. The prosodic Taylor, R. Zemel, and A. Culotta, editors, Advances (re-)organization of childrens early . in Neural Information Processing Systems 23, pages Journal of Child Language, 36(1):173–200. 1018–1026.

J. Elman, E. Bates, M. H. Johnson, A. Karmiloff- M. Johnson, K. Demuth, and M. Frank. 2012. Exploit- Smith, D. Parisi, and K. Plunkett. 1996. Rethink- ing social information in grounded language learn- ing Innateness: A Connectionist Perspective on De- ing via grammatical reduction. In Proceedings of velopment. MIT Press/Bradford Books, Cambridge, the 50th Annual Meeting of the Association for Com- MA. putational Linguistics, pages 883–891, Jeju Island, Korea, July. Association for Computational Linguis- V. Fromkin, editor. 2001. Linguistics: An Introduction tics. to Linguistic Theory. Blackwell, Oxford, UK. M. Johnson. 2008. Using Adaptor Grammars to iden- J. Gervain, M. Nespor, R. Mazuka, R. Horie, and tify synergies in the unsupervised acquisition of lin- J. Mehler. 2008. Bootstrapping word order in guistic structure. In Proceedings of the 46th Annual prelexical infants: A japaneseitalian cross-linguistic Meeting of the Association of Computational Lin- study. Cognitive Psychology, 57(1):56 – 74. guistics, pages 398–406, Columbus, Ohio. Associ- ation for Computational Linguistics. S. Goldwater, T. L. Griffiths, and M. Johnson. 2009. A Bayesian framework for word segmentation: Ex- Y. Kedar, M. Casasola, and B. Lust. 2006. Getting ploring the effects of context. Cognition, 112(1):21– there faster: 18- and 24-month-old infants’ use of 54. function words to determine reference. Child De- velopment, 77(2):325–338. S. Goldwater, T. L. Griffiths, and M. Johnson. 2011. Producing power-law distributions and damping K. Kurihara and T. Sato. 2006. Variational word frequencies with two-stage language models. Bayesian grammar induction for natural language. Journal of Machine Learning Research, 12:2335– In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, 2382. and E. Tomita, editors, Grammatical Inference: Al- gorithms and Applications, pages 84–96. Springer. P. A. Halle,´ C. Durand, and B. de Boysson-Bardies. 2008. Do 11-month-old French infants process ar- K. P. Murphy. 2012. Machine learning: a probabilistic ticles? Language and Speech, 51(1-2):23–44. perspective. The MIT Press.

J.-R. Hochmann, A. D. Endress, and J. Mehler. 2010. V. L. Shafer, D. W. Shucard, J. L. Shucard, and Word frequency as a cue for identifying function L. Gerken. 1998. An electrophysiological study of words in infancy. Cognition, 115(3):444 – 457. infants’ sensitivity to the sound patterns of English

291 speech. Journal of Speech, Language and Hearing Research, 41(4):874. R. Shi and M. Lepage. 2008. The effect of functional on word segmentation in preverbal in- fants. Developmental Science, 11(3):407–413. R. Shi and A. Melanc¸on. 2010. Syntactic categoriza- tion in French-learning infants. Infancy, 15(517– 533).

R. Shi and J. Werker. 2001. Six-months old infants’ preference for lexical words. Psychological Science, 12:71–76. R. Shi, A. Cutler, J. Werker, and M. Cruickshank. 2006. Frequency and form as determinants of func- tor sensitivity in English-acquiring infants. The Journal of the Acoustical Society of America, 119(6):EL61–EL67. Y. W. Teh, M. Jordan, M. Beal, and D. Blei. 2006. Hi- erarchical Dirichlet processes. Journal of the Amer- ican Statistical Association, 101:1566–1581. R. Zangl and A. Fernald. 2007. Increasing flexibil- ity in children’s online processing of grammatical and nonce determiners in fluent speech. Language Learning and Development, 3(3):199–231.

292