Modelling function words improves unsupervised word segmentation Mark Johnson1,2, Anne Christophe3,4, Katherine Demuth2,6 and Emmanuel Dupoux3,5 1 Department of Computing, Macquarie University, Sydney, Australia 2 Santa Fe Institute, Santa Fe, New Mexico, USA 3 Ecole Normale Superieure,´ Paris, France 4 Centre National de la Recherche Scientifique, Paris, France 5 Ecole des Hautes Etudes en Sciences Sociales, Paris, France 6 Department of Linguistics, Macquarie University, Sydney, Australia
Abstract function words are treated specially in human language acquisition. We do this by comparing Inspired by experimental psychological two computational models of word segmentation findings suggesting that function words which differ solely in the way that they model play a special role in word learning, we function words. Following Elman et al. (1996) make a simple modification to an Adaptor and Brent (1999) our word segmentation models Grammar based Bayesian word segmenta- identify word boundaries from unsegmented se- tion model to allow it to learn sequences quences of phonemes corresponding to utterances, of monosyllabic “function words” at the effectively performing unsupervised learning of a beginnings and endings of collocations lexicon. For example, given input consisting of of (possibly multi-syllabic) words. This unsegmented utterances such as the following: modification improves unsupervised word segmentation on the standard Bernstein- juwɑnttusiðəbʊk Ratner (1987) corpus of child-directed En- a word segmentation model should segment this as glish by more than 4% token f-score com- ju wɑnt tu si ðə bʊk, which is the IPA representation pared to a model identical except that it of “you want to see the book”. does not special-case “function words”, We show that a model equipped with the abil- setting a new state-of-the-art of 92.4% to- ity to learn some rudimentary properties of the ken f-score. Our function word model as- target language’s function words is able to learn sumes that function words appear at the the vocabulary of that language more accurately left periphery, and while this is true of than a model that is identical except that it is inca- languages such as English, it is not true pable of learning these generalisations about func- universally. We show that a learner can tion words. This suggests that there are acqui- use Bayesian model selection to determine sition advantages to treating function words spe- the location of function words in their lan- cially that human learners could take advantage of guage, even though the input to the model (at least to the extent that they are learning similar only consists of unsegmented sequences of generalisations as our models), and thus supports phones. Thus our computational models the hypothesis that function words are treated spe- support the hypothesis that function words cially in human lexical acquisition. As a reviewer play a special role in word learning. points out, we present no evidence that children use function words in the way that our model does, 1 Introduction and we want to emphasise we make no such claim. Over the past two decades psychologists have in- While absolute accuracy is not directly relevant vestigated the role that function words might play to the main point of the paper, we note that the in human language acquisition. Their experiments models that learn generalisations about function suggest that function words play a special role in words perform unsupervised word segmentation the acquisition process: children learn function at 92.5% token f-score on the standard Bernstein- words before they learn the vast bulk of the asso- Ratner (1987) corpus, which improves the previ- ciated content words, and they use function words ous state-of-the-art by more than 4%. to help identify context words. As a reviewer points out, the changes we make The goal of this paper is to determine whether to our models to incorporate function words can computational models of human language acqui- be viewed as “building in” substantive informa- sition can provide support for the hypothesis that tion about possible human languages. The model
282 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 282–292, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics that achieves the best token f-score expects func- miners and prepositions are associated with tion words to appear at the left edge of phrases. nouns, auxiliary verbs and complementisers While this is true for languages such as English, are associated with main verbs) it is not true universally. By comparing the pos- 6. semantically, content words denote sets of terior probability of two models — one in which objects or events, while function words de- function words appear at the left edges of phrases, note more complex relationships over the en- and another in which function words appear at the tities denoted by content words right edges of phrases — we show that a learner 7. historically, the rate of innovation of function could use Bayesian posterior probabilities to deter- words is much lower than the rate of innova- mine that function words appear at the left edges tion of content words (i.e., function words are of phrases in English, even though they are not typically “closed class”, while content words told the locations of word boundaries or which are “open class”) words are function words. Properties 1–4 suggest that function words This paper is structured as follows. Section 2 might play a special role in language acquisition describes the specific word segmentation mod- because they are especially easy to identify, while els studied in this paper, and the way we ex- property 5 suggests that they might be useful for tended them to capture certain properties of func- identifying lexical categories. The models we tion words. The word segmentation experiments study here focus on properties 3 and 4, in that are presented in section 3, and section 4 discusses they are capable of learning specific sequences of how a learner could determine whether function monosyllabic words in peripheral (i.e., initial or words occur on the left-periphery or the right- final) positions of phrase-like units. periphery in the language they are learning. Sec- A number of psychological experiments have tion 5 concludes and describes possible future shown that infants are sensitive to the function work. The rest of this introduction provides back- words of their language within their first year of ground on function words, the Adaptor Grammar life (Shi et al., 2006; Halle´ et al., 2008; Shafer models we use to describe lexical acquisition and et al., 1998), often before they have experienced the Bayesian inference procedures we use to infer the “word learning spurt”. Crucially for our pur- these models. pose, infants of this age were shown to exploit frequent function words to segment neighboring 1.1 Psychological evidence for the role of content words (Shi and Lepage, 2008; Halle´ et function words in word learning al., 2008). In addition, 14 to 18-month-old Traditional descriptive linguistics distinguishes children were shown to exploit function words to function words, such as determiners and prepo- constrain lexical access to known words - for in- sitions, from content words, such as nouns and stance, they expect a noun after a determiner (Cau- verbs, corresponding roughly to the distinction be- vet et al., 2014; Kedar et al., 2006; Zangl and tween functional categories and lexical categories Fernald, 2007). In addition, it is plausible that of modern generative linguistics (Fromkin, 2001). function words play a crucial role in children’s Function words differ from content words in at acquisition of more complex syntactic phenom- least the following ways: ena (Christophe et al., 2008; Demuth and McCul- 1. there are usually far fewer function word lough, 2009), so it is interesting to investigate the types than content word types in a language roles they might play in computational models of 2. function word types typically have much language acquisition. higher token frequency than content word types 1.2 Adaptor grammars 3. function words are typically morphologically Adaptor grammars are a framework for Bayesian and phonologically simple (e.g., they are typ- inference of a certain class of hierarchical non- ically monosyllabic) parametric models (Johnson et al., 2007b). They 4. function words typically appear in peripheral define distributions over the trees specified by positions of phrases (e.g., prepositions typi- a context-free grammar, but unlike probabilistic cally appear at the beginning of prepositional context-free grammars, they “learn” distributions phrases) over the possible subtrees of a user-specified set of 5. each function word class is associated with “adapted” nonterminals. (Adaptor grammars are specific content word classes (e.g., deter- non-parametric, i.e., not characterisable by a finite
283 set of parameters, if the set of possible subtrees from Gi. The PCFG generates the distribution GS of the adapted nonterminals is infinite). Adaptor over the set of trees generated by the start sym- TS grammars are useful when the goal is to learn a bol S; the distribution over the strings it generates potentially unbounded set of entities that need to is obtained by marginalising over the trees. satisfy hierarchical constraints. As section 2 ex- In a Bayesian PCFG one puts Dirichlet priors plains in more detail, word segmentation is such Dir(α) on the rule probability vector θ, such that a case: words are composed of syllables and be- there is one Dirichlet parameter αA α for each long to phrases or collocations, and modelling this rule A α R. There are Markov Chain→ Monte → ∈ structure improves word segmentation accuracy. Carlo (MCMC) and Variational Bayes procedures Adaptor Grammars are formally defined in for estimating the posterior distribution over rule Johnson et al. (2007b), which should be consulted probabilities θ and parse trees given data consist- for technical details. Adaptor Grammars (AGs) ing of terminal strings alone (Kurihara and Sato, are an extension of Probabilistic Context-Free 2006; Johnson et al., 2007a). Grammars (PCFGs), which we describe first. A PCFGs can be viewed as recursive mixture Context-Free Grammar (CFG) G = (N, W, R, S) models over trees. While PCFGs are expres- consists of disjoint finite sets of nonterminal sym- sive enough to describe a range of linguistically- bols N and terminal symbols W , a finite set of interesting phenomena, PCFGs are parametric rules R of the form A α where A N and models, which limits their ability to describe phe- → ∈ α (N W )⋆, and a start symbol S N. (We nomena where the set of basic units, as well as ∈ ∪ ∈ assume there are no “ϵ-rules” in R, i.e., we require their properties, are the target of learning. Lexi- that α 1 for each A α R). cal acqusition is an example of a phenomenon that | | ≥ → ∈ A Probabilistic Context-Free Grammar (PCFG) is naturally viewed as non-parametric inference, is a quintuple (N, W, R, S, θ) where N, W , R where the number of lexical entries (i.e., words) and S are the nonterminals, terminals, rules and as well as their properties must be learnt from the start symbol of a CFG respectively, and θ is a vec- data. tor of non-negative reals indexed by R that sat- It turns out there is a straight-forward modifica- isfy α R θA α = 1 for each A N, where tion to the PCFG distribution (1) that makes it suit- A → ∈ R = ∈A α : A α R is the set of rules ably non-parametric. As Johnson et al. (2007b) A { → → ∈ } expanding∑ A. explain, by inserting a Dirichlet Process (DP) Informally, θA α is the probability of a node or Pitman-Yor Process (PYP) into the generative labelled A expanding→ to a sequence of nodes la- mechanism (1) the model “concentrates” mass on belled α, and the probability of a tree is the prod- a subset of trees (Teh et al., 2006). Specifically, uct of the probabilities of the rules used to con- an Adaptor Grammar identifies a subset A N ⊆ struct each non-leaf node in it. More precisely, for of adapted nonterminals. In an Adaptor Gram- each X N W a PCFG associates distributions mar the unadapted nonterminals N A expand ∈ ∪ \ G over the set of trees generated by X as via (1), just as in a PCFG, but the distributions of X TX . follows: the adapted nonterminals A are “concentrated” by If X W (i.e., if X is a terminal) then G passing them through a DP or PYP: ∈ X is the distribution that puts probability 1 on the HX = θX B1...Bn TDX (GB1 ,...,GBn ) single-node tree labelled X. → If X N (i.e., if X is a nonterminal) then: X B1...Bn RX ∈ →∑ ∈ GX = PYP(HX , aX , bX )
GX = θX B1...Bn TDX (GB1 ,...,GBn ) (1) → Here aX and bX are parameters of the PYP asso- X B1...Bn RX →∑ ∈ ciated with the adapted nonterminal X. As Gold- where RX is the subset of rules in R expanding water et al. (2011) explain, such Pitman-Yor Pro- nonterminal X N, and: cesses naturally generate power-law distributed ∈ data. n X Informally, Adaptor Grammars can be viewed TD (G ,...,G ) = G (t ). X 1 n i i as caching entire subtrees of the adapted nonter- ( t1 ... tn ) i=1 ∏ minals. Roughly speaking, the probability of gen- That is, TDX (G1,...,Gn) is a distribution over erating a particular subtree of an adapted nonter- the set of trees generated by nonterminal X, minal is proportional to the number of times that TX where each subtree ti is generated independently subtree has been generated before. This “rich get
284 richer” behaviour causes the distribution of sub- by explaining how we modify these adaptor gram- trees to follow a power-law (the power is speci- mars to capture some of the special properties of fied by the aX parameter of the PYP). The PCFG function words. rules expanding an adapted nonterminal X de- fine the “base distribution” of the associated DP 2.1 Syllable structure and phonotactics or PYP, and the aX and bX parameters determine The rule (3) models words as sequences of inde- how much mass is reserved for “new” trees. pendently generated phones: this is what Gold- There are several different procedures for infer- water et al. (2009) called the “monkey model” of ring the parse trees and the rule probabilities given word generation (it instantiates the metaphor that a corpus of strings: Johnson et al. (2007b) describe word types are generated by a monkey randomly a MCMC sampler and Cohen et al. (2010) describe banging on the keys of a typewriter). However, the a Variational Bayes procedure. We use the MCMC words of a language are typically composed of one procedure here since this has been successfully ap- or more syllables, and explicitly modelling the in- plied to word segmentation problems in previous ternal structure of words typically improves word work (Johnson, 2008). segmentation considerably. Johnson (2008) suggested replacing (3) with the 2 Word segmentation with Adaptor following model of word structure: Grammars Word Syllable1:4 (4) Perhaps the simplest word segmentation model is → the unigram model, where utterances are modeled Syllable (Onset) Rhyme (5) → as sequences of words, and where each word is Onset Consonant+ (6) a sequence of segments (Brent, 1999; Goldwater → Rhyme Nucleus (Coda) (7) et al., 2009). A unigram model can be expressed → Nucleus Vowel+ (8) as an Adaptor Grammar with one adapted non- → terminal Word (we indicate adapted nonterminals Coda Consonant+ (9) by underlining them in grammars here; regular ex- → pressions are expanded into right-branching pro- Here and below superscripts indicate iteration ductions). (e.g., a Word consists of 1 to 4 Syllables), while an Onset consists of an unbounded number of + Sentence Word (2) Consonants), while parentheses indicate option- → Word Phone+ (3) ality (e.g., a Rhyme consists of an obligatory → Nucleus followed by an optional Coda). We as- The first rule (2) says that a sentence consists of sume that there are rules expanding Consonant one or more Words, while the second rule (3) and Vowel to the set of all consonants and vow- states that a Word consists of a sequence of one or els respectively (this amounts to assuming that the more Phones; we assume that there are rules ex- learner can distinguish consonants from vowels). panding Phone into all possible phones. Because Because Onset, Nucleus and Coda are adapted, Word is an adapted nonterminal, the adaptor gram- this model learns the possible syllable onsets, nu- mar memoises Word subtrees, which corresponds cleii and coda of the language, even though neither to learning the phone sequences for the words of syllable structure nor word boundaries are explic- the language. itly indicated in the input to the model. The more sophisticated Adaptor Grammars dis- The model just described assumes that word- cussed below can be understood as specialis- internal syllables have the same structure as word- ing either the first or the second of the rules peripheral syllables, but in languages such as in (2–3). The next two subsections review the English word-peripheral onsets and codas can Adaptor Grammar word segmentation models pre- be more complex than the corresponding word- sented in Johnson (2008) and Johnson and Gold- internal onsets and codas. For example, the water (2009): section 2.1 reviews how phonotac- word “string” begins with the onset cluster str, tic syllable-structure constraints can be expressed which is relatively rare word-internally. Johnson with Adaptor Grammars, while section 2.2 re- (2008) showed that word segmentation accuracy views how phrase-like units called “collocations” improves if the model can learn different conso- capture inter-word dependencies. Section 2.3 nant sequences for word-inital onsets and word- presents the major novel contribution of this paper final codas. It is easy to express this as an Adaptor
285 Grammar: (4) is replaced with (10–11) and (12– designed to correspond to syntactic phrases, by ex- 17) are added to the grammar. amining the sample parses induced by the Adaptor Grammar we noticed that the collocations often Word SyllableIF (10) correspond to noun phrases, prepositional phrases → Word SyllableI Syllable0:2 SyllableF (11) or verb phrases. This motivates the extension to → the Adaptor Grammar discussed below. SyllableIF (OnsetI) RhymeF (12) → SyllableI (OnsetI) Rhyme (13) 2.3 Incorporating “function words” into → SyllableF (Onset) RhymeF (14) collocation models → OnsetI Consonant+ (15) The starting point and baseline for our extension → is the adaptor grammar with syllable structure RhymeF Nucleus (CodaF) (16) → phonotactic constraints and three levels of collo- CodaF Consonant+ (17) → cational structure (5-21), as prior work has found that this yields the highest word segmentation to- In this grammar the suffix “I” indicates a word- ken f-score (Johnson and Goldwater, 2009). initial element, and “F” indicates a word-final el- Our extension assumes that the Colloc1 − ement. Note that the model simply has the abil- Colloc3 constituents are in fact phrase-like, so we ity to learn that different clusters can occur word- extend the rules (19–21) to permit an optional se- peripherally and word-internally; it is not given quence of monosyllabic words at the left edge any information about the relative complexity of of each of these constituents. Our model thus these clusters. captures two of the properties of function words discussed in section 1.1: they are monosyllabic 2.2 Collocation models of inter-word (and thus phonologically simple), and they appear dependencies on the periphery of phrases. (We put “function Goldwater et al. (2009) point out the detrimental words” in scare quotes below because our model effect that inter-word dependencies can have on only approximately captures the linguistic proper- word segmentation models that assume that the ties of function words). words of an utterance are independently gener- Specifically, we replace rules (19–21) with the ated. Informally, a model that generates words in- following sequence of rules: dependently is likely to incorrectly segment multi- + word expressions such as “the doggie” as single Colloc3 (FuncWords3) Colloc2 (22) → words because the model has no way to capture Colloc2 (FuncWords2) Colloc1+ (23) word-to-word dependencies, e.g., that “doggie” is → Colloc1 (FuncWords1) Word+ (24) typically preceded by “the”. Goldwater et al show → FuncWords3 FuncWord3+ (25) that word segmentation accuracy improves when → the model is extended to capture bigram depen- FuncWord3 SyllableIF (26) → dencies. FuncWords2 FuncWord2+ (27) Adaptor grammar models cannot express bi- → FuncWord2 SyllableIF (28) gram dependencies, but they can capture similiar → + inter-word dependencies using phrase-like units FuncWords1 FuncWord1 (29) → that Johnson (2008) calls collocations. John- FuncWord1 SyllableIF (30) son and Goldwater (2009) showed that word seg- → mentation accuracy improves further if the model This model memoises (i.e., learns) both the in- learns a nested hierarchy of collocations. This can dividual “function words” and the sequences of “function words” that modify the Colloc1 be achieved by replacing (2) with (18–21). − Colloc3 constituents. Note also that “function Sentence Colloc3+ (18) words” expand directly to SyllableIF, which in → turn expands to a monosyllable with a word-initial Colloc3 Colloc2+ (19) → onset and word-final coda. This means that “func- Colloc2 Colloc1+ (20) → tion words” are memoised independently of the Colloc1 Word+ (21) “content words” that Word expands to; i.e., the → model learns distinct “function word” and “con- Informally, Colloc1, Colloc2 and Colloc3 define a tent word” vocabularies. Figure 1 depicts a sample nested hierarchy of phrase-like units. While not parse generated by this grammar.
286 . Sentence Token Boundary Boundary Model Colloc3 f-score precision recall Baseline 0.872 0.918 0.956 FuncWords3 Colloc2 + left FWs 0.924 0.935 0.990 FuncWord3FuncWord3FuncWord3 Colloc1 Colloc1 + left + right FWs 0.912 0.957 0.953 you want to Word FuncWords1 Word see FuncWord1 book Table 1: Mean token f-scores and boundary preci- sion and recall results averaged over 8 trials, each the consisting of 8 MCMC runs of models trained Figure 1: A sample parse generated by the “func- and tested on the full Bernstein-Ratner (1987) cor- tion word” Adaptor Grammar with rules (10–18) pus (the standard deviations of all values are less and (22–30). To simplify the parse we only show than 0.006; Wilcox sign tests show the means of the root node and the adapted nonterminals, and all token f-scores differ p < 2e-4). replace word-internal structure by the word’s or- thographic form. 800 sample segmentations of each utterance. The most frequent segmentation in these 800 This grammar builds in the fact that function sample segmentations is the one we score in the words appear on the left periphery of phrases. This evaluations below. is true of languages such as English, but is not true 3.1 Word segmentation with “function word” cross-linguistically. For comparison purposes we models also include results for a mirror-image model that permits “function words” on the right periphery, Here we evaluate the word segmentations found a model which permits “function words” on both by the “function word” Adaptor Grammar model the left and right periphery (achieved by changing described in section 2.3 and compare it to the base- rules 22–24), as well as a model that analyses all line grammar with collocations and phonotactics words as monosyllabic. from Johnson and Goldwater (2009). Figure 2 Section 4 explains how a learner could use presents the standard token and lexicon (i.e., type) Bayesian model selection to determine that func- f-score evaluations for word segmentations pro- tion words appear on the left periphery in English posed by these models (Brent, 1999), and Table 1 by comparing the posterior probability of the data summarises the token and lexicon f-scores for the under our “function word” Adaptor Grammar to major models discussed in this paper. It is interest- that obtained using a grammar which is identi- ing to note that adding “function words” improves cal except that rules (22–24) are replaced with the token f-score by more than 4%, corresponding to mirror-image rules in which “function words” are a 40% reduction in overall error rate. attached to the right periphery. When the training data is very small the Mono- syllabic grammar produces the highest accuracy 3 Word segmentation results results, presumably because a large proportion of the words in child-directed speech are monosyl- This section presents results of running our Adap- labic. However, at around 25 sentences the more tor Grammar models on subsets of the Bernstein- complex models that are capable of finding multi- Ratner (1987) corpus of child-directed English. syllabic words start to become more accurate. We use the Adaptor Grammar software available It’s interesting that after about 1,000 sentences from http://web.science.mq.edu.au/˜mjohnson/ the model that allows “function words” only on with the same settings as described in Johnson the right periphery is considerably less accurate and Goldwater (2009), i.e., we perform Bayesian than the baseline model. Presumably this is be- inference with “vague” priors for all hyperpa- cause it tends to misanalyse multi-syllabic words rameters (so there are no adjustable parameters on the right periphery as sequences of monosyl- in our models), and perform 8 different MCMC labic words. runs of each condition with table-label resampling The model that allows “function words” only on for 2,000 sweeps of the training data. At every the left periphery is more accurate than the model 10th sweep of the last 1,000 sweeps we use the that allows them on both the left and right periph- model to segment the entire corpus (even if it ery when the input data ranges from about 100 to is only trained on a subset of it), so we collect about 1,000 sentences, but when the training data
287