Modelling Function Words Improves Unsupervised Word Segmentation

Modelling function words improves unsupervised word segmentation Mark Johnson1,2, Anne Christophe3,4, Katherine Demuth2,6 and Emmanuel Dupoux3,5 1 Department of Computing, Macquarie University, Sydney, Australia 2 Santa Fe Institute, Santa Fe, New Mexico, USA 3 Ecole Normale Superieure,´ Paris, France 4 Centre National de la Recherche Scientifique, Paris, France 5 Ecole des Hautes Etudes en Sciences Sociales, Paris, France 6 Department of Linguistics, Macquarie University, Sydney, Australia Abstract function words are treated specially in human language acquisition. We do this by comparing Inspired by experimental psychological two computational models of word segmentation findings suggesting that function words which differ solely in the way that they model play a special role in word learning, we function words. Following Elman et al. (1996) make a simple modification to an Adaptor and Brent (1999) our word segmentation models Grammar based Bayesian word segmenta- identify word boundaries from unsegmented se- tion model to allow it to learn sequences quences of phonemes corresponding to utterances, of monosyllabic “function words” at the effectively performing unsupervised learning of a beginnings and endings of collocations lexicon. For example, given input consisting of of (possibly multi-syllabic) words. This unsegmented utterances such as the following: modification improves unsupervised word segmentation on the standard Bernstein- juwɑnttusiðəbʊk Ratner (1987) corpus of child-directed En- a word segmentation model should segment this as glish by more than 4% token f-score com- ju wɑnt tu si ðə bʊk, which is the IPA representation pared to a model identical except that it of “you want to see the book”. does not special-case “function words”, We show that a model equipped with the abil- setting a new state-of-the-art of 92.4% to- ity to learn some rudimentary properties of the ken f-score. Our function word model as- target language’s function words is able to learn sumes that function words appear at the the vocabulary of that language more accurately left periphery, and while this is true of than a model that is identical except that it is inca- languages such as English, it is not true pable of learning these generalisations about func- universally. We show that a learner can tion words. This suggests that there are acqui- use Bayesian model selection to determine sition advantages to treating function words spe- the location of function words in their lan- cially that human learners could take advantage of guage, even though the input to the model (at least to the extent that they are learning similar only consists of unsegmented sequences of generalisations as our models), and thus supports phones. Thus our computational models the hypothesis that function words are treated spe- support the hypothesis that function words cially in human lexical acquisition. As a reviewer play a special role in word learning. points out, we present no evidence that children use function words in the way that our model does, 1 Introduction and we want to emphasise we make no such claim. Over the past two decades psychologists have in- While absolute accuracy is not directly relevant vestigated the role that function words might play to the main point of the paper, we note that the in human language acquisition. Their experiments models that learn generalisations about function suggest that function words play a special role in words perform unsupervised word segmentation the acquisition process: children learn function at 92.5% token f-score on the standard Bernstein- words before they learn the vast bulk of the asso- Ratner (1987) corpus, which improves the previ- ciated content words, and they use function words ous state-of-the-art by more than 4%. to help identify context words. As a reviewer points out, the changes we make The goal of this paper is to determine whether to our models to incorporate function words can computational models of human language acqui- be viewed as “building in” substantive informa- sition can provide support for the hypothesis that tion about possible human languages. The model 282 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 282–292, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics that achieves the best token f-score expects func- miners and prepositions are associated with tion words to appear at the left edge of phrases. nouns, auxiliary verbs and complementisers While this is true for languages such as English, are associated with main verbs) it is not true universally. By comparing the pos- 6. semantically, content words denote sets of terior probability of two models — one in which objects or events, while function words de- function words appear at the left edges of phrases, note more complex relationships over the en- and another in which function words appear at the tities denoted by content words right edges of phrases — we show that a learner 7. historically, the rate of innovation of function could use Bayesian posterior probabilities to deter- words is much lower than the rate of innova- mine that function words appear at the left edges tion of content words (i.e., function words are of phrases in English, even though they are not typically “closed class”, while content words told the locations of word boundaries or which are “open class”) words are function words. Properties 1–4 suggest that function words This paper is structured as follows. Section 2 might play a special role in language acquisition describes the specific word segmentation mod- because they are especially easy to identify, while els studied in this paper, and the way we ex- property 5 suggests that they might be useful for tended them to capture certain properties of func- identifying lexical categories. The models we tion words. The word segmentation experiments study here focus on properties 3 and 4, in that are presented in section 3, and section 4 discusses they are capable of learning specific sequences of how a learner could determine whether function monosyllabic words in peripheral (i.e., initial or words occur on the left-periphery or the right- final) positions of phrase-like units. periphery in the language they are learning. Sec- A number of psychological experiments have tion 5 concludes and describes possible future shown that infants are sensitive to the function work. The rest of this introduction provides back- words of their language within their first year of ground on function words, the Adaptor Grammar life (Shi et al., 2006; Halle´ et al., 2008; Shafer models we use to describe lexical acquisition and et al., 1998), often before they have experienced the Bayesian inference procedures we use to infer the “word learning spurt”. Crucially for our pur- these models. pose, infants of this age were shown to exploit frequent function words to segment neighboring 1.1 Psychological evidence for the role of content words (Shi and Lepage, 2008; Halle´ et function words in word learning al., 2008). In addition, 14 to 18-month-old Traditional descriptive linguistics distinguishes children were shown to exploit function words to function words, such as determiners and prepo- constrain lexical access to known words - for in- sitions, from content words, such as nouns and stance, they expect a noun after a determiner (Cau- verbs, corresponding roughly to the distinction be- vet et al., 2014; Kedar et al., 2006; Zangl and tween functional categories and lexical categories Fernald, 2007). In addition, it is plausible that of modern generative linguistics (Fromkin, 2001). function words play a crucial role in children’s Function words differ from content words in at acquisition of more complex syntactic phenom- least the following ways: ena (Christophe et al., 2008; Demuth and McCul- 1. there are usually far fewer function word lough, 2009), so it is interesting to investigate the types than content word types in a language roles they might play in computational models of 2. function word types typically have much language acquisition. higher token frequency than content word types 1.2 Adaptor grammars 3. function words are typically morphologically Adaptor grammars are a framework for Bayesian and phonologically simple (e.g., they are typ- inference of a certain class of hierarchical non- ically monosyllabic) parametric models (Johnson et al., 2007b). They 4. function words typically appear in peripheral define distributions over the trees specified by positions of phrases (e.g., prepositions typi- a context-free grammar, but unlike probabilistic cally appear at the beginning of prepositional context-free grammars, they “learn” distributions phrases) over the possible subtrees of a user-specified set of 5. each function word class is associated with “adapted” nonterminals. (Adaptor grammars are specific content word classes (e.g., deter- non-parametric, i.e., not characterisable by a finite 283 set of parameters, if the set of possible subtrees from Gi. The PCFG generates the distribution GS of the adapted nonterminals is infinite). Adaptor over the set of trees generated by the start sym- TS grammars are useful when the goal is to learn a bol S; the distribution over the strings it generates potentially unbounded set of entities that need to is obtained by marginalising over the trees. satisfy hierarchical constraints. As section 2 ex- In a Bayesian PCFG one puts Dirichlet priors plains in more detail, word segmentation is such Dir(α) on the rule probability vector θ, such that a case: words are composed of syllables and be- there is one Dirichlet parameter αA α for each long to phrases or collocations, and modelling this rule A α R. There are Markov Chain→ Monte → ∈ structure improves word segmentation accuracy. Carlo (MCMC) and Variational Bayes procedures Adaptor Grammars are formally defined in for estimating the posterior distribution over rule Johnson et al.

Load more