Improving Word Segmentation by Simultaneously Learning Phonotactics

Improving Word Segmentation by Simultaneously Learning Phonotactics Daniel Blanchard Jeffrey Heinz Computer & Information Sciences Linguistics & Cognitive Science University of Delaware University of Delaware [email protected] [email protected] Abstract six months infants can begin to segment words out of speech (Bortfeld et al., 2005). Here we present The most accurate unsupervised word seg- an efficient word segmentation system aimed to mentation systems that are currently avail- model how infants accomplish the task. able (Brent, 1999; Venkataraman, 2001; While an algorithm that could reliably extract Goldwater, 2007) use a simple unigram orthographic representations of both novel and fa- model of phonotactics. While this sim- miliar words from acoustic data is something we plifies some of the calculations, it over- would like to see developed, following earlier re- looks cues that infant language acquisition searchers, we simplify the problem by using a text researchers have shown to be useful for that does not contain any word boundary markers. segmentation (Mattys et al., 1999; Mattys Hereafter, we use the phrase “word segmentation” and Jusczyk, 2001). Here we explore the to mean some process which adds word boundaries utility of using bigram and trigram phono- to a text that does not contain them. tactic models by enhancing Brent’s (1999) This paper’s focus is on unsupervised, incre- MBDP-1 algorithm. The results show mental word segmentation algorithms; i.e., those the improved MBDP-Phon model outper- that do not rely on preexisting knowledge of a par- forms other unsupervised word segmenta- ticular language, and those that segment the cor- tion systems (e.g., Brent, 1999; Venkatara- pus one utterance at a time. This is in contrast man, 2001; Goldwater, 2007). to supervised word segmentation algorithms (e.g., 1 Introduction Teahan et al., 2000), which are typically used for segmenting text in documents written in languages How do infants come to identify words in the that do not put spaces between their words like speech stream? As adults, we break up speech Chinese. (Of course, unsupervised word segmen- into words with such ease that we often think tation algorithms also have this application.) This that there are audible pauses between words in the also differs from batch segmentation algorithms same sentence. However, unlike some written lan- (Goldwater, 2007; Johnson, 2008b; Fleck, 2008), guages, speech does not have any completely reli- which process the entire corpus at least once be- able markers for the breaks between words (Cole fore outputting a segmentation of the corpus. Un- and Jakimik, 1980). In fact, languages vary on how supervised incremental algorithms are of interest they signal the ends of words (Cutler and Carter, to some psycholinguists and acquisitionists inter- 1987), which makes the task even more daunting. ested in the problem of language learning, as well Adults at least have a lexicon they can use to rec- as theoretical computer scientists who are inter- ognize familiar words, but when an infant is first ested in what unsupervised, incremental models born, they do not have a pre-existing lexicon to are capable of achieving. consult. In spite of these challenges, by the age of Phonotactic patterns are the rules that deter- c 2008. Licensed under the Creative Commons mine what sequences of phonemes or allophones Attribution-Noncommercial-Share Alike 3.0 Unported li- cense (http://creativecommons.org/licenses/by-nc-sa/3.0/). are allowable within words. Learning the phono- Some rights reserved. tactic patterns of a language is usually modeled 65 CoNLL 2008: Proceedings of the 12th Conference on Computational Natural Language Learning, pages 65–72 Manchester, August 2008 separately from word segmentation; e.g., current 5. Concatenate the words in the order specified phonotactic learners such as Coleman and Pierre- by s, and remove the word delimiters (#). humbert (1997), Heinz (2007), or Hayes and Wil- It is important to note that this model treats the son (2008) are given word-sized units as input. generation of the text as a single event in the prob- However, infants appear to simultaneously learn ability space, which allows Brent to make a num- which phoneme combinations are allowable within ber of simplifying assumptions. As the values for words and how to extract words from the input. It n, L, f, and s completely determine the segmenta- is reasonable that the two processes feed into one tion, the probability of a particular segmentation, another, and when infants acquire a critical mass of wm, can be calculated as: phonotactic knowledge, they use it to make judge- ments about what phoneme sequences can occur P (wm) = P (n, L, f, s) (1) within versus across word boundaries (Mattys and To allow the model to operate on one utterance at Jusczyk, 2001). We use this insight, also suggested a time, Brent states the probability of each word in by Venkataraman (2001) and recently utilized by the text as a recursive function, R(w ), where w Fleck (2008) in a different manner, to enhance k k is the text up to and including the word at position Brent’s (1999) model MBDP-1, and significantly k, w . Furthermore, there are two specific cases increase segmentation accuracy. We call this mod- k for R: familiar words and novel words. If w is ified segmentation model MBDP-Phon. k familiar, the model already has the word in its lex- 2 Related Work icon, and its score is calculated as in Equation 2. 2 f(wk) f(wk) 1 2.1 Word Segmentation R(wk) = − (2) k · f(wk) The problem of unsupervised word segmentation has attracted many earlier researchers over the Otherwise, the word is novel, and its score is cal- 1 past fifty years (e.g., Harris, 1954; Olivier, 1968; culated using Equation 3 (Brent and Tao, 2001), de Marcken, 1995; Brent, 1999). In this section, R(wk) = we describe the base model MBDP-1, along with 2 6 n PΣ(a1)...PΣ(aq) n 1 (3) π2 k 1 P (#) −n two other segmentation approaches, Venkataraman · · − Σ · (2001) and Goldwater (2007). In 4, we compare § where PΣ is the probability of a particular MBDP-Phon to these models in more detail. For phoneme occurring in the text. The third term of a thorough review of word segmentation literature, the equation for novel words is where the model’s see Brent (1999) or Goldwater (2007). unigram phonotactic model comes into play. We detail how to plug a more sophisticated phonotac- 2.1.1 MBDP-1 tic learning model into this equation in 3. With § Brent’s (1999) MBDP-1 (Model Based Dy- the generative model established, MBDP-1 uses a namic Programming) algorithm is an implemen- Viterbi-style search algorithm to find the segmentation of the INCDROP framework (Brent, 1997) tation for each utterance that maximizes the R val- that uses a Bayesian model of how to generate an ues for each word in the segmentation. unsegmented text to insert word boundaries. The Venkataraman (2001) notes that considering the generative model consists of five steps: generation of the text as a single event is un- 1. Choose a number of word types, n. likely to be how infants approach the segmentation problem. However, MBDP-1 uses an incre- 2. Pick n distinct strings from Σ+#, which will mental search algorithm to segment one utterance make up the lexicon, L. Entries in L are la- at a time, which is more plausible as a model of beled W1 ...Wn. W0 = $, where $ is the infants’ word segmentation. utterance boundary marker. 1Brent (1999) originally described the novel word score 2 6 n Pσ (Wn ) n 1 k k k− as R(wk) = π2 k nk 1 nk n , · · 1 − Pσ (Wj ) · k 3. Pick a function, f, which maps word types to − nk · j=1 their frequency in the text. where Pσ is the probability of all the phonemes in the word occurring together, but the denominatorP of the third term was dropped in Brent and Tao (2001). This change drastically 4. Choose a function, s, to map positions in the speeds up the model, and only reduces segmentation accuracy text to word types. by 0.5%. ∼ 66 2.1.2 Venkataraman (2001) humbert, 1997; Heinz, 2007; Hayes and Wilson, MBDP-1 is not the only incremental unsuper- 2008). While Hayes and Wilson present a more vised segmentation model that achieves promis- complex Maximum Entropy phonotactic model in ing results. Venkataraman’s (2001) model tracks their paper than the one we add to MBDP-1, they MBDP-1’s performance so closely that Batchelder also evaluate a simple n-gram phonotactic learner (2002) posits that the models are performing the operating over phonemes. The input to the mod- same operations, even though the authors describe els is a list of English onsets and their frequency them differently. in the lexicon, and the basic trigram learner simply Venkataraman’s model uses a more traditional, keeps track of the trigrams it has seen in the cor- smoothed n-gram model to describe the distribu- pus. They test the model on novel words with ac- tion of words in an unsegmented text.2 The most ceptable rhymes—some well-formed (e.g., [kIp]), probable segmentation is retrieved via a dynamic and some less well-formed (e.g., [stwIk])—so any programming algorithm, much like Brent (1999). ill-formedness is attributable to onsets. This ba- We use MBDP-1 rather than Venkataraman’s sic trigram model explains 87.7% of the variance approach as the basis for our model only because it in the scores that Scholes (1966) reports his 7th was more transparent how to plug in a phonotactic grade students gave when subjected to the same learning module at the time this project began.

Load more