A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition

A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition Abdellah Fourtassi ———————————– Emmanuel Dupoux Laboratoire de Sciences Cognitives et Psycholinguistique, ENS/EHESS/CNRS, Paris abdellah.fourtassi, emmanuel.dupoux @gmail.com { } Abstract search, little is still known about the mechanisms that are operative in infant’s brain to achieve such Infants spontaneously discover the rele- a result. Current work in early language acquisi- vant phonemes of their language without tion has proposed two competing but incomplete any direct supervision. This acquisition hypotheses that purports to account for this stun- is puzzling because it seems to require ning development path. The bottom-up hypothesis the availability of high levels of linguistic holds that infants converge onto the linguistic units structures (lexicon, semantics), that logi- of their language through a statistical analysis over cally suppose the infants having a set of of their input. In contrast, the top-down hypothesis phonemes already. We show how this cir- emphasizes the role of higher levels of linguistic cularity can be broken by testing, in real- structure in learning the lower level units. size language corpora, a scenario whereby infants would learn approximate representations at all levels, and then refine them in 1 A chicken-and-egg problem a mutually constraining way. We start with corpora of spontaneous speech that have 1.1 Bottom-up is not enough been encoded in a varying number of detailed context-dependent allophones. We Several studies have documented the fact that in- derive, in an unsupervised way, an approx- fants become attuned to the native sounds of their imate lexicon and a rudimentary seman- language, starting at 6 months of age (see Ger- tic representation. Despite the fact that vain & Mehler, 2010 for a review). Some re- all these representations are poor approxi- searchers have claimed that such an early attune- mations of the ground truth, they help re- ment is due to a statistical learning mechanism that organize the fine grained categories into only takes into account the distributional prop- phoneme-like categories with a high de- erties of the sounds present in the native input gree of accuracy. (Maye et al., 2002). Unsupervised clustering al- One of the most fascinating facts about human gorithms running on simplified input have, indeed, infants is the speed at which they acquire their provided a proof of principle for bottom-up learn- native language. During the first year alone, i.e., ing of phonemic categories from speech (see for before they are able to speak, infants achieve im- instance Vallabha et al., 2007). pressive landmarks regarding three key language It is clear, however, that distributional learning components. First, they tune in on the phone- cannot account for the entire developmental pat- mic categories of their language (Werker and Tees, tern. In fact, phoneme tokens in real speech ex- 1984). Second, they learn to segment the continu- hibit high acoustic variability and result in phone- ous speech stream into discrete units (Jusczyk and mic categories with a high degree of overlap (Hil- Aslin, 1995). Third, they start to recognize fre- lenbrand et al., 1995). When purely bottom up quent words (Ngon et al., 2013), as well as the clustering algorithms are tested on realistic input, semantics of many of them (Bergelson and Swing- they ended up in either a too large number of sub- ley, 2012). phonemic units (Varadarajan et al., 2008) or a too Even though these landmarks have been doc- small number of coarse grained categories (Feld- umented in detail over the past 40 years of re- man et al., 2013a). 191 Proceedings of the Eighteenth Conference on Computational Language Learning, pages 191–200, Baltimore, Maryland USA, June 26-27 2014. c 2014 Association for Computational Linguistics 1.2 The top-down hypothesis tion to fine grained variation in the acoustic input, Inspection of the developmental data shows that thus constructing perceptual phonetic categories infants do not wait to have completed the acqui- that are not phonemes, but segments encoding fine sition of their native phonemes to start to learn grained phonetic details (Werker and Curtin, 2005; words. In fact, lexical and phonological acquisi- Pierrehumbert, 2003). Second, we assume that tion largely overlap. Infant can recognize highly these units enable infants to segment proto-words frequent word forms like their own names, by as from continuous speech and store them in this de- early as 4 months of age (Mandel et al., 1995). tailed format. Importantly, this proto-lexicon will Vice versa, the refinement of phonemic categories not be adult-like: it will contain badly segmented does not stop at 12 months. The sensitivity to pho- word forms, and store several alternant forms for netic contrasts has been reported to continue at 3 the same word. Ngon et al. (2013) have shown years of age (Nittrouer, 1996) and beyond (Hazan that 11 month old infants recognize frequent sound and Barrett, 2000), on par with the development of sequences that do not necessarily map to adult the lexicon. words. Third, we assume that infants can use this imperfect lexicon to acquire some semantic repre- Some researchers have therefore suggested that sentation. As shown in Shukla et al. (2011), in- there might be a learning synergy which allows infants can simultaneously segment words and asso- fants to base some of their acquisition not only on ciate them with a visual referent. Fourth, we as- bottom up information, but also on statistics over sume that as their exposure to language develops, lexical items or even on the basis of word mean- infants reorganize these initial categories along the ing (Feldman et al., 2013a; Feldman et al., 2013b; relevant dimensions of their native language based Yeung and Werker, 2009) on cues from all these representations. These experiments and computational models, The aim of this work is to provide a proof of however, have focused on simplified input or/and principle for this general scenario, using real size used already segmented words. It remains to be corpora in two typologically different languages, shown whether the said top-down strategies scale and state-of-the-art learning algorithms. up when real size corpora and more realistic repre- The paper is organized as follows. We begin sentations are used. There are indeed indications by describing how we generated the input and that, in the absence of a proper phonological repre- how we modeled different levels of representation. sentation, lexical learning becomes very difficult. Then, we explain how information from the higher For example, word segmentation algorithms that levels (word forms and semantics) can be used to work on the basis of phoneme-like units tend to refine the learning of the lower level (phonetic cat- degrade quickly if phonemes are replaced by con- egories). Next, we present the results of our sim- textual allophones (Boruta et al., 2011) or with the ulations and discuss the potential implications for output of phone recognizers (Jansen et al., 2013; the language learning process. Ludusan et al., 2014). In brief, we are facing a chicken-and-egg prob- 2 Modeling the representations lem: lexical and semantic information could help to learn the phonemes, but phonemes are needed Here, we describe how we model different levels to acquire lexical information. of representation (phonetic categories, lexicon and semantics) starting from raw speech in English 1.3 Breaking the circularity: An incremental and Japanese. discovery procedure Here, we explore the idea that instead of learning 2.1 Corpus adult-like hierarchically organized representations We use two speech corpora: the Buckeye Speech in a sequential fashion (phonemes, words, seman- corpus (Pitt et al., 2007), which contains 40 hours tics), infants learn approximate, provisional lin- of spontaneous conversations in American En- guistic representations in parallel. These approxi- glish, and the 40 hours core of the Corpus of Spon- mate representations are subsequently used to im- taneous Japanese (Maekawa et al., 2000), which prove each other. contains spontaneous conversations and public More precisely, we make four assumptions. speeches in different fields, ranging from engi- First, we assume that infants start by paying atten- neering to humanities. Following Boruta (2012), 192 we use an inventory of 25 phonemes for transcrib- ing, each phoneme model is cloned into context- ing Japanese, and for English, we use the set of 45 dependent triphone models, for each context in phonemes in the phonemic transcription of Pitt et which the phoneme actually occurs (for example, al. (2007). the phoneme /A/ occurs in the context [d–A–g] as in the word /dAg/ (“dog”). The triphone models 2.2 Phonetic categories cloned from the phonemes are then retrained, but, Here, we describe how we model the percep- this time, only on the relevant subset of the data, tual phonetic categories infants learn in a first corresponding to the given triphone context. Fi- step before converging on the functional cate- nally, these detailed models are clustered back into gories (phonemes). We make the assumption that inventories of various sizes (from 2 to 20 times these initial categories correspond to fine grained the size of the phonemic inventory) and retrained. allophones, i.e., different systematic realizations Clustering is done state by state using a phonetic of phonemes, depending on context. Allophonic feature-based decision tree, and results in tying variation can range from categorical effects due to together the HMM states of linguistically simi- phonological rules to gradient effects due to coar- lar triphones so as to maximize the likelihood of ticulation, i.e, the phenomenon whereby adjacent the data. The HMM were built using the HMM sounds affect the physical realization of a given Toolkit (HTK: Young et al., 2006). phoneme. An example of a rather categorical allophonic rule is given by /r/ devoicing in French: 2.3 The proto-lexicon Finding word boundaries in the continuous se- quence of phones is part of the problem infants [X] / before a voiceless obstruent /r/ have to solve without direct supervision.

Load more