Weak Semantic Context Helps Phonetic Learning in a Model of Infant Language Acquisition
Weak semantic context helps phonetic learning in a model of infant language acquisition
Stella Frank Naomi H. Feldman Sharon Goldwater [email protected] [email protected] [email protected] ILCC, School of Informatics Department of Linguistics ILCC, School of Informatics University of Edinburgh University of Maryland University of Edinburgh Edinburgh, EH8 9AB, UK College Park, MD, 20742, USA Edinburgh, EH8 9AB, UK
Abstract Feldman et al., 2013a; McMurray et al., 2009; Val- labha et al., 2007). Learning phonetic categories is one of the Models without any semantic information are first steps to learning a language, yet is hard likely to underestimate infants’ ability to learn pho- to do using only distributional phonetic in- netic categories. Infants learn language in the wild, formation. Semantics could potentially be and quickly attune to the fact that words have (pos- useful, since words with different mean- sibly unknown) meanings. The extent of infants’ ings have distinct phonetics, but it is un- semantic knowledge is not yet known, but existing clear how many word meanings are known evidence shows that six-month-olds can associate to infants learning phonetic categories. We some words with their referents (Bergelson and show that attending to a weaker source of Swingley, 2012; Tincoff and Jusczyk, 1999, 2012), semantics, in the form of a distribution over leverage non-acoustic contexts such as objects or ar- topics in the current context, can lead to ticulations to distinguish similar sounds (Teinonen improvements in phonetic category learn- et al., 2008; Yeung and Werker, 2009), and map ing. In our model, an extension of a pre- meaning (in the form of objects or images) to new vious model of joint word-form and pho- word-forms in some laboratory settings (Friedrich netic category inference, the probability of and Friederici, 2011; Gogate and Bahrick, 2001; word-forms is topic-dependent, enabling Shukla et al., 2011). These findings indicate that the model to find significantly better pho- young infants are sensitive to co-occurrences be- netic vowel categories and word-forms than tween linguistic stimuli and at least some aspects a model with no semantic knowledge. of the world. In this paper we explore the potential contribu- 1 Introduction tion of semantic information to phonetic learning Infants begin learning the phonetic categories of by formalizing a model in which learners attend to their native language in their first year (Kuhl et al., the word-level context in which phones appear (as 1992; Polka and Werker, 1994; Werker and Tees, in the lexical-phonetic learning model of Feldman 1984). In theory, semantic information could offer et al., 2013a) and also to the situations in which a valuable cue for phoneme induction1 by helping word-forms are used. The modeled situations con- infants distinguish between minimal pairs, as lin- sist of combinations of categories of salient ac- guists do (Trubetzkoy, 1939). However, due to a tivities or objects, similar to the activity contexts widespread assumption that infants do not know the explored by Roy et al. (2012), e.g.,‘getting dressed’ meanings of many words at the age when they are or ‘eating breakfast’. We assume that child learn- learning phonetic categories (see Swingley, 2009 ers are able to infer a representation of the situ- for a review), most recent models of early phonetic ational context from their non-linguistic environ- category acquisition have explored the phonetic ment. However, in our simulations we approximate learning problem in the absence of semantic infor- the environmental information by running a topic mation (de Boer and Kuhl, 2003; Dillon et al., 2013; model (Blei et al., 2003) over a corpus of child- directed speech to infer a topic distribution for each 1The models in this paper do not distinguish between pho- situation. These topic distributions are then used as netic and phonemic categories, since they do not capture input to our model to represent situational contexts. phonological processes (and there are also none present in our synthetic data). We thus use the terms interchangeably. The situational information in our model is simi-
1073 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1073–1083, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics lar to that assumed by theories of cross-situational word learning (Frank et al., 2009; Smith and Yu, 200 2008; Yu and Smith, 2007), but our model does not 400 iy uw require learners to map individual words to their ref- ih oo ei er oa erents. Even in the absence of word-meaning map- 600 ae pings, situational information is potentially useful eh uh aw because similar-sounding words uttered in similar F1 800 situations are more likely to be tokens of the same ah lexeme (containing the same phones) than similar- 1000 sounding words uttered in different situations. In simulations of vowel learning, inspired by 1200 Vallabha et al. (2007) and Feldman et al. (2013a), 3500 3000 2500 2000 1500 1000 500 F2 we show a clear improvement over previous mod- els in both phonetic and lexical (word-form) cate- Figure 1: The English vowel space (generated from gorization when situational context is used as an Hillenbrand et al. (1995), see Section 6.2), plotted additional source of information. This improve- using the first two formants. ment is especially noticeable when the word-level context is providing less information, arguably the more realistic setting. These results demonstrate improved phonetic learning. that relying on situational co-occurrence can im- Our own Topic-Lexical-Distributional (TLD) prove phonetic learning, even if learners do not yet model extends the LD model to include an addi- know the meanings of individual words. tional type of context: the situations in which words appear. To motivate this extension and clarify the 2 Background and overview of models differences between the models, we now provide a high-level overview of both models; details are Infants attend to distributional characteristics of given in Sections 3 and 4. their input (Maye et al., 2002, 2008), leading to the hypothesis that phonetic categories could be acquired on the basis of bottom-up distributional 2.1 Overview of LD model learning alone (de Boer and Kuhl, 2003; Vallabha Both the LD and TLD models are computational- et al., 2007; McMurray et al., 2009). However, this level models of phonetic (specifically, vowel) cat- would require sound categories to be well sepa- egorization where phones (vowels) are presented rated, which often is not the case—for example, to the model in the context of words.2 The task is see Figure 1, which shows the English vowel space to infer a set of phonetic categories and a set of that is the focus of this paper. lexical items on the basis of the data observed for Recent work has investigated whether infants each word token xi. In the original LD model, the could overcome such distributional ambiguity by observations for token xi are its frame fi, which incorporating top-down information, in particular, consists of a list of consonants and slots for vowels, the fact that phones appear within words. At six and the list of vowel tokens wi. (The TLD model months, infants begin to recognize word-forms includes additional observations, described below.) such as their name and other frequently occurring A single vowel token, wij, is a two dimensional words (Mandel et al., 1995; Jusczyk and Hohne, vector representing the first two formants (peaks 1997), without necessarily linking a meaning to in the frequency spectrum, ordered from lowest to these forms. This “protolexicon” can help differen- highest). For example, a token of the word kitty tiate phonetic categories by adding word contexts would have the frame fi = k t , containing two in which certain sound categories appear (Swingley, consonant phones, /k/ and /t/, with two vowel phone 2009; Feldman et al., 2013b). To explore this idea slots in between, and two vowel formant vectors, further, Feldman et al. (2013a) implemented the Lexical-Distributional (LD) model, which jointly 2For a related model that also tackles the word segmenta- learns a set of phonetic vowel categories and a set tion problem, see Elsner et al. (2013). In a model of phono- logical learning, Fourtassi and Dupoux (submitted) show that of word-forms containing those categories. Simula- semantic context information similar to that used here remains tions showed that the use of lexical context greatly useful despite segmentation errors.
1074 3 wi0 = [464, 2294] and wi1 = [412, 2760]. 2.2 Overview of TLD model Given the data, the model must assign each To demonstrate the benefit of situational informa- vowel token to a vowel category, wij = c. Both tion, we develop the Topic-Lexical-Distributional the LD and the TLD models do this using inter- (TLD) model, which extends the LD model by as- mediate lexemes, `, which contain vowel category suming that words appear in situations analogous assignments, v`j = c, as well as a frame f`. If a to documents in a topic model. Each situation h word token is assigned to a lexeme, xi = `, the is associated with a mixture of topics θh, which is vowels within the word are assigned to that lex- assumed to be observed. Thus, for the ith token in 4 eme’s vowel categories, wij = v = c. The word `j situation h, denoted xhi, the observed data will be and lexeme frames must match, fi = f . ` its frame fhi, vowels whi, and topic vector θh. Lexical information helps with phonetic catego- From an acquisition perspective, the observed rization because it can disambiguate highly over- topic distribution represents the child’s knowledge lapping categories, such as the ae and eh categories of the context of the interaction: she can distin- in Figure 1. A purely distributional learner who ob- guish bathtime from dinnertime, and is able to rec- serves a cluster of data points in the ae-eh region is ognize that some topics appear in certain contexts likely to assume all these points belong to a single (e.g. animals on walks, vegetables at dinnertime) category because the distributions of the categories and not in others (few vegetables appear at bath- are so similar. However, a learner who attends to time). We assume that the child would learn these lexical context will notice a difference: contexts topics from observing the world around her and that only occur with ae will be observed in one part the co-occurrences of entities and activities in the of the ae-eh region, while contexts that only oc- world. Within any given situation, there might be cur with eh will be observed in a different (though a mixture of different (actual or possible) topics partially overlapping) space. The learner then has that are salient to the child. We assume further that evidence of two different categories occurring in as the child learns the language, she will begin to different sets of lexemes. associate specific words with each topic as well. Simulations with the LD model show that using Thus, in the TLD model, the words used in a sit- lexical information to constrain phonetic learning uation are topic-dependent, implying meaning, but can greatly improve categorization accuracy (Feld- without pinpointing specific referents. Although the man et al., 2013a), but it can also introduce errors. model observes the distribution of topics in each When two word tokens contain the same consonant situation (corresponding to the child observing her frame but different vowels (i.e., minimal pairs), non-linguistic environment), it must learn to asso- the model is more likely to categorize those two ciate each (phonetically and lexically ambiguous) vowels together. Thus, the model has trouble distin- word token with a particular topic from that distri- guishing minimal pairs. Although young children bution. The occurrence of similar-sounding words also have trouble with minimal pairs (Stager and in different situations with mostly non-overlapping Werker, 1997; Thiessen, 2007), the LD model may topics will provide evidence that those words be- overestimate the degree of the problem. We hypoth- long to different topics and that they are therefore esize that if a learner is able to associate words with different lexemes. Conversely, potential minimal the contexts of their use (as children likely are), this pairs that occur in situations with similar topic dis- could provide a weak source of information for dis- tributions are more likely to belong to the same ambiguating minimal pairs even without knowing topic and thus the same lexeme. their exact meanings. That is, if the learner hears Although we assume that children infer topic kV1t and kV2t in different situational contexts, they distributions from the non-linguistic environment, are likely to be different lexical items (and V1 and we will use transcripts from CHILDES to create the V2 different phones), despite the lexical similarity word/phone learning input for our model. These between them. transcripts are not annotated with environmental context, but Roy et al. (2012) found that topics 3In simulations we also experiment with frames in which learned from similar transcript data using a topic consonants are not represented perfectly. 4 model were strongly correlated with immediate ac- The notation is overloaded: wij refers both to the vowel tivities and contexts. We therefore obtain the topic formants and the vowel category assignments, and xi refers to both the token identity and its assignment to a lexeme. distributions used as input to the TLD model by
1075 training an LDA topic model (Blei et al., 2003) a Normal Inverse-Wishart prior.5 Each observation, on a superset of the child-directed transcript data a formant vector wij, is drawn from the Gaussian we use for lexical-phonetic learning, dividing the corresponding to its category assignment cij: transcripts into small sections (the ‘documents’ in µ , Σ H = NIW(µ , Σ , ν ) (1) LDA) that serve as our distinct situations h. As c c ∼ C 0 0 0 G DP (α ,H ) (2) noted above, the learned document-topic distribu- C ∼ c C tions θ are treated as observed variables in the cij GC (3) TLD model to represent the situational context. The ∼ wij cij = c N(µc, Σc) (4) topic-word distributions learned by LDA are dis- | ∼ carded, since these are based on the (correct and The above model generates a category assignment unambiguous) words in the transcript, whereas the cij for each vowel token wij. This is the baseline TLD model is presented with phonetically ambigu- IGMM model, which clusters vowel tokens using ous versions of these word tokens and must learn to bottom-up distributional information only; the LD disambiguate them and associate them with topics. model adds top-down information by assigning cat- egories in the lexicon, rather than on the token level. 3 Lexical-Distributional Model 3.2 Lexicon In this section we describe more formally the gen- In the LD model, vowel phones appear within erative process for the LD model (Feldman et al., words drawn from the lexicon. Each such lexeme 2013a), a joint Bayesian model over phonetic cat- is represented as a frame plus a list of vowel cate- egories and a lexicon, before describing the TLD gories v`. Lexeme assignments for each token are extension in the following section. drawn from a DP with a lexicon-generating base The set of phonetic categories and the lexicon are distribution HL. The category for each vowel to- both modeled using non-parametric Dirichlet Pro- ken in the word is determined by the lexeme; the cess priors, which return a potentially infinite num- formant values are drawn from the corresponding ber of categories or lexemes. A DP is parametrized Gaussian as in the IGMM: as DP (α, H), where α is a real-valued hyperpa- G DP (α ,H ) (5) rameter and H is a base distribution. H may be con- L ∼ l L tinuous, as when it generates phonetic categories x = ` G (6) i ∼ L in formant space, or discrete, as when it generates wij v`j = c N(µc, Σc) (7) lexemes as a list of phonetic categories. | ∼ A draw from a DP, G DP (α, H), returns HL generates lexemes by first drawing the num- ∼ a distribution over a set of draws from H, i.e., a ber of phones from a geometric distribution and the discrete distribution over a set of categories or lex- number of consonant phones from a binomial dis- emes generated by H. In the mixture model setting, tribution. The consonants are then generated from a the category assignments are then generated from DP with a uniform base distribution (but note they G, with the datapoints themselves generated by the are fixed at inference time, i.e., are observed cate- corresponding components from H. If H is infinite, gorically), while the vowel phones v` are generated by the IGMM DP above, v G . the support of the DP is likewise infinite. During `j ∼ C inference, we marginalize over G. Note that two draws from HL may result in iden- tical lexemes; these are nonetheless considered to be separate (homophone) lexemes. 3.1 Phonetic Categories: IGMM 4 Topic-Lexical-Distributional Model Following previous models of vowel learning (de Boer and Kuhl, 2003; Vallabha et al., 2007; Mc- The TLD model retains the IGMM vowel phone Murray et al., 2009; Dillon et al., 2013) we assume component, but extends the lexicon of the LD that vowel tokens are drawn from a Gaussian mix- model by adding topic-specific lexicons, which cap- ture model. The Infinite Gaussian Mixture Model ture the notion that lexeme probabilities are topic- (IGMM) (Rasmussen, 2000) includes a DP prior, dependent. Specifically, the TLD model replaces as described above, in which the base distribution 5This compound distribution is equivalent to Σc HC generates multivariate Gaussians drawn from Σc IW(Σ0, ν0), µc Σc N(µ0, ) ∼ | ∼ ν0
1076 the Dirichlet Process lexicon with a Hierarchical µ0, κ0, Σ0, ν0 λ θ Dirichlet Process (HDP; Teh (2006)). In the HDP h lexicon, a top-level global lexicon is generated as H H C L z in the LD model. Topic-specific lexicons are then hi drawn from the global lexicon, containing a subset G G G C L k x of the global lexicon (but since the size of the global K hi lexicon is unbounded, so are the topic-specific lex- α α α c l k w icons). These topic-specific lexicons are used to fhi hij generate the tokens in a similar manner to the LD whi µc, Σc | | model. There are a fixed number of lower level xh | | topic-lexicons; these are matched to the number ∞ D of topics in the LDA model used to infer the topic Figure 2: TLD model, depicting, from left to right, distributions (see Section 6.4). the IGMM component, the LD lexicon compo- More formally, the global lexicon is generated nent, the topic-specific lexicons, and finally the as a top-level DP: G DP (α ,H ) (see Sec- L ∼ l L token x , appearing in document h, with observed tion 3.2; remember H includes draws from the hi L vowel formants w and frame f . The lexeme IGMM over vowel categories). G is in turn used hij hi L assignment x and the topic assignment z are as the base distribution in the topic-level DPs, hi hi inferred, the latter using the observed document- G DP (α ,G ). In the Chinese Restaurant k ∼ k L topic distribution θ . Note that f is deterministic Franchise metaphor often used to describe HDPs, h i given the lexeme assignment. Squared nodes depict G is a global menu of dishes (lexemes). The topic- L hyperparameters. λ is the set of hyperparameters specific lexicons are restaurants, each with its own used by H when generating lexical items (see distribution over dishes; this distribution is defined L Section 3.2). by seating customers (word tokens) at tables, each of which serves a single dish from the menu: all tokens x at the same table t are assigned to the topics, and assignments of tokens to tables (from same lexeme `t. Inference (Section 5) is defined which the assignment to lexemes can be read off). in terms of tables rather than lexemes; if multiple 5.1 Sampling lexeme vowel categories tables draw the same dish from GL, tokens at these tables share a lexeme. Each vowel in the lexicon must be assigned to a In the TLD model, tokens appear within situa- category in the IGMM. The posterior probability of tions, each of which has a distribution over topics a category assignment is composed of the DP prior over categories and the likelihood of the observed θh. Each token xhi has a co-indexed topic assign- vowels belonging to that category. We use w`j to ment variable, zhi, drawn from θh, designating the denote the set of vowel formants at position j in topic-lexicon from which the table for xhi is to be words that have been assigned to lexeme `. Then, drawn. The formant values for whij are drawn in ` the same way as in the LD model, given the lexeme P (v = c w, x, `\ ) `j | assignment at xhi. This results in the following ` `j P (v = c `\ )p(w v = c, w\ ) (13) model, shown in Figure 2: ∝ `j | `j| `j The first (DP prior) factor is defined as: G DP (α ,H ) (8) L ∼ l L nc if c exists Gk DP (αk,GL) (9) `j c nc+αc ∼ P (v`j = c v\ ) = | αc zhi Mult(θh) (10) ( P n +α if c new ∼ c c c x = t z = k G (11) P (14) hi | hi ∼ k whij xhi = t, v`tj = c N(µc, Σc) (12) where nc is the number of other vowels in the lex- | ∼ lj icon, v\ , assigned to category c. Note that there is always positive probability of creating a new 5 Inference: Gibbs Sampling category. We use Gibbs sampling to infer three sets of vari- The likelihood of the vowels is calculated by ables in the TLD model: assignments to vowel cat- marginalizing over all possible means and vari- egories in the lexemes, assignments of tokens to ances of the Gaussian category parameters, given
1077 the NIW prior. For a single point (if w = 1), remaining hyperparameters for the vowel category | `j| this predictive posterior is in the form of a Student-t and lexeme priors are set to the same values used distribution; for the more general case see Feldman by Feldman et al. (2013a). et al. (2013a), Eq. B3.
5.2 Sampling table & topic assignments 6 Experiments We jointly sample x and z, the variables assigning 6.1 Corpus tokens to tables and topics. Resampling the table assignment includes the possibility of changing to We test our model on situated child directed speech, a table with a different lexeme or drawing a new taken from the C1 section of the Brent corpus in table with a previously seen or novel lexeme. The CHILDES (Brent and Siskind, 2001; MacWhinney, joint conditional probability of a table and topic 2000). This corpus consists of transcripts of speech assignment, given all other current token assign- directed at infants between the ages of 9 and 15 ments, is: months, captured in a naturalistic setting as par- hi hi ent and child went about their day. This ensures P (x = t, z = k w , θ , t\ , `, w\ ) hi hi | hi h variability of situations. hi = P (k θ )P (t k, ` , t\ ) | h | t Utterances with unintelligible words or quotes hi are removed. We restrict the corpus to content p(whi v`t = c, w\ ) (15) ·| · c C words by retaining only words tagged as adj, Y∈ n, part and v (adjectives, nouns, particles, and The first factor, the prior probability of topic k verbs). This is in line with evidence that infants in document h, is given by θ obtained from the hk distinguish content and function words on the basis LDA. The second factor is the prior probability of of acoustic signals (Shi and Werker, 2003). Vowel assigning word x to table t with lexeme ` given i categorization improves when attending only to topic k. It is given by the HDP, and depends on more prosodically and phonologically salient to- whether the table t exists in the HDP topic-lexicon kens (Adriaans and Swingley, 2012), which gen- for k and, likewise, whether any table in the topic- erally appear within content, not function words. lexicon has the lexeme `: The final corpus consists of 13138 tokens and 1497 nkt if t in k word types. nk+αk hi α m P (t k, `, t\ ) k ` if t new, ` known | ∝ nk+αk m+αl αk α` if t and ` new 6.2 Hillenbrand Vowels nk+αk m+αl (16) The transcripts do not include phonetic information, so, following Feldman et al. (2013a), we synthe- Here nkt is the number of other tokens at table t, size the formant values using data from Hillenbrand nk are the total number of tokens in topic k, m` is the number of tables across all topics with the et al. (1995). This dataset consists of a set of 1669 lexeme `, and m is the total number of tables. manually gathered formant values from 139 Amer- The third factor, the likelihood of the vowel for- ican English speakers (men, women and children) for 12 vowels. For each vowel category, we con- mants whi in the categories given by the lexeme vl, is of the same form as the likelihood of vowel cate- struct a Gaussian from the mean and covariance of gories when resampling lexeme vowel assignments. the datapoints belonging to that category, using the However, here it is calculated over the set of vow- first and second formant values measured at steady els in the token assigned to each vowel category state. We also construct a second dataset using only datapoints from adult female speakers. (i.e., the vowels at indices where v`t = c). For a · new lexeme, we approximate the likelihood using Each word in the dataset is converted to a phone- 100 samples drawn from the prior, each weighted mic representation using the CMU pronunciation by α/100 (Neal, 2000). dictionary, which returns a sequence of Arpabet phoneme symbols. If there are multiple possible 5.3 Hyperparameters pronunciations, the first one is used. Each vowel The three hyperparameters governing the HDP over phoneme in the word is then replaced by formant the lexicon, αl and αk, and the DP over vowel cate- values drawn from the corresponding Hillenbrand gories, αc, are estimated using a slice sampler. The Gaussian for that vowel.
1078 6.3 Merging Consonant Categories Dataset C24 C15 C6 The Arpabet encoding used in the phonemic rep- Input Types 1487 1426 1203 resentation includes 24 consonants. We construct Frames 1259 1078 702 datasets both using the full set of consonants—the Ambig Types % 27.2 42.0 80.4 ‘C24’ dataset—and with less fine-grained conso- Ambig Tokens % 41.3 56.9 77.2 nant categories. Distinguishing all consonant cate- gories assumes perfect learning of consonants prior Table 1: Corpus statistics showing the increasing to vowel categorization and is thus somewhat unre- amount of ambiguity as consonant categories are alistic (Polka and Werker, 1994), but provides an merged. Input types are the number of word types upper limit on the information that word-contexts with distinct input representations (as opposed to can give. gold orthographic word types, of which there are In the ‘C15’ dataset, the voicing distinction is 1497). Ambiguous types and tokens are those with collapsed, leaving 15 consonant categories. The frames that match multiple (orthographic) word collapsed categories are B/P, G/K, D/T, CH/JH, types. V/F, TH/DH, S/Z, SH/ZH, R/L while HH, M, NG, N, W, Y remain separate phonemes. This dataset situations, so we do this somewhat arbitrarily by mirrors the finding in Mani and Plunkett (2010) that splitting each transcript after 50 CDS utterances, 12 month old infants are not sensitive to voicing resulting in 203 situations for the Brent C1 dataset. mispronunciations. As well as function words, we also remove the The ‘C6’ dataset distinguishes between only five most frequent content words (be, go, get, want, 6 coarse consonant phonemes, corresponding to come). On average, situations are only 59 words stops (B,P,G,K,D,T), affricates (CH,JH), fricatives long, reflecting the relative lack of content words (V, F, TH, DH, S, Z, SH, ZH, HH), nasals (M, in CDS utterances. NG, N), liquids (R, L), and semivowels/glides (W, We infer 50 topics for this set of situations using Y). This dataset makes minimal assumptions about the mallet toolkit (McCallum, 2002). Hyperpa- the category categories that infants could use in this rameters are inferred, which leads to a dominant learning setting. topic that includes mainly light verbs (have, let, Decreasing the number of consonants increases see, do). The other topics are less frequent but cap- the ambiguity in the corpus: bat not only shares ture stronger semantic meaning (e.g. yummy, peach, a frame (b t) with boat and bite, but also, in the cookie, daddy, bib in one topic, shoe, let, put, hat, C15 dataset, with put, pad and bad (b/p d/t), and pants in another). The word-topic assignments are in the C6 dataset, with dog and kite, among many used to calculate unsmoothed situation-topic distri- others (STOP STOP). Table 1 shows the percent- butions θ used by the TLD model. age of types and tokens that are ambiguous in each dataset, that is, words in frames that match multiple 6.5 Evaluation wordtypes. Note that we always evaluate against We evaluate against adult categories, i.e., the ‘gold- the gold word identities, even when these are not standard’, since all learners of a language even- distinguished in the model’s input. These datasets tually converge on similar categories. (Since our are intended to evaluate the degree of reliance on model is not a model of the learning process, we consonant information in the LD and TLD models, do not compare the infant learning process to the and to what extent the topics in the TLD model can learning algorithm.) We evaluate both the inferred replace this information. phonetic categories and words using the clustering evaluation measure V-Measure (VM; Rosenberg 6.4 Topics and Hirschberg, 2007).6 VM is the harmonic mean The input to the TLD model includes a distribution of two components, similar to F-score, where the over topics for each situation, which we infer in components (VC and VH) are measures of cross advance from the full Brent corpus (not only the entropy between the gold and model categorization. C1 subset) using LDA. Each transcript in the Brent 6Other clustering measures, such as 1-1 matching and corpus captures about 75 minutes of parent-child pairwise precision and recall (accuracy and completeness) showed the same trends, but VM has been demonstrated to interaction, and thus multiple situations will be be the most stable measure when comparing solutions with included in each file. The transcripts do not delimit varying numbers of clusters (Christodoulopoulos et al., 2010).
1079 200 90 400
85 600 F1 VM 800 80 LD-all 1000 TLD-all
LD-w 1200 75 TLD-w 3500 3000 2500 2000 1500 1000 500 F2 24 Cons 15 Cons 6 Cons Dataset Figure 4: Vowels found by the TLD model; su- pervowels are indicated in red. The gold-standard Figure 3: Vowel evaluation. ‘all’ refers to datasets vowels are shown in gold in the background but are with vowels synthesized from all speakers, ‘w’ to mostly overlapped by the inferred categories. datasets with vowels synthesized from adult female speakers’ vowels. The bars show a 95% Confidence Interval based on 5 runs. IGMM-all results in a VM cating the results found by Feldman et al. (2013a) score of 53.9 (CI=0.5); IGMM-w has a VM score on this dataset. Furthermore, TLD consistently out- of 65.0 (CI=0.2), not shown. performs the LD model, finding better phonetic categories, both for vowels generated from the com- For vowels, VM measures how well the inferred bined categories of all speakers (‘all’) and vowels phonetic categorizations match the gold categories; generated from adult female speakers only (‘w’), for lexemes, it measures whether tokens have been although the latter are clearly much easier for both assigned to the same lexemes both by the model models to learn. Both models perform less well and the gold standard. Words are evaluated against when the consonant frames provide less informa- gold orthography, so homophones, e.g. hole and tion, but the TLD model performance degrades less whole, are distinct gold words. than the LD performance. Both the TLD and the LD models find ‘super- 6.6 Results vowel’ categories, which cover multiple vowel cat- We compare all three models—TLD, LD, and egories and are used to merge minimal pairs into a IGMM—on the vowel categorization task, and single lexical item. Figure 4 shows example vowel TLD and LD on the lexical categorization task categories inferred by the TLD model, including (since IGMM does not infer a lexicon). The datasets two supervowels. The TLD supervowels are used correspond to two sets of conditions: firstly, either much less frequently than the supervowels found using vowel categories synthesized from all speak- by the LD model, containing, on average, only two- ers or only adult female speakers, and secondly, thirds as many tokens. varying the coarseness of the observed consonant Figure 5 shows that TLD also outperforms LD categories. Each condition (model, vowel speak- on the lexeme/word categorization task. Again per- ers, consonant set) is run five times, using 1500 formance decreases as the consonant categories iterations of Gibbs sampling with hyperparameter become coarser, but the additional semantic infor- sampling. Overall, we find that TLD outperforms mation in the TLD model compensates for the lack the other models in both tasks, across all condi- of consonant information. In the individual com- tions. ponents of VM, TLD and LD have similar VC Vowel categorization results are shown in Fig- (“recall”), but TLD has higher VH (“precision”), ure 3. IGMM performs substantially worse than demonstrating that the semantic information given both TLD and LD, with scores more than 30 points by the topics can separate potentially ambiguous lower than the best results for these models, clearly words, as hypothesized. showing the value of the protolexicon and repli- Overall, the contextual semantic information
1080 100 TLD model tracks is similar to that potentially used in other linguistic learning tasks. Theories of cross-situational word learning (Smith and Yu, 98 2008; Yu and Smith, 2007) assume that sensitivity to situational co-occurrences between words and non-linguistic contexts is a precursor to learning the 96 VM meanings of individual words. Under this view, con- textual semantics is available to infants well before LD-all they have acquired large numbers of semantic min- 94 TLD-all imal pairs. However, recent experimental evidence LD-w indicates that learners do not always retain detailed TLD-w information about the referents that are present in a 92 24 Cons 15 Cons 6 Cons scene when they hear a word (Medina et al., 2011; Dataset Trueswell et al., 2013). This evidence poses a di- rect challenge to theories of cross-situational word Figure 5: Lexeme evaluation. ‘all’ refers to datasets learning. Our account does not necessarily require with vowels synthesized from all speakers, ‘w’ to learners to track co-occurrences between words datasets with vowels synthesized from adult female and individual objects, but instead focuses on more speakers’ vowels. abstract information about salient events and topics in the environment; it will be important to investi- gate to what extent infants encode this information added in the TLD model leads to both better pho- and use it in phonetic learning. netic categorization and to a better protolexicon, Regardless of the specific way in which infants especially when the input is noisier, using degraded encode semantic information, our method of adding consonants. Since infants are not likely to have per- this information by using LDA topics from tran- fect knowledge of phonetic categories at this stage, script data was shown to be effective. This method semantic information is a potentially rich source is practical because it can approximate semantic of information that could be drawn upon to offset information without relying on extensive manual noise from other domains. The form of the seman- annotation. tic information added in the TLD model is itself The LD model extended the phonetic catego- quite weak, so the improvements shown here are in rization task by adding word contexts; the TLD line with what infant learners could achieve. model presented here goes even further, adding larger situational contexts. Both forms of top-down 7 Conclusion information help the low-level task of classifying acoustic signals into phonetic categories, furthering Language acquisition is a complex task, in which a holistic view of language learning with interac- many heterogeneous sources of information may tion across multiple levels. be useful. In this paper, we investigated whether contextual semantic information could be of help Acknowledgments when learning phonetic categories. We found that This work was supported by EPSRC grant this contextual information can improve phonetic EP/H050442/1 and a James S. McDonnell Founda- learning performance considerably, especially in tion Scholar Award to the final author. situations where there is a high degree of pho- netic ambiguity in the word-forms that learners References hear. This suggests that previous models that have ignored semantic information may have underesti- Frans Adriaans and Daniel Swingley. Distribu- mated the information that is available to infants. tional learning of vowel categories is supported Our model illustrates one way in which language by prosody in infant-directed speech. In Pro- learners might harness the rich information that is ceedings of the 34th Annual Conference of the present in the world without first needing to acquire Cognitive Science Society (CogSci), 2012. a full inventory of word meanings. E. Bergelson and D. Swingley. At 6-9 months, The contextual semantic information that the human infants know the meanings of many
1081 common nouns. Proceedings of the National word learning. Psychological Science, 20(5): Academy of Sciences, 109(9):3253–3258, Feb 578–585, 2009. 2012. Manuela Friedrich and Angela D. Friederici. Word David M. Blei, Thomas L. Griffiths, Michael I. Jor- learning in 6-month-olds: Fast encoding—weak dan, and Joshua B. Tenenbaum. Hierarchical retention. Journal of Cognitive Neuroscience, 23 topic models and the nested Chinese restaurant (11):3228–3240, Nov 2011. process. In Advances in Neural Information Pro- Lakshmi J. Gogate and Lorraine E. Bahrick. In- cessing Systems 16, 2003. tersensory redundancy and 7-month-old infants’ Michael R. Brent and Jeffrey M. Siskind. The role memory for arbitrary syllable-object relations. of exposure to isolated words in early vocabulary Infancy, 2(2):219–231, Apr 2001. development. Cognition, 81(2):B33–B44, 2001. J. Hillenbrand, L. A. Getty, M. J. Clark, and Christos Christodoulopoulos, Sharon Goldwater, K. Wheeler. Acoustic characteristics of Ameri- and Mark Steedman. Two decades of unsuper- can English vowels. Journal of the Acoustical vised POS induction: How far have we come? Society of America, 97(5 Pt 1):3099–3111, May In Proceedings of the 48th Annual Meeting of 1995. the Association for Computational Linguistics P. W. Jusczyk and Elizabeth A. Hohne. Infants’ (ACL), pages 575–584, Cambridge, MA, Octo- memory for spoken words. Science, 277(5334): ber 2010. Association for Computational Lin- 1984–1986, Sep 1997. guistics. Patricia K. Kuhl, Karen A. Williams, Francisco Bart de Boer and Patricia K. Kuhl. Investigating the Lacerda, Kenneth N. Stevens, and Bjorn Lind- role of infant-directed speech with a computer blom. Linguistic experience alters phonetic per- model. Acoustics Research Letters Online, 4(4): ception in infants by 6 months of age. Science, 129, 2003. 255(5044):606–608, 1992. Brian Dillon, Ewan Dunbar, and William Idsardi. A Brian MacWhinney. The CHILDES Project: Tools single-stage approach to learning phonological for Analyzing Talk. Lawrence Erlbaum Asso- categories: Insights from Inuktitut. Cognitive ciates, 2000. Science, 37(2):344–377, Mar 2013. D. R. Mandel, P. W. Jusczyk, and D. B. Pisoni. Micha Elsner, Sharon Goldwater, Naomi Feldman, Infants’ recognition of the sound patterns of their and Frank Wood. A cognitive model of early own names. Psychological Science, 6(5):314– lexical acquisition with phonetic variability. In 317, Sep 1995. Proceedings of the 18th Conference on Empir- Nivedita Mani and Kim Plunkett. Twelve-month- ical Methods in Natural Language Processing olds know their cups from their keps and tups. (EMNLP), 2013. Infancy, 15(5):445470, Sep 2010. Naomi H. Feldman, Thomas L. Griffiths, Sharon Jessica Maye, Daniel J. Weiss, and Richard N. Goldwater, and James L. Morgan. A role for the Aslin. Statistical phonetic learning in infants: developing lexicon in phonetic category acquisi- facilitation and feature generalization. Develop- tion. Psychological Review, 2013a. mental Science, 11(1):122–134, Jan 2008. Naomi H. Feldman, Emily B. Myers, Katherine S. Jessica Maye, Janet F Werker, and LouAnn Gerken. White, Thomas L. Griffiths, and James L. Mor- Infant sensitivity to distributional information gan. Word-level information influences phonetic can affect phonetic discrimination. Cognition, learning in adults and infants. Cognition, 127(3): 82(3):B101–B111, Jan 2002. 427–438, 2013b. Andrew McCallum. MALLET: A machine learn- Abdellah Fourtassi and Emmanuel Dupoux. A rudi- ing for language toolkit, 2002. mentary lexicon and semantics help bootstrap Bob McMurray, Richard N. Aslin, and Joseph C. phoneme acquisition. Submitted. Toscano. Statistical learning of phonetic cate- Michael C. Frank, Noah D. Goodman, and gories: insights from a computational approach. Joshua B. Tenenbaum. Using speakers’ refer- Developmental Science, 12(3):369–378, May ential intentions to model early cross-situational 2009.
1082 Tamara Nicol Medina, Jesse Snedeker, John C. ceedings of the 44th Annual Meeting of the As- Trueswell, and Lila R. Gleitman. How words sociation for Computational Linguistics (ACL), can and cannot be learned by observation. Pro- pages 985 – 992, Sydney, 2006. ceedings of the National Academy of Sciences, Tuomas Teinonen, Richard N. Aslin, Paavo Alku, 108(22):9014–9019, 2011. and Gergely Csibra. Visual speech contributes to Radford Neal. Markov chain sampling methods phonetic learning in 6-month-old infants. Cogni- for Dirichlet process mixture models. Journal tion, 108:850–855, 2008. of Computational and Graphical Statistics, 9: Erik D. Thiessen. The effect of distributional infor- 249–265, 2000. mation on children’s use of phonemic contrasts. Linda Polka and Janet F. Werker. Developmen- Journal of Memory and Language, 56(1):16–34, tal changes in perception of nonnative vowel Jan 2007. contrasts. Journal of Experimental Psychology: R. Tincoff and P. W. Jusczyk. Some beginnings of Human Perception and Performance, 20(2):421– word comprehension in 6-month-olds. Psycho- 435, 1994. logical Science, 10(2):172–175, Mar 1999. Carl Rasmussen. The infinite Gaussian mixture Ruth Tincoff and Peter W. Jusczyk. Six-month- model. In Advances in Neural Information Pro- olds comprehend words that refer to parts of the cessing Systems 13, 2000. body. Infancy, 17(4):432444, Jul 2012. Andrew Rosenberg and Julia Hirschberg. V- N. S. Trubetzkoy. Grundzuge¨ der Phonologie. Van- measure: A conditional entropy-based external denhoeck und Ruprecht, Gottingen,¨ 1939. cluster evaluation measure. In Proceedings of John C. Trueswell, Tamara Nicol Medina, Alon the 12th Conference on Empirical Methods in Hafri, and Lila R. Gleitman. Propose but ver- Natural Language Processing (EMNLP), 2007. ify: Fast mapping meets cross-situational word Brandon C. Roy, Michael C. Frank, and Deb Roy. learning. Cognitive Psychology, 66:126–156, Relating activity contexts to early word learning 2013. in dense longitudinal data. In Proceedings of the G. K. Vallabha, J. L. McClelland, F. Pons, J. F. 34th Annual Conference of the Cognitive Science Werker, and S. Amano. Unsupervised learning Society (CogSci), 2012. of vowel categories from infant-directed speech. Rushen Shi and Janet F. Werker. The basis of pref- Proceedings of the National Academy of Sci- erence for lexical words in 6-month-old infants. ences, 104(33):13273–13278, Aug 2007. Developmental Science, 6(5):484–488, 2003. Janet F. Werker and Richard C. Tees. Cross- M. Shukla, K. S. White, and R. N. Aslin. Prosody language speech perception: Evidence for per- guides the rapid mapping of auditory word forms ceptual reorganization during the first year of onto visual objects in 6-mo-old infants. Proceed- life. Infant Behavior and Development, 7:49–63, ings of the National Academy of Sciences, 108 1984. (15):6038–6043, Apr 2011. H. Henny Yeung and Janet F. Werker. Learning Linda B. Smith and Chen Yu. Infants rapidly learn words’ sounds before learning how words sound: word-referent mappings via cross-situational 9-month-olds use distinct objects as cues to cat- statistics. Cognition, 106(3):1558–1568, 2008. egorize speech information. Cognition, 113(2): 234–243, Nov 2009. Christine L. Stager and Janet F. Werker. Infants listen for more phonetic detail in speech percep- Chen Yu and Linda B. Smith. Rapid word learning tion than in word-learning tasks. Nature, 388: under uncertainty via cross-situational statistics. 381–382, 1997. Psychological Science, 18(5):414–420, 2007. D. Swingley. Contributions of infant word learning to language development. Philosophical Trans- actions of the Royal Society B: Biological Sci- ences, 364(1536):3617–3632, Nov 2009. Yee Whye Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Pro-
1083