<<

Statistics Learning and Universal : Modeling Word Segmentation

Timothy Gambell Charles Yang 59 Bishop Street Department of , Yale University New Haven, CT 06511 New Haven, CT 06511 USA USA [email protected] [email protected]

Abstract vision of labor between endowment and learning± plainly, things that are built in needn't be learned, This paper describes a computational model of word and things that can be garnered from experience segmentation and presents simulation results on re- needn't be built in. alistic acquisition In particular, we explore the ca- The important paper of Saffran, Aslin, & New- pacity and limitations of statistical learning mecha- port [8] on statistical learning (SL), suggests that nisms that have recently gained prominence in cog- children may be powerful learners after all. Very nitive psychology and linguistics. young infants can exploit transitional probabilities between syllables for the task of word segmenta- 1 Introduction tion, with only minimum exposure to an arti®cial Two facts about learning are indisputable. language. Subsequent work has demonstrated SL First, only a human baby, but not her pet kitten, can in other domains including arti®cial grammar learn- learn a language. It is clear, then, that there must ing [9], music [10], vision [11], as well as in other be some element in our biology that accounts for species [12]. This then raises the possibility of this unique ability. Chomsky's Universal Grammar learning as an alternative to the innate endowment (UG), an innate form of knowledge speci®c to lan- of linguistic knowledge [13]. guage, is an account of what this ability is. This po- We believe that the computational modeling of sition gains support from formal learning theory [1- psychological processes, with special attention to 3], which sharpens the logical conclusion [4,5] that concrete mechanisms and quantitative evaluations, no (realistically ef®cient) learning is possible with- can play an important role in the endowment vs. out priori restrictions on the learning space. Sec- learning debate. Linguists' investigations of UG are ond, it is also clear that no matter how much of a rarely developmental, even less so corpus-oriented. head start the child has through UG, language is Developmental psychologists, by contrast, often learned. Phonology, lexicon, and grammar, while stop at identifying components in a cognitive task governed by universal principles and constraints, do [14], without an account of how such components vary from language to language, and they must be work together in an algorithmic manner. On the learned on the basis of linguistic experience. In other hand, if computation is to be of relevance other words±indeed a truism±both endowment and to linguistics, psychology, and cognitive science in learning contribute to , the re- general, being merely computational will not suf- sult of which is extremely sophisticated body of ®ce. A model must be psychological plausible, and linguistic knowledge. Consequently, both must be ready to face its implications in the broad empirical taken in account, explicitly, in a theory of language contexts [7]. For example, how does it generalize acquisition [6,7]. to typologically different ? How does the Controversies arise when it comes to the relative model's behavior compare with that of human lan- contributions by innate knowledge and experience- guage learners and processors? based learning. Some researchers, in particular lin- In this article, we will present a simple compu- guists, approach language acquisition by charac- tational model of word segmentation and some of terizing the scope and limits of innate principles its formal and developmental issues in child lan- of Universal Grammar that govern the world's lan- guage acquisition. Speci®cally we show that SL guage. Others, in particular psychologists, tend to using transitional probabilities cannot reliably seg- emphasize the role of experience and the child's ment words when scaled to a realistic setting (e.g., domain-general learning ability. Such division of child-directed English). To be successful, it must research agenda understandably stems from the di- be constrained by the knowledge of phonological 49 structure. Indeed, the model reveals that SL may CV syllables, many languages, including English, well be an artifact±an impressive one, nonetheless± make use of a far more diverse range of syllabic that plays no role in actual word segmentation in types. And then, syllabi®cation of speech is far human children. from trivial, which (most likely) involve both in- nate knowledge of phonological structures as well 2 Statistics does not Refute UG as discovering language-speci®c instantiations [14]. All these problems have to be solved before SL for It has been suggested [15, 8] that word segmenta- word segmentation can take place. tion from continuous speech may be achieved by using transitional probabilities (TP) between ad- jacent syllables A and B, where , TP(A→B) = 3 The Model P(AB)/P(A), with P(AB) being the frequency of B following A, and P(A) the total frequency of A. To give a precise evaluation of SL in a realis- Word boundaries are postulated at local minima, tic setting, we constructed a series of (embarrass- where the TP is lower than its neighbors. For ex- ingly simple) computational models tested on child- ample, given suf®cient amount of exposure to En- directed English. glish, the learner may establish that, in the four- The learning data consists of a random sam- syllable sequence ªprettybabyº, TP(pre→tty) and TP(ba→by) are both higher than TP(tty→ba): a ple of child-directed English sentences from the CHILDES database [19] The words were then pho- word boundary can be (correctly) postulated. It is remarkable that 8-month-old infants can extract netically transcribed using the Carnegie Mellon Pro- three-syllable words in the continuous speech of an nunciation Dictionary, and were then grouped into syllables. Spaces between words are removed; how- arti®cial language from only two minutes of expo- sure [8]. ever, utterance breaks are available to the modeled learner. Altogether, there are 226,178 words, con- To be effective, a learning algorithm±indeed any sisting of 263,660 syllables. algorithm±must have an appropriate representation of the relevant learning data. We thus need to be Implementing SL-based segmentation is straight- cautious about the interpretation of the success of forward. One ®rst gathers pair-wise TPs from the SL, as the authors themselves note [16]. If any- training data, which are used to identify local min- thing, it seems that the ®ndings strengthen, rather ima and postulate word boundaries in the on-line than weaken, the case for (innate) linguistic knowl- processing of syllable sequences. Scoring is done edge. A classic argument for innateness [4, 5, for each utterance and then averaged. Viewed as an 17] comes from the fact that syntactic operations information retrieval problem, it is customary [20] are de®ned over speci®c types of data structures± to report both precision and recall of the perfor- constituents and phrases±but not over, say, linear mance. strings of words, or numerous other logical possibil- The segmentation results using TP local minima ities. While infants seem to keep track of statistical are remarkably poor, even under the assumption information, any conclusion drawn from such ®nd- that the learner has already syllabi®ed the input per- ings must presuppose children knowing what kind fectly. Precision is 41.6%, and recall is 23.3%; over of statistical information to keep track of. After all, half of the words extracted by the model are not ac- an in®nite range of statistical correlations exists in tual English words, while close to 80% of actual the acoustic input: e.g., What is the probability of a words fail to be extracted. And it is straightfor- syllable rhyming with the next? What is the proba- ward why this is the case. In order for SL to be bility of two adjacent vowels being both nasal? The effective, a TP at an actual word boundary must fact that infants can use SL to segment syllable se- be lower than its neighbors. Obviously, this con- quences at all entails that, at the minimum, they dition cannot be met if the input is a sequence of know the relevant unit of information over which monosyllabic words, for which a space must be pos- correlative statistics is gathered: in this case, it is tulated for every syllable; there are no local min- the syllables, rather than segments, or front vowels. ima to speak of. While the pseudowords in [8] A host of questions then arises. First, How do are uniformly three-syllables long, much of child- they know so? It is quite possible that the primacy directed English consists of sequences of monosyl- of syllables as the basic unit of speech is innately labic words: corpus statistics reveals that on aver- available, as suggested in neonate speech perception age, a monosyllabic word is followed by another studies [18]? Second, where do the syllables come monosyllabic word 85% of time. As long as this from? While the experiments in [8] used uniformly is the case, SL cannot, in principle, work. 50 4 Statistics Needs UG words yields even better performance: a precision of 81.5% and recall of 90.1%. This is not to say that SL cannot be effective for word segmentation. Its application, must be 5 Conclusion constrained±like that of any learning algorithm however powerful±as suggested by formal learning Further work, both experimental and computational, theories [1-3]. The performance improves dramat- will need to address a few pressing questions, in or- ically, in fact, if the learner is equipped with even der to gain a better assessment of the relative contri- a small amount of prior knowledge about phono- bution of SL and UG to language acquisition. These logical structures. Speci®cally, we assume, uncon- include, more pertinent to the problem of word seg- troversially, that each word can have only one pri- mentation: mary stress. (This would not work for functional • Can statistical learning be used in the acquisi- words, however.) If the learner knows this, then tion of language-speci®c phonotactics, a pre- it may limit the search for local minima only in requisite to syllabi®cation and a prelude to the window between two syllables that both bear word segmentation? primary stress, e.g., between the two a's in the sequence ªlanguageacquisitionº. This assumption • Given that prosodic constraints are critical for is plausible given that 7.5-month-old infants are the success of SL in word segmentation, future sensitive to strong/weak prosodic distinctions [14]. work needs to quantify the availability of stress When stress information suf®ces, no SL is em- information in spoken corpora. ployed, so ªbigbadwolfº breaks into three words • Can further experiments, carried over realistic for free. Once this simple principle is built in, the linguistic input, further tease apart the multi- stress-delimited SL algorithm can achieve the pre- ple strategies used in word segmentation [14]? cision of 73.5% and 71.2%, which compare favor- What are the psychological mechanisms (algo- ably to the best performance reported in the litera- rithms) that integrate these strategies? ture [20]. (That work, however, uses an computa- tionally prohibitive and psychological implausible • How does word segmentation, statistical or algorithm that iteratively optimizes the entire lexi- otherwise, work for agglutinative (e.g., Turk- con.) ish) and polysynthetic languages (e.g. Mo- The computational models complement the ex- hawk), where the division between words, perimental study that prosodic information takes morphology, and syntax is quite different from priority over statistical information when both are more clear-cut cases like English? available [21]. Yet again one needs to be cautious Computational modeling can make explicit the about the improved performance, and a number of balance between statistics and UG, and are in the unresolved issues need to be addressed by future same vein as the recent ®ndings [24] on when/where work. It remains possible that SL is not used at SL is effective/possible. UG can help SL by all in actual word segmentation. Once the one- providing speci®c constraints on its application, word-one-stress principle is built in, we may con- and modeling may raise new questions for fur- sider a model that does not use any statistics, hence ther experimental studies. In related work [6,7], avoiding the computational cost that is likely to we have augmented traditional theories of UG± be considerable. (While we don't know how in- derivational phonology, and the Principles and Pa- fants keep track of TPs, there are clearly quite some rameters framework±with a component of statisti- work to do. Syllables in English number in the cal learning, with novel and desirable consequences. thousands; now take the quadratic for the potential Yet in all cases, statistical learning, while perhaps number of pair-wise TPs.) It simply stores previ- domain-general, is constrained by what appears to ously extracted words in the memory to bootstrap be innate and domain-speci®c knowledge of linguis- new words. Young children's familiar segmenta- tic structures, such that learning can operate on spe- tion errors±ºI was haveº from be-have, ªhiccing upº ci®c aspects of the input evidence from hicc-up, ªtwo dultsº, from a-dult±suggest that this process does take place. Moreover, there is ev- References idence that 8-month-old infants can store familiar 1. Gold, E. M. (1967). Language identi®cation in sounds in the memory [22]. And ®nally, there are the limit. Information and Control, 10:447-74. plenty of single-word utterances±up to 10% [23]± that give many words for free. The implementation 2. Valiant, L. (1984). A theory of the learnable. of a purely symbolic learner that recycles known Communication of the ACM. 1134-1142. 51 3. Vapnik, V. (1995). The Nature of Statistical 18. Bijeljiac-Babic, R., Bertoncini, J., & Mehler, Learning Theory. Berlin: Springer. J. (1993). How do four-day-old infants cate- gorize multisyllabic utterances. Developmen- 4. Chomsky, N. (1959). Review of Verbal Behav- tal psychology, 29, 711-21. ior by B.F. Skinner. Language, 35, 26-57. 19. MacWhinney, B. (1995). The CHILDES 5. Chomsky, N. (1975). Reflections on Language. Project: Tools for Analyzing Talk. Hillsdale: New York: Pantheon. Lawrence Erlbaum. 6. Yang, C. D. (1999). A selectionist theory of 20. Brent, M. (1999). Speech segmentation and language development. In Proceedings of 37th word discovery: a computational perspective. Meeting of the Association for Computational Trends in Cognitive Science, 3, 294-301. Linguistics. Maryland, MD. 431-5. 21. Johnson, E.K. & Jusczyk, P.W. (2001) Word 7. Yang, C. D. (2002). Knowledge and Learning segmentation by 8-month-olds: When speech in Natural Language. Oxford: Oxford Univer- cues count more than statistics. Journal of sity Press. Memory and Language, 44, 1-20. 8. Saffran, J.R., Aslin, R.N., & Newport, E.L. 22. Jusczyk, P. W., & Hohne, E. A. (1997). In- (1996). Statistical learning by 8-month old in- fants' memory for spoken words. Science, 277, fants. Science, 274, 1926-1928. 1984-6. 9. Gomez, R.L., & Gerken, L.A. (1999). Arti®- 23. Brent, M.R., & Siskind, J.M. (2001). The role cial grammar learning by one-year-olds leads of exposure to isolated words in early vocabu- to speci®c and abstract knowledge. Cognition, lary development. Cognition, 81, B33-44. 70, 109-135. 24. Newport, E.L., & Aslin, R.N. (2004). Learn- ing at a distance: I. Statistical learning of non- 10. Saffran, J.R., Johnson, E.K., Aslin R.N. & adjacent dependencies. Cognitive Psychology, Newport, E.L. (1999). Statistical learning of 48, 127-62. tone sequences by human infants and adults. Cognition, 70, 27-52. 11. Fiser, J., & Aslin, R.N. (2002). Statistical learning of new visual feature combinations by infants. PNAS, 99, 15822-6. 12. Hauser, M., Newport, E.L., & Aslin, R.N. (2001). Segmentation of the speech stream in a non-human primate: Statistical learning in cotton-top tamarins. Cognition, 78, B41-B52. 13. Bates, E., & Elman, J. (1996). Learning redis- covered. Science, 274, 1849-50. 14. Jusczyk, P.W. (1999). How infants begin to ex- tract words from speech. Trends in Cognitive Sciences, 3, 323-8. 15. Chomsky, N. (1955/1975). The Logical Struc- ture of Linguistic Theory. Manuscript, Har- vard University and Massachusetts Institute of Technology. Published in 1975 by New York: Plenum. 16. Saffran, J.R., Aslin, R.N., & Newport, E.L. (1997). Letters. Science, 276, 1177-1181 17. Crain, S., & Nakayama, M. (1987). Structure dependency in grammar formation. Language, 63:522-543. 52