Statistics Learning and Universal Grammar: Modeling Word Segmentation
Total Page:16
File Type:pdf, Size:1020Kb
Statistics Learning and Universal Grammar: Modeling Word Segmentation Timothy Gambell Charles Yang 59 Bishop Street Department of Linguistics, Yale University New Haven, CT 06511 New Haven, CT 06511 USA USA [email protected] [email protected] Abstract vision of labor between endowment and learning± plainly, things that are built in needn't be learned, This paper describes a computational model of word and things that can be garnered from experience segmentation and presents simulation results on re- needn't be built in. alistic acquisition In particular, we explore the ca- The important paper of Saffran, Aslin, & New- pacity and limitations of statistical learning mecha- port [8] on statistical learning (SL), suggests that nisms that have recently gained prominence in cog- children may be powerful learners after all. Very nitive psychology and linguistics. young infants can exploit transitional probabilities between syllables for the task of word segmenta- 1 Introduction tion, with only minimum exposure to an arti®cial Two facts about language learning are indisputable. language. Subsequent work has demonstrated SL First, only a human baby, but not her pet kitten, can in other domains including arti®cial grammar learn- learn a language. It is clear, then, that there must ing [9], music [10], vision [11], as well as in other be some element in our biology that accounts for species [12]. This then raises the possibility of this unique ability. Chomsky's Universal Grammar learning as an alternative to the innate endowment (UG), an innate form of knowledge speci®c to lan- of linguistic knowledge [13]. guage, is an account of what this ability is. This po- We believe that the computational modeling of sition gains support from formal learning theory [1- psychological processes, with special attention to 3], which sharpens the logical conclusion [4,5] that concrete mechanisms and quantitative evaluations, no (realistically ef®cient) learning is possible with- can play an important role in the endowment vs. out priori restrictions on the learning space. Sec- learning debate. Linguists' investigations of UG are ond, it is also clear that no matter how much of a rarely developmental, even less so corpus-oriented. head start the child has through UG, language is Developmental psychologists, by contrast, often learned. Phonology, lexicon, and grammar, while stop at identifying components in a cognitive task governed by universal principles and constraints, do [14], without an account of how such components vary from language to language, and they must be work together in an algorithmic manner. On the learned on the basis of linguistic experience. In other hand, if computation is to be of relevance other words±indeed a truism±both endowment and to linguistics, psychology, and cognitive science in learning contribute to language acquisition, the re- general, being merely computational will not suf- sult of which is extremely sophisticated body of ®ce. A model must be psychological plausible, and linguistic knowledge. Consequently, both must be ready to face its implications in the broad empirical taken in account, explicitly, in a theory of language contexts [7]. For example, how does it generalize acquisition [6,7]. to typologically different languages? How does the Controversies arise when it comes to the relative model's behavior compare with that of human lan- contributions by innate knowledge and experience- guage learners and processors? based learning. Some researchers, in particular lin- In this article, we will present a simple compu- guists, approach language acquisition by charac- tational model of word segmentation and some of terizing the scope and limits of innate principles its formal and developmental issues in child lan- of Universal Grammar that govern the world's lan- guage acquisition. Speci®cally we show that SL guage. Others, in particular psychologists, tend to using transitional probabilities cannot reliably seg- emphasize the role of experience and the child's ment words when scaled to a realistic setting (e.g., domain-general learning ability. Such division of child-directed English). To be successful, it must research agenda understandably stems from the di- be constrained by the knowledge of phonological 49 structure. Indeed, the model reveals that SL may CV syllables, many languages, including English, well be an artifact±an impressive one, nonetheless± make use of a far more diverse range of syllabic that plays no role in actual word segmentation in types. And then, syllabi®cation of speech is far human children. from trivial, which (most likely) involve both in- nate knowledge of phonological structures as well 2 Statistics does not Refute UG as discovering language-speci®c instantiations [14]. All these problems have to be solved before SL for It has been suggested [15, 8] that word segmenta- word segmentation can take place. tion from continuous speech may be achieved by using transitional probabilities (TP) between ad- jacent syllables A and B, where , TP(A!B) = 3 The Model P(AB)/P(A), with P(AB) being the frequency of B following A, and P(A) the total frequency of A. To give a precise evaluation of SL in a realis- Word boundaries are postulated at local minima, tic setting, we constructed a series of (embarrass- where the TP is lower than its neighbors. For ex- ingly simple) computational models tested on child- ample, given suf®cient amount of exposure to En- directed English. glish, the learner may establish that, in the four- The learning data consists of a random sam- syllable sequence ªprettybabyº, TP(pre!tty) and TP(ba!by) are both higher than TP(tty!ba): a ple of child-directed English sentences from the CHILDES database [19] The words were then pho- word boundary can be (correctly) postulated. It is remarkable that 8-month-old infants can extract netically transcribed using the Carnegie Mellon Pro- three-syllable words in the continuous speech of an nunciation Dictionary, and were then grouped into syllables. Spaces between words are removed; how- arti®cial language from only two minutes of expo- sure [8]. ever, utterance breaks are available to the modeled learner. Altogether, there are 226,178 words, con- To be effective, a learning algorithm±indeed any sisting of 263,660 syllables. algorithm±must have an appropriate representation of the relevant learning data. We thus need to be Implementing SL-based segmentation is straight- cautious about the interpretation of the success of forward. One ®rst gathers pair-wise TPs from the SL, as the authors themselves note [16]. If any- training data, which are used to identify local min- thing, it seems that the ®ndings strengthen, rather ima and postulate word boundaries in the on-line than weaken, the case for (innate) linguistic knowl- processing of syllable sequences. Scoring is done edge. A classic argument for innateness [4, 5, for each utterance and then averaged. Viewed as an 17] comes from the fact that syntactic operations information retrieval problem, it is customary [20] are de®ned over speci®c types of data structures± to report both precision and recall of the perfor- constituents and phrases±but not over, say, linear mance. strings of words, or numerous other logical possibil- The segmentation results using TP local minima ities. While infants seem to keep track of statistical are remarkably poor, even under the assumption information, any conclusion drawn from such ®nd- that the learner has already syllabi®ed the input per- ings must presuppose children knowing what kind fectly. Precision is 41.6%, and recall is 23.3%; over of statistical information to keep track of. After all, half of the words extracted by the model are not ac- an in®nite range of statistical correlations exists in tual English words, while close to 80% of actual the acoustic input: e.g., What is the probability of a words fail to be extracted. And it is straightfor- syllable rhyming with the next? What is the proba- ward why this is the case. In order for SL to be bility of two adjacent vowels being both nasal? The effective, a TP at an actual word boundary must fact that infants can use SL to segment syllable se- be lower than its neighbors. Obviously, this con- quences at all entails that, at the minimum, they dition cannot be met if the input is a sequence of know the relevant unit of information over which monosyllabic words, for which a space must be pos- correlative statistics is gathered: in this case, it is tulated for every syllable; there are no local min- the syllables, rather than segments, or front vowels. ima to speak of. While the pseudowords in [8] A host of questions then arises. First, How do are uniformly three-syllables long, much of child- they know so? It is quite possible that the primacy directed English consists of sequences of monosyl- of syllables as the basic unit of speech is innately labic words: corpus statistics reveals that on aver- available, as suggested in neonate speech perception age, a monosyllabic word is followed by another studies [18]? Second, where do the syllables come monosyllabic word 85% of time. As long as this from? While the experiments in [8] used uniformly is the case, SL cannot, in principle, work. 50 4 Statistics Needs UG words yields even better performance: a precision of 81.5% and recall of 90.1%. This is not to say that SL cannot be effective for word segmentation. Its application, must be 5 Conclusion constrained±like that of any learning algorithm however powerful±as suggested by formal learning Further work, both experimental and computational, theories [1-3]. The performance improves dramat- will need to address a few pressing questions, in or- ically, in fact, if the learner is equipped with even der to gain a better assessment of the relative contri- a small amount of prior knowledge about phono- bution of SL and UG to language acquisition.