Proceedings of the

First Workshop on Psycho-computational Models of Human Language Acquisition

Held in cooperation with COLING-2004

28-29 August 2004 Geneva, Switzerland

ii

INVITED SPEAKERS:

Walter Daelemans (University of Antwerp, Belgium and Tilburg University, the Netherlands) B. Elan Dresher (University of Toronto, Canada) Charles Yang (Yale University, USA)

ORGANIZER:

William Gregory Sakas (City University of New York, USA)

PROGRAM COMMITTEE:

Robert Berwick (MIT, USA) Antal van den Bosch (Tilburg University, the Netherlands) Ted Briscoe (University of Cambridge, UK) Damir Cavar (Indiana University, USA) Morten H. Christiansen (Cornell University, USA) Stephen Clark (University of Edinburgh, UK) James Cussens (University of York, UK) Walter Daelemans (University of Antwerp, Belgium and Tilburg University, the Netherlands) Jeffrey Elman (University of California, San Diego, USA) Gerard Kempen (Leiden University, the Netherlands and the Max Planck Institute, Nijmegen) Vincenzo Lombardo (University of Torino, Italy) Larry Moss (University of Indiana, USA) Miles Osborne (University of Edinburgh, UK) Dan Roth (University of Illinois at Urbana-Champaign, USA) Ivan Sag (Stanford University, USA) Jeffrey Siskind (Purdue University, USA) Mark Steedman (University of Edinburgh, UK) Menno van Zaanen (Tilburg University, the Netherlands) Charles Yang (Yale University, USA)

iii

WORKSHOP ASSISTANTS:

Xuan Nga Cao (City University of New York, USA) Mari Fujimoto (City University of New York, USA) Lydiya Tornyova (City University of New York, USA)

SPONSOR:

The 20th International Conference on Computational Linguistics

FURTHER INFORMATION:

William Gregory Sakas Ph.D. Programs in Linguistics and Computer Science Department of Computer Science, North Bldg1008 Hunter College, City University of New York 695 Park Ave New York, NY 10021 USA

email: [email protected] [email protected]

WWW: http://www.colag.cs.hunter.cuny.edu/psychocomp

iv Introduction

Every day, we use language so effortlessly that we often overlook its complexity. The fact that language is complex is indisputable. Indeed, even after decades of scrutiny, highly-trained adult scientists cannot agree on a definitive analysis of the underlying mechanism that ultimately determines how our sounds, words, and sentences go together – but such an effortless task for a child! Children as young as one-and-a-half-years-old (and younger) continually exploit much of language’s underpinnings while going about the business of making sense of the linguistic environment that surrounds them. By the time a child reaches kindergarten, he or she has almost full mastery of an elaborate structure that eludes adequate scientific description. How children accomplish this – how they come to acquire ‘knowledge’ of language’s essential organization – is one of the most fundamental, beguiling, and surprisingly open questions of modern science.

This workshop brings together researchers whose (at least one) line of investigation is to computationally model the acquisition process and ascertain substantive interrelationships between a model and linguistic and psycholinguistic theory. Progress in this agenda not only directly informs developmental psycholinguistic and linguistic research, but in my opinion, will also have the long term benefit of informing applied computational linguistics in areas that involve the automated acquisition of knowledge from a human or human-computer linguistic environment.

The level of sophistication and breadth of applied computational linguistics techniques has skyrocketed in the past two decades. There is now a battery of computational formalisms and statistical methods to ‘choose from,’ all which have yielded remarkable success in many applied domains that involve the computer learning of natural language (e.g. speech recognition, web technologies, corpus analysis, etc). These achievements have dramatically spurred even more research and funding to the point where the evolution of the science of computational linguistics can be seen as quickly outpacing that of psycholinguistics.

However, there are signs that the computational linguistics community has been progressively more aware that language technologies might benefit by incorporating learning strategies employed by humans. Although research involving the psycho-computational modeling of human language acquisition has been long active in the areas of psycholinguistics, and formal learning theory, it has, arguably, only recently become a growing part of the computational linguistics agenda. This is evidenced by the occasional special session at an ACL meeting (e.g., ACL-1999 – Thematic Session on Computational Psycholinguistics), current workshops at both COLING-2004 (this workshop) and ACL-2004 (Incremental Parsing: Bringing Engineering and Cognition Together), and regular invitations to developmental psycholinguists to deliver plenary addresses at recent ACL meetings. This cross-discipline attentiveness is clearly very healthy and might well help reduce the possibility that applied research will run into a psycho-computational bottleneck – when state-of-the-art computational methods cannot be improved further in the development of user- transparent computer-human language applications – by incorporating theoretical advances in computational psycholinguistics into computational language learning technologies.

This workshop brings together a wide range of computational psycholinguistics research that is involved with the study of language acquisition: 34% of author contributions come from researchers holding positions in computer science or related departments, 33% from linguistics departments, 30% from psychology or cognitive science departments, and 3% from other departments.1 The articles present investigations involving a broad diversity of formalisms, learning strategies, modeling techniques and linguistic phenomena. Linguistic footings range from (variations on): Universal Grammar, constructionist frameworks, and categorial grammar, to novel formulations of structural representation, to ‘none.’ Learning strategies include: distributional and corpus techniques, connectionist implementations, cue-based learning, and hybrid models that apply several strategies. Phenomena that are modeled include: the acquisition of semantics, linguistic (principles and) parameter setting, lexical subcategorization, child language production, atypical acquisition, phonological acquisition and morphological acquisition. Several papers involve cross-linguistic research and/or use actual child-directed speech (from corpora).

1 An “author contribution” is calculated as 1 / the number of authors on a paper. v

Notably, most papers (not all) address acquisition at the sub-word, word, or multi-word level. Few models assign structure or meaning to an entire utterance (or discourse) although many papers suggest that a presented model could be (easily) scaled-up – a worthwhile direction for future research. It is also worth remarking on the fact that articles addressing formal learning issues (e.g., PAC learning, identification in the limit, grammar induction, etc.) or that incorporate formalisms from mainstream computational linguistics (e.g., any of the many variants of probabilistic grammars) are underrepresented (the workshop contains one such). Future meetings along the lines of this workshop might benefit from attracting research efforts related to these approaches.

I would sincerely like to thank the program committee for above-and-beyond effort given the tight timetable, the diversity of the papers, and the several frustrating problems caused by spam-blockers; the workshop assistants who were a tremendous help with collating the reviews, organizing the articles for the proceedings, dealing with email and designing the conference web site; and, finally, the members of the COLING-2004 Workshop Program Committee, who were extremely helpful (and patient) on more than one occasion.

William Gregory Sakas New York City June 2004

vi

Table of Contents

A Quantitative Evaluation of Naturalistic Models of Language Acquisition; the Efficiency of the Triggering Learning Algorithm Compared to a Categorial Grammar Learner Paula Buttery………………………………………………………………………………….....1

On Statistical Parameter Setting Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues and Giancarlo Schrementi ...... 9

Putting Meaning into Grammar Learning Nancy Chang ...... 17

Grammatical Inference and First Language Acquisition Alexander Clark...... 25

A Developmental Model of Syntax Acquisition in the Construction Grammar Framework with Cross-Linguistic Validation in English and Japanese Peter Ford Dominey and Toshio Inui ...... 33

On the Acquisition of Phonological Representations B. Elan Dresher...... 41

Statistics Learning and Universal Grammar: Modeling Word Segmentation Timothy Gambell and Charles Yang ...... 49

Modelling Syntactic Development in a Cross-Linguistic Context Fernand Gobet, Daniel Freudenthal and Julian M. Pine...... 53

A Computational Model of Emergent Simple Syntax: Supporting the Natural Transition from the One-Word Stage to the Two-Word Stage Kris Jack, Chris Reed and Annalu Waller...... 61

On a Possible Role for Pronouns in the Acquisition of Verbs Aarre Laakso and Linda Smith...... 69

Some Tests of an Unsupervised Model of Language Acquisition Bo Pedersen, Shimon Edelman, Zach Solan, David Horn and Eytan Ruppin...... 77

Modelling Atypical Syntax Processing Michael S. C. Thomas and Martin Redington ...... 85

Combining Utterance-Boundary and Predictability Approaches to Speech Segmentation Aris Xanthos...... 93

vii

A quantitative evaluation of naturalistic models of language acquisition; the efficiency of the Triggering Learning Algorithm compared to a Categorial Grammar Learner Paula Buttery Natural Language and Information Processing Group, Computer Laboratory, Cambridge University, 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK [email protected]

Abstract pothesize infinitely many question-formation rules, such as: (i) swap the first and second words in the Naturalistic theories of language acquisition assume sentence; (ii) front the first auxiliary verb; (iii) front learners to be endowed with some innate language words beginning with w. knowledge. The purpose of this innate knowledge The first two of these rules are refuted if the child is to facilitate language acquisition by constrain- encounters the following: (2a) “the cat who was ing a learner’s hypothesis space. This paper dis- grinning at Alice was disappearing”; (2b) “was the cusses a naturalistic learning system (a Categorial cat who was grinning at Alice disappearing?” Grammar Learner (CGL)) that differs from previous If a child is to converge upon the correct hypoth- learners (such as the Triggering Learning Algorithm esis unaided she must be exposed to sufficient ex- (TLA) (Gibson and Wexler, 1994)) by employing a amples so that all false hypotheses are refuted. Un- dynamic definition of the hypothesis-space which fortunately such examples are not readily available is driven by the Bayesian Incremental Parameter in child-directed speech; even the constructions in Setting algorithm (Briscoe, 1999). We compare examples (2a) and (2b) are rare (Legate, 1999). To the efficiency of the TLA with the CGL when ac- compensate for this lack of data Chomsky suggests quiring an independently and identically distributed that some principles of language are already avail- English-like language in noiseless conditions. We able in the child’s mind. For example, if the child show that when convergence to the target gram- had innately “known” that all grammar rules are mar occurs (which is not guaranteed), the expected structurally-dependent upon syntax she would never number of steps to convergence for the TLA is have hypothesized rules (i) and (iii). Thus, Chom- shorter than that for the CGL initialized with uni- sky theorizes that a human mind contains a Univer- form priors. However, the CGL converges more sal Grammar which defines a hypothesis-space of reliably than the TLA. We discuss the trade-off of “legal” grammars.1 This hypothesis-space must be efficiency against more reliable convergence to the both large enough to contain grammar’s for all of target grammar. the world’s languages and small enough to ensure 1 Introduction successful acquisition given the sparsity of data. Language acquisition is the process of searching the A normal child acquires the language of her envi- hypothesis-space for the grammar that most closely ronment without any specific training. Chomsky describes the language of the environment. With (1965) claims that, given the “relatively slight ex- estimates of the number of living languages being posure” to examples and “remarkable complexity” around 6800 (Ethnologue, 2004) it is not sensible to of language, it would be “an extraordinary intellec- model the hypothesis-space of grammars explicitly, tual achievement” for a child to acquire a language rather it must be modeled parametrically. Language if not specifically designed to do so. His Argument acquisition is then the process of setting these pa- from the Poverty of the Stimulus suggests that if we rameters. Chomsky (1981) suggested that param- know X, and X is undetermined by learning expe- eters should represent points of variation between rience then X must be innate. For an example con- languages, however the only requirement for pa- sider structure dependency in language syntax: rameters is that they define the current hypothesis- A question in English can be formed by invert- space. ing the auxiliary verb and subject noun-phrase: (1a) 1 “Dinah was drinking a saucer of milk”; (1b) “was Discussion of structural dependence as evidence of the Ar- gument from the Poverty of Stimulus is illustrative, the sig- Dinah drinking a saucer of milk?” nificance being that innate knowledge in any form will place Upon exposure to this example, a child could hy- constraints on the hypothesis-space

1 The properties of the parameters used by this • Backward Application (<) learner (the CGL) are as follows: (1) Parameters are Y X\Y → X lexical; (2) Parameters are inheritance based; (3) Pa- • Forward Composition (> B) rameter setting is statistical. X/Y Y/Z → X/Z 1 - Lexical Parameters • Backward Composition (< B) The CGL employs parameter setting as a means Y \X Z\Y → X\Z to acquire a lexicon; differing from other paramet- ric learners, (such as the Triggering Learning Al- • Generalized Weak Permutation (P ) gorithm (TLA) (Gibson and Wexler, 1994) and the ((X | Y1)... | Yn) → ((X | Yn)... | Y1) Structural Triggers Learner (STL) (Fodor, 1998b), where | is a variable over \ and /. (Sakas and Fodor, 2001)) which acquire general syntactic information rather than the syntactic prop- 2 may eat erties associated with individual words. the cake (s\np)/(s\np) (s\np)/np · In particular, a categorial grammar is acquired. · > B · The syntactic properties of a word are contained in Alice (s\np)/np np its lexical entry in the form of a syntactic category. > np s\np A word that may be used in multiple syntactic situ- < ations (or sub-categorization frames) will have mul- s tiple entries in the lexicon. Syntactic categories are constructed from a finite set of primitive categories combined with two op- Figure 1: Illustration of forward/backward applica- erators (/ and \) and are defined by their members tion (>, <) and forward composition (> B) ability to combine with other constituents; thus con- stituents may be thought of as either functions or arguments. saw The arguments of a functional constituent are shown to the right of the operators and the result she (s\np)/np P to the left. The forward slash operator (/) indicates that np (s/np)\np that the argument must appear to the right of the < function and a backward slash (\) indicates that it rabbit (n\n)/(s/np) (s/np) > must appear on the left. Consider the following the n n\n CFG structure which describes the properties of a < transitive verb: np/n n > s → np vp np vp → tv np tv → gets, finds, ... Figure 2: Illustration of generalized weak permuta- Assume that there is a set of primitive categories tion (P ) {s, np}. A vp must be in the category of func- tional constituents that takes a np from the left and returns an s. This can be written s\np. Likewise The lexicon for a language will contain a finite a tv takes an np from the right and returns a vp subset of all possible syntactic categories, the size of (whose type we already know). A tv may be writ- which depends on the language. Steedman (2000) ten (s\np)/np. suggests that for English the lexical functional cate- Rules may be used to combine categories. We gories never need more than five arguments and that assume that our learner is innately endowed with the these are needed only in a limited number of cases rules of function application, function composition such as for the verb bet in the sentence I bet you five and generalized weak permutation (Briscoe, 1999) pounds for England to win. (see figures 1 and 2). The categorial grammar parameters of the CGL are concerned with defining the set of syntactic • Forward Application (>) categories present in the language of the environ- X/Y Y → X ment. Converging on the correct set aids acquisition 2The concept of lexical parameters and the lexical-linking by constraining the learner’s hypothesized syntactic of parameters is to be attributed to Borer (1984). categories for an unknown word. A parameter (with

2 value of either ACTIVE or INACTIVE) is associ- a compromise between two extremes: implementa- ated with every possible syntactic category to indi- tions of the TLA are memoryless allowing a param- cate whether the learner considers the category to be eter values to oscillate; some implementations of the part of the target grammar. STL set a parameter once, for all time. Some previous parametric learners (TLA and Using the BIPS algorithm, evidence from an in- STL) have been primarily concerned with overall put utterance will either strengthen the current pa- syntactic phenomena rather than the syntactic prop- rameter settings or weaken them. Either way, there erties of individual words. Movement parameters is re-estimation of the probabilities associated with (such as the V 2 parameter of the TLA) may be cap- possible parameter values. Values are only assigned tured by the CGL using innate rules or multiple lex- when sufficient evidence has been accumulated, i.e. ical entries. For instance, Dutch and German word once the associated probability reaches a threshold order is captured by assuming that verbs in these value. By employing this method, it becomes un- languages systematically have two categories, one likely for parameters to switch between settings as determining main clause order and the other subor- the consequence of an erroneous utterance. dinate clause orders. Another advantage of using a Bayesian approach is that we may set default parameter values by as- 2 - Inheritance Based Parameters signing Bayesian priors; if a parameter’s default The complex syntactic categories of a categorial value is strongly biased against the accumulated ev- grammar are a sub-categorization of simpler cate- idence then it will be difficult to switch. Also, we no gories; consequently categories may be arranged in longer need to worry about ambiguity in parameter- a hierarchy with more complex categories inheriting setting-utterances (Clark, 1992) (Fodor, 1998b): the from simpler ones. Figure 3 shows a fragment of a Bayesian approach allows us to solve this problem possible hierarchy. This hierarchical organization of “for free” since indeterminacy just becomes another parameters provides the learner with several bene- case of error due to misclassification of input data fits: (1) The hierarchy can enforce an order on learn- (Buttery and Briscoe, 2004). ing; constraints may be imposed such that a parent parameter must be acquired before a child parame- 2 Overview of the Categorial Grammar ter (for example, in Figure 3, the learner must ac- Learner quire intransitive verbs before transitive verbs may be hypothesized). (2) Parameter values may be in- The learning system is composed of a three mod- herited as a method of acquisition. (3) The parame- ules: a semantics learning module, syntax learning ters are stored efficiently. module and memory module. For each utterance heard the learner receives an input stream of word s - ACTIVE tokens paired with possible semantic hypotheses. ``` ``` For example, on hearing the utterance “Dinah drinks s/s s\np - ACTIVE milk” the learner may receive the pairing: ({dinah, XX  XXX drinks, milk}, drinks(dinah, milk)). [s\np]/np - ACTIVE [s\np]/[s\np] 2.1 The Semantic Module Figure 3: Partial hierarchy of syntactic categories. The semantic module attempts to learn the mapping Each category is associated with a parameter indi- between word tokens and semantic symbols, build- cating either ACTIVE or INACTIVE status. ing a lexicon containing the meaning associated with each word sense. This is achieved by analyz- ing each input utterance and its associated semantic 3 - Statistical Parameter Setting hypotheses using cross-situational techniques (fol- lowing Siskind (1996)). The learner uses a statistical method to track rela- For a trivial example consider the utterances “Al- tive frequencies of parameter-setting-utterances in 3 ice laughs” and “Alice eats cookies”; they might the input. We use the Bayesian Incremental Pa- have word tokens paired with semantic expressions rameter Setting (BIPS) algorithm (Briscoe, 1999) as follows: ({alice, laughs}, laugh(alice)), ({alice, to set the categorial parameters. Such an approach eats, cookies}, eat(alice, cookies)). sets the parameters to the values that are most likely From these two utterances it is possible to ascer- given all the accumulated evidence. This represents tain that the meaning associated with the word token 3Other statistical parameter setting models include Yang’s alice must be alice since it is the only semantic ele- Variational model (2002) and the Guessing STL (Fodor, 1998a) ment that is common to both utterances.

3 2.2 The Syntactic Module is English this parameter would be set to BACK- The learning system links the semantic module and WARD since subjects generally appear to the left of syntactic module by using a typing assumption: the the verb. Evidence for the setting of word order pa- semantic arity of a word is usually the same as its rameters is collected from word order statistics of number of syntactic arguments. For example, if it is the input language. known that likes maps to like(x, y), then the typ- ing assumption suggests that its syntactic category 3 The acquisition of an English-type will be in one of the following forms: a\b\c, a/b\c, language a\b/c, a/b/c or more concisely a | b | c (where a, b The English-like language of the three-parameter and c may be basic or complex syntactic categories system studied by Gibson and Wexler has the themselves). parameter settings and associated unembedded By employing the typing assumption the number surface-strings as shown in Figure 4. For this task of arguments in a word’s syntactic category can be we assume that the surface-strings of the English- hypothesized. Thus, the objective of the syntactic like language are independent and identically dis- module is to discover the arguments’ category types tributed in the input to the learner. and locations. The module attempts to create valid parse trees Specifier Complement V2 starting from the syntactic information already as- 0 (Left) 1 (Right) 0 (off ) sumed by the typing assumption (following But- tery (2003)). A valid parse is one that adheres 1. Subj Verb to the rules of the categorial grammar as well as 2. Subj Verb Obj the constraints imposed by the current settings of 3. Subj Verb Obj Obj the parameters. If a valid parse can not be found 4. Subj Aux Verb the learner assumes the typing assumption to have 5. Subj Aux Verb Obj failed and backtracks to allow type raising. 6. Subj Aux Verb Obj Obj 7. Adv Subj Verb 2.3 Memory Module 8. Adv Subj Verb Obj The memory module records the current state of 9. Adv Subj Verb Obj Obj the hypothesis-space. The syntactic module refers 10. Adv Subj Aux Verb to this information to place constraints upon which 11. Adv Subj Aux Verb Obj syntactic categories may be hypothesized. The 12. Adv Subj Aux Verb Obj Obj module consists of two hierarchies of parameters which may be set using the BIPS algorithm: Figure 4: Parameter settings and surface-strings of Categorial Parameters determine whether a cat- Gibson and Wexler’s English-like Language. egory is in use within the learner’s current model of the input language. An inheritance hierarchy of all possible syntactic categories (for up to five argu- 3.1 Efficiency of Trigger Learning Algorithm ments) is defined and a parameter associated with For the TLA to be successful it must converge to each one (Villavicencio, 2002). Every parameter the correct parameter settings of the English-like (except those associated with primitive categories language. Berwick and Niyogi (1996) modeled the such as S) is originally set to INACTIVE, i.e. no TLA as a Markov process (see Figure 5). categories (except primitives) are known upon the Using this model it is possible to calculate the commencement of learning. A categorial parameter probability of converging to the target from each may only be set to ACTIVE if its parent category starting grammar and the expected number of steps is already active and there has been satisfactory ev- before convergence. idence that the associated category is present in the language of the environment. Probability of Convergence: Word Order Parameters determine the underly- Consider starting from Grammar 3, after the process ing order in which constituents occur. They may be finishes looping it has a 3/5 probability of mov- set to either FORWARD or BACKWARD depend- ing to Grammar 4 (from which it will never con- ing on whether the constituents involved are gen- verge) and a 2/5 probability of moving to Grammar erally located to the right or left. An example is 7 (from which it will definitely converge), therefore the parameter that specifies the direction of the sub- there is a 40% probability of converging to the target ject of a verb: if the language of the environment grammar when starting at Grammar 3.

4 Expected number of Steps to Convergence: The criteria for success for the CGL when acquir- Let S be the expected number of steps from state ing Gibson and Wexler’s English-like language is a n 4 n to the target state. For starting grammars 6, 7 and lexicon containing the following: 8, which definitely converge, we know: Adv S/S Aux [S\SB]/[S\SB] 5 S6 = 1 + S6 (1) Obj OB Verb S\SB 6 Subj SB [S\SB]/OB 2 1 S7 = 1 + S7 + S8 (2) [[S\SB]/OB]/OB 3 18 1 1 8 where (sentence), (subject) and (ob- S8 = 1 + S6 + S7 + S8 (3) S SB OB 12 36 9 ject) are primitive categories which are innate to the and for the times when we do converge from gram- learner with SB and OB assumed to be derivable mars 3 and 1 we can expect: from the semantic module. During the learning process the CGL will have 3 S1 = 1 + S1 (4) constructed a category hierarchy by setting appro- 5 priate categorial parameters to true (see Figure 7). 31 The learner will have also constructed a word-order S3 = 1 + S3 (5) 33 hierarchy (Figure 8), setting parameters to FOR- Figure 6 shows the probability of convergence and WARD or BACKWARD. These hierarchies are used expected number of steps to convergence for each during the learning process to constrain hypothe- of the starting grammars. The expected number of sized syntactic categories. For this task the set- steps to convergence ranges from infinity (for start- ting of the word-order parameters becomes trivial ing grammars 2 and 4) down to 2.5 for Grammar and their role in constraining hypotheses negligible; 1. If the distribution over the starting grammars is consequently, the rest of our argument will relate to uniform then the overall probability of converging categorial parameters only. For the purpose of this is the sum of the probabilities of converging from each state divided by the total number of states: gen!diar = / !! aa 1.00 + 1.00 + 1.00 + 1.00 + 0.40 + 0.66 = 0.63 subjdir = \ vargdir = / 8 (6) Figure 8: Word-order parameter settings required to and the expected number of steps given that you parse Gibson and Wexler’s English-like language. converge is the weighted average of the number of steps from each possibly converging state: analysis parameters are initialized with uniform pri- 5.47 + 14.87 + 6 + 21.98 × 0.4 + 2.5 × 0.66 = 7.26 ors and are originally set INACTIVE. Since the in- 1.00 + 1.00 + 1.00 + 1.00 + 0.40 + 0.66 put is noiseless, the switching threshold is set such (7) that parameters may be set ACTIVE upon the evi- 3.2 Efficiency of Categorial Grammar Learner dence from one surface-string. The input data to the CGL would usually be an ut- It is a requirement of the parameter setting de- terance annotated with a logical form; the only data vice that the parent-types of hypothesized syntax available here however, is surface-strings consist- categories are ACTIVE before new parameters are ing of word types. Hence, for the purpose of com- set. Thus, the learner is not allowed to hypoth- parison with the TLA the semantic module of our esize the syntactic category for a transitive verb learner is by-passed; we assume that mappings to [[S\SB]/OB] before it has learnt the category for semantic forms have previously been acquired and an intransitive verb [S\SB]; this behaviour con- that the subject and objects of surface-strings are strains over-generation. Additionally, it is usually known. For example, given surface-string 1 (Subj not possible to derive a word’s full syntactic cate- Verb) we assume the mapping Verb 7→ verb(x), gory (i.e. without any remaining unknowns) unless which provides Verb with a syntactic category of the it is the only new word in the clause. form a|b by the typing assumption (where a, b are As a consequence of these issues, the order in unknown syntactic categories and | is an operator which the surface-strings appear to the learner af- over \ and /); we also assume Subj to map to a prim- 4Note that the lexicon would usually contain orthographic itive syntactic category SB, since it is the subject of entries for the words in the language rather than word type en- Verb. tries.

5 fects the speed of acquisition. For instance, the (such as the STL) in larger domains. learner prefers to see the surface-string Subj Verb before Subj Verb Obj so that it can acquire the References maximum information without wasting any strings. R Berwick and P Niyogi. 1996. Learning from trig- For the English-type language described by Gib- gers. Linguistic Inquiry, 27(4):605–622. son and Wexler the learner can optimally acquire H Borer. 1984. Parametric Syntax: Case Studies the whole lexicon after seeing only 5 surface-strings in Semitic and Romance Languages. Foris, Dor- (one string needed for each new complex syntactic drecht. category to be learnt). However, the strings appear E Briscoe. 1999. The acquisition of grammar in an to the learner in a random order so it is necessary to evolving population of language agents. Machine calculate the expected number of strings (or steps) Intelligence, 16. before convergence. P Buttery and T Briscoe. 2004. The significance of The learner must necessarily see the string Subj errors to parametric models of language acquisi- Verb before it can learn any other information. With tion. Technical Report SS-04-05, American As- 12 surface-strings the probability of seeing Subj sociation of Artificial Intelligence, March. Verb is 1/12 and the expected number of strings be- P Buttery. 2003. A computational model for first fore it is seen is 12. The learner can now learn from language acquisition. In CLUK-6, Edinburgh. 3 surface-strings: Subj Verb Obj, Subj Aux Verb and N Chomsky. 1965. Aspects of the Theory of Syntax. Adv Subj Verb. Figure 9 shows a Markov structure MIT Press. of the process. From the model we can calculate the N Chomsky. 1981. Lectures on Government and expected number of steps to converge to be 24.53. Binding. Foris Publications. R Clark. 1992. The selection of syntactic knowl- 4 Conclusions edge. Language Acquisition, 2(2):83–149. The TLA and CGL were compared for efficiency Ethnologue. 2004. Languages of the (expected number of steps to convergence) when world, 14th edition. SIL International. acquiring the English-type grammar of the three- http://www.ethnologue.com/. parameter system studied by Gibson and Wexler. J Fodor. 1998a. Parsing to learn. Journal of Psy- The expected number of steps for the TLA was cholinguistic Research, 27(3):339–374. found to be 7.26 but the algorithm only converged J Fodor. 1998b. Unambiguous triggers. Linguistic 63% of the time. The expected number of steps for Inquiry, 29(1):1–36. the CGL is 24.53 but the learner converges more re- E Gibson and K Wexler. 1994. Triggers. Linguistic liably; a trade off between efficiency and success. Inquiry, 25(3):407–454. With noiseless input the CGL can only fail if there J Legate. 1999. Was the argument that was made is insufficient input strings or if Bayesian priors are empirical? Ms, Massachusetts Institute of Tech- heavily biased against the target. Furthermore, the nology. CGL can be made robust to noise by increasing the W Sakas and J Fodor. 2001. The structural triggers probability threshold at which a parameter may be learner. In S Bertolo, editor, Language Acquisi- set ACTIVE; the TLA has no mechanism for coping tion and Learnability, chapter 5. Cambridge Uni- with noisy data. versity Press, Cambridge, UK. The CGL learns incrementally; the hypothesis- W Sakas. 2000. Ambiguity and the Computational space from which it can select possible syntactic Feasibility of Syntax Acquisition. Ph.D. thesis, categories expands dynamically and, as a conse- City University of New York. quence of the hierarchical structure of parameters, J Siskind. 1996. A computational study of the speed of acquisition increases over time. For cross situational techniques for learning word-to- instance, in the starting state there is only a 1/12 meaning mappings. Cognition, 61(1-2):39–91, probability of learning from surface-strings whereas Nov/Oct. in state k (when all but one category has been ac- M Steedman. 2000. The Syntactic Process. MIT quired) there is a 1/2 probability. It is likely that Press/Bradford Books. with a more complex learning task the benefits of A Villavicencio. 2002. The acquisition of a this incremental approach will outweigh the slow unification-based generalised categorial gram- starting costs. Related work on the effects of incre- mar. Ph.D. thesis, University of Cambridge. mental learning on STL performance (Sakas, 2000) C Yang. 2002. Knowledge and Learning in Natural draws similar conclusions. Future work hopes to Language. Oxford University Press. compare the CGL with other parametric learners

6 Figure 5: Gibson and Wexler’s TLA as a Markov structure. Circles represent possible grammars (a config- uration of parameter settings). The target grammar lies at the centre of the structure. Arrows represent the possible transitions between grammars. Note that the TLA is constrained to only allow movement between grammars that differ by one parameter value. The probability of moving between Grammar Gi and Gram- mar Gj is a measure of the number of target surface-strings that are in Gj but not Gi normalized by the total number of target surface-strings as well as the number of alternate grammars the learner can move to. For example the probability of moving from Grammar 3 to Grammar 7 is 2/12 ∗ 1/3 = 1/18 since there are 2 target surface-strings allowed by Grammar 7 that are not allowed by Grammar 3 out of a possible of 12 and three grammars that differ from Grammar 3 by one parameter value.

Initial Language Initial Grammar Prob. of Converging Expected no. of Steps

VOS -V2 110 0.66 2.50 VOS +V2 111 0.00 n/a OVS -V2 100 0.40 21.98 OVS +V2 101 0.00 n/a SVO -V2 010 1.00 0.00 SVO +V2 011 1.00 6.00 SOV -V2 000 1.00 5.47 SOV +V2 001 1.00 14.87

Figure 6: Probability and expected number of steps to convergence from each starting grammar to an English-like grammar (SVO -V2) when using the TLA.

7 top ! ``` !! ``` SB OB S ``` ``` S/S S\SB XX  XXX [S\SB]/OB [S\SB]/[S\SB]

[[S\SB]/OB]/OB Figure 7: Category hierarchy required to parse Gibson and Wexler’s English-like language.

Figure 9: The CGL as a Markov structure. The states represent the set of known syntactic cate- gories: state S - {}, state a - {S\SB}, state b - {S\SB, S/S}, state c - {S\SB, [S\SB]/OB}, state d - {S\SB, [S\SB]/[S\SB]}, state e - {S\SB, S/S, [S\SB]/OB}, state f - {S\SB, [S\SB]/OB, [[S\SB]/OB]/OB}, state g - {S\SB, [S\SB]/[S\SB], S/S} state h - {S\SB, [S\SB]/[S\SB], [S\SB]/OB}, state i - {S\SB, S/S, [S\SB]/OB, [S\SB]/[S\SB]}, state j - {S\SB, S/S, [S\SB]/OB, [[S\SB]/OB]/OB}, state k - {S\SB, [S\SB]/OB, [[S\SB]/OB]/OB, [S\SB]/[S\SB]}, state l - {S\SB, [S\SB]/OB, [[S\SB]/OB]/OB, [S\SB]/[S\SB], S/S}.

8 On Statistical Parameter Setting

Damir ĆAVAR, Joshua HERRING, Giancarlo SCHREMENTI Toshikazu IKUTA, Paul RODRIGUES Computer Science, Indiana University Linguistics Dept., Indiana University Bloomington, IN, 47405 Bloomington, IN, 46405 [email protected] [email protected]

properties of a small subset of languages can be Abstract induced with high accuracy, most of the existing We present a model and an experimental approaches are motivated by applied or platform of a bootstrapping approach to engineering concerns, and thus make assumptions statistical induction of natural language that are less cognitively plausible: a. Large corpora properties that is constraint based with voting are processed all at once, though unsupervised components. The system is incremental and incremental induction of grammars is rather the unsupervised. In the following discussion we approach that would be relevant from a focus on the components for morphological psycholinguistic perspective; b. Arbitrary decisions induction. We show that the much harder about selections of sets of elements are made, 2 problem of incremental unsupervised based on frequency or frequency profile rank, morphological induction can outperform though such decisions should rather be derived or comparable all-at-once algorithms with avoided in general. respect to precision. We discuss how we use However, the most important aspects missing in such systems to identify cues for induction in these approaches, however, are the link to different a cross-level architecture. linguistic levels and the support of a general learning model that makes predictions about how 1 Introduction knowledge is induced on different linguistic levels and what the dependencies between information at In recent years there has been a growing amount these levels are. Further, there is no study focusing of work focusing on the computational modeling on the type of supervision that might be necessary of language processing and acquisition, implying a for the guidance of different algorithm types cognitive and theoretical relevance both of the towards grammars that resemble theoretical and models as such, as well as of the language empirical facts about language acquisition, and properties extracted from raw linguistic data.1 In processing and the final knowledge of language. the computational linguistic literature several While many theoretical models of language attempts to induce grammar or linguistic acquisition use innateness as a crutch to avoid knowledge from such data have shown that at outstanding difficulties, both on the general and different levels a high amount of information can abstract level of I-language as well as the more be extracted, even with no or minimal supervision. detailed level of E-language, (see, among others, Different approaches tried to show how various Lightfoot (1999) and Fodor and Teller (2000), puzzles of language induction could be solved. there is also significant research being done which From this perspective, language acquisition is the shows that children take advantage of statistical process of segmentation of non-discrete acoustic regularities in the input for use in the language- input, mapping of segments to symbolic learning task (see Batchelder (1997) and related representations, mapping representations on references within). higher-level representations such as phonology, In language acquisition theories the dominant morphology and syntax, and even induction of view is that knowledge of one linguistic level is semantic properties. Due to space restrictions, we bootstrapped from knowledge of one, or even cannot discuss all these approaches in detail. We several different levels. Just to mention such will focus on the close domain of morphology. approaches: Grimshaw (1981), and Pinker (1984) Approaches to the induction of morphology as presented in e.g. Schone and Jurafsky (2001) or 2 Just to mention some of the arbitrary decisions Goldsmith (2001) show that the morphological made in various approaches, e.g. Mintz (1996) selects a small set of all words, the most frequent words, to induce word types via clustering ; Schone and Jurafsky 1 See Batchelder (1998) for a discussion of these (2001) select words with frequency higher than 5 to aspects. induce morphological segmentation.

9 assume that semantic properties are used to are the cues that are used for grammar induction bootstrap syntactic knowledge, and Mazuka (1998) only. suggested that prosodic properties of language As shown in Elghamry (2004) and Elghamry and establish a bias for specific syntactic properties, Ćavar (2004), there are efficient ways to identify a e.g. headedness or branching direction of kernel set of such units in an unsupervised fashion constituents. However, these approaches are based without any arbitrary decision where to cut the set on conceptual considerations and psycholinguistc of elements and on the basis of what kind of empirical grounds, the formal models and features. They present an algorithm that selects the computational experiments are missing. It is set of kernel cues on the lexical and syntactic level, unclear how the induction processes across as the smallest set of words that co-occurs with all linguistic domains might work algorithmically, and other words. Using this set of words it is possible the quantitative experiments on large scale data are to cluster the lexical inventory into open and missing. closed class words, as well as to identify the As for algorithmic approaches to cross-level subclasses of nouns and verbs in the open class. induction, the best example of an initial attempt to The direction of the selectional preferences of the exploit cues from one level to induce properties of language is derived as an average of point-wise another is presented in Déjean (1998), where Mutual Information on each side of the identified morphological cues are identified for induction of cues and types, which is a self-supervision aspect syntactic structure. Along these lines, we will that biases the search direction for a specific argue for a model of statistical cue-based learning, language. This resulting information is understood introducing a view on bootstrapping as proposed in as derivation of secondary cues, which then can be Elghamry (2004), and Elghamry and Ćavar (2004), used to induce selectional properties of verbs that relies on identification of elementary cues in (frames), as shown in Elghamry (2004). the language input and incremental induction and The general claim thus is: further cue identification across all linguistic • Cues can be identified in an unsupervised levels. fashion in the input. 1.1 Cue-based learning • These cues can be used to induce properties of the target grammar. Presupposing input driven learning, it has been • These properties represent cues that can be shown in the literature that initial segmenations used to induce further cues, and so on. into words (or word-like units) is possible with The hypothesis is that this snowball effect can unsupervised methods (e.g. Brent and Cartwright reduce the search space of the target grammar (1996)), that induction of morphology is possible incrementally. The main research questions are (e.g. Goldsmith (2001), Schone and Jurafsky now, to what extend do different algorithms (2001)) and even the induction of syntactic provide cues for other linguistic levels and what structures (e.g. Van Zaanen (2001)). As mentioned kind of information do they require as supervision earlier, the main drawback of these approaches is in the system, in order to gain the highest accuracy the lack of incrementality, certain arbitrary at each linguistic level, and how does the linguistic decisions about the properties of elements taken information of one level contribute to the into account, and the lack of integration into a information on another. general model of bootstrapping across linguistic In the following, the architectural considerations levels. of such a computational model are discussed, As proposed in Elghamry (2004), cues are resulting in an example implementation that is elementary language units that can be identified at applied to morphology induction, where each linguistic level, dependent or independent of morphological properties are understood to prior induction processes. That is, intrinsic represent cues for lexical clustering as well as properties of elements like segments, syllables, syntactic structure, and vice versa, similar to the morphemes, words, phrases etc. are the ones ideas formulated in Déjean (1998), among others. available for induction procedures. Intrinsic properties are for example the frequency of these 1.2 Incremental Induction Architecture units, their size, and the number of other units they The basic architectural principle we presuppose are build of. Extrinsic properties are taken into is incrementality, where incrementally utterances account as well, where extrinsic stands for are processed. The basic language unit is an distributional properties, the context, relations to utterance, with clear prosodic breaks before and other units of the same type on one, as well as after. The induction algorithm consumes such across linguistic levels. In this model, extrinsic and utterances and breaks them into basic linguistic intrinsic properties of elementary language units units, generating for each step hypotheses about

10 the linguistic structure of each utterance, based on representations for the input in a first order the grammar built so far and statistical properties explosion and subsequently filter out irrelevant of the single linguistic units. Here we presuppose a hypotheses, ABL reduces the set of possible SDs successful segmentation into words, i.e. feeding from the outset to the ones that are motivated by the system utterances with unambiguous word previous experience/input or a pre-existing boundaries. We implemented the following grammar. pipeline architecture: Such constraining characteristics make ABL attractive from a cognitive point of view, both because hopefully the computational complexity is reduced on account of the smaller set of potential hypotheses, and also because learning of new items, rules, or structural properties is related to a general learning strategy and previous experience only. The approaches that are based on a brute- force first order explosion of all possible The GEN module consumes input and generates hypotheses with subsequent filtering of relevant or hypotheses about its structural descriptions (SD). irrelevant structures are both memory-intensive EVAL consumes a set of SDs and selects the set of and require more computational effort. best SDs to be added to the knowledge base. The The algorithm is not supposed to make any knowledge base is a component that not only stores assumptions about types of morphemes. There is SDs but also organizes them into optimal no expectation, including use of notions like stem, representations, here morphology grammars. prefix, or suffix. We assume only linear sequences. All three modules are modular, containing a set The properties of single morphemes, being stems of algorithms that are organized in a specific or suffixes, should be a side effect of their fashion. Our intention is to provide a general statistical properties (including their frequency and platform that can serve for the evaluation and co-occurrence patterns, as will be explained in the comparison of different approaches at every level following), and their alignment in the corpus, or of the induction process. Thus, the system is rather within words. designed to be more general, applicable to the There are no rules about language built-in, such problem of segmentation, as well as type and as what a morpheme must contain or how frequent grammar induction. it should be. All of this knowledge is induced We assume for the input to consist of an statistically. alphabet: a non-empty set A of n symbols {s1, s2,... In the ABL Hypotheses Generation, a given sn}. A word w is a non-empty list of symbols w = word in the utterance is checked against [s1,s2,... sn], with s∈A. The corpus is a non-empty morphemes in the grammar. If an existing list C of words C = [w1,w2,... wn]. morpheme LEX aligns with the input word INP, a In the following, the individual modules for the hypothesis is generated suggesting a morphology induction task are described in detail. morphological boundary at the alignment positions: 1.2.1 GEN INP (speaks) + LEX (speak) = HYP [speak, s] For the morphology task GEN is compiled from a Another design criterion for the algorithm is set of basically two algorithms. One algorithm is a complete language independence. It should be able variant of Alignment Based Learning (ABL), as to identify morphological structures of Indo- described in Van Zaanen (2001). European type of languages, as well as The basic ideas in ABL go back to concepts of agglutinative languages (e.g. Japanese and substitutability and/or complementarity, as Turkish) and polysynthetic languages like some discussed in Harris (1961). The concept of Bantu dialects or American Indian languages. In substitutability generally applies to central part of order to guarantee this behavior, we extended the the induction procedure itself, i.e. substitutable Alignment Based hypothesis generation with a elements (e.g. substrings, words, structures) are pattern identifier that extracts patterns of character assumed to be of the same type (represented e.g. sequences of the types: with the same symbol). 1. A — B — A The advantage of ABL for grammar induction is 2. A — B — A — B its constraining characteristics with respect to the 3. A — B — A — C set of hypotheses about potential structural This component is realized with cascaded properties of a given input. While a brute-force regular expressions that are able to identify and method would generate all possible structural

11 return the substrings that correspond to the used to set the parameters for the internal adult repeating sequences.3 parsing system. All possible alignments for the existing grammar Each algorithm is weighted. In the current at the current state, are collected in a hypothesis implementation these weights are set manually. In list and sent to the EVAL component, described in future studies we hope to use the weighting for the following. A hypothesis is defined as a tuple: self-supervision.4 Each algorithm assigns a H = , with w the input word, f its numerical rank to each hypothesis multiplied with frequency in C, and g a list of substrings that the corresponding weight, a real number between 0 represent a linear list of morphemes in w, g = [ and 1. m1, m2, ... mn ]. On the one hand, our main interest lies in the comparison of the different algorithms and a 1.2.2 EVAL possible interaction or dependency between them. EVAL is a voting based algorithm that subsumes Also, we expect the different algorithms to be of a set of independent algorithms that judge the list varying importance for different types of of SDs from the GEN component, using statistical languages. and information theoretic criteria. The specific Mutual Information (MI) algorithms are grouped into memory and usability For the purpose of this experiment we use a oriented constraints. variant of standard Mutual Information (MI), see Taken as a whole, the system assumes two (often e.g. MacKay (2003). Information theory tells us competing) cognitive considerations. The first of that the presence of a given morpheme restricts the these forms a class of what we term “time-based” possibilities of the occurrence of morphemes to the constraints on learning. These constraints are left and right, thus lowering the amount of bits concerned with the processing time required of a needed to store its neighbors. Thus we should be system to make sense of items in an input stream, able to calculate the amount of bits needed by a whereby “time” is understood to mean the number morpheme to predict its right and left neighbors of steps required to generate or parse SDs rather respectively. To calculate this, we have designed a than the actual temporal duration of the process. variant of mutual information that is concerned To that end, they seek to minimize the amount of with a single direction of information. structure assigned to an utterance, which is to say This is calculated in the following way. For they prefer to deal with as few rules as possible. every morpheme y that occurs to the right of x we The second of these cognitive considerations forms sum the point-wise MI between x and y, but we a class of “memory-based” constraints. Here, we relativize the point-wise MI by the probability that are talking about constraints that seek to minimize y follows x, given that x occurs. This then gives us the amount of memory space required to store an the expectation of the amount of information that x utterance by maximizing the efficiency of the tells us about which morpheme will be to its right. storage process. In the specific case of our model, Note that p() is the probability of the bigram which deals with morphological structure, this occurring and is not equal to p() which means that the memory-based constraints search is the probability of the bigram occurring. the input string for regularities (in the form of We calculate the MI on the right side of x∈G by: repeated substrings) that then need only be stored p(< xy >) once (as a pointer) rather than each time they are ∑ p(< xy >| x)lg found. In the extreme case, the time-based y∈{} p(x)p(y) constraints prefer storing the input “as is”, without and the MI on the left of x∈G respectively by: any processing at all, where the memory-based p(< yx >) constraints prefer a rule for every character, as this ∑ p(< yx >| x)lg y∈{) p(y)p(x) would assign maximum structure to the input. One way we use this as a metric, is by summing Parsable information falls out of the tension up the left and right MI for each morpheme in a between these two conflicting constraints, which can then be applied to organize the input into 4 potential syntactic categories. These can then be One possible way to self-supervise the weights in this architecture is by taking into account the revisions subsequent components make when they optimize the 3 This addition might be understood to be a sort of grammar. If rules or hypotheses have to be removed supervision in the system. However, as shown in recent from the grammar due to general optimization research on human cognitive abilities, and especially on constraints on the grammars as such, the weight of the the ability to identify patterns in the speech signal by responsible algorithm can be lowered, decreasing its very young infants (Marcus et al, 1999) shows that we general value in the system on the long run. The can assume such an ability to be part of the cognitive relevant evaluations with this approach are not yet abilities, maybe not even language specific finished.

12 hypothesis. We then look for the hypothesis that The second equation doesn't give variable results in the maximal value of this sum. The pointer lengths, but it is preferred since it doesn't tendency for this to favor hypotheses with many carry the heavy computational burden of morphemes is countered by our criterion of calculating the frequency rank. favoring hypotheses that have fewer morphemes, We calculate the description length for each GEN discussed later. hypothesis only,5 by summing up the cost of each Another way to use the left and right MI is in morpheme in the hypothesis. Those with low judging the quality of morpheme boundaries. In a description lengths are favored. good boundary, the morpheme on the left side Relative Entropy (RE) should have high right MI and the morpheme on We are using RE as a measure for the cost of the right should have high left MI. Unfortunately, adding a hypothesis to the existing grammar. We MI is not reliable in the beginning because of the look for hypotheses that when added to the low frequency of morphemes. However, as the grammar will result in a low divergence from the lexicon is extended during the induction procedure, original grammar. reliable frequencies are bootstrapping this We calculate RE as a variant of the Kullback- segmentation evaluation. Leibler Divergence, see MacKay (2003). Given Minimum Description Length (DL) grammar G1, the grammar generated so far, and G2 The principle of Minimum Description Length the grammar with the extension generated for the (MDL), as used in recent work on grammar new input increment, P(X) is the probability mass induction and unsupervised language acquisition, function (pmf) for grammar G2, and Q(X) the pmf e.g. Goldsmith (2001) and De Marcken (1996), for grammar G1: explains the grammar induction process as an P(x) P(x)lg iterative minimization procedure of the grammar ∑ Q(x) size, where the smaller grammar corresponds to the x∈X Note that with every new iteration a new element best grammar for the given data/corpus. can appear, that is not part of G . Our variant of RE The description length metric, as we use it here, 1 takes this into account by calculating the costs for tells us how many bits of information would be such a new element x to be the point-wise entropy required to store a word given a hypothesis of the of this element in P(X), summing up over all new morpheme boundaries, using the so far generated elements: grammar. For each morpheme in the hypothesis 1 that doesn't occur in the grammar we need to store ∑ P(x)lg the string representing the morpheme. For x∈X P(x) morphemes that do occur in our grammar we just These two sums then form the RE between the need to store a pointer to that morphemes entry in original grammar and the new grammar with the the grammar. We use a simplified calculation, addition of the hypothesis. Hypotheses with low taken from Goldsmith (2001), of the cost of storing RE are favored. a string that takes the number of bits of This metric behaves similarly to description information required to store a letter of the length, that is discussed above, in that both are alphabet and multiply it by the length of the string. calculating the distance between our original lg(len(alphabet))* len(morpheme) grammar and the grammar with the inclusion of the We have two different methods of calculating new hypothesis. The primary difference is RE also the cost of the pointer. The first assigns a variable takes into account how the pmf differs in the two the cost based on the frequency of the morpheme grammars and that our variation punishes new that it is pointing to. So first we calculate the morphemes based upon their frequency relative to frequency rank of the morpheme being pointed to, the frequency of other morphemes. Our (e.g. the most frequent has rank 1, the second rank implementation of MDL does not consider 2, etc.). We then calculate: frequency in this way, which is why we are floor(lg( freq_ rank) −1) including RE as an independent metric. to get a number of bits similar to the way Morse Further Metrics code assigns lengths to various letters. In addition to the mentioned metric, we take into The second is simpler and only calculates the account the following criteria: a. Frequency of entropy of the grammar of morphemes and uses this as the cost of all pointers to the grammar. The entropy equation is as follows: 5 1 We do not calculate the sizes of the grammars with ∑ p(x)lg and without the given hypothesis, just the amount each x∈G p(x) given hypothesis would add to the grammar, favoring the least increase of total grammar size.

13 morpheme boundaries; b. Number of morpheme oriented speech portion of the CHILDES Peter boundaries; c. Length of morphemes. corpus,8 and Caesar’s “De Bello Gallico” in Latin.9 The frequency of morpheme boundaries is given From the Brown corpus we used the files ck01 – by the number of hypotheses that contain this ck09, with an average number of 2000 words per boundary. The basic intuition is that the higher this chapter. The total number of words in these files is number is, i.e. the more alignments are found at a 18071. The randomly selected portion of “De Bello certain position within a word, the more likely this Gallico” contained 8300 words. The randomly position represents a morpheme boundary. We selected portion of the Peter corpus contains 58057 favor hypotheses with high values for this words. criterion. The system reads in each file and dumps log The number of morpheme boundaries indicates information during runtime that contains the how many morphemes the word was split into. To information for online and offline evaluation, as prevent the algorithm from degenerating into the described below in detail. state where each letter is identified as a morpheme, The gold standard for evaluation is based on we favor hypotheses with low number of human segmentation of the words in the respective morpheme boundaries. corpora. We create for every word a manual The length of the morphemes is also taken into segmentation for the given corpora, used for online account. We favor hypotheses with long evaluation of the system for accuracy of hypothesis morphemes to prevent the same degenerate state as generation during runtime. Due to complicated the above criterion. cases, where linguist are undecided about the accurate morphological segmentation, a team of 5 1.2.3 Linguistic Knowledge linguists was cooperating with this task. The acquired lexicon is stored in a hypothesis The offline evaluation is based on the grammar space which keeps track of the words from the that is generated and dumped during runtime after input and the corresponding hypotheses. The each input file is processed. The grammar is hypothesis space is defined as a list of hypotheses: manually annotated by a team of linguists, Hypotheses space: S = [ H1, H2, ... Hn] indicating for each construction whether it was Further, each morpheme that occurred in the SDs segmented correctly and exhaustively. An of words in the hypothesis space is kept with its additional evaluation criterion was to mark frequency information, as well as bigrams that undecided cases, where even linguists do not consist of morpheme pairs in the SDs and their agree. This information was however not used in 6 frequency. the final evaluation. Similar to the specification of signatures in Goldsmith (2001), we list every morpheme with 2.1 Evaluation the set of morphemes it co-occurs. Signatures are We used two methods to evaluate the lists of morphemes. Grammar construction is performance of the algorithm. The first analyzes performed by replacement of morphemes with a the accuracy of the morphological rules produced symbol, if they have equal signatures. by the algorithm after an increment of n words. The hypothesis space is virtually divided into The second looks at how accurately the algorithm two sections, long term and short term storage. parsed each word that it encountered as it Long term storage is not revised further, in the progressed through the corpus. current version of the algorithm. The short term The morphological rule analysis looks at each storage is cyclically cleaned up by eliminating the grammar rule generated by the algorithm and signatures with a low likelihood, given the long judges it on the correctness of the rule and the term storage. resulting parse. A grammar rule consists of a stem and the suffixes and prefixes that can be attached 2 The experimental setting to it, similar to the signatures used in Goldsmith In the following we discuss the experimental (2001). The grammar rule was then marked as to setting. We used the Brown corpus,7 the child- whether it consisted of legitimate suffixes and prefixes for that stem, and also as to whether the 6 Due to space restrictions we do not formalize this further. A complete documentation and the source code 8 Documented in L. Bloom (1970) and available at is available at: http://jones.ling.indiana.edu/~abugi/. http://xml.talkbank.org:8888/talkbank/file/CHILDES/E 7 The Brown Corpus of Standard American English, ng-USA/Bloom70/Peter/. consisting of 1,156,329 words from American texts 9 This was taken from the Gutenberg archive at: printed in 1961 organized into 59,503 utterances and http://www.gutenberg.net/etext/10657. The Gutenberg compiled by W.N. Francis and H. Kucera at Brown header and footer were removed for the experimental University. run.

14 stem of the rule was a true stem, as opposed to a pretest. In the very initial phase we reached a stem plus another morpheme that wasn't identified precision of 99.5% and a recall of 13.2%. This is by the algorithm. The number of rules that were however the preliminary result for the initial phase correct in these two categories were then summed, only. We expect that for a larger corpus the recall and precision and recall figures were calculated for will increase much higher, given the rich the trial. The trials described in the graph below morphology of Latin, potentially with negative were run on three increasingly large portions of the consequences for precision. general fiction section of the Brown Corpus. The The results on the Peter corpus are shown in the first trial was run on one randomly chosen chapter, following table: the second trial on two chapters, and the third on After file precision recall three chapters. The graph shows the harmonic 01 .9957 .8326 average (F-score) of precision and recall. 01-03 .9968 .8121 01-05 .9972 .8019 01-07 .9911 .7710 01-09 .9912 .7666 We notice a more or less stable precision value with decreasing recall, due to a higher number of words. The Peter corpus contains also many very specific transcriptions and tokens that are indeed unique, thus it is rather surprising to get such results at all. The following graphics shows the F- The second analysis is conducted as the score for the Peter corpus: algorithm is running and examines each parse the system produces. The algorithm's parses are compared with the “correct” morphological parse of the word using the following method to derive a numerical score for a particular parse. The first part of the score is the distance in characters between each morphological boundary in the two parses, with a score of one point for each character space. The second part is a penalty of two points for each morphological boundary that occurs in one parse and not the other. These scores were 3 Conclusion examined within a moving window of words that progressed through the corpus as the algorithm ran. The evaluations on two related morphology The average scores of words in each such window systems show that with a restrictive setting of the were calculated as the window advanced. The parameters in the described algorithm, approx 99% purpose of this method was to allow the precision can be reached, with a recall higher than performance of the algorithm to be judged at a 60% for the portion of the Brown corpus, and even given point without prior performance in the higher for the Peter corpus. corpus affecting the analysis of the current We are able to identify phases in the generation window. The following graph shows how the of rules that turn out to be for English: a. initially average performance of the windows of analyzed inflectional morphology on verbs, with the plural words as the algorithm progresses through five “s” on nouns, and b. subsequently other types of randomly chosen chapters of general fiction in the morphemes. We believe that this phenomenon is Brown Corpus amounting to around 10,000 words. purely driven by the frequency of these The window size for the following graph was set to morphemes in the corpora. In the manually 40 words. segmented portion of the Brown corpus we identified on the token level 11.3% inflectional morphemes, 6.4% derivational morphemes, and 82.1% stems. In average there are twice as many inflectional morphemes in the corpus, than derivational. Given a very strict parameters, focusing on the description length of the grammar, our system would need long time till it would discover

The evaluations on Latin were based on the prefixes, not to mention infixes. By relaxing the initial 4000 words of “De Bello Gallico” in a weight of description length we can inhibit the

15 generation and identification of prefixing rules, M.R. Brent and T.A. Cartwright. 1996. Distributional however, to the cost of precision. regularity and phonotactic constraints are useful for Given these results, the inflectional paradigms segmentation. Cognition 61: 93-125. can be claimed to be extractable even with an H. Déjean. 1998. Concepts et algorithmes pour la incremental approach. As such, this means that découverte des structures formelles des langues. central parts of the lexicon can be induced very Doctoral dissertation, Université de Caen Basse early along the time line. Normandie. The existing signatures for each morpheme can K. Elghamry. 2004. A generalized cue-based approach be used as simple clustering criteria.10 Clustering to the automatic acquisition of subcategorization will separate dependent (affixes) from independent frames. Doctoral dissertation, Indiana University. morphemes (stems). Their basic distinction is that K. Elghamry and D. Ćavar. 2004. Bootstrapping cues affixes will usually have a long signature, i.e. for cue-based bootstrapping. Mscr. Indiana University. many elements they co-occur with, as well as a high frequency, while for stems the opposite is J. Fodor and V. Teller. 2000. Decoding syntactic true.11 Along these lines, morphemes with a similar parameters: The superparser as oracle. Proceedings of the Twenty-Second Annual Conference of the signature can be replaced by symbols, expressing Cognitive Science Society, 136-141. the same type information and compressing the grammar further. This type information, especially J. Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational for rare morphemes is essential in subsequent Linguistics 27(2): 153-198. induction of syntactic structure. Due to space Z.S. Harris. 1961. Structural linguistics. University of limitations, we cannot discuss in detail subsequent Chicago Press. Chicago. steps in the cross-level induction procedures. Nevertheless, the model presented here provides an J. Grimshaw. 1981. Form, function, and the language acquisition device. In C.L. Baker and J.J. McCarthy important pointer to the mechanics of how (eds.), The Logical Problem of Language Acquisition. grammatical parameters might come to be set. Cambridge, MA: MIT Press. Additionally, we provide a method by which to D.J.C. MacKay. 2003. Information Theory, Inference, test the roles different statistical algorithms play in and Learning Algorithms. Cambridge: Cambridge this process. By adjusting the weights of the University Press. contributions made by various constraints, we can C.G. de Marcken. 1996. Unsupervised Language approach an understanding of the optimal ordering Acquisition. Phd dissertation, MIT. of algorithms that play a role in the computational G.F. Marcus, S. Vijayan, S. Bandi Rao, and P.M. framework of language acquisition. Vishton. 1999. Rule-learning in seven-month-old This is but a first step to what we hope will infants. Science 283:77-80. eventually finish a platform for a detailed study of R. Mazuka. 1998. The Development of Language various induction algorithms and evaluation Processing Strategies: A cross-linguistic study metrics. between Japanese and English. Lawrence Erlbaum. T.H. Mintz. 1996. The roles of linguistic input and References innate mechanisms in children's acquisition of E. O. Batchelder. 1997. Computational evidence for the grammatical categories. Unpublished doctoral use of frequency information in discovery of the dissertation, University of Rochester. infant’s first lexicon. PhD dissertation, CUNY. S. Pinker. 1984. Language Learnability and Language E. O. Batchelder. 1998. Can a computer really model Development, Press, Cambridge, cognition? A case study of six computational models MA. of infant word discovery. In M. A. Gernsbacher and S. Pinker. 1994. The language instinct. New York, NY: S. J. Derry, editors, Proceedings of the 20th Annual W. Morrow and Co. Conference of the Cognitive Science Society, pages P. Schone and D. Jurafsky. 2001. Knowledge-Free 120–125. Lawrence Erlbaum, University of Induction of Inflectional Morphologies. In Wisconsin-Madison. Proceedings of NAACL-2001. Pittsburgh, PA, June L. Bloom, L. Hood, and P. Lightbown. 1974. Imitation 2001. in language development: If, when and why. M.M. Van Zaanen and Pieter Adriaans. 2001. Cognitive Psychology, 6, 380–420. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. EMILE. Tech. 10 Length of the signature and frequency of each Rep. TR2001.05, University of Leeds. morpheme are mapped on a feature vector. M.M. Van Zaanen. 2001. Bootstrapping Structure into 11 This way, similar to the clustering of words into Language: Alignment-Based Learning. Doctoral open and closed class on the basis of feature vectors, as dissertation, The University of Leeds. described in Elghamry and Ćavar (2004), the morphemes can be separated into open and closed class.

16 Putting Meaning into Grammar Learning

Nancy Chang UC Berkeley, Dept. of Computer Science and International Computer Science Institute 1947 Center St., Suite 600 Berkeley, CA 94704 USA [email protected]

Abstract Moreover, by the time children make the leap from single words to complex combinations, they have This paper proposes a formulation of grammar amassed considerable conceptual knowledge, in- learning in which meaning plays a fundamental cluding familiarity with a wide variety of entities role. We present a computational model that aims and events and sophisticated pragmatic skills (such to satisfy convergent constraints from cognitive lin- as using joint attention to infer communicative in- guistics and crosslinguistic developmental evidence tentions (Tomasello, 1995) and subtle lexical dis- within a statistically driven framework. The target tinctions (Bloom, 2000)). The developmental evi- grammar, input data and goal of learning are all de- dence thus suggests that the input to grammar learn- signed to allow a tight coupling between language ing may in principle include not just surface strings learning and comprehension that drives the acqui- but also meaningful situation descriptions with rich sition of new constructions. The model is applied semantic and pragmatic information. to learn lexically specific multi-word constructions from annotated child-directed transcript data. This paper formalizes the grammar learning problem in line with the observations above, tak- 1 Introduction ing seriously the ideas that the target of learning, for both lexical items and larger phrasal and clausal What role does meaning play in the acquisition of units, is a bipolar structure in which meaning is on grammar? Computational approaches to grammar par with form, and that meaningful language use learning have tended to exclude semantic informa- drives language learning. The resulting core com- tion entirely, or else relegate it to lexical representa- putational problem can be seen as a restricted type tions. Starting with Gold’s (1967) influential early of relational learning. In particular, a key step of work on language identifiability in the limit and the learning task can be cast as learning relational continuing with work in the formalist learnability correspondences, that is, associations between form paradigm, grammar learning has been equated with relations (typically word order) and meaning rela- syntax learning, with the target of learning consist- tions (typically role-filler bindings). Such correla- ing of relatively abstract structures that govern the tions are essential for capturing complex multi-unit combination of symbolic linguistic units. Statis- constructions, both lexically specific constructions tical, corpus-based efforts have likewise restricted and more general grammatical constructions. their attention to inducing syntactic patterns, though in part due to more practical considerations, such as The remainder of the paper is structured as fol- the lack of large-scale semantically tagged corpora. lows. Section 2 states the learning task and pro- But a variety of cognitive, linguistic and develop- vides an overview of the model and its assump- mental considerations suggest that meaning plays a tions. We then present algorithms for inducing central role in the acquisition of linguistic units at structured mappings, based on either specific input all levels. We start with the proposition that lan- examples or the current set of constructions (Sec- guage use should drive language learning — that is, tion 3), and describe how these are evaluated using the learner’s goal is to improve its ability to commu- criteria based on minimum description length (Ris- nicate, via comprehension and production. Cogni- sanen, 1978). Initial results from applying the learn- tive and constructional approaches to grammar as- ing algorithms to a small corpus of child-directed sume that the basic unit of linguistic knowledge utterances demonstrate the viability of the approach needed to support language use consists of pairings (Section 4). We conclude with a discussion of the of form and meaning, or constructions (Langacker, broader implications of this approach for language 1987; Goldberg, 1995; Fillmore and Kay, 1999). learning and use.

17 2 Overview of the learning problem We highlight a few relevant aspects of the for- We begin with an informal description of our learn- malism, exemplified in Figure 1. Each construc- ing task, to be formalized below. At all stages tion has sections labeled form and meaning list- of language learning, children are assumed to ex- ing the entities (or roles) and constraints (type con- ploit general cognitive abilities to make sense of straints marked with :, filler constraints marked the flow of objects and events they experience. To with £¥¤ , and identification (or coindexation) con- make sense of linguistic events — sounds and ges- straints marked with £§¦ ) of the respective do- tures used in their environments for communica- mains. These two sections, also called the form and tive purposes — they also draw on specifically meaning poles, capture the basic intuition that con- structions are form-meaning pairs. A subscripted linguistic knowledge of how forms map to mean- ¨ ings, i.e., constructions. Comprehension consists of or © allows reference to the form or meaning two stages: identifying the constructions involved pole of any construction, and the keyword self al- and how their meanings are related (analysis), and lows self-reference. Thus, the   construc- matching these constructionally sanctioned mean- tion simply links a form whose orthography role (or ings to the actual participants and relations present feature) is bound to the string “throw” to a mean- in context (resolution). The set of linguistic con- ing that is constrained to be of type Throw, a sepa- structions will typically provide only a partial anal- rately defined conceptual schema corresponding to ysis of the utterance in the given context; when this throwing events (including roles for a thrower and happens, the agent may still draw on general infer- throwee). (Although not shown in the examples, the ence to match even a partial analysis to the context. formalism also includes a subcase of notation for The goal of construction learning is to acquire expressing constructional inheritance.) a useful set of constructions, or grammar. This grammar should allow constructional analysis to construction  form

produce increasingly complete interpretations of ut-  self  .orth “throw” terances in context, thus requiring minimal recourse meaning to general resolution and inference procedures. In

self : Throw the limit the grammar should stabilize, while still

being useful for comprehending novel input. A use-

*+( ,.- ful grammar should also reflect the statistical prop- construction ! "$#&%')( constituents

erties of the input data, in that more frequent or spe-

%4& "56&72+-2')')( &% t1 : /0-213-2++(

cific constructions should be learned before more &% t2 : ! 98:&%';*+<=*+(

infrequent and more general constructions.

%4& "56&72+-2')')( &% t3 : /0-213-2++( Formally, we define our learning task as follows:

form

Given an initial grammar and a sequence of train-  t1  before t2

ing examples consisting of an utterance paired with 

t2  before t3 ¢¡

its context, find the best grammar to fit seen data meaning

!>

and generalize to new data. The remainder of this t2 .thrower t1

!> section describes the hypothesis space, prior knowl- t2 .throwee t3 edge and input data relevant to the task. 2.1 Hypothesis space: embodied constructions Figure 1: Embodied Construction Grammar repre-

The space of possible grammars (or sets of con- sentation of the lexical   and lexically spe- structions) is defined by Embodied Construc- cific  @?A BCD3EGFEIH:J construction (licensing tion Grammar (ECG), a computationally explicit expressions like You throw the ball). unification-based formalism for capturing insights Multi-unit constructions such as the  @? from the construction grammar and cognitive lin- BCD3EGFEIH:J construction also list their con- guistics literature (Bergen and Chang, in press; stituents, each of which is itself a form-meaning Chang et al., 2002). ECG is designed to support construction. These multi-unit constructions serve the analysis process mentioned above, which deter- as the target representation for the specific learn- mines what constructions and schematic meanings ing task at hand. The key representational insight are present in an utterance, resulting in a semantic here is that the form and meaning constraints typi- specification (or semspec).1 simulation using active representations (or embodied schemas) 1ECG is intended to support a simulation-based model of to produce context-sensitive inferences. See Bergen and Chang language understanding, with the semspec parameterizing a (in press) for details.

18 cally involve relations among the form and meaning These include schemas for people, objects (e.g. Ball poles of the constructional constituents. For cur- in the ®gure), locations, and actions familiar to chil- rent purposes we limit the potential form relations dren by the time they enter the two-word stage (typ- to word order, although many other form relations ically toward the end of the second year). Actions are in principle allowed. In the meaning domain, like the Throw schema referred to in the example the primary relation is identification, or uni®cation, ¢¡¤£¦¥¨§ construction and in the ®gure have roles between two meaning entities. In particular, we whose ®llers are subject to type constraints, re¯ect- will focus on role-®ller bindings, in which a role of ing children's knowledge of what kinds of entities one constituent is identi®ed with another constituent can take place in different events. or with one of its roles. The example construc- 2.2.2 Lexical constructions tion pairs two word order constraints over its con- stituents' form poles with two identi®cation con- The input to learning includes a set of lexical con- straints over its constituents' meaning poles (these structions, represented using the ECG formalism, specify the ®llers of the thrower and throwee roles linking simple forms (i.e. words) to speci®c con-

ceptual items. Examples of these include the © and

of a Throw event, respectively). ¤  constructions in the ®gure, as well as the

Note that both lexical constructions and the ¢¡¤£¦¥¨§ multi-unit constructions needed to express gram- construction formally de®ned in Figure 1. matical patterns can be seen as graphs of varying Lexical learning is not the focus of the current work, complexity. Each domain (form or meaning) can but a number of previous computational approaches be represented as a subgraph of elements and re- have shown how simple mappings may be acquired lations among them. Lexical constructions involve from experience (Regier, 1996; Bailey, 1997; Roy a simple mapping between these two subgraphs, and Pentland, 1998). whereas complex constructions with constituents 2.2.3 Construction analyzer require structured relational mappings over the As mentioned earlier, the ECG construction formal- two domains, that is, mappings between form and ism is designed to support processes of language meaning relations whose arguments are themselves use. In particular, the model makes use of a con- linked by known constructions. struction analyzer that identi®es the constructions 2.2 Prior knowledge responsible for a given utterance, much like a syn- The model makes a number of assumptions based tactic parser in a traditional language understanding on the child language literature about prior knowl- system identi®es which parse rules are responsible. edge brought to the task, including conceptual In this case, however, the basic representational unit knowledge, lexical knowledge and the language is a form-meaning pair. The analyzer must there- comprehension process described earlier. Figure 2 fore also supply a semantic interpretation, called the depicts how these are related in a simple example; semspec, indicating which conceptual schemas are each is described in more detail below. involved and how they are related. The analyzer is also required to be robust to input that is not cov- meaning HROW-TT RANSITIVE ered by its current grammar, since that situation is form constructional the norm during language learning. t1 t2 FORM t3 MEANING

meaning Bryant (2003) describes an implemented con- I I Speaker form struction analyzer program that meets these needs. Throw meaning The construction analyzer takes as input a set of throw HROW thrower

form T throwee ECG constructions (linguistic knowledge), a set of the

meaning ECG schemas (conceptual knowledge) and an utter- ball THE- BALL Ball form ance. The analyzer draws on partial parsing tech- niques previously applied to syntactic parsing (Ab- ney, 1996): utterances not covered by known con- Figure 2: A constructional analysis of I throw the structions yield partially ®lled semspecs, and un- ball, with form elements on the left, meaning ele- known forms in the input are skipped. As a result, ments (conceptual schemas) on the right and con- even a small set of simple constructions can provide structions linking the two domains in the center. skeletal interpretations of complex utterances. Figure 2 gives an iconic representation of the re-

2.2.1 Conceptual knowledge sult of analyzing the utterance I throw the ball us-

¤¨ ¢¡¤£¦¥¨§ Conceptual knowledge is represented using an on- ing the ¢¡¤£¦¥¨§ ¢£ and con- tology of typed feature structures, or schemas. structions shown earlier, along with some additional

19 lexical constructions (not shown). The analyzer scene. A simple feature structure representation of matches each input form with its lexical construc- a sample input token is shown here; boxed numbers

tion (if available) and corresponding meaning, and indicate that the relevant entities are identi®ed:

# $  $ then matches the clausal construction by checking 

text  throw the ball

$ 

Form  $

the relevant word order relations (implicitly rep- 

intonation  falling

$  $ resented by the dotted arrow in the ®gure) and  0 1 2

Participants  Mother , Naomi , Ball

$  $

role bindings (denoted by the double-headed arrows  Throw $

 1



! $

within the meaning domain) asserted on its candi-  Scene thrower Naomi

$ 

2 "

 $

date constituents. Note that at the stage shown, no  throwee Ball

 # $ $

 0

 

speaker Mother $ $ construction for the has yet been learned, result-  1 addressee  Naomi

ing in a partial analysis. At an even earlier stage 

Discourse  speech act imperative

 %

%  ¢¡¤£¦¥¨§ © ¢£¦ ¤ ¨ of learning, before the con- activity  play 2 struction is learned, the lexical constructions are joint attention  Ball matched without resulting in the role-®ller bindings on the Throw action schema. Many details have been omitted, and a number Finally, note that the semspec produced by con- of simplifying assumptions have been made. But structional analysis (right-hand side of the ®gure) the rough outline given here nevertheless captures must be matched to the current situational con- the core computational problem faced by the child text using a contextual interpretation, or resolu- learner in acquiring multi-word constructions in a tion, process. Like other resolution (e.g. refer- framework putting meaning on par with form. ence resolution) procedures, this process relies on category/type constraints and (provisional) identi- 3 Learning algorithms ®cation bindings. The resolution procedure at- We model the learning task as a search through the tempts to unify each schema and constraint ap- space of possible grammars, with new constructions pearing in the semspec with a type-compatible en- incrementally added based on encountered data. As tity or relation in the context. In the example, in the child learning situation, the goal of learning the schemas on the right-hand side of the ®gure is to converge on an optimal set of constructions, should be identi®ed during resolution with particu- i.e., a grammar that is both general enough to en- lar schema instances available in context (e.g., the compass signi®cant novel data and speci®c enough Speaker schema should be linked to the speci®c to accurately predict previously seen data. contextually available discourse speaker, the Ball A suitable overarching computational framework schema to a particular ball instance, etc.). for guiding the search is provided by the mini- 2.3 Input data mum description length (MDL) heuristic (Rissanen, 1978), which is used to ®nd the optimal analysis The input is characterized as a set of input tokens, of data in terms of (a) a compact representation of each consisting of an utterance form (a string of the data (i.e., a grammar); and (b) a compact means known and novel word-forms) paired with a speci®c of describing the original data in terms of the com- communicative context (a set of linked conceptual pressed representation (i.e., constructional analyses schemas corresponding to the participants, salient using the grammar). The MDL heuristic exploits a scene and discourse information available in the sit- tradeoff between competing preferences for smaller uation). The learning model receives only positive grammars (encouraging generalization) and for sim- examples, as in the child learning case. Note, how- pler analyses of the data (encouraging the retention ever, that the interpretation a given utterance has of speci®c/frequent constructions). in context depends on the current state of linguis- The rest of this section makes the learning frame- tic knowledge. Thus the same utterance at different work concrete. Section 3.1 describes several heuris- stages may lead to different learning behavior. tics for moving through the space of grammars (i.e., The speci®c training corpus used in learn- how to update a grammar with new constructions ing experiments is a subset of the Sachs corpus based on input data), and Section 3.2 describes how of the CHILDES database of parent-child tran- to chose among these candidate moves to ®nd op- scripts(Sachs, 1983; MacWhinney, 1991), with ad- timal points in the search space (i.e., speci®c MDL ditional annotations made by developmental psy- criteria for evaluating new grammars). These speci- chologists as part of a study of motion utterances ®cations extend previous methods to accommodate (Dan I. Slobin, p.c.). These annotations indicate the relational structures of the ECG formalism and semantic and pragmatic features available in the the process-based assumptions of the model.

20 3.1 Updating the grammar shown in Section 2.3 and depicted schematically in The grammar may be updated in three ways: Figure 4. Given the utterance “throw the ball” and a grammar including constructions for throw and ball hypothesis forming new structured maps to ac- (but not the), the analyzer produces a semspec in- count for mappings present in the input but un- cluding a Ball schema and a Throw schema, without explained by the current grammar; indicating any relations between them. The reso- lution process matches these schemas to the actual reorganization exploiting regularities in the set of context, which includes a particular throwing event known constructions (merge two similar con- in which the addressee (Naomi) is the thrower of structions into a more general construction, or a particular ball. The resulting resolved analysis compose two constructions that cooccur into a looks like Figure 4 but without the new construction larger construction); and (marked with dashed lines): the two lexical con- reinforcement incrementing the weight associated structions are shown mapping to particular utterance with constructions that are successfully used forms and contextual items. during comprehension. UTTERANCE CONSTRUCTS CONTEXT

Hypothesis. The first operation addresses the Mom Discourse core computational challenge of learning new struc- intonation: falling Naomi speaker: tured maps. The key idea here is that the learner is addressee: Throw speech act: imperative assumed to have access to a partial analysis based throw THROW thrower joint attention: throwee activity: play on linguistic knowledge, as well as a fuller situa- temporality: ongoing tion interpretation it can infer from context. Any difference between the two can directly prompt the NEW CONSTRUCTION formation of new constructions that will improve the agent’s ability to handle subsequent instances of ball BALL Ball Block similar utterances in similar contexts. In particu- lar, certain form and meaning relations that are un- matched by the analysis but present in context may Figure 4: Hypothesizing a relational mapping for be mapped using the procedure in Figure 3. the utterance throw ball. Heavy solid lines indi- cate structures matched during analysis; heavy dot-

Hypothesize construction. Given utterance in ¢ situational context ¡ and current grammar : ted lines indicate the newly hypothesized mapping.

1. Call the construction analysis/resolution pro- Next, an unmatched form relation (the word or-

¡ ¢ cesses on ( , , ) to produce a semspec con-

der edge between throw and ball) is found, fol- ¤

sisting of form and meaning graphs £ and . ¤ Nodes and edges of £ and are marked as lowed by a corresponding unmatched meaning re-

matched or unmatched by the analysis. lation (the binding between the Throw.throwee role

¥ ¥ £ 2. Find rel ¥ (A ,B ), an unmatched edge in cor- and the specific Ball in context); these are shown responding to an unused form relation over the in the figure using heavy dashed lines. Crucially,

matched form poles of two constructs A and B. these relations meet the condition in step 3 that

¦ ¦ 3. Find rel ¦ (A ,B ), an unmatched edge (or the relations be pseudo-isomorphic. This condition

subgraph) in ¤ corresponding to an unused captures three common patterns of relational form- meaning relation (or set of bindings) over the meaning mappings, i.e., ways in which a meaning

corresponding matched meaning poles A ¦ and

¨ ¨ ¨

¦ ¦ ¦

B ¦ . rel (A ,B ) is required to be pseudo- relation rel over A and B can be correlated

¥ ¥ ¥

© © isomorphic to rel (A ,B ). with a form relation rel © over A and B (e.g., word

4. Create a new construction § with constituents order); these are illustrated in Figure 5, where we

A and B and form and meaning constraints cor- assume a simple form relation:

¥ ¥ ¦ ¦ ¦ responding to rel ¥ (A ,B ) and rel (A ,B ),

respectively. ¨

(a) strictly isomorphic: B ¨ is a role-filler of A (or

vice versa) (A .r1 B )

Figure 3: Construction hypothesis. ¨

(b) shared role-filler: A ¨ and B each have a role

The algorithm creates new constructions map- filled by the same entity (A .r1 B .r2)

ping form and meaning relations whose arguments ¨

are already constructionally mapped. It is best illus- (c) sibling role-fillers: A ¨ and B fill roles of the

trated by example, based on the sample input token same schema (Y.r1 A , Y.r2 B )

21 A A A f m Reorganize constructions. Reorganize 5 to con- solidate similar and co-occurring constructions: (a) relf r

6 Merge: Pairs of constructions with significant Bf B Bm shared structure (same number of constituents, minimal ontological distance (i.e., distance in A f A Am r1 the type ontology) between corresponding con- stituents, maximal overlap in constraints) may relf x (b) be merged into a new construction containing r2 Bf B Bm the shared structure; the original constructions are rewritten as subcases of the new construc- tion along with the non-overlappinginformation. A A f A m r1 6 Compose: Pairs of constructions that co-occur (c) relf Y frequently with compatible constraints (are part r2 of competing analyses using the same con- Bf B Bm stituent, or appear in a constituency relation- ship) may be composed into one construction. Figure 5: Pseudo-isomorphic relational mappings over constructs A and B: (a) strictly isomorphic; (b) shared role-®ller; and (c) sibling role-®llers. Figure 7: Construction reorganization. three constituents. This condition enforces structural similarity be- Reinforcement. Each construction is associated tween the two relations while recognizing that con- with a weight, which is incremented each time it is structions may involve relations that are not strictly used in an analysis that is successfully matched to isomorphic. (The example mapping shown in the the context. A successful match covers a majority ®gure is strictly isomorphic.) The resulting con- of the contextually available bindings. struction is shown formally in Figure 6. Both hypothesis and reorganization provide means of proposing new constructions; we now construction ¢¡¤£¦¥¦§©¨  constituents specify how proposed constructions are evaluated.

t1 : ¢¡¤£¦¥¦§ 3.2 Evaluating grammar cost

t2 : 

form The MDL criteria used in the model is based on the

  8 t1 before t2 cost of the grammar 7 given the data :

meaning

 © 

8>=@? ACB 9:7D=EGFHB 9 8I; 7D=

t1 .throwee t2 cost 9:7<; size cost

J 9QPR= size 9:7D=@? size

Figure 6: Example learned construction. KMLON

K K

JU

F EGT E 9WV¦=

Reorganization. Besides hypothesizing con- size 9QPR=S? length L¦K

structions based on new data, the model also allows

J

7D=@? 9Q[%=

new constructions to be formed via constructional cost 9 8I; score X

reorganization, essentially by applying general cat- LZY

\ `

J J

9Q[%=S? 9 ; ; =

egorization principles to the current grammar, as de- score weight E^]_B type

X `.LZ\

scribed in Figure 7. \¦L

X

X

 ! #"$$ E For example, the construction E height sem®t

and a similar  ! #$%¢&(' construction can be

A F merged into a general  )+*,.-&/ construc- where and are learning parameters that con- tion; the resulting subcase constructions each retain trol the relative bias toward model simplicity and

the appropriate type constraint. Similarly, a general data compactness. The size( 7 ) is the sum over the

0213

K

"4! P F

and  )+*,.-&/ construc- size of each construction in the grammar ( is

K T

tion may occur in many analyses in which they com- the number of constituents in P , is the number

P V pete for the / constituent. Since they have of constraints in , and each element reference

compatible constraints in both form and meaning (in in P has a length, measured as slot chain length). 7 the latter case based on the same conceptual Throw The cost (complexity) of the data 8 given is the

schema), repeated co-occurrence may lead to the sum of the analysis scores of each input token [

formation of a larger construction that includes all using 7 . This score sums over the constructions

22

¡

in the analysis of , where weight ¢ reflects rel- constructions as expected (e.g. throw bear, throw ¤¥£ ative (in)frequency, £ type denotes the number of books, you throw), later in learning generalizing

ontology items of type ¦ , summed over all the con- over different event participants (throw OBJECT, stituents in the analysis and discounted by parame- PERSON throw, etc.).

ter § . The score also includes terms for the height A quantitative measure of comprehension over of the derivation graph and the semantic fit provided time, coverage, was defined as the percentage of to-

by the analyzer as a measure of semantic coherence. tal bindings © in the data accounted for at each learn- In sum, these criteria favor constructions that are ing step. This metric indicates how new construc- simply described (relative to the available meaning tions incrementally improve the model’s compre- representations and the current set of constructions), hensive capacity, shown in Figure 8. The throw sub- frequently useful in analysis, and specific to the data set, for example, contains 45 bindings to the roles of encountered. The MDL criteria thus approximate the Throw schema (thrower, throwee, and goal loca- Bayesian learning, where the minimizing of cost tion). At the start of learning, the model has no com- corresponds to maximizing the posterior probabil- binatorial constructions and can account for none of ity, the structural prior corresponds to the grammar these. But the model gradually amasses construc- size, and likelihood corresponds to the complexity tions with greater coverage, and by the tenth input of the data relative to the grammar. token, the model learns new constructions that ac- count for the majority of the bindings in the data. 4 Learning verb islands The learning trajectories do appear distinct: The model was applied to the data set described in throw constructions show a gradual build-up before Section 2.3 to determine whether lexically specific plateauing, while fall has a more fitful climb con- multi-word constructions could be learned using the verging at a higher coverage rate than throw. It is MDL learning framework described. This task rep- interesting to note that the throw subset has a much resents an important first step toward general gram- higher percentage of imperative utterances than fall matical constructions, and is of cognitive interest, (since throwing is pragmatically more likely to be since item-based patterns appear to be learned on done on command); the learning strategy used in independent trajectories (i.e., each verb forms its the current model focuses on relational mappings own “island” of organization (Tomasello, 2003)). and misses the association of an imperative speech- act with the lack of an expressed agent, providing a We give results for drop ( ¨ =10 examples), throw

possible explanation for the different trajectories. ¨ ( ¨ =25), and fall ( =50). While further experimentation with larger train- 100 ing sets is needed, the results indicate that the model is able to acquire useful item-based constructions

80 like those learned by children from a small number examples. More importantly, the learned construc-

60 tions permit a limited degree of generalization that allows for increasingly complete coverage (or com- prehension) of new utterances, fulfilling the goal of 40

Percent correct bindings the learning model. Differences in verb learning drop (n=10, b=18) throw (n=25, b=45) fall (n=53, b=86) lend support to the verb island hypothesis and illus- 20 trate how the particular semantic, pragmatic and sta- tistical properties of different verbs can affect their

0 0 10 20 30 40 50 60 70 80 90 100 learning course. Percent training examples encountered 5 Discussion and future directions Figure 8: Incrementally improving comprehension The work presented here is intended to offer an al- for three verb islands. ternative formulation of the grammar learning prob- Given the small corpus sizes, the focus for this lem in which meaning in context plays a pivotal role experiment is not on the details of the statisti- in the acquisition of grammar. Specifically, mean- cal learning framework but instead on a qualita- ing is incorporated directly into the target grammar tive evaluation of whether learned constructions im- (via the construction representation), the input data prove the model’s comprehension over time, and (via the context representation) and the evaluation how verbs may differ in their learning trajectories. criteria (which is usage-based, i.e. to improve com- Qualitatively, the model first learned item-specific prehension). To the extent possible, the assump-

23 tionsmadewithrespecttostructuresandprocesses Paul Bloom. 2000. How Children Learn the Meanings availabletoahumanlanguagelearnerinthisstage of Words. MIT Press, Cambridge, MA. are consistent with evidence from across the cog- John Bryant. 2003. Constructional analysis. Master’s nitive spectrum. Though only preliminary conclu- thesis, University of California at Berkeley. sions can be made, the model is a concrete compu- Nancy Chang, Jerome Feldman, Robert Porzel, and tational step toward validating a meaning-oriented Keith Sanders. 2002. Scaling cognitive linguistics: Formalisms for language understanding. In Proc. approach to grammar learning. 1st International Workshop on Scalable Natural Lan- The model draws from a number of computa- guage Understanding, Heidelberg, Germany. tional forerunners from both logical and probabilis- G.F. DeJong and R. Mooney. 1986. Explanation-based tic traditions, including Bayesian models of word learning: An alternative view. Machine Learning, learning (Bailey, 1997; Stolcke, 1994) for the over- 1(2):145–176. all optimization model, and work by Wolff (1982) Charles Fillmore and Paul Kay. 1999. Construction modeling language acquisition (primarily produc- grammar. CSLI, Stanford, CA. To appear. tion rules) using data compression techniques sim- E.M. Gold. 1967. Language identification in the limit. ilar to the MDL approach taken here. The use of Information and Control, 16:447–474. the results of analysis to hypothesize new mappings Adele E. Goldberg. 1995. Constructions: A Construc- tion Grammar Approach to Argument Structure. Uni- can be seen as related to both explanation-based versity of Chicago Press. learning (DeJong and Mooney, 1986) and inductive Ronald W. Langacker. 1987. Foundations of Cognitive logic programming (Muggleton and Raedt, 1994). Grammar, Vol. 1. Stanford University Press. The model also has some precedents in the work Brian MacWhinney. 1991. The CHILDES project: of Siskind (1997) and Thompson (1998), both of Tools for analyzing talk. Erlbaum, Hillsdale, NJ. which based learning on the discovery of isomor- Stephen Muggleton and Luc De Raedt. 1994. Inductive phic structures in syntactic and semantic representa- logic programming: Theory and methods. Journal of tions, though in less linguistically rich formalisms. Logic Programming, 19/20:629–679. In current work we are applying the model to the Terry Regier. 1996. The Human Semantic Potential. full corpus of English verbs, as well as crosslin- MIT Press, Cambridge, MA. Jorma Rissanen. 1978. Modeling by shortest data de- guistic data including Russian case markers and scription. Automatica, 14:465–471. Mandarin directional particles and aspect markers. Deb Roy and Alex Pentland. 1998. Learning audio- These experiments will further test the robustness visually grounded words from natural input. In Proc. of the model’s theoretical assumptions and protect AAAI workshop, Grounding Word Meaning. against model overfitting and typological bias. We J. Sachs. 1983. Talking about the there and then: the are also developing alternative means of evaluating emergence of displaced reference in parent-child dis- the system’s progress based on a rudimentary model course. In K.E. Nelson, editor, Children's language, of production, which would enable it to label scene volume 4, pages 1–28. Lawrence Erlbaum Associates. descriptions using its current grammar and thus fa- Jeffrey Mark Siskind, 1997. A computational study cilitate detailed studies of how the system general- of cross-situational techniques for learning word-to- meaning mappings, chapter 2. MIT Press. izes (and overgeneralizes) to unseen data. Andreas Stolcke. 1994. Bayesian Learning of Prob- Acknowledgments abilistic Language Models. Ph.D. thesis, Computer Science Division, University of California at Berke- We thank members of the ICSI Neural Theory of Lan- ley. guage group and two anonymous reviewers. Cynthia A. Thompson. 1998. Semantic Lexicon Acquisi- tion for Learning Natural Language Interfaces. Ph.D. References thesis, Department of Computer Sciences, University Steven Abney. 1996. Partial parsing via finite-state cas- of Texas at Austin, December. cades. In Workshop on Robust Parsing, 8th European Michael Tomasello. 1995. Joint attention as social Summer School in Logic, Language and Information, cognition. In C.D. Moore P., editor, Joint atten- pages 8–15, Prague, Czech Republic. tion: Its origins and role in development. Lawrence David R. Bailey. 1997. When Push Comes to Shove: A Erlbaum Associates, Hillsdale, NJ. Educ/Psych Computational Model of the Role of Motor Control in BF720.A85.J65 1995. the Acquisition of Action Verbs. Ph.D. thesis, Univer- Michael Tomasello. 2003. Constructing a Language: A sity of California at Berkeley. Usage-Based Theory of Language Acquisition. Har- Benjamin K. Bergen and Nancy Chang. in press. vard University Press, Cambridge, MA. Simulation-based language understanding in Embod- J. Gerard Wolff. 1982. Language acquisition, data com- ied Construction Grammar. In Construction Gram- pression and generalization. Language & Communi- mar(s): Cognitive and Cross-language dimensions. cation, 2(1):57–89. John Benjamins. 24 Grammatical Inference and First Language Acquisition

Alexander Clark ([email protected]) ISSCO / TIM, University of Geneva UNI-MAIL, Boulevard du Pont-d’Arve, CH-1211 Gene`ve 4, Switzerland

Abstract stated has no bounds on the amount of data or com- putation required for the learner. In spite of the inap- One argument for parametric models of language plicability of this particular paradigm, in a suitable has been learnability in the context of first language analysis there are quite strong arguments that bear acquisition. The claim is made that “logical” ar- directly on this problem. guments from learnability theory require non-trivial Grammatical inference is the study of machine constraints on the class of languages. Initial formal- learning of formal languages. It has a vast formal isations of the problem (Gold, 1967) are however vocabulary and has been applied to a wide selec- inapplicable to this particular situation. In this pa- tion of different problems, where the “languages” per we construct an appropriate formalisation of the under study can be (representations of) parts of nat- problem using a modern vocabulary drawn from sta- ural languages, sequences of nucleotides, moves of tistical learning theory and grammatical inference a robot, or some other sequence data. For any con- and looking in detail at the relevant empirical facts. clusions that we draw from formal discussions to We claim that a variant of the Probably Approxi- have any applicability to the real world, we must mately Correct (PAC) learning framework (Valiant, be sure to select, or construct, from the rich set of 1984) with positive samples only, modified so it is formal devices available an appropriate formalisa- not completely distribution free is the appropriate tion. Even then, we should be very cautious about choice. Some negative results derived from crypto- making inferences about how the infant child must graphic problems (Kearns et al., 1994) appear to ap- or cannot learn language: subsequent developments ply in this situation but the existence of algorithms in GI might allow a more nuanced description in with provably good performance (Ron et al., 1995) which these conclusions are not valid. The situation and subsequent work, shows how these negative re- is complicated by the fact that the field of grammti- sults are not as strong as they initially appear, and cal inference, much like the wider field of machine that recent algorithms for learning regular languages learning in general, is in a state of rapid change. partially satisfy our criteria. We then discuss the applicability of these results to parametric and non- In this paper we hope to address this problem by parametric models. justifying the selection of the appropriate learning framework starting by looking at the actual situa- 1 Introduction tion the child is in, rather than from an a priori deci- sion about the right framework. We will not attempt For some years, the relevance of formal results a survey of grammatical inference techniques; nor in grammatical inference to the empirical question shall we provide proofs of the theorems we use here. of first language acquisition by infant children has Arguments based on formal learnability have been been recognised (Wexler and Culicover, 1980). Un- used to support the idea of parameter based theo- fortunately, for many researchers, with a few no- ries of language (Chomsky, 1986). As we shall see table exceptions (Abe, 1988), this begins and ends below, under our analysis of the problem these ar- with Gold’s famous negative results in the identifi- guments are weak. Indeed, they are more pertinent cation in the limit paradigm. This paradigm, though to questions about the autonomy and modularity of still widely used in the grammatical inference com- language learning: the question whether learning of munity, is clearly of limited relevance to the issue some level of linguistic knowledge – morphology at hand, since it requires the model to be able to or syntax, for example – can take place in isolation exactly identify the target language even when an from other forms of learning, such as the acquisition adversary can pick arbitrarily misleading sequences of word meaning, and without interaction, ground- of examples to provide. Moreover, the paradigm as ing and so on.

25 Positive results can help us to understand how hu- plained. Thus, when looking for an appropriate for- mans might learn languages by outlining the class of mulation of the problem, we should recall for exam- algorithms that might be used by humans, consid- ple the fact that different children do not converge to ered as computational systems at a suitable abstract exactly the same knowledge of language as is some- level. Conversely, negative results might be help- times claimed, nor do all of them acquire a language ful if they could demonstrate that no algorithms of a competently at all, since there is a small proportion certain class could perform the task ± in this case we of children who though apparently neurologically could know that the human child learns his language normal fail to acquire language. In the context of in some other way. our discussion later on, these observations lead us We shall proceed as follows: after brie¯y de- to accept slightly less stringent criteria where we al- scribing FLA, we describe the various elements of low a small probability of failure and do not demand a model of learning, or framework. We then make perfect equality of hypothesis and target. a series of decisions based on the empirical facts about FLA, to construct an appropriate model or 3 Grammatical Inference models, avoiding unnecessary idealisation wherever The general ®eld of machine learning has a spe- possible. We proceed to some strong negative re- cialised sub®eld that deals with the learning of for- sults, well-known in the GI community that bear on mal languages. This ®eld, Grammatical Inference the questions at hand. The most powerful of these (GI), is characterised above all by an interest in for- (Kearns et al., 1994) appears to apply quite directly mal results, both in terms of formal characterisa- to our chosen model. We then discuss an interest- tions of the target languages, and in terms of formal ing algorithm (Ron et al., 1995) which shows that proofs either that particular algorithms can learn ac- this can be circumvented, at least for a subclass of cording to particular de®nitions, or that sets of lan- regular languages. Finally, after discussing the pos- guage cannot be learnt. In spite of its theoretical sibilities for extending this result to all regular lan- bent, GI algorithms have also been applied with guages, and beyond, we conclude with a discussion some success. Natural language, however is not the of the implications of the results presented for the only source of real-world applications for GI. Other distinction between parametric and non-parametric domains include biological sequence data, arti®cial models. languages, such as discovering XML schemas, or sequences of moves of a robot. The ®eld is also 2 First Language Acquisition driven by technical motives and the intrinsic ele- Let us ®rst examine the phenomenon we are con- gance and interest of the mathematical ideas em- cerned with: ®rst language acquisition. In the space ployed. In summary it is not just about language, of a few years, children almost invariably acquire, and accordingly it has developed a rich vocabulary in the absence of explicit instruction, one or more of to deal with the wide range of its subject matter. the languages that they are exposed to. A multitude In particular, researchers are often concerned of subsidiary debates have sprung up around this with formal results ± that is we want algorithms central issue covering questions about critical peri- where we can prove that they will perform in a cer- ods ± the ages at which this can take place, the ex- tain way. Often, we may be able to empirically es- act nature of the evidence available to the child, and tablish that a particular algorithm performs well, in the various phases of linguistic use through which the sense of reliably producing an accurate model, the infant child passes. In the opinion of many re- while we may be unable to prove formally that the searchers, explaining this ability is one of the most algorithm will always perform in this way. This important challenges facing linguists and cognitive can be for a number of reasons: the mathematics scientists today. required in the derivation of the bounds on the er- A dif®culty for us in this paper is that many of rors may be dif®cult or obscure, or the algorithm the idealisations made in the study of this ®eld are may behave strangely when dealing with sets of data in fact demonstrably false. Classical assumptions, which are ill-behaved in some way. such as the existence of uniform communities of The basic framework can be considered as a language users, are well-motivated in the study of game played between two players. One player, the the ªsteady stateº of a system, but less so when teacher, provides information to another, the learner, studying acquisition and change. There is a regret- and from that information the learner must identify table tendency to slip from viewing these idealisa- the underlying language. We can break down this tions correctly ± as counter-factual idealizations ± to situation further into a number of elements. We as- viewing them as empirical facts that need to be ex- sume that the languages to be learned are drawn

26 in some way from a possibly in®nite class of lan- must perform. We consider that some of the em- guages, L, which is a set of formal mathematical pirical questions do not yet have clear answers. In objects. The teacher selects one of these languages, those cases, we shall make the choice that makes the which we call the target, and then gives the learner learning task more dif®cult. In other cases, we may a certain amount of information of various types not have a clear idea of how to formalise some in- about the target. After a while, the learner then re- formation source. We shall start by making a signif- turns its guess, the hypothesis, which in general will icant idealisation: we consider language acquisition be a language drawn from the same class L. Ide- as being a single task. Natural languages as tradi- ally the learner has been able to deduce or induce tionally describe have different levels. At the very or abduce something about the target from the in- least we have morphology and syntax; one might formation we have given it, and in this case the hy- also consider inter-sentential or discourse as an ad- pothesis it returns will be identical to, or close in ditional level. We con¯ate all of these into a single some technical sense, to the target. If the learner task: learning a formal language; in the discussion can conistently do this, under whatever constraints below, for the sake of concreteness and clarity, we we choose, then we say it can learn that class of lan- shall talk in terms of learning syntax. guages. To turn this vague description into some- thing more concrete requires us to specify a number 4.1 The Language of things. The ®rst question we must answer concerns the lan- • What sort of mathematical object should we guage itself. A formal language is normally de®ned use to represent a language? as follows. Given a ®nite alphabet Σ, we de®ne the set of all strings (the free monoid) over Σ as Σ∗. What is the target class of languages? ∗ • We want to learn a language L ⊂ Σ . The alpha- • What information is the learner given? bet Σ could be a set of phonemes, or characters, or a set of words, or a set of lexical categories (part • What computational constraints does the of speech tags). The language could be the set of learner operate under? well-formed sentences, or the set of words that obey • How close must the target be to the hypothesis, the phonotactics of the language, and so on. We re- and how do we measure it? duce all of the different learning tasks in language to a single abstract task ± identifying a possibly in- This paper addresses the extent to which negative ®nite set of strings. This is overly simplistic since results in GI could be relevant to this real world sit- transductions, i.e. mappings from one string to an- uation. As always, when negative results from the- other, are probably also necessary. We are using ory are being applied, a certain amount of caution here a standard de®nition of a language where every is appropriate in examining the underlying assump- string is unambiguously either in or not in the lan- tions of the theory and the extent to which these are guage.. This may appear unrealistic ± if the formal applicable. As we shall see, in our opinion, none language is meant to represent the set of grammati- of the current negative results, though powerful, are cal sentences, there are well-known methodological applicable to the empirical situation. We shall ac- problems with deciding where exactly to draw the cordingly, at various points, make strong pessimistic line between grammatical and ungrammatical sen- assumptions about the learning environment of the tences. An alternative might be to consider accept- child, and show that even under these unrealistically ability rather than grammaticality as the de®ning stringent stipulations, the negative results are still criterion for inclusion in the set. Moreover, there inapplicable. This will make the conclusions we is a certain amount of noise in the input ± There come to a little sharper. Conversely, if we wanted are other possibilities. We could for example use a to show that the negative results did apply, to be fuzzy set ± i.e. a function from Σ∗ → [0, 1] where convincing we would have to make rather optimistic each string has a degree of membership between 0 assumptions about the learning environment. and 1. This would seem to create more problems than it solves. A more appealing option is to learn 4 Applying GI to FLA distributions, again functions f from Σ∗ → [0, 1] We now have the delicate task of selecting, or rather but where Ps∈L f(s) = 1. This is of course the constructing, a formal model by identifying the vari- classic problem of language modelling, and is com- ous components we have identi®ed above. We want pelling for two reasons. First, it is empirically well to choose the model that is the best representation grounded ± the probability of a string is related to its of the learning task or tasks that the infant child frequency of occurrence, and secondly, we can de-

27 duce from the speech recognition capability of hu- and erratic, not a universal of global child-raising. mans that they must have some similar capability. Furthermore this appears to have no effect on the Both possibilities ± crisp languages, and distri- child. Children do also get indirect pragmatic feed- butions ± are reasonable. The choice depends on back if their utterances are incomprehensible. In our what one considers the key phenomena to be ex- opinion, both of these would be better modelled by plained are ± grammaticality judgments by native what is called a membership query: the algorithm speakers, or natural use and comprehension of the may generate a string and be informed whether that language. We favour the latter, and accordingly string is in the language or not. However, we feel think that learning distributions is a more accurate that this is too erratic to be considered an essential and more dif®cult choice. part of the process. Another question is whether the input data is presented as a ¯at string or annotated 4.2 The class of languages with some sort of structural evidence, which might A common confusion in some discussions of this be derived from prosodic or semantic information. topic is between languages and classes of lan- Unfortunately there is little agreement on what the guages. Learnability is a property of classes of constituent structure should be ± indeed many lin- languages. If there is only one language in the guistic theories do not have a level of constituent class of languages to be learned then the learner structure at all, but just dependency structure. can just guess that language and succeed. A class Semantic information is also claimed as an im- with two languages is again trivially learnable if portant source. The hypothesis is that children can you have an ef®cient algorithm for testing member- use lexical semantics, coupled with rich sources of ship. It is only when the set of languages is expo- real-world knowlege to infer the meaning of utter- nentially large or in®nite, that the problem becomes ances from the situational context. That would be non-trivial, from a theoretical point of view. The an extremely powerful piece of information, but it is class of languages we need is a class of languages clearly absurd to claim that the meaning of an utter- that includes all attested human languages and ad- ance is uniquely speci®ed by the situational context. ditionally all ªpossibleº human languages. Natu- If true, there would be no need for communication ral languages are thought to fall into the class of or information transfer at all. Of course the context mildly context-sensitive languages, (Vijay-Shanker puts some constraints on the sentences that will be and Weir, 1994), so clearly this class is large uttered, but it is not clear how to incorporate this enough. It is, however, not necessary that our class fact without being far too generous. In summary it be this large. Indeed it is essential for learnability appears that only positive evidence can be unequiv- that it is not. As we shall see below, even the class ocally relied upon though this may seem a harsh and of regular languages contains some subclasses that unrealistic environment. are computationally hard to learn. Indeed, we claim it is reasonable to de®ne our class so it does not con- 4.4 Presentation tain languages that are clearly not possible human languages. We have now decided that the only evidence avail- able to the learner will be unadorned positive sam- 4.3 Information sources ples drawn from the target language. There are var- Next we must specify the information that our learn- ious possibilities for how the samples are selected. ing algorithm has access to. Clearly the primary The choice that is most favourable for the learner is source of data is the primary linguistic data (PLD), where they are slected by a helpful teacher to make namely the utterances that occur in the child's envi- the learning process as easy as possible (Goldman ronment. These will consist of both child-directed and Mathias, 1996). While it is certainly true that speech and adult-to-adult speech. These are gen- carers speak to small children in sentences of sim- erally acceptable sentences that is to say sentences ple structure (Motherese), this is not true for all of that are in the language to be learned. These are the data that the child has access to, nor is it uni- called positive samples. One of the most long- versally valid. Moreover, there are serious techni- running debates in this ®eld is over whether the cal problems with formalising this, namely what is child has access to negative data ± unacceptable sen- called 'collusion' where the teacher provides exam- tences that are marked in some way as such. The ples that encode the grammar itself, thus trivialising consensus (Marcus, 1993) appears to be that they do the learning process. Though attempts have been not. In middle-class Western families, children are made to limit this problem, they are not yet com- provided with some sort of feedback about the well- pletely satisfactory. The next alternative is that the formedness of their utterances, but this is unreliable examples are selected randomly from some ®xed

28 distribution. This appears to us to be the appropri- the interpretation and syntactic acceptability of sim- ate choice, subject to some limitations on the dis- ple sentences, quite apart from the wide purely lex- tributions that we discuss below. The ®nal option, ical variation that is easily detected. A famous ex- the most dif®cult for the learner, is where the se- ample in English is ªEach of the boys didn't comeº. quence of samples can be selected by an intelli- Moreover, language change requires some chil- gent adversary, in an attempt to make the learner dren to end up with slightly different grammars fail, subject only to the weak requirement that each from the older generation. At the very most, we string in the language appears at least once. This is should require that the hypothesis should be close the approach taken in the identi®cation in the limit to the target. The function we use to measure the paradigm (Gold, 1967), and is clearly too stringent. 'distance' between hypothesis and target depends on The remaining question then regards the distribu- whether we are learnng crisp languages or distribu- tion from which the samples are drawn: whether the tions. If we are learning distributions then the ob- learner has to be able to learn for every possible dis- vious choice is the Kullback-Leibler divergence ± a tribution, or only for distributions from a particular very strict measure. For crisp languages, the prob- class, or only for one particular distribution. ability of the symmetric difference with respect to 4.5 Resources some distribution is natural. Beyond the requirement of computability we will 4.7 PAC-learning wish to place additional limitations on the computa- tional resources that the learner can use. Since chil- These considerations lead us to some variant of the dren learn the language in a limited period of time, Probably Approximately Correct (PAC) model of which limits both the amount of data they have ac- learning (Valiant, 1984). We require the algorithm cess to and the amount of computation they can use, to produce with arbitrarily high probability a good it seems appropriate to disallow algorithms that use hypothesis. We formalise this by saying that for any unbounded or very large amounts of data or time. δ > 0 it must produce a good hypothesis with prob- As normal, we shall formalise this by putting poly- ability more than 1 − δ. Next we require a good nomial bounds on the sample complexity and com- hypothesis to be arbitrarily close to the target, so we putational complexity. Since the individual samples have a precision  and we say that for any  > 0, the are of varying length, we need to allow the compu- hypothesis must be less than  away from the target. tational complexity to depend on the total length of We allow the amount of data it can use to increase as the sample. A key question is what the parameters the con®dence and precision get smaller. We de®ne of the sample complexity polynomial should be. We PAC-learning in the following way: given a ®nite shall discuss this further below. alphabet Σ, and a class of languages L over Σ, an algorithm PAC-learns the class L, if there is a poly- 4.6 Convergence Criteria nomial q, such that for every con®dence δ > 0 and Next we address the issue of reliability: the extent precision  > 0, for every distribution D over Σ∗, to which all children acquire language. First, vari- for every language L in L, whenever the number of ability in achievement of particular linguistic mile- samples exceeds q(1/, 1/δ, |Σ|, |L|), the algorithm stones is high. There are numerous causes including must produce a hypothesis H such that with prob- deafness, mental retardation, cerebral palsy, speci®c ability greater than 1 − δ, P rD(H∆L > ). Here language impairment and autism. Generally, autis- we use A∆B to mean the symmetric difference be- tic children appear neurologically and physically tween two sets. The polynomial q is called the normal, but about half may never speak. Autism, sample complexity polynomial. We also limit the on some accounts, has an incidence of about 0.2%. amount of computation to some polynomial in the Therefore we can require learning to happen with total length of the data it has seen. Note ®rst of all arbitrarily high probability, but requiring it to hap- that this is a worst case bound ± we are not requiring pen with probability one is unreasonable. A related merely that on average it comes close. Additionally question concerns convergence: the extent to which this model is what is called 'distribution-free'. This children exposed to a linguistic environment end means that the algorithm must work for every com- up with the same language as others. Clearly they bination of distribution and language. This is a very are very close since otherwise communication could stringent requirement, only mitigated by the fact not happen, but there is ample evidence from stud- that the error is calculated with respect to the same ies of variation (Labov, 1975), that there are non- distribution that the samples are drawn from. Thus, trivial differences between adults, who have grown if there is a subset of Σ∗ with low aggregate proba- up with near-identical linguistic experiences, about bility under D, the algorithm will not get many sam-

29 ples from this region but will not be penalised very function over negative examples or alternatively much for errors in that region. From our point of require the hypothesis to be a subset of the tar- view, there are two problems with this framework: get. Secondly, this de®nition is too vague. The ®rst, we only want to draw positive samples, but the exact way in which you extend the ªcrispº lan- distributions are over all strings in Σ∗, and include guage to a stochastic one can have serious con- some that give a zero probability to all strings in sequences. When dealing with regular languages, the language concerned. Secondly, this is too pes- for example, though the class of languages de®ned simistic because the distribution has no relation to by deterministic automata is the same as that de- the language: intuitively it's reasonable to expect ®ned by non-deterministic languages, the same is the distribution to be derived in some way from the not true for their stochastic variants. Additionally, language, or the structure of a grammar generating one can have exponential blow-ups in the number the language. Indeed there is a causal connection of states when determinising automata. Similarly, in reality since the sample of the language the child with CFGs, (Abney et al., 1999) showed that con- is exposed to is generated by people who do in fact verting between two parametrisations of stochastic know the language. Context Free languages are equivalent but that there One alternative that has been suggested is the are blow-ups in both directions. We do not have a PAC learning with simple distributions model intro- completely satisfactory solution to this problem at duced by (Denis, 2001). This is based on ideas from the moment; an alternative is to consider learning complexity theory where the samples are drawn ac- the distributions rather than the languages. cording to a universal distribution de®ned by the In the case of learning distributions, we have the conditional Kolmogorov complexity. While math- same framework, but the samples are drawn accord- ematically correct this is inappropriate as a model ing to the distribution being learned T , and we re- of FLA for a number of reasons. First, learnability quire that the hypothesis H has small divergence is proven only on a single very unusual distribution, from the target: D(T ||H) < . Since the divergence and relies on particular properties of this distribu- is in®nite if the hypothesis gives probability zero to tion, and secondly there are some very large con- a string in the target, this will have the consequence stants in the sample complexity polynomial. that the target must assign a non-zero probability to The solution we favour is to de®ne some natu- every string. ral class of distributions based on a grammar or au- tomaton generating the language. Given a class of 5 Negative Results languages de®ned by some generative device, there Now that we have a fairly clear idea of various ways is normally a natural stochastic variant of the de- of formalising the situation we can consider the ex- vice which de®nes a distribution over that language. tent to which formal results apply. We start by con- Thus regular languages can be de®ned by a ®nite- sidering negative results, which in Machine Learn- state automaton, and these can be naturally ex- ing come in two types. First, there are information- tended to Probabilistic ®nite state automaton. Sim- theoretic bounds on sample complexity, derived ilarly context free languages are normally de®ned from the Vapnik-Chervonenkis (VC) dimension of by context-free grammmars which can be extended the space of languages, a measure of the complex- again to to Probabilistic or stochastic CFG. We ity of the set of hypotheses. If we add a parameter therefore propose a slight modi®cation of the PAC- to the sample complexity polynomial that represents framework. For every class of languages L, de®ned the complexity of the concept to be learned then this by some formal device de®ne a class of distribu- will remove these problems. This can be the size of tions de®ned by a stochastic variant of that device. a representation of the target which will be a poly- D. Then for each language L, we select the set of nomial in the number of states, or simply the num- distributions whose support is equal to the language ber of non-terminals or states. This is very standard and subject to a polynomial bound (q)on the com- in most ®elds of machine learning. plexity of the distribution in terms of the complex- + The second problem relates not to the amount ity of the target language: DL = {D ∈ D : L = of information but to the computation involved. supp(D) ∧ |D| < q(|L|)}. Samples are drawn from Results derived from cryptographic limitations on one of these distributions. computational complexity, can be proved based on There are two technical problems here: ®rst, this widely held and well supported assumptions that doesn't penalise over-generalisation. Since the dis- certain hard cryptographic problems are insoluble. tribution is over positive examples, negative exam- In what follows we assume that there are no ef®- ples have zero weight, so we need some penalty cient algorithms for common cryptographic prob-

30 lems such as factoring Blum integers, inverting RSA work can be extended to cyclic automata (Clark and function, recognizing quadratic residues or learning Thollard, 2004a; Clark and Thollard, 2004b), and noisy parity functions. thus the class of all regular languages, with the ad- There may be algorithms that will learn with rea- dition of a further parameter which bounds the ex- sonable amounts of data but that require unfeasibly pected length of a string generated from any state. large amounts of computation to ®nd. There are The use of distinguishability seems innocuous; in a number of powerful negative results on learning syntactic terms it is a consequence of the plausible in the purely distribution-free situation we consid- condition that for any pair of distinct non-terminals ered and rejected above. (Kearns and Valiant, 1989) there is some fairly likely string generated by one showed that acyclic deterministic automata are not and not the other. Similarly strings of symbols in learnable even with positive and negative exam- natural language tend to have limited length. An ples. Similarly, (Abe and Warmuth, 1992) showed alternate way of formalising this is to de®ne a class a slightly weaker representation dependent result on of distinguishable automata, where the distinguisha- learning with a large alphabet for non-deterministic bility of the automata is lower bounded by an in- automata, by showing that there are strings such that verse polynomial in the number of states. This is maximising the likelihood of the string is NP-hard. formally equivalent, but avoids adding terms to the Again this does not strictly apply to the partially dis- sample complexity polynomial. In summary this tribution free situation we have chosen. would be a valid solution if all human languages However there is one very strong result that ap- actually lay within the class of regular languages. pears to apply. A straightforward consequence of Note also the general properties of this kind of al- (Kearns et al., 1994) shows that Acyclic Determinis- gorithm: provably learning an in®nite class of lan- tic Probabilistic FSA over a two letter alphabet can- guages with in®nite support using only polynomial not be learned under another cryptographic assump- amounts of data and computation. tion (the noisy parity assumption). Therefore any It is worth pointing out that the algorithm does class of languages that includes this comparatively not need to ªknowº the values of the parameters. weak family will not be learnable in out framework. De®ne a new parameter t, and set, for example n = −t − − But this rests upon the assumption that the class t, L = t, δ = e ,  = t 1 and µ = t 1. This gives of possible human languages must include some a sample complexity polynomial in one parameter cryptographically hard functions. It appears that q(t). Given a certain amount of data N we can just our formal apparatus does not distinguish between choose the largest value of t such that q(t) < N, these cryptographic functions which hav been con- and set the parameters accordingly. sciously designed to be hard to learn, and natu- ral languages which presumably have evolved to be 7 Parametric models easy to learn since there is no evolutionary pressure We can now examine the relevance of these re- to make them hard to decrypt ± no intelligent preda- sults to the distinction between parametric and non- tors eavesdropping for example. Clearly this is a parametric languages. Parametric models are those ¯aw in our analysis: we need to ®nd some more where the class of languages is parametrised by a nuanced description for the class of possible human small set of ®nite-valued (binary) parameters, where languages that excludes these hard languages or dis- the number of paameters is small compared to the tributions. log2 of the complexity of the languages. Without this latter constraint the notion is mathematically 6 Positive results vacuous, since, for example, any context free gram- There is a positive result that shows a way forward. mar in Chomsky normal form can be parametrised A PDFA is µ-distinguishable the distributions gen- with N 3 + NM + 1 binary parameters where N erated from any two states differ by at least µ in is the number of non-terminals and M the num- the L∞-norm, i.e. there is a string with a differ- ber of terminals. This constraint is also necessary ence in probability of at least µ. (Ron et al., 1995) for parametric models to make testable empirical showed that µ-distinguishable acyclic PDFAs can predictions both about language universals, devel- be PAC-learned using the KLD as error function opmental evidence and relationships between the in time polynomial in n, 1/, 1/δ, 1/µ, |Σ|. They two (Hyams, 1986). We neglect here the important use a variant of a standard state-merging algorithm. issue of lexical learning: we assume, implausibly, Since these are acyclic the languages they de®ne that lexical learning can take place completely be- are always ®nite. This additional criterion of distin- fore syntax learning commences. It has in the past guishability suf®ces to guarantee learnability. This been stated that the ®niteness of a language class

31 suf®ces to guarantee learnability even under a PAC- quisition and Learnability. Cambridge University learning criterion (Bertolo, 2001). This is, in gen- Press. eral, false, and arises from neglecting constraints on Noam Chomsky. 1986. Knowledge of Language : the sample complexity and the computational com- Its Nature, Origin, and Use. Praeger. plexities both of learning and of parsing. The neg- Alexander Clark and Franck Thollard. 2004a. ative result of (Kearns et al., 1994) discussed above PAC-learnability of probabilistic deterministic ®- applies also to parametric models. The speci®c class nite state automata. Journal of Machine Learning of noisy parity functions that they prove are unlearn- Research, 5:473±497, May. able, are parametrised by a number of binary pa- Alexander Clark and Franck Thollard. 2004b. Par- rameters in a way very reminiscent of a parametric tially distribution-free learning of regular lan- model of language. The mere fact that there are a guages from positive samples. In Proceedings of ®nite number of parameters does not suf®ce to guar- COLING, Geneva, Switzerland. antee learnability, if the resulting class of languages F. Denis. 2001. Learning regular languages from is exponentially large, or if there is no polynomial simple positive examples. Machine Learning, algorithm for parsing. This does not imply that all 44(1/2):37±66. parametrised classes of languages will be unlearn- E. M. Gold. 1967. Language indenti®cation in the able, only that having a small number of parame- limit. Information and control, 10(5):447 ± 474. ters is neither necessary nor suf®cient to guarantee S. A. Goldman and H. D. Mathias. 1996. Teach- ef®cient learnability. If the parameters are shallow ing a smarter learner. Journal of Computer and and relate to easily detectable properties of the lan- System Sciences, 52(2):255±267. guages and are independent then learning can oc- N. Hyams. 1986. Language Acquisition and the cur ef®ciently (Yang, 2002). If they are ªdeepº and Theory of Parameters. D. Reidel. inter-related, learning may be impossible. Learn- M. Kearns and G. Valiant. 1989. Cryptographic ability depends more on simple statistical properties limitations on learning boolean formulae and ®- of the distributions of the samples than on the struc- nite automata. In 21st annual ACM symposium ture of the class of languages. on Theory of computation, pages 433±444, New Our conclusion then is ultimately that the theory York. ACM, ACM. of learnability will not be able to resolve disputes M.J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, about the nature of ®rst language acquisition: these R.E. Schapire, and L. Sellie. 1994. On the learn- problems will have to be answered by empirical re- ability of discrete distributions. In Proc. of the search, rather than by mathematical analysis. 25th Annual ACM Symposium on Theory of Com- puting, pages 273±282. Acknowledgements W. Labov. 1975. Empirical foundations of linguis- This work was supported in part by the IST tic theory. In R. Austerlitz, editor, The Scope of Programme of the European Community, under American Linguistics. Peter de Ridder Press. the PASCAL Network of Excellence, IST-2002- G. F. Marcus. 1993. Negative evidence in language 506778, funded in part by the Swiss Federal Of®ce acquisition. Cognition, 46:53±85. for Education and Science (OFES). This publication D. Ron, Y. Singer, and N. Tishby. 1995. On the only re¯ects the authors' views. learnability and usage of acyclic probabilistic ®- nite automata. In COLT 1995, pages 31±40, References Santa Cruz CA USA. ACM. N. Abe and M. K. Warmuth. 1992. On the com- L. Valiant. 1984. A theory of the learnable. Com- putational complexity of approximating distribu- munications of the ACM, 27(11):1134 ± 1142. tions by probabilistic automata. Machine Learn- K. Vijay-Shanker and David J. Weir. 1994. ing, 9:205±260. The equivalence of four extensions of context- N. Abe. 1988. Feasible learnability of formal gram- free grammars. Mathematical Systems Theory, mars and the theory of natural language acquisi- 27(6):511±546. tion. In Proceedings of COLING 1988, pages 1± Kenneth Wexler and Peter W. Culicover. 1980. For- 6. mal Principles of Language Acquisition. MIT S. Abney, D. McAllester, and F. Pereira. 1999. Re- Press. lating probabilistic grammars and automata. In C. Yang. 2002. Knowledge and Learning in Natu- Proceedings of ACL '99. ral Language. Oxford. Stefano Bertolo. 2001. A brief overview of learn- ability. In Stefano Bertolo, editor, Language Ac-

32 A Developmental Model of Syntax Acquisition in the Construction Grammar Framework with Cross-Linguistic Validation in English and Japanese

Peter Ford Dominey Toshio Inui Sequential Cognition and Language Group Graduate School of Informatics, Institut des Sciences Cognitives, CNRS Kyoto University, 69675 Bron CEDES, France Yoshida-honmachi, Sakyo-ku, 606-8501, [email protected] Kyoto, Japan [email protected]

constructions in a generalized, generative manner. Abstract An alternative functionalist perspective holds The current research demonstrates a system that learning plays a much more central role in inspired by cognitive neuroscience and language acquisition. The infant develops an developmental psychology that learns to inventory of grammatical constructions as construct mappings between the grammatical mappings from form to meaning (Goldberg 1995). structure of sentences and the structure of their These constructions are initially rather fixed and meaning representations. Sentence to meaning specific, and later become generalized into a more mappings are learned and stored as abstract compositional form employed by the adult grammatical constructions. These are stored (Tomasello 1999, 2003). In this context, and retrieved from a construction inventory construction of the relation between perceptual and based on the constellation of closed class cognitive representations and grammatical form items uniquely identifying each construction. plays a central role in learning language (e.g. These learned mappings allow the system to Feldman et al. 1990, 1996; Langacker 1991; processes natural language sentences in order Mandler 1999; Talmy 1998). to reconstruct complex internal representations These issues of learnability and innateness have of the meanings these sentences describe. The provided a rich motivation for simulation studies system demonstrates error free performance that have taken a number of different forms. and systematic generalization for a rich subset Elman (1990) demonstrated that recurrent of English constructions that includes complex networks are sensitive to predictable structure in hierarchical grammatical structure, and grammatical sequences. Subsequent studies of generalizes systematically to new sentences of grammar induction demonstrate how syntactic the learned construction categories. Further testing demonstrates (1) the capability to structure can be recovered from sentences (e.g. accommodate a significantly extended set of Stolcke & Omohundro 1994). From the constructions, and (2) extension to Japanese, a “grounding of language in meaning” perspective free word order language that is structurally (e.g. Feldman et al. 1990, 1996; Langacker 1991; quite different from English, thus Goldberg 1995) Chang & Maia (2001) exploited demonstrating the extensibility of the structure the relations between action representation and mapping model. simple verb frames in a construction grammar approach. In effort to consider more complex 1 Introduction grammatical forms, Miikkulainen (1996) demonstrated a system that learned the mapping The nativist perspective on the problem of between relative phrase constructions and multiple language acquisition holds that the data to which the child is exposed is for maintaining state information during the highly indeterminate, and underspecifies the processing of the next embedded clause in a mapping to be learned. This “poverty of the recursive manner. stimulus” is a central argument for the existence of In a more generalized approach, Dominey a genetically specified universal grammar, such (2000) exploited the regularity that sentence to that language acquisition consists of configuring meaning mapping is encoded in all languages by the UG for the appropriate target language word order and grammatical marking (bound or (Chomsky 1995). In this framework, once a given free) (Bates et al. 1982). That model was based on parameter is set, its use should apply to new

33 the functional neurophysiology of cognitive “gugle” to the object of push. sequence and language processing and an associated neural network model that has been WordToReferent(i,j) = WordToReferent(i,j) + demonstrated to simulate interesting aspects of OCA(k,i) * SEA(m,j) * infant (Dominey & Ramus 2000) and adult max(α, SentenceToScene(m,k)) (1) language processing (Dominey et al. 2003). 2.2 Open vs Closed Class Word Categories 2 Structure mapping for language learning Our approach is based on the cross-linguistic The mapping of sentence form onto meaning observation that open class words (e.g. nouns, (Goldberg 1995) takes place at two distinct levels verbs, adjectives and adverbs) are assigned to their in the current model: Words are associated with thematic roles based on word order and/or individual components of event descriptions, and grammatical function words or morphemes (Bates grammatical structure is associated with functional et al. 1982). Newborn infants are sensitive to the roles within scene events. The first level has been perceptual properties that distinguish these two addressed by Siskind (1996), Roy & Pentland categories (Shi et al. 1999), and in adults, these categories are processed by dissociable (2002) and Steels (2001) and we treat it here in a neurophysiological systems (Brown et al. 1999). relatively simple but effective manner. Our Similarly, artificial neural networks can also learn principle interest lies more in the second level of to make this function/content distinction (Morgan mapping between scene and sentence structure. et al. 1996). Thus, for the speech input that is Equations 1-7 implement the model depicted in provided to the learning model open and closed Figure 1, and are derived from a class words are directed to separate processing neurophysiologically motivated model of streams that preserve their order and identity, as sensorimotor sequence learning (Dominey et al. indicated in Figure 2. 2003). 2.1 Word Meaning Equation (1) describes the associative memory, WordToReferent, that links word vectors in the OpenClassArray (OCA) with their referent vectors in the SceneEventArray (SEA)1. In the initial learning phases there is no influence of syntactic knowledge and the word-referent associations are stored in the WordToReferent matrix (Eqn 1) by associating every word with every referent in the current scene (α = 1), exploiting the cross- situational regularity (Siskind 1996) that a given word will have a higher coincidence with referent to which it refers than with other referents. This initial word learning contributes to learning the mapping between sentence and scene structure (Eqn. 4, 5 & 6 below). Then, knowledge of the syntactic structure, encoded in SentenceToScene Figure 1. Structure-Mapping Architecture. 1. Lexical categorization. can be used to identify the appropriate referent (in 2. Open class words in Open Class Array are translated to Predicted the SEA) for a given word (in the OCA), Referents in the PRA via the WordtoReferent mapping. 3. PRA elements are mapped onto their roles in the SceneEventArray by the corresponding to a zero value of α in Eqn. 1. In SentenceToScene mapping, specific to each sentence type. 4. This this “syntactic bootstrapping” for the new word mapping is retrieved from Construction Inventory, via the ConstructionIndex that encodes the closed class words that “gugle,” for example, syntactic knowledge of characterize each grammatical construction type. Agent-Event-Object structure of the sentence 2.3 Mapping Sentence to Meaning “John pushed the gugle” can be used to assign Meanings are encoded in an event predicate, argument representation corresponding to the 1 In Eqn 1, the index k = 1 to 6, corresponding to the maximum SceneEventArray in Figure 1 (e.g. push(Block, number of words in the open class array (OCA). Index m = 1 to 6, corresponding to the maximum number of elements in the scene event triangle) for “The triangle pushed the block”). array (SEA). Indices i and j = 1 to 25, corresponding to the word and There, the sentence to meaning mapping can be scene item vector sizes, respectively. 34 characterized in the following successive steps. referents to scene elements (in SEA). The First, words in the Open Class Array are decoded resulting, SentenceToSceneCurrent encodes the into their corresponding scene referents (via the correspondence between word order (that is WordToReferent mapping) to yield the Predicted preserved in the PRA Eqn 2) and thematic roles in Referents Array that contains the translated words the SEA. Note that the quality of while preserving their original order from the OCA SentenceToSceneCurrent will depend on the (Eqn 2) 2. quality of acquired word meanings in WordToReferent. Thus, syntactic learning n requires a minimum baseline of semantic PRA(k,j) =ƒ OCA(k,i) * WordToReferent(i,j) (2) knowledge. i= 1 Next, each sentence type will correspond to a SentenceToSceneCurrent(m,k) = specific form to meaning mapping between the n (4) PRA and the SEA. encoded in the ƒ PRA(k,i)*SEA(m,i) SentenceToScene array. The problem will be to i=1 retrieve for each sentence type, the appropriate corresponding SentenceToScene mapping. To Given the SentenceToSceneCurrent mapping solve this problem, we recall that each sentence for the current sentence, we can now associate it in type will have a unique constellation of closed the ConstructionInventory with the corresponding class words and/or bound morphemes (Bates et al. function word configuration or ConstructionIndex 1982) that can be coded in a ConstructionIndex for that sentence, expressed in (Eqn 5)4. (Eqn.3) that forms a unique identifier for each sentence type. ConstructionInventory(i,j) = ConstructionInventory(i,j) The ConstructionIndex is a 25 element vector. + ConstructionIndex(i) Each function word is encoded as a single bit in a * SentenceToScene-Current(j) (5) 25 element FunctionWord vector. When a function word is encountered during sentence Finally, once this learning has occurred, for processing, the current contents of new sentences we can now extract the ConstructionIndex are shifted (with wrap-around) SentenceToScene mapping from the learned by n + m bits where n corresponds to the bit that is ConstructionInventory by using the on in the FunctionWord, and m corresponds to the ConstructionIndex as an index into this associative number of open class words that have been memory, illustrated in Eqn. 65. encountered since the previous function word (or the beginning of the sentence). Finally, a vector SentenceToScene(i) = addition is performed on this result and the n (6) FunctionWord vector. Thus, the appropriate ƒConstructionInventory(i,j) * ConstructinIndex(j) SentenceToScene mapping for each sentence type i=1 can be indexed in ConstructionInventory by its corresponding ConstructionIndex. To accommodate the dual scenes for complex events Eqns. 4-7 are instantiated twice each, to

ConstructionIndex = fcircularShift(ConstructionIndex, represent the two components of the dual scene. In FunctionWord) (3) the case of simple scenes, the second component of the dual scene representation is null. The link between the ConstructionIndex and the We evaluate performance by using the corresponding SentenceToScene mapping is WordToReferent and SentenceToScene knowledge established as follows. As each new sentence is to construct for a given input sentence the processed, we first reconstruct the specific “predicted scene”. That is, the model will SentenceToScene mapping for that sentence (Eqn 4)3, by mapping words to referents (in PRA) and references array (PRA). Index i = 1 to 25, corresponding to the word and scene item vector sizes. 4 Note that we have linearized SentenceToSceneCurrent from 2 to 2 Index k = 1 to 6, corresponding to the maximum number of scene 1 dimensions to make the matrix multiplication more transparent. items in the predicted references array (PRA). Indices i and j = 1 to Thus index j varies from 1 to 36 corresponding to the 6x6 dimensions 25, corresponding to the word and scene item vector sizes, of SentenceToSceneCurrent. respectively. 5 Again to simplify the matrix multiplication, SentenceToScene 3 Index m = 1 to 6, corresponding to the maximum number of has been linearized to one dimension, based on the original 6x6 elements in the scene event array (SEA). Index k = 1 to 6, matrix. Thus, index i = 1 to 36, and index j = 1 to 25 corresponding to corresponding to the maximum number of words in the predicted the dimension of the ConstructionIndex. 35 construct an internal representation of the scene (random) syntactic knowledge on semantic that should correspond to the input sentence. This learning in the initial learning stages. Evaluation is achieved by first converting the Open-Class- of the performance of the model after this training Array into its corresponding scene items in the indicated that for all sentences, there was error-free Predicted-Referents-Array as specified in Eqn. 2. performance. That is, the PredictedScene The referents are then re-ordered into the proper generated from each sentence corresponded to the scene representation via application of the actual scene paired with that sentence. An SentenceToScene transformation as described in important test of language learning is the ability to Eqn. 76. generalize to new sentences that have not previously been tested. Generalization in this form PSA(m,i) = PRA(k,i) * SentenceToScene(m,k) (7) also yielded error free performance. In this experiment, only 2 grammatical constructions were When learning has proceeded correctly, the learned, and the lexical mapping of words to their predicted scene array (PSA) contents should match scene referents was learned. Word meaning provides the basis for extracting more complex those of the scene event array (SEA) that is syntactic structure. Thus, these word meanings are directly derived from input to the model. We then fixed and used for the subsequent experiments. quantify performance error in terms of the number of mismatches between PSA and SEA. 3.1.2 Passive forms The second experiment examined learning with 3 Learning Experiments the introduction of passive grammatical forms, Three sets of results will be presented. First the thus employing grammatical forms 1-4. demonstration of the model sentence to meaning mapping for a reduced set of constructions is 3. Passive: The triangle was pushed by the block. presented as a proof of concept. This will be 4. Dative Passive: The moon was given to the followed by a test of generalization to a new triangle by the block. extended set of grammatical constructions. Finally, in order to validate the cross-linguistic validity of the underlying principals, the model is A new set of pairs was tested with Japanese, a free word-order language generated that employed grammatical that is qualitatively quite distinct from English. constructions, with two- and three- arguments, and active and passive grammatical forms for the 3.1 Proof of Concept with Two Constructions narration. Word meanings learned in Experiment 1 were used, so only the structural mapping from 3.1.1 Initial Learning of Active Forms for grammatical to scene structure was learned. With Simple Event Meanings exposure to less than 100 , error The first experiment examined learning with free performance was achieved. Note that only the sentence, meaning pairs with sentences only in the WordToReferent mappings were retained from active voice, corresponding to the grammatical Experiment 1. Thus, the 4 grammatical forms forms 1 and 2. were learned from the initial naive state. This means that the ConstructionIndex and 1. Active: The block pushed the triangle. ConstructionInventory mechanism correctly 2. Dative: The block gave the triangle to the discriminates and learns the mappings for the moon. different grammatical constructions. In the generalization test, the learned values were fixed, For this experiment, the model was trained on and the model demonstrated error-free 544 pairs. Again, meaning is performance on new sentences for all four coded in a predicate-argument format, e.g. grammatical forms that had not been used during push(block, triangle) for sentence 1. During the the training. first 200 trials (scene/sentence pairs), value α in Eqn. 1 was 1 and thereafter it was 0. This was 3.1.3 Relative forms for Complex Events necessary in order to avoid the effect of erroneous The complexity of the scenes/meanings and corresponding grammatical forms in the previous 6 In Eqn 7, index i = 1 to 25 corresponding to the size of the scene experiments were quite limited. Here we consider and word vectors. Indices m and k = 1 to 6, corresponding to the dimension of the predicted scene array, and the predicted references complex mappings that involve array, respectively. relativised sentences and dual event scenes. A 36 small corpus of complex pairs 3.2 Generalization to Extended Construction were generated corresponding to the grammatical Set construction types 5-10 As illustrated above the model can accommodate 10 distinct form-meaning mappings or 5. The block that pushed the triangle touched the moon. grammatical constructions, including constructions 6. The block pushed the triangle that touched the involving "dual" events in the meaning moon. representation that correspond to relative clauses. 7. The block that pushed the triangle was touched by Still, this is a relatively limited size for the the moon. construction inventory. The current experiment 8. The block pushed the triangle that was touched the demonstrates how the model generalizes to a moon. number of new and different relative phrases, as 9. The block that was pushed by the triangle touched well as additional sentence types including: the moon. conjoined (John took the key and opened the door), 10. The block was pushed by the triangle that touched reflexive (The boy said that the dog was chased by the moon. the cat), and reflexive pronoun (The block said that it pushed the cylinder) sentence types, for a total of After exposure to less than 100 sentences 38 distinct abstract grammatical constructions. The generated from these relativised constructions, the consideration of these sentence types requires us to model performed without error for these 6 address how their meanings are represented. construction types. In the generalization test, the Conjoined sentences are represented by the two learned values were fixed, and the model corresponding events, e.g. took(John, key), demonstrated error-free performance on new open(John, door) for the conjoined example above. sentences for all six grammatical forms that had Reflexives are represented, for example, as not been used during the training. said(boy), chased(cat, dog). This assumes indeed, for reflexive verbs (e.g. said, saw), that the meaning representation includes the second event 3.1.4 Combined Test as an argument to the first. Finally, for the The objective of the final experiment was to reflexive pronoun types, in the meaning verify that the model was capable of learning the representation the pronoun's referent is explicit, as 10 grammatical forms together in a single learning in said(block), push(block, cylinder) for "The session. Training material from the previous block said that it pushed the cylinder." experiments were employed that exercised the For this testing, the ConstructionInventory is ensemble of 10 grammatical forms. After implemented as a lookup table in which the exposure to less than 150 pairs, ConstructionIndex is paired with the corresponding the model performed without error. Likewise, in SentenceToScene mapping during a single learning the generalization test the learned values were trial. Based on the tenets of the construction fixed, and the model demonstrated error-free grammar framework (Goldberg 1995), if a performance on new sentences for all ten sentence is encountered that has a form (i.e. grammatical forms that had not been used during ConstructionIndex) that does not have a the training. corresponding entry in the ConstructionInventory, This set of experiments in ideal conditions then a new construction is defined. Thus, one demonstrates a proof of concept for the system, exposure to a sentence of a new construction type though several open questions can be posed based allows the model to generalize to any new sentence on these results. First, while the demonstration of that type. In this sense, developing the capacity with 10 grammatical constructions is interesting, to handle a simple initial set of constructions leads we can ask if the model will generalize to an to a highly extensible system. Using the training extended set of constructions. Second, we know procedures as described above, with a pre-learned that the English language is quite restricted with lexicon (WordToReferent), the model successfully respect to its word order, and thus we can ask learned all of the constructions, and demonstrated whether the theoretical framework of the model generalization to new sentences that it was not will generalize to free word order languages such trained on. as Japanese. These questions are addressed in the That the model can accommodate these 38 following three sections. different grammatical constructions with no modifications indicates its capability to generalize. This translates to a (partial) validation of the hypothesis that across languages, thematic role assignment is encoded by a limited set of

37 parameters including word order and grammatical unique to each construction type, and associates marking, and that distinct grammatical the correct SentenceToScene mapping with each of constructions will have distinct and identifying them. As for the English constructions, once ensembles of these parameters. However, these learned, a given construction could generalize to results have been obtained with English that is a new untrained sentences. relatively fixed word-order language, and a more This demonstration with Japanese is an rigorous test of this hypothesis would involve important validation that at least for this subset of testing with a free word-order language such as constructions, the construction-based model is Japanese. applicable both to fixed word order languages such as English, as well as free word order languages 3.3 Generalization to Japanese such as Japanese. This also provides further The current experiment will test the model with validation for the proposal of Bates and sentences in Japanese. Unlike English, Japanese MacWhinney (et al. 1982) that thematic roles are allows extensive liberty in the ordering of words, indicated by a constellation of cues including with grammatical roles explicitly marked by grammatical markers and word order. postpositional function words -ga, -ni, -wo, -yotte. This word-order flexibility of Japanese with 3.4 Effects of Noise respect to English is illustrated here with the English active and passive di-transitive forms that The model relies on lexical categorization of each can be expressed in 4 different common open vs. closed class words both for learning manners in Japanese: lexical semantics, and for building the ConstructionIndex for phrasal semantics. While we 1. The block gave the circle to the triangle. can cite strong evidence that this capability is 1.1 Block-ga triangle-ni circle-wo watashita . expressed early in development (Shi et al. 1999) it 1.2 Block-ga circle-wo triangle-ni watashita . is still likely that there will be errors in lexical 1.3 Triangle-ni block-ga circle-wo watashita . categorization. The performance of the model for 1.4 Circle-wo block-ga triangle-ni watashita . learning lexical and phrasal semantics for active 2. The circle was given to the triangle by the transitive and ditransitive structures is thus block. examined under different conditions of lexical 2.1 Circle-ga block-ni-yotte triangle-ni watasareta. categorization errors. A lexical categorization error 2.2 Block-ni-yotte circle-ga triangle-ni watasareta . consists of a given word being assigned to the 2.3 Block-ni-yotte triangle-ni circle-ga watasareta . wrong category and processed as such (e.g. an 2.4 Triangle-ni circle-ga block-ni-yotte watasareta open class word being processed as a closed class . word, or vice-versa). Figure 2 illustrates the performance of the model with random errors of In the “active” Japanese sentences, the this type introduced at levels of 0 to 20 percent postpositional function words -ga, -ni and –wo errors. explicitly mark agent, recipient and, object whereas in the passive, these are marked respectively by –ni-yotte, -ga, and –ni. For both the active and passive forms, there are four different legal word-order permutations that preserve and rely on this marking. Japanese thus provides an interesting test of the model’s ability to accommodate such freedom in word order. Employing the same method as described in the previous experiment, we thus expose the model to pairs generated from 26 Japanese constructions that employ the equivalent of active, passive, relative forms and their permutations. We predicted that by processing the Figure 2. The effects of Lexical Categorization Errors (mis- -ga, -ni, -yotte and –wo markers as closed class categorization of an open-class word as a closed-class word or vice- elements, the model would be able to discriminate versa) on performance (Scene Interpretation Errors) over Training and identify the distinct grammatical constructions Epochs. The 0% trace indicates performance in the absences of noise, with a rapid elimination of errors . The successive introduction of and learn the corresponding mappings. Indeed, the categorization errors yields a corresponding progressive impairment in model successfully discriminates between all of the learning. While sensitive to the errors, the system demonstrates a construction types based on the ConstructionIndex desired graceful degradation

38 We can observe that there is a graceful The obvious weakness is that it does not go degradation, with interpretation errors further. That is, it cannot accommodate new progressively increasing as categorization errors construction types without first being exposed to a rise to 20 percent. In order to further asses the training example of a well formed pair. Interestingly, however, this noise, after training with noise, we then tested appears to reflect a characteristic stage of human performance on noise-free input. The interpretation development, in which the infant relies on the use error values in these conditions were 0.0, 0.4, 2.3, of constructions that she has previously heard (see 20.7 and 33.6 out of a maximum of 44 for training Tomasello 2003). Further on in development, with 0, 5, 10, 15 and 20 percent lexical however, as pattern finding mechanisms operate categorization errors, respectively. This indicates on statistically relevant samples of this data, the that up to 10 percent input lexical categorization child begins to recognize structural patterns, errors allows almost error free learning. At 15 corresponding for example to noun phrases (rather percent input errors the model has still than solitary nouns) in relative clauses. When this significantly improved with respect to the random is achieved, these phrasal units can then be inserted behavior (~45 interpretation errors per epoch). into existing constructions, thus providing the basis Other than reducing the lexical and phrasal for “on the fly” processing of novel relativised learning rates, no efforts were made to optimize constructions. This suggests how the abstract the performance for these degraded conditions, construction model can be extended to a more thus there remains a certain degree of freedom for generalized compositional capability. We are improvement. The main point is that the model currently addressing this issue in an extension of does not demonstrate a catastrophic failure in the the proposed model, in which recognition of presence of lexical categorization errors. linguistic markers (e.g. “that”, and directly successive NPs) are learned to signal embedded relative phrases (see Miikkulainen 1996). Future work will address the impact of 4 Discussion ambiguous input. The classical example “John The research demonstrates an implementation of saw the girl with the telescope” implies that a a model of sentence-to-meaning mapping in the given grammatical form can map onto multiple developmental and neuropsychologically inspired meaning structures. In order to avoid this violation construction grammar framework. The strength of of the one to one mapping, we must concede that the model is that with relatively simple “innate” form is influenced by context. Thus, the model learning mechanisms, it can acquire a variety of will fail in the same way that humans do, and grammatical constructions in English and Japanese should be able to succeed in the same way that based on exposure to pairs, humans do. That is, when context is available to with only the lexical categories of open vs. closed disambiguate then ambiguity can be resolved. This class being prespecified. This lexical will require maintenance of the recent discourse categorization can be provided by frequency context, and the influence of this on grammatical analysis, and/or acoustic properties specific to the construction selection to reduce ambiguity. two classes (Blanc et al. 2003; Shi et al. 1999). The 5 Acknowledgements model learns grammatical constructions, and generalizes in a systematic manner to new This work was supported by the ACI sentences within the class of learned constructions. Computational Neuroscience Project, The This demonstrates the cross-linguistic validity of Eurocores OMLL project and the HFSP our implementation of the construction grammar Organization. approach (Goldberg 1995, Tomasello 2003) and of the “cue competition” model for coding of References grammatical structure (Bates et al. 1982). The Bates E, McNew S, MacWhinney B, Devescovi A, point of the Japanese study was to demonstrate this Smith S (1982) Functional constraints on cross-linguistic validity – i.e. that nothing extra sentence processing: A cross linguistic study, was needed, just the identification of constructions Cognition (11) 245-299. based on lexical category information. Of course a better model for Japanese and Hungarian etc. that Blanc JM, Dodane C, Dominey P (2003) Temporal processing for syntax acquisition. exploits the explicit marking of grammatical roles th of NPs would have been interesting – but it Proc. 25 Ann. Mtg. Cog. Science Soc. Boston wouldn’t have worked for English! Brown CM, Hagoort P, ter Keurs M (1999) Electrophysiological signatures of visual lexical

39 processing: Open- and closed-class words. Miikkulainen R (1996) Subsymbolic case-role Journal of Cognitive Neuroscience. 11 :3, 261- analysis of sentences with embedded clauses. 281 Cognitive Science, 20:47-73. Chang NC, Maia TV (2001) Grounded learning of Morgan JL, Shi R, Allopenna P (1996) Perceptual grammatical constructions, AAAI Spring Symp. bases of rudimentary grammatical categories: On Learning Grounded Representations, Toward a broader conceptualization of Stanford CA. bootstrapping, pp 263-286, in Morgan JL, Chomsky N. (1995) The Minimalist Program. Demuth K (Eds) Signal to syntax, Lawrence MIT Erlbaum, Mahwah NJ, USA. Crangle C. & Suppes P. (1994) Language and Pollack JB (1990) Recursive distributed Learning for Robots, CSLI lecture notes: no. 41, representations. Artificial Intelligence, 46:77- Stanford. 105. Dominey PF, Ramus F (2000) Neural network Roy D, Pentland A (2002). Learning Words from processing of natural language: I. Sensitivity to Sights and Sounds: A Computational Model. serial, temporal and abstract structure of Cognitive Science, 26(1), 113-146. language in the infant. Lang. and Cognitive Shi R., Werker J.F., Morgan J.L. (1999) Newborn Processes, 15(1) 87-127 infants' sensitivity to perceptual cues to lexical Dominey PF (2000) Conceptual Grounding in and grammatical words, Cognition, Volume 72, Simulation Studies of Language Acquisition, Issue 2, B11-B21. Evolution of Communication, 4(1), 57-85. Siskind JM (1996) A computational study of cross- Dominey PF, Hoen M, Lelekov T, Blanc JM situational techniques for learning word-to- (2003) Neurological basis of language and meaning mappings, Cognition (61) 39-91. sequential cognition: Evidence from simulation, Siskind JM (2001) Grounding the lexical semantics aphasia and ERP studies, Brain and Language, of verbs in visual perception using force 86, 207-225 dynamics and event logic. Journal of AI Elman J (1990) Finding structure in time. Research (15) 31-90 Cognitive Science, 14:179-211. Steels, L. (2001) Language Games for Feldman JA, Lakoff G, Stolcke A, Weber SH Autonomous Robots. IEEE Intelligent Systems, (1990) Miniature language acquisition: A vol. 16, nr. 5, pp. 16-22, New York: IEEE Press. touchstone for cognitive science. In Proceedings Stolcke A, Omohundro SM (1994) Inducing of the 12th Ann Conf. Cog. Sci. Soc. 686-693, probabilistic grammars by Bayesian model MIT, Cambridge MA merging/ In Grammatical Inference and nd Feldman J., G. Lakoff, D. Bailey, S. Narayanan, T. Applications: Proc. 2 Intl. Colloq. On Regier, A. Stolcke (1996). L0: The First Five Grammatical Inference, Springer Verlag. Years. Artificial Intelligence Review, v10 103- Talmy L (1988) Force dynamics in language and 129. cognition. Cognitive Science, 10(2) 117-149. Goldberg A (1995) Constructions. U Chicago Tomasello M (1999) The item-based nature of Press, Chicago and London. children's early syntactic development, Trends in Hirsh-Pasek K, Golinkof RM (1996) The origins of Cognitive Science, 4(4):156-163 grammar: evidence from early language Tomasello, M. (2003) Constructing a language: A comprehension. MIT Press, Boston. usage-based theory of language acquisition. Kotovsky L, Baillargeon R, The development of Harvard University Press, Cambridge. calibration-based reasoning about collision events in young infants. 1998, Cognition, 67, 311-351 Langacker, R. (1991). Foundations of Cognitive Grammar. Practical Applications, Volume 2. Stanford University Press, Stanford. Mandler J (1999) Preverbal representations and language, in P. Bloom, MA Peterson, L Nadel and MF Garrett (Eds) Language and Space, MIT Press, 365-384

40 On the Acquisition of Phonological Representations

B. Elan DRESHER Department of Linguistics University of Toronto Toronto, Ontario Canada M5S 3H1 [email protected]

Abstract equipped with innate phonetic feature detect- Language learners must acquire the grammar ors, one might suppose that they can use (rules, constraints, principles) of their lan- these to extract phonetic features from the guage as well as representations at various signal. These extracted phonetic features levels. I will argue that representations are would then constitute phonological repre- part of the grammar and must be acquired sentations (surface, or phonetic, representa- together with other aspects of grammar; thus, tions). Once these are acquired, they can grammar acquisition may not presuppose serve as a basis from which learners can ac- knowledge of representations. Further, I will quire the rest of the grammar, namely, the argue that the goal of a learning model phonological rules (and/or constraints) and should not be to try to match or approximate the lexical, or underlying, representations. target forms directly, because strategies to do This idea of acquisition by stages, with re- so are defeated by the disconnect between presentations preceding rules, has enduring principles of grammar and the effects they appeal, though details vary with the prevail- produce. Rather, learners should use target ing theory of grammar; versions of this the- forms as evidence bearing on the selection of ory can be found in (Bloch, 1941) and the correct grammar. I will draw on two (Pinker, 1994:264–5). The idea could not be areas of phonology to illustrate these argu- implemented in American Structuralist pho- ments. The first is the grammar of stress, or nology, however (Chomsky, 1964), and I metrical phonology, which has received will argue that it remains untenable today. I much attention in the learning model litera- will discuss two areas of phonology in which ture. The second concerns the acquisition of representations must be acquired together phonological features and contrasts. This with the grammar, rather than prior to it. The aspect of acquisition turns out, contrary to first concerns the grammar of stress, or first appearances, to pose challenging prob- metrical phonology. The second concerns the lems for learning models. acquisition of phonological features. These pose different sorts of problems for learning 1 Introduction models. The first has been the subject of con- I will discuss the extent to which representa- siderable discussion. The second, to my tions are intertwined with the grammar, and knowledge, has not been discussed in the consequences of this fact for acquisition context of formal learning models. Though it models. I will focus on phonological rep- has often been assumed, as mentioned above, resentations, but the argument extends to that acquisition of features might be the most other components of the grammar. straightforward aspect of phonological acqui- One might suppose that phonological rep- sition, I will argue that it presents challeng- resentations can be acquired directly from the ing problems for learning models. acoustic signal. If, for example, children are

41 2 Representations of stress (2) Acquired representations Phonetic representations are not simply bun- a. Amrica b. Mnitba dles of features. Consider stress, for example. x x Line 2 Depending on the language, stress may be (x) (x x) Line 1 indicated phonetically by pitch, duration, x(x x) (x x)(x) Line 0 loudness, or by some combination of these L L L L L L H L dimensions. So even language learners gifted Ameri ca Mani to:ba with phonetic feature detectors will have to sort out what the specific correlates of stress are in their language. For purposes of the en- Looking at the word America, these repre- suing discussion, I will assume that this sentations indicate that the first syllable A is much can be acquired prior to further unfooted, that the next two syllables meri acquisition of the phonology. constitute a trochaic foot, and that the final But simply deciding which syllables have syllable ca is extrametrical. Manitoba has stress does not yield a surface representation two feet, hence two stresses, of which the of the stress contour of a word. According to second is stronger than the first. The Ls and metrical theory (Liberman and Prince 1977, Hs under the first line of the metrical grid Halle and Idsardi 1995, Hayes 1995), stress designate light and heavy syllables, respec- results from grouping syllables into feet; the tively. The distinction is important in Eng- strongest foot is assigned the main stress, the lish: The syllable to: in Manitoba is heavy, other feet are associated with secondary hence capable of making up a foot by itself, stress. Moreover, some syllables at the edges and it receives the stress. If it were light, then of the stress domain may be designated as Manitoba would have stress on the extrametrical, and not included in feet. antepenultimate syllable, as in America. For example, I assume that learners who How does a learner know to assign these have sorted out which acoustic cues signal surface structures? Not just from the acoustic stress can at some point assign the stress signal, or from the schematic stress contours contours depicted in (1) to English words. in (1). Observe that an unstressed syllable The height of the column over each syllable, can have several metrical representations: it S, indicates how much relative stress it has. can be footed, like the first syllable in However, these are not the surface represen- America; it can be the weak position of a tations. They indicate levels of stress, but no foot, like the second syllable of Manitoba; or metrical organization. it can be extrametrical, like the final syllables in both words. One cannot tell from the (1) Representations of stress contours before sound which of these representations to as- setting metrical parameters sign. The only way to know this is to acquire a. Amrica b. Mnitba the grammar of stress, based on evidence drawn from the observed contours in (1). x x Line 2 x x x Similar remarks hold for determining syl- Line 1 lable quantity. English divides syllables into x x x x x x x x Line 0 light and heavy: a light syllable ends in a S S S S S S S S short vowel, and a heavy syllable contains America Manito:ba either a long vowel or is closed by a con- sonant. In many other languages, though, a According to conventional accounts of closed syllable containing a short vowel is English stress, the metrical structures as- considered to be light, contrary to the English signed to these words are as in (2). categorization. Learners must decide how to

42 classify such syllables, and the decision can- terms of concepts such as heavy syllables, not be made on phonetic grounds alone. heads, feet, and so on. In syntax, various parameters have been posited that refer spe- 3 Acquisition of metrical structure cifically to anaphors, or to functional projec- How, then, are these aspects of phonological tions of various types. These entities do not structure acquired? Following Chomsky come labelled as such in the input, but must (1981), I will suppose that metrical structures themselves be constructed by the learner. So, are governed by a finite number of para- to echo the title character in Plato’s dialogue meters, whose value is to be set on the basis The Meno, how can learners determine if of experience. The possible values of a main stress falls on the first or last foot if parameter are limited and given in advance.1 they do not know what a foot is, or how to Parameter setting models must overcome a identify one? This can be called the Episte- basic problem: the relation between a para- mological Problem: in this case we know meter and what it does is indirect, due to the about something in the abstract, but we do fact that there are many parameters, and they not recognize that thing when it is front of us. interact in complex ways (Dresher and Kaye, Because of the Credit Problem and the 1990). For example, in English main stress is Epistemological Problem, parameter setting tied to the right edge of the word. But that is not like learning to hit a target, where one does not mean that stress is always on the can correct one’s aim by observing where last syllable: it could be on the penultimate previous shots land. The relation between syllable, as in Manitoba, or on the antepen- number of parameters correct and apparent ultimate, as in America. What is consistent in closeness to the target is not smooth (Turkel, these examples is that main stress devolves 1996): one parameter wrong may result in onto the strong syllable of the rightmost foot. forms that appear to be way off the target, Where this syllable and foot is in any given whereas many parameters wrong may word depends on how a variety of parameters produce results that appear to be better are set. Some surprising consequences follow (Dresher, 1999). This discrepancy between from the nontransparent relationship between grammar and outputs defeats learning models a parameter and its effects. that blindly try to match output forms The first one is that a learner who has (Gibson and Wexler, 1994), or that are based some incorrectly set parameters might know on a notion of goodness-of-fit (Clark and that something is wrong, but might not know Roberts, 1993). In terms of Fodor (1998), which parameter is the source of the prob- there are no unambiguous triggers: thus, lem. This is known as the Credit Problem learning models that seek them in individual (cf. Clark 1989, 1992, who calls this the target forms are unlikely to be successful. Selection Problem): a learner cannot reliably I have argued (Dresher, 1999) that Plato’s assign credit or blame to individual solution – a series of questions posed in a parameters when something is wrong. specified order – is the best approach we There is a second way in which parameters have. One version of this approach is the can pose problems to a learner. Some para- cue-based learner of (Dresher and Kaye, meters are stated in terms of abstract entities 1990). In this model, not only are the prin- and theory-internal concepts that the learner ciples and parameters of Universal Grammar may not initially be able to identify. For ex- innate, but learners must be born with some ample, the theory of stress is couched in kind of a road map that guides them in setting the parameters. Some ingredients of 1For some other approaches to the acquisition of this road map are the following: stress see (Daelemans Gillis and Durieux, 1994), First, Universal Grammar associates every (Gupta and Touretzky, 1994), (Tesar, 1998, 2004), and parameter with a cue, something in the data (Tesar and Smolensky, 1998).

43 that signals the learner how that parameter is fore, to identify a phoneme one must be able to be set. The cue might be a pattern that the to assign to it a representation in terms of learner must look for, or simply the presence feature specifications. What are these repre- of some element in a particular context. sentations? Since Saussure, it has been a Second, parameter setting proceeds in a central assumption of much linguistic theory (partial) order set by Universal Grammar: that a unit is defined not only in terms of its this ordering specifies a learning path (Light- substance, but also in negative terms, with foot 1989). The setting of a parameter later respect to the units it contrasts with. On this on the learning path depends on the results of way of thinking, an /i/ that is part of a three- earlier ones. vowel system /i a u/ is not necessarily the Hence, cues can become increasingly ab- same thing as an /i/ that is part of a seven- e a o Ѩ u/. In a three-vowel گ stract and grammar-internal the further along vowel system /i the learning path they are. As learners ac- system, no more than two features are re- quire more of the system, their representa- quired to distinguish each vowel from all the tions become more sophisticated, and they others; in a seven-vowel system, at least one are able to build on what they have already 2 more feature is required. learned to set more parameters. Jakobson and Halle (1956) suggested that If this approach is correct, there is no distinctive features are necessarily binary be- parameter-independent learning algorithm. cause of how they are acquired, through a This is because the learning path is depend- series of ‘binary fissions’. They propose that ent on the particular parameters. Also, the the order of these contrastive splits, which cues must be discovered for each parameter. form what I will call a contrastive hierarchy Thus, a learning algorithm for one part of the (Dresher 2003a, b) is partially fixed, thereby grammar cannot be applied to another part of 3 allowing for certain developmental sequen- the grammar in an automatic way. ces and ruling out others. This idea has been fruitfully applied in acquisition studies, 4. Segmental representations where it is a natural way of describing devel- Up to now we have been looking at an aspect oping phonological inventories (Pye Ingram of phonological representation above the and List, 1987), (Ingram, 1989), (Levelt, level of the segment. I have argued that ac- 1989), (Dinnsen et al., 1990), (Dinnsen, quisition of this aspect of surface phono- 1992), and (Rice and Avery, 1995). logical representation cannot simply be based Consider, for example, the development of on attending to the acoustic signal, but segment types in onset position in Dutch requires a more elaborate learning model. (Fikkert, 1994): But what about acquisition of the phonemic inventory of a language? One might suppose (3) Development of Dutch onset consonants that this be achieved prior to the acquisition (Fikkert 1994) of the phonology itself. consonant Since the pioneering work of Trubetzkoy u m and Jakobson, phonological theory has pos- obstruent sonorant ited that phonemes are characterized in terms urum urum of a limited set of distinctive features. There- plosive g fricativ g e nasa g l liquid/glide g

2For details of parameter ordering, defaults, and /P/ /F/ /N/ /L/J/ cues in the acquisition of stress, see (Dresher and Kaye, 1990) and (Dresher, 1999). At first there are no contrasts. The value of 3 For further discussion and critiques of cue-based the consonant defaults to the least marked (u) models see (Nyberg, 1991), (Gillis Durieux and Daele- mans, 1995), (Bertolo et al. 1997), and (Tesar, 2004). onset, namely an obstruent plosive, desig-

44 ѐ u nated here as /P/. The first contrast is be- b. / / triggers labial harmony, but / / and /Ѩ / do not. Though phonetically tween obstruent and sonorant. The former re- u mains the unmarked (u), or default, option; [+labial], there is no evidence that / / and /Ѩ/ are specified for this feature. the marked (m) sonorant defaults to nasal, /N/. At this point children differ. Some ex- Acquiring phonological specifications is pand the obstruent branch first, bringing in not the same as identifying phonetic features. marked fricatives, /F/, in contrast with Surface phonetics do not determine the pho- plosives. Others expand the sonorant branch, nological specifications of a segment. Man- introducing marked sonorants, which may be chu /i/ is phonetically [+ATR], but does not either liquids, /L/, or glides, /J/. Continuing Ѩ in this way we will eventually have a tree bear the feature phonologically; /u/ and / / that gives all and only the contrasting fea- are phonetically [+labial], but are not specif- tures in the language. ied for that feature. How does a learner de- duce phonological (contrastive) specifica- 5 5. Acquiring segmental representations tions from surface phonetics? Let us consider how such representations It must be the case that phoneme acqui- might be acquired. To illustrate, we will look sition requires learners to take into account at the vowel system of Classical Manchu phonological processes, and not just the local (Zhang, 1996), which nicely illustrates the phonetics of individual segments (Dresher types of problems a learning model will have and van der Hulst, 1995). Thus, the phonolo- to overcome. Zhang (1996) proposes the con- gical status of Manchu vowels is demonstrat- trastive hierarchy in (4) for Classical Man- ed most clearly by attending to the effects of chu, where the order of the features is [low]> the vowel on neighbouring segments. This conclusion is strengthened when we [coronal]>[labial]>[ATR]. consider that the distinction between /u/ and U (4) Classical Manchu vowel system (Zhang / / in Classical Manchu is phonetically evi- 1996)4 dent only after back consonants; elsewhere, [low] they merge to [u]. To determine the under- – + lying identity of a surface [u], therefore, a language learner must observe its patterning [coronal] [labial] +ru– –ru+ with other vowels: if it co-occurs with

[+ATR] vowels, it is /u/; otherwise, it is /U/. /i/ [ATR] [ATR] /ѐ/ +ty– +ty– The nonlocal and diverse character of the /u/ /Ѩ/ /ђ/ /a/ evidence bearing on the feature specifica- tions of segments poses a challenge to Part of the evidence for these specifica- learning models. tions comes from the following observations: Finally, let us consider the acquisition of the hierarchy of contrastive features in each (5) Evidence for the specifications in (4) language. Examples such as the acquisition u ђ i a. / / and / / trigger ATR harmony, but / / of Dutch onsets given above appear to accord does not, though /i/ is phonetically well with the notion of a learning path, [+ATR], suggesting that /i/ lacks a whereby learners proceed to master individ- phonological specification for [ATR]. ual feature contrasts in order. If this order were the same for all languages, then this

4Zhang (1996) assumes privative features: [F] vs. 5Phonological contrasts that play a role in phono- the absence of [F], rather than [+F] vs. [–F]. The logical representations are thus different from their distinction between privative and binary features is not phonetic manifestations, the subject of studies such as crucial to the matters under discussion here. (Flemming, 1995).

45 much would not have to be acquired. How- Perhaps this paradox is only apparent. How- ever, it appears that the feature hierarchies ever it is resolved, the issue raises an inter- vary somewhat across languages (Dresher, esting problem for models of acquisition. 2003a, b). The existence of variation raises the question of how learners determine the 7 Acknowledgements order for their language. The problem is This research was supported in part by grant difficult, because establishing the correct 410-2003-0913 from the Social Sciences and ordering, as shown by the active contrasts in Humanities Research Council of Canada. I a language, appears to involve different kinds would like to thank the members of the pro- of potentially conflicting evidence. In the ject on Contrast in Phonology at the case of metrical parameters, the relevant evi- University of Toronto (http://www.chass. dence could be reduced to particular cues, or utoronto.ca/~contrast/) for discussion. so it appears. Whether the setting of feature hierarchies can be parameterized in a similar References way remains to be demonstrated. Stefano Bertolo Kevin Broihir Edward Gibson and Kenneth Wexler. 1997. Char- 6 Conclusion acterizing learnability conditions for cue- I will conclude by raising one further based learners in parametric language sys- problem for learning models that is suggested tems. In Tilman Becker and Hans-Ulrich by the Manchu vowel system. We have ob- Krieger, editors, Proceedings of the Fifth served that in Classical Manchu, /ђ/ is the Meeting on the Mathematics of Language. [+ATR] counterpart of /a/. Both vowels are http://www.dfki.de/events/ mol/. [+low]. Since [low] is ordered first among Bernard Bloch. 1941. Phonemic overlapping. the vowel features in the Manchu hierarchy, American Speech 16:278–284. Reprinted we might suppose that learners determine in Martin Joos, editor, Readings in Lingui- which vowels are [+low] and which are not stics I, Second edition, 93–96. New York: at an early stage in the process, before as- American Council of Learned Societies, signing the other features. However, a vowel 1958. that is phonetically [ђ] is ambiguous as to its Noam Chomsky. 1964. Current issues in lin- featural classification. In many languages, guistic theory. In Jerry A. Fodor and including descendants of Classical Manchu Jerrold J. Katz, editors, The Structure of (Zhang, 1996, Dresher & Zhang, 2003) such Language, 50–118. Englewood Cliffs, NJ: vowels are classified as [–low]. What helps Prentice-Hall. to place /ђ/ as a [+low] vowel in Classical Noam Chomsky. 1981. Principles and para- meters in syntactic theory. In Norbert Manchu is the knowledge that it is the Hornstein and David Lightfoot, editors, [+ATR] counterpart of /a/. That is, in order to ђ Explanation In Linguistics: The Logical assign the feature [+low] to / /, it helps to Problem of Language Acquisition, 32–75. know that it is [+ATR]. But, by hypothesis, London: Longman. [low] is assigned before [ATR]. Similarly, the Robin Clark. 1989. On the relationship bet- determination that /i/ is contrastively ween the input data and parameter setting. [+coronal] is tied in with its not being con- In Proceedings of NELS 19, 48–62. trastively [–labial]; but [coronal] is assigned GLSA, University of Massachusetts, prior to [labial]. Amherst. It appears, then, that whatever order we Robin Clark. 1992. The selection of syntactic choose to assign features, it is necessary to knowledge. Language Acquisition 2:83– have some advance knowledge about classi- 149. fication with respect to features ordered later.

46 Robin Clark and Ian Roberts. 1993. A com- vowel systems. Paper presented at the putational model of language learnability Twenty-Ninth Annual Meeting of the and language change. Linguistic Inquiry Berkeley Linguistics Society, February 24:299–345. 2003. To appear in the Proceedings. Walter Daelemans Steven Gillis and Gert Paula Fikkert. 1994. On the Acquisition of Durieux. 1994. The acquisition of stress: A Prosodic Structure (HIL Dissertations 6). data-oriented approach. Computational Dordrecht: ICG Printing. Linguistics 20:421–451. Edward Flemming. 1995. Auditory represen- Daniel A. Dinnsen. 1992. Variation in devel- tations in phonology. Doctoral disserta- oping and fully developed phonetic inven- tion, UCLA. tories. In Charles Ferguson Lise Menn and Janet Dean Fodor. 1998. Unambiguous trig- Carol Stoel-Gammon, editors, Phonologi- gers. Linguistic Inquiry 29:1–36. cal Development: Models, Research, Im- Edward Gibson and Kenneth Wexler. 1994. plications, 191–210. Timonium, MD: Triggers. Linguistic Inquiry 25:407–454. York Press,. Steven Gillis Gert Durieux and Walter Daniel A. Dinnsen Steven B. Chin Mary Daelemans. 1996. A computational model Elbert and Thomas W. Powell. 1990. of P&P: Dresher & Kaye (1990) revisited. Some constraints on functionally disorder- In Frank Wijnen and Maaike Verrips, edit- ed phonologies: Phonetic inventories and ors, Approaches to Parameter Setting, phonotactics. Journal of Speech and 135–173. Vakgroep Algemene Taalweten- Hearing Research 33:28–37. schap, Universiteit van Amsterdam. B. Elan Dresher. 1999. Charting the learning Prahlad Gupta and David Touretzky. 1994. path: Cues to parameter setting. Linguistic Connectionist models and linguistic Inquiry 30:27–67. theory: Investigations of stress systems in B. Elan Dresher. 2003a. Contrast and asym- language. Cognitive Science 18:1–50. metries in inventories. In Anna-Maria di Morris Halle and William J. Idsardi. 1995. Sciullo, editor, Asymmetry in Grammar, General properties of stress and metrical Volume 2: Morphology, Phonology, structure. In John Goldsmith, editor, The Acquisition, 239–257. Amsterdam: John Handbook of Phonological Theory, 403– Benjamins. 443. Cambridge, MA: Blackwell. B. Elan Dresher. 2003b. The contrastive Bruce Hayes. 1995. Metrical Stress Theory: hierarchy in phonology. In Daniel Currie Principles and Case Studies. Chicago: Hall, editor, Toronto Working Papers in University of Chicago Press. Linguistics (Special Issue on Contrast in David Ingram. 1989. First Language Acquis- Phonology) 20, 47–62. Toronto: Depart- ition: Method, Description and Explana- ment of Linguistics, University of tion. Cambridge: Cambridge University Toronto. Press. B. Elan Dresher and Harry van der Hulst. Roman Jakobson and Morris Halle. 1956. 1995. Global determinacy and learnability Fundamentals of Language. The Hague: in phonology. In John Archibald, editor, Mouton. Phonological Acquisition and Phonologi- Clara C. Levelt. 1989. An essay on child cal Theory, 1–21. Hillsdale, NJ: Lawrence phonology. M.A. thesis, Leiden Uni- Erlbaum. versity. B. Elan Dresher and Jonathan Kaye. 1990. A Mark Liberman and Alan Prince. 1977. On computational learning model for metrical stress and linguistic rhythm. Linguistic phonology. Cognition 34:137–195. Inquiry 8:249–336. B. Elan Dresher and Xi Zhang. 2003. Phono- David Lightfoot. 1989. The child’s trigger logical contrast and phonetics in Manchu experience: Degree-0 learnability (with

47 commentaries). Behavioral and Brain Sci- ences 12:321–375. Eric H. Nyberg 3rd. 1991. A non-determin- istic, success-driven model of parametric setting in language acquisition. Doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. . 1994. The Language Instinct. New York: William Morrow. Plato. Meno. Various editions. Clifton Pye David Ingram and Helen List. 1987. A comparison of initial consonant acquisition in English and Quich. In Keith E. Nelson and Ann Van Kleeck, editors, Children's Language (Vol. 6), 175–190. Hillsdale, NJ: Erlbaum. Keren Rice and Peter Avery. 1995. Variabil- ity in a deterministic model of language acquisition: A theory of segmental elabo- ration. In John Archibald editor, Phonolo- gical Acquisition and Phonological Theory, 23–42. Hillsdale, NJ: Lawrence Erlbaum. Bruce Tesar. 1998. An iterative strategy for language learning. Lingua 104:131–145. Bruce Tesar. 2004. Using inconsistency de- tection to overcome structural ambiguity. Linguistic Inquiry 35:219–253. Bruce Tesar and Paul Smolensky. 1998. Learnability in Optimality Theory. Lin- guistic Inquiry 29:229–268. William J. Turkel. 1996. Smoothness in a parametric subspace. Ms., University of British Columbia, Vancouver. Xi Zhang. 1996. Vowel systems of the Manchu-Tungus languages of China. Doc- toral dissertation, University of Toronto.

48 Statistics Learning and Universal Grammar: Modeling Word Segmentation

Timothy Gambell Charles Yang 59 Bishop Street Department of Linguistics, Yale University New Haven, CT 06511 New Haven, CT 06511 USA USA [email protected] [email protected]

Abstract vision of labor between endowment and learning± plainly, things that are built in needn't be learned, This paper describes a computational model of word and things that can be garnered from experience segmentation and presents simulation results on re- needn't be built in. alistic acquisition In particular, we explore the ca- The important paper of Saffran, Aslin, & New- pacity and limitations of statistical learning mecha- port [8] on statistical learning (SL), suggests that nisms that have recently gained prominence in cog- children may be powerful learners after all. Very nitive psychology and linguistics. young infants can exploit transitional probabilities between syllables for the task of word segmenta- 1 Introduction tion, with only minimum exposure to an arti®cial Two facts about language learning are indisputable. language. Subsequent work has demonstrated SL First, only a human baby, but not her pet kitten, can in other domains including arti®cial grammar learn- learn a language. It is clear, then, that there must ing [9], music [10], vision [11], as well as in other be some element in our biology that accounts for species [12]. This then raises the possibility of this unique ability. Chomsky's Universal Grammar learning as an alternative to the innate endowment (UG), an innate form of knowledge speci®c to lan- of linguistic knowledge [13]. guage, is an account of what this ability is. This po- We believe that the computational modeling of sition gains support from formal learning theory [1- psychological processes, with special attention to 3], which sharpens the logical conclusion [4,5] that concrete mechanisms and quantitative evaluations, no (realistically ef®cient) learning is possible with- can play an important role in the endowment vs. out priori restrictions on the learning space. Sec- learning debate. Linguists' investigations of UG are ond, it is also clear that no matter how much of a rarely developmental, even less so corpus-oriented. head start the child has through UG, language is Developmental psychologists, by contrast, often learned. Phonology, lexicon, and grammar, while stop at identifying components in a cognitive task governed by universal principles and constraints, do [14], without an account of how such components vary from language to language, and they must be work together in an algorithmic manner. On the learned on the basis of linguistic experience. In other hand, if computation is to be of relevance other words±indeed a truism±both endowment and to linguistics, psychology, and cognitive science in learning contribute to language acquisition, the re- general, being merely computational will not suf- sult of which is extremely sophisticated body of ®ce. A model must be psychological plausible, and linguistic knowledge. Consequently, both must be ready to face its implications in the broad empirical taken in account, explicitly, in a theory of language contexts [7]. For example, how does it generalize acquisition [6,7]. to typologically different languages? How does the Controversies arise when it comes to the relative model's behavior compare with that of human lan- contributions by innate knowledge and experience- guage learners and processors? based learning. Some researchers, in particular lin- In this article, we will present a simple compu- guists, approach language acquisition by charac- tational model of word segmentation and some of terizing the scope and limits of innate principles its formal and developmental issues in child lan- of Universal Grammar that govern the world's lan- guage acquisition. Speci®cally we show that SL guage. Others, in particular psychologists, tend to using transitional probabilities cannot reliably seg- emphasize the role of experience and the child's ment words when scaled to a realistic setting (e.g., domain-general learning ability. Such division of child-directed English). To be successful, it must research agenda understandably stems from the di- be constrained by the knowledge of phonological 49 structure. Indeed, the model reveals that SL may CV syllables, many languages, including English, well be an artifact±an impressive one, nonetheless± make use of a far more diverse range of syllabic that plays no role in actual word segmentation in types. And then, syllabi®cation of speech is far human children. from trivial, which (most likely) involve both in- nate knowledge of phonological structures as well 2 Statistics does not Refute UG as discovering language-speci®c instantiations [14]. All these problems have to be solved before SL for It has been suggested [15, 8] that word segmenta- word segmentation can take place. tion from continuous speech may be achieved by using transitional probabilities (TP) between ad- jacent syllables A and B, where , TP(A→B) = 3 The Model P(AB)/P(A), with P(AB) being the frequency of B following A, and P(A) the total frequency of A. To give a precise evaluation of SL in a realis- Word boundaries are postulated at local minima, tic setting, we constructed a series of (embarrass- where the TP is lower than its neighbors. For ex- ingly simple) computational models tested on child- ample, given suf®cient amount of exposure to En- directed English. glish, the learner may establish that, in the four- The learning data consists of a random sam- syllable sequence ªprettybabyº, TP(pre→tty) and TP(ba→by) are both higher than TP(tty→ba): a ple of child-directed English sentences from the CHILDES database [19] The words were then pho- word boundary can be (correctly) postulated. It is remarkable that 8-month-old infants can extract netically transcribed using the Carnegie Mellon Pro- three-syllable words in the continuous speech of an nunciation Dictionary, and were then grouped into syllables. Spaces between words are removed; how- arti®cial language from only two minutes of expo- sure [8]. ever, utterance breaks are available to the modeled learner. Altogether, there are 226,178 words, con- To be effective, a learning algorithm±indeed any sisting of 263,660 syllables. algorithm±must have an appropriate representation of the relevant learning data. We thus need to be Implementing SL-based segmentation is straight- cautious about the interpretation of the success of forward. One ®rst gathers pair-wise TPs from the SL, as the authors themselves note [16]. If any- training data, which are used to identify local min- thing, it seems that the ®ndings strengthen, rather ima and postulate word boundaries in the on-line than weaken, the case for (innate) linguistic knowl- processing of syllable sequences. Scoring is done edge. A classic argument for innateness [4, 5, for each utterance and then averaged. Viewed as an 17] comes from the fact that syntactic operations information retrieval problem, it is customary [20] are de®ned over speci®c types of data structures± to report both precision and recall of the perfor- constituents and phrases±but not over, say, linear mance. strings of words, or numerous other logical possibil- The segmentation results using TP local minima ities. While infants seem to keep track of statistical are remarkably poor, even under the assumption information, any conclusion drawn from such ®nd- that the learner has already syllabi®ed the input per- ings must presuppose children knowing what kind fectly. Precision is 41.6%, and recall is 23.3%; over of statistical information to keep track of. After all, half of the words extracted by the model are not ac- an in®nite range of statistical correlations exists in tual English words, while close to 80% of actual the acoustic input: e.g., What is the probability of a words fail to be extracted. And it is straightfor- syllable rhyming with the next? What is the proba- ward why this is the case. In order for SL to be bility of two adjacent vowels being both nasal? The effective, a TP at an actual word boundary must fact that infants can use SL to segment syllable se- be lower than its neighbors. Obviously, this con- quences at all entails that, at the minimum, they dition cannot be met if the input is a sequence of know the relevant unit of information over which monosyllabic words, for which a space must be pos- correlative statistics is gathered: in this case, it is tulated for every syllable; there are no local min- the syllables, rather than segments, or front vowels. ima to speak of. While the pseudowords in [8] A host of questions then arises. First, How do are uniformly three-syllables long, much of child- they know so? It is quite possible that the primacy directed English consists of sequences of monosyl- of syllables as the basic unit of speech is innately labic words: corpus statistics reveals that on aver- available, as suggested in neonate speech perception age, a monosyllabic word is followed by another studies [18]? Second, where do the syllables come monosyllabic word 85% of time. As long as this from? While the experiments in [8] used uniformly is the case, SL cannot, in principle, work. 50 4 Statistics Needs UG words yields even better performance: a precision of 81.5% and recall of 90.1%. This is not to say that SL cannot be effective for word segmentation. Its application, must be 5 Conclusion constrained±like that of any learning algorithm however powerful±as suggested by formal learning Further work, both experimental and computational, theories [1-3]. The performance improves dramat- will need to address a few pressing questions, in or- ically, in fact, if the learner is equipped with even der to gain a better assessment of the relative contri- a small amount of prior knowledge about phono- bution of SL and UG to language acquisition. These logical structures. Speci®cally, we assume, uncon- include, more pertinent to the problem of word seg- troversially, that each word can have only one pri- mentation: mary stress. (This would not work for functional • Can statistical learning be used in the acquisi- words, however.) If the learner knows this, then tion of language-speci®c phonotactics, a pre- it may limit the search for local minima only in requisite to syllabi®cation and a prelude to the window between two syllables that both bear word segmentation? primary stress, e.g., between the two a's in the sequence ªlanguageacquisitionº. This assumption • Given that prosodic constraints are critical for is plausible given that 7.5-month-old infants are the success of SL in word segmentation, future sensitive to strong/weak prosodic distinctions [14]. work needs to quantify the availability of stress When stress information suf®ces, no SL is em- information in spoken corpora. ployed, so ªbigbadwolfº breaks into three words • Can further experiments, carried over realistic for free. Once this simple principle is built in, the linguistic input, further tease apart the multi- stress-delimited SL algorithm can achieve the pre- ple strategies used in word segmentation [14]? cision of 73.5% and 71.2%, which compare favor- What are the psychological mechanisms (algo- ably to the best performance reported in the litera- rithms) that integrate these strategies? ture [20]. (That work, however, uses an computa- tionally prohibitive and psychological implausible • How does word segmentation, statistical or algorithm that iteratively optimizes the entire lexi- otherwise, work for agglutinative (e.g., Turk- con.) ish) and polysynthetic languages (e.g. Mo- The computational models complement the ex- hawk), where the division between words, perimental study that prosodic information takes morphology, and syntax is quite different from priority over statistical information when both are more clear-cut cases like English? available [21]. Yet again one needs to be cautious Computational modeling can make explicit the about the improved performance, and a number of balance between statistics and UG, and are in the unresolved issues need to be addressed by future same vein as the recent ®ndings [24] on when/where work. It remains possible that SL is not used at SL is effective/possible. UG can help SL by all in actual word segmentation. Once the one- providing speci®c constraints on its application, word-one-stress principle is built in, we may con- and modeling may raise new questions for fur- sider a model that does not use any statistics, hence ther experimental studies. In related work [6,7], avoiding the computational cost that is likely to we have augmented traditional theories of UG± be considerable. (While we don't know how in- derivational phonology, and the Principles and Pa- fants keep track of TPs, there are clearly quite some rameters framework±with a component of statisti- work to do. Syllables in English number in the cal learning, with novel and desirable consequences. thousands; now take the quadratic for the potential Yet in all cases, statistical learning, while perhaps number of pair-wise TPs.) It simply stores previ- domain-general, is constrained by what appears to ously extracted words in the memory to bootstrap be innate and domain-speci®c knowledge of linguis- new words. Young children's familiar segmenta- tic structures, such that learning can operate on spe- tion errors±ºI was haveº from be-have, ªhiccing upº ci®c aspects of the input evidence from hicc-up, ªtwo dultsº, from a-dult±suggest that this process does take place. Moreover, there is ev- References idence that 8-month-old infants can store familiar 1. Gold, E. M. (1967). Language identi®cation in sounds in the memory [22]. And ®nally, there are the limit. Information and Control, 10:447-74. plenty of single-word utterances±up to 10% [23]± that give many words for free. The implementation 2. Valiant, L. (1984). A theory of the learnable. of a purely symbolic learner that recycles known Communication of the ACM. 1134-1142. 51 3. Vapnik, V. (1995). The Nature of Statistical 18. Bijeljiac-Babic, R., Bertoncini, J., & Mehler, Learning Theory. Berlin: Springer. J. (1993). How do four-day-old infants cate- gorize multisyllabic utterances. Developmen- 4. Chomsky, N. (1959). Review of Verbal Behav- tal psychology, 29, 711-21. ior by B.F. Skinner. Language, 35, 26-57. 19. MacWhinney, B. (1995). The CHILDES 5. Chomsky, N. (1975). Reflections on Language. Project: Tools for Analyzing Talk. Hillsdale: New York: Pantheon. Lawrence Erlbaum. 6. Yang, C. D. (1999). A selectionist theory of 20. Brent, M. (1999). Speech segmentation and language development. In Proceedings of 37th word discovery: a computational perspective. Meeting of the Association for Computational Trends in Cognitive Science, 3, 294-301. Linguistics. Maryland, MD. 431-5. 21. Johnson, E.K. & Jusczyk, P.W. (2001) Word 7. Yang, C. D. (2002). Knowledge and Learning segmentation by 8-month-olds: When speech in Natural Language. Oxford: Oxford Univer- cues count more than statistics. Journal of sity Press. Memory and Language, 44, 1-20. 8. Saffran, J.R., Aslin, R.N., & Newport, E.L. 22. Jusczyk, P. W., & Hohne, E. A. (1997). In- (1996). Statistical learning by 8-month old in- fants' memory for spoken words. Science, 277, fants. Science, 274, 1926-1928. 1984-6. 9. Gomez, R.L., & Gerken, L.A. (1999). Arti®- 23. Brent, M.R., & Siskind, J.M. (2001). The role cial grammar learning by one-year-olds leads of exposure to isolated words in early vocabu- to speci®c and abstract knowledge. Cognition, lary development. Cognition, 81, B33-44. 70, 109-135. 24. Newport, E.L., & Aslin, R.N. (2004). Learn- ing at a distance: I. Statistical learning of non- 10. Saffran, J.R., Johnson, E.K., Aslin R.N. & adjacent dependencies. Cognitive Psychology, Newport, E.L. (1999). Statistical learning of 48, 127-62. tone sequences by human infants and adults. Cognition, 70, 27-52. 11. Fiser, J., & Aslin, R.N. (2002). Statistical learning of new visual feature combinations by infants. PNAS, 99, 15822-6. 12. Hauser, M., Newport, E.L., & Aslin, R.N. (2001). Segmentation of the speech stream in a non-human primate: Statistical learning in cotton-top tamarins. Cognition, 78, B41-B52. 13. Bates, E., & Elman, J. (1996). Learning redis- covered. Science, 274, 1849-50. 14. Jusczyk, P.W. (1999). How infants begin to ex- tract words from speech. Trends in Cognitive Sciences, 3, 323-8. 15. Chomsky, N. (1955/1975). The Logical Struc- ture of Linguistic Theory. Manuscript, Har- vard University and Massachusetts Institute of Technology. Published in 1975 by New York: Plenum. 16. Saffran, J.R., Aslin, R.N., & Newport, E.L. (1997). Letters. Science, 276, 1177-1181 17. Crain, S., & Nakayama, M. (1987). Structure dependency in grammar formation. Language, 63:522-543. 52 Modelling syntactic development in a cross-linguistic context

Fernand GOBET Daniel FREUDENTHAL Centre for Cognition and Neuroimaging Julian M. PINE Department of Human Sciences School of Psychology Brunel University University of Nottingham Uxbridge UB8 3PH, UK Nottingham NG7 2RD, UK [email protected] [email protected] [email protected]

However, there is also strong empirical evidence Abstract that the amount of information present in the input is considerably greater than has traditionally been Mainstream linguistic theory has traditionally assumed by the nativist approach. In particular, assumed that children come into the world computer simulations have shown that a distribu- with rich innate knowledge about language tional analysis of the statistics of the input can pro- and grammar. More recently, computational vide a significant amount of syntactic information work using distributional algorithms has (Redington & Chater, 1997). shown that the information contained in the One limitation of the distributional approach is input is much richer than proposed by the na- that analyses have rarely been done with naturalis- tivist approach. However, neither of these ap- tic input (e.g. mothers child-directed speech) and proaches has been developed to the point of have so far not been linked to the detailed analysis providing detailed and quantitative predictions of a linguistic phenomenon found in human data, about the developmental data. In this paper, (e.g., Christiansen & Chater, 2001). Indeed, neither we champion a third approach, in which com- the nativist nor the distributional approach has putational models learn from naturalistic input been developed to the point of providing detailed and produce utterances that can be directly and quantitative predictions about the developmen- compared with the utterances of language- tal dynamics of the acquisition of language. In learning children. We demonstrate the feasi- order to remedy this weakness, our group has re- bility of this approach by showing how cently been exploring a different approach. This MOSAIC, a simple distributional analyser, approach, which we think is a more powerful way simulates the optional-infinitive phenomenon of understanding how children acquire their native in English, Dutch, and Spanish. The model ac- language, has involved developing a computational counts for young childrens tendency to use model (MOSAIC; Model Of Syntax Acquisition In both correct finites and incorrect (optional) in- Children) that learns from naturalistic input, and finitives in finite contexts, for the generality of produces utterances that can be directly compared this phenomenon across languages, and for the with the utterances of language-learning children. sparseness of other types of errors (e.g., word This makes it possible to derive quantitative pre- order errors). It thus shows how these phe- dictions about empirical phenomena observed in nomena, which have traditionally been taken children learning different languages and about the as evidence for innate knowledge of Universal developmental dynamics of these phenomena. Grammar, can be explained in terms of a sim- MOSAIC, which is based upon a simple distri- ple distributional analysis of the language to butional analyser, has been used to simulate a which children are exposed. number of phenomena in language acquisition. These include: the verb-island phenomenon (Gobet 1 Introduction & Pine, 1997; Jones, Gobet, & Pine, 2000); nega- Children acquiring the syntax of their native lan- tion errors in English (Croker, Pine, & Gobet, guage are faced with a task of considerable com- 2003); patterns of pronoun case marking error in plexity, which they must solve using only noisy English (Croker, Pine, & Gobet, 2001); patterns of and potentially inconsistent input. Mainstream lin- subject omission error in English (Freudenthal, guistic theory has addressed this learnability prob- Pine, & Gobet, 2002b); and the optional-infinitive lem by proposing the nativist hypothesis that chil- phenomenon (Freudenthal, Pine, & Gobet, 2001, dren come into the world with rich innate knowl- 2002a, 2003). MOSAIC has also been used to edge about language and grammar (Chomsky, simulate data from three different languages (Eng- 1981; Piattelli-Palmarini, 2002; Pinker, 1984). lish, Dutch, and Spanish), which has helped us to 53 understand how these phenomena are affected by rameters in order to optimise the goodness of fit to differences in the structure of the language that the the human data. An obvious consequence of over- child is learning. fitting the data in this way would be that In this paper, we illustrate our approach by MOSAICs ability to simulate the phenomenon showing how MOSAIC can account in detail for would break down when the model was applied to the optional-infinitive phenomenon in two lan- a new language. The simulations of Dutch show guages (English and Dutch) and its quasi-absence that this is not the case: with this language, which in a third language (Spanish). This phenomenon is has a richer morphology than English, the model of particular interest as it has generally been taken was still able to reproduce the key characteristics to reflect innate grammatical knowledge on the of the optional-infinitive stage. part of the child (Wexler, 1994, 1998). Spanish, the syntax of which is quite different to We begin by highlighting the theoretical chal- English and Dutch, offered an even more sensitive lenges faced in applying our model to data from test of the models mechanisms. The Dutch simula- three different languages. Then, after describing tions relied heavily on the presence of compound the optional-infinitive phenomenon, we describe finites in the child-directed speech used as input. MOSAIC, with an emphasis on the mechanisms However, although Spanish child-directed speech that will be crucial in explaining the empirical has a higher proportion of compound finites than data. We then consider the data from the three lan- Dutch, children learning Spanish produce optional- guages, and show to what extent the same model infinitive errors less often than children learning can simulate these data. When dealing with Eng- Dutch. Somewhat counter-intuitively, the simula- lish, we describe the methods used to collect and tions correctly reproduce the relative scarcity of analyse childrens data in some detail. While these optional-infinitive errors in Spanish, showing that details may seem out of place in a conference on the model is sensitive to subtle regularities in the computational linguistics, we emphasise that they way compound finites are used in Dutch and Span- are critical to our approach: first, our approach re- ish. quires fine-grained empirical data, and, second, the analysis of the data produced by the model is as 3 The optional-infinitive phenomenon close as possible to that used with childrens data. Between two and three years of age, children We conclude by discussing the implications of our learning English often produce utterances that ap- approach for developmental psycholinguistics. pear to lack inflections, such as past tense markers or third person singular agreement markers. For 2 Three languages: three challenges example, children may produce utterances as: The attempt to use MOSAIC to model data in three different languages involves facing up to a number (1a) That go there* of challenges, each of which is instructive for dif- (2a) He walk home* ferent reasons. An obvious problem when model- instead of: ling English data is that English has an impover- ished system of verb morphology that makes it (1b) That goes there difficult to determine which form of the verb a (2b) He walked home child is producing in any given utterance. This problem militates against conducting objective Traditionally, such utterances have been inter- quantitative analyses of childrens early verb use preted in terms of absence of knowledge of the and has resulted in there being no detailed quanti- appropriate inflections (Brown, 1973) or the drop- tative description of the developmental patterning ping of inflections as a result of performance limi- of the optional infinitive phenomenon in English tations in production (L. Bloom, 1970; P. Bloom, (in contrast to other languages like Dutch). We 1990; Pinker, 1984; Valian, 1991). More recently, have addressed this problem by using exactly the however, it has been argued that they reflect the same (automated) methods to classify the utter- childs optional use of (root) infinitives (e.g. go) in ances produced by the child and by the model. contexts where a finite form (e.g. went, goes) is These methods, which do not rely on the subjective obligatory in the adult language (Wexler, 1994, judgment of the coder (e.g. on Blooms, 1970, 1998). method of rich interpretation) proved to be suffi- This interpretation reflects the fact that children ciently powerful to capture the development of the produce (root) infinitives not only in English, optional infinitive in English, and to do so at a where the infinitive is a zero-marked form, but also relatively fine level of detail. in languages such as Dutch where the infinitive One potential criticism of these simulations of carries its own infinitival marker. For instance, English is that we may have tuned the models pa- 54 children learning Dutch may produce utterances nodes, and, as a consequence, the average length of such as: the phrases it can output. The node creation prob- ability (NCP) is computed as follows: (3a) Pappa eten* (Daddy to eat) (4a) Mamma drinken* (Mummy to drink) NCP = (N / M)L instead of: where M is a parameter arbitrarily set to 70,000 in (3b) Pappa eet (Daddy eats) the English and Spanish simulations, N = number ≤ (4b) Mamma drinkt (Mummy drinks) of nodes in the net (N M), and L = length of the phrase being encoded. Node creation probability is The optional infinitive phenomenon is particu- thus dependent both on the length of the utterance larly interesting as it occurs in languages that differ (longer utterances are less likely to yield learning) considerably in their underlying grammar, and is and on the amount of knowledge already acquired. subject to considerable developmental and cross- In a small net, learning is slow. When the number linguistic variation. It is also intriguing because of nodes in the net increases, the node creation children in the optional infinitive stage typically probability increases and, as a result, the learning make few other grammatical errors. For example, rate also increases. This is consistent with data they make few errors in their use of the basic word showing that children learn new words more easily order of their language: English-speaking children as they get older (Bates & Carnavale, 1993). may say he go, but not go he. Technically, the optional infinitive phenomenon revolves around the notion of finiteness. Finite forms are forms that are marked for Tense and/or Agreement (e.g. went, goes). Non-finite forms are forms that are not marked for Tense or Agreement. This includes the infinitive form (go), the past par- ticiple (gone), and the progressive participle (go- ing). In English, finiteness marking increases with development: as they grow older, children produce an increasing proportion of unambiguous finite forms. Figure 1: Illustration of a MOSAIC discrimination 4 Description of the model net. The Figure also illustrates how an utterance MOSAIC is a computational model that analyses can be generated. Because she and he have a gen- the distributional characteristics present in the in- erative link, the model can output the novel utter- put. It learns to produce increasingly long utter- ance she sings. (For simplicity, preceding context ances from naturalistic (child-directed) input, and is ignored in this Figure.) produces output consisting of actual utterances, which can be directly compared to childrens While the first learning mechanism is based on speech. This allows for a direct comparison of the discrimination, the second is based on generalisa- output of the model at different stages with the tion. When two nodes share a certain percentage childrens developmental data. (set to 10% for these simulations) of nodes The model learns from text-based input (i.e., it is (phrases) following and preceding them, a new assumed that the phonological stream has been type of link, a generative link is created between segmented into words). Utterances are processed in them (see Figure 1 for an example). Generative a left to right fashion. MOSAIC uses two learning links connect words that have occurred in similar mechanisms, based on discrimination and generali- contexts in the input, and thus are likely to be of sation, respectively. The first mechanism grows an the same word class. As no linguistic constructs n-ary discrimination network (Feigenbaum & are given to the model, the development of ap- Simon, 1984; Gobet et al., 2001) consisting of proximate linguistic classes, such as those of noun nodes connected by test links. Nodes encode single or verb, is an emergent property of the distribu- words or phrases. Test links encode the difference tional analysis of the input. An important feature of between the contents of consecutive nodes. (Figure MOSAIC is that the creation and removal of gen- 1 illustrates the structure of the type of discrimina- erative links is dynamic. Since new nodes are con- tion net used.) As the model sees more and more stantly being created in the network, the percentage input, the number of nodes and links increases, and overlap between two nodes varies over time; as a so does the amount of information held in the

55 consequence, a generative link may drop below the unambiguous finite form (walks, went) were threshold and so be removed. counted as finite, while those containing a finite The model generates output by traversing the verb form plus a non-finite form (has gone) were network and outputting the contents of the visited classified as compound finites. The remaining ut- links. When the model traverses test links only, the terances were classified as ambiguous and counted utterances it produces must have been present in separately; they contained an ambiguous form the input. Where the model traverses generative (such as bought in he bought) as the main verb, links during output, novel utterances can be gener- which can be classified either as a finite past tense ated. An utterance is generated only if its final form or as a (non-finite) perfect participle (in the word was the final word in the utterance when it phrase he bought, the word has may have been was encoded (this is accomplished by the use of an omitted). end marker ). Thus, the model is biased towards 5.2 Childrens data: Results generating utterances from sentence final position, which is consistent with empirical data from lan- The childrens speech was partitioned into three guage-learning children (Naigles & Hoff-Ginsberg, developmental stages, defined by mean length of 1998; Shady & Gerken, 1999; Wijnen, Kempen, & utterance (MLU). The resulting distributions, por- Gillis, 2001). trayed in Figure 2, show that the proportion of non- finites decreases as a function of MLU, while the 5 Modelling the optional-infinitive phenome- proportion of compound finites increases. There is non in English also a slight increase in the proportion of simple finites, although this is much less pronounced than Despite the theoretical interest of the optional- the increase in the proportion of compound finites. infinitive phenomenon, there is, to our knowledge, no quantitative description of the developmental 5.3 Simulations dynamics of the use of optional infinitives in Eng- The model received as input speech from the chil- lish, with detail comparable to that provided in drens respective mothers. The size of the input other languages, such as Dutch (Wijnen et al., was 33,000 utterances for Annes model, and 2001). The following analyses fill this gap. 27,000 for Beckys model. Note that, while the 5.1 Childrens data: Methods analyses are restricted to a subset of the childrens corpora, the entire mothers corpora were used as We selected the speech of two children (Anne, input during learning. The input was fed through from 1 year 10 months to 2 years 9 months; and the model several times, and output was generated Becky, from 2 years to 2 years 11 months). These after every run of the model, until the MLU of the data were taken from the Manchester corpus output was comparable to that of the end stage in (Theakston, Lieven, Pine, & Rowland, 2001), the two children. The output files were then com- which is available in the CHILDES data base pared to the childrens data on the basis of MLU. (MacWhinney, 2000). Recordings were made The model shows a steady decline in the propor- twice every three weeks over a period of one year tion of non-finites as a function of MLU coupled and lasted for approximately one hour per session. with a steady increase in the proportion of com- Given that optional-infinitive phenomena are pound finites (Figure 3). On average, the models harder to identify in English than in languages such production of optional infinitives in third person as Dutch or German (due to the relatively low singular contexts drops from an average of 31.5% number of unambiguous finite forms), the analysis to 16% compared with 47% to 12.5% in children. focused on the subset of utterances that contain a MOSAIC thus provides a good fit to the develop- verb with he, she, it, this (one), or that (one) as its mental pattern in the childrens data (not including subject. Restricting the analysis in this way avoids the ambiguous category: r2 = .65, p < .01, RMSD utterances such as I go, which could be classified = 0.096 for Anne and her model; r2 = .88, p < .001, both as non-finite and finite, and therefore makes it RMSD = 0.104 for Becky and her model). One possible to more clearly separate non-finites, sim- obvious discrepancy between the models and the ple finites, compound finites, and ambiguous utter- childrens output is that both models at MLU 2.1 ances. produce too many simple finite utterances. Further Identical (automatic) analyses of the data and inspection of these utterances reveals that they model were carried out in a way consistent with contain a relatively high proportion of finite mo- previous work on Dutch (Wijnen et al., 2001). Ut- dals such as can and will and finite forms of the terances that had the copula (i.e., forms of the verb dummy modal do such as does and did. These to be) as a main verb were removed. Utterances forms are unlikely to be used as the only verb in that contained a non-finite form as the only verb childrens early utterances as their function is to were classified as non-finites. Utterances with an

56 modulate the meaning of the main verb rather than a: Data for Anne to encode the central relational meaning of the sen- tence. 1 An important reason why MOSAIC accounts for n

o 0.8 i

t the data is that it is biased towards producing sen- r 0.6 Non-finites o

p 0.4 Simple finites tence final utterances. In English, non-finite utter- o r

P 0.2 Compound fin. ances can be learned from compound finite ques- 0 Ambiguous tions in which finiteness is marked on the auxiliary 2.2 3 3.4 rather than the lexical verb. A phrase like He walk MLU home can be learned from Did he walk home?, and a phrase like That go there can be learned from Does that go there? As MLU increases, the rela- b: Data for Becky tive frequency of non-finite utterances in the out- put decreases, because the model learns to produce 1 more and more of the compound finite utterances

n 0.8 o from which these utterances have been learned. i t

r 0.6 Non-finites

o MOSAIC therefore predicts that as the proportion p 0.4 Simple finites o r 0.2 of non-finite utterances decreases, there will be a P 0 Compound fin. complementary increase in the proportion of com- 2.4 3.1 3.5 Ambiguous pound finites.

MLU 6 Modelling optional infinitives in Dutch Children acquiring Dutch seem to use a larger pro- Figure 2: Distribution of non-finites, simple finites, portion of non-finite verbs in finite contexts (e.g., compound finites, and ambiguous utterances for Anne hij lopen, bal trappen) than children learning Eng- and Becky as a function of developmental phase. Only lish. Thus, in Dutch, a very high percentage of utterances with he, she, it, that (one), or this (one) as a subject are included. childrens early utterances with verbs (about 80%) are optional-infinitive errors. This percentage de- creases to around 20% by MLU 3.5 (Wijnen, a: Model for Anne Kempen & Gillis, 2001). As in English, optional infinitives in Dutch can 1 be learned from compound finites (auxiliary/modal

n Non-finites

o 0.8

i + infinitive). However, an important difference t

r 0.6 Simple finites o between English and Dutch is that in Dutch verb p 0.4 Compound fin. o r 0.2 position is dependent on finiteness. Thus, in the P Ambiguous 0 simple finite utterance Hij drinkt koffie (He drinks 2.1 2.7 3.4 coffee) the finite verb form drinkt precedes its ob- ject argument koffie whereas in the compound fi- MLU nite utterance Hij wil koffie drinken (He wants cof- fee drink), the non-finite verb form drinken is re- b: Model for Becky stricted to utterance final position and is hence pre- ceded by its object argument: koffie. Interestingly, children appear to be sensitive to this feature of 1

n Non-finites Dutch from very early in development and

o 0.8 i t

r 0.6 Simple finites MOSAIC is able to simulate this sensitivity. How- o

p 0.4 Compound fin.

o ever, the fact that verb position is dependent on r 0.2 P Ambiguous finiteness in Dutch also means that whereas non- 0 finite verb forms are restricted to sentence final 2.1 2.6 3.4 position, finite verb forms tend to occur earlier in MLU the utterance. MOSAIC therefore simulates the very high proportion of optional infinitives in early child Dutch as a function of the interaction be- Figure 3: Distribution of non-finites, simple finites, compound finites, and ambiguous utterances for the tween its utterance final bias and increasing MLU. models of Anne and Becky as a function of develop- That is, the high proportion of non-finites early on mental phase. Only utterances with he, she, it, that is explained by the fact that the model mostly pro- (one), or this (one) as a subject are included. duces sentence-final phrases, which, as a result of

57 Dutch grammar, have a large proportion of non- speech than they are in Dutch child-directed finites. speech (in the corpora we have used, they make up As shown in Figure 4, the models production of 36% and 30% of all parents utterances including optional infinitives drops from 69% to 28% com- verbs, respectively). pared with 77% to 18% in the data of the child on whose input data the model had been trained. In these simulations, the input data consisted of a a: Data for Juan sample of approximately 13,000 utterances of child-directed speech. Because of the lower input 1 n 0.8 o Non-finites size, the M used in the NCP formula was set to i t

r 0.6 50,000. o Simple finites p 0.4 o r 0.2 Compound fin. P 0 a: Data for Peter 2.2 2.8 3.8

1 MLU

n 0.8 o Non-finites i t

r 0.6

o Simple finites

p 0.4 o

r b: Model for Juan 0.2 Compound fin. P 0 1.5 2.2 3.1 4.1 1

n 0.8

MLU o Non-finites i t

r 0.6

o Simple finites p 0.4 o r Compound fin. b: Model for Peter P 0.2 0 1 2.1 2.7 3.4

n 0.8 o Non-finites i

t MLU

r 0.6

o Simple finites

p 0.4 o r 0.2 Compound fin. P Figure 5: Distribution of non-finites, simple finites 0 and compound finites for Juan and his model, as a 1.4 2.3 3 4.1 function of developmental phase. MLU Figure 5a shows the data for a Spanish child, Juan (Aguado Orea & Pine, 2002), and Figure 5b Figure 4: Distribution of non-finites, simple finites the outcome of the simulations run using and compound finites for Peter and his model, as a MOSAIC. The parental corpus used as input con- function of developmental phase. sisted of about 27,000 utterances. The models production of optional infinitives drops from 21% to 13% compared with 23% to 4% in the child. 7 Modelling optional infinitives in Spanish Both the child and the model show a lower propor- Wexler (1994, 1998) argues that the optional- tion of optional-infinitive errors than in Dutch. The infinitive stage does not occur in pro-drop lan- presence of (some rare) optional-infinitive errors in guages, that is, languages like Spanish in which the models output is explained by the same verbs do not require an overt subject. Whether mechanism as in English and Dutch: a bias towards MOSAIC can simulate the low frequency of op- learning the end of utterances. For example, the tional-infinitive errors in early child Spanish is input ¿Quieres beber café? (Do you want to drink therefore of considerable theoretical interest, since coffee?) may later lead to the production of beber the ability of Wexlers theory to explain cross- café. But why does the model produce so few op- linguistic data is presented as one of its main tional-infinitive errors in Spanish when the Spanish strengths. Note that simulating the pattern of fi- input data contain so many compound finites? The niteness marking in early child Spanish is not a answer is that finite verb forms are much more trivial task. This is because although optional- likely to occur in utterance final position in Span- infinitive errors are much less common in Spanish ish than they are in Dutch, which makes them than they are in Dutch, compound finites are actu- much easier to learn. ally more common in Spanish child-directed

58 8 Conclusion stage has been to provide an integrated account of the different patterns of results observed across In this paper, we have shown that the same simple languages, of the fact that children use both correct model accounts for the data in three languages that finite forms and incorrect (optional) infinitives, differ substantially in their underlying structure. To and of the scarcity of other types of errors (e.g. our knowledge, this is the only model of language verb placement errors). His approach, however, acquisition which simultaneously (1) learns from requires a complex theoretical apparatus to explain naturalistic input (actual child-directed utterances), the data, and does not provide any quantitative where the statistics and frequency distribution of predictions. Here, we have shown how a simple the input are similar to that experienced by chil- model with few mechanisms and no free parame- dren; (2) produces actual utterances, which can be ters can account for the same phenomena not only directly compared to those of children; (3) has a qualitatively, but also quantitatively. developmental component; (4) accounts for speech The simplicity of the model inevitably means generativity and increasing MLU; (5) makes quan- that some aspects of the data are ignored. Children titative predictions; and (6) has simulated phenom- learning a language have access to a range of ena from more than one language. sources of information (e.g. phonology, seman- An essential feature of our approach is to limit tics), which the model does not take into consid- the number of degrees of freedom in the simula- eration. Also, generating output from the model tions. We have used an identical model for simu- means producing everything the model can output. lating the same class of phenomena in three lan- Clearly, children produce only a subset of what guages. The method of data analysis was also the they can say. Furthermore, any rote-learned utter- same, and, in all cases, the models and childs ance that the model produces early on in its devel- output were coded automatically and identically. opment will continue to be produced during the The use of realistic input was also crucial in that it later stages. This inability to unlearn is clearly a guaranteed that cross-linguistic differences were weakness of the model, but one that we hope to reflected in the input. correct in subsequent research. The simulations showed that simple mechanisms The results clearly show that the interaction be- were sufficient for obtaining a good fit to the data tween a simple distributional analyser and the sta- in three different languages, in spite of obvious tistical properties of naturalistic child-directed syntactic differences and very different proportions speech can explain a considerable amount of the of optional-infinitive errors. The interaction be- developmental data, without the need to appeal to tween a sentence final processing bias and increas- innate linguistic knowledge. The fact that such a ing MLU enabled us to capture the reason why relatively simple model provides such a good fit to English, Dutch and Spanish offer different patterns the developmental data in three languages suggests of optional-infinitive errors: the difference in the that (1) aspects of childrens multi-word speech relative position of finites and non-finites is larger data such as the optional-infinitive phenomenon do in Dutch than in English, and Spanish verbs are not necessarily require a nativist interpretation, and predominantly finite. We suggest that any model (2) nativist theories of syntax acquisition need to that learns to produce progressively longer utter- pay more attention to the role of input statistics and ances from realistic input, and in which learning is increasing MLU as determinants of the shape of biased towards the end of utterances, will simulate the developmental data. these results. The production of actual utterances (as opposed 9 Acknowledgements to abstract output) by the model makes it possible to analyse the output with respect to several (seem- This research was supported by the Leverhulme ingly) unrelated phenomena, so that the nontrivial Trust and the Economic and Social Research predictions of the learning mechanisms can be as- Council. We thank Javier Aguado-Orea for sharing sessed. Thus, the same output can be utilized to his corpora on early language acquisition in Span- study phenomena such as optional-infinitive errors ish children and for discussions on the Spanish (as in this paper), evidence for verb-islands (Jones simulations. et al., 2000), negation errors (Croker et al., 2003), and subject omission (Freudenthal et al., 2002b). It References also makes it possible to assess the relative impor- Aguado-Orea, J., & Pine, J. M. (2002). Assessing tance of factors such as increasing MLU that are the productivity of verb morphology in early implicitly assumed by many current theorists but child Spanish. Paper presented at the IX Interna- not explicitly factored into their models. tional Congress for the Study of Child Lan- An important contribution of Wexlers (1994, guage, Madison, Wisconsin. 1998) nativist theory of the optional-infinitive 59 Bates, E., & Carnavale, G. F. (1993). New direc- Gobet, F., Lane, P. C. R., Croker, S., Cheng, P. C. tions in research on child development. Devel- H., Jones, G., Oliver, I., & Pine, J. M. (2001). opmental Review, 13, 436-470. Chunking mechanisms in human learning. Bloom, L. (1970). Language development: Form Trends in Cognitive Sciences, 5, 236-243. and function in emerging grammars. Cam- Gobet, F., & Pine, J. M. (1997). Modelling the ac- bridge, MA: MIT Press. quisition of syntactic categories. In Proceedings Bloom, P. (1990). Subjectless sentences in child of the 19th Annual Meeting of the Cognitive language. Linguistic Inquiry, 21, 491-504. Science Society (pp. 265-270). Hillsdale, NJ: Brown, R. (1973). A first language. Boston, MA: Erlbaum. Harvard University Press. Jones, G., Gobet, F., & Pine, J. M. (2000). A proc- Chomsky, N. (1981). Lectures on government and ess model of children's early verb use. In L. R. binding. Dordrecht, NL: Foris. Gleitman & A. K. Joshi (Eds.), Proceedings of Christiansen, M. H., & Chater, N. (2001). Connec- the Twenty Second Annual Meeting of the Cog- tionist psycholinguistics: Capturing the empiri- nitive Science Society (pp. 723-728). Mahwah, cal data. Trends in Cognitive Sciences, 5, 82-88. N.J.: Erlbaum. Croker, S., Pine, J. M., & Gobet, F. (2001). Model- MacWhinney, B. (2000). The CHILDES project: ling children's case-marking errors with Tools for analyzing talk (3rd ed.). Mahwah, NJ: MOSAIC. In E. M. Altmann, A. Cleeremans, C. Erlbaum. D. Schunn & W. D. Gray (Eds.), Proceedings of Naigles, L., & Hoff-Ginsberg, E. (1998). Why are the 4th International Conference on Cognitive some verbs learned before other verbs: Effects Modeling (pp. 55-60). Mahwah, NJ: Erlbaum. of input frequency and structure on childrens Croker, S., Pine, J. M., & Gobet, F. (2003). Model- early verb use. Journal of Child Language, 25, ling children's negation errors using probabilis- 95-120. tic learning in MOSAIC. In F. Detje, D. Dörner Piattelli-Palmarini, M. (2002). The barest essen- & H. Schaub (Eds.), Proceedings of the Fifth In- tials. Nature, 416, 129. ternational Conference on Cognitive Modeling Pinker, S. (1984). Language learnability and lan- (pp. 69-74). Bamberg:Universitäts-Verlag. guage development. Cambridge, MA: Harvard Feigenbaum, E. A., & Simon, H. A. (1984). University Press. EPAM-like models of recognition and learning. Redington, M., & Chater, N. (1997). Probabilistic Cognitive Science, 8, 305-336. and distributional approaches to language acqui- Freudenthal, D., Pine, J. M., & Gobet, F. (2001). sition. Trends in Cognitive Sciences, 1, 273-279. Modelling the optional infinitive stage in Shady, M., & Gerken, L. (1999). Grammatical and MOSAIC: A generalisation to Dutch. In E. M. caregiver cue in early sentence comprehension. Altmann, A. Cleeremans, C. D. Schunn & W. Journal of Child Language, 26, 163-176. D. Gray (Eds.), Proceedings of the 4th Interna- Theakston, A. L., Lieven, E. V. M., Pine, J. M., & tional Conference on Cognitive Modeling (pp. Rowland, C. F. (2001). The role of performance 79-84). Mahwah, NJ: Erlbaum. limitations in the acquisition of verb-argument Freudenthal, D., Pine, J. M., & Gobet, F. (2002a). structure: An alternative account. Journal of Modelling the development of Dutch optional Child Language, 28, 127-152. infinitives in MOSAIC. In W. Gray & C. D. Valian, V. (1991). Syntactic subjects in the early Schunn (Eds.), Proceedings of the 24th Annual speech of American and Italian children. Cogni- Meeting of the Cognitive Science Society (pp. tion, 40, 21-81. 322-327). Mahwah, NJ: Erlbaum. Wexler, K. (1994). Optional infinitives, head Freudenthal, D., Pine, J. M., & Gobet, F. (2002b). movement and the economy of derivation in Subject omission in children's language: The child grammar. In N. Hornstein & D. Lightfoot case for performance limitations in learning. In (Eds.), Verb movement. Cambridge, MA: Cam- W. Gray & C. D. Schunn (Eds.), Proceedings of bridge University Press. the 24th Annual Meeting of the Cognitive Sci- Wexler, K. (1998). Very early parameter setting ence Society (pp. 328-333). Mahwah, NJ: Erl- and the unique checking constraint: A new ex- baum. planation of the optional infinitive stage. Lin- Freudenthal, D., Pine, J. M., & Gobet, F. (2003). gua, 106, 23-79. The role of input size and generativity in simu- Wijnen, F., Kempen, M., & Gillis, S. (2001). Root lating language acquisition. In F. Schmalhofer, infinitives in Dutch early child language. Jour- R. M. Young & G. Katz (Eds.), Proceedings of nal of Child Language, 28, 629-660. EuroCogSci 03: The European Cognitive Sci- ence Conference 2003 (pp. 121-126). Mahwah NJ: Erlbaum.

60 A Computational Model of Emergent Simple Syntax: Supporting the Natural Transition from the One-Word Stage to the Two-Word Stage.

Kris Jack, Chris Reed, and Annalu Waller Division of Applied Computing, University of Dundee, Dundee, Scotland, DD1 4HN [kjack | creed | awaller]@computing.dundee.ac.uk

This paper is sectioned as follows; pre-one- Abstract word stage linguistic abilities in children are This paper introduces a system that simulates briefly discussed to explain why initial system the transition from the one-word stage to the functionality assumptions are made; the defining two-word stage in child language production. characteristics of both the one-word stage and two- Two-word descriptions are syntactically word stage in children are introduced as possible generated and compete against one-word benchmarks for the system; a detailed description descriptions from the outset. Two-word of system design and implementation with descriptions become dominant as word examples of the learning process and games played combinations are repeatedly recognised, by the system are presented; a discussion of current forming syntactic categories; resulting in an results along with their possible implications emergent simple syntax. The system follows; a brief review of related works that have demonstrates a similar maturation as children influenced this research, citing major influences; as evidenced by phenomena such as the direction and aims of future research is overextensions and mismatching, and the use described briefly; and finally, conclusions are of one-word descriptions being replaced by drawn. two-word descriptions over time. 2 Pre-One-Word Stage Children 1 Introduction Linguistic abilities can be found in children prior Studies of first language acquisition in children to word production. In terms of comprehension, have documented general stages in linguistic children can distinguish between their mother‘s development. Neither the trigger nor the voice and a stranger‘s voice, male and female mechanism that takes a child from one stage to the voices, and sentences spoken in their mother‘s next are known. Stages arise gradually with no native language and sentences spoken in a different precise start or end points, overlapping one language. They also show categorical perception another (Ingram, 1989). to voice, can use formant transition information to The aim of this research is to develop a system mark articulation, and show intonation sensitivity that autonomously acquires conceptual (Pinker, 1994, Jusczyk, 1999). representations of individual words (the —one-word In terms of production, children produce noises, stage“) and also, simultaneously, is capable of such as discomfort noises (0-2 months), comfort developing representations of valid multi-word noises (2-4 months), and —play“ vocally with pitch structures i.e. simple syntax (the —two-word and loudness variations (4-7 months) (Pinker, stage“). Two-word descriptions are expected to 1994). The babbling stage (6-8 months) is emerge as a result of the system state and not be characterised with the production of recognisable artificially triggered. syllables. The syllables are often repeated, such as The system accepts sentences containing a [mamama] and [papapa], with the easiest to maximum of two words. It is designed to be produce sounds often being associated with scalable, allowing larger, more natural sentence members of the family (Jakobson, 1971). sizes also. System input is therefore a mixture of From this evidence it is reasonable to draw both one-word and two-word sentences. The conclusions about linguistic abilities in the young system is required to produce valid descriptions, child that can be used to frame assumptions for use particularly in the two-word stage. Rules that in the system. It is assumed that the system can enforce syntactic order, and allow for the receive and produce strings that can be broken production of semantically correct descriptions down into their component words. These words from novel concepts, are desirable. can be compared and equalities can be detected.

61 3 One-Word Stage and Two-Word Stages system learning (Steels and Kaplan, 2001). The learning process involves a user selecting an object The system is required to produce one-word in a scene and naming it. The guessing game descriptions in early stages that develop into two- involves a user saying a phrase, and the system word descriptions, where appropriate, in latter pointing to the object that the phrase refers to. The stages.. The recognition of each stage is based on naming game involves a user pointing to an object the number of words that the system uses at a and the system naming it The system is not particular point. In children, the one and two-word physically grounded, so all games are simulated. stages have notable features. The learning process allows the system to The one-word, or holophrastic, stage (9-18 acquire associations between phrases and concepts months), is characterised by one-word while the games test system comprehension and vocalisations that are consistently associated with system production respectively. The learning concepts. These concepts can be either concrete or process takes a string and concept as input, and abstract, such as —mama“, referring to the concrete produces no output. Comprehension takes a string concept of the child‘s mother, and —more“, an as input, and produces a concept as output, abstract concept which can be applied in a variety whereas production takes a concept as input, and of situations (Piaget, 1960). produces a string as output. Two phenomena that occur during this stage are underextensions and overextensions. An 4.2 Strings and Concepts underextension is the formation of a word to A string is a list of characters with a fixed order. concept association that is too narrow, such as A blank space is used to separate words within the —dog“ referring to only the family dog. string, of which there can be either one or two. Overextension, similarly, is an association that is The system can break strings down into their too broad, such as —dog“ referring to all four component words. legged animals. Mismatches, or idiosyncratic A concept is a list of feature values. The referencing also occur, resulting in a word being system recognises six feature values; red, blue, associated with an unrelated concept, such as green, white, circle, and square. There are no in- —dog“ referring to a table (Pinker, 1994). These built associations between any of the feature associations change over time. values. This form of learning is supported by the The two-word stage (18-24 months) introduces imageability theory (Paiviom 1971). No claims simple syntax into the child‘s language faculty. concerning concept acquisition and formation are Children appear to determine the most important made in this paper. All concepts are hard coded words in a sentence and, almost all of the time, use from the outset. them in the same order as an adult would The full list of objects used in the games are (Gleitman and Newport, 1995). Brown (1973) derived from shape and colour combinations; red defines a typology to express semantic relations in square, red circle, blue square, blue circle, green the two-word stage. It contains ten sets of square, green circle, white square, and white relations, but only one will be considered in this circle. Individual feature values can also act as paper; attribute + entity (—red circle“). During this concepts, therefore the full list is concepts is the stage, children already demonstrate a three word list of object plus the list of feature values. comprehension level (Tomasello and Kruger, 1992). The concepts relating to their sentences 4.3 Groups may therefore be more detailed than the phrases To associate a string with a concept, the system themselves. stores a list of groups. Each group contains an ID, The system is expected to make the transition one or more description pairs, an observed from the one-word stage to the two-word stage frequency, and zero or more occurrence without changes to the functionality of the system. supporter links. Once the system begins to run, input is restricted to The ID acts as a unique identifier, allowing the that of sensory (concept based) and vocal (string group to be found. A description pair is a string representation) data. and a concept. Groups must have at least one description pair since their primary function is to 4 System Design and Implementation relate a string to a concept. The observed 4.1 Introduction frequency represents the number of times that the description pair‘s components have been The system is designed to learn phrase-to- associated through system input. concept associations and demonstrate it through The occurrence supporter links are a set of group playing games: a guessing game and a naming IDs. Each ID in the set refers to a group that game. Games are often used to test, and encourage

62 contains a superset of either the description pair, or the same value for one component of the ID Description Pair OF OSLs TF description pair and a superset of the other e.g. The #1 [ª red circleº ; red circle] 1 [] 1 description pair [—red“; red] 1 would be supported #2 [ª red squareº ; red 1 [] 1 by the description pair [—red square“; red square]. square] A worked example is provided in the next section. #3 [ª redº; red] 0 [#1,# 2] 2 The links therefore record the number of #4 [ª #3 circleº; #3 circle] 0 [#1] 1 occurrences of the group‘s description pair. The #5 [ª #3 squareº ; #3 0 [#2] 1 occurrence supporter link reinforces the square] description pair‘s association and increases the #6 [ª circleº; circle], 0 [] 0 [ª squareº; square] total frequency of the group. The total frequency #7 [ª #3 #6º ; #3 #6] 0 [#2] 1 is the group‘s observed frequency plus the observed frequency of all of its supporters, never Table 1: Sample data including a supporter more than once. 4.4.2 Find equal and unequal parts Finally, group equality is defined by groups The new group is compared to all of the groups sharing the same description pair. in the system. Comparisons are based on the 4.4 The Learning Process groups‘ description pairs alone. Strings are compared separately from concepts. A string At each stage in the learning process, a match is found if one of the strings is a subset, or description pair is entered into the system. The exact match, of the other. Subsets of strings must system does not attempt to parse the correctness of contain complete words. Words are regarded as the description. All data is considered to be atomic units. Concepts are compared in the same positive. The general learning process algorithm is fashion as strings, where feature values are the detailed in the rest of this section. Specific atomic units. Successful comparisons create a set examples are also provided in Table 1, showing the of equal parts and unequal parts. Comparison groups‘ values; ID, description pair, occurrence results are only used when equal parts exist. This frequency (OF), occurrence supporter links approach is similar to alignment based learning, (OSLs), and total frequency (TF). Five steps are but with the additional component of concepts (van followed to incorporate the new data: Zaanen, 2000). 1. Identify the description pair. In comparing the new group, group #2, to the 2. Find equal and unequal parts. existing group, group #1, the equal part [—red“; 3. Update system based on equal parts.. red] and the unequal part [—circle“; circle], 4. Update system based on unequal parts. [—square“; square] are found. The comparison 5. Re-enter new groups into the system. algorithm is essential to the operation of the 4.4.1 Identify the description pair system. It is used in the learning process and in the If the description pair exists in a group that is games. Without it, no string or concept relations 2 already in the system, then that group‘s observed could be drawn . frequency is incremented. Otherwise, the system 4.4.3 Update system based on equal parts creates a new group containing the new When an equal part is found, a new group is description. It is given a unique ID and an created. In the example, an equal part is found observed frequency of one. Assume that the between group #1 and group #2. Group #3 is system already contains a group based on the created as a result. The new group is given an description pair [—red circle“; red circle]. This has observed frequency of zero. The IDs of the groups an ID of one. Assume also that the new that were compared (group #1 and group #2) are description pair entered is [—red square“; red added to the new group‘s (group #3) occurrence square]. Its group has an ID of two (group #2). supporter links. If the group already exists, then as All description pairs entered into the system are well as the existing group‘s observed frequency called concrete description pairs, this is, the being incremented, the IDs of the groups that were system has encountered them directly as input. compared are added to the occurrence supporter The new group is referred to as a concrete group, links. IDs can only appear once in the set of since it contains a concrete description pair. occurrence supporters links, so if an ID is already in it, then it is not added.

1 The convention of strings appearing in quotes (ª redº ), and concepts appearing in italics (red) is 2 The system assumes full compositionality. Idioms adopted throughout this paper. and metaphors are not considered at this stage. 63 Up until this point, all groups‘ description pairs 4.4.5 Re-enter new groups into the system have contained a string and concept. Description All groups that have been created through steps pairs can also contain links to other groups‘ strings 3 and 4 are compared to all other groups in the and groups‘ concepts. These description pairs are system. Results of comparisons are dealt with by referred to as abstract description pairs. If all repeating steps 3-5 with the new results. By use a elements of the abstract description pair are links recursive step like this, all groups are compared to to other groups then it is fully abstract, else it is one another in the system. All group equalities are partially abstract. A group that contains an therefore created when the round is complete. The abstract description pair is called an abstract amount of information available from every new group. The group is fully abstract if its abstract group entered into the system is therefore description pair is fully abstract, else it is a maximised. partially abstract group. Once a group has been created (as group #3 was), based on a description 4.5 The Significance of Groups Types comparison, the system attempts to make two Four different types of group have been abstract groups. identified in the previous section. Although all The new abstract groups (group #4 and group groups share the same properties, they can be seen #5) are based on substitutions of the new group‘s to represent difference aspects of language. It is ID (group #3) into each of the groups that were the combination and interaction of these groups originally compared. Group #4 is therefore created that gives rise to emergent simple syntax. This by substituting group #3 into group #1. Similarly, syntax is bi-gram collocations, but since the system group #5 is created by substituting group #3 into is scalable, it is referred to as simple syntax. group #2. The new abstract groups are given an observed 4.5.1 Concrete Groups frequency of zero (ID‘s equal four and five). Note Concrete groups acquire the meaning of that abstract groups always have an observed individual lexemes (associate concepts with frequency of zero as they can never been directly strings). They are verifiable in the real world observed. The ID of the appropriate group used in through the use of scene based games. comparison and later creation is added to the 4.5.2 Multi-Groups occurrence supporters links. Each abstract group therefore has a total frequency equal to that of the Multi-groups form syntactic categories based on group of which it is an abstract form. similarities between description pair usage. Under the current system, groups can only have a 4.4.4 Update system based on unequal parts maximum of two description pairs. If this were to Unequal parts are only considered if equal parts be expanded, it is clear that large syntactic are found in the comparison. Otherwise, the categories such as noun and verb equivalents unequal parts would be the complete set of data would arise. from both groups, which does not provide useful 4.5.3 Partially and Fully Abstract Groups information for comparisson. For every set of unequal parts that is found, a new group is created. Partially and fully abstract groups act as phrasal If there is more than one unequal part then the rules in the system. Abstract values contained group will contain more than one description pair. within the group‘s description pairs can relate to Such a group is referred to as a multi-group. Two both concrete groups and multi-groups. Abstract unequal parts were found earlier in comparing groups that relate to multi-groups offer a choice of group #1 and group #2. They are [—circle“; circle] substitutions. and [—square“; square]. Group #6 is therefore For example, group #7 (Table 1) relates a single created using these two description pairs. group to a multi-group. By substitution of groups The creation of a multi-group allows for a fully #3 and #6 into group #7, the concrete pairings of abstract group to be created. The system uses the [—red circle“; red circle] and [—red square“; red data from the new multi-group (group #6) and the square] are produced. The string data are directly group created through equal parts (group #3). equivalent to: Both groups are substituted back into the group S -> Adj. N, that was originally being compared (group #1). where Adj. = {—red“} The resulting group (group #7) is fully abstract as and N = {—circle“, —square“} both equal parts and unequal parts have been used When a description pair is entered into the to reconstruct the original group (group #1). system, the process of semantic bootstrapping takes place. Lexical items (strings) are associated with their meanings (concepts). When group

64 comparisons are made, syntactic bootstrapping contain any of the abstract forms are found. Each begins. Associations are made between all multi-group‘s description pair becomes a combinations of lexical items throughout the replacement for the appropriate input‘s abstract system, and all combinations of meanings value. throughout the system. The new input, which is still in abstract form, is The system stores lexical item-meaning searched for, using holophrastic matching again. associations, lexical item-lexical item associations Since the groups found are not exact matches of and meaning-meaning associations. This basic the original input, their total frequency is framework allows for the production of complex multiplied by an abstract factor. The abstract phrasal rules. factor is a value between zero and one inclusive. The higher the factor, the greater the effect that 4.6 Comprehension and Production Through abstract groups have on the results. Syntactic Games matches can therefore produce different results The guessing game tests comprehension while based on the value of abstract factor. The abstract the naming game test production. Comprehension factor is not changed from the initiation to takes a string as input, and produces a concept as termination of the system. output, whereas production takes a concept as Groups found during the search are added to a input, and produces a string as output. The new list of possible results. The appropriate comprehension and the production algorithms are elements are substituted into the groups abstract the same, except the first is string based, and the values to make them concrete. If an abstract value second is concept based. is acting as a substitute (by being found originally The algorithm performs two tasks: finding in a multi-group) then the original input value is concrete groups with exact matches to the input, used, not the replacement element. This allows the and finding abstract groups with possible matches abstract group to act as a syntactic rule, but it is to the input. Holophrastic matching uses only penalised by the abstract factor so it does not have concrete groups. Syntactic matching performs as much influence as concrete groups, that have holophrastic matching, followed by further been found to occur through direct input matches using abstract groups. Note that the associations. system only performs syntactic matching, which The groups found throughout the entire syntactic includes holophrastic matching. Holophrastic search are now contained in a second list of matching is never performed alone, unless in possible results. This list is reduced by removing testing stages. duplicate groups. For each group that is removed, For holophrastic matches, the system searches its observed frequency and occurrence supporter through its list of groups. Their description pairs links are added to the duplicate that is kept in the are compared to the input being searched for. list. There is therefore re-use of the comparison The two lists from each matching routine are algorithm introduced in the learning process. merged and sorted by total frequency. The When a match is found, the group is added to a list string\concept of the group with the highest total of possible results. frequency is outputted by the system. If holophrastic matching is being performed alone, then this list of possible results is sorted by 5 Testing and Results total frequency. The group with the highest total The system is tested within the following areas: frequency is output by the system. 1. Comprehension and production of all Syntactic matching begins by performing fourteen concepts. The rate at which full holophrastic matching, but does not output a result comprehension and full production are until all abstract groups have been matched too. It achieved is compared. is therefore an extension of holophrastic matching. 2. Correctness of production matches for Once a first fun of holophrastic matching is compound concepts. The correctness of performed, the input is converted into abstract production matches are studied over a form. This is performed at the word/feature value number of rounds. level. The most likely element is found by 3. Type of production matches for compound searching through the groups, comparing it to the concepts. The type of production matches description pair, and selecting the group with the favoured, holophrastic or syntactic, are highest total frequency from those found. compared over a number of rounds The group IDs replace the appropriate element in A match of concept to word or word to concept the input (just as substitutions were made during is considered correct if the string describes the the learning process). All multi-groups that concept fully. For example, [—red“; red] and [—red

65 square“; red square] are correct, but [—red“; red when a syntactic match is run, the matching square] and [—red square“; red] are incorrect. One algorithm has been split. The number of correct point is given for each correct match, zero for each strings produced using holophrastic data and the incorrect match. number of correct strings produced using syntactic Note that all test results are based on the average data alone are compared (see Figure 2). of ten different system trials. Each result shows a The data demonstrate that the system uses broad tendency that will likely be smoothed if mostly holophrastic matches in early rounds more trials are run. All input is randomly (comparable to the one-word stage). This is generated. The abstract factor is set to 0.4 for all eliminated in further rounds, in favour or syntactic tests. matches alone (the two-word stage). Note that although the holophrastic stage may appear to be 5.1 Comprehension Vs. Production producing two-words, these words are considered Full comprehension occurs much sooner (see to be one-word. For example, —allgone“ is Figure 1), on average, than full production. This considered to be one-word in early stages of result is found in children also. Although linguistic development, as opposed to —all gone“ production and comprehension compete quite (Ingram, 1989). steadily in early stages of the system, comprehension reaches its maximum, on average, 6 in 20% of the time that production takes to reach 5 its maximum. 4

3 14 2 Holophrastic 12 Syntactic 1 10 0 8 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 6 Production 4 Figure 2: Shows number of correct holophrastic Comprehension 2 and syntactic matches.

0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 The syntactic data continues to rise, until it Figure 1: Shows number of correct achieves full production. The holophrastic stage comprehension and production matches never achieves full production, but peaks, then reduces to zero. This trend occurs as holophrastic Full comprehension (fourteen points) is underextensions such as —red“ representing red achieved, on average, by round 50, while full square become more likely than —red square“ production comes at round 250. Both holophrastic representing red square. data and syntactic data contribute to the successes. Early syntactic matches are based on novel Underextensions are found during comprehension. string productions for novel string concepts. For example, in early rounds, —green“ is used to Holophrastic matching is incapable of producing describe only green squares. This phenomena is novel strings from novel concepts, as it deals with quickly eliminated in the trials but with a larger set concrete concepts. Abstract concepts however, of concepts and vocabulary, it is likely to persist allow new string combinations to be produced, for more than a few rounds. such as —blue square“, from blue square even though neither then string nor concept have been 5.2 Correctness of Holophrastic Vs Syntactic encountered before. Such an abstraction may Matches come from a multi-group that associates —blue“ At the end of each round, production is tested with —red“, while containing a group that contains using the eight compound concepts alone. These —red square“ also. The novel string —blue square“ are based on the eight observable objects in the is therefore abstracted. simulated scene. Only compound concepts can 5.3 Use of Holophrastic Vs Syntactic Matches demonstrate simple syntax in this system, as singular concepts have associations to single word The system does not always produce the correct strings. strings when a concept is entered. The strings that The system uses syntactic matching alone, but are produced are a result of either holophrastic or syntactic matching includes holophrastic matching, syntactic matching. Regardless of correctness, the as discussed earlier. To determine whether amount of times that holophrastic matches are holophrastic data is being used, or syntactic data made over syntactic matches can be compared (see Figure 3). 66 100% it produces —More“. Both examples are typical of 90% Syntactic 80% children during the one-word stage (Bloom, 1973). Holophrastic 70% Several systems exist that use perceptions to 60% 50% encourage language acquisition (Howell, Becker, 40% and Jankowicz,, 2001; Roy, 2001). ELBA learns 30% 20% both nouns and verbs from video scenes, starting 10% 0% with a blank lexicon. Such systems have helped in 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 the selection of both appropriate input sources and feature values to use in this research. This system Figure 3: Shows distribution of holophrastic and will also be physically grounded in future. syntactic matches. The research presented in this paper describes a The system relies completely on one-word system that drives linguistic development. Other descriptions at the outset, but soon syntactically systems have used similar techniques, based on derived two-word descriptions become prevalent. syntactic and semantic bootstrapping (Howell and It is likely that the one-word stage will last longer Becker, 2001), but have not explained how if larger concept and vocabulary sets are in use. multiple word acquisition is achieved from a single The system shows the same form of transition as word basis. can be seen in children from the one-word stage to Steels (1998) introduces frames that group the two-word stage, without the use of an artificial lexical elements together by the roles that they trigger. The shift is gradual although the use of play, very similar to groups in this paper. Frames larger concept and vocabulary sets, plus different are more dynamic than groups however, abstract factor values will affect the transition. structurally adapting when words reoccur. Groups The greater the number of words in multi-groups do not adapt in this way. New groups are created (the greater the size of syntactic categories), the to describe similarities rather than adapting lower the abstract factor is required to encourage existing ones. Steels also introduces multiple word the emergence of simple syntax. sentences, but it is unclear as to why agents invent a multiple word description over creating a new 6 Related Works single word description. The invention is triggered and does not emerge. This research is based on Supporters of computational modelling in real multiple word inputs, so the reason for language acquisition, often promote the practical invention is not necessary, unlike the reason for importance of running simulations, where adoption i.e. why the system adopts two-word evolutionary effects can be recreated in short time descriptions. periods (Zuidema, 2001). The comparison algorithm, as previously noted, Although this paper is focussed on an individual is similar to alignment based learning (van Zaanen, system, or agent, acquiring language, it is been 2000). The system in this research performs influenced by research into social learning perfect alignment requiring exact word matches (Oliphant and Batali, 1997; Kirby, 1999; Steels when finding equal parts and unequal parts. This and Kaplan, 2002). Social learning demonstrates system also uses concepts, reducing the number of the convergence upon a common language, or set incorrect groupings, or constituents, when there is of languages, from an uncoordinated proto- ambiguity in text. Unsupervised grammar language, within a population of agents. Social induction can also be found in EMILE (van Zaanen learning allows for the playing of games between and Adriaans, 2001). EMILE identifies agents, similar to those in this paper, with the substitution classes by means of clustering. These results being used as further system input, to classes are comparable to this system‘s groups support, or deny associations. This research can be although no concepts are used. viewed as a form of social learning with one agent (string and concept generator) performing the 7 Future Research teacher role, and the other agent (the system) performing the learner role. As the system stands, it uses a small input set. Simulations of both the babbling stage and the Further developments are focussed on expanding one-word stage have been developed (Scheler, the system. All ten of Brown‘s relations should be 1997; Abidi, 1999). ACCLAIM, a one-word stage implemented. Larger concept and vocabulary sets simulator, demonstrates that systems can react are therefore required. Extensions to these sets are appropriately to changes in situations. For likely to affect underextensions, mismatches, the example, when a cessation event is triggered, it length of pre-syntactic usage time, and the overall produces —Stop“, and when an object is requested, growth pattern of simple syntax.

67 8 Conclusion D. Ingram, 1989. First Language Acquisition. Method, Description and Explanation. Cambridge: Cambridge This paper offers a potential explanation of the University Press. mechanism by which the two-word stage emerges R. Jakobson, 1971. Why “mama” and “papa”? In from the one-word stage. It suggests that syntactic “ Child Language: A Book of Readings”, by A. Bar- data is sought out from the beginning of language Adon and W. F. Leopold, ed., pages 213-217. acquisition. This syntactic data is always Englewood Cliffs, NJ:Prentice-Hall. competing with the associations of holophrastic P.W. Jusczyk, 1999 How infants begin to extract words data. Syntax is strengthened when patterns are from speech. Trends in Cognitive Science, 3 (9, consistently found between strings and concepts, September):323-328. and is used in favour of holophrastic data when it S. Kirby, 1999. Syntax out of learning: The cultural is sufficiently frequent. The simple syntax evolution of structured communication in a continues to grow in strength, ultimately being population of induction algorithms. In Proceedings of used in favour of holophrastic data in all ECAL99 European Conference on Artificial Life, D. Floreano et al. ed. pages 694-703, Berlin: Springer- production and comprehension tasks. Verlag, This system provides the foundation for more M. Oliphant and J. Batali 1997. Learning and the complex, hierarchical, syntax to emerge. The type emergence of coordinated communication. Centre for and volume of input is the only constraint upon the Research in Language Newsletter, 11(1). system. The entry into post two-word stages is A. Paivio, 1971, Imagery and Verbal Processes. New predicted from the system‘s robust architecture. York: Holt, Rinehart & Winston. J. Piaget, 1960. The Language and Thought of the 9 Acknowledgements Child. Routledge and K. Paul, 3rd ed.,. Routledge The first author is sponsored by a studentship Paperbacks. from the EPSRC. S. Pinker, 1994. The Language Instinct. The New Thanks to the workshop reviewers for their Science of Language and Mind. Allen Lane, Penguin helpful and much appreciated advice. Press. D. Roy, 2001. Grounded spoken language acquisition: References Experiments in word learning. IEEE Transactions on Multimedia. S. Abidi, 1996. A Neural Network Simulation of Child G. Scheler, 1997d. The transition from babbling to the Language Development at the One-word Stage. In one-word stage: A computational model. In proceedings of IASTED Int. Conf. on Modelling, Proceedings of GALA '97. Simulation and Optimization, Gold Coast, Australia. L. Steels and F. Kaplan, 2001. AIBO's first words: The L. Bloom, 1973. One Word at a Time. The use of social learning of language and meaning. Evolution of single-word utterances before syntax The Hague, Communication, vol. 4(1):3-32. John Benjamin’s Mouton. Publishing Company, Amsterdam, Holland. R.W. Brown, 1986. Language and categories. In “ A L. Steels, 1998. The Origins of Syntax in visually Study of Thinking”, ed. J.S. Bruner, J.J. Goodnow, grounded robotic agents. AI 103, 1-24. and G.A. Austin, pages 247-312. New York: John Wiley, 1956. Reprint, New Brunswick: Transaction. M. Tomasello, and A.C. Kruger, 1992. Joint attention in action: Acquiring verbs in ostensive and non- L.R.. Gleitman and Elissa L. Newport, 1995. The ostensive contexts. Journal of Child Language Invention of Language by Children: Environmental 19:311-333. and Biological Influences on the Acquisition Language. In “ An Invitation to Cognitive Science”, M. van Zaanen, 2000. Learning structure using L.R. Gleitman and M. Liberman, 2nd ed., Vol.1, alignment based learning. In Proceedings of the Cambridge, Mass., London, MIT Press. Third Annual Doctoral Research Colloquium (CLUK), pages 75-82. S.R. Howell and S. Becker, 2001. Modelling language M. van Zaanen and P. Adriaans, 2001. Alignment- acquisition: Grammar from the Lexicon? In Proceedings of the Cognitive Science Society.. based learning versus EMILE: A comparison. In Proceedings of the Belgian-Dutch Conference on AI S.R. Howell, S. Becker, and D. Jankowicz, 2001. (BNAIC). Modelling Language Acquisition: Lexical Grounding Through Perceptual Features. In Proceedings of the W.H. Zuidema, 2001. Emergent syntax: the unremitting value of computational modelling for understanding 2001 Workshop on Developmental Embodied Cognition the origins of complex language. ECAL01, 641-644. Springer, Prague, Sept. 10-14, 2001. J.R. Hurford, M. Studdert-Kennedey, and C. Knight, 1998. The Emergence of Syntax. In “ Approaches to the evolution of language: social and cognitive bases”, Cambridge, Cambridge University Press.

68 O( *)--&!'# ,)'# $), *,)()/(- &( .%# "+/&-&.&)( )$ 0#,!-

"(55, %((/63 (2+ %.2+( '1.7- )6A2BD>6?D @7 .CI49@=@8I $$#$ *! $#D9 /DB66D (=@@>:?8D@?, ,- %&%#' /"(""',+,,)&-% 0!&*#&"*".$#.

VQ"URGCM ^RTQPQWP KUNCPFU_ UQOGVJKPI NKMG ")675(*7 @QOCUGNNQ`U (%--& ^XGTD KUNCPFU#_ +:F6? D96 B6CDB:4D:@?C @? D96 CE3;64DC 2?5 ;CP[ TGUGCTEJGTU JCXG UWIIGUVGF VJCV YQTF" @3;64DC D92D 2?I 8:F6? F6B3 >2I D2<6, :D C66>C YQTF TGNCVKQPU KP IGPGTCN! CPF U[PVCEVKE HTCOGU =:<6=I D92D 49:=5B6? >:89D =62B? F6B3C A2BD=I URGEKHKECNN[! CTG RCTVKEWNCTN[ KORQTVCPV HQT 3I 6HA=@:D:?8 CD2D:CD:42= B68E=2B:D:6C :? 4@ NGCTPKPI XGTDU (G#I#! 6NGKVOCP! %--$/ 6NGKVOCP @44EBB6?46C 36DG66? F6B3C 2?5 ?@E? A9B2C6C! CPF 6KNNGVVG! %--) # BJCV JCU PQV DGGP UVWFKGF! VQ .B@?@E?C 2B6 D96 >@CD 4@>>@? -.C :? D96 QWT MPQYNGFIG! KU JQY "#! !& $ URGEKHKECNN[ OC[ CA6649 D92D 49:=5B6? 962B! 16 56>@?CDB2D6 JGNR EJKNFTGP NGCTP XGTDU D[ XKTVWG QH U[UVGOCVKE D92D AB@?@E?C CICD6>2D:42==I A2BD:D:@? C6F6B2= EQ"QEEWTTGPEGU# BG JCXG DGIWP VQ CFFTGUU VJKU :>A@BD2?D 4=2CC6C @7 F6B3C, 2?5 D92D 2 C:>A=6 KUUWG KP VYQ UVGRU# 5KTUV! YG OGCUWTGF VJG CD2D:CD:42= =62B?6B 42? 6HA=@:D D96C6 UVCVKUVKECN TGIWNCTKVKGU COQPI VJG WUGU QH RTQPQWPU B68E=2B:D:6C D@ ?2BB@G D96 B2?86 @7 A@CC:3=6 CPF XGTDU KP C NCTIG EQTRWU QH RCTGPV CPF EJKNF F6B3C D92D 2B6 4@?C:CD6?D G:D9 2? :?4@>A=6D6 URGGEJ# BG HQWPF UVTQPI TGIWNCTKVKGU KP VJG WUG QH EDD6B2?46! 02<6? D@86D96B, D96C6 B6CE=DC RTQPQWPU YKVJ UGXGTCN DTQCF ENCUUGU QH XGTDU# CE886CD D92D 49:=5B6? $" !% EC6 B68E=2B:D:6C :? ?GEQPF! WUKPI VJG EQTRWU FCVC! YG VTCKPGF C AB@?@E?"F6B3 4@ @44EBB6?46C D@ 96=A =62B? EQPPGEVKQPKUV PGVYQTM VQ IWGUU YJKEJ XGTD F6B3C, D9@E89 G96D96B D9:C :C ac%&a##' C@ DGNQPIU KP C UGPVGPEG IKXGP QPN[ VJG UWDLGEV CPF B6>2:?C 2 D@A:4 7@B 7EBD96B B6C62B49! QDLGEV! FGOQPUVTCVKPI VJCV KV KU RQUUKDNG KP RTKPEKRNG HQT C UVCVKUVKECN NGCTPGT VQ WUG VJG $2753+8*7.32 TGIWNCTKVKGU KP RCTGPVCN URGGEJ VQ FGFWEG KPHQTOCVKQP CDQWV CP WPMPQYP XGTD# =TQPQWPU UVCPF HQT EGPVTCN GNGOGPVU QH CFWNV EQPEGRVWCN UEJGOGU]CU >WKPG RQKPVGF QWV! ! #94,5.1,27 RTQPQWPU ^CTG VJG DCUKE OGFKC QH TGHGTGPEG_ (>WKPG! %-,$! R# %' # 8P HCEV! OQUV U[PVCEVKE @JG HKTUV GZRGTKOGPV EQPUKUVGF QH C EQTRWU UWDLGEVU KP URQPVCPGQWU URQMGP CFWNV FKUEQWTUG CPCN[UKU VQ KFGPVKH[ RCVVGTPU QH EQ"QEEWTTGPEG CTG RTQPQWPU (2JCHG! %--( ! CPF 4PINKUJ"URGCMKPI DGVYGGP RTQPQWPU CPF XGTDU KP VJG EJKNF`U KPRWV# OQVJGTU QHVGP DGIKP YKVJ C JKIJ"HTGSWGPE[ !. &,7-3+ RTQPQWP YJGP URGCMKPI VQ VJGKT EJKNFTGP! YKVJ '!& CPF I QEEWTTKPI OQUV HTGSWGPVN[ (G#I#! ACNKCP! =CTGPVCN WVVGTCPEGU HTQO VJG 278:34? %--% # =CTGPVU WUG VJG KPCPKOCVG RTQPQWP i% HCT FCVCDCUG (;CEBJKPPG[! &$$$ YGTG EQFGF HQT OQTG HTGSWGPVN[ CU VJG UWDLGEV QH CP KPVTCPUKVKXG U[PVCEVKE ECVGIQTKGU! VJGP UWDLGEVGF VQ ENWUVGT UGPVGPEG VJCP QH CP VTCPUKVKXG QPG (2COGTQP" CPCN[UKU# @JG OGCP CIG QH VCTIGV EJKNFTGP 5CWNMPGT GV CN#! &$$'! R# ,*$ # 0U 2COGTQP" TGRTGUGPVGF KP VJG VTCPUETKRVU VJCV YGTG EQFGF HQT 5CWNMPGT GV CN# PQVG! VJKU UWIIGUVU VJCV KPVTCPUKVKXG VJKU GZRGTKOGPV YCU '/$ (?3%/& # UGPVGPEGU CTG WUGF OQTG QHVGP VJCP VTCPUKVKXGU HQT !. . &(7,5.(06 VCNMKPI CDQWV KPCPKOCVG QDLGEVU# 8V CNUQ UWIIGUVU! YG YQWNF PQVG! VJCV VJG WUG QH VJG KPCPKOCVG @JG HQNNQYKPI EQTRQTC YGTG WUGF. 1CVGU! 1NKUU! RTQPQWP OKIJV DG C EWG HQT VJG EJKNF CU VQ YJGVJGT 1NQQO %-+$! 1TQYP! 2NCTM! 2QTPGNN! 3GOGVTCU VJG XGTD KU VTCPUKVKXG QT KPVTCPUKVKXG# ?KOKNCTN[! BQTMKPI! 6NGCUQP! 7CNN! 7KIIKPUQP! 9WE\CL! :KGXGP CPF =KPG (:KGXGP GV CN#! %--+/ =KPG CPF ;CEBJKPPG[! ;QTKUUGV!

69 MRTYX" ERH WXSVIH MX MR E =]A@< HEXEFEWI$ BLI RYQFIV SJ ZEVMEFPIW" XLI RYQFIV SJ PIZIPW SJ IEGL ETTPMGEXMSR SGGEWMSREPP] EWWMKRIH XLI WEQI ZEVMEFPI ( WTIEOIVW" ( EHHVIWWIIW" , JVEQIW" ERH + XVERWGVMTX XS EPP GSHIVW" MR SVHIV XS QIEWYVI W]RXEGXMG VIPEXMSRW!" ERH XLI RYQFIV SJ GSHIVW *!" VIPMEFMPMX]$ 7MZI YRHIVKVEHYEXI GSHIVW [IVI XVEMRIH XLI TVSFEFMPMX] SJ GLERGI EKVIIQIRX MW ZIV] PS[$ SR XLI GSHMRK XEWO ERH XLI YWI SJ XLI W]WXIQ$ 2PXLSYKL XLIVI EVI WSQI WYFWXERXMZI IVVSVW .1. !('#%$+(% YWYEPP] [MXL GSQTPI\ IQFIHHIH GPEYWIW SV SXLIV YRYWYEP GSRWXVYGXMSRW!" QER] SJ XLI HMWGVITERGMIW 6EGL GSHIV [EW TVIWIRXIH" MR WIUYIRGI" [MXL EVI WMQTPI WTIPPMRK QMWXEOIW SV JEMPYVIW XS XVMQ IEGL QEMR XMIV PMRI SJ IEGL XVERWGVMTX WLI [EW [SVHW XS XLIMV VSSXW$ EWWMKRIH" XSKIXLIV [MXL WIZIVEP PMRIW SJ GSRXI\X0 CI SRP] GSRWMHIVIH TEVIRXEP GLMPH#HMVIGXIH XLI IRXMVI XVERWGVMTX [EW EPWS EZEMPEFPI F] GPMGOMRK WTIIGL ?45A!" HIJMRIH EW YXXIVERGIW [LIVI XLI E PMRO SR XLI GSHMRK TEKI$ 7SV IEGL PMRI" WLI WTIEOIV [EW E TEVIRX ERH XLI EHHVIWWII [EW E MRHMGEXIH E! [LIXLIV XLI WTIEOIV [EW E TEVIRX" XEVKIX GLMPH$ 2 XSXEP SJ ')"'-+ ?45A YXXIVERGIW XEVKIX GLMPH" SV SXLIV0 F! [LIXLIV XLI EHHVIWWII [IVI GSHIH" MRGPYHMRK E XSXEP SJ '-",(( GPEYWIW$ [EW E TEVIRX" XEVKIX GLMPH" SV SXLIV0 G! XLI =SVI XLER E UYEVXIV '-$(+%! SJ XLI ?45A W]RXEGXMG JVEQIW SJ YT XS ( GPEYWIW MR XLI GPEYWIW GSRXEMRIH RS ZIVF EX EPP0 XLIWI [IVI YXXIVERGI0 H! JSV IEGL GPEYWI" YT XS ( WYFNIGXW" I\GPYHIH JVSQ JYVXLIV EREP]WMW$ 4PEYWIW XLEX [IVI EY\MPMEVMIW" ZIVFW" HMVIGX SFNIGXW" MRHMVIGX SFNIGXW UYIWXMSRW &+$-+%!" TEWWMZIW %$%'%!" ERH ERH SFPMUYIW$ 3IGEYWI QER] YXXIVERGIW [IVI GSTYPEW &&$-+%! [IVI EPWS I\GPYHIH JVSQ JYVXLIV QYPXM#GPEYWEP" XLI YRMX SJ EREP]WMW JSV EWWIWWMRK EREP]WMW$ BLI EREP]WMW [EW GSRHYGXIH YWMRK SRP] TVSRSYR#ZIVF GS#SGGYVVIRGIW [EW XLI GPEYWI GPEYWIW XLEX [IVI MRXVERWMXMZIW &,$')% SJ XSXEP VEXLIV XLER XLI YXXIVERGI$ ?45A GPEYWIW!" XVERWMXMZIW ')$(+%! SV BLI W]RXEGXMG JVEQIW [IVI/ RS ZIVF" UYIWXMSR" HMXVERWMXMZIW &$)-%!" E XSXEP SJ &'"(,, GPEYWIW$ TEWWMZI" GSTYPE" MRXVERWMXMZI" XVERWMXMZI ERH . "%)+&*) HMXVERWMXMZI$ BLIWI [IVI GSRWMHIVIH XS FI QYXYEPP] I\GPYWMZI" M$I$" IEGL GPEYWI [EW XEKKIH EW BLI QSWX JVIUYIRX RSYRW MR XLI GSVTYW_FSXL FIPSRKMRK XS SRI ERH SRP] SRI JVEQI" EGGSVHMRK XS WYFNIGXW ERH SFNIGXW_EVI TVSRSYRW" EW WLS[R MR [LMGL SJ XLI JSPPS[MRK JVEQIW MX QEXGLIH JMVWX/ &! 7MKYVIW & ERH '$ BLI SFNIGXW HMZMHIH XLI QSWX BLI () 0%," JVEQI MRGPYHIH GPEYWIW ^ WYGL EW GSQQSR ZIVFW MRXS XLVII QEMR GPEWWIW/ ZIVFW XLEX `DIWa SV `>;a ^ [MXL RS QEMR ZIVF$ '! BLI XEOI XLI TVSRSYR &. ERH GSRGVIXI RSYRW EW SFNIGXW" +/%-.&)( JVEQI MRGPYHIH ER] GPEYWI YWMRK E ZIVFW XLEX XEOI GSQTPIQIRX GPEYWIW" ERH ZIVFW UYIWXMSR [SVH ^ WYGL EW `CLIVI HMH ]SY KS1a ^ SV XLEX XEOI WTIGMJMG GSRGVIXI RSYRW EW SFNIGXW$ BLI LEZMRK MRZIVXIH [SVH SVHIV ^ WYGL EW `5MH ]SY KS WYFNIGXW HMZMHIH XLI QSWX GSQQSR ZIVFW MRXS JSYV XS XLI FERO1a ^ FYX RSX QIVIP] E UYIWXMSR QEVO ^ QEMR GPEWWIW/ ZIVFW [LSWI WYFNIGX MW EPQSWX WYGL EW `DSY [IRX XS XLI FERO1a (! BLI *!--&0% EP[E]W " ZIVFW [LSWI WYFNIGX MW EPQSWX EP[E]W JVEQI MRGPYHIH GPEYWIW MR XLI TEWWMZI ZSMGI" WYGL 1)/" ZIVFW XLEX XEOI SV 1)/ EPQSWX IUYEPP] EW EW `:SLR [EW LMX F] XLI FEPP$a )! BLI #)*/'! WYFNIGX" ERH SXLIV ZIVFW$ BLI ZIVFW HMZMHIH XLI JVEQI MRGPYHIH GPEYWIW [MXL XLI GSTYPE EW XLI QSWX GSQQSR SFNIGX RSYRW MRXS E RYQFIV SJ QEMR ZIVF" WYGL EW `:SLR MW ERKV]$a *! BLI GPEWWIW" MRGPYHMRK SFNIGXW SJ XIPPMRK ERH PSSOMRK &(.,!(-&.&0% JVEQI MRGPYHIH GPEYWIW [MXL RS HMVIGX ZIVFW" SFNIGXW SJ LEZMRK ERH [ERXMRK ZIVFW" ERH SFNIGX" WYGL EW `:SLR VER$a BLI .,!(-&.&0% JVEQI SFNIGXW SJ TYXXMRK ERH KIXXMRK ZIVFW$ BLI ZIVFW MRGPYHIH GPEYWIW [MXL E HMVIGX SFNIGX FYX RS EPWS HMZMHIH XLI QSWX GSQQSR WYFNIGX RSYRW MRXS MRHMVIGX SFNIGX" WYGL EW `:SLR LMX XLI FEPP$a +! BLI E RYQFIV SJ GPEWWIW" MRGPYHMRK WYFNIGXW SJ LEZMRK $&.,!(-&.&0% JVEQI MRGPYHIH GPEYWIW [MXL ER ERH [ERXMRK ZIVFW" ERH WYFNIGXW SJ XLMROMRK ERH MRHMVIGX SFNIGX" WYGL EW `:SLR KEZI =EV] E OMWW$a ORS[MRK ZIVFW$ 2PP RSYRW [IVI GSHIH MR XLIMV WMRKYPEV JSVQW" [LIXLIV XLI] [IVI WMRKYPEV SV TPYVEP I$K$" `FS]Wa !00 [EW GSHIH EW `FS]a!" ERH EPP ZIVFW [IVI GSHIH MR 000 XLIMV MRJMRMXMZI JSVQW" [LEXIZIV XIRWI XLI] [IVI MR 1!00

I$K$" `VERa [EW GSHIH EW `VYRa!$ 1000

9R XSXEP" *.".,, YXXIVERGIW [IVI GSHIH JVSQ &'( !00 XVERWGVMTXW$ A'' SJ XLI GSHIVW GSHIH , SJ XLSWI 0 . " - ' & + * + - # XVERWGVMTXW JSV XLI TYVTSWI SJ QIEWYVMRK + & & & ) % % & ) , % % $ . + $ ( + VIPMEFMPMX]$ 2ZIVEKI MRXIV#GSHIV VIPMEFMPMX] ( QIEWYVIH JSV IEGL GSHIV EW XLI TIVGIRXEKI SJ . MXIQW GSHIH I\EGXP] XLI WEQI [E] XLI] [IVI 7MKYVI &/ BLI &% QSWX JVIUYIRX WYFNIGXW MR ?45A GSHIH F] IEGL SXLIV GSHIV! [EW -+$&%$ 8MZIR XLI F] XLIMV RYQFIV SJ SGGYVVIRGIW

70 ;KP=H EP !#" EP ! " !# PQNJ +, )) +.$/ !" PDNKS ), (& ++$+ ! LQOD (+ ') +($& % DKH@ *( '/ *+$( $ >NA=G ), ', **$* # " HA=RA (- '( **$* KLAJ ), '+ *'$-

(' *0 0 3 0 . 2 0 , ) @K (+, '&+ *'$& ) . ) - ) * + ) ( , & & 1 ( ( */ 0 , & 1 0 SA=N (+ '& *&$& / ( ) P=GA KBB (* / )-$+ LQP (-, /) ))$- CAP )*. -* ('$) 6ECQNA (0 ;DA '& IKOP BNAMQAJP K>FA?PO EJ 945: P=GA '&, (( (&$. >U PDAEN JQI>AN KB K??QNNAJ?AO$ LQP KJ *( . '/$& !.!. '+4)5 6,(6 6(.+ i! (5 (1 2)-+*6 >QU +& / '.$& CERA .+ '* ',$+ ;DA RAN>O PD=P P=GA '0 =O PDAEN IKOP ?KIIKJ D=RA )*& (, -$, K>FA?P EJ?HQ@A RAN>O KB IKPEKJ =J@ PN=JOBAN# =O ;=>HA '0 O IKOP ?KIIKJHU QOA@ SEPD ODKSJ EJ ;=>HA '$ K>FA?P '0. !.!.! '+4)5 6,(6 6(.+ *203/+0+16 */(75+5 8KOP RAN>O PD=P @E@ JKP P=GA '0 =O PDAEN IKOP ;KP=H 2?H=QOA3 2?H=QOA3 !#" ! " ?KIIKJ K>FA?P EJOPA=@ PKKG ?KILHAIAJP ?H=QOAO$ PDEJG '.- '-/ /+$- ;DAOA =NA LNEI=NEHU LOU?DKHKCE?=H RAN>O# =O ODKSJ NAIAI>AN )' () -*$( EJ ;=>HA ($ HAP -. +- -)$' !.!." '+4)5 6,(6 6(.+ *21*4+6+ 12715 (5 2)-+*65 GJKS (&- '*' ,.$' =OG (/ '- +.$, 8KOP NAI=EJEJC RAN>O EJ PDA ?KNLQO PKKG QJEMQA CK ++ )( +.$( OAPO KB K>FA?PO$ 6KN AT=ILHA# PDA IKOP ?KIIKJ S=JP )'- '.) +-$- K>FA?P QOA@ SEPD .$ # S=O !,,(# BKHHKSA@ >U '0 IA=J (+ '* +,$& =J@ /0,.31 PDA IKOP ?KIIKJ K>FA?P QOA@ SEPD -) 3 PAHH ''+ *+ )/$' S=O % *$# BKHHKSA@ >U '0# !),"(# =J@ &,1/$$ PNU +' '. )+$) O=U '-+ +) )&$) !.!.# '+4)5 6,(6 6(.+ I (5 ( 57)-+*6 HKKG *. '* (/$( O SDKOA IKOP ?KIIKJ OQ>FA?P EO I EJ?HQ@A JAA@ ,* '. (.$' !$0 !() KQP KB () QOAO SEPD = OQ>FA?P# KN '&& "# OAA (,, -) (-$* %1$// !('%((# /+$* "# 0&'+( !('(%(,)# .&$, "# =J@ HEGA '() )( (,$& /$$ !/+%(&-# *+$/ "$ 9=NAJPO SANA JKP @EO?QOOEJC ODKS ), / (+$& PDAEN C=I>HEJC D=>EPO SEPD PDAEN ?DEH@NAJ V !$0 S=O I=GA '++ () '*$. >AEJC QOA@ PK EJ@E?=PA PDA ALEOPAIE? OP=PQO KB = $&'.* " %*3'4 /145 (1//10.9 64*) 8,5+ OQ>OAMQAJP ?H=QOA# =O SANA PDA KPDAN RAN>O$ (1/2.*/*05 (.&64*4. !.!.$ '+4)5 6,(6 6(.+ # " (5 ( 57)-+*6 ;KP=H 7 !#" 7 ! " UKQ UKQ O SDKOA IKOP ?KIIKJ OQ>FA?P EO 3,1 !#" ! " EJ?HQ@A )'($ !., KQP KB EPO ')* PKP=H QOAO SEPD = >AP () () '&& & & OQ>FA?P# KN ,*$( "# 2 +0 !'/(%(-&# -'$' "# =J@ CQAOO (( (' /+$* & & +$$# !))%,+# +&$. "$ ;DAOA RAN>O =NA >AEJC QOA@ PDEJG (,) ('( .&$, ). '*$* PK EJ@E?=PA PDA @AKJPE? OP=PQO KB = OQ>OAMQAJP OAA (&- /+ *+$/ +& (*$' ?H=QOA# EJ?HQ@EJC @EOLKOEPEKJ KN EJ?HEJ=PEKJ# IA=J )( '+ *,$/ '( )-$+ RKHEPEKJ# =J@ ?KILQHOEKJ$ GJKS ),& '+/ **$( './ +($+ NAIAI>AN () / )/$' '( +($( !.!.% '+4)5 6,(6 6(.+ 827 24 & (5 ( 57)-+*6 HEGA ')* (& '*$/ ., ,*$( O PD=P P=GA I =J@ 3,1 IKNA KN HAOO AMQ=HHU S=JP (-& )* '($, '/( -'$' =O OQ>FA?P EJ?HQ@A *$ + !'+ KQP KB )( QOAO# KN JAA@ ,+ + -$- )) +&$. *,$/ # SEPD I =J@ '( KB )( QOAO# KN )-$+ # SEPD $&'.* !" #1/* 7*3'4 (1//10.9 64*) 8,5+ 3,1"# (+,2 !I0 '+/%),&# **$( 1 3,10 './%),&# 46'-*(5 I 13 !o . +($+ "# =J@ .$*$*!$. !I0 /%()# )/$' 1 3,10 '(%()# +($( "$

71 ".".$ ,/620>= :3 ) %% .91 %''$ a) G:A6I>DCH 7:IL::C EGDCDJCH 6C9 K:G7H# -DL:K:G! 2=: D7?:8IH .&! 53! D"%%9 6C9 !0..9 ;DGB:9 I=:N 9D CDI H=DL I=6I I=:H: G:I>:H 86C 7: 6 8AJHI:G >C K:G7 HE68:! 6EE:6G>C< ;G:FJ:CIAN L>I= JH:9 ID 8J: I=: 6HHD8>6I:9 LDG9H# I=: K:G7H 4&-- 6C9 -00, "4# # )A;2<5829> " ".".% ,/620>= :3 (*) .91 ! ) 2D 9:BDCHIG6I: I=6I I=: G:I>:H >C EGDCDJC" 2=: D7?:8IH 0/&! 345''! #08! 6C9 409 D88JGG:9 K:G7 8D"D88JGG:C8:H >C E6G:CI6A HE::8= ID 8=>A9G:C BDHI ;G:FJ:CIAN L>I= (&4! 6C9 ;G:FJ:CIAN L>I= 154# 86C 68IJ6AAN 7: :MEAD>I:9 7N 6 HI6I>HI>86A A:6GC:G! 2=: D7?:8IH 4)&.! ) * .! ) & 2 ! #&%! 6C9 .054) L: IG6>C:9 6C 6JID6HHD8>6IDG DC I=: 8DGEJH 96I6! D88JGG:9 BDHI ;G:FJ:CIAN L>I= 154 6C9! >C HDB: I=:C I:HI:9 >I DC >C8DBEA:I: JII:G6C8:H ID H:: =DL 86H:H! 6AHD ;G:FJ:CIAN L>I= (&4# L:AA >I LDJA9 P;>AA >C I=: 7A6C@HQ L=:C <>K:C DCAN ".".& ,/620>= :3 "a+ .91 ,a&) 6 EGDCDJC! DG DCAN 6 K:G7# *C 6JID6HHD8>6IDG >H 6 2=: D7?:8IH $00,*&! 30.&! .0/&9! $0''&&! .*-,! 8DCC:8I>DC>HI C:ILDG@ I=6I >H IG6>C:9 ID I6@: :68= 6C9 +5*$& ;DGB:9 6 8AJHI:G >C K:G7 HE68:! >CEJI E6II:GC 6C9 G:EGD9J8: >I 6I I=: DJIEJI# .C I=: 6EE:6G>C< ;G:FJ:CIAN L>I= K:G7H HJ8= 6H )"6& 6C9 EGD8:HH! >I 8DBEG:HH:H I=: E6II:GC I=GDJ<= 6 HB6AA 7"/4! 6H L:AA 6H! >C HDB: 86H:H! (*6&! 4",&! 1052! H:I D; =>99:C JC>IH >C I=: B>99A:! ;DG8>C< I=: %2*/,! 6C9 &"4# C:ILDG@ ID ;>C9 I=: HI6I>HI>86A G:I>:H 6BDC< I=: :A:B:CIH >C I=: >CEJI 96I6# 2=: C:ILDG@ >H ".".! -?/620>= :3 )"#&$ .91 $&', IG6>C:9 7N 768@EGDE6<6I>DC! L=>8= >I:G6I>K:AN 2=: HJ7?:8I 6EE:6G:9 BDHI ;G:FJ:CIAN L>I= I=: G:9J8:H I=: 9>H8G:E6C8>:H 7:IL::C I=: C:ILDG@RH K:G7H 4)*/, 6C9 ,/07# 68IJ6A DJIEJIH 6C9 I=: I6G<:I DJIEJIH (I=: H6B: 6H ".# (5=0?==5:9 I=: >CEJIH ;DG 6C 6JID6HHD8>6IDG # .C DJG 86H:! I=: >CEJIH (6C9 I=JH I=: DJIEJIH 6G: *AI=DJ<= EGDCDJCH 6G: H:B6CI>86AAN PA><=I!Q HJ7?:8I"K:G7"D7?:8I PH:CI:C8:H#Q 0C8: I=: I=:>G E6GI>8JA6G G:;:G:CIH 9:I:GB>C67A: DCAN ;GDB C:ILDG@ =6H A:6GC:9 I=: G:I>:H >C=:G:CI >C 6 8DCI:MI! I=:N B6N CDC:I=:A:HH 7: EDI:CI ;DG8:H DC 8DGEJH D; 8DBEA:I: 130 H:CI:C8:H! I:HI>C< >I DC :6GAN A:M>86A A:6GC>C< 7N HI6I>HI>86AAN ED>CI>C< ID >C8DBEA:I: H:CI:C8:H (:#<#! P. 555 =>BQ 6AADLH JH HDB: 8A6HH:H D; K:G7H 6H 7:>C< BDG: A>@:AN I=6C ID H:: L=6I >I =6H DCH=>E DI=:GH# 2=: G:HJAIH D; +ME:G>B:CI % 8A:6GAN H=DL 7:IL::C I=: <>K:C E6GIH (HJ7?:8I P.Q 6C9 D7?:8I I=6I I=:G: 6G: HI6I>HI>86A G:I>:H >C I=: 8D" P=>BQ >C DJG :M6BEA: 6C9 I=: B>HH>C< E6GIH (I=: D88JGG:C8:H D; EGDCDJCH 6C9 K:G7H I=6I I=: 8=>A9 K:G7 >C DJG :M6BEA: # 8DJA9 JH: ID 9>H8G>B>C6I: 8A6HH:H D; K:G7H# #.! *2>4:1 1E:8>;>86AAN! L=:C ;DAADL:9 7N *4! I=: K:G7 >H A>@:AN ID 9:H8G>7: E=NH>86A BDI>DC! IG6CH;:G! DG #.!.! (.>. EDHH:HH>DC# 4=:C ;DAADL:9 6 G:A6I>K:AN 8DBEA:M 2=: C:ILDG@ IG6>C>C< 96I6 8DCH>HI:9 D; I=: 8DBEA:B:CI 8A6JH:! 7N 8DCIG6HI! I=: K:G7 >H A>@:AN HJ7?:8I! K:G7! 6C9 D7?:8I D; 6AA 8D9:9 JII:G6C8:H ID 6IIG>7JI: 6 EHN8=DAD<>86A HI6I:# ,>C:G I=6I 8DCI6>C:9 I=: ($ BDHI 8DBBDC HJ7?:8IH! 9>HI>C8I>DCH B6N 6AHD 7: B69: L>I= DI=:G D7?:8IH! K:G7H 6C9 D7?:8IH# 2=:G: L:G: (!)'( HJ8= >C8AJ9>C< EGDE:G C6B:H 6C9 CDJCH# 3:G7H JII:G6C8:H# 2=: >CEJIH JH:9 6 AD86A>HI 8D9>C< ;DAADL:9 7N .&! 53! D"%%9! 6C9 !0..9 6G: A>@:AN L=:G:>C I=:G: L6H DC: 6C9 DCAN DC: >CEJI JC>I DJI ID =6K: ID 9D L>I= I:AA>C< DG ADD@>C<# 3:G7H D; ($ 68I>K6I:9 ;DG :68= HJ7?:8I! 6C9 A>@:L>H: ;DG ;DAADL:9 7N 0/&! 345''! 4)&.! )*.! DG )&2 6G: A>@:AN :68= K:G7 6C9 :68= D7?:8I# *7H:CI 6C9 DB>II:9 ID =6K: ID 9D L>I= <:II>C< DG EJII>C<# 3:G7H 6GC 8DC8G:I: D7?:8IH HJ8= 6H :M6BEA:! I=: JII:G6C8: P/D=C GJCHQ LDJA9 =6K: ' $00,*&! .*-,! DG +5*$& 6G: A>@:AN ID =6K: ID 9D L>I= JC>IH 68I>K6I:9 :K:C I=DJ<= >I DCAN =6H & =6K>C< DG L6CI>C<# ,>C: 9>HI>C8I>DCH B6N 6AHD 7: LDG9HOI=: I=>G9 JC>I 7:>C< I=: PCD D7?:8IQ JC>I# B69: 688DG9>C< ID HJ7?:8I# .; I=: HJ7?:8I >H ! I=: 4>I= ($ JC>IH :68= ;DG HJ7?:8I! K:G7 6C9 D7?:8I! K:G7 >H A>@:AN ID =6K: ID 9D L>I= I=>C@>C< DG I=:G: L:G: 6 IDI6A D; %($ >CEJI JC>IH ID I=: @CDL>C; I=: HJ7?:8I >H 905! 3)&! 7&! C:ILDG@# *8I>K: >CEJI JC>IH =69 6 K6AJ: D; %! 6C9 )&! DG 4)&9! I=: K:G7 >H A>@:AN ID =6K: ID 9D L>I= >C68I>K: >CEJI JC>IH =69 6 K6AJ: D; $# =6K>C< DG L6CI>C<# 2=>H G:IN BDHI A>@:AN G:;A:8IH I=: :8DAD@:<7 '<045>20>?<2 8=>A9G:COE6G:CIH P@CDLQ 6C9 8=>A9G:C PL6CIQ O 2=: C:ILDG@ 8DCH>HI:9 D; 6 ILD"A6N:G %($")"%($ 7JI >I 8DJA9 CDC:I=:A:HH 7: JH:;JA >C JC>I 6JID6HHD8>6IDG L>I= 6 AD<>HI>8 68I>K6I>DC 9>HI>CH=>C< I=:H: ILD 8A6HH:H D; K:G7H# ;JC8I>DC 6I I=: =>99:C A6N:G 6C9 6 I=G:: H:E6G6I: 2=: G:HJAIH I=JH ;6G H=DL I=6I I=:G: 6G: HD;IB6M 68I>K6I>DC ;JC8I>DCH (DC: :68= ;DG I=: EDI:CI>6AAN JH67A: G:I>:H >C I=: HI6I>HI>86A HJ7?:8I! K:G7 6C9 D7?:8I 6I I=: DJIEJI A6N:GOH::

72 2CAOL? ($ 9MCHA NB? MI@NG;R ;=NCP;NCIH @OH=NCIH" !. $(/1+0/ QBC=B ?HMOL?M NB;N ;FF NB? IONJONM CH NB? <;HE MOG 8B? H?NQILEM F?;LH G;HS I@ NB? =I#I==OLL?H=? NI &" NIA?NB?L QCNB NB? =LIMM#?HNLIJS ?LLIL L?AOF;LCNC?M I CH NB? >;N;$ 2IL ?R;GJF?" G?;MOL?" ;FFIQM OM NI CHN?LJL?N NB? H?NQILE QB?H N?MN?> IH NB? I P?L $a."" QBC=B ;L? ;GIHA NB? GIMN <;=EJLIJ;A;NCIH ;FAILCNBG 6C?>GCFF?L ;H> =IGGIH P?L QCNB %, CH NB? CHJON M?? 1L;OH" &..(! NI G;J CNM CHJONM <;=E IHNI CNM 8; +a0 ;L? NB? IONJONM$ :? =BIM? NI OM? ?CABN OHCNM CH NB? BC>>?H GIMN ;=NCP;N?> P?L F;S?L IH NB? <;MCM I@ MIG? JCFIN ?RJ?LCG?HNM NB;N QCNB NB? 'a-+" OHCN ;=NCP;N?> CH NB? I NB? HOG>?H OHCNM$ 5?NQILEM QCNB JIMCNCIH @CAOL? HIN MBIQH!" ;H> NB?S ;L? ;FMI @?Q?L BC>>?H OHCNM ?CNB?L >C> HIN F?;LH NB? ;GIHA NB? P?L QCNB JLII?M HIN G?L?FS F?;LH NB? - BC>>?H OHCNM F?;LH?> KOC=EFS ?> NI L?F;NCP? @L?KO?H=C?M I@ JLIHIOHM QCNB P?L;N;$ ?R;GJF?" NB? P?L #", M?? 2CAOL? * IH J;A? - QCNB NB? I NB? I CH NB? JL?PCIOM J;L;AL;JB" CM MNLIHAFS ;MMI=C;N?> QCNB NB? P?L #",$ 8B? >C@@?L?H=? G;S GIMN =F?;LFS QB?H NB? H?NQILE CM 2CAOL? (/ 5?NQILE ;L=BCN?=NOL? JLIGJN?> MCGOFN;H?IOMFS QCNB 0*- ;M NB? MO 'a-+" ;M NB? I;N; Q;M L;H>IGFS ;MMCAH?> NI NQI ALIOJM/ JL?@?LL?> ;H>" NBIOAB #", MNCFF N;E?M M?=IH> JF;=?" .%% I@ NB? >;N; Q;M OM?> @IL NL;CHCHA NB? H?NQILE" , " ' ' ;H> & ) * / L;HE NBCL> ;H> @IOLNB" QBCF? &%% Q;M L?M?LP?> @IL P;FC>;NCHA NB? L?MJ?=NCP?FSU=IHMCMN?HN QCNB NB? L?MOFNM CH 8;C@@?L?HN &$ 8BCM >?GIHMNL;N?M NB;N NB? H?NQILE GI>?F CM L;H>IG CHCNC;F Q?CABNM" @CP? H?NQILEM Q?L? NL;CH?> M?HMCNCP? NI BCAB#IL>?L =ILL?F;NCIHM ;GIHA QIL>M OHNCF NB? =LIMM#?HNLIJS IH NB? P;FC>;NCIH M?N CH NB? CHJON" HIN G?L?FS NB? @CLMN#IL>?L =ILL?F;NCIHM L?;=B?> ; GCHCGOG @IL ?;=B I@ NB?G$ 8L;CHCHA P?L< I==OLL?H=?M$ MNIJJ?> ;@N?L ;JJLIRCG;N?FS &*% ?JI=BM I@ 8B?M? L?MOFNM >I HIN >?J?H> IH OMCHA ;H NL;CHCHA" IH ;P?L;A?$ 0N NB;N JICHN" NB? H?NQILEM ;ONI;MMI=C;NCIH H?NQILE" ;H> Q? >I HIN =F;CG NB;N Q?L? ;=BC?PCHA ;L?H CH @;=N OM? ;H ;ONI;MMI=C;NCIH ;L=BCN?=NOL? C>?HNC@SCHA MO ICM=IP?L BCAB?L#IL>?L =ILL?F;NCIHM QCFF M?N =IOF> B;P? O=? L?MOFNM MCGCF;L NI NB? IH?M MBIQH B?L?$ 0H QCNB MIG? FIMM I@ A?H?L;FCT;NCIH" ;ONI;MMI=C;NIL Q;M =BIM?H IHFS ;M ; MCGJF? G?;HM NI ;PIC> IP?L@CNNCHA$ I@ >?GIHMNL;NCHA CH JLCH=CJF? NB;N ; MN;NCMNC=;F !.1." %(/0*,) F?;LH?L =;H ?RNL;=N NB? MN;NCMNC=;F L?AOF;LCNC?M @LIG NB? >;N;$ 0@N?L NL;CHCHA" NB? H?NQILEM Q?L? N?MN?> QCNB CH=IGJF?N? CHJONM =ILL?MJIH>CHA NI CMIF;N?> P?L JLIHIOHM$ 2IL ?R;GJF?" NI M?? QB;N ; H?NQILE B;> F?;LH?> ; QCNB :? B;P? MBIQH NB;N NB?L? ;L? MN;NCMNC=;F ; MCHAF? CHJON OHCN ;=NCP;N?>UNB? IH? L?AOF;LCNC?M CH =I#I==OLL?H=?M CHA NI %, ;M MO P?LL?H B?;L @LIG OHCNM Q?L? M?N NI %$ 0=NCP;NCIHM ;N NB? IONJON OHCNM NB?CL J;L?HNM$ :? B;P? ;FMI MBIQH NB;N ; MCGJF? Q?L? L?=IL>?>$ 8B? L?MOFNM JL?M?HN?> CHA MO?L L?AOF;LCNC?M NB;N ;L? HIN I;N;" ;H> OM? NB?G NI JL?>C=N NB? P?L< CH ;H CH=IGJF?N? M?HN?H=?$ 3IQ GCABN NBCM B?FJ =BCF>L?H F?;LH

73 VERBS, 4N THE FIRST PLACE! HEARING A VERB FRAMED BY CONDITION (IN WHICH THE VERB IS USED WITH NOUNS OR PRONOUNS MAY HELP THE CHILD ISOLATE THE VERB PRONOUNS THAT ARE CONSISTENT WITH THE REGULARITIES ITSELF[HAVING SIMPLE! SHORT CONSISTENT! AND HIGH" IN THE CORPUS AND AN \INCONSISTENT] CONDITION (IN FREQUENCY SLOT FILLERS COULD MAKE IT THAT MUCH WHICH THE VERB IS USED WITH NOUNS OR PRONOUNS EASIER TO SEGMENT THE RELEVANT WORD IN FRAMES LIKE THAT ARE INCONSISTENT WITH THE REGULARITIES IN THE \3E @@@ IT#] ;ECOND! THE INFORMATION PROVIDED BY CORPUS # E ARE DESIGNING AN EXPERIMENT TO WITH PRONOUNS! AND THEN HEARS ANOTHER VERB BEING TEST THIS HYPOTHESIS# USED WITH THE SAME OR A SIMILAR PATTERN OF E THE INPUT MAY THUS HELP THE CHILD NARROW DOWN THE ARE ESPECIALLY KEEN TO FIND OUT WHAT SORTS OF CUES CLASS TO WHICH AN UNKNOWN VERB BELONGS! ALLOWING CHILDREN MIGHT BE USING TO IDENTIFY VERB CLASSES IN THE LEARNER TO FOCUS ON FURTHER REFINING HER GRASP SUCH LANGUAGES# 3ENCE! WORK IS UNDERWAY TO OF THE VERB THROUGH SUBSEQUENT EXPOSURES# COLLECT COMPARABLE DATA FROM 5APANESE AND HETHER CHILDREN ARE ACTUALLY SENSITIVE TO THESE VERB"HEAVY LANGUAGES WITH FREQUENT ARGUMENT REGULARITIES REMAINS AN OPEN QUESTION# 793: OF (! BY WHICH TIME MOST CHILDREN CAN PRODUCE THE /79 &+;;.96 '.,7062;276! 5PAIK?, 5PAIK? MOST COMMON VERBS (0ALE AND 2ENSON! &**) # 9HDO@KLDMQ 6K@LL! C<@F 8IGIHLMKN>MDIH REGULARITIES DEMONSTRATED IN THIS PAPER IS THAT =CDF? ?DK@>M@? LJ@@>C! CHILDREN^S COMPREHENSION OF ORDINARY VERBS !7062;2=. (,2.6,. $),*&% *)%! SHOULD BE BETTER WHEN THEY ARE USED IN FRAMES THAT ;@ 3! /C +6- THAN WHEN THEY ARE USED IN FRAMES THAT ARE "2:84+,.5.6; 7/ !76:,27<: #?8.92.6,. 26 INCONSISTENT WITH THOSE REGULARITIES# -SSESSING (8.+3260 +6- *92;260! /CD>CDF?K@H!

74 (=@9NAGJ9D 5=K=9J;@ 2=L@G )@AD< 19F?M9?= $"+''# )GEHML=JK $)+#$' #$(! '(#! 6JMB IF STRUDTURBM SPURDFS PG AJMMBRE @BO 9RNBO ;UJOF! #*)"! 9O WIBT TIFRF WPRE NFBOJOH! 19F?M9?= ';IMAKALAGF #+% ''! JS! 4O ,JGE 9 1G?A;9D 4GAFL G> 8A=O, FE! 6JMB IF AJMMBRE @BO 9RNBO ;UJOF! .BNCRJEHF, 7,+ RPMF PG SYOTBX JO VFRC MFBROJOH! 4O 7@= 3BRVBRE ?OJVFRSJTY :RFSS! .9F<:GGC G> )@AD< 19F?M9?=, FES! :BUM 7BRTJO IF =J=F;= -BMEWJO! #**(! 6FXJDBMMY CBSFE MFBROJOH BOE GF 3=MJ9D 3=LOGJCK #%%$ /)33 %$!, =BO FBRMY HRBNNBTJDBM EFVFMPQNFOT! 0GMJF9D G> 1RBODJSDP, .,! )@AD< 19F?M9?= $&+#)( $#*! 7JDIBFM >PNBSFMMP! #**$! ,AJKL 8=J:K& ' )9K= -RJBO 7BDAIJOOFY! $"""! 7@= )./1*+6 6LM

+9JDP -J9EE9LA;9D *=N=DGHE=FL! 4JGB=;L& 7GGDK >GJ 'F9DPQAF? 79DC!VPM! $+ >IF .BNCRJEHF+ .BNCRJEHF ?OJVFRSJTY :RFSS! /BTBCBSF! 7BIWBI, 85+ 6BWRFODF 0RMCBUN @JRHJOJB @BMJBO! #**#! =YOTBDTJD SUCKFDTS JO TIF ,SSPDJBTFS! FBRMY SQFFDI PG ,NFRJDBO BOE 4TBMJBO DIJMERFO! 5UMJBO 7! :JOF, BOE 0MFOB @! 7! 6JFVFO! #**%! )G?FALAGF &"+$# )#!

50ïïïlady 50ïïïuse 50ïïïanything 49ïïïpuppy 49ïïïsit down 49ïïïthings 48ïïïgirl 48ïïïtouch 48ïïïmommy 47ïïïpaul 47ïïïbuild 47ïïïus 46ïïïgeorgie 46ïïïhold 46ïïïem 45ïïïus 45ïïïput on 45ïïïfinger 44ïïïwhy 44ïïïbreak 44ïïïdoor 43ïïïwhere 43ïïïopen 43ïïïlight 42ïïïdudley 42ïïïtalk 42ïïïpicture 41ïïïmomma 41ïïïthrow 41ïïïmouth 40ïïïbug 40ïïïhear 40ïïïhair 39ïïïthing 39ïïïshow 39ïïïmilk 38ïïïjane 38ïïïlook like 38ïïïbed 37ïïïmrs wood 37ïïïfind 37ïïïhead 36ïïïperson 36ïïïremember 36ïïïnose 35ïïïwater 35ïïïwait 35ïïïshoe 34ïïïross 34ïïïbuy 34ïïïdaddy 33ïïïeverybody 33ïïïwatch 33ïïïjuice 32ïïïcar 32ïïïthank 32ïïïthose 31ïïïmama 31ïïïhelp 31ïïïcoffee 30ïïïdog 30ïïïcome on 30ïïïmoney 29ïïïya 29ïïïturn 29ïïïhand 28ïïïthese 28ïïïtry 28ïïïbutton 27ïïïkitty 27ïïïneed 27ïïïbaby 26ïïïmom 26ïïïplay with 26ïïïstory 25ïïïsam 25ïïïlet 25ïïïher 24ïïïboy 24ïïïread 24ïïïtoy 23ïïïcat 23ïïïgive 23ïïïchair 22ïïïmarky 22ïïïsit 22ïïïsome 21ïïïnomi 21ïïïtake 21ïïïhouse 20ïïïwho 20ïïïlike 20ïïïbox 19ïïïthere 19ïïïbe 19ïïïthese 18ïïïsomebody 18ïïïplay 18ïïïblock 17ïïïdaddy 17ïïïeat 17ïïïcar 16ïïïbaby 16ïïïlook at 16ïïïthing 15ïïïone 15ïïïtell 15ïïïsomething 14ïïïthis 14ïïïlet\\’s 14ïïïcookie 13ïïï(let\\’s) 13ïïïmake 13ïïïball 12ïïïme 12ïïïcome 12ïïïbook 11ïïïmommy 11ïïïsay 11ïïïhim 10ïïïwhat 10ïïïthink 10ïïïme 9ïïïthat 9ïïïput 9ïïïthis 8ïïïshe 8ïïïlook 8ïïïwhat 7ïïïthey 7ïïïwant 7ïïïone 6ïïïhe 6ïïïknow 6ïïïthem 5ïïïit 5ïïïhave 5ïïïyou 4ïïïwe 4ïïïget 4ïïïthat 3ïïïi 3ïïïgo 3ïïïit 2ïïïyou 2ïïïsee 2ïïï(clause) 1ïïï 1ïïïdo 1ïïï 0 0.5 1 0 0.5 1 0 0.5 1 &20>;. #$ %?.;*0. 7.=@8;4 8>=9>= ;.<987<. =8 =1. 8+3.,= it" '>+3.,=< *;. <18@7 27 =1. 5./= ,85>67! ?.;+< 27 =1. 62--5.! *7- 8+3.,=< 87 =1. ;201=" )2=127 .*,1 =9>= >72=< *;. 8;-.;.- *,,8;-270 =8 =1. /;.:>.7,A 8/ =1. ,8;;.<987-270 @8;-< 27 =1. 279>= (58@.; 7>6+.;< *;. 1201.; /;.:>.7,A " (1. @2-=1 8/ .*,1 +*; ;./5.,=< =1. *?.;*0. *,=2?*=287 8/ =1. ,8;;.<987-270 >72= 27 8>; 7.=@8;4<"

75 50ïïïlady 50ïïïuse 50ïïïanything 49ïïïpuppy 49ïïïsit down 49ïïïthings 48ïïïgirl 48ïïïtouch 48ïïïmommy 47ïïïpaul 47ïïïbuild 47ïïïus 46ïïïgeorgie 46ïïïhold 46ïïïem 45ïïïus 45ïïïput on 45ïïïfinger 44ïïïwhy 44ïïïbreak 44ïïïdoor 43ïïïwhere 43ïïïopen 43ïïïlight 42ïïïdudley 42ïïïtalk 42ïïïpicture 41ïïïmomma 41ïïïthrow 41ïïïmouth 40ïïïbug 40ïïïhear 40ïïïhair 39ïïïthing 39ïïïshow 39ïïïmilk 38ïïïjane 38ïïïlook like 38ïïïbed 37ïïïmrs wood 37ïïïfind 37ïïïhead 36ïïïperson 36ïïïremember 36ïïïnose 35ïïïwater 35ïïïwait 35ïïïshoe 34ïïïross 34ïïïbuy 34ïïïdaddy 33ïïïeverybody 33ïïïwatch 33ïïïjuice 32ïïïcar 32ïïïthank 32ïïïthose 31ïïïmama 31ïïïhelp 31ïïïcoffee 30ïïïdog 30ïïïcome on 30ïïïmoney 29ïïïya 29ïïïturn 29ïïïhand 28ïïïthese 28ïïïtry 28ïïïbutton 27ïïïkitty 27ïïïneed 27ïïïbaby 26ïïïmom 26ïïïplay with 26ïïïstory 25ïïïsam 25ïïïlet 25ïïïher 24ïïïboy 24ïïïread 24ïïïtoy 23ïïïcat 23ïïïgive 23ïïïchair 22ïïïmarky 22ïïïsit 22ïïïsome 21ïïïnomi 21ïïïtake 21ïïïhouse 20ïïïwho 20ïïïlike 20ïïïbox 19ïïïthere 19ïïïbe 19ïïïthese 18ïïïsomebody 18ïïïplay 18ïïïblock 17ïïïdaddy 17ïïïeat 17ïïïcar 16ïïïbaby 16ïïïlook at 16ïïïthing 15ïïïone 15ïïïtell 15ïïïsomething 14ïïïthis 14ïïïlet\\’s 14ïïïcookie 13ïïï(let\\’s) 13ïïïmake 13ïïïball 12ïïïme 12ïïïcome 12ïïïbook 11ïïïmommy 11ïïïsay 11ïïïhim 10ïïïwhat 10ïïïthink 10ïïïme 9ïïïthat 9ïïïput 9ïïïthis 8ïïïshe 8ïïïlook 8ïïïwhat 7ïïïthey 7ïïïwant 7ïïïone 6ïïïhe 6ïïïknow 6ïïïthem 5ïïïit 5ïïïhave 5ïïïyou 4ïïïwe 4ïïïget 4ïïïthat 3ïïïi 3ïïïgo 3ïïïit 2ïïïyou 2ïïïsee 2ïïï(clause) 1ïïï 1ïïïdo 1ïïï 0 0.5 1 0 0.5 1 0 0.5 1

%/-:7+ !# $;+7'-+ 4+9<571 5:96:9 7+86548+ 95 9.+ 8:(0+)9 %"$. &'3+ )54;+49/548 '8 67+;/5:8 ,/-:7+.

50ïïïlady 50ïïïuse 50ïïïanything 49ïïïpuppy 49ïïïsit down 49ïïïthings 48ïïïgirl 48ïïïtouch 48ïïïmommy 47ïïïpaul 47ïïïbuild 47ïïïus 46ïïïgeorgie 46ïïïhold 46ïïïem 45ïïïus 45ïïïput on 45ïïïfinger 44ïïïwhy 44ïïïbreak 44ïïïdoor 43ïïïwhere 43ïïïopen 43ïïïlight 42ïïïdudley 42ïïïtalk 42ïïïpicture 41ïïïmomma 41ïïïthrow 41ïïïmouth 40ïïïbug 40ïïïhear 40ïïïhair 39ïïïthing 39ïïïshow 39ïïïmilk 38ïïïjane 38ïïïlook like 38ïïïbed 37ïïïmrs wood 37ïïïfind 37ïïïhead 36ïïïperson 36ïïïremember 36ïïïnose 35ïïïwater 35ïïïwait 35ïïïshoe 34ïïïross 34ïïïbuy 34ïïïdaddy 33ïïïeverybody 33ïïïwatch 33ïïïjuice 32ïïïcar 32ïïïthank 32ïïïthose 31ïïïmama 31ïïïhelp 31ïïïcoffee 30ïïïdog 30ïïïcome on 30ïïïmoney 29ïïïya 29ïïïturn 29ïïïhand 28ïïïthese 28ïïïtry 28ïïïbutton 27ïïïkitty 27ïïïneed 27ïïïbaby 26ïïïmom 26ïïïplay with 26ïïïstory 25ïïïsam 25ïïïlet 25ïïïher 24ïïïboy 24ïïïread 24ïïïtoy 23ïïïcat 23ïïïgive 23ïïïchair 22ïïïmarky 22ïïïsit 22ïïïsome 21ïïïnomi 21ïïïtake 21ïïïhouse 20ïïïwho 20ïïïlike 20ïïïbox 19ïïïthere 19ïïïbe 19ïïïthese 18ïïïsomebody 18ïïïplay 18ïïïblock 17ïïïdaddy 17ïïïeat 17ïïïcar 16ïïïbaby 16ïïïlook at 16ïïïthing 15ïïïone 15ïïïtell 15ïïïsomething 14ïïïthis 14ïïïlet\\’s 14ïïïcookie 13ïïï(let\\’s) 13ïïïmake 13ïïïball 12ïïïme 12ïïïcome 12ïïïbook 11ïïïmommy 11ïïïsay 11ïïïhim 10ïïïwhat 10ïïïthink 10ïïïme 9ïïïthat 9ïïïput 9ïïïthis 8ïïïshe 8ïïïlook 8ïïïwhat 7ïïïthey 7ïïïwant 7ïïïone 6ïïïhe 6ïïïknow 6ïïïthem 5ïïïit 5ïïïhave 5ïïïyou 4ïïïwe 4ïïïget 4ïïïthat 3ïïïi 3ïïïgo 3ïïïit 2ïïïyou 2ïïïsee 2ïïï(clause) 1ïïï 1ïïïdo 1ïïï 0 0.5 1 0 0.5 1 0 0.5 1 %/-:7+ "# $;+7'-+ 4+9<571 5:96:9 7+86548+ 95 9.+ 8:(0+)9 %"$ '4* 9.+ 5(0+)9 c!a$# 8/3:29'4+5:82=. &'3+ )54;+49/548 '8 %/-:7+8 '4* !.

76 Some Tests of an Unsupervised Model of Language Acquisition

Bo Pedersen and Shimon Edelman Zach Solan, David Horn, Eytan Ruppin Department of Psychology Faculty of Exact Sciences Cornell University Tel Aviv University Ithaca, NY 14853, USA Tel Aviv, Israel 69978 {bp64,se37}@cornell.edu {zsolan,horn,ruppin}@post.tau.ac.il

Abstract We are developing an approach to the acquisi- tion of distributional information from raw input We outline an unsupervised language acquisition (e.g., transcribed speech corpora) that also supports algorithm and offer some psycholinguistic support the distillation of structural regularities comparable for a model based on it. Our approach resem- to those captured by Context Sensitive Grammars bles the Construction Grammar in its general phi- out of the accrued statistical knowledge. In think- losophy, and the Tree Adjoining Grammar in its ing about such regularities, we adopt Langacker's computational characteristics. The model is trained notion of grammar as ª simply an inventory of lin- on a corpus of transcribed child-directed speech guistic unitsº ((Langacker, 1987), p.63). To de- (CHILDES). The model's ability to process novel tect potentially useful units, we identify and pro- inputs makes it capable of taking various standard cess partially redundant sentences that share the tests of English that rely on forced-choice judgment same word sequences. We note that the detection and on magnitude estimation of linguistic accept- of paradigmatic variation within a slot in a set of ability. We report encouraging results from several otherwise identical aligned sequences (syntagms) is such tests, and discuss the limitations revealed by the basis for the classical distributional theory of other tests in our present method of dealing with language (Harris, 1954), as well as for some mod- novel stimuli. ern work (van Zaanen, 2000). Likewise, the pat- tern Ð the syntagm and the equivalence class of 1 The empirical problem of language complementary-distribution symbols that may ap- acquisition pear in its open slot Ð is the main representational The largely unsupervised, amazingly fast and al- building block of our system, ADIOS (for Automatic most invariably successful learning stint that is lan- DIstillation Of Structure). guage acquisition by children has long been the Our goal in the present short paper is to illus- envy of computer scientists (Bod, 1998; Clark, trate some of the capabilities of the representa- 2001; Roberts and Atwell, 2002) and a daunting tions learned by our method vis a vis standard tests enigma for linguists (Chomsky, 1986; Elman et al., used by developmental psychologists, by second- 1996). Computational models of language acqui- language instructors, and by linguists. Thus, the sition or ª grammar inductionº are usually divided main computational principles behind the ADIOS into two categories, depending on whether they sub- model are outlined here only brie¯ y. The algo- scribe to the classical generative theory of syn- rithmic details of our approach and accounts of its tax, or invoke ª general-purposeº statistical learning learning from CHILDES corpora appear elsewhere mechanisms. We believe that polarization between (Solan et al., 2003a; Solan et al., 2003b; Solan et al., classical and statistical approaches to syntax ham- 2004; Edelman et al., 2004). pers the integration of the stronger aspects of each method into a common powerful framework. On 2 The principles behind the ADIOS the one hand, the statistical approach is geared to algorithm take advantage of the considerable progress made to date in the areas of distributed representation The representational power of ADIOS and its capac- and probabilistic learning, yet generic ª connection- ity for unsupervised learning rest on three princi- istº architectures are ill-suited to the abstraction ples: (1) probabilistic inference of pattern signi®- and processing of symbolic information. On the cance, (2) context-sensitive generalization, and (3) other hand, classical rule-based systems excel in recursive construction of complex patterns. Each of just those tasks, yet are brittle and dif®cult to train. these is described brie¯ y below. 77 54

71

52 54

34 39 34 34 39

35 35 38 35 35 35 38 s s s t t t t t e e e e e e e e e e y y y y y s s h h h h h m m m m m m m m m m t t t t t e e e a a a a a g g g g g i i i i i o o o o o d d d d d k k r r r r r v v v a a a a a e e e e e J J J J J J J J J J h h h h h n n n n n n n i i i i i i i t t t t t o o o o o e e e P P P P P B B B B B i i i h h l l l e e e e e C C C C C t t e e e G G G G G 54 b b b Joe thinks that George thinks that Cindy believes that George thinks that Pam thinks that ...

71 84 0.0001

52 54 58 63

64 64 34 39 34 34 39 49 51 61 49 51

35 35 38 35 35 35 38 50 50 60 85 53 62 50 50 48 t t t t t t t t t t s i i i i s s s s s s s y y y s s s e e e e e e e h h h e e e e e a a g g g g d d d d t w w w w , ' i o t t t m m m m m m a a a a r r r r r e e d d d e p s s s s b b b b e b e h i i i d p g g g h h o o o o o o o i i i i o o o o i i i l c c c c e e e a a a r r r r t t e r r r r r l v y J J J r n n n n b b b b J J J g h a d d d d b b b b c c c c f i i i h s s s o t t t t t m r e e e e e o o o o h o u e e e e e y y y y y o o o s s B B B P P P h h h h h o o a a a a u t t m m m m m m m m m m t t t t t e e e l s c a a a a a s g g g g g i i i i i o o o o o u d d d d d k k o r r r h h h r h d C C C e e e n r w j r r r r r a s h v v v o a a a a a e e e e e s J J J J J J J J J J h h h h h n n n n n l i n n a i i i i i i i t t t t t w o o o o o o n e e e e t G G G b P P P P P B B B B B i i i d h h l l l e e e e e C C C C C a t t w o e e e G G G G G b b b d Joe thinks that George thinks that Cindy believes that George thinks that Pam thinks that ... Long Range Dependency that the bird jumps disturbes Jim who adores the cat, doesn't it?

84 0.0001

58 63

64 64 P84 "that" P58 P63 E63 E64 P48 49 51 61 49 51 E64 "Beth" | "Cindy" | "George" | "Jim" | "Joe" | "Pam" | P49 | P51 P48 "," "doesn't" "it" 50 50 60 85 53 62 50 50 48 P51 "the" E50 P49 "a" E50 E50 "bird" | "cat" | "cow" | "dog" | "horse" | "rabbit" P61 "who" E62 t t t t t t t t t t s i i i i s s s s s s y s y y s s s e e e e e e e h e e h h a a e e g e g g g d d d d t w w w w , ' i o m t t t m m m m m a a a a r r r r r e e d d d e p s s s s b b b b e b e h i i i d p g g g h h o o o o o o o i i i i o o o o i i i l c c c c E62 "adores" | "loves" | "scolds" | "worships" a e e e a a r r r r t t e r r r r r l v y J J J r n n n n b b b b J J J g h a d b d d b b b d c c c c f i i i h o m r o o o o h o u o o o P B B B P P o o a a a a u t t l s c s u o r h r r h h r h d C C C e e e E53 "Beth" | "Cindy" | "George" | "Jim" | "Joe" | "Pam" n r w j a s h o s l i a w o n e t G G G b d

a E85 "annoyes" | "bothers" | "disturbes" | "worries" w o

that the bird jumps disturbes Jim who adores the cat, doesn't it? d P58 E60 E64 E60 "flies" | "jumps" | "laughs"210 178

Figure 1: Left: a pattern (presented in a tree form), capturing a long range dependency (equivalence class 55 84 labels are underscored). This and other examples here were distilled from a 400-sentence corpus generated 114 P84 that P58 P63 E63 bE64y P48a 40-rule Context Free Grammar. Right: the same pattern recast as a set of rewriting rules that can be 92 E64 Beth | Cindy | George | Jim | Joe | Pam | P49 | P51 P48 s, doesn'teen itas a Context Free Grammar fragment. P51 the E50 56 75 75 73 89 109 65 P49 a E50 E50 bird | cat | cow | dog | horse | rabbit210 178 P61 who E62 E62 adores | loves | scolds | worships E53 Beth | Cindy | George55 | Jim | Joe | Pam 84

114 N I t t

E85 annoyes | bothers | disturbes | worries , t t t s ' e e i e e s s s d s d g y y e a ? s e e h h s y w h m m r m m t t a a a k G m n t g g k s b i i o o h e d d i n o p h k h d P58 E60 E64 w o i r r r r c t a a e e h h n l s J J J J e a E b g n n a b s 92 d n i t t n c f o i i i o o a o m i e P P B B a E60 flies | jumps | laughs u B P h B e h e e b h r u t C C o C j a t l d 75 75 73 89 109 65 m 56 G G Agreement N I t t , t t t s ' e e i e e s s s d s d g y y e a ? s e e h h s y w h m m r m m t t

a P210 P55 P84 a a k G m n t g g k s b i i o o h e d d i n o p h k h d w o i r r r r c t e e a a h h n l s J J J J e a E b g n n a b s d n i t t n c f o i i i o o a o m i e B B P P a u B P h B e h e e b r h u t C C o C j a t BEGIN P55 P84 BEGIN E56 "thinks" "that" P84 l m d G G Agreement P55 P84 P178 P55 E75 "thinks" "that" P178 P210 P55 P84 BEGIN P55 P84 BEGIN E56 "thinks" "that" P84 P55 P84 P178 P55 E75 "thinks" "that" P178 Figure 2: Left: because ADIOS does not rewire all the occurrences of a speci®c pattern, but only those that Pam Beth and Jim think that Joe thinks that George shaPamre Bethth ande Jims thinkam thate Joe cthinkson thatt eGeorgext, its power is comparable to that of Contethinksxt Sthaten Cindysiti vbelievese Gr athatm Jimma whors. adoresIn t hais example, thinks that Cindy believes that Jim who adores a cat meows and the bird flies , don't they? cat meows and the bird flies , don't they? equivalence class #75 is not extended to su0.027bsume the subject position, because that position appears in that Pam laughs worries a dog , doesn't it? 0.027 74 that Pam laughs worries a dog , doesn't it? a dthatif fa ecowr jumpsen tdisturbsco Jimn whote xlovest a( horsee. g, ., immediately to the right of the symbol BEGIN). Thus, long-range agreement is 74 doesn't it? 0.27 enforced and over-generalization prevente70d. Right: the context-thatse na scowiti vjumpse ª rdisturbsulesº Jimc owhorr elovesspo an horsedin g, to pattern Joe and Beth think that Jim believes that the rabb it 0.27 meows and Pam who scolds the dog laughs , don't 1 doesn't it? #21they?0. 66 70

that Joe is eager to please disturbs the bird. 1 Joe and Beth think that Jim believes that the rabb it 52 Cindy thinks that Jim believes that to read is tough . meows and Pam who scolds the dog laughs , don't 1 they? 66 ProBethb athinksb thatil iJims tbelievesic thatin Bethfe whore lovesnc a e of patter55n sign71ifica53nce67. ure 1). Because this variation is only allowed in horse meows and the horse jumps , doesn't sh e? 55 that Joe is eager to please disturbs the bird. ADIthatO PamS is rtoughep tor pleaseese worriesnts the cat.a corpus of sentences as an ini- the context speci®ed by the pattern, the generaliza- 1 r t s e e e s d y y s h e s s y o h h i e m m r t m t t a 52 e g g s i i a o s d d e g b i r r g e e e a h r r J J J y a a n n e u t i i r o o a h r u B B P o e tially highly redundant directed graph, whie ch can be tion afforded by a set of patterns is inherently safer o t Cindy thinks that Jim believes that to read is tough . t l o e e t e C C n o s p i w n G G b d informally visualized as a tangle of strands that a re than in approaches that posit globally valid cate- 0.5 Beth thinks that Jim believes that Beth who loves a 144 123 0.5 55 71 53 67 55 partially segregated into bundles. Each of these con- gories (ª phorseart smeowsof sandpe theec horsehº ) jumpsand , rdoesn'tules s(hªeg? rammarº ). 1 52 1 52 sists of some strands clumped together; a bundle is The reliathatnc ePamo fis toughADI toO pleaseS on worriesman they cat.context-sensitive 1 53 111 53 r t s e e e s d y y s h e s s y o h h i e m m r t m t t formed when two or more strands join together and patterns rather than on traditional rules can be com- a e g g s i i a o s d d e g b i r r g e a e e h r r J J J y a a n n e u t i i r o o a h r u P B B o e e o t t l o e d e e e t C C n d e o s s a s s y o o run in parallel and is dissolved when moreh strands pared both to the Construction Grammar (discussed p s i i i a t t a w e n s G G b g a e r e d r a l e u a l p e p o leave the bundle than stay in. In a givent corpus, later) and to Langacker's concept of the grammar as there will be many bundles, with each strand (sen- a collection of ª patterns of all intermediate degrees 144 0.5 123 0.5 tence) possibly participating in several. Our algo- of generalityº ((Langacker, 1987), p.46). 52 1 rithm, described in detail in (Solan et al., 2004), 1 52 identi®es signi®cant bundles that balance high com- Hierarchical structure of patterns. The ADIOS 1 53 111 pression (small size of the bundle ª lexiconº ) against graph is rewired every time a new pattern is de- 53 good generalization (the ability to generate new tected, so that a bundle of strings subsumed by it e d d e s a s s y o o is represented by a single new edge. Following the h s i i a t t a e grammatical sentences by splicing together various s g a e r e r a l e u l p e

rewiring, which is context-speci®c, potentially far- p o

strand fragments each of which belongs to a differ- t ent bundle). apart symbols that used to straddle the newly ab- stracted pattern become close neighbors. Patterns Context sensitivity of patterns. A pattern is an thus become hierarchically structured in that their abstraction of a bundle of sentences that are identi- elements may be either terminals (i.e., fully speci- cal up to variation in one place, where one of several ®ed strings) or other patterns. Moreover, patterns symbols Ð the members of the equivalence class may refer to themselves, which opens the door for associated with the pattern Ð may appear (Fig- recursion. 78 3 Related computational and linguistic terminal-level length of the string represented by formalisms and psycholinguistic findings a pattern does not have to be of a fixed length. Unlike ADIOS, very few existing algorithms for un- This goes conceptually beyond the variable order supervised language acquisition use raw, unanno- Markov structure: b2[c2]b3 do not have to appear in tated corpus data (as opposed, say, to sentences con- a Markov chain of a finite order ||b2||+||c2||+||b3|| verted into sequences of POS tags). The only work because the size of [c2] is ill-defined, as explained described in a recent review (Roberts and Atwell, above. Fourth, as we showed earlier (Figure 2), 2002) as completely unsupervised — the GraSp ADIOS incorporates both context-sensitive substitu- model (Henrichsen, 2002) — does attempt to in- tion and recursion. duce syntax from raw transcribed speech, yet it is Tree Adjoining Grammar. The proper place in not completely data-driven in that it makes a prior the Chomsky hierarchy for the class of strings ac- commitment to a particular theory of syntax (Cate- cepted by our model is between Context Free and gorial Grammar, complete with a pre-specified set Context Sensitive Languages. The pattern-based of allowed categories). Because of the unique na- representations employed by ADIOS have counter- ture of our chosen challenge — finding structure parts for each of the two composition operations, in language rather than imposing it — the follow- substitution and adjoining, that characterize a Tree ing brief survey of grammar induction focuses on Adjoining Grammar, or TAG, developed by Joshi contrasts and comparisons to approaches that gen- and others (Joshi and Schabes, 1997). Specifically, erally stop short of attempting to do what our al- both substitution and adjoining are subsumed in the gorithm does. We distinguish between approaches relationships that hold among ADIOS patterns, such that are motivated computationally (Local Grammar as the membership of one pattern in another. Con- and Variable Order Markov models, and Tree Ad- sider a pattern Pi and its equivalence class E(Pi); joining Grammar, discussed elsewhere (Edelman et any other pattern Pj ∈ E(Pi) can be seen as substi- al., 2004), and those whose main motivation is lin- tutable in Pi. Likewise, if Pj ∈ E(Pi), Pk ∈ E(Pi) guistic and cognitive psychological (Cognitive and and Pk ∈ E(Pj), then the pattern Pj can be seen Construction grammars, discussed below). as adjoinable to Pi. Because of this correspon- Local Grammar and Markov models. In cap- dence between the TAG operations and the ADIOS turing the regularities inherent in multiple criss- patterns, we believe that the latter represent regu- crossing paths through a corpus, ADIOS su- larities that are best described by Mildly Context- perficially resembles finite-state Local Grammars Sensitive Language formalism (Joshi and Schabes, (Gross, 1997) and Variable Order Markov (VOM) 1997). Importantly, because the ADIOS patterns models (Guyon and Pereira, 1995). The VOM ap- are learned from data, they already incorporate the proach starts by postulating a maximum-n struc- constraints on substitution and adjoining that in the ture, which is then fitted to the data by maximizing original TAG framework must be specified manu- the likelihood of the training corpus. The ADIOS ally. philosophy differs from the VOM approach in sev- Psychological and linguistic evidence for pattern- eral key respects. First, rather than fitting a model based representations. Recent advances in un- to the data, we use the data to construct a (recur- derstanding the psychological role of representa- sively structured) graph. Thus, our algorithm nat- tions based on what we call patterns, or construc- urally addresses the inference of the graph’s struc- tions (Goldberg, 2003), focus on the use of statis- ture, a task that is more difficult than the estima- tical cues such as conditional probabilities in pat- tion of parameters for a given configuration. Sec- tern learning (Saffran et al., 1996; Go´mez, 2002), ond, because ADIOS works from the bottom up in a and on the importance of exemplars and construc- recursive, data-driven fashion, it is less susceptible tions in children’s language acquisition (Cameron- to complexity issues. It can be used on huge graphs, Faulkner et al., 2003). Converging evidence for the and may yield very large patterns, which in a VOM centrality of pattern-like structures is provided by model would correspond to an unmanageably high corpus-based studies of prefabs — sequences, con- order n. Third, ADIOS transcends the idea of VOM tinuous or discontinuous, of words that appear to structure, in the following sense. Consider a set of be prefabricated, that is, stored and retrieved as a patterns of the form b1[c1]b2[c2]b3, etc. The equiv- whole, rather than being subject to syntactic pro- alence classes [·] may include vertices of the graph cessing (Wray, 2002). Similar ideas concerning the (both words and word patterns turned into nodes), ubiquity in syntax of structural peculiarities hitherto wild cards (i.e., any node), as well as ambivalent marginalized as “ exceptions” are now being voiced cards (any node or no node). This means that the by linguists (Culicover, 1999; Croft, 2001). 79 16543 (0.25) Cognitive Grammar; Construction Grammar. 16555 (0.14) The main methodological tenets of ADIOS — pop- 1 ulating the lexicon with “ units” of varying com- 16556 16544 plexity and degree of entrenchment, and using 2 (0.33) cognition-general mechanisms for learning and rep- 15543 (0.33) 15543 resentation — fit the spirit of the foundations of 3 15539 (1) Cognitive Grammar (Langacker, 1987). At the 15539 (1) 11 18 same time, whereas the cognitive grammarians typ- 5 ically face the chore of hand-crafting structures that 15540 15544 15040 (1) 15540 15544 would refl ect the logic of language as they per- 6 12 14374 (1) 14819 (1) 14384 (1) 15041 14819 (1) 14384 (1) ceive it, ADIOS discovers the primitives of gram- 4 13 17 mar empirically and autonomously. The same is 7 14335 14378 (1) 14818 (1) 14383 (1) 16557 14380 (1) 14379 (1) 14378 (1)14818 (1) 14383 (1) ADIOS 16 true also for the comparison between and the 10 14 15 19 various Construction Grammars (Goldberg, 2003; 8 9 ' t I t t r I r t s s e a o g e o s a u a e o o o g e o o o a u u Croft, 2001), which are all hand-crafted. A con- a h t m e h r u e n t m t t r ' l ' g n o g n n n o o h d g g n w w i g i g e o a t h e y n y y n n n u y

struction grammar consists of elements that differ u w w w a o a o o o h g g w h h

in their complexity and in the degree to which they c t t are specified: an idiom such as “ big deal” is a fully specified, immutable construction, whereas the ex- Figure 3: a typical pattern extracted from the pression “ the X, the Y” – as in “ the more, the bet- CHILDES collection (MacWhinney and Snow, ter” (Kay and Fillmore, 1999) – is a partially spec- 1985). Hundreds of such patterns and equivalence ified template. The patterns learned by ADIOS like- classes (underscored) together constitute a concise wise vary along the dimensions of complexity and representation of the raw data. Some of the phrases specificity (e.g., not every pattern has an equiva- that can be described/generated by these patterns lence class). are: let's change her. . . ; I thought you were gonna change her. . . ; I was going to change 4 ADIOS: a psycholinguistic evaluation your. . . ; none of these appear in the training data, To illustrate the applicability of our method to real illustrating the ability of ADIOS to generalize. The data, we first describe briefl y the outcome of run- generation process operates as a depth-first search ning it on a subset of the CHILDES collection of the tree corresponding to a pattern. For details (MacWhinney and Snow, 1985), consisting of tran- see (Solan et al., 2003a; Solan et al., 2004). scribed speech directed at children. The corpus we selected contained 300, 000 sentences (1.3 million (patterns and their associated equivalence classes) tokens) produced by parents. After 14 real-time acquired by ADIOS, we have examined their abil- days, the algorithm (version 7.3) identified 3400 ity to support various kinds of grammaticality judg- patterns and 3200 equivalence classes. The outcome ments. The first experiment we report sought to was encouraging: the algorithm found intuitively make a distinction between a set of (presumably significant patterns and produced semantically ad- grammatical) CHILDES sentences not seen by the equate corresponding equivalence sets. The algo- algorithm during training, and the same sentences rithm’s ability to recombine and reuse the acquired in which the word order has been perturbed. We patterns is exemplified in the legend of Figure 3, first trained the model on 10, 000 sentences from which lists some of the novel sentences it generated. CHILDES, then compared its performance on (1) The input module. The ADIOS system’s input 1000 previously unseen sentences and (2) the same module allows it to process a novel sentence by sentences in each of which a single random word forming its distributed representation in terms of ac- order switch has been carried out. The results, tivities of existing patterns. We stress that this mod- shown in Figure 5, indicate a substantial sensitiv- ule plays a crucial role in the tests described below, ity of the ADIOS input module to simple deviations all of which require dealing with novel inputs. Fig- from grammaticality in novel data, even after a very ure 4 shows the activation of two patterns (#141 and brief training. #120) by a phrase that contains a word in a novel Learnability of nonadjacent dependencies context (stay), as well as another word never before Within the ADIOS framework, the “ nonadjacent encountered in any context (5pm). dependencies” that characterize the artificial lan- Acceptability of correct and perturbed novel sen- guages used by (Go´mez, 2002) translate, simply, tences. To test the quality of the representations into patterns with embedded equivalence classes. 80 141... activation level: 0.972 120... activation level: 0.667

119

W0=1.0 W13=1.0 86 113 100 W0=1.0 W1=ε W2..8=ε 74 112 93 W8=1.0 89 W15=0.8

C C C C C C C C C C C C C C C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 0 1 2 3 4 5 6 7 8 9 10 11 12 13 liv ing are Jim Jim Joe Joe and play Pam Pam Beth next work Beth until END week Cindy Cindy winter Friday month BEGIN George George Sunday Monday Tuesday Saturday tomorrow Thursday Wednesday Wednesday

Figure 4: The two most active patterns responding to the partially novel input Joe and Beth are staying until 5pm. Leaf activation, which is proportional to the mutual information between input words and various members of the equivalence classes, is propagated upward by taking the average at each junction (Solan et al., 2003a).

80 a perfect acceptance of L1 and a perfect rejection of L2. Training with the original words (rather 70 than letters) as the basic symbols resulted in L2

60 rejection rates of 0%, 55%, 100%, and 100%, respectively, for |X| = 2, 6, 12, 24. Thus, the 50 ADIOS performance both mirrors that of the human subjects and suggests a potentially interesting new 40 effect (of the granularity of the input stimuli) that may be explored in further psycholinguistic studies. 30 A developmental test. The CASL test (Compre- 20 hensive Assessment of Spoken Language) is widely used in the USA to assess language comprehen- 10 sion in children (Carrow-Woolfolk, 1999). One of its many components is a grammaticality judgment 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 test, which consists of 57 sentences and is admin- istered as follows: a sentence is read to the child, Figure 5: Grammaticality of perturbed sentences who then has to decide whether or not it is correct. (CHILDES data). The figure shows a histogram If not, the child has to suggest a correct version of of the input module output values for two kinds of the sentence. For every incorrect sentence, the test stimuli: novel grammatical sentences (dark/blue), lists 2-3 acceptable correct ones. The present ver- and sentences obtained from these by a single word- sion of the ADIOS algorithm can compare sentences order permutation (light/red). but cannot score single sentences. We therefore ig- nored 11 out of the 57 sentences, which were correct Go´mez showed that the ability of subjects to learn to begin with. The remaining 46 incorrect sentences a language L1 of the form {aXd, bXe, cXf}1, and their corrected versions were scored by ADIOS as measured by their ability to distinguish it (which for this test had been trained on a 300,000- implicitly from L2={aXe, bXf, cXd}, depends sentence corpus from the CHILDES database); the on the amount of variation introduced at X. We highest scoring sentence in each trial was inter- replicated this experiment by training ADIOS on preted as the model’s choice. The model labeled 432 strings from L1, with |X| = 2, 6, 12, 24. The 17 of the test sentences correctly, yielding a score stimuli were the same strings as in the original of 108 (100 = norm) for the age interval 7-0 through experiment, with the individual letters serving as 7-2. This score is the norm for the age interval 8-3 the basic symbols. A subsequent test resulted in through 8-5.2

1Symbols a−f here stand for nonce words such as pel, vot, or dak, whereas X denotes a slot in which a subset of 24 other 2ADIOS was undecided about the majority of the other sen- nonce words may appear. tences on which it did not score correctly.

81 ADIOS are equal, which usually indicates that there are too many unfamiliar words. Omitting these sen- tences yields a significant R2 = 9.7%, p < 0.001; removing sentences for which the choices score al- most equally (within 10%) results in R2 = 12.7%, p < 0.001.4

Figure 6: The results of several grammaticality tests (the Go¨teborg ESL test is described in the text).

ESL test (forced choice). We next used a stan- dard test developed for English as Second Lan- guage (ESL) classes, which has been administered in Go¨teborg (Sweden) to more than 10, 000 upper secondary levels students (that is, children who typ- ically had 9 years of school, but only 6-7 years of English). The test consists of 100 three-choice ques- tions, such as She asked me at once (choices: Figure 7: Magnitude estimation study from Keller, come, to come, coming) and The tickets have plotted against the ADIOS score on the same sen- been paid for, so you not worry (choices: may, tences (R2 = 0.53, p < 0.05). The sentences dare, need); the average score for the population (ranked by increasing score) are: mentioned is 65%. As before, the choice given the How many men did you destroy the picture of? highest score by the algorithm won; if two choices How many men did you destroy a picture of? received the same top score, the answer was “ don’t How many men did you take the picture of? know” . The algorithm’s performance in this and How many men did you take a picture of? several other tests is summarized in Figure 6 (these Which man did you destroy the picture of? tests have been conducted with an earlier version of Which man did you destroy a picture of? the algorithm (Solan et al., 2003a)). In the ESL test, Which man did you take the picture of? ADIOS scored at just under 60%; compare this to Which man did you take a picture of? the 45% precision (with 20% recall) achieved by a 3 straightforward bi-gram benchmark. Modeling Keller's data. A manuscript by Frank ESL test (magnitude estimation). In this exper- Keller lists magnitude estimation data for eight sen- iment, six subjects were asked to provide magni- tences.5 We compared these to the scores pro- tude estimates of linguistic acceptability (Gurman- duced by ADIOS, and obtained a significant corre- Bard et al., 1996) for all the 3 × 100 sentences in lation (Figure 7). The input module seems capa- the Go¨teborg ESL test. The test was paper based ble of dealing with the substitution of a with the and included the instructions from (Keller, 2000). or of take with destroy, and it does reasonably No measures were taken to randomize the order of well on the substitution of How many men with the sentences or otherwise control the experiment. Which man. We conjecture that this performance The same 300 sentences were processed by ADIOS, can be improved by a more sophisticated normal- whose responses were normalized by dividing the ization of the score produced by the input module, output by the sum of each triplet’s score. The re- which should do a better job quantifying the cover sults indicate a significant correlation (R2 = 6.3%, (Edelman, 2004) of a novel sentence by the stored p < 0.001) between the scores produced by the sub- patterns. The limitations of the present version of jects and by ADIOS. In some cases the scores of the model became apparent when we tested it on the

3Chance performance in this test is 33%. We note that the 4Four of the subjects only filled out the test partially (the corpus used here was too small to train an n-gram model for numbers of responses were 300, 300, 186, 159, 96, 60), but the n > 2; thus, our algorithm effectively overcomes the problem correlation was highly significant despite the missing data. of sparse data by putting the available data to a better use. 5http://elib.uni-stuttgart.de/opus/volltexte/1999/81/pdf/81.pdf

82 52 sentences from Keller's dissertation, using his ADIOS is sensitive to the presentation order of the magnitude estimation method (Keller, 2000).6 For training sentences, 30 instances were trained on ran- these sentences, no correlation was found between domized versions of the training set. Eight hu- the human and the model scores. One of the more man subjects were then asked to estimate accept- challenging aspects of this set is the central role of ability of 20 sentences from the original corpus, in- pronoun binding in many of the sentences, e.g., The termixed randomly with 20 sentences generated by woman/Each woman saw Peter's photograph the trained versions of ADIOS. The precision, calcu- of her/herself/him/himself. Moreover, this test set lated as the average number of sentences accepted contains examples of context effects, where infor- by the subjects divided by the total number of sen- mation in an earlier sentence can help resolve a later tences in the set (20), was 0.73  0.2 for sentences ambiguity. Thus, many of the grammatical contrasts from the original corpus and 0.67  0.07 for the that appear in Keller's test sentences are too subtle sentences generated by ADIOS. Thus, the ADIOS- for the present version of the ADIOS input module generated sentences are, on the average, as accept- to handle. able to human subjects as the original ones. Acceptability of correct and perturbed artifi- cial sentences. In this experiment 64 random sen- tences was produced with a CFG. For uniformity the 5 Concluding remarks sentence length was kept within 15-20 words. 16 of the sentences had two adjacent words switched and The ADIOS approach to the representation of another 16 had two random words switched. The 64 linguistic knowledge resembles the Construction sentences were presented to 17 subjects, who placed Grammar in its general philosophy (e.g., in its re- each on a computer screen at a lateral position re- liance on structural generalizations rather than on ¯ ecting the perceived acceptability. As expected, syntax projected by the lexicon), and the Tree Ad- the perturbed sentences were rated as less accept- 2 joining Grammar in its computational capacity (e.g., able than the non-perturbed ones (R = 50.3% with in its apparent ability to accept Mildly Context Sen- p < 0.01). We controlled for sentence number, for sitive Languages). The representations learned by how high on the screen the sentence was placed, for the ADIOS algorithm are truly emergent from the the reaction time and for sentence length; only the (unannotated) corpus data. Previous studies focused latter had a signi®cant contribution to the correla- on the algorithm that makes such learning possible tion. The random permutations scored signi®cantly (Solan et al., 2004; Edelman et al., 2004). In the (p < 0.01) lower than the adjacent permutations. present paper, we concentrated on testing the input Furthermore, the variance in the scores of the ran- module that allows the acquired patterns to be used domly permuted sentences was signi®cantly larger in processing novel stimuli. (p < 0.005), suggesting that this kind of permu- The results of the tests we described here are en- tation violates the sentence structure more severely, couraging, but there is clearly room for improve- but may also sometimes create acceptable sentences ment. We believe that the most pressing issue in by chance. Previous tests showed that ADIOS is very this regard is developing a conceptually and com- good at recognizing perturbed CFG-generated sen- putationally well-founded approach to the notion of tences as such, but it remains to be seen whether or cover (that is, a distributed representation of a novel not ADIOS also exhibits differential behavior on the sentence in terms of the existing patterns). Intu- adjacent and non-adjacent permutations. itively, the best case, which should receive the top Acceptability of ADIOS-generated sentences. score, is when there is a single pattern that precisely ADIOS was trained on 12,700 sentences (out of a covers the entire input, possibly in addition to other total of 12,966 sentences) in the ATIS (Air Travel evoked patterns that are only partially active. We are Information System) corpus; the remaining 226 sen- currently investigating various approaches to scor- tences were used for precision/recall tests. Because ing distributed representations in which several pat- terns are highly active. A crucial constraint that ap- 6We remark that this methodology is not without its prob- lems. As one of our linguistically naive subjects remarked, plies to such cases is that a good cover should give a ª The instructions were (purposefully?) vague about what I proper expression to the subtleties of long-range de- was supposed to judge Ð understandability, grammar, correct pendencies and binding, many of which are already use of language, or getting the point through. . . º . Indeed, the captured by the ADIOS learning algorithm. scores in a magnitude experiment must be composites of sev- eral factors Ð at the very least, well-formedness and meaning- Acknowledgments. Supported by the US-Israel Bi- fulness. We are presently exploring various means of acquiring national Science Foundation and by the Thanks to and dealing with such multidimensional ª acceptabilityº data. Scandinavia Graduate Scholarship at Cornell. 83 References Markov Models. In Proc. 3rd Int'l Conf. Doc- R. Bod. 1998. Beyond grammar: an experience- ument Analysis and Recogition, pages 454–457, based theory of language. CSLI Publications, Montreal, Canada. Stanford, US. Z. S. Harris. 1954. Distributional structure. Word, 10:140–162. T. Cameron-Faulkner, E. Lieven, and M. Tomasello. 2003. A construction-based analysis of child di- P. J. Henrichsen. 2002. GraSp: Grammar learning rected speech. Cognitive Science, 27:843–874. form unlabeled speech corpora. In Proceedings of CoNLL-2002, pages 22–28. Taipei, Taiwan. E. Carrow-Woolfolk. 1999. Comprehensive As- A. Joshi and Y. Schabes. 1997. Tree-Adjoining sessment of Spoken Language (CASL). AGS Pub- Grammars. In G. Rozenberg and A. Salomaa, lishing, Circle Pines, MN. eds., Handbook of Formal Languages, volume 3, N. Chomsky. 1986. Knowledge of language: its na- pages 69 – 124. Springer, Berlin. ture, origin, and use. Praeger, New York. P. Kay and C. J. Fillmore. 1999. Grammatical A. Clark. 2001. Unsupervised Language Acquisi- constructions and linguistic generalizations: the tion: Theory and Practice. Ph.D. thesis, COGS, What’s X Doing Y? construction. Language, U. of Sussex. 75:1–33. W. Croft. 2001. Radical Construction Grammar: F. Keller. 2000. Gradience in Grammar: Experi- syntactic theory in typological perspective. Ox- mental and Computational Aspects of Degrees of ford U. Press, Oxford. Grammaticality. Ph.D. thesis, U. of Edinburgh. P. W. Culicover. 1999. Syntactic nuts: hard cases, R. W. Langacker. 1987. Foundations of cogni- syntactic theory, and language acquisition. Ox- tive grammar, volume I: theoretical prerequisites. ford U. Press, Oxford. Stanford U. Press, Stanford, CA. S. Edelman, Z. Solan, D. Horn, and E. Ruppin. B. MacWhinney and C. Snow. 1985. The Child 2004. Bridging computational, formal and psy- Language Exchange System. Journal of Compu- cholinguistic approaches to language. In Proc. of tational Lingustics, 12:271–296. the 26th Conference of the Cognitive Science So- A. Roberts and E. Atwell. 2002. Unsupervised ciety, Chicago, IL. grammar inference systems for natural language. S. Edelman. 2004. Bridging language with the Technical Report 2002.20, School of Computing, rest of cognition: computational, algorithmic U. of Leeds. and neurobiological issues and methods. In J. R. Saffran, R. N. Aslin, and E. L. Newport. 1996. M. Gonzalez-Marquez, M. J. Spivey, S. Coulson, Statistical learning by 8-month-old infants. Sci- and I. Mittelberg, eds., Proc. of the Ithaca work- ence, 274:1926–1928. shop on Empirical Methods in Cognitive Linguis- Z. Solan, E. Ruppin, D. Horn, and S. Edelman. tics. John Benjamins. 2003a. Automatic acquisition and efficient rep- J. L. Elman, E. A. Bates, M. H. Johnson, resentation of syntactic structures. In S. Thrun, A. Karmiloff-Smith, D. Parisi, and K. Plunkett. editor, Advances in Neural Information Process- 1996. Rethinking innateness: A connectionist ing, volume 15, Cambridge, MA. MIT Press. perspective on development. MIT Press, Cam- Z. Solan, E. Ruppin, D. Horn, and S. Edelman. bridge, MA. 2003b. Unsupervised efficient learning and rep- A. E. Goldberg. 2003. Constructions: a new theo- resentation of language structure. In R. Alter- retical approach to language. Trends in Cognitive man and D. Kirsh, eds., Proc. 25th Conference Sciences, 7:219–224. of the Cognitive Science Society, Hillsdale, NJ. R. L. Go´mez. 2002. Variability and detection Erlbaum. of invariant structure. Psychological Science, Z. Solan, D. Horn, E. Ruppin, and S. Edelman. 13:431–436. 2004. Unsupervised context sensitive language M. Gross. 1997. The construction of local gram- acquisition from a large corpus. In L. Saul, ed- mars. In E. Roche and Y. Schabe`s, eds., Finite- itor, Advances in Neural Information Processing, State Language Processing, pages 329–354. MIT volume 16, Cambridge, MA. MIT Press. Press, Cambridge, MA. M. van Zaanen. 2000. ABL: Alignment-Based E. Gurman-Bard, D. Robertson, and A. Sorace. Learning. In COLING 2000 - Proceedings of the 1996. Magnitude estimation of linguistic accept- 18th International Conference on Computational ability. Language, 72:32–68. Linguistics, pages 961–967. I. Guyon and F. Pereira. 1995. Design of a linguis- A. Wray. 2002. Formulaic language and the lexi- tic postprocessor using Variable Memory Length con. Cambridge U. Press, Cambridge, UK. 84 Modelling Atypical Syntax Processing

Michael S. C. THOMAS Martin REDINGTON School of Psychology School of Psychology Birkbeck College, Malet St., Birkbeck College, Malet St., London WC1E 7HX London WC1E 7HX [email protected] [email protected]

1998). These processes remain to be clarified even Abstract for the normal course of development. We evaluate the inferences that can be drawn In the area of syntax, Fowler (1998) concluded from dissociations in syntax processing that a consistent picture emerges. Individuals with identified in developmental disorders and learning disabilities are systematic in their acquired language deficits. We use an SRN to grammatical knowledge, follow the normal course simulate empirical data from Dick et al. (2001) of development, and show similar orders of on the relative difficulty of comprehending difficulty in acquiring constructions. However, different syntactic constructions under normal such individuals can often handle only limited conditions and conditions of damage. We levels of syntactic complexity and therefore conclude that task constraints and internal development seems to terminate at a lower level. computational constraints interact to predict While there is great variability in linguistic patterns of difficulty. Difficulty is predicted by function both across different disorders and within frequency of constructions, by the requirement single disorders, this cannot be attributed solely to of the task to focus on local vs. global differences in ”general cognitive functioning‘ (e.g., sequence information, and by the ability of the as assessed by problem solving ability). Syntax system to maintain sequence information. We acquisition is therefore to some extent independent generate a testable prediction on the empirical of IQ. However, adults with developmental pattern that should be observed under disorders who have successfully acquired syntax conditions of developmental damage. typically have mental ages of at least 6 or 7, an age at which typically developing children also have 1 Dissociations in language function well-structured language. The variability in outcome has been attributed to various factors Behavioural dissociations in language, identified specific to language, including verbal working both in cases of acquired brain damage in adults memory and the quality of phonological and in developmental disorders, have often been representations (Fowler, 1998; McDonald, 1997). used to infer the functional components of the Most notably, disorders with different cognitive underlying language system. Generally these abilities show similarity in syntactic acquisition. attempted fractionations appeal to broad The apparent lack of deviance across distinctions within language. However, fine-scaled heterogeneous disorders has been used to argue for dissociations have also been proposed, such as the a model of language acquisition that is heavily loss of individual semantic categories or of constrained by the brain that is acquiring the particular linguistic features in inflecting verbs. language (Newport, 1990). Here, we consider the implications of developmental and acquired deficits for the nature 1.2 Acquired deficits in adulthood of syntax processing. One of the broadest distinctions in acquired 1.1 Developmental deficits language deficits is between Broca‘s and Wernicke‘s aphasia. Broca‘s aphasics are A comparison of developmental disorders such sometimes described as having greater deficits in as autism, Downs syndrome, Williams syndrome, grammar processing, and Wernicke‘s aphasics as Fragile-X syndrome, and Specific Language having greater deficits in lexical processing. The Impairment reveals that dissociations can occur dissociation is taken to support the idea that the between phonology, lexical semantics, morpho- division between grammar and the lexicon is one syntax, and pragmatics. The implications of such of the constraints that the brain brings to language fractionations remain controversial but will be acquisition. contingent on understanding the developmental Dick et al. (2001) recently argued that four types origins of language structures (Karmiloff-Smith, of evidence undermine this claim: (1) all aphasics

85 have naming deficits to some extent; (2) apparently Figure 1. Aphasic and simulated (human) aphasic agrammatic patients retain knowledge of grammar data from Dick et al. (2001) that can be exhibited in grammaticality Dick et al: Agent / Patient task judgements; (3) grammar deficits are found in Elderly vs. aphasic data many populations both with and without damage to 100 Broca‘s area, the reputed seat of syntax in the 80 Elderly %

brain; and (4) aphasic symptoms of language e Anomic c n comprehension can be simulated in normal adults a Conduction

m 60 r Broca o by placing them in stressed conditions (e.g., via f r

e Wernicke manipulating the speech input or giving the subject P 40 a distracter task). Dick et al. pointed out that in syntax comprehension, the constructions most 20 Active Subject Object Cleft Passive resilient in both aphasic patients and normal adults Cleft with simulated aphasia are those that are most Dick et al: Agent / Patient task regular or most frequent, and conversely those Normal adults under stressed conditions liable to errors are non-canonical and/or low 100 frequency. Dick et al. (2001) illustrated these Normal 80 arguments in an experiment that compared % Compressed + e c Visual Digits comprehension of four complex syntactic n a 60 m Noise + Visual structures: r o Digits f r x e Low Pass + Actives (e.g., The dog [subject] is biting the P 40 cow [object]) Compression x Subject Clefts (e.g., It is the dog [subject] that 20 is biting the cow [object]) Active Subject Object Cleft Passive Cleft x Passives (e.g., The cow [object] is bitten by the dog [subject]) conditions should exhibit a similar pattern of x Object Clefts (e.g., It is the cow [object] that developmental or acquired deficits. Thus Dick et the dog [subject] is biting) al. predicted that a connectionist model trained on The latter two constructions are lower frequency, an appropriate frequency-weighted corpus would and have non-canonical word orders in which the show equivalent vulnerability of non-canonical object precedes the subject. Dick et al. tested 56 word orders and low frequency constructions under adults with different types of aphasia on a task that conditions of damage. In contrast to the inferences involved identifying the agent of spoken sentences. drawn from developmental deficits, the focus here Patients with all types of aphasia demonstrated is on attributing similarities in patterns of acquired lower performance on Passives and Object Clefts deficits to features of the problem domain rather than Actives and Subject Clefts. Moreover, normal than constraints of the language system. adults given the same task but with a degraded speech signal (either speeded up, low-pass filtered, 2 Computational modelling or with noise added) or in combination with a Proposals that site the explanation of behavioural distracter task (such as remembering a set of digits) data in the frequency structure of the problem produced a similar profile of performance to the domain (here, the relative frequency of the aphasics (see Figure 1). construction types) are insufficient for three Dick et al. (2001) argued that the common reasons: (1) language comprehension is not about pattern of deficits could be explained by the passive reception. The language learner must do Competition Model (MacWhinney & Bates, 1989), something with the words in order to derive the which proposes that the difficulty of acquiring meanings of sentences. It is the nature of the certain aspects of language and their retention after transformations required that crucially determines brain damage could be explained by considering task difficulty, which statistics of language input cue validity (the reliability of a source of alone cannot reveal. (2) Whatever the statistics of information in predicting the structure of a target the environment, such information must be language) and cue cost (the difficulty of processing accessed by an implemented learning system. This each cue). Cues high in validity and low in cost, system may be differentially sensitive to certain such as Subject-Verb-Object word order in features of the input, and it may find certain English, should be acquired more easily and be transformations more computationally expensive relatively spared in adult breakdown. The proposal than others, further modulating task difficulty. (3) is that for a given language, any domain-general In the context of atypical syntax processing in processing system placed under sub-optimal developmental and acquired disorders, behavioural

86 deficits are caused by changes in internal encoded with thermometer encoding, with one to computational constraints. Without an three units being activated according to the number implemented, parameterised learning system, we of syllables in the input word. In English, longer can have no understanding of how sub-optimal words tend to be content words. This was reflected processing conditions generate behavioural deficits in the vocabulary items that were selected for the in syntax processing. To date, this issue has been grammar. Stress was encoded as a single unit that relatively under-explored. was activated for content words, which are stressed The choice of learning system is evidently of more heavily. The word length and stress units importance here. In this paper, we explore the were present both as inputs and outputs, so that behaviour of a connectionist network, since these multiple cue networks had 59 input and output systems have been widely applied to phenomena units to represent the words and cues. within cognitive and language development 3.1 The materials (Elman et al., 1996) and more recently to capturing both atypical development and acquired deficits in The input corpus was a stochastic phrase adults (Thomas & Karmiloff-Smith, 2002, 2003). structure grammar, derived from the materials used by C&D (2001). The grammar featured a range of 3 Simulation Design constructions (imperatives, interrogatives and declarative statements). Frequencies were based on Our starting point is a set of models of syntax those observed in child-directed language. We acquisition proposed by Christiansen and Dale added passives, subject and object cleft (2001). These authors employed a simple recurrent constructions to the grammar, which is illustrated network (SRN; Elman, 1990), an architecture that in Figure 2. is the dominant connectionist model of sequence processing in language studies and in sequence Figure 2. Stochastic phrase structure grammar, learning more generally. As is typical of current including the probabilities of each construction connectionist models of syntax processing, the Christiansen and Dale (henceforth C&D) model S -> Imperative [0.1] | Interrogative [0.3] | Declarative [0.6] focuses on small fragments of grammar and a Declarative -> NP V-int [0.35] | NP V-tran NP active [.28] | NP V-tran NP passive [0.042] | small vocabulary. Nevertheless, it provides a subject cleft [0.014] | useful platform to begin considering the effects of object cleft [0.014] | NP-Adj [0.1] | processing constraints on syntax processing. That-NP [0.075] | You-P [0.125] The following models performed a prediction NP-ADJ -> NP is/are adjective That-NP -> that/those is/are NP task at the word level. At each time step, the You-P -> you are NP network was presented with the current word and Imperative -> VP had to predict the next word in the sentence. This Interrogative -> Wh-Question [0.65] | Aux-Question [0.35] Wh-Question -> where / who / what is/are NP component of the task induces sensitivity to [0.5] | where / who / what do / does NP VP [0.5] syntactic structures. A localist representation was Aux-Question -> do / does NP VP [0.33] | used, with each input unit corresponding to a do / does NP wanna VP [0.33] | single word. The artificial corpus consisted of 54 is / are NP adjective [0.34] words and included 6 nouns, 10 verbs, 5 NP -> a / the N-sing / N-plur VP -> V-int | V-trans NP adjectives, and 10 functions words. Nouns and verbs had inflected forms represented by separate word units (N: stem, pluralised; V: stem, past The four sentence types appeared with the tense, progressive, 3rd person singular). following frequency: (Declarative) Active: 16.8%, C&D investigated the effect of several cues on Subject Cleft: 0.84%, Object Cleft: 0.84%, syntax acquisition, such as prosody, stress, and Passives: 2.52%. This gave a Passive-to-Active word length. Prosody was represented as utterance ratio of roughly 1:7, and ratio of OVS to SVO boundary information that occurred at the end of sentences of 1:21. Dick and Elman (2001) found an utterance with 92% probability. The utterance that for English, the Passive-to-Active ratio ranged boundary cue was represented by an additional from 1:2 to 1:9 across corpora and that subject and input and output unit. object clefts appear in less than 0.05% of English Distributional cues of where words appeared in sentences. They found that the relative frequency various sentences, along with utterance boundary of word orders depended on whether one compares information, were available to all networks. We the passive OVS against transitive (SVO) or refer to the networks that received only these cues intransitive (SV) sentences and reported ratios that as the —basic“ model. We also tested a second set varied from 1:5 to 1:63 depending on corpus of —multiple cue“ networks that also received cues (spoken or written). The simulation frequencies about word length and stress. Word length was were therefore an approximate fit, with the Subject 87 and Object Clefts slightly higher than in English object / patient of the sentence was presented. For due to the requirement to have at least a handful all other inputs, the target activation of both units appear in our training corpus. was zero. Thus, the number of input and output We generated a corpus of 10,000 sentences from units was 55 and 57 respectively for the basic this grammar as our training materials for the model, and 59 units and 61 units for the multiple- network, and a set of 100 test sentences for each of cue model. the active, passive, subject cleft and object cleft The network‘s ability to correctly predict the constructions. next word was measured over the 55 word output units using the cosine between the target and actual 3.2 Simulation One output vectors. On novel sentences, a perfect The Dick et al. (2001) task consisted of network will only be able to predict the next item presenting participants with a spoken sentence, and probabilistically. However, over many test items, two pictures corresponding to the agent and patient this measure gives a fair view of the network‘s of the sentence. The participant‘s task was to performance and we followed C&D (2001) in indicate with a binary choice which of the pictures using this measure. was the agent of the sentence. For example, for We initially chose our parameters based on those sentences such as ”the dog is biting the cow‘, used by C&D (2001). Our learning rate was 0.1, participants were asked to —press the button for the and we trained the network for ten epochs. We side of the animal that is doing the bad action“. performed a simple search of the parameter space Our next step was to implement this task in the for the number of hidden units to establish a model. One approach would be to train the —normal“ condition (see Thomas & Karmiloff- network to output at each processing step not only Smith, 2003, for discussion of parameters defining the next predicted word in the sentence but also the normality). Eighty hidden units, the number used thematic role of the current input. If the current by C&D, gave adequate results for both models. input is a noun, this would be agent or patient. This value was used to define the normal model. Joanisse (2000) proposed just such a solution to We first evaluate normal performance at the end parsing in a connectionist model of anaphor of training, then under the developmental deficit of resolution. We will refer to the implementation of a reduction in hidden units in the start state, and activating units for agent or patient (solely) on the finally under the acquired deficit of a random same cycle as the relevant noun as the —Discrete“ lesion to a proportion of connection weights from mapping problem of relating nouns to roles. the trained network. The mapping problem adds to the difficulty of 3.2.2 Results the prediction task. We can assess the extent of this difficulty by measuring performance on the On the prediction component of the task, both prediction component alone, against the metrics of models demonstrated better prediction ability than two statistical models. The bigram and trigram the bigram model, and marginally less prediction models are statistical descriptions of the sentence ability than the trigram model. This is in contrast to set that predict the next word given the previous C&D‘s original prediction-only SRN model, which two or three words of context, respectively, and exceeded trigram model performance. It shows that these were derived from the observed frequencies the requirement to derive agent and patient roles in the training set. increased the complexity of the learning problem, Lastly, for the purposes of this simulation, we do interfering with prediction ability. not distinguish between the syntactic roles of The role-assignment component of the task was subject and object, and semantic roles of agent and indexed by the activation of the agent and patient patient, even though a more complex model may units when presented with the second noun of the separate these levels and include a process that sentence. At presentation of the first noun, there maps between them. Although these simulations was no information available in the test sentences conflate the syntactic and semantic categories, we that would allow the network to distinguish use the terms agent / patient for clarity in linking to between the possible interpretations of the the Dick et al. empirical data. sentence. At the second noun, the most active of the two units was assumed to drive the 3.2.1 Method interpretation of the sentence and subsequent For Simulation 1, we added two output units to picture identification in the Dick et al. task. the C&D network. The network was trained to Therefore, the network‘s response was —correct“ activate the first extra unit when the current input for Active and Subject Cleft sentences if the element was the subject / agent of the sentence, —patient“ unit had the highest activation, and for and to activate the second extra unit when the Passive and Object Cleft sentences if the —agent“

88 Figure 3 Figure 8 Figure 9

Simulated Agent / Patient task Simulated Agent / Patient task Simulated Agent / Patient task Discrete Mapping Model Discrete Mapping model: Acquired Deficit Discrete Mapping model - Developmental Deficit 100% 100% 100%

80% 80% 80% % % %

Normal e e e c c 60% Basic model 60% c 60% 20 hid. units n n 5% lesion n a a Multiple cue model a 40 hid. units m m m r r 10% lesion r o o o 80 hid. units f f 40% 40% f 40% r r 20% lesion r e e e P P P 20% 20% 20%

0% 0% 0% Active Subject Object Passive Active Subject Object Passive Active Subject Object Cleft Passive Cleft Cleft clefts clefts Cleft

Figure 4 Figure 6 Active Passive 1.0 1.0

0.8 n n 0.8 o o i i t t a a 0.6 v v 0.6 i agent i t t agent c 0.4 c

A patient A 0.4

patient t t i i n 0.2 n U 0.2 U

0.0 0.0 a boy eats the cat the croco- are eaten by the cat diles Figure 5 Figure 7 Subject Cleft Object Cleft 1.0 1.0

0.8

n 0.8 n o i o t i t a 0.6 a v 0.6 i v

t agent i

t agent c c A

0.4 patient A t

0.4

i patient t i n n U

0.2 U 0.2

0.0 0.0 it is a boy that is kissing the bunny it is a boy that a dog is eating unit had the highest activation. The scores, and Subject Cleft sentences are not perfect. measured in terms of the proportion of correct In contrast, in the example Passive and Object interpretations for the test sentences for each Cleft sentences, the network incorrectly activates construction are shown in Figure 3. the agent unit at presentation of the first noun. At Somewhat surprisingly, both the basic and this point, the network has no information that multiple-cue models exhibited better performance could possibly allow it to distinguish between the on the Passive and Object Cleft sentences than on two different kinds of sentence, and so its response Active and Subject Cleft sentences. (These is driven by the relative frequency of the differences were statistically reliable.) The main constructions. However, for the second noun (the difference between the two models was lower agent), although the patient unit does show some performance on Subject Cleft in the basic model, activation, the agent unit is clearly favoured. implying that cues to content-word status help to Generally, the advantage of the agent unit for the disambiguate the two cleft constructions. Passive and Object Cleft sentences is greater than Examining the profiles of performance for each the advantage of the patient unit for the Active and sentence type gives some insight into the dynamics Subject Cleft sentences. This can be explained by a of the networks. Figures 4 to 7 show the activation general bias in the network in favour of the agent of the agent and patient units for the multiple-cue unit. In the training set, agents (subjects) occur model during the processing of examples of each much more frequently than patients (objects). All construction, selected at random. The Subject Cleft of the interrogatives and imperatives only have sentence shown in Figure 5 is typical of the pattern agents, and these comprise 30% of the training for both Active and Subject Cleft sentences. That sentences. Thus, paradoxically, the network suffers is, agent unit activation is close to 1.0 at the first when attempting to produce activation on the noun, while patient unit activation is close to zero. patient unit, and this impacts on the Active and At the second noun, the network is usually able to Subject Cleft performance, despite the much correctly distinguish the patient, but some agent greater frequency of these constructions. unit activation also occurs. Therefore, using our Figures 8 and 9 illustrate the affects of initially decision criteria, the network is not always able to reducing the numbers of hidden units in the correctly identify the patient, and scores on Active network and of lesioning connections in the

89 endstate. In both cases, non-optimal processing therefore call this the Continuous Mapping conditions exaggerated the pattern of task problem for sentence comprehension. Like the difficulty, with Actives and Subject Clefts failing Discrete Mapping problem, the Continuous version to be learned or showing greater impairment after has also been employed in previous connectionist lesioning. Object Clefts are the most easily learnt models of parsing (Miikkulainen & Mayberry, and most robust to damage, despite their non- 1999). (Note that Morris, Cottrell & Elman, 2000, canonical word order and low frequency. With the used an implementation that combines Discrete task definition of responding —agent“ to the second and Continuous methods, providing a training noun, this construction gains most from the signal that is activated when a word appears and is prevalence of the agent status of nouns in the then maintained until the end of the sentence). The corpus. Continous method generates a training signal for This interpretation of the Dick et al. agent- comprehension. It does not constrain on-line identification task does not provide an adequate fit comprehension, which may be subject to garden- to the human data, either for normal or atypical pathing and dynamic revision. performance. Why not? This implementation of the 3.3.1 Method task requires that the network keep track of two roles at the same time and assign those roles at the A single output unit was trained to produce an correct moment. It is therefore driven by the activation of 1 for sentences with Subject-Object independent probability of a noun being an agent word order (active and subject cleft constructions), or a patient at multiple time points through the and 0 for Object-Subject word order (passives and sentence. The result is a de-emphasis of global object cleft constructions). Apart from this sequence information and an emphasis on local difference, the basic and multiple-cue models were lexical information, leading to a relative advantage identical in all other respects, with 55 input and of responding ”agent‘ to any noun. output units in the basic model, and 59 units in the In the Dick et al. task, the participant is asked to multiple cue model. As before, we trained the make a single decision based on the entire network on 10,000 sentences generated by the sentence, rather than continously monitor word-by- stochastic phrase structure grammar, and tested the word probabilities. Responses occurred between 2 trained network on sets of 100 Active, Passive, and 4 seconds after sentence onset, with words Subject Cleft and Object Cleft sentences. One presented at around 3 words-per-second. In the hundred and twenty hidden units were required to next section, we therefore provide an alternate define the ”normal condition‘ for these simulations. implementation of the task based on a single 3.3.2 Results categorisation decision for the whole sentence. But As with Simulation 1, the prediction ability of Simulation 1 serves as a demonstration that the both basic and multiple-cue models suffered due to statistics of the input set alone do not generate the the burden imposed by the mapping task. Although task difficulty. It is the mappings required of the the networks‘ performance reliably exceeded a network. Moreover, we might predict that a bigram prediction model, the trigram statistical modification of the Dick et al. study to encourage model was slightly superior. on-line monitoring of roles would alter the pattern The network‘s ability to correctly —interpret“ the of task difficulty. Thus, the four options might be test sentences was measured as follows. If the presented as pictures (each noun twice, once as semantic output unit‘s activation at the time of agent, once as patient), and the participants‘ eye- second noun presentation was greater than 0.5, gaze direction recorded as the sentence unfolds. then the response was assumed to indicate that the 3.3 Simulation Two sentence had Subject-Object word order and the agent was the first noun. If the activation was less An alternate implementation of the Dick et al. than or equal to 0.5, then the response was task is that the network should be required to make assumed to indicate that the sentence had Object- a single categorisation on the whole sentence as to Subject word order and the agent was the second whether the agent precedes the patient, or the noun. Although the target output for the network patient precedes the agent. This implementation was consistent throughout each sentence, we follows the assumption that task performance is selected the presentation of the second noun as our driven by higher-level sentence-based information point of measurement, as this was where the rather than lexically-based information. A single network‘s discrimination ability was greatest. unit can serve to categorise the input sentence as Figure 10 depicts performance on the four agent-then-patient or patient-then-agent. During constructions. training, the target activation for the unit is applied On Active, Subject Cleft, and Passive sentences continuously throughout the entire utterance. We the basic model showed appropriate performance,

90 Figure 10 Figure 15 Figure 16

Simulated Agent / patient task Simulated Agent / Patient task Simulated Agent / Patient task Continuous Mapping Model Continuous Mapping model - Acquired Deficit Continuous Mapping model - Developmental Deficit 100% 100% 100%

80% 80% 80% % %

%

Normal e e e

c 80 hid. units c c Basic model 60% 60% 60% n n

n 5% lesion a a

a 100 hid. units

Multiple cue model m m m 10% lesion r r r

o 120 hid. Units o o f 40% f

f 40% 40% r r

r 20% lesion e e e P P P 20% 20% 20%

0% 0% 0% Active Subject Object Passive Active Subject Object Passive Active Subject Object Passive Cleft Cleft Cleft Cleft Cleft Cleft

Figure 11 Figure 13 Active Passive

1.0 )

) 1.0 O O V V 0.8 S S

= 0.8 = 1 1 ( (

0.6 e e 0.6 p p y y t t

e 0.4 0.4 e c c n n e t 0.2 e 0.2 t n e n e S

0.0 S 0.0 a boy eats the cat the croco- are eaten by the cat diles Figure 12 Figure 14 Subject Cleft Object Cleft 1.0 1.0 ) ) O O V

0.8 V

S 0.8 S = = 1 1 ( (

e 0.6 0.6 e p p y y t t

e 0.4

e 0.4 c c n n e e t 0.2 t n 0.2 n e e S S 0.0 0.0 it is a boy that is kissing the bunny it is a boy that a dog is eating but it failed to correctly distinguish the Object the sentence has Object-Subject word order. Cleft sentences. Doubling the hidden units did not The Cleft constructions show a very different markedly alter this pattern. The multiple-cue pattern. For the Subject Clefts, the network begins model showed a much better fit to the human data, with a low output value from the semantic unit. performing at close to ceiling for the Active, This increases slightly as the first determiner and Passive and Subject Cleft constructions, and noun are presented, but the most valuable cue scoring in excess of 85% correct on Object Cleft arrives with the words —that is kissing“. These constructions. The content-word cues provided in provide a perfect indicator (in this context) that the the multiple-cue model again appeared important sentence has Subject-Object word order, and the in disambiguating the cleft constructions. activation of the semantic unit jumps dramatically, Focusing on the multiple-cue model, Figures 11- staying near ceiling for the rest of the sentence. 14 show the activation of the network‘s semantic Finally, examining the Object Cleft sentence, output unit over a random sentence from each of output activation again starts low and rises only the four test constructions. For the Active sentence, modestly during presentation of the first noun. the network maintains a fairly constant high level However, the presence of a second noun following of activation throughout the sentence. That is, it immediately after the first pulls the activation back starts with the —assumption“ that sentences will down, to correctly indicate that the sentence has have a Subject-Object word order, and becomes Object-Subject word order. Notice that, as with the more certain of this result (as shown by rising Passive sentence, as the distance increases from the output activation) as the sentence proceeds. cue that marks the (less common) status of the For the Passive sentence, again, the network Object Cleft sentence, so the activation level of the starts out assuming that the sentence will have the semantic unit tends to drift back to the default of more frequent Subject-Object word order. But on the more frequent constructions. seeing —eaten by“, the network reverses its original Figures 15 and 16 illustrate, respectively, the diagnosis. However, the influence of this cue effects of reducing the initial numbers of hidden noticeably fades as the sentence proceeds. It units in the network and of lesioning connections persists enough that by the second noun, the in the endstate. In the case of acquired damage, network (just) manages to indicate correctly that non-optimal processing conditions exaggerate the

91 pattern of task difficulty, with Passives and Object demonstrate the importance of the mapping task in Cleft‘s showing greater impairment after lesioning specifying difficulty (over and above the statistics in line with the empirical data in Figure 1. of the input); how internal processing constraints Interestingly, in the case of the developmental influence performance; and how local and global deficit, the pattern is subtly different. While Object information show a differential contribution to and Clefts show increased vulnerability, Passives are vulnerability in sequence processing in a recurrent far more resilient to developmental damage. connectionist network. We carried out further analysis of this difference. Using the examples in Figs. 13 and 14, the cues 5 Acknowledgements predicting Object-Subject order for Passives turned This research was supported by grants from the out to be the inflected verb ”eaten‘ followed by British Academy and the Medical Research ”by‘, i.e., two lexical cues (the second redundant). Council (G0300188) to Michael Thomas. For Object Clefts, the cue for Object-Subject order was sequence-based information: in this References construction, two nouns are not separated by a verb. This is marked by the arrival of a second Christiansen, M. & Dale, R. 2001. Integrating distributional, prosodic and phonological information in a connectionist noun prior to a verb, that is, the words ”a‘ and model of language acquisition. In Proceedings of the 23rd ”dog‘. While both lexical and sequence cues are Annual Conference of the Cognitive Science Society (p. low frequency by virtue of their constructions, they 220-225). Mahwah, NJ: LEA. differ in that the Passive cue comprises lexical Dick, F. & Elman, J. 2001. The frequency of major sentence items unique to this construction, while the Object types over discourse levels: A corpus analysis. CRL: Cleft cue involves a particular sequence of lexical Newsletter, 13. items that also appear in other other constructions. Dick, F., Bates, E., Wulfeck, B., Aydelott, J., Dronkers, N., & Gernsbacher, M. 2001. Language deficits, localization, and Examination of activation dynamics reveals that grammar: Evidence for a distributive model of language both low frequency cues are lost after acquired breakdown in aphasic patients and neurologically intact damage. However, the network with the individuals. Psychological Review, 108(3): 759-788. developmental deficit retains the ability to learn Elman, J. 1990. Finding structure in time. Cognitive Science, the lexically-based cue that marks the Passive, but 14, 179-211. has insufficient resources to learn the sequence- Elman, J., et al., (1996). Rethinking innateness. Cambridge, based cue that marks the Object Cleft construction. Mass.: MIT Press. Three points are evident here. First, the model Fowler, A. (1998). Language in mental retardation: makes a strong empirical prediction that when Associations with and dissociations from general cognition. In J. Burack et al., Handbook of Mental Retardation and developmental deficits are compared to acquired Development (p.290-333). Cambridge, UK: CUP. deficits, passive constructions will be relatively Joanisse, M. 2000. Connectionist phonology. Unpublished less vulnerable. This renders the model testable Ph.D. Dissertation, University of Southern California. and therefore falsifiable. Second, the model Karmiloff-Smith, A. (1998). Development itself is the key to demonstrates the differential computational understanding developmental disorders. Trends in Cognitive requirements of tasks driven by local (lexically- Sciences, 2(10): 389-398. based) and global (sequence-based) information in MacWhinney, B. & Bates, E. 1989. The cross-linguistic study a parsing task. Third, the model reveals the of sentence processing. New York: CUP. distinction between acquired and developmental McDonald, J. 1997. Language acquisition: The acquisition of linguistic structure in normal and special populations. Annu. deficits, with compensation possible in the latter Rev. Psychol., 48, 215-241 case for cues with low processing cost (see Miikkulainen, R. & Mayberry, M. 1999. Disambiguation and Thomas & Karmiloff-Smith, 2002, for discussion). grammar as emergent soft constraints. In B. MacWhinney (ed.) Emergence of Language. Hillsdale, NJ: LEA. 4 Discussion Morris, W., Cottrell, G., and Elman, J. 2000. A connectionist simulation of the empirical acquisition of grammatical Implemented learning models are an essential relations. In S. Wermter & R. Sun (eds.), Hybrid Neural requirement to begin an exploration of the internal Systems. Heidelberg: Springer Verlag. constraints that influence successful and atypical Newport, E. 1990. Maturational constraints on language syntax processing. Our model necessarily makes learning. Cognitive Science, 14, 11-28. simplifications to begin this exploration (e.g., the Thomas, M.S.C. & Karmiloff-Smith, A. (2002). Are distribution and frequency of lexical items across developmental disorders like cases of adult brain damage? constructions is not in reality uniform; cleft Implications from connectionist modelling. Behavioural constructions may have different stress / prosodic and Brain Sciences, 25(6), 727-788. cues). A precise quantitative fit to the empirical Thomas, M.S.C. & Karmiloff-Smith, A. 2003. Modelling language acquisition in atypical phenotypes. Psychological data must await models that include those factors. Review, 110(4), 647-682. However, the current model is sufficient to

92 Combining Utterance-Boundary and Predictability Approaches to Speech Segmentation Aris XANTHOS Linguistics Department, University of Lausanne UNIL - BFSH2 1015 Lausanne Switzerland [email protected]

Abstract esize boundaries inside utterances (Aslin et al., This paper investigates two approaches to 1996; Christiansen et al., 1998; Xanthos, 2004). speech segmentation based on different heuris- The second approach is based on the predictabil- tics: the utterance-boundary strategy, and the ity strategy, which assumes that speech should predictability strategy. On the basis of for- be segmented at locations where some mea- mer empirical results as well as theoretical con- sure of the uncertainty about the next symbol siderations, it is suggested that the utterance- (phoneme or syllable for instance) is high (Har- boundary approach could be used as a prepro- ris, 1955; Gammon, 1969; Saffran et al., 1996; cessing step in order to lighten the task of the Hutchens and Adler, 1998; Xanthos, 2003). predictability approach, without damaging the Our implementation of the utterance- resulting segmentation. This intuition leads to boundary strategy is based on n-grams the formulation of an explicit model, which is statistics. It was previously found to perform empirically evaluated for a task of word segmen- a “safe” word segmentation, that is with a tation on a child-oriented phonemically tran- rather high precision, but also too conser- scribed French corpus. The results show that vative as witnessed by a not so high recall the hybrid algorithm outperforms its compo- (Xanthos, 2004). As regards the predictability nent parts while reducing the total memory load strategy, we have implemented an incremental involved. interpretation of the classical successor count (Harris, 1955). This approach also relies on the 1 Introduction observation of phoneme sequences, the length The design of speech segmentation1 methods of which is however not restricted to a fixed has been much studied ever since Harris’ sem- value. Consequently, the memory load involved inal propositions (1955). Research conducted by the successor count algorithm is expected since the mid 1990’s by cognitive scientists to be higher than for the utterance-boundary (Brent and Cartwright, 1996; Saffran et al., approach, and its performance substantially 1996) has established it as a paradigm of its own better. in the field of computational models of language The experiments presented in this paper were acquisition. inspired by the intuition that both algorithms In this paper, we investigate two boundary- could be combined in order to make the most based approaches to speech segmentation. Such of their respective strengths. The utterance- methods “attempt to identify individual word- boundary typicality could be used as a compu- boundaries in the input, without reference to tationally inexpensive preprocessing step, find- words per se” (Brent and Cartwright, 1996). ing some true boundaries without inducing too The first approach we discuss relies on the many false alarms; then, the heavier machinery utterance-boundary strategy, which consists in of the successor count would be used to accu- reusing the information provided by the occur- rately detect more boundaries, its burden be- rence of specific phoneme sequences at utter- ing lessened as it would process the chunks pro- ance beginnings or endings in order to hypoth- duced by the first algorithm rather than whole

1 utterances. We will show the results obtained To avoid a latent ambiguity, it should be stated that for a word segmentation task on a phonetically speech segmentation refers here to a process taking as in- put a sequence of symbols (usually phonemes) and pro- transcribed and child-oriented French corpus, ducing as output a sequence of higher-level units (usually focusing on the effect of the preprocessing step words). on precision and recall, as well as its impact on 93 memory load and processing time. The absolute frequency of an n-gram w ∈ Sn T The next section is devoted to the formal def- in the corpus is given by n(w) := Pt=1 nt(w) inition of both algorithms. Section 3 discusses where nt(w) denotes the absolute frequency of w some issues related to the space and time com- in the t-th utterance of C. In the same way, we plexity they involve. The experimental setup define the absolute frequency of w in utterance- T as well as the results of the simulations are de- initial position as n(w|I) := Pt=1 nt(w|I) where scribed in section 4, and in conclusion we will nt(w|I) denotes the absolute frequency of w in summarize our findings and suggest directions utterance-initial position in the t-th utterance for further research. of C (which is 1 iff the utterance begins with w and 0 otherwise). Similarly, the absolute fre- 2 Description of the algorithms quency of w in utterance-final position is given T 2.1 Segmentation by thresholding by n(w|F) := Pt=1 nt(w|F). Many distributional segmentation algorithms Accordingly, the relative frequency of w ob- described in the literature can be seen as in- tains as f(w) := n(w)/ Pw˜∈Sn n(w ˜). Its stances of the following abstract procedure relative frequencies in utterance-initial and (Harris, 1955; Gammon, 1969; Saffran et al., -final position respectively are given by f(w|I) := n(w|I)/ Pw˜∈Sn n(w ˜|I) and f(w|F) := 1996; Hutchens and Adler, 1998; Bavaud and 2 Xanthos, 2002). Let S be the set of phonemes n(w|F)/ Pw˜∈Sn n(w ˜|F) . (or segments) in a given language. In the most Both algorithms described below process the general case, the input of the algorithm is an input incrementally, one utterance after an- utterance of length l, that is a sequence of l other. This implies that the frequency measures phonemes u := s1 . . . sl (where si denotes the defined in this section are in fact evolving all i-th phoneme of u). Then, for 1 ≤ i ≤ l − 1, we along the processing of the corpus. In general, insert a boundary after si iff D(u, i) > T (u, i), for a given input utterance, we chose to update where the values of the decision variable D(u, i) n-gram frequencies first (over the whole utter- and of the threshold T (u, i) may depend on both ance) before performing the segmentation. the whole sequence and the actual position ex- amined (Xanthos, 2003). 2.3 Utterance-boundary typicality The output of such algorithms can be evalu- ated in reference to the segmentation performed We use the same implementation of the by a human expert, using traditional measures utterance-boundary strategy that is described from the signal detection framework. It is usual in more details by Xanthos (2004). Intuitively, to give evaluations both for word and boundary the idea is to segment utterances where se- detection (Batchelder, 2002). The word preci- quences occur, which are typical of utterance sion is the probability for a word isolated by boundaries. Of course, this implies that the cor- the segmentation procedure to be present in pus is segmented in utterances, which seems a the reference segmentation, and the word recall reasonable assumption as far as language acqui- is the probability for a word occurring in the sition is concerned. In this sense, the utterance- true segmentation to be correctly isolated. Sim- boundary strategy may be viewed as a kind of ilarly, the segmentation precision is the proba- learning by generalization. bility that an inferred boundary actually occurs Probability theory provides us with a straightforward way of evaluating how much an in the true segmentation, and the segmentation n recall is the probability for a true boundary to n-gram w ∈ S is typical of utterance end- be detected. ings. Namely, we know that events “occur- In the remaining of this section, we will use rence of n-gram w” and “occurrence of an n- this framework to show how the two algorithms gram in utterance-final position” are indepen- we investigate rely on different definitions of dent iff p(w ∩ F) = p(w)p(F) or equivalently D(u, i) and T (u, i). iff p(w|F) = p(w). Thus, using maximum- likelihood estimates, we may define the typical- 2.2 Frequency estimates ∗ Let U ⊆ S be the set of possible utterances in 2 n w Note that in general, Pw˜∈Sn ( ˜|F) = the language under examination. Suppose we n w T˜ T˜ T T Pw˜∈Sn ( ˜|I) = , where ≤ is the number are given a corpus C ⊆ U made of T successive of utterances in C that have a length greater than or utterances. equal to n. 94 ity of w in utterance-final position as: 2.4 Successor count The second algorithm we investigate in this pa- f(w|F) per is an implementation of Harris’ successor t(w|F) := (1) f(w) count (Harris, 1955), the historical source of all predictability-based approaches to segmen- This measure is higher than 1 iff w is more likely tation. It relies on the assumption that in gen- to occur in utterance-final position (than in any eral, the diversity of possible phonemes tran- position), lower iff it is less likely to occur there, sitions is high after a word boundary and de- and equal to 1 iff its probability is independent creases as we consider transitions occurring fur- of its position. ther inside a word. The diversity of transitions following an n- In the context of a segmentation procedure, n this suggest a “natural” constant threshold gram w ∈ S is evaluated by the successor T (u, i) := 1 (which can optionally be fine-tuned count (or successor variety), simply defined as in order to obtain a more or less conservative the number of different phonemes that can oc- result). Regarding the decision variable, if we cur after it: were dealing with an utterance u of infinite succ(w) := |{s ∈ S|n(ws) > 0}| (3) length, we could simply set the order r ≥ 1 of the typicality computation and define d(u, i) Transposing the indications of Harris in the as t(si−(r−1) . . . si|F) (where si denotes the i- terms of section 2.1, for an utterance u := th phoneme of u). Since the algorithm is more s1 . . . sl, we define D(u, i) as succ(w) where likely to process an utterance of finite length w := s1 . . . si, and T (u, i) as max[D(u, i − l, there is a problem when considering a po- 1),D(u, i + 1)]. Here again a “backward” mea- tential boundary close to the beginning of the sure can be defined, the predecessor count: utterance, in particular when r > i. In this case, we can compute the typicality of smaller predec(w) := |{s ∈ S|n(sw) > 0}| (4) sequences, thus defining the decision variable as 0 0 t(si−(˜r−1) . . . si|F), wherer ˜ := min(r, i). Accordingly, we have D (u, i) = predec(w ) 0 0 where w := si+1 . . . sl, and T (u, i) := As was already suggested by Harris (1955), 0 0 our implementation actually combines the typ- max[D (u, i − 1),D (u, i + 1)]. In order to com- icality in utterance-final position with its ana- bine both statistics, we have found efficient to logue in utterance-initial position. This is done use a composite decision rule, where a boundary is inserted after phoneme si iff D(u, i) > T (u, i) by taking the average of both statistics, and 0 0 we have found empirically efficient to weight it or D (u, i) > T (u, i). by the relative lengths of the conditioning se- These decision variables differ from those quences: used in the utterance-boundary approach in that there is no fixed bound on the length of 0 their arguments. As will be discussed in sec- r˜ r˜ 0 D(u, i) := t(w|F ) + t(w |I) (2) tion 3, this has important consequences for the r˜ +r ˜0 r˜ +r ˜0 complexity of the algorithm. Also, the thresh-

r˜ 0 old used for the successor count depends ex- where w := si−(˜r−1) . . . si ∈ S , w := plicitely on both u and i: rather than seek- r˜0 0 si+1 . . . si+˜r0 ∈ S ,r ˜ := min(r, i) andr ˜ := ing values higher than a given threshold, this min(r, l − i). This definition helps compensate method looks for peaks of the decision variable for the asymmetry of arguments when i is either monitored over the input, whether the actual close to 1 or close to l. value is high or not. This is a more or less ar- Finally, in the simulations below, we ap- bitrary feature of this class of algorithms, and ply a mechanism that consists in incrementing much work remains to be done in order to pro- n(w|F) and n(w0|I) (by one) whenever D(u, i) > vide theoretical justifications rather than mere T (u, i). The aim of this is to enable the dis- empirical evaluations. covery of new utterance-boundary typical se- quences. It was found to considerably raise the 3 Complexity issues recall as more utterances are processed, at the It is not easy to evaluate the complexity of the cost of a slight reduction in precision (Xanthos, algorithms discussed in this paper, which con- 2004). sist mainly in the space and time needed to store 95 and retrieve the necessary information for the storage or retrieval of an n-gram grows linearly computation of n-grams frequencies. Of course, with n. this depends much on the actual implementa- tion. For instance, in a rather naive approach, 4 Empirical evaluation utterances can be stored as such and the mem- 4.1 Experimental setup ory load is then roughly equivalent to the size of the corpus, but computing the frequency of an Both algorithms described above were imple- 5 n-gram requires scanning the whole memory. mented implemented in Perl and evaluated using a phonemically transcribed and child- A first optimization is to count utterances 6 rather than merely store them. Some program- oriented French corpus (Kilani-Schoch corpus ). ming languages have a very convenient and effi- We have extracted from the original corpus cient built-in data structure for storing elements all the utterances of Sophie’s parents (mainly indexed by a string3, such as the frequency as- her mother) between ages 1;6.14 and 2;6.25 sociated with an utterance. However, the actual (year;month.day). These were transcribed gain depends on the redundancy of the corpus at phonemically in a semi-automatic fashion, using utterances level, and even in an acquisition cor- the BRULEX database (Content et al., 1990) pus, many utterances occur only once. The time and making the result closer to oral French needed to compute the frequency of an n-gram with a few hand-crafted rules. Eventually the is reduced accordingly, and due to the average first 10’000 utterances were used for simula- efficiency of hash coding, the time involved by tions. This corresponds to 37’663 words (992 the storage of an utterance is approximately as types) and 103’325 phonemes (39 types). low as in the naive case above. In general, we will compare the results ob- It is possible to store not only the frequency served for the successor count used on its own of utterances, but also that of their subparts. In (“SC alone”, on the figures) with those obtained when the utterance-boundary typicality is used this approach, storing an n-gram and retrieving 7 its frequency need comparable time resources, for preprocessing . The latter were recorded for expected to be low if hashing is performed. Of 1 ≤ r ≤ 5, where r is the order for the com- course, from the point of view of memory load, putation of typicalities. The threshold value for this is much more expensive than the two pre- typicality was set to 1 (see section 2.3). The vious implementations discussed. However, we results of the algorithms for word segmenta- can take advantage of the fact that in an utter- tion were evaluated by comparing their output ance of length l, every n-gram w with 1 ≤ n < l to the segmentation given in the original tran- is the prefix and/or suffix of at least an n + 1- scription using precision and recall for word and gram w0. Thus, it is much more compact to boundary detection (computed over the whole store them in a directed tree, the root of which corpus). The memory load is measured by the is the empty string, and where each node corre- number of nodes in the trees built by each al- sponds to a phoneme in a given context4, and gorithm, and the processing time is the number each child of a node to a possible successor of of seconds needed to process the whole corpus. that phoneme in its context. The frequency of 4.2 Segmentation performance an n-gram can be stored in a special child of the node representing the terminal phoneme of the When used in isolation, our implementation of n-gram. the successor count has a segmentation preci- This implementation (tree storage) will be sion as high as 82.5%, with a recall of 50.5%; used in the simulations described below. It is the word precision and recall are 57% and 40.8% not claimed to be more psychologically plausible 5Perl was chosen here because of the ease it provides than another, but we believe the size in nodes when it comes to textual statistics; however, execution is of the trees built for a given corpus provides notoriously slower than with C or C++, and this should an intuitive and accurate way of comparing the be kept in mind when interpreting the large differences memory requirements of the algorithms we dis- in processing time reported in section 4.4. 6Sophie, a French speaking Swiss child, was recorded cuss. From the point of view of time complexity, at home by her mother every ten days in situations of however, the tree structure is less optimal than play (Kilani-Schoch and Dressler, 2001). The transcrip- a flat hash table since the time needed for the tion and coding were done according to CHILDES con- ventions (MacWhinney, 2000). 3 This type of storage is known as hash coding. 7Results of the utterance-boundary approach alone 4defined by the whole sequence of its parent nodes are given in (Xanthos, 2004) 96 %12%$3!34#$ 56%7484#$ !$9 6%7!"" /#01 20%3454#$ !$1 0%3!"" 0+, .+,

/+, -+, .+, *+, -+, *+, )+, 56%7 20%3 )+, 6%7 (+, 0%3 (+, '+, '+, &+, &+, +, +, 6 0  & ' ( ) *  & ' ( ) * !"#$% !"#$%

Figure 1: Segmentation precision and recall ob- Figure 2: Word precision and recall ob- tained with the successor count alone and with tained with the successor count alone and with utterance-boundary preprocessing on n-grams. utterance-boundary preprocessing on n-grams. respectively. For comparison, the highest seg- mance exhibited by our hybrid algorithm con- mentation precision obtained with utterance- firms our expectations regarding the comple- boundary typicality alone is 80.8% (for r = 5), mentarity of the two strategies examined: their but the corresponding recall does not exceed combination is clearly superior to each of them 37.6%, and the highest word precision is 44.4% taken independently. There may be a slight loss (r = 4) with a word recall of 31.4%. As ex- in precision, but it is massively counterbalanced pected, the successor count performs much bet- by the gain in recall. ter than the utterance boundary typicality in 4.3 Memory load isolation. Using utterance-boundary typicality as a pre- The second hypothesis we made was that the processing step has a remarkable impact on the preprocessing step would reduce the memory performance of the resulting algorithm. Figure load of the successor count algorithm. In our 1 shows the segmentation performance obtained implementation, the space used by each algo- for boundary detection with the successor count rithm can be measured by the number of nodes alone or in combination with preprocessing (for of the trees storing the distributions. Five dis- 1 ≤ r ≤ 5). The segmentation precision is al- tinct trees are involved: three for the utterance- ways lower with preprocessing, but the differ- boundary approach (one for the distribution of ence dwindles as r grows: for r = 5, it reaches n-grams in general and two for their distribu- 79.9%, so only 2.1% are lost. On the contrary, tions in utterance-initial and -final position), the segmentation recall is always higher with and two for the predictability approach (one preprocessing. It reaches a peak of 79.3% for for successors and one for predecessors). The r = 3, and stays as high as 71.2% for r = 5 , memory load of each algorithm is obtained by meaning a 20.7% difference with the successor summation of these values. count alone. As can be seen on figure 3, the size of the trees Concerning the detection of whole words, (fig- built by the successor count is drastically re- ure 2), the word precision is strictly increasing duced by preprocessing. Successor count alone with r and ranges between 15.2% and 60.2%, uses as many as 99’405 nodes; after preprocess- the latter being a 3.2% increase with regard ing, the figures range between 7’965 for n = 1 8 to the successor count alone. The word recall and 38’786 for n = 5 (SC, on the figure) . How- is lower when preprocessing is performed with ever, the additional space used by the n-grams n = 1 (-18.2%), but higher in all other cases, 8 These values are highly negatively correlated with with a peak of 56% for n = 4 (+15.2%). the number of boundaries–true or false–inserted by pre- Overall, we can say the segmentation perfor- processing (r = −0.96). 97 0%1#23 "#!4 01#2%334$5 647%

&&+ /+ &++ .+ /+ -+ .+ ,+ -+ ,+ 567 *+ 89: *+  )+  )+ (+ (+ '+ '+ &+ &+ + + 2 1  & ' ( ) *  & ' ( ) * !"#$% !"#$%

Figure 3: Memory load (in thousands of nodes) Figure 4: Processing time (in seconds) mea- measured with the successor count alone and sured with the successor count alone and with with utterance-boundary preprocessing on n- utterance-boundary preprocessing on n-grams. grams (see text). distributions in flat hash tables rather than tree distributions needed to compute the utterance- structures. boundary typicality (UBT) grows quickly with 5 Conclusions and discussion n, and the total number of nodes even exceeds that of the successor count alone when n = 5. In this paper, we have investigated two ap- Still, for lower values of n, preprocessing leads proaches to speech segmentation based on dif- to a substantial reduction in total memory load. ferent heuristics: the utterance-boundary strat- egy, and the predictability strategy. On the ba- 4.4 Processing time sis of former empirical results as well as theoret- ical considerations regarding their performance It seems unlikely that the combination of the and complexity, we have suggested that the two algorithms does not exhibit any drawback. utterance-boundary approach could be used as We have said in section 3 that storing distribu- a preprocessing step in order to lighten the task tions in a tree was not optimal from the point of the predictability approach without damag- of view of time complexity, so we did not have ing the segmentation. high expectations on this topic. Nevertheless, This intuition was translated into an explicit we recorded the time used by the algorithms model, then implemented and evaluated for a for the sake of completeness. CPU time9 was task of word segmentation on a child-oriented measured in seconds, using built-in functions of phonetically transcribed french corpus. Our re- Perl, and the durations we report were averaged sults show that: over 10 runs of the simulation10. What can be seen on figure 4 is that although • the combined algorithm outperforms its the time used by the successor count computa- component parts considered separately; tion is slightly reduced by preprocessing, this • the total memory load of the combined al- does not compensate for the additional time re- gorithm can be substantially reduced by quired by the preprocessing itself. On average, the preprocessing step; the total time is multiplied by 1.6 when pre- processing is performed. Again, this is really a • however, the processing time of the com- consequence of the chosen implementation, as bined algorithm is generally longer and this factor could be reduced to 1.15 by storing possibly much longer depending on the im- plementation. 9on a pentium III 700MHz 10This does not give a very accurate evaluation of pro- These findings are in line with recent research cessing time, and we plan to express it in terms of num- advocating the integration of various strate- ber of computational steps. gies for speech segmentation. In his work on 98 ()*+,-*./011*//2+.12034 like carrots), where vertical bars denote bound- " aries predicted by the utterance-boundary typ- icality (for r = 5), and dashes represent bound-

!# aries inferred by the successor count:

¡£¢ ¤§¦¨ ¢  !"¨ ¨ 

! SC

¡£¢#¤§¦¨ %$ ¢  !&$ ¨ ¨  UBT (r = 5)

#

¡£¢ ¤§¦'"¨ ¥$ ¢ () !&$ ¨ ¨  UBT + SC  This suggests that the utterance-boundary # strategy could be more than an additional de- vice that safely predicts some boundaries that  the successor count alone might have found or # not: it could actually have a functional rela- tionship with it. If the predictability strategy  3 has some relevance for speech segmentation in   ! " # $ % & '    ! " # early infancy (Saffran et al., 1996), then it may be necessary to counterbalance the data sparse- Figure 5: Average successor count for n-grams ness; this is what these authors implicitely do (based on the corpus described in section 4.1). by using first-order transition probabilities, and it would be easy to define an n-th order succes- computational morphology, Goldsmith (2001) sor count in the same way. Yet another possi- uses Harris’ successor count as a means to re- bility would be to “reset” the successor count duce the search space of a more powerful al- after each boundary inserted. Further research gorithm based on minimum description length should bring computational and psychological (Marcken, 1996). We go one step further and evidence for or against such ways to address rep- show that an utterance-boundary heuristic can resentativity issues. be used in order to reduce the complexity of the We conclude this paper by raising an issue successor count algorithm11. that was already discussed by Gammon (1969), and might well be tackled with our methodol- Besides complexity issues, there is a prob- ogy. It seems that various segmentation strate- lem of data sparseness with the successor count, gies correlate more or less with different segmen- as it decreases very quickly while the size n of tation levels. We wonder if these different kinds the context grows. In the case of our quite re- of sensitivity could be used to make inferences dundant child-oriented corpus, the (weighted) about the hierarchical structure of utterances. average of the successor count12 for a random n-gram Pw∈Sn f(w) succ(w) gets lower than 1 6 Acknowledgements for n ≥ 9 (see figure 5). This means that in The author is grateful to Marianne Kilani- most utterances, no more boundary can be in- Schoch and the mother of Sophie for providing serted after the first 9 phonemes (respectively the acquisition corpus (see p.4), as well as to before the last 9 phonemes) unless we get close Fran¸cois Bavaud, Marianne Kilani-Schoch and enough to the other extremity of the utter- two anonymous reviewers for useful comments ance for the predecessor (respectively successor) on earlier versions of this paper. count to operate. As regards the utterance- boundary typicality, on the other hand, the po- References sition in the utterance makes no difference. As R.N. Aslin, J.Z. Woodward, N.P. Lamendola, a consequence, many examples can be found in and T.G. Bever. 1996. Models of word seg- our corpus, where the middle part of a long ut- mentation in fluent maternal speech to in- terance would be undersegmented by the succes- fants. In J.L Morgan and Demuth K., ed- sor count alone, whereas preprocessing provides itors, Signal to Syntax: Bootstrapping from it with more tractable chunks. This is illus- Speech to Grammar in Early Language Ac-

trated by the following segmentations of the ut- quisition, pages 117–134. Lawrence Erlbaum ¡£¢¥¤§¦©¨ ¢ ¨ ¨  terance (Daddy doesn’t Associates, Mahwah (NJ). 11at least as regards memory load, which could more E. Batchelder. 2002. Bootstrapping the lexi- restrictive in a developmental perspective con: A computational model of infant speech 12The predecessor count behaves much the same. segmentation. Cognition, 83:167–206. 99 F. Bavaud and A. Xanthos. 2002. Thermody- namique et statistique textuelle: concepts et illustrations. In Actes des 6`eJourn´eesInter- nationales d’Analyse Statistique des Donn´ees Textuelles (JADT 2002), pages 101–111. M.R. Brent and T.A. Cartwright. 1996. Distri- butional regularity and phonotactics are use- ful for segmentation. Cognition, 61:93–125. M.H. Christiansen, J. Allen, and M. Seidenberg. 1998. Learning to segment speech using mul- tiple cues. Language and Cognitive Processes, 13:221–268. A. Content, P. Mousty, and M. Radeau. 1990. Brulex: Une base de donn´ees lexicales in- formatis´ee pour le fran¸cais ´ecrit et parl´e. L’Ann´eePsychologique, 90:551–566. E. Gammon. 1969. Quantitative approxima- tions to the word. In Papers presented to the International Conference on Computational Linguistics COLING-69. J. Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Com- putational Linguistics, 27 (2):153–198. Z.S. Harris. 1955. From phoneme to morpheme. Language, 31:190–222. J.L. Hutchens and M.D. Adler. 1998. Find- ing structure via compression. In Proceedings of the International Conference on Computa- tional Natural Language Learning, pages 79– 82. M. Kilani-Schoch and W.U. Dressler. 2001. Filler + infinitive and pre- and protomorphol- ogy demarcation in a french acquisition cor- pus. Journal of Psycholinguistic Research, 30 (6):653–685. B. MacWhinney. 2000. The CHILDES Project: Tools for Analyzing Talk. Third Edition. Lawrence Erlbaum Associates, Mahwah (NJ). C.G. de Marcken. 1996. Unsupervised Language Acquisition. Phd dissertation, Massachusetts Institute of Technology. J.R. Saffran, E.L. Newport, and R.N. Aslin. 1996. Word segmentation: The role of distri- butional cues. Journal of Memory and Lan- guage, 35:606–621. A. Xanthos. 2003. Du k-gramme au mot: vari- ation sur un th`eme distributionnaliste. Bul- letin de linguistique et des sciences du langage (BIL), 21. A. Xanthos. 2004. An incremental implemen- tation of the utterance-boundary approach to speech segmentation. To appear in the Pro- ceedings of Computational Linguistics in the Netherlands 2003 (CLIN 2003). 100

Author Index

Buttery, Paula ...... 1 Ćavar, Damir...... 9 Chang, Nancy ...... 17 Clark, Alexander...... 25 Dominey, Peter Ford...... 33 Dresher, B. Elan...... 41 Edelman Shimon...... 77 Freudenthal, Daniel...... 53 Gambell, Timothy...... 49 Gobet, Fernand ...... 53 Herring, Joshua...... 9 Horn, David ...... 77 Ikuta, Toshikazu ...... 9 Inui, Toshio...... 33 Jack, Kris ...... 61 Laakso, Aarre...... 69 Pedersen, Bo ...... 77 Pine, Julian M...... 53 Redington, Martin...... 85 Reed, Chris ...... 61 Rodrigues, Paul...... 9 Ruppin, Eytan ...... 77 Schrementi, Giancarlo ...... 9 Smith, Linda ...... 69 Solan, Zach ...... 77 Thomas, Michael S. C...... 85 Waller, Annalu...... 61 Yang, Charles ...... 49 Xanthos, Aris ...... 93

101