Bootstrapping into Filler-Gap: An Acquisition Story

Marten van Schijndel Micha Elsner The Ohio State University vanschm,melsner @ling.ohio-state.edu { }

Abstract Age 13mo 15mo 20mo 25mo Wh-S No Yes Yes Yes Analyses of filler-gap dependencies usu- ally involve complex syntactic rules or (Yes) heuristics; however recent results suggest Wh-O No Yes Yes that filler-gap comprehension begins ear- lier than seemingly simpler constructions 1-1 Yes No such as ditransitives or passives. Therefore, this work models filler-gap acquisition as a byproduct of learning word orderings (e.g. Figure 1: The developmental timeline of subject SVO vs OSV), which must be done at a (Wh-S) and object (Wh-O) wh-clause extraction very young age anyway in order to extract comprehension suggested by experimental results meaning from language. Specifically, this (Seidl et al., 2003; Gagliardi et al., 2014). Paren- model, trained on part-of-speech tags, rep- theses indicate weak comprehension. The final row resents the preferred locations of semantic shows the timeline of 1-1 role bias errors (Naigles, roles relative to a verb as Gaussian mix- 1990; Gertner and Fisher, 2012). Missing nodes de- tures over real numbers. note a lack of studies. This approach learns role assignment in filler-gap constructions in a manner con- sistent with current developmental findings acquired through learning word orderings rather and is extremely robust to initialization than relying on hierarchical syntactic knowledge. variance. Additionally, this model is shown to be able to account for a characteristic er- This work describes a cognitive model of the de- ror made by learners during this period (A velopmental timecourse of filler-gap comprehension and B gorped interpreted as A gorped B). with the goal of setting a lower bound on the mod- eling assumptions necessary for an ideal learner 1 Introduction to display filler-gap comprehension. In particular, the model described in this paper takes chunked The phenomenon of filler-gap, where the argument child-directed speech as input and learns orderings of a predicate appears outside its canonical posi- over semantic roles. These orderings then permit tion in the phrase structure (e.g. [the apple]i that the model to successfully resolve filler-gap depen- 1 the boy ate ti or [what]i did the boy eat ti), has long dencies. Further, the model presented here is also been an object of study for syntacticians (Ross, shown to initially reflect an idiosyncratic role as- 1967) due to its apparent processing complexity. signment error observed in development (e.g. A Such complexity is due, in part, to the arbitrary and B kradded interpreted as A kradded B; Gert- length of the dependency between a filler and its ner and Fisher, 2012), though after training, the gap (e.g. [the apple]i that Mary said the boy ate ti). model is able to avoid the error. As such, this work Recent studies indicate that comprehension of may be said to model a learner from 15 months to filler-gap constructions begins around 15 months between 25 and 30 months. (Seidl et al., 2003; Gagliardi et al., 2014). This finding raises the question of how such a complex phenomenon could be acquired so early since chil- 1This model does not explicitly learn gap positions, dren at that age do not yet have a very advanced but rather assigns thematic roles to arguments based grasp of language (e.g. ditransitives do not seem on where those arguments are expected to manifest. This approach to filler-gap comprehension is supported to be generalized until at least 31 months; Gold- by findings that show people do not actually link fillers berg et al. 2004, Bello 2012). This work shows to gap positions but instead link the filler to a verb that filler-gap comprehension in English may be with missing arguments (Pickering and Barry, 1991)

1084 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1084–1093, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics 2 Background their study, infants were shown video of a person talking on a phone using a nonce verb with ei- The developmental timeline during which children ther one or two nouns (e.g. Mary kradded Susan). acquire the ability to process filler-gap construc- Under the assumption that infants look longer at tions is not well-understood. Language comprehen- things that correspond to their understanding of sion precedes production, and the developmental a prompt, the infants were then shown two im- literature on the acquisition of filler-gap construc- ages that potentially depicted the described action tions is sparsely populated due to difficulties in de- – one picture where two actors acted independently signing experiments to test filler-gap comprehen- (reflecting an intransitive proposition) and one pic- sion in preverbal infants. Older studies typically ture where one actor acted on the other (reflecting looked at verbal children and the mistakes they a transitive proposition).3 Even though the infants make to gain insight into the acquisition process had no extralinguistic knowledge about the verb, (de Villiers and Roeper, 1995). they consistently treated the verb as transitive if Recent studies, however, indicate that filler- two nouns were present and intransitive if only one gap comprehension likely begins earlier than pro- noun was present. duction (Seidl et al., 2003; Gagliardi and Lidz, Similarly, Gertner and Fisher (2012) show that 2010; Gagliardi et al., 2014). Therefore, studies intransitive phrases with conjoined subjects (e.g. of verbal children are probably actually testing John and Mary gorped) are given a transitive in- the acquisition of production mechanisms (plan- terpretation (i.e. John gorped Mary) at 21 months ning, motor skills, greater facility with lexical ac- (henceforth termed ‘1-1 role bias’), though this ef- cess, etc) rather than the acquisition of filler- fect is no longer present at 25 months (Naigles, gap. Note that these may be related since filler- 1990). This finding suggests both that learners gap could introduce greater processing load which will ignore canonical structure in favor of using could overwhelm the child’s fragile production ca- all possible arguments and that children have a pacity (Phillips, 2010). bias to assign a unique semantic role to each argu- Seidl et al. (2003) showed that children are able ment. It is important to note, however, that cross- to process wh-extractions from subject position linguistically children do not seem to generalize be- (e.g. [who]i ti ate pie) as young as 15 months yond two arguments until after at least 31 months while similar extractions from object position (e.g. of age (Goldberg et al., 2004; Bello, 2012), so a [what]i did the boy eat ti) remain unparseable until predicate occurring with three nouns would still 2 around 20 months of age. This line of investiga- likely be interpreted as merely transitive rather tion has been reopened and expanded by Gagliardi than ditransitive. et al. (2014) whose results suggest that the ex- Computational modeling provides a way to test perimental methodology employed by Seidl et al. the computational level of processing (Marr, 1982). (2003) was flawed in that it presumed infants have That is, given the input (child-directed speech, ideal performance mechanisms. By providing more adult-directed speech, and environmental experi- trials of each condition and controlling for the prag- ences), it is possible to probe the computational matic felicity of test statements, Gagliardi et al. processes that result in the observed output. How- (2014) provide evidence that 15-month old infants ever, previous computational models of grammar can process wh-extractions from both subject and induction (Klein and Manning, 2004), including in- object positions. Object extractions are more diffi- fant grammar induction (Kwiatkowski et al., 2012), cult to comprehend than subject extractions, how- have not addressed filler-gap comprehension.4 ever, perhaps due to additional processing load in The closest work to that presented here is the object extractions (Gibson, 1998; Phillips, 2010). work on BabySRL (Connor et al., 2008; Connor et Similarly, Gagliardi and Lidz (2010) show that rel- al., 2009; Connor et al., 2010). BabySRL is a com- ativized extractions with a wh-relativizer (e.g. find putational model of semantic role acquistion using [the boy]i who ti ate the apple) are easier to com- a similar set of assumptions to the current work. prehend than relativized extractions with that as BabySRL learns weights over ordering constraints the relativizer (e.g. find [the boy]i that ti ate the (e.g. preverbal, second noun, etc.) to acquire se- apple). mantic role labelling while still exhibiting 1-1 role Yuan et al. (2012) demonstrate that 19-month bias. However, no analysis has evaluated the abil- olds use their knowledge of nouns to learn both verbs and their associated argument structure. In 3There were two actors in each image to avoid bias- ing the infants to look at the image with more actors. 2Since the wh-phrase is in the same (or a very simi- 4As one reviewer notes, Joshi et al. (1990) and sub- lar) position as the original subject when the wh-phrase sequent work show that filler-gap phenomena can be takes subject position, it is not clear that these con- formally captured by mildly context-sensitive grammar structions are true extractions (Culicover, 2013), how- formalisms; these have the virtue of scaling up to adult ever, this paper will continue to refer to them as such grammar, but due to their complexity, do not seem to for ease of exposition. have been described as models of early acquisition.

1085 Susan said John gave girl book µ σ π -3 -2 -1 0 1 2 GSC -1 0.5 .999 GSN -1 3 .001

Table 1: An example of a chunked sentence (Su- GOC 1 0.5 .999 san said John gave the girl a red book) with the GON 1 3 .001 sentence positions labelled. Nominal heads of noun Φ .00001 chunks are in bold. Table 2: Initial values for the mean (µ), standard ity of BabySRL to acquire filler-gap constructions. deviation (σ), and prior (π) of each Gaussian as Further comparison to BabySRL may be found in well as the skip penalty (Φ) used in this paper. Section 6. Finally, following the finding by Gertner and 3 Assumptions Fisher (2012) that children interpret intransitives The present work restricts itself to acquiring filler- with conjoined subjects as transitives, this work as- gap comprehension in English. The model pre- sumes that semantic roles have a one-to-one corre- sented here learns a single, non-recursive ordering spondence with nouns in a sentence (similarly used for the semantic roles in each sentence relative to as a soft constraint in the semantic role labelling the verb since several studies have suggested that work of Titov and Klementiev, 2012). early child grammars may consist of simple lin- ear grammars that are dictated by semantic roles 4 Model (Diessel and Tomasello, 2001; Jackendoff and Wit- The model represents the preferred locations of tenberg, in press). This work assumes learners can semantic roles relative to the verb as distribu- already identify nouns and verbs, which is sup- tions over real numbers. This idea is adapted from ported by Shi et al. (1999) who show that chil- Boersma (1997) who uses it to learn constraint dren at an extremely young age can distinguish be- rankings in optimality theory. tween content and function words and by Waxman In this work, the final (main) verb is placed at and Booth (2001) who show that children can dis- position 0; words (and chunks) before the verb are tinguish between different types of content words. given progressively more negative positions, and Further, since Waxman and Booth (2001) demon- words after the verb are given progressively more strate that, by 14 months, children are able to dis- positive positions (see Table 1). Learner expecta- tinguish nouns from modifiers, this work assumes tions of where an argument will appear relative learners can already chunk nouns and access the to the verb are modelled as two-component Gaus- nominal head. To handle recursion, this work as- sian mixtures: one mixture of Gaussians (GS ) cor- sumes that children treat the final verb in each · responds to the subject argument, another (GO ) sentence as the main verb (implicitly assuming sen- corresponds to the object argument. There is no· tence segmentation), which ideally assigns roles to mixture for a third argument since children do not each of the nouns in the sentence. generalize beyond two arguments until later in de- Due to the findings of Yuan et al. (2012), velopment (Goldberg et al., 2004; Bello, 2012). this work adopts a ‘syntactic bootstrapping’ the- One component of each mixture learns to repre- ory of acquisition (Gleitman, 1990), where struc- sent the canonical position for the argument (G C ) tural properties (e.g. number of nouns) inform the · while the other (G N ) represents some alternate, learner about semantic properties of a predicate non-canonical position· such as the filler position (e.g. how many semantic roles it confers). Since in filler-gap constructions. To reflect the fact that infants infer the number of semantic roles, this learners have had 15 months of exposure to their work further assumes they already have expecta- language before acquiring filler-gap, the mixture is tions about where these roles tend to be realized initialized so that there is a stronger probability in sentences, if they appear. These positions may associated with the canonical Gaussian than with correspond to different semantic roles for different the non-canonical Gaussian of each mixture.5 Fi- predicates (e.g. the subject of run and of melt); nally, the one-to-one role bias is explicitly encoded however, the role for predicates with a single argu- such that the model cannot use a label that has ment is usually assigned to the noun that precedes already been used elsewhere in the sentence. the verb while a second argument is usually as- signed after the verb. The semantic properties of 5Akhtar (1999) finds that learners may not have these roles may be learned lexically for each pred- strong expectations of canonical argument positions until four years of age, but the results of the current icate, but that is beyond the scope of this work. study are extremely robust to changes in initialization, Therefore, this work uses syntactic and semantic as discussed in Section 7 of this paper, so this assump- roles interchangeably (e.g. subject and agent). tion is mostly adopted for ease of exposition.

1086 aie nrae oe complexity. model increased nalizes the ficti- using capture by to data Gaussian the subject of non-canonical to likelihood model the the canonical allows improve a imperatives English of in lack subject The statements. imperative of nouns postverbal in likelihood modeling erroneously the by the data improves the only to non-canonical Gaussian according the subject that data find experiments training follow-up (BIC). Criterion the Information to Bayesian the (termed fit objects best and model subjects symmetric both for tures expecta- initial model. the the of of representation 2 tions Figure visual and a parameters for initialization for See parameter. 2 a this Table for on set 7 constraints Section the is see of cost though discussion study, the this noun; for 0.00001 skipped into at each (Φ) for term product penalty to a the allowed multiplying is by model nouns the skip nouns, two than sentences more many have Since etc). -2, compo- position the at (e.g. argument for positions probabilities argument label nent the of product ‘missing’ a is there argument. if ex- postverbal object occurred an have that may infers traction model label. the words, object other the In for is there competing addition, noun a in postverbal the and, no hypothesize assigned something already to to role has likely close subject it most very when is curve object model preverbal even red The or the 0). curve) sampling to blue also al- sampling the (by (by ordering, from OSV SOV number SVO obtain negative to an a possible realize is to it though likely most are arguments where of expectations appear, model’s will converged arguments the where (Right) of and appear. 2 expectations will Table model’s in parameters initial initial the the (Left) given of representations Visual 2: Figure hswr ssamdlwt -opnn mix- 2-component with model a uses work This the is sequence given a of probability the Finally, 2) Figure (see conditions model initial the Thus,

6 Probability h I ead mrvdlglklho u pe- but log-likelihood improved rewards BIC The oiinrltv overb to relative Position .Ti omlto civsthe achieves formulation This ). G SC eeaigan generating 6 However, 1087 otos(onre l,20)o CHILDES of us- 2008) performed al., is Chunking et 2000). BabySRL (MacWhinney, the (Connor from portions (CDS) transcribed using speech trained is child-directed work this in model The Evaluation 5 preverbally. appear will argument second chance larger the a that now is preverbally assigned there argu- however, be assigned to postverbally; patient) first (say, be second the the to expects expects and agent) model (usually The ment model. the of 1997). (Boersma, plausible cognitively as independently posited being current to the due but chosen Generalized was 2009), formulation a al., et to (Chen This model similarity whole. Mallows some a as bears sentence the approach for determines ordering it best Instead, the Models. based Markov not is Hidden it on since 2007) (e.g. Griffiths, induction and Goldwater grammar of models case. computational either in in similar described are evaluations results the the for assump- paper, this made; an such not if is tion happens model what symmetric a demonstrate assumes to subject. paper is this non-canonical model of a best mainder lacks the that subject), impera- one implicit the that the is (e.g. assumption learners learner for the prosodically-marked makes are tives that one There- model Gaussian. if a subject fore, than non-canonical fit the BIC symmetric lacks worse the a corpus, obtains training model the of out are filtered imperatives When arguments. postverbal tious fcmoet nec itr ol ovret a imperative to if converge employed. subjects were would non-canonical filtering lacks mixture number that each the model in determining components dynamically of of means other iue2(ih)sostecnegd nlstate final converged, the shows (Right) 2 Figure non-recursive other from differs model This

7 Probability hsfidn ugssta iihe rcs or Process Dirichlet a that suggests finding This oiinrltv overb to relative Position 7 h re- The Eve (n = 4820) Adam (n = 4461) Subject Extraction filter: S x V . . . P R F P R F Object Extraction filter: O . . . V . . . Initial .54 .64 .59 .53 .60 .56 Eve (n = 1345) Adam (n = 1287) Trained .52 .69 .59∗ .51 .65 .57∗ P R F P R F Initialc .56 .66 .60 .55 .62 .58 Initialc .53 .57 .55 .53 .52 .52 Trainedc .54 .71 .61∗ .53 .67 .59∗ Trainedc .55 .67 .61∗ .54 .63 .58∗

Table 3: Overall accuracy on the Eve and Adam Table 4: (Above) Filters to extract filler-gap con- sections of the BabySRL corpus. Bottom rows re- structions: A) the subject and verb are not ad- flect accuracy when non-agent roles are collapsed jacent, B) the object precedes the verb. (Below) into a single role. Note that improvements are nu- Filler-gap accuracy on the Eve and Adam sections merically slight since filler-gap is relatively rare of the BabySRL corpus when non-agent roles are (Schuler, 2011). ∗p << .01 collapsed into a single role. ∗p << .01 ing a basic noun-chunker from NLTK (Bird et al., ing testing. Essentially, this forces the model to 2009). Based on an initial analysis of chunker per- tightly adhere to the perceived argument struc- formance, yes is hand-corrected to not be a noun. ture during training to learn more rigid parame- Poor chunker perfomance is likely due to a mis- ters, but the model is allowed more leeway to skip match in chunker training and testing domains arguments it has less confidence in during testing. (Wall Street Journal text vs transcribed speech), Convergence (see Figure 2) tends to occur after but chunking noise may be a good estimation of four iterations but can take up to ten iterations learner uncertainty, so the remaining text is left depending on the initial parameters. uncorrected. All noun phrase chunks are then re- Since the model is unsupervised, it is trained on placed with their final noun (presumed the head) a given corpus (e.g. Eve) before being tested on to approximate the ability of children to distin- the role annotations of that same corpus. The Eve guish nouns from modifiers (Waxman and Booth, corpus was used for development purposes,8 and 2001). Finally, for each sentence, the model assigns the Adam data was used only for testing. sentence positions to each word with the final verb For testing, this study uses the semantic role at zero. annotations in the BabySRL corpus. These anno- Viterbi Expectation-Maximization is performed tations were obtained by automatically semantic over each sentence in the corpus to infer the pa- role labelling portions of CHILDES with the sys- rameters of the model. During the Expectation tem of Punyakanok et al. (2008) before roughly step, the model uses the current Gaussian param- hand-correcting them (Connor et al., 2008). The eters to label the nouns in each sentence with ar- BabySRL corpus is annotated with 5 different gument roles. Since the model is not lexicalized, roles, but the model described in this paper only these roles correspond to the semantic roles most uses 2 roles. Therefore, overall accuracy results (see commonly associated with subject and object. The Table 3) are presented both for the raw BabySRL model then chooses the best label sequence for each corpus and for a collapsed BabySRL corpus where sentence. all non-agent roles are collapsed into a single role These newly labelled sentences are used during (denoted by a subscript c in all tables). the Maximization step to determine the Gaussian Since children do not generalize above two ar- parameters that maximize the likelihood of that guments during the modelled age range (Goldberg labelling. The mean of each Gaussian is updated et al., 2004; Bello, 2012), the collapsed numbers to the mean position of the words it labels. Sim- more closely reflect the performance of a learner ilarly, the standard deviation of each Gaussian is at this age than the raw numbers. The increase in updated with the standard deviation of the posi- accuracy obtained from collapsing non-agent ar- tions it labels. A learning rate of 0.3 is used to guments indicates that children may initially gen- prevent large parameter jumps. The prior proba- eralize incorrectly to some verbs and would need bility of each Gaussian is updated as the ratio of to learn lexically-specific role assignments (e.g. that Gaussian’s labellings to the total number of double-object constructions of give). Since the cur- labellings from that mixture in the corpus: rent work is interested in general filler-gap com- prehension at this age, including over unknown Gρθ πρθ = | | (1) verbs, the remaining analyses in this paper con- Gρ | · | 8This is included for transparency, though the ini- where ρ S, O and θ C,N . ∈ { } ∈ { } tial parameters have very little bearing on the final re- Best results seem to be obtained when the skip- sults as stated in Section 7, so the danger of overfitting penalty is loosened by an order of magnitude dur- to development data is very slight.

1088 P R F P R F P R F P R F Eve Subj (n = 691) Obj (n = 654) Eve Wh- (n = 689) That (n = 125) Initialc .66 .83 .74 .35 .31 .33 Initialc .63 .45 .53 .43 .48 .45 Trainedc .64 .84 .72† .45 .52 .48∗ Trainedc .73 .75 .74∗ .44 .57 .50† Adam Subj (n = 886) Obj (n = 1050) Adam Wh- (n = 748) That (n = 189) Initialc .69 .81 .74 .33 .27 .30 Initialc .50 .37 .42 .50 .50 .50 Trainedc .66 .81 .73 .44 .48 .46∗ Trainedc .61 .65 .63∗ .47 .56 .51†

Table 5: (Left) Subject-extraction accuracy and object-extraction accuracy and (Right) Wh-relative ac- curacy and that-relative accuracy; calculated over the Eve and Adam sections of the BabySRL corpus with non-agent roles collapsed into a single role. †p = .02 ∗p << .01 sider performance when non-agent arguments are 6 Comparison to BabySRL collapsed.9 The acquisition of semantic role labelling (SRL) by Next, a filler-gap version of the BabySRL cor- the BabySRL model (Connor et al., 2008; Connor pus is created using a coarse filtering process: the et al., 2009; Connor et al., 2010) bears many sim- new corpus is comprised of all sentences where an ilarities to the current work and is, to our knowl- associated object precedes the final verb and all edge, the only comparable line of inquiry to the sentences where the relevant subject is not imme- current one. The primary function of BabySRL is diately followed by the final verb (see Table 4). For to model the acquisition of semantic role labelling these filler-gap evaluations, the model is trained on while making an idiosyncratic error which infants the full version of the corpus in question (e.g. Eve) also make (Gertner and Fisher, 2012), the 1-1 role before being tested on the filler-gap subset of that bias error (John and Mary gorped interpreted as corpus. The overall results of the filler-gap evalua- John gorped Mary). Similar to the model presented tion (see Table 4) indicate that the model improves in this paper, BabySRL is based on simple ordering significantly at parsing filler-gap constructions af- features such as argument position relative to the ter training. verb and argument position relative to the other The performance of the model on role- arguments. assignment in filler-gap constructions may be This section will demonstrate that the model in analyzed further in terms of how the model this paper initially reflects 1-1 role bias comparably performs on subject-extractions compared with to BabySRL, though it progresses beyond this bias object-extractions and in terms of how the model after training.10 Further, the model in this paper is performs on that-relatives compared with wh- able to reflect the concurrent acquisition of filler- relatives (see Table 5). gap whereas BabySRL does not seem well-suited The model actually performs worse at subject- to such a task. Finally, BabySRL performs unde- extractions after training than before training. sirably in intransitive settings whereas the model This is unsurprising because, prior to training, in this paper does not. subjects have little-to-no competition for prever- Connor et al. (2008) demonstrate that a super- bal role assignments; after training, there is a pre- vised perceptron classifier, based on positional fea- verbal extracted object category, which the model tures and trained on the silver role label annota- can erroneously use. This slight, though signifi- tions of the BabySRL corpus, manifests 1-1 role cant in Eve, deficit is counter-balanced by a very bias errors. Follow-up studies show that supervi- substantial and significant improvement in object- sion may be lessened (Connor et al., 2009) or re- extraction labelling accuracy. moved (Connor et al., 2010) and BabySRL will still reflect a substantial 1-1 role bias. Similarly, training confers a large and significant Connor et al. (2008) and Connor et al. (2009) improvement for role assignment in wh-relative run direct analyses of how frequently their mod- constructions, but it yields less of an improve- els make 1-1 role bias errors. A comparable eval- ment for that-relative constructions. This differ- uation may be run on the current model by ence mimics a finding observed in the developmen- generating 1000 sentences with a structure of tal literature where children seem slower to ac- NNV and reporting how many times the model quire comprehension of that-relatives than of wh- chooses a subject-first labelling (see Table 6).11 relatives (Gagliardi and Lidz, 2010). 10All evaluations in this section are preceded by training on the chunked Eve corpus. 9Though performance is slightly worse when argu- 11While Table 6 analyzes erroneous labellings of ments are not collapsed, all the same patterns emerge. NNV structure, the ‘Obj’ column of Table 5 (Left)

1089 Error rate NVN NV Initial .36 Sents in Eve 1173 1513 Trained .11 Sents in Adam 1029 1353 Initial (given 2 args) .66 Initial .67 1 Trained (given 2 args) .13 Trained .65 .96 2008 arg-arg position .65 Weak (10) lexical .71 .59 2008 arg-verb position 0 Strong (365) lexical .74 .41 2009 arg-arg position .82 Gold Args .77 .58 2009 arg-verb position .63 Table 7: Agent-prediction recall accuracy in tran- Table 6: 1-1 role bias error in this model compared sitive (NVN) and intransitive (NV) settings of the to the models of Connor et al. (2008) and Connor model presented in this paper (middle) and the et al. (2009). That is, how frequently each model combined model of Connor et al. (2010) (bottom), labelled an NNV sentence SOV. Since the Connor which has features for argument-argument relative et al. models are perceptron-based, they require position as well as argument-predicate relative po- both arguments be labelled. The model presented sition and so is closest to the model presented in in this paper does not share this restriction, so the this paper. raw error rate for this model is presented in the first two lines; the error rate once this additional restriction is imposed is given in the second two of different initial lexicons, this evaluation com- lines. pares against the resulting BabySRL from each ini- tializer: they initially seed their part-of-speech tag- ger with either the 10 or 365 most frequent nouns The results of Connor et al. (2008) and Connor in the corpus or they dispense with the tagger and et al. (2009) depend on whether BabySRL uses use gold part-of-speech tags. argument-argument relative position as a feature As with subject extraction, the model in this or argument-verb relative position as a feature paper gets less accurate after training because of (there is no combined model). Further, the model the newly minted extracted object category that presented here from Connor et al. (2009) has a can be mistakenly used in these canonical settings. unique argument constraint, similar to the model While the model of Connor et al. (2010) outper- in this paper, in order to make comparison as di- forms the model presented here when in a tran- rect as possible. sitive setting, their model does much worse in an The 1-1 role bias error rate (before training) of intransitive setting. The difference in transitive set- the model presented in this paper is comparable tings stems from increased lexicalization, as is ap- to that of Connor et al. (2008) and Connor et al. parent from their results alone; the model pre- (2009), which shows that the current model pro- sented here initially performs close to their weakly vides comparable developmental modeling benefits lexicalized model, though training impedes agent- to the BabySRL models. Further, similar to real prediction accuracy due to an increased probability children (see Figure 1) the model presented in this of non-canonical objects. paper develops beyond this error by the end of its For the intransitive case, however, whereas the training,12 whereas the BabySRL models still make model presented in this paper is generally able to this error after training. successfully label the lone noun as the subject, the Connor et al. (2010) look at how frequently model of Connor et al. (2010) chooses to label lone their model correctly labels the agent in transitive nouns as objects about 40% of the time. This likely and intransitive sentences with unknown verbs (to stems from their model’s reliance on argument- demonstrate that it exhibits an agent-first bias). argument relative position as a feature; when there This evaluation can be replicated for the current is no additional argument to use for reference, the study by generating 1,000 sentences with the tran- model’s accuracy decreases. This is borne out by sitive form of NVN and a further 1,000 sentences their model (not shown in Table 7) that omits with the intransitive form of NV (see Table 7). the argument-argument relative position feature Since Connor et al. (2010) investigate the effects and solely relies on verb-argument position, which achieves up to 70% accuracy in intransitive set- shows model accuracy on NNV structures. tings. Even in that case, however, BabySRL still 12It is important to note that the unique argument chooses to label lone nouns as objects 30% of the constraint prevents the current model from actually time. The fact that intransitive sentences are more getting the correct, conjoined-subject parse, but it no longer exhibits agent-first bias, an important step for common than transitive sentences in both the Eve acquiring passives, which occurs between 3 and 4 years and Adam sections of the BabySRL corpus sug- (Thatcher et al., 2008). gests that learners should be more likely to assign

1090 correct roles in an intransitive setting, which is not there is still a gap between object-extraction and reflected in the BabySRL results. subject-extraction comprehension even after train- The overall reason for the different results be- ing. tween the current work and BabySRL is that Further, the model exhibits better comprehen- BabySRL relies on positional features that mea- sion of wh-relatives than that-relatives similar to sure the relative position of two individual ele- children (Gagliardi and Lidz, 2010). This could ments (e.g. where a given noun is relative to the also be an area where a lexicalized model could verb). Since the model in this paper operates over do better. As Gagliardi and Lidz (2010) point global orderings, it implicitly takes into account out, whereas wh-relatives such as who or which the positions of other nouns as it models argument always signify a filler-gap construction, that can position relative to the verb; object and subject occur for many different reasons (demonstrative, are in competition as labels for preverbal nouns, , , etc) and so is a much so a preverbal object is usually only assigned once weaker filler-gap cue. A lexical model could poten- a subject has already been detected. tially pick up on clues which could indicate when Further, while BabySRL consistently reflects 1- that is a relativizer or simply improve on its com- 1 role bias (corresponding to a pre 25-month old prehension of wh-relatives even more. learner), it also learns to productively label five It is interesting to note that the cuurent model roles, which developmental studies have shown does not make use of that as a cue at all and does not take place until at least 31 months (Gold- yet is still slower at acquiring that-relatives than berg et al., 2004; Bello, 2012). Finally, it does not wh-relatives. This fact suggests that the findings seem likely that BabySRL could be easily extended of Gagliardi and Lidz (2010) may be partially ex- to capture filler-gap acquisition. The argument- plained by a frequency effect: perhaps the input to verb position features impede acquisition of filler- children is simply biased such that wh-relatives are gap by classifying preverbal arguments as agents, much more common than that-relatives (as shown and the argument-argument position features in- in Table 5). hibit accurate labelling in intransitive settings and This model also initially reflects the 1-1 role bias result in an agent-first bias which would tend to observed in children (Gertner and Fisher, 2012) as label extracted objects as agents. In fact, these ob- well as previous models (Connor et al., 2008; Con- servations suggest that any linear classifier which nor et al., 2009; Connor et al., 2010) without sac- relies on positioning features will have difficulties rificing accuracy in canonical intransitive settings. modeling filler-gap acquisition. Finally, this model is extremely robust to differ- In sum, the unlexicalized model presented in this ent initializations. The canonical Gaussian expec- paper is able to achieve greater labelling accuracy tations can begin far from the verb ( 3) or close than the lexicalized BabySRL models in intran- to the verb ( 0.1), and the standard± deviations sitive settings, though this model does perform of the distributions± and the skip-penalty can vary slightly worse in the less common transitive set- widely; the model always converges to give compa- ting. Further, the unsupervised model in this pa- rable results to those presented here. The only con- per initially reflects developmental 1-1 role bias as straint on the initial parameters is that the proba- well as the supervised BabySRL models, and it bility of the extracted object occurring preverbally is able to progress beyond this bias. Finally, un- must exceed the skip-penalty (i.e. extraction must like BabySRL, the model presented here provides a be possible). In short, this paper describes a sim- cognitive model of the acquisition of filler-gap com- ple, robust cognitive model of the development of prehension, which BabySRL does not seem well- a learner between 15 months until somewhere be- suited to model. tween 25- and 30-months old (since 1-1 role bias is no longer present but no more than two arguments 7 Discussion are being generalized). This paper has presented a simple cognitive model In future, it would be interesting to incorporate of filler-gap acquisition, which is able to capture lexicalization into the model presented in this pa- several findings from developmental psychology. per, as this feature seems likely to bridge the gap Training significantly improves role labelling in between this model and BabySRL in transitive set- the case of object-extractions, which improves the tings. Lexicalization should also help further dis- overall accuracy of the model. This boost is ac- tinguish modifiers from arguments and improve the companied by a slight decrease in labelling ac- overall accuracy of the model. curacy in subject-extraction settings. The asym- It would also be interesting to investigate how metric ease of subject versus object comprehen- well this model generalizes to languages besides sion is well-documented in both children and English. Since the model is able to use the verb adults (Gibson, 1998), and while training improves position as a semi-permeable boundary between the model’s ability to process object-extractions, canonical subjects and objects, it may not work as

1091 well in verb-final languages, and thus makes the Michael Connor, Yael Gertner, Cynthia Fisher, and prediction that filler-gap comprehension may be Dan Roth. 2009. Minimally supervised model of acquired later in development in such languages early . In Proceedings of the due to a greater reliance on hierarchical syntax. Thirteenth Conference on Computational Natu- Ordering is one of the definining characteris- ral Language Learning. tics of a language that must be acquired by learn- Michael Connor, Yael Gertner, Cynthia Fisher, and ers (e.g. SVO vs SOV), and this work shows that Dan Roth. 2010. Starting from scratch in se- filler-gap comprehension can be acquired as a by- mantic role labelling. In Proceedings of ACL product of learning orderings rather than having to 2010. resort to higher-order syntax. Note that this model Peter Culicover. 2013. Explaining syntax: repre- cannot capture the constraints on filler-gap usage sentations, structures, and computation. Oxford which require a hierarchical grammar (e.g. subja- University Press. cency), but such knowledge is really only needed for successful production of filler-gap construc- Jill de Villiers and Thomas Roeper. 1995. Bar- tions, which occurs much later (around 5 years; riers, binding, and acquisition of the dp-np dis- tinction. Language Acquisition, 4(1):73–104. de Villiers and Roeper, 1995). Further, the kind of ordering system proposed in this paper may form Holger Diessel and Michael Tomasello. 2001. The an initial basis for learning such grammars (Jack- acquisition of finite complement clauses in en- endoff and Wittenberg, in press). glish: A corpus-based analysis. Cognitive Lin- guistics, 12:1–45. 8 Acknowledgements Annie Gagliardi and Jeffrey Lidz. 2010. Mor- Thanks to Peter Culicover, William Schuler, Laura phosyntactic cues impact filler-gap dependency Wagner, and the attendees of the OSU 2013 Fall resolution in 20- and 30-month-olds. In Poster Linguistics Colloquium Fest for feedback on this session of BUCLD35. work. This work was partially funded by an OSU Annie Gagliardi, Tara M. Mease, and Jeffrey Dept. of Linguistics Targeted Investment for Ex- Lidz. 2014. Discontinuous development cellence (TIE) grant for collaborative interdisci- in the acquisition of filler-gap dependen- plinary projects conducted during the academic cies: Evidence from 15- and 20-month- year 2012-13. olds. Harvard unpublished manuscript: http://www.people.fas.harvard.edu/ gagliardi. ∼ Yael Gertner and Cynthia Fisher. 2012. Predicted References errors in children’s early sentence comprehen- Nameera Akhtar. 1999. Acquiring basic word or- sion. Cognition, 124:85–94. der: evidence for data-driven learning of syn- Edward Gibson. 1998. Linguistic complexity: tactic structure. Journal of Child Language, Locality of syntactic dependencies. Cognition, 26:339–356. 68(1):1–76. Sophia Bello. 2012. Identifying indirect objects Lila R. Gleitman. 1990. The structural sources of in French: An elicitation task. In Proceedings verb meanings. Language Acquisition, 1:3–55. of the 2012 annual conference of the Canadian Linguistic Association. Adele E. Goldberg, Devin Casenhiser, and Nitya Sethuraman. 2004. Learning argument struc- Steven Bird, Ewan Klein, and Edward Loper. ture generalizations. Cognitive Linguistics, 2009. Natural Language Processing with Python: 14(3):289–316. Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing. Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part- Paul Boersma. 1997. How we learn variation, op- of-speech tagging. In Proceedings of the 45th tionality, and probability. Proceedings of the In- Annual Meeting of the Association for Compu- stitute of Phonetic Sciences of the University of tational Linguistics. Amsterdam, 21:43–58. Ray Jackendoff and Eva Wittenberg. in press. Harr Chen, S.R.K. Branavan, Regina Barzilay, and What you can say without syntax: A hierarchy David R. Karger. 2009. Content modeling using of grammatical complexity. In Fritz Newmeyer latent permutations. Journal of Artificial Intel- and Lauren Preston, editors, Measuring Linguis- ligence Research, 36:129–163. tic Complexity. Oxford University Press. Michael Connor, Yael Gertner, Cynthia Fisher, and Aravind K. Joshi, K. Vijay Shanker, and David Dan Roth. 2008. Baby srl: Modeling early lan- Weir. 1990. The convergence of mildly context- guage acquisition. In Proceedings of the Twelfth sensitive grammar formalisms. Technical Report Conference on Computational Natural Language MS-CIS-90-01, Department of Computer and In- Learning. formation Science, University of Pennsylvania.

1092 Dan Klein and Christopher D. Manning. 2004. Sandra R. Waxman and Amy E. Booth. 2001. See- Corpus-based induction of syntactic structure: ing pink elephants: Fourteen-month-olds’ inter- Models of dependency and constituency. In Pro- pretations of novel nouns and adjectives. Cogni- ceedings of the 42nd Annual Meeting of the As- tive Psychology, 43:217–242. sociation for Computational Linguistics. Sylvia Yuan, Cynthia Fisher, and Jesse Snedeker. Tom Kwiatkowski, Sharon Goldwater, Luke S. 2012. Counting the nouns: Simple structural Zettlemoyer, and Mark Steedman. 2012. A cues to verb meaning. Child Development, probabilistic model of syntactic and semantic 83(4):1382–1399. acquisition from child-directed utterances and their meanings. In Proceedings of EACL 2012. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk. Lawrence Elrbaum As- sociates, Mahwah, NJ, third edition. David Marr. 1982. Vision. A Computational In- vestigation into the Human Representation and Processing of Visual Information. W.H. Free- man and Company. Letitia R. Naigles. 1990. Children use syntax to learn verb meanings. The Journal Child Lan- guage, 17:357–374. Colin Phillips. 2010. Some arguments and non- arguments for reductionist accounts of syntactic phenomena. Language and Cognitive Processes, 28:156–187. Martin Pickering and Guy Barry. 1991. Sentence processing without empty categories. Language and Cognitive Processes, 6(3):229–259. Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computa- tional Linguistics, 34(2):257–287. John R. Ross. 1967. Constraints on Variables in Syntax. Ph.D. thesis, Massachusetts Institute of Technology. William Schuler. 2011. Effects of filler-gap de- pendencies on working memory requirements for parsing. In Proceedings of COGSCI, pages 501– 506, Austin, TX. Cognitive Science Society. Amanda Seidl, George Hollich, and Peter W. Jusczyk. 2003. Early understanding of subject and object wh-questions. Infancy, 4(3):423–436. Rushen Shi, Janet F. Werker, and James L. Mor- gan. 1999. Newborn infants’ sensitivity to per- ceptual cues to lexical and grammatical words. Cognition, 72(2):B11–B21. Katherine Thatcher, Holly Branigan, Janet McLean, and Antonella Sorace. 2008. Chil- dren’s early acquisition of the passive: Evidence from syntactic priming. In Proceedings of the Child Language Seminar 2007, pages 195–205, University of Reading. Ivan Titov and Alexandre Klementiev. 2012. Crosslingual induction of semantic roles. In Pro- ceedings of the 50th Annual Meeting of the As- sociation for Computational Linguistics (ACL- 2011).

1093