Universal Dependencies for Learner English

Yevgeni Berzak Jessica Kenney Carolyn Spadine Jing Xian Wang CSAIL MIT EECS & Linguistics MIT Linguistics MIT EECS MIT [email protected] [email protected] [email protected] [email protected]

Lucia Lam Keiko Sophie Mori Sebastian Garza Boris Katz MECHE MIT Linguistics MIT Linguistics MIT CSAIL MIT [email protected] [email protected] [email protected] [email protected]

Abstract currently no publicly available syntactic treebank for English as a Second Language (ESL). We introduce the Treebank of Learner En- glish (TLE), the first publicly available To address this shortcoming, we present the syntactic treebank for English as a Sec- Treebank of Learner English (TLE), a first of ond Language (ESL). The TLE provides its kind resource for non-native English, contain- manually annotated POS tags and Univer- ing 5,124 sentences manually annotated with POS sal Dependency (UD) trees for 5,124 sen- tags and dependency trees. The TLE sentences are tences from the Cambridge First Certifi- drawn from the FCE dataset (Yannakoudakis et al., cate in English (FCE) corpus. The UD 2011), and authored by English learners from 10 annotations are tied to a pre-existing er- different native language backgrounds. The tree- ror annotation of the FCE, whereby full bank uses the Universal Dependencies (UD) for- syntactic analyses are provided for both malism (De Marneffe et al., 2014; Nivre et al., the original and error corrected versions of 2016), which provides a unified annotation frame- each sentence. Further on, we delineate work across different languages and is geared to- ESL annotation guidelines that allow for wards multilingual NLP (McDonald et al., 2013). consistent syntactic treatment of ungram- This characteristic allows our treebank to sup- matical English. Finally, we benchmark port computational analysis of ESL using not only POS tagging and dependency parsing per- English based but also multilingual approaches formance on the TLE dataset and measure which seek to relate ESL phenomena to native lan- the effect of grammatical errors on parsing guage syntax. accuracy. We envision the treebank to sup- While the annotation inventory and guidelines port a wide range of linguistic and compu- are defined by the English UD formalism, we tational research on second language ac- build on previous work in learner language anal- quisition as well as automatic processing ysis (Dıaz-Negrillo et al., 2010; Dickinson and of ungrammatical language1. Ragheb, 2013) to formulate an additional set of annotation conventions aiming at a uniform treat- 1 Introduction ment of ungrammatical learner language. Our The majority of the English text available world- annotation scheme uses a two-layer analysis, wide is generated by non-native speakers (Crys- whereby a distinct syntactic annotation is pro- tal, 2003). Such texts introduce a variety of chal- vided for the original and the corrected version lenges, most notably grammatical errors, and are of each sentence. This approach is enabled by a of paramount importance for the scientific study pre-existing error annotation of the FCE (Nicholls, of language acquisition as well as for NLP. De- 2003) which is used to generate an error corrected spite the ubiquity of non-native English, there is variant of the dataset. Our inter-annotator agreement results provide evidence for the ability of the 1The treebank is available at universaldependencies.org. The annotation manual used in this project and a graphical annotation scheme to support consistent annota- query engine are available at esltreebank.org. tion of ungrammatical structures.

737 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 737–746, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Finally, a corpus that is annotated with both sentences were split further manually. Word level grammatical errors and syntactic dependencies tokenization was generated using the Stanford paves the way for empirical investigation of the PTB word tokenizer3. relation between grammaticality and syntax. Un- The treebank represents learners with 10 dif- derstanding this relation is vital for improving tag- ferent native language backgrounds: Chinese, ging and parsing performance on learner language French, German, Italian, Japanese, Korean, Por- (Geertzen et al., 2013), syntax based grammati- tuguese, Spanish, Russian and Turkish. For every cal error correction (Tetreault et al., 2010; Ng et native language, we randomly sampled 500 au- al., 2014), and many other fundamental challenges tomatically segmented sentences, under the con- in NLP. In this work, we take the first step in straint that selected sentences have to contain at this direction by benchmarking tagging and pars- least one grammatical error that is not punctuation ing accuracy on our dataset under different train- or spelling. ing regimes, and obtaining several estimates for The TLE annotations are provided in two ver- the impact of grammatical errors on these tasks. sions. The first version is the original sentence au- To summarize, this paper presents three contri- thored by the learner, containing grammatical er- butions. First, we introduce the first large scale rors. The second, corrected sentence version, is a syntactic treebank for ESL, manually annotated grammatical variant of the original sentence, gen- with POS tags and universal dependencies. Sec- erated by correcting all the grammatical errors in ond, we describe a linguistically motivated anno- the sentence according to the manual error annotation scheme for ungrammatical learner English tation provided in the FCE dataset. The resulting and provide empirical support for its consistency corrected sentences constitute a parallel corpus of via inter-annotator agreement analysis. Third, we standard English. Table 1 presents basic statistics benchmark a state of the art parser on our dataset of both versions of the annotated sentences. and estimate the influence of grammatical errors original corrected on the accuracy of automatic POS tagging and de- sentences 5,124 5,124 pendency parsing. tokens 97,681 98,976 The remainder of this paper is structured as fol- sentence length 19.06 (std 9.47) 19.32 (std 9.59) lows. We start by presenting an overview of the errors per sentence 2.67 (std 1.9) - treebank in section 2. In sections 3 and 4 we authors 924 native languages 10 provide background information on the annotation project, and review the main annotation stages Table 1: Statistics of the TLE. Standard deviations leading to the current form of the dataset. The ESL are denoted in parenthesis. annotation guidelines are summarized in section 5. Inter-annotator agreement analysis is presented in To avoid potential annotation biases, the anno- section 6, followed by parsing experiments in sec- tations of the treebank were created manually from tion 7. Finally, we review related work in section scratch, without utilizing any automatic annota- 8 and present the conclusion in section 9. tion tools. To further assure annotation quality, each annotated sentence was reviewed by two ad- 2 Treebank Overview ditional annotators. To the best of our knowledge, The TLE currently contains 5,124 sentences TLE is the first large scale English treebank con- (97,681 tokens) with POS tag and dependency an- structed in a completely manual fashion. notations in the English Universal Dependencies (UD) formalism (De Marneffe et al., 2014; Nivre 3 Annotator Training et al., 2016). The sentences were obtained from The treebank was annotated by six students, five the FCE corpus (Yannakoudakis et al., 2011), a undergraduates and one graduate. Among the un- collection of upper intermediate English learner dergraduates, three are linguistics majors and two essays, containing error annotations with 75 error are engineering majors with a linguistic minor. categories (Nicholls, 2003). Sentence level seg- The graduate student is a linguist specializing in mentation was performed using an adaptation of syntax. An additional graduate student in NLP the NLTK sentence tokenizer2. Under-segmented participated in the final debugging of the dataset.

2http://www.nltk.org/api/nltk.tokenize.html 3http://nlp.stanford.edu/software/tokenizer.shtml

738 Prior to annotating the treebank sentences, the (IND) and the second the word itself (WORD). annotators were trained for about 8 weeks. Dur- The remaining four columns had to be filled in ing the training, the annotators attended tutorials with a Universal POS tag (UPOS), a Penn Tree- on dependency grammars, and learned the English bank POS tag (POS), a head word index (HIND) UD guidelines4, the Penn Treebank POS guide- and a dependency relation (REL) according to ver- lines (Santorini, 1990), the grammatical error an- sion 1 of the English UD guidelines. notation scheme of the FCE (Nicholls, 2003), as The annotation section of the sentence is pre- well as the ESL guidelines described in section 5 ceded by a metadata header. The first field in this and in the annotation manual. header, denoted with SENT, contains the FCE er- Furthermore, the annotators completed six an- ror coded version of the sentence. The annotators notation exercises, in which they were required to were instructed to verify the error annotation, and annotate POS tags and dependencies for practice add new error annotations if needed. Corrections sentences from scratch. The exercises were done to the sentence segmentation are specified in the individually, and were followed by group meet- SEGMENT field5. Further down, the field TYPO ings in which annotation disagreements were dis- is designated for literal annotation of spelling er- cussed and resolved. Each of the first three exer- rors and ill formed words that happen to form valid cises consisted of 20 sentences from the UD gold words (see section 5.2). standard for English, the English Web Treebank The example below presents a pre-annotated (EWT) (Silveira et al., 2014). The remaining three original sentence given to an annotator. exercises contained 20-30 ESL sentences from the FCE. Many of the ESL guidelines were introduced #SENT=That time I had to sleep in a tent. or refined based on the disagreements in the ESL #SEGMENT= practice exercises and the subsequent group dis- #TYPO= cussions. Several additional guidelines were in- #IND WORD UPOS POS HIND REL troduced in the course of the annotation process. 1 That 2 time During the training period, the annotators also 3 I learned to use a search tool that enables formulat- 4 had 5 to ing queries over word and POS tag sequences as 6 sleep 7 in regular expressions and obtaining their annotation 8 tent statistics in the EWT. After experimenting with 9 . both textual and graphical interfaces for perform- Upon completion of the original sentence, the ing the annotations, we converged on a simple text annotators proceeded to annotate the corrected based format described in section 4.1, where the sentence version. To reduce annotation time, an- annotations were filled in using a spreadsheet or notators used a script that copies over annotations a text editor, and tested with a script for detect- from the original sentence and updates head in- ing annotation typos. The annotators continued to dices of tokens that appear in both sentence ver- meet and discuss annotation issues on a weekly sions. Head indices and relation labels were filled basis throughout the entire duration of the project. in only if the head word of the token appeared in 4 Annotation Procedure both the original and corrected sentence versions. Tokens with automatically filled annotations in- The formation of the treebank was carried out in cluded an additional # sign in a seventh column four steps: annotation, review, disagreement reso- of each word’s annotation. The # signs had to lution and targeted debugging. be removed, and the corresponding annotations either approved or changed as appropriate. Tokens 4.1 Annotation that did not appear in the original sentence version In the first stage, the annotators were given sen- were annotated from scratch. tences for annotation from scratch. We use a CoNLL based textual template in which each word 5The released version of the treebank splits the sentences is annotated in a separate line. Each line contains according to the markings in the SEGMENT field when those 6 columns, the first of which has the word index apply both to the original and corrected versions of the sentence. Resulting segments without grammatical errors in the 4http://universaldependencies.org/#en original version are currently discarded.

739 4.2 Review 5 Annotation Scheme for ESL

All annotated sentences were randomly assigned Our annotations use the existing inventory of En- to a second annotator (henceforth reviewer), in a glish UD POS tags and dependency relations, and double blind manner. The reviewer’s task was to follow the standard UD annotation guidelines for mark all the annotations that they would have an- English. However, these guidelines were for- notated differently. To assist the review process, mulated with grammatical usage of English in we compiled a list of common annotation errors, mind and do not cover non canonical syntactic available in the released annotation manual. structures arising due to grammatical errors6. To The annotations were reviewed using an active encourage consistent and linguistically motivated editing scheme in which an explicit action was re- annotation of such structures, we formulated a quired for all the existing annotations. The scheme complementary set of ESL annotation guidelines. was introduced to prevent reviewers from over- Our ESL annotation guidelines follow the gen- looking annotation issues due to passive approval. eral principle of literal reading, which emphasizes Specifically, an additional # sign was added at the syntactic analysis according to the observed lan- seventh column of each token’s annotation. The guage usage. This strategy continues a line of reviewer then had to either “sign off” on the exist- work in SLA which advocates for centering analy- ing annotation by erasing the # sign, or provide an sis of learner language around morpho-syntactic alternative annotation following the # sign. surface evidence (Ragheb and Dickinson, 2012; Dickinson and Ragheb, 2013). Similarly to our framework, which includes a parallel annotation 4.3 Disagreement Resolution of corrected sentences, such strategies are often presented in the context of multi-layer annota- In the final stage of the annotation process all tion schemes that also account for error corrected annotator-reviewer disagreements were resolved sentence forms (Hirschmann et al., 2007; Dıaz- by a third annotator (henceforth judge), whose Negrillo et al., 2010; Rosen et al., 2014). main task was to decide in favor of the annotator Deploying a strategy of literal annotation within or the reviewer. Similarly to the review process, UD, a formalism which enforces cross-linguistic the judging task was carried out in a double blind consistency of annotations, will enable meaning- manner. Judges were allowed to resolve annotator- ful comparisons between non-canonical structures reviewer disagreements with a third alternative, as in English and canonical structures in the author’s well as introduce new corrections for annotation native language. As a result, a key novel character- issues overlooked by the reviewers. istic of our treebank is its ability to support cross- Another task performed by the judges was to lingual studies of learner language. mark acceptable alternative annotations for am- biguous structures determined through review dis- 5.1 Literal Annotation agreements or otherwise present in the sentence. With respect to POS tagging, literal annotation im- These annotations were specified in an additional plies adhering as much as possible to the observed metadata field called AMBIGUITY. The ambigu- morphological forms of the words. Syntactically, ity markings are provided along with the resolved argument structure is annotated according to the version of the annotations. usage of the word rather than its typical distribu- tion in the relevant context. The following list of 4.4 Final Debugging conventions defines the notion of literal reading for some of the common non canonical structures After applying the resolutions produced by the associated with grammatical errors. judges, we queried the corpus with debugging tests for specific linguistics constructions. This Argument Structure additional testing phase further reduced the num- Extraneous prepositions We annotate all nominal ber of annotation errors and inconsistencies in the dependents introduced by extraneous prepositions treebank. Including the training period, the tree- 6The English UD guidelines do address several issues en- bank creation lasted over a year, with an aggregate countered in informal genres, such as the relation “goeswith”, of more than 2,000 annotation hours. which is used for fragmented words resulting from typos.

740 as nominal modifiers. In the following sentence, Similarly, in the example below we annotate “him” is marked as a nominal modifier (nmod) in- “necessaryiest” as a superlative. stead of an indirect object (iobj) of “give”. #SENT=The necessaryiest things... #SENT=...I had to give to 1 The DET DT 3 det him water... 2 necessaryiest ADJ JJS 3 amod 3 things NOUN NNS 0 root ...... 21 I PRON PRP 22 nsubj 22 had VERB VBD 5 parataxis 23 to PART TO 24 mark 5.2 Exceptions to Literal Annotation 24 give VERB VB 22 xcomp 25 to ADP IN 26 case Although our general annotation strategy for ESL 26 him PRON PRP 24 nmod 27 water NOUN NN 24 dobj follows literal sentence readings, several types of ... word formation errors make such readings unin- Omitted prepositions We treat nominal depen- formative or impossible, essentially forcing cer- dents of a predicate that are lacking a preposition tain words to be annotated using some degree of as arguments rather than nominal modifiers. In the interpretation (Rosen´ and De Smedt, 2010). We example below, “money” is marked as a direct ob- hence annotate the following cases in the original ject (dobj) instead of a nominal modifier (nmod) sentence according to an interpretation of an in- of “ask”. As “you” functions in this context as a tended word meaning, obtained from the FCE er- second argument of “ask”, it is annotated as an in- ror correction. direct object (iobj) instead of a direct object (dobj). Spelling #SENT=...I have to ask you for the money Spelling errors are annotated according to the cor- offor the tickets back. ... rectly spelled version of the word. To support error 12 I PRON PRP 13 nsubj analysis of automatic annotation tools, misspelled 13 have VERB VBP 2 conj 14 to PART TO 15 mark words that happen to form valid words are anno- 15 ask VERB VB 13 xcomp 16 you PRON PRP 15 iobj tated in the metadata field TYPO for POS tags 17 the DET DT 18 det with respect to the most common usage of the 18 money NOUN NN 15 dobj 19 of ADP IN 21 case misspelled word form. In the example below, the 20 the DET DT 21 det 21 tickets NOUN NNS 18 nmod TYPO field contains the typical POS annotation of 22 back ADV RB 15 advmod “where”, which is clearly unintended in the con- 23 . PUNCT . 2 punct text of the sentence.

Tense #SENT=...we where were invited to visit... Cases of erroneous tense usage are annotated ac- #TYPO=5 ADV WRB cording to the morphological tense of the verb. ... 4 we PRON PRP 6 nsubjpass For example, below we annotate “shopping” 5 where AUX VBD 6 auxpass with present participle VBG, while the correction 6 invited VERB VBN 0 root 7 to PART TO 8 mark “shop” is annotated in the corrected version of the 8 visit VERB VB 6 xcomp sentence as VBP. ... #SENT=...when you shopping Word Formation shop...... Erroneous word formations that cannot be as- 4 when ADV WRB 6 advmod signed with an existing PTB tag are annotated with 5 you PRON PRP 6 nsubj 6 shopping VERB VBG 12 advcl respect to the correct word form. ... #SENT=I am writting writing... Word Formation 1 I PRON PRP 3 nsubj Erroneous word formations that are contextually 2 am AUX VBP 3 aux 3 writting VERB VBG 0 root plausible and can be assigned with a PTB tag ... are annotated literally. In the following example, In particular, ill formed adjectives that have a “stuffs” is handled as a plural count noun. plural sufﬁx receive a standard adjectival POS tag. #SENT=...into fashionable When applicable, such cases also receive an addi- stuffsstuff... tional marking for unnecessary agreement in the ... 7 into ADP IN 9 case error annotation using the attribute “ua”. 8 fashionable ADJ JJ 9 amod 9 stuffs NOUN NNS 2 ccomp #SENT=... ... interestingsinteresting things...

741 ... Annotator-Reviewer UPOS POS HIND REL 6 interestings ADJ JJ 7 amod original 98.83 98.35 97.74 96.98 7 things NOUN NNS 3 dobj ... corrected 99.02 98.61 97.97 97.20 Reviewer-Judge Wrong word formations that result in a valid, original 99.72 99.68 99.37 99.15 but contextually implausible word form are also corrected 99.80 99.77 99.45 99.28 annotated according to the word correction. In the example below, the nominal form “sale” is Table 2: Inter-annotator agreement on the entire likely to be an unintended result of an ill formed TLE corpus. Agreement is measured as the frac- verb. Similarly to spelling errors that result in tion of tokens that remain unchanged after an edit- valid words, we mark the typical literal POS an- ing round. The four evaluation columns corre- notation in the TYPO metadata field. spond to universal POS tags, PTB POS tags, unlabeled attachment, and dependency labels. Co- #SENT=...they do not sale sell them... hen’s Kappa scores (Cohen, 1960) for POS tags #TYPO=15 NOUN NN and dependency labels in all evaluation conditions ... 12 they PRON PRP 15 nsubj are above 0.96. 13 do AUX VBP 15 aux 14 not PART RB 15 neg 15 sale VERB VB 0 root 16 them PRON PRP 15 dobj 7 Parsing Experiments ... The TLE enables studying parsing for learner lan- Taken together, our ESL conventions cover guage and exploring relationships between gram- many of the annotation challenges related to grammatical errors and parsing performance. Here, we matical errors present in the TLE. In addition to present parsing benchmarks on our dataset, and the presented overview, the complete manual of provide several estimates for the extent to which ESL guidelines used by the annotators is pub- grammatical errors degrade the quality of auto- licly available. The manual contains further details matic POS tagging and dependency parsing. on our annotation scheme, additional annotation Our first experiment measures tagging and pars- guidelines and a list of common annotation errors. ing accuracy on the TLE and approximates the We plan to extend and refine these guidelines in global impact of grammatical errors on automatic future releases of the treebank. annotation via performance comparison between 6 Editing Agreement the original and error corrected sentence versions. In this, and subsequent experiments, we utilize We utilize our two step review process to estimate version 2.2 of the Turbo tagger and Turbo parser agreement rates between annotators7. We measure (Martins et al., 2013), state of the art tools for sta- agreement as the fraction of annotation tokens ap- tistical POS tagging and dependency parsing. proved by the editor. Table 2 presents the agree- Table 3 presents tagging and parsing results on ment between annotators and reviewers, as well as a test set of 500 TLE sentences (9,591 original to- the agreement between reviewers and the judges. kens, 9,700 corrected tokens). Results are pro- Agreement measurements are provided for both vided for three different training regimes. The the original the corrected versions of the dataset. first regime uses the training portion of version 1.3 Overall, the results indicate a high agreement of the EWT, the UD English treebank, contain- rate in the two editing tasks. Importantly, the gap ing 12,543 sentences (204,586 tokens). The sec- between the agreement on the original and cor- ond training mode uses 4,124 training sentences rected sentences is small. Note that this result is (78,541 original tokens, 79,581 corrected tokens) obtained despite the introduction of several ESL from the TLE corpus. In the third setup we com- annotation guidelines in the course of the annota- bine these two training corpora. The remaining tion process, which inevitably increased the num- 500 TLE sentences (9,549 original tokens, 9,695 ber of edits related to grammatical errors. We in- corrected tokens) are allocated to a development terpret this outcome as evidence for the effective- set, not used in this experiment. Parsing of the test ness of the ESL annotation scheme in supporting sentences was performed on predicted POS tags. consistent annotations of learner language. The EWT training regime, which uses out of do- 7All experimental results on agreement and parsing ex- main texts written in standard English, provides clude punctuation tokens. the lowest performance on all the evaluation met-

742 Test set Train Set UPOS POS UAS LA LAS Tokens Train Set UPOS POS UAS LA LAS TLEorig EWT 91.87 94.28 86.51 88.07 81.44 Ungrammatical EWT 87.97 88.61 82.66 82.66 74.93 Grammatical EWT 92.62 95.37 87.26 89.11 82.7 TLEcorr EWT 92.9 95.17 88.37 89.74 83.8 Ungrammatical TLEorig 90.76 88.68 83.81 83.31 77.22 TLEorig TLEorig 95.88 94.94 87.71 89.26 83.4 Grammatical TLEorig 96.86 96.14 88.46 90.41 84.59 TLEcorr TLEcorr 96.92 95.17 89.69 90.92 85.64 Ungrammatical EWT+TLEorig 89.76 90.97 86.32 85.96 80.37 TLEorig EWT+TLEorig 93.33 95.77 90.3 91.09 86.27 Grammatical EWT+TLEorig 94.02 96.7 91.07 92.08 87.41 TLEcorr EWT+TLEcorr 94.27 96.48 92.15 92.54 88.3 Table 4: Tagging and parsing results on the origi- Table 3: Tagging and parsing results on a test set of nal version of the TLE test set for tokens marked 500 sentences from the TLE corpus. EWT is the with grammatical errors (Ungrammatical) and to- English UD treebank. TLEorig are original sen- kens not marked for errors (Grammatical). tences from the TLE. TLEcorr are the corresponding error corrected sentences. the global measurements in the ﬁrst experiment, this analysis, which focuses on the local impact rics. An additional factor which negatively af- of remove/replace errors, suggests a stronger ef- fects performance in this regime are systematic fect of grammatical errors on the dependency la- differences in the EWT annotation of possessive bels than on the dependency structure. pronouns, expletives and names compared to the Finally, we measure tagging and parsing perfor- UD guidelines, which are utilized in the TLE. In mance relative to the fraction of sentence tokens particular, the EWT annotates possessive pronoun marked with grammatical errors. Similarly to the UPOS as PRON rather than DET, which leads the previous experiment, this analysis focuses on re- UPOS results in this setup to be lower than the move/replace rather than insert errors. PTB POS results. Improved results are obtained using the TLE training data, which, despite its 100 smaller size, is closer in genre and syntactic char- 98 acteristics to the TLE test set. The strongest PTB 96 POS tagging and parsing results are obtained by 94 combining the EWT with the TLE training data, 92 yielding 95.77 POS accuracy and a UAS of 90.3 90 on the original version of the TLE test set. 88 86

The dual annotation of sentences in their orig- 84 Mean Per Sentence Score Sentence Per Mean inal and error corrected forms enables estimating 82 POS original POS corrected 80 the impact of grammatical errors on tagging and UAS original UAS corrected parsing by examining the performance gaps be- 78 LAS original LAS corrected 76 tween the two sentence versions. Averaged across 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 (362) (1033) (1050) (955) (613) (372) (214) (175) the three training conditions, the POS tagging ac- % of Original Sentence Tokens Marked as Grammatical Errors curacy on the original sentences is lower than the accuracy on the sentence corrections by 1.0 UPOS Figure 1: Mean per sentence POS accuracy, UAS and 0.61 POS. Parsing performance degrades by and LAS of the Turbo tagger and Turbo parser, as 1.9 UAS, 1.59 LA and 2.21 LAS. a function of the percentage of original sentence To further elucidate the inﬂuence of grammati- tokens marked with grammatical errors. The tag- cal errors on parsing quality, table 4 compares per- ger and the parser are trained on the EWT cor- formance on tokens in the original sentences ap- pus, and tested on all 5,124 sentences of the TLE. pearing inside grammatical error tags to those ap- Points connected by continuous lines denote per- pearing outside such tags. Although grammatical formance on the original TLE sentences. Points errors may lead to tagging and parsing errors with connected by dashed lines denote performance on respect to any element in the sentence, we expect the corresponding error corrected sentences. The erroneous tokens to be more challenging to ana- number of sentences whose errors fall within each lyze compared to grammatical tokens. percentage range appears in parenthesis. This comparison indeed reveals a substantial difference between the two types of tokens, with Figure 1 presents the average sentential perfor- an average gap of 5.0 UPOS, 6.65 POS, 4.67 UAS, mance as a function of the percentage of tokens 6.56 LA and 7.39 LAS. Note that differently from in the original sentence marked with grammati-

743 cal errors. In this experiment, we train the parser proach, called “morphosyntactic dependencies” is on the EWT training set and test on the entire related to our annotation scheme in its focus on TLE corpus. Performance curves are presented surface structures. Differently from this proposal, for POS, UAS and LAS on the original and error our annotations are grounded in a parallel anno- corrected versions of the annotations. We observe tation of grammatical errors and include an ad- that while the performance on the corrected sen- ditional layer of analysis for the corrected forms. tences is close to constant, original sentence per- Moreover, we refrain from introducing new syn- formance is decreasing as the percentage of the er- tactic categories and dependency relations speciﬁc roneous tokens in the sentence grows. to ESL, thereby supporting computational treat- Overall, our results suggest a negative, albeit ment of ESL using existing resources for standard limited effect of grammatical errors on parsing. English. At the same time, we utilize a multilin- This outcome contrasts a study by Geertzen et al. gual formalism which, in conjunction with our lit- (2013) which reported a larger performance gap of eral annotation strategy, facilitates linking the an- 7.6 UAS and 8.8 LAS between sentences with and notations to native language syntax. without grammatical errors. We believe that our While the above mentioned studies focus on an- analysis provides a more accurate estimate of this notation guidelines, attention has also been drawn impact, as it controls for both sentence content and to the topic of parsing in the learner language do- sentence length. The latter factor is crucial, since main. However, due to the shortage of syntactic it correlates positively with the number of gram- resources for ESL, much of the work in this area matical errors in the sentence, and negatively with resorted to using surrogates for learner data. For parsing accuracy. example, in Foster (2007) and Foster et al. (2008) parsing experiments are carried out on synthetic 8 Related Work learner-like data, that was created by automatic in- Previous studies on learner language proposed sertion of grammatical errors to well formed En- several annotation schemes for both POS tags and glish text. In Cahill et al. (2014) a treebank of sec- syntax (Hirschmann et al., 2007; Dıaz-Negrillo et ondary level native students texts was used to ap- al., 2010; Dickinson and Ragheb, 2013; Rosen et proximate learner text in order to evaluate a parser al., 2014). The unifying theme in these proposals that utilizes unlabeled learner data. is a multi-layered analysis aiming to decouple the Syntactic annotations for ESL were previously observed language usage from conventional struc- developed by Nagata et al. (2011), who annotate tures in the foreign language. an English learner corpus with POS tags and shal- In the context of ESL, Dıaz et al. (2010) pro- low syntactic parses. Our work departs from shal- pose three parallel POS tag annotations for the low syntax to full syntactic analysis, and provides lexical, morphological and distributional forms of annotations on a signiﬁcantly larger scale. Fur- each word. In our work, we adopt the distinc- thermore, differently from this annotation effort, tion between morphological word forms, which our treebank covers a wide range of learner na- roughly correspond to our literal word readings, tive languages. An additional syntactic dataset for and distributional forms as the error corrected ESL, currently not available publicly, are 1,000 words. However, we account for morphological sentences from the EFCamDat dataset (Geertzen forms only when these constitute valid existing et al., 2013), annotated with Stanford dependen- PTB POS tags and are contextually plausible. Fur- cies (De Marneffe and Manning, 2008). This thermore, while the internal structure of invalid dataset was used to measure the impact of gram- word forms is an interesting object of investiga- matical errors on parsing by comparing perfor- tion, we believe that it is more suitable for anno- mance on sentences with grammatical errors to er- tation as word features rather than POS tags. Our ror free sentences. The TLE enables a more direct treebank supports the addition of such features to way of estimating the magnitude of this perfor- the existing annotations. mance gap by comparing performance on the same The work of Ragheb and Dickinson (2009; sentences in their original and error corrected ver- 2012; 2013) proposes ESL annotation guidelines sions. Our comparison suggests that the effect of for POS tags and syntactic dependencies based on grammatical errors on parsing is smaller that the the CHILDES annotation framework. This ap- one reported in this study.

744 9 Conclusion Markus Dickinson and Marwa Ragheb. 2009. Depen- dency annotation for learner corpora. In Proceed- We present the ﬁrst large scale treebank of ings of the Eighth Workshop on Treebanks and Lin- learner language, manually annotated and double- guistic Theories (TLT-8), pages 59–70. reviewed for POS tags and universal dependen- Markus Dickinson and Marwa Ragheb. 2013. Annota- cies. The annotation is accompanied by a linguis- tion for learner English guidelines, v. 0.1. Technical tically motivated framework for handling syntactic report, Indiana University, Bloomington, IN, June. structures associated with grammatical errors. Fi- June 9, 2013. nally, we benchmark automatic tagging and parsing on our corpus, and measure the effect of gram- Jennifer Foster, Joachim Wagner, and Josef Van Gen- abith. 2008. Adapting a wsj-trained parser to grammatical errors on tagging and parsing quality. The matically noisy text. In Proceedings of the 46th treebank will support empirical study of learner Annual Meeting of the Association for Computa- syntax in NLP, corpus linguistics and second lan- tional Linguistics on Human Language Technolo- guage acquisition. gies: Short Papers, pages 221–224. Association for Computational Linguistics.

10 Acknowledgements Jennifer Foster. 2007. Treebanks gone bad. Interna- We thank Anna Korhonen for helpful discussions tional Journal of Document Analysis and Recogni- tion (IJDAR), 10(3-4):129–145. and insightful comments on this paper. We also thank Dora Alexopoulou, Andrei Barbu, Markus Jeroen Geertzen, Theodora Alexopoulou, and Anna Dickinson, Sue Felshin, Jeroen Geertzen, Yan Korhonen. 2013. Automatic linguistic annotation Huang, Detmar Meurers, Sampo Pyysalo, Roi Re- of large scale l2 databases: The ef-cambridge open ichart and the anonymous reviewers for valuable language database (efcamdat). In Proceedings of the 31st Second Language Research Forum. Somerville, feedback on this work. This material is based MA: Cascadilla Proceedings Project. upon work supported by the Center for Brains, Minds, and Machines (CBMM), funded by NSF Hagen Hirschmann, Seanna Doolittle, and Anke STC award CCF-1231216. Ludeling.¨ 2007. Syntactic annotation of non- canonical linguistic structures.

Andre´ FT Martins, Miguel Almeida, and Noah A References Smith. 2013. Turning on the turbo: Fast third-order Aoife Cahill, Binod Gyawali, and James V Bruno. non-projective turbo parsers. In ACL (2), pages 2014. Self-training for parsing learner text. In 617–622. Citeseer. Proceedings of the First Joint Workshop on Statisti- cal Parsing of Morphologically Rich Languages and Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach- Syntactic Analysis of Non-Canonical Languages, Brundage, Yoav Goldberg, Dipanjan Das, Kuzman pages 66–73. Ganchev, Keith B Hall, Slav Petrov, Hao Zhang, Oscar Tackstr¨ om,¨ et al. 2013. Universal depen- Jacob Cohen. 1960. A Coefﬁcient of Agreement for dency annotation for multilingual parsing. In ACL Nominal Scales. Educational and Psychological (2), pages 92–97. Citeseer. Measurement, 20(1):37. Ryo Nagata, Edward Whittaker, and Vera Shein- David Crystal. 2003. English as a global language. man. 2011. Creating a manually error-tagged Ernst Klett Sprachen. and shallow-parsed learner corpus. In Proceed- ings of the 49th Annual Meeting of the Association Marie-Catherine De Marneffe and Christopher D Man- for Computational Linguistics: Human Language ning. 2008. Stanford typed dependencies manual. Technologies-Volume 1, pages 1210–1219. Associ- Technical report, Technical report, Stanford Univer- ation for Computational Linguistics. sity.

Marie-Catherine De Marneffe, Timothy Dozat, Na- Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian talia Silveira, Katri Haverinen, Filip Ginter, Joakim Hadiwinoto, Raymond Hendy Susanto, and Christo- Nivre, and Christopher D Manning. 2014. Univer- pher Bryant. 2014. The conll-2014 shared task sal stanford dependencies: A cross-linguistic typol- on grammatical error correction. In CoNLL Shared ogy. In Proceedings of LREC, pages 4585–4592. Task, pages 1–14.

Ana Dıaz-Negrillo, Detmar Meurers, Salvador Valera, Diane Nicholls. 2003. The cambridge learner corpus: and Holger Wunsch. 2010. Towards interlanguage Error coding and analysis for lexicography and elt. pos annotation for effective learner corpora in sla In Proceedings of the Corpus Linguistics 2003 con- and ﬂt. Language Forum, 36(1–2):139–154. ference, pages 572–581.

745 Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajic,ˇ Christopher Man- ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Pro- ceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Marwa Ragheb and Markus Dickinson. 2012. Deﬁn- ing syntax for learner language annotation. In COL- ING (Posters), pages 965–974. Victoria Rosen´ and Koenraad De Smedt. 2010. Syn- tactic annotation of learner corpora. Systematisk, variert, men ikke tilfeldig, pages 120–132.

Alexandr Rosen, Jirka Hana, Barbora Stindlovˇ a,´ and Anna Feldman. 2014. Evaluating and automating the annotation of a learner corpus. Language Re- sources and Evaluation, 48(1):65–92.

Beatrice Santorini. 1990. Part-of-speech tagging guidelines for the penn treebank project (3rd revi- sion). Technical Reports (CIS).

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel R Bowman, Miriam Connor, John Bauer, and Christopher D Manning. 2014. A gold standard dependency corpus for english. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC- 2014). Joel Tetreault, Jennifer Foster, and Martin Chodorow. 2010. Using parse features for preposition selection and error detection. In Proceedings of the acl 2010 conference short papers, pages 353–358. Associa- tion for Computational Linguistics. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In ACL, pages 180–189.

746