Generating Artificial Errors for Grammatical Error Correction

Generating artificial errors for grammatical error correction Mariano Felice Zheng Yuan Computer Laboratory Computer Laboratory University of Cambridge University of Cambridge United Kingdom United Kingdom [email protected] [email protected] Abstract Artificial error generation allows researchers to create very large error-annotated corpora with lit- This paper explores the generation of ar- tle effort and control variables such as topic and tificial errors for correcting grammatical error types. Errors can be injected into candidate mistakes made by learners of English as texts using a deterministic approach (e.g. fixed a second language. Artificial errors are in- rules) or probabilities derived from manually an- jected into a set of error-free sentences in a notated samples in order to mimic real data. probabilistic manner using statistics from Although artificial errors have been used in pre- a corpus. Unlike previous approaches, we vious work, we present a new approach based on use linguistic information to derive error linguistic information and evaluate it using the test generation probabilities and build corpora data provided for the CoNLL 2013 shared task on to correct several error types, including grammatical error correction (Ng et al., 2013). open-class errors. In addition, we also Our work makes the following contributions. analyse the variables involved in the selec- First, we are the first to use linguistic information of candidate sentences. Experiments tion (such as part-of-speech (PoS) information or using the NUCLE corpus from the CoNLL semantic classes) to characterise contexts of natu- 2013 shared task reveal that: 1) training rally occurring errors and replicate them in error- on artificially created errors improves pre- free text. Second, we apply our technique to a cision at the expense of recall and 2) dif- larger number of error types than any other pre- ferent types of linguistic information are vious approach, including open-class errors. The better suited for correcting different error resulting datasets are used to train error correction types. systems aimed at learners of English as a second language (ESL). Finally, we provide a detailed de- 1 Introduction scription of the variables that affect artificial error Building error correction systems using machine generation. learning techniques can require a considerable 2 Related work amount of annotated data which is difficult to ob- tain. Available error-annotated corpora are often The use of artificial data to train error correction focused on particular groups of people (e.g. non- systems has been explored by other researchers us- native students), error types (e.g. spelling, syn- ing a variety of techniques. tax), genres (e.g. university essays, letters) or top- Izumi et al. (2003), for example, use artificial ics so it is not clear how representative they are errors to target article mistakes made by Japanese or how well systems based on them will gener- learners of English. A corpus is created by replac- alise. On the other hand, building new corpora is ing a, an, the or the zero article by a different ar- not always a viable solution since error annotation ticle chosen at random in more than 7,500 correct is expensive. As a result, researchers have tried sentences and used to train a maximum entropy to overcome these limitations either by compiling model. Results show an improvement for omis- corpora automatically from the web (Mizumoto et sion errors but no change for replacement errors. al., 2011; Tajiri et al., 2012; Cahill et al., 2013) or Brockett et al. (2006) describe the use of a sta- using artificial corpora which are cheaper to pro- tistical machine translation (SMT) system for cor- duce and can be tailored to their needs. recting a set of 14 countable/uncountable nouns 116 Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–126, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics which are often confusing for ESL learners. Their General training corpus consists of a large number of sen- Target words (e.g. articles) are replaced with oth- tences extracted from news articles which were de- ers of the same class with probability x (varying liberately modified to include typical countability from 0.05 to 0.18). Each new word is chosen uni- errors based on evidence from a Chinese learner formly at random. corpus. Their approach to artificial error injec- tion is deterministic, using hand-coded rules to Distribution before correction (in ESL data) change quantifiers (much many), generate plu- → Target words in the error-free text are changed rals (advice advices) or insert unnecessary de- → to match the distribution observed in ESL error- terminers. Experiments show their system was annotated data before any correction is made. generally able to beat the standard Microsoft Word 2003 grammar checker, although it produced a rel- Distribution after correction (in ESL data) atively higher rate of erroneous corrections. SMT systems are also used by Ehsan and Faili Target words in the error-free text are changed (2013) to correct grammatical errors and context- to match the distribution observed in ESL error- sensitive spelling mistakes in English and Farsi. annotated data after corrections are made. Training corpora are obtained by injecting arti- Native language-specific distributions ficial errors into well-formed treebank sentences using predefined error templates. Whenever an It has been observed that second language produc- original sentence from the corpus matches one of tion is affected by a learner’s native language (L1) these templates, a pair of correct and incorrect sen- (Lee and Seneff, 2008; Leacock et al., 2010). A tences is generated. This process is repeated mul- common example is the difficulty in using English tiple times if a single sentence matches more than articles appropriately by learners whose L1 has one error template, thereby generating many pairs no article system, such as Russian or Japanese. for the same original sentence. A comparison be- Because word choice errors follow systematic pat- tween the proposed systems and rule-based gram- terns (i.e. they do not occur randomly), this infor- mar checkers show they are complementary, with mation is extremely valuable for generating errors a hybrid system achieving the best performance. more accurately. L1-specific errors can be imitated by computing 2.1 Probabilistic approaches word confusions in an error-annotated ESL cor- A few researchers have explored probabilistic pora and using these distributions to change tar- methods in an attempt to mimic real data more ac- get words accordingly in error-free text. More specifically, if we estimate P(source target) in an curately. Foster and Andersen (2009), for exam- | ple, describe a tool for generating artificial errors error-tagged corpus (i.e. the probability of an based on statistics from other corpora, such as the incorrect source word being used when the cor- Cambridge Learner Corpus (CLC).1 Their experi- rect target is expected), we can generate more ac- ments show a drop in accuracy when artificial sen- curate confusion sets where each candidate has tences are used as a replacement for real incorrect an associated probability depending on the ob- sentences, suggesting that they may not be as use- served word. For example, supposing that a ful as genuine text. Their report also includes an group of learners use the preposition to in 10% extensive summary of previous work in the area. of cases where the preposition for should be used (that is, P(source=to target=for)=0.10), we can Rozovskaya and Roth propose more sophis- | ticated probabilistic methods to generate artifi- replicate this error pattern by replacing the oc- cial errors for articles (2010a) and prepositions currences of the preposition for with to with a (2010b; 2011), also based on statistics from an probability of 0.10 in a corpus of error-free sen- ESL corpus. In particular, they compile a set of tences. When the source and target words are the same, P(source=x target=x) expresses the proba- sentences from the English Wikipedia and apply | the following generation methods: bility that a learner produces the correct/expected word. 1 http://www.cup.cam.ac.uk/gb/elt/ Because errors are generally sparse (and there- catalogue/subject/custom/item3646603/ Cambridge-International-Corpus- fore error rates are low), replicating mistakes Cambridge-Learner-Corpus/ based on observed probabilities can easily lead to 117 low recall. In order to address this issue during ar- automatically-compiled annotated corpora and artificial error generation, Rozovskaya et al. (2012) tificial sentences generated using error probabili- propose an inflation method that boosts confusion ties derived from Wikipedia revisions and Lang- probabilities in order to generate a larger propor- 8.2 Their results reveal a number of interesting tion of artificial instances. This reformulation is points, namely that artificial errors provide com- shown to improve F-scores when correcting deter- petitive results and perform robustly across differ- miners and prepositions. ent test sets. A learning curve analysis also shows Experiments reveal that these approaches yield system performance increases as more training better results than assuming uniform probabilis- data is used, both real and artificial. tic distributions where all errors and correc- More recently, some teams have also

Load more