Developing a New Grammar Checker for English As a Second Language
Total Page:16
File Type:pdf, Size:1020Kb
Developing a new grammar checker for English as a second language Cornelia Tschichold, Franck Bodmer, Etienne Cornu, Franqois Grosjean, Lysiane Grosjean, Natalie Ktibler, Nicolas Lrwy & Corinne Tschumi Laboratoire de traitement du langage et de la parole Universit6 de Neuch~tel, Avenue du Premier-Mars 26 CH - 2000 Neuch~tel Abstract I Introduction In this paper we describe the prototype of Many word-processing systems today a new grammar checker specifically geared include grammar checkers which can be to the needs of French speakers writing in used to locate various grammatical problems English. Most commercial grammar in a text. These tools are clearly aimed at checkers on the market today are meant to be native speakers even if they can be of some used by native speakers of a language who help to non-native speakers as well. have good intuitions about their own However, non-native speakers make more language competence. Non-native speakers errors than native speakers and their errors of a language, however, have different intu- are quite different (Corder 1981). Because itions and are very easily confused by false of this, they require grammar checkers alarms, i.e. error messages given by the designed for their specific needs (Granger & grammar checker when there is in fact no Meunier 1994). error in the text. In our project aimed at The prototype we have developed is developing a complete writing tool for the aimed at French native speakers writing in non-native speaker, we concentrated on English. From the very start, we worked on building a grammar checker that keeps the the idea that our prototype would be com- rate of over-flagging down and on deve- mercialized. In order to find out users' real loping a user-friendly writing environment needs, we first conducted a survey among which contains, among other things, a series potential users and experts in the field con- of on-line helps. The grammar checking cerning such issues as coverage and the component, which is the focus of this paper, interface. In addition, we studied which uses island processing (or chunking) rather errors needed to be dealt with. To do this, than a full parse. This approach is both rapid we integrated the information on errors and appropriate when a text contains many found in published lists of typical learner errors. We explain how we use automata to errors, e.g. Fitikides (1963), with our own identify multi-word units, detect errors corpus of errors obtained from English texts (which we first isolated in a corpus of written by French native speakers. Some errors) and interact with the user. We end 27'000 words of text produced over 2'800 with a short evaluation of our prototype and errors, which were classified and sorted. We compare it to three currently available have used this corpus to decide which errors commercial grammar checkers. to concentrate on and to evaluate the correc- tion procedures developed. The following and it offers help, i.e. explanations and two tables give the percentages of errors examples for each word, on request. found in our corpus, broken down by the Potential errors such as false friends, con- major categories, followed by the subcate- fusions, foreign loans, etc. are tackled by gories pertaining to the verb. the highlighter. The heart of the prototype is the grammar checker, which we describe Error type percentage below. Further details about the writing aids spelling & pronunciation 10.3 % can be found in Tschumi et al. (1996). adjectives 5.3 % adverbs 4.4 % 2 The grammar checker nouns 19.6 % Texts written by non-native speakers are verbs 24.5 % more difficult to parse than texts written by word combinations 8.3 % native speakers because of the number and sentence 27.6 % types of errors they contain. It is doubtful whether a complete parse of a sentence con- Table 1: Percentage of errors by major categories raining these kinds of errors can be achieved with today's technology (Thurmair 1990). To get around this, we chose an approach Error type percentage using island processing. Such a method, morphology 1.2 % similar to the chunking approach described agreement 1.4 % in Abney (1991), makes it possible to extract tenses 8.8 % from the text most of the information needed lexicon 8.0 % to detect errors without wasting time and phrase following the verb 5.1% resources on trying to parse an ungram- matical sentence fully. Chunking can be seen Table 2: Percentage of verb errors by as an intermediary step between tagging and subcategories parsing, but it can also be used to get around the problem of a full parse when dealing The prototype includes a set of writing with ill-formed text. aids, a problem word highlighter and a grammar checker. The writing aids include Once the grammar checker has been two monolingual and a bilingual dictionary actived, the text which is to be checked goes (simulated), a verb conjugator, a small through a number of stages. It is first seg- translating module for certain fixed expres- mented into sentences and words. The indi- sions, and a comprehensive on-line vidual words are then looked up in the grammar. The problem word highlighter is dictionary, which is an extract of CELEX used to show all the potential lexical errors (see Burnage 1990). It includes all the in a text. While we hoped that the grammar words that occur in our corpus of texts, with checker would cover as many different types all their possible readings. For example, the of errors as possible, it quickly became clear word the is listed in our dictionary as a that certain errors could not be handled satis- determiner and as an adverb; table has an factorily with the checker, e.g. using library entry both as a noun and as a verb. In this (based on French librairie) instead of book- sense, our dictionary is not a simplified and store in a sentence like I need to go to the scaled-down version of a full dictionary, but library in order to buy a present for my simply a shorter version. In the next stage, father. Instead of flagging every instance of an algorithm based on neural networks (see library in a text, something other grammar Bodmer 1994)disambiguates all words checkers often do, we developed the prob- which belong to more than one syntactic lem word highlighter. It allows the user to category. Furthermore, some multi-word view all the problematic words at one glance units are identified and labeled with a single syntactic category, e.g. a lot of(PRON) I. feature. This is illustrated in the following After this stage, island parsing can begin. A automaton: first step consists in identifying simple noun phrases. On the basis of these NPs, prepro- Automaton 1 eessing automata assemble complex noun phrases and assign features to them when- <NP @[CAT = N, +HUMAN] ever possible. In a second step, other pre- (NP_TYPE => NP_HUM) > processing automata identify the verb group and assign tense, voice and aspect features Every automaton starts by looking for its to it. Finally, in a third step, error detection anchor, an arc marked by "@" (not neces- automata are run, some of which involve sarily the first arc in the automaton). The interacting with the user. Each of these three above automaton first looks for a noun steps will be described below. (inside a noun phrase) which occurs in the list of nouns denoting human beings. If it 3 The preprocessing stage finds one, it puts the value NP_HUMAN in the register called NP TYPE, a register The noun phrase parser identifies simple which is associated with the NP as a whole. non-recursive noun phrases such as The second group of preprocessing Det+Adj+N or N+N. The method used for automata deals with the verb group. this process involves an algorithm of the Periphrastic verb forms are analyzed for type described in Church (1988) which was tense, aspect, and phase, and the result trained on a manually marked part of our stored in registers. The content of these corpus. The module is thus geared to the registers can later be called upon by the particular type of second language text the detection automata. After these two prepro- checker needs to deal with. The resulting in- cessing stages, the error detection automata formation is passed on to a preprocessing can begin their work. module consisting of a number of automata groups. The automata used here (as well as in subsequent modules) are finite-state 4 Error detection automata similar to those described in Allen Once important islands in the sentence (1987) or Silberztein (1993). This type of have been identified, it becomes much easier automata is well-known for its efficiency to write detection automata which look for and versatility. specific types of errors. Because no overall In the preprocessing module, a first set of parse is attempted, error detection has to rely automata scan the text for noun phrases, on well-described contexts. Such an identify the head of each NP and assign the approach, which reduces overflagging, also features for person and number to it. Other has the advantage of describing errors sets of automata then mark temporal noun precisely. Errors can thus not only be iden- phrases, e.g. this past week or six months tified but can also be explained to the user. ago. In a similar fashion, some prepositional Suggestions can also be made as to how phrases are given specific features if they errors can be corrected. denote time or place, e.g.