The Influence of Spelling Errors on Content Scoring Performance

The Influence of Spelling Errors on Content Scoring Performance Andrea Horbach, Yuning Ding, Torsten Zesch Language Technology Lab, Department of Computer Science and Applied Cognitive Science, University of Duisburg-Essen, Germany andrea.horbach torsten.zesch @uni-due.de { [email protected]| } Abstract (1) Some additional information you will need are the material. You also need to Spelling errors occur frequently in edu- know the size of the contaneir to measure cational settings, but their influence on how the acid rain effected it. You need to automatic scoring is largely unknown. know how much vineager is used for each We therefore investigate the influence of sample. Another thing that would help is spelling errors on content scoring perfor- to know how big the sample stones are by mance using the example of the short measureing the best possible way. answer data set of the Automated Stu- dent Assessment Prize (ASAP). We con- In this answer, three non-word spelling errors duct an annotation study on the nature of (printed in bold) occur. In addition, there is also spelling errors in the ASAP dataset and one real-word spelling error, which leads to an ex- utilize these finding in machine learning isting word: effected, which should be affected. experiments that measure the influence of While a teacher who is manually scoring learner spelling errors on automatic content scor- answers can simply try to ignore spelling mis- ing. Our main finding is that scoring meth- takes as far as possible, automatic scoring methods using both token and character n-gram ods must include a spell-checking component to features are robust against spelling errors normalize an occurrence of vineager to vinegar. up to the error frequency seen in ASAP. Thus spell-checking components are also a part of some content scoring systems, such as the top two 1 Introduction performing systems in the ASAP challenge (Tan- dalla, 2012; Zbontar, 2012). However, it is unclear Spelling errors occur frequently in educational as- what impact spelling errors really have on the per- sessment situations, not only in language learn- formance of content scoring systems. ing scenarios, but also with native speakers, espe- Many systems in the ASAP challenge, as well cially when answers are written without the help as some participating systems in the SemEval of a spell-checker.1 In automatic content scoring 2013 Student Response Analysis Task (Heilman for short answer questions, a model is learnt about and Madnani, 2013; Levy et al., 2013), used which content needs to be present in a correct an- shallow features such as token n-grams (Zbontar, swer. Spelling mistakes interfere with this pro- 2012; Conort, 2012). If a token in the test data is cess, as they should be mostly ignored for content misspelled, then there is no way of knowing that scoring. It is still largely unknown how severe the it has the same meaning as the correct spelling of problem is in a practical setting. the word in the training data. At the same time, in- Consider the following answer to the first dividual spelling error instances are often not oc- prompt of the short answer data set of the Auto- curring uniquely in a dataset: Depending on fac- mated Student Assessment Prize (ASAP):2 tors such as the learner group (for example native 1Note that we do not distinguish between the terms error speakers or language learners with a certain na- and mistake used by Ellis(1994) to denote competence and tive language) or the data collection method (hand- performance errors respectively. We use the two terms inter- changeably. writing vs. typing) some spelling errors will oc- 2https://www.kaggle.com/c/asap-sas cur frequently while others will be rare. The mis- 45 Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications, pages 45–53, Taipei, Taiwan, December 1, 2017 c 2017 AFNLP spelled form vineger, for example, might be fre- n-gram features are contributing towards this ro- quent enough that an occurrence feature for the bustness. When introducing more errors, we see a misspelled version provides valuable information substantial drop in performance, such that the im- which a classifier can learn. Whether this observa- portance of spell-checking components in content tion mitigates the effect of spelling errors depends scoring depends on the frequency of errors in the on the frequency of individual errors and therefore data. also on the shape of error distributions. 2 Annotating Spelling Errors Contributions In this paper, we investigate how the presence or absence of spelling errors in- In order to evaluate the influence of spelling errors fluences the task of content scoring, taking the on content scoring, we need an error-annotated afore-mentioned influence criteria of error fre- corpus. However, a full manual annotation of quency and error distribution into account. We the complete dataset, which contains around one conduct our analyses and experiments on the fre- million tokens, was beyond our means. Instead, quently used ASAP content scoring dataset (Hig- we decided to annotate a representative sample of gins et al., 2014). The dataset contains 10 different the ASAP corpus which we utilize to evaluate the prompts about different topics ranging from sci- performance of spelling error detection methods. ences over biology to literature questions. Each This allows us to estimate whether we can draw prompt comes with 2,200 answers on average. reliable conclusions from applying existing spell Although this dataset has been used in a lot of checking methods to the full dataset. studies concerning content scoring, much about We manually annotated the first 20 answers the spelling errors in the dataset is still unknown. in each prompt using WebAnno (Yimam et al., Our manual annotations and corpus analyses will 2013). In order to facilitate the annotation process, therefore also provide insight on the number, the we automatically pre-annotate potential spelling nature and the distribution of spelling errors in this errors using the Jazzy spelling dictionary.3 Two dataset. annotators (non-native speakers and two of the au- First, we present an analysis of the frequency thors of this paper) reviewed the error candidates and distribution of non-word spelling errors in the and either accepted or rejected them, but could ASAP corpus and compare several spelling dictio- also mark additional spelling errors which were naries. We provide a gold-standard correction for not detected automatically. the non-word errors found automatically by a spell In this manual annotation process, we distin- checker in the test section of the data. We com- guish between non-word and real-word spelling pare error correction methods based on phonetic errors. We annotate a mistake as real-word error and edit distance and extend them with a domain- if another word with a different root is clearly in- specific method that prefers suggestions occurring tended in the context, such as “Their are two sam- in the material for a specific prompt. ples”. We do not distinguish between spelling er- Next, we investigate the effect of manipulating rors and grammatical errors among the non-word the number and distribution of spelling errors on errors, i.e., we do not filter out non-words that the performance of an automatic content scoring could originate from grammatical errors such as system. We experiment with two ways of regu- incorrect 3rd person forms like dryed instead of lating the number of misspellings. We automati- dried. We do not mark grammatical errors that cally and manually spell check the corpus to re- lead to a real-word error, such as wrong preposi- place non-word spelling errors by their corrected tions. Equally, we do not mark lexically unsuitable version. This only allows us to decrease the num- words which are morphologically possible, but do ber of errors. To increase the amount of spelling not fit in the context, such as counter partner in a errors further, we also introduce errors artificially context where counter part was clearly intended. in two conditions: (i) adding random noise as a In total, we annotated 9,995 tokens and reach an worst-case scenario, and (ii) adding mistakes ac- inter-annotator agreement of 0.87 Cohen’s kappa cording to the error distribution in the test data. (Cohen, 1960) on the binary decision whether a We find that token and character n-gram scoring 3https://sourceforge.net/projects/ features are largely robust against spelling errors jazzy/files/Dictionaries/English/ with a frequency present in our data. Character english.0.zip/download 46 word is a spelling mistake or not. Main sources of Dictionary P R F disagreement were (i) misses of real word spelling Jazzy .25 .98 .39 errors not marked in the pre-annotation (e.g. koala HunSpell .63 .89 .74 beer), and (ii) disagreements as to whether com- HunSpell -abbr .63 .95 .76 pounds may be written as one word or not (e.g. HunSpell +prompt .88 .88 .88 HunSpell -abbr +prompt .86 .94 .90 micro debris vs. microdebris), a decision which is often ambiguous. In case of disagreement between Table 1: Evaluation of different error detection the annotators, the final decision is made through dictionaries adjudication by both annotators. The resulting dataset contains 297 spelling er- purpose dictionaries, which we can adapt in or- rors, including 48 real-word errors which will not der to get better performance. First, we remove be considered for further evaluations and exper- all-uppercase abbreviations from the dictionary iments. The resulting ratio of spelling errors in ( ), as they can lead to the above-mentioned the dataset is about 3%, which is in line with the -abbr problem.6 Second, we extend the dictionary expected frequency of spelling errors in human- with prompt-specific lexical material ( ), typed text (Kukich, 1992).

The Influence of Spelling Errors on Content Scoring Performance

How Do BERT Embeddings Organize Linguistic Knowledge?

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Deep Linguistic Analysis for the Accurate Identification of Predicate

The Procedure of Lexico-Semantic Annotation of Składnica Treebank

Unified Language Model Pre-Training for Natural

Building a Treebank for French

Converting an HPSG-Based Treebank Into Its Parallel Dependency-Based Treebank

Corpus Based Evaluation of Stemmers

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Lecture 5: Part-Of-Speech Tagging

Merging Propbank, Nombank, Timebank, Penn Discourse Treebank and Coreference James Pustejovsky, Adam Meyers, Martha Palmer, Massimo Poesio