Modeling Argument Strength in Student Essays
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Beijing, China, July 2015, pp. 543--552. Modeling Argument Strength in Student Essays Isaac Persing and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 {persingq,vince}@hlt.utdallas.edu Abstract (Persing and Ng, 2013). Essay grading software that provides feedback along multiple dimensions While recent years have seen a surge of in- of essay quality such as E-rater/Criterion (Attali terest in automated essay grading, includ- and Burstein, 2006) has also begun to emerge. ing work on grading essays with respect Our goal in this paper is to develop a com- to particular dimensions such as prompt putational model for scoring the essay dimension adherence, coherence, and technical qual- of argument strength, which is arguably the most ity, there has been relatively little work important aspect of argumentative essays. Argu- on grading the essay dimension of argu- ment strength refers to the strength of the argu- ment strength, which is arguably the most ment an essay makes for its thesis. An essay with important aspect of argumentative essays. a high argument strength score presents a strong We introduce a new corpus of argumen- argument for its thesis and would convince most tative student essays annotated with argu- readers. While there has been work on design- ment strength scores and propose a su- ing argument schemes (e.g., Burstein et al. (2003), pervised, feature-rich approach to auto- Song et al. (2014), Stab and Gurevych (2014a)) matically scoring the essays along this for annotating arguments manually (e.g., Song et dimension. Our approach significantly al. (2014), Stab and Gurevych (2014b)) and auto- outperforms a baseline that relies solely matically (e.g., Falakmasir et al. (2014), Song et on heuristically applied sentence argument al. (2014)) in student essays, little work has been function labels by up to 16.1%. done on scoring the argument strength of student 1 Introduction essays. It is worth mentioning that some work has Automated essay scoring, the task of employing investigated the use of automatically determined computer technology to evaluate and score writ- argument labels for heuristic (Ong et al., 2014) ten text, is one of the most important educational and learning-based (Song et al., 2014) essay scor- applications of natural language processing (NLP) ing, but their focus is holistic essay scoring, not (see Shermis and Burstein (2003) and Shermis et argument strength essay scoring. al. (2010) for an overview of the state of the art In sum, our contributions in this paper are two- in this task). A major weakness of many ex- fold. First, we develop a scoring model for the ar- isting scoring engines such as the Intelligent Es- gument strength dimension on student essays us- say AssessorTM(Landauer et al., 2003) is that they ing a feature-rich approach. Second, in order to adopt a holistic scoring scheme, which summa- stimulate further research on this task, we make rizes the quality of an essay with a single score and our data set consisting of argument strength anno- thus provides very limited feedback to the writer. tations of 1000 essays publicly available. Since In particular, it is not clear which dimension of progress in argument strength modeling is hin- an essay (e.g., style, coherence, relevance) a score dered in part by the lack of a publicly annotated should be attributed to. Recent work addresses this corpus, we believe that our data set will be a valu- problem by scoring a particular dimension of es- able resource to the NLP community. say quality such as coherence (Miltsakaki and Ku- 2 Corpus Information kich, 2004), technical errors, relevance to prompt (Higgins et al., 2004; Persing and Ng, 2014), or- We use as our corpus the 4.5 million word Interna- ganization (Persing et al., 2010), and thesis clarity tional Corpus of Learner English (ICLE) (Granger Topic Languages Essays Score Description of Argument Strength Most university degrees are the- 13 131 4 essay makes a strong argument for its thesis oretical and do not prepare stu- and would convince most readers dents for the real world. They are 3 essay makes a decent argument for its thesis therefore of very little value. and could convince some readers The prison system is outdated. 11 80 2 essay makes a weak argument for its thesis or No civilized society should pun- sometimes even argues against it ish its criminals: it should reha- 1 essay does not make an argument or it is often bilitate them. unclear what the argument is In his novel Animal Farm, 10 64 George Orwell wrote “All men Table 2: Descriptions of the meaning of scores. are equal but some are more equal than others.” How true is this today? To ensure consistency in annotation, we ran- domly select 846 essays to have graded by mul- Table 1: Some examples of writing topics. tiple annotators. Though annotators exactly agree on the argument strength score of an essay only et al., 2009), which consists of more than 6000 26% of the time, the scores they apply fall within essays on a variety of different topics written by 0.5 points in 67% of essays and within 1.0 point in university undergraduates from 16 countries and 89% of essays. For the sake of our experiments, 16 native languages who are learners of English whenever the two annotators disagree on an es- as a Foreign Language. 91% of the ICLE texts say’s argument strength score, we assign the es- are written in response to prompts that trigger ar- say the average the two scores rounded down to gumentative essays. We select 10 such prompts, the nearest half point. Table 3 shows the number and from the subset of argumentative essays writ- of essays that receive each of the seven scores for ten in response to them, we select 1000 essays to argument strength. annotate for training and testing of our essay ar- gument strength scoring system. Table 1 shows score 1.0 1.5 2.0 2.5 3.0 3.5 4.0 three of the 10 topics selected for annotation. Fif- essays 2 21 116 342 372 132 15 teen native languages are represented in the set of Table 3: Distribution of argument strength scores. annotated essays. 4 Score Prediction 3 Corpus Annotation We cast the task of predicting an essay’s argument We ask human annotators to score each of the strength score as a regression problem. Using re- 1000 argumentative essays along the argument gression captures the fact that some pairs of scores strength dimension. Our annotators were selected are more similar than others (e.g., an essay with from over 30 applicants who were familiarized an argument strength score of 2.5 is more similar with the scoring rubric and given sample essays to an essay with a score of 3.0 than it is to one to score. The six who were most consistent with with a score of 1.0). A classification system, by the expected scores were given additional essays contrast, may sometimes believe that the scores to annotate. Annotators evaluated the argument 1.0 and 4.0 are most likely for a particular essay, strength of each essay using a numerical score even though these scores are at opposite ends of from one to four at half-point increments (see Ta- the score range. In the rest of this section, we de- ble 2 for a description of each score).1 This con- scribe how we train and apply our regressor. trasts with previous work on essay scoring, where Training the regressor. Each essay in the train- the corpus is annotated with a binary decision ing set is represented as an instance whose label (i.e., good or bad) for a given scoring dimension is the essay’s gold score (one of the values shown (e.g., Higgins et al. (2004)). Hence, our annota- in Table 3), with a set of baseline features (Sec- tion scheme not only provides a finer-grained dis- tion 5) and up to seven other feature types we pro- tinction of argument strength (which can be im- pose (Section 6). After creating training instances, portant in practice), but also makes the prediction we train a linear regressor with regularization pa- task more challenging. rameter c for scoring test essays using the linear SVM regressor implemented in the LIBSVM soft- 1See our website at http://www.hlt.utdallas. ware package (Chang and Lin, 2001). All SVM- edu/˜persingq/ICLE/ for the complete list of argu- ment strength annotations. specific learning parameters are set to their default values except c, which we tune to maximize per- # Rule 2 1 Sentences that begin with a comparison dis- formance on held-out validation data. course connective or contain any string prefixes Applying the regressor. After training the re- from “conflict” or “oppose” are tagged OP- POSES. gressor, we use it to score the test set essays. Test 2 Sentences that begin with a contingency con- instances are created in the same way as the train- nective are tagged SUPPORTS. ing instances. The regressor may assign an essay 3 Sentences containing any string prefixes from “suggest”, “evidence”, “shows”, “Essentially”, any score in the range of 1.0−4.0. or “indicate” are tagged CLAIM. 4 Sentences in the first, second, or last paragraph 5 Baseline Systems that contain string prefixes from “hypothes”, or “predict”, but do not contain string prefixes In this section, we describe two baseline systems from “conflict” or “oppose” are tagged HY- POTHESIS.