Automated Essay Scoring Based on Two-Stage Learning

1 Automated Essay Scoring based on Two-Stage Learning Jiawei Liuy, Yang Xuz, and Yaguang Zhuy yFopure (Hangzhou) Technology Co., Ltd., Hangzhou, China zSchool of Computer Science and Technology, University of Science and Technology of China, Hefei, China Email: [email protected], [email protected], [email protected] Abstract—Current state-of-the-art feature-engineered and end- [9]. Generally, there are two categories of adversarial inputs. to-end Automated Essay Score (AES) methods are proven to One is composed of well-written permuted paragraphs, which be unable to detect adversarial samples, e.g. the essays com- have been successfully detected by [5] based on a coherence posed of permuted sentences and the prompt-irrelevant essays. Focusing on the problem, we develop a Two-Stage Learning model [10]. The other consists of prompt-irrelevant essays, Framework (TSLF) which integrates the advantages of both which remain to be dealt with. Focusing on the problems feature-engineered and end-to-end AES methods. In experiments, and arguments mentioned above, in this paper, we develop we compare TSLF against a number of strong baselines, and a Two-Stage Learning Framework (TSLF), which makes full the results demonstrate the effectiveness and robustness of our use of the advantages of feature-engineered and end-to-end models. TSLF surpasses all the baselines on five-eighths of prompts and achieves new state-of-the-art average performance methods. In the first stage, we calculate three scores including when without negative samples. After adding some adversarial semantic score, coherence score and prompt-relevant score eassys to the original datasets, TSLF outperforms the features- based on Long Short-Term Memory (LSTM) neural network. engineered and end-to-end baselines to a great extent, and shows Semantic score is prompt-independent, and utilized to evaluate great robustness. essays from deep semantic level. Coherence score is exploited to detect the essays composed of permuted paragraphs. The connections between prompts and essays are evaluated based I. INTRODUCTION on prompt-relevant scores, which are defined to detect the Automated Essay Scoring (AES), which extracts various prompt-irrelevant samples. In the second stage, we concatenate features from essays and then scores them on a numeric these three scores with some handcrafted features, and the range, can improve the efficiency of writing assessment and results are fed the eXtreme Gradient Boosting model (XGboost) reduce human efforts to a great extent. In general, AES models [11] for further training. The details of TSLF are illustrated can be divided into two main streams. The models of the in Figure 1. In experiments, TSLF together with a number first stream are feature-engineered models, which are driven of strong baselines are evaluated on the public Automated by handcrafted features, such as the number of words and Student Assessment Prize dataset (ASAP) [3], which consists grammar errors [1, 2]. The advantage is that the handcrafted of 8 prompts. Our contributions in this paper are summarized features are explainable and flexible, and could be modified as follows. and adapted to different scoring criterion. However, some deep semantic features extracted by understanding the essays, which are especially essential for prompt-dependent writing tasks, are • The results on the original ASAP dataset demonstrate hard to be captured by feature-engineered models. the effectiveness of integrating both feature-engineered arXiv:1901.07744v2 [cs.CL] 20 Dec 2019 The other stream is the end-to-end model, which is driven models’ advantages and end-to-end models’ advantages. by the rapid development of deep learning techniques [3– TSLF outperforms the baselines on five-eighths of prompts 6]. Specifically, based on word embedding [7, 8], essays and achieves new state-of-the-art performance on average. are represented into low-dimensional vectors, and followed • After adding some adversarial samples to the original by a dense layer to transform these deep-encoded vectors ASAP dataset, TSLF surpasses all baselines to a great (involving deep semantic meanings) to corresponding ratings. degree, and show great robustness. The results demonstrate Although end-to-end models are good at extracting deep the validity of our coherence model and prompt-relevant semantic features, they can hardly integrate the handcrafted model to detect the negative samples. features like spelling errors and grammar errors, which are • With respect to the handcrafted features, the current AES proven to be vital for the effectiveness of AES models. In models only concern about the spell errors. However, this paper, we argue that both handcrafted features and deep- other grammar errors such as article errors and preposition encoded features are necessary and should be exploited to errors are also very important for a valid AES system. To enhance AES models. the best of our knowledge, we are the first to introduce It is reported that some well-designed adversarial inputs can the Grammar Error Correction (GEC) system into AES be exploited to cheat AES models so that the writers who are models. familiar with the systems’ working can maximize their scores 2 Overall Score XGboost Second Stage + Concatenation [ Se , Ce , Pe ] [ word counts, , grammar errors ] Semantic Score, Se Coherence Score, Ce Prompt-relevant Score, Pe sigmoid sigmoid sigmoid Projection Projection Projection First Stage h1 h2 hm h1 h2 hm h1 hn hn+1 hn+m s1 s2 sm s1 s2 sm s1 sn sn+1 sn+m Essay Essay Prompt Essay Fig. 1. Two-Stage Learning Framework for AES. In the first stage, based on deep neural networks, we calculate semantic score, coherence score and prompt-relevant score named as Se, Ce and Pe respectively. Ce and Pe are proposed to detect adversarial samples. In the second stage, we concatenate these three scores with some handcrafted features and feed the result to the boosting tree model for further training. II. TWO-STAGE LEARNING FRAMEWORK (TSLF) the sentence. Concretely, the representation of sentence s is expressed as the following equation. In Section I, we argue that deep-encoded features and n+1 1 X handcrafted features are both necessary for a valid AES system. s = w−2 (1) snt n + 2 i In this section, we are going to introduce our Two-Stage i=0 Learning Framework (TSLF), which combines the advantages where s means a sentence’s embedding, superscript −2 of of feature-engineered models and end-to-end models. As shown snt w indicates word representations are from the penultimate in Figure 1, during the first stage, we calculate semantic score i transformer layer. In this paper, we do not make use of the S , coherence score C and prompt-relevant score P , where e e e last layer’s representations because the last layer is too closed C is utilized to detect the adversarial samples composed e to the target functions of pre-training tasks including masked of well-written permuted paragraphs and P is designed for e language model task and next sentence prediction task [12], prompt-irrelevant samples. In the second stage, these three and therefore the representations may be biased to those targets. scores together with some handcrafted features are concatenated and fed to a boosting tree model for further training. B. First Stage Semantic Score In the first stage, we utilize LSTM to map A. Sentence Embedding essays into low-dimensional embeddings, which are then fed to a dense output layer for scoring essays. Concretely, for It is proven that the context-dependent embedding method th an essay e = fs1; s2; ··· ; smg, where st indicates the t named Bidirectional Encoder Representations from Transform- d (1 ≤ t ≤ m; st 2 R ) sentence embedding in the essay and d ers (BERT) achieved new state-of-the-art results on some means the length of sentence embedding. The encoding process downstream tasks like question answering and text classification of LSTM is described as follows: [12]. Due to these exciting achievements, in this paper, sentence i = σ(W · s + U · h + b ) embeddings are derived by the pre-trained BERT model1. For t i t i t−1 i f = σ(W · s + U · h + b ) sentence s = ft ; t ; ··· ; t ; t g, where t (0 ≤ i ≤ n + 1) t f t f t−1 f 0 1 n n+1 i c~ = σ(W · s + U · h + b ) indicates the ith word in sentence, t is a special tag CLS t c t c t−1 c (2) 0 c = i ◦ c~ + f ◦ c used for classification tasks and t is another special t t t t t−1 n+1 o = σ(W · s + U · h + b ) tag SEP utilized to split the sentences. Every word in the t o t o t−1 o h = o ◦ tanh(c ) sentence including CLS and SEP will be encoded into a low- t t t d dimensional embedding wi(wi 2 R ) based on BERT. In this ht means the hidden state of sentence st. Wi, Wf , Wc, Wo, Ui, paper, the average of the hidden states of the penultimate Uf , Uc, Uo are the weight matrices for the input gate, forget transformer layer along the time axis is exploited to represent gate, candidate state, and output gate respectively. bi, bf , bc, bo stand for the bias vectors. σ denotes the sigmoid function and 1https://github.com/google-research/bert. In this paper, we utilize the uncased ◦ means element-wise multiplication. Hence, for the essay e, model with 12-layer, 768-hidden, 12-heads and 110M parameters. we will get the hidden state set H = fh1; h2; ··· ; hmg. In this 3 paper, we utilize the last hidden state rather than the average m sentences, we first combine p and e, and the combination is hidden state [3, 5] to define the final essay’s representation.

Automated Essay Scoring Based on Two-Stage Learning

Towards Interpretation As Natural Logic Abduction

A Comparative Study of Pretrained Language Models for Automated Essay Scoring with Adversarial Inputs

The Effects of Automated Essay Scoring As a High School Classroom Intervention

A Hierarchical Classification Approach to Automated Essay Scoring

Automated Essay Scoring: a Siamese Bidirectional LSTM Neural Network Architecture

Automated Evaluation of Writing – 50 Years and Counting

Pearson's Automated Scoring of Writing, Speaking, and Mathematics

Automated Essay Scoring: a Survey of the State of the Art

Modeling Argument Strength in Student Essays

Get IT Scored Using Autosas!

Enhancing Automated Essay Scoring Performance Via Fine-Tuning Pre-Trained Language Models with Combination of Regression and Ranking

Neural Automated Essay Scoring Incorporating Handcrafted Features