Enhancing Automated Essay Scoring Performance Via Fine-Tuning Pre-Trained Language Models with Combination of Regression and Ranking

Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking Ruosong Yang†‡, Jiannong Cao†, Zhiyuan Wen†, Youzheng Wu‡, Xiaodong He‡ †Department of Computing, The Hong Kong Polytechnic University ‡JD AI Research †{csryang, csjcao, cszwen}@comp.polyu.edu.hk ‡{wuyouzheng1, xiaodong.he}@jd.com Abstract variation), different raters scoring the same essay may assign different scores (inter-rater variation) Automated Essay Scoring (AES) is a critical (Smolentzov, 2013). To alleviate teachers’ bur- text regression task that automatically assigns den and avoid intra-rater variation, as well as inter- scores to essays based on their writing quality. Recently, the performance of sentence rater variation, AES is necessary and essential. An prediction tasks has been largely improved by early AES system, e-rater (Chodorow and Burstein, using Pre-trained Language Models via fus- 2004), has been used to score TOEFL writings. ing representations from different layers, con- Recently, large pre-trained language models, structing an auxiliary sentence, using multi- such as GPT (Radford et al., 2018), BERT (Devlin task learning, etc. However, to solve the AES et al., 2019), XLNet (Yang et al., 2019), etc. have task, previous works utilize shallow neural net- shown the extraordinary ability of representation works to learn essay representations and constrain calculated scores with regression loss or and generalization. These models have gained bet- ranking loss, respectively. Since shallow neu- ter performance in lots of downstream tasks such as ral networks trained on limited samples show text classification and regression. There are many poor performance to capture deep semantic of new approaches to fine-tune pre-trained language texts. And without an accurate scoring func- models. Sun et al. (2019a) proposed to construct tion, ranking loss and regression loss mea- an auxiliary sentence to solve aspect-based senti- sures two different aspects of the calculated ment classification tasks. Cohan et al. (2019) added scores. To improve AES’s performance, we extra separate tokens to obtain representations of find a new way to fine-tune pre-trained language models with multiple losses of the same each sentence to solve sequential sentence classifi- task. In this paper, we propose to utilize a pre- cation tasks. Sun et al. (2019b) summarized several trained language model to learn text represen- fine-tuning methods, including fusing text representations first. With scores calculated from the tations from different layers, utilizing multi-task representations, mean square error loss and the learning, etc. To our knowledge, there are no exist- batch-wise ListNet loss with dynamic weights ing works to improve AES tasks with pre-trained constrain the scores simultaneously. We uti- language models. Before introducing our new way lize Quadratic Weighted Kappa to evaluate our model on the Automated Student Assessment to use pre-trained language models, we briefly re- Prize dataset. Our model outperforms not only view existing works in AES firstly. state-of-the-art neural models near 3 percent Existing works utilize different methods to learn but also the latest statistic model. Especially text representations and constrain scores, which are on the two narrative prompts, our model per- the two key steps in AES models. For text represen- forms much better than all other state-of-the- tation learning, various neural networks are used art models. to learn essay representations, such as Recurrent 1 Introduction Neural Network (RNN) (Taghipour and Ng, 2016; Tay et al., 2018), Convolutional Neural Network Automated Essay Scoring (AES) automatically (CNN) (Taghipour and Ng, 2016), Recurrent Con- evaluates the writing quality of essays. Essay as- volutional Neural Network (RCNN) (Dong et al., signments evaluation costs lots of time. Besides, 2017), etc. However, simple neural networks like the same instructor scoring the same essay at dif- RNN and CNN focus on word-level information, ferent times may assign different scores (intra-rater which is difficult to capture word connections in 1560 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569 November 16 - 20, 2020. c 2020 Association for Computational Linguistics long-distance dependency. Besides, shallow neu- batch-wise ranking loss constrain the scores to- ral networks trained on a small volume of labeled gether, which are jointly optimized with dynamic data are hard to learn deep semantics. As for score combination weights. To evaluate our model, constraints, prediction and ranking are two popular an open dataset, Automated Student Assessment solutions. From the prediction perspective, the task Prize (ASAP), is used. With the measurement of is a regression or classification problem (Taghipour Quadratic Weighted Kappa (QWK), our model out- and Ng, 2016; Tayetal., 2018; Dong et al., 2017). performs state-of-the-art neural models on average Besides, from the recommendation perspective, QWK score of all eight prompts near 3 percent and learning-to-rank methods (Yannakoudakis et al., also performs better than the latest statistical model. 2011; Chen and He, 2013) aim to rank all essays Especially on the two narrative Prompts (7 and 8), in the same order as that ranked by gold scores. only the regression based model performs compara- However, without precise score mapping functions, bly even better compared with other models. And only regression constraints could not ensure the our model with combined loss gains much better right ranking order. And only ranking based mod- performance. To explain the model’s effectiveness, els could not guarantee accurate scores. In general, we also illustrate the attention weights on two ex- there are two key challenges for the AES task. One ample essays (an argumentative essay and a nar- is how to learn better essay representations to eval- rative essay). The self-attention can capture most uate the writing quality, the other one is how to conjunction words that reveal the logical structure, learn a more accurate score mapping function. and most key concepts that show the topic shifting of the narratives. Motivated by the great success of pre-trained lan- In summary, our contributions are: guage models such as BERT in learning text representations with deep semantics, it is reasonable to • We propose a new method called multi-loss to utilize BERT to learn essay representations. Since fine-tune BERT models in AES tasks. We are self-attention is a key component of the BERT also the first one to combine regression and model, it can capture the interactions between any ranking in these tasks. The experiment results two words in the whole essays (long texts). Pre- show that the combined loss could improve vious work (Sun et al., 2019b) shows that fusing the performance significantly. text representations from different layers does not improve the performance effectively. For the AES • Experiment results also show that our model task, the length of essays approximates the length achieves the best average QWK score and out- limit of the BERT model, so it is hard to construct performs other state-of-the-art neural models an auxiliary sentence. Meanwhile, only score la- almost on each prompt. bels are available; it is also difficult to utilize multi- task learning. Summarized existing works in AES, • To show the effectiveness of self-attention in they utilize regression loss or ranking loss, respec- the BERT model, we illustrate the weights tively. Regression loss requires to obtain accurate of different words on two examples, includ- score value, and ranking loss aims to get precise ing one argumentative essay and one narrative score order. Unlike multi-task learning requires dif- essay. ferent fully-connected networks for different tasks, we propose to constrain the same task with multiple 2 Related Works losses to fine-tune the BERT model. In addition, it Ke and Ng (2019) summarized recent works on au- is impossible to rank all essays in one batch so that tomated essay scoring. In general, there are three the model is required to learn more accurate scores. parts to solve the AES task, namely text represen- During training, the weight of the regression loss is tation learning, score mapping function, and score increasing while that of ranking loss is decreasing. constraints. Almost all works utilize a linear com- In this paper, we propose R2BERT (BERT bination function to map each text representation Model with Regression and Ranking). In our to a score. In the rest, we introduce various score model, BERT is used to learn text representa- constraints with used approaches for text represen- tions to capture deep semantics. Then a fully con- tation learning. nected neural network is used to map the repre- According to different score constraints, existing sentations to scores. Finally, regression loss and works fall into three categories, namely prediction, 1561 recommendation, and reinforcement learning based Then scores calculation was guided by quadratic models. weighted kappa based reward function. Prediction is the most general approach, includ- For text representation, previous works only con- ing classification and regression. For classifica- sider the relations among sentences. In this paper, tion, the models directly predict labels that point we focus on all the interactions between any two to different scores. In comparison, regression mod- words. Besides, existing works only utilize regres- els constrain calculated scores to be the same as sion or ranking loss, respectively. We combine two gold ones. Generally, hand-crafted features and losses dynamically in our model. neural network based features are two popular 3R2BERT methods to learn text representations. Early works mainly focus on the construction of hand-crafted In this section, we first introduce the framework features such as statistical features and linguistic of our model, briefly review the BERT model, as features. There are several early AES systems in- well as self-attention. In addition, we will illustrate cluding e-rater (Chodorow and Burstein, 2004), the regression model as well as some useful tricks.

Enhancing Automated Essay Scoring Performance Via Fine-Tuning Pre-Trained Language Models with Combination of Regression and Ranking

Towards Interpretation As Natural Logic Abduction

A Comparative Study of Pretrained Language Models for Automated Essay Scoring with Adversarial Inputs

The Effects of Automated Essay Scoring As a High School Classroom Intervention

A Hierarchical Classification Approach to Automated Essay Scoring

Automated Essay Scoring: a Siamese Bidirectional LSTM Neural Network Architecture

Automated Evaluation of Writing – 50 Years and Counting

Pearson's Automated Scoring of Writing, Speaking, and Mathematics

Automated Essay Scoring: a Survey of the State of the Art

Modeling Argument Strength in Student Essays

Get IT Scored Using Autosas!

Neural Automated Essay Scoring Incorporating Handcrafted Features

Automatic Essay Scoring of Swedish Essays Using Neural Networks