Arxiv:2102.13136V1
Total Page:16
File Type:pdf, Size:1020Kb
Automated essay scoring using efficient transformer-based language models Christopher M. Ormerod, Akanksha Malhotra, and Amir Jafari ABSTRACT. Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education, Linguistics, and Natural Language Processing (NLP). The efficacy of an NLP model in AES tests it ability to evaluate long-term dependencies and extrapolate meaning even when text is poorly written. Large pretrained transformer-based language models have dominated the current state- of-the-art in many NLP tasks, however, the computational requirements of these models make them expensive to deploy in practice. The goal of this paper is to challenge the paradigm in NLP that bigger is better when it comes to AES. To do this, we evaluate the performance of several fine-tuned pretrained NLP models with a modest number of parameters on an AES dataset. By ensembling our models, we achieve excellent results with fewer parameters than most pretrained transformer-based models. 1. Introduction The idea that a computer could analyze writing style dates back to the work of Page in 1968 [31]. Many engines in production today rely on explicitly defined hand-crafted features designed by experts to measure the intrinsic characteristics of writing [14]. These features are combined with frequency-based methods and statistical models to form a collection of methods that are broadly termed Bag-of-Word (BOW) methods [49]. While BOW methods have been very successful from a purely statistical standpoint, [22] showed they tend to be brittle with respect novel uses of language and vulnerable to adversarially crafted inputs. Neural Networks learn features implicitly rather than explicitly. It has been shown that ini- tial neural network AES engines tend to be more more accurate and more robust to gaming than BOW methods [12, 15]. In the broader NLP community, the recurrent neural network approaches used have been replaced by transformer-based approaches, like the Bidirectional En- coder Representations from Transformers (BERT) [13]. These models tend to possess an order arXiv:2102.13136v1 [cs.CL] 25 Feb 2021 of magnitude more parameters than recurrent neural networks, but also boast state-of-the-art re- sults in the General Language Understanding Evaluation (GLUE) benchmarks [42, 45]. One of the main problems in deploying these models is their computational and memory requirements [28]. This study explores the effectiveness of efficient versions of transformer-based models in the domain of AES. There are two aspects of AES that distinguishes it from GLUE tasks that might benefit from the efficiencies introduced in these models; firstly, the text being evaluated can be almost arbitrarily long. Secondly, essays written by students often contain many more spelling issues than would be present in typical GLUE tasks. It could be the case that fewer 1 2 CHRISTOPHERM.ORMEROD,AKANKSHAMALHOTRA,ANDAMIRJAFARI and more often updated parameters might possibly be better in this situation or more generally where smaller training sets are used. With respect to essay scoring the Automated Student Assessment Prize (ASAP) AES data- set on Kaggle is a standard openly accessible data-set by which we may evaluate the performance of a given AES model [35]. Since the original test set is no longer available, we use the five-fold validation split presented in [46] for a fair and accurate comparison. The accuracy of BERT and XLNet have been shown to be very solid on the Kaggle dataset [33, 47]. To our knowledge, combining BERT with hand-crafted features form the current state-of-the-art [40]. The recent works of have challenged the paradigm that bigger models are necessarily better [11, 21, 24, 29]. The models in these papers possess some fundamental architectural character- istics that allow for a drastic reduction in model size, some of which may even be an advantage in AES. For this study, we consider the performance of the Albert models [26], Reformer mod- els [21], a version of the Electra model [11], and the Mobile-BERT model [39] on the ASAP AES data-set. Not only are each of these models more efficient, we show that simple ensembles provide the best results to our knowledge on the ASAP AES dataset. There are several reasons that this study is important. Firstly, the size of models scale quadratically with maximum length allowed, meaning that essays may be longer than the max- imal length allowed by most pretrained transformer-based models. By considering efficiencies in underlying transformer architectures we can work on extending that maximum length. Sec- ondly, as noted by [28], one of the barriers to effectively putting these models in production is the memory and size constraints of having fine-tuned models for every essay prompt. Lastly, we seek models that impose a smaller computational expense, which in turn has been linked by [37] to a much smaller carbon footprint. 2. Approaches to Automated Essay Scoring From an assessment point of view, essay tests are useful in evaluating a student’s ability to analyze and synthesize information, which assesses the upper levels of Blooms Taxonomy. Many modern rubrics use a multitude of scoring dimensions to evaluate an essay, including organization, elaboration, and writing style. An AES engine is a model that assigns scores to a piece of text closely approximating the way a person would score the text as a final score or in multiple dimensions. To evaluate the performance of an engine we use standard inter-rater reliability statistics. It has become standard practice in the development of a training set for an AES engine that each essay is evaluated by two different raters from which we may obtain a resolved score. The resolved score is usually either the same as the two raters in the case that they agree and an adjudicated score in cases in which they do not. In the case of the ASAP AES data-set, the resolved score for some items is taken to be the sum of the two raters, and a resolved score for others. The goal of a good model is to have a higher agreement with the resolved score in comparison with the agreement two raters have with each other. The most widely used statistic used to evaluate the agreement of two different collections of scores is the quadratic weighted kappa (QWK), defined by w x (1) κ = ij ij PPwijmij PP AUTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASEDLANGUAGEMODELS 3 where xi,j is the observed probability mi,j = xij(1 xij), − and (i j)2 wij = 1 − , − (k 1)2 − where k is the number of classes. The other measurements used in the industry are the standard- ized mean difference (SMD) and the exact match or accuracy (Acc). The total number of essays and the human-human agreement for the two raters and the score ranges are shown in table 1. The usual protocol for training a statistical model is that some portion of the training set is set aside for evaluation while the remaining set is used for training. A portion of the training set is often isolated for purposes of early stopping and hyperparameter tuning. In evaluating the Kaggle dataset, we use the 5-fold cross validation splits defined by [46] so that our results are comparable to other works [1, 12, 33, 40, 47]. The resulting QWK is defined to be the average of the QWK values on each of the five different splits. Essay Prompt 1 2 3 4 5 6 7 8 Rater score range 1-6 1-6 0-3 0-3 0-4 0-4 0-12 5-30 Resolved score 2-12 1-6 0-3 0-3 0-4 0-4 2-24 10-60 range Average Length 350 350 150 150 150 150 250 650 Training examples 1783 1800 1726 1772 1805 1800 1569 723 QWK 0.721 0.814 0769 0.851 0.753 0.776 0.721 0.624 Acc 65.3% 78.3% 74.9% 77.2% 58.0% 62.3% 29.2% 27.8% SMD 0.008 0.027 0.055 0.005 0.001 0.011 0.006 0.069 TABLE 1. A summary of the Automated Student Assessment Prize Automated Essay Scoring data-set. 3. Methodology Most engines currently in production rely on Latent Semantic Analysis or a multitude of hand-crafted features that measure style and prose. Once sufficiently many features are com- piled, a traditional machine learning classifier, like logistic regression, is applied and fit to a training corpus. As a representative of this class of model we include the results of the “En- hanced AI Scoring Engine”, which is open source 1. When modelling language with neural networks, the first layer of most neural networks are embedding layers, which send every token to an element of a semantic vector space. When training a neural network model from scratch, the word embedding vocabulary often comprises of the set of tokens that appear in the training set. There are several problems with this ap- proach that come as a result of word sparsity in language and the presence of spelling mistakes. Alternatively, one may use a pretrained word embedding [30] built from a large corpus with a large vocabulary with a sufficiently high dimensional semantic vector space. From an efficiency 1https://github.com/edx/ease 4 CHRISTOPHERM.ORMEROD,AKANKSHAMALHOTRA,ANDAMIRJAFARI standpoint, these word embeddings alone can account for billions of parameters. We can shrink our embedding and address some issues arising from word sparsity by fixing a vocabulary of subwords using a version of the byte-pair-encoding (BPE) [25, 34]. As such, pretrained models like BERT and those we use in this study utilize subwords to fix the size of the vocabulary.