Automated Essay Scoring with Discourse-Aware Neural Models

Automated Essay Scoring with Discourse-Aware Neural Models Farah Nadeem1, Huy Nguyen2, Yang Liu2, and Mari Ostendorf1 1Department of Electrical and Computer Engineering, University of Washington ffarahn, [email protected] 2LAIX Inc. fhuy.nguyen, [email protected] Abstract These problems (and the success of deep learning in other areas of language processing) have Automated essay scoring systems typically rely on hand-crafted features to predict es- led to the development of neural methods for au- say quality, but such systems are limited by tomatic essay scoring, moving away from fea- the cost of feature engineering. Neural net- ture engineering. A variety of studies (mostly works offer an alternative to feature engineer- LSTM-based) have reported AES performance ing, but they typically require more annotated comparable to or better than feature-based mod- data. This paper explores network structures, els (Taghipour and Ng, 2016; Cummins and Rei, contextualized embeddings and pre-training 2018; Wang et al., 2018; Jin et al., 2018; Farag strategies aimed at capturing discourse char- et al., 2018; Zhang and Litman, 2018). However, acteristics of essays. Experiments on three essay scoring tasks show benefits from all three the current state-of-the-art models still use a com- strategies in different combinations, with sim- bination of neural models and hand-crafted fea- pler architectures being more effective when tures (Liu et al., 2019). less training data is available. While vanilla RNNs, particularly LSTMs, are good at representing text sequences, essays are 1 Introduction longer structured documents and less well suited In the context of large scale testing and online to an RNN representation. Thus, our work looks learning systems, automated essay scoring (AES) at advancing AES by exploring other architectures is an important problem. There has been work on that incorporate document structure for longer both improving the performance of these systems documents. Discourse structure and coherence and on validity studies (Shermis, 2014). The abil- are important aspects of essay writing and are of- ity to evaluate student writing has always been im- ten explicitly a part of grading rubrics. We ex- portant for language teaching and learning; now it plore methods that aim at discourse-aware mod- also extends to science, since the focus is shift- els, through design of the model structure, use ing towards assessments that can more accurately of discourse-based auxiliary pretraining tasks, and gauge construct knowledge as compared to mul- use of contextualized embeddings trained with tiple choice questions (Shermis, 2014). Most ex- cross-sentence context (Devlin et al., 2018). In or- isting systems for automatic essay scoring lever- der to better understand the relative advantages of age hand crafted features, ranging from word- these methods, we compare performance on three counts to argumentation structure and coherence, essay scoring tasks with different characteristics, in linear regression and logistic regression mod- contrasting results with a strong feature-based sys- els (Chodorow and Burstein, 2004; Shermis and tem. Burstein, 2013; Klebanov et al., 2016; Nguyen Our work makes two main contributions. First, and Litman, 2018). Improving feature-based we demonstrate that both discourse-aware struc- models requires extensive redesigning of features tures and discourse-related pre-training (via aux- (Taghipour and Ng, 2016). Due to high variabil- iliary tasks or contextualized embeddings) bene- ity in types of student essays, feature-based sys- fit performance of neural network systems. In a tems are often individually designed for specific TOEFL essay scoring task, we obtain a substan- prompts (Burstein et al., 2013). This poses a chal- tial improvement over the state-of-the-art. Second, lenge for building essay scoring systems. we show that complex contextualized embedding 484 Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 484–493 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics models are not useful for tasks with small anno- • Natural language inference (NLI): given a tated training sets. Simpler discourse-aware neu- pair of sentences, predict their relation as ral models are still useful, but they benefit from neutral, contradictory, or entailment. combination with a feature-based model. • Discourse marker prediction (DM): given a pair of sentences, predict the category of dis- 2 Method course marker that connects them, e.g. “how- 2.1 Neural Models ever” (corresponding to the idea opposition category). The overall system involves a neural network to map an essay to a vector, which is then used with The NLI task has been shown to improve perfor- ordinal regression (McCullagh, 1980) for essay mance for several NLP tasks (Cozma et al., 2018). scoring. For this work we consider two neural The DM prediction task is used since discourse models that incorporate document structure: structure is an important aspect for essay writing. Both tasks involve sentence pairs, so they impact • Hierarchical recurrent network with attention the first-level LSTM of the HAN and BCA mod- (HAN) (Yang et al., 2016) els. • Bidirectional context with attention (BCA) The use of contextualized embeddings can also (Nadeem and Ostendorf, 2018) be thought of as pre-training with an auxiliary task of language modeling (or masked language mod- Both models are LSTM based. HAN captures the eling). In this work, we chose the bidirectional hierarchical structure within a document, by using transformer architecture (BERT) embeddings (De- two stacked layers of LSTMs. The first layer takes vlin et al., 2018), which uses a transformer ar- word embeddings as input and outputs contextual- chitecture trained on two tasks, masked language ized word representations. Self attention is used to model and next sentence prediction. We hypothe- compute a sentence vector as a weighted average sized that the next sentence prediction would cap- of the contextualized word vectors. The second ture aspects of discourse coherence. LSTM takes sentence vectors as input and outputs 2.3 Training Methods a document vector based on averaging using self attention at the sentence level. All HAN models and a subset of BCA models BCA extends HAN to account for cross sen- are initialized with pretrained Glove word embed- 1 tence dependencies. For each word, using the dings (Pennington et al., 2014). All models are contextualized word vectors output from the first trained with the essay training data. LSTM, a look-back and look-ahead context vector For models that are pretrained, the word-level is computed based on the similarity with words in LSTM and bidirectional context with attention the previous and following sentence, respectively. (for BCA), are common for all tasks used in train- The final word representation is then created as a ing. Given the word-level representations, the concatenation of the LSTM output, the look-back model computes attention weights over words for and look-ahead context vectors, and then used to the target task (DM, NLI or essay scoring). The create a sentence vector using attention weights, sentence representation is then computed by aver- which feeds into the second LSTM. The represen- aging the word representations using task-specific tation of cross-sentence dependencies makes this attention weights. For the pretraining tasks, the model discourse aware. sentence representations the two sentences in the pair are concatenated, passed through a feedfor- 2.2 Auxiliary Training Tasks ward neural network, and used with task-specific Neural networks typically require more training weights and biases to predict the label. For pre- data than feature-based models, but unlike these training the BCA with the auxiliary tasks, the for- models, neural networks can make use of related ward context vector is computed for the first sen- tasks to improve performance through pretraining. tence and the backward context vector is computed We use additional data chosen with the idea that for the second sentence. This allows the model to having related tasks for pretraining can help the learn the similarity projection matrix during pre- model learn aspects that impact the main classifi- training. cation problem. We use the following tasks: 1http://nlp.stanford.edu/data/glove.42B.300d.zip 485 For the essay scoring task there is another Data set Essays High Medium Low sentence-level LSTM on top of the word-level Train/dev 11,000 3,835 5,964 1,202 LSTM, with sentence-level attention, followed by Test 1,100 367 604 129 task-specific weights and biases. Pretraining is Train/dev 6,074 2,102 3,318 655 followed by training with the essay data, with all Test 2,023 700 1,101 222 model parameters updated during training, except Table 1: Label distribution in LDC TOEFL dataset. for the auxiliary task-specific word-level attention, Data is split into training and test sets: split 1 (upper feedforward networks, weights and biases. The part) and split 2 (lower part). network used for BCA with pretraining tasks is shown in Figure1. The hyper-parameters were Data set Essays Avg. len Score range tuned for the auxiliary tasks and the essay scor- 1 1783 350 2-12 ing task. To incorporate BERT embeddings in our 2 1800 350 1-6 model, we freeze the BERT model, and learn con- Table 2: Data statistics for essay sets 1 and 2 of ASAP textualized token embeddings for our data using corpus. the base uncased model. The tokens are from the second-to-last hidden layer, since we are not fine- tuning the model and the last layer is likely to be shown in Table2. Since only the training sam- more tuned to the original BERT training tasks. ples are available for both sets, we report results These embeddings are then used as input to the for 5-fold cross-validation using the same splits as BCA model (BERT-BCA), which is then trained (Taghipour and Ng, 2016).

Load more