Automatic Text Scoring Using Neural Networks

Automatic Text Scoring Using Neural Networks Dimitrios Alikaniotis Helen Yannakoudakis Marek Rei Department of Theoretical The ALTA Institute The ALTA Institute and Applied Linguistics Computer Laboratory Computer Laboratory University of Cambridge University of Cambridge University of Cambridge Cambridge, UK Cambridge, UK Cambridge, UK [email protected] [email protected] [email protected] Abstract tali and Burstein, 2006; Rudner and Liang, 2002; Elliot, 2003; Landauer et al., 2003; Briscoe et al., Automated Text Scoring (ATS) provides 2010; Yannakoudakis et al., 2011; Sakaguchi et a cost-effective and consistent alternative al., 2015, among others), overviews of which can to human marking. However, in order be found in various studies (Williamson, 2009; to achieve good performance, the pre- Dikli, 2006; Shermis and Hammer, 2012). Im- dictive features of the system need to plicitly or explicitly, previous work has primarily be manually engineered by human ex- treated text scoring as a supervised text classifica- perts. We introduce a model that forms tion task, and has utilized a large selection of tech- word representations by learning the ex- niques, ranging from the use of syntactic parsers, tent to which specific words contribute to via vectorial semantics combined with dimension- the text’s score. Using Long-Short Term ality reduction, to generative and discriminative Memory networks to represent the mean- machine learning. ing of texts, we demonstrate that a fully automated framework is able to achieve As multiple factors influence the quality of excellent results over similar approaches. texts, ATS systems typically exploit a large range In an attempt to make our results more of textual features that correspond to different interpretable, and inspired by recent ad- properties of text, such as grammar, vocabulary, vances in visualizing neural networks, we style, topic relevance, and discourse coherence introduce a novel method for identifying and cohesion. In addition to lexical and part-of- the regions of the text that the model has speech (POS) ngrams, linguistically deeper fea- found more discriminative. tures such as types of syntactic constructions, grammatical relations and measures of sentence 1 Introduction complexity are among some of the properties that form an ATS system’s internal marking criteria. Automated Text Scoring (ATS) refers to the set of The final representation of a text typically consists statistical and natural language processing tech- of a vector of features that have been manually se- niques used to automatically score a text on a lected and tuned to predict a score on a marking marking scale. The advantages of ATS systems scale. have been established since Project Essay Grade (PEG) (Page, 1967; Page, 1968), one of the earli- Although current approaches to scoring, such est systems whose development was largely moti- as regression and ranking, have been shown to vated by the prospect of reducing labour-intensive achieve performance that is indistinguishable from marking activities. In addition to providing a that of human examiners, there is substantial man- cost-effective and efficient approach to large-scale ual effort involved in reaching these results on dif- grading of (extended) text, such systems ensure a ferent domains, genres, prompts and so forth. Lin- consistent application of marking criteria, there- guistic features intended to capture the aspects of fore facilitating equity in scoring. writing to be assessed are hand-selected and tuned There is a large body of literature with re- for specific domains. In order to perform well on gards to ATS systems of text produced by non- different data, separate models with distinct fea- native English-language learners (Page, 1968; At- ture sets are typically tuned. 715 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 715–725, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Prompted by recent advances in deep learning the training set to which it is most similar. Lons- and the ability of such systems to surpass state-of- dale and Strong-Krause (2003) use the Link Gram- the-art models in similar areas (Tang, 2015; Tai et mar parser (Sleator and Templerley, 1995) to anal- al., 2015), we propose the use of recurrent neural yse and score texts based on the average sentence- network models for ATS. Multi-layer neural net- level scores calculated from the parser’s cost vec- works are known for automatically learning use- tor. ful features from data, with lower layers learn- The Bayesian Essay Test Scoring sYstem (Rud- ing basic feature detectors and upper levels learn- ner and Liang, 2002) investigates multinomial and ing more high-level abstract features (Lee et al., Bernoulli Naive Bayes models to classify texts 2009). Additionally, recurrent neural networks are based on shallow content and style features. e- well-suited for modeling the compositionality of Rater (Attali and Burstein, 2006), developed by language and have been shown to perform very the Educational Testing Service, was one of the well on the task of language modeling (Mikolov first systems to be deployed for operational scor- et al., 2011; Chelba et al., 2013). We therefore ing in high-stakes assessments. The model uses propose to apply these network structures to the a number of different features, including aspects task of scoring, in order to both improve the per- of grammar, vocabulary and style (among others), formance of ATS systems and learn the required whose weights are fitted to a marking scheme by feature representations for each dataset automat- regression. ically, without the need for manual tuning. More Chen et al. (2010) use a voting algorithm and specifically, we focus on predicting a holistic score address text scoring within a weakly supervised for extended-response writing items.1 bag-of-words framework. Yannakoudakis et al. However, automated models are not a panacea, (2011) extract deep linguistic features and employ and their deployment depends largely on the abil- a discriminative learning-to-rank model that out- ity to examine their characteristics, whether they performs regression. measure what is intended to be measured, and Recently, McNamara et al. (2015) used a hier- whether their internal marking criteria can be in- achical classification approach to scoring, utilizing terpreted in a meaningful and useful way. The linguistic, semantic and rhetorical features, among deep architecture of neural network models, how- others. Farra et al. (2015) utilize variants of lo- ever, makes it rather difficult to identify and ex- gistic and linear regression and develop models tract those properties of text that the network has that score persuasive essays based on features ex- identified as discriminative. Therefore, we also tracted from opinion expressions and topical ele- describe a preliminary method for visualizing the ments. information the model is exploiting when assign- There have also been attempts to incorporate ing a specific score to an input text. more diverse features to text scoring models. Kle- banov and Flor (2013) demonstrate that essay 2 Related Work scoring performance is improved by adding to the model information about percentages of highly In this section, we describe a number of the more associated, mildly associated and dis-associated influential and/or recent approaches in automated pairs of words that co-exist in a given text. So- text scoring of non-native English-learner writing. masundaran et al. (2014) exploit lexical chains and Project Essay Grade (Page, 1967; Page, 1968; their interaction with discourse elements for evalu- Page, 2003) is one of the earliest automated scor- ating the quality of persuasive essays with respect ing systems, predicting a score using linear regres- to discourse coherence. Crossley et al. (2015) sion over vectors of textual features considered to identify student attributes, such as standardized be proxies of writing quality. Intelligent Essay test scores, as predictive of writing success and Assessor (Landauer et al., 2003) uses Latent Se- use them in conjunction with textual features to mantic Analysis to compute the semantic similar- develop essay scoring models. ity between texts at specific grade points and a test In 2012, Kaggle,2 sponsored by the Hewlett text, which is assigned a score based on the ones in Foundation, hosted the Automated Student As- 1The task is also referred to as Automated Essay Scoring. sessment Prize (ASAP) contest, aiming to demon- Throughout this paper, we use the terms text and essay (scoring) interchangeably. 2http://www.kaggle.com/c/asap-aes/ 716 1 strate the capabilities of automated text scoring f(s), bo R ∈ systems (Shermis, 2015). The dataset released H 1 Woh R × ∈ consists of around twenty thousand texts (60% of D H Whi R × which are marked), produced by middle-school ∈ D English-speaking students, which we use as part s R ∈ of our experiments to develop our models. H bo R ∈ 3 Models 3.1 C&W Embeddings where M, Woh, Whi, bo, bh are learnable param- eters, D, H are hyperparameters controlling the Collobert and Weston (2008) and Collobert et al. size of the input and the hidden layer, respectively; (2011) introduce a neural network architecture σ is the application of an element-wise non-linear (Fig. 1a) that learns a distributed representation for function (htanh in this case). each word w in a corpus based on its local context. The model learns word embeddings by ranking Concretely, suppose

Automatic Text Scoring Using Neural Networks

Towards Interpretation As Natural Logic Abduction

A Comparative Study of Pretrained Language Models for Automated Essay Scoring with Adversarial Inputs

The Effects of Automated Essay Scoring As a High School Classroom Intervention

A Hierarchical Classification Approach to Automated Essay Scoring

Automated Essay Scoring: a Siamese Bidirectional LSTM Neural Network Architecture

Automated Evaluation of Writing – 50 Years and Counting

Pearson's Automated Scoring of Writing, Speaking, and Mathematics

Automated Essay Scoring: a Survey of the State of the Art

Modeling Argument Strength in Student Essays

Get IT Scored Using Autosas!

Enhancing Automated Essay Scoring Performance Via Fine-Tuning Pre-Trained Language Models with Combination of Regression and Ranking

Neural Automated Essay Scoring Incorporating Handcrafted Features