Automatic Text Scoring Using Neural Networks

Dimitrios Alikaniotis Helen Yannakoudakis Marek Rei Department of Theoretical The ALTA Institute The ALTA Institute and Applied Linguistics Computer Laboratory Computer Laboratory University of Cambridge University of Cambridge University of Cambridge Cambridge, UK Cambridge, UK Cambridge, UK [email protected] [email protected] [email protected]

Abstract tali and Burstein, 2006; Rudner and Liang, 2002; Elliot, 2003; Landauer et al., 2003; Briscoe et al., Automated Text Scoring (ATS) provides 2010; Yannakoudakis et al., 2011; Sakaguchi et a cost-effective and consistent alternative al., 2015, among others), overviews of which can to human marking. However, in order be found in various studies (Williamson, 2009; to achieve good performance, the pre- Dikli, 2006; Shermis and Hammer, 2012). Im- dictive features of the system need to plicitly or explicitly, previous work has primarily be manually engineered by human ex- treated text scoring as a supervised text classifica- perts. We introduce a model that forms tion task, and has utilized a large selection of tech- word representations by learning the ex- niques, ranging from the use of syntactic parsers, tent to which specific words contribute to via vectorial semantics combined with dimension- the text’s score. Using Long-Short Term ality reduction, to generative and discriminative Memory networks to represent the mean- machine learning. ing of texts, we demonstrate that a fully automated framework is able to achieve As multiple factors influence the quality of excellent results over similar approaches. texts, ATS systems typically exploit a large range In an attempt to make our results more of textual features that correspond to different interpretable, and inspired by recent ad- properties of text, such as grammar, vocabulary, vances in visualizing neural networks, we style, topic relevance, and discourse coherence introduce a novel method for identifying and cohesion. In addition to lexical and part-of- the regions of the text that the model has speech (POS) ngrams, linguistically deeper fea- found more discriminative. tures such as types of syntactic constructions, grammatical relations and measures of sentence 1 Introduction complexity are among some of the properties that form an ATS system’s internal marking criteria. Automated Text Scoring (ATS) refers to the set of The final representation of a text typically consists statistical and natural language processing tech- of a vector of features that have been manually se- niques used to automatically score a text on a lected and tuned to predict a score on a marking marking scale. The advantages of ATS systems scale. have been established since Project Essay Grade (PEG) (Page, 1967; Page, 1968), one of the earli- Although current approaches to scoring, such est systems whose development was largely moti- as regression and ranking, have been shown to vated by the prospect of reducing labour-intensive achieve performance that is indistinguishable from marking activities. In addition to providing a that of human examiners, there is substantial man- cost-effective and efficient approach to large-scale ual effort involved in reaching these results on dif- grading of (extended) text, such systems ensure a ferent domains, genres, prompts and so forth. Lin- consistent application of marking criteria, there- guistic features intended to capture the aspects of fore facilitating equity in scoring. writing to be assessed are hand-selected and tuned There is a large body of literature with re- for specific domains. In order to perform well on gards to ATS systems of text produced by non- different data, separate models with distinct fea- native English-language learners (Page, 1968; At- ture sets are typically tuned.

715 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 715–725, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

Prompted by recent advances in deep learning the training set to which it is most similar. Lons- and the ability of such systems to surpass state-of- dale and Strong-Krause (2003) use the Link Gram- the-art models in similar areas (Tang, 2015; Tai et mar parser (Sleator and Templerley, 1995) to anal- al., 2015), we propose the use of recurrent neural yse and score texts based on the average sentence- network models for ATS. Multi-layer neural net- level scores calculated from the parser’s cost vec- works are known for automatically learning use- tor. ful features from data, with lower layers learn- The Bayesian Essay Test Scoring sYstem (Rud- ing basic feature detectors and upper levels learn- ner and Liang, 2002) investigates multinomial and ing more high-level abstract features (Lee et al., Bernoulli Naive Bayes models to classify texts 2009). Additionally, recurrent neural networks are based on shallow content and style features. e- well-suited for modeling the compositionality of Rater (Attali and Burstein, 2006), developed by language and have been shown to perform very the Educational Testing Service, was one of the well on the task of language modeling (Mikolov first systems to be deployed for operational scor- et al., 2011; Chelba et al., 2013). We therefore ing in high-stakes assessments. The model uses propose to apply these network structures to the a number of different features, including aspects task of scoring, in order to both improve the per- of grammar, vocabulary and style (among others), formance of ATS systems and learn the required whose weights are fitted to a marking scheme by feature representations for each dataset automat- regression. ically, without the need for manual tuning. More Chen et al. (2010) use a voting algorithm and specifically, we focus on predicting a holistic score address text scoring within a weakly supervised for extended-response writing items.1 bag-of-words framework. Yannakoudakis et al. However, automated models are not a panacea, (2011) extract deep linguistic features and employ and their deployment depends largely on the abil- a discriminative learning-to-rank model that out- ity to examine their characteristics, whether they performs regression. measure what is intended to be measured, and Recently, McNamara et al. (2015) used a hier- whether their internal marking criteria can be in- achical classification approach to scoring, utilizing terpreted in a meaningful and useful way. The linguistic, semantic and rhetorical features, among deep architecture of neural network models, how- others. Farra et al. (2015) utilize variants of lo- ever, makes it rather difficult to identify and ex- gistic and and develop models tract those properties of text that the network has that score persuasive essays based on features ex- identified as discriminative. Therefore, we also tracted from opinion expressions and topical ele- describe a preliminary method for visualizing the ments. information the model is exploiting when assign- There have also been attempts to incorporate ing a specific score to an input text. more diverse features to text scoring models. Kle- banov and Flor (2013) demonstrate that essay 2 Related Work scoring performance is improved by adding to the model information about percentages of highly In this section, we describe a number of the more associated, mildly associated and dis-associated influential and/or recent approaches in automated pairs of words that co-exist in a given text. So- text scoring of non-native English-learner writing. masundaran et al. (2014) exploit lexical chains and Project Essay Grade (Page, 1967; Page, 1968; their interaction with discourse elements for evalu- Page, 2003) is one of the earliest automated scor- ating the quality of persuasive essays with respect ing systems, predicting a score using linear regres- to discourse coherence. Crossley et al. (2015) sion over vectors of textual features considered to identify student attributes, such as standardized be proxies of writing quality. Intelligent Essay test scores, as predictive of writing success and Assessor (Landauer et al., 2003) uses Latent Se- use them in conjunction with textual features to mantic Analysis to compute the semantic similar- develop essay scoring models. ity between texts at specific grade points and a test In 2012, Kaggle,2 sponsored by the Hewlett text, which is assigned a score based on the ones in Foundation, hosted the Automated Student As- 1The task is also referred to as Automated Essay Scoring. sessment Prize (ASAP) contest, aiming to demon- Throughout this paper, we use the terms text and essay (scor- ing) interchangeably. 2http://www.kaggle.com/c/asap-aes/

716 1 strate the capabilities of automated text scoring f(s), bo R ∈ systems (Shermis, 2015). The dataset released H 1 Woh R × ∈ consists of around twenty thousand texts (60% of D H Whi R × which are marked), produced by middle-school ∈ D English-speaking students, which we use as part s R ∈ of our experiments to develop our models. H bo R ∈ 3 Models

3.1 C&W Embeddings where M, Woh, Whi, bo, bh are learnable param- eters, D,H are hyperparameters controlling the Collobert and Weston (2008) and Collobert et al. size of the input and the hidden layer, respectively; (2011) introduce a neural network architecture σ is the application of an element-wise non-linear (Fig. 1a) that learns a distributed representation for function (htanh in this case). each word w in a corpus based on its local context. The model learns word embeddings by ranking Concretely, suppose we want to learn a represen- the activation of the true sequence higher than tation for some target word wt found in an n-sized S the activation of its ‘noisy’ counterpart 0. The sequence of words = (w1, . . . , wt, . . . , wn) S S objective of the model then becomes to minimize based on the other words which exist in the same the hinge loss which ensures that the activations sequence ( w w = w ). In order to derive ∀ i ∈ S | i 6 t of the original and ‘noisy’ ngrams will differ by this representation, the model learns to discrimi- at least 1: nate between and some ‘noisy’ counterpart S S0 in which the target word wt has been substituted losscontext(target, corrupt) = for a randomly sampled word from the vocabu- E (5) lary: = (w , . . . , w , . . . , w w ). In this [1 f(st) + f(sck)]+, k Z S0 1 c n | c ∼ V − ∀ ∈ way, every word w is more predictive of its local context than any other random word in the corpus. where E is another hyperparameter controlling the Every word in is mapped to a real-valued number of ‘noisy’ sequences we give along with V vector in Ω via a mapping function C( ) such the correct sequence (Mikolov et al., 2013; Gut- D· that C(wi) = M?i , where M R ×|V| is mann and Hyvarinen,¨ 2012). h i ∈ the embedding matrix and M is the ith col- h ?ii umn of M. The network takes as input by 3.2 Augmented C&W model S concatenating the vectors of the words found in Following Tang (2015), we extend the previous it; st = C(w1)| ... C(wt)| ... C(wn)| nD h k k k k i ∈ model to capture not only the local linguistic en- R . Similarly, 0 is formed by substituting S vironment of each word, but also how each word C(w ) for C(w ) M w = w . t c ∼ | c 6 t contributes to the overall score of the essay. The The input vector is then passed through a aim here is to construct representations which, hard tanh layer defined as, along with the linguistic information given by the linear order of the words in each sentence, are able to capture usage information. Words such as is, 1 x < 1 − − are, to, at which appear with any essay score are htanh(x) = x 1 x 1 (1) considered to be under-informative in the sense  − 6 6 1 x > 1 that they will activate equally both on high and low scoring essays. Informative words, on the other  which feeds a single linear unit in the output layer. hand, are the ones which would have an impact on The function that is computed by the network is the essay score (e.g., spelling mistakes). ultimately given by (4): In order to capture those score-specific word embeddings (SSWEs), we extend (4) by adding a | | | | further linear unit in the output layer that performs st = M?1 ... M?t ... M?n (2) h k k k k i linear regression, predicting the essay score. Us- i = σ(W s + b ) (3) hi t h ing (2), the activations of the network (presented f(st) = Wohi + bo (4) in Fig. 1b) are given by:

717 ...... the recent advances the recent advances (a) (b)

Figure 1: Architecture of the original C&W model (left) and of our extended version (right).

vector space from the incorrectly spelled ones, re-

fss(s) = Woh1 i + bo1 (6) taining, however, the information that labtop and copmuter are still contextually related (Fig. 2b). fcontext(s) = Woh2 i + bo2 (7)

f (s) [min(score), max(score)] ss ∈ 3.3 Long-Short Term Memory Network 1 bo1 R ∈ SSWE 1 H We use the s obtained by our model to Woh1 R × ∈ derive continuous representations for each essay. We treat each essay as a sequence of tokens The error we minimize for f (where ss stands for ss and explore the use of uni- and bi-directional score specific) is the mean squared error between (Graves, 2012) Long-Short Term Memory net- the predicted yˆ and the actual essay score y: works (LSTMs) (Hochreiter and Schmidhuber, N 1997) in order to embed these sequences in a vec- 1 2 lossscore(s) = (ˆyi yi) (8) tor of fixed size. Both uni- and bi-directional N − LSTMs have been effectively used for embedding Xi=1 long sequences (Hermann et al., 2015). LSTMs From (5) and (8) we compute the overall loss are a kind of recurrent neural network (RNN) ar- function as a weighted linear combination of the chitecture in which the output at time t is condi- two loss functions (9), back-propagating the error tioned on the input s both at time t and at time gradients to the embedding matrix M: t 1: − α losscontext(s, s0) lossoverall(s) = · (9) yt = Wyhht + by (10) + (1 α) lossscore(s) − · ht = (Whsst + Whhht 1 + bh) (11) H − where α is the hyper-parameter determining how the two error functions should be weighted. α val- where s is the input at time t, and is usually ues closer to 0 will place more weight on the score- t H an element-wise application of a non-linear func- specific aspect of the embeddings, whereas values tion. In LSTMs, is substituted for a composite closer to 1 will favour the contextual information. H function defining h as: Fig. 2 shows the advantage of using SSWEs in t the present setting. Based solely on the informa- σ(Wisst + Wihht 1+ tion provided by the linguistic environment, words it = − (12) Wicct 1 + bi) such as computer and laptop are going to be placed − together with their mis-spelled counterparts cop- σ(Wfsst + Wfhht 1+ ft = − (13) muter and labtop (Fig. 2a). This, however, does Wfcct 1 + bf ) not reflect the fact that the mis-spelled words tend − it g(Wcsst + Wchht 1 + bc)+ to appear in lower scoring essays. Using SSWEs, ct = − (14) ft ct 1 the correctly spelled words are pulled apart in the −

718 COMPUTER

3 3 LAPTOP COMPUTER COPMUTAR 2 2 LAPTOP LABTOP 1 1 COPMUTAR

LABTOP 1 2 3 4 1 2 3 4 (a) (b) Standard neural embeddings Score-specific word embeddings

Figure 2: Comparison between standard and score-specific word embeddings. By virtue of appearing in similar environments, standard neural embeddings will place the correct and the incorrect spelling closer in the vector space. However, since the mistakes are found in lower scoring essays, SSWEs are able to discriminate between the correct and the incorrect versions without loss in contextual meaning.

y interpretation of a word at some point ti might be ←−h −→h different once we know the word at ti+5. An ef- fective way to get around this issue has been to train the LSTM in a bidirectional manner. This re- quires doing both a forward and a backward pass of the sequence (i.e., feeding the words from left wthe wrecent wadvances w... to right and from right to left). The hidden layer element in (10) can therefore be re-written as the

the recent advances ... concatenation of the forward and backward hidden vectors: Figure 3: A single-layer Long Short Term Mem- ory (LSTM) network. The word vectors wi enter | ←−h t the input layer one at a time. The hidden layer yt = Wyh | + by (17) −→h that has been formed at the last timestep is used t ! to predict the essay score using linear regression. We also explore the use of bi-directional LSTMs We feed the embedding of each word found (dashed arrows). For ‘deeper’ representations, we in each essay to the LSTM one at a time, can stack more LSTM layers after the hidden layer zero-padding shorter sequences. We form D- shown here. dimensional essay embeddings by taking the ac- tivation of the LSTM layer at the timestep where the last word of the essay was presented to the net- σ(Wosst + Wohht 1+ ot = − (15) work. In the case of bi-directional LSTMs, the two Wocct + bo) independent passes of the essay (from left to right

ht = ot h(ct) (16) and from right to left) are concatenated together to predict the essay score. These essay embeddings where g, σ and h are element-wise non-linear are then fed to a linear unit in the output layer 1 functions such as the logistic sigmoid ( x ) and 1+e− which predicts the essay score (Fig. 3). We use the e2z 1 the hyperbolic tangent ( 2z− ); is the Hadamard mean square error between the predicted and the e +1 product; W, b are the learned weights and biases gold score as our loss function, and optimize with respectively; and i, f, o and c are the input, forget, RMSprop (Dauphin et al., 2015), propagating the output gates and the cell activation vectors respec- errors back to the word embeddings.3 tively. 3 Training the LSTM in a uni-directional manner The maximum time for jointly training a particular SSWE + LSTM combination took about 55–60 hours on an Ama- (i.e., from left to right) might leave out important zon EC2 g2.2xlarge instance (average time was 27–30 information about the sentence. For example, our hours).

719 3.4 Other Baselines hidden layer back to the embedding matrix We train a Support Vector Regression model (see (i.e., we do not provide any pre-trained word 4 Section 4), which is one of the most widely used embeddings). approaches in text scoring. We parse the data us- 4 Dataset ing the RASP parser (Briscoe et al., 2006) and extract a number of different features for assess- The Kaggle dataset contains 12.976 essays rang- ing the quality of the essays. More specifically, ing from 150 to 550 words each, marked by two we use character and part-of-speech unigrams, bi- raters (Cohen’s κ = 0.86). The essays were writ- grams and ; word unigrams, and ten by students ranging from Grade 7 to Grade trigrams where we replace open-class words with 10, comprising eight distinct sets elicited by eight their POS; and the distribution of common nouns, different prompts, each with distinct marking cri- 5 prepositions, and coordinators. Additionally, we teria and score range. For our experiments, we extract and use as features the rules from the use the resolved combined score between the two phrase-structure tree based on the top parse for raters, which is calculated as the average between each sentence, as well as an estimate of the error the two raters’ scores (if the scores are close), or rate based on manually-derived error rules. is determined by a third expert (if the scores are Ngrams are weighted using tf–idf, while the rest far apart). Currently, the state-of-the-art on this are count-based and scaled so that all features have dataset has achieved a Cohen’s κ = 0.81 (using approximately the same order of magnitude. The quadratic weights). However, the test set was re- final input vectors are unit-normalized to account leased without the gold score annotations, render- for varying text-length biases. ing any comparisons futile, and we are therefore Further to the above, we also explore the use restricted in splitting the given training set to cre- of the Distributed Memory Model of Paragraph ate a new test set. Vectors (PV-DM) proposed by Le and Mikolov The sets where divided as follows: 80% (2014), as a means to directly obtain essay embed- of the entire dataset was reserved for train- dings. PV-DM takes as input word vectors which ing/validation, and 20% for testing. 80% of make up ngram sequences and uses those to pre- the training/validation subset was used for actual dict the next word in the sequence. A feature of training, while the remaining 20% for validation PV-DM, however, is that each ‘paragraph’ is as- (in absolute terms for the entire dataset: 64% train- signed a unique vector which is used in the predic- ing, 16% validation, 20% testing). To facilitate tion. This vector, therefore, acts as a ‘memory’, future work, we release the ids of the validation retaining information from all contexts that have and test set essays we used in our experiments, in appeared in this paragraph. Paragraph vectors are addition to our source code and various hyperpa- 6 then fed to a linear regression model to obtain es- rameter values. say scores (we refer to this model as doc2vec). 5 Experiments Additionally, we explore the effect of our score- specific method for learning word embeddings, 5.1 Results when compared against three different kinds of The hyperparameters for our model were as fol- word embeddings: lows: sizes of the layers H, D, the learning rate η, the window size n, the number of ‘noisy’ se- • embeddings (Mikolov et al., quences E and the weighting factor α. Also the 2013) trained on our training set (see Sec- hyperparameters of the LSTM were the size of the tion 4). LSTM layer DLST M as well as the dropout rate r. • Publicly available word2vec embeddings 4Another option would be to use standard C&W embed- (Mikolov et al., 2013) pre-trained on the dings; however, this is equivalent to using SSWEs with α = 1, which we found to produce low results. Google News corpus (ca. 100 billion words), 5Five prompts employed a holistic scoring rubric, one was which have been very effective in capturing scored with a two-trait rubric, and two were scored with a solely contextual information. multi-trait rubric, but reported as a holistic score (Shermis and Hammer, 2012). 6The code, by-model hyperparameter configurations and • Embeddings that are constructed on the fly by the IDs of the testing set are available at https:// the LSTM, by propagating the errors from its github.com/dimalik/ats/.

720 Model Spearman’s ρ Pearson r RMSE Cohen’s κ doc2vec 0.62 0.63 4.43 0.85 SVM 0.78 0.77 8.85 0.75 LSTM 0.59 0.60 6.8 0.54 BLSTM 0.7 0.5 7.32 0.36 Two-layer LSTM 0.58 0.55 7.16 0.46 Two-layer BLSTM 0.68 0.52 7.31 0.48 word2vec + LSTM 0.68 0.77 5.39 0.76 word2vec + BLSTM 0.75 0.86 4.34 0.85 word2vec + Two-layer LSTM 0.76 0.71 6.02 0.69 word2vec + Two-layer BLSTM 0.78 0.83 4.79 0.82 word2vecpre-trained + Two-layer BLSTM 0.79 0.91 3.2 0.92 SSWE + LSTM 0.8 0.94 2.9 0.94 SSWE + BLSTM 0.8 0.92 3.21 0.95 SSWE + Two-layer LSTM 0.82 0.93 3 0.94 SSWE + Two-layer BLSTM 0.91 0.96 2.4 0.96

Table 1: Results of the different models on the Kaggle dataset. All resulting vectors were trained using linear regression. We optimized the parameters using a separate validation set (see text) and report the results on the test set.

Since the search space would be massive for grid above, the SVM model has rich linguistic knowl- search, the best hyperparameters were determined edge and consists of hand-picked features which using Bayesian Optimization (Snoek et al., 2012). have achieved excellent performance in similar In this context, the performance of our models in tasks (Yannakoudakis et al., 2011). However, in the validation set is modeled as a sample from a terms of RMSE, it is among the lowest performing Gaussian process (GP) by constructing a proba- models (8.85), together with ‘BLSTM’ and ‘Two- bilistic model for the error function and then ex- layer BLSTM’. Deep models in combination ploiting this model to make decisions about where with word2vec (i.e., ‘word2vec + Two-layer to next evaluate the function. The hyperparame- LSTM’ and ‘word2vec + Two-layer BLSTM’) ters for our baselines were also determined using and SVMs are comparable in terms of r and ρ, the same methodology. though not in terms of RMSE, where the former All models are trained on our training produce better results, with RMSE improving by set (see Section 4), except the one prefixed half (4.79). doc2vec also produces competitive ‘word2vecpre-trained’ which uses pre-trained em- RMSE results (4.43), though correlation is much beddings on the Google News Corpus. We re- lower (ρ = 0.62 and r = 0.63). port the Spearman’s rank correlation coefficient ρ, The two BLSTMs trained with word2vec em- Pearson’s product-moment correlation coefficient beddings are among the most competitive models r, and the root mean square error (RMSE) be- in terms of correlation and outperform all the mod- tween the predicted scores and the gold standard els, except the ones using pre-trained embeddings on our test set, which are considered more appro- and SSWEs. Increasing the number of hidden lay- priate metrics for evaluating essay scoring systems ers and/or adding bi-directionality does not always (Yannakoudakis and Cummins, 2015). However, improve performance, but it clearly helps in this we also report Cohen’s κ with quadratic weights, case and performance improves compared to their which was the evaluation metric used in the Kag- uni-directional counterparts. gle competition. Performance of the models is Using pre-trained word embeddings improves shown in Table 1. the results further. More specifically, we found In terms of correlation, SVMs produce com- ‘word2vecpre-trained + Two-layer BLSTM’ to be petitive results (ρ = 0.78 and r = 0.77), out- the best configuration, increasing correlation to performing doc2vec, LSTM and BLSTM, as 0.79 ρ and 0.91 r, and reducing RMSE to 3.2. well as their deep counterparts. As described We note however that this is not an entirely

721 fair comparison as these are trained on a much is equivalent to using the basic C&W model, we larger corpus than our training set (which we use found that performance was considerably lower to train our models). Nevertheless, when we (e.g., correlation dropped to ρ = 0.15). use our SSWEs models we are able to outper- The number of ‘noisy’ sequences was set to form ‘word2vecpre-trained + Two-layer BLSTM’, 200, which was the highest possible setting we even though our embeddings are trained on fewer considered, although this might be related more to data points. More specifically, our best model the size of the corpus (see Mikolov et al. (2013) for (‘SSWE + Two-layer BLSTM’) improves correla- a similar discussion) rather than to our approach. tion to ρ = 0.91 and r = 0.96, as well as RMSE Finally, the optimal value for DLST M was 10 (the to 2.4, giving a maximum increase of around 10% lowest value investigated), which again may be in correlation. Given the results of the pre-trained corpus-dependent. model, we believe that the performance of our best SSWE model will further improve should more 6 Visualizing the black box 7 training data be given to it. In this section, inspired by recent advances in (de-) convolutional neural networks in computer 5.2 Discussion vision (Simonyan et al., 2013) and text summa- Our SSWE + LSTM approach having no prior rization (Denil et al., 2014), we introduce a novel knowledge of the grammar of the language or the method of generating interpretable visualizations domain of the text, is able to score the essays in of the network’s performance. In the present con- a very human-like way, outperforming other state- text, this is particularly important as one advantage of-the-art systems. Furthermore, while we tuned of the manual methods discussed in § 2 is that we the models’ hyperparameters on a separate vali- are able to know on what grounds the model made dation set, we did not perform any further pre- its decisions and which features are most discrim- processing of the text other than simple tokeniza- inative. tion. At the outset, our goal is to assess the ‘qual- In the essay scoring literature, text length tends ity’ of our word vectors. By ‘quality’ we mean to be a strong predictor of the overall score. In the level to which a word appearing in a particu- order to investigate any possible effects of essay lar context would prove to be problematic for the length, we also calculate the correlation between network’s prediction. In order to identify ‘high’ the gold scores and the length of the essays. We and ‘low’ quality vectors, we perform a single pass find that the correlations on the test set are rela- of an essay from left to right and let the LSTM tively low (r = 0.3, ρ = 0.44), and therefore con- make its score prediction. Normally, we would clude that there are no such strong effects. provide the gold scores and adjust the network As described above, we used Bayesian Op- weights based on the error gradients. Instead, we timization to find optimal hyperparameter con- provide the network with a pseudo-score by taking figurations in fewer steps than in regular grid the maximum score this specific essay can take9 search. Using this approach, the optimization and provide this as the ‘gold’ score. If the word model showed some clear preferences for some vector is of ‘high’ quality (i.e., associated with parameters which were associated with better higher scoring texts), then there is going to be lit- scoring models:8 the number of ‘noisy’ sequences tle adjustment to the weights in order to predict the E, the weighting factor α and the size of the highest score possible. Conversely, providing the LSTM layer DLST M . The optimal α value was minimum possible score (here 0), we can assess consistently set to 0.1, which shows that our SSWE how ‘bad’ our word vectors are. Vectors which re- approach was necessary to capture the usage of quire minimal adjustment to reach the lowest score the words. Performance dropped considerably as are considered of ‘lower’ quality. Note that since α increased (less weight on SSWEs and more on we do a complete pass over the network (without the contextual aspect). When using α = 1, which doing any weight updates), the vector quality is going to be essay dependent. 7Our approach outperforms all the other models in terms of Cohen’s κ too. 9Note the in the Kaggle dataset essays from different es- 8For the best scoring model the hyperparameters were as say sets have different maximum scores. Here we take as follows: D = 200,H = 100, η = 1e 7, n = 9,E = y˜max the essay set maximum rather than the global maxi- − 200, α = 0.1,DLST M = 10, r = 0.5. mum.

722 . . . way to show that Saeng is a determined ...... sometimes I do . Being patience is being ...... which leaves the reader satisfied ...... is in this picture the cyclist is riding a dry and area which could mean that it is very and the looks to be going down hill there looks to be a lot of turns ...... The only reason im putting this in my own way is because know one is patient in my family ...... Whether they are building hand-eye coordination , researching a country , or family and friends through @CAPS3 , @CAPS2 , @CAPS6 the internet is highly and I hope you feel the same way .

Table 2: Several example visualizations created by our LSTM. The full text of the essay is shown in black and the ‘quality’ of the word vectors appears in color on a range from dark red (low quality) to dark green (high quality).

Concretely, using the network function f(x) as tiple times within an essay, sometimes correctly computed by Eq. (12) – (17), we can approximate and sometimes incorrectly, the model would not the loss induced by feeding the pseudo-scores by be able to distinguish between them. Two possi- taking the magnitude of each error vector (18) – ble solutions to this problem are to either provide

(19). Since lim w 2 0 yˆ = y, this magnitude the gold score at each timestep which results into should tell us howk k much→ an embedding needs to a very computationally expensive endeavour, or to change in order to achieve the gold score (here feed sentences or phrases of smaller size for which pseudo-score). In the case where we provide the the scoring would be more consistent.10 minimum as a pseudo-score, a w value closer k k2 to zero would indicate an incorrectly used word. 7 Conclusion For the results reported here, we combine the mag- In this paper, we introduced a deep neural network nitudes produced from giving the maximum and model capable of representing both local contex- minimum pseudo-scores into a single score, com- tual and usage information as encapsulated by es- puted as L(˜y , f(x)) L(˜y , f(x)), where: max − min say scoring. This model yields score-specific word embeddings used later by a recurrent neural net- L(˜y, f(x)) w (18) ≈ k k2 work in order to form essay representations. ∂L w = L(x) (19) We have shown that this kind of architecture is ∇ , ∂x (˜y,f(x)) able to surpass similar state-of-the-art systems, as

well as systems based on manual feature engineer- where w 2 is the vector Euclidean norm w = k k ing which have achieved results close to the upper N w2; L( ) is the mean squared error as in bound in past work. We also introduced a novel i=1 i · Eq. (8); and y˜ is the essay pseudo-score. way of exploring the basis of the network’s inter- qP We show some examples of this visualization nal scoring criteria, and showed that such models procedure in Table 2. The model is capable of are interpretable and can be further exploited to providing positive feedback. Correctly placed provide useful feedback to the author. punctuation or long-distance dependencies (as in Sentence 6 are . . . researching) are particularly Acknowledgments favoured by the model. Conversely, the model The first author is supported by the Onassis Foun- does not deal well with proper names, but is able dation. We would like to thank the three anony- to cope with POS mistakes (e.g., Being patience or mous reviewers for their valuable feedback. the internet is highly and . . . ). However, as seen 10We note that the same visualization technique can be in Sentence 3 the model is not perfect and returns used to show the ‘goodness’ of phrases/sentences. Within a false negative in the case of satisfied. the phrase setting, after feeding the last word of the phrase to One potential drawback of this approach is that the network, the LSTM layer will contain the phrase embed- ding. Then, we can assess the ‘goodness’ of this embedding the gradients are calculated only after the end of by evaluating the error gradients after predicting the high- the essay. This means that if a word appears mul- est/lowest score.

723 References Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Springer Berlin Yigal Attali and Jill Burstein. 2006. Automated essay Heidelberg. scoring with e-Rater v.2.0. Journal of Technology, Learning, and Assessment, 4(3):1–30. Michael U. Gutmann and Aapo Hyvarinen.¨ 2012. Ted Briscoe, John Carroll, and Rebecca Watson. 2006. Noise-contrastive estimation of unnormalized sta- The second release of the RASP system. In Pro- tistical models, with applications to natural image ceedings of the COLING/ACL, volume 6. statistics. J. Mach. Learn. Res., 13:307–361, Febru- ary. Ted Briscoe, Ben Medlock, and Øistein E. Andersen. 2010. Automated assessment of ESOL free text Karl Moritz Hermann, Tom Koisk, Edward Grefen- examinations. Technical Report UCAM-CL-TR- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, 790, University of Cambridge, Computer Labora- and Phil Blunsom. 2015. Teaching machines to read tory, nov. and comprehend. Jun. Ciprian Chelba, Toma´sˇ Mikolov, Mike Schuster, Qi Ge, S Hochreiter and J Schmidhuber. 1997. Long short- Thorsten Brants, Phillipp Koehn, and Tony Robin- term memory. Neural computation, 9(8):1735– son. 2013. One Billion Word Benchmark for Mea- 1780. suring Progress in Statistical Language Modeling. In arXiv preprint. Beata Beigman Klebanov and Michael Flor. 2013. Word association profiles and their use for auto- YY Chen, CL Liu, TH Chang, and CH Lee. 2010. mated scoring of essays. In Proceedings of the 51st An Unsupervised Automated Essay Scoring System. Annual Meeting of the Association for Computa- IEEE Intelligent Systems, pages 61–67. tional Linguistics, pages 1148–1158. Ronan Collobert and Jason Weston. 2008. A unified Thomas K. Landauer, Darrell Laham, and Peter W. architecture for natural language processing: deep Foltz. 2003. Automated scoring and annotation of neural networks with multitask learning. Proceed- essays with the Intelligent Essay Assessor. In M.D. ings of the Twenty-Fifth international conference on Shermis and J.C. Burstein, editors, Automated essay Machine Learning, pages 160–167, July. scoring: A cross-disciplinary perspective, pages 87– 112. Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Quoc V. Le and Tomas Mikolov. 2014. Distributed 2011. Natural language processing (almost) from representations of sentences and documents. May. scratch. Mar. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Scott Crossley, Laura K Allen, Erica L Snow, and Andrew Y. Ng. 2009. Convolutional deep be- Danielle S McNamara. 2015. Pssst... textual fea- lief networks for scalable unsupervised learning of tures... there is more to automatic essay scoring than hierarchical representations. Proceedings of the just you! In Proceedings of the Fifth International 26th Annual International Conference on Machine Conference on Learning Analytics And Knowledge, Learning ICML 09. pages 203–207. ACM. Deryle Lonsdale and D. Strong-Krause. 2003. Auto- Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. mated rating of ESL essays. In Proceedings of the 2015. Equilibrated adaptive learning rates for non- HLT-NAACL 2003 Workshop: Building Educational convex optimization. Feb. Applications Using Natural Language Processing. Misha Denil, Alban Demiraj, Nal Kalchbrenner, Phil Danielle S McNamara, Scott A Crossley, Rod D Blunsom, and Nando de Freitas. 2014. Modelling, Roscoe, Laura K Allen, and Jianmin Dai. 2015. A visualising and summarising documents with a sin- hierarchical classification approach to automated es- gle convolutional neural network. Jun. say scoring. Assessing Writing, 23:35–59. Semire Dikli. 2006. An overview of automated scor- ing of essays. Journal of Technology, Learning, and Toma´sˇ Mikolov, Stefan Kombrink, Anoop Deo- Assessment, 5(1). ras, Luka´sˇ Burget, and Jan Cernockˇ y.´ 2011. RNNLM-Recurrent neural network language mod- S. Elliot. 2003. IntellimetricTM: From here to valid- eling toolkit. In ASRU 2011 Demo Session. ity. In M. D. Shermis and J. Burnstein, editors, Au- tomated Essay Scoring: A Cross-Disciplinary Per- Tomas Mikolov, I Sutskever, K Chen, G S Corrado, spective, pages 71–86. Lawrence Erlbaum Asso- and Jeffrey Dean. 2013. Distributed representa- ciates. tions of words and phrases and their composition- ality. In Advances in Neural Information Processing Noura Farra, Swapna Somasundaran, and Jill Burstein. Systems, pages 3111–3119. 2015. Scoring persuasive essays using opinions and their targets. In Proceedings of the Tenth Workshop Ellis B. Page. 1967. Grading essays by computer: on Innovative Use of NLP for Building Educational progress report. In Proceedings of the Invitational Applications, pages 64–74. Conference on Testing Problems, pages 87–100.

724 Ellis B. Page. 1968. The use of the computer in ana- Helen Yannakoudakis and Ronan Cummins. 2015. lyzing student essays. International Review of Edu- Evaluating the performance of automated text scor- cation, 14(2):210–225, June. ing systems. In Proceedings of the Tenth Work- shop on Innovative Use of NLP for Building Educa- E.B. Page. 2003. Project essay grade: PEG. In M.D. tional Applications. Association for Computational Shermis and J.C. Burstein, editors, Automated essay Linguistics (ACL). scoring: A cross-disciplinary perspective, pages 43– 54. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically L.M. Rudner and Tahung Liang. 2002. Automated grading ESOL texts. In The 49th Annual Meet- essay scoring using Bayes’ theorem. The Journal of ing of the Association for Computational Linguis- Technology, Learning and Assessment, 1(2):3–21. tics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Ore- Keisuke Sakaguchi, Michael Heilman, and Nitin Mad- gon, USA, pages 180–189. nani. 2015. Effective feature integration for auto- mated short answer scoring. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Build- ing Educational Applications.

M Shermis and B Hammer. 2012. Contrasting state- of-the-art automated scoring of essays: analysis. Technical report, The University of Akron and Kag- gle.

Mark D Shermis. 2015. Contrasting state-of-the-art in the machine scoring of short-form constructed re- sponses. Educational Assessment, 20(1):46–65.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- man. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. 12.

D.D.K. Sleator and D. Templerley. 1995. English with a link grammar. Proceedings of the 3rd International Workshop on Parsing Technolo- gies, ACL.

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical bayesian optimization of machine learning algorithms. Jun.

Swapna Somasundaran, Jill Burstein, and Martin Chodorow. 2014. Lexical chaining for measuring discourse coherence quality in test-taker essays. In COLING, pages 950–961.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural net- works. Sep.

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representa- tions from tree-structured long short-term memory networks. Feb.

Duyu Tang. 2015. Sentiment-specific representation learning for document-level . In Proceedings of the Eighth ACM International Con- ference on Web Search and Data Mining - WSDM '15. Association for Computing Machinery (ACM).

D. M. Williamson. 2009. A framework for imple- menting automated scoring. Technical report, Ed- ucational Testing Service.

725