A Memory-Augmented Neural Model for Automated Grading Leave Authors Anonymous Leave Authors Anonymous Leave Authors Anonymous for Submission for Submission for Submission City, Country City, Country City, Country e-mail address e-mail address e-mail address

ABSTRACT scoring (AES) systems has been used in these tests to reduce The need for automated grading tools for essay writing and the time and cost of grading essays. Moreover, as massive open-ended assignments has received increasing attention open online courses (MOOCs) become widespread and the due to the unprecedented scale of Massive Online Courses number of students enrolled in one course increases, the need (MOOCs) and the fact that more and more students are relying for grading and providing feedback on written assignments on computers to complete and submit their school work. In are ever critical. this paper, we propose an efficient memory networks-powered As part of the automated grading system, AES has employed automated grading model . The idea of our model stems from numerous efforts to improving its performance. AES uses the philosophy that with enough graded samples for each score statistical and Natural Language Processing (NLP) techniques in the rubric, such samples can be used to grade future work to automatically predict a score for an essay based on the essay that is found to be similar. For each possible score in the rubric, prompt and rubric. Essay writing is usually a common student a student response graded with the same score is collected. assessment process in schools and universities. In this task, These selected responses represent the grading criteria spec- students are required to write essays of various length, given a ified in the rubric and are stored in the memory component. prompt or essay topic. Our model learns to predict a score for an ungraded response by computing the relevance between the ungraded response Most existing AES systems are built on the basis of predefined and each selected response in memory. The evaluation was features, e.g. number of words, average word length, and conducted on the Kaggle Automated Student Assessment Prize number of spelling errors, and a machine learning algorithm (ASAP) dataset. The results show that our model achieves [4]. It is normally a heavy burden to find out effective features state-of-the-art performance in 7 out of 8 essay sets and can for AES. Moreover, the performance of the AES systems is be trained efficiently due to the simplicity of model structure. constrained by the effectiveness of the predefined features. Recently another kind of approach has emerged, employing ACM Classification Keywords neural network models to learn the features automatically in I.2.7. ARTIFICIAL INTELLIGENCE: Natural Language Pro- an end-to-end manner [29]. By this means, a direct predic- cessing tion of essay scores can be achieved without performing any feature extraction. The model based on long short-term mem- Author Keywords ory (LSTM) networks in [29] has demonstrated promise in Automated grading; neural networks; memory networks; accomplishing multiple types of automated grading tasks. word embeddings; natural language processing Neural Networks have achieved promising results on various NLP tasks, including [1,5], sentiment INTRODUCTION analysis [6], and [13, 31, 18, 28]. Neural Automated grading is a critical part of Massive Open Online Network models, in terms of NLP tasks, use word vectors to Courses (MOOCs) system and any intelligent tutoring systems learn distributed representations from text. The advantages are (ITS) at scale. Many studies have been conducted to improve that these models do not require hand-engineered features and automated grading for assignments with simple fixed-form an- can be trained to solve tasks in an end-to-end fashion. swers, short-answers [3, 15, 19, 26, 21], or long-form answers [26,2,7, 14]. Some standard tests, such as Test of English as Recent work [29] has exploited several Recurrent Neural Net- a Foreign Language (TOEFL) and Graduate Record Examina- work (RNN) models to solve AES tasks. The results show that tion (GRE), assess student writing skills. Manually grading neural-based models outperform even strong baselines. Mem- these essay will be time-consuming. Thus automated essay ory Networks (MN) [31, 18, 28] have been recently introduced to deal with complex reasoning and inferencing NLP tasks Permission to make digital or hard copies of all or part of this work for personal or and have been shown to outperform RNNs on some complex classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation reasoning tasks [28]. MN is a class of models which contains on the first page. Copyrights for components of this work owned by others than the an external scalable memory and a controller to read from author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and write to that memory. The notion of neural networks with and/or a fee. Request permissions from [email protected]. memory was introduced to solve complex reasoning and in- CHI’16, May 07–12, 2016, San Jose, CA, USA ferring AI-tasks which require remembering external contexts. © 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. Some work [18, 28] has shown the success of MN on different ISBN 123-4567-24-567/08/06. . . $15.00 DOI: http://dx.doi.org/10.475/123_4 1 kinds of tasks, e.g. bAbI tasks [30], MovieQA, and WikiQA prediction directly. One research direction of this approach [18]. is to apply techniques to constructing specific answer patterns manually or to training from large To our knowledge, no study has been conducted to investigate training dataset with strong supervision support [2,3,24]. An- the feasibility and effectiveness of MN applied in automated other direction is to compare the students’ answers with a es- grading tasks. In this study, we develop a generic model for tablished standard answer with an unsupervised text-similarity such tasks using Memory Networks inspired by their capabil- approach [21]. ity to store rich representations of data and reason over that data in memory. For each essay score, we select one essay Most studies mentioned above are dealing with simple fixed- exhibiting the same score from student responses as a sample form answers or short-answers assignments. Some complex for that grade. All collected sample responses are loaded into assignments have long form answer instead of short, simple the memory of the model. The model is trained with the rest one. Essay writing with a given topic is a typical assignment of student responses in a manner on these with long form answers and AES has become one important data to compute the relevance between the representation of research branch of automated grading system. an ungraded response and that of each sample. The intuition is that as a part of a scoring rubric, a number of sample re- AES is generally treated as a machine learning problem. We sponses of variable quality are usually provided to students can group the existing AES solutions from different points of view. Most developed AES system is based on a number and graders to help them better understand the rubric. These of predefined features. These features include essay length, collected responses are characterized with expectations of number of words, lexicon and grammar, syntactic features, quality described in the rubric. The model is expected to learn readability, text coherence, essay organization, and so on [4]. the grading criteria from these responses. We evaluate our model on a publicly available essay grading data set from the Recently, there emerges another trial to treat the whole essays Kaggle Automated Student Assessment Prize (ASAP) compe- as inputs and learn the features automatically in an end-to-end manner [29]. Without pres-working on features extraction, tition (https://www.kaggle.com/c/asap-aes). Our experiments show that our model achieves state-of-the-art results on this work burden was lightened. Moreover, the predicting accurate dataset and training of the model is found to be efficient and is improved by removing the dependency of effectiveness of cost-effective. predefined features. The rest of the paper is organized as follows. Section 2 gives Based on learning techniques utilized in existing solutions, we an overview of related work in this research area. Section divide them into three categories: regression based approach, 3 provides detailed information of our model. Section 4 de- classification based approach and preference ranking based ap- scribes the ASAP dataset and evaluation metrics used to test proach. PEG-system and E-rater are two examples that belong our framework. Furthermore, it contains the details of our im- to regression based approach. Specifically, when the scores plementation and experimental setup to help other researchers range of the essays is wide, the regression based approach is replicate our work. In section 5, we present the results of our normally adopted since it treats the essay score as a continuous value. framework and compare them with other models. Finally, we discuss the results and conclude the paper. Besides essay writing, some complex assignments such as medicinal assignments utilized regression model as well [9]. RELATED WORK Some work such as [27, 14] treated the AES task as a classifi- cation problem. Each possible score is converted into a class Automated Grading label. By using classic classification algorithms, AES system MOOCs were introduced in 2008 and become more popular predicts which class an essay should belongs to. Since it treats recently. Most MOOCs systems provide automated grading each score as a class label, this kind of approach is not suitable as their important features to prove the efficiency of their in- for a very large range of scores. Recently preference ranking teraction with massive number of online users. Some specific based approach was also proposed by [33]. assignment types have been adopted for automated grading since the correct answers of these kinds of assignments have According to the prompts or essay topics the AES system deals some simple fixed-forms, such as multi-choice questions. Pro- with, the existing solutions can be divided into two groups: gramming assignments are the represents of these kinds of prompt-specific and generic. The prompt-specific approach assignments with simple form answer such as âAIJyes⢠A˘ I˙ or train the AES system with essays from one specific topic. âAIJno⢠A˘ I[˙ 8, 11]. Not satisfied with providing answers for This kind of AES system normally has excellent performance one specific assignment, more efforts have been devoted to on the specific topic it was trained. Most of existing works providing feedback on many different assignments according belongs to this prompt-specific approach. Generic approach to the shared features of the programming codes [22, 25]. train the AES system with essays from different prompts. As an example, the work of [4] proposed a domain adaptation However, many assignment types cannot be responded well technique which is based on Bayesian linear ridge regression, only with simple feedback. Some studies have been con- to achieve a generic prompt adaption AES system. This kind ducted with the attempt to fixing this problem by using semi- of approach normally neglects the prompt related features but automatic grading approach. This kind of approach aims to focus on writing quality. optimize the collaboration between humans and machines and provide short-answers [20,3]. Another approach is to provide

2 Memory Networks is the dimension of word vector. PE is a simple and efficient Memory Networks (MN) [31] and Neural Turing Machines way to represent a response, and does not need to learn extra (NTM) [10] are two classes of neural networks models with parameters. It has been used in other tasks [32]. external memory. MN store all information (e.g. knowledge Alternative way to represent a response is to feed each word base, background context) into external memory, assign a rel- vector from a response into Recurrent Neural Networks (RNN) evance probability to each memory slot using content-based [29]. Compared to traditional forward neural networks, hidden addressing schemes, and read contents from each memory slot states of RNN are able to retain the sequential information. By by taking the their weighted sum with relevance probabili- feeding a response into RNN, all the information which are ties. End-to-End Memory Networks (MemN2N) [28] can be useful for the grading ideally should be stored in the hidden trained end-to-end compared to MN, and hence require less su- states. Instead of taking the last hidden state as the essay pervision. Key-value Memory Networks [18] have a key-value representation, it is recommended to calculate the mean of all paired memory and is built upon MemN2N. Key-value paired hidden states to retrieve the representation for a long response. structure in memory is a generalized way of storing content in memory. The contents in key memory are used to calculate the relevance probabilities. The contents in value memory are Memory Addressing read into the model to help make the final prediction. After generating the representation of the responses, we select a sample from student response for every possible score, which NTM form another family of neural networks models with is graded with the same score. The selected samples work as external memory. The NTM controller uses both content and a representation of the criteria in the rubric for all possible location-based mechanism to access the memory. On the scores. Expert knowledge can be used here to choose most other hand, MN only uses content-based mechanism. The representative sample for each score or even generate a number fundamental difference between these two models is the MN of ideal samples. The motivation is that the model is highly do not have a mechanism to change the content in memory, likely to distinguish the difference within the criteria for each while the NTM can modify the content of the memory in each score with these representative samples. For our experiment, episode. This leads to the fact that MN is easier to be trained we randomly pick a sample from student responses for each in practice. score, which is graded with that score.

MODEL All sampled responses are loaded into the memory as an array An illustration of our model is given in Figure1, which is of vectors m1,m2,,mh, where h is the total number of sampled inspired by the work of memory networks applied in question essays. An ungraded response is denoted as x. The basic idea answering [18, 28]. Our model is comprised of four layers: of memory addressing is that it assigns a weight/importance input representation layer, memory addressing layer, memory to each sampled response mi by calculating a dot product reading layer, and output layer. Input representation layer is between x and mi followed by a softmax. T T responsible for generating a vector representation for a student pi = So ftmax(xA · miB ) (1) response. Memory addressing layer loads selected samples yi y j of student responses to memory, and assigns a weight to each where So ftmax(yi) = e /∑ j e , A is a k × d matrix and so is memory piece. Afterward memory reading layer gathers the B. Defined in this way p is a weight vector over all sampled content from memory by taking weighted sum of each memory responses. A and B are learned matrices used to transfer the piece based on the weights calculated from previous layer, and response representation to a d-dimensional features space. The produces a resulting state. Finally the output layer makes the intuiation is that the responses with the same grade are highly prediction on the basis of the resulting state. Neural networks likely to have the similar representation in the feature space. models are usually featured with multiple computational layers to learn a more abstract representation of the input. Our model Memory Reading is extended to have the structure of multiple layers (hops) by After weight vector p is calculated, the output of the memory stacking memory addressing layer and memory reading layer is computed as a weighted sum of each piece of memory in m: repeatedly. T o = ∑ pimiC (2) Input Representation i Each student response is represented as a vector of dimension where C is a k × d matrix used to transfer the response rep- din our model. Given a student response x = {x1,x2,x3,...,xn}, resentation to the feature space. The k × d matrix C may be where n is the length of the response, we map each word into identical to A, but from our experiment, we found that training a word vector wi = Wxi. All word vectors come from a word a separate C leads to a better performance. From the equation, embedding matrix W ∈ Rd×V , where d is the dimension of we can see that weight vector p controls the amount of content word vector and V is the vocabulary size. To represent an es- that is read from each memory piece. say in a vector, we selected position encoding (PE) described in [28], By the scheme of PE, the vector representation of a re- Multiple Hops sponse is calculated by m = ∑ j l j ·Wxi j, where · is an element- The success of neural networks is due to its ability of learn- wise multiplication. l j is a column vector with the structure ing multiple layers of neurons and each layer can transform lk j = (1 − j/J) − (k/d)(1 − 2 j/J) (assuming 1-based index- the representation at previous level into a higher level of ab- ing), where J is the total number of words in the response, d stract representation. Inspired by this idea, we stack multiple

3 Figure 1. An illustration of memory networks for AES. The score range is 0 - 3. For each score, only one sample with the same score is selected from student responses. There are 4 samples in total in memory. Input representation layer is not included. memory addressing step and memory reading step together to gradient descent by minimizing a standard cross entropy loss handle multiple hops operations. between the predicted scores ˆ and the actual score s. After receiving the output o from equation2, the ungraded EXPERIMENTAL SETUP response u is updated with: Dataset u2 = Relu(R1(u + o)) (3) Dataset used in this study comes from Kaggle Automated T Student Assessment Prize (ASAP) competition sponsored by where R1 is a k × k matrix, u = xA and Relu(y) = max(0,y). Then memory addressing step and reading memory step are William and Flora Hewlett Foundation (Hewlett). There are 8 sets of essays and each set is generated from a single prompt. repeated, using a different matrix R j on each hop j. The memory addressing step is modified accordingly to use the All responses collected in the dataset were written by students updated representation of the ungraded response. ranging from grade 7 to grade 10. Score range varies on essay sets. All essays were graded by at least 2 human graders. The pi = So ftmax(u j · miB) (4) average length of the essays differs for each essay set, ranging from 150 words to 650 words. Selected details for each essay set is shown in Table1. Output Layer After a fixed number H hops, the resulting state uH is used to predict a final score over the possible scores: Evaluation Metric Quadratic weighted Kappa (QWK) is used to measure the sˆ = So ftmax(uHW + b) (5) agreement between the human grader and the model. We choose to use this metric because it is the official evaluation where W is k × r matrix, r is the number of possible scores metric of the ASAP competition. Other work such as [4, 29, and b is the bias value. Note that the number of output nodes 24] that uses the ASAP dataset also uses this evaluation metric. equals to the length of score range. We calculate a distribution QWK is calculated using over all possible scores and select most probable score as the ∑ wi, jOi, j prediction. k = 1 − i, j (6) ∑ wi, jEi, j The whole network is trained in end-to-end fashion without i, j any hand-engineered features, and the matrices A,B,C,W and where matrices O, w and E are the matrices of observed R1,...,RH are learned through backpropagation and stochastic scores, weights, and expected scores respectively. Matrix

4 Set Grade level # Essays Avg len Max len Min score Max score Mean score 1 8 1,783 350 911 2 12 8 2 10 1,800 350 118 1 6 3 3 10 1,726 150 395 0 3 1 4 10 1,772 150 383 0 3 1 5 8 1,805 150 452 0 4 2 6 10 1,800 150 489 0 4 2 7 7 1,569 250 659 0 30 16 8 10 723 650 983 0 60 36 Table 1. Selected Details of ASAP dataset

Oi, j corresponds to the number of student responses that re- ceive a score i by the first grader and a score j by the second grader (the model in our experiment). The weight matrix are 2 2 wi, j = (i − j) /(N − 1) , where N is the number of possible scores. Matrix E is calculated by taking the outer product between the score vectors of the two graders, which are then normalized to have the same sum as O.

Implementation Details The model was implemented using Tensorflow framework [16]. We used Adam stochastic gradient descent [12] for optimizing the learned parameters. The learning rate was set to 0.002 and batch size for each iteration to 32 for all models. As final Figure 2. An illustration of baseline LSTM model for AES prediction layer, we used a fully connected layer on top of output from memory reading layer with a softmax activation function. The model learned the parameters by minimizing a standard cross-entropy loss between predicted score and the The reason we use this system as baseline is that it achieved correct score. best QWK scores among all open-source systems participated For regularization we used L2 loss on all learned parameters in ASAP competition. [31] described a set of reliable features with lambda set to 0.3 and limited the norm of the gradients and reported the results of two models using these features: to be below 10. Moreover, we added gradient noise sampled support vector regression (SVR) and Bayesian linear ridge from a Gaussian distribution with mean 0 and variance 0.001 regression (BLRR). when training the memory networks. [29] examined several neural networks models, e.g. RNN We used the publicly available pre-trained Glove word embed- and Convolutional Neural Networks (CNN), on ASAP dataset. dings [23], which was trained on 42 billion tokens of web data, In their experiments, Long Short Term Memory networks from Common Crawl (http://commoncrawl.org/). The dimen- (LSTM) [36], a variant of RNN, achieved the best performance. sion of each word vector is 300. [17] is another LSTM is designed to have three gates in each hidden node: popular algorithm and pre-trained word em- input gate, forget gate, and output gate. By controlling these beddings are also publicly available from this algorithm. As three gates, LSMT has the capability of attaining long-term results shown in [23], Glove outperforms word2vec on word dependencies. The structure of the LSTM model described in analogy, word similarity, and named entity recognition tasks. [10] is presented in Figure2. 5-fold cross validation was used to evaluate our model. For To verify the efficacy of GloVe word embeddings and external each fold, the data was split into two parts: 80% of the data memory, we developed a simple multi-layer forward neural as the training data and 20% as the testing data. The sampled networks (FNN) model, which is similar to our model with response for each score is selected from the training data. A respect to the model structure, but without an external mem- model was trained on each essay set due to the fact that score ory. We refer this baseline model as FNN for the rest of paper range varies among 8 essay sets. We trained each model for for convenience. As shown in Figure3, each word of a stu- 200 epochs using batch gradient descent. dent response is first converted to a continuous vector using GloVe word embeddings. The vector representation for the Baselines response is obtained by applying PE on all word vectors from In [29], their system are compared with Enhanced AI Scoring the response. Afterward the representation is fed into 4 hidden Engine (EASE), an open-source AES system, to demonstrate layers, each of which has 100 hidden nodes. Apply a softmax the improvements on performance. EASE, like traditional operation on the resulting states of last hidden layer at output NLP techniques, requires fine-grained hand-engineered fea- layer to predict the final score. The model is also trained using tures and builds a regression model on top of these features. Adam Optimizer by minimising the standard cross entropy

5 is even comparable with the complex model (LSTM). This proves that using Glove word embeddings with PE to represent a student response is able to capture important features useful for grading the response. The effectiveness of the external memory is proved by the fact that MN accomplishes better performance on 7 sets (set 4 is equal) than FNN does. The comparison between FNN and other models indicates that representing a student response using GloVe with PE and adding external memory are two key factors which may lead Figure 3. An illustration of baseline FNN. Use GloVe with PE to repre- to the good performance on ASAP dataset. sent a student response. The representation is fed into 4-layer networks In ASAP dataset, two human graders are assigned to each and each layer has 100 hidden nodes. student response and each grader gives a score separately. The final gold-standard score for each response is calculated based on these two scores. In Column Human of Table2, between sˆ and truth score s. FNN is properly defined by the we calculated the QWK scores between these two graders to equations below: measure the agreement between two graders. T h0 = Relu(A x) (7) Time Cost hi = Relu(Rihi−1), f or i ≥ 1 (8) To study the time efficiency of our model, we recorded the average training time for each epoch on three models: FNN, sˆ = So ftmax(hHW) (9) MN, and LSTM. The number of hidden nodes in LSTM is 100, where x is the representation generated by GloVe with PE for which is the the same number of hidden nodes in FNN and a student response. hi is the output of hidden layer i. H is MN. However, in practice the number is usually set to a larger the total number of hidden layers. A, Ri ,and W are weight number, e.g. 300 or 500. All these models are implemented matrices. The bias vectors are omitted in the equations. in Tensorflow, trained with same hyperparameters and run on the same GPU (GTX 1080) server. When calculating the RESULTS average time cost, we do not include the time for loading In this section, we describe the results of our experiments on GloVe word embeddings and ASAP dataset, and the time for ASAP dataset and compare these results with baselines men- building model before the data is fed into it. tioned above. Column MN of Table2 presents the QWK scores of our model. Column EASE (SVR) and column Table3 presents the average training time of each epoch on 8 EASE(BLRR) contain the results from EASE with two differ- essay sets. It is clear that the LSTM model is computationally ent regression methods. We also compare our model to other expensive and requires more computational resource. The neural models in [29] and the best results from [29] is listed in complex calculation in each hidden node and long sequence column LSTM+CNN of Table2. Note that their best results of the input slows down the training process. On the other reported in the paper are obtained by ensembling results from hand, MN is 9 times faster than LSTM since the computation 10 runs of LSTM and 10 runs of CNN. However, in our exper- of GloVe with PE is a simple element-wise sum and MN is iment, the results are recorded from a single run of a single insensitive to the length of a response. FNN is the fastest since model after optimizing the hyperparameters. This is not fair the structure of FNN is the simplest. Unlike MN, FNN does to compare their best results with ours directly. Therefore we not need to loop through each memory piece to measure the also pick the best performance achieved by a single model relevance of two student responses at training time. from their paper and list in Column LSTM of Table2. In DISCUSSION AND CONCLUSION their setup, the number of hidden nodes in LSTM is 300 and In this study, we develop a generic model for automated grad- pre-trained word embeddings released by [34] is used. ing tasks using memory networks and word embeddings. To As indicated in Table2, our model outperforms in 7 out of 8 our best knowledge this is the first study that memory networks sets (except for set 7) and improves the average QWK score are applied for this kind of task. Our model is tested on ASAP by 4.0% compared to the baseline LSTM. Even compared to dataset and achieves state-of-the-art performance in 7 out of 8 their best ensembled model (LSTM+CNN), our model still essay sets. Similar to other neural networks models for AES, achieved better performance in 7 essay sets (except for essay our model can be trained in an end-to-end fashion and does 7). As expected, our model surpasses EASE in all 8 sets and not require any hand-engineered features. Compared to RNN, improves average QWK score by 10%. CNN, using GloVe word embeddings with PE to represent a student response makes our model simple and cost-effective. The results from the FNN model mentioned above is presented Adding external memory improves the performance over FNN in column FNN of Table2. In our experiments, FNN has 4 model, which means our model is able to take advantage of hidden layers and each layer has 100 hidden nodes, whose sampled responses stored in the external memory. structure is similar to that of our model except that the external memory is removed. When comparing these results to the best Our model can be generalized to automatically grade assign- results from EASE, we find that this basic model outperforms ments from other subjects. As shown above, there are two EASE in 7 out of 8 sets of essays (except for essay set 1) and key factors to the performance: reliable representation and

6 Set MN FNN EASE(SVR) EASE(BLRR) LSTM LSTM+CNN Human 1 0.83 0.75 0.78 0.76 0.78 0.82 0.72 2 0.72 0.7 0.62 0.61 0.69 0.69 0.81 3 0.72 0.7 0.63 0.62 0.68 0.69 0.77 4 0.82 0.8 0.75 0.74 0.8 0.81 0.85 5 0.83 0.8 0.78 0.78 0.82 0.81 0.75 6 0.83 0.79 0.77 0.78 0.81 0.82 0.78 7 0.79 0.73 0.73 0.73 0.81 0.81 0.72 8 0.68 0.63 0.53 0.62 0.59 0.64 0.63 Avg 0.78 0.74 0.7 0.71 0.75 0.76 0.75 Table 2. QWK scores on ASAP dataset.

Set FNN MN LSTM REFERENCES 1 0.2 1.1 15.5 1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2 0.2 1 19.5 2014. Neural Machine Translation by Jointly Learning to 3 0.2 1 7 Align and Translate. (1 Sept. 2014). 4 0.1 1 7 2. S P Balfour. 2013. Assessing writing in MOOCs: 5 0.2 1 8 Automated essay scoring and calibrated peer review. In 6 0.2 1 8.1 Research and Practice in Assessment 8, Vol. 1. 40–48. 7 0.2 1.5 10 8 0.1 1.4 6.5 3. Michael Brooks, Sumit Basu, Charles Jacobs, and Lucy Avg 0.2 1.1 10.2 Vanderwende. 2014. Divide and correct: using clusters to grade short answers at scale. In L@S. Table 3. Average runtime (seconds) of each training epoch 4. Hongbo Chen and Ben He. 2013. Automated Essay Scoring by Maximizing Human-Machine Agreement. In memory component. In order to apply our model to other EMNLP. kinds of assignment, learning a good vector representation for 5. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, the assignment is the first step. It is analogous to how the re- Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, gression model is built for supervised NLP tasks: first extract and Yoshua Bengio. 2014. Learning Phrase numerical hand-engineered features from text and then apply Representations using RNN Encoder-Decoder for a regression model on these generated features to predict true Statistical Machine Translation. (3 June 2014). labels. In the context of neural networks, a vector is required to represent the student response. Learning the vector can be 6. Cícero Nogueira dos Santos and Maira Gatti. 2014. Deep a part of the predictive model. For example, the word embed- Convolutional Neural Networks for dings in [10] are learned from their predictive model. These of Short Texts. In COLING. 69–78. vectors can also come from pre-trained models, like GloVe 7. Rehab Duwairi. 2006. A framework for the computerized and word2vec. The next step is to select characterized samples assessment of university student essays. Comput. Human and store these samples to memory. The purpose of this step Behav. 22 (2006), 381–388. is to teach the model to understand the grading strategy and eventually associate a vector representation to a score. 8. G E Forsythe and N Wirth. 1965. Automatic Grading Programs. Commun. Commun. ACM 8 5 (1965), However, we only test our model on one dataset. There is a 275–278. need to explore our model with more data sets that contain var- ious formats of assignments to verify our model. Furthermore, 9. Chase Geigle, Chengxiang Zhai, and Duncan C Ferguson. the representation of the assignment and the mechanism for 2016. An Exploration of Automated Grading of Complex measuring relevance among assignments is still elementary. Assignments. In L@S. Future work should therefore focus on these two areas to im- 10. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. prove the generalizability of the model. A lot of effort is still Neural Turing Machines. (20 Oct. 2014). needed to better interpret memory networks and explain the key factors behind our performance improvement. 11. Michael T Helmick. 2007. Interface-based programming assignments and automatic grading of java programs. In ACKNOWLEDGMENTS ITiCSE. We acknowledge funding from multiple NSF grants (ACI- 12. Diederik P Kingma and Jimmy Ba. 2014. Adam: A 1440753, DRL-1252297, DRL-1109483, DRL-1316736 & Method for Stochastic Optimization. CoRR DRL-1031398), the U.S. Department of Education (IES abs/1412.6980 (2014). R305A120125 & R305C100024 and GAANN), the ONR, and the Gates Foundation.

7 13. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, 23. Jeffrey Pennington, Richard Socher, and Christopher D James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Manning. 2014. Glove: Global Vectors for Word Paulus, and Richard Socher. 2016. Ask Me Anything: Representation. In EMNLP, Vol. 14. 1532–1543. Dynamic Memory Networks for Natural Language 24. Peter Phandi, Kian Ming Adam Chai, and Hwee Tou Ng. Processing. In ICML. 2015. Flexible Domain Adaptation for Automated Essay 14. Leah S Larkey. 1998. Automatic Essay Grading Using Scoring Using Correlated . In EMNLP. Text Categorization Techniques. In SIGIR. 25. Chris Piech, Jonathan Huang, Andy Nguyen, Mike 15. Claudia Leacock and Martin Chodorow. 2003. C-rater: Phulsuksombati, Mehran Sahami, and Leonidas J Guibas. Automated Scoring of Short-Answer Questions. Comput. 2015. Learning Program Embeddings to Propagate Hum. 37 (2003), 389–405. Feedback on Student Code. CoRR abs/1505.05969 16. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene (2015). Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, 26. Carolyn Penstein Rosé, Antonio Roque, Dumisizwe Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Bhembe, and Kurt VanLehn. 2003. A Hybrid Text Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Classification Approach For Analysis Of Student Essays. Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan 27. Lawrence M Rudner and Tahung Liang. 2002. Automated Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Essay Scoring Using Bayes’ Theorem. The Journal of Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Technology, Learning and Assessment 1, 2 (1 June 2002). Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent 28. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Rob Fergus. 2015. End-To-End Memory Networks. In Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Advances in Neural Information Processing Systems 28, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: C Cortes, N D Lawrence, D D Lee, M Sugiyama, and Large-Scale Machine Learning on Heterogeneous R Garnett (Eds.). Curran Associates, Inc., 2440–2448. Systems. (2015). 29. Kaveh Taghipour and Hwee Tou Ng. 2016. A Neural 17. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Approach to Automated Essay Scoring. In EMNLP. Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their 30. Jason Weston, Antoine Bordes, Sumit Chopra, Compositionality. In Advances in Neural Information Alexander M Rush, Bart van Merriënboer, Armand Processing Systems 26, C J C Burges, L Bottou, Joulin, and Tomas Mikolov. 2015. Towards AI-Complete M Welling, Z Ghahramani, and K Q Weinberger (Eds.). Question Answering: A Set of Prerequisite Toy Tasks. Curran Associates, Inc., 3111–3119. (19 Feb. 2015). 18. Alexander Miller, Adam Fisch, Jesse Dodge, 31. Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Amir-Hossein Karimi, Antoine Bordes, and Jason Memory Networks. CoRR abs/1410.3916 (2014). Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. CoRR abs/1606.03126 (2016). 32. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual 19. Tom Mitchell, Terry Russell, Peter Broomhead, and Question Answering. (4 March 2016). Nicola Aldridge. 2002. Towards Robust Computerised Marking of Free-text Responses towards Robust 33. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. Computerised Marking of Free-text Responses. 2011. A New Dataset and Method for Automatically Grading ESOL Texts. In Proceedings of the 49th Annual 20. P Mitros, V Paruchuri, J Rogosic, and others. 2013. An Meeting of the Association for Computational Linguistics: integrated framework for the grading of freeform Human Language Technologies - Volume 1 (HLT ’11). responses. The Sixth Conference of (2013). Association for Computational Linguistics, Stroudsburg, 21. Michael Mohler and Rada Mihalcea. 2009. Text-to-Text PA, USA, 180–189. for Automatic Short Answer 34. Will Y Zou, Richard Socher, Daniel M Cer, and Grading. In EACL. Christopher D Manning. 2013. Bilingual Word 22. Andy Nguyen, Chris Piech, Jonathan Huang, and Embeddings for Phrase-Based Machine Translation. In Leonidas J Guibas. 2014. Codewebs: scalable homework EMNLP. search for massive open online programming courses. In WWW.

8