Automatic Essay Scoring of Swedish Essays Using Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Essay Scoring of Swedish Essays using Neural Networks By Mathias Lilja Department of Statistics Uppsala University Supervisor: Patrik Andersson 2018 Contents 1 Introduction 1 2 Background 1 2.1 Automated Essay Scoring . .1 2.2 Existing Automatic Essay Scoring . .3 2.3 Evaluation of Automatic Essay Scoring . .3 2.4 Automatic Essay Scoring in Swedish . .4 2.5 A neural approach to Automatic Essay Scoring . .5 3 Method 6 3.1 Data . .7 3.2 Cross-validation . .9 3.3 Neural Networks . 10 3.3.1 Feedforward Network . 10 3.3.2 Loss function . 12 3.3.3 Neural network training . 13 3.3.4 Overfitting . 14 3.3.5 Recurrent Neural Network . 15 3.3.6 Pooling . 18 3.4 Working with textual data . 18 3.4.1 One-hot word vectors and word embeddings . 19 3.4.2 Pre-trained word embeddings . 21 3.5 Setup . 25 4 Results 27 5 Discussion 31 6 Conclusion 32 7 References 34 Abstract We propose a neural network-based system for automatically grading essays written in Swedish. Previous system either relies on laboriously crafted features extracted by human experts or are limited to essays writ- ten in English. By using different variations of Long Short-Term Memory (LSTM) networks, our system automatically learns the relation between Swedish high-school essays and their assigned score. Using all of the in- termediate states from the LSTM network proved to be crucial in order to understand the essays. Furthermore, we evaluate different ways of repre- senting words as dense vectors which ultimately have a substantial effect on the overall performance. We compare our results to the ones achieved by the first and previously only automatic essay scoring system designed for the Swedish language. Although no state-of-the-art performance is reached, indication of the potential from a neural based grading system is found. 1 Introduction Grading student essays is an important yet potentially hard and slow process. Traditionally, the grading is carried out by a human rater which is more often than not also working as a professional teacher. Every hour spent on grading is time perhaps better spent on teaching and interacting with students. The rater is also vulnerable to subjective bias, generated by factors such as economic incentives, personal opinion and general state of mind. Swedish national exams have traditionally been graded by teachers employed at the same school as the students taking the exam. Furthermore, the exams are not anonymous, exposing the human rater to even more sources of bias. The grade from the national exam may not necessarily be used as the final course grade, but it could certainly have an substantial influence on it. Fortunately, recent decisions made by the Swedish government together with the Swedish National Agency for Education have initiated an effort towards the digitization of the national exams (Skolverket 2018). Along with it, there is also an intention of making the exams anonymous as well as switching to external raters. Automated Essay Scoring (AES) refers to the statistical and natural language processing (NLP) techniques used to grade student essays without the direct involvement of humans. Most existing AES system are constructed and used on English texts. To our knowledge, only one previous study has been published where an AES system is implemented on Swedish essays (Ostling¨ et al. 2013). This study uses a linear classifier with manual feature engineering. Since then, recent advancements within neural networks have allowed for the creation of new state-of-the-art AES systems, pushing the precision of the AES closer to perfection. All things considered, there is lot of potential gain from using AES for the grading of the Swedish national exams, especially at a time when the Swedish school system is being modernized. Therefore, the intention of this study is to make use of the latest findings within the AES research area and create as well as evaluate, a neural network based AES system on high school essays written in Swedish. Hopefully, this will reveal some of the advantages as well as difficulties to the potential implementation of an AES system into the Swedish school system. 2 Background 2.1 Automated Essay Scoring Automated Essay Scoring (AES) is defined as the computer technology that evaluate and score written prose (Shermis and Barrera 2002). AES systems are mainly developed to assist teachers in low-stakes classroom assessment and authorities as well as testing companies in high-stakes assessments, such as the Test of English as a Foreign Language (TOEFL iBT) and Graduate Record Ex- amination (GRE) tests. The main purpose of AES is to help overcome time, cost 1 and reliability as well as generalizability issues present in writing assessments (Dikli 2006). Most AES systems have implicitly or explicitly treated AES as a supervised text classification problem, utilizing a number of techniques within the field of natural language processing and machine learning. Usually, an AES system works as follows: A collection of student texts in their plain text form (also known as a text corpus) are inputted into the AES software. The first stage of the AES is to preprocess the texts, that is making the texts readable by the computer and useful for further analysis. Basic examples of this includes stripping the texts of whitespace and removing certain char- acters such as punctuation, both of which are examples of tokenization, which is the task of splitting text sequences into pieces, referred to as tokens. Other common methods employed in the preprocessing stage are normalization, stem- ming and lemmatization (Sch¨utze,Manning, and Raghavan 2008). The second stage typically involves feature extraction which is the process that maps the text sequences to a vector of measurable quantities, the most common exam- ple is the occurrences (frequency) of each unique word in the texts (Goldberg 2017). In most existing AES models, the feature extraction is a manual process performed by the researcher. It is considered as the most difficult part of the construction of an AES system and it is a real challenge for humans to take into account all the factors affecting the grade. Furthermore, the effectiveness of the AES system is constrained by the manually chosen features (Chen and He 2013). Lastly, a machine learning algorithm is implemented, using the pre- defined features as input. The type of models used in this last step is usually divided into three categories: regression, classification and preference ranking (Dong, Zhang, and Yang 2017). In the regression approach, often applied when the score range of the essays are wide, each score is seen as continuous values and regression models such as linear regression and Bayesian linear ridge re- gression are used (Attali and Burstein 2004; Phandi, Chai, and Ng 2015). In the classification approach, each score is instead seen as a separate class and common classification techniques such as support vector machine (SVM) and naive bayes (NB) are applied (Ostling¨ et al. 2013; Larkey 1998). Preference ranking has recently been proposed, where the scoring task is seen as a ranking problem (Yannakoudakis, Briscoe, and Medlock 2011). Formally, an AES system is trained to minimize the deviation between the system-generated scores and human scores, given a set of training data. This can be formulated as: N X ∗ min f(yi ; yi); s:t: yi = gθ(ti); i = 1; 2; :::; N; (1) θ i=1 ∗ where N is the number of texts used in the training set, yi is the human score and yi is the predicted AES system score of the i-th essay, respectively. The ∗ function f measures the difference between yi and yi, ti is the feature represen- tation of the i-th essay and g maps the feature ti to score yi (Dong, Zhang, and Yang 2017). 2 2.2 Existing Automatic Essay Scoring The first AES system is commonly credited to the Project Essay Grader (PEG) developed by Ellis Page in 1966 (Page 1967; Page 1968). In similarity to most AES systems, PEG first uses a training sample to form its weights and then inputs new and unseen data in the testing stage. Its primary method is to use the correlation between different intrinsic variables such as fluency, grammar and punctuation. Criticism of the PEG have pointed out that the system mainly captures the surface structure of the texts and neglects the deeper semantic aspect (Chung and O'Neil Jr 1997). It was also possible to trick the PEG system by writing longer essays (Dikli 2006). Another AES system which is also in use today, is the Intelligent Essay Assessor (IEA). This system is produced by the Pearson Knowledge Analysis Technologies. The IEA scoring system is implemented and used in the high- stake exam Pearson Test of English. Unlike other AES systems, the IEA is not used in combination with human raters but is instead the sole rater (Williamson, Xi, and Breyer 2012). In comparison to the PEG, the IEA uses a semantic text- analysis technique referred to as latent semantic analysis (Lemaire and Dessus 2001). Perhaps the most known AES system is the electronic essay rater (e-rater; Burstein and Marcu 2000; Burstein 2003; Attali and Burstein 2004) which is developed and maintained by the Educational Testing Service (ETS). The scor- ing system of the e-rater uses regression in combination with predefined features representing different aspects of writing- and content quality (Williamson, Xi, and Breyer 2012). The number and type of extracted features have changed during the lifetime of the e-rater. However, ETS assures that these features are not only predictive of raters' score, but also have a meaningful correspondence to the features that raters are instructed to consider when they award scores (ETS 2018a).