Automatic Essay Scoring for Low Level Learners of English As a Second Language

_________________________________________________________________________Swansea University E-Theses Automatic essay scoring for low level learners of English as a second language. Mellor, Andrew How to cite: _________________________________________________________________________ Mellor, Andrew (2010) Automatic essay scoring for low level learners of English as a second language.. thesis, Swansea University. http://cronfa.swan.ac.uk/Record/cronfa42247 Use policy: _________________________________________________________________________ This item is brought to you by Swansea University. Any person downloading material is agreeing to abide by the terms of the repository licence: copies of full text items may be used or reproduced in any format or medium, without prior permission for personal research or study, educational or non-commercial purposes only. The copyright for any work remains with the original author unless otherwise specified. The full-text must not be sold in any format or medium without the formal permission of the copyright holder. Permission for multiple reproductions should be obtained from the original author. Authors are personally responsible for adhering to copyright and publisher restrictions when uploading content to the repository. Please link to the metadata record in the Swansea University repository, Cronfa (link given in the citation reference above.) http://www.swansea.ac.uk/library/researchsupport/ris-support/ Automatic Essay Scoring for Low Level Learners of English as a Second Language Andrew Mellor Submitted to the University of Wales in fulfillment of the requirements for the Degree of Doctor of Philosophy Swansea University 2010 ProQuest Number: 10797955 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10797955 Published by ProQuest LLC(2018). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 / ^ na'v^N c L,B»A a/ Abstract This thesis investigates the automatic assessment of essays written by Japanese low level learners of English as a second language. A number of essay features are investigated for their ability to predict human assessments of quality. These features include unique lexical signatures (Meara, Jacobs & Rodgers, 2002), distinctiveness, essay length, various measures of lexical diversity, mean sentence length and some properties of word distributions. Findings suggest that no one feature is sufficient to account for essay quality but essay length is a strong predictor for low level learners in time constrained tasks. Combinations of several features are much more powerful in predicting quality than single features. Some simple systems incorporating some of these features are also considered. One is a two-dimensional ‘quantity/content’ model based on essay length and lexical diversity. Various measures of lexical diversity are used for the content dimension. Another system considered is a clustering algorithm based on various lexical features. A third system is a Bayesian algorithm which classifies essays according to semantic content. Finally, an alternative process based on capture-recapture analysis is also considered for special cases of assessment. One interesting finding is that although many essay features only have moderate associations with quality, extreme values at both ends of the scale are often very reliable indicators of high quality or poor quality essays. These easily identifiable high quality or low quality essays can act as training samples for classification algorithms such as Bayesian classifiers. The clustering algorithm used in this study correlated particularly strongly with human essay ratings. This suggests that multivariate statistical methods may help realise more accurate essay prediction. Table of contents Chapter One: Introduction 1 1.1 Aims of the thesis 1 1.2 Background 1 1.3 Identifying features 2 1.4 Developing assessment systems 3 1.5 Influence of other fields 4 1.6 Automation 4 1.7 Outline of the thesis 5 Chapter Two: Literature Review 8 2.1 Introduction 8 2.2 Page (1994) 9 2.2.1 Summary 9 2.2.2 Comments 10 2.3 Chodorow & Burstein (2004) 12 2.3.1 Summary 12 2.3.2 Comments 15 2.4 Landauer, Laham, & Foltz (2003) 18 2.4.1 Summary 18 2.4.2 Comments 20 2.5 Larsen-Freeman & Strom (1977) 22 2.5.1 Summary 22 2.5.2 Comments 24 2.6 Arnaud (1984) 26 2.6.1 Summary 26 2.6.2 Comments 28 2.7 Engber (1995) 30 2.7.1 Summary 30 2.7.2 Comments 32 2.8 Ferris (1994) 34 2.8.1 Summary 34 2.8.2 Comments 36 2.9 Laufer & Nation (1995) 38 2.9.1 Summary 38 2.9.2 Comments 41 2.10 Meara & Bell (2001) 43 2.10.1 Summary 43 2.10.2 Comments 45 2.11 Meara, Jacobs & Rodgers (2002) 47 2.11.1 Summary 47 2.11.2 Comments 49 2.12 Discussion 50 2.12.1 Features 51 2.12.1.1 Essay length 51 2.12.1.2 Lexical diversity 52 2.12.1.3 Other considerations 56 2.12.2 Automatic scoring methods 57 2.12.3 Measures from authorship attribution 58 2.13 Conclusion 59 Chapter Three: Lexical Signatures 62 3.1 Introduction 62 3.2 Lexical signatures in low level learners 62 3.2.1 Introduction 62 3.2.1.1 Aims of the experiment 62 3.2.2 Methodology 63 3.2.2.1 Participants and task 63 3.2.2.2 Measuring lexical signatures 63 3.2.2.3 Constructing an index of uniqueness 65 3.2.2.4 Measurement reliability 65 3.2.3 Results 66 3.2.3.1 Lexical signatures 66 3.2.3.2 Index of uniqueness 66 3.2.3.3 Analysis of measurement reliability 67 3.2.4 Discussion 67 3.2.5 Conclusion 68 3.3 Investigating an index of uniqueness 68 3.3.1 Introduction 68 3.3.1.1 Aims of the experiment 69 3.3.2 Methodology 69 3.3.2.1 Participants and tasks 69 3.3.2.2 Sampling of subsets 70 3.3.2.3 Quality assessments of essays 71 3.3.2.4 Lexical signature analysis 72 3.3.2.5 Index of uniqueness 72 3.3.3 Results 72 3.3.3.1 General test characteristics 72 3.3.3.2 Measurement reliability 73 3.3.3.3 Reliability over two tasks 73 3.3.3.4 Uniqueness index scores and quality assessments 74 3.3.4 Discussion 75 3.3.4.1 Problems of reliability 75 3.3.4.2 Considerations of word selection 76 3.3.5 Conclusion 78 3.4 Conclusion 78 Chapter Four: Distinctiveness 79 4.1 Introduction 79 4.2 Distinctiveness in L2 learner texts 80 4.2.1 Introduction 80 4.2.1.1 Aims of the experiment 80 4.2.2 Methodology 80 4.2.2.1 Participants and tasks 80 4.2.2.2 Calculating distinctiveness 80 4.2.3 Results 81 4.2.3.1 Distinctiveness scores 81 4.2.3.2 Eliminating inconsistent students 81 4.2.3.3 Distinctiveness and quality assessments 82 4.2.4 Discussion 85 4.2.5 Conclusion 86 4.3 Testing reliability of a measure of distinctiveness on identical tasks 86 4.3.1 Introduction 86 4.3.1.1 Aims of the experiment 86 4.3.2 Methodology 86 4.3.2.1 Participants and tasks 86 4.3.2.2 Details of the analysis 86 4.3.3 Results 87 4.3.3.1 Distinctiveness scores 87 4.3.3.2 Consistency of learners 88 4.3.4 Discussion 88 4.3.5 Conclusion 89 4.4 Discussion 89 4.4.1 A comparison of results of the two experiments 89 4.4.2 Threats to reliability and validity 90 4.4.3 Reward structure 92 4.5 Conclusion 93 Chapter Five: Exploratory Studies 95 5.1 Introduction 95 5.2 Lexical features in L2 texts of varying proficiency 95 5.2.1 Introduction 95 5.2.1.1 Aims of the experiment 96 5.2.2 Methodology 96 5.2.2.1 Participants and tasks 96 5.2.2.2 Analysis 97 5.2.3 Results 98 5.2.3.1 Mean word length and mean sentence length 98 5.2.3.2 Entropy, Yule’s K and the D estimate 98 5.2.4 Discussion 101 5.2.5 Conclusion 103 5.3 Word distribution properties of LI and L2 texts 103 5.3.1 Introduction 103 5.3.1.1 Aims of the experiment 103 5.3.2 Methodology 104 5.3.2.1 Participants and tasks 104 5.3.2.2 Analysis 104 5.3.3 Results 105 5.3.3.1 Word frequency distribution 105 5.3.3.2 Word rank distributions 107 5.3.3.3 Word length distributions 109 5.3.4 Discussion 110 5.3.5 Conclusion 111 5.4 The type-token ratio and lexical variation 111 5.4.1 Introduction 111 5.4.1.1 Aims of the experiment 112 5.4.2 Methodology 112 5.4.2.1 Participants and tasks 112 5.4.2.2 Analysis 112 5.4.3 Results 112 5.4.3.1 Reliability over two tasks 112 5.4.3.2 Relationship to quality assessments 113 5.4.4 Discussion 114 5.4.5 Conclusion 115 5.5 LFP and P_Lex 115 5.5.1 Introduction 115 5.5.1.1 Aims of the experiment 116 5.5.2 Methodology 116 5.5.2.1 Participants and tasks 116 5.5.2.2 Analysis 116 5.5.3 Results 117 5.5.3.1 LFP 117 5.53.2 P Lex 118 5.5.4 Discussion 118 5.5.5 Conclusion 119 5.6 Conclusion 120 Chapter Six: A quantity/content model 122 6.1 Introduction 122 6.1.1 Aims of the experiment 122 6.2 Methodology 122 6.2.1 Participants and tasks 122 6.2.2 Essay ratings 122 6.2.3 Analysis 123 6.2.3.1 Selection of quantity and content measures 123 6.2.3.2 Sampled features 124 6.2.3.3 Calculation of quantity and content dimensions 125 6.2.3.4 Comparison of content variables 125 6.3 Results 126 6.3.1 Basic characteristics 126 6.3.2 Quantity/content graphs 127 6.3.2.1 Sample word types TTR(IOO) 127 6.3.2.2 Guiraud Index 128 6.3.2.3 Yule’s K 128 6.3.2.4 Hapax(lOO) 129 6.3.2.5 The D estimate 130 6.3.2.6 Advanced Guiraud 130 6.3.3 Correlation analysis 131 6.3.4 Partial correlation analysis 132 6.3.5 Multiple regression analysis 133 6.3.6 Results

Load more