Using NLP to Support Scalable Assessment of Short Free Text Responses

Using NLP to Support Scalable Assessment of Short Free Text Responses Alistair Willis Department of Computing and Communications The Open University Milton Keynes, UK [email protected] Abstract at least as accurately as a human marker), and help- ful to students, who find the online questions a valu- Marking student responses to short answer able and enjoyable part of the assessment process. questions raises particular issues for human Such automatic marking is also an increasingly im- markers, as well as for automatic marking systems. In this paper we present the Amati sys- portant part of assessment in Massive Open Online tem, which aims to help human markers im- Courses (MOOCs) (Balfour, 2013; Kay et al., 2013). prove the speed and accuracy of their marking. However, the process of creating marking rules Amati supports an educator in incrementally is known to be difficult and time consum- developing a set of automatic marking rules, ing (Sukkarieh and Pulman, 2005; Perez-Mar´ ´ın et which can then be applied to larger question al., 2009). The rules should usually be hand-crafted sets or used for automatic marking. We show that using this system allows markers to de- by a tutor who is a domain expert, as small differ- velop mark schemes which closely match the ences in the way an answer is expressed can be sig- judgements of a human expert, with the ben- nificant in determining whether responses are cor- efits of consistency, scalability and traceabil- rect or incorrect. Curating sets of answers to build ity afforded by an automated marking system. mark schemes can prove to be a highly labour- We also consider some difficult cases for auto- intensive process. Given this requirement, and the matic marking, and look at some of the com- current lack of availability of training data, a valu- putational and linguistic properties of these able progression from existing work in automatic as- cases. sessment may be to investigate whether NLP techniques can be used to support the manual creation of 1 Introduction such marking rules. In developing systems for automatic marking, In this paper, we present the Amati system, which Mitchell et al. (2002) observed that assessment supports educators in creating mark schemes for au- based on short answer, free text input from stu- tomatic assessment of short answer questions. Am- dents demands very different skills from assessment ati uses information extraction-style templates to en- based upon multiple-choice questions. Free text able a human marker to rapidly develop automatic questions require a student to present the appropri- marking rules, and inductive logic programming to ate information in their own words, and without the propose new rules to the marker. Having been devel- cues sometimes provided by multiple choice ques- oped, the rules can be used either for marking further tions (described respectively as improved verbalisa- unseen student responses, or for online assessment. tion and recall (Gay, 1980)). Work by Jordan and Automatic marking also brings with it further ad- Mitchell (2009) has demonstrated that automatic, vantages. Because rules are applied automatically, online marking of student responses is both feasible it improves the consistency of marking; Williamson (in that marking rules can be developed which mark et al. (2012) have noted the potential of automated 243 Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pages 243–253, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics marking to improve the reliability of test scores. In the question, and is). And while balance and equi- addition, because Amati uses symbolic/logical rules librium have closely related meanings, response (3) rather than stochastic rules, it improves the trace- was not considered a correct answer to the ques- ability of the marks (that is, the marker can give an tion3. These examples suggest that bag of words explanation of why a mark was awarded, or not), and techniques are unlikely to be adequate for the task of increases the maintainability of the mark scheme, short answer assessment. Without considering word because the educator can modify the rules in the order, it would be very hard to write a mark scheme context of better understanding of student responses. that gave the correct mark to responses (1)-(3), par- The explanatory nature of symbolic mark schemes ticularly when these occur in the context of several also support issues of auditing marks awarded in as- hundred other responses, all using similar terms. sessment. Bodies such as the UK’s Quality Assur- In fact, techniques such as Latent Semantic Anal- ance Agency1 require that assessment be fully open ysis (LSA) have been shown to be accurate in grad- for the purposes of external examination. Tech- ing longer essays (Landauer et al., 2003), but this niques which can show exactly why a particular success does not appear to transfer to short answer mark was awarded (or not) for a given response fit questions. Haley’s (2008) work suggests that LSA well with existing quality assurance requirements. performs poorly when applied to short answers, with All experiments in this paper were carried out us- Thomas et al. (2004) demonstrating that LSA-based ing student responses collected from a first year in- marking systems for short answers did not give troductory science module. an acceptable correlation with an equivalent human marker, although they do highlight the small size of 2 Mark Scheme Authoring their available dataset. Burrows et al. (2015) have identified several differ- Sukkarieh and Pulman (Sukkarieh and Pulman, ent eras of automatic marking of free text responses. 2005) and Mitchell et al. (2002) have demonstrated One era they have identified has treated automatic that hand-crafted rules containing more syntactic marking as essentially a form of information extrac- structure can be valuable for automatic assessment, tion. The many different ways that a student can but both papers note the manual effort required to correctly answer a question can make it difficult to develop the set of rules in the first place. To ad- award correct marks2. For example: dress this, we have started to investigate techniques to develop systems which can support a subject spe- A snowflake falls vertically with a con- cialist (rather than a computing specialist) in devel- stant speed. What can you say about the oping a set of marking rules for a given collection of forces acting on the snowflake? student responses. In addition, because it has been demonstrated (Butcher and Jordan, 2010) that mark- Three student responses to this question were: ing rules based on regular expressions can mark ac- (1) there is no net force curately, we have also investigated the use of a symbolic learning algorithm to propose further marking (2) gravitational force is in equilibrium with air re- rules to the author. sistance Enabling such markers to develop computational marking rules should yield the subsequent benefits of speed and consistency noted by Williamson et al., (3) no force balanced with gravity and the potential for embedding in an online systems The question author considered both responses to provide immediate marks for student submissions (1) and (2) correct. However, they share no com- (Jordan and Mitchell, 2009). This proposal fits with mon words (except force which already appears in the observation of Burrows et al. (2015), who suggest that rule based systems are desirable for “re- 1http://www.qaa.ac.uk 2Compared with multiple choice questions, which are easy 3As with all examples in this paper, the “correctness” of an- to mark, although constructing suitable questions in the first swers was judged with reference to the students’ level of study place is far from straightforward (Mitkov et al., 2006). and provided teaching materials. 244 then term(R, balanced, 3) would be true, as the 3rd term(R, Term, I) The Ith term in R is token in R is ballanced, and at most 1 edit is needed Term to transform ballanced to balanced. The predicate template allows a simple form of template(R, Template, I) The Ith term in R stemming (Porter, 1980). The statement template(R, matches Template Template, I) is true if Template matches at the be- ginning of the Ith token in R, subject to the same th precedes(Ii, Ij) The Ii term in a re- spelling correction as term. So for example, the sponse precedes the statement: Ith term j template(R, balanc, 3) th closely precedes(Ii, Ij) The Ii term in a re- would match example (4), because balanc is a sin- sponse precedes the gle edit from ballanc, which itself matches the be- th rd Ij within a specified ginning of the 3 token in R. (Note that it would window not match as a term, because ballance is two edits from balanc.) Such templates allow rules to be writ- ten which match, for example, balance, balanced, Figure 1: Mark scheme language balancing and so on. The predicates precedes and closely precedes, peated assessment” (i.e. where the assessment will and the index terms, which appear as the variables be used multiple times), which is more likely to re- I and J in figure 1, capture a level of linear prece- pay the investment in developing the mark scheme. dence, which allow the rules to recognise a degree of We believe that the framework that we present here linguistic structure. As discussed in section 2, tech- shows that rule-based marking can be more tractable niques which do not capture some level of word or- than suggested by Burrows. der are insufficiently expressive for the task of rep- resenting mark schemes. However, a full grammat- 2.1 The Mark Scheme Language ical analysis also appears to be unnecessary, and in In this paper, I will describe a set of such marking fact can lead to ambiguity.

Load more