Using NLP to Support Scalable Assessment of Short Free Text Responses

Alistair Willis Department of Computing and Communications The Open University Milton Keynes, UK [email protected]

Abstract at least as accurately as a human marker), and help- ful to students, who find the online questions a valu- Marking student responses to short answer able and enjoyable part of the assessment process. questions raises particular issues for human Such automatic marking is also an increasingly im- markers, as well as for automatic marking sys- tems. In this paper we present the Amati sys- portant part of assessment in Massive Open Online tem, which aims to help human markers im- Courses (MOOCs) (Balfour, 2013; Kay et al., 2013). prove the speed and accuracy of their marking. However, the process of creating marking rules Amati supports an educator in incrementally is known to be difficult and time consum- developing a set of automatic marking rules, ing (Sukkarieh and Pulman, 2005; Perez-Mar´ ´ın et which can then be applied to larger question al., 2009). The rules should usually be hand-crafted sets or used for automatic marking. We show that using this system allows markers to de- by a tutor who is a domain expert, as small differ- velop mark schemes which closely match the ences in the way an answer is expressed can be sig- judgements of a human expert, with the ben- nificant in determining whether responses are cor- efits of consistency, scalability and traceabil- rect or incorrect. Curating sets of answers to build ity afforded by an automated marking system. mark schemes can prove to be a highly labour- We also consider some difficult cases for auto- intensive process. Given this requirement, and the matic marking, and look at some of the com- current lack of availability of training data, a valu- putational and linguistic properties of these able progression from existing work in automatic as- cases. sessment may be to investigate whether NLP tech- niques can be used to support the manual creation of 1 Introduction such marking rules. In developing systems for automatic marking, In this paper, we present the Amati system, which Mitchell et al. (2002) observed that assessment supports educators in creating mark schemes for au- based on short answer, free text input from stu- tomatic assessment of short answer questions. Am- dents demands very different skills from assessment ati uses -style templates to en- based upon multiple-choice questions. Free text able a human marker to rapidly develop automatic questions require a student to present the appropri- marking rules, and inductive logic programming to ate information in their own words, and without the propose new rules to the marker. Having been devel- cues sometimes provided by multiple choice ques- oped, the rules can be used either for marking further tions (described respectively as improved verbalisa- unseen student responses, or for online assessment. tion and recall (Gay, 1980)). Work by Jordan and Automatic marking also brings with it further ad- Mitchell (2009) has demonstrated that automatic, vantages. Because rules are applied automatically, online marking of student responses is both feasible it improves the consistency of marking; Williamson (in that marking rules can be developed which mark et al. (2012) have noted the potential of automated

243

Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pages 243–253, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics marking to improve the reliability of test scores. In the question, and is). And while balance and equi- addition, because Amati uses symbolic/logical rules librium have closely related meanings, response (3) rather than stochastic rules, it improves the trace- was not considered a correct answer to the ques- ability of the marks (that is, the marker can give an tion3. These examples suggest that bag of words explanation of why a mark was awarded, or not), and techniques are unlikely to be adequate for the task of increases the maintainability of the mark scheme, short answer assessment. Without considering word because the educator can modify the rules in the order, it would be very hard to write a mark scheme context of better understanding of student responses. that gave the correct mark to responses (1)-(3), par- The explanatory nature of symbolic mark schemes ticularly when these occur in the context of several also support issues of auditing marks awarded in as- hundred other responses, all using similar terms. sessment. Bodies such as the UK’s Quality Assur- In fact, techniques such as Latent Semantic Anal- ance Agency1 require that assessment be fully open ysis (LSA) have been shown to be accurate in grad- for the purposes of external examination. Tech- ing longer essays (Landauer et al., 2003), but this niques which can show exactly why a particular success does not appear to transfer to short answer mark was awarded (or not) for a given response fit questions. Haley’s (2008) work suggests that LSA well with existing quality assurance requirements. performs poorly when applied to short answers, with All experiments in this paper were carried out us- Thomas et al. (2004) demonstrating that LSA-based ing student responses collected from a first year in- marking systems for short answers did not give troductory science module. an acceptable correlation with an equivalent human marker, although they do highlight the small size of 2 Mark Scheme Authoring their available dataset. Burrows et al. (2015) have identified several differ- Sukkarieh and Pulman (Sukkarieh and Pulman, ent eras of automatic marking of free text responses. 2005) and Mitchell et al. (2002) have demonstrated One era they have identified has treated automatic that hand-crafted rules containing more syntactic marking as essentially a form of information extrac- structure can be valuable for automatic assessment, tion. The many different ways that a student can but both papers note the manual effort required to correctly answer a question can make it difficult to develop the set of rules in the first place. To ad- award correct marks2. For example: dress this, we have started to investigate techniques to develop systems which can support a subject spe- A snowflake falls vertically with a con- cialist (rather than a computing specialist) in devel- stant speed. What can you say about the oping a set of marking rules for a given collection of forces acting on the snowflake? student responses. In addition, because it has been demonstrated (Butcher and Jordan, 2010) that mark- Three student responses to this question were: ing rules based on regular expressions can mark ac- (1) there is no net force curately, we have also investigated the use of a sym- bolic learning algorithm to propose further marking (2) gravitational force is in equilibrium with air re- rules to the author. sistance Enabling such markers to develop computational marking rules should yield the subsequent benefits of speed and consistency noted by Williamson et al., (3) no force balanced with gravity and the potential for embedding in an online systems The question author considered both responses to provide immediate marks for student submissions (1) and (2) correct. However, they share no com- (Jordan and Mitchell, 2009). This proposal fits with mon words (except force which already appears in the observation of Burrows et al. (2015), who sug- gest that rule based systems are desirable for “re- 1http://www.qaa.ac.uk 2Compared with multiple choice questions, which are easy 3As with all examples in this paper, the “correctness” of an- to mark, although constructing suitable questions in the first swers was judged with reference to the students’ level of study place is far from straightforward (Mitkov et al., 2006). and provided teaching materials.

244 then term(R, balanced, 3) would be true, as the 3rd term(R, Term, I) The Ith term in R is token in R is ballanced, and at most 1 edit is needed Term to transform ballanced to balanced. The predicate template allows a simple form of template(R, Template, I) The Ith term in R (Porter, 1980). The statement template(R, matches Template Template, I) is true if Template matches at the be- ginning of the Ith token in R, subject to the same th precedes(Ii, Ij) The Ii term in a re- spelling correction as term. So for example, the sponse precedes the statement: Ith term j template(R, balanc, 3) th closely precedes(Ii, Ij) The Ii term in a re- would match example (4), because balanc is a sin- sponse precedes the gle edit from ballanc, which itself matches the be- th rd Ij within a specified ginning of the 3 token in R. (Note that it would window not match as a term, because ballance is two edits from balanc.) Such templates allow rules to be writ- ten which match, for example, balance, balanced, Figure 1: Mark scheme language balancing and so on. The predicates precedes and closely precedes, peated assessment” (i.e. where the assessment will and the index terms, which appear as the variables be used multiple times), which is more likely to re- I and J in figure 1, capture a level of linear prece- pay the investment in developing the mark scheme. dence, which allow the rules to recognise a degree of We believe that the framework that we present here linguistic structure. As discussed in section 2, tech- shows that rule-based marking can be more tractable niques which do not capture some level of word or- than suggested by Burrows. der are insufficiently expressive for the task of rep- resenting mark schemes. However, a full grammat- 2.1 The Mark Scheme Language ical analysis also appears to be unnecessary, and in In this paper, I will describe a set of such marking fact can lead to ambiguity. Correct responses to the rules as a “mark scheme”, so Amati aims to support Rocks question (see table 1) required the students a human marker in hand crafting a mark scheme, to identify that the necessary conditions to form the which is made up of a set of marking rules. In Am- rock are high temperature and high pressure. Both ati, the mark schemes are constructed from sets of temperature and pressure needed to be modified to prolog rules, which attempt to classify the responses earn the mark. Responses such as (5) should be as either correct or incorrect. The rule syntax closely marked correct, with an assumption that the modi- follows that of Junker et al. (1999), using the set of fier should distribute over the conjunction. predicates shown in figure 1. (5) high temperature and pressure The main predicate for recognising keywords is term(R, Term, I), which is true when Term is the While the precedence predicate is adequate to Ith term in the response R. Here, we use “term” capture this behaviour, using a full parser creates an to mean a word or token in the response, subject to ambiguity between the analyses (6) and (7). simple spelling correction. This correction is based (6) (high (pressure)) and temperature upon a Damerau-Levenshtein (Damerau, 1964) edit × distance of 1, which represents the replacement, ad- (7) (high (pressure and temperature)) √ dition or deletion of a single character, or a transpo- sition of two adjacent characters. So for example, if The example suggest that high accuracy can be diffi- R represented the student response: cult to achieve by systems which commit to an early, single interpretation of the ambiguous text. (4) no force ballanced with gravity So a full example of a matching rule might be:

245 term(R, oil, I) ∧ term(R, less, J) ∧ template(R, dens, K) precedes(I, J) correct(R) → which would award the correct marks to responses (8) and (9):

(8) oil is less dense than water √

(9) water is less dense than oil × Figure 2: Form for entering marking rules The use of a template also ensures the correct mark is awarded to the common response (10), which should also be marked as correct:

(10) oil has less density than water

2.2 Incremental rule authoring The Amati system is based on a bootstrapping scheme, which allows an author to construct a rule- set by marking student responses in increments of 50 responses at a time, while constructing marking rules which reflect the marker’s own judgements. As the marker develops the mark scheme, he or she can correct the marks awarded by the existing mark scheme, and then edit the mark scheme to more ac- curately reflect the intended marks. The support for these operations are illustrated in figures 2 and 3. To make the system more usable Figure 3: Application of rule to the Oil response set by non-specialists (that is, non-specialists in com- puting, rather than non-specialists in the subject be- ing taught), the authors are not expected to work di- perform information extraction are now well estab- rectly with prolog. Rather, rules are presented to the lished (Califf and Mooney, 1997; Soderland, 1999), user via online forms, as shown in figure 2. As each with Inductive Logic Programming (ILP) (Quinlan, rule is developed, the system displays to the user the 1990; Lavracˇ and Dzeroski,ˇ 1994) often proving a responses which the rule marks as correct. suitable learning learning technique (Aitken, 2002; As increasingly large subsets of the student re- Ramakrishnan et al., 2008). ILP is a supervised sponses are marked, the system displays the set of learning algorithm which attempts to generate a log- imported responses, the mark that the current mark ical description of a set of facts in the style of a pro- scheme awards, and which rule(s) match agains each log program. Amati embeds the ILP system Aleph response (figure 3). This allows the mark scheme (Srinivasan, 2004) as the rule learner, which itself author to add or amend rules as necessary. implements the Progol learning algorithm (Muggle- ton, 1995), a bottom up, greedy coverage algorithm. 2.3 Rule Induction This allows an author to mark the current set of ques- As the marker constructs an increasingly large col- tions (typically the first or second block of 50 re- lection of marked responses, it can be useful to use sponses), before using the ILP engine to generate a the marked responses to induce further rules auto- rule set which he or she can then modify. We return matically. Methods for learning relational rules to to the question of editing rule sets in section 4.1.

246 Our use of ILP to support markers in develop- Short name Question text ing rules has several parallels with the Powergrad- Sandstone A sandstone observed in the field ing project (Basu et al., 2013). Both our work and contains well-sorted, well rounded, that of Basu et al. focus on using NLP techniques finely pitted and reddened grains. primarily to support the work of a human marker, What does this tell you about the and reduce marker effort. Basu et al. take an ap- origins of this rock? proach whereby student responses are clustered us- Snowflake A snowflake falls vertically with ing statistical topic detection, with the marker then a constant speed. What can you able to allocate marks and feedback at the cluster say about the forces acting on the level, rather than at the individual response level. snowflake? Similarly, the aim of Amati is that markers should Charge If the distance between two electri- be able to award marks by identifying, via generated cally charged particles is doubled, rules, commonly occurring phrases. The use of such what happens to the electric force phrases can then be analysed at the cohort level (or at between them? least, incrementally at the cohort level), rather than Rocks Metamorphic rocks are existing at the individual response level. rocks that have “changed form” In practice, we found that markers were likely (metamorphosed) in a solid state. to use the predicted rules as a “first cut” solution, What conditions are necessary in to gain an idea of the overall structure of the final order for this change to take place? mark scheme. The marker could then concentrate on Sentence What is wrong with the following developing more fine-grained rules to improve the sentence? A good idea. mark scheme accuracy. This usage appears to reflect Oil The photograph (not shown here) that found by Basu et al., of using the machine learn- shows a layer of oil floating on top ing techniques to automatically identify groups of of a glass of water. Why does the similar groups of responses. This allows the marker oil float? to highlight common themes and frequent misunder- Table 1: The questions used standings.

3 Evaluation ing a training set of responses from the 2008 student cohort, and then that scheme was applied to an un- The aim of the evaluation was to determine whether seen test set constructed from the responses to the a ruleset built using Amati could achieve perfor- same questions from the 2009 student cohort. mance comparable with human markers. As such, The difficulties in attempting to build any cor- there were two main aims. First, to determine pus in which the annotations are reliable are well whether the proposed language was sufficiently ex- documented (Marcus et al.’s (1993) discussion of pressive to build successful mark schemes, and sec- the Penn gives a good overview). In ond, to determine how well a mark scheme devel- this case, we exploited the presence of the orig- oped using the Amati system would compare against inal question setter and module chair to provide a human marker. as close to a “ground truth” as is realistic. Our gold-standard marks were obtained with a multiple- 3.1 Training and test set construction pass annotation process, in which the collections A training set and a test set of student responses were of responses were initially marked by two or more built from eight questions taken from an entry-level subject-specialist tutors, who mainly worked inde- science module, shown in table 1. Each student re- pendently, but who were able to confer when they sponse was to be marked as either correct or incor- disagreed on a particular response. The marks were rect. Two sets of responses were used, which were then validated by the module chair, who was also built from two subsequent presentations of the same called upon to resolve any disputes which arose as module. Amati was used to build a mark scheme us- a result of disagreements in the mark scheme. The

247 cost of constructing a corpus in this way would usu- Question # responses accuracy/% α/% ally be prohibitive, relying as it does on subject ex- perts both to provide the preliminary marks, and Sandstone 1711 98.42 97.5 to provide a final judgement in the reconciliation Snowflake 2057 91.0 81.7 phase. In this case, the initial marks (including the Charge 1127 98.89 97.6 ability to discuss in the case of a dispute) were gen- Rocks 1429 99.00 89.6 erated as part of the standard marking process for Sentence 1173 98.19 97.5 student assessment in the University4. Oil 817 96.12 91.5

3.2 Effectiveness of authored mark schemes Table 2: Accuracy of the Amati mark schemes on unseen To investigate the expressive power of the represen- data, and the Krippendorf α rating between the marks awarded by Amati and the gold standard tation language, a set of mark schemes for the eight questions shown in table 1 were developed using the Amati system. The training data was used to build agreement achieved by independent human markers the rule set, with regular comparisons against the ranged from a maximum of α = 88.2% to a mini- gold standard marks. The mark scheme was then mum of α = 71.2%, which was the agreement on applied to the test set, and the marks awarded com- marks awarded for the snowflake question. It is no- pared against the test set gold standard marks. table that the human marker agreement was worst The results are shown in table 2. The table shows on the same question that the Amati-authored rule- the total number of responses per question, and the set performed worst on; we discuss some issues that accuracy of the Amati rule set applied to the unseen this question raises in section 4.3. data set. So for example, the Amati rule set correctly The marks awarded by the marker supported with marked 98.42% of the 1711 responses to the Sand- Amati therefore aligned more closely with those of stone question. Note that the choice of accuracy as the human expert than was achieved between inde- the appropriate measure of success is determined by pendent markers. This suggests that further devel- the particular application. In this case, the impor- opment of computer support for markers is likely tant measure is how many responses are marked cor- to improve overall marking consistency, both across rectly. That is, it is as important that incorrect an- the student cohort, and by correspondence with the swers are marked as incorrect, as it is that correct official marking guidance. answers are marked as correct. To compare the performance of the Amati rule- 4 Observations on authoring rulesets set against the human expert, we have used Krip- pendorf’s α measure, implemented in the python It is clear from the performance of the different rule library (Bird et al., 2009) sets that some questions are easier to generate mark following Artstein and Poesio’s (2008) presenta- schemes for than others. In particular, the mark tion. The rightmost column of table 2 shows the scheme authored on the responses to the snowflake α measure between the Amati ruleset and the post- question performed with much lower accuracy than reconciliation marks awarded by the human expert. the other questions. This section gives a qualitative This column shows a higher level of agreement overview of some of the issues which were observed than was obtained with human markers alone. The while authoring the mark schemes. 4We have not presented inter-annotator agreement measures 4.1 Modification of generated rules here, as these are generally only meaningful when annotators have worked independently. This model of joint annotation A frequently cited advantage of ILP is that, as a with a reconciliation phase is little discussed in the literature, logic program, the output rules are generated in a although this is a process used by Farwell et al. (2009). Our human-readable form (Lavracˇ and Dzeroski,ˇ 1994; annotation process differed in that the reconciliation phase was Mitchell, 1997). In fact, the inclusion of templates carried out face to face following each round of annotation, in contrast to Farwell et al.’s, which allowed a second anonymous means that several of the rules can be hard to inter- vote after the first instance. pret at first glance. For example, a rule proposed to

248 mark the Rocks question was: term(R, high,I) ∧ term(R, pressure,J) ∧ template(R, bur,I) term(R, and,K) ∧ ∧ template(R, hea,J) correct(R) precedes(I,K) correct(R) → → As the official marking guidance suggests that High temperature and pressure is an acceptable response, which requires a conjunction, but makes no mention hea can easily be interpreted as heat. However, it is of temperature (or heat or some other equivalent). not immediately clear what bur represents. In fact, In this case, the responses (11) and (12): a domain expert would probably recognise this as a shortened form of buried (as the high pressure, the (11) (high (pressure and temperature)) √ second required part of the solution, can result from (12) (high (pressure and heat)) √ burial in rock). As the training set does not contain terms with the same first characters as burial, such as burnished, Burghundy or burlesque, then this term are both correct, and both appeared amongst the stu- matches. However, a mark scheme author might pre- dent responses. However, there were no incorrect fer to edit the rule slightly into something more read- responses following a similar syntactic pattern, such able and so maintainable: as, for example, (13) or (14):

template(R, buri,I) (13) high pressure and altitude ∧ × term(R, heat,J) correct(R) → (14) high pressure and bananas × so that either buried or burial would be matched, and to make the recognition of heat more explicit. Students who recognised that high pressure and A more complex instance of the same phe- something else were required, always got the some- nomenon is illustrated by the generated rule: thing else right. Therefore, the single rule above had greater coverage than rules that looked individually term(R, high,I) ∧ for high pressure and temperature or high pressure term(R, temperature,J) correct(R) → and heat. This example again illustrates the Amati philoso- Although the requirement for the terms high and phy that the technology is best used to support hu- temperature is clear enough, there is no part of this man markers. By hand-editing the proposed solu- rule that requires that the student also mention high tions, the marker ensures that the rules are more in- pressure. This has come about because all the stu- tuitive, and so can be more robust, and more main- dent responses that mention high temperature also tainable in the longer term. In this case, an author explicitly mention pressure. Because Progol and might reasonably rewrite the single rule into two: Aleph use a greedy coverage algorithm, in this case Amati did not need to add an additional rule to cap- term(R, high,I) ture . Again, the mark scheme author would proba- ∧ term(R, pressure,J) bly wish to edit this rule to give: ∧ term(R, temperature,K) ∧ term(R, high,I) precedes(I,K) correct(R) ∧ → term(R, temperature,J) ∧ term(R, high,I) term(R, pressure,K) ∧ ∧ term(R, pressure,J) precedes(I,J) correct(R) ∧ → term(R, heat,K) ∧ which covers the need for high to precede temper- precedes(I,K) correct(R) → ature, and also contain a reference to pressure.A similar case, raised by the same question, is the fol- removing the unnecessary conjunction, and provid- lowing proposed rule: ing explicit rules for heat and temperature.

249 4.2 Spelling correction The most straightforward form of the answer is It is questionable whether spelling correction is al- along the lines of response (15). In this case, there ways appropriate. A question used to assess knowl- are no particular forces mentioned; only a general edge of organic chemistry might require the term comment about the forces in question. Similar cases butane to appear in the solution. It would not be were there are no net forces, all forces balance, the appropriate to mark a response containing the to- forces are in equilibrium and so on. ken butene (a different ) as correct, even However, responses (16) and (17) illustrate that though butene would be an allowable misspelling of in many cases, the student will present particular butane according to the given rules. On the other examples to attempt to answer the question. In hand, a human marker would probably be inclined to these cases, both responses identify gravity as one mark responses containing buttane or butan as cor- of the acting forces, but describe the counteracting rect. These are also legitimate misspellings accord- force differently (as air resistance and friction re- ing the table, but are less likely to be misspellings of spectively). A major difficulty in marking this type butene. of question is predicting the (correct) examples that The particular templates generated reflect the lin- students will use in their responses, as each correct guistic variation in the specific datasets. A template pair needs to be incorporated in the mark schemes. such as temp, intended to cover responses contain- A response suggesting that air resistance counter- ing temperature (for example), would also poten- acts drag would be marked incorrect. As stated pre- tially cover temporary, tempestuous, temperamen- viously, developing robust mark schemes requires tal and so on. In fact, when applied to large sets that mark scheme authors use large sets of previous of homogenous response-types (such as multiple re- student responses, which can provide guidance on sponses to a single question), the vocabulary used the range of possible responses. across the complete set of responses turns out to Finally, response (18) illustrates a difficult re- be sufficiently restricted for meaningful templates to sponse to mark (for both pattern matchers and lin- be generated. It does not follow that this hypoth- guistic solutions). The response consists of two con- esis language would continue to be appropriate for joined clauses, the second of which, all forces are datasets with a wider variation in vocabulary. balanced, is in itself a correct answer. It is only in the context of the first clause that the response is 4.3 Diversity of correct responses marked incorrect, containing the error that it is only As illustrated in table 2, the Snowflake question was the force of gravity which acts. very tricky to handle, with lower accuracy than the This question highlights that the ease with which other questions, and lower agreement with the gold a question can be marked automatically can depend standard. The following are some of the student re- as much on the question being asked as the answers sponses: received. Of course, this also applies to questions intended to be marked by a human; some questions (15) they are balanced lead to easier responses to grade. So a good eval- uation of a marking system needs to consider the (16) the force of gravity is in balance with air resis- questions (and the range of responses provided by tance real students) being asked; the performance of the system is meaningful only in the context of the na- (17) friction is balancing the force of gravity ture of the questions being assessed, and an under- standing of the diversity of correct responses. In this (18) only the force of gravity is acting on the hail- case, it appears that questions which can be correctly stone and all forces are balanced answered by using a variety of different examples should be avoided. We anticipate that with increas- The module chair considered responses (15), (16) ing maturity of the use of automatic marking sys- and (17) to be correct, and response (18) to be incor- tems, examiners would develop skills in setting ap- rect. propriate questions for the marking system, just as

250 experienced authors develop skills in setting ques- suggest that a system such as Amati can help mark- tions which are appropriate for human markers. ers create accurate, reusable mark schemes. The user interface to Amati was developed in col- 4.4 Anaphora Ambiguity laboration with experienced markers from the Open The examples raise some interesting questions about University’s Computing department and Science de- how anaphora resolution should be dealt with. Two partment, who both gave input into the requirements responses to the oil question are: for an effective marking system. We intend to carry out more systematic analyses of the value of us- (19) The oil floats on the water because it is lighter ing such systems for marking, but informally, we have found that a set of around 500-600 responses was enough for an experienced marker to feel satis- (20) The oil floats on the water because it is heavier fied with the performance of her own authored mark These two responses appear to have contradictory scheme, and to be prepared to use it on further un- meanings, but in fact are both marked as correct. seen cases. (This number was for the Snowflake This initially surprising result arises from the am- question, which contained approximately half cor- biguity in the possible resolutions of the pronoun it: rect responses. For the other questions, the marker typically required fewer responses.) (21) [The oil]i floats on the water because iti is The work described in this paper contrasts with lighter. the approach commonly taken in automatic mark- ing, of developing mechanisms which assign marks (22) The oil floats on [the water]j because itj is by comparing student responses to one or more tar- heavier. get responses created by the subject specialist (Ziai et al., 2012). Such systems have proven effective When marking these responses, the human mark- where suitable linguistic information is compared, ers followed a general policy of giving the benefit of such as the predicate argument structure used by c- the doubt, and, within reasonable limits, will mark a rater (Leacock and Chodorow, 2003), or similarity response as correct if any of the possible interpreta- between dependency relationships, as used by Au- tions would be correct relative to the mark scheme. toMark (now FreeText (Mitchell et al., 2002)) and As with the ambiguous modifier attachment seen Mohler et al. (2011). Our own experiments with in responses (6) and (7), this example illustrates that FreeText found that incorrect marks were often a re- using a different (possibly better) parser is unlikely sult of an inappropriate parse by the embedded Stan- to improve the overall system performance. Re- ford parser (Klein and Manning, 2003)), as illus- sponses such (21) and (22) are hard for many parsers trated by the parses (6) and (7). In practice, we have to handle, because an early commitment to a single found that for the short answers we have been con- interpretation can assume that it must refer to the sidering, pattern based rules tend to be more robust oil or the water. Again, this example demonstrates in the face of such ambiguity than a full parser. that a more sophisticated approach to syntactic am- A question over this work is how to extend the biguity is necessary if a -based system is to technique to more linguistically complex responses. be used. (One possible approach might be to use un- The questions used here are all for a single mark, derspecification techniques (Konig¨ and Reyle, 1999; all or nothing. A current direction of our research van Deemter and Peters, 1996) and attempt to reason is looking at how to provide support for more com- with the ambiguous forms.) plicated questions which would require the student to mention two or more separate pieces of infor- 5 Discussion and Conclusions mation, or to reason about causal relationships. A We have presented a system which uses informa- further area of interest is how the symbolic analysis tion extraction techniques and machine learning to of the students’ responses can be used to generate support human markers in the task of marking free meaningful feedback to support them as part of their text responses to short answer questions. The results learning process.

251 Acknowledgments 2009. Interlingual annotation of multilingual text cor- pora and . In Hans C. Boas, editor, Multi- The author would like to thank Sally Jordan for her lingual FrameNets in Computational Lexicography : help in collecting the students’ data, for answering Methods and Applications. Mouton de Gruyter, Berlin. questions about marking as they arose and for mak- Lorraine R Gay. 1980. The comparative effects of ing the data available for use in this work. Also, multiple-choice versus short-answer tests on retention. David King who developed the Amati system, and Journal of Educational Measurement, 17(1):45–50. Rita Tingle who provided feedback on the usabil- Debra T. Haley. 2008. Applying Latent Semantic Analy- ity of the system from a marker’s perspective. We sis to Computer Assisted Assessment in the Computer Science Domain: A Framework, a Tool, and an Evalu- would also like to thank the anonymous reviewers ation. Ph.D. thesis, The Open University. for their valuable suggestions on an earlier draft of Sally Jordan and Tom Mitchell. 2009. E-assessment for the paper. learning? The potential of short free-text questions with tailored feedback. British Journal of Educational Technology, 40(2):371–385. References Markus Junker, Michael Sintek, and Matthias Rinck. James Stuart Aitken. 2002. Learning information extrac- 1999. Learning for text categorization and informa- tion rules: An inductive logic programming approach. tion extraction with ILP. In J. Cussens, editor, Pro- In ECAI, pages 355–359. ceedings of the First Workshop Learning Language in Ron Artstein and Massimo Poesio. 2008. Inter-coder Logic, pages 84–93. agreement for computational linguistics. Computa- Judy Kay, Peter Reimann, Elliot Diebold, and Bob Kum- tional Linguistics, 34(4):555–596. merfeld. 2013. MOOCs: So Many Learners, So Much Stephen P Balfour. 2013. Assessing writing in MOOCs: Potential. . . . IEEE Intelligent Systems, 3:70–77. Automated essay scoring and calibrated peer review. Dan Klein and Christopher D. Manning. 2003. Accu- Research & Practice in Assessment, 8(1):40–48. rate unlexicalized parsing. In Proceedings of the 41st Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. Meeting of the Association for Computational Linguis- 2013. Powergrading: A clustering approach to am- tics, pages 423–430. Association for Computational plify human effort for short answer grading. Transac- Linguistics. tions of the Association for Computational Linguistics, Esther Konig¨ and Uwe Reyle. 1999. A general rea- 1:391–402. soning scheme for underspecified representations. In Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- Hans Jurgen¨ Ohlbach and U. Reyle, editors, Logic, ural Language Processing with Python. O’Reilly Me- Language and Reasoning. Essays in Honour of Dov dia. Gabbay, volume 5 of Trends in Logic. Kluwer. Steven Burrows, Iryana Gurevych, and Benno Stein. T. K. Landauer, D. Laham, and P. W. Foltz. 2003. Auto- 2015. The eras and trends of automatic short an- mated scoring and annotation of essays with the intel- swer grading. International Journal of Artificial In- ligent essay assessor. In M. Shermis and J. Burstein, telligence in Education, 25(1):60–117. editors, Automated essay scoring: a cross-disciplinary Philip G. Butcher and Sally E. Jordan. 2010. A compari- approach, pages 87–112. Lawrence Erlbaum Asso- son of human and computer marking of short free-text ciates, Inc. student responses. Computers and Education. Nada Lavracˇ and Sasoˇ Dzeroski.ˇ 1994. Inductive Logic Mary Elaine Califf and Raymond J. Mooney. 1997. Re- Programming. Ellis Horwood, New York. lational learning of pattern-match rules for informa- Claudia Leacock and Martin Chodorow. 2003. C-rater: tion extraction. In T.M. Ellison, editor, Computational Automated scoring of short-answer questions. Com- Natural Language Learning, pages 9–15. Association puters and the Humanities, 37(4):389–405. for Computational Linguistics. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beat- F. J. Damerau. 1964. Technique for computer detection rice Santorini. 1993. Building a large annotated cor- and correction of spelling errors. Communications of pus of English: The Penn Treebank. Computational the Association of Computing Machinery, 7(3):171– linguistics, 19(2):313–330. 176. Tom Mitchell, Terry Russell, Peter Broomhead, and David Farwell, Bonnie Dorr, Rebecca Green, Nizar Nicola Aldridge. 2002. Towards robust comput- Habash, Stephen Helmreich, Eduard Hovy, Lori erised marking of free-text responses. In 6th In- Levin, Keith Miller, Teruko Mitamura, Owen Ram- ternational Computer Aided Assessment Conference, bow, Florence Reeder, and Advaith Siddharthan. Loughbrough.

252