Predicting the Difficulty of Multiple Choice Questions in a High-Stakes

Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam Le An Ha1 Victoria Yaneva2 Peter Baldwin2 Janet Mee2 1Research Group in Computational Linguistics, University of Wolverhampton, UK [email protected] 2National Board of Medical Examiners, Philadelphia, USA fvyaneva, pbaldwin, [email protected] Abstract that are too easy or too difficult are less able to distinguish between different levels of examinee Predicting the construct-relevant difficulty of ability (or between examinee ability and a defined Multiple-Choice Questions (MCQs) has the cut-score of some kind – e.g., pass/fail). This is es- potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, pecially important when scores are used to make we propose a method for estimating the dif- consequential decisions such as those for licen- ficulty of MCQs from a high-stakes medical sure, certification, college admission, and other exam, where all questions were deliberately high-stakes applications1. To address these issues, written to a common reading level. To accom- we propose a method for predicting the difficulty plish this, we extract a large number of linguis- of multiple choice questions (MCQs) from a high- tic features and embedding types, as well as stakes medical licensure exam, where questions features quantifying the difficulty of the items of varying difficulty may not necessarily vary in for an automatic question-answering system. The results show that the proposed approach terms of reading levels. outperforms various baselines with a statisti- Owing to the criticality of obtaining difficulty cally significant difference. Best results were estimates for items (exam questions) prior to their achieved when using the full feature set, where use for scoring, current best practices require embeddings had the highest predictive power, newly-developed items to be pretested. Pretest- followed by linguistic features. An ablation ing typically involves administering new items to study of the various types of linguistic features a representative sample of examinees (usually be- suggested that information from all levels of linguistic processing contributes to predicting tween a few hundred and a few thousand), and item difficulty, with features related to seman- then using their responses to estimate various sta- tic ambiguity and the psycholinguistic proper- tistical characteristics. Ideally, pretest data are ties of words having a slightly higher impor- collected by embedding new items within a stan- tance. Owing to its generic nature, the pre- dard live exam, although sometimes special data sented approach has the potential to generalize collection efforts may also be needed. Based on over other exams containing MCQs. the responses, items that are answered correctly by a proportion of examinees below or above cer- 1 Introduction tain thresholds (i.e. items that are too easy or too For many years, approaches from Natural Lan- difficult for almost all examinees) are discarded. guage Processing (NLP) have been applied to esti- While necessary, this procedure has a high finan- mating reading difficulty, but relatively fewer at- cial and administrative cost, in addition to the time tempts have been made to measure conceptual required to obtain the data from a sufficiently large difficulty or question difficulty beyond linguistic sample of examinees. complexity. In addition to expanding the hori- Here, we propose an approach for estimating zons of NLP research, estimating the construct- the difficulty of expert-level MCQs, where the relevant difficulty of test questions has a high prac- 1Examples of well-known high-stakes exams include tical value because ensuring that exam questions the TOEFL (Test of English as a Foreign Language) are appropriately difficult is both one of the most (https://www.ets.org/toefl), the SAT (Scholastic Assessment Test) (https://collegereadiness.collegeboard.org/sat), and the important and one of the most costly tasks within USMLE (United States Medical Licensing Examination) the testing industry. For example, test questions (https://www.usmle.org/). A 55-year-old woman with small cell carcinoma of the lung is admitted to the hospital to undergo chemotherapy. Six days after treatment is started, she develops a temperature of 38C (100.4F). Physical examination shows no other abnormalities. Laboratory studies show a leukocyte count of 100/mm3 (5% segmented neutrophils and 95% lymphocytes). Which of the following is the most appropriate pharmacotherapy to increase this patient’s leukocyte count? (A) Darbepoetin (B) Dexamethasone (C) Filgrastim (D) Interferon alfa (E) Interleukin-2 (IL-2) (F) Leucovorin Table 1: An example of a practice item gold standard of item difficulty is defined through iv) Owing to the generic nature of the features, large-scale pretesting and is based on the re- the presented approach is potentially generaliz- sponses of hundreds of highly-motivated exami- able to other MCQ-based exams. We make our nees. Being able to automatically predict item dif- code available2 at: https://github.com/ ficulty from item text has the potential to save sig- anomymous1/Survival-Prediction. nificant resources by eliminating or reducing the need to pretest the items. These savings are of 2 Related Work even greater importance in the context of some au- The vast majority of previous work on difficulty tomatic item generation strategies, which can pro- prediction has been concerned with estimating duce tens of thousands of items with no feasible readability (Flesch, 1948; Dubay, 2004; Kintsch way to pretest them or identify which items are and Vipond, 2014; François and Miltsakaki, 2012; most likely to succeed. Furthermore, understand- McNamara et al., 2014). Various complexity- ing what makes an item difficult other than manip- related features have been developed in readabil- ulating its reading difficulty has the potential to aid ity research (see Dubay(2004) and Kintsch and the item-writing process and improve the quality Vipond(2014) for a review), starting from ones of the exam. Last but not least, automatic diffi- utilising surface lexical features (e.g. Flesch culty prediction is relevant to automatic item gen- (1948)) to NLP-enhanced models (François and eration as an evaluation measure of the quality of Miltsakaki, 2012) and features aimed at capturing the produced output. cohesion (McNamara et al., 2014). Contributions i) We develop and test the pre- There have also been attempts to estimate the dictive power of a large number of different types difficulty of questions for humans. This has been of features (e.g. embeddings and linguistic fea- mostly done within the realm of language learn- tures), including innovative metrics that measure ing, where the difficulty of reading comprehension the difficulty of MCQs for an automatic question- questions is strongly related to their associated answering system. The latter produced empiri- text passages (Huang et al., 2017; Beinborn et al., cal evidence on whether parallels exist between 2015; Loukina et al., 2016). Another area where question difficulty for humans and machines. ii) question-difficulty prediction is discussed is the The results outperform a number of baselines, area of automatic question generation, as a form showing that the proposed approach measures of evaluation of the output (Alsubait et al., 2013; a notion of difficulty that goes beyond linguis- Ha and Yaneva, 2018). In many cases such evalua- tic complexity. iii) We analyze the most com- tion is conducted through some form of automatic mon errors produced by the models, as well as measure of difficulty (e.g., the semantic similarity the most important features, providing insight between the question and answer options as in (Ha into the effects that various item characteristics 2The questions cannot be made available because of test have on the success of predicting item difficulty. security. and Yaneva, 2018)) rather than through extensive or options, and grammatical cues (e.g., correct an- evaluation with humans. Past research has also fo- swers that are longer, more specific, or more com- cused on estimating the difficulty of open-ended plete than the other options; or the inclusion of the questions in community question-answering plat- same word or phrase in both the stem and the cor- forms (Wang et al., 2014; Liu et al., 2013); how- rect answer). Item writers had to ensure that the ever, these questions were generic in nature and produced items did not have flaws related to vari- did not require expert knowledge. Other studies ous aspects of validity. For example, flaws related use taxonomies representing knowledge dimen- to irrelevant difficulty include: Stems or options sions and cognitive processes involved in the com- are overly long or complicated, Numeric data not pletion of a test task to predict the difficulty of stated consistently and Language or structure of short-answer questions (Pado´, 2017) and identify the options is not homogeneous. Flaws related to skills required to answer school science questions “testwiseness” are: Grammatical cues; The cor- (Nadeem and Ostendorf, 2017). We build upon rect answer is longer, more specific, or more com- previous work by implementing a large number of plete than the other options; and A word or phrase complexity-related features, as well as testing var- is included both in the stem and in the correct an- ious prediction models (Section4). swer. Finally, stylistic rules concerning preferred While relevant in a broad sense, the above usage of terms, formatting, abbreviations, conven- works are not directly comparable to the current tions, drug names, and alphabetization of option task. Unlike community question answering, the sets were also enforced. The goal of standardizing questions used in this study were developed by items in this manner is to produce items that vary experts and require the application of highly spe- in difficulty and discriminating power due only cialized knowledge. Reading exams, where com- to differences in the medical content they assess. prehension difficulty is highly associated with text This practice, while sensible, makes modeling dif- complexity, are also different from our medical ficulty very challenging.

Load more