<<

Kennesaw State University DigitalCommons@Kennesaw State University

Faculty Publications

9-2012 An Item Response Curves Analysis of the Force Concept Inventory Gary A. Morris Valparaiso University

Nathan Harshman American University

Lee Branum-Martin University of Houston

Eric Mazur Harvard University

Taha Mzoughi Kennesaw State University, [email protected]

See next page for additional authors

Follow this and additional works at: http://digitalcommons.kennesaw.edu/facpubs Part of the Physics Commons, and the Science and Mathematics Education Commons

Recommended Citation Morris, G. A., Harshman, N., Branum-Martin, L., Mazur, E., Mzoughi, T., & Baker, S. D. (2012). An item response curves analysis of the Force Concept Inventory. American Journal Of Physics, 80(9), 825-831.

This Article is brought to you for free and open access by DigitalCommons@Kennesaw State University. It has been accepted for inclusion in Faculty Publications by an authorized administrator of DigitalCommons@Kennesaw State University. For more information, please contact [email protected]. Authors Gary A. Morris, Nathan Harshman, Lee Branum-Martin, Eric Mazur, Taha Mzoughi, and Stephen D. Baker

This article is available at DigitalCommons@Kennesaw State University: http://digitalcommons.kennesaw.edu/facpubs/2776 An item response curves analysis of the Force Concept Inventory Gary A. Morris, Nathan Harshman, Lee Branum-Martin, Eric Mazur, Taha Mzoughi, and Stephen D. Baker

Citation: American Journal of Physics 80, 825 (2012); doi: 10.1119/1.4731618 View online: http://dx.doi.org/10.1119/1.4731618 View Table of Contents: http://scitation.aip.org/content/aapt/journal/ajp/80/9?ver=pdfcov Published by the American Association of Physics Teachers

Articles you may be interested in Erratum: “An item response curves analysis of the force concept inventory” [Am. J. Phys. 80, 825–831 (2012)] Am. J. Phys. 81, 144 (2013); 10.1119/1.4766939

Comparing large lecture mechanics curricula using the Force Concept Inventory: A five thousand student study Am. J. Phys. 80, 638 (2012); 10.1119/1.3703517

Analyzing force concept inventory with Am. J. Phys. 78, 1064 (2010); 10.1119/1.3443565

Testing the : Item response curves and test quality Am. J. Phys. 74, 449 (2006); 10.1119/1.2174053

The effect of distracters on student performance on the force concept inventory Am. J. Phys. 72, 116 (2004); 10.1119/1.1629091

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 An item response curves analysis of the Force Concept Inventory Gary A. Morris Department of Physics and Astronomy, Valparaiso University, Valparaiso, Indiana 46383 Nathan Harshman Department of Physics, American University, Washington, DC 20016 Lee Branum-Martin Texas Institute for Measurement, Evaluation, and , University of Houston, Houston, Texas 77204 Eric Mazur Department of Physics, Harvard University, Cambridge, Massachusetts 02138 Taha Mzoughi Department of and Physics, Kennesaw State University, Kennesaw, Georgia 30144 Stephen D. Baker Department of Physics and Astronomy, Rice University, Houston, Texas 77005 (Received 30 November 2011; accepted 13 June 2012) Several years ago, we introduced the idea of item response curves (IRC), a simplistic form of item response theory (IRT), to the research community as a way to examine item performance on diagnostic instruments such as the Force Concept Inventory (FCI). We noted that a full-blown analysis using IRT would be a next logical step, which several authors have since taken. In this paper, we show that our simple approach not only yields similar conclusions in the analysis of the performance of items on the FCI to the more sophisticated and complex IRT analyses but also permits additional insights by characterizing both the correct and incorrect answer choices. Our IRC approach can be applied to a variety of multiple-choice assessments but, as applied to a carefully designed instrument such as the FCI, allows us to probe student understanding as a function of ability level through an examination of each answer choice. We imagine that physics teachers could use IRC analysis to identify prominent misconceptions and tailor their instruction to combat those misconceptions, fulfilling the FCI authors’ original intentions for its use. Furthermore, the IRC analysis can assist test designers to improve their assessments by identifying nonfunctioning distractors that can be replaced with distractors attractive to students at various ability levels. VC 2012 American Association of Physics Teachers. [http://dx.doi.org/10.1119/1.4731618]

I. INTRODUCTION total score on the test. Weighing each question equally in determining ability is a common model, but other methods, One of the joys of physics is that, unlike many other fields such as factor analysis or IRT, use correlations in the test of inquiry, right and wrong answers are often unambiguous. response data to weigh the questions differently. Many Because of this apparent objectivity, multiple-choice tests standardized tests subtract a fraction of a point from the raw remain an essential tool for assessment of physics teaching score to “penalize” guessing. This model assumes that the and learning. In particular, the Force Concept Inventory probability of guessing the right answer to a question does (FCI) (Ref. 1) is widely used in physics education research not depend on the specific question, that guessing any partic- (PER). ular answer choice on a particular question is equally likely, Tests are designed with specific models in mind. The FCI and that guessing works the same way at all ability levels. was designed so that the raw score measures (in some sense) Other, more sophisticated approaches, such as those in factor the ability of “Newtonian thinking.” This article compares analysis or item response theory,3 statistically define the two methods of test analysis based on different models of the probability of response in order to estimate potentially differ- functionality of the FCI. These methods have the unfortu- ent weights for items. These approaches can be viewed as nately similar names of item response curves (IRC) (Ref. 2) drawing information from the correlations among items to and item response theory (IRT) (e.g., Ref. 3). By comparing estimate appropriate weights. By the very nature of a these methods, we will show that the model behind the IRC multiple-choice test, all models presume that there exists an analysis is more consistent with that envisioned by the FCI unambiguously correct answer choice for every item. But, a designers. Additionally, it is easier to use and its results are key difference in models is associated with how they handle easier to interpret in a meaningful way. the distractors. Any method of test analysis requires the construction of a IRC uses total score as a proxy for ability level, which model to quantify the properties of the complex, interacting implies that all items are equally weighed. In order to imple- “population þ test” system. For example, in order to detect ment IRC analysis, item-level data are necessary. IRC analy- correlations in the data, models often assume that each test- sis describes each item with trace lines for the proportion of taker has a true but unknown “ability.” In assessing students, examinees who select each answer choice, including the dis- educators expect ability to be strongly correlated with raw tractors, at each ability level (see Fig. 1). To make an IRC

825 Am. J. Phys. 80 (9), September 2012 http://aapt.org/ajp VC 2012 American Association of Physics Teachers 825

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 Fig. 1. The IRCs for all 30 questions on the FCI using the approach of Morris et al. (Ref. 2). This figure is analogous to Fig. 4 in Wang and Bao (Ref. 4). Each graph shows the percentage of respondents (vertical axis) at each ability level (horizontal axis) on a given item selecting a given answer choice: þ¼1; ? ¼ 2; ^ ¼ 3; 4¼4; h¼ 5.

graph for a single item, we must separate the respondents by designed to function differently. The 3PL IRT analysis ability level and then determine the fraction at each ability assumes that all wrong answers for a given item are equiva- level selecting each answer choice. For example, in our data lent. We will provide IRC analysis examples that show, in an set, there were 116 students of ability level 20 who answered easy-to-interpret graphical form, how different distractors Question 8. Of those, 81 selected Answer Choice 2, the cor- appeal to students with varying abilities. If distractors are not rect answer. Thus, the trace line for the IRC of the correct all equally “wrong,” we could estimate ability in a more sen- answer (asterisks in Fig. 1) passes through the point (20, sitive manner, effectively giving partial credit for some 69%). When repeated for all ability levels, all questions, and wrong answers—those that suggest a deeper understanding— all answer choices, we arrive at the final product of IRC anal- than other wrong answers (cf. Bock7). Furthermore, we may ysis: a set of traces for each item, such as those for the FCI assess the effectiveness of items and alternative answer appearing in Fig. 1. choices as to how well they function in measuring student IRT analysis derives student ability by a statistical analy- understanding and misconceptions.2 sis of item responses and does not necessarily weigh all The IRC analysis is particularly meaningful in the case of questions equally. In fact, IRT uses a probabilistic model the FCI because IRT analyses demonstrate the strong corre- that is a convolution of examinee response patterns with a lation between ability and total score for this test.3 This qual- specific model of item functionality. The IRT analysis of ity justifies the use of raw score as a proxy for ability in the Wang and Bao4 employs the three parameter logistic (3PL) model that underlies IRC analysis. model to analyze the functionality of only the correct answer Below we illustrate the power of IRC analysis in probing choice. The final product of the 3PL IRT model is an esti- the performance of items on the FCI, with Sec. II comparing mate for each item of three conceptually useful numbers: results from Ref. 2 with subsequent publications. Section III difficulty, discrimination, and guessing probability. While focuses on the differences between the 3PL IRT analysis and 3PL is one example of an IRT model, a variety of models IRC analysis for the FCI and argues that IRC analysis is the with different algorithms and underlying assumptions have more appropriate model for formative assessments like con- appeared in the literature.5,6 One of the strengths of IRT cept tests in PER. Section IV provides an in-depth analysis analysis is that these parameters should be robust with using the IRC approach for several FCI questions. We con- respect to estimation from different populations. clude in Sec. V. We argue that, in the case of the FCI, IRC analysis yields information overlooked or not easily obtained with other II. INSIGHTS FROM RECENT ANALYSES methods and better reflects the intentions of the designers of COMPARED WITH IRC ANALYSIS IN REF. 2 the FCI. The distractors in the FCI are tailored to specific physical misconceptions that had been identified (by student In Ref. 2, we outlined our IRC analysis approach to evalu- interviews and instructor experiences) to exist in the popula- ating the effectiveness of test questions and applied that tech- tion. That is, different distractors in an FCI item were nique to three sample questions from the FCI. Our technique

826 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 826

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 provides a practical approach by which a broad cross-section Table I. The percentage of students who responded correctly to each of physics teachers could engage in a more thorough under- question on the FCI. standing of the functionality of items on the FCI specifically and multiple-choice tests more generally. Furthermore, our Question % Correct Question % Correct Question % Correct approach allows test designers to evaluate not only the effi- 1 71.6 11 25.7 21 32.9 cacy of test items but also, within those items, the functional- 2 34.6 12 65.2 22 42.1 ity of specific answer choices, both correct answer choices 3 51.5 13 26.3 23 39.6 and distractors. Finally, we suggested the possibility for 4 37.1 14 39.5 24 66.6 evaluating student knowledge level based not only upon the 5 19.2 15 29.0 25 21.4 traditional dichotomous right/wrong scoring but also upon 6 73.6 16 59.2 26 13.2 the specific incorrect answer choices selected. 7 66.4 17 17.6 27 59.4 Since the publication of Ref. 2, several studies have used 8 50.4 18 22.2 28 36.7 IRT models to assess student performance and test item 9 45.9 19 45.6 29 50.8 functionality, though to date, ours remains the only paper to 10 54.0 20 32.3 30 25.8 examine the performance of the distractors within each item. Ding and Beichner8 provide an overview of analysis techni- ques used to probe measurement characteristics of multiple- choice questions, including classical test theory, factor tical approach to investigate item functionality, we applied analysis, cluster analysis, and IRT. Schurmeier et al.9 recently IRC analysis to all 30 questions on the FCI. We used total described a 3PL IRT analysis of General Chemistry assess- score as a proxy for ability level, an approach supported by ments at the University of Georgia and found implications the findings of Wang and Bao—see their Fig. 5 and analysis for the impact of item wording on question difficulty and dis- that suggests a strong correlation between their IRT-based crimination. Marshall et al.10 applied full-information factor estimate of student ability (h) and total test score: r2 ¼ 0.994. analysis (FIFA) and a 3PL IRT model to 2003–2004 data The high correlation between the total score and model- from the Texas Assessment of Knowledge and Skills estimated ability does not mean that individual items are (TAKS) for science given to 10th and 11th grade students. equivalent or that wrong responses to specific items are This analysis indicated that the test was more a measure of equivalent. We note that there may be value in the particular generic testing ability and basic quantitative skills than spe- wrong answers given: experienced teachers know that some cific proficiency in biology, chemistry, or physics and sug- wrong answers are better (i.e., indicate a higher, though still gested the utility of a 3PL analysis in the design process for imperfect state of understanding) than others. Using that in- future assessments. (Marshall et al. also noted that Ref. 2 formation could provide the next logical step in diagnosing was similar to “item-observed score regression” described in dysfunctional items or distractors and may lead to more reli- Lord.11 Indeed, we were unaware of this reference at the able assessments of student ability levels (a benefit noted by time of our original publication.) Bock7 in the nominal response model). Most relevant to our original work are the papers by Pla- Figure 1 shows the IRCs for all 30 FCI questions from a ninic et al.12 and Wang and Bao.4 Planinic et al. used a Rasch database of >4500 student responses compiled by Harvard model (in effect, a one-parameter IRT model, distinguishing University, Mississippi State University, and Rice Univer- items by difficulty) to evaluate the functionality of the FCI sity, while Table I relates the percentage of students who for 1676 Croatian high school students and 141 Croatian col- answered each question correctly. This figure provides an lege students. They identified problems with the functionality analog to Fig. 4 in Wang and Bao. Some of the differences of some items on the FCI and made suggestions for improve- between the sets of graphs in the figures may be attributed to ment. Wang and Bao employed 3PL IRT analysis to the FCI. the fact that the data sets are independent of one another. As was the case with the study by Marshall et al., Wang and The IRC analysis graphs emphasize the information in the Bao used a 3PL model. The strength of 3PL IRT is the distractors, which provides an instructor with an important straightforward way in which it estimates difficulty and dis- diagnostic tool for evaluating student misunderstandings and crimination parameters by fitting the response data to a curve item functionality, empowering instructors both to correct of probability of correct response plotted against ability. One the most troublesome student misconceptions and develop limitation of such an approach is that, although 3PL IRT can more efficient test items. model respondent guessing, it assumes that all distractors In Ref. 2, we examined three questions from the FCI in within an item function equally. We have demonstrated this some detail. First, we identified Question 11 as a difficult not to be the case for the FCI. One way to overcome this (25.7% correct) but an efficient question, with an IRC for the weakness is to use a full-blown nominal IRT model in which correct answer choice showing a sharp slope near an ability response probabilities for each distractor are estimated, con- level/total score of 17 (out of a possible 30), allowing for a ditional on student ability. For example, the nominal clear discrimination between students above and below this 7 response model of Bock can be fit in MULTILOG software, but ability level, a result attainable with the 3PL IRT analysis as Bock’s parameters are not directly interpretable the way that well. However, using IRC, we also identified that Answer the parameters “a,” “b,” and “c” of the 3PL in Wang and Bao Choice 2 was particularly popular among lower ability stu- are. Alternatively, the IRC approach can yield reasonably so- dents, while Answer Choice 3 was most popular among stu- phisticated analyses and is quite a bit easier to implement. dents in the middle of the ability range. Interestingly, Other approaches include the nonparametric approach of Planinic et al. identify this question as problematic in that Ramsay’s TestGraf, a graphical approach to item analysis their single-parameter Rasch model fails to predict the stu- that makes fewer assumptions than standard IRT.10 dents who will answer this question correctly.12 While their For comparison with the approach of Wang and Bao and sample is different from ours, their model assumes that the as an example to teachers and researchers looking for a prac- low-ability students who get this question correct have done

827 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 827

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 so by guessing, yet our IRC analysis suggests a very low per- student’s selecting the correct answer choice increases line- centage of low-ability students are getting the correct an- arly with a moderate slope; for a total score of 12 to 27, the swer; they are much more likely to select one of two probability is nearly constant, increasing only slightly with alternative, yet attractive, incorrect answer choices. Like ability level; and from 28 to 30, the probability rises rather Wang and Bao, the analysis by Planinic et al. does not exam- sharply. How to properly characterize the discrimination of ine or make use of the set of incorrect answer choices. such an item remains an open question, but the behavior of Next, we considered Question 9, which Wang and Bao the IRC suggests that the 3PL model may overlook some in- rated as the sixth hardest question on the FCI but the second formation in that the performance of examinees may not be a most likely question on which to guess correctly.4 IRC anal- smoothly increasing function of ability level. ysis reveals that this unlikely pairing occurs (at least in our Furthermore, the content of Question 29 extends beyond sample) because Answer Choices 1 and 4 do not function the Newtonian thinking dimension that the FCI is designed very well, attracting few students of any ability level (the to measure in that the influences of the forces of gravity, air curves are low and have shallow slopes). Our analysis is con- resistance, and buoyancy are included in the answer choices, sistent with that of Planinic et al., who placed this question while instruction at the time of the administration of the FCI in the middle of their difficulty scale.12 may not have addressed the last two topics, even at the time Finally, we identified Question 4 as moderately difficult of the posttest. (37.1% correct), but very inefficient, since three of the an- In general, however, given the shape of the fitting func- swer choices were not functioning at all, each attracting less tions used in the 3PL model, only IRCs that monotonically than 2% of the students. Wang and Bao ranked this as the increase and have at most a single inflection point can be ninth hardest question and tenth easiest to guess correctly.4 modeled. From Fig. 1, we see that the 3PL IRT model likely Planinic et al. rated the difficulty of this question in the will have problems fitting the data not only from Question middle of the range.12 As we see in these examples, the addi- 29 but also from Questions 11 and 15. Questions 2, 5, 8, 14, tional information provided by IRC analysis (e.g., the func- 17, and 20 also may cause difficulties for 3PL IRT based on tionality of individual answer choices, the attractiveness of our criteria. With our sample size, however, a local variation individual answer choices to different subsets of the popula- in the shape of the IRC to a specific answer choice of less tion segregated by ability level, etc.) could lead to greater than 5% may not be statistically significant. We need a insights into the actual ability levels of students taking the larger data set to investigate these questions more carefully. test and the functionality of test items. In any case, 3PL IRT results for these items should be exam- ined carefully in light of these observations. Second, Wang and Bao describe item difficulty using pa- III. A COMPARISON OF IRC ANALYSIS rameter b. Figure 2 shows a scatter plot of percent correct WITH 3PL IRT ANALYSIS FOR THE FCI versus the difficulty parameter of Wang and Bao—the linear fit has a correlation coefficient of 0.89. The good agreement Let us now explore further a comparison between the 3PL suggests that using percent correct on an item on the FCI IRT model analysis by Wang and Bao and our approach for provides quite a good estimate of item difficulty. Wang and the FCI. The 3PL IRT model of Wang and Bao naturally pro- Bao specifically identify Questions 1 and 6 as easy.4 We con- duces estimates of three different parameters, labeled a, b, cur, finding 71.6% and 73.6% of student examinees, respec- and c, which correspond to discrimination, difficulty, and tively, responded with the correct answer choices—these guessing, respectively. being the two highest percentages of correct responses. The First, Wang and Bao estimate the discrimination of an IRCs also allow some insight as to why these questions are item on the FCI from the 3PL model parameter a, which is at the easy end of the spectrum on the FCI. In both cases, essentially the slope of the 3PL model fit for the correct an- one answer choice is attractive to students at the low end of swer choice at the ability level corresponding to 50% correct the total score axis. In the case of Question 1, it appears that response. In the IRC analysis, we can qualitatively judge the students with low total scores show a slight preference for discrimination of an answer choice from the slope of our Answer Choice 4 (which corresponds to the misconception curves: the steeper the slope, the more discriminating the an- that heavier objects fall faster). In the case of Question 6, swer choice. Table I in Wang and Bao indicates that Ques- two of the distractors are not functioning well at any ability tions 5, 13, and 18 are the most discriminating items on the level (Answer Choices 4 and 5 appear as simply more FCI. Examining our Fig. 1, we see that the IRCs for the cor- extreme and unlikely versions of Answer Choice 3 as related rect answer choices in each of these cases are each character- to the subsequent path of an object freed from circular ized by a steep slope. motion). We discuss both of these examples in more detail in The IRC graphs in Fig. 1, however, illustrate the limitation Sec. IV below. of the estimation of item discrimination from the 3PL IRT Wang and Bao identify Questions 25 and 26 as the most analysis. In particular, the 3PL IRT model assumes that all difficult, although by the data in their Table I, the four most distractors function equally. This assumption was explicitly difficult questions (in order) are Questions 17, 26, 15, and not made in the design of the FCI, and Fig. 1 demonstrates 25. Using our analysis, we found rates of correct responses that this assumption is not supported by the data. The behav- of 17.6%, 13.2%, 29.0%, and 21.4%, respectively. We also ior of the IRCs for the FCI answer choices indicates that a found Question 5 to be very difficult with only 19.2% cor- more sophisticated model than the 3PL IRT is needed. For rect; Wang and Bao found this to be the sixth most difficult example, Question 29 is rated by Wang and Bao as the least question in their analysis. In the case of Question 15, the dif- discriminating item on the FCI.4 Yet, when we examine ficulty mainly results from one well-functioning and attrac- the IRC for the correct answer choice, we find three regions tive distractor: Answer Choice 3 (if the car pushes the truck of ability level with different levels of attraction to this an- forward, it must be exerting a greater force on the truck than swer choice: Below a total score of 12, the probability of a vice versa). The difficulty associated with Question 17, in

828 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 828

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 Fig. 2. A comparison of the percent correct (this work, vertical axis) with Fig. 3. IRCs for a “dichotomously” scored Question 13 from the FCI. the difficulty parameter (“b,” horizontal axis) from Wang and Bao (Ref. 4). of information content. This question has two well- fact, comes from the very same misconception. Here, An- functioning distractors that attract students at distinct ranges swer Choice 1 is attractive across ability levels and is associ- of ability. The effectiveness of these distractors is lost in the ated with the misconception that, in order to move up, even analysis of Fig. 3, in which we can only see the discrimination at constant speed, the force up must be greater. Thus, the of the correct answer choice between students scoring above IRC analysis allows us to better characterize why questions and below a total score around 18. are relatively easy or difficult. Third, Wang and Bao estimate the guessing probability or parameter c. Estimates of the guessing parameter are notori- IV. ADDITIONAL FCI INSIGHTS FROM IRC ously noisy and difficult to make because the low end of the ANALYSIS ability scale is usually sparsely populated. If we examine the 3PL curves in Fig. 4 of Wang and Bao, we find that, when As demonstrated above, the IRC analysis allows for a rich their model and data disagree, the model systematically investigation of the functionality of items and particular an- overestimates the percent correct as compared to the data at swer choices within items on assessment instruments. Refer- the low end of the scale (conversely, although less relevant ence 2 provided a characterization of three items on the FCI here, we note that at the high end of the scale, the model typ- (reviewed in Sec. II above). Here, we demonstrate the capa- ically underestimates the percent correct at each ability bilities of the IRC approach more completely by examining level). Part of the difficulty once again may be that the 3PL several additional items in further detail. The analysis below IRT model assumes that all the incorrect answer choices are is not meant to be definitive or conclusive; a still larger data functioning equally, an assumption that we are not required set and separation of pre- and postinstruction results would to make with IRC analysis and that is not valid in the case of be helpful in this regard. However, it is presented to give the FCI. As an alternative, one might estimate the fraction the reader better insights into the kinds of analysis that the guessing from the IRC analysis by simply extrapolating the IRC approach permits, especially if combined with student correct IRC to a “0” ability level. We note, however, that if interviews. someone with no knowledge in the content areas covered by Question 1: This question (dropping two metal balls of the FCI took the test, we would expect the guessing fraction different masses from a roof top) allows the examiner to dis- to be 20%, given that five answer choices appear with each criminate at the lower end of the total score range, with the question. The fact that, in our Fig. 1, we never find an IRC IRC for the correct answer choice crossing 50% correct with greater than 12% correct at 0 ability level communi- rather steeply near a total score of 7. A student with even cates clearly that the distractors have been chosen carefully minimal previous physics instruction will have had direct to attract students with some knowledge of specific and com- engagement with this point and will select correct Answer mon misconceptions. That said, all of these approaches suf- Choice 3 (both hit the ground at the same time). Perhaps fer from the limited data at the extremely low end of the because it appeals to naı¨ve, Aristotelian intuition about scale, so we have little confidence in any estimates of a motion, Answer Choice 4 (heavier ball hits the ground first) guessing parameter. seems to attract students at the lowest end of the total score We provide a final point to further highlight the additional range, while the other three answer choices do not seem to information that is provided from our IRC analysis compared attract students at a rate higher than guessing. Improvement to a dichotomous IRT analysis like 3PL, which examines only of this question could be made by changing one of the low- the correct answer choices. Figure 3 shows a “dichotomously” appeal distractor choices (1, 2, or 5) so that it would be more constructed IRC diagram for Question 13, with curves only attractive to moderate ability students. Finding such an alter- for “correct” and “incorrect,” the latter consisting of the sum native answer choice might be accomplished by using FCI of the curves from all the incorrect IRCs as well as the no scores to identify moderate ability students, interviewing response data. This figure is the analog to all of the graphs in them to identify more subtle gradations of misconceptions Fig. 4 from Wang and Bao. A comparison of our Fig. 3 with about weight and acceleration, and then reworking the entire the corresponding graph in our Fig. 1 reveals a substantial loss question. Further IRC analysis could allow an instructor to

829 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 829

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 identify whether such an answer choice had been effectively Answer Choice 4, the correct answer, is preferred by stu- constructed. dents with total scores >17. The sharp slope of this IRC Question 5: The IRCs for this item (forces acting on a ball identifies this item as highly discriminating. Thus, by exam- traveling around a horizontally oriented semicircular track) ining the response to this one item, it is possible to roughly indicate that all of the answer choices are functioning reason- gauge the ability level of a student respondent. In conjunc- ably well. The IRC for the correct answer choice crosses tion with an analysis of other responses (correct and incor- 50% correct near a total score of 22, suggesting that the rect), the examiner increases his/her probability of correctly correct answer discriminates between students in the high identifying the respondent’s true ability level. We note that, and moderate ability ranges. As for the distractors, Answer in general, this ability level may or may not match the stu- Choice 1 (only gravity acts) is attractive to students at the dent’s total score; it is possible for a student with a lower very lowest end of the total score range (<2). Answer Choice total score to be identified as having a higher ability level 3 (gravity and a force in the direction of motion act) prefer- based on the specific set of incorrect and correct answer entially attracts students at the low end of the range as well choices selected. Thus, we can develop a more accurate (2–10). The IRC for Answer Choice 5 (gravity, a force in the assessment of student ability level using an IRC analysis. direction of motion, and a force pointing from the center of Furthermore, the results related to this particular item on the the circle outward) shows a broad peak in the range of total FCI suggest different instructional strategies may be needed scores of 9–16, while the IRC for Answer Choice 4 (gravity, to address differing misconceptions students have regarding a force exerted by the channel toward the center of the circle, the subject matter. In particular, students at the low end of and a force in the direction of motion) shows a broad peak in the ability scale need to learn why Answer Choice 2 is incor- the total score range of 14–23. This item provides the only rect (perhaps the students are confusing the force of gravity example in this data set for which all five answer choices with velocity), while students of middle ability need to learn seem to be functioning at some level, perhaps because stu- why Answer Choice 3 is incorrect (perhaps these students dents must put together several concepts from Newton’s understand the force of gravity but also describe a fictional laws and circular motion in order to select the correct answer upward force that closely correlates to momentum). We choice. would need to conduct student interviews to probe these mis- Question 6: Examining the content of this item, we see conceptions more carefully. Nevertheless, in this case, IRC this question probes student understanding of Newton’s first analysis has revealed a possible strategy for tailored instruc- law by asking what happens to the same ball traveling tion (e.g., online or computer assisted learning activities). around a horizontally oriented semicircular track as in Ques- Question 15: This question examines the forces as a car tion 5, but as the ball exits one end. This question ranked as pushes a truck while accelerating. The IRCs for this item indi- one of the easiest on the FCI by all analyses (Wang and Bao, cate that, as written, students are primarily attracted to incor- Planinic et al., and ours). Nevertheless, this item does pro- rect Answer Choice 3 (force of car on truck is greater than vide some discrimination at the bottom end of the ability force of truck on car) across all ranges of ability. Experienced scale, with Answer Choice 1 being the most attractive to teachers are not surprised by this result, even though Answer those of low ability. Answer Choice 1 corresponds to the ball Choice 1 is a nearly direct translation of Newton’s third law continuing to move in a circular path. Students who select into this scenario. The IRC for the correct Answer Choice 1 this answer choice do not understand that circular motion crosses 50% correct at a very high total score (ability) of 25, requires the presence of a net force. Answer Choices 3, 4, making this a very difficult item. For the instructor, it would and 5 are all variants of the same misconception—an expres- seem that the primary duty in this case is to correct the mis- sion of the “centrifugal force” causing objects moving along conception students hold in responding with Answer Choice a circular path to feel a false force away from the center. The 3: Newton’s third law is not a “sometimes” law, it is an “all- IRCs indicate that these distractors are not functioning par- the-time” law (except when there is momentum transfer to a ticularly well. Since the slight differences in the shapes of field, but that exception only highlights its usefulness). Of the curves for these three answer choices have no motiva- course, when the instructor succeeds in correcting this mis- tion, a skilled test taker would understand that none of these conception, the shapes of the other IRCs may well change, choices can be correct. This item, therefore, might better be forcing the instructor to target future class time in response to posed with three answer choices rather than 5, eliminating those changes. As with the discussion of Question 13, we find two of Answer Choices 3 through 5. the prospects for customized instruction aided by IRC analy- Question 13: This question examines the forces present on sis of assessment instruments to be exciting. a ball thrown directly upward after it has been released but before hitting the ground. The IRCs for three answer choices V. CONCLUSIONS on this item suggest that they could be used to characterize the ability level of a student into one of three ranges. Answer This paper presents an overview of the types of analysis Choice 2 (varying magnitude force in the direction of that are possible using IRCs and was inspired by the recent motion) is preferred by a plurality of students with total paper by Wang and Bao, which uses IRT to evaluate the scores <5 and by more than about half of respondents with FCI.4 In principle, the parameters extracted from IRT can be total scores <3. Students selecting this answer choice, there- useful for scoring tests on a more sensitive metric, but in the fore, have a high probability of being at the low end of the case of the FCI, there is such strong correlation between abil- ability range. Answer Choice 3 (gravity plus a decreasing ity and total score that this is not required for meaningful force in the direction of motion on the way up only) is pre- analysis. ferred by a plurality of students with 7 < total score <17 and The purpose of the 3PL IRT model described in Wang and by more than about half of respondents with 10 < total score Bao is to illustrate IRT and provide parameters to character- <16. Students selecting this answer choice, therefore, ize the FCI test items. By contrast, the purpose of the IRC are likely to be in the middle of the ability range. Finally, approach is to illustrate item and distractor diagnostics in a

830 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 830

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11 way usable by classroom teachers and applicable to diagnos- information that was designed into the distractors on the tic instruments beyond the FCI. In a qualitative comparison FCI. While alternative models might not make a great differ- of IRC analyses of the FCI with the 3PL IRT model results, ence in scoring the FCI, close examination of item perform- we find that we are able to characterize item difficulty by ance will help instructors and researchers in physics. using percent correct, which correlates strongly with Wang and Bao’s difficulty parameter, b. However, we claim that the discrimination parameter, a, and the guessing parameter, 1D. Hestenes, M. Wells, and G. Swackhamer, “Force concept inventory,” Phys. Teach. 30, 141–158 (1992). c, as extracted by 3PL IRT model in Wang and Bao do not 2 capture the variable functionality of distractors with ability G. A. Morris, L. Branum-Martin, N. Harshman, S. D. Baker, E. Mazur, S. Dutta, T. Mzoughi, and V. McCauley, “Testing the test: item response level. Furthermore, we demonstrate using the IRC analysis curves and test quality,” Am. J. Phys. 74, 449–453 (2006). that there are several questions on the FCI for which the 3PL 3R. P. McDonald, Test Theory: A Unified Treatment (Erlbaum, Mahwah, IRT model is not likely to produce a good fit to the data. NJ, 1999). The strength of the IRT model is that the parameters that 4J. Wang and L. Bao, “Analyzing force concept inventory with item response theory,” Am. J. Phys. 78, 1064–1070 (2010). are derived are more reliably sample-independent. This 5 means that item parameters estimated from a high- D. Thissen and L. Steinberg, “A taxonomy of item response models,” Psy- chometrika 51, 567–577 (1986). performing representative sample of students would match 6S. E. Embretson and S.P. Reise, Item Response Theory for Psychologists those estimated from a low-performing representative sam- (Erlbaum, Mahwah, NJ, 2000). ple of students. Another strength of IRT is that person and 7R. D. Bock, “Estimating item parameters and latent ability when responses item estimates are model-based and can therefore be eval- are scored in two or more nominal categories,” Psychometrika 37, 29–51 (1972). uated for fit or validity. 8 The strength of the IRC analysis, on the other hand, is that L. Ding and R. Beichner, “Approaches to data analysis of multiple-choice questions,” Phys. Rev. ST Phys. Educ. Res. 5, 020103 (2009). it provides a simple, descriptive, graphical approach to eval- 9K. D. Schurmeier, C. H. Atwood, C. G. Shepler, and G. J. Lautenschlager, uating the performance of distractors and for defining student “Using item response theory to assess changes in student performance ability levels in relation not only to dichotomous scoring but based on changes in question wording,” J. Chem. Educ. 87, 1268–1272 also to the set of incorrect answers that they select. For the (2010). current paper, we suggest that IRC can be informative to 10J. A. Marshall, E. A. Hagedorn, and J. O’Connor, “Anatomy of a physics classroom instructors. IRC does not require extensive psy- test: validation of the physics items on the Texas Assessment of Knowl- edge and Skills,” Phys. Rev. ST Phys. Educ. Res. 5, 010104 (2009). chometric training or specialized software. However, we 11 F. M. Lord, Applications of Item Response Theory to Practical Testing look forward to future research in which the nominal (Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, 1980). 7 Problems response model can be applied to take advantage of the 12M. Planinic, L. Ivanjek, and A. Susac, “Rasch model based analysis of the benefits of IRT while examining the potentially different Force Concept Inventory,” Phys. Rev. ST Phys. Educ. Res. 6, 010103 (2010).

831 Am. J. Phys., Vol. 80, No. 9, September 2012 Morris et al. 831

This article is copyrighted as indicated in the article. Reuse of AAPT content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 130.218.13.44 On: Tue, 29 Nov 2016 16:13:11