How Well Can We Equate Test Forms That Are Constructed by Examinees? Program Statistics Research
Total Page:16
File Type:pdf, Size:1020Kb
DOCUMENT RESUME ED 385 579 TM 024 017 AUTHOR Wainer, Howard; And Others TITLE How Well Can We Equate Test Forms That Are Constructed by Examinees? Program Statistics Research. INSTITUTION Educational Testing Service, Princeton, N.J. REPORT NO ETS-RR-91-57; ETS-TR-91-15 PUB DATE Oct 91 NOTE 29p. PUB TYPE Reports Research/Technical (143) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS *Adaptive Testing; Chemistry; ComparativeAnalysis; Computer Assisted Testing; *Constructed Response; Difficulty Level; *Equated Scores; *ItemResponse Theory; Models; Selection; Test Format; Testing; *Test Items IDENTIFIERS Advanced Placement Examinations (CEEB); *Unidimensionality (Tests) ABSTRACT When an examination consists, in wholeor in part, of constructed response items, it isa common-practice to allow the examinee to choose amonga variety of questions. This procedure is usually adopted so that the limited number ofitems that can be completed in the allotted time does not unfairlyaffect the examinee. This results in the de facto administrationof several different test forms, where the exact structure ofany particular form is determined by the examinee. When different formsare administered, a canon of good testing practice requires that thoseforms be equated to adjust for differences in their difficulty.When the items are chosen by the examinee, traditional equating proceduresdo not strictly apply. In this paper, how one might equate withan item response theory (IRT) framework is explored. The procedure isillustrated with data from the College Board's Advanced PlacementTest in Chemistry taken by a sample of 18,431 examinees. Comparablescores can be produced in the context of choice to the extent thatresponses may be characterized with a unidimensional IRT model.Seven tables and five figures illustrate the discussion. (Contains 19references.) (Author/SLD) *********************************************************************** * Reproductions supplied by EDRS * are the best that can be made from the original document. * *********************************************************************** "PERMISSION TO REPRODUCE THIS U.S. DEPARTMENT Of EDUCATION MATERIAL HAS BEEN GRANTED BY Ofhce of Educe honal Research and Improvement RR-91-67 EDUCATIONAL RESOURCES INFORMATION . CENTE (ERIC) /. ,48,L.,/t1 Ms document has been cAproducd as received from me person or organizatton Ongmahng 0 Mmor changes have been mad* to .rapcov mofoducton quellly TO THE EDUCATIONAL RESOURCES Ro.nts 01 um.. or Optnions Stated m t drX u- mnt do nol neCeSSenly represent &Will INFORMATION CENTER (ERIC)." OERI poSthon or poliCy How Well Can WeEquate Test Forms That Are ConstructedBy Examinees? Howard Wainer Educational Testing Service Xiang-Bo Wang University of Hawai'i David Thissen' University of North Carolina PROGRAM STATISTICS RESEARCH TECHNICAL REPORT NO. 9i -15 Educational Testing Service Princeton, New Jersey 08541 BEST COPY AVAILABLE 2 The Program Statistics Research Technical Report Series is designed to make the working papers of the Research Statistics Group at Educational Testing Service generally available. The series consists of reports by the members of the Research Statistics Group as well as their external and visiting statistical consultants. -Reproduction of any portion of a Program Statistics Research Technical Report requires the written consent of the author(s). 3 How Well Can We Equate Test Forms That Are Constructed By Examinees? Howard Wainer Educational Testing Service Xiang-Bo Wang University of Hawai'i David Thissen University of North Carolina Program Statistics Research Technical Report No. 91-15 Research Report No. 91-57 Educational Testing Service Princeton, New Jersey 08541 October 1991 Copyright © 1991 by Educational Testing Service. All rights reserved. How well can we equate test forms that are constructed by examinees?' Howard Wainer Xiang-Bo Wang David Thissen Educational Testing Service University of Hawai'i University of North Carolina Abstract When an exam consists, in whole or in part, of constructedresponse items, it is a common practice to allow the examinee to choose among a variety of questions. This procedure is usually adopted so that the limited number of items that can be completed in the allotted time does not unfairly affect the examinee. This results in the de facto administration of several different test forms, where the exact structure of any particular form is determined by the examinee. When different forms are administered, a canon of good testing practice requires that those forms be equated to adjust for differences in their difficulty. When the itemsare chosen by the examinee traditional equating procedures do not strictly apply. In this paper we explore how one might equate withinan IRT framework. We illustrate our procedure with data from the College Board's Advanced Placement Test in Chemistry. 'This research was supported by the Educational Testing Service through the ResearchStatistics Project. We are pleased to be able to acknowledge this help. The data, and advice about how to interpret themwere supplied by Walter MacDonald, Behroz Maneckshana, Joe Stevens, and Hessie Taft. Without theirexpert help and kind cooperation we would have had little to wl Ile about. We are also grateful to Rick Morganand Jerry Melican whose careful reviews assured the accuracy of this description. Xiang-Bo Wang's workon this project was done while he was an Educational Testing Service Summer Predoctoral Fellow. We would also like to express our appreciation to R. Darrell Bock whose initial inquires ledus to look into this problem. 5 It has long been understood that a good test mustcontain enough questions to cover fairly the content domain. In his description of an 1845 surveyof the Grammar and Writing Schools of Boston, Horace Mann argued that "... it is clear that the larger the number of questions put to ascholar, the better is the opportunity to test his merits. If but a single questionis put, the best scholar in the school may miss it, though he would succeed inanswering the next twenty without a blunder; or the poorest scholar may succeed inanswering one question, though certain to fail in twenty others. Each question is apartial test, and the greater the numbei of questions, therefore, the nearer does the testapproach to complete- ness. It is very uncertain which faceof a die will turn up at the first throw; but if the dice are thrown all day, there will be a great equality in thenumber of faces turned up.fl Despite the force of Mann's argument, there are reasons for theincreasing pressure to build tests using units that are largerthan a single multiple choice item. Sometimes these units can be thought of as aggregations of small items, e.g.,testlets (Wainer & Kiely, 1987; Wainer & Lewis, 1990); sometimes they are just largeitems essays, mathe- matical proofs, etc.). Large items, by definition, take the examineelonger to complete than do short items. Therefore, fewer large items can be completedwithin a given testing time. The fact that an examinee cannot complete very many large itemswithin the allotted testing time places the test builder in something of a quandary. One musteither be satisfied with fewer items, and possibily not span the content domain as fully as mighthave been the case with a much larger number of smaller items, or expand the testing timesufficiently to allow the content domain to be well represented. Often practicality limitstesting time, and so compromises on domain coverage must be made. A common compromise is toprovide several large items and allow the examinee to choose among them. Thenotion is that in this way the examinee is not placed at adisadvantage by an unfortunate choice of domain coverage by the test builder. Allowing examinees to choose the items they will answer presents a difficult setof problems. Despite the most strenuous efforts to write items of equivalent difficulty, some might be more difficult than others. If examinees who choose different items are tobe fairly compared with one another, the scores obtained on those items must be equated. How? All methods of equating are aimed at producing the subjunctive score that an examinee would have obtained had that examinee answered a different set ofitems. To accomplish this feat requires that the item responses that are not observed are"missing-at- random." The act of equating means that we believe that the performance that we obscrve on one item tells us something about whatperformance would have been on another item. If we know that the procedure by which an item was chosen has nothing to dowith any specialized knowledge that the student possesses we can believe that the missing responses are missing-at-random. However, if theexaminee has a hand in choosing the items this assumption becomes considerably less plausible. Page 2 To understand this more concretely consider two different construction rules for a spelling test. Suppose we have a corpus of 100,000 words of varying difficulty, and we wish to manufacture a 100-item spelling test. From the proportion of the test's items that the examinee correctly spells we will infer that the examinee can spell a proportion of the total corpus. Two rules for constructing such a test might be: Missing-at random: We select 100 words at random from the corpus and present them to the examinee. In this instance we believe that what we observe is a reasonable representation of what we did not observe. Examinee selected: A -word is presented at random to the examinee, who then decides whether or not to attempt to spell it. After 100 attempts the proportion spelled cor- rectly is the examinee's raw score. The usefulness of this score depends crucially on the extent to which we believe that examinees' judgments of whether or not they can spell particular words are related to actual ability. If there is no relation between spelling ability and a priori expectation, then this method is as good.as missing-at- random. At the other extreme, we might believe that examinees know perfectly well whether or not they can spell a particular word correctly. In this instance a raw score of 100% has quite a different meaning.