An Investigation Into the Test Equating Methods Used During 2006, and the Potential for Strengthening Their Validity and Reliability
Total Page:16
File Type:pdf, Size:1020Kb
An investigation into the test equating methods used during 2006, and the potential for strengthening their validity and reliability Final report to the Qualifications and Curriculum Authority Dr. Iasonas Lamprianou University of Manchester and Cyprus Testing Service September 2007 An investigation into test equating methods Contents Contents .........................................................................................................................1 Executive summary........................................................................................................3 Introduction....................................................................................................................8 The background of the research.................................................................................8 The aim and objectives of the report..........................................................................8 Methodology..............................................................................................................8 The format of the report...........................................................................................11 Literature Review.........................................................................................................12 Test equating............................................................................................................12 Data collection designs for test equating .................................................................13 Definition of validity and reliability in the context of test equating........................25 A special case in the literature review: The Massey report .....................................26 Statistical models .........................................................................................................28 Question 1.1 .............................................................................................................29 Question 1.2 .............................................................................................................34 Question 2 ................................................................................................................39 Question 3 ................................................................................................................41 Question 4 ...............................................................................................................42 Data-model fit, assumptions and properties of models................................................44 Question 1 ................................................................................................................44 Question 2 ................................................................................................................48 Question 3 ................................................................................................................50 Question 4 ................................................................................................................52 Question 5 ................................................................................................................54 Question 6 ................................................................................................................55 The quality of the datasets/samples .............................................................................57 Question 1 ................................................................................................................57 Question 2 ................................................................................................................60 Question 3 ................................................................................................................64 Question 4 ................................................................................................................65 Question 5 ................................................................................................................67 © 2007 Qualifications and Curriculum Authority 1 An investigation into test equating methods Test equating design–test equating error .....................................................................68 Question 1 ................................................................................................................68 Question 2 ................................................................................................................72 Question 3 ................................................................................................................73 Question 4 ................................................................................................................75 Question 5 ................................................................................................................77 Software .......................................................................................................................78 Question 1 ................................................................................................................78 Question 2 ................................................................................................................80 Question 3 ................................................................................................................81 Question 4 ................................................................................................................82 Documentation.............................................................................................................83 Question 1 ................................................................................................................83 Question 2 ................................................................................................................85 Question 3 ................................................................................................................86 Discussion and recommendations................................................................................88 References....................................................................................................................92 © 2007 Qualifications and Curriculum Authority 2 An investigation into test equating methods Executive summary This research attempted to investigate the issue of the validity and reliability of equating methods used in national curriculum assessments to support standards over time. All of the Test Development Agencies (TDAs) gave responses which indicate a high degree of professionalism. According to their responses, the TDAs employ thorough and sophisticated methods to carry out equating tasks; however, the TDAs use very different methods to carry out very similar tasks. They offer reasonable, though not always well supported, arguments for why this is happening. This research yielded a wealth of important findings, which are presented in this report. A few of the major findings are listed below. However, this list is not exhaustive: the reader is encouraged to go through the detailed comments in the next sections of this document as well. Some of the main findings raised during this research are the following: 1. The TDAs use very different statistical models (i.e. Item Response Theory, equipercentile and linear equating) to carry out similar tasks: • A carefully designed research is certainly needed in order to investigate whether the ‘competing’ models give noticeably different results. One TDA argued that transition from one model to another needs to be done with care, because it may lead to different equating results. However, this is exactly the point: if different models give substantially (practically and statistically) different results, we need to know why we use the models we currently use. And we need to explain why we do not use other models. • In certain cases some TDAs declare that they may use Item Response Theory (IRT), if adequate evidence is presented to them that IRT is more efficient than their current techniques. In this case, QCA might wish to fund a relevant research to investigate the possible merits (or drawbacks) of IRT over other techniques in the context of the English National Curriculum tests. • Some of the responses of the TDAs gave the impression that they might choose to use IRT techniques if item-level data was available (implying that © 2007 Qualifications and Curriculum Authority 3 An investigation into test equating methods item-level data is not always available). If this is the case, QCA may choose to help them acquire item-level data. 2. According to certain responses from the TDAs, the sample sizes are (at least in some cases) pre-defined by the NAA. However, some of the sample sizes reveal errors up to 1.5 marks in the mean of the scale (presumably much larger at the tails of the scale). • QCA needs to decide (and reason) why this error margin is acceptable. If an error of 1.5 or 2 marks at the mean of the scale (presumably much larger at the tails) is acceptable, we may want to investigate what % of students might have been awarded a different level, had the threshold been 1.5 or 2 marks up or down. • QCA may wish to explain why different error margins are currently acceptable for different subjects or different key stages (from the responses of the TDAs it is concluded that, in some cases, there are different equating errors for different subjects). 3. The assumptions of the statistical