An Investigation Into the Test Equating Methods Used During 2006, and the Potential for Strengthening Their Validity and Reliability

Total Page:16

File Type:pdf, Size:1020Kb

An Investigation Into the Test Equating Methods Used During 2006, and the Potential for Strengthening Their Validity and Reliability An investigation into the test equating methods used during 2006, and the potential for strengthening their validity and reliability Final report to the Qualifications and Curriculum Authority Dr. Iasonas Lamprianou University of Manchester and Cyprus Testing Service September 2007 An investigation into test equating methods Contents Contents .........................................................................................................................1 Executive summary........................................................................................................3 Introduction....................................................................................................................8 The background of the research.................................................................................8 The aim and objectives of the report..........................................................................8 Methodology..............................................................................................................8 The format of the report...........................................................................................11 Literature Review.........................................................................................................12 Test equating............................................................................................................12 Data collection designs for test equating .................................................................13 Definition of validity and reliability in the context of test equating........................25 A special case in the literature review: The Massey report .....................................26 Statistical models .........................................................................................................28 Question 1.1 .............................................................................................................29 Question 1.2 .............................................................................................................34 Question 2 ................................................................................................................39 Question 3 ................................................................................................................41 Question 4 ...............................................................................................................42 Data-model fit, assumptions and properties of models................................................44 Question 1 ................................................................................................................44 Question 2 ................................................................................................................48 Question 3 ................................................................................................................50 Question 4 ................................................................................................................52 Question 5 ................................................................................................................54 Question 6 ................................................................................................................55 The quality of the datasets/samples .............................................................................57 Question 1 ................................................................................................................57 Question 2 ................................................................................................................60 Question 3 ................................................................................................................64 Question 4 ................................................................................................................65 Question 5 ................................................................................................................67 © 2007 Qualifications and Curriculum Authority 1 An investigation into test equating methods Test equating design–test equating error .....................................................................68 Question 1 ................................................................................................................68 Question 2 ................................................................................................................72 Question 3 ................................................................................................................73 Question 4 ................................................................................................................75 Question 5 ................................................................................................................77 Software .......................................................................................................................78 Question 1 ................................................................................................................78 Question 2 ................................................................................................................80 Question 3 ................................................................................................................81 Question 4 ................................................................................................................82 Documentation.............................................................................................................83 Question 1 ................................................................................................................83 Question 2 ................................................................................................................85 Question 3 ................................................................................................................86 Discussion and recommendations................................................................................88 References....................................................................................................................92 © 2007 Qualifications and Curriculum Authority 2 An investigation into test equating methods Executive summary This research attempted to investigate the issue of the validity and reliability of equating methods used in national curriculum assessments to support standards over time. All of the Test Development Agencies (TDAs) gave responses which indicate a high degree of professionalism. According to their responses, the TDAs employ thorough and sophisticated methods to carry out equating tasks; however, the TDAs use very different methods to carry out very similar tasks. They offer reasonable, though not always well supported, arguments for why this is happening. This research yielded a wealth of important findings, which are presented in this report. A few of the major findings are listed below. However, this list is not exhaustive: the reader is encouraged to go through the detailed comments in the next sections of this document as well. Some of the main findings raised during this research are the following: 1. The TDAs use very different statistical models (i.e. Item Response Theory, equipercentile and linear equating) to carry out similar tasks: • A carefully designed research is certainly needed in order to investigate whether the ‘competing’ models give noticeably different results. One TDA argued that transition from one model to another needs to be done with care, because it may lead to different equating results. However, this is exactly the point: if different models give substantially (practically and statistically) different results, we need to know why we use the models we currently use. And we need to explain why we do not use other models. • In certain cases some TDAs declare that they may use Item Response Theory (IRT), if adequate evidence is presented to them that IRT is more efficient than their current techniques. In this case, QCA might wish to fund a relevant research to investigate the possible merits (or drawbacks) of IRT over other techniques in the context of the English National Curriculum tests. • Some of the responses of the TDAs gave the impression that they might choose to use IRT techniques if item-level data was available (implying that © 2007 Qualifications and Curriculum Authority 3 An investigation into test equating methods item-level data is not always available). If this is the case, QCA may choose to help them acquire item-level data. 2. According to certain responses from the TDAs, the sample sizes are (at least in some cases) pre-defined by the NAA. However, some of the sample sizes reveal errors up to 1.5 marks in the mean of the scale (presumably much larger at the tails of the scale). • QCA needs to decide (and reason) why this error margin is acceptable. If an error of 1.5 or 2 marks at the mean of the scale (presumably much larger at the tails) is acceptable, we may want to investigate what % of students might have been awarded a different level, had the threshold been 1.5 or 2 marks up or down. • QCA may wish to explain why different error margins are currently acceptable for different subjects or different key stages (from the responses of the TDAs it is concluded that, in some cases, there are different equating errors for different subjects). 3. The assumptions of the statistical
Recommended publications
  • <<SCROLL to VIEW ALL POSTED OPPORTUNITIES>>
    <<SCROLL TO VIEW ALL POSTED OPPORTUNITIES>> EMPLOYMENT OPPORTUNITY WITH PROFESSIONAL TESTING, INC POSTING DATE: 7/29/19 Job Title Psychometrician FLSA Status (exempt, non-exempt) Exempt Position Status (full, part-time) Full Time Location (city, state) TBD- Offices in Orlando, FL and Denver, CO Company Name + Description Professional Testing is a Psychometric Consulting Firm that develops, administers, and maintains licensure and certification examination programs in a wide range of industries. Our full range of services also includes program audits, accreditation preparation, policy development, recertification requirements, implementation of effective organizational and governance structures, and ethics and disciplinary procedures. Our team provides expertise in best certification practices and program management. We have offices in Orlando, Florida and Denver, Colorado. Position Description Psychometricians at Professional Testing manage, or assist in the management of, a variety of credentialing programs within varying industries. Activities performed by psychometricians include facilitating job/task analyses, item development activities, assembling and equating exam forms, facilitating passing score studies, performing item analyses, working in item banks, publishing CBT forms and managing projects. Essential Job Functions • Facilitate workshops including job/task analyses, item development meetings, form review meetings, and passing score studies • Guide and manage Professional Testing’s test development team in item banking activities
    [Show full text]
  • A Comparison of Marginal Maximum Likelihood and Marko
    Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo Approaches A dissertation presented to the faculty of The Gladys W. and David H. Patton College of Education of Ohio University In partial fulfillment of the requirements for the degree Doctor of Philosophy Hoan Do April 2021 © 2021 Hoan Do. All Rights Reserved. 2 This dissertation titled Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo Approaches by HOAN DO has been approved for the Department of Educational Studies and The Gladys W. and David H. Patton College of Education by Gordon P. Brooks Professor of Educational Studies Renée A. Middleton Dean, The Gladys W. and David H. Patton College of Education 3 Abstract DO, HOAN, Ph.D., April 2021, Educational Research and Evaluation Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo Approaches Director of Dissertation: Gordon P. Brooks This study assesses the parameter recovery accuracy of MML and two MCMC methods, Gibbs and HMC, under the four-parameter unidimensional binary item response function. Data were simulated under the fully crossed design with three sample size levels (1,000, 2,500 and 5,000 respondents) and two types of latent trait distribution (normal and negatively skewed). Results indicated that in general, MML took a more substantive impact of latent trait skewness but also absorbed the momentum from sample size increase to improve its performance more strongly than MCMC.
    [Show full text]
  • JMETRIK: Classical Test Theory and Item Response Theory Data Analysis Software
    ISSN: 1309 – 6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 2019; 10(2);165-178 JMETRIK: Classical Test Theory and Item Response Theory Data Analysis Software Gökhan AKSU* Cem Oktay GÜZELLER** Mehmet Taha ESER*** Abstract The aim of this study is to introduce the jMetric program which is one of the open source programs that can be used in the context of Item Response Theory and Classical Test Theory. In this context, the interface of the program, importing data to the program, a sample analysis, installing the jmetrik and support for the program are discussed. In sample analysis, the answers given by a total of 500 students from state and private schools, to a 10-item math test were analyzed to see whether they shows differentiating item functioning according to the type of school they attend. As a result of the analysis, it was found that two items were showing medium-level Differential Item Functioning (DIF). As a result of the study, it was found that the jMetric program, which is capable of performing Item Response Theory (IRT) analysis for two-category and multi-category items, is open to innovations, especially because it is open-source, and that researchers can easily add the suggested codes to the program and thus the program can be improved. In addition, an advantage of the program is producing visual results related to the analysis through the item characteristic curves. Keywords: jMetrik, item response theory, classical test theory, differential item functioning. INTRODUCTION For researchers nowadays, technology has almost the same meaning as the software that they use every day.
    [Show full text]
  • Psychometrics Denny Borsboom and Dylan Molenaar, University of Amsterdam, Amsterdam, the Netherlands
    Psychometrics Denny Borsboom and Dylan Molenaar, University of Amsterdam, Amsterdam, The Netherlands Ó 2015 Elsevier Ltd. All rights reserved. This article is a revision of the previous edition article by J.O. Ramsay, volume 18, pp. 12416–12422, Ó 2001, Elsevier Ltd. Abstract Psychometrics is a scientific discipline concerned with the construction of measurement models for psychological data. In these models, a theoretical construct (e.g., intelligence) is systematically coordinated with observables (e.g., IQ scores). This is often done through latent variable models, which represent the construct of interest as a latent variable that acts as the common determinant of a set of test scores. Important psychometric questions include (1) how much information about the latent variable is contained in the data (measurement precision), (2) whether the test scores indeed measure the intended construct (validity), and (3) to what extent the test scores function in the same way in different groups (measurement invariance). Recent developments have focused on extending the basic latent variable model for more complex research designs and on implementing psychometric models in freely available software. Definition led to questions that inspired the birth of psychometric theory as we currently know it: how should we analyze Psychometrics is a scientific discipline concerned with the psychological test data? Which properties determine the question of how psychological constructs (e.g., intelligence, quality of a psychological test? How may we find out neuroticism, or depression) can be optimally related to observ- whether a test is suited for its purpose? ables (e.g., outcomes of psychological tests, genetic profiles, Two important properties of tests were almost immediately neuroscientific information).
    [Show full text]
  • Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language
    ISSN: 1309 – 6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 2017; 8(3);266-287 Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları Okan BULUT* Önder SÜNBÜL** Abstract Monte Carlo simulation studies play an important role in operational and academic research in educational measurement and psychometrics. Item response theory (IRT) is a psychometric area in which researchers and practitioners often use Monte Carlo simulations to address various research questions. Over the past decade, R has been one of the most widely used programming languages in Monte Carlo studies. R is a free, open-source programming language for statistical computing and data visualization. Many user-created packages in R allow researchers to conduct various IRT analyses (e.g., item parameter estimation, ability estimation, and differential item functioning) and expand these analyses to comprehensive simulation scenarios where the researchers can investigate their specific research questions. This study aims to introduce R and demonstrate the design and implementation of Monte Carlo simulation studies using the R programming language. Three IRT-related Monte Carlo simulation studies are presented. Each simulation study involved a Monte Carlo simulation function based on the R programming language. The design and execution of the R commands is explained in the context of each simulation study. Key Words: Psychometrics, measurement, IRT, simulation, R. Öz Eğitimde ölçme ve psikometri alanlarında yapılan akademik ve uygulamaya dönük araştırmalarda Monte Carlo simülasyon çalışmaları önemli bir rol oynamaktadır. Psikometrik çalışmalarda araştırmacıların Monte Carlo simülasyonlarına sıklıkla başvurduğu temel konulardan birisi Madde Tepki Kuramı’dır (MTK).
    [Show full text]
  • <<SCROLL to VIEW ALL POSTED OPPORTUNITIES>>
    <<SCROLL TO VIEW ALL POSTED OPPORTUNITIES>> EMPLOYMENT OPPORTUNITY WITH PROFESSIONAL TESTING, INC POSTING DATE: 7/29/19 Job Title Psychometrician FLSA Status (exempt, non-exempt) Exempt Position Status (full, part-time) Full Time Location (city, state) TBD- Offices in Orlando, FL and Denver, CO Company Name + Description Professional Testing is a Psychometric Consulting Firm that develops, administers, and maintains licensure and certification examination programs in a wide range of industries. Our full range of services also includes program audits, accreditation preparation, policy development, recertification requirements, implementation of effective organizational and governance structures, and ethics and disciplinary procedures. Our team provides expertise in best certification practices and program management. We have offices in Orlando, Florida and Denver, Colorado. Position Description Psychometricians at Professional Testing manage, or assist in the management of, a variety of credentialing programs within varying industries. Activities performed by psychometricians include facilitating job/task analyses, item development activities, assembling and equating exam forms, facilitating passing score studies, performing item analyses, working in item banks, publishing CBT forms and managing projects. Essential Job Functions • Facilitate workshops including job/task analyses, item development meetings, form review meetings, and passing score studies • Guide and manage Professional Testing’s test development team in item banking activities
    [Show full text]
  • Semi-Real-Time Analyses of Item Characteristics for Medical School Admission Tests
    Proceedings of the Federated Conference on DOI: 10.15439/2017F380 Computer Science and Information Systems pp. 189–194 ISSN 2300-5963 ACSIS, Vol. 11 Semi-real-time analyses of item characteristics for medical school admission tests Patrícia Martinková Lubomír Štepánekˇ Institute of Computer Science Institute of Biophysics and Informatics Czech Academy of Sciences First Faculty of Medicine, Charles University Pod Vodárenskou vežíˇ 2, Praha 8 Salmovská 1, Praha 2 [email protected] [email protected] Adéla Drabinová Jakub Houdek Institute of Computer Science Institute of Computer Science Czech Academy of Sciences Czech Academy of Sciences Pod Vodárenskou vežíˇ 2, Praha 8 Pod Vodárenskou vežíˇ 2, Praha 8 [email protected] [email protected] Martin Vejražka Cestmírˇ Štuka Institute of Medical Biochemistry and Laboratory Diagnostics Institute of Biophysics and Informatics First Faculty of Medicine, Charles University First Faculty of Medicine, Charles University U Nemocnice 2, Praha 2 Salmovská 1, Praha 2 [email protected] [email protected] Abstract—University admission exams belong to so-called high- schools publish validation studies of their exams [3], [4], stakes tests, i. e. tests with important consequences for the exam [5], [6], [7], others may perform psychometric analyses as taker. Given the importance of the admission process for the internal reports or the test and item analysis is missing. While applicant and the institution, routine evaluation of the admission tests and their items is desirable. monographs containing the methodology of test analysis have In this work, we introduce a quick and efficient methodology been published in Czech language [8], [9], [10], the use and on-line tool for semi-real-time evaluation of admission of robust psychometric measures in test development is still exams and their items based on classical test theory (CTT) limited.
    [Show full text]
  • Jmetrik Item Analysis [Software Application]
    RESEARCH & PRACTICE IN ASSESSMENT Measurement in 2010. jMetrik is a free and open source Software Review jMetrik item analysis [software application]. Patrick software application for classical and modern psychomet- Meyer. Retrieved from http://www.itemanalysis.com/ ric analyses. The program is a pure Java application that runs on Windows, Mac, OSX, and Linux platforms, with REVIEWED BY: requirements of 256 MB of available memory, and Java 6 Andrea Gotzmann, Ph.D. (i.e., JRE 1.6) or higher. The jMetrik graphical user inter- Medical Council of Canada face (GUI) combines a workspace tree, data view, point- and-click menu, and several dialog boxes. Although the Louise M. Bahry, M.Ed. software is currently available not all features are active, University of Massachusetts, Amherst or fully functional. Therefore, this review will address the features that currently are available, offering a snapshot of Technology, and the use of software to enhance or the current version of the software. assist with evaluating measurement statistics, is currently a large emphasis for users. Measurement statistics, used Current Available Analyses and in classical test theory (CTT) and item response theory Program Interface (IRT), have been elusive for some users, as the measure- The jMetrik software includes psychometric analy- ment concepts are complex and investment of time to ses such as CTT, IRT, Differential item functioning (DIF), understand is intensive (Lord, 1980; Lord & Novick, and Confirmatory Factor Analysis (CFA). All of these 1968). However, users across many content disciplines are analyses are useful in evaluating the psychometric qual- developing their understanding and applying these meth- ity of an assessment.
    [Show full text]
  • DETERS, LAUREN BF, Ph.D. Analysis of the Schizotypal
    DETERS, LAUREN B. F., Ph.D. Analysis of the Schizotypal Ambivalence Scale. (2017) Directed by Dr. John Willse, 108 pp. The purpose of this study was threefold: a) to provide a thorough modern measurement example in a field where it is more limited in use, b) to investigate the psychometric properties of the Schizotypal Ambivalence Scale (SAS) through IRT measurement models, and c) to use the evaluation of the psychometric properties of the SAS to identify evidence for adherence to the relevant guidelines outlined in the Standards for Educational and Psychological Testing (hereafter Standards; AERA, APA, & NCME, 2014). Together, these goals were to contribute to the argument that the SAS is a robust measure of the ambivalence construct. An archived sample of over 7,000 undergraduate students was used to conduct all analyses. Comparison of eigenvalue ratios indicated that the SAS data could be interpreted as essential unidimensional; however, results from the DIMTEST procedure (Stout, 2006) suggested a departure from unidimensionality. Results from the analysis provided adequate evidence for Standard 1.13 (AERA, APA, & NCME, 2014). The data were modeled via 1PL, 2PL, and 3PL models, and the 2PL model best fit the data. Examination of item-level statistics indicated that items 4, 8, 10, and 15 were endorsed more frequently than other items, and that items 2, 3, 9, 14, and 19 were the most discriminating. Items 7, 15, and 18 were flagged for possible misfit. Results from the analysis of local independence revealed that many item pairs, particularly items 10 through 16, may have violated the assumption of local independence.
    [Show full text]
  • Download the Software Required Prior to the Training Sessions
    National Council on Measurement in Education Here and There and Back Again: Making Assessment a Stronger Force for Positive Impact on Teaching and Learning 2018 Training Sessions April 12-13 2018 Annual Meeting April 14-16 Westin New York at Times Square New York, NY #NCME18 Welcome from the Program Chairs Welcome to New York, welcome to New York! Welcome, friends and colleagues, to the 2018 NCME Annual Meeting. We are pleased to present you with this year’s NCME program. Our goal in putting together this slate of sessions has been balance: we have sought to represent research from different testing contexts, from a wide range of perspectives, from behind-the-scenes test development efforts across topics in our field to activities that advance the ways that tests and test results can be made accessible and useful to stakeholders. This year’s conference theme of “Here and There and Back Again: Making Assessment a Stronger Force for Positive Impact on Teaching and Learning” seeks to cultivate the interplay between testing (in all its forms) and the processes of instruction and learning. Carrying on with NCME’s expanding consideration of issues relating to classroom assessment, this year’s program features several invited sessions related to this important topic. On Saturday, April 14, at 10:35am, The Past, Present, and Future of Curriculum-Based Measurement will be discussed, reviewing 30+ years of research in the areas of reading, mathematics, content areas, and writing, and a discussion of future directions and challenges for CBM. On Sunday morning at 10:35am, speakers Joanna Gorin, Margaret Heritage, and James Pellegrino will take on The Positive Impact of Assessment, in a conversation about ways that assessment has been a positive impact on teaching and learning as well as ways that it could become a more positive influence in the future.
    [Show full text]
  • <<SCROLL to VIEW ALL POSTED
    <<SCROLL TO VIEW ALL POSTED OPPORTUNITIES>> EMPLOYMENT OPPORTUNITY WITH THE NATIONAL COUNCIL OF STATE BOARDS OF NURSING POSTING DATE: 5/26/21 Job Title Psychometrician II, Examinations FLSA Status (exempt, non-exempt) Exempt Position Status (full, part-time) Full time Location (city, state) Chicago, IL Company Name + Description The National Council States Boards of Nursing (NCSBN) is a not-for-profit organization whose membership includes the nursing regulatory bodies in the 50 US states, the District of Columbia, four U.S. territories, 9 Canadian Provinces, and 27 international jurisdictions. Our mission empowers and supports nursing regulators in their mandate to protect the public. NCSBN promotes leadership, excellence and innovation in addressing local and global regulatory and healthcare challenges through strategic alliances and partnership with its members and other organizations, both public and private. NCSBN is engaged in a transformative process to better support its strategic direction and initiatives focusing on its mission to empower and support nursing regulators in their mandate to protect the public. The building blocks of achieving NCSBN’s vision of leading regulatory excellence are encapsulated in the following four focusing concepts: • Collaboration • Performance Measures and Metrics • Governance • Data and Technology These focusing concepts require NCSBN to gain and maintain individuals with key skills related to communication, change management, performance management, quality improvement, policy, board development, strategic partnering, data analytics, and economics. The National Council of State Boards of Nursing (NCSBN) is an equal employment opportunity employer. Decisions affecting employment are considered without regard to disability, race, color, religion, gender, national origin, age, genetic information, military or veteran status, sexual orientation, marital status or any other protected characteristic.
    [Show full text]
  • <<SCROLL to VIEW ALL POSTED
    <<SCROLL TO VIEW ALL POSTED OPPORTUNITIES>> EMPLOYMENT OPPORTUNITY WITH CURRICULUM ASSOCIATES POSTING DATE: 6/16/21 Job Title Research Scientist FLSA Status (exempt, non-exempt) Exempt Position Status (full, part-time) Full-time Location (city, state) Remote Company Name + Description Curriculum Associates Curriculum Associates (CA) is a leading educational technology and publishing company with a mission to make classrooms better places. We have both a responsibility and opportunity to reduce the effects of systemic racism for students, educators, and educational communities we serve and for our own team members. We are committed to ensuring CA is a champion of antiracist ideals in our service to schools, in our products, and in our company culture. Our research-based, award-winning print and digital instruction and assessment products provide educators with tools necessary to personalize learning for every student and help all students become college and career ready. Position Description At Curriculum Associates (CA), we believe a diverse team leads to diversity in thinking, making our products better for teachers and students. If you read this job description, feel energized by what you see here, and believe you could bring passion and commitment to the role, but you aren’t sure you meet every qualification, please apply! Above all, we are looking for the right person! Curriculum Associates is a rapidly growing educational technology and publishing company committed to making classrooms better places for teachers and students. We are seeking a talented individual with strong research skills to join our Research team as a Research Scientist. In this role, you’ll develop and execute rigorous research projects focused on CA’s current and under-development solutions for improving student outcomes in reading and mathematics, especially for students who are historically underserved.
    [Show full text]