Comparability of Large-Scale Educational Assessments ISSUES and RECOMMENDATIONS

Comparability of Large-Scale Educational Assessments ISSUES and RECOMMENDATIONS NATIONAL ACADEMY of EDUCATION Amy I. Berman, National Academy of Education Edward H. Haertel, Stanford University James W. Pellegrino, University of Illinois at Chicago National Academy of Education Washington, DC NATIONAL ACADEMY OF EDUCATION 500 Fifth Street, NW Washington, DC 20001 NOTICE: This project and research reported here were supported by a grant from Smarter Balanced/ University of California, Santa Cruz. The opinions expressed are those of the editors and authors and do not represent the views of Smarter Balanced/University of California, Santa Cruz. Digital Object Identifier: 10.31094/2020/1 Additional copies of this publication are available from the National Academy of Education, 500 Fifth Street, NW, Washington, DC 20001; http://www.naeducation.org. Copyright 2020 by the National Academy of Education. All rights reserved. Suggested citation: Berman, A. I., Haertel, E. H., & Pellegrino, J. W. (2020). Comparability of Large-Scale Educational Assessments: Issues and Recommendations. Washington, DC: National Academy of Education. The National Academy of Education (NAEd) advances high-quality research to improve education policy and practice. Founded in 1965, the NAEd consists of U.S. members and international associates who are elected on the basis of scholarship related to education. The Academy undertakes research studies to address pressing educational issues and administers professional development fellowship programs to enhance the preparation of the next generation of education scholars. COMPARABILITY OF LARGE-SCALE EDUCATIONAL ASSESSMENTS: ISSUES AND RECOMMENDATIONS Steering Committee Edward H. Haertel (Co-Chair), Graduate School of Education, Stanford University James W. Pellegrino (Co-Chair), Learning Sciences Research Institute, University of Illinois at Chicago Louis M. Gomez, Graduate School of Education and Information Studies, University of California, Los Angeles Larry V. Hedges, Department of Statistics, Northwestern University Joan L. Herman, National Center for Research on Evaluation, Standards, and Student Testing, University of California, Los Angeles Diana C. Pullin, Lynch School of Education and Human Development, Boston College Marshall S. Smith, Carnegie Foundation for the Advancement of Teaching Guadalupe Valdes, Graduate School of Education, Stanford University Staff Amy I. Berman, Deputy Director v Contents EXECUTIVE SUMMARY 1 1 INTRODUCTION: FRAMING THE ISSUES 9 Amy Berman, National Academy of Education; Edward Haertel, Stanford University; and James Pellegrino, University of Illinois at Chicago 2 COMPARABILITY OF INDIVIDUAL STUDENTS’ SCORES ON THE “SAME TEST” 25 Charles DePascale and Brian Gong, National Center for the Improvement of Educational Assessment (Center for Assessment) 3 COMPARABILITY OF AGGREGATED GROUP SCORES ON THE “SAME TEST” 49 Leslie Keng and Scott Marion, Center for Assessment 4 COMPARABILITY WITHIN A SINGLE ASSESSMENT SYSTEM 75 Mark Wilson, University of California, Berkeley, and Richard Wolfe, Ontario Institute for Studies in Education, University of Toronto 5 COMPARABILITY ACROSS DIFFERENT ASSESSMENT SYSTEMS 123 Marianne Perie, Measurement in Practice, LLC 6 COMPARABILITY WHEN ASSESSING ENGLISH LEARNER STUDENTS 149 Molly Faulkner-Bond, WestEd, and James Soland, University of Virginia/Northwest Evaluation Association (NWEA) 7 COMPARABILITY WHEN ASSESSING INDIVIDUALS WITH DISABILITIES 177 Stephen Sireci and Maura O’Riordan, University of Massachusetts Amherst 8 COMPARABILITY IN MULTILINGUAL AND MULTICULTURAL ASSESSMENT CONTEXTS 205 Kadriye Ercikan, Educational Testing Service/University of British Columbia, and Han-Hui Por, Educational Testing Service 9 INTERPRETING TEST-SCORE COMPARISONS 227 Randy Bennett, Educational Testing Service BIOGRAPHICAL SKETCHES OF STEERING COMMITTEE MEMBERS AND AUTHORS 237 vii Executive Summary “How is my child doing?” “What are my child’s strongest and weakest subjects?” “Have my child’s test scores improved from last year?” “How does my child’s test scores compare to others looking to go to college?” “Should I move to this school zone?” —Parent questions “How do the assessment scores of schools within our district compare?” “How are our English learner students doing compared with our native English speak- ers?” “Are we closing the achievement gap?” “How do our assessment scores compare to others within the state?” —District administrator questions “How do our kids measure up to kids in other states?” “Within districts?” “How are the scores of various student subgroups changing over time?” —State administrator/policy maker questions PURPOSE OF THE VOLUME Such questions come from a range of stakeholders with separate vested interests in educational assessments, ranging from parents worried about individual test scores, to local district leaders interested in specific populations, to state policy makers looking at the big picture. Often, different questions are asked about the same assessments, and these questions do not always coincide with the uses for which the assessments were designed and validated. While their interests and questions may differ, these 1 2 COMPARABILITY OF LARGE-SCALE EDUCATIONAL ASSESSMENTS stakeholders all have one thing in common: they are asking questions that assume scores can be validly compared—that a lower score means less proficiency, similar scores mean similar proficiency, a higher score means greater proficiency, and a positive change in scores from one year to the next means improvement, regardless of the specific details of how each student was tested. In other words, they assume the comparability of scores from educational assessments.1 Stakeholders often simply assume that scores obtained from different students in different times and places, using different tests or test forms, are directly comparable, but that is not necessarily the case. Countless factors influence assessment scores, and finding accurate and satisfactory answers to questions of score comparability is not easy. Moreover, comparability may be adequate for one interpretive purpose but not for another. This National Academy of Education (NAEd) volume provides guidance to key stakeholders on how to accurately report and interpret comparability assertions as well as how to ensure greater comparability by paying close attention to key aspects of assessment design, content, and procedures. The goal of the volume is to provide guidance to relevant state-level educational assessment and accountability decision makers, leaders, and coordinators; consortia members; technical advisors; vendors; and the educational measurement community regarding how much and what types of variation in assessment content and procedures can be allowed, while still maintaining comparability across jurisdictions and student populations. At the same time, the larger takeaways from this volume will hopefully provide guidance to policy makers using assessment data to enact legislation and regulations and to district- and school-level leadership to determine resource allocations, and also provide greater contextual understanding for those in the media using test scores to make comparability determinations. WHAT IS COMPARABILITY? Users of educational tests often seek to compare scores even if the scores were obtained at different times, in different places, or using variations in assessment content and procedures. Score comparability broadly means that users can be confident in making such comparisons. Ideally, users could be assured that students with the same score are equally proficient with respect to the knowledge and skills a test was intended to measure. As described more fully throughout this volume, there are numerous threats to comparability that must be considered before making such a claim. For instance, if test performance requires proficiencies irrelevant to the knowledge and skills the test is intended to measure, and if some students’ performance suffers due to lack of those proficiencies, then the scores of those students are not comparable to the scores of other students (e.g., a math test may not be intended to test language proficiency, but limited language knowledge may nonetheless influence test results for some students). Threats to comparability may also arise due to differences in test administration or scoring conditions (e.g., paper-and-pencil versus computer-based testing, different times in the academic year, or human versus machine scoring), and differences in the specific 1 The words assessment and test are used throughout this volume, and though to some extent they are interchangeable, they do have different meanings. Assessment is the more general of the words, conveying the idea of a process providing evidence of quality. Assessment covers a broad range of procedures to measure teaching and learning. A test is one product that measures a particular set of objectives or behavior. SUMMARY 3 test or test form used. When scores are compared for groups of students, comparability also demands that the groups compared be defined consistently, with proper attention to sampling, rules for exclusions and exemptions, and retesting practices. Finally, the issue of score comparability requires attention to the inferences drawn from test scores, as well as the intended uses of the tests. What does the end user want to compare, at what aggregated level, and for what purpose? WHAT IS IN THE VOLUME? The volume is organized by the major types of comparisons

Comparability of Large-Scale Educational Assessments ISSUES and RECOMMENDATIONS

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support