2016 Training Sessions April 7-8 2016 Annual Meeting April 9-11

Renaissance Washington, DC Downtown Hotel Washington, DC

©Thinkstock

2016 PROGRAM National Council on Measurement in Education

Foundations and Frontiers: Advancing Educational Measurement for Research, Policy, and Practice

2016 Training Sessions April 7-8

2016 Annual Meeting April 9-11

Renaissance Washington, DC Downtown Hotel Washington, DC

#NCME16

Washington, DC, USA

Table of Contents

NCME Board of Directors ...... 4

Proposal Reviewers ...... 6

Future Meetings ...... 7

Renaissance Washington, DC Downtown Hotel Meeting Room Floor Plans . . . . 8

Training Sessions

Thursday, April 7 ...... 13

Friday, April 8 ...... 23

Program

Saturday, April 9 ...... 37

Sunday, April 10 ...... 95

Monday, April 11 ...... 139

Index ...... 185

Contact Information ...... 197

Schedule-at-a-Glance ...... 215

Foundations and Frontiers: Advancing Educational Measurement for Research, Policy, and Practice

3 2016 Annual Meeting & Training Sessions

NCME Officers

President Richard J . Patz ACT, Iowa City, IA

Vice President Mark Wilson UC Berkeley, Berkeley, CA

Past President Lauress Wise HUMRRO, Seaside, CA

NCME Directors

Amy Hendrickson The College Board, Newtown, PA

Kristen Huff ACT, Iowa City, IA

Luz Bay The College Board, Dover, NH

Won-Chan Lee University of Iowa, Iowa City, IA

Cindy Walker University of Wisconsin-Milwaukee, Milwaukee, WI

C Dale Whittington Shaker Heights (OH) Public Schools, Shaker Heights, OH

4 Washington, DC, USA

Editors

Journal of Educational Jimmy de la Torre Measurement Rutgers, The State University of NJ, New Brunswick, NJ

Educational Measurement Dr . Howard Everson Issues and Practice SRI International, Menlo Park, CA

NCME Newsletter Heather M . Buzick Educational Testing Service, Princeton, NJ

Website Content Editor Brett Foley Alpine Testing Solutions, Denton, NE

2016 Annual Meeting Chairs

Annual Meeting Program Chairs Andrew Ho Harvard Graduate School of Education, Cambridge, MA

Matthew Johnson Columbia University, New York, NY

Graduate Student Issues Brian Leventhal Committee Chair University of Pittsburgh, Pittsburgh, PA

Training and Development Xin Li Committee Chair ACT, Iowa City, IA

Fitness Run/Walk Directors Katherine Furgol Castellano ETS, San Francisco, CA

Jill R . van den Heuvel Alpine Testing Solutions, Hatfield, PA

NCME Information Desk

The NCME Information Desk is located on the Meeting Room Level in the Renaissance Washington, DC Downtown Hotel . Stop by to pick up a ribbon and obtain your bib number and tee-shirt for the fun run and walk . It will be open at the following times:

Thursday, April 7 ...... 7:30 AM-4:30 PM Friday, April 8 ...... 8:00 AM-4:30 PM Saturday, April 9 ...... 10:00 . . . AM-4:30 PM Sunday, April 10 ...... 8:00. . . AM-1:00 PM Monday, April 11 ...... 8:00 AM-1:00 PM

5 2016 Annual Meeting & Training Sessions

Proposal Reviewers

Terry Ackerman Steve Culpepper Won-Chan Lee* Michael Rodriguez Benjamin Andrews Mark Davison Dongmei Li Sandip Sinharay Robert Ankenmann Jimmy de la Torre Jinghua Liu Steve Sireci Karen Barton John Donoghue Skip Livingston Dubravka Svetina Kirk Becker Jeff Douglas JR Lockwood* Ye Tong* Anton Beguin Michael Edwards Susan Loomis Anna Topczewski Dmitry Belov Karla Egan* Krista Mattern Peter van Rijn Tasha Beretvas Kadriye Ercickan Andy Maul* Jay Verkuilen Jonas Bertling Steve Ferrara Dan McCaffrey Alina von Davier Damian Betebenner Holmes Finch* Katie McClarty Matthias von Davier Dan Bolt Mark Gierl Catherine McClellan Michael Walker Laine Bradshaw Brian Habing Patrick Meyer Chun Wang Henry Braun Chris Han Paul Nichols Jonathan Weeks Robert Brennan Mark Hansen Maria Oliveri Cathy Wendler Brent Bridgeman Deborah Harris Andreas Oranje Andrew Wiley Derek Briggs* Kristen Huff* Thanos Patelis Steve Wise Chad Buckendahl Minjeong Jeon Susan Philips Duanli Yan Li Cai* Hong Jiao Mary Pitoniak John Young Wayne Camara Matt Johnson John Poggio April Zenisky* Katherine Furgol Daniel Jurich Sophia Rabe- Castellano Seock-Ho Kim Hesketh Ying Cheng Jennifer Kobrin Mark Reckase Chia-Yi Chiu Suzanne Lane Frank Rijmen

* Indicates Expert Panel Chairperson

6 Washington, DC, USA

Graduate Student Abstract Reviewers

Lokman Akbay Emily Ho Mary Norris Min Wang Beyza Aksu Landon Hurley Nese Ozturk Ting Wang Abeer Alamri Charlie Iaconangelo Robyn Pitts Xiaolin Wang Bruce Austin Andrew Iverson Ray Reichenberg Diah Wihardini Elizabeth Barker Kyle Jennings Sumeyra Sahbaaz Elizabeth Williams Diego Luna Bazaldua HeaWon Jun Tyler Sandersfeld Immanuel Williams Masha Bertling Susan Kahn Can Shao Dawn Woods Lisa Beymer Jaclyn Kelly Benjamin Shear Kuan Xing Mark Bond Brian Leventhal Jordan Sparks Jing-Ru Xu Nuliyana Bukhari Isaac Li Rose Stafford Menglin Xu Jie Dandan Liao Latisha Sternod Sujin Yang Michelle Chen Fu Liu Myrah Stockdale Ai Ye Yi-Chen Chiang David Martinez- Meghan Sulivan Nedim Yel Shenghai Dai Alpizar Ragip Terzi Hulya Yurekli Tianna Floyd Namita Mehta Stephanie Underhill Oscar Gonzalez Rich Nieto Keyin Wang

Future Annual Meeting

2017 Annual Meeting April 26-30 San Antonio, TX

2018 Annual Meeting April 12-16 New York, NY, USA

2019 Annual Meeting April 4-8 Toronto, Ontario, Canada

7 2016 Annual Meeting & Training Sessions

Hotel Floor Plans – Renaissance Washington, DC Downtown

8 Washington, DC, USA

9 2016 Annual Meeting & Training Sessions

10 Washington, DC, USA

A Message from Your Program Chairs

2016 NCME Program Highlights: Foundations and Frontiers: Advancing Educational Measurement for Research, Policy, and Practice

We are pleased to highlight a few of the many excellent sessions that our members have contributed, as well as congratulate our partners at AERA on their centennial celebration .

From the very first conference session, at 8:15AM on Saturday, April 9, we’re kicking it off with big-picture topics (Henry Braun leading an invited session for the recent NCME volume: Challenges to Measurement in an Era of Accountability) alongside technical advances (Derek Briggs leading off a session on Learning Progressions for Measuring Growth) .

The momentum continues through our last session, at 4:05 on Monday, April 11, where we tackle buzz phrases (Thanos Patelis convening a session on Fairness Issues and Validation of Noncognitive Skills) and settle scores (The Great Subscore Debate, with Emily Bo, Howard Wainer, Sandip Sinharay, and many others facing off to surely resolve the issue once and for all) .

We are taking full advantage of our location in Washington, DC, with an invited session on the recently passed Every Students Succeeds Act over lunchtime on Monday . Peter Oppenheim and Sarah Bolton, Education Policy Directors (majority and minority, respectively) for the US Senate HELP Committee will discuss key provisions and spark a discussion among researchers about ESSA’s Implications and Opportunities for Measurement Research and Practice . Earlier that Monday morning, Kristen Huff will convene reporters and scholars in a session with the lively title: Hold the Presses! How Measurement Professionals can Speak More Effectively with the Press and the Public .

Consistent with our theme, our many sessions highlight both foundations (Isaac Bejar coordinates a session on Item Response Modeling: From Theory to Practice, while Karla Egan convenes a session on Standard Setting: Beyond Process) and frontiers (Tracy Sweet will lead a session on Recent Advances in Social Network Analysis, and Will Lorie takes on Big Data in Education: From Items to Policies) .

Stay up to date at the Twitter hashtag #NCME16 and our new NCME Facebook group . We are confident that you will enjoy the program that you have helped to create here at the 2016 NCME Annual Meeting .

Andrew Ho and Matt Johnson 2016 NCME Annual Meeting Co-Chairs

11 2016 Annual Meeting & Training Sessions

Pre-Conference Training Sessions

The 2016 NCME Pre-Conference Training Sessions will be held at the Renaissance Washington, DC Downtown Hotel on Thursday, April 7 and Friday, April 8 . All full-day sessions will be held from 8:00 AM to 5:00 PM . All half-day morning sessions will be held from 8:00 AM to 12:00 noon . All half-day afternoon sessions will run from 1:00 PM to 5:00 PM .

On-site registration for the Pre-Conference Training Sessions will be available at the NCME Information Desk at the Renaissance Washington, DC Downtown Hotel for those workshops that still have availability .

Please note that internet connectivity will not be available for most training sessions and, where applicable, participants should download the software required prior to the training sessions . Internet connectivity will be available for a few selected training sessions that have pre-paid an additional fee .

12 Washington, DC, USA

Pre-Conference Training Sessions - Thursday, April 7, 2016

13 2016 Annual Meeting & Training Sessions

14 Washington, DC, USA

Thursday, April 7, 2016 8:00 AM - 12:00 PM, Meeting Room 6, Meeting Room Level, Training Session, AA

Quality Control Tools in Support of Reporting Accurate and Valid Test Scores Aster Tessema, American Institute of Certified Public Accountants; Oliver Zhang, The College Board; Alina VonDavier, Educational Testing Service

All testing companies focus on ensuring that the test scores are valid, reliable, and fair . Significant resources are allocated to meet the guidelines of well-known organizations, such as AERA/NCME, and/or The international Test Commission Guidelines (Allalouf, 2007; ITC, 2011) .

In this workshop we will discuss traditional QC methods, the operational testing process, and new QC tools for monitoring the stability of scores over time .

We will provide participants a practical understanding of:

1 . The importance of flow charts and documentation of procedures

2 . The use of software tools to monitor tasks

3 . How to minimize the number of hand offs

4 . How to automate activities

5 . The importance of trend analysis to detect anomalies

6 . The importance of applying detective and preventive controls

7 . Having a contingency plan

We will also show how to apply QC techniques from manufacturing to monitor scores . We will discuss traditional QC charts (Shewhart and CUSUM charts), time series models, and change point models to the means of scale scores to detect abrupt changes (Lee & von Davier, 2013) . We will also discuss the QC methods for the process of automated & human scoring of essays (Wang & von Davier, 2014) .

15 2016 Annual Meeting & Training Sessions

Thursday, April 7, 2016 8:00 AM - 12:00 PM, Meeting Room 7, Meeting Room Level, Training Session, BB

IRT Parameter Linking Wim van der Linden and Michelle Barrett, Pacific Metrics

The problem of IRT parameter linking arises when the values of the parameters for the same items or examinees in different calibrations need to be compared . So far, the problem has mainly be conceptualized as an instance of the problem of invariance of the measurement scale for the ability parameters in the tradition of S . S . Stevens’ interval scales . In this half-day training session, we show that the linking problem has not much to do with arbitrary units and zeros of measurement scales but is the result of a more fundamental problem inherent in all IRT models—general lack of identifiability of their parameters . The redefinition of the linking problem allows us to formally derive the linking functions required to adjust for the differences in parameter values between separate calibrations . It also leads to new efficient statistical estimators of their parameters, the derivation of their standard errors, and the use of current optimal test-design methods to design linking studies with minimal error . All these results have been established both for the current dichotomous and polytomous IRT models . The results will be presented during four one-hour lectures appropriate for psychometricians with interest and/or practical experience in IRT parameter linking problems .

16 Washington, DC, USA

Thursday, April 7, 2016 8:00 AM - 5:00 PM, Meeting Room 5, Meeting Room Level, Training Session, CC

21st Century Skills Assessment: Design, Development, Scoring, and Reporting of Character Skills Patrick Kyllonen and Jonas Bertling, Educational Testing Service

This workshop will provide training, discussion, and hands-on experience in developing methods for assessing, scoring, and reporting on students’ social-emotional and self-management or character skills . Workshop will focus on (a) reviewing the kinds of character skills most important to assess based on current research; (b) standard and innovative methods for assessing character skills, including self-, peer-, teacher-, and parent- rating-scale reports, forced-choice (rankings), anchoring vignettes, and situational judgment methods; (c) cognitive lab approaches for item tryout; (d) classical and item-response theory (IRT) scoring procedures (e .g ., 2PL, partial credit, nominal response model); (e) validation strategies, including the development of rubrics and behaviorally anchored rating scales, and correlations with external variables; (f) the use of anchors in longitudinal growth studies, (g) reliability from classical test theory (alpha, test-retest), item-response theory, and generalizability theory; and (h) reporting issues . These topics will be covered in the workshop where appropriate, but the sessions within the workshop will tend to be organized around item types (e .g ., forced-choice, anchoring vignettes) . Examples will be drawn from various assessments, including PISA, NAEP, SuccessNavigator, FACETS, and others . The workshop is designed for a broad audience of assessment developers, analysts, and psychometricians, working in either applied or research settings .

17 2016 Annual Meeting & Training Sessions

Thursday, April 7, 2016 8:00 AM - 5:00 PM, Meeting Room 2, Meeting Room Level, Training Session, DD

Introduction to Standard Setting Chad Buckendahl, Alpine Testing Solutions; Jennifer Dunn, Measured Progress; Karla Egan, National Center for the Improvement of Educational Assessment; Lisa Keller, University of Massachusetts Amherst; Lee LaFond, Measured Progress

As states adopt new standards and assessments the expectations on psychometricians from a political perspective have been increasing . The purpose of this training session is to provide a practical introduction to the standard setting process while addressing common policy concerns and expectations .

This training will follow the Evidence-Based Standard Setting (EBSS) framework . The first third of the session will touch upon some of the primary pre-meeting developmental and logistical activities as well as the EBSS steps of defining outcomes and developing relevant research as guiding validity evidence .

The middle third of the session will be focused on the events of the standard setting meeting itself . The session facilitators will walk them through the phases of a typical standard setting, and participants will experience a training session on the Bookmark, Angoff, and Body of Work methods followed by practice rating rounds with discussion .

The final third of the training session will give an overview of what happens following a standard setting meeting . This will be carried out through a panel discussion with an emphasis on policy expectations and the importance of continuing to gather evidence in support of the standard .

18 Washington, DC, USA

Thursday, April 7, 2016 8:00 AM - 5:00 PM, Meeting Room 16, Meeting Room Level, Training Session, EE

Analyzing NAEP Data Using Plausible Values and Marginal Estimation with AM Emmanuel Sikali, National Center for Education Statistics; Young Yee Kim, American Institues for Research

Since results from the National Assessment of Education Progress (NAEP) serve as a common metric for all states and select urban districts, many researchers are interested in conducting studies using NAEP data . However, NAEP data pose many challenges for researchers due to its special design features . This class intends to provide analytic strategies and hands-on practice with researchers who are interested in NAEP data analysis . The class consists of two parts: (1) instructions on the psychometric and sampling designs of NAEP and data analysis strategies required by these design features and (2) the demonstration of NAEP data analysis procedures and hands-on practice . The first part includes marginal maximum likelihood estimation approach to obtaining scale scores and appropriate variance estimation procedures and the second part includes two approaches to NAEP data analysis, i .e . using the plausible values approach and the marginal estimation approach with item response data . The demonstration and hands-on practice will be conducted with a free software program, AM, using a mini-sample public-use NAEP data file released in 2011 . Intended participants are researchers, including graduate students, education practitioners, and policy analysts, who are interested in NAEP data analysis .

19 2016 Annual Meeting & Training Sessions

Thursday, April 7, 2016 8:00 AM - 5:00 PM, Meeting Room 4, Meeting Room Level, Training Session, FF

Multidimensional Item Response Theory: Theory and Applications and Software Lihua Yao, Defense Manpower Data Center; Mark Reckase, Michigan State University; Rich Schwarz, ETS

Theories and applications of multidimensional item response theory model (MIRT) and Multidimensional Computer Adaptive testing (MCAT) and MIRT linking are discussed . Software demonstrated and hands on experienced cover areas for multidimensional multi-group calibration, multidimensional linking, and MCAT simulation; intended for researchers who are interested in MIRT and MCAT .

20 Washington, DC, USA

Thursday, April 7, 2016 1:00 PM - 5:00 PM, Meeting Room 3, Meeting Room Level, Training Session, GG

New Weighting Methods for Causal Mediation Analysis Guanglei Hong, University of Chicago

Many important research questions in education relate to how interventions work . A mediator characterizes the hypothesized intermediate process . Conventional methods for mediation analysis generate biased results when the mediator-outcome relationship depends on the treatment condition . These methods also tend to have a limited capacity for removing confounding associated with a large number of covariates . This workshop teaches the ratio-of-mediator- probability weighting (RMPW) method for decomposing total treatment effects into direct and indirect effects in the presence of treatment-by-mediator interactions . RMPW is easy to implement and requires relatively few assumptions about the distribution of the outcome, the distribution of the mediator, and the functional form of the outcome model . We will introduce the concepts of causal mediation, explain the intuitive rationale of the RMPW strategy, and delineate the parametric and nonparametric analytic procedures . Participants will gain hands-on experiences with a free stand- alone RMPW software program . We will also provide SAS, Stata, and R code and will distribute related readings . The target audience includes graduate students, early career scholars, and advanced researchers who are familiar with multiple regression and have had prior exposure to binary and multinomial logistic regression . Each participant will need to bring a laptop for hands-on exercises .

21 2016 Annual Meeting & Training Sessions

Thursday, April 7, 2016 1:00 PM - 5:00 PM, Meeting Room 6, Meeting Room Level, Training Session, II

Computerized Multistage Adaptive Testing: Theory and Applications (Book by Chapman and Hall) Duanli Yan, Educational Testing Service; Alina von Davier, ETS; Kyung Chris Han

This workshop provides a general overview of a computerized multistage test (MST) design and its important concepts and processes . The focus of the workshop will be on MST theory and applications including alternative scoring and estimation methods, classification tests, routing and scoring, linking, test security, as well as a live demonstration of MST software MSTGen (Han, 2013) . This workshop is based on the edited volume of Yan, von Davier, & Lewis (2014) . The volume is structured to take the reader through all the operational aspects of the test, from the design to the post- administration analyzes . The training course consists of a series of lectures and hands-on examples in the following four sessions:

1 . MST Overview, Design, and Assembly

2 . MST Routing, Scoring, and Estimations

3 . MST Applications

4 . MST Simulation Software

The MST design is described, why it is needed, and how it differs from other test designs, such as linear test and computer adaptive test (CAT) designs .

This course is intended for people who have some basic understanding of item response theory and CAT .

22 Washington, DC, USA

Pre-Conference Training Sessions - Friday, April 8, 2016

23 2016 Annual Meeting & Training Sessions

24 Washington, DC, USA

Friday, April 8, 2016 8:00 AM - 12:00 PM, Renaissance West B, Ballroom Level, Training Session, JJ

Landing Your Dream Job for Graduate Students Deborah Harris and Xin Li, ACT, Inc.

This training session will address practical topics graduate students in measurement are interested in regarding finding a job and starting a career . It will concentrate on what to do now while they are still in school to best prepare for a job (including finding a dissertation topic, selecting a committee, maximizing experiences while still a student with networking, internships, and volunteering, and providing suggestions to the questions regarding what types of coursework an employer looks for, and what would make a good job talk), how to locate, interview for, and obtain a job (including how to find where jobs are, how to apply for jobs --targeting cover letters, references, and resumes), what to expect in the interview process (including job talks, questions to ask, and negotiating an offer), and what’s next after they have started their first post PhD job (including adjusting to the environment, establishing a career path, publishing, finding mentors, balancing work and life, and becoming active in the profession) . The session is interactive, and geared to addressing the participants’ questions during the session . Resource materials are provided on all relevant topics .

25 2016 Annual Meeting & Training Sessions

Friday, April 8, 2016 8:00 AM - 12:00 PM, Meeting Room 4, Meeting Room Level, Training Session, KK

Bayesian Analysis of IRT Models using SAS PROC MCMC Clement Stone, University of Pittsburgh

There is a growing interest in Bayesian estimation of IRT models, in part due to the appeal of the Bayesian paradigm, as well as the advantages of these methods with small sample sizes, more complex models (e .g ., multidimensional models), and simultaneous estimation of item and person parameters . Software has become available, SAS and WinBUGS, which make a Bayesian analysis of IRT models more accessible to psychometricians, researchers, and scale developers .

SAS PROC MCMC offers several advantages over other software, and the purpose of this training session is to illustrate how SAS can be used to implement a Bayesian analysis of IRT models . After reviewing briefly Bayesian methods and IRT models, PROC MCMC is introduced . This introduction includes discussion of a template for estimating IRT models as well as convergence diagnostics and specification of prior distributions . Also discussed are extensions for more complex models (e .g ., multidimensional, mixture) and methods for comparing models and evaluating model fit .

The instructional approach will be one involving lecture and demonstration . Considerable code and output will be discussed and shared . An overall objective is that attendees can extend examples to their testing applications . Some understanding of SAS programs and SAS procedures is helpful .

26 Washington, DC, USA

Friday, April 8, 2016 8:00 AM - 5:00 PM, Meeting Room 2, Meeting Room Level, Training Session, LL

flexMIRT®: Flexible Multilevel Multidimensional Item Analysis and Test Scoring Li Cai, University of California - Los Angeles; Carrie R. Houts, Vector Psychometric Group, LLC

There has been a tremendous amount of progress in item response theory (IRT) in the past two decades . flexMIRT® is IRT software which offers multilevel, multidimensional, and multiple group item response models . flexMIRT® also offers users the ability to obtain recently developed model fit indices, fit diagnostic classification models, and models with non-normal latent densities, among other advanced features . This training session will introduce users to the flexMIRT® system and provide valuable hands on experience with the software .

27 2016 Annual Meeting & Training Sessions

Friday, April 8, 2016 8:00 AM - 5:00 PM, Meeting Room 5, Meeting Room Level, Training Session, MM

Aligning ALDs and Item Response Demands to Support Teacher Evaluation Systems Steve Ferrara, Pearson School; Christina Schneider, The National Center for the Improvement of Educational Assessment

A primary goal of achievement tests is to classify students into achievement levels that enable inferences about student knowledge and skill . Explicating how knowledge and skills differ in complexity and empirical item difficulty—at the beginning of test design—is critical to those inferences . In this session we demonstrate for experts in assessment design, standard setting, formative assessment, or teacher evaluation how emerging practices in statewide tests for developing ALDs, training item writers to align items to ALDs, and identifying item response demands can be used to support teachers to develop student learning objectives (SLOs) in nontested grades and subjects . Participants will analyze ALDs, practice writing items aligned to those ALD response demands, and analyze classroom work products from teachers who used some of these processes to create SLOs . We will apply a framework for connecting ALDs (Egan et al ., 2012), the ID Matching standard setting method (Ferrara & Lewis, 2012), and item difficulty modeling techniques (Ferrara et al ., 2011; Schneider et al ., 2013) to a process that generalizes from statewide tests to SLOs, thereby supporting construct validity arguments for student achievement indicators used for teacher evaluation .

28 Washington, DC, USA

Friday, April 8, 2016 8:00 AM - 5:00 PM, Renaissance East, Ballroom Level, Training Session, NN

Best Practices for Lifecycles of Automated Scoring Systems for Learning and Assessment Peter Foltz, Pearson; Claudia Leacock, CTB/McGraw Hill; André Rupp and Mo Zhang, Educational Testing Service

Automated scoring systems are designed to evaluate performance data in order to assign scores, provide feedback, and/or facilitate teaching-learning interactions . Such systems are used in K-12 and higher education for such areas as ELA, science, and mathematics, as well as in professional domains such as medicine and accounting, across various use contexts . Over the past 20 years, there has been rapid growth around research on the underlying theories and methods of automated scoring, the development of new technologies, and ways to implement automated scoring systems effectively . Automated scoring systems are developed by a diverse community of scholars and practitioners encompassing such fields as natural language processing, linguistics, speech science, statistics, psychometrics, educational assessment, and learning and cognitive sciences . As the application of automated scoring continues to grow, it is important for the NCME community to have an overarching understanding of the best practices for designing, evaluating, deploying, and monitoring such systems . In this training session, we provide participants with such an understanding via a mixture of presentations, individual and group-level discussions, and structured and free- play demonstration activities . We utilize systems that are both proprietary and freely available, and provide participants with resources that empower them in their own future work .

29 2016 Annual Meeting & Training Sessions

Friday, April 8, 2016 8:00 AM - 5:00 PM, Meeting Room 3, Meeting Room Level, Training Session, OO

Test Equating Methods and Practices Michael Kolen and Robert Brennan, University of Iowa

The need for equating arises whenever a testing program uses multiple forms of a test that are built to the same specifications . Equating is used to adjust scores on test forms so that scores can be used interchangeably . The goals of the session are for attendees to be able to understand the principles of equating, to conduct equating, and to interpret the results of equating in reasonable ways . The session focuses on conceptual issues . Practical issues are considered .

30 Washington, DC, USA

Friday, April 8, 2016 8:00 AM - 5:00 PM, Renaissance West A, Ballroom Level, Training Session, PP

Diagnostic Measurement: Theory, Methods, Applications, and Software Jonathan Templin and Meghan Sullivan, University of Kansas

Diagnostic measurement is a field of psychometrics that focuses on providing actionable feedback from multidimensional tests . This workshop provides a hands-on introduction to the terms, techniques, and methods used for diagnosing what students know, thereby giving researchers access to information that can be used to guide decisions regarding students’ instructional needs . Upon completion of the workshop, participants will be able to understand the rationale and motivation for using diagnostic measurement methods . Furthermore, participants will be able to understand the types of data typically used in diagnostic measurement along with the information that can be obtained from implementing diagnostic models . Participants will become well-versed in the state-of-the-art techniques currently used in practice and will be able to use and estimate diagnostic measurement models using new software developed by the instructor

31 2016 Annual Meeting & Training Sessions

Friday, April 8, 2016 1:00 PM - 5:00 PM, Renaissance West B, Ballroom Level, Training Session, QQ

Effective Item Writing for Valid Measurement Anthony Albano, University of Nebraska-Lincoln; Michael Rodriguez, University of Minnesota- Twin Cities

In this training session, participants will learn to write and critique high-quality test items by implementing item- writing guidelines and validity frameworks for item development . Educators, researchers, test developers, and other test users are encouraged to participate .

Following the session, participants should be able to: implement empirically-based guidelines in the item writing process; describe procedures for analyzing and validating items; apply item-writing guidelines in the development of their own items; and review items from peers and provide constructive feedback based on adherence to the guidelines . The session will consist of short presentations with small-group and large-group activities . Materials will be contextualized within common testing applications (e .g ., classroom assessment, response to intervention, progress monitoring, summative assessment, entrance examination, licensure/certification) .

Participants are encouraged to bring a laptop computer, as they will be given access to a web application that facilitates collaboration in the item-writing process; those participating in the session in-person and remotely will use the application to create and comment on each other’s items online . This practice in item writing will allow participants to demonstrate understanding of what they have learned, and receive feedback on their items from peers and the presenters .

32 Washington, DC, USA

Friday, April 8, 2016 3:00 PM - 7:00 PM, Meeting Room 11, Meeting Room Level

NCME Board of Directors Meeting

Members of NCME are invited to attend as observers .

33 2016 Annual Meeting & Training Sessions

Friday, April 8, 2016 4:30 PM - 6:30 PM, Fado’s Irish Pub (Graduate Students only)

Graduate Student Social

Come enjoy FREE appetizers at a local venue within walking distance of the conference hotels . The first 50 graduate student attendees receive one free drink ticket . Exchange research interests and discuss your work with fellow graduate students from NCME & AERA Division D .

Fado’s Irish Pub is located at 808 7th Street NW, Washington, DC 20001

34 Washington, DC, USA

Friday, April 8, 2016 6:30 PM -10:00 PM, Ballroom C, Level Three, Convention Center

AERA Centennial Symposium & Centennial Reception

The Centennial Annual Meeting’s Opening Session and Reception will celebrate AERA’s 100-year milestone in grand style . Together, the elements of this energizing and dynamic opening session will commemorate the association’s history, highlight the breadth and unity of the field of education research as it has evolved around the world, and begin to explore second-century pathways for advancing AERA’s mission .

The centerpiece of the opening plenary session will be a “Meet the Press”-style Power Panel and Town Hall discussion that takes a critical look at the current “State of the Field” for education research – taking stock of its complex history and imagining its future . The Post Reception will be an elegant and festive party for members and friends of AERA .

35 2016 Annual Meeting & Training Sessions

36 Washington, DC, USA

Annual Meeting Program - Saturday, April 9, 2016

37 2016 Annual Meeting & Training Sessions

38 Washington, DC, USA

Saturday, April 9, 2016 6:30 AM - 7:30 AM, Meeting Room 7, Meeting Room Level

Sunrise Yoga Please join us for the second NCME Sunrise Yoga . We will start promptly at 6:30 a .m . for one hour at the Renaissance . Advance registration required ($10) to reserve your mat . NO EXPERIENCE NECESSARY . Just bring your body and your mind, and our friends from Flow Yoga Center (http://www .flowyogacenter .com/) will do the rest . Namaste .

39 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Renaissance East, Ballroom Level, Invited Session, A1

NCME Book Series Symposium: The Challenges to Measurement in an Era of Accountability Session Chair: Henry Braun, Boston College Session Discussants: Suzanne Lane, University of Pittsburgh; Scott Marion, National Center for the Improvement of Educational Assessment

This symposium draws on The Challenges to Measurement in an Era of Accountability, a recently published volume in the new NCME Book Series . The volume addresses a striking imbalance: Although it is not possible to calculate test-based indicators (e .g . value-added scores or mean growth percentiles) for more than 70 percent of teachers, assessment and accountability issues in those other subject/grade combinations have received comparatively little attention in the research literature . The book brought together experts in educational measurement, as well as those steeped in the various disciplines, to provide a comprehensive and accessible guide to the measurement of achievement in a broad range of subjects, with a primary focus on high school grades . The five focal presentations will offer discipline-specific perspectives from: social sciences, world languages, performing arts, life sciences and physical sciences . Each presentation will include a brief review of assessment (both formative and summative) in the discipline, with particular attention to the unique circumstances faced by teachers and measurement specialists responsible for assessment design and development, followed by a survey of current assessment initiatives and responses to accountability pressures . The symposium offers the measurement community a unique opportunity to learn about assessment practices and challenges across the disciplines .

Use of Evidence Centered Design in Assessment of History Learning Kadriye Ercikan, University of British Columbia; Pamela Kaliski, College Board

Assessment Issues in World Languages Meg Malone, Center for Applied Linguistics; Paul Sandrock, American Council on the Teaching of Foreign Languages

Arts Assessment in an Age of Accountability: Challenges and Opportunities in Implementation, Design, and Measurement Scott Shuler, Connecticut Department of Education, Ret; Tim Brophy, University of Florida; Robert Sabol, Purdue University

Assessing the Life Sciences: Using Evidence-Centered Design for Accountability Purposes Daisy Rutstein and Britte Cheng, SRI International

Assessing Physical and Earth and Space Science in the Context of the NRC Framework for K-12 Science Education and the Next Generation Science Standards Nathaniel Brown, Boston College

40 Washington, DC, USA

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Renaissance West A, Ballroom Level, Coordinated Session, A2

Collaborative Problem Solving Assessment: Challenges and Opportunities Session Chairs: Yigal Rosen, Pearson; Lei Liu, ETS Session Discussant: Samuel Greiff, University of Luxemburg

Collaborative problem solving (CPS) is a critical competency for college and career readiness . Students emerging from schools into the workforce and public life will be expected to have CPS skills as well as the ability to perform that collaboration in various group compositions and environments (Griffin, Care, & McGaw, 2012; OECD, 2013) . Recent curriculum and instruction reforms have focused to a greater extent on teaching and learning CPS (National Research Council, 2012; OECD, 2012) . However, structuring standardized computer-based assessment of CPS skills, specifically for large-scale assessment programs, is challenging . In this symposium a spectrum of approaches for collaborative problem solving assessment will be introduced, and four papers will be presented and discussed .

PISA 2015 Collaborative Problem Solving Assessment Framework Art Graesser, University of Memphis

Human-To-Agent Approach in Collaborative Problem Solving Assessment Yigal Rosen, Pearson

Collaborative Problem Solving Assessment: Bring Social Aspect into Science Assessment Lei Liu, Jiangang Hao, Alina von Davier and Patrick Kyllonen, ETS

Assessing Collaborative Problem Solving: Students’ Perspective Haggai Kupermintz, University of Haifa

41 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Renaissance West B, Ballroom Level, Coordinated Session, A3

Harnessing Technological Innovation in Assessing English Learners: Enhancing Rather Than Hindering Session Chair: Dorry Kenyon, Center for Applied Linguistics Session Discussant: Mark Reckase, Michigan State University

How do English Learners (ELs) interact with technology in large-scale testing? In this coordinated session, an interdisciplinary team from the Center for Applied Linguistics presents findings from four years of research and development for the WIDA Consortium . For nine years, WIDA has offered an annual paper-and-pencil assessment of developing academic English language proficiency (ELP), known as ACCESS for ELLs, used to assess over 1 million ELs in 36 states . With federal funding, WIDA and its partners have transitioned this assessment to a web-based assessment, ACCESS 2 .0, now in its first operational year (2015-2016) . ACCESS 2 .0 is used to assess ELs at all levels of English language development, from grades 1 to 12, and to assess all four language domains (listening, speaking, reading and writing) . Thus, the research and development activities covered multiple critical issues pertaining to ELs and technology in large-scale assessments . In this session, we share research findings from several inter-related perspectives, including improving accuracy of measurement, developing complex web-based performance- assessment tasks, and familiarity with technology in the EL population, including keyboarding and interfacing with technology-enhanced task types . These findings provide insight into the valid assessment of ELs using technology for a wide variety of uses .

Keyboarding and the Writing Construct for Els Jennifer Renn and Jenny Dodson, Center for Applied Linguistics

Supporting Extended Discourse Through a Computer-Delivered Assessment of Speaking Megan Montee and Samantha Musser, Center for Applied Linguistics

Using Multistage Testing to Enhance Measurement David MacGregor and Xin Yu, Center for Applied Linguistics

Enhanced Item Types—Engagement or Unnecessary Confusion for Els? Jennifer Norton and Justin Kelly, Center for Applied Linguistics

42 Washington, DC, USA

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Meeting Room 3, Meeting Room Level, Paper Session, A4

How Can Assessment Inform Classroom Practice? Session Discussant: Priya Kannan, ETS

What Score Report Features Promote Accurate Remediation? Insights from Cognitive Interviews Francis Rick, University of Massachusetts, Amherst; Amanda Clauser, National Board of Medical Examiners

Cognitive interviews were conducted with medical students interacting with score reports to investigate what content and design features promote adequate interpretations and remediation decisions . Transcribed “speech bursts” were coded based on pre-established categories, which were then used to evaluate the effectiveness of each report format .

Evaluating the Degree of Coherence Between Instructional Targets and Measurement Models Lauren Deters, Lori Nebelsick-Gullet, Charlene Turner, Bill Herrera and Elizabeth Towles, edCount, LLC

To solidify the links between the instructional and measurement contexts for its overall assessment system the National Center and State Collaborative investigated the degree of coherence among the system’s measurement targets, learning expectations, and targeted long-range outcomes . This study provides evidence for the system’s coherence across instruction and assessment contexts .

Modeling the Instructional Sensitivity of Polytomous Items Alexander Naumann and Johannes Hartig, German Institute for International Educational Research (DIPF); Jan Hochweber, University of Teacher Education St. Gallen (PHSG)

We propose a longitudinal multilevel IRT model for the instructional sensitivity of polytomous items . The model permits evaluation of global and differential sensitivity based on average change and variation of change in classroom-specific item locations and thresholds . Results suggest that the model performs well in its application to empirical data .

Growth Sensitivity and Standardized Assessments: New Evidence on the Relationship Shalini Kapoor, ACT; Catherine Welch and Steve Dunbar, Iowa Testing Programs/University of Iowa

Academic growth measurement requires a structured feedback that informs not only what students know but also what they need to know to learn and grow . This research proposes a method that can support generation of content-related growth feedback which can help tailor classroom instruction to student-specific needs .

Using Regression-Based Growth Models to Inform Learning with Multiple Assessments Ping Yin and Dan Mix, Curriculum Associates

This study evaluates the feasibility of two types of regression-based growth models to inform student learning using a computer adaptive assessment administered multiple times throughout a school year . With the increased interest to inform instruction learning, it is important to evaluate whether current growth models can support such goals .

43 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Meeting Room 4, Meeting Room Level, Coordinated Session, A5

Enacting a Learning Progression Design to Measure Growth Session Chair: Damian Betebenner, National Center for the Improvement of Educational Assessment

The concept of growth is at the foundation of the policy and practice around systems of educational accountability . Yet there is a disconnect between the criterion-referenced intuitions that parents and teachers have for what it means for students to demonstrate growth and the primarily norm-referenced metrics that are used to infer growth . One way to address this disconnect would be to develop vertically linked score scales that could be used to support both criterion-referenced and norm-referenced interpretations, but this hinges upon having a coherent conceptualization of what it is that is growing from grade to grade . The purpose of this session is to facilitate debate about the design of large-scale assessments for the intended purpose of drawing inferences about student growth, a topic that was the recent subject of a 2015 focus article and commentaries for the journal Measurement . A learning-progression approach to the conceptualization of growth and the subsequent design of a vertical score scale will be described in the context of student understanding of proportional reasoning, a big picture idea from the Common Core State Standards for Mathematics . Subsequent presentations and discussion will focus on the pros and cons of the proposed approach and of other possible alternatives .

Using Learning Progressions to Design Vertical Scales Derek Briggs and Fred Peck, University of Colorado

Challenges in Modeling and Measuring Learning Progressions Jere Confrey, Ryan Seth Jones, and Garron Gianopulos, North Carolina State University

The Importance of Content-Referenced Score Interpretations Scott Marion, National Center for the Improvement of Educational Assessment

Challenges on the Path to Implementation Joseph Martineau and Adam Wyse, National Center for the Improvement of Educational Assessment

Growth Through Levels David Thissen, University of North Carolina

44 Washington, DC, USA

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Meeting Room 5, Meeting Room Level, Paper Session, A6

Testlets and Multidimensionality in Adaptive Testing Session Discussant: Chun Wang, University of Minnesota

Measuring Language Ability of Students with Compensatory MCAT: a Post-Hoc Simulation Study Burhanettin Özdemir and Selahattin Gelbal, Hacettepe University

The purposes of this study is to determine the most suitable Multidimensional CAT design that measures language ability of students and compare the paper-pencil test outcomes to those of the new MCAT designs . Real data set from English Proficiency Test was used to create item pool consisting of 565 items .

Multidimensional CAT Classification Method for Composite Scores Lihua Yao and Dan Segall, Defense Manpower data center

The current research proposed an item selection method using cut points for the composite score for classification purpose in the multidimensional CAT frame work . The classification accuracy for the composite score for the proposed method is compared with other existing MCAT methods .

Two Bayesian Online Calibration Methods in Multidimensional Computerized Adaptive Testing Ping Chen, Normal University

To solve the non-convergence issue in M-MEM (Chen & Xin, 2013) and improve the calibration precision, this study combined Bayes Modal Estimation (BME) (Mislevy, 1986) with M-OEM and M-MEM to make full use of the prior information, and proposed two Bayesian online calibration methods in MCAT (M-OEM-BME and M-MEM-BME) .

Item Selection in Testlet-Based CAT Mark Reckase and Xin Luo, Michigan State University

The research in item selection in testlet-based CAT is rare . This study compared three item selection approaches (one was based on polytomous model and two on dichotomous model) and investigated some factors that might influence the effectiveness of CAT . These three approaches obtained similar measurement accuracy but different exposure rate .

Effects of Testlet Characteristics on Estimating Abilities in Testlet-Based CAT Seohong Pak, University of Iowa; Hong Qian and Xiao Luo, NCSBN

The testlet selection methods, testlet sizes, degrees of variation in item difficulites within each testlet, and degrees of testlet random effect were investigated under testlet-based CAT . The 48 conditions were run for 50 times using R and results were compared based on the measurement accuracy and decision accuracy .

Computerized Mastery Testing (CMT) Without the Use of Item Response Theory Sunhee Kim and Adena Lebeau, Prometric; Tammy Trierweiler, Law School Admissions Council (LSAC); F. Jay Breyer and Charles Lewis, Educational Testing Service; Robert Smith, Smith Consulting

This study demonstrates that CMT can be successfully implemented when testlets are constructed using classical item statistics in a real world application . As CMT is easier to implement and more cost efficient than CAT test designs, credentialing programs that have small samples and item pools may benefit from this approach .

45 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 8:15 AM - 10:15 AM, Meeting Room 12, Meeting Room Level, Paper Session, A7

Methods for Examining Local Item Dependence and Multidimensionality Session Discussant: Ki Matlock, Oklahoma State University

Examining Unidimensionality with Parallel Analysis on Residuals Tianshu Pan, Pearson

The current study will compare the performances of some parallel analysis (Horn, 1965) procedures on checking unidimensionality of the simulated unidimensional and multidimensional data respectively, i .e ., the procedures of Reckase’s (2009), Drasgow and Lissak’s (1983), regular parallel analysis, and a new parallel analysis proposed by this study .

A Conditional IRT Model for Directional Local Item Dependency in Multipart Items Dandan Liao, Hong Jiao and Robert Lissitz, University of Maryland, College Park

A multipart item consists of two related questions, which potentially introduces conditional local item dependence (LID) between two parts of the item . This paper proposes a conditional IRT model for directional LID in multipart items and compares different approaches to modeling LID in terms of parameter estimation through simulation study .

Fit Index Criteria in Confirmatory Factor Analysis Models Used by Measurement Practitioners Anne Corinne Huggins-Manley and HyunSuk Han, University of Florida

Measurement practitioners often use CFA models to assess unidimensionality and local independence in test data . Current guidelines for assessing fit of CFA models are possibly inappropriate because they were not developed under measurement oriented conditions . This study provides CFA fit index cutoff recommendations for evaluating IRT model assumptions .

Multilevel Bi-Factor IRT Models for Wording Effects Xiaorui Huang, East China Normal University

A multilevel bi-factor IRT was developed to account for wording effects in mixed-format scales and multilevel data structures . Simulation studies demonstrated good parameter recovery for the new model and underestimation of SE when multilevel data structures were ignored . An empirical example was provided .

A Generalized Multinomial Error Model for Tests That Violate Conditional Independence Assumptions Benjamin Andrews, ACT

A generalized multinomial error model is presented that allows for dependency among vectors of item responses . This model can be used in instances where polytomous items are related to the same passage or if responses are rated on several different traits . Examples and comparisons to G theory methods are discussed .

Both Local Item Dependencies and Cut-Point Location Impact Examinee Classifications Jonathan Rubright, American Institute of Certified Public Accountants

This simulation study demonstrates that the strength of local item dependencies and the location of an examination’s cut-point both influence the sensitivity and specificity of examinee classifications under unidimensional IRT . Practical implications are discussed in terms of false positives and false negatives of test takers .

46 Washington, DC, USA

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Renaissance East, Ballroom Level, Coordinated Session, B1

The End of Testing as We Know It? Session Chair: Randy Bennett, ETS Session Presenters: Randy Bennett, ETS; Joan Herman, UCLA-CRESST; Neal Kingston, University of Kansas

The rapid evolution of technology is affecting all aspects of our lives—commerce, communication, leisure, and education . Activities like travel planning, news consumption, and music purchasing have been so dramatically affected as to have caused significant shifts in how services and products are packaged, marketed, distributed, priced, and sold . Those shifts have been dramatic enough to have substantially reduced the influence of once-staple products like newspapers and the companies that provide them .

Technology has come to education and educational testing too, though more slowly than to other areas . Still, there is growing evidence that the future for these fields will be considerably different and that those differences will emerge quickly . Billions of dollars are being invested in new technology-based products and services for K-12 as well as higher education, huge amounts of student data are being collected through these offerings, tests are moving to digital delivery and substantially changed in the process, and the upheaval that has occurred in other industries may come to education too .

What will and won’t change for educational testing? This panel presentation will include three speakers, each offering a different scenario for the future of K-12 assessment .

47 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Renaissance West A, Ballroom Level, Coordinated Session, B2

Fairness and Machine Learning for Educational Practice Session Chair: Alina von Davier Session Moderator: Jill Burstein Session Panelists: Nitin Madnani and Aoife Cahill, Educational Testing Service; Solon Barocas, Princeton University; Brendan O’Connor, University of Massachusetts Amherst; James Willis, Indiana University

This panel will address issues around fairness and transparency in the application of ML to education, in particular to learning and assessment . Panelists will include experts in NLP, Computational Psychometric (CP), and Education Technology Policy and Ethics . Panelists will respond to questions such as,

1 . Are data-driven methods used alone ever OK?

2 . Are there use cases that are more acceptable than others from a fairness perspective?

3 . Are there examples from other domains that we may apply to educational assessment?

4 . In the case of scoring written essays: What is the difference between human raters and ML methods? For human raters, at least in writing, we know what they ‘are supposed to consider’ but don’t know what they choose and what the weightings are? For ML methods, we actually ‘know’ what features go in, but weightings and predictive modeling can be black-box-like . But, is this any less true for human raters?

5 . Under what conditions is interpretability important? For instance, how do we isolate diagnostic information if we use ML for predicting learning outcomes?

6 . Can we detect underlying bias in the large data set from education? If we identify the bias, is it acceptable to adjust the ML algorithms to eliminate the bias? Can this adjustments be misused?

7 . What type of evaluation methods should one employ to ensure that the results are fair to all groups?

The moderator will lead the panel by presenting questions to the panel and managing the discussion . The panel discussion will be 60 minutes, and there will be an additional 30 minutes intended for questions and discussion with the audience .

48 Washington, DC, USA

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Renaissance West B, Ballroom Level, Coordinated Session, B3

Item Difficulty Modeling: From Theory to Practice Session Chair: Isaac I . Bejar Session Discussant: Steve Ferrara

Item difficulty modeling (IDM) is concerned with both and understanding of the variability in estimated item difficulty, as well as explanatory item response modeling incorporating difficulty covariates . The symposium starts with an overview of the multiple applications of difficulty modeling, ranging from purely theoretical to practical applications . The following presentations then focus on presenting empirical research on the modelling of mathematics items used in K-12 and graduate admissions assessments . Specifically, the following research will be presented:

• The use of a validated IDM for generating items by means of family and structural variants

• The multidisciplinary development of an IDM for practical day-to-day application

• Evaluation of the feasibility of automating the propositional analysis of existing items to study the role of linguistic variables on item difficulty

• Fitting an explanatory IRT model that extends the LLTM by fixing residuals to fully account for difficulty

An Overview of the Purposes of Item Difficulty Modeling (IDM) Isaac Bejar, ETS

Implications of Item Difficulty Modeling for Item Design and Item Generation Susan Embretson, Georgia Institute of Technology

Developing an Item Difficulty Model for Quantitative Reasoning: A Knowledge Elicitation Approach Edith Aurora Graf, ETS

Exploring an Automated Approach to Examining Linguistic Context in Mathematics Items Kristin Morrison, Georgia Institute of Technology

An Explanatory Model for Item Difficulties with Fixed Residuals Paul De Boeck, Ohiao State University

49 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Meeting Room 3, Meeting Room Level, Paper Session, B4

Growth and Vertical Scales Session Discussant: Anna Topczewski, Pearson

Estimating Vertical Scale Drift Due to Repetitious Horizontal Equating Emily Ho, Michael Chajewski and Judit Antal, College Board

The stability of a vertical scale as a function of repeated administrations is rarely studied . Our empirical simulation uses 2-pl math items, generating forms for test-takers from three grades . We examine the effect of ability, test difficulty, and equating designs on vertical scale stability when applying repetitious horizontal equating .

An Eirm Approach for Studying Latent Growth in Alphabet Knowledge Among Kindergarteners Xiaoxin Wei, American Institutes for Research; Patrick Meyer and Marcia Invernizzi, University of Virginia

We applied a series of latent growth explanatory item response models to study growth in alphabet knowledge over three time points . Models allowed for time-varying item parameters and evaluated the impact of person properties on growth . Results show that growth differs by examinee group in expected and unexpected ways .

Vertical Scaling and Item Location: Generalizing from Horizontal Linking Designs Stephen Murphy, Rong Jin, Bill Insko and Sid Sharairi, Houghton Mifflin Harcourt

Establishing a vertical scale for an assessment is besieged with practical decisions . Outcomes of these decisions are essential to valid interpretations of student growth and teacher effectiveness (Briggs, Weeks, & Wiley, 2008) . This study adds to existing literature by examining the impact of item location on the vertical scale .

Predictive Accuracy of Model Inferences for Longitudinal Data with Self-Selection Tyler Matta, Yeow Meng Thum and Quinn Lathrop, Northwest Evaluation Association

Conventional approaches to characterizing classification accuracy are not valid when data are subject to self- selection . We introduce predictive accuracy, a framework that appropriately accounts for the impact of nonignorable missing data . We provide an illustration using longitudinal assessment data to predict college readiness when college test takers are self-selected .

50 Washington, DC, USA

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Meeting Room 4, Meeting Room Level, Paper Session, B5

Perspectives on Validation Session Discussant: Mark Shermis, University of Houston-Clear Lake

Using a Theory of Action to Ensure High Quality Tests Cathy Wendler, Educational Testing Service

A theory of action helps testing programs ensure high quality tests by documenting claims, determining evidence needed to support those claims, and creating solutions to address unintended consequences . This presentation describes the components of a theory of action and how it is being used to evaluate and improve programs .

Teacher Evaluation Systems: Mapping a Validity Argument Tia Sukin and W. Nicewander, Pacific Metrics; Phoebe Winter, Consultant, Assessment Research and Development

Providing validation evidence for teacher evaluation systems is a complex and historically neglected task . This paper provides a framework structure for building an argument for the use of comprehensive teacher evaluation systems which will allow for the identification of possible weaknesses in the system that need to be addressed .

Validity Evidence to Support Alternate Assessment Score Uses: Fidelity and Response Processes Meagan Karvonen, Russell Swinburne Romine and Amy Clark, University of Kansas

Validity of score interpretations and uses for new online alternate assessments for students with significant cognitive disabilities (AA-AAS) require new sources of evidence about student and teacher actions during the test administration process . We present findings from student cognitive labs, teacher cognitive labs, and test administration observations for an AA-AAS .

Communicating Psychometric Research to Policymakers Andrea Lash and Mary Peterson, WestEd; Benjamin Hayes, Washoe County School District

Policymakers’ implicit assumptions about assessment data inform their designs of educator evaluation systems . How can psychometricians help policymakers evaluate the validity of their assumptions? We examine a two-year effort in one state using a model of science communication for political contexts and an argument-based validation framework .

51 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, B6

Model Fit Session Discussant: Matthew Johnson, Teachers College

Evaluation of Item Response Theory Item-Fit Indices Adrienne Sgammato and John Donoghue, Educational Testing Service

Performance of Pearson chi-square and likelihood ratio item level model fit indices based on observed data were evaluated in the presence of complex sampling of items (i .e ., BIB sampling) . Distributional properties, type I error and power of these measures were evaluated .

Rethinking Complexity in Item Response Theory Models Wes Bonifay, University of Missouri

The notion of complexity commonly refers to the number of freely estimated parameters in a model . An investigation of five popular measurement models suggests that complexity in IRT should be defined not by the number of parameters, but instead by the functional form of the model .

Measures for Identifying Non-Monotonically Increasing Item Response Functions Nazia Rahman and Peter Pashley, Law School Admission Council; Charles Lewis, Educational Testing Service

This study explored statistical measures as bases for defining robust criterion in checking for non-monotonicity in multiple-choice tests, and may be considered analogous to effect size measures . The three methods adapted to identify non-monotonicity in items were Mokken’s scalability coefficient, isotonic regression analysis, and nonparametric smooth regression method .

Evaluation of Limited Information IRT Model-Fit Indices Applied to Complex Item Samples John Donoghue and Adrienne Sgammato, Educational Testing Service

Recently, “limited information” (computed from low order margins of the item response data) measures of model fit have been suggested . We examined the performance of the indices in the presence of complex sampling of items (i .e ., BIB sampling) . Distributional properties, Type I error and power of these measures were evaluated .

52 Washington, DC, USA

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Meeting Room 12, Meeting Room Level, Paper Session, B7

Simulation- and Game-Based Assessments Session Discussant: José Pablo González-Brenes, Pearson

Aligning Process, Product and Survey Data: Bayes Nets for a Simulation-Based Assessment Tiago Caliço, University of Maryland; Vandhana Mehta and Martin Benson, Cisco Networking Academy; André Rupp, Educational Testing Service

Simulation-based assessments yield product and process data that can potentially allow for more comprehensive measurement of competencies and factors that affect these competencies . We discuss the iterative construction of student characterizations (personae) and elucidate the methodological implications for putting into practice the evidence-centered design process successfully .

Practical Consequences of Static, Dynamic, or Hierarchical Bayesian Networks in Game-Based Assessments Maria Bertling, Harvard University; Katherine Furgol Castellano, Educational Testing Service

There is a growing interest in using Bayesian approaches for analyzing data from game-based assessments (GBAs) . This paper describes the process of developing a measurement model for an argumentation game and demonstrates analytical and practical consequences of using different types of Bayesian networks as scaling method for GBAs .

Impact of Feedback Within Technology Enhanced Items on Perseverance and Performance Stacy Hayes, Chris Meador and Karen Barton, Discovery Education

This research explores the impact of formative feedback within technology enhanced items (TEIs) embedded in a digital mathematics techbook, and where students are permitted multiple attempts . Exploratory analyses will investigate patterns of student performance by time on task, type of feedback, item type, misconception, construct complexity, and persistence .

Framework for Feedback and Remediation with Electronic Objective Structured Clinical Examinations Hollis Lai, Vijay Daniels, Mark Gierl, Tracey Hillier and Amy Tan, University of Alberta

Objective Structured Clinical Examination (OSCE) is popular among health profession education but cannot provide student feedback and guidance . As OSCEs migrate into an electronic format, the purpose of our paper is to demonstrate a framework to integrate myriad of data sources captured in an OSCE to provide student feedback .

53 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 10:35 AM - 12:05 PM, Meeting Room 10, Meeting Room Level, Paper Session, B8

Test Security and Cheating Session Discussant: Dmitry Belov, Law School Admission Council

Applying Three Methods for Detecting Aberrant Tests to Detect Compromised Items Yu Zhang, Jiyoon Park and Lorin Mueller, Federation of State Boards of Physical Therapy

Three different approaches toward detecting item preknowledge were applied to detect compromised items . These three methods were originally developed for detecting aberrant responses and showed high performance in detecting examinees having item preknowledge . We employed these methods to detect potentially compromised items .

Detecting Two Patterns of Cheating with a Profile of Statistical Indices Amin Saiar, Gregory Hurtz and John Weiner, PSI Services LLC

Several indices used to detect aberrances in item scores are compared, assessing similarities in raw responses . Results show that the different indices are differentially sensitive to two patterns of cheating, and profiles across the indices may be most useful for detecting and diagnosing test cheating .

Integrating Digital Assessment Meta-Data for Psychometric and Validity Analysis Elizabeth Stone, Educational Testing Service

This paper discusses meta-data (or process data) captured during assessments that can be used to enhance psychometric and validity analyses . We examine sources and types of meta-data, as well as uses including subgroup refinement, identification of effort, and test security . We also describe challenges and caveats to this usage .

How Accurately Can We Detect Erasures? Han Yi Kim and Louis Roussos, Measured Progress

Erasure analyses require accurate detection of erasures, as distinct from blank and filled-in marks . This study evaluates erasure detection using data for which the true nature of the marks are known . Optimal rules are formulated . Type I error and power are calculated and evaluated under various scenarios .

54 Washington, DC, USA

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Renaissance East, Ballroom Level, Coordinated Session, C1

Opting Out of Testing: Parent Rights Versus Valid Accountability Scores Session Discussant: S .E . Phillips, Assessment Law Consultant

Although permitted by legislation in some states, too many parents opting their children out of statewide testing may threaten the validity of school accountability scores . This session will explore the effects of opt outs from the perspectives of enabling state legislation, state assessment staff, measurement specialists, and testing vendors .

Survey and Analysis of State Opt Outs and Required Test Participation Legislation Michelle Croft, ACT, Inc.; Richard Lee, ACT, Inc

Test Administration, Scoring, and Reporting When Students Opt Out Tim Vansickle, Questar Assessment Inc.,

Responding to Parents and Schools About Student Testing Opt Outs Derek Brown, Oregon Department of Education

Opt-Outs: The Validity of School Accountability and Teacher Evaluation Test Score Interpretations Greg Cizek, University of North Carolina at Chapel Hill

55 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Renaissance West A, Ballroom Level, Coordinated Session, C2

Building Toward a Validation Argument with Innovative Field Test Design and Analysis Session Chair: Catherine Welch, University of Iowa Session Discussants: Michael Rodriguez, University of Minnesota; Wayne Camara, ACT, Inc .

For a variety of reasons, large-scale assessment programs have come to rely heavily on data collected during field testing to evaluate items, assemble forms and link those forms to already established standard score scales and interpretive frameworks such as proficiency benchmarks and other standards such as college readiness . When derived scores are based on pre-calibrated item pools, as in adaptive testing, or on pre-equated or otherwise linked fixed test forms, the administrative conditions (cf . Wise, 2015) and sampling designs (e .g . Meyers, Miller & Way, 2009) for field testing are critical to the validity of the scores . This session addresses key aspects of field testing that can be used as a basis for the validation work of an operational assessment program .

Implications of New Construct Definitions and Shifting Emphases in Curriculum and Instruction Catherine Welch, University of Iowa

Implications of Composition and Behavior of the Sample When Studying Item Responses Tim Davey, Educational Testing Service

Assessing Validity of Item Response Theory Model When Calibrating Field Test Items Brandon LeBeau, University of Iowa

56 Washington, DC, USA

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Renaissance West B, Ballroom Level, Coordinated Session, C3

Towards Establishing Standards for Spiraling of Contextual Questionnaires in Large- Scale Assessments Session Chair: Jonas Bertling, Educational Testing Service Session Discussant: Lauren Harrell, National Center for Education Statistics

Constraints of overall testing time and the large sample sizes in large-scale assessments (LSAs) make spiraling approaches where different respondents receive different sets of items a viable option to reduce respondent burden while maintaining or increasing content coverage across relevant areas . Yet, LSAs have taken different directions in their use of spiraling in operational questionnaires and there is currently no consensus on the benefits and drawbacks of spiraling . This symposium brings together diverse perspectives on spiraling approaches in conjunction with mass imputation for contextual questionnaires in LSAs and will help establish standards how future operational questionnaire designs can be improved to reduce risks for plausible value estimation and secondary analyses .

Context and Position Effects on Survey Questions and Implications for Matrix Sampling Paul Jewsbury and Jonas Bertling, Educational Testing Service

Matrix Sampling and Imputation of Context Questionnaires: Implications for Generating Plausible Values David Kaplan and Dan Su, University of Wisconsin – Madison

Imputing Missing Background Data, How to ... And When to ... Matthias von Davier, Educational Testing Service

Design Considerations for Planned Missing Auxiliary Data in a Latent Regression Context Leslie Rutkowski, University of Oslo

57 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, C4

Estimation Precision of Variance Components: Revisiting Generalizability Theory Session Discussant: Xiaohong Gao, ACT, Inc .

In this coordinated session of three presentations, the overarching theme is the estimation precision of variance components (VCs) in generalizability theory (G theory) . The estimation precision is of significant importance in that VCs are the building blocks of reliability, on which valid interpretations of measurement are contingent . In the first presentation, the authors discuss the adverse effects of non-additivity on the estimation precision of VCs . Specifically, the VC of subjects is underestimated and consequently, generalizability coefficients are also underestimated in a one- facet design . An example of non-additivity is the presence of subject-by-facet interaction in a one-facet design . The authors demonstrate that a nonadditive model should be used in such a case to obtain unbiased estimators for VCs . As a follow-up study, the second presentation focuses on the identification of non-additivity by use of Tukey’s single- degree-freedom test . The authors evaluate Tukey’s test for non-additivity in terms of Type I and Type II error rates . Finally, the third presentation extends our theme in a multivariate context and touches on the estimation precision of construct-irrelevant VCs in subscore profile analysis . The authors compare the extent to which Component Universe Score Profiles and factor analytic profiles accurately represent subscore profiles .

Bias in Estimating Subject Variance Component When Interaction Exists in One-Facet Design Jinming Zhang, University Of Illinois at Urbana-Champaign

Component Universe Score Profiles: Advantages Over Factor Analytic Profile Analysis Joe Grochowalski, The College Board; Se-Kang Kim, Fordham University

Evaluating Tukey’s Test for Detecting Nonadditivity in G-Theory Applications Chih-Kai Lin, Center for Applied Linguistics (CAL)

58 Washington, DC, USA

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Meeting Room 4, Meeting Room Level, Paper Session, C5

Sensitivity of Value-Added Models Session Discussant: Katherine Furgol Castellano, ETS

Cohort and Content Variability in Value-Added Model School Effects Daniel Anderson and Joseph Stevens, University of Oregon

The purpose of this paper was to explore the extent to which school effects estimated from a random-effects value- added model (VAM) vary as a function of year-to-year fluctuations in the student sample (i .e ., cohort) and the tested subject (reading or math) . Preliminary results suggest high volatility in school effect estimates .

Value-Added Modelling Considerations for School Evaluation Purposes Lucy Lu, NSW Department of Education, Australia

This paper discusses findings from the development of value-added models for a large Australian education system . Issues covered include the impact of modelling choices on the representation of schools of different sizes in the distribution of school effects; sensitivity of VA estimates to test properties and to missing test data .

Implications of Differential Item Quality for Test Scores and Value-Added Estimates Robert Meyer, Nandita Gawade and Caroline Wang, Education Analytics, Inc.

We explore whether differential item quality compromises the use of locally-developed tests in student performance and educator evaluation . Using simulated and empirical data, we find that item corruption affects test scores, and to a lesser extent, value-added estimates . Adjusting test score scales and limiting to well-functioning items mitigate these effects .

59 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Meeting Room 5, Meeting Room Level, Paper Session, C6

Item and Scale Drift Session Discussant: Jonathan Weeks, ETS

The Impact of Item Parameter Drift in Computer Adaptive Testing (cat) Nicole Risk, American Medical Technologists

The impact of IPD on measurement in CAT was examined . The amount and magnitude of IPD, as well as the size of the item pool, was varied in a series of simulations . A number of criteria was used to evaluate the effects on measurement precision, classification, and test efficiency .

Practice Differences and Item Parameter Drift in Computer Adaptive Testing Beyza Aksu Dunya, University of Illinois at Chicago

The purpose of this simulation study was to evaluate the impact of IPD that occurs due to teaching and practice differences on person parameter estimation and classification accuracy in CAT when factors such as percentage of drifting items and percentage of examinees receiving differential teaching and practices vary .

Investigating Linear and Nonlinear Item Parameter Drift with Explanatory IRT Models Luke Stanke, Minneapolis Public Schools; Okan Bulut, University of Alberta; Michael Rodriguez and Jose Palma, University of Minnesota

This study investigates the impact of model misspecification in detecting linear and nonlinear item parameter drift (IPD) . Monte Carlo simulations were conducted to examine drift with linear, quadratic, and factor IPD models under various testing conditions .

Quality Control Models for Tests with a Continuous Administration Mode Yuyu Fan, Fordham University; Alina von Davier and Yi-Hsuan Lee, ETS

This paper systematically compared the performance of Change Point Models (CPM) and Hidden Markov Models (HMM) on score stability monitoring and scale drift assessment in educational test administrations using simulated data . The study will contribute to the continuing monitoring of scale scores for the purpose of quality control in equating .

Ensuring Test Fairness Through Monitoring the Anchor Test and Covariates Marie Wiberg, Umeå University; Alina von Davier, Educational Testing Service

A quality control procedure for a testing program with multiple consecutive administrations with anchor test is proposed . Descriptive statistics, ANOVA, IRT and linear mixed effect models were used to examine the impact of covariates on the anchor test . The results implies that the covariates play a significant part .

60 Washington, DC, USA

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Meeting Room 12, Meeting Room Level, Paper Session, C7

Cognitive Diagnostic Model Extensions Session Discussant: Larry DeCarlo, Teachers College; Columbia University

A Polytomously-Scored Dina Model for Graded Response Data Dongbo Tu, Chanjin Zheng and Yan Cai, Jiangxi Normal University; Hua-Hua Chang, University of Illinois at Urbana- Champaign

This paper proposed a polytomous extension of the DINA model for a test with polytomously-scored items . Simulation study was conducted to investigate the performance of the proposed model . In addition, a real-data example was used to illustrate the application of this new model with the polytomously-scored items .

Information Matrix Estimation Procedures for Cognitive Diagnostic Model Tao Xin, Yanlou Liu and Wei Tian, Beijing Normal University

The performance of sandwich-type covariance matrix in CDM is consistent and robust to model misspecification . The Type I error rates of the Wald statistic, constructed by using observed information matrix, for one-, two-, and three-attribute items are all perfectly matched the nominal levels, when the sample size was relatively large .

Higher-Order Cognitive Diagnostic Models for Polytomous Latent Attributes Peida Zhan and Yufang Bian, Beijing Normal University; Wen-Chung Wang and Xiaomin Li, The Hong Kong Institude of Education

Latent attributes in cognitive diagnostic models (CDMs) are dichotomous, but in practice polytomous attributes are possible . We developed a set of new CDMs in which the polytomous attributes are assumed to measure the same continuous latent trait . Simulation studies demonstrated good parameter recovery using WinBUGS . An empirical example was given .

Incorporating Latent and Observed Predictors in Cognitive Diagnostic Models Yoon Soo Park and Kuan Xing, University of Illinois at Chicago; Young-Sun Lee, Teachers College, Columbia University; MiYoun Lim, Ewha Womans University

A general approach to specify observed and latent factors (estimated using item response theory) as predictors in an explanatory framework for cognitive diagnostic models is proposed . Simulations were conducted to examine the stability of estimates; real-world data analyses were conducted to demonstrate the framework and application using TIMSS data .

61 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 12:25 PM - 1:55 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session, Paper Session, C8

Electronic Board #1 Examination of Over-Extraction of Latent Classes in the Mixed Rasch Model Sedat Sen, Harran University

Correct identification of number of latent classes in MRMs is very important . This study investigated the over- extraction problem in MRMs by focusing on non-normal ability distributions and fit index selection . Three ML-based estimation techniques were used and over-extraction problem was observed under some conditions .

Electronic Board #2 Identifying a Credible Reference Variable for Measurement Invariance Testing Cheng-Hsien Li and KwangHee Jung, Department of Pediatrics, University of Texas Medical School at Houston

Two limitations to model identification in multiple-group CFA, unfortunately, have received little attention: (1) the standardization in loading invariance and (2) the lack of a statistical test for intercept invariance . The proposed strategy extends a MIMIC model with moderated effects to identify a credible reference variable for measurement invariance testing .

Electronic Board #3 Using Partial Classification of Respondents to Reduce Classification Error in Mixture IRT Youngmi Cho, Pearson; Tongyun Li, ETS; Jeffrey Harring and George Macready, University of Maryland

This study investigates an alternative classification method in mixture IRT models . This method incorporates an additional classification criterion . Namely that the largest posterior probability for each response pattern must equal or exceed a specified lower bound . This results in a reduction of expected classification error .

Electronic Board #4 Parameter Recovery in Multidimensional Item Response Theory Models Under Complexity and Nonormality Stephanie Underhill, Dubravka Svetina, Shenghai Dai and Xiaolin Wang, Indiana University - Bloomington

We investigate item and person parameter recovery in multidimensional item response theory models for understudied conditions . Specifically, we ask how well can IRTpro and the mirt package in R recover the parameters when person distribution is nonnormal, items exhibit varying degrees of complexity, and different item parameters comprise an assessment .

Electronic Board #5 Psychometric Properties of Technology-Enhanced Item Formats Ashleigh Crabtree and Catherine Welch, University of Iowa

Objectives of this research will be to provide information about the properties of technology-enhanced item formats . Specifically, the research will focus on the construct representation and technical properties of test forms that use these item types .

62 Washington, DC, USA

Electronic Board #6 Using Technology-Enhanced Items to Measure Fifth Grade Geometry Jessica Masters, Lisa Famularo and Kristin King, Measured Progress

Technology-enhanced items have potential to provide improved measurement of high-level constructs . But research is needed to evaluate whether these items lead to valid inferences about knowledge and provide improved measurement over traditional items . This paper explores these questions in the context of fifth grade geometry using qualitative cognitive lab data .

Electronic Board #7 A Multilevel Mt-Mm Approach for Estimating Trait Variance Across Informant Types Tim Konold and Kathan Shukla, University of Virginia

An approach for extracting common trait variance from structurally different informant ratings is presented with an extension for measuring the resulting factors’ associations with an external outcome . Results are based on structurally different and interchangeable students (N = 45,641) and teachers (N = 12,808) from 302 schools .

Electronic Board #8 A Validation Study of the Learning Errors and Formative Feedback (leaff) Model Wei Tang, Jacqueline Leighton and Qi Guo, University of Alberta

The objective of the present study involves (1) validating the selected measures of the latent variables in the Learning Errors and Formative Feedback (LEAFF) model, and (2) applying a structural equation model to evaluate the core of the LEAFF model . In addition, culturally invariant models are analyzed and presented .

Electronic Board #9 Automatic Flagging of Items for Key Validation Füsun Şahin, University at Albany, State University of New York; Jerome Clauser, American Board of Internal Medicine

Key validation procedures typically rely on professional judgement to identify potentially problematic items . Unfortunately, lack of standardized flagging criteria can introduce bias in examinee scores . This study demonstrates the use of logistic regression to mimic expert judgment and automatically flag problematic items . The final model properly identified 96% of items .

Electronic Board #10 Evaluating the Robustness of Multidimensional IRT (mirt) Based Growth Modeling Hanwook Yoo, Seunghee Chung, Peter van Rijn and Hyeon-Joo Oh, Educational Testing Service

This study evaluates the robustness of MIRT-based growth modeling when tests are not strictly unidimensional . Primary independent variables manipulated are a) magnitude of student growth and b) magnitude of test multidimensionality . The findings support how growth is effectively measured by proposed model under different test conditions .

Electronic Board #11 Standard Errors of Measurement for Group-Level SGP with Bootstrap Procedures Jinah Choi, Won-Chan Lee, Robert Brennan and Robert Ankenmann, The University of Iowa

This study provides procedures for estimating standard errors of measurement and confidence intervals for group- level SGPs by using bootstrap sampling plans in generalizability theory . It is informative to gauge reliability of the reported SGPs when reporting the mean or median of individual SGPs within a group of interest .

63 2016 Annual Meeting & Training Sessions

Electronic Board #12 Vertical Scaling of Test with Mixed Item Formats Including Technology Enhanced Items Dong-In Kim, Ping Wan and Joanna Tomkowicz, Data Recognition Corporation; Furong Gao, Pacific Metric; Jungnam Kim, NBCE

This study is intended to enhance the knowledge base of IRT vertical scaling when tests consists of mixed item types including technology-enhanced items . Using large scale state assessments, the study compares results from different configurations of item type compositions of anchor set, anchor sources, IRT models, and vertical scaling methods .

Electronic Board #13 Full-Information Bifactor Growth Models and Derivatives for Longitudinal Data Ying Li, American Nurses Credentialing Center

Bifactor growth model with correlated general factors has shown promising in recovering longitudinal data; however it’s not known whether the simplified models perform well with comparable estimation accuracy . This study investigated two simplified versions of the model in data recovery under various conditions, aiming to provide guidance on model selections .

Electronic Board #14 The Pseudo-Equivalent Groups Approach as an Alternative to Common-Item Equating Sooyeon Kim and Ru Lu, Educational Testing Service

This study evaluates the effectiveness of equating test scores by using demographic data to form “pseudo- equivalent groups” of test takers . The study uses data from a single test form to create two half-length forms for which the equating relationship is known .

Electronic Board #15 Equating with a Heterogeneous Target Population in the Common-Item Design Ru Lu and Sooyeon Kim, Educational Testing Service

This study evaluates the effectiveness of weighting for each subgroup in the nonequivalent groups with common- item design . This study uses data from a single test form to create two research forms for which the equating relationship is known . Two weighting schemes are compared in terms of equating accuracy .

Electronic Board #16 Examining the Reliability of Rubric Scores to Assess Score Report Quality Mary Roduta Roberts, University of Alberta; Chad Gotch, Washington State University

The purpose of this study is to assess the reliability of scores obtained from a recently developed ratings-based measure of score report quality . Findings will be used to refine assessment of score report quality and advance the study and practice of score reporting .

Electronic Board #17 Accuracy of Angoff Method Item Difficulty Estimation at Specific Cut Score Levels Tanya Longabach, Excelsior College

This study examines the accuracy of item difficulty estimates in Angoff standard setting with no normative item data available . Correlation between observed and estimated item difficulty is moderate to high . The judges consistently overestimate student ability at higher cut levels, and underestimate ability of students at the D cut level .

64 Washington, DC, USA

Electronic Board #18 A Passage-Based Approach to Setting Cutscores on Ela Assessments Marianne Perie and Jessica Loughran, Center for Educational Testing and Evaluation

New assessments in ELA contain a strong focus on reading comprehension with multiple passages of varying complexity . Using a variant on the Bookmark method, this study provides results from two standard setting workshops with two approaches to setting passage-based cut scores and two approaches to recovering the intended cut score .

Electronic Board #19 Psychometric Characteristics of Technology Enhanced Items from a Computer-Based Interim Assessment Program Nurliyana Bukhari, University of North Carolina at Greensboro; Keith Boughton and Dong-In Kim, Data Recognition Corporation

This study compared the IRT information of technology enhanced (TE) item formats from an interim assessment program . Findings indicate that the evidence-based selected response items within English Language Arts and the select-and-order, equation-and-expression entry, and matching items within Mathematics, provided more information when compared to the traditional selected response items .

Electronic Board #20 Exposure Control for Response Time-Informed Item Selection and Estimation in CAT Justin Kern, Edison Choe and Hua-Hua Chang, University of Illinois at Urbana-Champaign

This study will investigate item exposure control while using response times (RTs) with item responses in CAT to minimize overall test-taking time . Items are selected as maximum information per time unit as in Fan et al . (2012) . Calculations use estimates for ability and speededness obtained via a joint-estimation MAP routine .

Electronic Board #21 Monitoring Item Drift Using Stochastic Process Control Charts Hongwen Guo and Frederic Robin, ETS

In on-demand testing, test items have to be reused; however their true characteristics may drift away over time . This study links item drift to DIF analysis and SPC methods in a sequence of test administrations are used to detect item drift as early as possible .

Electronic Board #22 Reporting Subscores Using Different Multidimensional IRT Models in Sequencing Adaptive Testing Jing-Ru Xu, Pearson VUE; Frank Rijmen, Association of American Medical Colleges

This research investigates the efficiency of reporting subscores in sequencing adaptive testing . It compares this new implementation with a general multidimensional CAT program . Different multidimensional models were fitted in different CAT simulation studies using PISA 2012 Math with four subdomains . It provides insights into score reporting in multidimensional CAT .

Electronic Board #23 Multidimensional IRT Model Estimation with Multivariate Non-Normal Latent Distributions Tongyun Li and Liyang Mao, Educational Testing Service

The purpose of the present study is to investigate the robustness of the multidimensional IRT model parameter estimation when the latent distribution is multivariate non-normal . A simulation study is proposed to evaluate the

65 2016 Annual Meeting & Training Sessions

accuracy of item and person parameter estimates with different magnitudes of violation to the multivariate normal assumption .

Electronic Board #24 Stochastic Ordering of the Latent Trait Using the Composite Score Feifei Li and Timothy Davey, Educational Testing Service

The purposes of this study are to investigate whether combining score from monotonic items causes the violation of SOL in the empirical composite score function and to find out what are the factors that introduce violation of SOL when combining monotonic polytomous items .

Electronic Board #25 Establishing Critical Values for Parscale G2 Item Fit Statistics Lixiong Gu and Ying Lu, Educational Testing Service

Research shows the Type I error rate of the PARSCALE G2 statistic are inflated with the decrease of test-length and increase of sample size . This study develops a table of empirical critical values for Type I error of 0 .05 at different sample sizes that may help psychometricians flag misfit items .

66 Washington, DC, USA

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Renaissance East, Ballroom Level, Invited Session, D1

Assessing the Assessments: Measuring the Quality of New College- and Career-Ready Assessments Morgan Polikoff, USC Tony Alpert, Smarter Balanced Bonnie Hain, PARCC Brian Gill, Mathematica Carrie Conaway, Massachusetts Department of Education Donna Matovinovic, ACT

This panel presents results from two recent studies of the quality of new college and career-ready assessments . The first study uses a new methodology to evaluate the quality of PARCC, Smarter Balanced, ACT Aspire, and Massachusetts MCAS against the CCSSO Criteria for High Quality Assessment . After the presentation of the study and its findings, respondents from PARCC and Smarter Balanced will discuss the methodology and their thoughts on the most important dimensions against which new assessments should be evaluated . The second study investigates the predictive validity of PARCC and MCAS for predicting success in college . After the presentation of the study and its findings, respondents from the Massachusetts Department of Education will discuss the study and the state’s needs regarding evidence to select and improve next-generation assessment . The overarching goal of the panel is to provoke discussion and debate about the best ways to evaluate the quality of new assessments in the college- and career-ready standards era .

67 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Renaissance West A, Ballroom Level, Coordinated Session, D2

Some Psychometric Models for Learning Progressions Session Chair: Mark Wilson, University of California, Berkeley Session Discussant: Matthias Von Davier, ETS

Learning progressions represent theories about the conceptual pathways that students follow when learning in a domain (NRC, 2006) . One common type of representation is a multidimensional structure, with links between certain pairs of levels of the different dimensions (as predicted by, say, substantive theory and/or empirical findings) . An illustration of such a complex hypothesis, which derives from an assessment development project in the area of statistical modeling for middle school students called the Assessing Data Modeling and Statistical Reasoning (ADM; Lehrer, Kim, Ayers & Wilson, 2014) project . The vertical columns of boxes (such as Cos1, Cos2, . . Cos4) represent the levels of each of the 6 dimensions of the learning progression . In addition to these “vertical” links between different levels of each construct, other links between levels of different constructs (such as the one from ToM6 to Cos3) that indicate that there is an expectation (from theory and/or earlier empirical findings) that a student needs to succeed on the 6th level of the ToM dimension before they can be expected to succeed on the 3rd level of the CoS dimension .

Putting it a bit more formally, we use a genre of representation that is structured as a multidimensional set of constructs: Each construct has (1) several levels representing successive levels of sophistication in student understanding and (2) directional relations between individual levels of different constructs . We call the models used to analyze such a structure structured constructs models (SCMs; Wilson, 2009) .

Introduction to the Concept of a Structured Constructs Model (scm) Mark Wilson, University of California, Berkeley

Modeling Structured Constructs as Non-Symmetric Relations Between Ordinal Latent Variables David Torres Irribarra, Pontificia Universidad Católica de Chile; Ronli Diakow (Brenda Loyd Dissertation Award Winner, 2015), New York City Department of Education

A Structured Constructs Model for Continuous Latent Traits with Discontinuity Parameters In-Hee Choi, University of California, Berkeley

A Structured Constructs Model Based on Change-Point Analysis Hyo Jeong Shin, ETS

Discussion of the Different Approaches to Using Item Response Models for Scms Mark Wilson, University of California, Berkeley

68 Washington, DC, USA

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Renaissance West B, Ballroom Level, Coordinated Session, D3

Multiple Perspectives on Promoting Assessment Literacy for Parents Session Chair: Lauress Wise, Human Resources Research Organization (HumRRO)

The national dialogue on American education has become increasingly focused on assessment . There is a clear need for greater understanding about fundamental aspects of educational testing . Several organizations and individuals have undertaken concerted efforts to increase the assessment literacy of various audiences, including educators, policymakers, parents, and the general public .

This coordinated session will focus on the efforts taken by three initiatives that include parents among the target audiences . NCME Past President Laurie Wise will introduce the session by discussing the need for initiatives that increase the assessment literacy of parents . NCME Board member Cindy Walker will discuss the ongoing efforts on behalf of NCME to develop and promote assessment literacy materials . Beth Rorick of the National Parent Teacher Association will discuss a national assessment literacy effort to educate parents on college and career ready standards and state assessments . Maria Donata Vasquez-Colina and John Morris of Florida Atlantic University will discuss outcomes and follow up activities from focus groups with parents on assessment literacy . Presentations will be followed by group discussion (among both panelists and audience members) on ideas for coordinating multiple efforts to increase parents’ assessment literacy .

NCME Assessment Literacy Initiative Cindy Walker, University of Wisconsin - Milwaukee

NAEP Assessment Literacy Initiative Beth Rorick, National Parent-Teacher Association

Lessons Learned from Parents on Assessment Literacy Maria Donata Vasquez-Colina and John Morris, Florida Atlantic University

69 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Meeting Room 3, Meeting Room Level, Paper Session, D4

Equating Mixed-Format Tests Session Discussant: Won-Chan Lee, University of Iowa

Classification Error Under Random Groups Equating Using Small Samples with Mixed-Format Tests Ja Young Kim, ACT, Inc.

Few studies investigated equating with small samples using mixed-format tests . The purpose of this study is to examine the impact of small sample and equating method on the misclassification of examinees based on where the passing scores are located, taking into account factors related to using the mixed-format tests .

Sample Size Requirement for Trend Scoring in Mixed-Format Test Equating Qing Yi and Yong He, ACT, Inc.; Hua Wei, Pearson

The purpose of this study is to investigate how many rescored responses are sufficient to adjust for the differences in rater severity across test administrations in mixed-format test equating . Simulated data are used to study the sample size requirement for the trend scoring method with IRT equating .

Comparing IRT-Based and Ctt-Based Pre-Equating in Mixed-Format Testing Meichu Fan, Xin Li and YoungWoo Cho, ACT, Inc.

Pre-equating research has tremendous appeal to test practitioners with the demand for immediate score reporting . IRT pre-equating research is readily applicable, but research on pre-equating using classical test theory (CTT), where only classical item statistics are available, is limited . This study compares various pre- and post-equating methods in mixed-format testing .

Equating Mixed-Format Tests Using Automated Essay Scoring (aes) System Scores Süleyman Olgar, Florida Department of Education; Russell Almond, Florida State University

This study investigated the impact of using generic e-rater scores to equate mixed-format tests with MC items and an essay. The kappa and observed agreements were large and similar across six equating methods. The MC+e-rater equating outcomes are strong and even better than the MC-only equating results for some conditions.

70 Washington, DC, USA

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Meeting Room 4, Meeting Room Level, Paper Session, D5

Standard Setting Session Discussant: Susan Davis-Becker, Alpine Testing Solutions

Exploring the Influence of Judge Proficiency on Standard-Setting Judgments for Medical Examinations Michael Peabody, American Board of Family Medicine; Stefanie Wind, University of Alabama

The purpose of this study is to explore the use of the Many-Facet Rasch model (Linacre, 1989) as a method for adjusting modified-Angoff standard setting ratings (Angoff, 1971) based on judges’ subject area knowledge . Findings suggest differences in the severity and quality of standard-setting judgments across levels of judge proficiency .

Setting Cut Scores on the Ap Seminar Course and Exam Components Deanna Morgan and Priyank Patel, The College Board; Yang Zhao, University of Kansas

This paper documents a standard-setting study using the Performance Profile Method to determine recommended cut scores for examinees to be placed in each of the AP grade categories (1-5) . The Subject Matter Experts used an ordered profile packet of students’ performance, and converged on recommended scores .

Interval Validation Method for Setting Achievement Level Standards for Computerized Adaptive Tests William Insko and Stephen Murphy, Houghton Mifflin Harcourt

The Interval Validation Method for setting achievement level standards is specifically designed for assessments with large item pools, such as computerized adaptive tests . The method focuses judgments on intervals of similarly performing items presumed to contain a single cut score location . Validation of the interval sets the cut score .

The Use of Web 2.0 Tools in a Bookmark Standard Setting Jennifer Lord-Bessen, McGraw Hill Education CTB; Ricardo Mercado, DRC; Adele Brandstrom, CTB

This study examines the use of interactive, collaborative Web tools in an onsite, online Bookmark Standard Setting workshop for a state assessment . It explores the feasibility of this concept—addressing issues of security, user satisfaction, and cost—in a fully online standard setting with remote participants .

71 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Meeting Room 5, Meeting Room Level, Paper Session, D6

Diagnostic Classification Models: Applications Session Discussant: Jonathan Templin, University of Kansas

Assessing Students’ Competencies Through Cognitive Diagnosis Models: Validity and Reliability Evidences Miguel Sorrel, Julio Olea and Francisco Abad, Universidad Autónoma de Madrid; Jimmy de la Torre, Rutgers, The State University of New Jersey; Juan Barrada, Universidad de Zaragoza; David Aguado, Instituto de Ingeniería del Conocimiento; Filip Lievens, Ghent University

Cognitive diagnosis models can be applied to situational judgement tests to provide information about noncognitive factors, which currently are not included in selection procedures for admission to university . Reliable measures of study orientation (habits and attitudes), helping others, and generalized compliance were significantly related to the grade point average .

Examining Effects of Pictorial Fraction Models on Student Test Responses Angela Broaddus, Center for Educational Testing and Evaluation University of Kansas; Meghan Sullivan, University of Kansas

The present study investigates the effects of aspects of visual fraction models on student test responses . Responses to 50 items assessing partitioning and identifying unit fractions were analyzed using diagnostic classification methods to provide insight into effective representations of early fraction knowledge .

Evaluation of Learning Map Structure Using Diagnostic Cognitive Modeling and Bayesian Networks Feng Chen, Jonathan Templin and William Skorupski, The University of Kansas

The learning map underlying the assessment system should accurately specify the connections among nodes, as well as specify nodes at the appropriate level of the granularity . This paper seeks to validate a learning map combining real data analyses and simulation study to provide inference to test development .

72 Washington, DC, USA

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Meeting Room 12, Meeting Room Level, Paper Session, D7

Advances in IRT Modelling and Estimation Session Discussant: Mark Hansen, UCLA

Estimation of Mixture IRT Models from Nonnormally Distributed Data Tugba Karadavut and Allan S. Cohen, University of Georgia

Mixture IRT models generally assume standard normal ability distributions but, nonnormality is likely to occur in many achievement tests . Nonnormality has been shown to cause extraction of spurious latent classes . A skew t distribution, corrected extraction of spurious latent classes in growth models, and will be studied in this research .

Two-Tier Item Factor Models with Empirical Histograms as Nonnormal Latent Densities Hyesuk Jang, American Institutes for Research; Ji Seung Yang, University of Maryland; Scott Monroe, University of Massachusetts

The purpose of this study is to investigate the effects of nonnormal latent densities in two-tier item factor models on parameter estimates and to propose an extended empirical histogram approach that allows an appropriate characterization of the nonnormal densities for two correlated general factors and unbiased parameter estimates .

Examining Performance of the Mh-Rm Algorithm with the 3pl Multilevel MIRT Model Bozhidar Bashkov, American Board of Internal Medicine; Christine DeMars, James Madison University

This study examined the performance of the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm (Cai, 2010b) in estimating 3PL multilevel multidimensional IRT (ML-MIRT) models . Item and person parameter recovery as well as variances and covariances at different levels were investigated in different combinations of number of dimensions, intraclass correlation levels, and sample sizes .

Expectation-Expectation-Maximization: A Feasible Mixture-Model-Based Mle Algorithm for the Three- Parameter Logistic Model Chanjin Zheng, Jiangxi Normal University; Xiangbing Meng, Northeast Normal University

Stable MLE of item parameters under 3PLM with a modest sample size remains a challenge . The current study presents a mixture-model approach to 3PLM based on which a feasible Expectation-Expectation-Maximization MLE algorithm is proposed . The simulation study indicates that EEM is comparable to Bayesian EM .

73 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 2:15 PM - 3:45 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session: GSIC Graduate Student Poster Session, D8

Graduate Student Issues Committee Brian Leventhal, Chair Masha Bertling, Laine Bradshaw, Lisa Beymer, Evelyn Johnson, Ricardo Neito, Ray Reichenberg, Latisha Sternod, Dubravka Svetina

Electronic Board #1 Testing Two Alternatives to a Value-Added Model for Teacher Capability Nicole Jess, Michigan State University

This study tests two alternatives to Value-Added Models (VAMs) for teacher capability: Student Response Model (SRM) and Multilevel Mixture Item Response Model (MMixIRM) . We will compare the accuracy of estimation of teacher capability using these models under various conditions of class size, location of cut-score, and student assignment to teacher .

Electronic Board #2 Using Response Time in Cognitive Diagnosis Models Nathan Minchen, Rutgers, The State University of New Jersey

No abstract submitted at time of printing

Electronic Board #3 An Exhaustive Search for Identifying Hierarchical Attribute Structure Lokman Akbay, Rutgers, The State University of New Jersey

Specification of an incorrect hierarchical relationship between any two attributes can substantially degrade classification accuracy . As such, the importance of correctly identifying the hierarchical structure among attributes cannot be overemphasized . The primary objective of this study is to propose a procedure for identifying the most appropriate hierarchical structure for attributes .

Electronic Board #4 Performance of DIMTEST and Generalized Dimensionality Discrepancy Statistics for Assessing Unidimensionality Ray Reichenberg, Arizona State University

The standardized generalized dimensionality discrepancy measure (SGDDM; Levy, Xu, Yel, & Svetina, 2015) was compared to DIMTEST in terms of their absolute and relative efficacy in assessing the unidimensionality assumption common in IRT under a variety of testing conditions (e .g ., sample size/test length) . Results and future research opportunities are discussed .

Electronic Board #5 Self-Directed Learning Oriented Assessments Without High Technologies Jiahui Zhang, Michigan State University

Self-directed learning oriented assessments capitalizes on the construction of assessment activities for optimal learning and for the cultivation of self-directed learning capacities . This study aims to develop such an assessment

74 Washington, DC, USA

combining the strengths of paper-pencil tests, CDM, and standard setting, which can be used by learners without high technologies .

Electronic Board #6 Vertical Scaling Under Rasch Testlet Model Mingcai Zhang, Michigan state university

Using Rasch testlet model, the scaling constants are estimated between three pairs of adjacent grades which are linked through anchor testlets . The simulated factors that impact the precision of scaling constant estimation include group mean difference, anchor testlet positions, and the magnitude of testlet effect .

Electronic Board #7 The Effect of DIF on Group Invariance of IRT True Score Equating Dasom Hwang, Yonsei University

Traditional methods for detecting DIF have been used for single level data analysis . However, most data in education has multilevel structure . This study investigates more effective method under various conditions comparing statistical power and type 1 error rates using adjusted methods based on Mantel-Haenszel method and SIBTEST for multilevel data .

Electronic Board #8 Detecting Non-Fitting Items for the Testlet Response Model Ryan Lynch, University of Wisconsin - Milwaukee

A Monte Carlo simulation will be conducted to evaluate the s-X2 item fit statistic . Findings indicate that the s-X2 may be a viable tool for evaluating item fit when the testlet effect is large, but results are mixed when the testlet effect is small .

Electronic Board #9 An Iterative Technique to Improve Test Cheating Detection Using the Omega Statistic Hotaka Maeda, University of Wisconsin-Milwaukee

We propose an iterative technique to improve ability estimation for accused answer copiers . A Monte Carlo simulation showed that by using the new ability estimate, the omega statistic had better controlled Type I error and increased power in all studied conditions, particularly when the source ability was high .

Electronic Board #10 Parameter Recovery in the Multidimensional Graded Response Item Response Theory Model Shengyu Jiang, Universtiy of Minnesota

Multidimensional graded response model can be a useful tool in modeling ordered categorical test data for multiple latent traits . A simulation study is conducted to investigate the variables that might affect parameter recovery and provide guidance for test construction and data collection in practical settings where the MGRM is applied .

Electronic Board #11 The Impact of Ignoring a Multilevel Structure in Mixture Item Response Models Woo-yeol Lee, Vanderbilt University

Multilevel mixture item response models are widely discussed but infrequently used in education research . Because little research exists assessing when it is necessary to use such models, the current study investigated the consequences of ignoring a multilevel structure in mixture item response models via a simulation study .

75 2016 Annual Meeting & Training Sessions

Electronic Board #12 Determining the Diagnostic Properties of the Force Concept Inventory Mary Norris, Virginia Tech

The Force Concept Inventory (FCI) is widely used to measure learning in introductory physics . Typically, instructors use total score . Investigation suggests that the test is multidimensional . This study fits FCI data with cognitive diagnostic and bifactor models in order to provide a more detailed assessment of student skills .

Electronic Board #13 Understanding School Truancy: Risk-Need Latent Profiles of Adolescents Andrew Iverson, Washington State University

Latent Profile Analysis was used to examine risk and needs profiles of adolescents in Washington State based on the WARNS assessment . Profiles were developed to aid understanding of behaviors associated with school truancy . Profiles were examined across student demographic variables (e .g ., suspensions, arrests) to provide validity evidence for the profiles .

Electronic Board #14 Utilizing Nonignorable Missing Data Information in Item Response Theory Daniel Lee, University of Maryland

The purposes of this simulation study are to examine the effects of ignoring nonignorable missing data in item response models and evaluate the performance of model-based and imputation-based approaches (e .g ., stochastic regression and Markov Chain Monte Carlo imputation) in parameter estimation to provide practical guidance to applied researchers .

Electronic Board #15 Investigating IPD Amplification and Cancellation at the Testlet-Level on Model Parameter Estimation Rosalyn Bryant, University of Maryland College Park

This study investigates the effect of item parameter drift (IPD) amplification or cancellation on model parameter estimation in a testlet-based linear test . Estimates will be compared between a 2-Parameter item response theory (IRT) model and a 2-Parameter testlet model varying magnitudes and patterns of IPD at item and testlet levels .

Electronic Board #16 Measuring Reading Comprehension Through Automated Analysis of Students’ Small-Group Discussions Audra Kosh, University of North Carolina, Chapel Hill

We present the development and initial validation of a computer-automated tool that measures elementary school students’ reading comprehension by analyzing transcripts of small-group discussions about texts . Students’ scores derived from the automated tool were a statistically significant predictor of scores on traditional multiple-choice and constructed-response reading comprehension tests .

Electronic Board #17 Differential Item Functioning Among Students with Disabilities and English Language Learners Kevin Krost, University of Pittsburgh

The presence of differential item functioning (DIF) was investigated on a statewide eighth grade mathematics assessment . Both students with disabilities and English language learners were focal groups, and several IRT and CTT methods were used and compared . Implications of results were discussed .

76 Washington, DC, USA

Electronic Board #18 Extreme Response Style: Which Model is Best? Brian Leventhal, University of Pittsburgh

More robust and rigorous psychometric models, such as IRT models, have been advocated for survey applications . However, item responses may be influenced by construct-irrelevant variance factors such as preferences for extreme response options . Through simulation methods, this study helps determine which model accounting for extreme response tendency is more appropriate .

Electronic Board #19 Evaluating DIF Detection Procedure in the Context of the Mirid Isaac Li, University of South Florida

The model with internal restriction on item difficulties (MIRID) is a componential Rasch model with unique between- item relationships, which pose challenges for psychometric studies like differential item functioning in its context . This empirical study compares and evaluates the suitability of four different DIF detection procedures for the MIRID .

Electronic Board #20 Item Difficulty Modeling of Computer-Adaptive Reading Comprehension Items Using Explanatory IRT Models Yukie Toyama, UC Berkeley, Graduate School of Education

This study investigated the effects of passage complexity and item type on difficulty of reading comprehension items for grades 2-12 students, using the Rasch latent regression linear logistic test model . Results indicated that it is text complexity, rather than item type, that explained the majority of variance in item difficulty .

Electronic Board #21 Recovering the Item Model Structure from Automatically Generated Items Using Graph Theory Xinxin Zhang, University of Alberta

We describe a methodology to recover the item models from generated items and present the results using a novel graph theory approach . We also demonstrate the methodology using generated items from the medical science domain . Our proposed methodology was found to be robust and generalizable .

Electronic Board #22 The Impact of Item Difficulty on Diagnostic Classification Models Ren Liu, University of Florida

Diagnostic classification models have been applied to non-diagnostic tests to partly meet the accountability demands for student improvement . The purpose of the study is to investigate the impact of item parameters (i e. . discrimination, difficulty, and guessing) on attribute classification when diagnostic classification models are applied to existing non-diagnostic tests .

Electronic Board #23 Sensitivity to Multidimensionality of Mixture IRT Models Yoonsun Jang, University of Georgia

Overextraction of latent classes is a concern when mixture IRT models are used in an exploratory approach . This study investigates whether some kinds of multidimensionality might result in overextraction of latent classes . A simulation study and an empirical example are presented to explain this effect .

77 2016 Annual Meeting & Training Sessions

Electronic Board #24 Monte Carlo Methods for Approximating Optimal Item Selection in CAT Tianyu Wang, University of Illinois

Monte Carlo techniques for item selection in an adaptive sequence are explored as a method for determining how to minimize mean squared error of ability estimation in CAT . Algorithms are developed to trim away candidate items as the test length increases, and connections to the Maximum Information criterion are studied .

Electronic Board #25 The Relationship Between Q-Matrix Loading, Item Usage, and Estimation Precision in Cd-Cat Susu Zhang, University of Illinois at Urbana-Champaign

The current project explores the relationship between items’ Q-matrix loadings and their exposure rate in cognitive diagnostic computerized adaptive tests, under various information-based item selection algorithms . In addition, the consequences of selecting certain high-information items loading on a large number of attributes on estimation accuracy will be examined .

78 Washington, DC, USA

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Renaissance East, Ballroom Level, Coordinated Session, E1

Do Large Scale Performance Assessments Influence Classroom Instruction? Evidence from the Consortia Session Discussant: Suzanne Lane, University of Pittsburgh

Each of the six major statewide assessment consortia created logic models to explicate their theories of action for including performance assessment components in their summative and formative assessment designs . In this session, we will focus on the theory of action hypothesis that including performance assessment components will lead to desired changes in classroom teaching activities and student learning . This hypothesis echoes similar theories in the statewide performance assessment movements of the 1990s (e .g ., Davey, Ferrara, Shavelson, Holland, Webb, & Wise, 2015, p . 5; Lane & Stone, 2006, p . 389) . Lane and colleagues conducted consequential validity studies to examine this hypothesis and found modest positive results (e .g ., Darling-Hammond & Adamson, 2010; Lane, Parke, & Stone, 2002; Parke, Lane, & Stone, 2006; Stone & Lane, 2003) . Other researchers reported worrisome unintended consequences (e .g ., Koretz, Mitchell, Barron, & Keith, 1996) .

In a 2015 NCME coordinated session, several consortia reported on validity arguments and supporting evidence for the performance assessment components of their programs . The session discussant made the observation that “The idea that PA will drive improvements in teaching is [the] most suspect part of [the theory of action]; more research needed ”. That was a call for studies of impacts on teaching activities and student learning in the classroom . This session is a response to that call .

This session is a continuation of ongoing examinations of performance assessment in statewide assessment programs that follows from well attended sessions in the 2013, 2014, and 2015 NCME meetings . The session is somewhat innovative in that we have included five of the six major statewide assessment consortia, with the goal of creating a comprehensive summary on this topic . A discussant will synthesize the evidence provided by the presenters and evaluate the consortia’s hypothesis about performance assessment and widely held beliefs about how performance assessment can influence curriculum development, teaching, and learning .

Performance assessment has re-emerged as a widely used assessment tool in large scale assessment programs and in classroom formative assessment practices . Developments in validation theory (e .g ., Kane, 2013) have placed claims and evidence in the center of test score interpretation and use arguments—in this case, claims about test use arguments . The convergence of these two forces requires us to (a) explicate our rationales for using specific assessment tools for specific purposes and about intended claims and inferences, and (b) investigate the plausibility of these rationales and claims . The papers in this session will explicate the consortium rationales for including performance assessment in their designs and provide new evidence of the supportability of their rationales .

Smarter Balanced Assessment Consortium Marty McCall, Smarter Balanced Assessment Consortium

Dynamic Learning Maps Marianne Perie and Meagan Karvonen, CETE University of Kansas

NCSC Assessment Consortium Ellen Forte, edCount

Elpa21 Assessment Consortium Kenji Hakuta, Stanford University; Phoebe Winter, Independent Consultant

WIDA Consortium Dorry Kenyon and Meg Montee, Center for Applied Linguistics

79 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Renaissance West A, Ballroom Level, Contributed Session, E2

Applications of Latent Regression to Modeling Student Achievement, Growth, and Educator Effectiveness Session Chair: J .R . Lockwood, Educational Testing Service Session Discussant: Matthew Johnson, Columbia University

There are both research and policy demands for making increasingly ambitious inferences about student achievement, achievement growth and educator effectiveness using longitudinal educational data . For example, test score data are now used routinely to make inferences about achievement growth through Student Growth Percentiles (SGP), as well as inferences about the effectiveness of schools and teachers . A common concern in these applications is that inferences may have both random and systematic errors resulting from limitations of the achievement measures, limitations of the available data, and/or failure of statistical modeling assumptions . This session will present four diverse applications in which the accuracy of standard approaches to the estimation problems can be improved, or their validity tested, through latent regression modeling . “Latent regression” refers to statistical models involving the regression of unobserved variables on observed covariates (von Davier & Sinharay, 2010) . For example, the National Assessment of Educational Progress uses regression of latent achievement constructs on student background and grouping variables to improve the value of the reported results for secondary analysis (Mislevy, Johnson, & Muraki, 1992) . The increasing availability of methods and software for fitting latent regression models provides unprecedented opportunities for using them to improve inferences about quantities now being demanded from educational data .

Using the Fay-Herriot Model to Improve Inferences from Coarsened Proficiency Data Benjamin Shear, Stanford University; Katherine Furgol Castellano and J.R. Lockwood, Educational Testing Service

Estimating True SGP Distributions Using Multidimensional Item Response Models and Latent Regression Katherine Furgol Castellano and J.R. Lockwood, Educational Testing Service

Testing Student-Teacher Selection Mechanisms Using Item Response Data J.R. Lockwood, Daniel McCaffrey, Elizabeth Stone and Katherine Furgol Castellano, Educational Testing Service; Charles Iaconangelo, Rutgers University

Adjusting for Covariate Measurement Error When Estimating Weights to Balance Nonequivalent Groups Daniel McCaffrey, J.R. Lockwood, Shelby Haberman and Lili Yao, Educational Testing Service

80 Washington, DC, USA

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Renaissance West B, Ballroom Level, Coordinated Session, E3

Jail Terms for Falsifying Test Scores: Yes, No or Uncertain? Session Moderator: Wayne Camara, ACT Session Debaters: Mike Bunch, Measurement Incorporated; S E Phillips, Assessment Law Consultant; Mike Beck, Testing Consultant; Rachel Schoenig, ACT

Far too many testing programs have recently faced public embarrassment and loss of credibility due to well- organized schemes by educators to fraudulently inflate test scores over extended periods of time . Even testing programs with good prevention, detection and investigation strategies are frustrated because consequences such as score invalidation or loss of a license or credential seem not to be sufficient consequences to deter organized efforts to falsify test scores . The pecuniary gains, job security and recognition from falsified scores have appeared to outweigh the deterrence effect of existing penalties .

This situation led a prosecutor in Atlanta, Georgia to employ a novel strategy to impose serious consequences on educators who conspired to fraudulently inflate student test scores . An extensive, external investigation triggered by excessive erasures and phenomenal test score improvements over ten years had implicated a total of 178 educators, 82 of whom had confessed and resigned, were fired or lost their teaching licenses at administrative hearings . In 2013, a grand jury indicted 35 of the remaining educators, including the alleged leader of the conspiracy, Superintendent Beverly Hall, for violation of a state Racketeer Influenced and Corrupt Organizations (RICO) statute . The RICO statute was originally designed to punish mafia organized crime, but the prosecutor argued that the cover-ups, intimidation and collusion involved in the organized activity of changing students’ answers on annual tests constituted a criminal enterprise . He further argued that this criminal enterprise obscured the academic deficiencies and shortchanged the education of poor minority students . Superintendent Hall, who denied the charges but faced a possible sentence of up to 45 years in jail, died of breast cancer shortly before the trial began . Twelve of the indicted educators who refused a plea bargain went to trial and 11 were convicted . The lone defendant who was acquitted was a special education teacher who had administered tests to students with disabilities .

In April 2015, amid pleas for leniency and with an acknowledgement that the students whose achievements were misrepresented were the real victims, the trial judge handed down unexpected and stiff punishments that included jail terms for 8 of the convicted Atlanta educators . After refusing an opportunity to avoid jail time by admitting their crimes in open court and foregoing their rights to appeal, they were sentenced to jail terms of 1 to 7 years . Two of the remaining convicted educators, a testing coordinator and a teacher, accepted sentencing deals in which they received 6 months of weekends in jail and one year of home confinement, respectively . After having been held in the county jail for two weeks following their convictions, the judge released the sentenced educators on bond pending appeal . About two weeks later and consistent with the prosecutor’s original recommendations, the same judge reduced the jail time from 7 years to 3 years for the three administrators who had received the longest sentences . Despite these reductions, the sentencing of educators to multiyear jail terms for conspiring to falsify test scores remained unprecedented and controversial .

Although measurement specialists may focus mainly on threats to test score validity and view invalidation of scores as the most appropriate consequence for violations of test security rules, the exposure of educator conspiracies in Atlanta and a number of other districts nationally suggests that more severe penalties may be needed to deter such violations and ensure test score validity . Measurement specialists are likely to be part of the conversations with state testing programs considering alternative consequences and will be better able to participate responsibly if they are fully informed about the competing arguments for and against penalizing egregious test security violations with jail time .

81 2016 Annual Meeting & Training Sessions

Thus, the dual purposes of this symposium are to (1) conduct a debate to illuminate the arguments and evidence in favor of and against jail time for educators who conspire to falsify student test scores, and (2) to provide audience members with an opportunity to discuss and vote on a model statute specifying penalties for conspiracy to falsify student test scores . The model statute also includes an alternative for avoiding jail time similar to that offered to the convicted Atlanta educators by the trial judge prior to sentencing . A debate format was chosen for this symposium to present a fair and balanced discussion so audience members can draw their own conclusions . The opportunity to hear arguments on both sides and to consider the issues from multiple perspectives should provide audience members with insights and evidence that can be shared with states considering alternative consequences for violations of test security rules .

82 Washington, DC, USA

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Meeting Room 3, Meeting Room Level, Paper Session, E4

Test Design and Construction Session Discussant: Chad Buckendahl

Potential Impact of Section Order on an Internet Based Admissions Test Scoring Naomi Gafni and Michal Baumer, National Institute for Testing & Evaluation

Meimad is an internet based admissions test consisting of eight multiple choice sections . One out of every seven test forms is randomly selected and the eight test sections in it are presented to examinees in a random order . The study examines the effect of section position on performance level .

Automated Test-Form Generation with Constraint Programming (cp) Jie Li and Wim van der Linden, McGraw-Hill Education

Constraint programming (CP) is used to optimally solve automated test-form generation problems. The modeling and solution process is demonstrated for two empirical examples: (i) generation of a fixed test form with optimal item ordering; and (ii) real-time ordering of items in the shadow tests in CAT.

An Item-Matching Heuristic Method for a Complex Multiple Forms Test Assembly Problem Pei-Hua Chen and Cheng-Yi Huang, National Chiao Tung University

An item matching approach for a complex test specification problem was proposed and compared with the integer linear programming method . The purpose of this study is to extend the item matching method to test with complex non-psychometric constraints such as set-based items, variable set length, and nested content constraints .

The Effect of Foil-Reordering and Minor Editorial Revisions on Item Performance Tingting Chen, Yu-Lan Su and Jui-Sheng Wang, ACT, Inc.

This study investigates how foil-reordering, and minor reformatting and rewording affect item difficulty, discrimination and other statistics for multiple-choice items using empirical data . Comparative and correlational analyses were conducted across administrations . The results indicated a significant impact on item difficulty and key selection distributions for foil-reordering and rewording .

Is Pre-Calibration Possible? a Conceptual Aig Framework, Model, and Empirical Investigation Shauna Sweet, University of Maryland, College Park; Mark Gierl, University of Alberta

While automatic item generation is technologically feasible, a conceptual architecture supporting the evaluation of these generative processes is needed . This study details such a framework and empirically examines the performance of a new multi-level model intended for pre-calibration of automatically generated items and evaluation of the generation process .

Award Session: NCME Annual Award: Mark Gierl & Hollis Lai

83 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, E5

Tablet Use in Assessment Session Discussant: Walter Way, Pearson

Use of tablet devices in the classroom continues to increase as Bring Your Own Device (BYOD), 1:1 technology programs, and flipped learning change the way students consume academic content, interact with their teachers and peers, and demonstrate their mastery of academic knowledge and skills . In addition, many K-12 assessment programs (e .g . NAEP, PARCC, SBAC, etc .) now or will soon allow administration of assessments using tablets . To assure the validity and reliability of test scores it is incumbent upon test developers to evaluate the potential impact of digital devices prior to their use within assessment . This session will explore various facets of the use of tablets within educational assessment and will include presentation of a set of five papers on this topic . The papers will utilize both qualitative and quantitative methods for evaluating tablet use and will evaluate impacts for different student sub-groups and special populations as well as tablet applications for both testing and scoring .

Improving Measurement Through Usability Nicholas Cottrell, Fulcrum

Using Tablet Technology to Develop Learning-Oriented English Language Assessment for English Learners Alexis Lopez, Jonathan Schmigdall, Ian Blood and Jennifer Wain, ETS

Device Comparability: Score Range & Subgroup Analyses Laurie Davis, Yuanyuan McBride and Xiaojing Kong, Pearson; Kristin Morrison, Georgia Institute of Technology

Response Time Differences Between Computers and Tablets Xiaojing Kong, Laurie Davis and Yuanyuan McBride, Pearson

Scoring Essays on an iPad Guangming Ling, Jean Williams, Sue O’Brien and Carlos Cavalie, ETS

84 Washington, DC, USA

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, E6

Topics in Multistage and Adaptive Testing Session Discussant: Jonathan Rubright, AICPA

A Top-Down Approach to Designing a Computerized Multistage Test Xiao Luo, Doyoung Kim and Ada Woo, National Council of State Boards of Nursing

The success of a computerized multistage test (MST) relies on a meticulous test design . This study introduces a new route-based top-down approach to designing MST, which imposes constraints and objectives upon routes and algorithmically searches for an optimal assembly of modules . This method simplifies and expedites the design of MST .

Comparison of Non-Parametric Routing Methods with IRT in Multistage Testing Design Evgeniya Reshetnyak, Fordham University; Alina von Davier, Charles Lewis and Duanli Yan, ETS

The goal of proposed study is to compare performance of non-parametric methods and machine learning techniques with traditional IRT methods for routing test takers in an adaptive multistage test design using operational and simulated data .

A Modified Procedure in Applying Cats to Allow Unrestricted Answer Changing Zhongmin Cui, Chunyan Liu, Yong He and Hanwei Chen, ACT, Inc.

Computerized adaptive testing with salt (CATS) has been shown to be robust to test-taking strategies (e .g ., Wainer, 1993) in a reviewable CAT . The robustness, however, is gained at the expense of test efficiency loss . We propose an innovative modification such that the modified CATS is both robust and efficient .

The Expected Likelihood Ratio in Computerized Classification Testing Steven Nydick, Pearson VUE

This simulation study compares the classification accuracy and expected test length of the expected likelihood ratio (ELR; Nydick, 2014) item selection algorithm to alternative algorithms in SPRT-based computerized classification testing (CCT) . Results will help practitioners determine the most efficient method of item selection given a particular CCT stopping rule .

A Comparison of the Pretest Item Calibration Procedures in CAT Xia Mao, Pearson

This study compares four procedures for calibrating pretest items in CAT using both real data and simulated data by manipulating the pretest item cluster length, calibration sample features and calibration sample sizes . The results will provide guidance for pretest item calibration in large-scale CAT in K–12 contexts .

Pretest Item Selection and Calibration Under Computerized Adaptive Testing Shichao Wang, The University of Iowa; Chunyan Liu, ACT, Inc.

Pretest item calibration plays an important role in maintaining item pools under computerized adaptive testing . This study aims to compare and evaluate five pretest item selection methods in item parameter estimation using various calibration procedures . The practical significance of these methods is also discussed .

Using Off-Grade Items in Adaptive Testing —A Differential Item Functioning Approach Shuqin Tao and Daniel Mix, Curriculum Associates

This study is intended to assess the appropriateness of using off-grade items in adaptive testing from a differential item functioning (DIF) approach . Data came from an adaptive assessment administered to school districts nationwide . Insights gained will help develop item selection strategies in adaptive algorithm to select appropriate off-grade items .

85 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 4:05 PM - 6:05 PM, Meeting Room 12, Meeting Room Level, Paper Session, E7

Cognitive Diagnosis Models: Exploration and Evaluation Session Discussant: Laine Bradshaw, University of Georgia

Bayesian Inferences of Q-Matrix with Presence of Anchor Items Xiang Liu, Young-Sun Lee and Yihan Zhao, Teachers College, Columbia University

Anchor items are usually included in multiple administrations of same assessment . Attribute specifications and item parameters can be obtained for these items from previous analyses . We propose a Bayesian method for estimating Q-matrix with presence of partial knowledge . Simulation demonstrates its effectiveness . TIMSS 2003 and 2007 data are then analyzed .

An Exploratory Approach to the Q-Matrix Via Bayesian Estimation Lawrence DeCarlo, Teachers College, Columbia University

An exploratory approach to determining the Q-matrix in cognitive diagnostic models is presented . All elements are specified as being uncertain, with respect to inclusion, and posteriors from a Bayesian analysis are used for selection . Simulations show that the approach gives high rates of correct element recovery, typically over 90% .

Parametric or Nonparametric—Evaluating Q-Matrix Refinement Methods for Dina and Dino Models Yi-Fang Wu, University of Iowa; Hueying Tzou, National University of

Two model-based and one model-free statistical Q-matrix refinement methods are evaluated and compared against one another . Large-scope simulations are used to study their q-vector recovery rates and the correct rates of examinee classification . The three most recent methods are also applied to real data for identifying and correcting misspecified q-entries .

Comparing Attribute-Level Reliability Estimation Methods in Diagnostic Assessments Chunmei Zheng and Yuehmei Chien, Pearson; Ning Yan, Independent Consultant

Diagnostic classification models have drawn much attention to practitioners due to its promising use on aligning teaching, learning, and assessment . However, little has been investigated on attribute classification reliability . The purpose of this study, therefore, is to conduct a comparison study for the existing reliability estimation method .

Estimation of Diagnostic Classification Models Without Constraints: Issues with Class Label Switching Hongling Lao and Jonathan Templin, University of Kansas

Diagnostic classification models (DCMs) may suffer from the latent class label switching issue, providing misleading results . A simulation study is proposed to investigate (1) the prevalence of label switching issue in different DCMs, and (2) the effectiveness of constraints at preventing label switching from happening .

Conditions Impacting Parameter and Profile Recovery Under the Nida Model Yanyan Fu, Jonathan Rollins and Robert Henson, UNCG

The NIDA model was studied under various conditions . Results indicated that sample size did not affect attribute parameter recovery and marginal CCRs . However, the number of attributes and items influenced the mCCRs . RUM and NIDA model generated data yielded similar mCCRs when estimated using the NIDA model .

86 Washington, DC, USA

Sequential Detection of Learning Multiple Skills in Cognitive Diagnosis Sangbeak Ye, University of Illinois - Urbana Champaign

Cognitive diagnosis models aim to identify examinees’ mastery or non-mastery of a vector of skills . In an e-learning environment where a set of skills are trained until mastery, proper detection method to determine the presence of the skills is vital . We introduce techniques to detect change-points of multiple skills .

87 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 4:05 PM - 5:35 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session, Paper Session, E8

Electronic Board #1 Response Styles Adjustments in Cross-Cultural Data Using the Mixture PCM IRT Model Bruce Austin, Brian French and Olusola Adesope, Washington State University

Response styles can contribute irrelevant variance to rating-scale items and compromise cross-cultural comparisons . Rasch IRT models were used to identify response styles and adjust data after identifying latent classes based on response styles . Predictive models were improved with adjusted data . We conclude with recommendations for identifying response style classes .

Electronic Board #2 Using Differential Item Functioning to Test for Inter-Rater Reliability in Educational Testing Sakine Gocer Sahin, Hacettepe University; Cindy M. Walker, University of Wisconsin- Milwaukee

Although multiple choice items can be more reliable, the information obtained from open ended items is sometimes greater, and more aligned than these items . This is only true if the raters are unbiased . The purpose of this research was to investigate an alternative measure of inter-rater reliability, based in IRT .

Electronic Board #3 Incorporating Expert Priors in Estimation of Bayesian Networks for Computer Interactive Tasks Johnny Lin, University of California, Los Angeles; Hongwen Guo, Helena Jia, Jung Aa Moon and Janet Koster van Groos, Educational Testing Service

Due to the cost of item development in computer interactive tasks, the amount of evidence available for estimation is reduced . In order to minimize instability, we show how expert priors can be incorporated into Bayesian Networks by performing a smoothing transformation to obtain posterior estimates .

Electronic Board #4 A Multidimensional Rater Effects Model Richard Schwarz, ETS; Lihua Yao, DMDC

An approach for evaluating rater effects is to add an explicit rater parameter to a polytomous IRT model called a rater effects model . A multidimensional rater effects model is proposed . Using MCMC techniques and simulation, specifications for priors, the posterior distributions, and estimation of the model will be described .

Electronic Board #5 Exploring Clinical Diagnosis Process Data with Cluster Analysis and Sequence Mining Feiming Li and Frank Papa, Univeristy of North Texas Health Science Center

This study collected a clinical diagnosis process data from a diagnosis task performed by medical students in the computer-based environment . The study aimed to identify attributes of data-gathering behaviors predicting diagnostic accuracy; conduct cluster analysis and sequential mining to explore meaningful attribute or sequential patterns explaining the success/failure of diagnosis .

88 Washington, DC, USA

Electronic Board #6 Validity Evidence for a Writing Assessment for Students with Significant Cognitive Disabilities Russell Swinburne Romine, Meagan Karvonen and Michelle Shipman, University of Kansas

Sources of evidence for a validity argument are presented for the writing assessment in the Dynamic Learning Maps Alternate Assessment System . Methods included teacher surveys, test administration observations and a new cognitive lab protocol in which test administrators participated in a think aloud during administration of a practice assessment .

Electronic Board #7 The Implications of Reduced Testing for Teacher Accountability Jessica Alzen, School of Education University of Colorado Boulder; Erin Fahle and Benjamin Domingue, Graduate School of Education Stanford University

The present student testing burden is substantial, and interest in alternative scenarios with reduced testing but persistent accountability measures has grown . This study focuses on VA estimates in the presence of structural missingness of test data consistent with alternative scenarios designed to reduce the student testing burden .

Electronic Board #8 Examination of the Constructs Assessed by Published Tests of Critical Thinking Jennifer Kobrin, Edynn Sato, Emily Lai and Johanna Weegar, Pearson

We used a principled approach to define the construct of critical thinking and examined the degree to which existing tests are aligned to the construct . Our findings suggest that existing tests tend to focus on a narrow set of skills and identify gaps that offer opportunities for future assessment development .

Electronic Board #9 The False Discovery Rate Applied to Large-Scale Testing Security Screenings Tanesia Beverly, University of Connecticut; Peter Pashley, Law School Admission Council

When statistical tests are conducted repeatedly to detect test fraud (e .g ., copying) the overall false-positive rate should be controlled . Three approaches to adjusting significance levels were investigated with simulated and real data . A procedure for controlling the false discovery rate by Benjamini and Hochberg (1995) yielded the best results .

Electronic Board #10 The Impact of Ignoring Multiple-Group Structure in Testlet-Based Tests on Ability Estimation Ming Li, Hong Jiao and Robert Lissitz, University of Maryland

The study investigates the impact of ignoring the multi-group structure on ability estimation in testlet-based tests . In a simulation, model parameter estimates from three IRT models: a standard 2PL model, and a multiple-group 2PL model with or without testlet effects are compared and evaluated in terms of estimation errors .

Electronic Board #11 Reconceptualising Validity Incorporating Evidence of User Interpretation Timothy O’Leary, University of Melbourne; John Hattie and Patrick Griffin, Melbourne University

Validity is a fundamental consideration in test development . A recent conception introduced user validity focused upon the accuracy and effectiveness of interpretations resulting from test score reports . This paper proposes a reconceptualization of validity incorporating evidence of user interpretations and a method for the collection of such evidence .

89 2016 Annual Meeting & Training Sessions

Electronic Board #12 Single and Double Linking Designs Accessed by Population Invariance Yan Huo and Sooyeon Kim, Educational Testing Service

The purpose of this study is to determine whether double linking is more effective than single linking in terms of achieving subpopulation invariance on scoring . When double-linking was applied, the conversions derived from two subgroups different in geographic regions were more comparable to the conversion derived from the total group .

Electronic Board #13 Equating Mixed-Format Tests Using a Simple-Structure MIRT Model Under a Cineg Design Jiwon Choi, ACT/University of Iowa; Won-Chan Lee, University of Iowa

This study applies the SS-MIRT observed score equating procedure for mixed-format tests under the CINEG design . Also, the study compares various scale linking methods for SS-MIRT equating . The results show that the SS-MIRT approach provides more accurate equating results than the UIRT and traditional equipercentile methods .

Electronic Board #14 Pre-Equating or Post-Equating? Impact of Item Parameter Drift Wenchao Ma, Rutgers, The State University of New Jersey; Hao Song, National Board of Osteopathic Medical Examiners

This study, using a real-data-based simulation, examines whether item parameter drift (IPD) influences pre-equating and post-equating . Accuracy of ability estimates and classifications are evaluated under varied conditions of IPD direction, magnitude, and proportion of items with IPD . Recommendation is made on which equating method is preferred under different IPD conditions .

Electronic Board #15 A Comparative Study on Fixed Item Parameter Calibration Methods Keyu Chen and Catherine Welch, University of Iowa

This study provides a description of implementing fixed item parameter method in BILOG-MG as well as a comparison of three fixed item parameter calibration methods when calibrating field test items on the scale of operational items . A simulation study will be conducted to compare results of the three methods .

Electronic Board #16 Examining Various Weighting Schemes Effect on Examinee Classification Using a Test Battery Qing Xie, ACT/The University of Iowa; Yi-Fang Wu, Rongchun Zhu and Xiaohong Gao, ACT, Inc

The purpose of this study is to examine the effect of various weighting schemes on classifying examinees into multiple categories . The results will provide practical guidelines for using either profile scores or composite score for examinee classification in a test battery .

Electronic Board #17 Module Assembly for Logistic Positive Exponent Model-Based Multistage Adaptive Testing Thales Ricarte and Mariana Cúri, Institute of Mathematical and Computer Sciences (ICMC-USP); Alina von Davier, Educational Testing Service (ETS)

In Multistage (MST) Adaptive Testing based on Item Response theory models, modules are assembled optimizing an objective function via linear programming . In this project, we analyzed the MST based on the Logistic Positive Exponent model for testlet performance using Fisher, Kullback-Leibler information criteria and Continuous Entropy Method as objective function .

90 Washington, DC, USA

Electronic Board #18 Online Calibration Pretest Item Selection Design Rui Guo and Hua-hua Chang, University of Illinois at Urbana-Champaign

Pretest item calibration is crucial In multidimensional computerized adaptive testing . This study proposed an online calibration pretest item selection design name Four-quadrant D-optimal design with proportional density index algorithm . Simulation results showed that the proposed method provides a good item calibration efficiency .

Electronic Board #19 Online Multistage Intelligent Selection Method for Cd-Cat Fen LUO, Shuliang Ding, Xiaoqing Wang and Jianhua Xiong, Jiangxi Normal University

A new item selection-method, online multistage intelligent selection method (OMISM) is proposed . Simulation results show that for OMISM, the pattern match ratio of knowledge state is higher than that for posterior-weighted Kullback-Leibler information selection method in CD-CAT when examinees mastered multiple attributes .

Electronic Board #20 Data-Driven Simulations of False Positive Rates for Compound DIF Inference Rules Quinn Lathrop, Northwest Evaluation Association

Understanding how inference rules function under the null hypothesis is critical . This proposal presents a data- driven simulation method to determine the false positive rate of tests for DIF . The method does not assume a functional form of the item characteristic curves and also replicates impact from empirical data .

Electronic Board #21 Simultaneous Evaluation of DIF and Its Sources Using Hierarchical Explanatory Models William Skorupski, Jennifer Brussow and Jessica Loughran, University of Kansas

This study uses item-level features as explanatory variables for understanding DIF . Two approaches for DIF identification/explanation are compared: 1) two-stage DIF + regression, and 2) a simultaneous, hierarchical approach . Realistic data were simulated by varying the strength of relationship between DIF and explanatory variables and reference/focal group sample sizes .

Electronic Board #22 Comparing Imputation Methods for Trait Estimation Using the Rating Scale Model Christopher Runyon, Rose Stafford, Jodi Casabianaca and Barbara Dodd, The University of Texas at Austin

This research investigates trait level estimation under the rating scale model using three imputation methods of handling missing data: (a) multiple imputation, (b) nearest-neighbor hot deck imputation, and (c) multiple hot deck imputation . We compare the performance of these methods for three levels of missingness crossed with three scale lengths .

Electronic Board #23 The Nonparametric Method to Analyze Multiple-Choice Items: Using Hamming Distance Method Shibei Xiang, Wei Tian and Tao Xin, National Cooperative Innovation Center for Assessment and Improvement of Basic Education Quality

Many data in education are in the form of multiple-choice (MC) items that are scored as dichotomous data . In order to obtain information from incorrect answers, we expand Q-matrix for options and use nonparametric Hamming distance method to classify examinees that can be even used on a small sample size .

91 2016 Annual Meeting & Training Sessions

Electronic Board #24 Automatic Scoring System for a Short Answer in Korean Large Scale Assessment EunYoung Lim, Eunhee Noh and Kyunghee Sung, Korean Institute for Curriculum and Evaluation

The purpose of this study is to evaluate a prototype of Korean automatic scoring system (KASS) for short answers and to explore the related features of KASS to improve accuracy of automatic scoring .

92 Washington, DC, USA

Saturday, April 9, 2016 4:05 PM - 7:00 PM, Convention Center, Level Two, Room 202 A

The Life and Contributions of Robert L. Linn, Followed by a Reception

Note: NCME is partnering with AERA to record this session . We will make this recording available to all NCME members, including those who have to miss this tribute for presentations and attendance at NCME sessions .

93 2016 Annual Meeting & Training Sessions

Saturday, April 9, 2016 6:30 PM - 8:00 PM, Grand Ballroom South, Ballroom Level

NCME and AERA Division D Joint Reception

National Council on Measurement in Education and AERA Division D Welcome Reception for Current and New Members

94 Washington, DC, USA

Annual Meeting Program - Sunday, April 10, 2016

95 2016 Annual Meeting & Training Sessions

96 Washington, DC, USA

Sunday, April 10, 2016 8:00 AM - 9:00 AM, Marquis Salon 6, Marriott Marquis Hotel

2016 NCME Breakfast and Business Meeting (Ticketed Event)

Join your friends and colleagues at the NCME Breakfast and Business Meeting at the Marriott Marquis Hotel . Theater style seating will be available for those who did no purchase a breakfast ticket but wish to attend the Business Meeting .

97 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 9:00 AM - 9:40 AM, Marquis Salon 6, Marriott Marquis Hotel

Presidential Address: Education and the Measurement of Behavioral Change Rich Patz Act, Iowa City, IA

98 Washington, DC, USA

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Renaissance East, Ballroom Level, Invited Session, F1

Award Session Career Award: Do Educational Assessments Yield Achievement Measurements Winner: Mark Reckase Session Moderator: Kadriye Erickan, University of British Columbia

Because my original training in measurement/psychometrics was in psychology rather than education, I have noted the difference in approaches taken for the development of tests in those two disciplines . One begins with the concepts of a hypothetical construct and locating persons along a continuum, and the other begins with the definition of a domain of content and works to estimate the amount of the domain that a person has acquired . This presentation will address whether these two conceptions of test development are consistent with each other and with the assumptions of the IRT models that are often used to analyze the test results . It will also address how tests results are interpreted and if those interpretations are consistent with the measurement model and the test design . Finally, there is a discussion of how users of test results would like to interpret results, and whether measurement experts can produce tests and analysis procedures that will support the desired interpretations

99 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Renaissance West A, Ballroom Level, Invited Session, F2

Debate: Should the NAEP Mathematics Framework Be Revised to Align with the Common Core State Standards? Session Presenters: Michael Cohen, Achieve Chester Finn, Fordham Institute Session Moderators: Bill Bushaw, National Assessment Governing Board Terry Mazany, Chicago Community Trust

The 2015 National Assessment of Educational Progress (NAEP) results showed declines in mathematics scores at grades 4 and 8 for the nation and several states and districts . The release of the 2015 NAEP results prompted discussion about the extent to which the results may have been affected by differences between the content of the NAEP mathematics assessments and the Common Core State Standards in mathematics . The National Assessment Governing Board wants to know what you think . The presenters will frame the issue and then audience members will engage in a thorough discussion providing important insights to Governing Board members .

100 Washington, DC, USA

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Renaissance West B, Ballroom Level, Coordinated Session, F3

Beyond Process: Theory, Policy, and Practice in Standard Setting Session Chair: Karla Egan, NCIEA Session Discussant: Chad Buckendahl, Alpine Testing

Standard setting has become a routine and (largely) accepted part of the test development cycle for K-12 summative assessments . Conventional implementation of almost any K-12 standard setting method convenes teachers who study achievement level descriptors (ALDs) to make decisions about the knowledge, skills, and abilities (KSAs) expected of students . Traditionally, these cut scores have gone to state boards of education or education commissioners that are sometimes reluctant to adjust cut scores established by educators . While these conventional practices have served the field well, there are particular areas that deserve further scrutiny .

The first area needing further scrutiny is the validity of the ALDs, which provide a common framework for panelists to use when recommending cut scores . These ALDs are often written months or years prior to the test, sometimes even providing guidance for item writers and test developers regarding the KSAs expected of students on the test (Egan, Schneider, & Ferrara, 2012) . What happens when carefully developed ALDs are not well aligned to actual student performance? This is an area that happens in practice, yet only handful of studies that have examined the issue (e .g ., Schneider, Egan, Kim, & Brandstrom, 2012) . The first paper seeks to validate the ALDs used in the development of a national alternate assessment against student performance on that assessment .

The next area that needs a closer look is the use of educators as panelists in standard setting workshops . Educators may have a conflict of interest in recommending the cut scores . Educators are asked to recommend cut scores that have a direct consequence on accountability measures, such as teacher evaluation . There are other means of setting cut scores that do not involve teachers . For example, when setting college-readiness cut scores, it may not even be necessary to bring in panelists if the state links performance on their high school test to a test like the ACT or SAT . The second paper investigates the positives and negatives of quantitative methods for setting cut scores .

Another take on the same issue is to involve panelists who are able to reflect globally on the how cut scores will impact school-, district-, and statewide systems . To this end, methods have been used that show different types of data to inform panelist decision (Beimers, Way, McClarty, & Miles, 2012) . Others have brought in district-level staff following the content-based standard setting to adjust cut scores from a system perspective . The third paper approaches this as a validity issue, and it examines the different type of evidence (beyond process) that should be used to support standard setting .

The final issue that deserves further scrutiny is the use of panelists as evaluators of the standard setting . Panelists often serve as the only evaluators of the implementation and outcome of the method itself . Panelists fill out evaluations at the end of the standard setting, and these are often used as validity evidence supporting the cut scores . While this group represents an important perspective on the standard setting process, it is important to recognize that panelists are often heavily invested in the process by the time they participate in an evaluation of the standard setting workshop . The last paper considers the role that an external evaluator could play at standard setting .

The Alignment of Achievement Level Descriptors to Student Performance Lori Nebelsick-Gullet, edCount

Data-Based Standard Setting: Moving Away from Panelists? Joseph Martineau, NCIEA

101 2016 Annual Meeting & Training Sessions

Examining Validity Evidence of Policy Reviews Juan d’Brot, DRC

The Role of the External Evaluator in Standard Setting Karla Egan, NCIEA

102 Washington, DC, USA

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, F4

Exploring Timing and Process Data in Large-Scale Assessments Session Chairs: Matthias von Davier and Qiwei He, Educational Testing Service Session Discussant: Ryan Baker, Teachers College Columbia University

Computer-based assessments (CBAs) provide new insights into behavioral processes related to task completion that cannot be easily observed using paper-based instruments . In CBAs, a variety of timing and process data accompanies test performance data . This means that much more than data is available besides correctness or incorrectness . The analyses of these types of data are necessarily much more involved than those typically performed on traditional tests . This symposium provides examples of how sequences of actions and timing data are related to task performance and how to use process data to interpret students’ computer and information literacy achievements in large-scale international education and skills surveys such as the Programme for International Student Assessment (PISA), the Programme for International Assessment of Adult Competencies (PIAAC), and the International Computer and Information Literacy Study (ICILS) . The methods applied in these talks draw on cognitive theories for guidance of what “good” problem solving is, as well as on modern data-analytic techniques that can be utilized to explore log file data . These studies highlight the potential of analyzing students’ behavior stored in log files in computer-based large-scale assessments and show the promise of tracking students’ problem- solving strategies by using process data analysis .

An Overview: Process Data – Why Do We Care? Matthias von Davier, Educational Testing Service

Log File Analyses of Students’ Problem-Solving Behavior in PISA 2012 Assessment Samuel Greiff and Sascha Wüstenberg, University of Luxembourg

Identifying Feature Sequences from Process Data in PIAAC Problem-Solving Items with N-Grams Qiwei He and Matthias von Davier, Educational Testing Service

Predictive Feature Generation and Selection from Process Data in PISA Simulation-Based Environment Zhuangzhuang Han, Qiwei He and Matthias von Davier, Educational Testing Service

103 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, F5

Psychometric Challenges with the Machine Scoring of Short-Form Constructed Responses Session Chair: Mark Shermis, University of Houston—Clear Lake Session Discussant: Claudia Leacock, CTB/McGraw-Hill

This session examines four methodological problems associated with machine scoring of short-form constructed responses . The first study looks at the detection of speededness with short-answer questions on a testlet-based science test . Because items in a testlet are scored together, speededness can have a negative and even irrecoverable impact on an examinee’s score . The second study attempted to detect speededness/differential speededness on Task Based Simulations (a type of short-form constructed response) that were part of a licensing exam . Since the TBSs are embedded in the same section of the exam as multiple-choice questions, the goal was to ensure that examinees will have enough time to complete the test . The third study used a new twist on adjudicating short-answer machine scores . Instead of using a second human rater to adjudicate discrepant scores between one human and one machine rater, the study employed two different machine scoring systems and used a human rater to resolve differences in scores . The last study attempted to explain DIF using linguistic feature sets of machine scored short-answer questions taken from middle- and high-school exam questions . The study suggests that focal and reference groups have different “linguistic profiles” that may explain differences in test performance on particular items .

Speededness Effects in a Constructed Response Science Test Meereem Kim, Allan Cohen, Zhenqui Lu, Seohyun Kim, Cory Buxton and Martha Allexsaht-Snider, University of Georgia

Speededness for Task Based Simulations Items in a Multi-Stage Licensure Examination Xinhui Xiong, American Institute for Certified Public Accountants

Short-Form Constructed Response Machine Scoring Adjudication Methods Susan Lottridge, Pacific Metrics, Inc.

Use of Automated Scoring to Generate Hypotheses Regarding Language Based DIF Mark Shermis, University of Houston--Clear Lake; Liyang Mao, IXL Learning; Matthew Mulholland, Educational Testing Service; Vincent Kieftenbeld, PacificMetrics, Inc.

104 Washington, DC, USA

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, F6

Advances in Equating Session Discussant: Benjamin Andrews, ACT

Bifactor MIRT Observed Score Equating of Testlet-Based Tests with Nonequivalent Groups Mengyao Zhang, National Conference of Bar Examiners; Won-Chan Lee, The University of Iowa; Min Wang, ACT

This study extends a bifactor MIRT observed-score equating framework for testlet-based tests (Zhang et al ., 2015) to accommodate nonequivalent groups . Binary data are simulated to represent varying degrees of testlet effect and group equivalence . Different procedures are evaluated regarding the estimated equating relationships for number- correct scores .

Hierarchical Generalized Linear Models (hglms) for Testlet-Based Test Equating Ting Xu and Feifei Ye, University of Pittsburgh

This simulation study was to investigate the effectiveness of Hierarchical Generalized Linear Models (HGLMs) as concurrent calibration models on testlet-based test equating under the anchor-test design . Three approaches were compared, including two under the HGLM framework and one using Rasch concurrent calibration . Degrees of testlet variance were manipulated .

The Local Tucker Method and Its Standard Errors Sonya Powers, Pearson; Lisa Larsson, ERC Credit Modelling

A new linear equating method is proposed that addresses limitations of the local and Tucker equating methods . This method uses a bivariate normal distribution to model common and non-common item scores . Simulation results indicate that this new method has comparable standard errors to the original Tucker method and less bias .

Using Criticality Analysis to Select Loglinear Smoothing Models Arnond Sakworawich, National Institute of Development Administration; Han-Hui Por and Alina von Davier, Educational Testing Service; David Budescu, Fordham University

This paper proposes “Criticality analysis” as a loglinear smoothing model selection procedure . We show that this method outperforms traditional methods that rely on global measures of fit of the original data set by providing a clearer and sharper differentiation between the competing models .

105 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Meeting Room 15, Meeting Room Level, Paper Session, F7

Novel Approaches for the Analysis of Performance Data Session Discussant: William Skorupski, University of Kansas

Combining a Mixture IRT Model with a Nominal Random Item Mixture Model Hye-Jeong Choi and Allan Cohen, University of Georgia; Brian Bottge, University of Kentucky

This study describes a psychometric model in which a mixture item response theory model (MixIRTM) is combined to a random item mixture nominal response model (RMixNRM) . Inclusion of error and accuracy in one model has the potential to provide a more direct explanation about differences in response patterns .

Bayesian Estimation of Null Categories in Constructed-Response Items Yong He, Ruitao Liu and Zhongmin Cui, ACT, Inc.

Estimating item parameters in the presence of a null category in a constructed-response item is challenging . The problem has not been investigated in the generalized partial credit model (GPCM) . A Bayesian estimation of null categories based on the GPCM framework is proposed in this study .

The Fast Model: Integrating Learning Science and Measurement José González-Brenes, Pearson; Yun Huang and Peter Brusilovsky, University of Pittsburgh

The assessment and learning science communities rely on different paradigms to model student performance . Assessment uses models that capture different student abilities and problem difficulties, while learning science uses models that capture skill acquisition . We present our recent work on FAST (Feature Aware Student knowledge Tracing) to bridge both communities .

Award Session: Brenda Loyd Dissertation Award 2016- Yuanchoa Emily Bo

106 Washington, DC, USA

Sunday, April 10, 2016 10:35 AM - 12:05 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session, Paper Session, F8

Electronic Board #1 Multilevel IRT: When is Local Independence Violated? Christine DeMars and Jessica Jacovidis, James Madison University

Calibration data often is often collected within schools . This illustration shows that random school effects for ability do not bias IRT parameter estimates or their standard errors . However, random school effects for item difficulty lead to bias in item discrimination estimates and inflated standard errors for difficulty and ability .

Electronic Board #2 The Higher-Order IRT Model for Global and Local Person Dependence Kuan-Yu Jin and Wen-Chung Wang, The Hong Kong Institute of Education

Persons from the same clusters may behave more similarly than those from different clusters . In this study, we proposed a higher-order partial credit model for person clustering to quantify global and local person dependence for clustered samples in multiple tests . Simulations studies supported good parameter recovery of the new model .

Electronic Board #3 A Multidimensional Item Response Model for Local Dependence and Content Domain Structure Yue Liu, Sichuan Institute Of Education Sciences; Lihua Yao, Defense Manpower Data Center; Hongyun Liu, Beijing Normal University, Depart ment of Psychology

This study proposed a multidimensional item response model for testlets to simultaneously account for local dependence due to item clustering and multidimensional structure . Within-testlet and between-testlet models are applied to collaborative problem solving assessments real data . Precisions for the domain score and overall score for the proposed models are compared .

Electronic Board #4 Distinguishing Struggling Learners from Unmotivated Students in an Intelligent Tutoring System Kimberly Colvin, University at Albany, SUNY

To help teachers distinguish struggling learners from unmotivated students, a measure of examinee motivation designed for large-scale computer-based tests was modified and applied to an intelligent tutoring system . Proposed modifications addressed issues related to small sample sizes . The relationship of hint use and student motivation was also investigated .

Electronic Board #5 Using Bayesian Networks for Prediction in a Comprehensive Assessment System Nathan Dadey and Brian Gong, The National Center for the Improvement of Educational Assessment

This work shows how a Bayesian network can be used to predict student summative achievement classifications using assessment data collected thought the school year . The structure of the network is based on a curriculum map . The ultimate aim is to examine the usefulness of the network information to teachers .

107 2016 Annual Meeting & Training Sessions

Electronic Board #6 Comparability Within Computer-Based Assessment: Does Screen Size Matter? Jie Chen and Marianne Perie, Center for Educational Testing and Evaluation

Comparability studies are moving beyond paper-and-pencil versus computer-based assessments to analyze variances within computers . Using data from a large district giving tests on either Macs, with large, high-definition screens or Chromebooks, with standard 14” screens, this study compares assessment results between devices by grade, subject, and item type .

Electronic Board #7 Modeling Acquiescence and Extreme Response Styles and Wording Effects in Mixed-Format Items Hui-Fang Chen, City University of Hong Kong; Kuan-Yu Jin and Wen-Chung Wang, Hong Kong Institute of Education

Acquiescence and extreme response styles and wording effects are commonly observed in rating scale or Likert items . In this study, a multidimensional IRT model was proposed to account for these two responses styles and wording effects simultaneously . The effectiveness and feasibility of the new model were examined in simulation studies .

Electronic Board #8 Accessibility: Consideration of the Learner, the Teacher, and Item Performance Bill Herrera, Charlene Turner and Lori Nebelsick-Gullett, edCount, LLC; Lietta Scott, Arizona Department of Education, Assessment Section

To better understand the impact of federal legislation that required schools to provide access to academic curricula to students with intellectual disability, the National Center and State Collaborative examined differential performance of items with respect to students’ communication and opportunity to learn using data from three assessment administrations .

Electronic Board #9 Examining the Growth and Achievement of Stem Majors Using Latent Growth Models Heather Rickels, Catherine Welch and Stephen Dunbar, University of Iowa, Iowa Testing Programs

This study examined the use of latent growth models (LGM) when investigating the growth and college readiness of STEM majors versus non-STEM majors . Specifically, LGMs were used to compare growth on a state achievement test from Grades 6-11 of STEM majors and non-STEM majors at a public university .

Electronic Board #10 Modeling NCTM and CCSS 5th Grade Math Growth Estimates and Interactions Dan Farley and Meg Guerreiro, University of Oregon

This study compares NCTM and CCSS growth estimates . Multilevel models were used to generate models to compare standards . The CCSS measures appear to be more sensitive to growth, but exhibit potential biases toward female and English learners .

Electronic Board #11 Norming and Psychometric Analysis for a Large-Scale Computerized Adaptive Early Literacy Assessment James Olsen, Renaissance Learning Inc.

This paper presents psychometric analysis and norming information for a large-scale adaptive K-3 early-literacy assessment . It addresses validity, reliability, and later grade 3 reading proficiency . The norming involved sampling 586,380 fall/spring assessments, post stratification weighting to a representative national sample, descriptive score statistics, and developing scale percentiles and grade equivalents .

108 Washington, DC, USA

Electronic Board #12 The Impact of Ignoring the Multiple-Group Structure of Item Response Data Yoon Jeong Kang, American Institutes for Research; Hong Jiao and Robert Lissitz, University of Maryland

This study examines model parameter estimation accuracy and proficiency level classification accuracy when the multiple-group structure of item response data is ignored . The results show that the heterogeneity of population distribution was the most influential factor on the accuracy of model parameter estimation and proficiency level classification .

Electronic Board #13 Influential Factors on College Retention Based on Tree Models and Random Forests Chansoon Lee, Sonya Sedivy and James Wollack, University of Wisconsin-Madison

The purpose of this study is to examine influential factors on college retention . Tree models and random forests will be applied to determine important factors on student retention and to improve the prediction of college retention .

Electronic Board #14 Detecting Non-Effortful Responses to Short-Answer Items Ruth Childs, Gulam Khan and Amanda Brijmohan, Ontario Institute for Studies in Education, University of Toronto; Emily Brown, Sheridan College; Graham Orpwood, York University

This study investigates the feasibility and effects of using the content of short-answer responses, in addition to response times, to improve the filtering of non-effortful responses from field test data and so improve item calibration .

Electronic Board #15 Item Difficulty Modeling for an Ell Reading Comprehension Test Using LLTM Lingyun Gao, ACT, Inc.; Changjiang Wang, Pearson

This study models cognitive complexity of the items included in a large-scale high-stakes reading comprehension test for English language learners (ELL), using the linear logistic test model (LLTM; Fischer, 2005) . The findings will have implications for targeted test design and efficient item development .

Electronic Board #16 The Effect of Unmotivated Test-Takers on Field Test Item Calibrations H. Jane Rogers and Hariharan Swaminathan, University of Connecticut

A simulation study was conducted to investigate the effect of low motivation of test-takers on field-test item calibrations . Even small percentages of unmotivated test-takers resulted in substantial underestimation of discrimination parameters and overestimation of difficulty parameters . These calibration errors resulted in inaccurate estimation of trait parameters in a CAT administration .

Electronic Board #17 Cognitive Analysis of Responses Scored Using a Learning Progression for Proportional Reasoning Edith Aurora Graf, ETS; Peter van Rijn, ETS Global

Learning progressions are complex structures based on a synthesis of standards documents and research studies, and therefore require empirical verification . We describe a validity exercise in which we compare IRT-based classifications of students into the levels of a learning progression to classifications provided by a human rater .

109 2016 Annual Meeting & Training Sessions

Electronic Board #18 Nonparametric Diagnostic Classification Analysis for Testlet-Based Tests Shuying Sha and Robert Henson, University of North Carolina at Greensboro

This study investigates the impact of the testlet effect on performance of parametric and nonparametric (Hamming Distance method) diagnostic classification analysis . Results showed that the performance of both approaches deteriorated with the increase of the testlet effect size . Potential solutions to nonparametric classification for testlet- based test are proposed .

Electronic Board #19 An Application of Second-Order Growth Mixture Model for Educational Longitudinal Research Xin Li and Changhua Rich, ACT, Inc.; Hongyun Liu, Beijing Normal University

Investigating change in individual achievement over time is of central importance in educational research . The current study describes and illustrates the use of the second-order latent growth model with its extension to the growth mixture model to a real data to help modeling growth with considering the population heterogeneity .

Electronic Board #20 Confirmatory Factor Analysis of Timss’ Mathematics Attitude Items with Recommendations for Change Thomas Hogan, University of Scranton

This study reports results of confirmatory factor analysis for Trends in International Mathematics and Science Study (TIMSS) math attitude scales for national samples of students in the United States at grades 4 and 8 . Recommendations are made for improvement of the scales, particularly for the Self-confidence latent variable .

Electronic Board #21 Controlling for Multiplicity in Structural Equation Models Michael Zweifel and Weldon Smith, University of Nebraska-Lincoln

When evaluating a structural equation models, several hypotheses are evaluated simultaneously which leads which increases the probability that a Type I error is committed . This proposal examined how several common multiple comparison procedures performed when the number of item response categories and the item variances were varied .

Electronic Board #22 Alternative Approaches for Comparing Test Score Achievement Gap Trends Benjamin Shear, Stanford University; Yeow Meng Thum, Northwest Evaluation Association

This paper compares trajectories of cross-sectional achievement gaps between subgroups to subgroup differences in longitudinal growth trajectories . The impact of vertical scaling assumptions is assessed with parallel analyses in an ordinal metric . We suggest ways to test inferences about closing gaps (“equalization”) across grades and cohorts, possibly for value-added analyses .

110 Washington, DC, USA

Sunday, April 10, 2016 12:25 PM - 2:25 PM, Ballroom ABC, Level Three, Convention Center

AERA Awards Luncheon

AERA’s Awards Program is one of the most prominent ways for education researchers to recognize and honor the outstanding scholarship and service of their peers . Recipients of AERA awards are announced and recognized during the Annual Awards Luncheon .

111 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Renaissance East, Ballroom Level, Coordinated Session, G1

Challenges and Opportunities in the Interpretation of the Testing Standards Session Chair: Andrew Wiley, Alpine Testing Solutions, Inc . Session Discussant: Barbara Plake, University of Nebraska-Lincoln

Across divisions of the professional assessment community, the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014) and its requirements serve as the guiding principles for testing programs when determining procedures and policies . However, while the Standards do serve as the primary source for the assessment community, the interpretation of the Standards continues to be a somewhat subjective affair . Because validity is dependent on the context of each program, testing professionals are required to interpret and align the guidelines to prioritize and evaluate relevant evidence .

For example, in some scenarios a term such as “representative” can be difficult to define, and reasonable people could interpret evidence with notably different expectations . In practical terms, this can become problematic for the profession because if the Standards are not sufficiently clear for the purposes of interpretability and accountability within the profession, it creates more confusion when trying to communicate these expectations to policymakers and lay audiences .

The purpose of this session is to focus of how assessment professionals use and interpret the Standards and the procedures that individuals and organizations use when applying them . Each of the four presenter will discuss the methods and procedures that their respective organizations have developed or how they have advised organizations they work with about interpreting and using the Standards to design or improve their programs . In addition, they will also discuss the sections of the Standards that they have found to be particularly difficult to interpret with recommendations about how additional interpretative guidance would make the Standards more effective to implement .

The session will include with Dr . Barbara Plake serving as a discussant . Dr . Plake is one of the leading voices on the value and importance of the Standards and will review each paper along with a review of some of her experience in the use and interpretation of the Standards .

Using the Testing Standards as the Basis for Developing a Validation Argument. Wayne Camara, ACT

Using the Standards to Support Assessment Quality Evaluation Erika Hall and Thanos Patelis, Center for Assessment

Blurring the Lines Between Credentialing and Employment Testing Chad Buckendahl, Alpine Testing Solutions, Inc.

Content Based Evidence and Test Score Validation Ellen Forte, edCount, LLC

112 Washington, DC, USA

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Renaissance West A, Ballroom Level, Coordinated Session, G2

Applications of Combinatorial Optimization in Educational Measurement Session Chairs: Wim van der Linden and Michelle Barrett, Pacific Metrics; Bernard Veldkamp, University of Twente; Dmitry Belov, Law School Admission Council

Combinatorial optimization (CO) is concerned with searching for an element from a finite set (called a feasible set) that would optimize (minimize or maximize) a given objective function . Numerous practical problems can be formulated as CO problems, where a feasible set is not given explicitly but is represented implicitly by a list of inequalities and inclusions . Two unique features of CO problems should be mentioned:

1 . In practice, a feasible set is so large that a straightforward approach to solving a corresponding CO problem by checking every element of the feasible set would take an astronomical amount of time . For example, in a traveling salesmen problem (TSP) (Given a list of n cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?), the corresponding feasible set contains (n–1)!/2n− elements (routes) . Thus, in the case of 25 cities there are 310,224,200,866,620,000,000,000 possible routes . Assuming that a computer can check each route in 1 microsecond (1/1,000,000 of a second), an optimal solution of the TSP with 25 cities will be found in about 9,837,144,878 years . With respect to the size of a given CO problem (e .g ., number of cities, n, in the TSP), the time it takes to solve the problem can be approximated by an exponent, resulting in an exponential time (e .g ., c2n, cen, where c is a constant) in contrast to a polynomial time (e .g ., cnlogn, cn2) .

2 . Often, a given CO problem can be reduced to another CO problem in polynomial time . Thus, if one CO problem can be solved efficiently (e .g ., in polynomial time) then the whole class of CO problems can be solved efficiently as well .

Fortunately, the modern CO literature provides methods that, during the search, allow us to identify and remove large portions of the feasible set that do not contain an optimal element . As a result, many real instances of CO problems can be solved in a reasonable amount of time . The most popular method is branch-and-bound (Papadimitriou & Steiglitz, 1982), which solves an instance of the TSP with 25 cities in less than one minute on a regular PC .

The history of CO applications in educational measurement began in the early 1980s, when psychometricians started to use CO methods for automated test assembly (ATA) . Theunissen (1985) reduced a special case of an ATA problem to a knapsack problem (Papadimitriou & Steiglitz, 1982) . van der Linden and Boekkooi-Timminga (1989) formulated an ATA problem as a maximin problem . Later, Boekkooi-Timminga (1990) extended this approach to the assembly of multiple test forms with no common items . Soon after that, the ATA problem attracted many more researchers, whose major results are reviewed in van der Linden (2005) .

The first part of this coordinated session will introduce CO and then review its existing and potential future applications to educational measurement . More specifically, it will introduce mixed integer programming (MIP) modeling as a tool for finding solutions to CO problems, emphasizing such key notions as constraints, objective function, feasible and optimal feasible solutions, linear and nonlinear models, and heuristic and solver-based solutions . It will then review areas of educational measurement where CO has already provided or has the potential to provide optimal solutions to main problems, including areas such as optimal test assembly, automated test-form generation, item-pool design, adaptive testing, calibration sample design, controlling test speededness, parameter linking design, and test-based instructional assignment .

The second part of this coordinated session will discuss three recent applications of CO in educational measurement . The first application relates to linking . For the common dichotomous and polytomous response models, linking

113 2016 Annual Meeting & Training Sessions

response model parameters across test administrations that use separate item calibrations requires the use of common items and/or common examinees . Error in the estimated linking function parameters occurs as a result of propagation of estimation error in the response model parameters (van der Linden & Barrett, in press) . When using a precision-weighted average approach to estimation of linking parameters, linking error appears to be additive in the contribution of each linking item . Therefore, minimizing linking error when selecting common items from the larger set of available items from the first test administration may be facilitated using CO . Three new MIP models used to optimize the selection of a set of linking items, subject to blueprint and practical test requirements, will be presented . Empirical results will demonstrate the use of the models .

The second application is for ATA under uncertainty in item parameters . Commonly, in an ATA problem one assumes that item parameters are known precisely . However, they are always estimated from some dataset, which adds uncertainty into the corresponding CO problem . Several optimization strategies dealing with uncertainty in the objective function and/or constraints of a CO problem have been developed in the literature . This presentation will focus on robust and stochastic optimization strategies, which will be applied to both linear and adaptive test assembly . An impact of the uncertainty on the ATA process will be studied, and practical recommendations to minimize the impact will be provided .

The third application relates to two important topics in test security: detection of item preknowledge and detection of aberrant answer changes (ACs) . Item preknowledge describes a situation in which a group of examinees (called aberrant examinees) have had access to some items (called compromised items) from an administered test prior to the exam . Item preknowledge negatively affects both the corresponding testing program and its users (e .g ., universities, companies, government organizations) because scores for aberrant examinees are invalid . In general, item preknowledge is difficult to detect due to three unknowns: (i) unknown subgroups of examinees at (ii) unknown test centers who (iii) had access to unknown subsets of compromised items prior to taking the test . To resolve the issue of multiple unknowns, two CO methods are applied . First, a random search detects suspicious test centers and suspicious subgroups of examinees . Second, given suspicious subgroups of examinees, simulated annealing identifies compromised items . Advantages and limitations of the methods will be demonstrated using both simulated and real data . The statistical analysis of ACs has uncovered multiple testing irregularities on large- scale assessments . However, existing statistics capitalize on the uncertainty in AC data, which may result in a large Type I error . Without loss of generality, for each examinee, two disjoint subsets of administered items are introduced: the first subset has items with ACs; the second subset has items without ACs, assembled by CO methods to minimize the distance between its characteristic curve and the characteristic curve of the first subset . A new statistic measures the difference in performance between these two subsets, where to avoid the uncertainty, only final responses are used . In computer simulations, the new statistic demonstrated a strong robustness to the uncertainty and higher detection rates in contrast to two popular statistics based on wrong-to-right ACs .

114 Washington, DC, USA

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Renaissance West B, Ballroom Level, Paper Session, G3

Psychometrics of Teacher Ratings Session Discussant: Tia Sukin, Pacific Metrics

Psychometric Characteristics and Item Category Maps for a Student Evaluation of Teaching Patrick Meyer, Justin Doromal, Xiaoxin Wei and Shi Zhu, University of Virginia

We describe psychometric characteristics of a student evaluation of teaching with four dimensions: Organization, Assessment, Interactions, and Rigor . Using data from 430 students and 65 university classrooms, we implemented an IRT-based approach to maximum information item category mapping to facilitate score interpretation and multilevel models to evaluate threats to validity .

Psychometric Stability of Tripod Student Perception Surveys with Reduced Data Catherine McClellan, Clowder Consulting; John Donoghue, Educational Testing Service

Student perception surveys such as TripodTM are becoming more commonly used as part of PK-12 classroom teacher evaluations . The loss of classroom time to survey administration remains a concern for teachers . This study examines approaches the impact on survey results of various data reduction approaches .

Does the ‘type’ of Rater Matter When Evaluating Special Education Teachers? Janelle Lawson, San Francisco State University; Carrie Semmelroth, Boise State University

This study examined how school administrators without any formal experience in special education performed using the Recognizing Effective Special Education Teachers (RESET) Observation Tool compared with previous reliability studies that used experienced special education teachers as raters . Preliminary findings indicate that ‘type’ of rater matters when evaluating special education teachers .

Measuring Score Consistency Between Teacher and Reader Scored Grades Yang Zhao, University of Kansas; Jonathan Rollins, University of North Carolina; Deanna Morgan and Priyank Patel, The College Board

The purpose of this paper is to evaluate score consistency between teachers and readers . Measures such as the Pearson correlation, Root Mean Square Error, Mean Absolute Error, Root Mean Square Error in agreement, and the Concordance Correlation Coefficient in agreement, are calculated .

115 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Meeting Room 3, Meeting Room Level, Paper Session, G4

Multidimensionality Session Discussant: Mark Reckase, Michigan State University

An Index for Characterizing Construct Shift in Vertical Scales Jonathan Weeks, ETS

The purpose of this study is to define an index that characterizes the amount of construct shift associated with a “unidimensional” vertical scale when the underlying data are multidimensional . The method is applied to large-scale math and reading assessments .

Multidimensional Test Assembly of Parallel Test Forms Using a Kulback-Leibler Information Index Dries Debeer, University of Leuven; Usama Ali, Educational Testing Company; Peter van Rijn, ETS Global

The statistical targets commonly used for the assembly of parallel test forms in unidimensional IRT are not directly transferable to multidimensional IRT . To fill this gap, a Kulback-Leibler based information index (KLI) is proposed . The KLI is discussed and evaluated in the uni- and the multidimensional case .

Evaluating the Use of Unidimensional IRT Procedures for Multidimensional Data Wei Wang, Chi-Wen Liao and Peng Lin, Educational Testing Service

This study intends to investigate the feasibility of applying unidimensional IRT procedures (including item calibration and equating) for multidimensional data . Both simulated data and operational data will be used . The results will provide suggestions about under which conditions it is appropriate to use unidimensional IRT procedures to analyze multidimensional data .

Classification Consistency and Accuracy Indices for Multidimensional Item Response Theory Wenyi Wang, Lihong Song and Shuliang Ding, Jiangxi Normal University; Hua-Hua Chang, University of Illinois at Urbana- Champaign

For criterion-referenced tests, classification consistency and accuracy are important indicators to evaluate the reliability and validity of classification results . The purpose of this study is to explore these indices for complex decision rules under multidimensional item response theory . It would be valuable to score interpretation and computerized classification testing .

116 Washington, DC, USA

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Meeting Room 4, Meeting Room Level, Paper Session, G5

Validating “Noncognitive”/Nontraditional Constructs I Session Discussant: William Lorié, Center for NextGen Learning & Assessment, Pearson

Improving the NAEP SES Measure: Can NAEP Learn from Other Survey Programs? Young Yee Kim and Jonathan Phelan, American Institutes for Research; Jing Chen, National Center for Education Statistics; Grace Ji, Avar Consulting, Inc.

This study is designed as part of NCES’s efforts to improve the NAEP SES measure . Based on the findings from the extensive review of various survey programs within and outside NCES and literature review, some suggestions are made to help NCES in reporting a new SES measure in 2017 .

Investigating SES Using the NAEP-HSLS Overlap Sample Burhan Ogut, George Bohrnstedt and Markus Broer, American Institutes for Research

This study examines the relationships among the three main SES components (parental education, occupational status and income) based on parent-reports on the one hand, and student-reports of SES proxy variables (parents’ education, household possessions, and NSLP eligibility) on the other hand, using multiple-indicators and multiple- causes models and seemingly unrelated regressions .

Rethinking the Measurement of Noncognitive Attributes Andrew Maul, University of California, Santa Barbara

The quality of “noncognitive” measurement lags behind the quality of measurement in traditional academic realms . This project identifies a potentially serious gap in the validity argument for a prominent measure ofgrowth mindsets . New approaches to the measurement of growth mindsets are piloted and exemplified .

Validating Relationships Among Mathematics-Related Self Efficacy, Self Concept, Anxiety and Achievement Measures Madhabi Chatterji and Meiko Lin, Teachers College, Columbia University

In this construct validation study, we use structural equation modeling to validate theoretically specified pathways and correlations of mathematics-related self-efficacy, self-concept, and anxiety in with math achievement scores . Results are consistent with past research with older students, and carry implications for research, policy and classroom practice .

117 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Meeting Room 5, Meeting Room Level, Paper Session, G6

Invariance Session Discussant: Ha Phan, Pearson

The Impact of Measurement Noninvariance in Longitudinal Item Response Modeling In-Hee Choi, University of California, Berkeley

This study investigates the impact of measurement noninvariance across time and group in longitudinal item response modeling, when researchers examine group difference in growth . First, measurement noninvariance is estimated from a large-scale longitudinal survey . These results are then used for a simulation study with different sample sizes .

Measurement Invariance in International Large-Scale Assessments: Ordered-Categorical Outcomes in a Multidimensional Context Dubravka Svetina, Indiana University; Leslie Rutkowski, University of Oslo

A critical precursor to comparing means on latent variables across cultures is that the measures are invariant across groups . Lack of consensus for cut off values for evaluating model fit in literature motivates this study where we consider the performance of fit measures when data are modeled as multidimensional, ordered-categorical .

Assessing Uniform Measurement Invariance Using Multilevel Latent Modeling Carrie Morris, University of Iowa College of Education; Xin Li, ACT

This simulation study investigated use of multilevel MIMIC and mixture models for assessing uniform measurement invariance . A multilevel model was generated with measurement error, and measurement and factorial noninvariances were imposed . Model fit, parameter and standard error bias, and power to detect noninvariance were assessed for all estimated models .

Population Invariance of Equating Functions Across Subpopulations for a Large Scale Assessment Lucy Amati and Alina von Davier, Educational Testing Service

In this study, we examine the population invariance assumption for a large-scale assessment . Results of the analysis demonstrated that the equating functions for subpopulations are very close to that of the total population . Results supported the invariance assumption of the equating function, contributing to showcase the fairness of the test .

118 Washington, DC, USA

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Meeting Room 15, Meeting Room Level, Paper Session, G7

Detecting Aberrant Response Behaviors Session Discussant: John Donoghue, ETS

Methods That Incorporate Response Times and Responses for Excluding Data Irregularities Heru Widiatmo, ACT, Inc.

Two methods, which use both responses and response times for excluding data irregularities, are combined and compared to find an optimal method . The methods are Response Time Effort (RTE) and Effective Response Time (ERT) . The 3-PL IRT model is used to calibrate data and to evaluate the results .

Online Detection of Compromised Items with Response Times in CAT Hyeon-Ah Kang, University of Illinois at Urbana-Champaign

An online calibration based CUSUM procedure is proposed to detect compromised items in CAT . The procedure utilizes both observed item responses and response times for evaluating changes in item parameter estimates that are obtained on-the-fly during the CAT administrations .

Detecting Examinee Preknowledge of Items: A Comparison of Methods Xi Wang, University of Massachusetts Amherst; Frederic Robin, Hongwen Guo and Neil Dorans, Educational Testing Service; Yang Liu, University of California, Merced

In a continuous testing program, examinees are likely to have preknowledge of some items due to the repeated use of items over time . In this study, two methods are proposed to detect item preknowledge at person level, and their effectiveness is compared in a multistage adaptive testing context .

Development of an R Package for Statistical Analysis in Test Security Jiyoon Park, Yu Zhang and Lorin Mueller, Federation of State Boards of Physical Therapy

Statistical analysis of test results is the most widely used approach employed by test sponsors . Different statistical methods can be used to capture the signs of security breaches and to evaluate the validity of test scores . We propose an R package that provides systematic and comprehensive analyses in test security .

119 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 2:45 PM - 4:15 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session: GSIC Graduate Student Poster Session, G8

Graduate Student Issues Committee Brian Leventhal, Chair Masha Bertling, Laine Bradshaw, Lisa Beymer, Evelyn Johnson, Ricardo Neito, Ray Reichenberg, Latisha Sternod, Dubravka Svetina

Electronic Board #1 Examining Test Irregularities Using Multidimensional Scaling Approach Qing Xie, ACT/The University of Iowa

The purpose of this simulation study is to explore the possibility of using multidimensional scaling in detecting test irregularities via the concept of consistency of a battery or test structure . The results will provide insights on how well this method can be applied in different test irregularity situations .

Electronic Board #2 The Influence of Measurement Invariance in the Two-Wave, Longitudinal Mediation Model Oscar Gonzalez, Arizona State University

Statistical mediation describes how two variables are related by examining intermediate mechanisms . The mediation model assumes an underlying longitudinal design and that the same constructs are measured over time . This study examines what happens to the mediated effect when longitudinal measurement invariance is violated in a two-wave mediation model .

Electronic Board #3 Parallel Analysis of Unidimensionality with Pca and Paf in Dichotomously Scored Data Ismail Cukadar, Florida State University

This Monte Carlo study investigates the impact of using two different factor extraction methods (principle component analysis and principle axis factoring) in the Kaiser rule and the parallel analysis on the decision of unidimensionality in binary data that has examinee guessing .

Electronic Board #4 Reducing Data Demands of Using a Multidimensional Unfolding IRT Model Elizabeth Williams, Georgia Institute of Technology

A simulation study will be performed to investigate using a multidimensional scaling (MDS) solution in conjunction with the Multidimensional Generalized Graded Unfolding Model (MGGUM) to reduce data demands . The expected results are that the data demands will be reduced without sacrificing the quality of true parameter recovery .

Electronic Board #5 Challenging Conditions for Mml and Mh-Rm Estimation of Multidimensional IRT Models Derek Sauder, James Madison University

The MHRM estimator is faster than the MML estimator, and generally gives comparable parameter estimates . In one real dataset, the two procedures estimated similar item parameter values but different correlations between the subscales . A simulation will be conducted to examine which factors might lead to discrepancies between the estimators .

120 Washington, DC, USA

Electronic Board #6 The Effects of Dimensionality and Dimensional Structure on Composite Scores and Subscores Unhee Ju, Michigan State University

Both composite scores and subscores can provide diagnostic information about students’ specific progress . A simulation study was conducted to examine the performance of composite scores and subscores under different conditions of the number of dimensions, dimensional structure, and correlation between dimensions . Their implications will be discussed in the presentation .

Electronic Board #7 Simple Structure MIRT True Score Equating for Mixed-Format Tests Stella Kim, The University of Iowa

This study proposes a SS-MIRT true-score equating procedure for mixed-format tests and investigates its performance based on the results from real data analyses and a simulation study .

Electronic Board #8 Conditions of Evaluating Models with Approximate Measurement Invariance Using Bayesian Estimation Ya Zhang, University of Pittsburgh

A simulation study is performed to investigate approximate measurement invariance (MI) through Bayesian estimation . The size of differences in item intercepts, the proportion of items with differences on, and the level of prior variabilities are manipulated . The study findings provide a general guideline to the use of approximate MI .

Electronic Board #9 Detecting Nonlinear Item Position Effects with a Multilevel Model Logan Rome, University of Wisconsin-Milwaukee

When tests utilize a design in which items appear in different orders in various booklets, the item position can impact item responses . This simulation study will examine the performance of a multilevel model in detecting several functions and sizes of non-linear item-specific position effects .

Electronic Board #10 Comparison of Scoring Methods for Different Item Types Hongyu Diao, Unversity of Massachusetts-Amherst

This study will use a Monte Carlo simulation method to investigate the impact of concurrent calibration and separate calibration for the mixed-format test . The response data of Multiple Choice and Technology-Enhanced Items are simulated to represent two different dimensions .

Electronic Board #11 IRT Approach to Estimate Reliability of Testlet with Balanced and Unbalanced Data Nana Kim, Yonsei University

This study aims to investigate the effects of balanced and unbalanced data structures on the reliability estimates of testlet-based test when applying item response theory (IRT) approaches using simulated data sets . We focus on the relationship between patterns of reliability estimates and the degree of imbalance in data structure .

121 2016 Annual Meeting & Training Sessions

Electronic Board #12 Hierarchical Bayesian Modeling for Peer Assessment in a Massive Open Online Course Yao Xiong, The Pennsylvania State University

Peer assessment has been widely used in most of the massive open online courses (MOOCs) to provide feedbacks for constructed-response questions . However, peer rater accuracy and reliability is a major concern . The current study proposes a hierarchical Bayesian approach to account for the accuracy and reliability .

Electronic Board #13 The Impact of Model Misspecification in the DCM-CAT Yu Bao, The University of Georgia

Item parameters are usually assumed to be known in DCM-CAT simulations . When the assumption is violated, model misspecification may lead to different item information and posterior distribution, which are essential for item selection . The study shows how mis-fitting DCMs and overfitting DCMs will influence item bank usage and classification accuracy .

Electronic Board #14 Interval Estimation of IRT Proficiency in Mixed-Format Tests Shichao Wang, The University of Iowa

Interval estimation of proficiency can help to clearly present information to test users on how to interpret the uncertainty in their scores . This study intends to compare the performance of analytical and empirical approaches in constructing an interval for IRT-based proficiency for mixed-format tests using simulation techniques .

Electronic Board #15 Analysis of Item Difficulty Predictors for Item Pool Development Feng Chen, The University of Kansas

Systematic item difficulty prediction is introduced which accounts for all possible item features . The effect of these on resulting item parameters demonstrated using simulated and real data . Results will provide statistical and evidentiary implications to item pool development and test construction .

Electronic Board #16 Regressing Multiple Predictors into a Cognitive Diagnostic Model Kuan Xing, University of Illinois at Chicago

This study is to investigate the stability of parameter estimates and classification when multiple covariates of different types are analyzed in the RDINA and HO-DINA models . Real-world (TIMSS) data analyses and simulation study were conducted . Educational significance regarding examining the relationship between covariates and the CDM was discussed .

Electronic Board #17 Non-Instructional Factors That Affect Student Mathematics Performance Michelle Boyer, University of Massachusetts, Amherst

The effects of non-instructional factors in educational success are increasingly important for educational authorities to understand as they seek to improve student outcomes . This study evaluates a large number of such factors and their effects on mathematics performance for a large US nationally representative sample of students .

122 Washington, DC, USA

Electronic Board #18 A Procedure to Improve Item Parameter Estimation in Presence of Test Speededness Can Shao, University of Notre Dame

In this study, we propose to use a data cleansing procedure based on change-point analysis to improve item parameter estimation in presence of test speededness . Simulation results show that this procedure can dramatically reduce the bias and root mean square error of the item parameter estimates .

Electronic Board #19 Simulation Study Off Estimation Methods in Multidimensional Student Response Data Philip Grosse, University of Pittsburgh

The purpose of this simulation study is to provide a comparison of WLSMV and BAYES estimators in a bifactor model based on simulated multidimensional student responses . The estimation methods are compared in terms of their item parameter recovery and ability estimation .

Electronic Board #20 Detecting Testlet Effect Using Graph Theory Xin Luo, Michigan State University

Testlet effect has significant influence on measurement accuracy and test validity . This study proposed a new approach based on graph theory to detect testlet effect . Results of a simulation study supported the quality of this method .

Electronic Board #21 Assessing Item Response Theory Dimensionality Assumptions Using DIMTEST and NOHARM-Based Methods Kirsten Hochstedt, Penn State University

This study examined how select IRT dimensionality assessment methods performed for two- and three-parameter logistic models with combinations of short test lengths, small sample sizes, and ability distribution shapes (skewness, kurtosis) . The capability of DIMTEST and three NOHARM-based methods to detect dimensionality assumption violations in simulated data was compared .

Electronic Board #22 Evaluating the Invariance Property in IRT: A Case of Multi-State Assessment Seunghee Chung, Rutgers University

This simulation study investigates how the invariance property of IRT item parameter can be held under multi-state assessment situation, especially when the characteristics of member states are dissimilar to one another . Practical implication of multi-state assessment development is discussed to avoid potential measurement bias caused by lack of invariance property .

Electronic Board #23 Evaluating Predictive Accuracy of Alternative IRT Models and Scoring Methods Charles Iaconangelo, Rutgers University, The State University of New Jersey

This paper uses longitudinal data from a large urban school system to evaluate different item response theory models and scoring methods for their value in predicting future test scores . It finds that both richer IRT models, and scoring methods based on response patterns rather than number correct, improve predictive accuracy .

123 2016 Annual Meeting & Training Sessions

Electronic Board #24 A Comparison of Estimation Methods for the Multi-Unidimensional Three-Parameter IRT Model Tzu Chun Kuo, Southern Illinois University Carbondale

Two marginal maximum likelihood (MML) approaches, three fully Bayesian algorithms, and a Metropolis-Hastings Robbins-Monro (MHRM) algorithm were compared for estimating multi-unidimensional three-parameter models using simulations . Preliminary results suggested that the two MML approaches, together with blocked Metropolis and MHRM, had an overall better parameter recovery than the other estimation methods .

Electronic Board #25 A Methodology for Item Condensation Rule Identification in Cognitive Diagnostic Models Diego Luna Bazaldua, Teachers College, Columbia University

A methodology within a Bayesian framework is employed to identify the item condensation rules for cognitive diagnostic models (CDMs) . Simulated and empirical data are used to analyze the ability of the methodology to detect the correct condensation rules for different CDMs .

124 Washington, DC, USA

Sunday, April 10, 2016 4:35 PM - 5:50 PM, Ballroom C, Level Three, Convention Center

AERA Presidential Address

Public Scholarship to Educate Diverse Democracies Jeannie Oakes, AERA President; University of California - Los Angeles

125 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Renaissance East, Ballroom Level, Coordinated Session, H1

Advances in Balanced Assessment Systems: Conceptual Framework, Informational Analysis, Application to Accountability Session Chair: Scott Marion, National Center for the Improvement of Educational Assessment Session Discussant: Lorrie Shepard, University of Colorado, Boulder

For more than a decade, there have been calls for multiple assessments to be designed and used in more integrated ways—for “balanced” or “comprehensive” assessment systems . However, there has been little focused work on clearly defining what is meant by a balanced assessment system as well as the characteristics that contribute to the quality of such assessment systems . Importantly, there have been scant analyses of such systems and in particular how instructional and accountability demands might both be addressed . This coordinated session presents advances in conceptualizing and analyzing balanced assessment systems .

The session begins with an overview of the need for considering the quality of balanced assessment systems, with an emphasis on validity and usefulness . The second presentation focuses on conceptualizing the systems aspects of a balanced assessment system—what characterizes a system that goes beyond good individual assessments? The third presentation presents two approaches, content-based alignment judgments and scale- based interpretations, together to get content-referenced information from assessments to support instruction and learning . These approaches are based on the actual information available and the interpretations supported . The fourth presentation presents a technical analysis of comparability in a balanced assessment system in the context of school accountability .

Balanced Assessment Systems: Overview and Context Brian Gong and Scott Marion, National Center for the Improvement of Educational Assessment

Systemic Aspects of Balanced Assessment Systems Rajendra Chattergoon, University of Colorado, Boulder

Validity and Utility in a Balanced Assessment System: Use, Information, and Timing Phonraphee Thummaphan, University of Washington, Seattle; Nathan Dadey, Center for Assessment

Comparability in Balanced Assessment Systems for State Accountability Carla Evans, University of New Hampshire; Susan Lyons, Center for Assessment

126 Washington, DC, USA

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Renaissance West A, Ballroom Level, Coordinated Session, H2

Minimizing Uncertainty: Effectively Communicating Results from CDM-Based Assessments Session Discussant: Jacqueline Leighton, University of Alberta

Fueled by needs for educational tests that provide diagnostic feedback, researchers have made recent progress in designing statistical models that are well-suited to categorize examinees according to mastery levels for a set of latent skills or abilities . Cognitive diagnosis models (CDMs) yield probabilistic classifications of students according to multiple facets, termed attributes, of knowledge or reasoning . These results have the potential to inform instructional decision-making and learning, but in order to do so the results must be comprehensible to a variety of education stakeholders .

This session will include four papers on CDMs and communicating CDM-based results . Laine Bradshaw and Roy Levy outline the challenges of reporting results from CDMs and provide context for subsequent papers . Tasmin Dhaliwal, Tracey Hembry and Laine Bradshaw provide empirical evidence of teacher interpretation and preference for viewing mastery probabilities and classification results, in an online reporting environment . Kristen DiCerbo and Jennifer Kobrin share findings on how to present learning progression-based assessment results to teachers to support their instructional decision-making . Valerie Shute and Diego Zapata-Rivera model (using Bayes nets) and visualize students’ beliefs in flexible belief networks .

Interpreting Examinee Results from Classification-Based Models Laine Bradshaw (2015 Jason Millman Promising Measurement Scholar Award Winner), University of Georgia; Roy Levy, Arizona State University

Achieving the Promise of CDMs: Communicating CDM-Based Assessment Results Tasmin Dhaliwal, Pearson; Tracey Hembry, Alpine Testing Solutions; Laine Bradshaw, University of Georgia

Communicating Assessment Results Based on Learning Progressions Kristen DiCerbo and Jennifer Kobrin, Pearson

Representing and Visualizing Beliefs Valerie Shute, Florida State University; Diego Zapata-Rivera, Educational Testing Service

127 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Meeting Room 16, Meeting Room Level, Coordinated Session, H3

Overhauling the SAT: Using and Interpreting Redesigned SAT Scores Session Chair: Maureen Ewing, College Board Session Discussant: Suzanne Lane, University of Pittsburgh

In February of 2013, the College Board announced it would undertake a major redesign of the SAT® with the intent of making the test more transparent and useful . The redesigned test will assess skills, knowledge, and understandings that matter most for college and career readiness . Only relevant vocabulary (as opposed to the sometimes criticized obscure vocabulary measured today) will be assessed . The Math section will be focused on a smaller number of content areas . The essay will be optional . There will be a switch to rights-only scoring . The total score scale will revert back to the original 400 to 1600, and there will be several cross-test scores and subscores . At the same time, scores on the redesigned assessment are expected to continue to meaningfully predict success in college and serve as a reliable indicator of college and career readiness .

Throughout the redesign effort, many important research questions emerged such as: (1) How can we be sure the content on the redesigned test measures what is most important for college and career readiness? (2) How can we develop concordance tables to relate scores on the redesigned assessment to current scores? (3) How do we define and measure college and career readiness? (4) How well can we expect scores on the redesigned assessment to predict first-year college grades?

The purpose of this session is to describe the research the College Board has done to support the launch of the redesigned SAT . The session will begin with a brief overview of the changes to the SAT with a focus on how these changes are intended to make the test more transparent and useful . Four papers will follow that describe more specifically the test design and content validity argument for the new test, the development and practical implications of producing and delivering concordance tables, the methodology used to develop and validate college and career readiness benchmarks for the new test and, lastly, early results about the relationship between scores on the redesigned SAT and college grades gathered from a special, non-operational study . The discussant, Suzanne Lane, who is a nationally-renowned expert on assessment design and validity research, will offer constructive comments on the fundamental ideas, approaches, and designs undergirding the research presentations .

An Overview of the Redesigned SAT Jack Buckley, College Board

The Redesigned SAT: Content Validity and Assessment Design Sherral Miller and Jay Happel, College Board

Producing Concordance Tables for the Transition to the Redesigned SAT Pamela Kaliski, Rosemary Reshetar, Tim Moses, Hui Deng and Anita Rawls, College Board

College and Career Readiness and the Redesigned SAT Benchmarks Jeff Wyatt and Kara Smith, College Board

A First Look at the Predictive Validity of the Redesigned SAT Emily Shaw, Jessica Marini, Jonathan Beard and Doron Shmueli, College Board

128 Washington, DC, USA

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, H4

Quality Assurance Methods for Operational Automated Scoring of Essays and Speech Session Discussant: Vincent Kieftenbeld, Pacific Metrics

The quality of current automated scoring systems is increasingly comparable with or even surpassing that of trained human raters . Ensuring score validity in automated scoring, however, requires sophisticated quality assurance methods both during the design and training of automated scoring models, as well as during operational automated scoring . The four studies in this coordinated session present novel quality assurance methods for use in operational automated scoring of essay and speech responses . A common theme unifying these studies is the development of techniques to screen responses during operational scoring . A wide variety of methods is used, ranging from ensemble learning and outlier detection to information retrieval and natural language processing and identification . This session complements the session Challenges and solutions in the operational use of automated scoring systems which focuses on quality assurance during the design and training phases of automated scoring .

Statistical High-Dimensional Outlier Detection Methods to Identify Abnormal Responses in Automated Scoring Raghuveer Kanneganti, Data Recognition Corporation CTB; Luyao Peng, University of California, Riverside

Does Automated Speaking Response Scoring Favor Speakers of Certain First Language? Guangming Ling and Su-Youn Yoon, Educational Testing Service

Feature Development for Scoring Source-Based Essays Claudia Leacock, McGraw-Hill Education CTB; Raghuveer Kanneganti, Data Recognition Corporation CTB

Non-Scorable Spoken Response Detection Using NLP and Speech Processing Techniques Su-Youn Yoon, Educational Testing Service

129 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Meeting Room 4, Meeting Room Level, Paper Session, H5

Student Growth Percentiles Session Discussant: Damian Betebenner, Center for Assessment

The Accuracy and Fairness of Aggregate Student Growth Percentiles as Indicators of Educator Performance Jason Millman Promising Measurement Scholar Award Winner 2016: Katherine Furgol Castellano Daniel McCaffrey and J.R. Lockwood, ETS

Aggregated SGP (AGP), the mean/median SGP for students linked to the same teacher/school, are a popular alternative to VAM-based measures of educator performance . However, we demonstrate that test score measurement error affects the accuracy and precision of typically used AGP . We also contrast standard AGP against several alternative AGP estimators .

Cluster Growth Percentiles: An Alternative to Aggregated Student Growth Percentiles Scott Monroe, UMass Amherst; Li Cai, CRESST/UCLA

Aggregates of Student Growth Percentiles (Betebenner, 2009) are used by numerous states for purposes of teacher evaluation . In this research, we propose an alternative statistic, a Cluster Growth Percentile, defined directly at the group or cluster-level . The two approaches are compared, and simulated and empirical examples are provided .

Evaluating Student Growth Percentiles: Perspective of Test-Retest Reliability Johnny Denbleyker, Houghton Mifflin Harcourt; Ye Lin, University of Iowa

This study examines SGP calculations and corresponding NCEs where multiple test opportunities existed within the accountability testing window for an NCLB mathematics assessment . This allowed aspects of reliability to be assessed in a practical test-retest manner while accounting for measurement error associated with both sampling of items and occasions .

130 Washington, DC, USA

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, H6

Equating: From Theory to Practice Session Discussant: Ye Tong, Pearson

Similarities Between Equating Equivalents Using Presmoothing and Postsmoothing Hyung Jin Kim and Robert Brennan, The University of Iowa

Presmoothing and postsmoothing improve equating by reducing sampling error . However, little research has been conducted about similarities in equated-equivalents between presmoothing and postsmoothing . This study examines how equated-equivalents differ between presmoothing and postsmoothing for different smoothing degrees, and investigates the presmoothing degrees giving similar results as a specific postsmoothing degree .

Stability of IRT Calibration Methods for the Common-Item Nonequivalent Groups Equating Design Yujin Kang and Won-Chan Lee, University of Iowa

The purpose of this study is to investigate accumulated equating error of item response theory (IRT) calibration methods in the common-item nonequivalent groups (CINEG) design . The factors of investigation are calibration methods, equating methods, types of change in the ability distribution, common item compositions, and computer software for calibration .

Subscore Equating and Reporting Euijin Lim and Won-Chan Lee, The University of Iowa

The purpose of this study is to address the necessity of subscore equating in terms of score profiles using real data sets and discuss practical issues related thereto . Also, the performance of several equating methods for subscores are compared under various conditions using simulation techniques .

On the Effect of Varying Difficulty of Anchor Tests on Equating Accuracy Irina Grabovsky and Daniel Julrich, NBME

This study investigates the question of optimal location of anchor test for equating minimum competency examinations . For examinations where means of distributions of examinee abilities and item difficulties are distance apart, placement of an anchor test based on proximity to examinee ability mean results in a more accurate equating procedure .

131 2016 Annual Meeting & Training Sessions

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Meeting Room 15, Meeting Room Level, Paper Session, H7

Issues in Ability Estimation and Scoring Session Discussant: Peter van Rijn

Practical and Policy Impacts of Ignoring Nested Data Structures on Ability Estimation Kevin Shropshire, Virginia Tech (note I graduated in May 2014). I currently work at the University of Georgia (OIR) and this research is not affiliated with that department / university. I am providing the school where my research was conducted.; Yasuo Miyazaki, Virginia Tech

Consistent with the literature, the standard errors corresponding to item difficulty parameters are underestimated when clustering is part of the design but ignored in the estimation process . This research extends the focus to the impact of design clustering on ability estimation in IRT models for psychometricians and policy makers .

MIRT Ability Estimation: Effects of Ignoring the Partially Compensatory Nature Janine Buchholz and Johannes Hartig, German Institute for International Educational Research (DIPF); Joseph Rios, Educational Testing Service (ETS)

The MIRT model most commonly employed to estimate within-item multidimensionality is compensatory . However, numerous examples in educational testing suggest partially compensatory relations among dimensions . We therefore investigated conditional bias in theta estimates when incorrectly applying the compensatory model . Findings demonstrate systematic underestimation for examinees highly proficient in one dimension .

Interval Estimation of Scale Scores in Item Response Theory Yang Liu, University of California, Merced; Ji Seung Yang, University of Maryland, College Park

In finite samples, the uncertainty arising from item parameter estimation is often non-negligible and must be accounted for when calculating latent variable scores . Various Bayesian, fiducial, and frequentist interval estimators are harmonized under the framework of consistent predictive inference, and their performances are evaluated via Monte Carlo simulations .

Applying the Hajek Approach in the Delta Method of Variance Estimation Jiahe Qian, Educational Testing Service

The variance formula derived by the delta method, for two-stage sampling design, employs the joint inclusion probabilities in the first-stage selection of schools . The inquiry aims to apply Hajek approximation to estimate the joint probabilities, which are often unavailable in analysis . The application is illustrated with real and simulation data .

2016 Bradley Hanson Award for Contributions to Educational Measurement: Sun-Joo Cho

132 Washington, DC, USA

Sunday, April 10, 2016 4:35 PM - 6:05 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session, Paper Session, H8

Electronic Board #1 Asymmetric ICCs as an Alternative Approach to Accommodate Guessing Effects Sora Lee and Daniel Bolt, University of Wisconsin, Madison

Both the statistical and interpretational shortcomings of the three-parameter logistic (3PL) model in accommodating guessing effects are well documented (Han, 2012) . We consider the use of a residual heteroscedasticity model (Molenaar, 2014) as an alternative, and compare its performance to the 3PL with real test datasets and through simulation analyses .

Electronic Board #2 Software Note for PARSCALE Ying Lu, John Donoghue and Hanwook Yoo, Educational Testing Service

PARSCALE is one of the most popular commercial software packages for IRT calibration . PARSCALE users, however, should be aware of the issues associated with the software to ensure the quality of IRT calibration results . The purpose of this paper is to summarize these issues and to suggest solutions .

Electronic Board #3 Stochastic Approximation EM for Exploratory Item Factor Analysis Eugene Geis and Greg Camilli, Rutgers Graduate School of Education

We present an item parameter estimation combining stochastic approximation and Gibbs sampling for exploratory multivariate IRT analyses . It is characterized by drawing a missing random variable, updating post-burn-in sufficient statistics of missing data using the Robbins-Monro procedure, estimating factor loadings using a novel approach, and drawing samples of latent ability .

Electronic Board #4 Reporting Student Growth Percentiles: A Novel Tool for Displaying Growth David Swift and Sid Sharairi, Houghton Mifflin Harcourt

The increased use of growth models has created a need for tools that help policy makers with growth decisions and inform stakeholders . The data tool presented meets this need through a feature rich, user friendly application that puts the policy maker in control .

Electronic Board #5 The Impact of Plausible Values When Used Incorrectly Kyung Sun Chung, Pennsylvania State University

This study examined the effect of plausible values when used incorrectly such as using one value out of five provided or using averages of five plausible values . Two previously published studies are replicated for practical relevance . The results present that appropriate use of plausible values is recommended for unbiased estimates .

Electronic Board #6 Missing Data – on How to Avoid Omitted and Not-Reached Items Miriam Hacker, Frank Goldhammer and Ulf Kröhne, German Institute for International Educational Research (DIPF)

The problem of missing data is common in almost all measurements . In this study, the occurrence of missing data is examined and how to avoid them by presenting more time information at item level . Results indicates that time information can reduce missing responses without affecting the performance .

133 2016 Annual Meeting & Training Sessions

Electronic Board #7 Challenging Measurement in the Field of Multicultural Education: Validating a New Scale Jessie Montana Cain, University of North Carolina at Chapel Hill

Measurement in the field of multicultural education has been scarce . In this study the psychometric properties of the newly developed Multicultural Teacher Capacity Scale were examined . The MTCS is a reliable and valid measure of multicultural teacher capacity for samples that mirror the development sample .

Electronic Board #8 Automated Test Assembly Methods Using Monte-Carlo-Based Linear-On-The-Fly (LOFT) Techniques John Weiner and Gregory Hurtz, PSI Services LLC

Monte-Carlo-based Linear-on-the-fly techniques of automated test assembly offer a number of advantages toward the goals of exam security, exam form equivalence, and efficiency in examination development activities . Classical- test-theory and Rasch/IRT approaches are compared, and issues of statistical sampling and analyses are discussed .

Electronic Board #9 DIF Related to Test Takers’ Culture Background and Language Proficiency Jinghua Liu, Secondary School Admission Test Board; Tim Moses, College Board

This study examines DIF from the perspective of test takers’ culture background by using operational data from a standardized admission test . We recommend that testing programs containing large portion of test takers from different regions and culture background ought to add region/culture DIF to the DIF routine screening .

Electronic Board #10 Can a Two-Item Essay Test Be Reliable and Valid? Brent Bridgeman and Donald Powers, Educational Testing Service

Psychometricians have long complained that a two-item essay test cannot be reliable and valid for predicting academic outcomes compared to a multiple-choice test (e .g ., Wainer & Thissen, 1993) . Recent evidence from predictive validity studies of Verbal Reasoning and Analytical Writing GRE scores challenges this point of view .

Electronic Board #11 Selecting Automatic Scoring Features Using Criticality Analysis Han-Hui Por and Anastassia Loukina, Educational Testing Service

We apply the criticality analysis approach to select features in the automatic scoring of spoken responses in a language assessment . We show that this approach addresses issues of sample dependence and bias, and identifies salient features that are critical in improving model validity .

Electronic Board #12 A Meta-Analysis of the Predictive Validity of Graduate Management Admission Test HAIXIA QIAN, Kim Trang and Neal Martin Kingston, University of Kansas

The purpose of the meta-analysis was to assess the Graduate Management Admission Test (GMAT) and undergraduate GPA (UGPA) as predictors of business school performance . Results showed both the GMAT and UGPA were significant predictors, with the GMAT as a stronger predictor compared to UGPA .

Electronic Board #13 A Fully Bayesian Approach to Smoothing the Linking Function in Equipercentile Equating Zhehan Jiang and William Skorupski, University of Kansas

A fully Bayesian parametric method for robustly estimating the linking function in equipercentile equating is introduced, explicated, and evaluated via a Monte Carlo simulation study .

134 Washington, DC, USA

Electronic Board #14 Conducting a Post-Equating Check to Detect Unstable Items on Pre-Equated Tests Keyin Wang, Michigan State University; Wonsuk Kim and Louis Roussos, Measured Progress

Pre-equated tests are increasingly common . Every item is assumed to behave in a stable manner . Thus, “post- equated” checks need to be conducted to detect and correct problematic items . Little research has been directly conducted on this topic . This study proposes possible procedures and begins to evaluate them .

Electronic Board #15 An Evaluation of Methods for Establishing Crosswalks Between Instruments Mark Hansen, University of California, Los Angeles

In this study, we evaluate several approaches for obtaining projections (or crosswalks) between instruments measuring related, but somewhat distinct constructs . Methods utilizing unidimensional and multidimensional item response theory models are compared . We examine the impact of test length, correlation between constructs, and sample characteristics on the quality of the projection .

Electronic Board #16 Exploration of Factors Affecting the Necessity of Reporting Test Subscores Xiaolin Wang, Dubravka Svetina and Shenghai Dai, Indiana University, Bloomington

Interest in test subscore reporting has been growing rapidly for diagnosis purposes . This simulation study examined factors (correlation between subscales, number of items per subscale, complexity of test, and item parameter distribution) that affected the necessity of reporting subscores within the classical test theory framework .

Electronic Board #17 Evaluation of Psychometric Stability of Generated Items YU-LAN SU, TINGTING CHEN and JUI-SHENG WANG, ACT,ING

The study investigated the psychometric stability of generated items using operational data . The generated items were compared to their parents for the classical item statistics, DIF, raw response distributions to the key, and IRT parameters . The empirical evidence will serve as groundwork for the growing applications of item generation .

Electronic Board #18 Creating Parallel Forms with Small Samples of Examinees Lisa Keller, University of Massachusetts Amherst; Rob Keller, Measured Progress; Andrea Hebert, Bottom Line Technologies

This study investigates using item specific priors in item calibration to assist in the creation of parallel forms in the presence of small samples of examinees . Results indicate that while the item parameters may still contain error, classification of examinees into performance categories might be improved using the method .

Electronic Board #19 Higher-Order G-DINA Model for Polytomous Attributes qin Yi and Tao Yang, Faculty of Education, Beijing Normal University; Tao Xin and lou Liu, School Of Psychology, Beijing Normal University

G-DINA Model for Polytomous Attributes (Jinsong Chen,2013) accounting for the attribute level can provide additional diagnostic information . While involving the high order structure, it can provide more micro attributes information and macro capability expression linked to IRT theory, which also increase the sensitivity of classification .

135 2016 Annual Meeting & Training Sessions

Electronic Board #20 New Search Algorithm for Q-matrix Validation Ragip Terzi, Rutgers, The State University of New Jersey; Jimmy de la Torre, Rutgers University

The validity while constructing a Q-matrix in cognitive diagnosis modeling has raised significant attentions due to the possibility of attribute-misspecifications . It can result in model-data misfit and ultimately attribute- misclassifications . The current study proposes a new method for Q-matrix validation . The results are also compared to other parametric and non-parametric methods .

Electronic Board #21 Generalized DCMs for Option-Based Scoring Oksana Naumenko, Yanyan Fu and Robert Henson, The University of North Carolina at Greensboro; Bill Stout, University of Illinois at Urbana-Champaign; Lou DiBello, University of Illinois at Chicago

A recently proposed family of models, the Generalized Diagnostic Classification Models for Multiple Choice Option- Based Scoring (GDCM-MC) extracts information about examinee cognitive processing from all MC item options . This paper describes a set of simulation studies with factors such as test length and number of options that examine model performance .

Electronic Board #22 Evaluating Sampling Variability and Measurement Precision of Aggregated Scores in Large-Scale Assessment Xiaohong Gao and Rongchun Zhu, ACT, Inc.

The study demonstrates how to conceptualize sources of measurement error and estimate sampling variability and reliability in large-scale assessment of educational quality . One international and one domestic assessment data sets are used to shed light on potential sources of measurement uncertainty and improvement of measurement precision for aggregated scores .

Electronic Board #23 The Model for Dichotomously-Scored Multiple-Attempt Multiple-Choice Items Igor Himelfarb and Katherine Furgol Castellano, Educational Testing Service (ETS); Guoliang Fang, Penn State University

This paper proposes a model for dichotomously-scored, multiple-attempt, multiple-choice item responses that may occur in scaffolded assessments . Assuming a 3PL IRT model, simulations were conducted using MCMC Metropolis- Hasting to recover the generated parameters . Results indicate that best recovery was for item parameters of low and moderate difficulty and discrimination .

Electronic Board #24 Classical Test Theory Embraces Cognitive Load Theory: Measurement Challenges Keeping It Simple Charles Secolsky, Mississippi Department of Education; Eric Magaram, Rockland Community College

The measurement community is challenged by advances in educational technology and psychology .On a basic level, classical test theory is used as a measurement model for understanding cognitive load theory and the influence of cognitive load theory has on test validity .The greater the germane cognitive load, the greater the true score .

136 Washington, DC, USA

Sunday, April 10, 2016 6:30 PM - 8:00 PM, Renaissance West B, Ballroom Level

President’s Reception By Invitation Only

137 2016 Annual Meeting & Training Sessions

138 Washington, DC, USA

Annual Meeting Program - Monday, April 11, 2016

139 2016 Annual Meeting & Training Sessions

140 Washington, DC, USA

Monday, April 11, 2016 5:45 AM - 7:00 AM

NCME Fitness Run/Walk Session Organizers: Katherine Furgol Castellano, ETS; Jill R . van den Heuvel, Alpine Testing Solutions

Start your morning with NCME’s annual 5k Walk/Run in Potomac Park . Meet in the lobby of the Renaissance Washington, DC Downtown Hotel at 5:45AM . Pre-registration is required . Pickup your bib number and t-shirt at the NCME Information Desk in the hotel, anytime prior to race day . Transportation will be provided . (Additional registration fee required)

The event is made possible through the sponsorship of:

National Center for the Improvement of Educational Assessment, Inc .

Measurement, Inc .

College Board

ACT

American Institutes for Research

Graduate Management Admission Council

Educational Testing Service

Pearson Educational Measurement

Houghton Mifflin Harcourt

Law School Admission Council

Applied Measurement Professionals, Inc .

WestEd

HumRRO

141 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 13/14, Meeting Room Level, Invited Session, I1

NCME Book Series Symposium: Technology and Testing Session Editor: Fritz Drasgow, University of Illinois at Urbana-Champaign Session Chair: Randy Bennett, ETS

This symposium draws on Technology and Testing: Improving Educational and Psychological Measurement, a recently published volume in the new NCME Book Series . The volume probes the remarkable opportunities for innovation and progress that have resulted from the convergence of advances in technology, measurement, and the cognitive and learning sciences . The book documents many of these new directions and provides suggestions for numerous further advances . It seems safe to predict that testing will be dramatically transformed over the new few decades – paper test booklets with opscan answer sheets will soon be as outdated as computer punch cards .

The book is divided into four sections, each with several chapters and a section commentator . For purposes of this symposium, one chapter author per section will present his or her chapter in some depth, followed by the section commentator who will briefly review each of the other chapters in the section . The symposium offers the measurement community a unique opportunity to learn about how technology will help to transform assessment practices and the challenges that transformation is already posing and will continue to present

Issues in Simulation-Based Assessment Brian Clauser and Melissa Margolis, National Board of Medical Examiners; Jerome Clauser, American Board of Internal Medicine; Michael Kolen, University of Iowa Commentator: Stephen Sireci, University of Massachusetts, Amherst

Using Technology-Enhanced Processes to Generate Test Items in Multiple Languages Mark Gierl, Hollis Lai, Karen Fung and Bin Zheng, University of Alberta Commentator: Mark Reckase, Michigan State University

Increasing the Accessibility of Assessments through Technology Elizabeth Stone, Cara Laitusis and Linda Cook, ETS Commentator: Kurt Geisinger, University of Nebraska, Lincoln

From Standardization to Personalization: The Comparability of Scores Based on Different Testing Conditions, Modes, and Devices Walter Way, Laurie Davis, Leslie Keng and Ellen Strain-Seymour, Pearson Commentator: Edward Haertel, Stanford University

142 Washington, DC, USA

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 8/9, Meeting Room Level, Coordinated Session, I2

Exploring Various Psychometric Approaches to Report Meaningful Subscores Session Discussant: Li Cai, University of California, Los Angeles

The impetus of this session came directly from needs and concerns expressed by score users of K-12 large-scale Common Core State Standards (CCSS) aligned assessments . Subscores, also called domain scores as Reading, Listening, and Writing in an English language arts test, and subdomain scores that are based on detailed content standards nested within a domain are reported in assessments . As the CCSS have been adopted by many states, educators and parents need information of both domain and sub-domain from the state accountability tests to (1) explain the student’s performance in certain content areas, (2) evaluate the effects of teaching and learning practices in classroom and (3) investigate the impact of implementation of CCSS . However, the use of subscores has been criticized for its low reliability (Thissen & Wainer, 2001) and little added value when correlations among subscores are high (Sinharay, 2010) . In online-adaptive testing, the traditional observed subscores are usually not meaningful, because students responded to different items at different difficulty levels, which renders the subscores not comparable among students . Furthermore, in an online-adaptive testing format, each student usually receives only a few items that are from the core content-related subdomain units . In that case, student-level subdomain scores are unlikely to be reliable . However when school-level factors were collected from many students, the aggregated information may be meaningful .

The issues of reporting subscores in K-12 CCSS-aligned assessments are discussed by four different approaches from both theoretical and empirical perspectives . Our studies show that the reliabilities can be improved and additional information can be provided to test users in assessment even under the online-adaptive testing setting . The first study presents results from a residual analysis of subscores which has been widely applied in the statewide assessments . The advantages, limitations and possible solutions for improvement are also discussed . The second study uses a mixture of Item Response Theory (IRT) and a higher-order cognitive diagnostic models (HO-DINA) to produce attribute classification profiles as alternative of traditional subscores along with general ability scores . The third study proposes a Multilevel Testlet (MLT) item factor model to produce school-level instructionally-meaningful subscores . The fourth study incorporates collateral information by implementing a fully Bayesian approach to report more reliable subscores . This panel of studies will provide an insight of subscore from various approaches and both a within- and across-methodologies perspective . We hope this session can enrich the literature and methodology in subscore reporting and also support producing meaningful diagnostic information for teaching and learning .

Using Residual Analysis to Report Subscores in Statewide Assessments Jon Cohen, American Institutes for Research

Applying a Mixture of IRT and HO-DINA Models in Subscore Reporting Likun Hou, Educational Testing Services; Yan Huo, Educational Testing Service; Jummy de la Torre, Rutgers University

Multilevel Testlet Item Factor Model for School-Level Instructionally-Meaningful Subscores Megan Kuhfeld, University of California, Los Angeles

Incorporating Collateral Information and Fully Bayesian Approach for Subscores Reporting Yi Du, Educational Testing Services; Shuqin Tao, curriculum associates; Feifei Li, Educational Testing Service

143 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 3, Meeting Room Level, Coordinated Session, I3

From Items to Policies: Big Data in Education Session Discussant: Zachary Pardos, School of Information and Graduate School of Education, UC Berkeley

Data are woven into every sector of the global economy (McGuire et al ., 2012), including education . As technology and analytics improve, the use of big data to derive insights that lead to system improvements is growing rapidly . The purpose of this panel is to share a collection of promising approaches for analyzing and leveraging big data in a wide range of education contexts . Each contribution is an application of machine learning, computer science, and/ or statistical techniques to an education issue or question, in which expert judgment would be costly, impractical, or otherwise hampered by the magnitude of the problem . We focus on the novel application of big data to address questions of construct validity for assessments; inferences about student abilities and learning needs when data are sparse or unstructured; decisions about course structure; and public sentiment about specific education policies . The ultimate goal for the use of big data and the application of these methods is to improve outcomes for learners . We conclude the session with lessons learned from the application of these methods to research questions across a broad spectrum of education issues, noting strengths and limitations .

What and When Students Learn: Q-Matrices and Student Models from Longitudinal Data José González-Brenes, Center for Digital Data, Analytics & Adaptive Learning, Pearson

Misconceptions Revealed Through Error Responses Thomas McTavish, Center for Digital Data, Analytics and Adaptive Learning, Pearson

Beyond Subscores: Mining Student Responses for Diagnostic Information William Lorié, Center for NextGen Learning & Assessment, Pearson

Mining the Web to Leverage Collective Intelligence and Learn Student Preferences Kathy McKnight, Center for Educator Learning & Effectiveness, Pearson; Antonio Moretti and Ansaf Salleb-Aouissi, Center for Computational Learning Systems, Columbia University; José González-Brenes, Center for Digital Data, Analytics & Adaptive Learning, Pearson

The Application of Sentiment and Topic Analysis to Teacher Evaluation Policy Antonio Moretti and Ansaf Salleb-Aouissi, Center for Computational Learning Systems, Columbia University; Kathy McKnight, Center for Educator Learning & Effectiveness, Pearson

144 Washington, DC, USA

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 4, Meeting Room Level, Coordinated Session, I4

Methods and Approaches for Validating Claims of College and Career Readiness Session Chair: Thanos Patelis, Center for Assessment Session Discussant: Michael Kane, Educational Testing Service

The focus on college and career readiness has penetrated all aspects and segments of education, as well as economic and political rhetoric . Testing organizations, educational organizations, states, and institutions of higher education have made claims of college and career readiness . New large-scale assessments have been launched and historic assessments used for college admissions and placements are being revised to represent current claims of college and career readiness . Validation evidence to substantiate these claims are important and expected (AERA, APA, & NCME, 2014) . This session will involve four presentations by active participants and contributors in the conceptualization, design, and implementation of validation studies . Each presentation will present a validation framework and specific suggestions, recommendations and examples of methodologies in undertaking the validation of these claims of college and career readiness . Concrete suggestions will be provided . A fifth presenter will offer comments about the presentations and also provide additional recommendations and insights .

Are We Ready for College and Career Readiness? Stephen Sireci, University of Massachusetts-Amherst

Validating Claims for College and Career Readiness with Assessments Used for Accountability Wayne Camara, ACT

Moving Beyond the Rhetoric: Urgent Call for Empirically Validating Claims of College-And-Career-Readiness Catherine Welch and Stephen Dunbar, University of Iowa

Some Concrete Suggestions and Cautions in Evaluating/Validating Claims of College Readiness Thanos Patelis, Center for Assessment

145 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 8:15 AM - 10:15 AM, Renaissance West A, Ballroom Level, Invited Session, I5

Recent Advances in Quantitative Social Network Analysis in Education Presenters: Tracy Sweet, University of Maryland Qiwen Zheng, University of Maryland Mengxiao Zhu, ETS Sam Adhikari, Carnegie Mellon University Beau Dabbs, Carnegie Mellon University I-Chien Chen, Michigan State University

Social network data is becoming increasingly more common in education research and the purpose of this symposium is to both summarize current research on social network methodology and to showcase how these methods can address substantive research questions in education and promote on-going education research . Each presentation introduces exciting cutting-edge methodological research focusing on different aspects of social network analysis that will be of interest to both methodologists and education researchers .

The session will begin with an introduction by Tracy Sweet followed by several methodological talks showcasing exciting new research . Mengxiao Zhu will describe new ways to analyze network data from students’ learning and problem-solving processes . Qiwen Zheng will discuss a model for multiple networks that focuses on subgroup integration . Sam Adikhari will discuss a longitudinal model that illustrates how network structure changes over time, and I-Chien Chen will also introduce new methods for multiple time points but will focus on how changes over time is related to changes in other outcomes . Finally, Beau Dabbs will discuss model selection methods

146 Washington, DC, USA

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 15, Meeting Room Level, Paper Session, I6

Issues in Automated Scoring Session Discussant: Shayne Miel, Turnitin

Modeling the Global Text Features for Enhancing the Automated Scoring System Syed Muhammad Fahad Latifi and Mark Gierl, University of Alberta

We will introduce and demonstrate the innovative modeling of global text features for enhancing the performance of automated essay scoring (AES) system . The representative dataset from PARCC and SMARTER Balanced states were used . The results suggested that the global text modeling has consistently outperformed two state-of-the-art commercial AES systems .

Discretization of Scores from an Automated Scoring Engine Using Gradient Boosted Machines Scott Wood, Pacific Metrics Corporation

In automated scoring engines using linear regression models, it is common to convert the continuous predicted scores into discrete scores for reporting . A recent study shows that special care must be taken when converting continuous predicted scores from gradient boosted machine modelling into discrete scores .

Automated Scoring of Constructed Response Items Measuring Computational Thinking Daisy Rutstein, John Niekrasz and Eric Snow, SRI International

Increasingly, assessments contain constructed response items to measure hard-to-assess inquiry- and design- based concepts . These types of item responses are challenging to score reliably and efficiently . This paper discusses the adaptation of an automated scoring engine for scoring responses on constructed response items measuring computational thinking .

Automated Scoring of Complex Technology-Enhanced Tasks in a Middle School Science Unit Samuel Crane, Aaron Harnly, Malorie Hughes and John Stewart, Amplify

We show how complex user-interaction data from a Natural Selection app can be auto-scored using several methods . We estimate validity using a comparative analysis of content-expert ratings, evidence rule scoring, and a machine learning approach . The machine learning approaches are shown to agree with expert human scoring .

Comparison of Human Rater and Automatic Scoring on Students’ Ability Estimation Zhen Wang, Educational Testing Service (ETS); Lihua Yao, DoD Data Center; Yu Sun

The purpose is to compare human rater with automatic scoring in terms of examinees’ ability estimation with IRT- based rater model . Each speaking item is analyzed with both IRT models without rater-effect and with rater-effects . The effects of different rating designs may substantially increase the bias in examinees’ ability estimation .

Issues to Consider When Examining Differential Item Functioning in Essays Matthew Schultz, Jonathan Rubright and Aster Tessema, American Institute of Certified Public Accountants

The development of Automated Essay Scoring has propelled the increasing use of writing in high-stakes assessments . To date, DIF is rarely considered in such contexts . Here, methods to assess DIF in essays and considerations for practitioners are reviewed, and results of an application from an operational testing program are discussed .

147 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 8:15 AM - 10:15 AM, Meeting Room 16, Meeting Room Level, Paper Session, I7

Multidimensional and Multivariate Methods Session Discussant: Irina Grabovsky, NBME

Information Functions of Multidimensional Forced-Choice IRT Models Seang-hwane Joo, Philseok Lee and Stephen Stark, University of South Florida

This paper aimed to develop the concept of information functions for multidimensional force-choice IRT models and demonstrate how statement parameters and test formats (pair, triplet and tetrad) influence the item and test information . The implications for constructing fake-resistant noncognitive measures are further discussed using information functions .

Investigating Reverse-Worded Matched Item Pairs Using the GPCM and NRM Ki Matlock, Oklahoma State University; Ronna Turner and Dent Gitchel, University of Arkansas

The GPCM is often used for polytomous data, however the NRM allows for the investigation of how adjacent categories may discriminate differently when items are positively or negatively worded . In this study, responses to reverse-worded items are analyzed using the two models, and the estimated parameters are compared .

Item Response Theory Models for Ipsative Tests with Polytomous Multidimensional Forced-Choice Items Xue-Lan Qiu and Wen-Chung Wang, The Hong Kong Institute of Education

Developments of IRT models for ipsative tests with dichotomous multidimensional forced-choice items have been witnessed in recent years . In this study, we develop a new class of IRT models for polytomous MFC items . We conducted simulation studies in variety of conditions to evaluate parameter recovery and provided an empirical example .

Multivariate Generalizability Theory and Conventional Approaches for Obtaining More Accurate Disattenuated Correlations Walter Vispoel, Carrie Morris and Murat Kilinc, University of Iowa

The standard approach for obtaining disattenuated correlations rests on assumptions easily violated in practice . We explore multiple methods for obtaining disattenuated correlations designed to limit introduction of bias due to assumption violations, including methods based on applications of multivariate generalizability theory and a conventional alternative to such methods .

Comparing a Modified Alpha Coefficient to Split-Half Approaches in the LOFT Framework Tammy Trierweiler, Law School Admission Council (LSAC); Charles Lewis, Educational Testing Service

In this study, the performance of a Modified Alpha coefficient was compared to split-half methods for estimating generic reliability in a LOFT framework . Simulations across different ability distributions, sample sizes and ranges of item pool difficulties were considered and results were compared to the corresponding theoretical population reliability .

Estimating Correlations Among School Relevant Categories in a Multidimensional Space Se-Kang Kim, Fordham University; Joseph Grochowalski, College Board

The current study estimates correlations between row and column categories in a multidimensional space . The contingency table being analyzed consists of New York school districts as row categories and school relevant categories (e .g ., attendance, safety,…, etc .) as column categories . To calculate correlations, the biplot paradigm (Greenacre, 2010) is utilized

148 Washington, DC, USA

Monday, April 11, 2016 10:35 AM - 12:05 PM, Renaissance West A, Ballroom Level, Invited Session, J1

Hold the Presses! How Measurement Professionals Can Speak More Effectively with the Press and the Public (Education Writers Association Session) Session Chairs: Kristen Huff, ACT Laurie Wise, HumRRO, Emeritus Lori Crouch, EWA Session Panelists: Caroline Hendrie, EWA David Hoff, Hager Sharp Andrew Ho, Harvard Graduate School of Education Anya Kamenetz, NPR Sarah Sparks, Education Week

How can members of the press help advance the assessment literacy of the general public? Could we have communicated better about the Common Core State Standards? Please join NCME for a panel session sponsored jointly with the Education Writers Association (EWA), the professional organization of journalists that covers education . In this panel discussion, EWA Executive Director Caroline Hendrie will lead a conversation with journalists and academics about the role of measurement experts and the press in the modern media era, with all its political polarization, sound bites, Twitter hashtags, and quotes on deadline . Approximately half the session will be reserved for audience questions and answers, so please take advantage of this unique opportunity to discuss how we can improve our communication about educational measurement .

149 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 8/9, Meeting Room Level, Coordinated Session, J2

Challenges and Solutions in the Operational Use of Automated Scoring Systems Session Chair: Su-Youn Yoon Session Discussant: Klaus Zechner, ETS

An automated scoring system can assess constructed responses faster than human raters and at a lower cost . These advantages have prompted a strong demand for high-performing automated scoring systems for various applications . However, even state-of-the-art automated scoring systems face numerous challenges to their use in operational testing programs . This session will discuss four important issues that may arise when automated scoring systems are used in operational tests: features vulnerable to sub-group bias, accommodations for special test taker groups with disabilities, the development of new tests using a novel input type, and the addition of automated scoring to ongoing operational testing programs based only on human scoring . These issues may be associated with problems that cause aberrant performance of automated scoring systems and result in weakening the validity of automated scores . Also, the addition of machine scoring to prior all-human scoring may change the score distribution and result in difficulty interpreting and maintaining the reported scale . We will analyze problems associated with these issues and provide solutions . This session will demonstrate the importance of considering validity issues at the initial stage of automated scoring system design in order to overcome these challenges .

Fairness in Automated Scoring: Screening Features for Subgroup Differences Ji An, University of Maryland; Vincent Kieftenbeld and Raghuveer Kanneganti, McGraw-Hill Education CTB

Use of Automated Scoring in Language Assessments for Candidates with Speech Impairments Heather Buzick, Educational Testing Service; Anastassia Loukina, ETS

A Novel Automatic Handwriting Assessment System Built on Touch-Based Tablet Xin Chen, Ran Xu and Richard Wang, Pearson; Tuo Zhao, University of Missouri

Ensuring Scale Continuity in Automated Scoring Deployment in Operational Programs Jay Breyer, Shelby Haberman and Chen Li, ETS

150 Washington, DC, USA

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, J3

Novel Models to Address Measurement Errors in Educational Assessment and Evaluation Studies Session Chair: Kilchan Choi, CRESST/UCLA Session Discussant: Elizabeth Stuart, Johns Hopkins

Measurement error issues adversely affect results obtained from typical modeling approaches used to analyze data from assessment and evaluation studies . In particular, measurement error can weaken the validity of inferences from student assessment data, reduce the statistical power of impact studies, and diminish the ability of researchers to identify the causal mechanisms that lead to an intervention improving the desired outcome .

This symposium proposes novel statistical models to account for the impact of measurement error . The first paper proposes a multilevel two-tier item factor model with latent change score parameterization in order to address conditional exchangeability of participants that routinely accompanies analysis of multisite randomized experiments with pre- and posttests . The second paper examines the consequence of correcting measurement errors in value- added models to address a question on who are the teachers that are benefitting more than others in the result of correcting measurement errors . The third paper proposes a multilevel latent variable plausible values approach for more appropriately handling measurement error in predictors in multilevel modeling settings in which latent predictors are measured by observed categorical variables . The last paper proposes a three-level latent variable hierarchical model with a cluster-level measurement model using one-stage full information estimation approach .

On the Role of Multilevel Item Response Models in Multisite Evaluation Studies Li Cai and Kilchan Choi, UCLA/CRESST; Megan Kuhfeld, UCLA

Consequence of Correcting Measurement Errors in Value-Added Models Kilchan Choi, CRESST/UCLA; Yongnam Kim, University of Wisconsin

Handling Error in Predictors Using Multiple-Imputation/Mcmc-Based Approaches: Sensitivity of Results to Priors Michael Seltzer, UCLA; Jiseung Yang, University of Maryland

Three-Level Latent Variable Hierarchical Model with Level-2 Measurement Model Kilchan Choi and Li Cai, UCLA/CRESST; Michael Seltzer, UCLA

151 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, J4

Mode Comparability Investigation of a CCSS Based K-12 Assessment Session Chair: David Chayer, Data Recognition Corporation Session Discussant: Debora Harris, ACT

Recent introduction of the Common Core State Standards and accountability legislation have brought extensive attention to online administration of K-12 large scale assessments . In this coordinated session, a series of mode comparability investigations on a K-12 assessment which uses various item types, such as multiple choice, technology enhancement, and open ended items, is attempted in order to test three major comparability hypotheses of same test factor structure, same measurement precision, and same score properties by applying various methods . A presentation of most recent trends of the mode comparability studies on K-12 assessments will be followed by the presentations of findings from the mode comparability hypotheses investigations mentioned above . Finally, results via various equating methods are compared when a difference in difficulty exists in the two modes . This coordinated session will contribute to the measurement field by providing a summary of the most recent mode comparability studies, theoretical guidelines for mode comparability, and practical considerations for educators and practitioners .

Recent Trends of Mode Comparability Studies Jong Kim, ACT

Comparison of OLT and PPT Structure Karen Barton, Learning Analytics; Jungnam Kim, NBCE

Applying an IRT Method to Mode Comparability Dong-In Kim, Keith Boughton and Joanna Tomkowicz, Data Recognition Corporation; Frank Rijiman, AAMC

Equating When Mode Effect Exists Marc Julian, Dong-in Kim, Ping Wan and Litong Zhang, Data Recognition Corporation

152 Washington, DC, USA

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 16, Meeting Room Level, Paper Session, J5

Validating “Noncognitive”/Nontraditional Constructs II Session Discussant: Andrew Maul, University of California, Santa Barbara

Using Response Times to Enhance Scores on Measures of Executive Functioning Brooke Magnus, University of North Carolina at Chapel Hill; Michael Willoughby, RTI International; Yang Liu, University of California, Merced

We propose a novel response time model for the assessment of executive functioning in children transitioning from early to middle childhood . Using a model comparison approach, we examine the degree to which response times may be analyzed jointly with response accuracy to improve the precision and range of ability scores .

A Structural Equation Model Replication Study of Influences on Attitudes Towards Science Rajendra Chattergoon, University of Colorado, Boulder

This paper replicates and extends a structural equation model using data from the Trends in International Mathematics and Science Study (TIMSS) . Similar latent factor structure was obtained using TIMSS 1995 and 2011 data, but some items loaded on multiple factors . Three models fit the data equally well, suggesting multiple interpretations .

Experimental Validation Strategies Using the Example of a Performance-Based Ict-Skills Test Lena Engelhardt and Frank Goldhammer, German Institute for International Educational Research; Johannes Naumann, Goethe University Frankfurt; Andreas Frey, Friedrich Schiller University Jena

Two experimental validation approaches are presented to investigate the construct interpretation of ability scores using the example of a performance-based ICT (information and communication technology) -skills test . Construct- relevant task characteristics were manipulated experimentally, first, to change only the difficulty of items, and second, to change also the tapped construct .

Measuring Being Bullied in the Context of Racial and Religious DIF Michael Rodriguez, Kory Vue and Jose Palma, University of Minnesota

To address the measurement and relevance of novel constructs in education, a measure of being bullied is anticipated to exhibit DIF on items about the role of race and religion . The scale is recalibrated to account for DIF and compared vis-à-vis correlations, mean differences, and criterion-referenced levels of being bullied .

153 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 15, Meeting Room Level, Paper Session, J6

Differential Functioning - Theory and Applications Session Discussant: Catherine McClellan, Clowder Consulting

Using the Partial Credit Model to Investigate the Comparability of Examination Standards Qingping He and Michelle Meadows, Office of Qualifications and Examinations Regulation

This study explores the use of the Partial Credit Model (PCM) and differential step functioning (DSF) to investigate the comparability of standards in examinations that test the same subjects but are provided by different assessment providers . These examinations are used in the General Certificate of Secondary Education qualifications in England .

Handling Missing Data on DIF Detection Under the Mimic Model Daniella Reboucas and Ying Cheng, University of Notre Dame

In detecting differential item functioning (DIF), mistreatment of missing data would inflate type I error and lower power . This study examines DIF detection with the MIMIC model under the three missing mechanisms . Results suggest that the full information maximum likelihood method works better than multiple imputation in this case .

Properties of Matching Criterion and Its Effect on Mantel-Haenszel DIF Procedure Usama Ali, Educational Testing Service

This paper investigates the matching criterion used for Mantel-Haenszel DIF procedure . The goal of this paper is to evaluate the robustness of DIF results due to less optimal conditions as reflected in number of items contributing to the criterion score, number of score levels, and its reliability .

Impact of Differential Bundle Functioning on Test Performance of Focal Examinees Kathleen Banks, LEAD Public Schools; Cindy Walker, University of Wisconsin-Milwaukee

The purpose of this study was to apply the Walker, Zhang, Banks, and Cappaert (2012) effect size criteria to bundles that showed statistically significant differential bundle functioning (DBF) against focal groups in past DBF studies . The question was whether the bundles biased the mean total scores for focal groups .

154 Washington, DC, USA

Monday, April 11, 2016 10:35 AM - 12:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, J7

Latent Regression and Related Topics Session Discussant: Matthias von Davier, ETS

Multidimensional IRT Calibration with Simultaneous Latent Regression in Large-Scale Survey Assessments Lauren Harrell and Li Cai, University of California, Los Angeles

Multidimensional item response theory models, estimated simultaneously with latent regression models using an adaptation of the Metropolis-Hastings Robbins-Monro algorithm, are applied to data from the National Assessment of Educational Progress (NAEP) Science and Mathematics assessments . The impact of dimensionality on parameter estimation and plausible values is investigated .

Single-Stage Vs. Two-Stage Estimation of Latent Regression IRT Models Peter van Rijn, ETS Global; Yasmine El Masri, Oxford University Centre for Educational Assessment

Item and population parameters of PISA 2012 data are compared between a single-stage and a two-stage approach . While item and population parameters remained similar, standard errors of population parameters were greater in a single-stage approach . Similar results were observed when fitting univariate and multivariate models . Practical implications are discussed .

Improving Score Precision in Large-Scale Assessments with the Multivariate Bayesian Lasso Steven Culpepper, Trevor Park and James Balamuta, University of Illinois at Urbana-Champaign

The multivariate Bayesian Lasso (MBL) was developed for high-dimensional regression models, such as the conditioning model in large-scale assessments (e .g ., NAEP) . Monte Carlo results document the gains in score precision achieved when employing the MBL model versus Bayesian models that assume a multivariate normal prior for regression coefficients .

Performance of Missing Data Approaches in Retrieving Group-Level Parameters Steffi Pohl, Freie Universität Berlin; Carmen Köhler and Claus Carstensen, Otto-Friedrich-Universität Bamberg

We investigate the performance of different missing data approaches in retrieving group-level parameters (e .g ., regression coefficients) that are usually of interest in large-scale assessments . Results show that ignoring missing values performed almost equally well as model-based approaches for nonignorable missing data; both approaches outperformed treating missing values as incorrect responses .

155 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 11:00 AM - 2:00 PM, Meeting Room 12, Meeting Room Level

Past Presidents Luncheon

By invitation only

156 Washington, DC, USA

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 8/9, Meeting Room Level, Invited Session, K1

The Every Students Succeeds Act (ESSA): Implications for Measurement Research and Practice Session Moderator: Martin West, Harvard Graduate School of Education Session Presenters: Peter Oppenheim, Education Policy Director and Counsel, U .S . Senate Committee on Health, Education, Labor, and Pensions (Majority) Sarah Bolton, Education Policy Director, U .S . Senate Committee on Health, Education, Labor, and Pensions (Minority) Session Respondents: Sherman Dorn, Arizona State University Marianne Perie, University of Kansas John Easton, Spencer Foundation

The 2015 enactment of the Every Student Succeeds Act marked a major shift in federal education policy, allowing states greater flexibility with respect to the design of school accountability systems while at the same time directing them to incorporate additional performance metrics not based on test scores . In this session, key Congressional staff involved in crafting the new law will describe its rationale and how they hope states will respond . A panel of researchers will in turn consider the opportunities the law creates for innovation in and research on educational measurement and the design of school accountability systems .

157 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 12:25 PM - 1:55 PM, Renaissance West A, Ballroom Level, Coordinated Session, K2

Career Paths in Educational Measurement: Lessons Learned by Accomplished Professionals Session Moderator: S E Phillips, Assessment Law Consultant Session Panelists: Kathy McKnight, Pearson School Research; Joe Martineau, National Center for the Improvement of Educational Assessment; Barbara Plake, University of Nebraska - Lincoln, Emeritus

Deciding what you want to do when you become a measurement professional can be a daunting task for a masters or doctoral student about to graduate . It can also be challenging for a graduate of a measurement program about to begin a first job . Sometimes, graduate students see the work of accomplished measurement professionals and wonder how they got there . Other times, graduate students know what they are interested in and the type of measurement activity they would like to engage in, but are uncertain which settings or career paths will provide the best fit .

Careers in educational measurement are many and varied . As graduate students consider their career options, they must weight their skills, abilities, interests and preferences against the opportunities, expectations, demands and advancement potential of various jobs and career paths . This session is designed to provide some food for thought for these difficult decisions . It is targeted particularly at graduate students in measurement programs, graduates in their first jobs and career changers within measurement .

158 Washington, DC, USA

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, K3

Recent Investigations and Extensions of the Hierarchical Rater Model Session Chair: Jodi Casabianca, The University of Texas at Austin Session Discussant: Brian Patterson, Questar Assessment

Rater effects in education testing and research have the potential to impact the quality of scores in constructed response and performance assessments . The hierarchical rater model (HRM) is a multilevel item response theory model for multiple ratings of behavior and performance that yields estimates of latent traits corrected for individual rater bias and variability (Casabianca, Junker, & Patz, 2015; Patz, Junker, Johnson, & Mariano, 2002) . This session reports on some extensions and investigations of the basic HRM . The first paper serves as a primer to the session, providing the basic HRM formulae and notation, as well as comparisons to competing models . The second paper focuses on a parameterization of the longitudinal HRM that uses an autoregressive and/or moving average process in the estimation of latent traits over time . The third paper discusses a multidimensional extension to the HRM to be used with rubrics assessing more than one trait . The fourth paper evaluates HRM parameter estimates when the examinee population is nonnormal, and demonstrates the use of flexible options for the Bayesian prior on the latent trait .

The HRM and Other Modern Models for Multiple Ratings of Rich Responses Brian Junker, Carnegie Mellon University

The Longitudinal Hierarchical Rater Model with Autoregressive and Moving Average Processes Mark Bond and Jodi Casabianca, The University of Texas at Austin; Brian Junker, Carnegie Mellon University

The Hierarchical Rater Model for Multidimensional Rubrics Ricardo Nieto, Jodi Casabianca and Brian Junker, The University of Texas at Austin

Parameter Recovery of the Hierarchical Rater Model with Nonnormal Examinee Populations Peter Conforti and Jodi Casabianca, The University of Texas at Austin

159 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, K4

The Validity of Scenario-Based Assessment: Empirical Results Session Chair: Randy Bennett, ETS Session Discussant: Brian Stecher, RAND

Scenario-based assessments are distinct from traditional tests in that the former present a unifying context with which all subsequent questions are associated . Among other things, that context, or scenario, is intended to provide a reasonably realistic setting and purpose for responding . The presence of the scenario should, at best, facilitate valid, fair, and reliable measurement but, in no event should it impede such measurement . The facilitation of valid, fair, and reliable measurement may occur because the scenario increases motivation and engagement, provides background information to activate prior knowledge and make it more equal across students, or steps students through warm-up problems that prepare them better for undertaking a culminating performance task .

Among the issues that have emerged with respect to scenario-based assessment are generalizability (e .g ., students less knowledgeable or interested in the particular scenario may be disadvantaged); local dependency (i .e ., items may be conditionally dependent, artificially inflating measurement precision); and scaffolding effects (e .g ., the lead- in tasks may help students perform better than they otherwise would) .

This symposium will include three papers describing scenario-based assessments for K-12 reading, writing, and science, as well as empirical results related to their validity, fairness, and reliability . Brian Stecher, of RAND, will be the discussant .

Building and Scaling Theory-Based and Developmentally-Sensitive Scenario-Based Reading Assessments John Sabatini, Tenaha O’Reilly, Jonathan Weeks and Jonathan Steinberg, ETS

Scenario-Based Assessments in Writing: An Experimental Study Randy Bennett and Mo Zhang, ETS

SimScientists Assessments: Science System Framework Scenarios Edys Quellmalz, Matt Silberglitt, Barbara Buckley, Mark Loveland, Daniel Brenner and Kevin (Chun-Wei) Huang, WestEd

160 Washington, DC, USA

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 5, Meeting Room Level, Paper Session, K5

Item Design and Development Session Discussant: Ruth Childs, University of Toronto

A Mixed Methods Examination of Reverse-Scored Items in Adolescent Populations Carol Barry and Haifa Matos-Elefonte, The College Board; Whitney Smiley, SAS

This study is a mixed methods exploration of reverse-scored items administered to 8th graders . The quantitative portion examines the psychometric properties of a measure of academic perseverance . The qualitative portion uses think aloud interviews to explore potential reasons for poor functioning of reverse-scored items on the instrument .

Effects of Writing Skill on Scores on Justification/Evaluation Mathematics Items Tim Hazen and Catherine Welch, Iowa Testing Programs

Justification/Explanation (J/E) items in Mathematics require students to justify or explain their answers, often through writing . This empirical study matches scores on J/E items with scores on Mathematics and Writing achievement tests to examine 1) unidimensionality assumptions and 2) potentially unwanted effects on scores on tests with J/E items .

Economy of Multiple-Choice (mc) Versus Constructed-Response (cr) Items: Does Cr Always Lose? Xuan-Adele Tan and Longjuan Liang, Educational Testing Service

This study will compare Multiple-Choice (MC) versus Constructed-Response (CR) items in different contents and of different types in terms of cost and time for certain level of reliability . Results showed that CRs can have higher or comparable reliabilities for certain contents . Results will help direct future test design effort .

Applying the Q-Diffusion IRT Model to Assess the Impact of Multi-Media Items Nick Redell, Qiongqiong Liu and Hao Song, National Board of Osteopathic Medical Examiners (NBOME)

An application of the Q-diffusion IRT response process model to data from a timed, high-stakes licensure examination suggested that multi-media items convey additional information to examinees above and beyond the time needed to process and encode the item and that multi-media alters response processes for select examinees .

161 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 15, Meeting Room Level, Paper Session, K6

English Learners Session Discussant: Michael Rodriguez, University of Minnesota

Using Translanguaging to Assess Math Knowledge of Emergent Bilinguals: An Exploratory Study Alejandra Garcia and Fernanda Gandara, University of Massachusetts; Alexis Lopez, Educational Testing Services

There are persisting gaps in mathematics scores between ELs (English-learners) and non-ELs even with existing test accommodations . Translanguaging considers that bilinguals have one linguistic repertoire from which they select features strategically to communicate effectively . This study analyzed the performance of ELs on a math assessment that included items translanguaging features .

Estimating Effects of Reclassification of English Leaners Using a Propensity Score Approach Jinok Kim, Li Cai and Kilchan Choi, UCLA/CRESST

Reclassification of English Learners (ELs) should be based on their readiness for mainstream classrooms . Drawing on propensity score methods, this paper estimates the effects of ELs’ reclassification on their subsequent academic outcomes in one state . Findings suggest small but positive effects for students reclassified in grades 4, 5, and 6 .

Comparability Study of Computer-Based and Paper-Based Tests for English Language Learners Nami Shin, Mark Hansen and Li Cai, University of California, Los Angeles/ National Center for Research on Evaluation, Standards, and Student Testing (CRESST)

The purpose of this study is to examine the extent to which English Language Learner (ELL) status interacts with mode of test administration on large-scale, end-of-year content assessments . Specifically, we examine whether differences in item performance or functioning across Computer-based and Paper-based administrations are similar for ELL and non-ELL students .

Applying Hierarchical Latent Regression Models in Cross Lingual Assessment Haiyan Lin and Xiaohong Gao, ACT, Inc.

This study models the variation of examinees’ performance across groups and interaction effect between group and person variables by applying 2- and 3-level hierarchical latent regression model in cross lingual assessments . Simulation uses empirical estimates of two real datasets and explores different sample sizes, test lengths, and theta distributions .

162 Washington, DC, USA

Monday, April 11, 2016 12:25 PM - 1:55 PM, Meeting Room 16, Meeting Room Level, Paper Session, K7

Differential Item and Test Functioning Session Discussant: Dubravka Svetina, Indiana University

Examining Sources of Gender DIF Using Cross-Classified Multilevel IRT Models Liuhan Cai and Anthony Albano, University of Nebraska–Lincoln

An understanding of the sources of DIF can lead to more effective test development . This study examined gender DIF and its relationship with item format and opportunity to learn using cross-classified multilevel IRT models fit to math achievement data from an international dataset . Implications for test development are discussed .

Comparing Differential Test Functioning (dtf) for Dfit Mantel-Haenszel/Liu-Agresti Variance C. Hunter and T. Oshima, Georgia State University

Using simulated data, DTF was calculated using DFIT and the Mantel-Haenszel/Liu-Agresti variance method . DFIT results show unacceptable Type I error rate for DIF conditions with unequal sample sizes, but no susceptibility to distributional differences . The variance method showed expected high rates of DTF, being especially sensitive to distributional differences .

When Can MIRT Models Be a Solution for Dif? Yuan-Ling Liaw and Elizabeth Sanders, University of Washington

The present study was designed to examine whether multidimensional item response theory (MIRT) models might be useful in controlling for differential item functioning (DIF) when estimating primary ability, or whether traditional (and simpler) unidimensional item response theory (UIRT) models with DIF items removed are sufficient for accurately estimating primary ability .

Power Formulas for Uniform and Non-Uniform Logistic Regression DIF Tests Zhushan Li, Boston College

Power formulas for the popular logistic regression tests for uniform and non-uniform DIF are derived . The formulas provide a means for sample size calculations in planning DIF studies with logistic regression DIF tests . Factors influencing the power are discussed . The correctness of the power formulas is confirmed by simulation studies .

Detecting Group Differences in Item Response Processes: An Explanatory Speed-Accuracy Mixture Model Heather Hayes, AMTIS Inc.; Stephen Gunter and Sarah Morrisey, Camber Corporation; Michael Finger, Pamela Ing and Anne Thissen-Roe, Comira

For the purpose of assessing construct validity, we extend previous conjoint speed-accuracy models to simultaneously examine a) the impact of cognitive components on performance for verbal reasoning items and b) how these effects (i .e ., response processes) differ among groups who vary in educational breadth and depth .

163 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 12:25 PM - 1:55 PM, Mount Vernon Square, Meeting Room Level, Electronic Board Session, Paper Session, K8

Electronic Board #1 Extension of the Lz* Statistic to Mixed-Format Tests Sandip Sinharay, Pacific Metrics Corp

Snijders (2001) suggested the lz* statistic that is a popular IRT-based person fit statistic (PFS) . However, lz* can be computed for tests including only dichotomous items and has not been extended to mixed-format tests . This paper extends lz* to mixed-format tests .

Electronic Board #2 Examining Two New Fit Statistics for Dichotomous IRT Models Leanne Freeman and Bo Zhang, University of Wisconsin, Milwaukee

This study introduces the Clarke and Vuong statistics for assessing model-data fit for dichotomous IRT models . Monte Carlo simulations will be conducted to examine the Type I error and power of the two statistics . Their performance will be compared to the likelihood ratio test, which most researchers use currently .

Electronic Board #3 Automated Marking of Written Response Items in a National Medical Licensing Examination Maxim Morin, André-Philippe Boulais and André De Champlain, Medical Council of Canada

Automated essay scoring (AES) offers a promising alternative to human scoring for the marking of constructed- response type items . Based on real data, the present study compared several AES conditions for scoring short- answer CR items and evaluated the impact of using AES on the overall statistics of a sample examination form .

Electronic Board #4 Evaluating Automated Rater Performance: Is the State of the Art Improving? Michelle Boyer, University of Massachusetts, Amherst; Vincent Kieftenbeld, Pacific Metrics

This study evaluates multiple automated raters across four different automated scoring studies to assess whether the state of the art in automated scoring is advancing . Beyond an item by item evaluation, the method used here investigates automated rater performance across many items .

Electronic Board #5 Test-Taking Strategies and Ability Estimates in a Speeded Computerized Adaptive Test Hua Wei and Xin Li, Pearson

This study compares ability estimates of examinees using different test-taking strategies towards the end of a computerized adaptive test (CAT) when they are unable to finish the test within the allotted time . Item responses will be simulated for fixed-length CAT administrations with different test lengths and different degrees of speededness .

Electronic Board #6 Detecting Cheating When Examinees and Accomplices Are Not Physically Co-Located Chi-Yu Huang, Yang Lu and Nooree Huh, ACT.Inc.

A simulation study will be conducted to examine the efficiency of different statistics in detecting cheating among examinees who are physically in different locations but share highly similar item responses . Different statistics that will be investigated include a modified ω index, l_z index, H^T index, score estimation, and score prediction .

164 Washington, DC, USA

Electronic Board #7 Detecting Differential Item Functioning (dif) Using Boosting Regression Tree Xin Luo and Mark Reckase, Michigan State University; John Lockwood, ETS

A classification method in data mining known asboosting regression tree (BRT) was applied to identify the items with DIF in a variety of test situations, and the effectiveness of this new method was compared with other DIF detection procedures . The results supported the quality of the BRT method .

Electronic Board #8 Using Growth Mixture Modeling to Explore Test Takers’ Score Change Patterns Youhua Wei, Educational Testing Service

For a large-scale and high-stakes testing program, some examinees take the test more than once and their score change patterns vary across individuals . This study uses latent class and growth mixture modeling to identify unobserved sub-populations and explore different latent score change patterns among repeaters in a testing program .

Electronic Board #9 Studies of Growth in Reading in a Vertically Equated National Reading Test David Andrich and Ida Marais, University of Western Australia

Australia’s yearly reading assessments for all Year 3, 5, 7 and 9 students are equated vertically . The rate of increase of the worst performing state is greater than that of the best performing one . The former’s efforts to improve reading may be missed if mean achievements alone were compared .

Electronic Board #10 Examining the Impact of Longitudinal Measurement Invariance Violations on Growth Models Kelli Samonte, American Board of Internal Medicine; John Willse, University of North Carolina Greensboro

Longitudinal analyses rely on the assumption that scales function invariantly across measurement occasions . Minimal research has been conducted to evaluate the impact longitudinal measurement invariance violations have on latent growth models (LGM) . The current study aims to examine the impact varying degrees of longitudinal invariance violations have on LGM parameters .

Electronic Board #11 Defining On-Track Towards College Readiness Using Advanced Latent Growth Modeling Techniques Anthony Fina, Iowa Testing Programs, University of Iowa

The primary purpose of this exploratory study was to investigate growth at the individual level and examine how individual variability in growth is related to college readiness . Growth mixture models and a latent class growth analysis were used to define developmental trajectories from middle school through high school .

Electronic Board #12 Impact of Sample Size and the Number of Common Items on Equating Hongyu Diao, Duy Pham and Lisa Keller, University of Massachusetts-Amherst

Three methods of small sample equating in the non-equivanlent groups anchor test design are investigated in this simulation study: circle-arc, nominal weights mean equating, and Rasch equating . Results indicate that in the presence of small samples, increasing the number of equating items might help mitigate the error .

165 2016 Annual Meeting & Training Sessions

Electronic Board #13 Effect of Test Speededness on Item Parameter Estimation and Equating Can Shao, University of Notre Dame; Rongchun Zhu and Xiaohong Gao, ACT

Test speededness often leads to biased parameter estimates and produces inaccurate equated scores, thus threatens test validity . In this study, we compare three different methods of dealing with test speededness and investigate their impact on item parameter estimation and equating .

Electronic Board #14 Computation of Conditional Standard Error of Measurement with Compound Multinomial Models Hongling Wang, ACT, Inc.

Compound multinomial models have been used to compute conditional standard error of measurement (CSEM) for tests containing polytomous scores . One problem hindering applications of these models is the great amount of computation for tests with complex item scoring . This study investigates strategies to simplify CSEM computation with compound multinomial models .

Electronic Board #15 Exploring the Within-Item Speed-Accuracy Relationship with the Profile Method for Computer-Based Tests Shu-chuan Kao, Pearson

The purpose of this study is to describe the effect of time on the item-person interaction for computer-based tests . The profile method shows the subgroup item difficulty conditioned on item latency . The profile trend can help testing practitioners easily inspect the effect of response time in empirical data .

Electronic Board #16 Impact of Items with Minor Drift on Examinee Classification aijun wang, Yu zhang and Lorin Mueller, Federation of state boards of physical therapy

This study examined the impact of items with minor drift on examinee’s classification accuracy at different levels of abilities . Results show the pass/fail status of examinees at medium ability levels are more affected than high or low ability levels .

Electronic Board #17 Detecting DIF on Polytomous Items of Tests with Special Education Populations Kwang-lee Chu and Marc Johnson, Pearson; Pei-ying Lin, University of Saskatchewan

Disability affects performance and interacts with gender/ethnicity; its impacts are more of ability differences and should be isolated from DIF analysis . The effects of disability on polytomous item DIF analysis are examined . This study uses empirical data and simulations investigating accuracy of DIF models .

Electronic Board #18 Online Calibration of Polytomous Items Using the Generalized Partial Credit Model Yi Zheng, Arizona State University

Online calibration is a technology-enhanced calibration strategy that dynamically embeds pretest items in operational computerized adaptive tests and utilizes known operational item parameters to calibrate the pretest items . This study extends existing online calibration methods for dichotomous IRT models to GPCM to model polytoums items such as performance-based items .

166 Washington, DC, USA

Electronic Board #19 Identifying Intra-Individual Significant Growth in K-12 Reading and Mathematics with Adaptive Testing Chaitali Phadke, David Weiss and Theodore Christ, University of Minnesota

Psychometrically significant intra-individual change in K-12 Math and Reading achievement was measured using the fixed-length (30-item) Adaptive Measurement of Change (AMC) method . Analyses indicated that the majority of change was nonlinear . Results supported the use of the AMC procedure for the detection of psychometrically significant change .

Electronic Board #20 A Comparison of Estimation Techniques for IRT Models with Small Samples Holmes Finch, Ball State University; Brian French, Washington State University

Estimation accuracy of item response theory (IRT) model parameters is a concern with small samples . This can preclude the use of IRT and associated advantages with low incidence populations . This simulation study compares marginal maximum likelihood (ML) and pairwise estimation procedures . Results support the accuracy of pairwise estimation over ML .

Electronic Board #21 Comparing Three Procedures for Preknowledge Detection in Computerized Adaptive Testing Jin Zhang and Ann Wang, ACT Inc.

One classical and two Bayesian procedures of item preknowledge detection based on the hierarchical lognormal response time model are compared for computerized adaptive testing . A simulation study is conducted to investigate the effectiveness of the methods in conditions with various proportions of items and examinees affected by item preknowledge .

Electronic Board #22 Small Sample Equating for Different Uses of Test Scores in Higher Education HyeSun Lee, University of Nebraska-Lincoln; Katrina Roohr and Ou Lydia Liu, Educational Testing Service

The current simulation examined four equating methods for small samples depending on the use of test scores in higher education . Mean equating performed better for the estimation of institution-level reliability, whereas identity equating performed slightly better for the estimation of value-added scores . The paper addresses practical implications of the findings .

Electronic Board #23 Diagnostic Classification Modeling in Student Learning Progression Assessment Ruhan Circi, University of Colorado Boulder; Nathan Dadey, The National Center for the Improvementof Educational Assessment, Inc

A diagnostic classification model is used in this study to model a learning progression assessment . Results provided evidence for the moderate item quality . There is found support for the use of learning progression in the classroom to help students to gain mastery at least in one of learning outcomes .

167 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 2:15 PM - 3:45 PM, Renaissance West A, Ballroom Level, Invited Session, L1

Learning from History: How K-12 Assessment Will Impact Student Learning Over the Next Decade (National Association of Assessment Directors) Session Organizer: Mary E Yakimowski, Sacred Heart University Session Panelists: Kenneth J Daly III Dale Whittington, Shaker Heights Schools Lou Fabrizio, North Carolina Department of Public Instruction Carlos Martínez, Jr, U .S . Department of Education James H McMillan, Virginia Commonwealth University Eva Baker, University of California, Los Angeles

We have seen a remarkable evolution in the field of K-12 student assessment over the past 50 years . This increased attention has increased student learning, or has it? Through this invited session, you will hear panelists sharing insight from our history on K-12 assessment to offer learnings to best design and utilize assessment results that truly deepen student learning over this next decade .

More specifically, this invited session brings together panelists representing practitioners (Mr . Kenneth J . Daly III, Dr . Dale Whittington), state and federal government agencies (Dr . Louis M . Fabrizio, Dr . Carlos Martinez) and higher education institutions (Dr . James H . McMillan, Dr . Eva Baker) with a combined experience in assessment of over 150 years .

For the introductory portion of this session, panelists have been charged with sharing reflections on significant developments in K-12 student assessment from the last half century . They will do this by reconstructing their collective memory of this assessment history . The major portion of the session will be allotted to the second charge given to the panelists; specifically, to present and discuss some learnings gained from this history to better construct and use assessments that are geared to deepen student learning during this next decade . The last part of this session will allow for interactions among the panelists and the audience on improving learning through assessments .

168 Washington, DC, USA

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 8/9, Meeting Room Level, Coordinated Session, L2

Psychometric Issues on the Operational New-Generation Consortia Assessments Session Discussant: Timothy Davey, Educational Testing Services

Theoretical foundation of online (adaptive and non-adaptive) testing has been historically well established . Basic components of computerized adaptive test (CAT) procedures and their implementations have also been sufficiently investigated with options from various perspectives (Weiss and Gage, 1984; Way, 2005; Davey, 2011) . However, new and practical psychometric issues arose as online assessments moved to large-scale operational testing practices . Particularly, newly-developed new-generation Common Core State Standards (CCSS) aligned assessments were operationalized to a number of states . Psychometric designs affected by these changes including scoring strategies, IRT model selections, and vertical scales, may have impact on the validity of test scores . Furthermore, a complex test design with both CAT and performance task was used for these CCSS-aligned assessments . The assessments also include innovative items, in addition to traditional dichotomous and polytomous items . Therefore, findings and solutions from previous research may not be directly applicable for some issues mentioned above regarding the operational online assessments . Innovative psychometric analyses and solutions are required .

This session discusses the following important practical psychometric issues addressed in the first-year operational practice of the newly-developed new-generation Common Core State Standards (CCSS) aligned assessments, including (1) how to score an incomplete computerized adaptive test (CAT), (2) how to achieve an optimal balance between content/administration constraints and CAT efficiency in the assessment designs for accurate ability estimates? (3) which type of IRT models (unidimensional or multidimensional) produces more robust vertical scales in measuring student ability and growth? Three studies explore these questions using different psychometric and statistical methods based on operational data from multiple states or simulations . Analyses and findings are not only useful in validating the characteristics of the assessments for future improvement, but will also inspire more investigations in these areas that have not been fully explored yet .

Psychometric Issues and Approaches in Scoring Incomplete Online-Adaptive Tests Yi Du, Yanming Jiang, Terran Brown and Timothy Davey, Educational Testing Service

Effects of CAT Designs on Content Balance and the Efficiency of Test Shudong Wang, Northwest Evaluation Association; Hong Jiao, University of Maryland

Multidimensional Vertical Scaling for Tests with Complex Structures and Various Growth Patterns Yanming Jiang, Educational Testing Service

169 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, L3

Issues and Practices in Multilevel Item Response Models Session Chair: Ji Seung Yang, University of Maryland Session Discussant: Li Ci, University of California

Educational assessment data are often collected under complex sampling designs that result in unavoidable dependency among examinees within clusters such as classrooms or schools . The multilevel item response theory models (MLIRT) have been developed (e .g ., Adams, Wilson, and Wu, 1997; Fox, 2005; Kamata, 2001) to address the nested structure of item response data more properly and to draw more sound statistical inferences for both within- and between-cluster level estimates (e .g ., interclass correlation or cluster-level latent scores) . Combined with multidimensionality or local dependency among item responses (e .g ., testlet), the complexity of multilevel item response models has increased and drawn many methodologists’ attention with respect to the issues and practices that cover not only modeling but also scoring and choosing models . The purpose of this coordinated session is to introduce recent advanced topics in MLIRT and provide more practical guidance to practitioners to implement some of the extended MLIRT models . The session is composed of five papers . The first two papers are concerned about MLIRT models that reflect complex sampling designs properly, and the second two papers focus on the distribution of latent density and scoring at between-cluster level . Finally, the last paper is on model selection methods in MLIRT .

Multilevel Cross-Classified Dichotomous Item Response Theory Models for Complex Person Clustering Structures Chen Li and Hong Jiao, University of Maryland

Multilevel Item Response Models with Sampling Weights Xiaying Zheng and Ji Seung Yang, University of Maryland

School-Level Subscores Using Multilevel Item Factor Analysis Megan Kuhfeld and Li Cai, University of California

Multilevel Item Bifactor Models with Nonnormal Latent Densities Ji Seung Yang, Ji An and Xiaying Zheng, University of Maryland

Model Selection Methods for Mlirt Models: Gaining Information from Different Focused Parameters Xue Zhang and Jian Tao, Northeast Normal University; Chun Wang, University of Minnesota

170 Washington, DC, USA

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, L4

Psychometric Issues in Alternate Assessments Session Chair: Okan Bulut, University of Alberta Session Discussant: Michael Rodriguez, University of Minnesota

Alternate assessments are designed for students with significant cognitive disabilities . They are characterized by semi-adaptive test designs, testlet-based forms, small sample sizes, and negatively skewed ability distributions . This symposium aims to reflect the common psychometric challenges in the context of alternate assessments, such as local item dependence (LID), differential item functioning (DIF), testlet and position effects, and the impact of cumulative item parameter drift (IPD) . The alternate assessments used in this proposal are mixed-format tests that consist of both dichotomous and polytomous items .

The first study explores the advantages of a four-level measurement model (1– item effect, 2–testlet effect, 3–person effect, and 4–disability type effect) in investigating local item dependence caused by item clustering and local person dependence caused by person clustering over models that cannot handle them simultaneously .

The second study employs the Linear Logistic Test Model (LLTM) to examine the consequences of item position and testlet position effects in alternate assessments . The use of LLTM for investigating position effects in a semi-adaptive test form is demonstrated .

The third study quantifies the advantages of three bi-factor models that take the testlet-based item structure into account and compares them with the 2PL IRT model . In addition, DIF analysis based on each model included in the study is conducted, which helps understanding the differences of the models in the context of DIF .

The last study examines the cumulative impact of item parameter drift on item parameter and student ability estimates . It includes a Monte Carlo simulation for each operational administration in five states across three to nine years . Results from simulations and operational testing are compared . Effects of different equating methods are also compared .

Multilevel Modeling of Item and Person Clustering Simultaneously in Alternate Assessments Chao Xie and Hyesuk Jang, American Institutes for Research

Examining Item and Testlet Position Effects in Computer-Based Alternate Assessments Okan Bulut, University of Alberta; Xiaodong Hou and Ming Lei, American Institutes for Research

An Application of Bi-Factor Model for Examining DIF in Alternate Assessments Hyesuk Jang and Chao Xie, American Institutes for Research

Impact of Cumulative Drift on Parameter and Ability Estimates in Alternate Assessments Ming Lei, American Institutes for Research; Okan Bulut, University of Alberta

171 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 5, Meeting Room Level, Coordinated Session, L5

Recommendations for Addressing the Unintended Consequences of Increasing Examination Rigor Session Discussant: Betsy Becker, Florida State University

The purpose of this symposium is to present findings from all development activities since the RTTT and address the unintended consequences of increasing examination rigor . The findings from the past 5 years of FTCE/FELE development, scoring, reporting, and standard setting procedures and outcomes will be presented . First, The FTCE/ FELE program initiatives, as well as policy changes and outcomes that have occurred as a result of the increase in examination rigor will be presented . Second, the current study will draw an overview picture of the 1 .5-2 year development cycle for the FTCE/FELE program and provide an in-depth explanation of the facilitation of each step in the test development process, based on the Standards for Educational and Psychological Testing . Third, the current psychometric, scoring and reporting, standard setting, and passing scores adoption processes for the FTCE/ FELE program will be discussed . Lastly an overview picture of educator candidates’ performance and in response to examinations’ increased rigor will be discussed and analysis of student-level and test-level data will be presented to answer: What is the impact of increased rigor on average difficulty of tests? Does increased rigor have significant impact on test takers’ performances? Does increased rigor have significant impact on passing rates?

The Effect of Increased Rigor on Education Policy Phil Canto, Florida Department of Education

Developing Assessments in an Ongoing Testing Environment Lauren White, Florida Department of Education

FTCE/FELE Standard Setting and New Passing Scores: The Methodology Süleyman Olgar, Florida Department of Education

Increased Rigor and Its Impact on Certification Examination Outcomes Onder Koklu, Florida Department of Education

172 Washington, DC, USA

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 15, Meeting Room Level, Paper Session, L6

Innovations in Assessment Session Discussant: TBA

Investigating the Comparability of Examination Difficulty Using Comparative Judgement and Rasch Modelling Stephen Holmes, Michelle Meadows, Ian Stockford and Qingping He, Office of Qualifications and Examinations Regulation

This research explores a new approach, the comparative judgement and Rasch modelling approach, to investigate the comparability of difficulty of examinations . Findings from this study suggests that this approach could potentially be used as a proxy for pretesting assessments when security or other issues are a major concern .

Improvements in Automated Capturing of Psycho-Linguistic Features in Readingassessment Text Makoto Sano, Prometric

This study explores psycho-linguistic features associated with reading passage MC item types that can be used to predict item difficulty levels of these item types . The effectiveness of new functions on NLP tool, PLIMAC (Sano, 2015) is evaluated in use of items from the NAEP Grade 8 Reading assessment .

Generating Rubric Scores from Pairwise Comparisons Shayne Miel, Elijah Mayfield and David Adamson, Turnitin; Holly Garner, EverEd Technology

Using pairwise comparisons to score essays on a holistic rubric is potentially a more reliable scoring method than traditional handscoring . We establish a metric for measuring the reliability of a scoring process and explore methods for assigning discrete rubric scores to the ranked list induced by the pairwise comparisons .

Investigating Sequential Item Effects in a Testlet Model William Muntean and Joe Betts, Pearson

Scenario-based assessments are well-suited for measuring professional decision-making skills such as clinical judgment . However, these types of items present a unique challenge to a testlet-based model because of potential sequential item effects . This research investigates the impact of sequential item effects within a testlet model .

173 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 12, Meeting Room Level, Paper Session, L7

Technology-Based Assessments Session Discussant: Mengxiao Zhu, ETS

Theoretical Framework for Log-Data in Technology-Based Assessments with Empirical Applications from PISA Ulf Kroehne, Heiko Rölke, Susanne Kuger, Frank Goldhammer and Eckhard Klieme, German Institute for International Educational Research (DIPF)

Indicators derived from log-data are often based on the ad hoc use of available events due to the missing definition of log-data completeness . This gap is filled with a theoretical framework that formalizes technology-based assessments with finite-state machines and provides completeness conditions, illustrated with empirical examples from PISA assessments .

Investigating the Relations of Writing Process Features and the Final Product Chen Li, Mo Zhang and Paul Deane, Educational Testing Service

Features extracted from the writing processes such as latency between keypresses have potential to provide evidence of one’s writing skills not available from the final product . This study investigates and compares the relations of process features with text quality as measured by two rubrics on writing fundamentals and higher-level skills .

Interpretation of a Complex Assessment Focusing on Validity and Appropriate Reliability Assessment Steffen Brandt, Art of Reduction; Kristina Kögler, Goethe-Universität Frankfurt; Andreas Rausch, Universität Bamberg

An analysis approach combining qualitative analyses of answer patterns and quantitative, IRT-based analyses is demonstrated on data from a test composed of three computer-based problem solving tasks (each 30-45 minutes) . The strong qualitative component increases validity and additionally yields appropriate reliability estimates by avoiding local item dependence .

Award Session: Brenda Loyd Dissertation Award 2016: Youn-Jeng Choi

174 Washington, DC, USA

Monday, April 11, 2016 2:15 PM - 3:45 PM, Meeting Room 13/14, Meeting Room Level, Invited Session, L8

NCME Diversity and Testing Committee Sponsored Symposium: Implications of Computer-Based Testing for Assessing Diverse Learners: Lessons Learned from the Consortia Session Moderator: Priya Kannan, Educational Testing Service Session Discussant: Bob Dolan, Diverse Learners Consulting

Six consortia developed and operationally delivered next-generation, large-scale assessments in 2015 . These efforts provided opportunities to re-think the ways that assessment systems, and in particular computer-based tests, are designed to support valid assessment for all learners . In this session, representatives from each consortium will describe their lessons learned in the administration of computer-based tests to diverse learners . Topics will include design features of the assessment systems that are intended to promote effective and inclusive assessment, research and evaluation on the 2014-15 assessment administration, and future challenges and opportunities

Smarter Balanced Assessment Consortium (SBAC) Tony Alpert, Smarter Balanced Assessment Consortium

Partnership for Assessment of Readiness of College and Careers (PARCC) Trinell Bowman, Prince George’s County Public Schools in Maryland

National Center and State Collaborative (NCSC) Rachel Quenemoen, National Center on Educational Outcomes

Dynamic Learning Maps Alternate Assessment System (DLM) Russell Swinburne Romine, University of Kansas

English Language Proficiency Assessment for the 21st Century (ELPA21) Martha Thurlow, National Center on Educational Outcomes

WIDA Carsten Wilmes, University of Wisconsin

175 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 3:00 PM - 7:00 PM, Meeting Room 10/11, Meeting Room Level

NCME Board of Directors Meeting

Members of NCME are invited to attend as observers

176 Washington, DC, USA

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 8/9, Meeting Room Level, Coordinated Session, M1

Fairness Issues and Validation of Non-Cognitive Skills Session Chair: Haifa Matos-Elefonte, The College Board Session Discussant: Patrick Kyllonen, Educational Testing Service

More research and attention are needed to ensure assessments of noncognitive skills provide fair and valid inferences for all examinees . Four presenters will offer perspectives on non-cognitive skills and the issues of fairness of assessing them in four contexts . The first presenter will discuss non-cognitive factors within the context of an international assessment offering a framework to handle the interplay of cultural and linguistic diversity in developing the assessment to ensure fairness and valid interpretations for all test takers . The second presenter will provide an overview of non-cognitive skills in K-12 settings with thoughts on the issues surrounding the various threats to fair and valid interpretations . The third presenter will extend the evidence-centered-design approach to capture the needs of culturally and linguistically diverse populations in the design and development of a noncognitive assessment used in higher education, so as to ensure the fairness and validity of inferences for all examinees . The fourth presentation will provide an overview of the fairness issues involving non-cognitive measures in personnel selection and discuss specific aspects that permit these assessments to be used in fair and valid ways . Finally, a discussant will provide some comments on each of the presentations and offer additional insights .

Non-Cognitive Factors, Culture, and Fair and Valid Assessment of Culturally And-Linguistically-Diverse Learners Edynn Sato, Pearson

Some Thoughts on Fairness Issues in Assessing Non-Cognitive Skills in K-12 Thanos Patelis, Center for Assessment

An Application of Evidence-Centered-Design to Assess Collaborative Problem Solving in Higher Education Maria Elena Oliveri, Robert Mislevy and Rene Lawless, Educational Testing Service

The Changing Use of Non-Cognitive Measures in Personnel Selection Kurt Geisinger, Buros Center for Testing, University of Nebraska-Lincoln

177 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 3, Meeting Room Level, Coordinated Session, M2

Thinking About Your Audience in Designing and Evaluating Score Reports Session Chair: Priya Kannan, Educational Testing Service Session Discussant: April Zenisky, University of Massachusetts, Amherst

The information presented in score reports is often the single-most important point of interaction between a score user and the outcomes of an assessment . Score reports are consumed by a variety of score users (e .g ., test takers, parents, teachers, administrators, policy makers), and each of these users have different levels of understanding of the assessment and its intended outcomes . The degree to which these diverse users understand the information presented in score reports impacts their ability to draw reasonable conclusions . Recent score reporting frameworks have highlighted the importance of taking into account the needs, pre-existing knowledge, and attitudes of specific stakeholder groups (Zapata-Rivera & Katz, 2014) as well as the importance of iterative design in the development of score reports (Hambleton & Zenisky, 2013) . The papers in this session employ a variety of methods to identify and understand the needs of diverse stakeholder groups, and studies highlight the importance of sequential and iterative approaches (i .e ., assessing needs – prototyping – evaluating usability and accuracy of understanding) to the design and development of audience-focused score reports . These collection of studies demonstrate how a focus on stakeholder needs can bring substantive gains for the validity of interpretations and decisions made from assessment results .

Designing and Evaluating Score Reports for a Medical Licensing Examination Amanda Clauser, National Board of Medical Examiners; Francis Rick, University of Massachusetts, Amherst

Evaluating Validity of Score Reports with Diverse Subgroups of Parents Priya Kannan, Diego Zapata-Rivera and Emily Leibowitz, Educational Testing Service

Designing Alternate Assessment Score Reports: Implications for Instructional Planning Amy Clark, Meagan Karvonen and Neal Kingston, University of Kansas

Interactive Score Reports: a Strategic and Systematic Approach to Development Richard Tannenbaum, Priya Kannan, Emily Leibowitz, Ikkyu Choi and Spyridon Papageorgiou, Educational Testing Service

Data Systems and Reports as Active Participants in Data Analyses Jenny Rankin, Illuminate Education

178 Washington, DC, USA

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 4, Meeting Room Level, Coordinated Session, M3

Use of Automated Tools in Listening and Reading Item Generation Session Chair: Su-Youn Yoon, ETS Session Discussant: Christy Schneider, Center for Assessment

Creating a large pool of valid items with appropriate difficulty has been a continuing challenge for testing programs . In order to address this need, several studies have focused on developing automated tools to predict the complexity of passages for reading or listening items . In addition to predicting text complexity, automated technologies can be used in a variety of ways in the context of item generation, which may contribute to increased efficiency, validity, and reliability in item development . This coordinated session will investigate the use of automated technology to support a wide range of processes for generating items that assess listening and reading skills .

Aligning the Textevaluator Reporting Scale with the Common Core Text Complexity Scale Kathleen Sheehan, ETS

Prediction of Passage Acceptance/ Rejection Using Linguistic Information Swapna Somasundaran, Yoko Futagi, Nitin Madnani, Nancy Glazer, Matt Chametsky and Cathy Wendler, ETS

Measuring Text Complexity of Items for Adult English Language Learners Peter Foltz, Pearson and University of Colorado Boulder; Mark Rosenstein, Pearson

Automatic Prediction of Difficulty of Listening Items Su-Youn Yoon, Anastassia Loukina, Youhua Wei and Jennifer Sakano, ETS

Item Generation Using Natural Language Processing Based Tools and Resources Chong Min Lee, Melissa Lopez, Su-Youn Yoon, Jenifer Sakano, Anastassia Loukina, Bob Krovetz and Chi Lu, ETS

179 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 5, Meeting Room Level, Paper Session, M4

Practical Issues in Equating Session Discussant: Dongmei Li, ACT

Empirical Item Characteristic Curve Pre-Equating with the Presence of Test Speededness Yuxi Qiu and Anne Huggins-Manley, University of Florida

This simulation study is proposed to evaluate the accuracy of the empirical item characteristic curve (EICC) pre- equating method under combinations of varied levels of test speededness, sample size, and test length . Findings of this research provide guidelines for practitioners, and further stimulate a better practice toward score equating .

Investigating the Effect of Missing and Speeded Responses in Equating Hongwook Suh, JP Kim and Tony Thompson, ACT, inc.

This study investigates the effect of dealing with examinees who showed omitted and speeded responses on equating results by applying lognormal response time model (van der Linden, 2006) . Empirical data are manipulated to design practical situations considered in the equating procedures .

The Effects of Non-Representative Common Items on Linear Equating Relationships Lu Wang, ACT, Inc./The University of Iowa; Won-Chan Lee, University of Iowa

This study investigates the effects of both content and statistical representation of common items on the accuracy of four linear equating relationships . The results of this study will assist practitioners in choosing the most accurate linear equating method(s) when the representativeness of common items is a concern .

Pseudo-Equating Without Common Items or Common Persons Nooree Huh, Deborah Harris and Yu Fang, ACT, Inc.

In some high stakes testing programs, it is not possible to conduct standard equating such as common item or random groups equating because once an item is exposed, it is no longer secure . However, the need to compare scores across administrations may still exist . This paper demonstrates some alternative approaches .

Equating Item Difficulty Under Sub-Optimal Conditions Michael Walker, The College Board; Usama Ali, Educational Testing Service

This paper evaluates two methods for equating item difficulty statistics: one using linear equating and the other using post-stratification . The paper evaluates these methods in terms of bias and error across a range of sample sizes and population ability differences; and across chains of equating of different length .

Impact of Drifted Common Items on Proficiency Estimates Under the Ciecp Design Juan Chen, Andrew Mroch, Mengyao Zhang, Joanne Kane, Mark Connally and Mark Albanese, National Conference of Bar Examiners

The authors explore the detection and impact of drifted common items on examinee proficiency estimates and examinee classification . Two different detection methods, two approaches to setting item parameter estimates, and two different linking methods are examined . Both practical and theoretical implications of the findings are discussed .

180 Washington, DC, USA

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 16, Meeting Room Level, Paper Session, M5

The Great Subscore Debate Session Discussant: Sandip Sinharay, Pacific Metrics

How Worthless Subscores Are Causing Excessively Long Tests Howard Wainer and Richard Feinberg, National Board of Medical Examiners

Previous research overwhelmingly confirms the paucity of subscores worth reporting for either individuals or institutions . Given the excessive length of most standardized tests, particularly licensure/credentialing, offered without evidence to support reporting more than a single score, we illustrate an approach for reducing test length and minimizing additional pass/fail misclassification .

An Alternative Perspective on Subscores and Their Value Yuanchao Emily Bo, Mark Hansen and Li Cai, University of California, Los Angeles; Charles Lewis, Educational Testing Service, Fordham University

Recent work has shown that observed subscores are often worse predictors of true subscores than the total score . However, we propose here that it is the specific component of the subscore that should be used to judge its value . From this perspective, we reach a quite different conclusion .

Masking Distinct and Reliable Subscores: A Call to Assess Added Value Invariance Joseph Rios, Educational Testing Service

Subscore added value is commonly assessed for the total sample; however, this study found that up to 30% of examinees with added value can be masked when treating subscores as invariant across groups . Therefore, we should consider that subscores may be valid and reliable for some examinees and not all .

Why Do Value Added Ratios Differ Under Different Scoring Approaches? Brian Leventhal, University of Pittsburgh; Jonathan Rubright, American Institute of Certified Public Accountants

Using classical test theory, Haberman (2008) developed an approach to calculate whether a subscore has value in being reported . This paper shows how value added ratios differ under item response theory, and provides an empirical example showing how various scoring options under IRT impact this ratio .

Accuracy of the Person-Level Index for Conditional Subscore Reporting Richard Feinberg and Mark Raymond, National Board of Medical Examiners

Recent research has proposed a conditional index to detect subscore value for certain test takers when more conventional methods suggest not reporting at all . The current study furthers this research by investigating conditions under which conditional indices detect potentially meaningful score profiles that may be worthy of reporting .

The Validity of Augmented Subscores When Used for Different Purposes Marc Gessaroli, National Board of Medical Examiners

The validity of augmented subscores has been debated in the literature . This paper studies the validity of augmented subscores when they are used for different purposes . The findings suggest that the usefulness of augmented subscores varies depending upon the intended use of the scores .

181 2016 Annual Meeting & Training Sessions

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 12, Meeting Room Level, Paper Session, M6

Scores and Scoring Rules Session Discussant: Steven Culpepper, University of Illinois

The Relationship Between Pass Rate and Multiple Attempts Ying Cheng and Cheng Liu, University of Notre Dame

We analytically derive the relationship between expected conditional and marginal pass rate and the number of allowed attempts at a test under two definitions of pass rate . It is shown that depending on the definition, the pass rate can go up or down with the number of attempts .

Classification Consistency and Accuracy with Atypical Score Distributions Stella Kim and Won-Chan Lee, The University of Iowa

The primary purpose of this study is to evaluate relative performance of various procedures for estimating classification consistency and accuracy indices with atypical score distributions . Three simulation studies are conducted, each of which is associated with a peculiar observed score distribution .

A Psychometric Evaluation of Item-Level Scoring Rules for Educational Tests Frederik Coomans and Han van der Maas, University of Amsterdam; Peter van Rijn, ETS Global, Amsterdam; Marjan Bakker, Tilburg University; Gunter Maris, Cito Institute for Educational Measurement and University of Amsterdam

We develop a modeling framework in which psychometric models can be constructed directly from a scoring rule for dichotomous and polytomous items . By assessing the fit of such a model, we can infer the extent to which the population of test takers responds in accordance with the scoring rule .

For Want of Subscores in Large-Scale Educational Survey Assessment:a Simulation Study Nuo Xi, Yue Jia, Xueli Xu and Longjuan Liang, Educational Testing Service

The objective of the simulation study is to investigate the impact of varying length of content area subscales (overall and per examinee) for its prospective use in large-scale educational survey assessments . Sample size and estimation method are also controlled to evaluate the overall effect on group statistics estimation .

Comparability of Essay Scores Across Response-Modes: A Complementary View Using Multiple Approaches Nina Deng and Jennifer Dunn, Measured Progress

This study evaluates the comparability of essay scores between computer-typed vs . handwritten responses . Multiple approaches were integrated to provide a complementary view for assessing both statistical and practical significance of essay score differences at the factorial, scoring-dimension, and item levels .

182 Washington, DC, USA

Monday, April 11, 2016 4:05 PM - 6:05 PM, Meeting Room 13/14, Meeting Room Level, Invited Session, M7

On the Use and Misuse of Latent Variable Scores Session Presenter: Anders Skrondal, Norwegian Institute of Public Health

One major purpose of latent variable modeling is scoring of latent variables, such as ability estimation . Another purpose is investigation of the relationships among latent (and possibly observed) variables . In this case the state- of-the-art approach is simultaneous estimation of a measurement model (for the relationships between latent variables and items measuring them) and a structural model (for the relationships between different latent variables and between latent and observed variables) . An alternative approach, that is considered naive, is to use latent variable scores as proxies for latent variables . Here, estimation is simplified by first estimating the measurement model and obtaining latent variable scores, and subsequently treating the latent variable scores as observed variables in standard regression analyses . This approach will generally produce invalid estimates for the target parameters in the structural model, but we will demonstrate that valid estimates can be obtained if the scoring methods are judiciously chosen . Furthermore, the proxy approach can be superior to the state-of-the-art approach because it protects against certain misspecifications and allows doubly-robust causal inference in a class of latent variable models .

183 2016 Annual Meeting & Training Sessions

184 Washington, DC, USA

Participant Index A Bertling, Masha ...... 74, 120 Betebenner, Damian ...... 44, . . 130 Abad, Francisco ...... 72 Betts, Joe ...... 173 Adamson, David ...... 173 . . Beverly, Tanesia ...... 89 Adesope, Olusola ...... 88. . . Beymer, Lisa ...... 74, 120 Adhikari, Sam ...... 146 Bian, Yufang ...... 61 Aguado, David ...... 72. . . Blood, Ian ...... 84. . . Akbay, Lokman ...... 74 Bo, Yuanchao Emily ...... 181 . . Albanese, Mark ...... 180 Boeck, Paul De ...... 49. . . Albano, Anthony ...... 32, 163 Bohrnstedt, George ...... 117 . . Ali, Usama ...... 116, . . 154, 180 Bolt, Daniel ...... 133 . . . Allexsaht-Snider, Martha ...... 104 Bolton, Sarah ...... 157. . . Almond, Russell ...... 70 . . Bond, Mark ...... 159 . . . Alpert, Tony ...... 67,. . 175 . Bonifay, Wes ...... 52 Alzen, Jessica ...... 89 . . Bottge, Brian ...... 106. . . Amati, Lucy ...... 118 . . . Boughton, Keith ...... 65, . . 152 An, Ji ...... 150, . 170. Boulais, André-Philippe ...... 164. . Anderson, Daniel ...... 59 Bowman, Trinell ...... 175. . . Andrews, Benjamin ...... 46,. . 105 Boyer, Michelle ...... 122, . . 164 Andrich, David ...... 165 . . Bradshaw, Laine ...... 74, . 86, 120, 127, 127 Ankenmann, Robert ...... 63 Brandstrom, Adele ...... 71 Antal, Judit ...... 50 . . Brandt, Steffen ...... 174 . . Austin, Bruce ...... 88 . . . Braun, Henry ...... 40 . . . Brennan, Robert ...... 30,. . 63, 131 Brenner, Daniel ...... 160 B Breyer, F . Jay ...... 45 Baker, Eva ...... 168 . . Breyer, Jay ...... 150. . . Baker, Ryan ...... 103 . . . Bridgeman, Brent ...... 134 . . Bakker, Marjan ...... 182 . . Briggs, Derek ...... 44 . . Balamuta, James ...... 155 Brijmohan, Amanda ...... 109 Banks, Kathleen ...... 154. . . Broaddus, Angela ...... 72. . . Bao, Yu ...... 122 . . . Broer, Markus ...... 117. . . Barocas, Solon ...... 48. . . Brophy, Tim ...... 40. . . Barrada, Juan ...... 72 . . Brown, Derek ...... 55 . . Barrett, Michelle ...... 16, . . 113 Brown, Emily ...... 109. . . Barry, Carol ...... 161 . . . Brown, Nathaniel ...... 40 Barton, Karen ...... 53, . . 152 Brown, Terran ...... 169 Bashkov, Bozhidar ...... 73 . . Brusilovsky, Peter ...... 106 . . Baumer, Michal ...... 83 Brussow, Jennifer ...... 91. . . Bazaldua, Diego Luna ...... 124 Bryant, Rosalyn ...... 76 Beard, Jonathan ...... 128. . . Buchholz, Janine ...... 132 Becker, Betsy ...... 172. . . Buckendahl, Chad ...... 18, . 83, 101, 112 Bejar, Isaac ...... 49 Buckley, Barbara ...... 160 . . Bejar, Isaac I ...... 49 Buckley, Jack ...... 128. . . Belov, Dmitry ...... 54, . . 113 Budescu, David ...... 105 Bennett, Randy ...... 47, 47, 142, 160, 160 Bukhari, Nurliyana ...... 65 Benson, Martin ...... 53. . . Bulut, Okan ...... 60, . . 171, 171, 171 Bertling, Jonas ...... 17, . . 57, 57 Burstein, Jill ...... 48. . . Bertling, Maria ...... 53. . . Bushaw, Bill ...... 100 . .

185 2016 Annual Meeting & Training Sessions

Participant Index

Buxton, Cory ...... 104. . . Choe, Edison ...... 65 . . . Buzick, Heather ...... 150 Choi, Hye-Jeong ...... 106 . . Choi, Ikkyu ...... 178 Choi, In-Hee ...... 68, 118 C Choi, Jinah ...... 63 Cahill, Aoife ...... 48. . . Choi, Jiwon ...... 90 . . Cai, Li . . 27, . 130, 143, 151, 151, 155, 162, 162, 170, 181 Choi, Kilchan ...... 151, . 151, 151, 151, 162 Cai, Liuhan ...... 163 Christ, Theodore ...... 167 . . Cai, Yan ...... 61 Chu, Kwang-lee ...... 166 Cain, Jessie Montana ...... 134. . . Chung, Kyung Sun ...... 133 Caliço, Tiago ...... 53 Chung, Seunghee ...... 63, . . 123 Camara, Wayne ...... 56, 81, 112, 145 Ci, Li ...... 170 Camilli, Greg ...... 133 Circi, Ruhan ...... 167 . . Canto, Phil ...... 172 Cizek, Greg ...... 55 . . Carstensen, Claus ...... 155 . . Clark, Amy ...... 51, 178 Casabianaca, Jodi ...... 91. . . Clauser, Amanda ...... 43, 178 Casabianca, Jodi ...... 159,. . 159, 159, 159 Clauser, Brian ...... 142. . . Castellano, Katherine Furgol . . . . 53,. 59, 80, 80, 80, Clauser, Jerome ...... 63, 142 130, 136, 141 Cohen, Allan ...... 104, 106 Cavalie, Carlos ...... 84. . . Cohen, Allan S ...... 73. . . Chajewski, Michael ...... 50 . . Cohen, Jon ...... 143 Chametsky, Matt ...... 179 Cohen, Michael ...... 100 Champlain, André De ...... 164 Colvin, Kimberly ...... 107 . . Chang, Hua-Hua ...... 61,. . 65 Conaway, Carrie ...... 67 . . Chang, Hua-hua ...... 91 . . Conforti, Peter ...... 159 . . Chang, Hua-Hua ...... 116 . . Confrey, Jere ...... 44 . . . Chattergoon, Rajendra ...... 126, . . 153 Connally, Mark ...... 180 . . Chatterji, Madhabi ...... 117 Cook, Linda ...... 142 . . Chayer, David ...... 152 Coomans, Frederik ...... 182 Chen, Feng ...... 72,. . 122 . Cottrell, Nicholas ...... 84 Chen, Hanwei ...... 85 Crabtree, Ashleigh ...... 62 Chen, Hui-Fang ...... 108 Crane, Samuel ...... 147 . . Chen, I-Chien ...... 146. . . Croft, Michelle ...... 55. . . Chen, Jie ...... 108 . . . Crouch, Lori ...... 149 . . Chen, Jing ...... 117. . . Cui, Zhongmin ...... 85,. . 106 Chen, Juan ...... 180 Cukadar, Ismail ...... 120 . . Chen, Keyu ...... 90 . . Culpepper, Steven ...... 155, 182 Chen, Pei-Hua ...... 83 Cúri, Mariana ...... 90 . . . Chen, Ping ...... 45 Chen, Tingting ...... 83. . . CHEN, TINGTING ...... 135 . . D Chen, Xin ...... 150 d’Brot, Juan ...... 102 . . Cheng, Britte ...... 40 . . . Dabbs, Beau ...... 146 Cheng, Ying ...... 154, . . 182 Dadey, Nathan ...... 107, . . 126, 167 Chien, Yuehmei ...... 86 Dai, Shenghai ...... 62, 135 Childs, Ruth ...... 109, . . 161 Daniels, Vijay ...... 53 . . . Cho, Youngmi ...... 62 Davey, Tim ...... 56 Cho, YoungWoo ...... 70 . . Davey, Timothy ...... 66, 169, 169

186 Washington, DC, USA

Participant Index Index Davier, Alina von . . . .22, 41, 48, 60, 60, 85, 90, 105, 118 F Davier, Matthias von ...... 57 . . Davier, Matthias Von ...... 68 . . Fabrizio, Lou ...... 168 Davier, Matthias von . . . . .103, . 103, 103, 103, 155 Fahle, Erin ...... 89 . . . Davis, Laurie ...... 84, 84, 142 Famularo, Lisa ...... 63 Davis-Becker, Susan ...... 71. . . Fan, Meichu ...... 70. . . Deane, Paul ...... 174 . . Fan, Yuyu ...... 60 Debeer, Dries ...... 116. . . Fang, Guoliang ...... 136 . . DeCarlo, Larry ...... 61 Fang, Yu ...... 180. . . DeCarlo, Lawrence ...... 86 Farley, Dan ...... 108 DeMars, Christine ...... 73, . . 107 Feinberg, Richard ...... 181, . . 181 Denbleyker, Johnny ...... 130 Ferrara, Steve ...... 28,. . 49 . Deng, Hui ...... 128 . . Fina, Anthony ...... 165 Deng, Nina ...... 182 Finch, Holmes ...... 167 Deters, Lauren ...... 43. . . Finger, Michael ...... 163 . . Dhaliwal, Tasmin ...... 127 . . Finn, Chester ...... 100. . . Diakow, Ronli ...... 68 . . Foltz, Peter ...... 29, 179 Diao, Hongyu ...... 121,. . . 165 Forte, Ellen ...... 79, 112 DiBello, Lou ...... 136 . . Freeman, Leanne ...... 164 DiCerbo, Kristen ...... 127. . . French, Brian ...... 88, . 167 . Ding, Shuliang ...... 91, . . 116 Frey, Andreas ...... 153. . . Dodd, Barbara ...... 91. . . Fu, Yanyan ...... 86, . 136. Dodson, Jenny ...... 42. . . Fung, Karen ...... 142 . . Dolan, Bob ...... 175 Futagi, Yoko ...... 179 . . Domingue, Benjamin ...... 89 . . Donoghue, John ...... 52, . 52, 115, 119, 133 Dorans, Neil ...... 119 . . G Dorn, Sherman ...... 157 . . Gafni, Naomi ...... 83 . . . Doromal, Justin ...... 115 Gandara, Fernanda ...... 162 . . Drasgow, Fritz ...... 142 Gao, Furong ...... 64 Du, Yi ...... 143,. . . 169 Gao, Lingyun ...... 109. . . Dunbar, Stephen ...... 108, 145 Gao, Xiaohong ...... 58,. . 90, 136, 162, 166 Dunbar, Steve ...... 43 Garcia, Alejandra ...... 162 Dunn, Jennifer ...... 18, . . 182 Garner, Holly ...... 173. . . Dunya, Beyza Aksu ...... 60 Gawade, Nandita ...... 59 Geis, Eugene ...... 133. . . Geisinger, Kurt ...... 142, . . 177 E Gelbal, Selahattin ...... 45. . . Easton, John ...... 157 Gessaroli, Marc ...... 181 . . Egan, Karla ...... 18, 101, 102 Gianopulos, Garron ...... 44. . . Embretson, Susan ...... 49 . . Gierl, Mark ...... 53, 83, 142, 147 Engelhardt, Lena ...... 153 Gill, Brian ...... 67 Ercikan, Kadriye ...... 40 . . Gitchel, Dent ...... 148. . . Erickan, Kadriye ...... 99 . . Glazer, Nancy ...... 179. . . Evans, Carla ...... 126 . . Goldhammer, Frank ...... 133,. . 153, 174 Ewing, Maureen ...... 128. . . Gong, Brian ...... 107, . . 126 Gonzalez, Oscar ...... 120. . . González-Brenes, José ...... 106,. . 144, 144 González-Brenes, José Pablo ...... 53 . .

187 2016 Annual Meeting & Training Sessions

Participant Index

Gotch, Chad ...... 64 Henson, Robert ...... 86, 110, 136 Grabovsky, Irina ...... 131,. . . 148 Herman, Joan ...... 47 Graesser, Art ...... 41 Herrera, Bill ...... 43,. . 108 . Graf, Edith Aurora ...... 49, . . 109 Heuvel, Jill R . van den ...... 141 Greiff, Samuel ...... 41, 103 Hillier, Tracey ...... 53 . . . Griffin, Patrick ...... 89 Himelfarb, Igor ...... 136 . . Grochowalski, Joe ...... 58 . . Ho, Andrew ...... 149 . . Grochowalski, Joseph ...... 148 Ho, Emily ...... 50 Groos, Janet Koster van ...... 88 Hochstedt, Kirsten ...... 123 Grosse, Philip ...... 123. . . Hochweber, Jan ...... 43 . . Gu, Lixiong ...... 66 . . Hoff, David ...... 149 Guerreiro, Meg ...... 108 . . Hogan, Thomas ...... 110 Gunter, Stephen ...... 163 . . Holmes, Stephen ...... 173 Guo, Hongwen ...... 65, . . 88, 119 Hong, Guanglei ...... 21 Guo, Qi ...... 63. . . Hou, Likun ...... 143 Guo, Rui ...... 91 . . . Hou, Xiaodong ...... 171 . . Houts, Carrie R ...... 27. . . Huang, Cheng-Yi ...... 83 H Huang, Chi-Yu ...... 164 Haberman, Shelby ...... 80, 150 Huang, Kevin (Chun-Wei) ...... 160 . . Hacker, Miriam ...... 133 . . Huang, Xiaorui ...... 46. . . Haertel, Edward ...... 142. . . Huang, Yun ...... 106 . . . Hain, Bonnie ...... 67 . . . Huff, Kristen ...... 149 Hakuta, Kenji ...... 79 . . . Huggins-Manley, Anne ...... 180 . . Hall, Erika ...... 112 . . Huggins-Manley, Anne Corinne ...... 46 . . Han, HyunSuk ...... 46 Hughes, Malorie ...... 147 . . Han, Kyung Chris ...... 22 Huh, Nooree ...... 164, 180 Han, Zhuangzhuang ...... 103. . . Hunter, C ...... 163 . . Hansen, Mark ...... 73,. . 135, 162, 181 Huo, Yan ...... 90, . 143 . Hao, Jiangang ...... 41 Hurtz, Gregory ...... 54,. . 134 Happel, Jay ...... 128 . . . Hwang, Dasom ...... 75 Harnly, Aaron ...... 147. . . Harrell, Lauren ...... 57, . . 155 Harring, Jeffrey ...... 62. . . I Harris, Debora ...... 152 . . Iaconangelo, Charles ...... 80, . . 123 Harris, Deborah ...... 25, 180 III, Kenneth J Daly ...... 168 . . Hartig, Johannes ...... 43, 132 Ing, Pamela ...... 163 . . . Hattie, John ...... 89. . . Insko, Bill ...... 50 Hayes, Benjamin ...... 51 . . Insko, William ...... 71 . . Hayes, Heather ...... 163 . . Invernizzi, Marcia ...... 50 Hayes, Stacy ...... 53 Irribarra, David Torres ...... 68 Hazen, Tim ...... 161 Iverson, Andrew ...... 76 . . He, Qingping ...... 154,. . . 173 He, Qiwei ...... 103, 103, 103 He, Yong ...... 70,. .85, . 106 J Hebert, Andrea ...... 135 Jacovidis, Jessica ...... 107 Hembry, Tracey ...... 127 Jang, Hyesuk ...... 73, . .171, 171 Hendrie, Caroline ...... 149 . . Jang, Yoonsun ...... 77. . .

188 Washington, DC, USA

Participant Index

Jess, Nicole ...... 74 . . Kim, Dong-in ...... 152. . . Jewsbury, Paul ...... 57. . . Kim, Doyoung ...... 85. . . Ji, Grace ...... 117. . . Kim, Han Yi ...... 54 . . Jia, Helena ...... 88 Kim, Hyung Jin ...... 131 . . Jia, Yue ...... 182 . . Kim, Ja Young ...... 70 Jiang, Shengyu ...... 75. . . Kim, Jinok ...... 162 . . Jiang, Yanming ...... 169, . . 169 Kim, Jong ...... 152 . . Jiang, Zhehan ...... 134 Kim, JP ...... 180 . . . Jiao, Hong ...... 46, . . 89, 109, 169, 170 Kim, Jungnam ...... 64, . . 152 Jin, Kuan-Yu ...... 107, . . 108 Kim, Meereem ...... 104 . . Jin, Rong ...... 50 . . Kim, Nana ...... 121 . . Johnson, Evelyn ...... 74, . . 120 Kim, Se-Kang ...... 58, . . 148 Johnson, Marc ...... 166 . . Kim, Seohyun ...... 104 Johnson, Matthew ...... 52, 80 Kim, Sooyeon ...... 64, 64, 90 Jones,, Ryan Seth ...... 44 Kim, Stella ...... 121, . . 182 Joo, Seang-hwane ...... 148 Kim, Sunhee ...... 45 Ju, Unhee ...... 121 . . Kim, Wonsuk ...... 135. . . Julian, Marc ...... 152 . . Kim, Yongnam ...... 151 . . Julrich, Daniel ...... 131 Kim, Young Yee ...... 19,. . 117 Jung, KwangHee ...... 62 King, Kristin ...... 63. . . Junker, Brian ...... 159, . . 159, 159 Kingston, Neal ...... 47, . . 178 Kingston, Neal Martin ...... 134 Klieme, Eckhard ...... 174. . . K Kobrin, Jennifer ...... 89, 127 Kaliski, Pamela ...... 40, . . 128 Kögler, Kristina ...... 174 . . Kamenetz, Anya ...... 149. . . Köhler, Carmen ...... 155 Kane, Joanne ...... 180. . . Koklu, Onder ...... 172. . . Kane, Michael ...... 145 Kolen, Michael ...... 30, . . 142 Kang, Hyeon-Ah ...... 119 . . Kong, Xiaojing ...... 84, . 84 . Kang, Yoon Jeong ...... 109 . . Konold, Tim ...... 63. . . Kang, Yujin ...... 131 Kosh, Audra ...... 76. . . Kannan, Priya ...... 43,. . 175, 178, 178, 178 Kroehne, Ulf ...... 174 Kanneganti, Raghuveer ...... 129, . 129, 150 Kröhne, Ulf ...... 133 Kao, Shu-chuan ...... 166 Krost, Kevin ...... 76. . . Kaplan, David ...... 57 Krovetz, Bob ...... 179 Kapoor, Shalini ...... 43. . . Kuger, Susanne ...... 174 Karadavut, Tugba ...... 73. . . Kuhfeld, Megan ...... 143, . . 151, 170 Karvonen, Meagan ...... 51, . . 79, 89, 178 Kuo, Tzu Chun ...... 124 Keller, Lisa ...... 18,. . 135, 165 Kupermintz, Haggai ...... 41 Keller, Rob ...... 135. . . Kyllonen, Patrick ...... 17,. . 41, 177 Kelly, Justin ...... 42 . . Keng, Leslie ...... 142 . . Kenyon, Dorry ...... 42, . 79 . L Kern, Justin ...... 65 . . LaFond, Lee ...... 18. . . Khan, Gulam ...... 109. . . Lai, Emily ...... 89 Kieftenbeld, Vincent ...... 104, 129, 150, 164 Lai, Hollis ...... 53, 142 Kilinc, Murat ...... 148 Laitusis, Cara ...... 142. . . Kim, Dong-In ...... 64,. . 65, . 152 Lane, Suzanne ...... 40, . 79, . 128

189 2016 Annual Meeting & Training Sessions

Participant Index

Lao, Hongling ...... 86 Lin, Haiyan ...... 162 Larsson, Lisa ...... 105 Lin, Johnny ...... 88 . . Lash, Andrea ...... 51 . . . Lin, Meiko ...... 117. . . Lathrop, Quinn ...... 50, . .91 Lin, Pei-ying ...... 166 . . Latifi, Syed Muhammad Fahad ...... 147 . Lin, Peng ...... 116 Lawless, Rene ...... 177 Lin, Ye ...... 130 Lawson, Janelle ...... 115 Linden, Wim van der ...... 83, . . 113 Leacock, Claudia ...... 29, . . 104, 129 Ling, Guangming ...... 84, . . 129 Lebeau, Adena ...... 45. . . Lissitz, Robert ...... 46, 89, 109 LeBeau, Brandon ...... 56 Liu, Cheng ...... 182 Lee, Chansoon ...... 109 . . Liu, Chunyan ...... 85,. . 85 . Lee, Chong Min ...... 179 Liu, Hongyun ...... 107,. . . 110 Lee, Daniel ...... 76 Liu, Jinghua ...... 134 . . Lee, HyeSun ...... 167 Liu, Lei ...... 41, . 41 . Lee, Philseok ...... 148. . . Liu, lou ...... 135 . . Lee, Richard ...... 55. . . Liu, Ou Lydia ...... 167. . . Lee, Sora ...... 133 . . . Liu, Qiongqiong ...... 161. . . Lee, Won-Chan . . . 63, . 70, 90, 105, 131, 131, 180, 182 Liu, Ren ...... 77 Lee, Woo-yeol ...... 75 Liu, Ruitao ...... 106. . . Lee, Yi-Hsuan ...... 60 . . Liu, Xiang ...... 86. . . Lee, Young-Sun ...... 61, 86 Liu, Yang ...... 119,. . . 132, 153 Lei, Ming ...... 171,. . . 171 Liu, Yanlou ...... 61 Leibowitz, Emily ...... 178,. . . 178 Liu, Yue ...... 107 Leighton, Jacqueline ...... 63, . . 127 Lockwood, J .R ...... 80, . 80, 80, 80, 80, 130 Leventhal, Brian ...... 74, . 77, 120, 181 Lockwood, John ...... 165 . . Levy, Roy ...... 127 Longabach, Tanya ...... 64 . . Lewis, Charles ...... 45, 52, 85, 148, 181 Lopez, Alexis ...... 84, . 162 . Li, Chen ...... 150, 170, 174 Lopez, Melissa ...... 179 . . Li, Cheng-Hsien ...... 62 Lord-Bessen, Jennifer ...... 71 . . Li, Dongmei ...... 180 . . Lorié, William ...... 117,. . . 144 Li, Feifei ...... 66, 143 Lottridge, Susan ...... 104. . . Li, Feiming ...... 88 Loughran, Jessica ...... 65, . .91 Li, Isaac ...... 77 Loukina, Anastassia ...... 134,. . 150, 179, 179 Li, Jie ...... 83 . . . Loveland, Mark ...... 160 Li, Ming ...... 89 Lu, Chi ...... 179 . . . Li, Tongyun ...... 62, . 65. Lu, Lucy ...... 59 Li, Xiaomin ...... 61 Lu, Ru ...... 64,. . 64 . Li, Xin ...... 25, . . 70, 110, 118, 164 Lu, Yang ...... 164. . . Li, Ying ...... 64. . . Lu, Ying ...... 66, 133 Li, Zhushan ...... 163 . . . Lu, Zhenqui ...... 104 . . Liang, Longjuan ...... 161,. . . 182 LUO, Fen ...... 91 . . Liao, Chi-Wen ...... 116. . . Luo, Xiao ...... 45, 85 Liao, Dandan ...... 46 . . . Luo, Xin ...... 45, 123, 165 Liaw, Yuan-Ling ...... 163 Lynch, Ryan ...... 75. . . Lievens, Filip ...... 72 Lyons, Susan ...... 126. . . Lim, Euijin ...... 131. . . Lim, EunYoung ...... 92. . . Lim, MiYoun ...... 61 Lin, Chih-Kai ...... 58

190 Washington, DC, USA

Participant Index M Mix, Daniel ...... 85 Miyazaki, Yasuo ...... 132 Ma, Wenchao ...... 90 . . Monroe, Scott ...... 73, 130 Maas, Han van der ...... 182 Montee, Meg ...... 79 . . MacGregor, David ...... 42 . . Montee, Megan ...... 42 . . Macready, George ...... 62 . . Moon, Jung Aa ...... 88. . . Madnani, Nitin ...... 48,. . 179 Moretti, Antonio ...... 144,. . 144 Maeda, Hotaka ...... 75. . . Morgan, Deanna ...... 71, . . 115 Magaram, Eric ...... 136 Morin, Maxim ...... 164 Magnus, Brooke ...... 153. . . Morris, Carrie ...... 118,. . . 148 Malone, Meg ...... 40 . . . Morris, John ...... 69 Mao, Liyang ...... 65,. . 104 . Morrisey, Sarah ...... 163 Mao, Xia ...... 85 . . . Morrison, Kristin ...... 49,. . 84 Marais, Ida ...... 165 Moses, Tim ...... 128, 134 Margolis, Melissa ...... 142 Mroch, Andrew ...... 180 Marini, Jessica ...... 128 Mueller, Lorin ...... 54, . . 119, 166 Marion, Scott ...... 40, . 44, 126, 126 Mulholland, Matthew ...... 104 Maris, Gunter ...... 182. . . Muntean, William ...... 173 . . Martineau, Joe ...... 158 . . Murphy, Stephen ...... 50, 71 Martineau, Joseph ...... 44, 101 Musser, Samantha ...... 42 . . Martínez, Jr, Carlos ...... 168 Masri, Yasmine El ...... 155 Masters, Jessica ...... 63 N Matlock, Ki ...... 46, 148 Naumann, Alexander ...... 43 . . Matos-Elefonte, Haifa ...... 161, 177 Naumann, Johannes ...... 153. . . Matovinovic, Donna ...... 67 Naumenko, Oksana ...... 136 . . Matta, Tyler ...... 50 . . Nebelsick-Gullet, Lori ...... 43, . . 101 Maul, Andrew ...... 117, 153 Nebelsick-Gullett, Lori ...... 108 . . Mayfield, Elijah ...... 173 . . Neito, Ricardo ...... 74, 120 Mazany, Terry ...... 100. . . Nicewander, W ...... 51 McBride, Yuanyuan ...... 84, . .84 Niekrasz, John ...... 147 . . McCaffrey, Daniel ...... 80, . . 80, 130 Nieto, Ricardo ...... 159 McCall, Marty ...... 79 . . Noh, Eunhee ...... 92 . . . McClellan, Catherine ...... 115, . . 154 Norris, Mary ...... 76. . . McKnight, Kathy ...... 144,. . 144, 158 Norton, Jennifer ...... 42 . . McMillan, James H ...... 168 Nydick, Steven ...... 85. . . McTavish, Thomas ...... 144. . . Meador, Chris ...... 53 Meadows, Michelle ...... 154,. . 173 O Mehta, Vandhana ...... 53. . . O’Brien, Sue ...... 84. . . Meng, Xiangbing ...... 73 O’Connor, Brendan ...... 48 . . Mercado, Ricardo ...... 71. . . O’Leary, Timothy ...... 89 Meyer, Patrick ...... 50, 115 O’Reilly, Tenaha ...... 160 Meyer, Robert ...... 59 Oakes, Jeannie ...... 125 . . Miel, Shayne ...... 147, 173 Ogut, Burhan ...... 117. . . Miller, Sherral ...... 128. . . Oh, Hyeon-Joo ...... 63. . . Minchen, Nathan ...... 74 Olea, Julio ...... 72 . . . Mislevy, Robert ...... 177 Olgar, Süleyman ...... 70, . . 172 Mix, Dan ...... 43 . .

191 2016 Annual Meeting & Training Sessions

Participant Index

Oliveri, Maria Elena ...... 177 . . Quellmalz, Edys ...... 160 Olsen, James ...... 108. . . Quenemoen, Rachel ...... 175. . . Oppenheim, Peter ...... 157 Orpwood, Graham ...... 109 Oshima, T ...... 163 . . R Özdemir, Burhanettin ...... 45 Rahman, Nazia ...... 52. . . Rankin, Jenny ...... 178 Rausch, Andreas ...... 174 . . P Rawls, Anita ...... 128 . . Pak, Seohong ...... 45 . . Raymond, Mark ...... 181 Palma, Jose ...... 60,. . 153 . Reboucas, Daniella ...... 154 . . Pan, Tianshu ...... 46 Reckase, Mark ...... 20, 42, 45, 99, 116, 142, 165 Papa, Frank ...... 88 . . Redell, Nick ...... 161 . . . Papageorgiou, Spyridon ...... 178 . . Reichenberg, Ray ...... 74, . . 74, 120 Pardos, Zachary ...... 144. . . Renn, Jennifer ...... 42 Park, Jiyoon ...... 54,. . 119 . Reshetar, Rosemary ...... 128 . . Park, Trevor ...... 155 . . . Reshetnyak, Evgeniya ...... 85 Park, Yoon Soo ...... 61. . . Ricarte, Thales ...... 90. . . Pashley, Peter ...... 52, 89 Rich, Changhua ...... 110 Patel, Priyank ...... 71, . . 115 Rick, Francis ...... 43,. . 178 . Patelis, Thanos ...... 112, . . 145, 145, 177 Rickels, Heather ...... 108. . . Patterson, Brian ...... 159 Rijiman, Frank ...... 152 Patz, Rich ...... 98 Rijmen, Frank ...... 65 . . Peabody, Michael ...... 71. . . Rijn, Peter van ...... 63, 109, 116, 132, 155, 182 Peck, Fred ...... 44. . . Rios, Joseph ...... 132, 181 Peng, Luyao ...... 129 . . Risk, Nicole ...... 60 . . Perie, Marianne ...... 65, 79, 108, 157 Roberts, Mary Roduta ...... 64 Peterson, Mary ...... 51. . . Robin, Frederic ...... 65,. . 119 Phadke, Chaitali ...... 167. . . Rodriguez, Michael . . . . . 32,. . 56, 60, 153, 162, 171 Pham, Duy ...... 165 Rogers, H . Jane ...... 109 . . Phan, Ha ...... 118 . . . Rölke, Heiko ...... 174 Phelan, Jonathan ...... 117 Rollins, Jonathan ...... 86, 115 Phillips, S E ...... 158 Rome, Logan ...... 121. . . Phillips, S .E ...... 55 . . Romine, Russell Swinburne ...... 51, . . 89, 175 Plake, Barbara ...... 112, 158 Roohr, Katrina ...... 167 Pohl, Steffi ...... 155. . . Rorick, Beth ...... 69. . . Polikoff, Morgan ...... 67 . . Rosen, Yigal ...... 41, . 41. Por, Han-Hui ...... 105, 134 Rosenstein, Mark ...... 179 Powers, Donald ...... 134 Roussos, Louis ...... 54, . . 135 Powers, Sonya ...... 105 . . Rubright, Jonathan ...... 46,. . 85, 147, 181 Runyon, Christopher ...... 91 . . Rupp, André ...... 29, 53 Q Rutkowski, Leslie ...... 57, 118 QIAN, HAIXIA ...... 134. . . Rutstein, Daisy ...... 40, . . 147 Qian, Hong ...... 45 . . Qian, Jiahe ...... 132 Qiu, Xue-Lan ...... 148. . . S Qiu, Yuxi ...... 180. . . Sabatini, John ...... 160

192 Washington, DC, USA

Participant Index

Sabol, Robert ...... 40 . . Skrondal, Anders ...... 182 Şahin, Füsun ...... 63 Smiley, Whitney ...... 161. . . Sahin, Sakine Gocer ...... 88. . . Smith, Kara ...... 128 . . . Saiar, Amin ...... 54 Smith, Robert ...... 45 Sakano, Jenifer ...... 179 . . Smith, Weldon ...... 110 . . Sakano, Jennifer ...... 179 . . Snow, Eric ...... 147 . . Sakworawich, Arnond ...... 105 . . Somasundaran, Swapna ...... 179 . . Salleb-Aouissi, Ansaf ...... 144, . . 144 Song, Hao ...... 90, . 161. Samonte, Kelli ...... 165 Song, Lihong ...... 116. . . Sanders, Elizabeth ...... 163. . . Sorrel, Miguel ...... 72 Sandrock, Paul ...... 40. . . Sparks, Sarah ...... 149. . . Sano, Makoto ...... 173. . . Stafford, Rose ...... 91 Sato, Edynn ...... 89,. . 177 . Stanke, Luke ...... 60 Sauder, Derek ...... 120 Stark, Stephen ...... 148 . . Schmigdall, Jonathan ...... 84 Stecher, Brian ...... 160. . . Schneider, Christina ...... 28 Steinberg, Jonathan ...... 160 Schneider, Christy ...... 179. . . Sternod, Latisha ...... 74, . . 120 Schultz, Matthew ...... 147 . . Stevens, Joseph ...... 59 . . Schwarz, Rich ...... 20 . . Stewart, John ...... 147. . . Schwarz, Richard ...... 88 Stockford, Ian ...... 173. . . Scott, Lietta ...... 108 . . Stone, Clement ...... 26 Secolsky, Charles ...... 136 Stone, Elizabeth ...... 54, . . 80, 142 Sedivy, Sonya ...... 109 Stout, Bill ...... 136 Segall, Dan ...... 45 Strain-Seymour, Ellen ...... 142 Seltzer, Michael ...... 151, 151 Stuart, Elizabeth ...... 151 . . Semmelroth, Carrie ...... 115 . . Su, Dan ...... 57 Sen, Sedat ...... 62 . . . Su, Yu-Lan ...... 83 . . . Sgammato, Adrienne ...... 52,. . 52 SU, YU-LAN ...... 135 . . . Sha, Shuying ...... 110. . . Suh, Hongwook ...... 180. . . Shao, Can ...... 123, . . 166 Sukin, Tia ...... 51, 115 Sharairi, Sid ...... 50,. . 133 . Sullivan, Meghan ...... 31, 72 Shaw, Emily ...... 128 . . Sun, Yu ...... 147 . . Shear, Benjamin ...... 80, . . 110 Sung, Kyunghee ...... 92 . . Sheehan, Kathleen ...... 179 Svetina, Dubravka . . . . . 62,. 74, 118, 120, 135, 163 Shepard, Lorrie ...... 126 . . Swaminathan, Hariharan ...... 109 Shermis, Mark ...... 51, 104, 104 Sweet, Shauna ...... 83. . . Shin, Hyo Jeong ...... 68 . . Sweet, Tracy ...... 146 Shin, Nami ...... 162 Swift, David ...... 133 . . Shipman, Michelle ...... 89 Shmueli, Doron ...... 128 Shropshire, Kevin ...... 132 . . T Shukla, Kathan ...... 63. . . Tan, Amy ...... 53 Shuler, Scott ...... 40 Tan, Xuan-Adele ...... 161. . . Shute, Valerie ...... 127. . . Tang, Wei ...... 63 Sikali, Emmanuel ...... 19 Tannenbaum, Richard ...... 178 . . Silberglitt, Matt ...... 160 Tao, Jian ...... 170. . . Sinharay, Sandip ...... 164,. . 181 Tao, Shuqin ...... 85,. . 143 . Sireci, Stephen ...... 142, . . 145 Templin, Jonathan ...... 31, 72, 72, 86 Skorupski, William ...... 72, . 91, 106, 134

193 2016 Annual Meeting & Training Sessions

Participant Index

Terzi, Ragip ...... 136 . . . Wang, Caroline ...... 59. . . Tessema, Aster ...... 15, . . 147 Wang, Changjiang ...... 109 Thissen, David ...... 44. . . Wang, Chun ...... 45, 170 Thissen-Roe, Anne ...... 163 Wang, Hongling ...... 166. . . Thompson, Tony ...... 180 . . Wang, Jui-Sheng ...... 83 . . Thum, Yeow Meng ...... 50, 110 WANG, JUI-SHENG ...... 135 Thummaphan, Phonraphee ...... 126 Wang, Keyin ...... 135 Thurlow, Martha ...... 175 . . Wang, Lu ...... 180 Tian, Wei ...... 61,. .91 . Wang, Min ...... 105 Tomkowicz, Joanna ...... 64,. . 152 Wang, Richard ...... 150 . . Tong, Ye ...... 131. . . Wang, Shichao ...... 85,. . 122 Topczewski, Anna ...... 50 . . Wang, Shudong ...... 169. . . Torre, Jimmy de la ...... 72, . . 136 Wang, Tianyu ...... 78 . . Torre, Jummy de la ...... 143 Wang, Wei ...... 116. . . Towles, Elizabeth ...... 43 Wang, Wen-Chung ...... 61, 107, 108, 148 Toyama, Yukie ...... 77 Wang, Wenyi ...... 116. . . Trang, Kim ...... 134. . . Wang, Xi ...... 119 . . . Trierweiler, Tammy ...... 45, 148 Wang, Xiaolin ...... 62, . . 135 Tu, Dongbo ...... 61. . . Wang, Xiaoqing ...... 91 . . Turner, Charlene ...... 43, . . 108 Wang, Zhen ...... 147 . . Turner, Ronna ...... 148 Way, Walter ...... 84,. . 142 . Tzou, Hueying ...... 86 Weegar, Johanna ...... 89 Weeks, Jonathan ...... 60, 116, 160 Wei, Hua ...... 70, . 164 . U Wei, Xiaoxin ...... 50,. . 115 . Underhill, Stephanie ...... 62 . . Wei, Youhua ...... 165, 179 University, Sacred Heart ...... 168. . Weiner, John ...... 54, . 134 . Weiss, David ...... 167 Welch, Catherine . . . . 43, 56, 56, 62, 90, 108, 145, 161 V Wendler, Cathy ...... 51,. . 179 West, Martin ...... 157 van der Linden, Wim ...... 16 . . White, Lauren ...... 172 Vansickle, Tim ...... 55 Whittington, Dale ...... 168 . . Vasquez-Colina, Maria Donata ...... 69. . Wiberg, Marie ...... 60 Veldkamp, Bernard ...... 113 . . Widiatmo, Heru ...... 119 Vispoel, Walter ...... 148 . . Wiley, Andrew ...... 112 . . VonDavier, Alina ...... 15 . . Williams, Elizabeth ...... 120 Vue, Kory ...... 153 Williams, Jean ...... 84 Willis, James ...... 48 W Willoughby, Michael ...... 153. . . Willse, John ...... 165 . . Wain, Jennifer ...... 84 Wilmes, Carsten ...... 175. . . Wainer, Howard ...... 181. . . Wilson, Mark ...... 68,. . 68, 68 Walker, Cindy ...... 69, . . 154 Wind, Stefanie ...... 71. . . Walker, Cindy M ...... 88 . . Winter, Phoebe ...... 51, 79 Walker, Michael ...... 180 Wise, Lauress ...... 69 . . . Wan, Ping ...... 64, . 152. Wise, Laurie ...... 149 . . wang, aijun ...... 166 . . . Wollack, James ...... 109 . . Wang, Ann ...... 167 Woo, Ada ...... 85

194 Washington, DC, USA

Participant Index Wood, Scott ...... 147 . . Z Wu, Yi-Fang ...... 86, . 90. Wüstenberg, Sascha ...... 103 Zapata-Rivera, Diego ...... 127,. . 178 Wyatt, Jeff ...... 128. . . Zechner, Klaus ...... 150 . . Wyse, Adam ...... 44 Zenisky, April ...... 178. . . Zhan, Peida ...... 61 . . Zhang, Bo ...... 164 . . X Zhang, Jiahui ...... 74 . . Zhang, Jin ...... 167. . . Xi, Nuo ...... 182 . . Zhang, Jinming ...... 58 Xiang, Shibei ...... 91 . . . Zhang, Litong ...... 152 Xie, Chao ...... 171, 171 Zhang, Mengyao ...... 105, 180 Xie, Qing ...... 90, . 120 . Zhang, Mingcai ...... 75 Xin, Tao ...... 61, 91, 135 Zhang, Mo ...... 29, 160, 174 Xing, Kuan ...... 61, 122 Zhang, Oliver ...... 15 . . Xiong, Jianhua ...... 91. . . Zhang, Susu ...... 78 Xiong, Xinhui ...... 104. . . Zhang, Xinxin ...... 77 Xiong, Yao ...... 122. . . Zhang, Xue ...... 170 . . . Xu, Jing-Ru ...... 65 Zhang, Ya ...... 121 . . Xu, Ran ...... 150 . . Zhang, Yu ...... 54, . 119. Xu, Ting ...... 105 zhang, Yu ...... 166 Xu, Xueli ...... 182 . . . Zhao, Tuo ...... 150 . . Zhao, Yang ...... 71, 115 Y Zhao, Yihan ...... 86 . . Zheng, Bin ...... 142 Yakimowski, Mary E ...... 168 . . Zheng, Chanjin ...... 61, 73 Yan, Duanli ...... 22, . 85. Zheng, Chunmei ...... 86 . . Yan, Ning ...... 86 Zheng, Qiwen ...... 146 Yang, Ji Seung ...... 73, . 132, 170, 170, 170 Zheng, Xiaying ...... 170, . . 170 Yang, Jiseung ...... 151. . . Zheng, Yi ...... 166 Yang, Tao ...... 135 Zhu, Mengxiao ...... 146, . . 174 Yao, Lihua ...... 20, . . 45, 88, 107, 147 Zhu, Rongchun ...... 90, 136, 166 Yao, Lili ...... 80. . . Zhu, Shi ...... 115 Ye, Feifei ...... 105. . . Zweifel, Michael ...... 110. . . Ye, Sangbeak ...... 87 . . . Yi, qin ...... 135 Yi, Qing ...... 70 Yin, Ping ...... 43 . . Yoo, Hanwook ...... 63, . . 133 Yoon, Su-Youn . . . . . 129,. . 129, 150, 179, 179, 179 Yu, Xin ...... 42. . .

195 2016 Annual Meeting & Training Sessions

Participant Index

196 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions FirstFirst Authors

Aksu Dunya, Beyza Bennett, Randy E University of Illinois at Chicago ETS baksu2@uic .edu rbennett@ets .org

Ali, Usama S. Bertling, Maria Educational Testing Service Harvard University uali@ets .org mbertling@g .harvard .edu

Alzen, Jessica Beverly, Tanesia School of Education University of Colorado Boulder University of Connecticut jessica .alzen@gmail .com tanesia .beverly@uconn .edu

Amati, Lucy Bo, Yuanchao Emily Educational Testing Service University of California, Los Angeles lamati@ets .org ybo@ucla .edu

An, Ji Bond, Mark University of Maryland The University of Texas at Austin jian12@umd .edu obipam@gmail .com

Anderson, Daniel Bonifay, Wes E University of Oregon University of Missouri daniela@uoregon .edu bonifay .w@gmail .com

Andrews, Benjamin Boyer, Michelle ACT University of Massachusetts, Amherst Benjamin .Andrews@act .org mlboyer@umass .edu

Andrich, David Bradshaw, Laine University of Western Australia University of Georgia david .andrich@uwa edu. .au laineb@uga edu.

Austin, Bruce W Brandt, Steffen Washington State University Art of Reduction bwaustin@wsu .edu steffen .brandt@artofreduction .com

Banks, Kathleen Breyer, Jay F. LEAD Public Schools ETS kbfiredup08@yahoo .com fbreyer@ets .org

Barry, Carol L Bridgeman, Brent The College Board Educational Testing Service cabarry@collegeboard .org bbridgeman@ets .org

Barton, Karen Briggs, Derek C Learning Analytics University of Colorado Karen_Barton@discovery .com derek .briggs@colorado .edu

Bashkov, Bozhidar M Broaddus, Angela American Board of Internal Medicine Center for Educational Testing and Evaluation bbashkov@abim .org University of Kansas broaddus@ku .edu Bejar, Isaac I. ETS Brown, Derek ibejar@ets .org Oregon Department of Education derek .brown@state .or .us

197 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Buchholz, Janine Carstens, Ralph German Institute for International Educational International Association for the Evaluation of Research (DIPF) Educational Achievement (IEA) Data Processing and buchholz@dipf .de Research Center ralph .carstens@iea-dpc .de Buckendahl, Chad W. Alpine Testing Solutions, Inc . Castellano, Katherine Furgol Chad .Buckendahl@alpinetesting .com Educational Testing Service (ETS) kecastellano@ets .org Buckley, Jack College Board Chattergoon, Rajendra jbuckley@collegeboard .org University of Colorado, Boulder rajendra .chattergoon@colorado .edu Bukhari, Nurliyana University of North Carolina at Greensboro Chattergoon, Rajendra n_bukhar@uncg .edu University of Colorado, Boulder rajendra .chattergoon@colorado .edu Bulut, Okan University of Alberta Chatterji, Madhabi bulut@ualberta .ca Teachers College, Columbia University mb1434@tc .columbia .edu Buzick, Heather Educational Testing Service Chen, Feng hbuzick@ets .org The University of Kansas chenfeng27@ku .edu Cai, Li UCLA/CRESST Chen, Hui-Fang lcai@ucla .edu City University of Hong Kong hfchen@cityu .edu .hk Cai, Liuhan University of Nebraska-Lincoln Chen, Jie cliuhan@gmail .com Center for Educational Testing and Evaluation jie-chen@ku .edu Cain, Jessie Montana University of North Carolina at Chapel Hill Chen, Juan jcain@live .unc .edu National Conference of Bar Examiners jchen@ncbex .org Caliço, Tiago A University of Maryland Chen, Keyu tcalico@umd .edu University of Iowa keyu-chen@uiowa .edu Camara, Wayne ACT Chen, Pei-Hua Wayne .Camara@act .org National Chiao Tung University peihuamail@gmail .com Camara, Wayne J. ACT Chen, Ping wayne .camara@act .org Beijing Normal University pchen@bnu .edu .cn Canto, Phil Florida Department of Education Chen, Tingting Phil .Canto@fldoe .org ACT, Inc . tingting .chen@act .org Carroll, Patricia E University of California - Los Angeles patriciacarroll@ucla .edu

198 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Chen, Xin Cizek, Greg Pearson University of North Carolina at Chapel Hill xin .chen@pearson .com cizek@unc .edu

Cheng, Ying Alison Clark, Amy K. University of Notre Dame University of Kansas ycheng4@nd .edu akclark@ku .edu

Childs, Ruth A Clauser, Amanda L. Ontario Institute for Studies in Education, University National Board of Medical Examiners of Toronto AClauser@nbme .org ruth .childs@utoronto .ca Cohen, Allan Cho, Youngmi University of Georgia Pearson acohen@uga .edu youngmi .cho@pearson .com Cohen, Jon Choi, Hye-Jeong American Institutes for Research University of Georgia jcohen@air .org hjchoi1@uga .edu Colvin, Kimberly F Choi, In-Hee University at Albany, SUNY University of California, Berkeley kcolvin@albany .edu fermata11@gmail .com Conforti, Peter Choi, In-Hee The University of Texas at Austin University of California, Berkeley peter .conforti@utexas .edu ineechoi@berkeley .edu Confrey, Jere Choi, Jinah North Carolina State University The University of Iowa jconfre@ncsu .edu jinah-choi@uiowa .edu Contributor Last Name, Contributor First Name Middle Choi, Jiwon Initial ACT/University of Iowa Company/Institution jw0326@gmail .com Contributor Email Address

Choi, Kilchan Coomans, Frederik CRESST/UCLA University of Amsterdam kcchoi@ucla .edu frecoomans@gmail .com

Choi, Kilchan Cottrell, Nicholas D CRESST/UCLA Fulcrum kcchoi@ucla .edu ncottrell@fulcrumco .com

Chu, Kwang-lee Crabtree, Ashleigh R Pearson University of Iowa kwang-lee .chu@pearson .com ashleigh-crabtree@uiowa .edu

Chung, Kyung Sun Crane, Samuel Pennsylvania State University Amplify kuc182@psu .edu scrane@amplify .com

Circi, Ruhan Croft, Michelle University of Colorado Boulder ACT, Inc . ruhan .circi@colorado .edu michelle .croft@act .org

199 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Cui, Zhongmin Diao, Hongyu ACT, Inc . University of Massachusetts-Amherst zhongmin .cui@act .org denisediao@gmail .com

Culpepper, Steven Andrew DiCerbo, Kristen University of Illinois at Urbana-Champaign Pearson sculpepp@illinois .edu kristen .dicerbo@pearson .com

Dadey, Nathan Donata Vasquez-Colina, Maria The National Center for the Improvement of Florida Atlantic University Educational Assessment mvasque3@fau .edu ndadey@nciea .org Donoghue, John R Davey, Tim Educational Testing Service Educational Testing Service jdonoghue@ets .org TDAVEY@ETS .ORG Du, Yi Davis, Laurie L Educational Testing Service Pearson ydu@ETS .org laurie .davis@pearson com. Du, Yi d’Brot, Juan Educational Testing Services DRC ydu@ETS .org JD’Brot@DataRecognitionCorp .com Egan, Karla De Boeck, Paul NCIEA Ohiao State University karlaegan@gmail .com deboeck .2@osu .edu Embretson, Susan E Debeer, Dries Georgia Institute of Technology University of Leuven susan .embretson@psych .gatech .edu dries .debeer@ppw .kuleuven be. Engelhardt, Lena DeCarlo, Lawrence T. German Institute for International Educational Teachers College, Columbia University Research decarlo@tc .edu lengelhardt@dipf .de

DeMars, Christine E. Evans, Carla M. James Madison University University of New Hampshire demarsce@jmu .edu carla m. evans@gmail. .com

Denbleyker, Johnny Fan, Meichu Houghton Mifflin Harcourt ACT, Inc johnny .denbleyker@hmhco .com xin li@act. .org

Deng, Nina Fan, Yuyu Measured Progress Fordham University nndeng@gmail .com yuyufan3@gmail .com

Deters, Lauren Farley, Dan edCount, LLC University of Oregon ldeters@edcount .com dfarley@uoregon edu.

Dhaliwal, Tasmin Feinberg, Richard A Pearson National Board of Medical Examiners tasmin .dhaliwal@pearson .com RFeinberg@nbme .org

200 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Fina, Anthony D Gocer Sahin, Sakine Iowa Testing Programs, University of Iowa Hacettepe University anthony-fina@uiowa .edu sgocersahin@gmail .com

Finch, Holmes Gong, Brian Ball State University National Center for the Improvement of Educational whfinch@bsu .edu Assessment BGong@nciea org. Foltz, Peter W. Pearson and University of Colorado Boulder González-Brenes, José peter .foltz@pearson com. Center for Digital Data, Analytics & Adaptive Learning, Pearson Forte, Ellen jose .gonzalez-brenes@pearson .com edCount eforte@edcount .com González-Brenes, José Pablo Pearson Forte, Ellen jose .gonzalez-brenes@pearson .com edCount, LLC eforte@edcount .com Grabovsky, Irina NBME Freeman, Leanne igrabovsky@nbme .org University of Wisconsin, Milwaukee leannes4@uwm .edu Graesser, Art University of Memphis Fu, Yanyan art .graesser@gmail .com UNCG y_fu2@uncg .edu Graf, Edith Aurora ETS Gafni, Naomi agraf@ets .org National Institute for Testing & Evaluation naomi@nite .org .il Greiff, Samuel University of Luxemburg Gao, Lingyun samuel .greiff@uni .lu ACT, Inc . lingyun .gao@act .org Grochowalski, Joe The College Board Gao, Xiaohong jgrochowalski@collegeboard .org ACT, Inc . xiaohong .gao@act .org Gu, Lixiong Educational Testing Service Garcia, Alejandra Amador lgu@ets .org University of Massachusetts Alejandra .AmadorGarcia@gmail .com Guo, Hongwen ETS Geis, Eugene J hguo@ets .org Rutgers Graduate School of Education eugene .geis@gse .rutgers .edu Guo, Rui University of Illinois at Urbana-Champaign Geisinger, Kurt F. ruiguo1@illinois .edu Buros Center for Testing, University of Nebraska- Lincoln Hacker, Miriam kgeisinger@buros .org The German Institute for International Educational Research (DIPF) Centre for International Student Gessaroli, Marc E Assessment (ZIB) National Board of Medical Examiners hacker@dipf .de mgessaroli@nbme .org

201 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Hakuta, Kenji Hogan, Thomas P Stanford University University of Scranton hakuta@stanford .edu thomas .hogan@scranton edu.

Hall, Erika Holmes, Stephen Center for Assessment Office of Qualifications and Examinations Regulation EHall@nciea .org stephen .holmes@ofqual .gov .uk

Han, Zhuangzhuang Hou, Likun Teachers College Columbia University Educational Testing Services zh2198@tc .columbia edu. Lhou@ets .org

Hansen, Mark Huang, Chi-Yu University of California, Los Angeles ACT, Inc . markhansen@ucla .edu chiyu .huang@act .org

Harrell, Lauren Huang, Xiaorui University of California, Los Angeles East China Normal University laurenharrell@ucla .edu ellyhxr@gmail .com

Hayes, Heather Huggins-Manley, Anne Corinne AMTIS Inc . University of Florida hhayes@amtisinc .com ahuggins@coe .ufl .edu

Hayes, Stacy Huh, Nooree Discovery Education ACT, Inc . Stacy_Hayes@discovery .com nooree .huh@act .org

Hazen, Tim Hunter, C. Vincent Iowa Testing Programs Georgia State Univrsity timothy-hazen@uiowa .edu chunter1@student .gsu .edu

He, Qingping Huo, Yan Office of Qualifications and Examinations Regulation Educational Testing Service qingping .he@ofqual .gov .uk yhuo@ets .org

He, Qiwei Insko, William R Educational Testing Service Houghton Mifflin Harcourt qhe@ets .org bill .insko@hmhco .com

He, Yong Jang, Hyesuk ACT, Inc . American Institutes for Research Yong .He@act .org janggahyesuk@gmail .com

Herrera, Bill Jang, Hyesuk edCount, LLC American Institutes for Research bherrera@edcount .com hjang@air .org

Himelfarb, Igor Jewsbury, Paul Educational Testing Service (ETS) Educational Testing Service ihimelfarb@ets .org pjewsbury@ets .org

Ho, Emily H Jiang, Yanming College Board Educational Testing Service eho2@fordham .edu yjiang@ets .org

202 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Jiang, Zhehan KARADAVUT, TUGBA University of Kansas UNIVERSITY OF GEORGIA zjiang4@ku .edu TUGBA-MAT@HOTMAIL COM.

Jin, Kuan-Yu Karvonen, Meagan The Hong Kong Institute of Education University of Kansas kyjin@ied .edu .hk karvonen@ku .edu

Joo, Seang-hwane Keller, Lisa A University of South Florida University of Massachusetts Amherst sjoo@mail .usf .edu lkeller@umass .edu

Julian, Marc Kenyon, Dorry Data Recognition Corporation Center for Applied Linguistics MJulian@DataRecognitionCorp .com dkenyon@cal .org

Junker, Brian W Kern, Justin L. Carnegie Mellon University University of Illinois at Urbana-Champaign brian@stat .cmu .edu jkern7787@gmail .com

Kaliski, Pamela Kim, Dong-In College Board Data Recognition Corporation pkaliski@collegeboard .org DKim@DataRecognitionCorp .com

Kang, Hyeon-Ah Kim, Dong-In University of Illinois at Urbana-Champaign Data Recognition Corporation hkang31@illinois .edu DKim@DataRecognitionCorp .com

Kang, Yoon Jeong Kim, Han Yi American Institutes for Research Measured Progress yoonjeongkang94@gmail .com Kim HanYi@measuredprogress. .org

Kang, Yujin Kim, Hyung Jin University of Iowa The University of Iowa yujin-kang@uiowa .edu hyungjin-kim@uiowa .edu

Kannan, Priya Kim, Ja Young Educational Testing Service ACT, Inc . PKANNAN@ets .org jaykim319@gmail .com

Kanneganti, Raghuveer Kim, Jinok Data Recognition Corporation CTB UCLA/CRESST raghuveer .kanneganti@ctb .com jinok@ucla .edu

Kao, Shu-chuan Kim, Jong Pearson ACT shu-chuan .kao@person .com JP .Kim@act .org

Kaplan, David Kim, Se-Kang University of Wisconsin – Madison Fordham University david .kaplan@wisc .edu sekim@fordham .edu

Kapoor, Shalini Kim, Sooyeon ACT Educational Testing Service shalinikapoor .ia@gmail .com skim@ets .org

203 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Kim, Stella Y Latifi, Syed Muhammad Fahad The University of Iowa University of Alberta stella-kim@uiowa .edu syed .latifi@ualberta .ca

Kim, Sunhee Lawson, Janelle Prometric San Francisco State University sunnyk0206@yahoo .com jlawson@ucla edu.

Kim, Young Yee Leacock, Claudia American Institues for Research McGraw-Hill Education CTB ykim@air .org claudia leacock@gmail. .com

Kobrin, Jennifer L. LeBeau, Brandon Pearson University of Iowa JENNIFER .KOBRIN@PEARSON .COM brandon-lebeau@uiowa edu.

Koklu, Onder Lee, Chansoon Florida Department of Education University of Wisconsin-Madison onder .koklu@fldoe .org chansoon .lee@wisc .edu

Konold, Tim R Lee, Chong Min University of Virginia ETS Konold@virginia .edu clee001@ets .org

Kroehne, Ulf Lee, HyeSun German Institute for International Educational University of Nebraska-Lincoln Research (DIPF) hyesun kj. .lee@gmail .com kroehne@dipf .de Lee, Sora Kuhfeld, Megan University of Wisconsin, Madison University of California slee486@wisc .edu megan .kuhfeld@gmail .com Lei, Ming Kuhfeld, Megan American Institutes for Research University of California, Los Angeles mlei@air .org megan .kuhfeld@gmail .com Leventhal, Brian Kupermintz, Haggai University of Pittsburgh University of Haifa brl38@pitt .edu kuperh@edu .haifa .ac .il Li, Chen Lai, Hollis Educational Testing Service University of Alberta cli@ets .org hollis .lai@ualberta .ca Li, Chen Lao, Hongling University of Maryland University of Kansas lc1210@umd .edu lao@ku .edu Li, Cheng-Hsien Lash, Andrea A. Department of Pediatrics, University of Texas Medical WestEd School at Houston alash@wested .org Cheng .Hsien .Li@uth .tmc .edu

Lathrop, Quinn N Li, Feifei Northwest Evaluation Association Educational Testing Service quinn .lathrop@nwea .org fli@ets .org

204 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Li, Feiming Ling, Guangming University of North Texas Health Science Center ETS feimingli@hotmail .com GLing@ETS .ORG

Li, Jie Liu, Jinghua McGraw-Hill Education Secondary School Admission Test Board jie .li@mheducation .com jliu@ssat .org

Li, Ming Liu, Lei University of Maryland ETS liming@umd .edu lliu001@ets .org

Li, Tongyun Liu, Xiang Educational Testing Service Teachers College, Columbia University tli002@ets .org xl2438@tc .columbia .edu

Li, Xin Liu, Yang ACT, Inc . University of California, Merced xin .li@act .org yliu85@ucmerced .edu

Li, Ying Liu, Yue American Nurses Credentialing Center Sichuan Institute Of Education Sciences lynnliying2011@gmail .com helena701@126 .com

Li, Zhushan Mandy Lockwood, J.R. Boston College Educational Testing Service zhushan .li@bc .edu jrlockwood@ets .org

Liao, Dandan Longabach, Tanya University of Maryland, College Park Excelsior College echommm@gmail .com tlongabach@yahoo .com

Liaw, Yuan-Ling Lopez, Alexis A University of Washington ETS linda08@uw .edu alopez@ets .org

Lim, Euijin Lord-Bessen, Jennifer The University of Iowa McGraw Hill Education CTB ejlim .mail@gmail .com jennlordb@nyc .rr .com

Lin, Chih-Kai Lorié, William A Center for Applied Linguistics (CAL) Center for NextGen Learning & Assessment, Pearson clin@cal .org william .lorie@pearson com.

Lin, Haiyan Lottridge, Susan ACT, Inc . Pacific Metrics, Inc . haiyan .lin@act .org slottridge@pacificmetrics .com

Lin, Johnny Lu, Lucy University of California, Los Angeles NSW Department of Education, Australia j83lin@gmail .com lucy .lu@det .nsw .edu .au

Ling, Guangming Lu, Ru Educational Testing Service Educational Testing Service GLing@ets .org rlu@ets .org

205 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Lu, Ying Matta, Tyler H. Educational Testing Service Northwest Evaluation Association ylu@ets .org tyler .matta@nwea org.

LUO, Fen Maul, Andrew Jiangxi Normal University University of California, Santa Barbara luofen312@163 .com amaul@education .ucsb .edu

Luo, Xiao McCaffrey, Daniel F. National Council of State Boards of Nursing Educational Testing Service xluo@ncsbn .org dmccaffrey@ets .org

Luo, Xin McCall, Marty Michigan State University Smarter Balanced Assessment Consortium charonluo@gmail .com marty .mccall@smarterbalanced .org

Ma, Wenchao McClellan, Catherine A Rutgers, The State University of New Jersey Clowder Consulting HSong@nbome .org cmcclellan@clowderconsulting .com

MacGregor, David McKnight, Kathy Center for Applied Linguistics Center for Educator Learning & Effectiveness, Pearson dmacgregor@cal .org kathy .mcknight@pearson com.

Magnus, Brooke E McTavish, Thomas S University of North Carolina at Chapel Hill Center for Digital Data, Analytics and Adaptive brooke .magnus@unc .edu Learning, Pearson tom .mctavish@pearson .com Mao, Xia Pearson Meyer, Patrick xia .mao@pearson .com University of Virginia meyerjp@virginia .edu Marion, Scott National Center for the Improvement of Educational Meyer, Robert H Assessment Education Analytics, Inc . smarion@nciea .org rhmeyer@edanalytics .org

Martineau, Joseph Miel, Shayne National Center for the Improvement of Educational Turnitin Assessment smiel@turnitin .com jmartineau@nciea .org Miller, Sherral Martineau, Joseph College Board NCIEA shmiller@collegeboard .org jmartineau@nciea .org Monroe, Scott Masters, Jessica UMass Amherst Measured Progress scott .monroe@ucla edu. masters .jessica@measuredprogress .org Montee, Megan Matlock, Ki Lynn Center for Applied Linguistics Oklahoma State University mmontee@cal .org ki .matlock@okstate .edu

206 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Moretti, Antonio Nydick, Steven W Center for Computational Learning Systems, Pearson VUE Columbia University nydic001@umn edu. amoretti86@gmail .com Ogut, Burhan Morgan, Deanna L American Institutes for Research The College Board bogut@air .org demorgan@collegeboard .org O’Leary, Timothy Mark Morin, Maxim University of Melbourne Medical Council of Canada olearyt@student .unimelb .edu .au mmorin@mcc .ca Olgar, Süleyman Morris, Carrie A Florida Department of Education University of Iowa College of Education suleymanolgar@yahoo .com carrie-morris@uiowa .edu Olgar, Süleyman Morrison, Kristin M Florida Department of Education Georgia Institute of Technology suleyman .olgar@fldoe .org kmorrison3@gatech .edu Oliveri, Maria Elena Muntean, William Joseph Educational Testing Service Pearson moliveri@ets .org william .muntean@pearson .com Olsen, James B. Murphy, Stephen T Renaissance Learning Inc . Houghton Mifflin Harcourt james .olsen@renaissance .com stephen .murphy@hmhco .com Özdemir, Burhanettin Naumann, Alexander Hacettepe University German Institute for International Educational b .ozdemir@hacettepe .edu .tr Research (DIPF) Pak, Seohong naumanna@dipf .de University of Iowa Naumenko, Oksana seohong-pak@uiowa .edu The University of North Carolina at Greensboro Pan, Tianshu o_naumen@uncg .edu Pearson Nebelsick-Gullet, Lori tianshu .pan@pearson .com edCount Park, Jiyoon lnebelsick-gullett@edcount .com Federation of State Boards of Physical Therapy Nieto, Ricardo jpark@fsbpt .org The University of Texas at Austin Park, Yoon Soo nietoric@utexas .edu University of Illinois at Chicago Noh, Eunhee yspark2@uic .edu Korean Institute for Curriculum and Evaluation Patelis, Thanos eylim .ed@gmail .com Center for Assessment Norton, Jennifer tpatelis@nciea .org Center for Applied Linguistics Patelis, Thanos jnorton@cal .org Center for Assessment tpatelis@nciea .org

207 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Peabody, Michael Reboucas, Daniella American Board of Family Medicine University of Notre Dame mpeabody@theabfm .org ycheng4@nd .edu

Perie, Marianne Reckase, Mark Center for Educational Testing and Evaluation Michigan State University mperie@ku .edu luoxin1@msu .edu

Perie, Marianne Redell, Nick CETE University of Kansas National Board of Osteopathic Medical Examiners mperie@ku .edu (NBOME) nredell@nbome .org Phadke, Chaitali University of Minnesota Renn, Jennifer phadk011@umn .edu Center for Applied Linguistics jrenn@cal .org Pohl, Steffi Freie Universität Berlin Reshetnyak, Evgeniya steffi .pohl@fu-berlin .de Fordham University ereshetnyak@fordham .edu Por, Han-Hui Educational Testing Service Ricarte, Thales Akira Matsumoto hpor@ets .org Institute of Mathematical and Computer Sciences (ICMC-USP) Powers, Sonya thalesam@icmc .usp .br Pearson sopowers@gmail .com Rick, Francis University of Massachusetts, Amherst QIAN, HAIXIA frick@umass .edu University of Kansas hxqian@gmail .com Rickels, Heather Anne University of Iowa, Iowa Testing Programs Qian, Jiahe heather-rickels@uiowa .edu Educational Testing Service jqian58@gmail .com Rios, Joseph A. Educational Testing Service QIU, Xue-Lan jrios@ets .org The Hong Kong Institute of Education xlqiu@ied .edu .hk Risk, Nicole M American Medical Technologists Qiu, Yuxi nrisk@americanmedtech org. University of Florida yqiu2013@ufl .edu Roduta Roberts, Mary University of Alberta Quellmalz, Edys S mroberts@ualberta .ca WestEd equellm@wested .org Rogers, H. Jane University of Connecticut Rahman, Nazia jane .rogers@uconn .edu Law School Admission Council nrahman@lsac .org Rorick, Beth National Parent-Teacher Association Rankin, Jenny G. Illuminate Education Rosen, Yigal drjrankin@gmail .com Pearson igal .rosen@gmail .com

208 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Rubright, Jonathan D Seltzer, Michael American Institute of Certified Public Accountants UCLA jrubright@aicpa .org mseltzer@ucla edu.

Runyon, Christopher R. Sen, Sedat The University of Texas at Austin Harran University runyon .christopher@utexas .edu sedatsen06@gmail .com

Rutkowski, Leslie Sgammato, Adrienne University of Oslo Educational Testing Service leslie .rutkowski@gmail .com asgammato@ets .org

Rutstein, Daisy W. Sha, Shuying SRI International University of North Carolina at Greensboro daisy .rutstein@sri .com s_sha@uncg .edu

Sabatini, John Shao, Can ETS University of Notre Dame jsabatini@ets .org cshao@nd .edu

Şahin, Füsun Shaw, Emily University at Albany, State University of New York College Board fsahin@albany .edu eshaw@collegeboard .org

Saiar, Amin Shear, Benjamin PSI Services LLC Stanford University amin@psionline .com bshear@stanford .edu

Sakworawich, Arnond Shear, Benjamin R. National Institute of Development Administration Stanford University arnond@as .nida .ac .th bshear@stanford .edu

Samonte, Kelli M. Sheehan, Kathleen M. American Board of Internal Medicine ETS kelli .samonte@gmail .com ksheehan@ETS .ORG

Sano, Makoto Shermis, Mark D Prometric University of Houston--Clear Lake makoto .sano@prometric .com mshermis@uhcl .edu

Sato, Edynn Shin, Hyo Jeong Pearson ETS edynn .sato@pearson .com hshin@ets .org

Schultz, Matthew T Shin, Nami American Institute of Certified Public Accountants University of California, Los Angeles/ National Center matthew .schultz01@gmail .com for Research on Evaluation, Standards, and Student Testing (CRESST) Schwarz, Richard D. nami0623@gmail .com ETS rdschwarz@ets .org

Secolsky, Charles Mississippi Department of Education csecolsky@gmail .com

209 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Shropshire, Kevin O. Sweet, Shauna J Virginia Tech (note I graduated in May 2014) . I University of Maryland, College Park currently work at the University of Georgia (OIR) and ssweet@umd .edu this research is not affiliated with that department Swift, David / university . I am providing the school where my Houghton Mifflin Harcourt research was conducted . DAVID .SWIFT@HMHCO .COM kshropsh@vt .edu Swinburne Romine, Russell Shute, Valerie University of Kansas Florida State University swin0030@ku .edu vshute@fsu .edu Tan, Xuan-Adele Sinharay, Sandip Educational Testing Service Pacific Metrics Corp atan@ets .org ssinharay@pacificmetrics .com Tang, Wei Sireci, Stephen G. University of Alberta University of Massachusetts-Amherst wtang3@ualberta .ca sireci@acad .umass .edu Tannenbaum, Richard J. Skorupski, William P Educational Testing Service University of Kansas RTANNENBAUM@ets .org wps@ku .edu Tao, Shuqin Somasundaran, Swapna Curriculum Associates ETS shuqin tao@gmail. .com ssomasundaran@ets .org Terzi, Ragip Sorrel, Miguel A. Rutgers, The State University of New Jersey Universidad Autónoma de Madrid terziragip@gmail .com miguel .sorrel@uam .es Thissen, David Stanke, Luke University of North Carolina Minneapolis Public Schools dthissen@email .unc .edu stanke@gmail .com Thomas, Larry Stone, Elizabeth University of California, Los Angeles Educational Testing Service LDThomas@ucla .edu estone@ets .org Thummaphan, Phonraphee SU, YU-LAN University of Washington, Seattle ACT .ING phonrt@uw .edu yulan .su@act .org Torres Irribarra, David Suh, Hongwook Pontificia Universidad Católica de Chile ACT, inc . david@torresirribarra .me hongwooks@gmail .com Traynor, Anne Sukin, Tia M Purdue University Pacific Metrics atraynor@purdue .edu tsukin@pacificmetrics .com Trierweiler, Tammy J. Svetina, Dubravka Law School Admission Council (LSAC) Indiana University tjtrier@gmail .com dsvetina@indiana .edu

210 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

TU, DONGBO Wang, Shichao Jiangxi Normal University The University of Iowa tudongbo@aliyun .com shichao-wang@uiowa .edu

Underhill, Stephanie Wang, Shudong Indiana University - Bloomington Northwest Evaluation Association stepunde@indiana .edu shudong .wang@nwea .org van Rijn, Peter Wang, Wei ETS Global Educational Testing Service pvanrijn@etsglobal .org WWang@ets .org

Vansickle, Tim Wang, Wenyi Questar Assessment Inc ., Jiangxi Normal University TVansickle@questarai .com wenyiwang2009@gmail .com

Vispoel, Walter P Wang, Xi University of Iowa University of Massachusetts Amherst walter-vispoel@uiowa .edu xiw@educ .umass .edu von Davier, Matthias Wang, Xiaolin Educational Testing Service Indiana University, Bloomington mvondavier@ets .org xw41@indiana edu.

Vue, Kory Wang, Zhen University of Minnesota Educational Testing Service (ETS) vuexx199@umn .edu zhenwng@aol .com

Wainer, Howard Weeks, Jonathan P National Board of Medical Examiners ETS hwainer@nbme .org jweeks@ets .org

Walker, Cindy Wei, Hua University of Wisconsin - Milwaukee Pearson cmwalker@uwm .edu hua .wei@pearson .com

Walker, Michael E Wei, Xiaoxin Elizabeth The College Board American Institutes for Research miwalker@collegeboard .org xwei@virginia edu. wang, aijun Wei, Youhua federation of state boards of physical therapy Educational Testing Service awang@fsbpt .org ywei@ets .org

Wang, Hongling Weiner, John A. ACT, Inc . PSI Services LLC hongling .wang@act .org jweiner@psionline .com

Wang, Keyin Welch, Catherine Michigan State University University of Iowa wangkeyi@msu .edu catherine-welch@uiowa .edu

Wang, Lu Welch, Catherine J ACT, Inc ./The University of Iowa University of Iowa lu-wang-3@uiowa .edu catherine-welch@uiowa .edu

211 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

Wendler, Cathy Xin, Tao Educational Testing Service Beijing Normal University cwendler@ets .org xintao@bnu .edu .cn

White, Lauren Xiong, Xinhui Florida Department of Education American Institute for Certified Public Accountants lauren .white@fldoe .org xxiong@aicpa org.

Wiberg, Marie Xu, Jing-Ru Umeå University Pearson VUE marie .wiberg@umu .se jingruxu2013@gmail .com

Widiatmo, Heru Xu, Ting ACT, Inc . University of Pittsburgh heru .widiatmo@act .org tix3@pitt .edu

Wilson, Mark Yang, Ji Seung University of California, Berkeley University of Maryland MarkW@berkeley .edu jsyang@umd .edu

Wilson, Mark Yao, Lihua University of California, Berkeley Defense manpower data center MarkW@berkeley .edu Lihua .Yao .civ@mail .mil

Wood, Scott W Ye, Sangbeak Pacific Metrics Corporation University of Illinois - Urbana Champaign swood@pacificmetrics .com sye3@illinois .edu

Wu, Yi-Fang Yi, qin University of Iowa Faculty of Education, Beijing Normal University wuyifang91@gmail .com scgayiqin@163 .com

Wyatt, Jeff Yi, Qing College Board ACT, Inc . jwyatt@collegeboard .org qing .yi@act .org

Xi, Nuo Yin, Ping Educational Testing Service Curriculum Associates yjia@ets .org pyin@cainc .com

Xiang, Shibei Yoo, Hanwook Henry National Cooperative Innovation Center for Educational Testing Service Assessment and Improvement of Basic Education hyoo@ets .org Quality Yoon, Su-Youn xiangshibei@163 .com Educational Testing Service Xie, Chao SYoon@ets .org American Institutes for Research Yoon, Su-Youn cxie@air .org ETS Xie, Qing syoon@ets .org ACT/The University of Iowa Zhan, Peida qing-xie@uiowa .edu Beijing Normal University pdzhan@gmail .com

212 Washington, DC, USA

Contact Information for Individual and Coordinated Sessions First Authors

Zhang, Jin ACT Inc . jin .zhang@act .org

Zhang, Jinming University Of Illinois at Urbana-Champaign jmzhang@illinois .edu

Zhang, Mengyao National Conference of Bar Examiners mengyao-zhang@uiowa .edu

Zhang, Xue Northeast Normal University zhangx815@nenu .edu .cn

Zhang, Yu Federation of State Boards of Physical Therapy yzhang@fsbpt .org

Zhao, Yang University of Kansas zhaoyang@ku .edu

Zheng, Chanjin Jiangxi Normal University russelzheng@gmail .com

Zheng, Chunmei Pearson zhengchunmei5@gmail .com

Zheng, Xiaying University of Maryland xyzheng86@gmail .com

Zheng, Yi Arizona State University yi .isabel .zheng@asu .edu

Zweifel, Michael University of Nebraska-Lincoln zweifelmj@gmail .com

213 2016 Annual Meeting & Training Sessions

Contact Information for Individual and Coordinated Sessions First Authors

214 Washington, DC, USA

NCME 2016 • Schedule-At-A-Glance

Time Room Type ID Title Thursday, April 7, 2016

Quality Control Tools in Support of Reporting 8:00 AM–12:00 PM Meeting Room 6 TS AA Accurate and Valid Test Scores 8:00 AM–12:00 PM Meeting Room 7 TS BB IRT Parameter Linking 21st Century Skills Assessment: Design, 8:00 AM–5:00 PM Meeting Room 5 TS CC Development, Scoring, and Reporting of Character Skills 8:00 AM–5:00 PM Meeting Room 2 TS DD Introduction to Standard Setting Analyzing NAEP Data Using Plausible Values and 8:00 AM–5:00 PM Meeting Room 16 TS EE Marginal Estimation with AM Multidimensional Item Response Theory: Theory 8:00 AM–5:00 PM Meeting Room 4 TS FF and Applications and software New Weighting Methods for Causal Mediation 1:00 PM–5:00 PM Meeting Room 3 TS GG Analysis Computerized Multistage Adaptive Testing: Theory 1:00 PM–5:00 PM Meeting Room 6 TS II and Applications (Book by Chapman and Hall)” Friday, April 8, 2016 Renaissance 8:00 AM–12:00 PM TS JJ Landing Your Dream Job for Graduate Students West B Bayesian Analysis of IRT Models using SAS PROC 8:00 AM–12:00 PM Meeting Room 4 TS KK MCMC flexMIRT®: Flexible multilevel multidimensional 8:00 AM–5:00 PM Meeting Room 2 TS LL item analysis and test scoring Aligning ALDs and Item Response Demands to 8:00 AM–5:00 PM Meeting Room 5 TS MM Support Teacher Evaluation Systems Best Practices for Lifecycles of Automated Scoring 8:00 AM–5:00 PM Renaissance East TS NN Systems for Learning and Assessment 8:00 AM–5:00 PM Meeting Room 3 TS OO Test Equating Methods and Practices Renaissance Diagnostic Measurement: Theory, Methods, 8:00 AM–5:00 PM TS PP West A Applications, and Software Renaissance 1:00 PM–5:00 PM TS QQ Effective Item Writing for Valid Measurement West B 3:00 PM–8:00 PM Meeting Room 11 Board Meeting Fado’s Irish Pub, 808 7th Street 4:30 PM–6:30 PM Graduate Student Social NW, Washington, DC 20001

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 215 2016 Annual Meeting & Training Sessions

Time Room Type ID Title Convention AERA Centennial Symposium 6:30 PM–10:00 PM Center, Level & Centennial Reception Three, Ballroom C Saturday, April 9, 2016 6:30 AM–7:30 AM Meeting Room 7 Sunrise Yoga NCME Book Series Symposium: The Challenges to 8:15 AM–10:15 AM Renaissance East IS A1 Measurement in an Era of Accountability Renaissance Collaborative Problem Solving Assessment: 8:15 AM–10:15 AM CS A2 West A Challenges and Opportunities Renaissance Harnessing Technological Innovation in Assessing 8:15 AM–10:15 AM CS A3 West B English Learners: Enhancing Rather Than Hindering 8:15 AM–10:15 AM Meeting Room 3 PS A4 How can assessment inform classroom practice? Enacting a Learning Progression Design to Measure 8:15 AM–10:15 AM Meeting Room 4 CS A5 Growth Testlets and Multidimensionality in Adaptive 8:15 AM–10:15 AM Meeting Room 5 PS A6 Testing Methods for Examining Local Item Dependence 8:15 AM–10:15 AM Meeting Room 12 PS A7 and Multidimensionality 10:35 AM–12:05 PM Renaissance East CS B1 The End of Testing as We Know it? Renaissance Fairness and Machine Learning for Educational 10:35 AM–12:05 PM CS B2 West A Practice Renaissance 10:35 AM–12:05 PM CS B3 Item Difficulty Modeling: From Theory to Practice West B 10:35 AM–12:05 PM Meeting Room 3 PS B4 Growth and Vertical Scales 10:35 AM–12:05 PM Meeting Room 4 PS B5 Perspectives on Validation 10:35 AM–12:05 PM Meeting Room 5 PS B6 Model Fit 10:35 AM–12:05 PM Meeting Room 12 PS B7 Simulation- and Game-based Assessments 10:35 AM–12:05 PM Meeting Room 10 PS B8 Test Security and Cheating Opting out of testing: Parent rights versus valid 12:25 PM–1:55 PM Renaissance East CS C1 accountability scores Renaissance Building toward a validation argument with 12:25 PM–1:55 PM CS C2 West A innovative field test design and analysis Towards establishing standards for spiraling Renaissance 12:25 PM–1:55 PM CS C3 of contextual questionnaires in large-scale West B assessments Estimation precision of variance components: 12:25 PM–1:55 PM Meeting Room 3 CS C4 Revisiting generalizability theory 12:25 PM–1:55 PM Meeting Room 4 PS C5 Sensitivity of Value-Added Models 12:25 PM–1:55 PM Meeting Room 5 PS C6 Item and Scale Drift 12:25 PM–1:55 PM Meeting Room 12 PS C7 Cognitive Diagnostic Model Extensions

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 216 Washington, DC, USA

Time Room Type ID Title Mount Vernon 12:25 PM–1:55 PM EB C8 Square Assessing the assessments: Measuring the quality 2:15 PM–3:45 PM Renaissance East IS D1 of new college- and career-ready assessments Renaissance Some psychometric models for learning 2:15 PM–3:45 PM CS D2 West A progressions Renaissance Multiple Perspectives on Promoting Assessment 2:15 PM–3:45 PM CS D3 West B Literacy for Parents 2:15 PM–3:45 PM Meeting Room 3 PS D4 Equating Mixed-Format Tests 2:15 PM–3:45 PM Meeting Room 4 PS D5 Standard Setting 2:15 PM–3:45 PM Meeting Room 5 PS D6 Diagnostic Classification Models: Applications 2:15 PM–3:45 PM Meeting Room 12 PS D7 Advances in IRT Modelling and Estimation Mount Vernon 2:15 PM–3:45 PM EB D8 GSIC Poster Session Square Do Large Scale Performance Assessments Influence 4:05 PM–6:00 PM Renaissance East CS E1 Classroom Instruction? Evidence from the Consortia Applications of Latent Regression to Modeling Renaissance 4:05 PM–6:05 PM CS E2 Student Achievement, Growth, and Educator West A Effectiveness Renaissance Jail Terms for Falsifying Test Scores:  Yes, No 4:05 PM–6:05 PM CS E3 West B or Uncertain? 4:05 PM–6:05 PM Meeting Room 3 PS E4 Test Design and Construction 4:05 PM–6:05 PM Meeting Room 4 CS E5 Tablet Use in Assessment 4:05 PM–6:05 PM Meeting Room 5 PS E6 Topics in Multistage and Adaptive Testing Cognitive Diagnosis Models: Exploration and 4:05 PM–6:05 PM Meeting Room 12 PS E7 Evaluation Mount Vernon 4:05 PM–5:35 PM EB E8 Square Grand Ballroom 6:30 PM–8:00 PM NCME and Division D Reception South Sunday, April 10, 2016 Marriott Marquis 8:00 AM–9:00 AM Hotel, Marquis Breakfast and Business Session Salon 6 Marriott Marquis Presidential Address: Education and the 9:00 AM–9:40 AM Hotel, Marquis Measurement of Behavioral Change Salon 6 Career Award: Do Educational Assessments Yield 10:35 AM–12:05 PM Renaissance East IS F1 Achievement Measurements

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 217 2016 Annual Meeting & Training Sessions

Time Room Type ID Title Debate: Should the NAEP Mathematics Framework Renaissance 10:35 AM–12:05 PM IS F2 be revised to align with the Common Core State West A Standards? Renaissance Beyond process: Theory, policy, and practice in 10:35 AM–12:05 PM CS F3 West B standard setting Exploring Timing and Process Data in Large-Scale 10:35 AM–12:05 PM Meeting Room 3 CS F4 Assessments Psychometric Challenges with the Machine Scoring 10:35 AM–12:05 PM Meeting Room 4 CS F5 of Short-Form Constructed Responses 10:35 AM–12:05 PM Meeting Room 5 PS F6 Advances in Equating Novel Approaches for the Analysis of Performance 10:35 AM–12:05 PM Meeting Room 15 PS F7 Data Mount Vernon 10:35 AM–12:05 PM EB F8 Square Convention Center, Level 12:25 PM–2:25 PM AERA Awards Luncheon Three, Ballroom ABC Challenges and Opportunities in the Interpretation 2:45 PM–4:15 PM Renaissance East CS G1 of the Testing Standards Renaissance Applications of Combinatorial Optimization in 2:45 PM–4:15 PM CS G2 West A Educational Measurement Renaissance 2:45 PM–4:15 PM PS G3 Psychometrics of Teacher Ratings West B 2:45 PM–4:15 PM Meeting Room 3 PS G4 Multidimensionality Validating “Noncognitive”/Nontraditional 2:45 PM–4:15 PM Meeting Room 4 PS G5 Constructs I 2:45 PM–4:15 PM Meeting Room 5 PS G6 Invariance 2:45 PM–4:15 PM Meeting Room 15 PS G7 Detecting Aberrant Response Behaviors Mount Vernon 2:45 PM–4:15 PM EB G8 GSIC Poster Session Square Convention 4:35 PM–5:50 PM Center, Level AERA Presidential Address Three, Ballroom C Advances in Balanced Assessment Systems: 4:35 PM–6:05 PM Renaissance East CS H1 Conceptual framework, informational analysis, application to accountability Renaissance Minimizing Uncertainty: Effectively Communicating 4:35 PM–6:05 PM CS H2 West A Results from CDM-based Assessments Overhauling the SAT: Using and Interpreting 4:35 PM–6:05 PM Meeting Room 16 CS H3 Redesigned SAT Scores

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 218 Washington, DC, USA

Time Room Type ID Title Quality Assurance Methods for Operational 4:35 PM–6:05 PM Meeting Room 3 CS H4 Automated Scoring of Essays and Speech 4:35 PM–6:05 PM Meeting Room 4 PS H5 Student Growth Percentiles 4:35 PM–6:05 PM Meeting Room 5 PS H6 Equating: From Theory to Practice 4:35 PM–6:05 PM Meeting Room 15 PS H7 Issues in Ability Estimation and Scoring Mount Vernon 4:35 PM–6:05 PM EB H8 Square Renaissance 6:30 PM–8:00 PM President’s Reception West B Monday, April 11, 2016 5:45 AM–7:00 AM NCME Fitness Run/Walk Meeting Room NCME Book Series Symposium: Technology and 8:15 AM–10:15 AM IS I1 13/14 Testing Meeting Room Exploring Various Psychometric Approaches to 8:15 AM–10:15 AM CS I2 8/9 Report Meaningful Subscores 8:15 AM–10:15 AM Meeting Room 3 CS I3 From Items to Policies: Big Data in Education Methods and Approaches for Validating Claims of 8:15 AM–10:15 AM Meeting Room 4 CS I4 College and Career Readiness Renaissance Recent Advances in Quantitative Social Network 8:15 AM–10:15 AM IS I5 West A Analysis in Education 8:15 AM–10:15 AM Meeting Room 15 PS I6 Issues in Automated Scoring 8:15 AM–10:15 AM Meeting Room 16 PS I7 Multidimensional and Multivariate methods Hold the Presses! How Measurement Professionals Renaissance 10:35 AM–12:05 PM IS J1 can Speak More Effectively with the Press and the West A Public (Education Writers Association Session) Meeting Room Challenges and solutions in the operational use of 10:35 AM–12:05 PM CS J2 8/9 automated scoring systems Novel Models to Address Measurement Errors in 10:35 AM–12:05 PM Meeting Room 3 CS J3 Educational Assessment and Evaluation Studies Mode Comparability Investigation of a CCSS based 10:35 AM–12:05 PM Meeting Room 4 CS J4 K-12 Assessment Validating “Noncognitive”/Nontraditional 10:35 AM–12:05 PM Meeting Room 16 PS J5 Constructs II 10:35 AM–12:05 PM Meeting Room 15 PS J6 Differential Functioning - Theory and Applications 10:35 AM–12:05 PM Meeting Room 5 PS J7 Latent Regression and Related Topics 11:00 AM–2:00 PM Meeting Room 12 Past Presidents Luncheon Meeting Room The Every Students Succeeds Act (ESSA): 12:25 PM–1:55 PM IS K1 8/9 Implications for measurement research and practice Renaissance Career Paths in Educational Measurement: Lessons 12:25 PM–1:55 PM CS K2 West A Learned by Accomplished Professionals

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 219 2016 Annual Meeting & Training Sessions

Time Room Type ID Title Recent Investigations and Extensions of the 12:25 PM–1:55 PM Meeting Room 3 CS K3 Hierarchical Rater Model The Validity of Scenario-Based Assessment: 12:25 PM–1:55 PM Meeting Room 4 CS K4 Empirical Results 12:25 PM–1:55 PM Meeting Room 5 PS K5 Item Design and Development 12:25 PM–1:55 PM Meeting Room 15 PS K6 English Learners 12:25 PM–1:55 PM Meeting Room 16 PS K7 Differential Item and Test Functioning Mount Vernon 12:25 PM–1:55 PM EB K8 Square Learning from History: How K-12 Assessment Will Renaissance 2:15 PM–3:45 PM IS L1 Impact Student Learning Over the Next Decade West A (National Association of Assessment Directors) Meeting Room Psychometric Issues on the Operational New- 2:15 PM–3:45 PM CS L2 8/9 Generation Consortia Assessments Issues and Practices in Multilevel Item Response 2:15 PM–3:45 PM Meeting Room 3 CS L3 Models 2:15 PM–3:45 PM Meeting Room 4 CS L4 Psychometric Issues in Alternate Assessments Recommendations for Addressing the Unintended 2:15 PM–3:45 PM Meeting Room 5 CS L5 Consequences of Increasing Examination Rigor 2:15 PM–3:45 PM Meeting Room 15 PS L6 Innovations in Assessment 2:15 PM–3:45 PM Meeting Room 12 PS L7 Technology-based Assessments NCME Diversity and Testing Committee Sponsored Meeting Room Symposium: Implications of Computer-Based 2:15 PM–3:45 PM IS L8 13/14 Testing for Assessing Diverse Learners: Lessons Learned from the Consortia Meeting Room 3:00 PM–7:00 PM Board Meeting 10/11 Meeting Room Fairness Issues and Validation of Non-Cognitive 4:05 PM–6:05 PM CS M1 8/9 Skills Thinking about your Audience in Designing and 4:05 PM–6:05 PM Meeting Room 3 CS M2 Evaluating Score Reports Use of automated tools in listening and reading 4:05 PM–6:05 PM Meeting Room 4 CS M3 item generation 4:05 PM–6:05 PM Meeting Room 5 PS M4 Practical Issues in Equating 4:05 PM–6:05 PM Meeting Room 16 PS M5 The Great Subscore Debate 4:05 PM–6:05 PM Meeting Room 12 PS M6 Scores and Scoring Rules Meeting Room 4:05 PM–6:05 PM IS M7 On the use and misuse of latent variable scores 13/14

CS=Coordinated Session • EB= Electronic Board Session IS= Invited Session • PS= Paper Session • TS=Training Session 220 National Council on Measurement in Education is very grateful to the following organizations for their generous financial support of our 2016 Annual Meeting

National Council on Measurement in Education 100 North 20th Street, Suite 400. Philadelphia, PA 19103 (215) 461-6263 http://www.ncme.org/