Validity and the Florida Standards Assessments

Validity and the Florida Standards Assessments As with all of Florida’s statewide assessments, a number of analyses are conducted each year to verify the quality of individual test items, and the validity of the assessment as a whole, prior to reporting student scores. In addition, the need for comparable results from year to year requires that the test design maintains consistent content and difficulty. These processes ensure that the tests accurately measure student achievement of the Florida standards, are appropriate for Florida’s diverse student population, and contain content acceptable to all Florida stakeholders. The criteria and quality control measures include oversight by 3rd-party experts and include input from Florida’s Technical Advisory Committees (TAC). Florida’s TAC includes national experts in educational measurement, as well as representatives from Florida’s school districts. Specific areas of focus in test design, development, and scoring are discussed below. Content Validity All FSA test items must align to Florida standards as described in the Florida course description. The test design summary and test item specifications available on the FSA Portal are evidence of the alignment of the assessment to the standards. Items used in 2015 were reviewed and evaluated by department content and assessment experts in July and August of 2014 to ensure they each met content validity criteria. Content validity, however, is not quantifiable by the statistical analyses used for other aspects of the assessment. For all future FSA development, committees of Florida educators will review each item to ensure content validity. Difficulty Level All 2015 FSA items were reviewed for grade-level difficulty and appropriateness in July and August of 2014. Each test includes items that span a range of difficulty to measure the entire range of Florida student performance. In April and May, statistical analyses of student performance will be used to verify that items are within an acceptable range of difficulty. Other measures of difficulty will also be reviewed, and items with unacceptable statistics will be rejected from inclusion in the calculation of student scores. Item Discrimination For an item to be useful on a test, students who succeed on any given item should exhibit greater success on the test as a whole than students who do not succeed on that item. For 2015 FSA, item discrimination statistics were reviewed at item selection in July and August of 2014, and will be reviewed again after operational scoring in April and May. Any item with unacceptable discrimination statistics will be rejected from inclusion in the calculation of student scores. Guessing Good assessment items are written in such a way as to reduce the likelihood that a student could get the item correct by guessing. For 2015 FSA, statistics related to guessing were reviewed at item selection in July and August of 2014, and will be reviewed again after operational scoring in April and May. Any item that indicates students were able to easily guess the correct response will be rejected from inclusion in the calculation of student scores.

Item and Test Scoring Page 1 of 3 Office of Assessment 3/26/2015 Data analyses related to scoring will be conducted in April and May include difficulty, discrimination, and guessing, as described above. These parameters are used to ensure that each item and the test as a whole fit established guidelines. These standards are established in the specifications documents that guide test construction. Freedom from Bias An item is biased if it places a group or groups of students at a relative advantage or disadvantage due to characteristics, experiences, interests, or opportunities common to the group that are unrelated to academic achievement. Items used in 2015 were reviewed and evaluated by department content and assessment experts to ensure they were free from bias in July and August of 2014. In April and May, a statistical analysis will be conducted to determine if items unfairly advantaged or disadvantaged subgroups. Any item with unacceptable statistics related to bias will be rejected from inclusion in the calculation of student scores. Universal Design Principles Universal design means that assessments are accessible to the greatest number of students, including those with disabilities and non-native speakers of English. FSA items were reviewed for adherence to these principles from July 2014 through February 2015 and reflect the best practices suggested by universal design, including, but not limited to, reduction of wordiness; avoidance of ambiguity; selection of reader-friendly constructions and terminology; and application of consistently applied concept names and graphic conventions. Universal design principles were also used to make decisions about test layout and design during this time, including, but not limited to, type size, line length, spacing, and graphics. All future FSA item development will also adhere to these standards. Test Reliability A reliable test score provides an accurate estimate of a student’s true achievement. When there are sufficient numbers of test items that reflect the intended content, are free from bias, are well-written, represent a range of difficulty, and have positive correlations to success on the test, the likelihood of the test being reliable will be high. All of the steps in the assessment process contribute in some way or another to maximize the reliability of the FSA. In the process of test construction, test developers review the statistical data for items and generate indicators of overall test reliability. These statistics and measures are reviewed in light of established guidelines before the final approval of all test forms. Calibration (Post-Equating) After operational testing, data are generated from a sample of students representative of all students tested to generate the statistics necessary for scoring and to determine whether any items require special treatment in the scoring process. During the spring 2015 season there are several calibration activities that must take place. These activities include:

 Operational item calibration for all grades (used for operational scoring in June 2015)

 Horizontal Linking to FCAT 2.0 for Grades 3 and 10 ELA and Algebra 1 (used for operational scoring in June 2015)

 Vertical Linking for Grades 3-10 ELA and Grades 3-8 Mathematics (used for standard setting) Page 2 of 3 Office of Assessment 3/26/2015  Field test item calibration

The department and AIR are conducting activities in March to finalize computer programming that will be used in scoring, and will run simulations using this programming. In April and early May, a representative sample of student from across the state in each grade and subject will have their tests scored first. External Analyses To ensure the accuracy of any contractor’s system, the department requires external analyses to duplicate and verify the contractor’s work. For spring 2015, the Human Resources Research Organization (HumRRO) independently duplicates the calibration process. The Buros Center for Testing (BUROS) monitors handscoring, scanning, calibration, and equating activities. BUROS will provide a comprehensive report in late summer/early fall on the quality control processes in place, including any recommendations for improvement. Final Scoring and Reporting Throughout May and the first week of June, scoring activities will continue for all students in all grades and subjects using the processes described above. Scores will be reported by the week of June 8th, and by using all of the processes detailed here, the department will have ensured that each student’s score is accurate, valid, and reliable.

Page 3 of 3 Office of Assessment 3/26/2015