LEARNING POINT Reliability and : How do these concepts influence accurate student assessment?

There are two concepts central to ac- scale reads 50 pounds heavier than Internal consistency reliability can curate student assessment: reliability expected. A second reading shows a be thought of as the degree to which and validity. Very briefly explained, weight that is 60 pounds lighter than responses to all the items on a reliability refers to the consistency of expected. Clearly, there is something “hang together.” This is expressed as test scores, where validity refers to wrong with this scale! It is not mea- a correlation between 0 and 1, with the degree that a test measures what suring the person’s weight accurately values closer to 1 indicating higher re- or consistently. This is liability or internal consistency within the basic idea behind the measure. Internal consistency reliability for - values are impacted by the structure al measurements. of the test. Tests that consist of more homogeneous items (i.e., all mea- In practice, education- sure the same thing) will have higher al assessment is not internal consistency values than tests quite as straightfor- that have more divergent content. For ward as the bathroom example, we would expect a test that scale example. There contains only American history items are different ways to to have higher reliability (internal think of, and express, consistency) values than a test that the reliability of educa- contains American history, African tional tests. Depending history, and political science ques- on the format and purpose of the it purports to measure. These very tions— even if both tests are equally test, different types of reliability will basic definitions have some utility; well built. but these two concepts are so funda- be more or less appropriate to report. Inter-rater reliability describes the mental to educational measurement Test-retest reliability is simply the degree of agreement between scores that assessment literate individuals correlation between the test scores of of different raters on the same should have a deeper understanding people who take the same test, twice. student performance. Inter-rater of each concept. Ideally, a person would receive the same score on the same test each reliability is particularly important What do we mean by reliability? time they take it (if we assume that for rubric-scored items such as A real-world example makes it easy they have not learned any new, rele- written response or performance to see why reliability, or consistency vant content or forgotten any relevant assessments. This type of reliability of scores, is important. Suppose content between the two administra- is important because if a written someone bought a new bathroom tions. This assumption is often response item or performance task scale. Upon unpacking the scale and not realistic.) doesn’t have inter-rater reliability, who using it for the first time, imagine the evaluates a performance may have a owner’s surprise in finding that the bigger impact on the score than the

©November 2019 | This information is aligned with the Assessment Literacy Standards at michiganassessmentconsortium.org actual performance itself. No one wants to be judged by the harshest scorer, particularly if the assessment To learn more is high stakes. Assessment Literacy for Educators in a Hurry, by James Popham ASCD (2018). The calculation of the various reli- http://bit.ly/2Gb8ulD ability , the requirements for Standards for educational and . test construction, and considerations American Association, American Psychological Association, relevant to the judgement of appro- & National Council on Measurement in Education (1985). priateness of reliability values are be- http://apa.org/science/programs/testing/standards yond the scope of this Learning Point. “Scoring rubric development: Validity and Reliability,” by Barbara M. Moskal and Understanding that there is nuance Jon A. Leydens. Practical Assessment, Research & Evaluation, 7(10), (2000). in calculation and interpretation of https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1093&context=pare reliability is what is important. content, but it might not be a valid the biology test correlate and What do we mean by validity? test for judging the quality of a school. predict scores on some other test The assessment of a test’s validity is A validity argument must be made for of biology or other factor even more complicated and nuanced each stated purpose or use of a test, n - Results of the than the calculation of reliability. Stat- and supportive evidence must be test help describe the construct of ed in an over-simplified way, reliability collected to “validate” each intended biology knowledge (usually is like a spreadsheet in which we can use of the assessment. supported by use of a statistical calculate a correlation value. Validity, method such as factor analysis) on the other hand, is like a courtroom Historically, it was thought that there n Consequential Validity - The deci- sions made based on the results of the biology test are appropriate and constructive A test is not valid or As time and research progressed, “ more unified theories of validity caught on. Rather than different types of validity, specialists came to invalid, per se. A test is feel that different types of evidence are required to establish validity. This means that the intended uses of an valid or invalid only for assessment have to be supported by evidence that supports the use(s). a stated purpose. Validity is thought of now in terms ” of the appropriateness of decisions based on test results. Basically, it is currently thought that consequential in which the test maker has to make were different types of validity. Using validity subsumes all the other no- a case for the validity of a test. a biology test as a context, historical tions of validity. This interpretation is notions of validity included: aided by a knowledge of the historical Perhaps the most important thing n Face Validity - The biology test notions of validity in that they can to know is this: a test is not valid or looks like a biology test help guide one in making the case for invalid, per se. A test is valid or invalid appropriate use of the test results for n only for a stated purpose. One cannot Content Validity - Biology experts a specific purpose. Someone tasked assess the validity of a test unless the agree that the test is a with developing a validity argument purpose of that test is made clear. Ad- biology test can use the historical notions of valid- ditionally, a test may be valid for one n - Scores on the ity to determine what type of evidence purpose but invalid for another. As an biology test correlate with other should be collected and brought to example, a test might provide valid aspects of knowledge of biology bear in support of the proposed use estimates of student achievement of n Predictive Validity - Scores on for the assessment.

The Michigan Assessment Consortium’s Assessment Learning Network (ALN) is a professional learning community consisting of members from MI’s professional education organizations; the goal of the ALN is to increase the assessment literacy of all of Michigan’s professional educators.
