Reliability and Generalization Studies Reliability

RELIABILITY AND GENERALIZATION STUDIES RELIABILITY: ARGUMENTS FOR MULTIPLE PERSPECTIVES AND POTENTIAL PROBLEMS WITH GENERALIZATION ACROSS STUDIES DIMITER M. DIMITROV Kent State University The present article addresses reliability issues in light of recent studies and debates focused on psychometrics versus datametrics terminology and reliability generalization (RG) introduced by Vacha-Haase. The purpose here was not to moderate arguments presented in these debates but to discuss multiple perspectives on score reliability and how they may affect research practice, editorial policies, and RG across studies. Issues of classical error variance and reliability are discussed across models of classical test theory, generalizability theory, and item response theory. Potential problems with RG across studies are discussed in relation to different types of reliability, different test forms, different number of items, misspecifications, and confounding independent vari- ables in a single RG analysis. The editorial policies of Educational and Psychological Measurement (EPM) have become a focal point of recent studies and debates related to important issues of reliability, effect size, and confidence intervals in reporting research. The purpose of this article was not to reiterate or moderate datametrics-psychometrics arguments presented in these debates (e.g., Thompson, 1994; Sawilowski, 2000; Thompson & Vacha-Haase, 2000) but to discuss issues concerning the assessment of score reliability and how they may affect (current or future) research practice, editorial policies, and reliability generalization (RG) across studies introduced by Vacha-Haase (1998). The discussion is organized in three parts. The first part presents multiple perspectives on classical standard error of measurement (SEM) and reli- I would like to thank the two anonymous reviewers and an acting editor for their helpful comments and suggestions. Correspndence regarding this article should be addressed to Dimiter M. Dimitrov, Ph.D., 507 White Hall, College of Education, Kent State University, Kent, OH 44242- 0001. Educational and Psychological Measurement, Vol. 62, No. 5, October 2002 783-801 DOI: 10.1177/001316402236878 © 2002 Sage Publications 783 784 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT ability across traditional and modern measurement models. The second part addresses potential problems with RG across studies in light of issues discussed in the first part. The conclusion summarizes major points of the discussion in the first two parts in an attempt to provide some insight for the practice of substantive research and RG across studies. In response to EPM editorial policies stating that “scores, not tests, are re- liable” (Thompson, 1994; Thompson & Daniel, 1996), Sawilowski (2000) addressed problems inherent with datametrics from the perspectives of an his- torical review of reliability terminology and textbook treatments of reliability. He recognized that statements related to “reliability of the test" are not sufficiently informative because “reliability paradigms and their coefficients are simply not interchangeable” ( p. 159). He indicated, however, that reliability estimates should not be tied only to the data at hand: Indeed, the purpose for using a nationally representative scientifically selected random sample when conducting reliability studies during t he test construc- tion process is to obtain the reliability estimate of a test for general purposes. In my view, authors ought always to report these reliability coefficients from manuals and other sources along with the reliability estimate obtained from, and a description of, the researcher’s own sample. (p. 170) This point of view, based on numerous treatments of reliability (e.g., Crocker & Algina, 1986; Goodenough, 1949; Gronlund, 1988; Nunnally & Bernstein, 1994; Payne, 1975; Suen, 1990), is also represented by Traub (1994) who noted that “the usual reliability experiments provide a sample estimate of the reliability coefficient for the population [italics added]” (p. 66). On their side, Thompson and Vacha-Haase (2000) recognized that the treat- ment of reliability should not be isolated from the context of the measurement theory because different ways of estimating reliability involve different sources of error. The author of this article, therefore, argues that the under- standing of reliability and related policies for calculating and reporting reliability should be based on multiple perspectives about population reliability and its sample estimates. Inferences from such perspectives can also be used to identify potential problems with RG across studies. Reliability in Classical Test Theory Cronbach’s α and Reliability of Congeneric Measures Under classical test theory, it is important to make distinctions between models that are parallel, essentially τ equivalent, or congeneric (e.g., Crocker & Algina, 1986; Feldt & Brennan, 1989; Lord & Novick, 1968; Raykov, 1997). DIMITROV 785 Thompson and Vacha-Haase (2000) referred to these models and the possi- bility of testing their assumptions using structural equation modeling (Jöreskog & Sörbom, 1989), but the EPM guidelines editorial of Fan and Thompson (2001) did not elaborate on this issue in discussing confidence intervals for reliability estimates. Therefore, some specific comments seem appropriate. Let us have the test divided into k components; 2 #k #n, where n is the number of items in the test. By assumption, the score Xi on each component can be represented as Xi = Ti + Ei, where Ti and Ei denote the true score and error score, respectively. The test components are called parallel if any two items i and j have equal true scores, Ti = Tj, and equal error variances, Var(Ei) = Var(Ej). With the assumption of error variances dropped, the test components are called (a) essentially tau-equivalent, if Ti = Tj + aij, where aij … 0 is a constant; and (b) congeneric if Ti = bijTj + aij, where bij is a constant (bij …0, bij …1). The reliability of the test for a population, ρxx, is defined as the ratio of true score variance to observed score variance, Var(Ti)/Var(Xi), or (equivalently) as the squared correlation between true and observed scores (e.g., Lord & Novick, 1968, p. 57). In empirical research, however, true scores cannot be directly determined and, therefore, the reliability is typically estimated by coefficients of internal consistency, test-retest, alternate forms, and other reliability methods adopted in the psychometric literature (e.g., Crocker & Algina, 1986). One of the most frequently used method of estimating internal consistency reliability is Cronbach’s coefficient α (Cronbach, 1951). It is important to emphasize that confidence intervals for Cronbach’s α (or other reliability estimates) should also be calculated and reported for the data in hand with substantive studies. Classical approaches for determining confidence intervals about score reliability coefficients are discussed, for example, in an EPM editorial (Fan & Thompson, 2001). It should be noted, however, that (even at population level) Cronbach’s α is an accurate estimate of reliability, ρxx, only if there is no correlation among errors and the test components are at least essentially tau-equivalent (Novick & Lewis, 1967). As a reminder, tau equivalency implies that the test components measure the same trait and their true scores have equal variances in the population of respondents. When errors do not correlate but the test components are not essentially tau-equivalent, Cronbach’s α will underestimate ρxx. If, however, there is a correlation among error scores, Cronbach’s α may sub- stantially overestimate ρxx (see Komaroff, 1997; Raykov, 2001; Zimmerman, Zumbo, & Lalonde, 1993). Correlated error may occur, for example, with adjacent items in a multicomponent instrument, with items related to a com- mon stimulus (e.g., same paragraph or graph), or with tests presented in a speeded fashion (Komaroff, 1997; Raykov, 2001; Williams & Zimmerman, 1996). 786 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT Evidently, it is important that EPM editorial policies address this issue and that related discussions elaborate on congeneric measures when assumptions of parallel or essentially tau- equivalent test components do not hold. More- over, previous research provides sound theoretical and practical treatments of reliability for congeneric measures. For example, Raykov (2001, 1997) pro- posed structural equation modeling methods for obtaining confidence intervals of reliability with congeneric tests as well as computer applications of these methods with EQS (Bentler, 1995). Relatively simple procedures for determining weights that maximize reliability under a congeneric model are provided, for example, by Wang (1998). As a reminder, congeneric tests are measures of the same latent trait, but they may have different scale origins and units of measurement, and may vary in precision (Jöreskog, 1971). Thus, congeneric measures facilitate aggregation or comparison of reliability coefficients across different scales and forms of measurement instruments. This can be very beneficial for RG across studies that use and report congeneric scales. Although congeneric measures were not widely used in previously published studies, investigators and editorial policies should target applications of such measures in future research. Test-Retest Reliability Test-retest reliability is estimated by the correlation between the observed scores of the same people

Reliability and Generalization Studies Reliability

What Is Test Reliability/Precision?

Reliability and Validity Analysis of Statistical Reasoning Test Survey Instrument Using the Rasch Measurement Model

Accuracy Vs. Validity, Consistency Vs. Reliability, and Fairness Vs

Eurasian Journal of Educational Research

Confidence Intervals for Population Reliability Coefficients: Evaluation of Methods, Recommendations, and Software for Composite Measures

Some Aspects of Classical Reliability Theory & Classical Test Theory

On the Application of Design of Experiments to Accelerated Life Testing

Reliability, Repeatability, and Reproducibility of Pulmonary Transit Time Assessment by Contrast Enhanced Echocardiography Ingeborg H

Reliability from Α to Ω: a Tutorial

Reliability and Repeatability of the Assessment of Stress in Nursing Students Instrument in Podiatry Students: a Transcultural Adaptation

Reliability Coefficients and Generalizability Theory

Classical Test Theory and Item Response Theory Journal of Education; ISSN 2298-0172