<<

RELIABILITY AND GENERALIZATION STUDIES

RELIABILITY: ARGUMENTS FOR MULTIPLE PERSPECTIVES AND POTENTIAL PROBLEMS WITH GENERALIZATION ACROSS STUDIES

DIMITER M. DIMITROV Kent State University

The present article addresses reliability issues in light of recent studies and debates focused on versus datametrics terminology and reliability generalization (RG) introduced by Vacha-Haase. The purpose here was not to moderate arguments pre- sented in these debates but to discuss multiple perspectives on score reliability and how they may affect research practice, editorial policies, and RG across studies. Issues of classical error variance and reliability are discussed across models of classical test the- ory, generalizability theory, and . Potential problems with RG across studies are discussed in relation to different types of reliability, different test forms, different number of items, misspecifications, and confounding independent vari- ables in a single RG analysis.

The editorial policies of Educational and Psychological Measurement (EPM) have become a focal point of recent studies and debates related to im- portant issues of reliability, effect size, and confidence intervals in reporting research. The purpose of this article was not to reiterate or moderate datametrics-psychometrics arguments presented in these debates (e.g., Thompson, 1994; Sawilowski, 2000; Thompson & Vacha-Haase, 2000) but to discuss issues concerning the assessment of score reliability and how they may affect (current or future) research practice, editorial policies, and reli- ability generalization (RG) across studies introduced by Vacha-Haase (1998). The discussion is organized in three parts. The first part presents mul- tiple perspectives on classical standard error of measurement (SEM) and reli-

I would like to thank the two anonymous reviewers and an acting editor for their helpful com- ments and suggestions. Correspndence regarding this article should be addressed to Dimiter M. Dimitrov, Ph.D., 507 White Hall, College of Education, Kent State University, Kent, OH 44242- 0001. Educational and Psychological Measurement, Vol. 62, No. 5, October 2002 783-801 DOI: 10.1177/001316402236878 © 2002 Sage Publications 783 784 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT ability across traditional and modern measurement models. The second part addresses potential problems with RG across studies in light of issues dis- cussed in the first part. The conclusion summarizes major points of the dis- cussion in the first two parts in an attempt to provide some insight for the practice of substantive research and RG across studies. In response to EPM editorial policies stating that “scores, not tests, are re- liable” (Thompson, 1994; Thompson & Daniel, 1996), Sawilowski (2000) addressed problems inherent with datametrics from the perspectives of an his- torical review of reliability terminology and textbook treatments of reliabil- ity. He recognized that statements related to “reliability of the test" are not sufficiently informative because “reliability paradigms and their coefficients are simply not interchangeable” ( p. 159). He indicated, however, that reliabil- ity estimates should not be tied only to the data at hand:

Indeed, the purpose for using a nationally representative scientifically selected random sample when conducting reliability studies during t he test construc- tion process is to obtain the reliability estimate of a test for general purposes. In my view, authors ought always to report these reliability coefficients from manuals and other sources along with the reliability estimate obtained from, and a description of, the researcher’s own sample. (p. 170)

This point of view, based on numerous treatments of reliability (e.g., Crocker & Algina, 1986; Goodenough, 1949; Gronlund, 1988; Nunnally & Bernstein, 1994; Payne, 1975; Suen, 1990), is also represented by Traub (1994) who noted that “the usual reliability experiments provide a sample estimate of the reliability coefficient for the population [italics added]” (p. 66). On their side, Thompson and Vacha-Haase (2000) recognized that the treat- ment of reliability should not be isolated from the context of the measure- ment theory because different ways of estimating reliability involve different sources of error. The author of this article, therefore, argues that the under- standing of reliability and related policies for calculating and reporting reli- ability should be based on multiple perspectives about population reliability and its sample estimates. Inferences from such perspectives can also be used to identify potential problems with RG across studies.

Reliability in Classical Test Theory

Cronbach’s α and Reliability of Congeneric Measures Under classical test theory, it is important to make distinctions between models that are parallel, essentially τ equivalent, or congeneric (e.g., Crocker & Algina, 1986; Feldt & Brennan, 1989; Lord & Novick, 1968; Raykov, 1997). DIMITROV 785

Thompson and Vacha-Haase (2000) referred to these models and the possi- bility of testing their assumptions using structural equation modeling (Jöreskog & Sörbom, 1989), but the EPM guidelines editorial of Fan and Thompson (2001) did not elaborate on this issue in discussing confidence intervals for reliability estimates. Therefore, some specific comments seem appropriate. Let us have the test divided into k components; 2 #k #n, where n is the number of items in the test. By assumption, the score Xi on each component can be represented as Xi = Ti + Ei, where Ti and Ei denote the true score and error score, respectively. The test components are called parallel if any two items i and j have equal true scores, Ti = Tj, and equal error variances, Var(Ei) =

Var(Ej). With the assumption of error variances dropped, the test components are called (a) essentially tau-equivalent, if Ti = Tj + aij, where aij … 0 is a con- stant; and (b) congeneric if Ti = bijTj + aij, where bij is a constant (bij …0, bij …1).

The reliability of the test for a population, ρxx, is defined as the ratio of true score variance to observed score variance, Var(Ti)/Var(Xi), or (equivalently) as the squared correlation between true and observed scores (e.g., Lord & Novick, 1968, p. 57). In empirical research, however, true scores cannot be directly determined and, therefore, the reliability is typically estimated by coefficients of internal consistency, test-retest, alternate forms, and other reliability methods adopted in the psychometric literature (e.g., Crocker & Algina, 1986). One of the most frequently used method of estimating inter- nal consistency reliability is Cronbach’s coefficient α (Cronbach, 1951). It is important to emphasize that confidence intervals for Cronbach’s α (or other reliability estimates) should also be calculated and reported for the data in hand with substantive studies. Classical approaches for determining confi- dence intervals about score reliability coefficients are discussed, for exam- ple, in an EPM editorial (Fan & Thompson, 2001). It should be noted, however, that (even at population level) Cronbach’s α is an accurate estimate of reliability, ρxx, only if there is no correlation among errors and the test components are at least essentially tau-equivalent (Novick & Lewis, 1967). As a reminder, tau equivalency implies that the test compo- nents measure the same trait and their true scores have equal variances in the population of respondents. When errors do not correlate but the test compo- nents are not essentially tau-equivalent, Cronbach’s α will underestimate ρxx. If, however, there is a correlation among error scores, Cronbach’s α may sub- stantially overestimate ρxx (see Komaroff, 1997; Raykov, 2001; Zimmerman, Zumbo, & Lalonde, 1993). Correlated error may occur, for example, with adjacent items in a multicomponent instrument, with items related to a com- mon stimulus (e.g., same paragraph or graph), or with tests presented in a speeded fashion (Komaroff, 1997; Raykov, 2001; Williams & Zimmerman, 1996).

786 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Evidently, it is important that EPM editorial policies address this issue and that related discussions elaborate on congeneric measures when assumptions of parallel or essentially tau- equivalent test components do not hold. More- over, previous research provides sound theoretical and practical treatments of reliability for congeneric measures. For example, Raykov (2001, 1997) pro- posed structural equation modeling methods for obtaining confidence inter- vals of reliability with congeneric tests as well as computer applications of these methods with EQS (Bentler, 1995). Relatively simple procedures for determining weights that maximize reliability under a congeneric model are provided, for example, by Wang (1998). As a reminder, congeneric tests are measures of the same latent trait, but they may have different scale origins and units of measurement, and may vary in precision (Jöreskog, 1971). Thus, congeneric measures facilitate aggregation or comparison of reliability coef- ficients across different scales and forms of measurement instruments. This can be very beneficial for RG across studies that use and report congeneric scales. Although congeneric measures were not widely used in previously published studies, investigators and editorial policies should target applica- tions of such measures in future research.

Test-Retest Reliability

Test-retest reliability is estimated by the correlation between the observed scores of the same people taking the same test (or parallel test forms) twice. However, this procedure may run into problems such as carry-over effects due to memory and/or practice (Cronbach & Furby, 1970; Lord & Noivick, 1968). Test-retest reliability estimates are most appropriate f or measuring traits that are stable across time period between the two test administrations (e.g., work values, visual or auditory acuity, and personality). It is important to note that test-retest reliability and internal consistency reliability are inde- pendent concepts. Basically, they are affected by different sources of error, and therefore, it may happen that measures with low internal consistency have high temporal stability and vice versa (Anastasi & Urbina, 1997; Nunnally & Bernstein, 1994). Previous research on stability in reliability showed that the test-retest correlation coefficient can serve reasonably well as a surrogate for the classical reliability coefficient if an essentially tau-equivalent model with equal error variances or a parallel model is present (Tisak & Tisak, 1996). In fact, recent RG studies also report substantial reliability variation due to type of reliability coefficients (e.g., Vacha-Haase, 1998; Yin & Fan, 2000).

Alternate Form Reliability

Alternate form reliability is a measure of the consistency of scores across comparable forms of a test administered to the same group of individuals DIMITROV 787

(Crocker & Algina, 1986). Ideally, the alternate forms should be parallel, but this is difficult to achieve, especially with personality measures. The correla- tion between observed scores on two alternate test forms is usually referred as to coefficient of equivalence. Basically, the alternate form reliability and internal consistency reliability are affected by different sources of error. If the correlation between alternate forms is much lower than the internal consis- tency coefficient (e.g., by 0.20 or more), this might be due to (a) differences in content, (b) subjectivity of scoring, and (c) changes in the trait being mea- sured over the time between the two administrations of alternate forms. To determine the relative contribution of these sources of error, it is recommend- able to administer the two alternate forms (a) on the same day for some respondents and (b) within a 2-week time interval for others (see, Nunnally & Bernstein, 1994; Dimitrov, Rumrill, Fitzgerald, & Hennessey, 2001). The comments in this section are also important for the discussion of RG across studies later in this article.

Reliability in Generalizability Theory

Generalizability (G) theory simultaneously takes into account all avail- able error sources (facets) & such as items, raters, test forms, and occasions & that influence the reliability for either relative or absolute (criterion-related) decisions (Cronbach, Gleser, Nanda, & Rajaratnum, 1972). Classical test the- ory takes into account only one facet (e.g., items) at a time and provides esti- mates of reliability only for relative decisions. The one-facet crossed design “persons x items” (p x i) in G theory corresponds to the classical model when 2 persons respond to items. With this design, the raw-score variance,σ x , is par- 2 titioned into three independent variance components: (a)σ i , which indicates 2 how much items differ in difficulty; (b) σ p , which indicates how much per-

2 sons differ in their responses; and (c) σ pi, e , for the confounding effect of the p x i interaction with other sources of variation, e. The error variance for rela- 22 tive decisions isσσRelpie= , / n and the error variance for absolute decisions is

222 σσσAbs=+()/ i pi, e n , where n is the number of items. With many-facet

2 designs, σ Re l is defined as all variance components that affect the relative 2 standing of persons andσ Abs as all variance components (except for the ob- 2 ject of measurement, σ p ) that contribute to measurement error. In G theory, the reliability coefficients for relative decisions (generalizability coefficient, 2 ρ Re l ) and absolute decisions (index of dependability, Φ) are defined as fol- lows (e.g., Shavelson & Webb, 1991, p. 93):

2222 222 ρσσσRelpp=+/( Re l )and Φ =+ σσσ ppAbs /( ) (1) Using the results from a G study, one can also control SEM and reliability lev- els by manipulating the conditions of each facet in a decision study. Previous 788 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT research provided numerous theoretical, methodological, and practical treat- ments of reliability in G theory (e.g., Brennan, 1983; Clauser, Clyman, & Swanson, 1999; Crick & Brennon, 1983; Erlich & Shavelson, 1978; Sun, Valiga, & Gao, 1997; Shavelson & Webb, 1991). Some advantages of the G theory reliability coefficients are also discussed by Thompson and Vacha-Haase (2000).

Item Response Theory Perspective on Classical SEM and Reliability

In item response theory (IRT), the term ability (or trait, for psychological constructs) connotes a latent trait that underlies the responses of the partici- pants on items of the measurement instrument. The ability score, θ, of an examinee on an achievement test determines the probability for this examinee to answer correctly any test item (Lord, 1980). Although the classi- cal test theory is concerned with the accuracy of observed scores, the IRT is concerned with the accuracy of ability scores. In IRT, the conditional error variance at an ability level θ is inversely related to the amount of information, I(θ), provided by a test at θ (e.g., Lord, 1980; Samejima, 1977). This section, however, does not deal with accuracy of θ but, instead, with classical SEM and reliability from IRT perspective. Lord (1957) noted that “a reliability coefficient depends on the spread of ability among the examinees, as well as the number of items in the test, the discrimination power of the items, and other factors” (p. 510). Lord (1980) related classical SEM and reliability coefficients to IRT estimates of item pa- rameters and ability. He presented the error variance of number-right scores on a test of n dichotomous items as the mean of conditional error variances at the ability levels of N examinees, θ1, ..., θn (Lord, 1980, p. 52):

11N N n σσ22==PP()[ θθ1 − ()] (2) $ ex∑ |θ j ∑ ∑ ijij NNj=1 j=1 i=1

In Equation 2, Pi(θj), the probability for correct answer on item i from an examinee with ability θj, is calculated with the respective IRT model. The product Pi(θj)[1 - Pi(θj)] is the conditional error variance for the dichotomous item i at ability level θj. The square root of the error variance in Equation 2 is a classical (sample-based) SEM. The population error variance (expected value of the sample error variance) is obtained by replacing the summation over dis- crete ability scores, θ1, . . ., θn, with integration over the ability continuum:

n 2 ∞ σθθϕθθ=−PP()[1 ()]() d , (3) eii∑ ∫−∞ i=1 DIMITROV 789 where φ(θ) is the probability density function (pdf) for the ability distribu- tion. The integration is from -4 to 4 because the ability, θ, is not limited in the theoretical framework of IRT. The ability scores typically vary from -4 to 4 on the logit scale (e.g., Lord, 1980). An important property of Equation3 is that the integral components in the summation of n items represent additive yet independent contributions of the individual items to the classical error variance. This feature, not avail- able with other methods of estimating classical SEM and error variance, may be very useful for the practice of test development and reliability analy- sis. For example, given the IRT calibration of items (available with most stan- dardized instruments), one can select items to develop a measurement instru- ment with prespecified error variance. Then, one can also determine the

classical SEM and reliability, ρxx, from the relationships:

2 22 SEM =σ e and ρσσxx=−1 e/ x (4)

May and Nicewander (1993) used Equation 3 in comparing reliability for number-right scores and percentile ranks. Dimitrov (2001a, 2001b) provided exact formulas for the classical error variance in Equation 3 under different distributions (e.g., normal, triangular, and logistic) of the ability (trait) scores for a population of participants. From Rasch measurement perspectives, Wright (2001) also noted that “once the test items are calibrated, the standard error corresponding to every possible raw score can be estimated without further data collection” (p. 786). For a 14-item test, for example, he found that the “sample-independent” reli- ability for the test (with a separation index of 4) was .94, whereas the empiri- cal reliability was .62, thus “grossly underestimating the test’s measurement effectiveness” ( p. 786). Evidently, classical SEM and reliability can be estimated from IRT infor- mation about item parameters and trait distribution for the population. Such information is available (or easy to obtain) with standardized measurement instruments. This perspective on SEM and reliability has important implica- tions for understanding, estimating, and comparing accuracy of measure- ments. For example, the error variance in Equation 3 is a classical match of the marginal error variance that is used in computer adaptive testing for com- parison between internal consistency reliability of a computer adaptive test and alternatively used paper-and-pencil forms (e.g., Thissen, 1990).

Additional Perspectives on Reliability

Other types of reliability, such as interrater reliability, criterion-referenced reliability (classification consistency), and stability in reliability for repeated measurements, can also be treated from different perspectives. In previous 790 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT research, interrater reliability has been estimated with correlations (e.g., Fan & Thompson, 2001), factorial design (Winer, 1971), Cohen’s kappa (Cohen, 1960), extended (g-wise) kappas (Conger, 1980), G theory coeffi- cients (e.g., Brennan & Prediger, 1981; Cooil & Rust, 1994), latent class models (e.g., Dillon & Mulani, 1984), and Rasch measurement models (e.g., Linacre, 1994). Criterion-referenced reliability has been also estimated through classical (e.g., Berk, 1980; Huynh, 1976; Peng & Subkoviak, 1980), G theory (Brennan & Kane, 1977), IRT (e.g., Hambleton & Novick, 1973), and latent class models (e.g., Van Der Linden & Mellenbergh, 1977). Stability in reliability for repeated studies has been evaluated primar- ily through structural equation modeling (e.g., Jöreskog, 1971; Raykov, 2000; Tisak & Tisak, 1996). It should be noted that stability in reliability for repeated studies uses the same test and same examiness across time points, whereas the RG across studies involves different samples and may use (which is questioned later in this article) different types of reliability, differ- ent test forms, and different number of items in a single analysis. Reliability of measurement of change is another perspective on reliability related to pre- and posttest scores. This type of reliability has been discussed extensively throughout the history of psychometrics (e.g., Collins, 1996; Cronbach & Furby, 1970; Kane, 1996; Linn & Slindle, 1977; Rogosa, 1995; Zimmerman & Williams, 1982). The common misconception that the gain scores always have low reliability and thus have questionable value still affects the way most substantive studies report and interpret their results. Zimmerman and Williams (1982) demonstrated that the reliability of gain scores can be consistently high when the pre- and posttest measures have unequal standard deviations and unequal reliability coefficients. Recently, Kane (1996) showed that the reliability of gain scores can be interpreted as a context-specific, between-examinees precision index.

SEM and Reliability

Previous studies indicate that although the reliability coefficient is a con- venient unitless number between 0 and 1, the SEM relates to the meaning of the scale and is, therefore, more useful for score interpretations (e.g., Feldt & Brennan, 1989; Gronlund, 1988; Thissen, 1990). The classical relationship between reliability and SEM in equation (4) is based on the (generally false) assumption that the error variance is the same for all scores (e.g., Thissen, 1990; Thompson, 1992; Thorndike, 1982). The accuracy of measurement is therefore much better represented by conditional SEM estimates at different score levels. Procedures for obtaining such estimates with the classical the- ory are provided in previous research (e.g., Kolen, Hanson, & Brennan, 1992; Thorndike, 1982). Kolen at al. (1992) noted, DIMITROV 791

Estimation of conditional standard errors of measurement of scale scores is im- portant, especially because the Standards for Educational and Psychological Testing ( AERA, APA, & NCME [American Educational Research Associa- tion, American Psychological Association, & National Council on Measure- ment in Education], 1985), particularly Standard 2.10, recommends that con- ditional standard errors be reported. In addition, conditional standard errors of measurement are valuable to users of test results for constructing confidence intervals for reported scores and for communicating an index of measurement error. (p. 286)

Traditionally, the classical error variance is estimated as

22 σσex=−(),1 ρ xx (5)

where ρxx is substituted for a sample-based estimate of reliability (e.g., Cronbach’s α). Whereas Equation 5 is technically appropriate for the data in hand with substantive studies, formula (3) provides important population per- spectives on classical error variance. One can see, for example, that ( a) the error variance does not depend on reliability estimates, and (b) given the shape of the trait distribution, the independent additive contribution of indi- vidual items to the error variance can be determined from their parameters (e.g., difficulty and discrimination). This perspective on classical error vari- ance is consistent with empirical findings that SEM remains fairly constant across samples, but this is not true for the reliability coefficient, which is dependent on the score variance in the sample tested (see, Gronlund and Linn, 1990, p. 92; Sawilowski, 2000, p. 164).

Potential Problems With RG Across Studies

The comments on RG across studies in this section relate primarily to meta-analytic tools used by Vacha-Haase (1998) and do not generalize over all possible ways of conducting RG. In fact, in response to the criticism of Sawilowski (2000) to some of the analytic choices made by Vacha-Haase (1998), Thompson and Vacha-Haase (2000) noted, “We do not see RG as invoking a monolithic methodological approach” (p. 187) and, “We do not see RG as involving always a single genre of analyses” (p. 185). Although I totally agree with this, there are several points to be made about the RG meth- odology introduced by Vacha-Haase (1998) and replicated in numerous RG studies recently published in EPM (e.g., Capraro, Capraro, & Henson, 2001; Caruso, 2000; Caruso, Witkiewitz, Belcourt-Dittloff, & Gottlieb, 2001; Henson, Kogan & Vacha-Haase, 2001; Vacha-Haase, Kogan, Tani, & Woodall, 2001; Viswesvaran, & Ones, 2000; Yin & Fan, 2000). Most argu- 792 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT ments in this section are based on facts and comments provided in previous sections.

Using Different Types of Reliability

Using different types of reliability in a single RG analysis may lead to mixing apples and oranges, thus violating basic meta-analytic principles (e.g., Hunter & Schmidt, 1990). For example, Vacha-Haase (1998) used mul- tiple regression with the dependent variable being either test-retest correla- tion coefficients or Cronbach’s α (or KR-20) coefficients. However, none of these coefficients will represent the classical reliability unless specific assump- tions can be made. As a reminder, Cronbach’s α can either underestimate the reliability when the measures are not at least essentially tau-equivalent (Novick & Lewis, 1967) or overestimate the reliability, when correlations among errors occur (Komaroff, 1997). Also, the test-retest correlation coeffi- cient can be a reasonable estimate of reliability only if the measures are es- sentially tau-equivalent and have equal error variances (Tisak & Tisak, 1996). Thus, the dependent variable used by Vacha-Haase (1998) might be a mixture of apples and oranges, as implied also by her own conclusion that “the results ... indicate that internal consistency and test-retest reliability coefficients seem to present considerably different pictures of score quality” (p. 16). Yin and Fan (2000) also noted that

the results for partitioning the variances of the BDI [Beck Depression Inven- tory] score reliability estimates indicated that the type of reliability coefficients (internal consistency vs. test-retest reliability coefficients) and type of study participants ... were statistically significant and practically meaningful pre- dictors in the general linear model with reliability estimates as the dependent variable. (p. 216)

Given that most published studies do not test for tau-equivalency or other assumptions related to reliability coefficients they report (when they do), RG researchers should be cautious about mixing internal consistency and test- retest reliability coefficients in a single RG analysis.

Using Tests With Different Lengths

A problem with RG across studies may also result from using alpha coeffi- cients for tests with different lengths. Some RG investigators use the Spearman-Brown formula to adjust alpha coefficients (e.g., Caruso, 2000). It should be noted, however, that the Spearman-Brown formula produces α coefficients only under the assumption of equal item variances (Charter, 2001). Although it is clear that RG researchers do not have access to raw item-level data and thus cannot test for equal item variances, it is recom- DIMITROV 793 mended that they reflect this in the discussion of their results. Other RG across studies do not adjust the α coefficients and instead use dummy coding for the number of items administered (e.g., Vacha-Haase, 1998; Viswesvaran & Ones, 2000; Yin & Fan, 2000). Let us also remember that the split-half approach of estimating internal consistency requires that the two parts of the test have equal variances (Kristof, 1970). One can also refer to Thorndike (1982, p. 148) for a statistical analysis of effects of manipulating test length on true-score variance, error variance, and reliability, as well as to Alsawalmeh and Feldt (1999), for statistical tests for hypotheses about alphas extrapolated from the Spearman-Brown formula. Awareness of such effects can also help RG researchers to adequately interpret and discuss RG study results.

Using Tests With Different Formats Using different test forms in a single RG analysis can also cause prob- lems because, as indicated by previous research, the reliability depends on factors such as item response format (Frisbie & Druva, 1986), (positively/ negatively) wording of stems (Barnette, 2000), and type of scales (Cook, Heath, Thompson, & Thompson, 2001). For example, estimating the reliabil- ity of multiple true-false (MTF) tests, Frisbie and Druva (1986) noted,

If the MTF items in the same cluster [items sharing a common stem] are more highly intercorrelated than are MTF items that appear in different clusters, then some conventional internal analysis procedures for estimating test reliability may yield spurious results. The results may be spurious because the assump- tions required by many of the procedures for estimating reliability include subtest equivalence or parallel items. The presence of systematic patterns of intercorrelations could violate either assumption. (p. 100)

Regarding the role of stem wording, Barnette (2000) noted that “negated items are not considered the exact opposite of directly worded items, and this is one of the major factors in the reduction of the reliability and validity of scores on surveys using mixed items” (p. 369). Cook et al. (2001) studied the dependence of reliability on the type of scale in Web- or Internet-based sur- veys, noting that

although greater score variance is possible when more score intervals are employed, if participants do not cognitively process so many scores intervals when dealing with unnumbered graphic scales [a continuous line drawn between two antonyms], then the use of excessive intervals may not improve score reliability and might even lessen reliability. ( p. 705)

Evidently, different test forms should be used with proper understanding and awareness of problems that they may cause in a single RG analysis. 794 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Confounding Independent Variables

Another problem with RG across studies may result from inappropriate coding of groups (e.g., by gender, ethnicity, age, status) or, even worse, from biasedness of the instrument on groups coded with independent variables. For example, Sawilowski (2000) noted that the coding design for gender in the RG study of Vacha-Haase (1998) led to a confounding of independent variables since gender of study participants was coded twice. First, gender was coded 1 when the study participants were both men and women and 0 when they were all men or all women. Then, gender was coded 1 when the study participants were all women and 0 when they were all men or both men and females. The negative effect of confounding independent variables such as gender and ethnicity will increase if the instrument is biased against a spe- cific group in the context of some substantive studies used with the RG analy- sis. As a reminder, an instrument will be, for example, gender biased if male and female participants who share the same location on the trait being mea- sured by the instrument consistently differ in their responses. Evidently, this will obscure the actual contribution of gender to the reliability variation across RG studies. Bias effects may occur in RG studies when, for example, the instrument is used in different contexts of age, language, and cultural experience (e.g., Yin & Fan, 2000).

Sample-Related Misspecifications

Problems with RG may also result from misspecifications that occur when relevant characteristics of the study samples are not coded as inde- pendent variables in RG analysis. For example, Sawilowski (2000) noted that failure to consider random samples versus nonrandom samples led to misspecifications in the RG regression analysis of Vacha-Haase (1998). Another source of possible misspecifications, not addressed in previous RG studies or debates, relates to failure to take into account the underlying distri- bution of scores for the study samples. Yet, Lord (1984) noted that “unless the distribution of ability is known, it is not logically possible to get an unbiased estimate of the SEM for examinees who are chosen for consideration because they all have a given observed score” (p. 241). As one can also see from Equa- tion 3, the classical error variance is affected by the distribution of the trait underlying the participants’ responses. Thus, composition factors that may affect the trait distribution of study samples are particularly relevant for the practice of RG. Information about such composition factors, if not provided with studies used in RG, may come from previous theoretical or empirical findings. Avoiding misspecifications related to group homogeneity or differ- ences in trait distributions for study samples (or composition subgroups) will reduce the risk of capitalization on chance in RG. DIMITROV 795

RG Across Studies Versus Accuracy of Measurement Within Studies

The results from RG across studies may provide potentially useful infor- mation about the variability of reliability coefficients, but they do not repre- sent the accuracy of measurement for a specific study. Indeed, the accuracy of measurement for a study relates to information about the (study-specific) population reliability and its sample estimates. Within the classical frame- work, such information is provided, for example, by confidence intervals for alphas (Fan & Thompson, 2001; Pedney & Hubert, 1975) or reliability for congeneric measures (Raykov, 1998), but not by reliability box-plots in RG across studies (e.g., Vacha-Haase, 1998, p. 13). When IRT calibration of items is available, classical SEM can be determined with Equation 3 (e.g., Dimitrov, 2001a, 2001b). It should be noted that although the unitless reliability coefficient is more useful for comparing the accuracy of measurement across scoring proce- dures, the SEM is more useful with a specific study because it provides scale- specific interpretations of the accuracy of measurement for this study. How- ever, generalization of SEM across studies cannot be done with RG analyses. As Thompson and Vacha-Haase (2000) noted,

An RG a nalysis of SEMs across studies would not be reasonable when the meta-analysis looked at variations in score quality in range of measures .... nor would an SEM focus be appropriate even when a single measure was investi- gated if both short and long forms were used in the prior studies. (p. 188)

It is difficult to understand, then, why some RG s tudies (e.g., Yin and Fan, 2000) include SEM (in addition to reliability) in a single RG analysis with different forms of the measurement instrument and different types of reliabil- ity coefficients. Moreover, the findings about SEMs in such RG studies do not relate to interpretational characteristics of SEMs, but to theoretically based statistical artifacts. For example, Yin and Fan (2000) reported that “SEMs were not correlated with reliability estimates, but SEMs were sub- stantially related to the standard deviations of the BDI scores in different studies” ( p. 217).

Conclusion

The multiple perspectives on SEM and reliability addressed in this article have important implications for (a) understanding, estimating, and reporting accuracy of scores within studies and (b) identifying potential problems with the RG across studies. Clearly arguments could be made for improving cur- rent and future practices of estimating and reporting SEM and reliability in 796 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT substantive or measurement studies. The facts and comments provided in this article lend support to recent editorial policies of EPM about reporting confi- dence intervals for score reliability coefficients (Fan & Thompson, 2001). However, researchers should also be encouraged to calculate and report more appropriate estimates of SEM and reliability in light of multiple perspectives on these two concepts in traditional and modern measurement models. For example, assumptions for the traditionally used Cronbach’s coefficient α should be checked and, if necessary, researchers should report reliability for congeneric measures. Subsequent RG across studies will also benefit of this because congeneric tests relate to the same trait, but they may vary in preci- sion, scale origins, and units of measurement (Jöreskog, 1971). It is recom- mended also that researchers use, whenever possible, generalizability theory methods to take into account all available error sources (e.g., items, raters, and occasions) that influence the reliability for either relative or absolute decisions. It should be emphasized that the IRT section in this article presents an IRT perspective on classical SEM and reliability, not the IRT treatment of condi- tional standard errors at trait levels. It was shown that, given the shape of the trait distribution, the independent additive contribution of individual items to the classical SEM depends only on IRT estimates of their parameters (e.g., difficulty and discrimination). Thus, when such estimates are available (e.g., with standardized measurement instruments), researchers may estimate clas- sical SEM for the study population with higher quality and accuracy relative to traditional approaches (e.g., confidence intervals about Cronbach’s α). This perspective on SEM and reliability also shows that “population reliabil- ity of the test” connotes a sound psychometric concept. Therefore, although agreeing that “the test qua test is not reliable or unreliable” (Thompson & Vacha-Haase, 2000), we should also agree that concepts related to accuracy of measurement are psychometric concepts that can be treated from multiple perspectives within and across theoretical frameworks. From terminology point of view, then, the necessity of the term datametrics is questionable if its sole purpose is to convey that “data, not tests, are reliable” in psychometrics (e.g., Thompson & Vacha-Haase, 2000). Also, this term is loosely defined and semantically problematic & because data result from measurement, we do not measure data. Conversely, psychometrics is well defined as “related to measurement of attributes of people or other objects, with tests that are intended to measure those attributes, and with single exercises or items that in combination make such tests” (Thorndike, 1982, p. 4). The potential problems with RG across studies identified in this article relate primarily to (a) using different types of reliability, different number of items, and/or different test forms in a single RG analysis; (b) confounding independent variables; and (c) misspecifications related to group homogene- DIMITROV 797 ity, test bias, and irrelevance of composition factors. It should be emphasized that RG researchers were not able to address most of these problems because prior published studies reported very limited information about the accuracy of their data (in the rare cases that they reported such information at all). What RG researchers should do, however, is discuss this issue in light of possible pitfalls and limitations of their RG procedures and results. Disclaimers for RG articles to include in their discussion relate, for example, to using Cronbach’s α, although (a) the test components are likely congeneric, sug- gesting that alpha may underestimate the reliability; (b) there are chances for correlated errors (e.g., items sharing a common stem), suggesting that Cronbach’s α may overestimate the reliability; and (c) the tests are of differ- ent length, suggesting possible loss of accuracy in generalizing alpha; (it is recommended that an adjustment for alpha, using the Spearman-Brown for- mula, be performed, with a word of caution when the assumption of equal item variances with this formula is not checked). Other disclaimers in RG discussions may relate to using different types of reliability in a single RG analysis, which may produce variability effects due to statistical artifacts (e.g., Hunter & Schmidt, 1990). Also, using different test forms may produce spurious effects and reduce the reliability due to (a) systematic intercorrelations among items sharing a common stem in multiple true-false test, (b) negative stem wording, or (c) excessive scale intervals. Other possible sources of disclaimers in RG discussions relate to the follow- ing: (a) confounding independent variables; (b) presence of bias when the instrument is used in different contexts of age, language, or cultural experi- ence; and (c) misspecifications due to failure to take into account sample characteristics (e.g., randomness or shape of the trait distribution) that may affect the reliability coefficients. It should be also noted that RG across studies does not provide adequate information about the accuracy of measurement with a specific study. Such information is provided, instead, by SEM and reliability coefficients for the study sample and (exact or interval) estimates of their population parameters (e.g., Dimitrov, 2001a, 2001b; Fan & Thompson, 2001). Nor can the RG across studies assess stability (time-invariance) in reliability. This can be done, instead, with longitudinal models for generalization of reliability using the same instrument with the same sample across time points (e.g., Tisak & Tisak, 1996). In conclusion, with awareness of potential problems and appropriate designs, RG across studies can provide useful meta-analytic information about variability of reliability coefficients. I believe, however, that the meth- odological soundness and practical usefulness of RG studies depend, among other things, upon implications from the multiple perspectives on reliability presented in this article.

798 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

References

Alsawalmeh, Y. M., & Feldt, L. S. (1999). Testing the equality of independent alpha coefficients adjusted for test length. Educational and Psychological Measurement, 59, 373-383. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychologi- cal testing. Washington, DC: American Psychological Association. Anastasi, A., & Urbina, S. (1997). Psychological testing (4th ed.). Upper Saddle River, NJ: Prentice Hall. Barnette, J. J. (2000). Effects of stem and Likert response option reversals on survey internal consistency: If you feel the need, there is a better alternative to using those negatively worded stems. Educational and Psychological Measurement, 60, 361-370. Bentler, P. M. (1995). EQS structural equations program manual. Enrico, CA: Multivariate Software. Berk, R. A. (1980). A consumer’s guide to criterion referenced reliability. Journal of Educa- tional Measurement, 17, 323-349. Brennan, R. L. (1983). Elements of generalizability theory. IO: ACT. Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277-289. Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alterna- tives. Educational and Psychological Measurement, 20, 37-46. Capraro, M. M., Capraro, R. M., & Henson, R. K. (2001). Measurement error of scores on the Mathematics Anxiety Rating Scale across studies. Educational and Psychological Measure- ment, 61, 373-386. Caruso, J. C. (2000). Reliability generalization of the NEO personality scales. Educational and Psychological Measurement, 60, 236-254. Caruso, J. C., Witkiewitz, K., Belcourt-Dittloff, A., & Gottlieb, J. (2001). Reliability of scores from the Eysenck Personality : A Reliability Generalization (RG) study. Edu- cational and Psychological Measurement, 61, 675-682. Charter, R. A. (2001). It is time to bury the Spearman-Brown “prophecy” formula for some com- mon applications. Educational and Psychological Measurement, 61, 690-696. Clauser, B. E., Clyman, S. G., & Swanson, D. B. (1999). Components of rater error in a complex performance assessment. Journal of Educational Measurement, 36, 29-45. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Collins, L. M. (1996). Is reliability obsolete? A commentary on “Are simple gain score obso- lete?” Applied Psychological Measurement, 20, 289-292. Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322-328. Cooil, B., & Rust, R. T. (1994). Reliability and expected loss: A unifying principle. Psychometrika, 59, 203-216. Cook, C., Heath, F., Thompson, R. L., & Thompson, B. (2001). Score reliability in Web- or Internet-based surveys: Unnumbered graphic rating scales versus Likert-type scales. Educa- tional and Psychological Measurement, 61, 697-706. Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system. Iowa City, Iowa: ACT Publications. Crockewr, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holy, Rinehart & Winston. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of a test. Psychometrika, 16, 297-334. Cronbach, L. J., & Furby, L. (1970). How we should measure “change”- or should we? Psycho- logical Bulletin, 74, 68-80. DIMITROV 799

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behav- ioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Dillon, W. R., Mulani, N. (1984). A probabilistic latent class model for assessing inter-rater re- liability. Multivariate Behavioral Research, 19, 439-458. Dimitrov, D. M. (2001a, October). Reliability and standard error of measurement related to pa- rameters of items and ability distributions. Paper presented at the annual meeting of the Mid- Western Educational Research Association. Chicago. Dimitrov, D. M. (2001a, October). Reliability of Rasch measurement with skewed ability distribu- tions. Paper presented at the International Conference on Objective Measurement, Chicago. Dimitrov, D. M., Rumrill, P., Fitzgerald, S., & Hennessey, M. (2001). Reliability in rehabillitation measurement. Work: A Journal of Prevention, Assessment, and Rehabilitation, 16, 159-164. Erlich, O., & Shavelson, R. J. (1978). The search for correlations between measures of teacher behavior and student achievement: Measurement problem, conceptualization problem, or both? Journal of Educational Measurement, 15, 77-89. Fan, X., & Thompson, B. (2001). Confidence intervals about score reliability coefficients, please: An EPM guidelines editorial. Educational and Psychological Measurement, 61, 517-531. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (pp. 105-146). New York: Macmillan. Frisbie, D. A., & Druva, C. A. (1986). Estimating reliability of multiple true-false tests. Journal of Educational Measurement, 23, 99-105. Goodenough, F. L. (1949). Mental testing: Its history, principles, and applications. New York: Rinehart. Gronlund, N. E. (1988). How to construct achievement tests (4th ed.). Englewood Clifs, NJ: Prentice Hall. Gronlund, N. E., & Linn, R. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan. Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159-170. Henson, R. K., Kogan, L. R., & Vacha-Haase, T. (2001). A reliability generalization study on the Teacher Efficacy Scale and related instruments. Educational and Psychological Measure- ment, 61, 404-420. Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis. Newbury Park, CA: Sage. Huynh, H. (1976). On t he reliability of decisions in domain-referenced testing. Journal of Edu- cational Measurement, 13, 253-264. Jöreskog, K..G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109-133. Jöreskog, K. G., & Sörbom, D. (1989). LISREL 7: A guide to the program and applications (2nd ed.). Chicago: SPSS. Kane, M. (1996). The precision of measurements. Applied Measurement in Education,9, 355- 379. Kollen, M., J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measure- ment for scale scores. Journal of Educational Measurement, 29, 285-307. Komaroff, E. (1997). Effect of simultaneous violations of essential tau-equivalent and uncorrelated errors on coefficient alpha. Applied Psychological Measurement, 21, 337-348. Kristof, W. (1970). On the sampling theory of reliability estimation. Journal of Mathematical , 7, 371-377. Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago: MESA Press. Linne, R. L., & Slindle, J. A. (1977). The determination of the significance of change between pre- and posttesting periods. Review of Educational Research, 47, 121-150. Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement? Educational and Psychological Measurement, 17, 510-521. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. 800 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Lord, F. M. (1984). Standard error of measurement at different ability levels. Journal of Educa- ional Measurement, 21, 239-243. Lord, F. M., & Novick, R. (1968). Statistical theories of mental tests scores. Reading MA: Addison- Wesley. May, K., & Nicewander, W. A. (1993). Reliability and information functions for percentile ranks. Psychometrika, 58, 313-325. Novik, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measure- ments. Psychometrika, 32, 1-13. Nunnally, J. C. , & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw- Hill. Payne, J. I. (1975). Principles of social science measurement. College Station, TX: Lytton. Pedney, T. N., & Hubert, L. (1975). An empirical comparison of several interval estimation pro- cedures for coefficient alpha. Psychometrika, 40, 169-181. Peng, C. J., & Subkoviak, M. J. (1980). A note on Huynh’s normal approximation procedure for estimating criterion referenced reliability. Journal of Educational Measurement, 17, 359-368. Raykov, T. (1997). Estimation of composite reliability for congeneric measures. Applied Psy- chological Measurement, 21, 173-184. Raykov, T. (1998). A method for obtaining standard errors and confidence intervals of composite reliability for congeneric items. Applied Psychological Measurement, 22, 369-374. Raykov, T. (2000). A method for examining stability in reliability. Multivariate Behavioral Re- search, 35, 289-305. Raykov, T. (2001). Bias of coefficient alpha for fixed congeneric measures with correlated er- rors. Applied Psychological Measurement, 25, 69-76. Rogosa, D. (1995). Myths and methods: “Myths about longitudinal research” plus supplemental questions. In J. M. Gottman (Ed.) The analysis of change (pp. 3-66). Hillsddale, NJ: Law- rence Erlbaum. Samejima, F. (1977). The use of the information function in tailored testing. Applied Psychologi- cal Measurement, 1, 233-247. Sawilowsky, S. S. (2000). Psychometrics versus datametrics: Comments on Vacha-Haase’s “re- liability generalization” method and some EPM editorial policies. Educational and Psycho- logical Measurement, 60, 157-173. Shavelson, J. S., & Webb, N. M. (1991). Generalizability theory. A primer. Newbury Park, CA: Sage. Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum. Sun, A., Valiga, M. J., & Gao, X. (1997). Using generalizability theory to assess the reliability of student ratings of academic advising. The Journal of Experimental Education, 65, 367-379. Thissen, D. (1990). Reliability and measurement precision. In H. Wainer (Ed.), Computerized adaptive testing. A primer (pp. 161-185). Hillsdale, NJ: Lawrence Erlbaum. Thompson, B. (1992). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434-438. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837-847. Thompson, B., & Daniel, L. G. (1996). F actor analytic evidence for the construct validity of scores: An historical overview and some guidelines. Educational and Psychological Mea- surement, 56, 197-208. Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test in not reliable. Educational and Psychological Measurement, 60, 174-195. Thorndike, R. L. (1982). Applied Psychometrics. Boston, MA: Houghton Miffin. Tisak, J. & Tisak, M. S. (1996). Longitudinal models of reliability and validity: A latent curve approach. Applied Psychological Measurement, 20, 275-288. DIMITROV 801

Traub, R. E. (1994). Measurement methods for the social sciences & Reliability for the social sci- ences: Theory and applications (Vol. 3). Thousand Oaks, CA: Sage. Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error af- fecting score reliability across studies. Educational and Psychological Measurement, 58, 6-20. Vacha-Haase, T., Kogan, L., Tani, C. R., & Woodall, R. A. (2001). Reliability generalization: Ex- ploring reliability coefficients of MMPI clinical scales scores. Educational and Psychologi- cal Measurement, 61, 45-59. Van der Linden, W. J., & Mellenbergh, G. J. (1977). Optimal cutting scores using a linear loss function. Applied Psychological Measurement, 1, 593-599. Viswesvaran, C., & Ones, D. (2000). Measurement error in “big five factors” personality assess- ment: Reliability generalization across studies and measures. Educational and Psychologi- cal Measurement, 60, 224-235. Wang, T. (1998). Weights that maximize reliability under a congeneric model. Applied Psycho- logical Measurement, 22, 179-187. Williams, R. H., & Zimmerman, D. W. (1996). Are simple gain score obsolete? Applied Psy- chological Measurement, 20, 59-69. Winer, B. J. (1971). Statistical principals in experimental design (2nd ed.). NY: McGraw- Hill. Wright, B. D. (2001). Separation, reliability, and skewed distributions. Rasch Measurement Transactions, 14(4), 786. Yin, P., & Fan, X. (2000). Assessing the reliability of Beck Depression Inventory scores: Reliabil- ity generalization across studies. Educational and Psychological Measurement, 60, 201-223. Zimmerman, D. W., & Williams, R. H. (1982). Gain scores in research can be highly reliable. Journal of Educational Measurement, 19, 149-154. Zimmerman, D. W., Zumbo, B. D., & Lalonde, C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measure- ment, 53, 33-49.