<<

Psychological Methods Copyright 2004 by the American Psychological Association 2004, Vol. 9, No. 2, 164Ð182 1082-989X/04/$12.00 DOI: 10.1037/1082-989X.9.2.164

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of and Analysis

James H. Steiger Vanderbilt University

This article presents confidence interval methods for improving on the standard F tests in the balanced, completely between-subjects, fixed-effects . Exact confidence intervals for omnibus effect size measures, such as ␻2 and the root--square standardized effect, provide all the information in the traditional hypothesis test and more. They allow one to test simultaneously whether overall effects are (a) zero (the traditional test), (b) trivial (do not exceed some small value), or (c) nontrivial (definitely exceed some minimal level). For situations in which single-degree-of-freedom contrasts are of primary interest, exact confi- dence interval methods for contrast effect size measures such as the contrast correlation are also provided.

The analysis of variance (ANOVA) remains one of the terval estimates for some correlational measures of most commonly used methods of statistical analysis in the effect size (e.g., Rosenthal, Rosnow, & Rubin, behavioral sciences. Most ANOVAs, especially in explor- 2000; Rosnow & Rosenthal, 1996). atory studies, report an omnibus F test of the hypothesis that a main effect, , or simple main effect is precisely 2. Calculate exact confidence interval estimates of zero. In recent years, a number of authors (Cohen, 1994; measures of standardized effect size, using an it- Rosnow & Rosenthal, 1996; Schmidt, 1996; Schmidt & erative procedure (e.g., Smithson, 2001; Steiger & Hunter, 1997; Serlin & Lapsley, 1993; Steiger & Fouladi, Fouladi, 1997). 1997) have sharply questioned the efficacy of tests of this “nil” hypothesis. Several of these critiques have concen- 3. Perform tests of a statistical null hypothesis other trated on ways that the nil hypothesis test fails to deliver the than that of no difference or zero effect (e.g., information that the typical behavioral scientist wants. Serlin & Lapsley, 1993). However, a number of the articles have also suggested, more or less specifically, replacements for or extensions of As proponents of the first suggestion, Rosnow and the null hypothesis test that would deliver much more useful Rosenthal (1996) discussed several types of correlation co- information. efficients that are useful in assessing experimental effects. The suggestions have developed along several closely Their work is particularly valuable in situations in which the related lines, including the following: researcher has questions that are best addressed by testing single contrasts. Rosnow and Rosenthal emphasized the use 1. Eliminate the emphasis on omnibus tests, with of the Pearson correlation, rather than the squared multiple attention instead on focused contrasts that answer correlation, partly because of concern that the latter tends to specific research questions, along with calculation present an overly pessimistic picture of the value of “small” of point estimates and approximate confidence in- experimental effects. The second suggestion, exact , has been gathering momentum since around 1980. The move- I express my gratitude to Michael W. Browne, Stanley A. ment to replace hypothesis tests with confidence intervals Mulaik, the late Jacob Cohen, William W. Rozeboom, Rachel T. stems from the fundamental realization that, in many if not Fouladi, Gary H. McClelland, and numerous others who have most situations, confidence intervals provide more of the encouraged this project. Correspondence concerning this article should be addressed to information that the scientist is truly interested in. For James H. Steiger, Department of and Human Devel- example, in a two-group , the scientist is more opment, Box 512 Peabody College, Vanderbilt University, Nash- interested in knowing how large the difference between the ville, TN 37203. E-mail: [email protected] two groups is (and how precisely it has been determined) 164 BEYOND THE F TEST 165

than whether the difference between the groups is exactly group experiment with a double-blind control. In this case, zero. you are engaging in “rejectÐsupport” (R-S) hypothesis test- The third suggestion, which might be called tests of close ing (rejecting the null hypothesis will support your belief). fit, has much in common with the approach widely known to The null and alternative hypotheses might be biostatisticians as bioequivalence testing and is based on the ␮ Յ ␮ ␮ Ͼ ␮ idea that the scientist should not be testing perfect adher- H0: 1 2; H1: 1 2. (1) ence to a point hypothesis but should replace the test of close fit with a “relaxed” test of a more appropriate hypoth- The null hypothesis states that the drug is no better than a esis. Tests of close fit share many of their computational . The alternative, which the investigator believes, is aspects with the exact interval estimation approach in terms that the drug enhances performance. Rejecting the null of the software routines required to compute probability hypothesis, even at a very low alpha such as .001, need not levels, power, and sample size. They remain within the indicate that the drug has a strong effect, because if sample familiar hypothesis-testing framework, while providing im- size is very large relative to the variability of the portant practical and conceptual gains, especially when the drug effect, even a trivial effect might be declared highly experimenter’s goal is to demonstrate that an effect is significant. On the other hand, if sample size is too low, trivial. even a strong effect might have a low probability of creating In this article, I present methods that implement; support; a statistically significant result. and, in some cases, unify and extend major suggestions (1) Statistical power analysis (Cohen, 1988) and sample size through (3) discussed above. First I briefly review the his- estimation have been based on the notion that calculations tory, rationale, and theory behind exact confidence intervals made before are gathered can help to create a situation on measures of standardized effect size in ANOVA. I then in which neither of the above problems is likely to occur. provide detailed instructions, with examples, and software That is, sample size is chosen so that power will be high, but support for computing these confidence intervals. Next I not too high. discuss a general procedure for assessing effects that are There is an alternative situation, “acceptÐsupport” (A-S) represented by one or more contrasts, using correlations. testing, that attracts far less attention than R-S testing in Included is a population rationale, with sampling theory and statistics texts and has had far less impact on the popular an exact confidence interval estimation procedure, for one wisdom of hypothesis testing. In A-S testing, the statistical of the correlational measures discussed by Rosnow and null hypothesis is what the experimenter actually wishes to Rosenthal (1996). prove. Accepting the statistical null hypothesis supports the Although the initial emphasis is on confidence interval researcher’s theory. Suppose, for example, an experiment estimation, I also discuss how the same technology that provides convincing evidence that the above-mentioned generates confidence intervals may be used to test hypoth- drug actually works. The next step might be to provide eses of minimal effect, thus implementing the good enough convincing evidence that it has few, or acceptably low, side principle discussed by Serlin and Lapsley (1993). effects. In this case two groups are studied, and some measure of Exact Confidence Intervals on Standardized side effects is taken. The null hypothesis is that the exper- Effect Size imental group’s level of side effects is less than or equal to the control group’s level. The researcher (or drug company) The notion that hypothesis tests of zero effect should be supporting the research wants not to reject this null hypoth- replaced with exact confidence intervals on measures of esis, because in this case accepting the null hypothesis effect size has been around for quite some time but was supports the researcher’s point of view, that is, that the drug somewhat impractical because of its computational de- is no more harmful than its predecessors. mands until about 10 years ago. A general method for In a similar vein, a company might wish to show that a constructing the confidence intervals, which Steiger and generic drug does not differ appreciably in bioavailability Fouladi (1997) referred to as noncentrality interval estima- from its brand name equivalent. This problem of bioequiva- tion, is considered elementary by but seldom is lence testing is well known to biostatisticians and has re- discussed in behavioral statistics texts. In this section, I sulted in a very substantial literature (e.g., Chow & Liu, review some history, then describe the method of noncen- 2000). trality interval estimation in detail. Suppose that Drug A has a well-established bioavailabil- ␪ ity level A, and an investigator wishes to assess the bio- equivalence of Drug B with Drug A. One might engage in Rationale and History A-S testing, that is, test the null hypothesis that Suppose that, as a researcher, you test a drug that you ␪ ϭ ␪ believe enhances performance. You perform a simple two- A B (2) 166 STEIGER

and declare the two drugs bioequivalent if this null hypoth- that the model fits perfectly in the population. This hypoth- esis is not rejected. However, the perils of such A-S testing esis test was performed, and a model was judged to fit the are even greater than in R-S testing. Specifically, simply data “sufficiently well” if the null hypothesis was not re- running a sloppy, low-power experiment will tend to result jected. There was widespread dissatisfaction with the test, in nonrejection of the null hypothesis, even if the drugs because no model would be expected to fit perfectly, and so differ appreciably in bioavailability. Thus, paradoxically, large sample sizes usually led to rejection of a model, even someone trying to establish the bioequivalence of Drug B if it fit the data quite well. In this arrangement, enhanced with Drug A could virtually guarantee success simply by precision actually worked against the researcher’s interests. using too small a sample size. Moreover, with extremely Steiger and Lind (1980) suggested that the traditional null large sample sizes, Drug B might be declared nonequivalent hypothesis test of perfect fit of a structural model be re- to Drug A even if the difference between them is trivial. placed by a confidence interval on the root-mean-square Because of such problems, biostatisticians decided long error of approximation (RMSEA), an index of population ago that the test for strict equality is inappropriate for badness of fit that compensated for the complexity of the bioavailability studies (Metzler, 1974). Rather, a dual hy- model. pothesis test should be performed. Suppose that the Food MacCallum, Browne, and Sugawara (1996) suggested and Drug Administration has determined that any drug with augmenting the confidence interval with a pair of hypothesis ␪ bioavailability within 20% of A may be considered bio- tests. They considered a population RMSEA value of .05 to ␪ equivalent and prescribed in its stead. Suppose that 1 and be indicative of a close-fitting model, whereas a value of .08 ␪ 2 represent these bioequivalence limits. Then establishing or more was evidence of marginal to poor fit. Consequently, bioequivalence of Drug B with Drug A might amount to a test of close fit would test the null hypothesis that the rejecting the following hypothesis, RMSEA is greater than or equal to .05 against the alterna- tive that it is less than .05. Rejection of the null hypothesis ␪ Ͼ ␪ ␪ Ͻ ␪ H0: B 2 or B 1 (3) indicates close fit. A test of not-close fit tests the null hypothesis that the RMSEA is less than or equal to .08 against the alternative against the alternative that it is greater than .08. Rejection of ␪ Յ ␪ Յ ␪ the null hypothesis indicates that fit is not close. MacCallum Ha: 1 B 2. (4) et al. demonstrated in detail how, with such hypothesis tests, In practice, this usually amounts to testing two one-sided power calculations could be performed and required sample hypotheses, sizes estimated. These two one-sided tests can be performed easily and simultaneously with a single 1 Ϫ 2␣ confidence ␪ Յ ␪ ␪ Ͼ ␪ H01: B 1 versus Ha1: B 1 (5) interval recommended by Steiger (1989). Simply construct the confidence interval and see whether its upper end is and below .05 (in which case the test of close fit results in ␪ Ն ␪ ␪ Ͻ ␪ rejection at the alpha level) and whether its lower end H02: B 2 versus Ha2: B 2. (6) exceeds .08 (in which case the test of not-close fit results in An alternative approach (Westlake, 1976) is to construct rejection at the alpha level). The confidence interval pro- ␪ a confidence interval for B. Bioequivalence would be de- vides all the information in both hypothesis tests, and more. clared if the confidence interval falls entirely within the Fleishman (1980) suggested interval estimation as a sup- established bioequivalence limits. plement for the F test in ANOVA. He gave examples of In other contexts, particularly the more exploratory stud- how to compute exact confidence intervals on a number of ies performed in psychology, the research goal may be useful quantities, such as the signal-to-noise ratio, in simply to pinpoint the nature of a rather than to ANOVA. These confidence intervals offered clear advan- decide whether it is within a known fixed . In that case, tages over the traditional hypothesis test. Other authors have reporting the endpoints of a confidence interval (without noted the existence of exact confidence intervals for the announcing an associated decision) may be an appropriate standardized effect size in the simplest special case of conclusion to an analysis. In any case, because the hypoth- ANOVA, the two-sample t test (e.g., Hedges & Olkin, esis test may be performed with the confidence interval, it 1985). seems that the confidence interval should always be re- The rationale for switching from hypothesis testing to ported. It contains all the information in a hypothesis test confidence interval estimation is straightforward (Steiger & result, and more. Fouladi, 1997). Unfortunately, the exact interval estimation In structural equation modeling, which includes factor procedures of Steiger and Lind (1980), Fleishman (1980), analysis and multiple regression as special cases, statistical and Hedges and Olkin (1985) are virtually impossible to testing prior to 1980 was limited to a chi-square test of compute accurately by hand. However, by 1990, microcom- perfect fit. In this procedure, the statistical null hypothesis is puter capabilities had advanced substantially. The RMSEA BEYOND THE F TEST 167 confidence interval was implemented in general purpose trality parameter of zero, whereas if the null hypothesis is structural equation modeling software (Mels, 1989; Steiger, false, it has a noncentral t distribution, that is, the noncen- 1989) and, by the late 1990s, had achieved widespread use. trality parameter is nonzero. The more false the null hy- Steiger (1990) presented general procedures for construct- pothesis, the larger the absolute value of the noncentrality ing confidence intervals on measures of effect size in co- parameter for a given alpha and sample size. variance structure analysis, ANOVA, contrast analysis, and Most confidence intervals in introductory textbooks are multiple regression. Steiger and Fouladi (1992) produced a derived by simple manipulation of a statement about inter- general computer program, R2, that performed exact confi- val probability of a . This approach dence interval estimation of the squared multiple correlation cannot be used to generate exact confidence intervals for in multiple regression. Taylor and Muller (1995, 1996) have many quantities of fundamental importance in statistics. As presented general procedures for analyzing power and non- an example, consider the sample squared multiple correla- centrality in the general , including an analysis tion, whose distribution changes as a function of the popu- of the impact of restriction of published articles to signifi- lation squared multiple correlation. Confidence intervals for cant results. Steiger and Fouladi (1997) demonstrated gen- the squared multiple correlation are very informative yet are eral procedures for confidence interval calculations, and not discussed in standard texts, because a single simple Steiger (1999) implemented these in a commercial software formula for the direct calculation of such an interval cannot package. Smithson (2001) discussed a number of confi- be obtained in a manner analogous to the way one obtains a dence interval procedures in fixed and random regression confidence interval for the population mean ␮. Steiger and models and included SPSS macros for calculating confi- Fouladi (1997) discussed a general method for confidence dence intervals for noncentral distributions. Reiser (2001) interval construction that handles many such interesting discussed confidence intervals on functions of Mahalanobis examples. The method combines two general principles, distance. which they called the confidence interval transformation principle and the inversion confidence interval principle. The former is obvious but seldom discussed formally. The General Theory of Noncentrality-Based Interval latter is referred to by a variety of names in textbooks and Estimation review articles (Casella & Berger, 2002; Steiger & Fouladi, In this section, I review the general theoretical principles 1997), yet it does not seem to have found its way into the for constructing exact confidence intervals for effect size, standard behavioral statistics textbooks, primarily because power, and sample size in the balanced fixed-effects be- its implementation involves some difficult computations. tween-subjects ANOVA. For a more detailed discussion of However, the method is easy to discuss in principle and is these principles, see Steiger and Fouladi (1997). Through- no longer impractical. When the two principles are com- out what follows, I adopt a simple notational device: When bined, a number of very useful confidence intervals result. several groups or cells are sampled, I use Ntot to stand for Proposition 1: Confidence interval transformation prin- the total sample size and use n to stand for the number of ciple. Let f(␪) be a monotone function of ␪, that is, a observations in each group. function whose slope never changes sign and is never zero. Ϫ ␣ I begin this section with a brief nontechnical discussion of Let l1 and l2 be lower and upper endpoints of a 1 noncentral distributions. The t, chi-square, and F distribu- confidence interval on quantity ␪. Then, if the function is tions are special cases of more general distributions called increasing, f(l1) and f(l2) are lower and upper endpoints, the noncentral t, noncentral chi-square, and noncentral F. respectively, of a 100(1 Ϫ ␣)% confidence interval on f(␪).

Each of these noncentral distributions has an additional If the function is decreasing, f(l2) and f(l1) are lower and parameter, called the noncentrality parameter. For example, upper endpoints. Here are two elementary examples of whereas the F distribution has two (the numer- this principle. ator and denominator degrees of freedom), the noncentral F Example 1: Suppose you read in a textbook how to has these two plus a noncentrality parameter (often indi- calculate a confidence interval for the population variance cated with the symbol ␭). When the noncentral F distribu- ␴2. However, you desire a confidence interval for ␴.Be- tion has a noncentrality parameter of zero, it is identical to cause ␴ takes on only nonnegative values, it is a monotonic the F distribution, so it includes the F distribution as a increasing function of ␴2 over its domain. Hence, the con- special case. Similar facts hold for the t and chi-square fidence interval for ␴ is obtained by taking the square root distributions. What makes the noncentrality parameter es- of the endpoints for the corresponding confidence interval pecially important is that it is related very closely to the for ␴2. truth or falsity of the null hypotheses that these distributions Example 2: Suppose one calculates a confidence interval are typically used to test. Thus, for example, when the null for z(␳), the Fisher transform of ␳, the population correlation hypothesis of no difference between two is correct, coefficient. Taking the inverse Fisher transform of the end- the standard t has a distribution that has a noncen- points of this interval will give a confidence interval for ␳. 168 STEIGER

This is, in fact, the method used to calculate the standard of ␭ that places x at the ␣/2 cumulative probability ␯ ␯ confidence interval for a correlation. point of a noncentral F distribution with 1 and 2 These examples show why Proposition 1 is very useful in degrees of freedom. practice. A statistical quantity we are very interested in— ␭ such as ␳—may be a simple function of a quantity—such as Calculating a confidence interval for thus requires iter- ␭ z(␳)—we are not so interested in, but for which we can ative calculation of the unique value of that places an observed value of F at a particular of the non- easily obtain a confidence interval. Next, we define the 1 inversion confidence interval principle. central F distribution. In what follows, I give a variety of Proposition 2: Inversion confidence interval principle. examples of confidence interval calculations. Some will be Let x be the observed value of X, a with a at the 95% level of confidence, others at the less common 90% level. In a later section, I discuss why, when confi- continuous cdf (cumulative distribution function) F(x, ␭) ϭ Յ ͉␭ ␭ ␣ ϩ dence intervals are used to perform a hypothesis test at the Pr(X x ) for some numerical parameter . Let 1 ␣ ϭ ␣ Ͻ ␣ Ͻ ␭ .05 level, a 90% interval may be appropriate in some situ- 2 with 0 1befixed values. If F(x, ) is strictly ␭ ations and a 95% interval in others. At that point, I describe decreasing in , for fixed values of x, choose l1(x) and l2(x) Յ ͉␭ ϭ ϭ Ϫ ␣ Յ ͉␭ ϭ how to select confidence intervals at the appropriate level to so that Pr[X x l1(x)] 1 2 and Pr[X x ϭ ␣ ␭ ␭ perform a particular hypothesis test. l2(x)] 1.IfF(x, ) is strictly increasing in , for fixed Յ ͉␭ ϭ values of x, choose l1(x) and l2(x) so that Pr[X x l1(x)] ϭ ␣ Յ ͉␭ ϭ ϭ Ϫ ␣ Measures of Standardized Effect Size 1 and Pr[X x l2(x)] 1 2. Then the random Ϫ ␣ interval [l1(x), l2(x)] is a 100(1 )% confidence interval Now I examine some more ambitious examples. For for ␭. Upper or lower 100(1 Ϫ ␣)% confidence bounds (or simplicity of exposition, I assume in this section that either “one-sided confidence intervals”) may be obtained by set- the freeware program NDC (noncentral distribution calcu- ␣ ␣ ting 1 or 2 to zero. lator; see Footnote 1) or other software is available to For a simple graphically based explanation of Proposition compute a confidence interval on ␭, the noncentrality pa- 2, consult Steiger and Fouladi (1997, pp. 237Ð239). For a rameter of a noncentral F distribution. Consider the one- clear, succinct discussion with partial proof, see Casella and way, fixed-effects ANOVA, in which p means are compared Berger (2002, p. 432), who referred to this as “pivoting” the for equality, and there are n observations per group. The cdf. In this article, I assume ␣ ϭ ␣ ϭ ␣/2, although such overall F statistic has a distribution that is a noncentral F, 1 2 Ϫ Ϫ ϭ Ϫ an interval may not be the minimum width for a given ␣. with degrees of freedom p 1 and p(n 1) Ntot p. ␭ Proposition 2 implies a simple approach to interval estima- The noncentrality parameter can be expressed in a tion: Suppose you have observed an F statistic with a value number of ways. One formula that appears frequently in ␯ ␯ textbooks is x and known degrees of freedom 1 and 2. Denote the cumulative distribution of the F statistic by F(x, ␭), where ␭ p ␣ 2 is the noncentrality parameter. It can be shown that if ␯ , ␯ , j 1 2 ␭ ϭ n ͸ͩ ͪ . (7) and x are held constant at any positive value, then F(x, ␭)is ␴ jϭ1 strictly decreasing in ␭. Accordingly, Proposition 2 can be Ϫ ␣ ␣ used. To calculate a 100(1 )% confidence interval on the The j values in Equation 7 are the effects as commonly noncentrality parameter of the F distribution, use the fol- defined in ANOVA, that is, lowing steps. ␣ ϭ ␮ Ϫ ␮ j j . (8) 1. Calculate the cumulative probability p of x in the ␮ ␮ central F distribution. If p is below ␣/2, then both If j is the mean of the jth group, and is the overall ␮ limits of the confidence interval are zero. If p is mean, then is, in the case of equal n, simply the arithmetic ␮ below 1 Ϫ ␣/2, the lower limit of the confidence average of the j. More generally (although in what follows interval is zero, and the upper limit must be cal- I assume a balanced design unless stated otherwise), culated (go to Step 3). Otherwise, calculate both p n limits of the confidence interval, using Steps 2 ␮ ϭ ͸ j ␮ j. (9) and 3. Ntot jϭ1 2. To calculate the lower limit, find the unique value of ␭ that places x at the 1 Ϫ ␣/2 cumulative 1 NDC (noncentral distribution calculator), a freeware Windows probability point of a noncentral F distribution program for calculating percentage points and noncentrality con- ␯ ␯ with 1 and 2 degrees of freedom. fidence intervals for noncentral F, t, and chi-square distributions, is available for direct download from the author’s website (http:// 3. To calculate the upper limit, find the unique value www.statpower.net). BEYOND THE F TEST 169

␣ ␴ ␭ The quantity j/ is a standardized effect, that is, the The 95% confidence interval for ranges from 1.8666 to effect expressed in units. The quantity 32.5631. To convert this to a confidence interval for ⌿,we ␭/n is therefore the sum of squared standardized effects. use Equation 12. The corresponding confidence interval for There are numerous ways one might convert the sum of ⌿ ranges from .1764 to .7367. Effects are almost certainly squared standardized effects into an overall measure of “here,” but they are on the order of half a standard devia- effect size. For example, suppose we average these squared tion, what is commonly considered a medium-size effect. standardized effects in order to obtain an overall measure of Moreover, the size of the effects has not been determined strength of effects in the design. The arithmetic average of with high precision. the p squared standardized effects, sometimes called the Example 4: Fleishman (1980) described the calculation of signal-to-noise ratio (Fleishman, 1980), is as follows: confidence intervals on the noncentrality parameter of the noncentral F distribution to obtain, in a manner equivalent 1 p ␣ 2 ␭ ␭ to that used in the previous two examples, confidence in- 2 ϭ ͸ j ϭ ϭ 2 ␻2 f ͩ␴ͪ . (10) tervals on f and , the latter of which is defined as p np Ntot jϭ1 2 S␮ ␻2 ϭ A One problem with this measure is that it is the average A(partialed) 2 2 , (13) S␮ ϩ ␴ squared effect and so is not in the proper unit of measure- A e ment. A potential solution is to simply take the square root 2 where S␮ is the variance of p means for the levels of a A of the signal-to-noise ratio, obtaining particular effect A, that is,

␭ p ␣ 2 p 1 j f ϭ ͱ ϭ ͱ ͸ͩ ͪ . (11) 2 ϭ ͑ ͒ ͸ ͑␮ Ϫ ␮៮ ͒2 ␴ S␮A 1/p j (14) Ntot p ϭ j 1 jϭ1

In a one-way ANOVA with p groups and equal n, the ␴2 ␻2 and e is the within-cell variance. A(partialed) may be effects are constrained to sum to zero, so there are actually thought of as the proportion of the variance remaining (after only p Ϫ 1 independent effects. Thus, an alternative mea- all other main effects and interactions have been partialed sure, ␭/[(p Ϫ 1)n], is the average squared independent out) that is explained by the effect. (In what follows, for standardized effect, and the root-mean-square standardized simplicity, I refer to the coefficient simply as ␻2.) There are effect (RMSSE) is as follows: simple relationships between f 2, ␻2, and ␭, specifically,

␻2 ␭ ␭ ␭ p ␣ 2 1 j f 2 ϭ ϭ ϭ (15) ⌿ ϭ ͱ ϭ ͱ ͸ͩ ͪ . (12) 1 Ϫ ␻2 pn N ͑p Ϫ 1͒n p Ϫ 1 ␴ tot jϭ1 and Equations 11 and 12 demonstrate that the relationships between ⌿, f, and the noncentrality parameter ␭ are f 2 ␭ ␻2 ϭ ϭ ϩ 2 ␭ ϩ . (16) straightforward. 1 f Ntot In order to obtain a confidence interval for ⌿, we proceed as follows. First, we obtain a confidence interval estimate Fleishman (1980) cited an example given by Venables for ␭. Next, we invoke the confidence interval transforma- (1975) of a five-group ANOVA with n ϭ 11 per cell and an tion principle to directly transform the endpoints by divid- observed F of 11.221. In this case the 90% confidence ing by (p Ϫ 1)n. Finally, we take the square root. The result interval for the noncentrality parameter ␭ has endpoints is an exact confidence interval on ⌿. 19.380 and 71.549. Once we obtain the confidence interval Example 3: Suppose a one-way fixed-effects ANOVA is for ␭, it is a trivial matter to transform the limits of the performed on four groups, each with a sample size of 20, interval to confidence limits for ␻2, using Equation 16. For and that an overall F statistic of 5.00 is obtained, with 3 and example, the lower limit becomes 76 degrees of freedom, with a probability level of .0032. 19.380 19.380 19.380 The F test is thus “highly significant,” and the null hypoth- ϭ ϭ ϭ .261. esis is rejected at the .01 level. Some investigators might 19.380 ϩ ͑5͒͑11͒ 19.380 ϩ 55 74.380 interpret this result as implying that a powerful experimen- (17) tal effect was found and that this was determined with high precision. In this case, the noncentrality interval estimate In a similar manner, the upper limit of the confidence provides a more informative and somewhat different ac- interval can be calculated as .565. The confidence interval count of what has been found. has determined with 90% confidence that the main effect 170 STEIGER accounts for between 26.1% and 56.5% of the variance in Table 1 the dependent variable. Key Quantities for Computing Effect Size Intervals in Four-Way Analysis of Variance

General Procedures for Effect Size Intervals in Source Levels df␪ n␪ Between-Subjects Factorial ANOVA A ppϪ 1 nqrs In a previous example, we saw how easy it is to construct B qqϪ 1 nprs Ϫ a confidence interval on measures of effect size in one-way C rr1 npqs Ϫ ANOVA, provided a confidence interval for ␭ has been D ss1 npqr AB (p Ϫ 1)(q Ϫ 1) nrs computed. In this section, a completely general method is AC (p Ϫ 1)(r Ϫ 1) nqs demonstrated for computing confidence intervals for vari- AD (p Ϫ 1)(s Ϫ 1) nqr ous measures of standardized effect size in completely BC (q Ϫ 1)(r Ϫ 1) nps between-subjects factorial ANOVA designs with equal BD (q Ϫ 1)(s Ϫ 1) npr sample size n per cell. CD (r Ϫ 1)(s Ϫ 1) npq We begin with a general formula relating the noncentral- ABC (p Ϫ 1)(q Ϫ 1)(r Ϫ 1) ns ity parameter ␭ with the RMSSE in any completely be- ABD (p Ϫ 1)(q Ϫ 1)(s Ϫ 1) nr tween-subjects factorial ANOVA. Let ␪ stand for a partic- ACD (p Ϫ 1)(r Ϫ 1)(s Ϫ 1) nq ular effect, and n the sample size per cell. Then BCD (q Ϫ 1)(r Ϫ 1)(s Ϫ 1) np ABCD (p Ϫ 1)(q Ϫ 1)(r Ϫ 1)(s Ϫ 1) n ␭ Error pqrs(n Ϫ 1) ⌿ ϭ ͱ ␪ ␪ . (18) Note. ␪ represents a particular effect; n represents the sample size per n␪ df␪ cell; and p, q, r, and s represent levels of factors A, B, C, and D, respectively. In Equation 18, n␪ is equal to n (the number of observations in each cell of the design) multiplied by the product of the numbers of levels in all the factors not represented in the effect dom, has p ϭ .0400. We first calculate a confidence interval currently under consideration; df␪ is the numerator degrees of ␭ ␭ ϭ for . The endpoints of this interval are lower 0.100597 freedom parameter for the effect under consideration. ␭ ϭ and upper 13.8186. To convert these to confidence inter- There are simple relationships between the RMSSE and vals on ⌿, f 2, f, and ␻2, we apply Equations 18, 19, and 16. other measures of standardized effect size. Specifically, for ϭ ϭ ϭ Ϫ For the A effect, we have nA (6)(3)(7) 126, dfA (2 a general factorial ANOVA, ϭ ϭ ϭ ⌿ 1) 1, CellsA 2, and Ntot 252. Hence, for we have, 2 from Equation 18, ⌿␪ df␪ ␭␪ 2 f ␪ ϭ ϭ , (19) Cells␪ Ntot 0.100597 ⌿ ϭ ͱ ϭ 0.028256, ⌿ lower ͑126͒͑1͒ upper where Cells␪ is, for any main effect, the number of levels of the effect. For any interaction, it is the product of the 13.8186 ϭ ͱ ϭ 0.331167. (20) numbers of levels for all factors involved in the interaction. ͑126͒͑1͒ The relationship between f 2 and ␻2 is given in Equation 16. Some examples of these quantities, for a four-way For f 2 and f we have, for the lower limits, ANOVA, with p, q, r, and s levels of factors A, B, C, and D, respectively, are given in Table 1. The table may be used 0.100597 f 2 ϭ ϭ 0.000399194, f ϭ 0.01998. (21) also for one-, two-, or three-way ANOVAs simply by elim- lower 252 lower inating terms involving levels not represented in the design. 2 ϭ For example, in a three-way ANOVA, the BC interaction For the upper limits, we obtain f upper 0.0548357 and ϭ effect has (q Ϫ 1)(r Ϫ 1) numerator degrees of freedom, and fupper 0.234170. nBC is np, because there is no s in this design. The error We can also convert the confidence limits for f 2 into Ϫ degrees of freedom in a three-way ANOVA are pqr(n 1). limits for ␻2, using Equation 16. We have In the following two examples, I demonstrate how to com- pute a 90% confidence interval on various measures of f 2 0.000399194 ␻2 ϭ lower ϭ ϭ effect, using the information in the table. lower ϩ 2 0.000399035. (22) 1 f lower 1.000399194 Example 5: Suppose that, as a researcher, you perform a ϫ ϫ ϭ ␻2 ϭ three-way 2 3 7 ANOVA, with n 6 observations per In a similar manner, we obtain the upper limit as upper cell. In this case, we have p ϭ 2, q ϭ 3, and r ϭ 7. 0.0519851. Suppose that, for the A main effect, you observe an F Example 6: Table 1 can also be used for a two-way statistic of 4.2708, which, with 1 and 210 degrees of free- ANOVA, simply by letting r ϭ 1 and s ϭ 1 and ignoring all BEYOND THE F TEST 171 effects involving factors C and D. Suppose, for example, ␩ ϭ X␤ ϩ ⑀ ϭ ␩ˆ ϩ ⑀, (30) one were to perform a two-way 2 ϫ 7 ANOVA, with n ϭ 4 observations per cell, and the F statistic for the AB and ⑀ has a multivariate with zero mean ␴2 interaction is observed to be 2.50. The key quantities are dfAB vector 0 and matrix I, with I an identity ϭ ϭ ϭ ϭ ␤ 6, dferror 42, nAB 4, and CellsAB 14. The confi- matrix. It is common to partition into dence limits for ␭ are ␭ ϭ 0.462800 and ␭ ϭ AB lower upper ␤ 25.8689. Consequently, from Equation 18, the confidence 0 ␤ ϭ ͫ ␤ ͬ, (31) limits for the RMSSE are 1 ␤ ␭ 0.462800 where 0 is an intercept term. Correspondingly, X is parti- ⌿ ϭ ͱ lower ϭ ͱ ϭ lower ͑ ͒͑ ͒ 0.1389, (23) tioned as nABdfAB 4 6 X ϭ ͓1X͔, (32) 25.8689 1 ⌿ ϭ ͱ ϭ upper ͑ ͒͑ ͒ 1.0382. (24) 4 6 where 1 is a column of ones and X1 contains the original X scores transformed into deviations about their sample 2 The confidence intervals for f and f are means. Consider now a set of observed scores y, representing 0.462800 2 ϭ ϭ ϭ realizations of the random variables in ␩.IfX has p Ϫ 1 f lower 0.008264, flower 0.0909, (25) 1 56 columns, then an F statistic for testing the hypothesis that ␤ ϭ 0is 25.8689 1 f 2 ϭ ϭ 0.461944, f ϭ 0.6797. (26) upper 56 upper R2/͑p Ϫ 1͒ ϭ F ͑ Ϫ 2͒ ͑ Ϫ ͒ . (33) Using Equation 16, we convert the above to the following 1 R / Ntot p ␻2 confidence limits for : This statistic has a noncentral F distribution with p Ϫ 1 and Ϫ f 2 0.008264 Ntot p degrees of freedom, with a noncentrality parameter ␻2 ϭ lower ϭ ϭ given by lower ϩ 2 0.0082, (27) 1 f lower 1.008264 ␤Ј Ј͑ Ϫ ͒ ␤ ␤Ј Ј ␤ X I P1 X X Q1X 0.461944 ␭ ϭ ϭ . (34) ␻2 ϭ ϭ 0.3160. (28) ␴2 ␴2 upper 1.461944

For any matrix A of full column rank, PA is the column Ј Ϫ1 Ј Multiple Regression With Fixed Regressors space projection operator A(A A) A and QA the comple- Ϫ mentary projector I PA. One standardized index of the size of effects is to com- We now turn to an application of this theory in the pute the squared multiple correlation coefficient between context of ANOVA. Consider the simple case of a one-way the independent variable and the scores on the dependent fixed-effects ANOVA with n observations in each of p variable. This index, in the population, characterizes the independent groups. It is well-known that this model can be strength of the effect. ANOVA may be conceptualized as a written in the form of Equation 29, where X is a design model with fixed independent variables. In ϭ ␤ matrix with Ntot np rows and p columns, and contains this case, the theory of multiple regression with fixed re- ANOVA parameters. gressors applies. It is important to realize (e.g., Sampson, We are not interested in R2 per se. Rather, we are inter- 1974) that the theory for fixed regressors, although it shares ested in the corresponding quantity ␳2 in an infinite popu- many similarities with that for random regressors, has im- lation of observations in which treatment groups are repre- portant differences, which are especially apparent when sented equally. There are several alternative ways of considering the nonnull distributions of the variables. The conceptualizing such a quantity. Formally, we can define ␳2 general model is as the probability limit of R2, that is,

E͑␩͒ ϭ X␤, (29) ␳2 ϭ plim͑R2͒. (35) n3ϱ ␩ ϫ ϫ where is an Ntot 1 random vector, X is an Ntot p matrix, and ␤ is a p ϫ 1 vector of unknown parameters. This is the constant that R2 converges to as the sample size This model includes model errors (⑀) that are assumed to increases without bound. It can be proven (see Appendix A) be independently and identically distributed with a normal that, with this definition of ␳2, the noncentrality parameter is distribution, zero mean, and variance ␴2. That is, equivalent to 172 STEIGER

␳2 fidence interval is identical to the one for ␻2 discussed ␭ ϭ N , (36) 2 tot 1 Ϫ ␳2 earlier. Note also that the sample R is positively biased with small sample sizes and will consequently be much closer to the and so upper end of the confidence interval than the lower. One of several alternative methods for parameterizing the ␭ ␳2 ϭ linear model in Equation 29 is to use what is sometimes called ␭ ϩ . (37) Ntot effect coding. In this case, the entries in X correspond to the contrast weights applied to group means in the ANOVA null ␭ Consequently, a confidence interval for may be converted hypothesis. For example, the hypothesis of no treatments in a easily into a confidence interval on ␳2 or ␳, because ␳ is 2 one-way ANOVA with three groups corresponds to two con- nonnegative. ␳ represents the coefficient of determination ␮ Ϫ ␮ ϭ trasts simultaneously being zero, that is, 1 3 0 and for predicting scores on the dependent variable from only a ␮ Ϫ ␮ ϭ 2 3 0. The contrast weights for the two hypotheses are knowledge of the population means of the groups in an thus 1, 0, Ϫ1 and 0, 1, Ϫ1. Thus, omnibus effect size in infinite population in which all treatment groups are equally ANOVA can be expressed as the multiple correlation between represented. a set of contrast weights and the dependent variable. Example 7: Suppose that X is set up as in Equation 38 to There has been a fair amount of discussion in the applied represent a full rank design matrix for a one-way ANOVA, literature (Ozer, 1985; Rosenthal, 1991; Steiger & Ward, ϭ with three groups, and n 3, and that the scores in y are 1, 1987) about whether the coefficient of determination is ␤ 2, 3, 4, 5, 6, 7, 8, 9. In this parameterization, 0 corresponds overly pessimistic in describing the strength of effects. ␮ ␤ ␮ Ϫ ␮ ␤ to 3, 1 corresponds to 1 3, and 2 corresponds to Those who prefer ␳ may convert a confidence interval on ␳2 ␮ Ϫ ␮ 2 3. The group means are 2, 5, 8, and the group to a confidence interval on ␳ simply by taking the square are all 1. root of the endpoints of the former. ⑀ y11 110 11 ⑀ Confidence Intervals on Single-Contrast Measures of y21 110 21 ⑀ Effect Size y31 110 31 ␤ ⑀ y12 101 0 12 ␤ ⑀ Rosenthal et al. (2000) argued convincingly for the im- y22 ϭ 101 ͫ 1 ͬ ϩ 22 . (38) ␤ ⑀ portance of replacing the omnibus hypothesis in ANOVA y32 101 2 32 ΄ ΅ ΄ ΅ ΄ ⑀ ΅ with hypotheses that focus on substantive research ques- y 100 13 13 ␺ y 100 ⑀ tions. Often such hypotheses involve single contrasts of 23 23 ␺ ϭ ¥p ␮ y 100 ⑀ the form jϭ1 cj j, with cj , the contrast weights and the 33 33 null hypothesis being that ␺ ϭ 0. Rosenthal et al. discussed In this case, it is easy to show using any standard multiple several different correlational measures for assessing the regression program that the sample squared multiple corre- status of hypotheses on a single contrast. In this section, I lation for predicting y from X is .90 and that the F statistic discuss methods for exact confidence interval estimation of for testing the null hypothesis that ␳2 ϭ 0is measures of effect size for a single contrast, including the 2 population equivalent of the correlation measure rcontrast R2/2 .9/2 discussed by Rosenthal et al. F͑2, 6͒ ϭ ϭ ϭ 27.0. (39) ͑1 Ϫ R2͒/6 .1/6 Exact Confidence Intervals for Standardized This F statistic is identical to the one obtained by perform- Contrast Effect Size ing a one-way fixed-effects ANOVA on the data. The 90% ␭ ␭ ϭ Consider a contrast hypothesis on means, of the form confidence interval for has endpoints of 1 10.797 and ␭ ϭ 2 119.702. The lower endpoint for the confidence inter- p val on ␳2, the coefficient of determination, is thus ␺ϭ͸ ␮ ϭ H0: cj j 0. (42) 10.797 10.797 jϭ1 ϭ ϭ ϩ .545, (40) 10.797 9 19.797 With equal sample sizes of n per group, this hypothesis may be tested with a t statistic of the form and the upper endpoint is ␺ˆ 119.702 119.702 ϭ ͱ ϭ ϭ .930. (41) t n , (43) 119.702 ϩ 9 128.702 p ͑͸ 2͒ ͱMSwithin cj With one-way ANOVA and equal n per group, this con- jϭ1 BEYOND THE F TEST 173

␳2 with Exact Confidence Intervals for contrast Rosenthal et al. (2000) discussed the sample statistic p 2 ␺ˆ ϭ ͸ ៮ rcontrast, which is the squared between the c Y⅐ , (44) j j contrast weight vector discussed in the previous section and jϭ1 the scores in y, with all other sources of systematic between- ៮ groups variation partialed out. Consider the data discussed where Y⅐j represents the sample mean of the jth group. The in the preceding example. These weights happen to be the standardized effect size Es is the size of the contrast in standard deviation units, that is, rescaled orthogonal polynomial weights for testing qua- dratic trend. The remaining sources of between-groups vari- ␺ ation may be predicted from any orthogonal complement of ϭ Es ␴ . (45) the quadratic trend contrast weights. Consequently, if we construct the vectors with columns of repeated linear and The has a noncentral t distribution with cubic contrast weights, the partial correlation between y and p(n Ϫ 1) degrees of freedom and a noncentrality parameter of the contrast weights with the quadratic and cubic weights 2 partialed out is rcontrast, which may also be computed di- n rectly from the standard F statistic for the contrast as ␦ ϭ ϭ ͑ ͒ Es L Es. (46) p F ͱ 2 2 contrast ͸ ϭ cj rcontrast ϩ . (48) Fcontrast dfwithin jϭ1

To estimate E , one obtains a confidence interval for ␦, Rosenthal et al. (2000) did not discuss sampling theory s 2 ␳2 using the method discussed by Steiger and Fouladi (1997), for rcontrast. However, a population equivalent, contrast,may and transforms the endpoints of the confidence interval by be defined, and it may be shown (see Appendix B) that, with dividing by L (i.e., the expression under the radical in p groups in the analysis, Equation 46), as shown in the example below. r2 Example 8: The data in Table 2 represent four indepen- ϭ contrast Fcontrast ͑ Ϫ 2 ͒ ͑ Ϫ ͒ (49) dent groups of three observations each. Suppose one wished 1 rcontrast / Ntot p to test the following null hypothesis: Ϫ has a noncentral F distribution with 1 and Ntot p degrees ␮ ϩ ␮ ␮ ϩ ␮ of freedom and noncentrality parameter 1 4 2 3 ␺ ϭ Ϫ ϭ 0. (47) 2 2 ␳2 ␭ ϭ contrast Ntot Ϫ ␳2 . (50) This hypothesis tests whether the average of the means of 1 contrast the first and fourth groups is equal to the average of the means of the other two groups. Suppose we observe t(8) ϭ Consequently, one may construct a confidence interval ␳2 ␭ 1.7321. The traditional 95% confidence interval for ␺ for contrast by computing a confidence interval for and ranges from Ϫ0.3314 to 2.3314. Because mean square error transforming the endpoints, using the result of Equation 37. is 1 in this example, we would expect a confidence interval Example 9: Consider again the data in Table 2. We can compute the F statistics corresponding to linear, quadratic, for Es to be similar. Actually, it is somewhat narrower. The 95% confidence interval for ␦ ranges from Ϫ0.4429 to and cubic trend and, for each trend, compute confidence ␳2 ␳ 3.8175. The sum of squared contrast weights is 1, so L ϭ intervals for contrast and/or contrast. For example, consider ͌3, and the endpoints of the confidence interval are divided the test for linear trend. The F statistic is 216, with 1 and 8 by ͌3 to obtain 95% confidence limits of Ϫ0.2557 and degrees of freedom, and the 95% confidence interval for the ␭ 2.2041 for E . noncentrality parameter has endpoints of 54.2497 and s 483.8839. Consequently, from Equation 37, a 95% confi- ␳2 dence interval for contrast has endpoints of

Table 2 54.2497 ␳2 ϭ ϭ .819,␳ 2 Sample Data for a One-Way Analysis of Variance lower 54.2497 ϩ12 upper Group 1 Group 2 Group 3 Group 4 483.8839 ϭ ϭ .976 (51) 13 812 483.8839 ϩ12 24 913 3 5 10 14 ␳ The confidence interval for contrast (defined as the square 174 STEIGER

␳2 root of contrast, thus excluding negative values as in alpha. These endpoints are also appropriate for testing Rosenthal et al., 2000) ranges from .905 to .988. one-sided hypotheses at the ␣/2 significance level. Table 3 shows the results of computing contrast correla- The preceding paragraph implies a general rule of thumb: tions and the associated confidence intervals for linear, to use the confidence intervals to test a statistical hypothesis quadratic, and cubic trend. Some brief comments are in and to maintain a Type I error rate at alpha: order. Note, first, that although the rcontrast values for qua- dratic and cubic trends are appealingly high, the correspond- 1. When testing a two-sided hypothesis at the alpha ing confidence intervals are quite wide and include zero. On level, use a 100(1 Ϫ ␣)% confidence interval. the other hand, the confidence interval for the linear trend is very narrow. 2. When testing a one-sided hypothesis at the alpha level, use a 100(1 Ϫ 2␣)% confidence interval.

The Relationship Between Confidence Intervals and Example 10: Consider a test of the hypothesis that ⌿ϭ Hypothesis Tests—Choosing the 0, that is, that the RMSSE (as defined in Equation 12) in Appropriate Interval an ANOVA is zero. This hypothesis test is one-sided, because the RMSSE cannot be negative. To use a two- Confidence intervals on measures of effect size convey all sided confidence interval to test this hypothesis at the the information in a hypothesis test, and more. If one selects ␣ ϭ .05 significance level, one should examine the an appropriate confidence interval, a hypothesis test may be 100(1 Ϫ 2␣)% ϭ 90% confidence interval for ⌿.Ifthe performed simply by inspection. If the confidence interval confidence interval excludes zero, the null hypothesis excludes the null hypothesized value, then the null hypoth- will be rejected. This hypothesis test is equivalent to the esis is rejected. In such applications, I recommend using the traditional standard ANOVA F test. two-sided confidence interval, rather than a one-sided Example 11: Consider the test that the standardized effect interval (or confidence bound), regardless of whether the size Es in Equation 45 is precisely zero. This hypothesis test hypothesis test is one-sided or two-sided. When a two- is two-sided, because Es can be either positive or negative. sided confidence interval is used to perform the hypoth- Consequently, to use a confidence interval to test this hy- Ϫ ␣ ϭ esis test, the confidence level must be matched appropri- pothesis at the .05 level, a 100(1 )% 95% two-sided ately both to the type of hypothesis test and to the Type confidence interval should be used, and the null hypothesis I error rate. Recall that the endpoints of the two-sided rejected only if both ends of the confidence interval are confidence interval for a parameter ␪ at the 100(1 Ϫ ␣)% above zero or if both are below zero. confidence level are the values of ␪ that place the ob- Example 12: Consider a situation in which one wishes served statistic ␪ˆ at the ␣/2 or 1 Ϫ ␣/2 cumulative to establish that the standardized effect size Es in Equa- probability point. Suppose the upper and lower limits of tion 45 is small, and that smallness is defined as an the 100(1 Ϫ ␣)% confidence interval are U and L, re- absolute value less than 0.20. To establish smallness, one spectively. Then ␪ˆ is the rejection point at the ␣/2 sig- must reject a hypothesis that Es is not small. Because Es nificance level for one-sided hypothesis tests that ␪ is, can be either positive or negative, Es can be not small in first, greater than or equal to U and, second, less than or two directions. The hypothesis that Es is not small can equal to L. The observed statistic ␪ˆ is also equal to (a) the therefore be tested with two simultaneous one-sided hy- upper rejection point for a two-sided test that ␪ ϭ L at the pothesis tests, alpha level and (b) the lower rejection point for the Յ Ϫ Ͼ Ϫ two-sided test that ␪ ϭ U at the alpha level. Conse- H01: Es 0.20 versus Ha1: Es 0.20 (52) quently, the endpoints of the confidence interval repre- sent two values of ␪ that the observed statistic would and barely reject in a two-sided test with significance level Ն Ͻ H02: Es 0.20 versus Ha2: Es 0.20. (53) Table 3 These two hypotheses can both be tested simultaneously at Confidence Intervals (CIs) for Contrast Correlations the .05 level by constructing a 90% confidence interval and Statistic Linear Quadratic Cubic observing whether the lower end of the interval is above F 216.00 3.00 2.40 Ϫ0.20 (to test the first one-sided hypothesis) and the upper 2 rcontrast .964 .273 .231 end of the interval is below 0.20. What this amounts to is CI .819Ð.976 0Ð.541 0Ð.520 observing whether the entire interval is between Ϫ0.20 and r .982 .522 .480 contrast 0.20. If so, the hypothesis that Es is not small is rejected, and CI .905Ð.988 0Ð.741 0Ð.721 smallness is indicated. BEYOND THE F TEST 175

Tests of Minimal Effect the entire confidence interval is above the point of triv- iality (i.e., 0.25), then the effect may be judged non- Rationale and Method trivial. If the entire confidence interval is below the point of triviality, then the effect has been shown to be trivial. In many situations, the null hypothesis of zero effect is There is a strong similarity between using the effect size inappropriate or can be misleading. For example, in R-S confidence interval in this way and the long tradition of testing with extremely large sample sizes, a null hypoth- bioequivalence testing. esis may be rejected consistently, with a very low prob- Example 13: Suppose you have p ϭ 6 groups and n ϭ 75 ability level, even when the population effect is small. per group. You observe an F statistic of F(5, 444) ϭ 2.28, Conversely, in A-S testing, the nil hypothesis of zero with p ϭ .046, so the nil hypothesis of zero effects is effect is often unreasonable, and the hypothesis the ex- rejected at the .05 significance level. However, on substan- perimenter probably wants to test is that the effect is ⌿ trivial. tive grounds, you have decided that a value of less than Tests of minimal effect are a partial solution to the 0.25 can be ignored. To demonstrate triviality, you would ⌿ problems caused by inappropriate testing of a nil hypoth- attempt to reject the null hypothesis that is greater than or esis when the goal is to show that an effect is small. For equal to 0.25. example, if some “minimal reasonable” effect size can be There are two approaches to performing the test. The specified, rejection of the hypothesis that the effect is less first approach requires only a single calculation from the than or equal to this value is of practical importance noncentral F distribution. Consider the cutoff value of whether or not the sample size is very large. In the 0.25. Using the result of Equation 12, one may convert 2 traditional A-S situation, in which the experimenter is this to a value for ␭ via the formula ␭ ϭ (p Ϫ 1)n⌿ ϭ 2 trying to show that an effect is trivial, the hypothesis that (6 Ϫ 1)(75)(.25 ) ϭ 23.4375. The observed F statistic of the effect is greater than or equal to a minimal reasonable 2.28 has a one-sided probability value of .0256 in the value can be tested. Serlin and Lapsley (1993) discussed noncentral F distribution with ␭ ϭ 23.4375, and 5 and this latter notion in detail and gave numerical examples. 444 degrees of freedom, so the null hypothesis is rejected In such cases, large sample size will work for, rather than at the .05 level, and the overall effects are declared against, the experimenter, because if the effect size is trivial. truly below a level that is of practical import, larger An alternative approach uses the confidence interval. samples will yield greater power to demonstrate that fact Note that, because the test is one-sided, we use the 90% by rejecting the null hypothesis that the effect is at or confidence interval. The endpoints of the interval for ␭ above a point of triviality. are 0.1028 and 20.3804. Using the result of Equation 12, The confidence intervals described in the preceding sec- we convert this confidence interval into a confidence tion can be used to test hypotheses of minimal effect: One interval for ⌿ by dividing the above endpoints by (p Ϫ 1)n ϭ 375, simply observes whether the appropriately constructed con- then taking the square root. The resulting endpoints for the fidence interval contains the target minimal reasonable confidence interval for ⌿ are 0.0166 and 0.2331. This value. For example, suppose you decide that an RMSSE of confidence interval excludes 0.25, so we can reject the 0.25 constitutes a minimal reasonable effect. In other words, hypothesis that effects are nontrivial, that is, ⌿ Ն 0.25, at effects below that level may be ignored. Effects that are the .05 significance level. The advantage of using the con- definitely above that level are nontrivial. If you wish to fidence interval is that it provides us with an approximate demonstrate that effects are trivial, you might test the indication of the precision of the estimation process while hypotheses still allowing us to perform the hypothesis test. Significant technical and theoretical issues surround the H : ⌿ Ն 0.25; H : ⌿ Ͻ 0.25. (54) 0 1 use of confidence intervals in this manner. On the other hand, if you wish to demonstrate that effects 1. The choice of a numerical “point of triviality” for are definitely not trivial, you might test the hypotheses a measure of omnibus effect size should not be ⌿ Յ ⌿ Ͼ treated as a mechanical selection from a small H0: 0.25; H1: 0.25. (55) menu of “approved” choices. Rather, it should be In each case, rejecting the null hypothesis will support the considered carefully on the basis of the specific goal in performing the test, and the problems inherent in experimental design and the substantive aspects of A-S testing can be avoided. the variables being measured and manipulated. A simple approach to simultaneously testing the two Whereas .25 might be considered trivial in one hypotheses discussed above is to examine the 1 Ϫ 2␣ experiment, it might be considered very important confidence interval for ⌿ and see if it excludes 0.25. If in another. 176 STEIGER

2. The power of both hypothesis tests must be ana- tions (independence, normality, and equal variances) and lyzed a priori to assess whether sample size is are easily calculated with modern software. However, adequate. With low precision (i.e., a wide confi- they are restricted to (a) completely between-subjects dence interval), one might still have high power to fixed-effect ANOVA with (b) equal n per cell. The demonstrate nontrivial effects if effects are large. present article does not present procedures for dealing However, it is virtually impossible to demonstrate with the complications that result from unbalanced de- triviality if precision is low, because the triviality signs and/or repeated measures, nor does it discuss ex- point will be close to zero, and a wide confidence tensions to random effects or mixed ANOVA models or interval will not fit between zero and the triviality to multivariate analyses. point. In some cases, procedures for these other situations are already available. Consider, for example, the case of one- Full consideration of the technical aspects of estimating way random effects ANOVA. The treatment effects are the point of triviality, and precision of a parameter estimate ␴2 ⌿ random variables with a variance of A, and may be and the resulting confidence interval, is beyond the scope ␴ ␴ Ϫ ␣ redefined as A/ . A 100(1 )% confidence interval for (and length restrictions) of this article. However, in the next ⌿ may therefore be obtained in the equal n case by taking section, I discuss several theoretical issues that the sophis- the square root of the well-known (Glass & Hopkins, 1996, ticated user should keep in mind. ␴2 ␴2 p. 542) confidence interval for A/ . One obtains, with p groups, Conclusions and Discussion F ⌿ ϭ ͱ ͫ Ϫ1ͩ obs Ϫ ͪ ͬ ⌿ lower max n * 1 ,0 , upper This article demonstrates that the F statistic in ANOVA F␣/2 contains information about standardized effect size, and its Fobs precision of estimation, that has not been made available in ϭ ͱmaxͫnϪ1ͩ Ϫ 1ͪ,0ͬ. (56) F*Ϫ␣ typical reports and is not reported by tradi- 1 /2 tional software packages. Yet this information can readily Fobs is the observed value of the F statistic, and F*isthe be calculated, using a few basic techniques. percentage point from the F distribution with p Ϫ 1 and The fact is, simply reporting an F statistic, and a proba- p(n Ϫ 1) degrees of freedom. bility level attached to a hypothesis of nil effect, is so This approach can be generalized to more complicated suboptimal that its continuance can no longer be justified, at designs. Burdick and Graybill (1992) discussed general least in a social science tradition that prides itself on em- methods for obtaining exact confidence intervals for ⌿ and piricism. A number of the field’s most influential commen- related quantities in random effects models, both in the tators on have emphasized this and urged equal n and unbalanced cases. Computational procedures that, as researchers, we revise our approach to reporting the for the unbalanced case are much more complicated than for results of significance tests (e.g., see articles in Harlow, the case of equal n. Mulaik, & Steiger, 1997). However, on close inspection, some extensions yield Null hypothesis testing is the source of much contro- challenging complications that require careful analysis. versy. I have tried to promote an eclectic, integrated point Some examples are as follows. of view that resists the temptation to downgrade either 1. In the unbalanced, fixed-effects case, the noncentrality the hypothesis testing or the interval estimation ap- parameter ␭ is defined as follows: proaches and emphasizes how they complement each p ␣ 2 other. Reviewers and other readers of the article have ␭ ϭ ͸ ͩ jͪ nj ␴ . (57) provided much food for thought and have raised several jϭ1 substantive criticisms that enriched my point of view considerably. In the following sections, I discuss some of the limitations of the procedures in this article, deal Note that with ␮ defined as in Equation 9, the quantity f 2 as explicitly with several of the more common objections to defined in Equation 10 represents the ratio of between- my major suggestions, and then summarize my point of groups to within-group variance in a population with prob- view and present some conclusions. ability of membership in the treatment groups proportional to the sample sizes in the ANOVA. There are situations in which this quantity is of interest (such as when the sampling Statistical Limitations and Extensions of the Present plan reflects the relative size of natural subpopulations) and Procedures others in which it might not be. Cohen (1988, pp. 359Ð361) The procedures discussed in this article provide exact discussed this point in detail. distributional results under standard ANOVA assump- 2. In repeated measures ANOVA designs, the noncen- BEYOND THE F TEST 177

trality parameter ␭ unfortunately confounds effects of treat- The width of a confidence interval often is described as ments with the correlation among observations. For exam- indicating precision of measurement. However, as Steiger ple, in a one-way within-subjects design, if the data possess and Fouladi (1997, pp. 254Ð255) pointed out, this relation- compound symmetry, the noncentrality parameter is ship is less than perfect and is seriously compromised in some situations for several reasons. The width of a confi- p ␮ Ϫ ␮ 2 dence interval is itself a random variable and is subject to n j ␭ ϭ ͸ͩ ͪ . (58) sampling variations. Moreover, the confidence intervals are 1 Ϫ ␳ ␴ jϭ1 truncated at zero to avoid improper estimates. In extreme cases, a confidence interval might actually have 0 as both ⌿ The RMSSE, ,asdefined in Equation 12, though still endpoints. This zero-width confidence interval obviously an appropriate measure of effect size, cannot be esti- does not imply that effect size was determined with perfect mated directly using the exact techniques discussed in precision. this article, unless ␳ is known. For a detailed discussion of this issue in the context of in meta- analysis, see Dunlap, Cortina, Vaslow, and Burke (1996). Focused Contrasts or Omnibus Hypotheses? 3. In multivariate analysis, the noncentrality parameter In an early version of this article, I concentrated almost includes information about the variances and correlations of exclusively on omnibus measures of effect size. Several the dependent variables. For example, when two popula- reviewers have objected that confidence intervals on tions are compared on k dependent variables, using Hotell- ␻2 2 measures of standardized effect size such as and the ing’s T with two independent samples of size n1 and n2, the RMSSE were, to paraphrase, an elegant solution to the ϩ Ϫ Ϫ standard F statistic has k and n1 n2 k 1 degrees of wrong problem. These writers have echoed the view of freedom and has a noncentral F distribution with a noncen- Rosenthal et al. (2000), who stated that “omnibus ques- ␭ trality parameter that is a simple function of the squared tions seldom address questions of real interest to re- ⌬2 population : searchers, and are typically less powerful than focused procedures” (p. 1). n n ␭ ϭ 1 2 ⌬2 I share an enthusiasm for focused contrasts and recom- ϩ . (59) n1 n2 mend them in lieu of an omnibus test whenever researchers have clear ideas about linear contrasts. Moreover, I believe The latter, computed as that not enough researchers have been trained to look care- ⌬2 ϭ ͑␮ Ϫ ␮ ͒Ј⌺Ϫ1͑␮ Ϫ ␮ ͒ fully for ways to phrase their ideas as contrasts. However, I 2 1 2 1 (60) think that dismissing the improvements to the use of the F ␮ ␮ ⌺ with 1 and 2 the population mean vectors, and the statistic suggested in this article ignores several important common , may be described as a sum of realities. squared orthogonalized and standardized mean differences. First, much research in the social sciences is explor- Consequently, a natural analogue of Equation 12 that takes atory, and an omnibus F test in such circumstances may into account the number of dependent variables is be the prelude to subsequent examination of unplanned contrasts. In such cases, an overall measure of the ⌬2 strength of effect sizes, and the precision with which they ⌿ ϭ ͱ . (61) k have been determined, may alert the researcher in ad- vance to a lack of overall precision in the experimental A confidence interval on ⌬2 may be calculated easily (Rei- design. Second, when one is comparing several studies ser, 2001) from a confidence interval on ␭, using the results that have reported overall F tests, comparing confidence of Equation 59. This interval may, in turn, be transformed intervals on standardized effect size measures can be very into a confidence interval on ⌿ using Equation 61. useful in resolving apparent disparities in experimental We see that in one of the contexts discussed above (re- outcomes. ␳2 peated measures), dependencies between measures are an As it turns out, the confidence interval on contrast, one annoying confound that must be removed from consider- standardized measure of omnibus effect size, is closely ation. In another (the case of two populations), they are an related, conceptually and computationally, to the procedure essential ingredient for proper evaluation of effect size. In for computing a confidence interval on ␻2. The latter index some of the problematic cases discussed above, and in examines the squared multiple correlation between ob- situations where the standard ANOVA statistical assump- served data and a set of contrast weights, whereas the tions are inappropriate, methods such as boot- former examines the squared correlation between the data strapping can be used to obtain appropriate confidence and one set of contrast weights with the variation predicted intervals. by the complementary contrasts partialed out. Thus, the 178 STEIGER

same technology that I find useful for omnibus tests may be hypothesis tests required by the dual hypothesis-testing applied directly to contrasts. framework. Specifically, one simply examines, simulta- I believe that reporting an exact confidence interval on neously, whether the confidence interval excludes a trivial ␳ contrast is substantially more informative than simply re- effect value on the left or right. If, for example, the confi- porting the raw coefficient. And, to be clear, I fully support dence interval lies entirely above the cutoff point for a concentration on focused contrasts in lieu of omnibus tests trivial effect, one rejects the hypothesis of triviality. If the whenever the experimenter has firm questions that suit the confidence interval lies entirely below the cutoff point, one contrast analysis framework. rejects the hypothesis of nontriviality. Moreover, the confidence interval approach, being an Some Recent Objections to Standardized Measures exact procedure, also provides all the information available of Effect Size in the standard F test. For example, the F test results in rejection at the .05 level if and only if the 90% confidence Revised hypothesis-testing strategies for ANOVA require interval for ⌿ excludes zero. specification of target values of a standardized measure of The hypothesis-testing approach offers advantages as effect size. The confidence interval approach is more re- well. For one, it keeps the analysis within the comfortably laxed but strongly tends to lead the experimenter to consider familiar bounds of hypothesis testing. For another, it is which overall effect sizes qualify as trivial and which are computationally easier—one may perform the hypothesis nontrivial in a particular application. test without extensive iteration, and so it may be performed Although many writers have emphasized the value of with a wider range of available free software. Another standardized measures of effect size in power analysis advantage is that, by simultaneously analyzing power for and sample size estimation, standardized effect size mea- both a test of triviality and a test of nontriviality, the user sures do have some shortcomings. As a nonlinear com- can be relatively certain that the confidence interval, if bination of several sources of variation in an experiment, calculated, will have enough precision to determine whether they reduce several values into one and are of necessity effects are trivial or not. less precise than similar indices computed on a focused contrast. Moreover, ANOVA effects as used in the cal- Standardized Effects and Coefficients of culation of the noncentrality parameter ␭ in the omnibus Determination—A Caution test may or may not correspond to experimental effects as commonly conceptualized (see, e.g., Steiger & Fouladi, Any statistical technique offers opportunity for abuse and 1997, pp. 244Ð245), and focused contrasts can get at misuse, especially if the technique is used mechanically and such experimental effects much more effectively than an without taking into account the special circumstances sur- omnibus procedure. Recently, Lenth (2001) suggested rounding a particular set of data. Abelson (1995) discussed dispensing with standardized measures of effect size al- in detail how important it is to remain open-minded when together in the context of power analysis and sample size judging the importance of effect sizes. In some cases, ef- estimation. His main justification was that combining fects that seem small may be quite important. This should be information about raw effects (i.e., mean differences) and kept in mind before effects that are nonzero, but seemingly variation ignored a possible impact of reli- trivial, are dismissed. Abelson’s comments are similar to ability of measurement. Cohen’s (1988, pp. 534Ð535) in his chapter on special issues in power analysis. Reconciling the Interval Estimation and Minimal- Effect-Testing Approaches Casting a Vote for Change As stated at the outset, this article discusses two major A fundamental contribution to behavioral statistics by approaches that might be used to replace the traditional F Cohen (1962) was to demonstrate that many studies lack test in ANOVA. The noncentrality interval estimation sufficient statistical power. The initial emphasis on power approach emphasizes estimation of some function of analysis spearheaded by Cohen (1962) has now given overall effect size, along with an indication of the preci- way to a more sophisticated emphasis on precision of sion of the measurement. The dual hypothesis testing estimation. approach replaces the hypothesis of nil effect with two Confidence intervals on standardized measures of effect hypotheses, one that the effect is trivial, the other that it size allow one to assess how precisely effects have been is nontrivial. measured and simultaneously assess whether the experi- The approach I personally favor is confidence interval ment has ruled out (a) the notion that effects are trivial and estimation on some standardized measure of overall effect (b) the notion that they are nontrivial. The procedures are size. This approach may be viewed as replacing hypothesis straightforward and offer obvious benefits. It is time for a testing entirely, yet it can be used to perform both kinds of change. Yet there are numerous obstacles to change in BEYOND THE F TEST 179 behavioral statistics practice. A significant obstacle is the Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta- dominant influence a few commercial statistical packages analysis. New York: Academic Press. such as SPSS and SAS have on practice in the field. The Lenth, R. V. (2001). Some practical guidelines for effective sam- way psychology has operated in the past, procedures are ple size determination. American , 55, 187Ð193. unlikely to be used until they have been implemented in a MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). widely used statistics package, and commercial statistics Power analysis and determination of sample size for covariance packages tend to be conservative toward new approaches. structure modeling. Psychological Methods, 1, 130Ð149. In the final analysis, the impetus for change may have Mels, G. (1989). A general system for path analysis with latent to come from journal editors and practitioners, some of variables. Unpublished master’s thesis, University of South whom have resisted change for a variety of reasons Africa, Pretoria, South Africa. discussed by Thompson (1999). Fortunately, the Internet Metzler, C. M. (1974). Bioavailability: A problem in equivalence. makes it possible to distribute innovative software to Biometrics, 30, 309Ð332. practitioners very easily at virtually zero cost. There is no Ozer, D. J. (1985). Correlation and the coefficient of determina- longer any reason to report a squared multiple correla- tion. Psychological Bulletin, 97, 307Ð315. tion, an ANOVA F statistic, or a focused contrast t test Reiser, B. (2001). Confidence intervals for the Mahalanobis dis- without providing information about confidence intervals tance. Communications in Statistics, Simulation and Computa- on standardized effects. Each reader of this article can tion, 30, 37Ð45. cast votes for change by obtaining the freeware I (and Rosenthal, R. (1991). Effect sizes: Pearson’s correlation, its dis- other authors) have made available, and then, when re- play via the BESD, and alternative indices. American Psychol- viewing articles that report omnibus tests and focused ogist, 46, 1086Ð1087. contrasts without associated intervals, taking two simple Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts steps: (a) performing their own calculation of confidence and effect sizes in behavioral research: A correlational ap- intervals on standardized effect size and (b) requesting proach. New York: Cambridge University Press. that the author include this information in the published Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, article. effect sizes, and counternulls on other people’s published data: General procedures for research consumers. Psychological References Methods, 1, 331Ð340. Sampson, A. R. (1974). A tale of two regressions. Journal of the Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, American Statistical Association, 69, 682Ð689. NJ: Erlbaum. Schmidt, F. L. (1996). Statistical significance testing and cumula- Burdick, R. K., & Graybill, F. A. (1992). Confidence intervals on tive research in psychology: Implications for the training of variance components. New York: Dekker. researchers. Psychological Methods, 1, 115Ð129. Casella, G., & Berger, R. L. (2002). (2nd ed.). Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false Pacific Grove, CA: Duxbury. objections to the discontinuation of significance testing in the Chow, S.-C., & Liu, J.-P. (2000). Design and analysis of bioavail- analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. ability and bioequivalence studies. New York: Dekker. Steiger (Eds.), What if there were no significance tests? (pp. Cohen, J. (1962). The statistical power of abnormalÐsocial psy- 38Ð64). Mahwah, NJ: Erlbaum. chological research. Journal of Abnormal and Social Psychol- Searle, S. R. (1987). Linear models for unbalanced data. New ogy, 65, 145Ð153. York: Wiley. Cohen, J. (1988). Statistical power analysis for the behavioral Serlin, R. A., & Lapsley, D. K. (1993). Rational appraisal of sciences (2nd ed.). Mahwah, NJ: Erlbaum. psychological research and the good-enough principle. In G. Cohen, J. (1994). The earth is round (p Ͻ .05). American Psychol- Keren & C. Lewis (Eds.), A handbook for data analysis in the ogist, 49, 997Ð1003. behavioral sciences: Methodological issues (pp. 199Ð228). Dunlap, W. P., Cortina, J. M., Vaslow, J. M., & Burke, M. J. Hillsdale, NJ: Erlbaum. (1996). Meta-analysis of with matched groups or Smithson, M. (2001). Correct confidence intervals for various repeated measures designs. Psychological Methods, 1, 170Ð177. regression effect sizes and parameters: The importance of non- Fleishman, A. E. (1980). Confidence intervals for correlation ra- central distributions in computing intervals. Educational and tios. Educational and Psychological Measurement, 40, 659Ð Psychological Measurement, 61, 603Ð630. 670. Steiger, J. H. (1989). EzPATH: A supplementary module for SYS- Glass, G. V., & Hopkins, K. D. (1996). Statistical methods in TAT and SYGRAPH. Evanston, IL: Systat. education and psychology (3rd ed.). Needham Heights, MA: Steiger, J. H. (1990, October). Noncentrality interval estimation Allyn & Bacon. and the evaluation of statistical models. Paper presented at the Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there meeting of the Society of Multivariate Experimental Psychol- were no significance tests? Mahwah, NJ: Erlbaum. ogy, Kingston, RI. 180 STEIGER

Steiger, J. H. (1999). STATISTICA power analysis. Tulsa, OK: StatSoft. Taylor, D. J., & Muller, K. E. (1995). Computing confidence Steiger, J. H., & Fouladi, R. T. (1992). R2: A computer program bounds for power and sample size of the general linear univar- for interval estimation, power calculation, and hypothesis test- iate model. The American Statistician, 49, 43Ð47. ing for the squared multiple correlation. Behavior Research Taylor, D. J., & Muller, K. E. (1996). Bias in linear model power Methods, Instruments, and Computers, 4, 581Ð582. and sample size calculation due to estimating noncentrality. Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval Communications in Statistics: Theory and Methods, 25, estimation and the evaluation of statistical models. In L. L. 1595Ð1610. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were Thompson, B. (1999). Why “encouraging” effect size reporting is no significance tests? (pp. 221Ð257). Mahwah, NJ: Erlbaum. not working: The etiology of researcher resistance to changing Steiger, J. H., & Lind, J. C. (1980, May). Statistically based tests practices. The Journal of Psychology, 133, 133Ð140. for the number of factors. Paper presented at the meeting of the Venables, W. (1975). Calculation of confidence intervals for non- Psychometric Society, Iowa City, IA. centrality parameters. Journal of the Royal Statistical Society, Steiger, J. H., & Ward, L. M. (1987). and the Series B, 37, 406Ð412. coefficient of determination. Psychological Bulletin, 99, Westlake, W. J. (1976). Symmetrical confidence intervals for 471Ð474. bioequivalence trials. Biometrics, 32, 741Ð744.

Appendix A

The Relationship Between ␳2 and ␭ in One-Way ANOVA

Define, for the p sample means, sponding population . We define ␳2 as plim(R2), that is, the value that R2 converges to in an infinite population. Then 1 p 2 ϭ ͸ ͑៮ Ϫ ៮ ͒2 SStreatments sx៮⅐ Ϫ x⅐j x⅐⅐ . (A1) ␳2 ϭ ͑ 2͒ ϭ p 1 plim R plimͩ ϩ ͪ jϭ1 SStreatments SSerror

2 The corresponding population quantity is plim͓n͑p Ϫ 1͒s៮ ͔ ϭ x⅐ ͓ ͑ Ϫ ͒ 2͔ ϩ ͓ ͑ Ϫ ͒ ͔ plim n p 1 sx៮⅐ plim p n 1 MSerror ␤ p ␤؅ ؅ 1 X Q1X 2 ϭ ͸ ͑␮ Ϫ ␮៮ ͒2 ϭ 2 s␮ j ⅐ , (A2) plim͑s៮ ͒ p Ϫ 1 n͑p Ϫ 1͒ x⅐ ϭ ϭ j 1 p͑n Ϫ 1͒ ͑ 2 ͒ ϩ ͫ ͬ ͑ ͒ plim sx៮⅐ limn3ϱ plim MSerror ␤ ͑p Ϫ 1͒n where , X, and Q1 are as described in Equations 29 through 38.

In a balanced, one-way ANOVA, with p groups and n observations 2 ϭ Ϫ 2 s␮ per group, SStreatments n(p 1)sx៮⅐ . ϭ ˆ p Consider any ␪ of a parameter ␪. The probability limit 2 2 s␮ ϩ ␴ of ␪ˆ, denoted plim(␪ˆ), is equal to a value c if and only if for any p Ϫ 1 error tolerance ␦ Ͼ 0, we have 2 ͑p Ϫ 1͒s␮ ϭ ͉͑␪ˆ Ϫ ͉ Ͻ ␦͒ ϭ 2 2 . (A3) limn3ϱ Pr c 1. ͑p Ϫ 1͒s␮ ϩ p␴

The notion of a probability limit is closely related to that of Combining Equations A1 through A3, we obtain consistency, in that ␪ˆ is a consistent estimator for ␪ if and only if ␤␪ˆ ϭ ␪ ␤؅X؅Q X plimn3ϱ( ) . In what follows, for brevity of notation, I simply 1 write plim(X) rather than plim 3ϱ(X). I use a number of well- n n ␳2 ϭ (؅ ؅ ␤ (A4␤ known results. In particular, if plim(X) and plim(Y) exist, then X Q1X ϩ p␴2 n plim͑X ϩ Y͒ ϭ plim͑X͒ ϩ plim͑Y͒, and plim͑X/Y͒ ϭ plim͑X͒/plim͑Y͒, ␤ ␳2 ␤؅ ؅ ␤ ␤؅ ؅ X Q1X X Q1X N ϭ N ϭ ϭ ␭, (A5) plim͑XY͒ ϭ plim͑X͒plim͑Y͒. tot 1 Ϫ ␳2 tot np␴2 ␴2

Moreover, the plim of a sample moment is equal to the corre- where ␭ is as defined in Equation 34. BEYOND THE F TEST 181

Appendix B

2 The Distribution of the F Statistic for rcontrast

Assume the as described in Equations This statistic is a ratio of two quadratic forms, in the general ϭ 29 and 30. For any full column rank matrix A,define PA form ؅ Ϫ1 ؅ ϭ Ϫ A(A A) A , and QA I PA, with I a conformable identity 2 matrix. Define 1 to be a column of 1s. Partition X as X ϭ y؅Ay/a␴ (؅ ␴2 , (B7 [1x1 X2]. x1 contains replications of the contrast weights y By/b for the contrast being evaluated, so that the ith value in x is the 1 ϭ ϭ Ϫ where a 1, and b Ntot p. From Searle (1987, pp. 233Ð234), contrast weight for the group that yi is in, and X2 contains a set of columns that are the orthogonal complement of the contrast Fcontrast has a noncentral F distribution with a and b degrees of freedom and noncentrality parameter weights in x1. Thus, for example, if x1 contains contrast weights for evaluating linear trend, X would contain quadratic and (␭ ϭ ␤؅X؅AX␤/␴2 (B8 2 cubic contrast weights (or some full rank transformation of contrast them). The regression weight vector ␤ is partitioned accord- if A␴2 is idempotent, B␴2 is idempotent, AB ϭ 0, and a and ingly as b are the ranks of A and B, respectively. These four properties are easily established by substitution and the fact that x , X , ␤ 1 2 0 and 1 are pairwise orthogonal. The implies that ␤ ␤ ϭ ͫ 1 ͬ. (B1) the noncentral F distribution has a noncentrality parameter ␤ 2 equal to

Define ␮ as a vector of the population means of the p groups, ␤؅X؅P X␤ ␤2xЈx ␭ ϭ x1 ϭ 1 1 1 and c as the linear weights for the contrast of interest. In this case, contrast ␴2 ␴2 , (B9) the contrast of interest is Ϫ and the ranks of the A and B are 1 and Ntot p, respectively. Next, ␺ ϭ ؅␮ ␭ c , (B2) we derive the relationship between contrast and the population 2 equivalent of rcontrast. and because x1 contains n replications of the elements of c, and 2 We may write rcontrast as follows: E(␩) contains n replications of the elements of ␮, we have

Fcontrast ؅ ϭ Ј r2 ϭ c c x1x1/n (B3) contrast ϩ ͑ Ϫ ͒ Fcontrast p n 1 and n␺ˆ 2/MS ͑c؅c͒ ϭ error n␺ˆ 2/MS ͑c؅c͒ ϩ p͑n Ϫ 1͒ n␺ error ␤ ϭ 1 Ј . (B4) x1x1 ␺ˆ 2 ϭ . (B10) ͒ ␺ˆ 2 ϩ ͑ ؅ ͒ ͑ Ϫ I first demonstrate that an F statistic may be constructed for c c MSerrorp n 1 /n 2 2 rcontrast. Rosenthal et al. (2000) defined rcontrast as the squared ␳2 We define contrast as partial correlation between y and x1 with X2 partialed out. This ␳2 ϭ ͑ 2 ͒ sample statistic can be computed as the following ratio of qua- contrast plim rcontrast dratic forms in y: plim͑␺ˆ 2͒ ؅ ϭ y Px1y n Ϫ 1 2 ϭ 2 ͒ ͑ ͪ ͩ ͒ rcontrast . (B5) ͑␺ˆ ͒ ϩ ͑ ؅ ؅͑ Ϫ Ϫ ͒ plim p c c limn3ϱ plim MSerror y I PX2 P1 y n

Consider the statistic ␺2 ϭ . (B11) ␺2 ϩ p͑c؅c͒␴2 r2 ϭ contrast Fcontrast ͑ Ϫ 2 ͒ ͑ Ϫ ͒ Hence, 1 rcontrast / Ntot p y؅P y ␳2 ␺2 ϭ x1 contrast ϭ (؅͑ Ϫ Ϫ Ϫ ͒ ͑ Ϫ ͒ . (B6) Ϫ ␳2 ͑ ؅ ͒␴2 , (B12 y I Px1 PX2 P1 y/ Ntot p 1 contrast p c c

Appendix continues 182 STEIGER

␭ which, after substitution of Equations B3 and B4, becomes Recalling the result of Equation B9 for contrast, we have thus shown that ␳2 ͑xЈx ͒2␤2/n2 contrast ϭ 1 1 1 Ϫ ␳2 ͑ Ј ͒ ␴2 ͑xЈx ͒␤2 ␳2 1 contrast x1x1 p /n ␭ ϭ 1 1 1 ϭ contrast contrast ␴2 Ntot Ϫ ␳2 . (B14) 1 contrast ͑xЈx ͒␤2 ϭ 1 1 1 np␴2 Received February 23, 2001 ͑xЈx ͒␤2 ϭ 1 1 1 Revision received December 1, 2003 ␴2 . (B13) Ntot Accepted January 8, 2004 Ⅲ