<<

of : The Fundamental Concepts

Steven F. Sawyer, PT, PhD

nalysis of variance (ANOVA) is a procedures (or post hoc tests), , ANOVA General Linear Models statistical tool used to detect differ- statistical power, etc. How do these terms ences between experimental group pertain to p values and statistical signifi- ANOVA is based mathematically on lin- A ear regression and general linear models . ANOVA is warranted in experi- cance? What precisely is meant by a “sta- mental designs with one dependent vari- tistically significant ANOVA”? How does that quantify the relationship between the dependent variable and the indepen- able that is a continuous parametric nu- analyzing variance result in an inferential 1 merical outcome measure, and multiple decision about differences in group dent variable(s) . There are three different experimental groups within one or more means? Can ANOVA be performed on general linear models for ANOVA: (i) independent (categorical) variables. In non-parametric ? What are the vir- (Model 1)makes infer- ANOVA terminology, independent vari- tues and potential pitfalls of ANOVA? ences that are specific and valid only to ables are called factors, and groups within These are the issues to be addressed in the populations and treatments of the each factor are referred to as levels. The this primer on the use and interpretation study. For example, if three treatments array of terms that are part and parcel of of ANOVA. The intent is to provide the involve three different doses of a drug, ANOVA can be intimidating to the un- clinician reader, whose misspent youth inferential conclusions can only be drawn initiated, such as: partitioning of vari- did not include an enthusiastic reading of for those specific drug doses. The levels ance, main effects, interactions, factors, textbooks, an understanding of within each factor are fixed as defined by sum of squares, squares, F scores, the fundamentals of this widely used the experimental design. (ii) Random ef- familywise alpha, multiple comparison form of inferential statistical analysis. fects model (Model 2) makes inferences about levels of the factor that are not used in the study, such as a continuum of drug doses when the study only used three ABSTRACT: (ANOVA) is a statistical test for detecting differences doses. This model pertains to random ef- in group means when there is one parametric dependent variable and one or more indepen- fects within levels, and makes inferences dent variables. This article summarizes the fundamentals of ANOVA for an intended benefit about a population’s random variation. of the clinician reader of scientific literature who does not possess expertise in statistics. The (iii) Mixed effects model (Model 3) con- emphasis is on conceptually-based perspectives regarding the use and interpretation of tains both Fixed and Random effects. ANOVA, with minimal coverage of the mathematical foundations. Computational exam- In most types of orthopedic reha- ples are provided. Assumptions underlying ANOVA include parametric data measures, bilitation clinical research, the Fixed ef- normally distributed data, similar group , and independence of subjects. However, fects model is relevant since the statistical normality and variance assumptions can often be violated with impunity if sizes are inferences being sought are fixed to the sufficiently large and there are equal numbers of subjects in each group. A statistically sig- levels of the experimental design. For this nificant ANOVA is typically followed up with a multiple comparison procedure to identify reason, the Fixed effects model will be the which group means differ from each other. The article concludes with a discussion of effect focus of this article. Computer statistics size and the important distinction between and clinical significance. programs typically default to the Fixed KEYWORDS: Analysis of Variance, , Main Effects, Multiple Comparison effects model for ANOVA analysis, but Procedures higher end programs can perform ANOVA with all three models.

Department of Rehabilitation , School of Allied Health Sciences, Texas Tech University Health Sciences Center, Lubbock, TX Address all correspondence and requests for reprints to: Steven F. Sawyer, PT, PhD, [email protected]

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E27] Analysis of Variance: The Fundamental Concepts

Assumptions of ANOVA as a way to infer whether the normal dis- A curve will have tribution curves of different data sets are = 0 and = 3. (Note that Assumptions for ANOVA pertain to the best thought of as being from the same an alternative definition of kurtosis sub- underlying mathematics of general lin- population or different populations tracts 3 from the final value so that a ear models. Specifically, a data set should (Figure 1). It follows that a fundamental normal distribution will have kurtosis = meet the following criteria before being assumption of parametric ANOVA is 0. This “minus 3” kurtosis value is some- subjected to ANOVA: that each group of data (each level) be times referred to as “excess kurtosis” to Parametric data: A parametric normally distributed. The Shapiro-Wilk distinguish it from the value obtained ANOVA, the topic of this article, re- test2 is commonly used to test for nor- with the standard kurtosis function. The quires parametric data (ratio or interval mality for group sample sizes (N) less kurtosis value calculated by many statis- measures). There are non-parametric, than 50; D’Agnostino’s modification3 is tical programs is the “minus 3” variant one-factor versions of ANOVA for non- useful for larger samplings (N>50). but is referred to, somewhat mislead- parametric ordinal (ranked) data, spe- A normal distribution curve can be ingly, as “kurtosis.”). Normality of a data cifically the Kruskal-Wallis test for inde- described by whether it has symmetry set can be assessed with a z-test in refer- pendent groups and the about the mean and the appropriate ence to the of skewness for repeated measures analysis. width and height (peakedness). These (estimated as √[6 / N) and the standard Normally distributed data within attributes are defined statistically by error of kurtosis (estimated as √[24 / each group: ANOVA can be thought of “skewness” and “kurtosis”, respectively. N)4. A conservative alpha of 0.01 (z ≥

FIGURE 1. Graphical representation of statistical Null and Alternative hypotheses for ANOVA in the case of one dependent variable (change in ankle ROM pre/post manual therapy treatment, in units of degrees), and one independent variable with three levels (three different types of manual therapy treatments). For this fictitious data, the group (sample) means are 13, 14 and 18 degrees of increased ankle ROM for treatment type groups 1, 2 and 3, respectively (raw data are presented in Figure 2). The Null hypothesis is represented in the left graph, in which the population means for all three groups are assumed be identical to each other (in spite of difference in sample means calculated from the experimental data). Since in the Null hypothesis the subjects in the three groups are considered to compose a single population, by definition the population means of each group are equal to each other, and are equal to the Grand mean (mean for all data scores in the three groups). The corresponding normal distribution curves are identical and precisely overlap along the X-axis. The is shown in right graph, in which differences in group sample means are inferred to represent true differences in group population means. These normal distribution curves do not overlap along the X-axis because each group of subjects are considered to be distinct populations with respect to ankle ROM, created from the original single population that experienced different efficacies of the three treatments. Graph is patterned after Wilkinson et al11.

ANOVA Null Hypothesis: ANOVA Null Hypothesis: Identical Normal distribution curve Different Normal distribution curve robability D ensityrobability Function D ensityrobability Function P P

Increased elbow ROM (degree) Increased elbow ROM (degree)

[E28] The Journal of Manual & Manipulative Therapy n volume 17 n number 2 Analysis of Variance: The Fundamental Concepts

2.56) is appropriate, due to the overly the F score calculation are warranted. so9,10. If normality and homogeneity of sensitive nature of these tests, especially The two most commonly used correc- variance violations are problematic, for large sample sizes (>100)4. As a com- tion methods are the Greenhouse- there are three options: (i) Mathemati- putational example, for N = 20, the esti- Geisser and Huynh-Feldt, which calcu- cally transform (log, arcsin, etc.) the mation of standard error of skewness = late a descriptive called epsilon, data to best mitigate the violation, with √[6 / 20] = 0.55, and any skewness value which is a measure of the extent to which the cost of cognitive fog in understand- greater than ±2.56 x 0.55 = ±1.41 would sphericity has been violated. The ing the meaning to the ANOVA results indicate non-normality. Perhaps the of values for epsilon are 1 (no sphericity (e.g., “A statistically significant main ef- best “test” is what always should be violation) to a lower boundary of 1 / fect was obtained for the arcsin transfor- done: examine a of the distri- (m—1), where m = number of levels. For mation of degrees of ankle range of mo- bution of the data. In practice, any dis- example, with three groups, the range tion”). (ii) Use one of the non-parametric tribution that resembles a bell-shaped would be 1 to 0.50. The closer epsilon is ANOVAs mentioned above, but at the curve will be “normal enough” to pass to the lower boundary, the greater the cost of reduced power and being limited normality tests, especially if the sample degree of violation. There are three op- to one-. (iii) Identify out- size is adequate. tions for adjusting the ANOVA to ac- liers in the data set using formal statisti- Homogeneity of variance within count for the sphericity violation, all of cal criteria (not discussed here). Use each group: Referring again to the notion which involve modifying degrees of caution in deleting from the that ANOVA compares normal distri- freedom: use the lower boundary epsi- data set; such decisions need to be justi- bution curves of data sets, these curves lon, which is the most conservative ap- fied and explained in research reports. need to be similar to each other in shape proach (least powerful) and will gener- Removal of outliers will reduce devia- and width for the comparison to be ate the largest p value, or use either the tions from normality and homogeneity valid. In other words, the amount of data Greenhouse-Geisser epsilon or the of variance. dispersion (variance) needs to be similar Huynh-Feldt epsilon (most powerful) between groups. Two commonly in- [statistical power is the ability of an in- If You Understand t-Tests, You Already voked tests of homogeneity of variance ferential test to detect a difference that Know A Lot About ANOVA are by Levene5 and Brown & Forsthye6. actually exists, i.e., a true positive]. Independent Observations: A gen- Most commercially available statis- As a starting point, the reader should eral assumption of parametric analysis is tics programs perform normality, ho- understand that the familiar t-test is an that the value of each observation for mogeneity of variance and sphericity ANOVA in abbreviated form. A t-test is each subject is independent of (i.e., not tests. Determination of the parametric used to infer on statistical grounds related to or influenced by) the value of nature of the data and soundness of the whether there are differences between any other observation. For independent experimental design is the responsibility group means for an experimental design groups designs, this issue is addressed of the investigator, reviewers and critical with (i) one parametric dependent vari- with random , random assign- readers of the literature. able and (ii) one independent variable ment to groups, and experimental con- with two levels, i.e., there is one outcome trol of extraneous variables. This as- Robustness of ANOVA to Violations of measure and two groups. In clinical re- sumption is an inherent concern for Normality and Variance Assumptions search, levels often correspond to differ- repeated measures designs, in which an ent treatment groups; the term “level” assumption of sphericity comes into ANOVA tests can handle moderate vio- does not imply any ordering of the play. When subjects are exposed to all lations of normality and equal variance groups. levels of an independent variable (e.g., if there is a large enough sample size and The Null statistical hypothesis for a 7 all treatments), it is conceivable that the a balanced design . As per the central t-test is H0: μ1 = μ2, that is, the population effects of a treatment can persist and af- limit theorem, the distribution of sam- means of the two groups are the same. fect the response to subsequent treat- ple means approximates normality even Note that we are dealing with popula- ments. For example, if a treatment effect with population distributions that are tion means, which are almost always for one level has a long half-time (analo- grossly skewed and non-normal, so long unknown and unknowable in clinical gous to a drug effect) and there is inad- as the sample size of each group is large research. If the Null hypothesis involved equate “ out” time between expo- enough. There is no fixed definition of sample means, there would be nothing sure to different levels (treatments), “large enough”, but a rule of thumb is to infer, since descriptive analysis pro- there will be a carryover effect. A well N≥308. Thus, the mathematical vides this information. However, with designed and executed cross-over ex- of ANOVA is said to be “robust” in the inferential analysis using t-tests and perimental design can mitigate carry- face of violations of normality assump- ANOVA, the aim is to infer, without ac- over effects. Mauchly’s test of sphericity tions if there is an adequate sample size. cess to “the truth”, if the group popula- is commonly employed to test the as- ANOVA is more sensitive to violations tion means differ from each other. sumption of independence in repeated of the homogeneity of variance assump- The Alternative hypothesis, which measures designs. If the Mauchly test is tion, but this is mitigated if sample sizes comes into play if the Null hypothesis is statistically significant, corrections to of factors and levels are equal or nearly rejected, asserts that the group popula-

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E29] Analysis of Variance: The Fundamental Concepts

tion means differ. The Null hypothesis is and whether variance between groups is involves the “partitioning of variance” rejected when the p value yielded by the equivalent; the reader is to referred to from calculations of “Sum of Squares” t-test is less than alpha. Alpha is the pre- any number of statistics books for details and “Mean Squares.” Three metrics are determined upper limit risk for commit- about the formulae. The t score is con- used in calculating the ANOVA test sta- ting a Type 1 error, which is the statisti- verted into a p value based on the magni- tistic, which is called the F score (named cal false positive of incorrectly rejecting tude of the t score (larger t scores lead to after R.A. Fisher, the developer of the Null hypothesis and inferring the smaller p values) and the sample size ANOVA): (i) Grand Mean, which is the groups means differ when in fact the (which relates to degrees of freedom). mean of all scores in all groups; (ii) Sum groups are from a single population. By of Squares, which are of two kinds, the convention, alpha is typically set to 0.05. ANOVA Null Hypothesis sum of all squared differences between Thep value generated by the t-test statis- and Alternative Hypothesis group means and the Grand Mean (be- tic is based on of the tween-groups Sum of Squares) and the experimental data, and represents the ANOVA is applicable when the aim is to sum of squared differences between in- probability of committing a Type 1 error infer differences in group values when dividual data scores and their respective if the Null hypothesis is rejected. When there is one dependent variable and group mean (within-groups Sum of p is less than alpha, there is a statistically more than two groups, such as one inde- Squares), and (iii) Mean Squares, also of significant result, i.e., the values in the pendent variable with three or more lev- two kinds (between-groups Mean two groups are inferred to differ from els, or when there are two or more inde- Squares, within-groups Mean Squares), each other and to represent separate pendent variables. Since an independent which are the average deviations of indi- populations. The logic of statistical in- variable is called a “factor”, ANOVAs are vidual scores from their respective ference is analogous to a jury trial: at the described in terms of the number of fac- mean, calculated by dividing Sum of outset of the trial (inferential analysis), tors; if there are two independent vari- Squares by their appropriate degrees of the group data are presumed to be in- ables, it is a two-factor ANOVA. In the freedom. nocent of having different population simpler case of a one-factor ANOVA, A key point to appreciate about means (Null hypothesis) unless the dif- the Null hypothesis asserts that the pop- ANOVA is that the data set variance is ferences in group means in the sampled ulation means for each level (group) of partitioned into statistical signal and data are sufficiently compelling to meet the independent variable are equal. Let’s statistical noise components to generate the standard of “beyond a reasonable use as an example a fictitious experi- the F score. TheF score for independent doubt” (p less than alpha), in which case ment with one dependent variable (pre/ groups is calculated as: a guilty verdict is rendered (reject Null post changes in ankle range of motion in hypothesis and accept Alternative hy- subjects who received one of three types F = statistical signal / statistical noise pothesis = statistical significance). of manual therapy treatment after surgi- F = treatment effect / unexplained vari- The test statistic for a t-test is the t cal repair of a talus fracture). This con- ance (“error variance”) score. In conceptual terms, the calcula- stitutes a one-factor ANOVA with three F = Mean SquaresBetween Groups / Mean tion of a t score for independent groups levels (the three different types of treat- SquaresWithin Groups (Error)

(i.e., not repeated measures) is as fol- ment). The Null hypothesis is: 0H : μ1 = μ2 lows: = μ3. The Alternative hypothesis is that at Note that the statistical signal, the MSBe-

least two of group means differ. Figure 1 tween Groups term, is an indirect measure of t = statistical signal / statistical noise provides a graphical presentation of this differences in group means. The WithinMS t = treatment effect / unexplained vari- ANOVA statistical hypotheses: (i) the Groups (Error) term is considered to represent ance (“error variance”) Null hypothesis (left graph) asserts that statistical noise/error since this variance t = differences between sample means the normal distribution curves of data is not explained by the effect of the inde- of the two groups / within-group for the three groups are identical in pendent variable on the dependent vari- variance shape and position and therefore pre- able. Here is the gist of the issue: as cisely overlap, whereas (ii) the Alterna- group means increasingly diverge from The difference in group means repre- tive hypothesis (right graph) asserts that each other, there is increasingly more sents the statistical signal since it is pre- these normal distribution curves are variance for between-group scores in re- sumed to result from treatment effects of best described by the distribution indi- lation to the Grand Mean, quantified as the different levels of the independent cated by the sample means, which repre- Sum of SquaresBetween Groups, leading to a variable. The within-group variance is sent an experimentally-derived estimate larger MSBetween Groups term and a larger F considered to be statistical noise and an of the population means11. score. Conversely, as there is more vari- “error” term because it is not explained ance within-group scores, quantified as by the influence of the independent vari- Sum of Squares , the The Mechanics of Calculating a Within Groups (Error) able on the dependent variable. The par- MS term will increase, One-factor ANOVA Within Groups (Error) ticulars of how the t score is calculated leading to a smaller F score. Thus, for depends on the experimental design (in- ANOVA evaluates differences in group independent groups, large F scores arise dependent groups vs repeated measures) means in a round-about fashion, and from large differences between group

[E30] The Journal of Manual & Manipulative Therapy n volume 17 n number 2 Analysis of Variance: The Fundamental Concepts

means and/or small variances within and subjected to ANOVA, yielding a cal- groups), the t score for independent groups. Larger F scores equate to lower culated F score and corresponding p groups is 5.0 with a p value of 0.0025 p values, with the p value also influenced value. (calculations not shown). For the same by the sample size and number of groups, data assessed with ANOVA, the F score each of which constitutes separate types Mathematical Equivalence of t-tests is 25.0 with a p value of 0.0025. Thet -test of “degrees of freedom.” and ANOVA: t-tests are a Special Case and ANOVA generate identical p values. ANOVA calculations are now the of ANOVA The mathematical relation between the domain of computer software, but there two test statistics is: t2 = F. is illustrative and heuristic value in man- Let’s briefly return to the notion that a ually performing the arithmetic calcula- t-test is a simplified version of ANOVA Repeated Measures ANOVA: Different tion of the F score to garner insight into that is specific to the case of one inde- Error Term, Greater Statistical Power how analysis of data set variance gener- pendent variable with two groups. If we ates a about differ- analyze the data in Figure 2 for the Type The experimental designs emphasized ences in group means. A numerical ex- 1 treatment vs. Type 3 treatment group thus far entail independent groups, in ample is provided in Figure 2, in which data (disregarding the Type 2 treatment which each subject is “exposed” to only the data set graphed in Figure 1 is listed group data to reduce the analysis to two one level of an independent variable. In

FIGURE 2. The mechanics of calculating a F score for a one-factor ANOVA with independent groups by partitioning the data set variance as Sum of Squares and Mean Squares are shown below. This fictitious data set lists increased ankle range of motion pre/post for three different types of manual therapy treatments. For the sake of clarity and ease of calculation, a data set with an inappropriately small sample size is used.

Subject Manual Therapy Manual Therapy Manual Therapy Gender treatment Type 1 treatment Type 2 treatment Type 3

Male 14 16 20 Male 14 14 18 Female 11 13 17 Female 13 13 17 Group Means 13 14 18 Grand Mean 15

In the following, SS = Sum of Squares; MS = Mean Squares; df = degrees of freedom.

SSTotal = SSBetween Groups + SSWithin Groups (Error), and is calculated by summing the squares of differences between each data value vs. the Grand Mean. For this data set with a Grand Mean of 15: 2 2 2 2 2 2 2 2 2 2 2 SSTotal = (14-15) + (14-15) + (11-15) + (13-15) + (16-15) + (14-15) + (13-15) + (13-15) + (20-15) + (18-15) + (17-15) + (17-15)2 = 74

SSWithin Groups (Error) = SSMT treatment Type 1 (Error) + SSMT treatment Type 2 (Error) + SSMT treatment Type 3 (Error), in which the sum of squares within each group is calculated in reference to the group’s mean: 2 2 2 2 SSMT treatment Type 1 (Error) = (14-13) + (14-13) + (11-13) + (13-13) = 6 2 2 2 2 SSMT treatment Type 2 (Error) = (16-14) + (14-14) + (13-14) + (13-14) = 6 2 2 2 2 SSMT treatment Type 3 (Error) = (20-18) + (18-18) + (17-18) + (17-18) = 6

SSWithin Groups (Error) = 6 + 6 + 6 = 18. By subtraction, SSBetween Groups = 74 - 18 = 56

df refers to the number of independent measurements used in calculating a Sum of Squares.

dfBetween Groups = (# of groups—1) = (3—1) = 2

dfWithin Groups (Error) = (N —# of groups) = (12—1) = 9

ANOVA test statistic, the F score, is calculated from Mean Squares (SS/df ):

F = Mean SquaresBetween Groups / Mean SquaresWithin Groups (Error)

Mean SquaresBetween Groups = SSBetween Groups / dfBetween Groups = 56 / 2 = 28

Mean SquaresWithin Groups (Error) = SSWithin Groups (Error) / dfWithin Groups (Error) = 18 / 9 = 2 So, F = 28 / 2 = 14

With dfBetween Groups = 2 and dfWithin Groups (Error) = 9, this F score translates into p = 0.0017, a statistically significant result for alpha = 0.05.

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E31] Analysis of Variance: The Fundamental Concepts the data set of Figure 2, this would in- teraction describes an interplay between will be statistically significant. In this volve each subject receiving only one of independent variables such that differ- case with two factors, there is only one the three different treatments. If a sub- ent levels of the independent variables interaction to be evaluated. With three ject is exposed to all levels of an inde- have non-additive effects on the depen- or more independent variables, there pendent variable, the mechanics of the dent variable. In formal terms, there is are multiple interactions that need to be ANOVA are altered to take into account an interaction between two factors considered. A statistically significant in- that each subject serves as their own ex- when the dependent variable response teraction complicates the interpretation perimental control. Whereas the term at levels of one factor differ from those of the Main Effects, since the factors are for statistical signal, MSBetween Groups, is un- produced at levels of the other factor(s). not independent of each other in their changed, there is a new statistical noise Interactions can be easily identified in effects on the dependent variable. Inter- term called MSWithin Subjects (Error) that per- graphs of group means. For example, actions should be examined before tains to variance within each subject again referring to the data set from Fig- Main Effects. If interactions are not sta- across all levels of the independent vari- ure 2, let us now consider the effect of tistically sig­nificant, then Main Effects able instead of between all subjects subject gender as a second independent can be easily evaluated as a series of within one level. Since there is typically variable. This would be a two factor one-factor ANOVAs. less variation within subjects than be- ANOVA: one factor is the sex of sub- tween subjects, the statistical error term jects, called Gender, with two levels; So There is a Statistically Significant is typically smaller in repeated measures the second factor is the type of manual ANOVA—Now What? Multiple designs. A smaller MS therapy treatment, called Treatment, Within Subjects (Error) Comparison Procedures value leads to a larger F value and a with three levels. A shorthand descrip- smaller p value. As a result, repeated tion of this design is 2x3 ANOVA (two If an ANOVA does not yield statistical measures ANOVA typically have greater factors with two and three levels, re- significance on any main effects or inter- statistical power than independent spectively). For this two-factor ANOVA, actions, the Null hypothesis (hypothe- groups ANOVA. there are three Null hypotheses: (i) ses) is (are) accepted, meaning that the Main Effect for the Gender factor: Are different levels of independent variables Factorial ANOVA: Main Effects and there differences in the response (ankle did not have any differential effects on Interactions range of motion) for males vs. females the dependent variable. The inferential to manual therapy treatment (combin- statistical work is done (but see next sec- An advantage of ANOVA is its ability to ing data for the three levels of the tion), unless covariates are analyze an experimental design with Treatment factor with respect to the suspected, possibly warranting analysis multiple independent variables. When two Gender factor levels)? (ii) Main Ef- of (ANCOVA), which is be- an ANOVA has two or more indepen- fect for the Treatment factor: Are yond the scope of this article. dent variables it is referred to as a facto­ there differences in the response for When statistical significance is ob- r­ial ANOVA, in to the one- subjects in the three levels of the Treat- tained in an ANOVA, additional statisti- factor ANOVAs discussed thus far. This ment factor (combining data for males cal tests are necessary to determine is efficient experimentally, because the and females in the Gender factor with which of the group means differ from effects of multiple independent variables respect to the three Treatment factor each other. These follow-up tests are re- on a dependent variable are tested on levels)? (iii) Interaction: Are there dif- ferred to as multiple comparison proce- one cohort of subjects. Furthermore, ferences due to neither the Gender or dures (MCPs) or post hoc tests. MCPs factorial ANOVA permits, and requires, Treatment factors alone but to the involve multiple pair-wise comparisons an evaluation of whether there is an in- combination of these factors? With re- (or contrasts) in a fashion designed to terplay between different levels of the spect to analysis of interactions, Figure maintain alpha for the family of com- independent variables, which is called 3 shows a table of group means for all parisons to a specified level, typically an interaction. levels the two independent variables, 0.05. This is referred to as the familywise Definitions of terminology that is based on data from Figure 2. Note that alpha. There are two general options for unique to factorial ANOVA are war- the two independent variables are MCP tests: either perform multiple t- ranted: (i) Main effect is the effect of an graphed in relation to the dependent tests that require “manual” adjustment independent variable (a factor) on a de- variable. The two lines in the left graph of the alpha for each pairwise test to pendent variable, determined separate are parallel, indicating the absence of an maintain a familywise alpha of 0.05, or from of the effects of other independent interaction between the levels of the two use a test such as the Tukey HSD (see variables. A main effect is a one-factor factors. An interaction would exist if the below) that has built-in protection from ANOVA that is performed on a factor graphs were not parallel, such as in the alpha inflation. Multiple t-tests have that disregards the effects of other fac- right graph in which group means for their place, especially when only a sub- tors. In a two factor ANOVA, there are males and females on the Type 2 treat- set of all possible pairwise comparisons two main effects, one for each indepen- ment were switched for illustrative pur- are to be performed, but the special pur- dent variable; a three-factor ANOVA poses. If the lines deviate from parallel pose MCPs are preferable when all pair- has three main effects, and so on. (ii)In - to a sufficient degree, the interaction wise comparisons are assessed.

[E32] The Journal of Manual & Manipulative Therapy n volume 17 n number 2 Analysis of Variance: The Fundamental Concepts

FIGURE 3. Factorial ANOVA interactions, which are assessed with a table and a graph of group means. Group means are based on data presented in Figure 2, and represents a 3x2 two-factor (Treatment x Gender) ANOVA with independent groups. In reference to j columns and k rows indicated in the table below, the Null hypothesis for this interaction is:

μj1,k1– μj1k2 = μj2,k1– μj2k2 = μj3,k1– μj3k2 Thegraph below leftshows the group means of the two independent variables in relation to the dependent variable. The parallel lines indicate that males and females displayed similar changes in ankle ROM for the three types of treatment, so there was no interaction between the different levels of the independent variables. Consider the situation in which the group means for males and females on treatment type 2 are reversed. These altered group means are shown in the graph below right. The graphed lines are not parallel, indicating the presence of an interaction. In other words, the relative efficacies of the three treatments are different for males and females; whether this meets the statistical level of an interaction is determined by ANOVA (p less than alpha).

FACTOR A: Treatment Treatment Treatment Treatment FACTOR B: Type 1 Type 2 Type 3 Factor B Main Gender (Level j = 1) (Level j = 2) (Level j = 3) Effect (row means)

Male (Level k = 1) 14 15 19 16

Female (Level k = 2) 12 13 17 14

Factor A Main Effect (column means) 13 14 18 Increase in ankleIncrease ROM in ankleIncrease ROM

Type 1 Type 2 Type 3 Type 1 Type 2 Type 3

Manual Therapy Treatment Manual Therapy Treatment

Using the simple case of a statisti- crease in the number of t-tests as the comparisons to maintain the risk of cally significant one-factor ANOVA, number of levels increases, as defined by Type 1 errors to no more than 5%. This t-tests can be used for post hoc evalua- C = m (m - 1) / 2, where C = number of is commonly accomplished with the tion with the aim of identifying which pairwise comparisons, and m = number Bonferroni (or Dunn) adjustment, in levels differ from each. However, with of levels in a factor. For example, there which alpha for each post hoc t-test is multiple t-tests there is a need to adjust are three pairwise comparisons for three adjusted by dividing the familywise al- alpha for each t-test in such a way as to levels; six comparisons for four levels; pha (0.05) by the number of pairwise maintain the familywise alpha at 0.05. If ten comparisons for five levels, and so comparisons: all possible pairwise comparisons are forth. There is a need to maintain fami- α = α / C performed, there will be a geometric in- lywise alpha to 0.05 in these multiple Multiple t-tests Familywise

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E33] Analysis of Variance: The Fundamental Concepts

If there are two pairwise compari- down and step-up procedures, respec- takes into account group mean differ- sons, alpha for each t-test is set to 0.05/2 tively. In Hochberg’s step-up procedure ences, group variances, and group sam- = 0.25; for three comparisons, alpha is with C pairwise comparisons, the t-test p ple sizes in a fashion similar but not 0.05/3 = 0.0167, and so on. Any pairwise values are evaluated sequentially in de- identical to the calculation of t. This q t-test with a p value less than the ad- scending order, with p1 the lowest value value is compared to a critical value gen- justed alpha would be considered statis- and pC the highest. If pC is less than 0.05, erated from a q distribution (a distribu- tically significant. The trade-off for pre- all the p values are statistically signifi- tion of differences in sample means). venting familywise alpha inflation is that cant. If pC is greater than 0.05, that evalu- Protection from familywise alpha infla- as the number of comparisons increases, ation is non-significant, and the next tion is in the form of a multiplier applied it becomes incrementally more difficult largest p value, pC - 1, is evaluated with a to the critical value. The multiplier in- to attain statistical significance due to Bonferroni adjusted alpha of 0.05/2 = creases as the number of comparisons the lower alpha. Furthermore, the infla- 0.025. If pC - 1 is significant, then all re- increases, thereby requiring greater dif- tion of familywise alpha with multiple maining p values are significant. Each ferences between group means to attain t-tests is not additive. As a result, Bon- sequential evaluation leads to an alpha statistical significance as the number of ferroni adjustments overcompensate the adjustment based on the number of pre- comparisons is increased. Some MCPs alpha adjustment, making this the most vious evaluations, not on the entire set of are better than others at balancing statis- conservative (least powerful) of all possible evaluations, thereby yielding tical power and Type 1 errors. By general MCPs. For example, running two t-tests, increased statistical power compared to consensus amongst , the each with alpha set to 0.05, does not the Bonferroni method. For example, if Fisher Least Significant Difference double familywise alpha to 0.10; it in- three p values are 0.07, 0.02 and 0.015, (LSD) test and the Duncan’s Multiple creases it to only 0.0975. The effects of Hochberg’s method evaluates p3 = 0.07 Range Test are considered to be overly multiple t-tests on familywise alpha and vs alpha = 0.05/1 = 0.05 (non-signifi- powerful, with too high a likelihood of

Type 1 error rate is defined by the fol- cant); then p2 = 0.02 vs. alpha = 0.05/2 = Type 1 errors (false positives). The lowing formula: 0.025 (significant); and then p1 = 0.015 Scheffè test is considered to be overly vs alpha = 0.05/3 = 0.0167 (significant). conservative, with too high a likelihood α = 1—(1—α )C Familywise Multiple t-tests Holm’s method performs the inverse se- of Type 2 errors (false negatives), but is The overcorrection by the Bonferroni quence and alpha adjustments, such that applicable when group sample sizes are technique becomes more prominent the lowest p value is evaluated first with a markedly unequal17. The Tukey Hon- with many pairwise comparisons: exe- fully adjusted alpha. In this case: p1 = estly Significant Difference (HSD) is fa- cuting 20 t-tests, each with an alpha of 0.015 vs alpha = 0.05/3 = 0.0167 (signifi- vored by many statisticians for its bal-

0.05, does not yield a familywise alpha of cant); then p2 = 0.020 vs. alpha = 0.05/2 = ance of statistical power and protection

20 x 0.05 = 1.00 (i.e., 100% chance of 0.025 (significant); and then p3 = 0.070 from Type 1 errors. It is worth noting Type 1 error); the value is actually 0.64. vs alpha = 0.05/1 = 0.05 (non-signifi- that the power advantage of the Tukey There are modifications of the Bonfer- cant). Once Holm’s method encounters HSD test obtains only when all possible roni adjustment developed by Šidák12 to non-significance, sequential evaluations pairwise comparisons are performed. more accurately reflect the inflation of end, whereas Hochberg’ method contin- The Student-Newman-Keuls (SNK) test familywise alpha that result in larger ad- ues testing. For these three p values, the statistic is computed identically to the justed alpha levels and therefore in- Bonferroni adjustment would find p = Tukey HSD, however the critical value is creased statistical power, but the effects 0.015 significant butp = 0.02 to be non- determined differently using a step-wise are slight and rarely convert a margin- significant. As can be seen, the methods approach, somewhat like the Holm ally non-significant pairwise compari- of Hochberg and Holm are less conser- method described above for t-tests. This son into a statistical significance. For vative and more powerful than Bonfer- makes the SNK test slightly more power- example, with three pairwise compari- roni’s adjustment. Further, Hochberg’s ful than the Tukey HSD test. However, sons, the Bonferroni adjusted alpha of method is uniformly more powerful an advantage of the Tukey HSD test is 0.167 is increased by only 0.003 to 0.170 than Holm’s method15. For example, if that a variant called Tukey-Kramer HSD with the Šidák adjustment. there are three pairwise comparisons test can be used with unbalanced sample The sequential alpha adjustment with p = 0.045, 0.04 and 0.03, all would size designs, unlike the SNK test. The methods for multiple post hoc t-tests by be significant with Hochberg’s method Dunnett test is useful when planned Holm13 and Hochberg14 provide in- but none would be with Holm’s method pairwise tests are restricted to one group creased power while still maintaining (or Bonferroni’s). (e.g., a control group) being compared to control of the familywise alpha. These There are many types of MCPs dis- all other groups (e.g., treatment groups). techniques permit the assignment of sta- tinct from the t-test approaches de- In summary, (i) the Tukey HSD and tistical significance in certain situations scribed above16. These tests have “built- Student-Newman-Keuls tests are rec- for which p values are less than 0.05 but in” familywise alpha protection that do ommended when performing all pair- do not meet the Bonferroni criterion for not require “manual” adjustment of al- wise tests; (2) the Hochberg or Holm se- significance. The sequential approach by pha. Most of these MCPs calculate a so- quential alpha adjustments enhance the Holm13 and Hochberg14 are called step- called q value for each comparison that power of multiple post hoc t-tests while

[E34] The Journal of Manual & Manipulative Therapy n volume 17 n number 2 Analysis of Variance: The Fundamental Concepts

2 maintaining control of familywise alpha; conceivable pairwise comparison, will As with the η calculation, the SSBetween 2 and (3) the Dunnett test is preferred reduce the number of extraneous pair- Groups numerator term for partial η per- when comparing one group to all other wise comparisons and false positives, tains to the independent variable of in- groups. and have the added benefit of increasing terest. However, the denominator differs statistical power. from that of η2. The denominator for 2 Break from Tradition: Skip One-Factor partial η is not based on the entire data set (SS ) but instead on only SS ANOVA and Proceed Directly to a MCP ANOVA Effect Size Total Between Groups and SSError for the factor being eval- Thus far, the conventional approach to Effect size is a unitless measure of the uated. For a one-factor ANOVA, the ANOVA and MCPs has been presented, magnitude of treatment effects20. For sum of square terms are identical for η2 namely, run an ANOVA and if it is not ANOVA, there are two categories of ef- and partial η2, so the values are identical; significant, proceed no further; if the fect size indices: (i) those based on pro- however, with factorial ANOVA the de- ANOVA is significant, then run MCPs to portions of sum of squares (η2, partial η2, nominator for partial η2 will always be determine which group means differ. ω2), and (ii) those based on a standard- smaller. For this reason, partial η2 is al- However, it has long been held by some ized difference between group means ways larger than η2 with factorial statisticians that in certain circum- (such as Cohen’s d)21,22. The latter type of ANOVA (unless a factor or interaction stances ANOVA can be skipped and that effect size index is useful for power anal- has absolutely no effect, as in the case of an appropriate MCP is the only neces- ysis, and will be discussed briefly in the the interaction in Figure 4, for which sary inferential test. To quote an influen- next section. To an ever increasing de- both η2 and partial η2 equal 0). tial paper by Wilkinson et al18, the gree, peer review journals are requiring Omega squared (ω2) is based on an ANOVA-followed-by-MCP approach the presentation of effect sizes with de- estimation of the proportion of variance “is usually wrong for several reasons. scriptive summaries of data. in the underlying population, in contrast First, pairwise methods such as Tukey’s There are three commonly used ef- to the η2 and partial η2 indices that are honestly significant difference proce- fect size indices that are based on pro- based on proportions of variance in the dure were designed to control a family- portions of the familiar sum of squares sample. For this reason, ω2 will always be wise error rate based on the sample size values that form the foundation of a smaller value than η2 and partial η2. and number of comparisons. Preceding ANOVA computations. The three indi- Application of ω2 is limited to between- them with an omnibus F test in a stage- ces are called eta squared (η2), partial eta subjects designs (i.e., not repeated mea- wise testing procedure defeats this de- squared (partial η2), and omega squared sures) with equal samples sizes in all sign, making it unnecessarily conserva- (ω2). These indices range in value from 0 groups. Omega squared is calculated as tive.” Related to this perspective is the (no effect) to 1 (maximal effect) because follows: fact that inferential discrepancies are they are proportions of variance. These 2 possible between ANOVA and MCPs, in indices typically yield different values ω = [SSBetween Groups—(dfBetween Groups)*

which one is statistically significant and for effect size. (MSError)] / (SSTotal + MSError) the other is not. This can occur when p Eta squared (η2) is calculated as: 2 values are near the boundary of alpha. 2 In contrast to η , which provides an up- η = SSBetween Groups / SSTotal Each MCP has slightly different criteria wardly biased estimate of effect size 2 for statistical significance (based on ei- The SSBetween Groups term pertains to the in- when the sample size is small, ω calcu- ther the t or q distribution), and all differ dependent variable of interest, whereas lates an unbiased estimate23. 2 slightly from the criteria of F scores SSTotal is based on the entire data set. Spe- The reader is cautioned that η and 2 (based on the F distribution). An argu- cifically, for a factorial ANOVA, SSTotal = partial η are often misreported in the 2 ment has also been put forth with respect [SSBetween Groups for all factors + SSError + all literature (e.g., η incorrectly reported as 2 2 24,25 to performing pre-planned MCPs with- SSInteractions]. As such, the magnitude of η partial η ) . It is advisable to calculate out the need for a statistically significant for a given factor will be influenced by these values by hand using the formulae ANOVA in clinical trials19. Nonetheless, the number of other independent vari- shown above as a confirmation of the 2 the convention remains to perform ables. For example, η will tend to be output of statistical software programs, ANOVA and then MCPs, but MCPs larger in a one-factor design than in a to ensure accurate reporting. Refer to alone are a statistically valid option. two-factor design because in the latter Figure 4 for sample calculations of these ANOVA is especially warranted when the SSTotal term will be inflated to include three effect size indices for a two-factor there are multiple factors, due to the abil- sum of squares arising from the second ANOVA. ity of ANOVA to detect interactions. factor. The 2 η and partial η2 indices have 2 Wilkinson et al.18 also reminds re- Partial eta squared (partial η ) is distinctly different attributes. Whether a searchers that it is rarely necessary to calculated with respect to the sum of given attribute is considered to be an ad- perform all pairwise comparisons. Se- squares associated with the factor of in- vantage or disadvantage is a matter of lected pre-planned comparisons that are terest, not the total sum of squares: perspective and context. Some authors24 2 driven by the research hypothesis, and partial η = SSBetween Groups / argue the merits of eta squared, whereas 4 not a subcortical reflex to perform every (SSBetween Groups + SSError) others prefer partial eta squared. Nota-

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E35] Analysis of Variance: The Fundamental Concepts

FIGURE 4. Calculations of three different measures of effect size for a two-factor (Treatment and Gender) ANOVA of data set shown in Figure 2. The effect sizes shown are all based on proportions of sum of squares: eta squared (η2), partial η2, and omega squared (ω2). Note the following: (i) The denominator sum of squares term will be larger for η2 than for partial η2 in a factorial ANOVA, so η2 will be smaller than partial η2. (ii) Omega squared (ω2) is a population estimate, whereas η2 and partial η2 are sample estimates, so ω2 will be smaller than both η2 and partial η2. (iii) The sum of all 2η equals 1, whereas the sum of all partial η2 does not equal 1 (can be less than or greater than). Refer to text for further explanation of these attributes.

Sum Degrees Mean Effect of Squares of freedom Squares η2 partial η2 ω2

Treatment 56 2 28 0.76 0.90 0.72 Gender 12 1 12 0.16 0.67 0.15 Treatment x Gender 0 2 0 0.00 0.00 0.00 Error 6 6 1 0.08 ------Total 74 11 1.00 1.57

Sample calculations: 2 η = SSBetween Groups / SSTotal η2 for Treatment = 56 / 74 = 0.76 = accounts for 76% of total variability in DV scores. η2 for Gender = 12 / 74 = 0.16 = accounts for 16% of total variability in DV scores. η2 for Treatment*Gender interaction = 0 / 4 = 0.00 = accounts for 0% of total variability in DV scores. η2 for Error = 6 / 74 = 0.08 = accounts for 8% of total variability in DV scores. Sum of all η2 = 100%

2 partial η = SSBetween Groups / (SSBetween Groups + SSError) partial η2 for Treatment = 56 / (56 + 6) = 0.90 = accounts for 90% of total variability in DV scores. partial η2 for Gender = 12 / (12 + 6) = 0.67 = accounts for 67% of total variability in DV scores. partial η2 for Treatment*Gender interaction = 0 / (0 + 6) = 0.00 = accounts for 0% of total variability in DV scores. Sum of all partial η2 ≠100%

2 ω = [SSBetween Groups—(dfBetween Groups) * (MSError)] / (SSTotal + MSError) ω2 for Treatment = [56—(2)(1)] / [74 + 1] = 54 / 75 = 0.72 ω2 for Gender = [12—(1)(1)] / [74 + 1] = 11 / 75 = 0.15 ω2 for Treatment*Gender interaction = [0—(2)(1)] / [74 + 1] = 0.00

ble issues pertaining to these indices sure of effect size. This can be viewed as factor will necessarily decrease. Accord- include: a positive or negative attribute. ingly, η2 decreases in an associated way. (i) Proportion of variance: When (iii) Additivity: η2 is additive, but In contrast, partial η2 for each factor is there is a statistically significant main ef- partial η2 is not. Since η2 for each factor calculated within the sum of squares fect or interaction, both η2 and partial η2 is calculated in terms of the total sum of variance metrics of that particular fac- (and ω2) can be interpreted in terms of squares, all the η2 for an ANOVA are ad- tor, and is not influenced by the number the percentage of variance accounted for ditive and sum to 1 (i.e., they sum to of other factors. by the corresponding independent vari- equal the amount of variance in the de- able, even though they will often yield pendent variable that arises from the ef- How Many Subjects? different values for factorial ANOVAs. fects of all the independent variables). In So if η2 = 0.20 and partial η2 = 0.25 for a contrast, a factor’s partial η2 is calculated The aim of any experimental design is to given factor, these two effect size indices in terms of that factor’s sum of squares have adequate statistical power to detect indicate that the factor accounts for 20% (not the total sum of squares), so on differences between groups that truly vs. 25%, respectively, of the total vari- mathematical grounds the individual exist. There is no simple answer to the ability in the dependent variable scores. partial η2 from an ANOVA are not addi- question of how many subjects are (ii) Relative values: Since η2 is either tive and do not necessarily sum to 1. needed for statistical validity using equal to (one-factor ANOVA) or less (iv) Effects of multiple factors: As the ANOVA. Typical standards are to design than (factorial ANOVA) partial η2, the number of factors increases, the propor- a study with an alpha of 0.05 to have with η2 index is the more conservative mea- tion of variance accounted for by each statistical power of at least 0.80 (i.e., 80%

[E36] The Journal of Manual & Manipulative Therapy n volume 17 n number 2 Analysis of Variance: The Fundamental Concepts

chance of detecting differences between For β = 0.20 (power = 0.80), zβ = been the result of inadequate statistical group means that truly exists; alterna- 0.84 (1 tail). power. The textbooks cited above, as tively, a 20% chance of committing a As a computational example, if the well as many others, also discuss the me- Type 2 error). Statistical power will be a effect sized is predicted to be 1.0 (which chanics of how to perform retrospective function of effect size, sample size, and equates to a difference between group power analyzes. the number of independent variables means of one ), then and levels, among other things. Ade- for alpha = 0.05 and power = 0.80 the Conclusion: Statistical Significance quate sample size is a critical design con- appropriate sample size for both groups Should not be Confused with sideration, and prospective (a priori) would be: Clinical Significance power analysis is performed to estimate N = 2 x [(1.96 + 0.84) / 1]2 the required sample size that will yield Estimated ANOVA is a useful statistical tool for = 2 x [2.80 / 1]2 = 2 x 2.82 = 16 the desired level of power in the inferen- drawing inferential conclusions about tial analysis after data are collected. This For a smaller effect size, a larger sample how one or more independent variables entails a prediction of group mean dif- size is needed, e.g., N = 63 for an effect influences a parametric dependent vari- ferences and group standard deviations size of 0.5. The reader is cautioned that able (outcome measure). It is imperative in the yet-to-be collected. Specifically, these sample sizes are estimates based to keep in mind that statistical signifi- the effect size index used for prospective on guesses about the predicted effect cance does not necessarily correspond power analysis is based on a standard- size; they do not guarantee statistical to clinical significance. The much sought ized measure such as Cohen’s d, which is significance. after statistically significant ANOVA p based on predicted differences in group Prospective power analysis for value has only two purposes: to play a means (statistical signal) divided by ANOVA is more complex than outlined role in the inferential decision as to standard deviation (statistical noise). above for a simple t-test. ANOVAs can whether group means differ from each Being based on differences instead of have numerous levels within a factor, other (rejection of Null hypothesis), and proportions, the d effect size index is multiple factors, and interactions, all of to assign a probability of the risk of com- scaled differently than the η2, partial η2 which need to be accounted for in a mitting a Type 1 error if the Null hy- and ω2 described above, and can exceed comprehensive power analysis. These pothesis is rejected. Statistically signifi- a value of 1. complications raise the following cau- cant ANOVA and MCPs say nothing The prediction of an ’s tionary note: ANOVA power analysis about the magnitude of group mean dif- effect size that is part of a prospective quickly devolves into a series of progres- ferences, other than that a difference ex- power analysis is nothing more than an sively more wild guesses (instead of “es- ists. A large sample size can produce estimate. This estimate can be based on timates”) of effect sizes as the number of statistical significance with small differ- pilot study data, previously published independent variables and possible in- ences in group means; depending on the findings, intuition or best guesses. A teractions increase26. It is often advisable outcome measure, these small differ- guiding principle should be to select an focus a prospective power analysis for ences may have little clinical signifi- effect size that is deemed to be clinically ANOVA on one factor that is of primary cance. Assigning clinical significance is relevant. interest, so as simplify the power analy- a judgment call that needs to take into The approach used in a prospective sis and reduce the amount of unjustifi- account the magnitude of the differ- power analysis is outlined below for the able guesses. The reader is referred to ences between groups, which is best as- simple case of a t-test with independent statistical textbooks (such as references sessed by examination of effect sizes. groups and equal variance, in which the 22, 26, 27) for different approaches that Statistical significance plays the role of a effect size index is define as: can be used for prospective power anal- searchlight to detect group differences, ysis for ANOVA designs. As a general whereas effect size is useful for judging d = difference in group means / standard guideline, it is desirable for group sam- the clinical significance of these differ- deviation of both groups ple sizes to be large enough to invoke the ences. The estimate of the appropriate number in the statistical of subjects in each group for the speci- analysis (>30 or so) and for there to be a REFERENCES fied alpha and power is given by the fol- balanced design (equal sample sizes in lowing equation26: each group). 1. Wackerly DD, Mendenhall W III, Scheaffer Finally, a retrospective (post hoc) RL. with Applica- N = 2 x [ (z + z ) / d ]2 Estimated α β power analysis is warranted after data tions. 6th ed. Pacific Grove, CA: Druxbury are collected. The aim is to determine Press, 2002. in which: the statistical power of the study, based 2. Shapiro SS, Wilk MB. An analysis of vari-

zα is the z value for the specified alpha. on the effect size (not estimated, but cal- ance test for normality (complete samples).

With an alpha = 0.05, zα = 1.96 (2 culated directly from the data) and sam- Biometrika 1965;52:591–611. tail). ple size. This is particularly relevant for 3. D’Agnostino RB. An for nor- zβ is the z value for the specified beta statistically non-significant findings, mality of moderate and large size samples. (risk of Type 2 error). Power = 1- β. since the non-significance may have Biometrika 1971;58:341–348.

The Journal of Manual & Manipulative Therapy n volume 17 n number 2 [E37] Analysis of Variance: The Fundamental Concepts

4. Tabachnick BG, Fidell LS. Using Multivari- tions. Journal of the American Statistical As- 21. Cohen J. Eta-squared and partial eta- ate Statistics. 5th ed. New York: Pearson sociation 1967;62:626–633. squared in fixed factor ANOVA designs. Education, 2007. 13. Holm S. A simple sequentially rejective mul- Educational and Psychological Measurement 5. Levene H. Robust tests for the equality of tiple test procedure. Scandinavian Journal of 1973;33:107–112. variance test for normality. In Olkin I, ed. Statistics 1979;6:65–70. 22. Cohen J. Statistical Power Analysis for the Contributions to Probability and Statistics: 14. Hochberg Y. A sharper Bonferroni proce- Behavioral Sciences. 2nd ed. Hillsdale, NJ: Essays in Honor of . Palo dure for multiple tests of significance. Lawrence Erlbaum, 1988. Alto: Stanford University Press, 1960. Biometrika 1988; 7:800–802. 23. Keppel G. Design and Analysis: A Research- 6. Brown MB, Forsythe AB. Robust tests for the 15. Huang Y. Hochberg’s step-up method: Cut- er’s Handbook. 2nd ed. Englewood Cliffs, NJ: equality of variances. Journal of the Ameri- ting corners off Holm’s step-down method. Prentice Hall, 1982. can Statistical Association 1974;69:364–367. Biometrika 2007;94:965–975. 24. Levine TR, Hullett CR. Eta squared, partial 7. Zar JH. Biostatistical Analysis. Upper Saddle 16. Toothaker L. Multiple Comparisons for Re- eta squared, and misreporting of effect size River, NJ: Prentice Hall, 1998. searchers. New York, NY: Sage Publications, in communication research. Human Com- 8. Daniel WW. : A Foundation for 1991. munication Research 2002;28:612–625. Analysis in the Health Sciences. 7th ed. 17. Cabral HJ. Multiple Comparisons Proce- 25. Pierce CA, Block RA, Aguinis H. Caution- Hoboken, NJ: John Wiley & Sons, Inc., 1999. dures. Circulation 2008;117:698–701. ary note on reporting eta-squared values 9. Box GEP. Non-normality and tests on vari- 18. Wilkinson L and the Task Force on Statisti- from multifactor ANOVA designs. Educa- ances. Biometrika 1953;40:318–335. cal Inference. Statistical methods in psy- tional and Psychological Measurement, 2004; 10. Box GEP. Some theorems on quadratic chology journals. American Psychologist 64:916–924. forms applied in the study of analysis of vari- 1999;54:594–604. 26. Norman GR, Streiner DL. Biostatistics The ance problems: I. Effect of inequality of vari- 19. D’Agostino RB, Massaro J, Kwan H, Cabral Bare Essentials. Hamilton, Ontario: B.C. ance in the one way classification.Annals of H. Strategies for dealing with multiple treat- Decker Inc., 1998. Mathematical Statistics 1954;25:290–302. ment comparisons in confirmatory clinical 27. Portney LG, Watkins MP. Foundations of 11. Wilkinson L, Blank G, Gruber C. Desktop trials. Drug Information Journal 1993;27: Clinical Research. Applications to Practice. Data Analysis with SYSTAT. Upper Saddle 625–641. 3rd ed. Upper Saddle River, NJ: Pearson River, New Jersey: Prentice Hall, 1996. 20. Cook C. Clinimetrics Corner: Use of effect Education Inc., 2009. 12. Šidàk Z. Rectangular for sizes in describing data. J Man Manip Ther the means of multivariate normal distribu- 2008;16:E54–E5.

[E38] The Journal of Manual & Manipulative Therapy n volume 17 n number 2