Relationship Between Omnibus and Post-Hoc Tests: an Investigation of Performance of the F Test in ANOVA
Total Page:16
File Type:pdf, Size:1020Kb
Shanghai Archives of Psychiatry: first published as 10.11919/j.issn.1002-0829.218014 on 25 July 2018. Downloaded from • 60 • Shanghai Archives of Psychiatry, 2018, Vol. 30, No. 1 •BIOSTATISTICS IN PSYCHIATRY (43)• Relationship between Omnibus and Post-hoc Tests: An Investigation of performance of the F test in ANOVA Tian CHEN1*, Manfei XU2, Justin TU3, Hongyue WANG4, Xiaohui NIU5 Summary: Comparison of groups is a common statistical test in many biomedical and psychosocial research studies. When there are more than two groups, one first performs an omnibus test for an overall difference across the groups. If this null is rejected, one then proceeds to the next step of post- hoc pairwise group comparisons to determine sources of difference. Otherwise, one stops and declares no group difference. A common belief is that if the omnibus test is significant, there must exist at least two groups that are significantly different and vice versa. Thus, when the omnibus test is significant, but no post-hoc between-group comparison shows significant difference, one is bewildered at what is going on and wondering how to interpret the results. At the end of the spectrum, when the omnibus test is not significant, one wonders if all post-hoc tests will be non-significant as well so that stopping after a non- significant omnibus test will not lead to any missed opportunity of finding group difference. In this report, we investigate this perplexing phenomenon and discuss how to interpret such results. Key words: Omnibus test, post-hoc test, F test, Tukey’s test copyright. [Shanghai Arch Psychiatry. 2018; 30(1): 60-64. doi: http://dx.doi.org/10.11919/j.issn.1002-0829.218014] 1. Introduction means are different. Otherwise, post-hoc tests are Comparison of groups is a common issue of interest performed to find sources of difference. in most biomedical and psychosocial research studies. During post-hoc analysis, one compares pairs In many studies, there are more than two groups, in of groups and finds all pairs that show significant which case the popular t-test for two (independent) difference. This hierarchical procedure is predicated http://gpsych.bmj.com/ groups no longer applies and models for comparing upon the premise that if the omnibus test is more than two groups must be used, such as the significant, there must exist at least two groups that analysis of variance, ANOVA, model.[1] When comparing are significantly different and vice versa. more than two groups, one follows a hierarchical The hierarchical procedure is taught in basic as approach. Under this approach, one first performs an well as advanced statistics courses and built into omnibus test, which tests the null hypothesis of no many popular statistical packages. For example, when difference across groups, i.e., all groups have the same performing the analysis of variance (ANOVA) model on September 29, 2021 by guest. Protected mean. If this test is not significant, there is no evidence for comparing multiple groups, the omnibus test is in the data to reject the null and one then concludes carried out by the F-statistic.[1] For post-hoc analyses, that there is no evidence to suggest that the group one can use a number of specialized procedures such 1Department of Mathematics and Statistics, University of Toledo, OH, USA 2Shanghai Mental Health Center, Shanghai Jiao Tong University Medical College, Shanghai, China 3Department of Physical Medicine and Rehabilitation, University of Virginia School of Medicine, Charlottesville, VA, USA 4Department of Biostatistics and Computational Biology, University of Rochester, NY, USA 5College of Informatics, Huazhong Agriculture University, Wuhan, China *correspondence: Tian CHEN. Mailing address: 2801 W. Bancroft Street, MS 942, Toledo, OH, USA. Postcode: 43606. E-Mail: [email protected] Shanghai Archives of Psychiatry: first published as 10.11919/j.issn.1002-0829.218014 on 25 July 2018. Downloaded from Shanghai Archives of Psychiatry, 2018, Vol. 30, No. 1 • 61 • [1] as Tukey’s and Scheffe’s tests. Special statistical tests Thus, under the null H0 , all groups have the same are needed for performing post-hoc analyses, because mean. If H0 is rejected in favor of the alternative of potentially inflated type I errors when performing H a , there are at least two groups that have different multiple tests to identify the groups that have different means. means. Tukey’s, Scheffe’s and other post-hoc tests are When performing ANOVA, one first tests the all adjusted for such multiple comparisons to ensure hypothesis in Equation (2). If this omnibus test is not correct type I errors in the fact of multiple testing. rejected, then one concludes that there is evidence In practice, however, it seems quite often that to indicate different means across the groups. none of the post-hoc tests are significant, while the Otherwise, there is evidence against the null in favor omnibus test is significant. The reverse seems to occur of the alternative and one then proceeds to the next often as well; when the omnibus test is not significant, step to identify the groups that have different means although some of the post-hoc tests are significant. To from each other. For I groups, there are a total of the best of our knowledge, there does not appear a I(I-1)/2 pairs of groups to examine. In the post-hoc general, commonly accepted approach to handle such testing phase, one performs I(I-1)/2 tests to identify a situation. In this report, we examine this hierarchical the groups that have different group means µi . This approach and see how well it performs using simulated number I(I-1)/2 can be large, especially where there data. We want to know if a significant omnibus test is a large number of groups. Thus, performing all guarantees at least one post-hoc test and vice versa. such tests can potentially increase type I errors. The Although the statistical problem of comparing multiple popular t-test for comparing two (independent) groups groups is relevant to all statistical models, we focus is inappropriate and specially designed tests must be on the relatively simpler analysis of variance (ANOVA) used to account for accumulated type I errors due to model and start with a brief overview of this popular multiple testing to ensure correct type I errors. Next model for comparing more than two groups. we review the omnibus and some post-hoc tests, which will later be used in our simulation studies. 2. One-Way Analysis of Variance (ANOVA) 2.1 The Omnibus F Test for No Difference Across 2.1 The Statistical Model Groups The analysis of variance (ANOVA) model is widely used The omnibus test for comparing all group means copyright. in research studies for comparing multiple groups. This simultaneously within the context of ANOVA is the model extends the popular t-test for comparing two F-test. The F-test is defined by quantities in the so- (independent) groups to the general setting of more called ANOVA table. To set up this table, let us define than two groups.[1] the following quantities: Consider a continuous outcome of interest, Y , and niiIn ni I let I denote the number of groups. We are interested ∑j=1YYij ∑∑ij=11 = ij 221 in comparing the (population) mean of Y across the I YYii+ =, ++ =, si =∑∑( YYij −=+ ) , N ni , groups. The classic analysis of variance (ANOVA) model nii Nn−1 ji=11= has the form: n In n ∑ii ∑∑ i I j2=1YYij ij=11 = ij 221 =µεε + ∼ σ ≤ ≤ ≤≤ (1) http://gpsych.bmj.com/ Yij i ij, YY ij ii+ N=(0, ), 1, jn++i , = 1 iI , , si =∑∑( YYij −=+ ) , N ni , nii Nn−1 ji=11= where Yij is the outcome from the j th subject µ = within the i th group, iEY() ij is the (population) n mean of the i th group, ε is the error term, N(,µσ2 ) where N is the total sample size, ∑ j=1Aj denotes ij ∑∑In denotes the normal distribution with mean µ and sum of all Aj , ij=11 = Aij denotes sum of Aij over 2 both indices and j , i.e., variance σ , and ni is the sample size for the i th i group. n on September 29, 2021 by guest. Protected With the statistical model in Equation (1), the ∑Ajn=+ AA12 ++L A, primary objective of group comparisonn can be stated j=1 in terms of statistical hypotheses as∑ follows.Ajn=+ AA First,12 ++we L In A, want to know if all the groups have thej=1 same mean. ∑∑AAij =11 ++L AA1n + 21 ++ LLL A2n ++ AI1 ++ AIn . Under the ANOVA above, the nullIn and alternative ij=11 = hypothesis for this comparison of interest is stated as: ∑∑AAij =11 ++L AA1n + 21 ++ LLL A2n ++ AI1 ++ AIn . ij=11 = H0 :µµik= for all 1 ≤<ikI ≤ v.s. (2) Hai:µµ≠ k for at least one pair i and k, 1≤< ikI ≤ . The ANOVA table is defined by: Shanghai Archives of Psychiatry: first published as 10.11919/j.issn.1002-0829.218014 on 25 July 2018. Downloaded from • 62 • Shanghai Archives of Psychiatry, 2018, Vol. 30, No. 1 In the ANOVA table above, SS() R is called the and then test if two groups with sample group means, regression sum of squares, SS() E is called the error Y i+ and Y k+ , have different (population) group means, sum of squares, SS() Total is called the total sum of i.e., µµik= , by the following criteria: squares, is called the mean regression sum 2 MS() R s of squares, and SS () E is called the mean error sum of Yik++−≥ Y W, W = qα () In,N − I , squares.