Homoscedasticity: an Overlooked Critical Assumption for Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Open access Biostatistical methods in psychiatry Gen Psych: first published as 10.1136/gpsych-2019-100148 on 17 October 2019. Downloaded from Homoscedasticity: an overlooked critical assumption for linear regression Kun Yang,1 Justin Tu,2 Tian Chen3 To cite: Yang K, Tu J, Chen T. SUMMARY using ranks, a sequence of natural numbers Homoscedasticity: an overlooked Linear regression is widely used in biomedical and such as 1, 2 and 3 to represent ordinal differ- critical assumption for linear psychosocial research. A critical assumption that is often ences in the original continuous outcomes. regression. General Psychiatry overlooked is homoscedasticity. Unlike normality, the other 2019;32:e100148. doi:10.1136/ An even more serious problem with the KW assumption on data distribution, homoscedasticity is often gpsych-2019-100148 test is its extremely complex distribution of taken for granted when fitting linear regression models. the test statistic and consequently limited However, contrary to popular belief, this assumption Received 18 September 2019 applications in practice.2 Accepted 19 September 2019 actually has a bigger impact on validity of linear regression results than normality. In this report, we use Monte Carlo Over the past 30 years, many new statistical simulation studies to investigate and compare their effects methods have been developed to address the on validity of inference. aforementioned limitations of the classic LR and associated alternatives. Such new models apply to cross-sectional and longitudinal INTRODUCTION data, the latter being the hallmark of modern Linear regression (LR) is arguably the most clinical research. Semiparameter statistical popular statistical model used to facilitate models are the most popular, since they biomedical and psychosocial research. LR require one of the distribution assumptions can be used to examine relationships between and apply to continuous outcomes without continuous variables, and associations changing the continuous scale.3 In this between a continuous and a categorical vari- report, we use the Monte Carlo simulation able. For example, by using one binary inde- study to investigate and compare results when pendent variable, LR can be used to compare one of the two assumptions is violated, and the means between two groups, akin to the to show the importance of homoscedasticity two independent samples t-test. If we have a for valid inference for LR. We will discuss and http://gpsych.bmj.com/ multilevel categorical independent variable, perform head-to-head comparison of power LR yields the analysis of variance (ANOVA) between the classic KW test and modern semi- model. Although the t-test for unequal group parametric models in a future article. variance is often used as an alternative for comparing group means when large differ- LR MODEL © Author(s) (or their ences in group variances emerge, the same We start with a brief overview of the classic LR. employer(s)) 2019. Re-use homoscedasticity assumption underlying permitted under CC BY-NC. No Consider a continuous outcome of interest, Y , on September 30, 2021 by guest. Protected copyright. ANOVA is often taken for granted when this commercial re-use. See rights and a set of p independent variables, X 1, X2, Xp . and permissions. Published by classic model is applied for comparing more We are interested in modelling the relationship BMJ. than two groups. For ANOVA, much of the 1 of Y with the independent variables. Given a Department of Family Medicine focus is centred on normality, with little atten- sample of n subjects, the classic LR models this and Public Health, University of tion paid to homoscedasticity. California System, San Diego, relationship as: Contrary to popular belief, the homosce- California, USA Yi = β0 + β1Xi1 + β2Xi2 +L+βpXip+ (1) 2PGY-2, Physical Medicine dasticity assumption actually plays a more 2 εi, εi N 0, σ ,1 i n , and Rehabilitation, University critical role than normality on validity of ∼ ≤ ≤ ( ) of Virginia Health System, ANOVA. This is because the F-test, testing for where i indexes the subjects, Charlottesville, Virginia, USA overall differences in group means across all β0, β1, β2,K..., βp are the regression coef- 3Department of Mathematics the groups (omnibus test), is more sensitive to ficients (parameters), εi is the error term, and Statistics, University of 2 Toledo, Toledo, Ohio, USA heteroscedasticity than normality. Thus, even N µ, σ denotes the normal distribution with when data are perfectly normal, F-test will mean µ and variance σ2 . The LR in equation ( ) Correspondence to generally yield incorrect results, if large group (1) posits a linear association between the Dr Tian Chen, Department of variances exist. Although the Kruskal-Wallis outcome (dependent variable) Y and each Mathematics and Statistics, University of Toledo, Toledo, OH (KW) test is applied when homoscedasticity is of the independent variables. The latter have 1 43606, USA; deemed suspicious, this test is less powerful been called different names such as predictors, tian. chen@ utoledo. edu than the F-test, since it discretises original data covariates and explanatory variables. Yang K, et al. General Psychiatry 2019;32:e100148. doi:10.1136/gpsych-2019-100148 1 General Psychiatry Gen Psych: first published as 10.1136/gpsych-2019-100148 on 17 October 2019. Downloaded from The first part of LR, Thus, the regression coefficients become the group Y = β + β X + β X +L+β X ,1 i n, mean of Y : i 0 1 i1 2 i2 p ip ≤ ≤ (2) is called the conditional (population) mean of Y i given µ1 = β0, µ2 = β0 + β1, µ2 = β0 + β2, the independent variables X1, X2, K, Xp. On estimating the where µk denotes the mean of Y for group k (1 k K ). ≤ ≤ regression coefficients, this conditional mean can be calcu- Because of the relationship of the coefficient with the lated to provide an estimate of Y i for each subject. In addition group mean, the ANOVA is often simply expressed as: to the assumed linear relationship, there are two additional 2 assumptions in equation (1): (A) normal distribution and Yki = µk + εki, εki N 0, σ ,1 i nk,1 k K, (3) 2 ∼ ≤ ≤ ≤ ≤ (B) homoscedasticity, or constant variance σ for all subjects. ( ) All three assumptions play an important role in obtaining where Y ki denotes the outcome from the i th subject valid inference for regression coefficients. For example, within the k th group, µk = E Yki is the (population) mean of the kth group, and K is the total number of if the association of Y with a particular independent vari- ( ) groups. For the one-way ANOVA in equation (3) the able X 1 is quadratic, the linear model in equation (1) must 2 linearity assumption does not apply, as µk represents the also include Xi1 , since otherwise estimates of β1 will gener- ally be biased. Likewise, if the error term εi is not normally group mean and no linear or any relationship is assumed distributed, SEs of estimated coefficients may be incorrect. between the group means. The normality and homosce- Both the linearity and normality have been receiving great dasticity become easier to interpret and check as well, as coverage in the literature. they apply to distributions of Y ki within each group. In contrast, the impact of homoscedasticity on Under ANOVA, comparisons of group means across all statistical inference of regression coefficients has groups are readily expressed by a null, H0 , and alternative received much less attention. Most publications in the Ha hypothesis as: biomedical and psychosocial literature do not even H0 : µi = µk for all 1 i < k Iv.s. acknowledge this assumption for their applications ≤ ≤ (4) H : µ = µ for at least one pair i and k,1 i < k I. of LR. Contrary to popular belief, inference about a i ̸ k ≤ ≤ regression coefficients is actually more sensitive to Under the null hypothesis H0 , all groups have the same departures from homoscedasticity than normality. In mean. If H0 is rejected, post hoc analyses are followed to fact, normality actually does not matter at all when determine the groups that have different group means. sample size is relatively large. In contrast, homosce- We focus on the hypothesis in equation (4) for overall dasticity remains an issue regardless of how large the group difference below, but the same conclusions apply sample size becomes. Below we illustrate these facts to post hoc pairwise group comparisons as well. using Monte Carlo (MC) simulated data. For ease of ANOVA uses F-tests for testing the null hypothesis exposition, we focus on one-way ANOVA, but the same of no group difference in equation (4). This omnibus http://gpsych.bmj.com/ conclusions apply to general LR as well. test is defined by elements of a so-called ANOVA table: Mean squares Source df Sum of squares (SS) (MS) ANOVA MODEL One particularly popular special case of LR is the ANOVA Groups K−1 SS R = MS R = K ( ) 2 ( ) model. This occurs when the independent variables nk Y¯k+ Y¯++ SS R / k=1 − X , X , K, X are binary indicators, representing different 1 2 p ∑ ( ) K( )1 on September 30, 2021 by guest. Protected copyright. levels of a categorical or ordinal variable for multiple groups. − ( ) The conditional mean of Y in equation (2) becomes the Error N−K SS E = MS E = k n ( )k 2 ( ) group mean. For example, if there are three groups, we may Yki Yk+ SS E / k 1 i=1 − − use group 1 as the referent and the other two independent ∑ ∑ ( ) N( )K − variables X 1, X2 to represent groups 2 and 3: Total N−1 SS Total = ( ) k n 1 if subject i is in ( k ) 2 Yki Y++ 1 if subject i is in group 2 k=1 i=1 − Xi1 = Xi2 = group 3 ∑ ∑ ( ) 0 otherwise 0 otherwise In the above table, In this case, the LR in equation (1) becomes: nk I nm Y Y ki ki K i=1 k=1 i=1 Yk+ = , Y++ = , N = nk,1 k K, ∑nk ∑ ∑N ≤ ≤ k=1 n β0 + εi if subject i is from group 1 k ∑ Yki Y = β + β X + β X + ε = i=1 i 0 1 i1 2 i2 i β0 + β1 + εi if subject i is from group 2 where Yk+ = is the sample mean of group k , ∑nk I nk β0 + β2 + εi if subject i is from group 3 Yki k=1 i=1 Y++ = ∑ ∑N is the sample mean of the entire sample (grand 2 Yang K, et al.