Learn to Test for Heteroscedasticity in SPSS with Data from the Early Childhood Longitudinal Study (1998)

Home , Alternative hypothesis, Heteroscedasticity, Homoscedasticity, Regression analysis

Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998)

© 2015 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998)

Student Guide

Introduction This dataset example introduces readers to testing for heteroscedasticity following a linear regression analysis. Linear regression rests on several assumptions, one if which is that the variance of the residual from the model is constant and unrelated to the independent variable(s). Constant variance is called homoscedasticity, while non-constant variance is called heteroscedasticity.

This example describes heteroscedasticity, discusses its consequences, and shows how to detect it using data from the Early Childhood Longitudinal Study, Kindergarten Class of 199-8-99 (ECLSK) at the National Center for Education Statistics (https://nces.ed.gov/ecls/kindergarten.asp). It presents an analysis of whether math performance in kindergarten predicts reading performance in kindergarten. Analysis like this might help researchers and policy makers better understand early childhood education.

What Is Heteroscedasticity? Linear regression analysis expresses a dependent variable as a linear function of one or more independent variables. Equation 1 shows an example of a simple linear model with a single independent variable:

Page 2 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

(1)

Yi = β0 + β1Xi + εi

Where:

• Yi = individual values of the dependent variable • Xi = individual values of the independent variable • β0 = the intercept, or constant, associated with the regression line • β1 = the slope of the regression line • εi = the unmodeled random, or stochastic, component of the dependent variable, often called the error term or the residual of the model.

Linear regression models are typically estimated using ordinary least squares (OLS) regression. OLS regression rests on several assumptions, one of which is that the variance of the residuals from the regression model (εi) is constant and unrelated to the independent variable(s). Heteroscedasticity means that the variance of the residual is not constant, which means that an important assumption of OLS has been violated.

Consequences of Heteroscedasticity The presence of heteroscedasticity does not affect the estimated values of the intercept or slope coefficients of a linear regression model. Those estimates remain unbiased.

However, heteroscedasticity does affect the estimated standard errors for those coefficients. It can make them either too large or to small, but most often it makes them too small.

The standard error of any statistic is calculated to provide an estimate of how

Page 3 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 much that statistic might change if it were calculated again on another random sample of data of the same size taken from the same population. For regression, the standard error for each coefficient provides an estimate of uncertainty about that coefficient. We use both the coefficients and their standard errors when we test hypotheses about those coefficients.

For example, we might estimate a regression model like the one presented in Equation 1 and produce an estimate of β1 of 2.5. In order to determine whether 2.5 is statistically significantly different from zero, we need to perform a hypothesis test. Specifically, we would test the null hypothesis that β1 = 0. To do so, we would:

1. Perform the regression analysis to produce our estimates of β1 and its standard error. 2. Divide the estimate of β1 by its standard error to produce a T-score. 3. Compare the T-score from the previous step to a Student’s T distribution, with degrees of freedom equal to the sample size minus the number of coefficients estimated as part of the original regression. 4. Determine the level of significance, or p-value, associated with the calculated T-score. Typically, if that p-value is less than 0.05, researchers would declare the result to be statistically significant.

Of course, statistical software generally performs all of these steps for us automatically. However, this process and those computer programs assume that the variance of the residuals is constant. As noted above, heteroscedasticity leads to incorrect estimates of standard errors. As a result, heteroscedasticity will confound subsequent hypothesis testing. Because heteroscedasticity typically produces standard errors that are smaller than they should be, we run the risk of being over-confident in our coefficient estimates and possibly declaring a coefficient estimate to be statistically significant when it is not.

Page 4 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Detecting Heteroscedasticity There are two main strategies for detecting heteroscedasticity. The first approach is graphical. For a simple regression model, a two-way scatter plot with the residuals from the regression model plotted on the Y-axis and the independent variable plotted on the X-axis is a good place to start.

For a multiple regression model, you could produce separate plots like this for each independent variable – with the residuals plotted on the Y-axis and the independent variable in question plotted on the X-axis.

For either a simple or a multiple regression model, it is quite common to plot the residuals on the Y-axis and the predicted value of the dependent variable based on the regression model on the X-axis. Often both the residuals and the predicted values of the dependent variable are standardized before being plotted. Figure 1 shows an example of this approach.

Figure 1: Illustration comparing what homoscedasticity and heteroscedasticity look like using two-way scatter plots with the standardized residual from the regression plotted on the Y-axis and the standardized predicted value of the dependent variable from the regression plotted on the X-axis.

Page 5 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Regardless of which figures you produce, you should see the same level of vertical spread among the residuals across all values plotted along the X-axis as you look from left to right at the plot if you have constant variance in the residuals. However, if you have heteroscedasticity, you will see changes in the vertical spread among the residuals as you look across the figure from left to right. That spread could be steadying growing, steadily shrinking, or showing a more complex pattern such as less variance in the center of the data than at both the lower and upper extremes.

The panel on the left in Figure 1 shows how the vertical spread of the standardized residuals is roughly the same as you look from left to right. However, the panel on the right in Figure 1 shows a smaller vertical spread for the residuals initially that spreads out wider as you move from left to right. The panel on the left shows what homoscedasticity looks like. The panel on the right shows what one of the more common forms of heteroscedasticity looks like.

Graphical methods are useful for seeing the data, but they lack formal statistical

Page 6 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 precision. As a result, many researchers turn to formal statistical tests for heteroscedasticity. There are many tests of the null hypothesis of homoscedasticity, but the most common is the Breusch–Pagen test (sometimes called the Breusch–Pagan–Godfrey test; also developed independently as the Cook–Weisberg test).

The Breusch–Pagen test unfolds in several steps:

1. Estimate the regression model of interest and save the residuals, εi, for each observation. ^ 2 2. Compute a number we will label σ by: 2.1 squaring each residual, εi 2.2 summing up those squared residuals 2.3 dividing the result by the size of your sample from the regressions model you estimated. ε2 3. i Compute a new variable named as equal to ^ 2 . σ 4. Run a new auxiliary regression with ρ1 as the dependent variable and all of the same independent variables that were part of your original regression. 5. Compute the explained sum of squares from this auxiliary regression and divide it by 2. 6. Compare the result from the previous step to a Chi-squared probability distribution with degrees of freedom equal to the number of independent variables included in the auxiliary regression. 7. If the level of significance, or p-value, associated with the result from the previous step is small – typically below 0.05 – you can reject the null hypothesis of homoscedasticity and declare that you do in fact have evidence of heteroscedasticity.

Page 7 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Fortunately, most (but not all) statistical software programs have this and other formal tests for heteroscedasticity built in. Another version of this test replaces the full set of independent variables in the auxiliary regression from Step 4 above with the predicted value of the dependent variable from Step 1 instead. That alters the degrees of freedom for the resulting Chi-squared test, but does not generally affect the resulting conclusion about whether or not the regression suffers from heteroscedasticity. If your statistical software has the test built in, you should double check to see which version is being estimated by default. If your original model only has one independent variable, the two versions of the Breusch–Pagen test are identical.

What to Do If You Have Heteroscedasticity A full exploration of what to do if you have evidence of heteroscedasticity is beyond the scope of this example. However, there are three basic approaches.

First, you could transform the data in an effort to remove the problem and then estimate the regression model on the transformed data. This is most commonly done through Weighted Least Squares, where observations are transformed, or weighted by, an estimate of the inverse of their variance.

Second, you could use a different method to estimate the standard errors of the regression coefficients. This can be done by parametrically estimating so- called robust standard errors (often called White’s robust standard errors or Huber–White robust standard errors). There are numerous flavors of robust standard errors. This can also be done non-parametrically, most commonly by using bootstrapping.

Third, you could model the changing variance of the residual as a function of one or more of the independent variables directly. This would require using some method other than OLS – most commonly maximum likelihood estimation.

Page 8 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Illustrative Example: Kindergarten Reading and Math Performance in the U.S. This example explores whether the reading performance of kindergarten students in the U.S. can be modeled as a linear function of their math performance, using data from the Fall, 1998 wave of the ECLSK survey. The focus will be on detecting whether or not there is evidence of heteroscedasticity in the residual from this regression. The research question guiding this example is:

Do higher performance scores in math predict higher reading scores among kindergarten students?

We can also state this in the form of a null hypothesis:

H0 = There is no linear relationship between student reading performance scores and their math performance scores in kindergarten.

The Data This example uses a subset of data from first wave of the ECLSK dataset. This extract includes data from 11,933 students who were in kindergarten in the 1998–99 academic year. The two variables we examine are:

• Reading performance in the Fall of 1998 (c1r4rscl). • Math performance in the Fall of 1998 (c1r4mscl).

Both of these performance score measures are scales based on student responses to a large number of test items in each area. Each scale was built using item response theory, which is a common method of measuring performance base on multiple test items. The reading score variable ranges from about 21 to just over 138, with a mean of 36 and a standard deviation of 10. The math score variable ranges from about 10 to almost 116, with a mean of 27 and a standard

Page 9 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 deviation of 9. Both variables are continuous measures, making them appropriate for simple regression.

Analyzing the Data Before producing the simple regression model, it is a good idea to look at each variable separately. However, in the interest of space, we forgo doing so here. Readers should explore the Sage Research Methods Dataset examples associated with Simple Regression and Multiple Regression for more information.

Regression results are often presented in a table that reports the coefficient estimates, their estimated standard errors, t-scores, and levels of statistical significance. Table 1 presents the results of regressing student reading scores on their math scores.

Table 1: Results from a simple regression where a student’s reading score in kindergarten is regressed on their math score, ECLSK.

Coefficient Standard Error t-score Sig.

Constant 13.971 0.213 65.54 0.000

Math Score 0.810 0.007 65.54 0.000

The results report an estimate of the intercept (or constant) as equal to approximately 14. The constant of a simple regression model can be interpreted as the average expected value of the dependent variable when the independent variable equals zero. In this case, our independent variable, c1r4mscl, never equals zero, so the constant by itself does not provide much information.

Table 1 reports that the value for the slope coefficient linking student math scores to student reading scores is estimated to be approximately 0.81. This represents the average marginal effect of the math score on the reading score, and can be interpreted as the expected change on average in the dependent variable for a

Page 10 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 one-unit increase in the independent variable. For this example, that means that every increase in a student’s math score of 1 point is associated with an average increase in a kindergarten student’s reading score of only 0.81. Table 1 reports that this estimate is statistically significantly different from zero, with a p-value well below 0.001. This leads us to reject the null hypothesis and conclude that there does appear to be a positive relationship between math scores and reading scores among kindergarten students.

Figure 2 presents a plot with the standardized residuals of this regression on the Y-axis and the standardized predicted values of the dependent variable on the X-axis. Figure 2 shows that the vertical spread of the residuals is relatively low among students with lower predicted reading scores. However, as we move left to right and the predicted reading scores increase, we see the vertical spread of the residuals also increasing. The resulting image appears like a cone or fan that is spreading out as we move from left to right in the figure. This means that the variance of the residuals is not constant and thus we appear to have evidence of heteroscedasticity.

Figure 2: Two-way scatter plot of standardized residuals from the regression shown in Table 1 on the Y-axis and standardized predicted values of the dependent variable from that regression on the X-axis, ECLSK.

Page 11 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Applying the steps of the Breusch–Pagen test to this example results in a test statistic of 11,543.55. When compared to a Chi-squared distribution with 1 degree of freedom, the resulting p-value falls well below the standard 0.05 level. We therefore have clear evidence to reject the null hypothesis of homoscedasticity and accept the alternative hypothesis that we do in fact have heteroscedasticity in the residual of this regression model. As noted above, researchers could respond by trying to correct for the problem, estimate standard errors that are robust to the presence of heteroscedasticity, or attempt to model it directly.

Presenting Results The results of a test for heteroscedasticity following this simple regression can be

Page 12 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 presented as follows:

“We used a subset of data from the ECLSK survey to test the following null hypothesis:

H0 = There is no linear relationship between student reading performance scores and their math performance scores in kindergarten.

The data include 11,933 students who were in kindergarten in the Fall of 1998. Results presented in Table 1 show that there is a positive and statistically significant relationship between math and reading performance scores. Specifically, the results show that every increase in math scores of 1 point is associated with an average increase in reading scores of about 0.81 points. Figure 2 shows a plot of the standardized residuals from this regression against the predicted values of the dependent variable. It reveals a clear pattern of heteroscedasticity as the vertical spread of the residuals appears to grow steadily from left to right. That evidence is further supported by a statistically significant Breusch–Pagen test for heteroscedasticity (Chi-squared = 11,543.55, df = 1, p-value < 0.05). Further analysis that either uses Weighted Least Squares, robust standard errors, or bootstrapped standard errors, or that models the non-constant variance directly, should be conducted."

Review Linear regression models estimated via OLS rest on several assumptions. One such assumption is that the variance of the residuals is constant and unrelated to any of the independent variables in the model. Heteroscedasticity, or non- constant variance of the residuals, violates this assumption. Heteroscedasticity does not bias the estimates of regression coefficients, but it does result in incorrect estimates of their standard errors. Heteroscedasticity can be detected graphically or by using formal statistical tests such as the Breusch–Pagen test. Once

Page 13 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 discovered, researchers have several options for how to respond.

You should know:

• What types of models require testing for heteroscedasticity. • How heteroscedasticity impacts the results of a regression model. • How to detect heteroscedasticity graphically and statistically. • How to report the results of attempting to detect heteroscedasticity.

Your Turn You can download this sample dataset along with a guide showing how to estimate a simple regression model using statistical software. This example used math and reading performance scores for kindergarten students measured in the Fall of 1998. The sample dataset includes a second math score (c2r4mscl) and reading score (c2r4rscl) for kindergarten students measured in the Spring of 1999. Try producing your own simple regression using these measures and then explore whether or not there is evidence of heteroscedasticity in the residuals of the regression.

Page 14 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the Early Childhood Longitudinal Study (1998)