<<

Learn to Test for in SPSS With From the China Health and Survey (2006)

© 2015 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006)

Student Guide

Introduction This dataset example introduces readers to testing for heteroscedasticity following a . Linear regression rests on several assumptions, one of which is that the of the residual from the model is constant and unrelated to the independent variable(s). Constant variance is called , while non-constant variance is called heteroscedasticity.

This example describes heteroscedasticity, discusses its consequences, and shows how to detect it using data from the 2006 China Health and Nutrition Survey (CHNS) survey of adults (http://www.cpc.unc.edu/projects/china). Specifically, we test whether systolic blood pressure is a linear function of a person’s age. After performing the regression, we show how to examine the results for evidence of heteroscedasticity. High blood pressure is associated with a number of negative health outcomes. Results from an analysis like this could therefore have implications for individual behavior and public .

What Is Heteroscedasticity? Linear expresses a dependent variable as a linear function of one or more independent variables. Equation 1 shows an example of a simple

Page 2 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 with a single independent variable:

(1)

Yi = β0 + β1Xi + εi

Where:

• Yi = individual values of the dependent variable • Xi = individual values of the independent variable • β0 = the intercept, or constant, associated with the regression line • β1 = the slope of the regression line • εi = the unmodeled random, or stochastic, component of the dependent variable, often called the error term or the residual of the model.

Linear regression models are typically estimated using ordinary (OLS) regression. OLS regression rests on several assumptions, one of which is that the variance of the residuals from the regression model (εi) is constant and unrelated to the independent variable(s). Heteroscedasticity that the variance of the residual is not constant, which means that an important assumption of OLS has been violated.

Consequences of Heteroscedasticity The presence of heteroscedasticity does not affect the estimated values of the intercept or slope coefficients of a linear regression model. Those estimates remain unbiased.

However, heteroscedasticity does affect the estimated standard errors for those coefficients. It can make them either too large or to small, but most often it makes them too small.

Page 3 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 The of any is calculated to provide an estimate of how much that statistic might change if it were calculated again on another random sample of data of the same size taken from the same population. For regression, the standard error for each coefficient provides an estimate of uncertainty about that coefficient. We use both the coefficients and their standard errors when we test hypotheses about those coefficients.

For example, we might estimate a regression model like the one presented in Equation 1 and produce an estimate of β1 of 2.5. In order to determine whether 2.5 is statistically significantly different from zero, we need to perform a hypothesis test. Specifically, we would test the null hypothesis that β1 = 0. To do so, we would:

1. Perform the regression analysis to produce our estimates of and its standard error. 2. Divide the estimate of by its standard error to produce a T-score. 3. Compare the T-score from the previous step to a Student’s T distribution, with degrees of freedom equal to the sample size minus the number of coefficients estimated as part of the original regression. 4. Determine the level of significance, or p-value, associated with the calculated T-score. Typically, if that p-value is less than 0.05, researchers would declare the result to be statistically significant.

Of course, statistical software generally performs all of these steps for us automatically. However, this process and those computer programs assume that the variance of the residuals is constant. As noted above, heteroscedasticity leads to incorrect estimates of standard errors. As a result, heteroscedasticity will confound subsequent hypothesis testing. Because heteroscedasticity typically produces standard errors that are smaller than they should be, we run the risk of being over-confident in our coefficient estimates and possibly declaring a coefficient estimate to be statistically significant when it is not.

Page 4 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Detecting Heteroscedasticity There are two main strategies for detecting heteroscedasticity. The first approach is graphical. For a simple regression model, a two-way scatter with the residuals from the regression model plotted on the Y-axis and the independent variable plotted on the X-axis is a good place to start.

For a multiple regression model, you could produce separate plots like this for each independent variable – with the residuals plotted on the Y-axis and the independent variable in question plotted on the X-axis.

For either a simple or a multiple regression model, it is quite common plot the residuals on the Y-axis and the predicted value of the dependent variable based on the regression model on the X-axis. Often both the residuals and the predicted values of the dependent variable are standardized before being plotted. Figure 1 shows an example of this approach.

Figure 1: Illustration comparing what homoscedasticity and heteroscedasticity look like using two-way scatter plots with the standardized residual from the regression plotted on the Y-axis and the standardized predicted value of the dependent variable from the regression plotted on the X-axis.

Page 5 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Regardless of which figures you produce, you should see the same level of vertical spread among the residuals across all values plotted along the X-axis as you look from left to right at the plot if you have constant variance in the residuals. However, if you have heteroscedasticity, you will see changes in the vertical spread among the residuals as you look across the figure from left to right. That spread could be steadying growing, steadily shrinking, or showing a more complex pattern such as less variance in the center of the data than at both the lower and upper extremes.

The panel on the left in Figure 1 shows how the vertical spread of the standardized residuals is roughly the same as you look from left to right. However, the panel on the right in Figure 1 shows a smaller vertical spread for the residuals initially that spreads out wider as you move from left to right. The panel on the left shows what homoscedasticity looks like. The panel on the right shows what one of the more common forms of heteroscedasticity looks like.

Graphical methods are useful for seeing the data, but they lack formal statistical

Page 6 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 precision. As a result, many researchers turn to formal statistical tests for heteroscedasticity. There are many tests of the null hypothesis of homoscedasticity, but the most common is the Breusch–Pagen test (sometimes called the Breusch–Pagan–Godfrey test; also developed independently as the Cook–Weisberg test).

The Breusch–Pagen test unfolds in several steps:

1. Estimate the regression model of interest and save the residuals, εi, for each observation. ^ 2 2. Compute a number we will label σ by: 2.1 squaring each residual, εi 2.2 summing up those squared residuals 2.3 dividing the result by the size of your sample from the regressions model you estimated. ε2 3. i Compute a new variable named ρi as equal to ^ 2 . σ 4. Run a new auxiliary regression with ρi as the dependent variable and all of the same independent variables that were part of your original regression. 5. Compute the from this auxiliary regression and divide it by 2. 6. Compare the result from the previous step to a Chi-squared with degrees of freedom equal to the number of independent variables included in the auxiliary regression. 7. If the level of significance, or p-value, associated with the result from the previous step is small – typically below 0.05 – you can reject the null hypothesis of homoscedasticity and declare that you do in fact have evidence of heteroscedasticity.

Page 7 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Fortunately, most (but not all) statistical software programs have this and other formal tests for heteroscedasticity built in. Another version of this test replaces the full set of independent variables in the auxiliary regression from Step 4 above with the predicted value of the dependent variable from Step 1 instead. That alters the degrees of freedom for the resulting Chi-squared test, but does not generally affect the resulting conclusion about whether or not the regression suffers from heteroscedasticity. If your statistical software has the test built in, you should double check to see which version is being estimated by default. If your original model only has one independent variable, the two versions of the Breusch–Pagen test are identical.

What to Do If You Have Heteroscedasticity A full exploration of what to do if you have evidence of heteroscedasticity is beyond the scope of this example. However, there are three basic approaches.

First, you could transform the data in an effort to remove the problem and then estimate the regression model on the transformed data. This is most commonly done through , where observations are transformed, or weighted, by an estimate of the inverse of their variance.

Second, you could use a different method to estimate the standard errors of the regression coefficients. This can be done by parametrically estimating so- called robust standard errors (often called White’s robust standard errors or Huber–White robust standard errors). There are numerous flavors of robust standard errors. This can also be done non-parametrically, most commonly by using bootstrapping.

Third, you could model the changing variance of the residual as a function of one or more of the independent variables directly. This would require using some method other than OLS – most commonly maximum likelihood estimation.

Page 8 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

Illustrative Example: Blood Pressure and Age in China This example explores whether a person’s systolic blood pressure can be modeled as a linear function of a person’s age. The focus will be on detecting whether or not there is evidence of heteroscedasticity in the residual from this regression. The research question guiding this example is:

Do older people in China tend to have a higher systolic blood pressure?

We can also state this in the form of a null hypothesis:

H0 = There is no linear relationship between systolic blood pressure and age in China.

The Data This example uses two variables from the 2006 China Health and Nutrition Survey:

• A person’s systolic blood pressure (systolic). • A person’s age, measured in kilograms (age).

There are 9178 respondents in this survey. Systolic blood pressure measures the pressure in a person’s arteries when their heart beats (contracts and pumps blood). In this dataset, this variable ranges from 70 to 240 with a of about 122 and a of 18.14. Age is measured in years, and in this dataset it ranges from 17 to 95 with a mean of about 49 and a standard deviation of 15.19. Both of these variables are continuous, making them appropriate for simple regression.

Analyzing the Data Before producing the simple regression model, it is a good idea to look at each

Page 9 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 variable separately. However, in the interest of space, we forgo doing so here. Readers should explore the Sage Research Methods Dataset examples associated with Simple Regression and Multiple Regression for more .

Regression results are often presented in a table that reports the coefficient estimates, their estimated standard errors, t-scores, and levels of . Table 1 presents the results of regressing the student reading score on family income.

Table 1: Results from a simple regression where systolic blood pressure is regressed on age, 2006 China Health and Nutrition Survey.

Coefficient Standard Error t-score Sig.

Constant 98.56 0.588 167.59 0.000

Age (in Years) 0.47 0.011 41.35 0.000

The results report an estimate of the intercept (or constant) as equal to approximately 98.56. The constant of a simple regression model can be interpreted as the of the dependent variable when the independent variable equals zero. In this case, our independent variable, age, can never equal zero, so the constant by itself does not provide much information.

Table 1 reports that the value for the slope coefficient linking age to systolic blood pressure is estimated to be approximately 0.47. This represents the average marginal effect of age on systolic blood pressure, and can be interpreted as the expected change on average in the dependent variable for a one-unit increase in the independent variable. For this example, that means that every increase in age of 1 year is associated with an average increase in systolic blood pressure of 0.47. Table 1 reports that this estimate is statistically significantly different from zero, with a p-value well below 0.001. This leads us to reject the null hypothesis and conclude that there does appear to be a positive relationship between a person’s

Page 10 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 age and their systolic blood pressure in China.

Figure 2: Two-way of standardized residuals from the regression shown in Table 1 on the Y-axis and standardized predicted values of the dependent variable from that regression on the X-axis, 2006 China Health and Nutrition Survey.

Figure 2 presents a plot with the standardized residuals of this regression on the Y-axis and the standardized predicted values of the dependent variable on the X- axis. Figure 2 shows that the vertical spread of the residuals is relatively low for respondents with lower predicted levels of systolic blood pressure. However, as

Page 11 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 we move left to right and the predicted level of systolic blood pressure increases, we see the vertical spread of the residuals also increasing. This spread appears to shrink somewhat at the very highest predicted values for systolic blood pressure. Overall, Figure 2 shows a pattern in the variance of the residuals, meaning that we appear to have evidence of heteroscedasticity. As noted above, researchers could respond by trying to correct for the problem, estimate standard errors that are robust to the presence of heteroscedasticity, or attempt to model it directly.

Applying the steps of the Breusch–Pagen test to this example results in a of 652.33. When compared to a Chi-squared distribution with 1 degree of freedom, the resulting p-value falls well below the standard 0.05 level. We therefore have clear evidence to reject the null hypothesis of homoscedasticity and accept the that we do in fact have heteroscedasticity in the residual of this regression model.

Presenting Results The results of a test for heteroscedasticity following this simple regression can be presented as follows:

“We used a subset of data from the 2006 China Health and Nutrition Survey to test the following null hypothesis:

H0 = There is no linear relationship between systolic blood pressure and age in China.

The data include 9178 respondents. Results presented in Table 1 show that there is a positive and statistically significant relationship between a person’s age and their systolic blood pressure. Specifically, the results show that every increase in age of 1 year is associated with an average increase of 0.47 in systolic blood pressure. Figure 2 shows a plot of the standardized residuals from this regression

Page 12 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 against the predicted values of the dependent variable. It reveals a clear pattern of heteroscedasticity as the vertical spread of the residuals appears to grow steadily from left to right, though with some reduction in spread at the furthest right portion of the figure. That evidence is further supported by a statistically significant Breusch–Pagen test for heteroscedasticity (Chi-squared = 652.33, df = 1, p-value < 0.05). Further analysis that either uses Weighted Least Squares, robust standard errors, or bootstrapped standard errors, or that models the non- constant variance directly, should be conducted.”

Review Linear regression models estimated via OLS rest on several assumptions. One such assumption is that the variance of the residuals is constant and unrelated to any of the independent variables in the model. Heteroscedasticity, or non- constant variance of the residuals, violates this assumption. Heteroscedasticity does not bias the estimates of regression coefficients, but it does result in incorrect estimates of their standard errors. Heteroscedasticity can be detected graphically or by using formal statistical tests such as the Breusch–Pagen test. Once discovered, researchers have several options for how to respond.

You should know:

• What types of models require testing for heteroscedasticity. • How heteroscedasticity impacts the results of a regression model. • How to detect heteroscedasticity graphically and statistically. • How to report the results of attempting to detect heteroscedasticity.

Your Turn You can download this sample dataset along with a guide showing how to estimate a simple regression model using statistical software. The sample dataset

Page 13 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 also includes a variable that measures diastolic blood pressure (diastolic), which measures blood pressure between heart beats. Try producing a simple regression replacing systolic blood pressure with diastolic blood pressure as the dependent variable and then explore whether or not there is evidence of heteroscedasticity in the residuals of the regression.

Page 14 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006)