<<

Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002)

Student Guide

Introduction This example dataset demonstrates how to test for multicollinearity (and collinearity) prior to running a multiple analysis. Collinearity is an association or correlation between two predictor (or independent) variables in a ; multicollinearity is where more than two predictor (or independent) variables are associated. The absence of collinearity or multicollinearity within a dataset is an assumption of a range of statistical tests, including multilevel modelling, , Factor Analysis, and multiple linear regression. This example demonstrates how to test for multicollinearity, specifically, in multiple linear regression and shows how to compute and interpret it. We illustrate how to test for multicollinearity using a subset of data from the 2002 English Health Survey (Teaching Dataset). This page provides links to this sample dataset and a guide to testing for multicollinearity using statistical software.

What Is Multicollinearity? Collinearity is where there is a correlation (or association) between two predictor (or independent) variables in a regression model; multicollinearity is an

Page 2 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 association between more than two predictor (or independent) variables in a regression model. Multicollinearity and collinearity are problems for a range of statistical tests including Multi-Level Modelling, Logistic Regression and Multiple Linear regression. However, this example focuses on testing for multicollinearity in a multiple linear regression. In a multiple linear regression, we want to examine whether there is a linear relationship between more than one predictor (or independent) variable(s) and a outcome (or dependent) variable. For example, we may be interested in building a model for predictors of intelligence. Our dependent variable might be IQ. We might be interested in four of the following predictor variables: shoe size, gender, age, and height. We might find correlations (≥0.90) between shoe size and age and age and height. Indeed, all three variables may exhibit multicollinearity, being almost perfectly linearly related. This situation would impact on our ability to understand the relationship between the predictor and outcome variables; indeed, removal of one or more of these predictors would not impact on the overall effect on the model of IQ; therefore, multicollinearity can render predictor variables redundant. If there is multicollinearity between the predictor variables, then there can be a reduction in the significance level (i.e., increase in the p-value).

There are three key impacts of multicollinearity on the development of a multiple linear regression model. Firstly, as collinearity increases, so do the standard errors of the b coefficients. This leads to greater variance of the b coefficients, which may make them unrepresentative of the sample population. Secondly, the signs (positive or negative) associated with the regression coefficients may be contrary to what is expected. For example, we may expect that the coefficients of shoe size, age, and height will be positive. However, if shoe size and age are highly correlated and shoe size and height are highly correlated, then it is possible for one of the estimated coefficients to turn out negative. Lastly, it is possible for the model to be statistically significant, even though none of the individual coefficients are significant. In other words, all the predictor variables, when taken collectively,

Page 3 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 may provide a good model fit to the outcome variable, leading to a small residual sum of squares value; an individual regression coefficient may not accurately reflect the effect of the particular predictor variable – it is being influenced by the other variables in the model.

Causes of Multicollinearity Fortunately, the so-called perfect collinearity, where at least one predictor variable is perfectly correlated with another, i.e., it has a correlation coefficient of 1, is rare in real-life research contexts. Low-level collinearity is to be expected, especially in observational data, and is not that problematic. However, it is important to assess levels of collinearity (and multicollinearity) within your data becase as collinearity increases, so do potential problems that will impact upon your analysis.

Multicollinearity is less common with experimental data where researchers have a high level of control over the design and collection of their data, with extraneous and similar variables being more easily controlled or removed. Of course, poorly designed experiments may produce data with multicollinearity. Multicollinearity is more common with observational data, which is less easily controlled or manipulated and more likely to exhibit highly correlated variables that can confound statistical models. Researchers can also create the conditions for multicollinearity in three ways: first by choosing variables which may be highly similar or identical, for example, height in centimeters and height in inches; or cultural identity and ethnicity. Secondly, by introducing a variable into a model that is a combination of two other variables that are similar, for example, total household income which is actually a combination of “income from work” and “income from investments.” Lastly, by incorrectly defining and using Dummy variables, for example, failing to exclude one category from the Dummy coding. In addition, a researcher may collect insufficient data – larger samples may be less likely to exhibit multicollinearity. Thus, researchers can attempt to prevent

Page 4 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 multicollinearity by effective research design, data cleaning, and preparation.

Detecting Multicollinearity The easiest way to detect multicollinearity is to use a statistical software package which can test for multicollinearity alongside other required assumptions of the linear regression model. However, you can test for multicollinearity manually by calculating the correlation coefficients for all pairs of predictor variables. Let’s return to our earlier example of predictors of IQ: shoe size, age, and height, and imagine that we have collected data from 20 respondents. Table 1 shows the data collected.

Table 1: Respondents’ Shoe Size, Age, and Height.

Shoe size (UK size) Age (in years) Height (in cm)

1 8 147

3 10 152

4 11 155

5 16 163

10 18 180

7 20 170

6 24 165

6 28 165

4 32 155

11 35 186

11 38 186

Page 5 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

12 42 194

8 45 168

5 52 163

6 55 165

8 58 168

12 62 196

9 65 173

7 72 170

5 85 163

We can calculate the correlation coefficient using Equation 1:

(1)

¯ ¯ ∑ (x − x )(y − y) r = ¯ 2 ¯ 2 √∑ (x − x ) √∑ (y − y )

Let’s start by calculating the correlation coefficient for our first pair of variables, shoe size and age. Table 2 shows our data with various calculations required of Equation 1.

Table 2: Respondents’ Shoe Size and Age.

¯ ¯ ¯ ¯ ¯ 2 ¯ 2 Shoe size (UK size) (x) Age (y) x − x y − y x − x y − y (x − x) (y − y) ( )( )

1 8 −6 −30.8 184.8 36 948.64

3 10 −7 −28.8 201.6 49 829.44

Page 6 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

4 11 −7 −27.8 194.6 49 772.84

5 16 −11 −22.8 250.8 121 519.84

10 18 −8 −20.8 166.4 64 432.64

7 20 −13 −18.8 244.4 169 353.44

6 24 −18 −14.8 266.4 324 219.04

6 28 −22 −10.8 237.4 484 116.64

4 32 −28 −6.8 190.4 784 46.24

11 35 −24 −3.8 91.2 576 14.44

11 38 −27 −0.8 21.6 729 0.64

12 42 −30 3.2 −96 900 10.24

8 45 −37 6.2 −229.4 1,369 38.44

5 52 −47 13.2 −620.4 2,209 174.24

6 55 −49 16.2 −793.8 2,401 262.24

8 58 −50 19.2 −960 2,500 368.64

12 62 −50 23.2 −1,160 2,500 538.24

9 65 −56 26.2 −1,467.2 3,136 686.44

7 72 −65 33.2 −2,158 4,225 1,102.24

5 85 −80 46.2 −3,696 6,400 2,134.44

¯ ¯ x = 7 y = 38.8 Σ = −9,131.2 Σ = 29,025 Σ = 9,569

We can now populate Equation 1 with our data.

Page 7 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 − 9131.2 r = √29025 √9569 − 9131.2 r = 170.37 x 97.82 − 9131.2 r = 16665.5934 r = − 0.548

We can see that the r value for shoe size and age is a negative medium-strength correlation. Generally, predictors are interpreted as being highly correlated if the r value is above 0.8/−0.8; for our example, it suggests that our predictors (shoe size and age) do not have a worrying level of collinearity. We could, if we wanted, repeat this process for our other variables: age and height; shoe size and height. This approach is a basic and “ball park” figure approach; it is also cumbersome. It is much easier to use statistical software to run a number of multicollinearity diagnostics instead.

Multicollinearity Diagnostics There are a number of diagnostics that we can run using statistical software to detect multicollinearity. These include:

• Review the correlation matrix for predictor variables that correlate highly (r values above 0.8/−0.8). This is basically repeating the process we did manually but using software. • Computing the Variance Inflation Factor (henceforth VIF) and the Tolerance Statistic which is its reciprocal. This highlights whether a predictor has a strong linear relationship with the other predictors. Multicollinearity is identified if the largest VIF is greater than 10 (or the Tolerance below 0.1), if the average VIF is substantially greater than 1, and/or the Tolerance is below 0.2. • Compute Eigenvalues of the scaled, uncentred cross-products matrix; the

Page 8 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 condition indexes; and the variance proportions. By examining the variance proportions of the same small Eigenvalues, we can identify multicollinearity. If each predictor has the majority of its variance loading on a different dimension to the other predictors, then it means there is no multicollinearity.

Assumptions Behind the Method Nearly every statistical test relies on some underlying assumptions, and they all are affected by the type of data that you have. Multicollinearity is not a statistical test as such and therefore has no specific assumptions. Multicollinearity is in fact one of a number of tests done to meet the assumptions of the linear model. When we test for multicollinearity, we do assume that our data are randomly selected and at the continuous or interval level. These assumptions are not typically testable from the sample data and are related to the research design.

Illustrative Example: Systolic Blood Pressure, Weight, and Age This example illustrates how to test for multicollinearity prior to testing a multiple linear regression model. It uses data from the 2002 English Health Survey (Teaching Dataset). In our example, we want to test for multicollinearity prior to running a multiple linear regression that addresses the following research questions:

Is the systolic blood pressure of adults in England positively associated with their age after controlling for the effect of weight and body mass index (BMI)? Is the systolic blood pressure of adults in England positively associated with their weight after controlling for the effect of age and BMI? Is the systolic blood pressure of adults in England positively associated with their BMI after controlling for the effect of age and weight?

Page 9 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Stated in the form of null hypotheses:

H0a = After controlling for the effect of weight and BMI, there is no linear relationship between people’s age and their systolic blood pressure. H0b = After controlling for the effect of BMI and age, there is no linear relationship between people’s weight and their systolic blood pressure. H0c = After controlling for the effect of weight and age, there is no linear relationship between people’s BMI and their systolic blood pressure.

We will not be testing these specific hypotheses in this example; this is just to illustrate what linear model we are interested in.

The Data This example uses a subset of data derived from the 2002 English Health Survey (Teaching Dataset). This extract includes 4,477 respondents. It should be noted that the original dataset is larger than this, but it has been “cleaned” to include only those who have responded to our dependent variable. The four variables we examine are:

• Age last birthday (age) • (D) Valid Mean Systolic BP (sysval) • (D) Valid weight (kg) inc. estimated > 130 kg (wtval) • (D) Valid BMI – inc estimated > 130 kg (bmival)

Systolic blood pressure measures the pressure in a person’s arteries at the point the heart beats. High systolic blood pressure can lead to a range of health conditions and problems. All the variables are continuous, which makes them suitable for multiple linear regression.

Analyzing the Data

Page 10 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Univariate Analysis Prior to testing for Multicollinearity, we should start by examining the frequency distributions of our four variables, as shown in Table 3.

Table 3: Frequency Distribution of age, sysval, wtval, and bmival.

age sysval wtval bmival

Valid 4,477 4,477 4,477 4,477 N Missing 0 0 0 0

Mean 47.56 132.5449 75.5685 26.9718

Median 46.00 130.0000 74.0000 26.3321

Standard deviation 17.997 18.20476 16.07723 5.04947

Range 80 144.50 144.60 44.30

Minimum 16 79.00 38.30 15.15

Maximum 96 223.50 182.90 59.45

The age range of this data is 16–96 years with the mean age 47.5 years. Respondents’ systolic blood pressure ranges from 79 to 223.5, with a mean of 132.5. The weights of respondents range from 38.3 to 182.9 kgs, with a mean of 75.5 kg; BMI ranges from 15.15 to 59.45, with a mean of 26.97.

Prior to running our multiple linear regression and alongside our testing for multicollinearity, we would also test for the other assumptions of the linear model – linearity and additivity, normality, homoscedasticity, and whether there are any independent errors. In this example, we will not be testing for these assumptions; for more information, please see the SAGE Research Methods Guide to Multiple Linear Regression.

Page 11 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Testing for Multicollinearity Earlier we identified four different diagnostics for multicollinearity. We shall look at each one in turn, although when you run them on statistical software, they will be generated at the same time, as one output.

Review the Correlation Matrix for Predictor Variables That Correlate Hhighly

We can review the Correlation Matrix in order to identify any predictor variables that correlate highly. We can see from Table 4 that wtvals and bmival correlate highly (r = 0.831), suggesting that there may be collinearity in our data.

Table 4: Correlation Matrix for age, sysval, wtval, and bmival.

sysval age wtval bmival

Pearson correlation sysval 1.000 .473 .187 .250

age .473 1.000 .039 .186

wtval .187 .039 1.000 .831

bmival .250 .186 .831 1.000

Significance (one-tailed) sysval .000 .000 .000

age .000 .004 .000

wtval .000 .004 .000

bmival .000 .000 .000

Computing the VIF and Tolerance Statistic

Table 5 shows us the VIF value and the Tolerance Statistic for our data. We should review three items in relation to this table. First, we should look at the VIF; the largest VIF (3.502) is for bmival, but it is not greater than 10, so it is within tolerance. Secondly, we should look at the corresponding Tolerance Statistic for

Page 12 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 bmival (0.286), which is not below 0.1, again this is within tolerance. Finally, we should calculate the average VIF, which we can do by adding all the VIF values together (1.084 + 3.386 + 3.502) and dividing by the total number of predictors (3). This gives us an average VIF of 2.657, which is not substantially greater than 1. Likewise, we can calculate an average Tolerance in the same way, giving us an average Tolerance Statistic of 0.501, which is not below 0.2.

Table 5: Variance Inflation Factor (VIF) and Tolerance .

Unstandardised Standardised Collinearity Correlations coefficients coefficients statistics

Standard Zero- Model B β t Significance Partial Part Tolerance VIF error order

(Constant) 94.104 1.352 69.627 .000

age .459 .014 .454 33.781 .000 .473 .451 .436 .922 1.084

wtval .116 .027 .102 4.311 .000 .187 .064 .056 .295 3.386

bmival .290 .087 .080 3.331 .001 .250 .050 .043 .286 3.502

After reviewing Table 5, it would seem that we do not need to be concerned with multicollinearity in our data.

Compute Eigenvalues

The last diagnostic that we should run is to compute Eigenvalues of the scaled, uncentred cross-products matrix; the condition indexes; and the variance proportions (Table 6).

Table 6: Eigenvalues.

Collinearity diagnostics Variance proportions

Model Dimension Eigenvalue Condition index (Constant) age wtval bmival

Page 13 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

1 3.870 1.000 .00 .01 .00 .00

2 .102 6.171 .01 .84 .02 .01 1 3 .022 13.128 .96 .08 .11 .04

4 .006 25.555 .03 .07 .87 .96

If we inspect the rows with the smallest Eigenvalues, the bottom two rows, we should then look for variables that both have high variance proportions for that Eigenvalue. The variance proportions vary from 0 to 1. If we look at the lowest Eigenvalue (0.006), we can see that both wtval and bmival have high variance proportions (0.87 and 0.96) for Dimension 4, which means that 87% of the variance of the b-value for wtval and 96% of the variance of the b-value for bmival is associated with Eigenvalue 4 (the smallest Eigenvalue). This result suggests that there may be dependency between these two variables.

If we review all our multicollinearity diagnostics, we can see that the Correlation Matrix and the review of the Eigenvalues suggest collinearity between wtval and bmival, although the VIF and Tolerance Statistics did not corroborate this. It is clear that there is no multicollinearity in this data but that collinearity exists between two of the predictors (wtval and bmival), and it might be sensible to remove one of them from the proposed model. Given that BMI is calculated using weight, it makes sense that there may exist a high level of correlation between the two variables.

Presenting Results Multicollinearity test can be reported as follows:

“We used a subset of the 2002 English Health Survey (Teaching Dataset) dataset to test for multicollinearity as part of assumptions testing for a multiple linear regression. The data included 4,477 respondents. The proposed multiple linear

Page 14 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 regression would test the following hypotheses:

H0a = After controlling for the effect of weight and BMI, there is no linear relationship between people’s age and their systolic blood pressure. H0b = After controlling for the effect of BMI and age, there is no linear relationship between people’s weight and their systolic blood pressure. H0c = After controlling for the effect of weight and age, there is no linear relationship between people’s BMI and their systolic blood pressure.

A series of diagnostics (correlation matrix, VIF/Tolerance Statistics, and Eigenvalues) were run to assess multicollinearity. Tests for multicollinearity (VIF = 1.084 for age, 3.386 for wtval, and 3.502 for bmival) indicate that a very low level of multicollinearity was present. However, collinearity was found between wtval and bmival, as indicated by r = 0.831 and high variance proportions (0.87 and 0.96) for the smallest Eigenvalue (0.006). It is recommended that one of these variables is removed from the proposed model.”

Review Testing for Multicollinearity is one of the assumptions tests required prior to running a range of statistical tests, including Multiple Linear Regression. Multicollinearity exists when more than two predictor (or independent) variables are highly correlated, which can impact upon the results of a multiple linear regression model.

You should know:

• What types of variables are suited for Multicollinearity diagnostics. • The basic assumptions underlying this statistical diagnostic. • How to compute and interpret a Multicollinearity diagnostic. • How to report the results of a Multicollinearity diagnostic.

Page 15 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Your Turn You can download this sample dataset along with a guide showing how to test for Multicollinearity using statistical software. The sample dataset also includes another variable called diaval, which relates to respondent’s diastolic blood pressure. See whether you can reproduce the results presented here for the sysval variable as dependent variable and then try testing for multicollinearity substituting diaval for sysval as the dependent variable in the analysis.

Page 16 of 16 Learn to Test for Multicollinearity in SPSS With Data From the English Health Survey (Teaching Dataset) (2002)