Learn About Multicollinearity in SPSS with Data from Transparency, Class Bias, and Redistribution: Evidence from the American States Dataset (2018)

Home , Condition number, Multicollinearity

Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018)

Student Guide

Introduction This example introduces readers to the concept of multicollinearity. Perfect collinearity between two variables exists when the absolute correlation between the variables is 1. Similarly, perfect multicollinearity exists among a set of variables when the multiple correlation between one of the variables and the others is 1 or, equivalently, the correlation between some pair of linear combinations of the variables is 1. Collinearity and multicollinearity (also called ill conditioning or near dependency) refer to the situation where the correlation is high but not 1. Multicollinearity raises certain challenges when it exists among the set of independent variables in a regression model. These difficulties, their causes, detection methods, and solutions are discussed.

This example describes multicollinearity, the difficulties it can cause in regression modeling, causes, detection methods, and solutions. We illustrate this using a subset of data from Transparency, Class Bias, and Redistribution: Evidence From the American States (https://dataverse.unc.edu/ dataset.xhtml?persistentId=doi:10.15139/S3/KPG5C2). We selected and cleaned

Page 2 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 a subset of the data for illustration. Specifically, we go through a process of detection, evaluation, and solution for multicollinearity. Those interested in doing research in this area should visit the website noted above. Here, you will find links to this sample dataset and a guide to the test using statistical software.

What Is Multicollinearity? Within a set of variables, multicollinearity exists when at least one variable is nearly or exactly a linear combination of the other variables in the set. Consider the set of variables X = {x1, x2, …, xp} and the linear regression model:

* xi = α + X β + ε

* Where X is the matrix X with xi removed. Perfect multicollinearity exists when 2 2 2 σε = 0 and multicollinearity exists when σε is small compared to σxi, implying that R2 from this regression is large.

Formally, perfect collinearity or perfect linear dependence exists between two variables x1 and x2 when there exist values λ0 and λ1 such that λ0 + x1 + λ1x2 = 0 for all observations. Similarly, within a set of variables, perfect multicollinearity exists if there is a set of values λi such that λ0 + x1 + λ1x2 + … + λ(p−1)xp = 0. Multicollinearity, or near dependency, describes the situation where λ0 + x1 + λ1x2 + … + λ(p−1)xp = u, and the variance of u is small compared to the variance 2 2 σμ of x1. When considered as a regression problem, this implies that R = 1 − 2 σx1 is large when x1 is regressed onto {x2, x3 … xp}. When the xi are a set of independent variables in a linear regression model (the design matrix, X), perfect multicollinearity will and multicollinearity may cause estimation and interpretation problems. Perfect multicollinearity is almost always the result of faulty design matrix construction and is easy to fix. The causes of multicollinearity can be more complex and a solution, if necessary, requires practical and theoretical

Page 3 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 considerations. Multicollinearity is a function of the characteristics of X, not the linear regression model, so it is a data, rather than a statistical, issue.

Challenges for Linear Regression Analysis Raised By Multicollinearity When X constitutes the set of regressors in a multiple regression model y = Xβ + ε, multicollinearity within X can raise complications both in the estimation of β as well as the interpretation of the individual parameters in β.

Estimation Problems Given y and X, the estimates for β in the linear regression model y = Xβ + ε are obtained using the normal equations:

^ ′ − 1 ′ β = (X X) X y so the matrix (X ′ X) must be invertible, a condition which is only met if X is full column rank. When perfect multicollinearity exists within X, X is not full column ^ rank, so there is no unique solution for β using the normal equations. When near-perfect multicollinearity is severe, a similar problem can occur due to the inability of a computer algorithm to obtain an inverse properly. Even when X is full column rank and a unique inverse is properly obtained, high multicollinearity can still cause estimation challenges. The precision of the estimators of the affected βs is low, which leads to large standard error estimates—importantly: correctly large standard errors; there really is that much uncertainty about the effect of x “holding the other X fixed” because of the high collinearity—and a seemingly paradoxical situation in hypothesis testing. A hypothesis test that a set of parameters is all 0s may have a low p-value, easily rejecting the null, even while the p-values associated with the individual parameters may be high, and so decidedly failing to reject. Another result of low precision (i.e., of the correctly large standard errors)

Page 4 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 induced by high collinearity in X is that estimates from replication studies may be highly variant, even with large samples.

Interpretation Problems The usual interpretation of βi in the regression model is the expected difference in y for a one-unit difference in xi, all other xj held constant. If xi is highly correlated with some set of the other members of X, holding the other xj constant is not realistic, and the meaning of that interpretation of βi can be difficult to grasp. A meaningful description of the effects under these conditions requires an understanding of the high intercorrelations of the variables and the consequent effect of plausible hypotheticals regarding differences in X (rather than in x) on the expected value of y.

Causes of Multicollinearity Perfect multicollinearity and multicollinearity typically arise through different mechanisms so will be discussed separately.

Perfect Multicollinearity Perfect multicollinearity is often the result of one of three situations. The first is a design matrix (X) with more columns than rows. This occurs when we try to fit a regression model with more regressors than observations. Importantly, the number of coefficients to be estimated in the model equals the number of regressors (independent variables, columns of X) plus the intercept. The regression model y = α + x1β1 + x2β2 + x3β3 + x4β4 + ε

^ has five coefficients, so X must have at least five rows for β to be estimable, six for standard errors to be estimable. When a model includes indicators for nominal

Page 5 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 variables and/or interaction terms, the number of coefficients can grow rapidly with the number of nominal categories and/or interactions. For example, the model indicated by y = Group + Treatment + Group*Treatment + ε has four parameters when both Group and Treatment are binary but nine parameters when each has three levels.

The second common situation is faulty construction of the design matrix. Consider the one-way between-subjects ANOVA model indicated by y = Treatment + ε where Treatment has three levels. We can also describe this in typical regression form as y = α + x1(T1) + x2(T2) + x3(T3) + ε where α is the overall mean, Ti are the treatment deviations from the mean and xi are 0/1 indicators of treatment received. As a description of the model, this equation is correct, but in the design matrix that contains a column of 1s for the mean and three columns of 0/1 variables for the indicators, x1 + x2 + x3 = 1, which is redundant (i.e., perfectly collinear) with the intercept, so the design matrix is not full rank.

Exact multicollinearity can also occur with a seemingly well-defined design matrix. Consider the ANOVA model representation: y = Group + Treatment + Group*Treatment + ε where there are two groups (0/1) and three treatments (A, B, C). We use a reference cell design with Group = 0, Treatment = A as the reference. The design matrix is:

Page 6 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Effects Design matrix

Group Treatment Intercept G(1) T(B) T(C) G(1)*T(B) G(1)*T(C)

0 A 1 0 0 0 0 0

0 B 1 0 1 0 0 0

0 C 1 0 0 1 0 0

1 A 1 1 0 0 0 0

1 B 1 1 1 0 1 0

1 C 1 1 0 1 0 1 which is full rank.

Now consider the same theoretical model and a dataset with no observations appearing in the Group = 0, Treatment = B combination. The resulting design matrix:

Effects Design matrix

Group Treatment Intercept G(1) T(B) T(C) G(1)*T(B) G(1)*T(C)

0 A 1 0 0 0 0 0

0 C 1 0 0 1 0 0

1 A 1 1 0 0 0 0

1 B 1 1 1 0 1 0

1 C 1 1 0 1 0 1 is not full rank as T(B) − G(1)*T(B) = 0. This situation often arises in observational data with classification variables where data are sparse in some of the classifications and consequently no observations exist in some cross-

Page 7 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 classifications.

The third common vehicle for exact multicollinearity is measured-variable redundancy. Suppose we want to model a response as a function of relationship quality measured using a scale that measures four dimensions, Consensus (CON), Satisfaction (SAT), Cohesion (COH) and Expression (EXP), and Total Quality (TOT). We use the regression model: y = α + β1(CON) + β2(SAT) + β3(COH) + β4(EXP) + β5(TOT) + ε

If Total Quality is the mean of the other four effects, we have CON + SAT + COH + EXP − 4*TOT = 0, so the design matrix is not full rank.

Multicollinearity Multicollinearity, or high but imperfect collinearity, is usually a function of the mechanism in the system that is being studied. For example, suppose we want to use height and weight as independent variables in a model involving teenagers. Height and weight are naturally correlated, and this correlation may be even higher if age and gender are added to the model, i.e., the partial correlation may be even higher. As this is not perfect multicollinearity, (X ′ X) will have an inverse, but estimation and interpretation may be problematic.

Detecting Multicollinearity Detection methods differ for perfect multicollinearity and multicollinearity, so are discussed separately.

Perfect Multicollinearity If one or more columns in X is an exact linear combination of other columns of X, then X is not full rank and (X ′ X) does not have an inverse. When fitting a regression model where the design matrix has this issue, most software packages

Page 8 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 (SAS, SPSS, Stata) recognize the problem and will eliminate variables as needed until the design matrix is full rank. The eliminated variables are chosen arbitrarily, so results will differ among software packages and may yield difficult to interpret results. When this problem occurs, researchers are best served by examining their data and removing the most reasonable terms. For example, in the previous ANOVA model with a missing cell, T(B) − G(1)*T(B) = 0. We would remove the G(1)*T(B) term and redefine T(B) as the B effect conditioned on Group = 1 (as there is no Treatment = B when Group = 0). In the regression model, we would probably eliminate the Total Quality term. As discussed, perfect multicollinearity can always be avoided by proper design matrix construction.

Multicollinearity Researchers familiar with principal components analysis are aware that the number of non-zero eigenvalues of a correlation matrix is the rank of the correlation matrix and that the number of zero eigenvalues is the number of complete dependencies. The concept of zero eigenvalues when a correlation matrix is not full rank is used when assessing multicollinearity.

Two methods are commonly used to detect multicollinearity, the variance inflation factor (VIF) and the condition index.

The Variance Inflation Factor A common phenomenon associated with multicollinearity is large variance for the parameter estimate associated with variables that are nearly linear combinations of other variables. Consider two regression models:

Model 1 : y = α + x1β1 + x2β2 + … + xnβn + ε and

Model 2 : y − (x2β2 + … + xnβn) = α + x1β1 + ε

Page 9 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

^ 2 ′ − 1 In Model 1, the variance of β1 is σε × [the second diagonal element of (X X) ^ 2 ′ − 1 ]. In Model 2, the variance of β1 is σε × (x 1x1) (assuming β2 through βn are known and not estimated). The ratio of these two values is the VIF, the inflation ^ in the variance of β1 due to the correlation of x1 with the other variables in the model. For the ith variable in the model,

1 VIFi = 2 1 − Ri

2 where Ri is the squared multiple correlation coefficient when xi is regressed onto ^ 2 the other members of X. VIFi is estimated using R i Large values of VIFi indicate high multicollinearity from which difficulties in estimation and interpretation may arise, but the VIF alone does not tell us what the nature of that multicollinearity is.

2 A large VIF is an indication of an Ri close to one, so it is useful as an indication of collinearity. It does not give any information about which other variables are involved in the collinearity nor can it distinguish among multiple sources of collinearity. For example, consider the regression model: y = α + x1β1 + x2β2 + x3β3 + x4β4 + x5β5 + x6β7 + ε where the sets (x1, x2, x3) and (x4, x5, x6) are highly correlated with set but have low between set correlation. High VIFs will be obtained for some or all of the variables, but VIFs give no information as to the source. The inverse of the VIF is sometimes reported and is called the tolerance.

The Condition Index and Variance Proportion As previously described, when perfect multicollinearity exists within X, at least one eigenvalue of the correlation matrix of X will be 0. Consistent with that, if multicollinearity exists within X, at least one eigenvalue will be small, but

Page 10 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 small is ill-defined as even a well-conditioned matrix can have arbitrarily small eigenvalues depending on how the columns are scaled. What is not arbitrary is the condition number, the ratio of the largest to smallest eigenvalue and the condition indices, the ratio of the largest eigenvalue to each eigenvalue. A large condition number and large condition indices are further indications of potentially problematic multicollinearity. So that each variable is weighted the same, X is rescaled to have all column sums of squares equal 1 prior to obtaining eigenvalues.

′ ^ The eigenvalues of rescaled X X are related to the variance of the βs as

^ 2 ′ − 1 cov(β) = σ (X X) . Condition indices are then informative of the effects of ^ ^ multicollinearity on the variances of the βs. The variances of the βs can be decomposed into terms (variance proportions) each associated with an eigenvalue that determine the extent to which multicollinearity effects the variance ^ ^ of each β. For each β, the sum of the variance proportions is 1.

Studies have suggested that weak dependencies are associated with condition indices between 5 and 10 and strong relations begin between 30 and 100. Also, multicollinearity is only an issue if at least two variables have a high variance proportion associated with a large condition index as collinearity involves at least two variables. A two-step diagnostic procedure is commonly used:

1. Determine which eigenvalues have a high condition index (say >20). 2. Determine which coefficients have high (say >0.50) variance proportions associated with the condition indices flagged in Step 1.

The exact nature of the multicollinearity may then be directly assessed by OLS with variables associated with the highest variance proportions serving as outcomes.

Page 11 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Solutions for Multicollinearity

Perfect Multicollinearity Exact multicollinearity must be addressed. Careful design matrix construction will always avoid the issue. If it is found, it is solved by removing one or more of the offending elements from X (i.e., one indicator from a set of dummies, a total score when all subscale scores are included).

Multicollinearity Multicollinearity does not always require a solution and solutions may degrade the model. It is important to understand that multicollinearity is not a statistical issue; it is not related to the validity of a model. It is, rather, a function of the information in a design matrix. If a theoretical model dictates that y = Xβ + ε, then that is the empirical model that must be estimated. OLS estimators are not biased by multicollinearity, and some or all estimators may be biased if important terms are ^ removed from X. The precision of the outcome estimate, y, is also not degraded in the presence of multicollinearity. Additionally, if multicollinearity exists within a set of “controlled for” covariates which are orthogonal to the variables of interest, the precisions of the coefficient estimators for the variables of interest are not affected by the multicollinearity among the controls.

For descriptive models, multicollinearity may be worth dealing with, often by removing from the model the least theoretically interesting variable involved in the multicollinearity, using the described diagnostic procedure to assess the full nature of the multicollinearity. That choice is up to the researcher.

Illustrative Example: Multicollinearity and Transparency, Class Bias, and Redistribution: Evidence From the American States

Page 12 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 This example presents diagnostics for multicollinearity within a dataset that contains multiple economic variables, the unit of observation is a state. This is important as social scientists often use economic variables for models of association, and these variables are often intercorrelated.

The Data This example uses a subset of data from Multicollinearity and Transparency, Class Bias, and Redistribution: Evidence From the American States (https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/ KPG5C2). 14 variables are used in this example, publicwelftotalexp gsp unem1 pop pop65 nonwhite pctscore media_pen cbias govparty_c leg_cont divided_gov citi nominate, and they are described further in the codebook. Within the set, there are several economy-related variables (public expenditures, personal income, GINI coefficient) which may be intercorrelated. Complete cases are retained resulting in 1,081 usable of 1,887 original observations (57.3%).

Analyzing the Data Several VIF statistics are large, among them are publicwelftotalexp, gsp, pop, govparty_c, and nominate, all greater than 5. There are five condition indices greater than 20 of which two have two or more variables with large variance proportions. nominate has a large variance proportion associated with one of these condition indices and gsp has a large variance proportion associated with the other, so these variables serve as pivots, that is, they are regressed onto other members of the set to assess multicollinearity through R2. R2 for nominate is 0.89 and that for gsp is 0.96, both very large. When these two variables are removed from the design matrix, all VIFs are below 5, and there are no condition indices greater than 20 associated with multiple variance proportions greater than 0.50.

Page 13 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Presenting Results Depending on the researcher’s conclusions about multicollinearity issues, results from a multicollinearity analysis are typically reported briefly or not at all. In our example, if the decision were made to eliminate nominate and gsp from the sample set, we may say:

Legislator ideology and gross state product were eliminated from the analysis due to their unacceptably large multiple correlations with other independent variables.

Review Multicollinearity describes the situation where one or more variables in a set of variables has high multiple correlation with other members of the set.

You should know:

• How multicollinearity in a regression model design matrix affects coefficient and standard error estimates. • How to diagnose multicollinearity. • When to address multicollinearity. • If necessary, how to address multicollinearity.

Your Turn Download this sample data to see whether you can replicate these results.

Page 14 of 14 Learn About Multicollinearity in SPSS With Data From Transparency, Class Bias, and Redistribution: Evidence From the American States Dataset (2018)