Multicollinearity

00:28 Thursday 29th October, 2015 See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 17: Multicollinearity 36-401, Fall 2015, Section B 27 October 2015 Contents 1 Why Collinearity Is a Problem 1 1.1 Dealing with Collinearity by Deleting Variables . .2 1.2 Diagnosing Collinearity Among Pairs of Variables . .3 1.3 Why Multicollinearity Is Harder . .3 1.4 Geometric Perspective . .5 2 Variance Inflation Factors 5 2.1 Why VIFi ≥ 1 ............................5 3 Matrix-Geometric Perspective on Multicollinearity 7 3.1 The Geometric View . .9 3.2 Finding the Eigendecomposition . 10 3.3 Using the Eigendecomposition . 10 3.3.1 Example . 10 3.4 Principal Components Regression . 14 4 Ridge Regression 15 4.1 Some Words of Advice about Ridge Regression . 18 4.2 Penalties vs. Constraints . 19 4.3 Ridge Regression in R . 19 4.4 Other Penalties/Constraints . 20 5 High-Dimensional Regression 20 5.1 Demo . 21 6 Further Reading 23 1 Why Collinearity Is a Problem Remember our formula for the estimated coefficients in a multiple linear regression: βb = (xT x)−1xT y 1 2 1.1 Dealing with Collinearity by Deleting Variables This is obviously going to lead to problems if xT x isn't invertible. Similarly, the variance of the estimates, h i Var βb = σ2(xT x)−1 will blow up when xT x is singular. If that matrix isn't exactly singular, but is close to being non-invertible, the variances will become huge. There are several equivalent conditions for any square matrix, say u, to be singular or non-invertible: • The determinant det u or juj is 0. • At least one eigenvalue1 of u is 0. (This is because the determinant of a matrix is the product of its eigenvalues.) • u is rank deficient, meaning that one or more of its columns (or rows) is equal to a linear combination of the other rows2. Since we're not concerned with any old square matrix, but specifically with xT x, we have an additional equivalent condition: • x is column-rank deficient, meaning one or more of its columns is equal to a linear combination of the others. The last explains why we call this problem collinearity: it looks like we have p different predictor variables, but really some of them are linear combi- nations of the others, so they don't add any information. The real number of distinct variables is q < p, the column rank of x. If the exact linear relationship holds among more than two variables, we talk about multicollinearity; collinearity can refer either to the general situation of a linear dependence among the predictors, or, by contrast to multicollinearity, a linear relationship among just two of the predictors. Again, if there isn't an exact linear relationship among the predictors, but they're close to one, xT x will be invertible, but (xT x)−1 will be huge, and the variances of the estimated coefficients will be enormous. This can make it very hard to say anything at all precise about the coefficients, but that's not necessarily a problem. 1.1 Dealing with Collinearity by Deleting Variables Since not all of the p variables are actually contributing information, a natural way of dealing with collinearity is to drop some variables from the model. If you want to do this, you should think very carefully about which variable to delete. As a concrete example: if we try to include all of a student's grades as 1You learned about eigenvalues and eigenvectors in linear algebra; if you are rusty, now is an excellent time to refresh your memory. 2The equivalence of this condition to the others is not at all obvious, but, again, is proved in linear algebra. 00:28 Thursday 29th October, 2015 3 1.2 Diagnosing Collinearity Among Pairs of Variables predictors, as well as their over-all GPA, we'll have a problem with collinearity (since GPA is a linear function of the grades). But depending on what we want to predict, it might make more sense to use just the GPA, dropping all the individual grades, or to include the individual grades and drop the average3. 1.2 Diagnosing Collinearity Among Pairs of Variables Linear relationships between pairs of variables are fairly easy to diagnose: we make the pairs plot of all the variables, and we see if any of them fall on a straight line, or close to one. Unless the number of variables is huge, this is by far the best method. If the number of variables is huge, look at the correlation matrix, and worry about any entry off the diagonal which is (nearly) ±1. 1.3 Why Multicollinearity Is Harder A multicollinear relationship involving three or more variables might be totally invisible on a pairs plot. For instance, suppose X1 and X2 are independent 2 Gaussians, of equal variance σ , and X3 is their average, X3 = (X1 + X2)=2. The correlation between X1 and X3 is Cov [X1;X3] Cor(X1;X3) = p (1) Var [X1] Var [X3] Cov [X ; (X + X )=2] = 1 1 2 (2) pσ2σ2=2 σ2=2 = p (3) σ2= 2 1 = p (4) 2 p This is also the correlation between X2 and X3. A correlation of 1= 2 isn't trivial, but is hardly perfect, and doesn't really distinguish itself on a pairs plot (Figure 1). 3One could also drop just one of the individual class grades from the average, but it's harder to think of a scenario where that makes sense. 00:28 Thursday 29th October, 2015 4 1.3 Why Multicollinearity Is Harder 30 40 50 60 70 80 90 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 90 ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● x1 ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● 70 ● ●●● ● ● ● ● ●●● ●● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● 60 ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ●●● ● ●● ● ● ● ● 50 ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● 80 ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ●● ●●● ● 70 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ●●● ● ● x2 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● 50 ● ● ● ● 40 ● ● ● ● 30 ● ● ● ● ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● 80 ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●●● x3 ● ● ● ● ● ● ● ● ● ●●● ●●● ● 70 ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ●● ●● ● ●● ●●● ●● ●●●● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 60 ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 50 40 50 60 70 80 90 100 50 60 70 80 90 ## x1 x2 x3 ## x1 1.00000000 0.03788452 0.7250514 ## x2 0.03788452 1.00000000 0.7156686 ## x3 0.72505136 0.71566863 1.0000000 # Simulation: two independent Gaussians x1 <- rnorm(100, mean=70, sd=15) x2 <- rnorm(100, mean=70, sd=15) # Add in a linear combination of X1 and X2 x3 <- (x1+x2)/2 pairs(cbind(x1,x2,x3)) cor(cbind(x1,x2,x3)) Figure 1: Illustration of a perfect multi-collinear relationship might not show up on a pairs plot or in a correlation matrix. 00:28 Thursday 29th October, 2015 5 1.4 Geometric Perspective 1.4 Geometric Perspective The predictors X1;:::Xp form a p-dimensional random vector X. Ordinarily, we expect this random vector to be scattered throughout p-dimensional space. When we have collinearity (or multicollinearity), the vectors are actually confined to a lower-dimensional subspace. The column rank of a matrix is the number of linearly independent columns it has. If x has column rank q < p, then the data vectors are confined to a q-dimensional subspace. It looks like we've got p different variables, but really by a change of coordinates we could get away with just q of them. 2 Variance Inflation Factors If the predictors are correlated with each other, the standard errors of the coefficient estimates will be bigger than if the predictors were uncorrelated. ^ If the predictors were uncorrelated, the variance of βi would be h i σ2 Var β^ = (5) i ns2 Xi just as it is in a simple linear regression. With correlated predictors, however, we have to use our general formula for the least squares: h ^ i 2 T −1 Var βi = σ (x x)i+1;i+1 (6) (Why are the subscripts on the matrix i + 1 instead of i?) The ratio between th Eqs. 6 and 5 is the variance inflation factor for the i coefficient, VIFi. The average of the variance inflation factors across all predictors is often written VIF , or just VIF . Folklore says that VIFi > 10 indicates \serious" multicollinearity for the predictor. I have been unable to discover who first proposed this threshold, or what the justification for it is. It is also quite unclear what to do about this. Large variance inflation factors do not, after all, violate any model assumptions. 2.1 Why VIFi ≥ 1 Let's take the case where p = 2, so xT x is a 3 × 3 matrix. As you saw in the homework, 2 1 x1 x2 3 1 T 2 x x = 4 x1 x1 x1x2 5 n 2 x2 x1x2 x2 00:28 Thursday 29th October, 2015 6 2.1 Why VIFi ≥ 1 After tedious but straightforward algebra4, we get for the inverse (deep breath) 2 3 Var [X ] Var [X ] − Cov [X ;X ]2 + Var [x X − x X ] x Var [X ] − Cov [X ;X ] x x Cov [X ;X ] − Var [X ] x 1 −1 1 d 1 d 2 d 1 2 d 2 1 1 2 1 d 2 d 1 2 2 1 d 1 2 d 1 2 xT x = 6 7 2 4 x1Vard [X2] − Covd [X1;X2] x2 Vard [X2] −Covd [X1;X2] 5 n Vard [X1] Vard [X2] − Covd [X1;X2] x1Covd [X1;X2] − Vard [X1] x2 −Covd [X1;X2] Vard [X1] where the hats on the variances and covariances indicate that they are sample, not population, quantities. Notice that the pre-factor to the matrix, which is the determinant of n−1xT x, blows up when X1 and X2 are either perfectly correlated or perfectly anti- correlated | which is as it should be, since then we'll have exact collinearity. The variances of the estimated slopes are, using this inverse, 2 2 h ^ i σ Vard [X2] σ Var β1 = 2 = 2 n Vard [X1] Vard [X2] − Covd [X1;X2] n(Vard [X1] − Covd [X1;X2] =Vard [X2]) and 2 2 h ^ i σ Vard [X1] σ Var β2 = 2 = 2 n Vard [X1] Vard [X2] − Covd [X1;X2] n(Vard [X2] − Covd [X1;X2] =Vard [X1]) Notice that if Covd [X1;X2] = 0, these reduce to 2 2 h ^ i σ h ^ i σ Var β1 = ; Var β2 = nVard [X1] nVard [X2] exactly as we'd see in simple linear regressions.

Load more