<<

Explained Variance in Multiple Regression

Lecture 5: Inference in OLS Regression : Form

Dave Armstrong Properties of the OLS Estimator

University of Wisconsin – Milwaukee Department of Political Science Statistical Inference for OLS e: [email protected] w: www.quantoid.net/ICPSR.php Example: Duncan Data

Model (Mis)Specification

1/54 2/54

Partial Relationshps

Table : OLS Regressions of Occupational Prestige Model 1 Model 2 Model 3 Model 4 Intercept 27.141⇤ 10.732⇤ 48.693⇤ 6.794⇤ (2.268) (3.677) (2.308) (3.239) Partial relationships are those that “control” for the e↵ects of other Mean income of incumbents (in $1000) 2.897⇤ 1.314⇤ variables. How does this “controlling” happen? (0.283) (0.278) Mean incumbent years of education 5.361⇤ 4.187⇤ • Controlling happens by removing the parts of X1 and Y that are (0.332) (0.389) Percentage of women incumbents 0.064 0.009 explained by X2,...,Xk when calculating the slope coecient. (0.054) (0.030) • Note that if X1 can be perfectly predicted by X2,...,Xk,thenthere N 102 102 102 102 R2 0.511 0.723 0.014 0.798 will be nothing left to be related to Y. adj. R2 0.506 0.720 0.004 0.792 • Because of the way that controlling happens, it is impossible to say Resid. sd 12.090 9.103 17.169 7.846 Main entries are OLS coecients how much variance each variable uniquely explains, because there Standard errors in parentheses are overlaps in the variance of the X variables. ⇤ indicates singificance at p < 0.05 (two-tailed)

3/54 4/54 Partial Relationships (2) Partial Relationships in R

To find the partial relationship, we can just use multiple regression, or we >EP<-lm(prestige~education+women, could do the following: +data=Prestige)$residuals • First, partial out the e↵ects of education and women on prestige >EI<-lm(income~education+women, +data=Prestige)$residuals P Prestigei = A1 + B1Educationi + D1Womeni + Ei >partial.mod<-lm(EP~EI) >coef(mod4) • Then, partial out the e↵ect of education and women on income. (Intercept) income education women I -6.794334203 1.313560428 4.186637275 -0.008905157 Incomei = A2 + B2Educationi + D2Womeni + Ei • Finally, calculate the simple regression: >coef(partial.mod) (Intercept) EI P I Ei = A + BEi + Ei 8.190603e-16 1.313560e+00

5/54 6/54

Vectors and Matrices Explained Variance in Multiple Regression

Ordinary Least Squares: Matrix Form

• A vector is a listing of numbers in a particular order. We can have Properties of the OLS Estimator 1 2 row-vectors = [1, 2, 3, 4] or column vectors = v v 2 3 3 Statistical Inference for OLS 6 7 6 4 7 6 7 • The ordering of the numbers matters, thus if v =6 [4,73, 2, 1],then ⇤ 46 57 Example: Duncan Data v , v⇤.

Model (Mis)Specification

7/54 8/54 Vectors Vectors in R

Consider the following two fectors u = [3, 3, 3, 3] (or u = [u , u , u , u ]) We already know how to make vectors in R, it’s just with the c() 1 2 3 4 command. and v = [1, 2, 3, 4] (or v = [v1, v2, v3, v4]). >u<-c(3,3,3,3) • u + v = [u1 + v1, u2 + v2, u3 + v3, u4, v4] = [3 + 1, 3 + 2, 3 + 3, 3 + 4] = [4, 5, 6, 7] >v<-c(1,2,3,4) >u+v • u v = [u1 v1, u2 v2, u3 v3, u4, v4] = [3 1, 3 2, 3 3, 3 4] =[2, 1, 0, 1] [1] 4 5 6 7 • multiplication and division simply result in the elements of >u-v the vector being multiplied or divided by the scalar. [1] 2 1 0 -1

9/54 10 / 54

Conformability Inner Product

The inner product (or ) of vectors is an important type of calculation. Vectors can only be added, subtracted, etc... if they are “conformable”. • The vectors have to be of the same size. • The inner product of two vectors results in a scalar (i.e., a single • In the examples above (addition and subtraction) this amounts to number) the vectors having the same length. If vectors do not have the same • For example: u v = [u v + u v + u v + u v ] with the data above: length, they cannot be added together. · 1 1 2 2 3 3 4 4 • This will amount to something di↵erent when we talk about u v = [3 1 + 3 2 + 3 3 + 3 4] = [3 + 6 + 9 + 12] = 30 multiplication of matrices. · ⇥ ⇥ ⇥ ⇥ >u%*%v [,1] [1,] 30

11 / 54 12 / 54 I Outer Product II

Where the inner-product of vectors generates a scalar (a single-value), the outer-product generates a matrix. • When u is 1 k and v0 is k 1.Theresultis1 1 because the result • A row vector has dimensions 1 k and a column-vector has is the sum of⇥ the product of⇥ pairwise elements⇥ of each row (of which dimensions k 1. ⇥ ⇥ there is 1) and each column (of which there is 1). • In , as we’ll see shortly, matrices are conformable if they have the same inner-dimension. So, we could The outer product switches these two: u0v.Sinceu0 is k 1 and v is , the result is a matrix with dimensions . ⇥ multiply something 1 k by something that was k n where n is 1 k k k ⇥ ⇥ ⇥ ⇥ some positive integer. When should do the same with vectors. 3 3 6 9 12 • Both u and v are row-vectors above. To make one of them a 3 3 6 9 12 0 = [1, 2, 3, 4] = (1) column-vector, we will have to it. Transposing u v 2 3 3 2 3 6 9 12 3 6 7 6 7 interchanges the rows and columns of a matrix/vector. We can 6 3 7 6 3 6 9 12 7 6 7 6 7 indicate transposition by v0. 6 7 6 7 6 7 6 7 • To be perfectly correct above, we should have multiplied u v . 4 5 4 5 · 0

13 / 54 14 / 54

Outer Product in R Matrices

>u%o%v [,1] [,2] [,3] [,4] A matrix is a rectangular arrangement of numbers. Matrices have two [1,] 3 6 9 12 dimensions - the number of rows and the number of columns. The [2,] 3 6 9 12 ordering in both the rows and columns of the matrix matters. For [3,] 3 6 9 12 example: [4,] 3 6 9 12 12 = (2) >outer(u,v,"*") X 34 [,1] [,2] [,3] [,4] " # is a 2 2 matrix. We could say x[2, 1] = x = 3, that is the second [1,] 3 6 9 12 X 21 row, first⇥ column of = 3. [2,] 3 6 9 12 X [3,] 3 6 9 12 [4,] 3 6 9 12

15 / 54 16 / 54 Special matrices Matrix Addition/Subtraction

We can also do math on matrices. For addition and subtraction, the • A is one with the same number of rows and columns. matrices must be of the same order (that is, they must have the same The rows and columns needn’t be the same, but the number of dimensions). elements does need to be the same. x11 x12 y11 y12 x11 + y11 x21 + y21 • A symmetric matrix is one where xij = x ji. X = , Y = , X + Y = x21 x22 y21 y22 x21 + y21 x22 + y22 " # " # " # • A diagonal matrix is a square matrix where xij is non-zero when 12 56 i = j and is zero when i , j. To test, add = and = . X 34 Y 78 • An identity matrix is a diagonal matrix where all diagonal elements " # " # 68 equal 1. Answer: . 10 12 " #

17 / 54 18 / 54

Matrices in R Matrix Multiplication I

Multiplication is where things get a bit tricky, but you might be able to see where we’re going.

When you enter data into a matrix in R, you have to specify how many x x y y X = 11 12 , Y = 11 12 rows and columns it has. By default, R fills in the matrix by columns. x21 x22 y21 y22 " # " # >x<-matrix(c(1,3,2,4),ncol=2,nrow=2) To figure out what XY is, let’s think about breaking down X into >y<-matrix(c(5,7,6,8),ncol=2,nrow=2) row-vectors and Y into column vectors. >x+y • The first row-vector of X is x1 = x11 x12 [,1] [,2] · [1,] 6 8 • The second row-vector of X is x2 h= x21 x22i [2,] 10 12 · h y11 i • The first column vector of Y is y 1 = · y21 " # y12 • The second column vector of Y is y 2 = · y22 " #

19 / 54 20 / 54 Matrix Multiplication II Properties of Matrix Multiplication

We can find the product of X and Y by taking inner-products of the • When we see XY,thewecansaythatY is pre-multiplied by X or appropriate row- and column-vectors. the X is post-multiplied by Y. • The order of the multiplication matters here, so XY , YX x1 y 1 x1 y 2 XY = · · · · · · Much like vectors, matrices can be transposed as well. Here, we x2 y 1 x2 y 2 · · · · · · " # interchange the rows and columns, so that xij = x0ji. Sometimes, >x%*%y matrices need to be transposed to make them conformable for [,1] [,2] multiplication or addition. The properties of the transpose are as follows:

[1,] 19 22 • (X0)0 = X [2,] 43 50 • (X + Y)0 = Y0 + X0 th th The ij element of XY is the inner product of the i row of X and the • (XY)0 = Y0 X0 jth column of Y. • For a symmetric matrix X0 = X.

21 / 54 22 / 54

Matrix Inversion Matrix Inverse in R

Notice, we haven’t talked about matrix division yet. In the scalar world, a 1 1 >x b = a b = a b , so division can be expressed as a product of something and the inverse⇥ of something else. In the matrix world, the concept is the [,1] [,2] same, though the actual mathematics are considerably harder. [1,] 1 2 1 [2,] 3 4 • XX = I defines the inverse. The inverse of a square matrix is the matrix by which the original matrix must be multiplied (either pre- >solve(x) or post-) to obtain the identity matrix (diagonal 1’s and o↵-diagonal [,1] [,2] 0’s). [1,] -2.0 1.0 • Example: [2,] 1.5 -0.5

12 12 21 10 >x%*%solve(x) X = ,then = (3) 34 34 1.5 0.5 01 [,1] [,2] " # " #" # " # [1,] 1 1.110223e-16 So the second matrix in the equation is the inverse of X. [2,] 0 1.000000e+00

23 / 54 24 / 54 Matrix Form of Linear Models (1) Matrix Form of Linear Models(2)

• If we substitute A for B0, the general linear model takes the form:

Yi = B0 + B1Xi1 + B2Xi2 + ...+ BkXik + "i • Since each observation has one such equation, it is convenient to combine these equations in a single matrix equation: • With the inclusion of a 1 for the constant, the regressors can be collected into a row vector, and thus the equation for each individual Y1 1 x11 x1k B0 "1 Y 1 x ··· x B " observation can be rewritten in vector form: 2 21 ··· 2k 1 2 2 . 3 = 2 . . . 3 2 . 3 + 2 . 3 6 . 7 6 . . . 7 6 . 7 6 . 7 6 7 6 7 6 7 6 7 6 Yn 7 6 1 xn1 xnk 7 6 Bk 7 6 "n 7 B0 6 7 6 ··· 7 6 7 6 7 6 7 = 6 + " 7 6 7 6 7 B1 6 y 7 6 X b 7 6 7 6 7 4 (n 1)5 (4n k+1)(k+1 1) (n 1) 5 4 5 4 5 2 B 3 ⇥ ⇥ ⇥ ⇥ Yi = [1, xi1, xi2,...,xik] 6 2 7 + "i 6 . 7 6 . 7 • X is Called the model matrix, because it contains all the values of 6 7 the explanatory variables for each observation in the data 6 Bk 7 6 7 = xi b + "i 6 7 (1 k+1)(k+1 1) 4 5 ⇥ ⇥

25 / 54 26 / 54

OLS Fit in Matrix Form OLS Fit in Matrix Form (2)

S (b) = y0 y (2y0 X)b + b0(X0 X)b • The fitted linear model is then:

y = Xb + e • We see that with respect to the b coecient vector, there is a constant (y0 y),alinearforminb and a quadratic form in b.To where b is the vector of fitted slope coecients and e is the vector minimize S (b),weneedtofindthepartialfirstderivativewith of residuals. respect to b. • Expressed as a function of b,OLSfindsthevectorb that minimizes @S (b) = 0 2X0 y + 2X0 Xb the residual sum of squares: @b • The normal equations are found by setting this derivative to zero. 2 S (b) = ei = e0e X0 Xb = X0 y = (Xy Xb)0(y Xb) = y0 y y0 Xb b0 X0 y + b0(X0 X)b • If X0 X is non-singular (of k + 1) we can uniquely solve for the = y0 y (2y0 X)b + b0(X0 X)b least squares coecients:

1 b = (X0 X) X0 y

27 / 54 28 / 54 Unique solution and the rank of X0 X Where are we now?

• We’ve gone this far and what have we assumed? 1. Linearity • The rank of X0 X is equal to the rank of X.Thisattributeleadsto 2. No perfect collinearity two criteria that must be met in order to ensure X0 X is nonsingular, and thus obtain a unique solution: • What have we not assumed yet? 1. We need at least as many observations as there are coecients in the 1. e independent from X. 2 model. Since the rank of X can be no larger than the smallest of n 2. e n(0, In) - iid errors. and k + 1 to obtain a unique solution. ⇠ N 2. The columns of X must not be perfectly linearly related. Perfect • We can find a unique solution to the problem we faced by only collinearity prevents a unique solution, but even near collinearity can making the two assumptions mentioned above. cause statistical problems. Moreover, no regressor other than the • What we don’t know anything about is whether this tells us only constant can be invariant - an invariant regressor would be a multiple something about a sample (in particular, this sample) or whether it of the constant. tells us something about a larger set of observations from which this one is drawn. • We turn to this other set of questions now.

29 / 54 30 / 54

Reformulation of the Model Explained Variance in Multiple Regression • When we consider the statistical properties of b,it’sasanestimator of something, specifically the population parameter vector . • Here, we have to assume either that is fixed before the data Ordinary Least Squares: Matrix Form X collection, or more appropriately for this class, that X is independent of the errors in the population ". Properties of the OLS Estimator • Given this, we need to re-express our model as follows:

y = X + " (4) Statistical Inference for OLS • We can take the expectation of both sides:

Example: Duncan Data E(y) = E (X + ") = E(X) + E(") Model (Mis)Specification = X + E(") • If we add the assumption E(") = 0, then we get:

E(y) = X

31 / 54 32 / 54 Unbiased-ness BLUE

• For an estimator ✓˜ of a population parameter ✓ to be unbiased, it must be the case that E(✓˜) = ✓. • We want to assess the bias in b,soweneedtoseewhetherE(b) = . • To this point, we have made a set of assumptions that allowed us to show the b as an estimator of is both linear and unbiased. 1 • This is great, but it doesn’t show that the OLS line is the“best”line? b = X0 X X0 y 1 • To do this, we need to add another concept to our arsenal - = X0 X X0(X + ") eciency. 1 1 = X0 X X0 X + X0 X X0" • If the OLS line is “best”, it should be more ecient (i.e., have 1 smaller variance) than any other linear, unbiased estimate. = I + X0 X X0" 1 • How do we get there? E(b X) = + E X0 X X0" X | | = + 0 ⇣ ⌘

33 / 54 34 / 54

Variance of the errors Gauss Markov Theorem

• To get to the point where we can make statements about OLS being the “best”, we need to make another assumption about the errors. The Gauss-Markov theorem states that if the errors in our OLS • Homoskedasticity: V(" X) = 2 or, the variance of the errors regression model: | conditional on X is the same. • are independent of each other and X, 2 • Put another way, the variance of the residuals is In. • have expected value of 0, • We know that y is di↵erent from X because of ".Thus,the • have constant variance given X distribution of y (i.e., the spread of points of y around its expectation, X,willhavethesamevarianceas",or2 I . then the OLS estimator (b) of is BLUE - the best linear unbiased n estimator. • If this is true, then OLS is the minimum variance, linear unbiased estimator.

35 / 54 36 / 54 Recap Now we know that the OLS estimator b is linear, unbiased, and ecient. Explained Variance in Multiple Regression What assumptions did we have to make along the way: • Linearity Ordinary Least Squares: Matrix Form • y = X + ". • No perfect collinearity (or X of full-rank). • Unbiasedness Properties of the OLS Estimator • " independent from X • E(") = 0 • Eciency Statistical Inference for OLS • Homoskedasticity: V(" X) = 2. • V(" X) = 2 I | | n Example: Duncan Data What we have not done yet is talk about inference. To do this, we need to make one more assumption: Model (Mis)Specification " (0, 2 I ) ⇠ Nn n Remember, to know what values are likely and what values are unlikely (i.e., to make inferences), we need to know the probability distribution of the random variable. 37 / 54 38 / 54

Statistical Inference for OLS Inference for Individual Coecients (1)

• Any individual coecient B j is distributed normally with expectation 2 th If we know that " is distributed normally, then that implies is also j and sampling variance v jj,wherev jj is the j diagonal element b 1 normally distributed. Specifically: of (X0 X) . (0) • So, we can test the hypothesis H0 : j = with: 2 1 j b + , (X0 X) ⇠ Nk 1 (0) This is called the sampling distribution⇣ of b. is⌘ fixed in the population, B j j Z0 = it is a matrix of constants. However, because we have a “random” pv jj sample, b will di↵er in each sample according to the distribution above. This, however, doesn’t help much since and are unknown. Notice, and are population quantities, so this is a theoretical 2 e0e 2 distribution. • S E = n k 1 is an unbiased estimator of ,so e0e 1 Vˆ (b) = X0 X ; SE(B ) = Diagonal Vˆ (b) n k 1 j q ⇣ ⌘

39 / 54 40 / 54 Inference for Individual Coecients (2) Inference for Multiple Coecients: F-test

• Assume we have an OLS model with k explanatory variables that Because and 2 are independent, their ratio is distributed with B j S E t produces residual sum of squares RS S for the full model. n k 1 degrees of freedom: • Now, place q linear restrictions on the model coecients (e.g., set (0) some of them to zero) and generate a new residual sum of squares B j j t0 = RS S 0 for the restricted model. SE(B j) RS S RS S 0 Thus: = q F0 RS S n k 1 95% CI for j = B j t97.5,n k 1SE(B j) ± • The statistic F0 is distributed F with q and n k 1 degrees of freedom.

41 / 54 42 / 54

Example: Duncan Data Explained Variance in Multiple Regression >library(car) >data(Duncan) >mod<-lm(prestige~income+education,data=Duncan) Ordinary Least Squares: Matrix Form >summary(mod) Call: Properties of the OLS Estimator lm(formula = prestige ~ income + education, data = Duncan) Residuals: Min 1Q Median 3Q Max Statistical Inference for OLS -29.538 -6.417 0.655 6.605 34.641 Coefficients: Estimate Std. Error t value Pr(>|t|) Example: Duncan Data (Intercept) -6.06466 4.27194 -1.420 0.163 income 0.59873 0.11967 5.003 1.05e-05 *** education 0.54583 0.09825 5.555 1.73e-06 *** --- Model (Mis)Specification Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ''1

Residual standard error: 13.37 on 42 degrees of freedom Multiple R-squared: 0.8282, Adjusted R-squared: 0.82 F-statistic: 101.2 on 2 and 42 DF, p-value: < 2.2e-16

43 / 54 44 / 54 Regression “by hand” F-test Example >X<-with(Duncan,cbind(1,income,education)) >y<-matrix(Duncan[["prestige"]],ncol=1) >restricted.mod<-lm(prestige~1,data=Duncan) >b<-solve(t(X)%*%X)%*%t(X)%*%y >anova(restricted.mod,mod,test="F") >b Analysis of Variance Table [,1] -6.0646629 Model 1: prestige ~ 1 income 0.5987328 Model 2: prestige ~ income + education education 0.5458339 Res.Df RSS Df Sum of Sq F Pr(>F) >coef(mod) 14443688 2427507236181101.22<2.2e-16*** (Intercept) income education --- -6.0646629 0.5987328 0.5458339 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ''1 >e<-matrix(y-X%*%b,ncol=1) >Vb<-c((t(e)%*%e)/(nrow(X)-2-1))*solve(t(X)%*%X) >xtx<-solve(t(X)%*%X) >Vb >s2e<-(t(e)%*%e)/(nrow(X)-3) >F0<-(t(b[2:3])%*%solve(xtx[2:3,2:3])%*%b[2:3])/(2*s2e) income education >F0 18.249481 -0.151845008 -0.150706025 income -0.151845 0.014320275 -0.008518551 [,1] education -0.150706 -0.008518551 0.009653582 [1,] 101.2162 >vcov(mod) >pf(F0,2,nrow(X)-3,lower.tail=F) (Intercept) income education (Intercept) 18.249481 -0.151845008 -0.150706025 [,1] income -0.151845 0.014320275 -0.008518551 [1,] 8.647636e-17 education -0.150706 -0.008518551 0.009653582 45 / 54 46 / 54

General Linear Hypothesis Example Explained Variance in Multiple Regression

Test H0 : 1 = 2 = 0: Ordinary Least Squares: Matrix Form >linearHypothesis(mod,c("education=.5","income=.5")) Linear hypothesis test Properties of the OLS Estimator Hypothesis: education = 0.5 income = 0.5 Statistical Inference for OLS

Model 1: restricted model Model 2: prestige ~ income + education Example: Duncan Data Res.Df RSS Df Sum of Sq F Pr(>F) 1448054.5 2427506.72547.781.53240.2278 Model (Mis)Specification

47 / 54 48 / 54 Model (Mis)Specification Misspecification

• We’ll talk in more depth about model selection (e.g., selecting which variables belong in the model) later on in the course. Suppose we’re interested in the population model: • Finally, for today, we’ll address the potential problem of omitted y⇤ = X⇤ + " = X⇤ + X⇤ + " variable bias. 1 1 2 2

• First we must decide whether our model is one of empirical where variables with a ⇤ superscript are mean deviated versions of the relationship or causal relationship. While we all probably want to original variables. X1⇤ and X2⇤ are matrices of regressors. make causal claims, there are some definite problems that come Now, define X + " "˜ such that along with that. 2⇤ ⌘ • A model of empirical relationship simply traces the empirical = + " covariance of one variable with a set of other variables. y⇤ X1⇤ 1 ˜ • A causal model proposes that we have been able to ascertain and Now we can see what happens to b1. include all of the relevant causal of y and as such, we can interpret coecients as the e↵ect of x on y.

49 / 54 50 / 54

b1 and Mis-specification Mis-specification Bias

1 b1 = X1⇤0 X1⇤ X1⇤0 y⇤ 1 ⇣ 1 ⌘ 1 = X⇤0 X⇤ X⇤0 y⇤ n 1 1 n 1 We assumed = 0, but what about ? ! 1" 1˜" 1 1 1 = X⇤0 X⇤ X⇤0 X⇤ + X⇤ + " n 1 1 n 1 1 1 2 2 ! 1 ⇣ ⌘ 1 1 1 1 1 1 1 plim X⇤"˜ = plim X⇤ X⇤2 + " = + X⇤0 X⇤ X⇤0 X⇤ + X⇤0 X⇤ X⇤0" 1 1 2 1 n 1 1 n 1 2 2 n 1 1 n 1 n n ! ! = ⌃122 + 1⇣" ⌘

So, can only be 0 if ⌃ (the correlation between and is 0 or Taking probability limits produces: 1˜" 12 X1⇤ X2⇤ 2 the e↵ect of X2⇤ in the population is 0. 1 1 plim b1 = 1 + ⌃11 ⌃122 + ⌃11 1" 1 This is the classic omitted variable bias. = 1 + ⌃11 ⌃122

Where: ⌃ plim 1 X X , ⌃ plim 1 X X ,and 11 ⌘ n 1⇤0 1⇤ 12 ⌘ n 1⇤0 2⇤ 1 " plim X " = 0 (by assumption). 1 ⌘ n 1⇤0 ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ 51 / 54 52 / 54 Omitted Variable Bias: The Phantom Menace Explained Variance in Multiple Regression • We have the idea that the bias in our coecients is monotonically decreasing (and approaching zero) as the proportion of relevant controls in our model increases. Ordinary Least Squares: Matrix Form • That is to say, if in the true data generating process (DGP), there are 100 variables - a model that includes 75 of them is better Properties of the OLS Estimator (coecients have smaller bias) than a model that includes only 50 of them. Statistical Inference for OLS • Clarke (2005) shows that this is not necessarily the case. • Adding a subset of controls does not necessarily make the model better. Example: Duncan Data • It could actually make the model worse. • His recommendations: Model (Mis)Specification 1. Focus on research design and look for natural experiments. 2. Test theories on smaller, narrower domains (e.g., spatially or temporally narrower).

53 / 54 54 / 54