Section 1: Regression Review
Total Page:16
File Type:pdf, Size:1020Kb
Section 1: Regression review Yotam Shem-Tov Fall 2015 Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 1 / 58 Contact information Yotam Shem-Tov, PhD student in economics. E-mail: [email protected]. Office: Evans hall room 650 Office hours: To be announced - any preferences or constraints? Feel free to e-mail me. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 2 / 58 R resources - Prior knowledge in R is assumed In the link here is an excellent introduction to statistics in R . There is a free online course in coursera on programming in R. Here is a link to the course. An excellent book for implementing econometric models and method in R is Applied Econometrics with R. This is my favourite for starting to learn R for someone who has some background in statistic methods in the social science. The book Regression Models for Data Science in R has a nice online version. It covers the implementation of regression models using R . A great link to R resources and other cool programming and econometric tools is here. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 3 / 58 Outline There are two general approaches to regression 1 Regression as a model: a data generating process (DGP) 2 Regression as an algorithm (i.e., an estimator without assuming an underlying structural model). Overview of this slides: 1 The conditional expectation function - a motivation for using OLS. 2 OLS as the minimum mean squared error linear approximation (projections). 3 Regression as a structural model (i.e., assuming the conditional expectation is linear). 4 Applications, examples and additional topics. This slides relay heavily on Mostly Harmless Econometrics, Hansen's book Econometrics and Pat Kline's lecture notes. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 4 / 58 Notation and theoretical framework The researcher observes an i:i:d sample of N observations of (Yi ; Xi ), denoted by SN ≡ (Y1; X1);::: (YN ; XN ). 0 Xi = (X1i ; X2i ;:::; Xpi ) is a 1 × p random vector, where p is the number of covariates. Yi is a scalar random variable. (Yi ; Xi ) are sampled i:i:d from the distribution function FY ;X (y; x). Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 5 / 58 The Conditional expectations function (CEF) The conditional expectation function is, Z Z E[Yi jXi = x] = tdFY jX (tjXi = x) = tfY jX (tjXi = x)dt For any two variables Xi and Yi we can always compute the conditional expectation of one given the other. Additional assumptions are required to give anything other than a purely descriptive interpretation to the conditional expectation function. Descriptive tools are important and can be valuable. The law of iterated expectations (LIE) is, E[Yi ] = E [E [Yi jXi = x]] More in detail LIE is: EY [Yi ] = EX EY jX =x [Yi jXi = x] Another way of writing LIE is, E[Yi jX1i = x1] = E [E [Yi jX1i = x1; X2i = x2] jX1i = x1] Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 6 / 58 Properties of the CEF The LIE has several useful implications, E [Yi − E [Yi jXi ]] = 0 E [(Yi − E [Yi jXi ]) · Xi ] = 0 E [(Yi − E [Yi jXi ]) · g(Xi )] = 0 8 g(·) Therefore we can decompose any scalar random variable Yi to two parts, Yi = E [Yi jXi ] + ui ; E [ui jXi ] = 0: Question: Does E [ui jXi ]=0 implies E [ui ]=0? Yes, the proof follows from the LIE. The statement E [ui jXi ]=0 is more general than the statement E [ui ]=0. Question: Does E [ui ]=0 imply E [ui jXi ]=0? No. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 7 / 58 Properties of the CEF The ANOVA theorem (variance decomposition) V [Yi ] = V [E [Yi jXi ]] + E [V [Yi jXi ]] The CEF prediction property Let m(Xi ) be any function of Xi . The CEF solves, h 2i E [Yi jXi ] = argmin E (Yi − m(Xi )) m(Xi ) The CEF is the best predictor of Y using the loss function h i i ^ ^ 2 L(Yi ; Yi ) = E (Yi − Yi ) . The CEF is also refereed to as the minimum mean squares error (MMSE) estimator of Yi . Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 8 / 58 Homoskedasticity and Heteroskedasticity Homoskedasticity 2 2 The error is homoskedastic if E u jX = σ , does not depend on X . Hetroskedasticity 2 2 The error is hetroskedastic if E u jX = x = σ (x), depends on x. The concepts of homoskedasticity and heteroskedasticity concern the conditional variance, not the unconditional variance. By definition the unconditional variance σ2 is constant and independent of the regressors. The unconditional variance is a number, not a random variable or a function of X . The conditional variance depends on X and if X is a random variable it is also a random variable. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 9 / 58 Projections Denote the population linear projection of Yi on Xi by, p ∗ 0 X E [Yi jXi ] = Xi β = β0 + Xji βj j=1 where β is defined as the minimum mean squared error linear projection, h 0 2i β = argmin E Yi − Xi b b The F.O.C of the minimization problem are, 0 0−1 E Xi · Yi − Xi β = 0 ) β = E Xi Xi E [Xi Yi ] ∗ Denote by ei the projection residual, ei = Yi − E [Yi jXi ]. It follows from the F.O.C that, E [Xi · ei ] = 0. The residual of projecting Yi on Xi is not correlated with Xi . Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 10 / 58 Projections Theorem (Projection is the MMSE linear approximation to the CEF) h 0 2i β = argmin E E [Yi jXi ] − Xi b b 0 If the conditional expectation is linear, E [Yi jXi ] = Xi β then, ∗ E [Yi jXi ] = E [Yi jXi ]. Proof Projections and conditional expectations are related also without imposing any structural assumptions, ∗ ∗ E [Yi jXi ] = E [E [Yi jZi ; Xi ] jXi ] Proof: ∗ 0 0−1 0 0−1 E [Yi jXi ] = Xi E Xi Xi E [Xi Yi ] = Xi E Xi Xi E [Xi E [Yi jXi ; Zi ]] ∗ = E [E [Yi jZi ; Xi ] jXi ] 0 0 −1 ∗ where Xi E [Xi Xi ] E [Xi A] = E [AjXi ] for all A, and in our case A = E [Yi jZi ; Xi ]. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 11 / 58 Projections Projections also follow the law of iterated projections (LIP), ∗ ∗ ∗ E [Yi jXi ] = E [E [Yi jZi ; Xi ] jXi ] Consider the following special case, Ti is a binary variable (i.e., Ti 2 f0; 1g). Show that, ∗ E [Yi jTi ] = E [E [Yi jXi ; Ti ] jTi ] The proof is left as a homework assignment. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 12 / 58 Regression as a structural model: The CEF is linear 0 When E [Yi jXi ] = Xi β, β has a causal interpretation. It is no longer a linear approximation based on correlations, but a structural parameter that describes a relationship between Xi and Yi . The parameter β has two different possible interpretations: (i) Conditional mean interpretation: β indicates the effect of increasing X on the conditional mean E [Y jX ] in the model E [Y jX ] = X β. (ii) Unconditional mean interpretation: By the LIE it follows that E [Y ] = E [X ] β. Hence, β can be interpreted as the effect of increasing the mean value of X on the (unconditional) mean value of Y (i.e., E [Y ]). Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 13 / 58 Regression as a prediction algorithm Denote by X the design matrix. This is the matrix on which we will project y . In other words we have an input matrix X with dimensions n × p and an output vector y with dimensions n × 1. 0 1 X11 X12 ::: X1p B X21 X22 ::: X2p C X = B C B . C @ . A X X ::: X n1 n2 np n×p The Xji denotes the value of characteristic j for individual i. Example: If the researcher observes 2 covariates: education and age. A possible design matrix is, 0 2 1 1 education1 age1 agep B . C X = @ . A 1 education age age2 n n n n×4 Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 14 / 58 Regression as a prediction algorithm The linear regression algorithm has the following form: p X y^i = f (Xi ) = β0 + Xji βj j=1 We can estimate the coefficients β in a variety of ways but OLS is by far the most common, which minimizes the residual sum of squares (RSS): N P X X 2 RSS(β) = (yi − β0 − Xji βj ) i=1 j=1 The estimation of β is according to a square loss function, i.e. L(a; ^a) = (a − ^a)2. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 15 / 58 Visualization of OLS Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 16 / 58 The OLS estimator of β Write the RSS as: RSS(β) = (y − Xβ)0(y − xβ) Differentiate with respect to β: @RSS = −2X0(y − Xβ) (1) @β Assume that X is full rank (no perfect collinearity among any of the independent variables) and set first derivative to 0: X0(y − Xβ) = 0 Solve for β: β^ = (X0X)−1X0y Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 17 / 58 Rank condition for identifying β What happens if X is not full rank? There is an infinite number of ways to invert the matrix X0X, and the algorithm does not have a unique solution.