<<

Section 1: Regression review

Yotam Shem-Tov Fall 2015

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 1 / 58 Contact information

Yotam Shem-Tov, PhD student in economics. E-mail: [email protected]. Office: Evans hall room 650 Office hours: To be announced - any preferences or constraints? Feel free to e-mail me.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 2 / 58 R resources - Prior knowledge in R is assumed

In the link here is an excellent introduction to in R . There is a free online course in coursera on programming in R. Here is a link to the course. An excellent book for implementing econometric models and method in R is Applied Econometrics with R. This is my favourite for starting to learn R for someone who has some background in statistic methods in the social science. The book Regression Models for Data Science in R has a nice online version. It covers the implementation of regression models using R . A great link to R resources and other cool programming and econometric tools is here.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 3 / 58 Outline

There are two general approaches to regression 1 Regression as a model: a data generating process (DGP) 2 Regression as an algorithm (i.e., an estimator without assuming an underlying structural model). Overview of this slides: 1 The function - a motivation for using OLS. 2 OLS as the minimum linear approximation (projections). 3 Regression as a structural model (i.e., assuming the conditional expectation is linear). 4 Applications, examples and additional topics.

This slides relay heavily on Mostly Harmless Econometrics, Hansen’s book Econometrics and Pat Kline’s lecture notes.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 4 / 58 Notation and theoretical framework

The researcher observes an i.i.d sample of N observations of (Yi , Xi ), denoted by SN ≡ (Y1, X1),... (YN , XN ). 0 Xi = (X1i , X2i ,..., Xpi ) is a 1 × p random vector, where p is the number of covariates. Yi is a scalar .

(Yi , Xi ) are sampled i.i.d from the distribution function FY ,X (y, x).

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 5 / 58 The Conditional expectations function (CEF) The conditional expectation function is, Z Z E[Yi |Xi = x] = tdFY |X (t|Xi = x) = tfY |X (t|Xi = x)dt

For any two variables Xi and Yi we can always compute the conditional expectation of one given the other. Additional assumptions are required to give anything other than a purely descriptive interpretation to the conditional expectation function. Descriptive tools are important and can be valuable. The law of iterated expectations (LIE) is,

E[Yi ] = E [E [Yi |Xi = x]]   More in detail LIE is: EY [Yi ] = EX EY |X =x [Yi |Xi = x] Another way of writing LIE is,

E[Yi |X1i = x1] = E [E [Yi |X1i = x1, X2i = x2] |X1i = x1]

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 6 / 58 Properties of the CEF

The LIE has several useful implications,

E [Yi − E [Yi |Xi ]] = 0 E [(Yi − E [Yi |Xi ]) · Xi ] = 0 E [(Yi − E [Yi |Xi ]) · g(Xi )] = 0 ∀ g(·)

Therefore we can decompose any scalar random variable Yi to two parts,

Yi = E [Yi |Xi ] + ui , E [ui |Xi ] = 0.

Question: Does E [ui |Xi ]=0 implies E [ui ]=0? Yes, the proof follows from the LIE. The statement E [ui |Xi ]=0 is more general than the statement E [ui ]=0. Question: Does E [ui ]=0 imply E [ui |Xi ]=0? No.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 7 / 58 Properties of the CEF

The ANOVA theorem ( decomposition)

V [Yi ] = V [E [Yi |Xi ]] + E [V [Yi |Xi ]]

The CEF prediction property

Let m(Xi ) be any function of Xi . The CEF solves,

h 2i E [Yi |Xi ] = argmin E (Yi − m(Xi )) m(Xi )

The CEF is the best predictor of Y using the loss function h i i ˆ ˆ 2 L(Yi , Yi ) = E (Yi − Yi ) . The CEF is also refereed to as the minimum mean squares error (MMSE) estimator of Yi .

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 8 / 58 Homoskedasticity and Heteroskedasticity

Homoskedasticity  2  2 The error is homoskedastic if E u |X = σ , does not depend on X .

Hetroskedasticity  2  2 The error is hetroskedastic if E u |X = x = σ (x), depends on x.

The concepts of homoskedasticity and heteroskedasticity concern the , not the unconditional variance. By definition the unconditional variance σ2 is constant and independent of the regressors. The unconditional variance is a number, not a random variable or a function of X . The conditional variance depends on X and if X is a random variable it is also a random variable.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 9 / 58 Projections

Denote the population linear projection of Yi on Xi by,

p ∗ 0 X E [Yi |Xi ] = Xi β = β0 + Xji βj j=1

where β is defined as the minimum mean squared error linear projection, h 0 2i β = argmin E Yi − Xi b b The F.O.C of the minimization problem are,

 0   0−1 E Xi · Yi − Xi β = 0 ⇒ β = E Xi Xi E [Xi Yi ]

∗ Denote by ei the projection residual, ei = Yi − E [Yi |Xi ]. It follows from the F.O.C that, E [Xi · ei ] = 0. The residual of projecting Yi on Xi is not correlated with Xi .

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 10 / 58 Projections Theorem (Projection is the MMSE linear approximation to the CEF)

h 0 2i β = argmin E E [Yi |Xi ] − Xi b b

0 If the conditional expectation is linear, E [Yi |Xi ] = Xi β then, ∗ E [Yi |Xi ] = E [Yi |Xi ]. Proof Projections and conditional expectations are related also without imposing any structural assumptions,

∗ ∗ E [Yi |Xi ] = E [E [Yi |Zi , Xi ] |Xi ] Proof: ∗ 0  0−1 0  0−1 E [Yi |Xi ] = Xi E Xi Xi E [Xi Yi ] = Xi E Xi Xi E [Xi E [Yi |Xi , Zi ]] ∗ = E [E [Yi |Zi , Xi ] |Xi ]

0 0 −1 ∗ where Xi E [Xi Xi ] E [Xi A] = E [A|Xi ] for all A, and in our case A = E [Yi |Zi , Xi ]. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 11 / 58 Projections

Projections also follow the law of iterated projections (LIP),

∗ ∗ ∗ E [Yi |Xi ] = E [E [Yi |Zi , Xi ] |Xi ]

Consider the following special case, Ti is a binary variable (i.e., Ti ∈ {0, 1}). Show that,

∗ E [Yi |Ti ] = E [E [Yi |Xi , Ti ] |Ti ]

The proof is left as a homework assignment.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 12 / 58 Regression as a structural model: The CEF is linear

0 When E [Yi |Xi ] = Xi β, β has a causal interpretation. It is no longer a linear approximation based on correlations, but a structural parameter that describes a relationship between Xi and Yi . The parameter β has two different possible interpretations:

(i) Conditional mean interpretation: β indicates the effect of increasing X on the conditional mean E [Y |X ] in the model E [Y |X ] = X β.

(ii) Unconditional mean interpretation: By the LIE it follows that E [Y ] = E [X ] β. Hence, β can be interpreted as the effect of increasing the mean value of X on the (unconditional) mean value of Y (i.e., E [Y ]).

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 13 / 58 Regression as a prediction algorithm Denote by X the design matrix. This is the matrix on which we will project y . In other words we have an input matrix X with dimensions n × p and an output vector y with dimensions n × 1.   X11 X12 ... X1p  X21 X22 ... X2p  X =    . . . .   . . . .  X X ... X n1 n2 np n×p

The Xji denotes the value of characteristic j for individual i. Example: If the researcher observes 2 covariates: education and age. A possible design matrix is,  2  1 education1 age1 agep  . . . .  X =  . . . .  1 education age age2 n n n n×4

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 14 / 58 Regression as a prediction algorithm

The algorithm has the following form:

p X yˆi = f (Xi ) = β0 + Xji βj j=1

We can estimate the coefficients β in a variety of ways but OLS is by far the most common, which minimizes the residual sum of squares (RSS):

N P X X 2 RSS(β) = (yi − β0 − Xji βj ) i=1 j=1

The estimation of β is according to a square loss function, i.e. L(a, ˆa) = (a − ˆa)2.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 15 / 58 Visualization of OLS

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 16 / 58 The OLS estimator of β

Write the RSS as:

RSS(β) = (y − Xβ)0(y − xβ)

Differentiate with respect to β: ∂RSS = −2X0(y − Xβ) (1) ∂β

Assume that X is full rank (no perfect collinearity among any of the independent variables) and set first derivative to 0:

X0(y − Xβ) = 0

Solve for β: βˆ = (X0X)−1X0y

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 17 / 58 Rank condition for identifying β

What happens if X is not full rank? There is an infinite number of ways to invert the matrix X0X, and the algorithm does not have a unique solution. There are many values of β that satisfy the F.O.C

Therefore if X is not full rank, then βˆ is not well defined! Note, that not all prediction algorithms require that the input matrix X be of full rank. For example Regression trees or more advanced methods such as Bagging or Boosting do not require a full rank condition.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 18 / 58 Regression as a prediction: Making a Prediction

The hat matrix, or projection matrix is,

H = X(X0X)−1X0 with H˜ = I − H

We use the hat matrix to find the fitted values:

ˆy = Xβˆ = X(X0X)−1X0y = Hy

The regression residuals are,

e = (I − H)y

Hy yields the part of y that can be explained by a linear combination of X, and Hy˜ is the part of y that is not explained using a linear combination of X, which is the regression residual. Denote by M the regression residuals, M ≡ (I − H)y, hence e = My. What are the dimensions of M and H ? n × n

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 19 / 58 Regression as a prediction: Making a Prediction

Do we make any assumption on the distribution of y? No! Can the dependent variable (the response), y, be a binary variable, i.e y ∈ {0, 1}? Yes! 2 Do we assume homoskedasticity, i.e that Var(Yi ) = σ , ∀i ? No! Is the residuals, e, correlated with y? Do we need to make any additional assumption in order for corr(e, X) = 0? No! The OLS algorithm will always yield residuals which are not correlated with the covariates, i.e., that are orthogonal to X . The procedure we discussed so far is an algorithm, which solves an optimization problem (minimizing a square loss function). The algorithm requires an assumption of full rank in order to yield a unique solution, however it does not require any assumption on the distribution or the type of the response variable, y.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 20 / 58 Regression as a structural model: From algorithm to model

Now we make stronger assumptions, most importantly we assume a data generating process (hence DGP), i.e we assume a functional form for the relationship between y and X

Do we now assume y is a linear function of the covariates? No, we assume it is a linear function of β. An example of a response variable that is non-linear in β is:

Yi = β0 + β1 · X1i + exp(β3 · X2i + β4 · X3i )

What are the classic assumptions of the regression model? What is the role of the classic assumptions, why are we assuming them?

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 21 / 58 The classic assumptions of the regression model

1 The dependent variable is linearly related to the coefficients of the model and the model is correctly specified, y = Xβ +  2 The independent variables, X, are fixed, i.e are not random variables (this should be relaxed to the no endogeneity assumption E [|X ] = 0, note that by LIE E [|X ] = 0 ⇒ E [X ] = 0). 3 The conditional mean of the error term is zero, E(|X ) = 0 4 Homoscedasticity. The variance of the error term is independent of X, 2 i.e V(i ) = σ (This assumption can easily be relaxed). 5 The error terms are uncorrelated with each other, E [i · j ] = 0. 6 The design matrix, X, has full rank (this is necessary for the OLS algorithm and the for identifying parameters of a regression structural model). 7 The error term is normally distributed, ∼ N(0, σ2) (This assumption can easily be relaxed. This assumption makes the OLS estimator also a maximum likelihood estimator (MLE)).

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 22 / 58 Notes on the classic assumptions of the regression model

The assumption that E(|X ) = 0 will always be satisfied when there is an intercept term in the model, i.e when the design matrix contains a constant term When X ⊥  it follows that Cov(X , ) = 0

The normality assumption of i is required for hypothesis testing on β in finite samples without asymptotic approximations. The assumption can be relaxed for sufficiently large sample sizes, as by the CLT, βˆOLS converges to the normal distribution when N → ∞.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 23 / 58 Properties of the OLS estimators: Unbiased? The OLS estimator of β is, βˆ = (X 0X )−1X 0Y = (X 0X )−1X 0(X β + ) = (X 0X )−1X 0X β + (X 0X )−1X 0 = β + (X 0X )−1X 0

An estimator is unbiased if E(βˆ) = β. There are 2 cases to consider: (1) X is fixed. (2) X is random. We begin with the case of fixed X : E(βˆ|X ) = E(β + (X 0X )−1X 0|X ) = E(β|X ) + E((X 0X )−1X 0|X ) = β + (X 0X )−1E(|X ) where E(|X ) = E() = 0 E(βˆ) = β

0 Note, fixed X implies that the expectations are conditional on X E [X |X ]. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 24 / 58 Properties of the OLS estimators: Unbiased?

Next consider that case that X is a random variable,

 0 −1 0   0 −1 0  E β + (X X ) X  = β + E (X X ) X   0  −1  0  6= β + (E X X ) E X  = β

OLS is, in finite samples, generally biased for the population linear projection since the expectations operator cannot pass through a ratio.

Note, the OLS estimator can also be written in terms of sums, !−1 ! ˆ X 0 X β = Xi Xi Xi Yi i i Although the OLS can be biased in finite samples, it is an asymptotically unbiased estimator (consistent) for the population linear projection.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 25 / 58 The variance of βˆOLS under fixed X and homoskedasticity Recall: βˆ = (X 0X )−1X 0Y = (X 0X )−1X 0(X β + ) ⇒ βˆ − β = (X 0X )−1X 0 Plugging this into the covariance equation:

cov(βˆ|X ) = E[(βˆ − β)(βˆ − β)0|X ] = E(X 0X )−1X 0(X 0X )−1X 0)0|X  = E[(X 0X )−1X 00X (X 0X )−1|X ] = (X 0X )−1X 0E(0|X )X (X 0X )−1 0 2 where E( |X ) = σ Ip×p 0 −1 0 2 0 −1 = (X X ) X σ Ip×pX (X X ) = σ2(X 0X )−1X 0X (X 0X )−1 = σ2(X 0X )−1

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 26 / 58 Estimating σ2

We estimate σ2 by dividing the residuals squared by the degrees of freedom because the ei are generally smaller than the i due to the fact that βˆ was chosen to make the sum of square residuals as small as possible.

n 1 X σˆ2 = e2 OLS n − p i i=1 Compare the above estimator to the classic variance estimator:

n 1 X 2 σˆ2 = Y − Y¯  classic n − 1 i i=1 Is one estimator always preferable over the other? If not when each estimator is preferable?

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 27 / 58 measurment error

Consider the following DGP (data generating process):

n=200 x1 = rnorm(n,mean=10,1) epsilon = rnorm(n,0,2) y = 10+5*x1+epsilon

### mesurment error: noise = rnorm(n,0,2) x1_noise = x1+noise

noise The true model has x1, however we observe only x1 . We will investigate the effect of the noise and the distribution of the noise on the OLS estimation of β1. The true value of the parameter of interest is, β1 = 5

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 28 / 58 Measurement error: noise ∼ N(µ = 0, σ = 2)

measurment error with mean 0

70 y 60

50

6 9 12 15 x1

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 29 / 58 Measurement error: noise ∼ N(µ = 5, σ = 2)

measurment error with mean 5

70 y 60

50

8 12 16 20 x1

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 30 / 58 Measurement error: noise ∼ N(µ =?, σ = 2)

5

4

3 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●●●● ● ●● ● ● ● ●● ●● ● ●● ● ● ●●●●●● ● ●● ● ● ●● ● ● ● ● ●● ●● ●●● ● ●● ●● ● ● ●●● ●● ● ●● ●● ● ●● ●●● ● ●●●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● beta ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ●

1

0

0 25 50 75 100 Expectation of noise

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 31 / 58 Measurement error: noise ∼ N(µ = 5, σ =?)

5 ●● ● ● ●● ● 4 ●

● ● 3 ● ●

beta ●●●

2 ● ●●●

●● ● ● ● ●● 1 ●●●● ●●● ● ● ● ● ● ●●● ● ●●●●●● ● ●●● ● ● ● ●● ● ● ●● ●●●●●●●● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●● ●●● ● ●● ●●● ● 0 ● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 5 10 15 Variance of noise

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 32 / 58 Measurement error: noise ∼ exp(λ =?)

5 ● ● ●●●●●● ●●●● ●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●●●●●● ● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●● ● ● ● ● ●●●● ●● ● ●●● ● ● ●● ● ● ● ●●●●●●● ●●●●●● ● ● ● ●●● ● ● ●●● ●●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● 4 ● ●● ● ● ● ● ● ●●●● ● ● ● ● ●● 3 ● ● ●● ● ● ● beta ● ● 2 ●● ●●

1 ● ●● ● ● ●● ●● 0 ●●

0.0 2.5 5.0 7.5 10.0 Rate parameter (variance = 1/Rate)

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 33 / 58 Measurement error

Could we reach the same conclusions as the simulations from analytical derivations?Yes As we saw before,

noise ˆ Cov(y, x1 ) Cov(y, x1 + noise) E(βOLS ) = noise = V(x1 ) V(x1 + noise) Cov(y, x ) = 1 V(x1) + V(noise) Therefore as V(noise) → ∞, the expectation of the OLS estimator of β will converge to zero,

ˆ Cov(y, x1) V(noise) → ∞ ⇒ E(βOLS ) = → 0 V(x1) + V(noise)

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 34 / 58 Measurement error in the dependent variable

noise Consider the situation in which yi is not observed, but yi is observed. There are no measurement error in x1. The model (DGP) is,

yi = 10 + 5 ∗ x1i + i

noise yi = yi + noisei

Will the OLS estimator of β1 be unbiased? Yes

noise ˆ Cov(y , x1) Cov(y + noise, x1) E(βOLS ) = = V(x1) V(x1)

Cov(y, x1) = = β1 V(x1) This model is equivalent to the model, yi = 10 + 5 ∗ x1i + (i + noisei ), where yi is observed.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 35 / 58 Measurement error in the dependent variable Will the OLS estimator be unbiased if the measurement error was multiplicative instead of additive? Formally, if the DGP was:

yi = 10 + 5 ∗ x1i + i noise yi = yi · noisei

Analytic derivations: noise ˆ Cov(y , x1) Cov(y · noise, x1) E(βOLS ) = = V(x1) V(x1)

Cov(y · noise, x1) = E(y · noise · x1) − E(y · noise) · E(x1) E(noise) · Cov(y, x1) = = E(noise) · β1 V(x1) When there is a multiplicative noise the bias of βˆ is influenced by E(noise), not from V(noise).

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 36 / 58 Measurement error: Division bias Borjas (1980) introduced the problem of division bias in the context of labor supply estimation with respect to hourly wage. The problme arises when the researcher does not observe hourly wages, but observe: ∗ I Hours worked h - measured with error. The true hours of work are h and the observed hours of work are h = η · h∗. I Earnings - measured without error. Earnings are defined as hourly wage times hours worked, e = w ∗ · h∗. e e I The hourly wage is calculated as, w = h = η·h∗ . The true hourly wage ∗ e is w = h∗ . ∗ ∗ I If we observed h it would have been simple to extract w from e, w ∗ = e/h∗. By taking log() of w and h we can re-write the measurement error as,

log(w) = log(e) − log(h∗) − log(η) = log(e/h∗) − log(η) = log(w ∗) − log(η) log(h) = log(h∗) − log(η)

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 37 / 58 Measurement error: Division bias The researcher is interested in estimating the following regression, ∗ ∗ log(h ) = α + β · log(w ) + i (2) What is the bias as a result of estimating instead,

log(h) = a + b · log(w) + ei (3)

Borjas (1980) showed that, 2 2 σ ∗ σ plim bˆ = β · w − 1 · η 2 2 2 2 σw ∗ + ση σw ∗ + ση Does division bias always bias the OLS estimator downwards? No, if β = −5, the bias will be upwards. Division bias leads to an estimator that is biased towards −1, as bˆ is a weighted average between β and −1. Unlike standard measurement error division bias can change the sign of the estimated OLS coefficient.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 38 / 58 Measurement error: Division bias

Until now we considered the affect of division bias when estimating a log-log regression. What is the effect when estimating: ∗ ∗ 1 log-linear: log(h ) = α + β · w + i ∗ ∗ 2 linear-log: h = α + β · log(w ) + i ∗ ∗ 3 linear-linear: h = α + β · w + i In all this cases it is difficult (i.e., there is analytical solution) for the bias of the OLS coefficient. Simulations show the bias is downwards, and might change the sign of the coefficient similar to the log-log case.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 39 / 58 Division bias: simulation

rm(list = ls()) set.seed(12345) n=10000 elasticity=5 w.vec = seq(20,50,length=1000) w=sample(w.vec,size=n,replace=TRUE) h=w^elasticity e = w*h sigma.vec = seq(0.1,1,length=20) noise = rnorm(n,mean=0,sd=0.7)+1 hs = h * noise ws = e/hs if(min(hs,ws)<0){ index = (hs<0 | ws<0) hs[which(index)]=NA ws[which(index)]=NA cat("Fraction of missing observations: ",sum(index)/n,"\n") }

Why do we need the if () statement? When the noise is negative it will make the observed hours of work a negative number, and log(·) is not undefined over negative numbers. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 40 / 58 The regression results without division bias

Dependent variable: log(h) h (1) (2) (3) log(w) 5.000∗∗∗ (0.000)

w 0.149∗∗∗ 9,163,099.000∗∗∗ (0.0002) (36,657.090)

Constant −0.000 12.415∗∗∗ −234,614,765.000∗∗∗ (0.000) (0.006) (1,322,068.000)

Observations 10,000 10,000 10,000 R2 1.000 0.986 0.862 Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 41 / 58 The regression results with division bias

Dependent variable: log(hs) hs (1) (2) (3) log(ws) −0.486∗∗∗ (0.018)

ws −0.0001∗∗∗ −1,282.862∗∗ (0.00001) (525.461)

Constant 19.267∗∗∗ 17.504∗∗∗ 96,180,729.000∗∗∗ (0.067) (0.016) (1,239,380.000)

Observations 9,224 9,224 9,224 R2 0.074 0.011 0.001 Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 42 / 58 Gauss-Markov theorem: BLUE The regression estimator is a linear estimator, βˆ = Cy, where 0 −1 0 C = (X X ) X . A linear estimator is any βˆj such that βˆj = c1y1 + c2y2 + ··· + cpyp.

The Gauss-Markov theorem: If assumptions following assumptions hold,

I Homoskedasticity (without this the MMSE is GLS, which is not a feasible estimator as we do not know the ). I X is full rank (necessary for the OLS algorithm to have a unique solution). I E [i |Xi ] = 0 (necessary for the estimator to be unbiased). I Yi is a linear function Xi (linear in the parameters β). h ˆ i I The Gauss-Markov theorem is for fixed regressors, i.e. E β|X = 0. The OLS estimator is the best linear unbiased estimator (BLUE), in terms of MSE (Mean Squared Error). What will be the critic on the Gauss-Markov theorem from a machine learning computer scientist, that is interested in making prediction? Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 43 / 58 Frisch-Waugh-Lovell: Regression Anatomy In the simple bivariate case:

Cov(Yi , Xi ) β1 = Var(Xi )

In the multivariate case, βj is:

Cov(Yi , X˜ij ) Cov(Y˜i , X˜ij ) βj = = Var(X˜ij ) Var(X˜ij )

where A˜ij (Y˜i or X˜j ) are the residuals from the regression of Aij on all other covariates.

The multiple regression coefficient βˆj represents the additional contribution of xj on y, after xj has been adjusted for 1, x1,..., xj−1, xj+1,..., xp

What happens when xj is highly correlated with some of the other xk ’s?

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 44 / 58 Frisch-Waugh-Lovell: Regression Anatomy

Cov(Y˜i ,X˜ij ) Claim: βj = , i.e Cov(Yi , X˜ij ) = Cov(Y˜i , X˜ij ) Var(X˜ij ) Proof: Let Y˜i be the residuals of a regression of all the covariates except Xji on Yi , i.e

Xji = β0 + β1X1i + β2X2 + ··· + βP XPi + fi

Yi = α0 + α1X1i + α2X2 + ··· + αP XPi + ei

Then,e ˆi = Y˜i , and fˆi = X˜ji It follows from the OLS algorithm that Cov(xki , X˜ji ) = 0, ∀k6=j . As the residuals of a regression are not correlated with any of the covariates

Cov(Y˜i , X˜ij ) = Cov(Yi − αˆ0 − αˆ1X1i − αˆ2X2 − · · · +α ˆP XPi , X˜ij )

= Cov(Yi , X˜ij )

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 45 / 58 Asymptotics of OLS

Is the OLS estimator of β consistent? Yes Proof:

Denote the observed characteristics of observation i by Xi . What is the dimensions of Xi ? p × 1   Xi1  Xi2  X 0 = (x , x ,..., x ) and X =   i i1 i2 ip i  .   .  Xip  2  Xi1 Xi1Xi2 ... Xi1Xip 2  Xi2Xi1 X ... Xi2Xip  0  i2  Xi Xi =  . . .   . . .  2 XipXi1 XipXi2 ... Xip

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 46 / 58 Asymptotics of OLS Verify at home that,

 Pn 2 Pn Pn  i=1 Xi1 i=1 Xi1Xi2 ... i=1 X1i Xip Pn Pn 2 Pn  Xi2Xi1 X ... Xi2Xip  0  i=1 i=1 i2 i=1  X X =  . . .   . . .  Pn X X Pn X X ... Pn X 2 i=1 ip i1 i=1 ip i2 i=1 ip (n×p)

0 Pn 0 0 Pn 0 Hence, X X = i=1 Xi Xi or X X = i=1 Xi Xi ? 0 Pn 0 X X = i=1 Xi Xi Note (and verify at home),

 Pn  i=1 Xi1Yi Pn n  i=1 Xi2Yi  X X 0y =   = X 0Y  .  i i  .  i=1 Pn i=1 XipYi

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 47 / 58 Asymptotics of OLS

The OLS estimator is, β = (X 0X )−1 X 0y Recall (X · k)−1 = k−1 · (X )−1 1 √1 Multiplying and dividing by n and n yields,

−1 n !−1 n ! 1   1  1 X 1 X β = X 0X √ X 0y = X 0X √ X 0y n n n i i n i i i=1 i=1

0 −1 0−1 0  → E Xi Xi · E (Xi Yi ) = E Xi Xi · E Xi Xi β + i The converges follows from the central limit theorem (CLT).

0−1 0 0−1 = E Xi Xi · E Xi Xi β + E Xi Xi · E (Xi i ) = β

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 48 / 58 A random coefficient model Assume the following DGP,

Yi = α + β · Ti + i

where βi is the treatment effect that can change between individuals and is a random variable. i is an error term independent of βi and Ti .

Assume the treatment Ti is assigned at random. Claim: The following regression provides an asymptotically unbiased estimate of a meaningful parameter of interest,

Yi = a + b · Ti + ei

is this claim correct? Asymptotically what is the OLS estimator (OLS algorithm) estimating? b = Cov(Yi ,Ti ) V[Ti ] What is b = Cov(Yi ,Ti ) =? b = [β]. V[Ti ] E

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 49 / 58 Regression in Causal Analysis

Imagine we are analyzing a randomized experiment with a regression using the following model:

0 Yi = α + β1 · Ti + Xi · β2 + i

where Ti is an indicator variable for treatment status and Xi is a vector of pre-treatment characteristics. Under this model, what is random? We need to decide whether to assume Xi is fixed or a random variable.

Note: If we consider Xi as a random variable we are interested in the average treatment effect, if we consider Xi as fixed we are interested in the conditional average treatment effect (conditional of this X characteristics).

How do we interpret the coefficient β1? What is the OLS estimator of the coefficient β1?

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 50 / 58 Revisiting the regression example from lecture

Lets recall the example from class:

I Let Y be a IID standard normal random variable, indexed by t. I Define ∆Yt = Yt − Yt−1. I Lets estimate via OLS,

∆Yt = α + β1∆Yt−1 + β2∆Yt−2 + β3∆Yt−3

I Question: What are the values of the betas as n → ∞? Next we consider a simplified regression specification and analytically calculate β1 when n → ∞,

∆Yt = α + β1∆Yt−1 + β2∆Yt−2

Using which theorem or method can we calculate β1?

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 51 / 58 Revisiting the regression example from lecture

Using which theorem or method can we calculate β1? 1 We know that as n → ∞ the OLS estimator converges to,

0 −1 0 β = (E [X X ]) E [X y]

we can calculate E [X 0X ] and E [X 0y] and than invert E [X 0X ]. This might be a bit algebraic as E [X 0X ] is of dimensions 3 × 3. 2 Another option is to use FWL theorem. What are the steps to calculate β1 using FWL? Calculating β1 using FWL, step-by-step: 1 Regress ∆Yt−1 on ∆Yt−2 and an intercept. Denote the residuals as ∆Y˜t−1. 2 Regress ∆Yt on ∆Y˜t−1 and an intercept. Cov(∆Y ,∆Y˜ ) 3 t t−1 By FWL, β1 = ˜ . V[Yt−1] Calculating β1 using FWL might take a few steps and algebra, but is a powerful tool for extracting a single coefficient from a multivariate regression.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 52 / 58 Calculating β1 using FWL Step 1: Regress ∆Yt−1 on ∆Yt−2 and an intercept:

∆Yt−1 = a + b∆Yt−2 + ∆Y˜t−1 2 Cov(∆Yt−1, ∆Yt−2) −σY −1 1 ⇒ b = = 2 2 = = − V [∆Yt−2] σY + σY 1 + 1 2 we can calclulate a by using the fact that the regression line passes through the mean point,

E [∆Yt−1] = a + bE [∆Yt−2] ⇒ 0 = a + b · 0 ⇒ a = 0 ˜ 1 ˜ Hence, ∆Yt−1 = ∆Yt−1 + 2 ∆Yt−2 Step 2: Regress ∆Yt on ∆Yt−1 and an intercept:

Cov(∆Y , ∆Y˜ ) −σ2 2 β = t t−1 = ··· = Y = − h ˜ i σ2 + 1 σ2 + 1 σ2 3 V ∆Yt−1 Y 4 Y 4 Y

2 note that, σY = 1 as Y is distributed i.i.d standard normal. Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 53 / 58 Revisiting the regression example from lecture: Coding exercise

Next we use R to test the claim that as the number of ∆lags increases and n → ∞, the value of βˆ1 converges. Write a simulation in R that:

I Generates ∆Yt , ∆Yt−1,..., ∆Yt−p variables, where p = 150. I Write a loop that calculates βˆ1 for each for each number of lags between 1 and 130. I Plot your results in a figure.

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 54 / 58 Simulation results

● −0.5

−0.6

● −0.7 1 β ●

−0.8 ●

Value of Value ● ● ● ● ● −0.9 ● ●● ●● ● ●● ●●● ●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −1.0 ●

0 20 40 60 80 100 120

Number of lags

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 55 / 58 rm(list=ls()) set.seed(12345) n=5000 p=150

Y = matrix(rnorm(n*p),ncol=p,nrow=n,byrow=TRUE) for (l in c(1:(p-1))){ Y[,l] = Y[,l]-Y[,l+1] } Y = Y[,-p] data = data.frame(y=Y[,1],Y[,-1]) pp=130 num.lags = c(1:pp) beta1 = rep(NA,pp) for (k in c(1:pp)){ beta1[k] = coef(lm(y~(.),data=data[,c(1:(num.lags[k]+1))]))[2] }

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 56 / 58 par(cex=0.7,cex.lab=1.2) plot(num.lags,beta1, las=1, frame=FALSE, ylim=c(min(min(beta1,-1.05)),max(beta1)), xlab="Number of lags", ylab=expression(paste("Value of ",beta[1]))) lines(num.lags,beta1,lty=2) abline(h=-1,col="red4")

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 57 / 58 ∗ Proof: E [Yi |Xi ] = E [Yi |Xi ] if E [Yi |Xi ] is linear

∗ 0  0−1 0  0−1 E [Yi |Xi ] = Xi E Xi Xi E [Xi Yi ] = Xi E Xi Xi E [Xi E [Yi |Xi ]] 0  0−1  0  0  0−1  0 = Xi E Xi Xi E Xi Xi β = Xi E Xi Xi E Xi Xi β 0 = Xi β

Back

Yotam Shem-Tov STAT 239A/ PS 236A September 3, 2015 58 / 58