<<

Web Science & Technologies University of Koblenz ▪ Landau, Germany

Lecture

Data Science

Regression and Causal Inference

JProf. Dr. Claudia Wagner

Bias- Tradeoff

Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968

WeST Claudia Wagner 2

Solutions

. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s

Cp . Limits the number of parameters

. Regularization  fit a model involving all p predictors  shrink coefficients towards zero to reduce variance and produce low at the same time  Contraints the magnitude of parameter

WeST Claudia Wagner 4

Cross-validated Prediction Error

. K-fold Cross Validation . Prediction error on test data

Src: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_EN.jpg

WeST Claudia Wagner 5

Mean Squared (Prediction) Error

. 푀푆퐸 휃 = 퐸[(휃 − 휃)2]

. If we have multiple models and want to select the best we can also use MSE

. 푀푆퐸 휃 = 퐸[(휃 − 퐸 휃 + 퐸 휃 − 휃)2] 2 . 푀푆퐸 휃 = E 휃 − 퐸 휃 + (E 휃 − 휃)2

Variance + Bias2

WeST Claudia Wagner 6

Model Selection

Choose among 3 model (polynomial degree 1, 2, or 3) . Use test error to select the best model  model 2 is the best model according to the test error . But what do we report as the test error now? . Test error should be computed on data that was not used for training and for choosing the best model

WeST Claudia Wagner 8

. Solution: split the labeled data into three parts

Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968

WeST Claudia Wagner 9

. Training Data (blue) . Validation Data (green)  d = 2 is chosen . Test Data (brown)  1.3 test error computed for d = 2

WeST Claudia Wagner 10

Mallows Cp (1973)

Assume we have a regression model with n independent variables (n parameter)  choose the smallest best model based on validation data

Sum of Squared Errors of model with p parameter

number of parameters in the model

Variance of full model with n parameter WeST Claudia Wagner 13

Solutions

. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s

Cp . Limits the number of parameters

. Regularization  fit a model involving all p predictors  shrink coefficients towards zero to reduce variance and achieve low bias at the same time  Contraints the magnitude of parameter • Some regularization (Lasso) can also be used to remove parameter

WeST Claudia Wagner 14

Regularization

. Estimate coefficient:

훽 Lp

LSE penalty

λ ≥ 0 is a tuning parameter, which controls the strength of the penalty term.

Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 15

Example

. Why we need regularization?

 Y = b1*X1 + b2*X2 + b3*X3 + error

 In reality the underlying model is Y = 5 * X3

. Ideal estimate for parameter vector: b=(0, 0, 5)

. If by chance X1 and X2 are perfectly correlated (e.g. X1 = 2* X2) then we have many other solutions that give us an equal fit: . E.g. b=(50, -100, 5) Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 16

Regularization

. Lp regularizer is defined as the parameter vector norm:

L = p

. Surface on which the norm takes on constant values:

휃0 푐푎푛 표푛푙푦 푏푒 1 or -1 푖푓휃1푖푠 0

Ridge

WeST Claudia Wagner 17

. Find balance between 2 loss functions: LSE of data and regularization term

L2 (Ridge) Prof. Alexander Ihler, University of California, Irvine https://www.youtube.com/watch?v=sO4ZirJh9ds WeST Claudia Wagner 18

L1 and L2 Norm

L1 (Lasso) tends to generate sparser solutions

L2 (Ridge) L1 (Lasso) Prof. Alexander Ihler, University of California, Irvine WeST Claudia Wagner 19

. Ridge (L2)  Provides a unique solution even if multiple variables are correlated  But it does not reduce the number of parameter

. Lasso (L1) has several important weaknesses:  It does not guarantee one unique solution  It is not robust to collinearity

. In practice elastic net (L1+L2) is used  It combines ridge and lasso  two lamda parameter need to be picked

WeST Claudia Wagner 20

Other Problems I

. Outliers  Extreme values can have a big impact on the line which we fit since we try to minimize the squared error

 Inclusion/exclusion of extreme values often changes results significantly  Solution: • Use log to reduce impact of large values • Bootstrapping or Cross Validation

WeST Claudia Wagner 24

Other Problems II

. Omitted Variable Bias  If omitted variable is a determinant of the DVs (outcome) AND  omitted variable is correlated with the included IVs

 The effect of the omitted variable (confounder) is attributed to another variable which is included in the model

 Heteroskedastic error can indicate that related variables have been omitted

WeST Claudia Wagner 25

Example

. Do students from elite colleges earn more later in life? . earn ~ b0 +b1 *college +error  Do people who went to an elite college earn on average more later in life?

College

Salary

Time

WeST Claudia Wagner 41

Problem with Regressions

Being accepted in an elite college correlates with motivation and socio-economic status These factors also correlate with salary

socio- economic Motivation status

College Salary

Time

WeST Claudia Wagner 42

Causal Effect

Did going to an elite college impact your future earnings?

If you would have chosen a normal college, would you have earned less?

Yi(T) - Yi(C)

wealth

college time

WeST Claudia Wagner 43

Causal Effect

WeST Claudia Wagner 44

Experiment

Gold Standard Method for causal inference!

Assignment of subjects to treatment or control group is random and groups are large enough to wash out differences in other covariates (Ignorability assumptions)

푌0, 푌1⟘푂푢푡푐표푚푒

Treatment Assignment

Manipulation of treatment is under control of researcher Different levels of treatment should lead to different levels of effect

WeST Claudia Wagner 45

Observational Data

Randomization in assures that: T⟘푂

C

T O

In observational studies we need to control for the covariates C that effect both, the outcome O and the treatment assignment T T⟘푂|퐶

WeST Claudia Wagner 48

METHODS

WeST Claudia Wagner 49

Solutions

. Matching Methods

. Instrumental Variables

. Regression Discontinuity

WeST Claudia Wagner 50

Matching Methods

. Idea: find a subset of observational data that looks like generated by an ?

. Balance pre-treatment covariates X so that: 푇⟘푂|푋

. matching == pruning

WeST Claudia Wagner 51

Does Special Training Help Job Promotion?

Treated with

special training

Outcome Position

education (in years) covariate

Ho, Daniel, Kosuke Imai, Gary King, and Elizabeth Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15: 199– 236. Copy at http://j.mp/jPupwz WeST Claudia Wagner 52

Example: Greedy Distance Matching

Are people who take special training more successful (higher position)? Pre-eduction is our confounder.

Treated Edu Pos Treated Edu Pos Matches: ATE: 4-6 8-5=3 1 4 8 0 10 5 7-10 7-5=2 10-1 9-9=0 1 7 7 0 6 5

1 10 9 0 1 9

Introduce Caliper = maximal acceptable distance Throw bad matches away

WeST Claudia Wagner 53

Analyse Matched Data

 Compute test for matched data pairs

• OutcomeTreated - OutcomeControl

 Null hypothesis (no treatment effect)

• E[OutcomeTreated] – E[OutcomeControl ] = 0

 Is our observed test statistic surprising?

• Assume H0 follows a normal distribution • OR • Randomly permute treatment assignments (shuffle treatment labels) and compute the distribution under H0 empirically

WeST Claudia Wagner 55

Regression After Matching

pos = b0 + b1*edu + b2*is_treated

1) Preprocessing

(matching)

2) Estimation of effects position (p) position (regression models)

education (e)

Matching can help to reduce model dependence!

WeST Claudia Wagner 57

Why do we need to match?

binary variable

pos = b0 + b1*edu + b2*is_treated

) (p estimated

treatment effect

change

position

education (e)

Correcting for education, the treated group has higher positions.

WeST Claudia Wagner 58

Quadratic Regression

2

pos = b0 + b1*edu + b2*edu + γ*is_treated

) (p

Model Dependence

Reason: Imbalance

position of covariates

WeST Claudia Wagner 59

Assess Quality of Matches

. Was the matching successful? Are covariates balanced between the 2 groups?

. Standardized difference (smd) = mean difference between groups for each covariate in units of

. smd < 0.1 is good . smd >0.2 indicates imbalance WeST Claudia Wagner 60

Distance Matching

Approximates fully blocked experiment  we balance the known covariates One can match multiple covariates at the same

time

position But if we have several 100 covariates finding matches becomes difficult education

Gary King, "Why Propensity Scores Should Not Be Used for Matching“, Methods Colloquium, 2015

WeST Claudia Wagner 61

Propensity Score Matching Each data point is a high dimensional vector X of covariates. Project it to one dimension. Propensity score is the probability of a subject to receive treatment given all covariates we want to control for

position

education WeST Claudia Wagner 65

Propensity Score Matching

. Idea: achieve balance in covariates by conditioning on propensity score. This works if:  푃 푋 = 푥 휋 푋 = 푝, 푇 = 1 = 푃(푋 = 푥|휋 푋 = 푝, 푇 = 0)

 Outcome variable: T  Independent variables (covariates): X

: take predicted outcome 푇 as propensity score

WeST Claudia Wagner 66

Logistic Regression

. Logistic regression models the log-odds ratio as a linear combination of the independent variables.

푷(풀 = ퟏ) 풍풏 = 휷 + 휷 퐗 + 흐 ퟏ − 푷(풀 = ퟏ) ퟎ ퟏ

Lever et al., Nature, 13(7), 2016 https://www.nature.com/articles/nmeth.3904 WeST Claudia Wagner 69

Propensity Score

. Logistic regression model.

푷(풀 = ퟏ) 풍풏 = 휷 + 휷 퐗 + 흐 ퟏ − 푷(풀 = ퟏ) ퟎ ퟏ

. For each observation we observe covariates and predict the log-odds ratio of being treated . We learn which covariates increase/decrease the chance of being treated. . Match observations that have the same probability of being treated, but one was not treated and the other one was treated

WeST Claudia Wagner 71

Post-Analysis

. Did propensity score matching work?  Check if you have good overlap between treated and control group with respect to propensity score

WeST Claudia Wagner 74

Analyse Matched Data

 Same as before

 Compute test statistic for matched data pairs

• OutcomeTreated - OutcomeControl

 Null hypothesis (no treatment effect)

• E[OutcomeTreated] – E[OutcomeControl ] = 0

 Or • Outcome ~ b0+b1*X+b2*treated+error

WeST Claudia Wagner 75

Summary

. Matching == pruning

. Many design choices

. How to match?  greedy, optimal nearest neighbor

 1:1 or 1:k matching

. Calliper: remove bad matches  Smaller calliper lower bias but more variance

WeST Claudia Wagner 76

Solutions

. Matching Methods

. Regression Discontinuity

. Instrumental Variables

WeST Claudia Wagner 78

Natural Experiment

. Like in experiments but…

 assignment is only “as-if-random”

 researcher does not control the intervention or treatment

WeST Claudia Wagner 79

.What is the impact of receiving a scholarship on future performance?

No scholarship scholarship

Min score Max score

Test score threshold

WeST Claudia Wagner 80

25,1 19,6

Thad Dunning, Natural Experiments in the Social Sciences. A Design-Based Approach, p.126, 2012.

WeST Claudia Wagner 81

Thad Dunning, Natural Experiments in the Social Sciences. A Design-Based Approach, p.126, 2012.

WeST Claudia Wagner 82

Regression Discontinuity (RD)

Future Coarsend Dummy Var: performance test score received schoolarship or not

WeST Claudia Wagner 83

RD Robustness Checks

 Can individuals control if they are above or below the threshold?

 Jumps at placebo points versus at threshold?  Do other variables jump at the threshold?

 Are the results robust to the usage of different bin widths?

 Distinguish between discontinuity and non-lineary • Are the results robust against including higher order polynomials?

WeST Claudia Wagner 84

Solutions

. Matching Methods

. Regression Discontinuity

. Instrumental Variables

WeST Claudia Wagner 87

Instrumental Variables

Correlated, because “earning potential” is omitted

military service Beta is biased and inconsistent! M E lifetime earnings P

earning potential (unobserved)

WeST Claudia Wagner 88

Instrumental Variable

draft lottery military service L is the draft-lottery L M lifetime earnings (can be 0 or 1) E L is an instrument for P the causal effect of M earning potential on E

Instrumental variable should be strongly correlated with the included endogenous regressors (L <-> M), but not effect outcome variable directly (L <-> E)

Angrist, Joshua D. (1990). "Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records". American Economic Review 80 (3)

WeST Claudia Wagner 89

Instrumental Variable

How to estimate the causal effect via instrumental variable L?

E[E|L=1] – E[E|L=0] “Wald Estimator” E[M|L=1] – E[M|L=0] if L is binary

Plausibility check: what if L is perfect instrument?

Example: winning the lottery => 90% chance of joining military & 10% joining without invitation & $2,200 per year difference $2,200 / 0.8 = $2,750

Angrist found that military service decreases earnings about $2,741 dollar

WeST Claudia Wagner 90

Instrumental Variable

δ is estimates the causal effect of M on E

WeST Claudia Wagner 91

Instrumental Variables

. Find Instrumental Variables  they are highly correlated with independent variables but not with outcome  earn ~ b0 +b1 *college +error  Confounder : motivation and socio-economic status

 Potential Instruments • Apply to elite college correlates with motivation • Living area correlates with socio-economic status

 푐표푙푙푒푔푒 ~ b0 +b1*collegeAp + b2*living +error  earn ~ b0 +b1 * 푐표푙푙푒푔푒

WeST Claudia Wagner 92

Summary

• Instrumental variables and discontinuities are often hard to find but powerful

• Matching methods help to approximate causality  But researchers have lots of freedom when decide how to match  We remove data  we need to specify for which group the causal effect holds  Most matching methods have been developed for low number of covariates  Worst case: random pruning  increases imbalance  increases bias and model dependence

• Compare results from different matching methods, different dimensionality reduction methods, different models  Avoid model dependence and method dependence!

WeST Claudia Wagner 93