Web Science & Technologies University of Koblenz ▪ Landau, Germany
Lecture
Data Science
Regression and Causal Inference
JProf. Dr. Claudia Wagner Model Selection
Bias-Variance Tradeoff
Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968
WeST Claudia Wagner 2
Solutions
. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s
Cp . Limits the number of parameters
. Regularization fit a model involving all p predictors shrink coefficients towards zero to reduce variance and produce low bias at the same time Contraints the magnitude of parameter
WeST Claudia Wagner 4
Cross-validated Prediction Error
. K-fold Cross Validation . Prediction error on test data
Src: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_EN.jpg
WeST Claudia Wagner 5
Mean Squared (Prediction) Error
. 푀푆퐸 휃 = 퐸[(휃 − 휃)2]
. If we have multiple models and want to select the best we can also use MSE
. 푀푆퐸 휃 = 퐸[(휃 − 퐸 휃 + 퐸 휃 − 휃)2] 2 . 푀푆퐸 휃 = E 휃 − 퐸 휃 + (E 휃 − 휃)2
Variance + Bias2
WeST Claudia Wagner 6
Model Selection
Choose among 3 model (polynomial degree 1, 2, or 3) . Use test error to select the best model model 2 is the best model according to the test error . But what do we report as the test error now? . Test error should be computed on data that was not used for training and for choosing the best model
WeST Claudia Wagner 8
. Solution: split the labeled data into three parts
Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968
WeST Claudia Wagner 9
. Training Data (blue) . Validation Data (green) d = 2 is chosen . Test Data (brown) 1.3 test error computed for d = 2
WeST Claudia Wagner 10
Mallows Cp (1973)
Assume we have a regression model with n independent variables (n parameter) choose the smallest best model based on validation data
Sum of Squared Errors of model with p parameter
number of parameters in the model
Variance of full model with n parameter WeST Claudia Wagner 13
Solutions
. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s
Cp . Limits the number of parameters
. Regularization fit a model involving all p predictors shrink coefficients towards zero to reduce variance and achieve low bias at the same time Contraints the magnitude of parameter • Some regularization (Lasso) can also be used to remove parameter
WeST Claudia Wagner 14
Regularization
. Estimate coefficient:
훽 Lp
LSE penalty
λ ≥ 0 is a tuning parameter, which controls the strength of the penalty term.
Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 15
Example
. Why we need regularization?
Y = b1*X1 + b2*X2 + b3*X3 + error
In reality the underlying model is Y = 5 * X3
. Ideal estimate for parameter vector: b=(0, 0, 5)
. If by chance X1 and X2 are perfectly correlated (e.g. X1 = 2* X2) then we have many other solutions that give us an equal fit: . E.g. b=(50, -100, 5) Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 16
Regularization
. Lp regularizer is defined as the parameter vector norm:
L = p
. Surface on which the norm takes on constant values:
휃0 푐푎푛 표푛푙푦 푏푒 1 or -1 푖푓휃1푖푠 0
Ridge
WeST Claudia Wagner 17
. Find balance between 2 loss functions: LSE of data and regularization term
L2 (Ridge) Prof. Alexander Ihler, University of California, Irvine https://www.youtube.com/watch?v=sO4ZirJh9ds WeST Claudia Wagner 18
L1 and L2 Norm
L1 (Lasso) tends to generate sparser solutions
L2 (Ridge) L1 (Lasso) Prof. Alexander Ihler, University of California, Irvine WeST Claudia Wagner 19
. Ridge (L2) Provides a unique solution even if multiple variables are correlated But it does not reduce the number of parameter
. Lasso (L1) has several important weaknesses: It does not guarantee one unique solution It is not robust to collinearity
. In practice elastic net (L1+L2) is used It combines ridge and lasso two lamda parameter need to be picked
WeST Claudia Wagner 20
Other Problems I
. Outliers Extreme values can have a big impact on the line which we fit since we try to minimize the squared error
Inclusion/exclusion of extreme values often changes results significantly Solution: • Use log to reduce impact of large values • Bootstrapping or Cross Validation
WeST Claudia Wagner 24
Other Problems II
. Omitted Variable Bias If omitted variable is a determinant of the DVs (outcome) AND omitted variable is correlated with the included IVs
The effect of the omitted variable (confounder) is attributed to another variable which is included in the model
Heteroskedastic error can indicate that related variables have been omitted
WeST Claudia Wagner 25
Example
. Do students from elite colleges earn more later in life? . earn ~ b0 +b1 *college +error Do people who went to an elite college earn on average more later in life?
College
Salary
Time
WeST Claudia Wagner 41
Problem with Regressions
Being accepted in an elite college correlates with motivation and socio-economic status These factors also correlate with salary
socio- economic Motivation status
College Salary
Time
WeST Claudia Wagner 42
Causal Effect
Did going to an elite college impact your future earnings?
If you would have chosen a normal college, would you have earned less?
Yi(T) - Yi(C)
wealth
college time
WeST Claudia Wagner 43
Causal Effect
WeST Claudia Wagner 44
Experiment
Gold Standard Method for causal inference!
Assignment of subjects to treatment or control group is random and groups are large enough to wash out differences in other covariates (Ignorability assumptions)
푌0, 푌1⟘푂푢푡푐표푚푒
Treatment Assignment
Manipulation of treatment is under control of researcher Different levels of treatment should lead to different levels of effect
WeST Claudia Wagner 45
Observational Data
Randomization in experiments assures that: T⟘푂
C
T O
In observational studies we need to control for the covariates C that effect both, the outcome O and the treatment assignment T T⟘푂|퐶
WeST Claudia Wagner 48
METHODS
WeST Claudia Wagner 49
Solutions
. Matching Methods
. Instrumental Variables
. Regression Discontinuity
WeST Claudia Wagner 50
Matching Methods
. Idea: find a subset of observational data that looks like generated by an experiment?
. Balance pre-treatment covariates X so that: 푇⟘푂|푋
. matching == pruning
WeST Claudia Wagner 51
Does Special Training Help Job Promotion?
Treated with
special training
Outcome Position
education (in years) covariate
Ho, Daniel, Kosuke Imai, Gary King, and Elizabeth Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15: 199– 236. Copy at http://j.mp/jPupwz WeST Claudia Wagner 52
Example: Greedy Distance Matching
Are people who take special training more successful (higher position)? Pre-eduction is our confounder.
Treated Edu Pos Treated Edu Pos Matches: ATE: 4-6 8-5=3 1 4 8 0 10 5 7-10 7-5=2 10-1 9-9=0 1 7 7 0 6 5
1 10 9 0 1 9
Introduce Caliper = maximal acceptable distance Throw bad matches away
WeST Claudia Wagner 53
Analyse Matched Data
Compute test statistic for matched data pairs
• OutcomeTreated - OutcomeControl
Null hypothesis (no treatment effect)
• E[OutcomeTreated] – E[OutcomeControl ] = 0
Is our observed test statistic surprising?
• Assume H0 follows a normal distribution • OR • Randomly permute treatment assignments (shuffle treatment labels) and compute the distribution under H0 empirically
WeST Claudia Wagner 55
Regression After Matching
pos = b0 + b1*edu + b2*is_treated
1) Preprocessing
(matching)
2) Estimation of effects position (p) position (regression models)
education (e)
Matching can help to reduce model dependence!
WeST Claudia Wagner 57
Why do we need to match?
binary variable
pos = b0 + b1*edu + b2*is_treated
) (p estimated
treatment effect
change
position
education (e)
Correcting for education, the treated group has higher positions.
WeST Claudia Wagner 58
Quadratic Regression
2
pos = b0 + b1*edu + b2*edu + γ*is_treated
) (p
Model Dependence
Reason: Imbalance
position of covariates
WeST Claudia Wagner 59
Assess Quality of Matches
. Was the matching successful? Are covariates balanced between the 2 groups?
. Standardized mean difference (smd) = mean difference between groups for each covariate in units of standard deviation
. smd < 0.1 is good . smd >0.2 indicates imbalance WeST Claudia Wagner 60
Distance Matching
Approximates fully blocked experiment we balance the known covariates One can match multiple covariates at the same
time
position But if we have several 100 covariates finding matches becomes difficult education
Gary King, "Why Propensity Scores Should Not Be Used for Matching“, Methods Colloquium, 2015
WeST Claudia Wagner 61
Propensity Score Matching Each data point is a high dimensional vector X of covariates. Project it to one dimension. Propensity score is the probability of a subject to receive treatment given all covariates we want to control for
position
education WeST Claudia Wagner 65
Propensity Score Matching
. Idea: achieve balance in covariates by conditioning on propensity score. This works if: 푃 푋 = 푥 휋 푋 = 푝, 푇 = 1 = 푃(푋 = 푥|휋 푋 = 푝, 푇 = 0)
Outcome variable: T Independent variables (covariates): X
Logistic regression: take predicted outcome 푇 as propensity score
WeST Claudia Wagner 66
Logistic Regression
. Logistic regression models the log-odds ratio as a linear combination of the independent variables.
푷(풀 = ퟏ) 풍풏 = 휷 + 휷 퐗 + 흐 ퟏ − 푷(풀 = ퟏ) ퟎ ퟏ
Lever et al., Nature, 13(7), 2016 https://www.nature.com/articles/nmeth.3904 WeST Claudia Wagner 69
Propensity Score
. Logistic regression model.
푷(풀 = ퟏ) 풍풏 = 휷 + 휷 퐗 + 흐 ퟏ − 푷(풀 = ퟏ) ퟎ ퟏ
. For each observation we observe covariates and predict the log-odds ratio of being treated . We learn which covariates increase/decrease the chance of being treated. . Match observations that have the same probability of being treated, but one was not treated and the other one was treated
WeST Claudia Wagner 71
Post-Analysis
. Did propensity score matching work? Check if you have good overlap between treated and control group with respect to propensity score
WeST Claudia Wagner 74
Analyse Matched Data
Same as before
Compute test statistic for matched data pairs
• OutcomeTreated - OutcomeControl
Null hypothesis (no treatment effect)
• E[OutcomeTreated] – E[OutcomeControl ] = 0
Or regression analysis • Outcome ~ b0+b1*X+b2*treated+error
WeST Claudia Wagner 75
Summary
. Matching == pruning
. Many design choices
. How to match? greedy, optimal nearest neighbor
1:1 or 1:k matching
. Calliper: remove bad matches Smaller calliper lower bias but more variance
WeST Claudia Wagner 76
Solutions
. Matching Methods
. Regression Discontinuity
. Instrumental Variables
WeST Claudia Wagner 78
Natural Experiment
. Like in experiments but…
assignment is only “as-if-random”
researcher does not control the intervention or treatment
WeST Claudia Wagner 79
.What is the impact of receiving a scholarship on future performance?
No scholarship scholarship
Min score Max score
Test score threshold
WeST Claudia Wagner 80
25,1 19,6
Thad Dunning, Natural Experiments in the Social Sciences. A Design-Based Approach, p.126, 2012.
WeST Claudia Wagner 81
Thad Dunning, Natural Experiments in the Social Sciences. A Design-Based Approach, p.126, 2012.
WeST Claudia Wagner 82
Regression Discontinuity (RD)
Future Coarsend Dummy Var: performance test score received schoolarship or not
WeST Claudia Wagner 83
RD Robustness Checks
Can individuals control if they are above or below the threshold?
Jumps at placebo points versus at threshold? Do other variables jump at the threshold?
Are the results robust to the usage of different bin widths?
Distinguish between discontinuity and non-lineary • Are the results robust against including higher order polynomials?
WeST Claudia Wagner 84
Solutions
. Matching Methods
. Regression Discontinuity
. Instrumental Variables
WeST Claudia Wagner 87
Instrumental Variables
Correlated, because “earning potential” is omitted
military service Beta is biased and inconsistent! M E lifetime earnings P
earning potential (unobserved)
WeST Claudia Wagner 88
Instrumental Variable
draft lottery military service L is the draft-lottery L M lifetime earnings (can be 0 or 1) E L is an instrument for P the causal effect of M earning potential on E
Instrumental variable should be strongly correlated with the included endogenous regressors (L <-> M), but not effect outcome variable directly (L <-> E)
Angrist, Joshua D. (1990). "Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records". American Economic Review 80 (3)
WeST Claudia Wagner 89
Instrumental Variable
How to estimate the causal effect via instrumental variable L?
E[E|L=1] – E[E|L=0] “Wald Estimator” E[M|L=1] – E[M|L=0] if L is binary
Plausibility check: what if L is perfect instrument?
Example: winning the lottery => 90% chance of joining military & 10% joining without invitation & $2,200 per year difference $2,200 / 0.8 = $2,750
Angrist found that military service decreases earnings about $2,741 dollar
WeST Claudia Wagner 90
Instrumental Variable
δ is estimates the causal effect of M on E
WeST Claudia Wagner 91
Instrumental Variables
. Find Instrumental Variables they are highly correlated with independent variables but not with outcome earn ~ b0 +b1 *college +error Confounder : motivation and socio-economic status
Potential Instruments • Apply to elite college correlates with motivation • Living area correlates with socio-economic status
푐표푙푙푒푔푒 ~ b0 +b1*collegeAp + b2*living +error earn ~ b0 +b1 * 푐표푙푙푒푔푒
WeST Claudia Wagner 92
Summary
• Instrumental variables and discontinuities are often hard to find but powerful
• Matching methods help to approximate causality But researchers have lots of freedom when decide how to match We remove data we need to specify for which group the causal effect holds Most matching methods have been developed for low number of covariates Worst case: random pruning increases imbalance increases bias and model dependence
• Compare results from different matching methods, different dimensionality reduction methods, different models Avoid model dependence and method dependence!
WeST Claudia Wagner 93