Lecture Data Science

Home , Matching (statistics)

Web Science & Technologies University of Koblenz ▪ Landau, Germany

Lecture

Data Science

Regression and Causal Inference

JProf. Dr. Claudia Wagner Model Selection

Bias-Variance Tradeoff

Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968

WeST Claudia Wagner 2

Solutions

. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s

Cp . Limits the number of parameters

. Regularization  fit a model involving all p predictors  shrink coefficients towards zero to reduce variance and produce low bias at the same time  Contraints the magnitude of parameter

WeST Claudia Wagner 4

Cross-validated Prediction Error

. K-fold Cross Validation . Prediction error on test data

Src: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_EN.jpg

WeST Claudia Wagner 5

Mean Squared (Prediction) Error

. 푀푆퐸 휃 = 퐸[(휃 − 휃)2]

. If we have multiple models and want to select the best we can also use MSE

. 푀푆퐸 휃 = 퐸[(휃 − 퐸 휃 + 퐸 휃 − 휃)2] 2 . 푀푆퐸 휃 = E 휃 − 퐸 휃 + (E 휃 − 휃)2

Variance + Bias2

WeST Claudia Wagner 6

Model Selection

Choose among 3 model (polynomial degree 1, 2, or 3) . Use test error to select the best model  model 2 is the best model according to the test error . But what do we report as the test error now? . Test error should be computed on data that was not used for training and for choosing the best model

WeST Claudia Wagner 8

. Solution: split the labeled data into three parts

Lever et al., Nature 13(9), 2016, https://www.nature.com/articles/nmeth.3968

WeST Claudia Wagner 9

. Training Data (blue) . Validation Data (green)  d = 2 is chosen . Test Data (brown)  1.3 test error computed for d = 2

WeST Claudia Wagner 10

Mallows Cp (1973)

Assume we have a regression model with n independent variables (n parameter)  choose the smallest best model based on validation data

Sum of Squared Errors of model with p parameter

number of parameters in the model

Variance of full model with n parameter WeST Claudia Wagner 13

Solutions

. Model Selection . find a minimal model that produces low test prediction error . e.g. via cross-validated prediction error, adjusted R2, Mallow’s

Cp . Limits the number of parameters

. Regularization  fit a model involving all p predictors  shrink coefficients towards zero to reduce variance and achieve low bias at the same time  Contraints the magnitude of parameter • Some regularization (Lasso) can also be used to remove parameter

WeST Claudia Wagner 14

Regularization

. Estimate coefficient:

훽 Lp

LSE penalty

λ ≥ 0 is a tuning parameter, which controls the strength of the penalty term.

Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 15

Example

. Why we need regularization?

 Y = b1*X1 + b2*X2 + b3*X3 + error

 In reality the underlying model is Y = 5 * X3

. Ideal estimate for parameter vector: b=(0, 0, 5)

. If by chance X1 and X2 are perfectly correlated (e.g. X1 = 2* X2) then we have many other solutions that give us an equal fit: . E.g. b=(50, -100, 5) Lever et al, Nature 13(10), 2016 https://www.nature.com/articles/nmeth.4014 WeST Claudia Wagner 16

Regularization

. Lp regularizer is defined as the parameter vector norm:

L = p

. Surface on which the norm takes on constant values:

휃0 푐푎푛 표푛푙푦 푏푒 1 or -1 푖푓휃1푖푠 0

Ridge

WeST Claudia Wagner 17

. Find balance between 2 loss functions: LSE of data and regularization term

L2 (Ridge) Prof. Alexander Ihler, University of California, Irvine https://www.youtube.com/watch?v=sO4ZirJh9ds WeST Claudia Wagner 18

L1 and L2 Norm

L1 (Lasso) tends to generate sparser solutions

L2 (Ridge) L1 (Lasso) Prof. Alexander Ihler, University of California, Irvine WeST Claudia Wagner 19

. Ridge (L2)  Provides a unique solution even if multiple variables are correlated  But it does not reduce the number of parameter

. Lasso (L1) has several important weaknesses:  It does not guarantee one unique solution  It is not robust to collinearity

. In practice elastic net (L1+L2) is used  It combines ridge and lasso  two lamda parameter need to be picked

WeST Claudia Wagner 20