Shrinkage Methods for Linear Models

Shrinkage Methods for Linear Models Bin Li IIT Lecture Series 1 / 40 Outline I Linear regression models and least squares I Subset selection I Shrinkage methods I Ridge regression I The lasso I Subset, ridge and the lasso I Elastic net I Shrinkage method for classification problem. 2 / 40 Linear Regression & Least Squares T I Input vector: X = (X1; X2;:::; Xp) I Real valued output vector: Y = (y1; y2;:::; yn) I Linear regression model has the form p f (X ) = β0 + Xj βj j=1 X I Least squares estimate minimizes n p 2 RSS(β) = (yi − (β0 + Xj βj )) i=1 j=1 X X = (y − Xβ)T (y − Xβ) 3 / 40 Least Square Estimate I Differentiate RSS(β) w.r.t. β: @RSS = −2XT (y − Xβ) (1) @β @2RSS = +2XT X (2) @ββT I Assume X has full column rank, (2) is positive definite. I Set first derivative to zero: XT (y − Xβ) =) βôls = (XT X)−1XT y ols ols T −1 2 I Variance of β^ : Var(β^ ) = (X X) σ . 4 / 40 Gauss-Markov Theorem p I Suppose we have: Yi = Xij βj + i j=1 X I We need the following assumptions: I E(i ) = 0; 8i 2 I Var(i ) = σ ; 8i I Cov(i ; j ) = 0; 8i 6= j I Gauss-Markov Theorem says that: ordinary least squares estimator (OLS) is the best linear unbiased estimator (BLUE) of β. ols I E(β^ ) = β I Among all the linear unbiased estimators β^, OLS estimator has the minimum p p 2 E[(β^j − βj ) ] = Var(β^j ) j=1 j=1 X X 5 / 40 Body Fat Example This is a study of the relation of amount of body fat (Y ) to several possible X variables, based on a sample of 20 healthy females 25-34 years old. There are three X variables: triceps skinfold thickness (X1); thigh circumference (X2); and midarm circumference (X3). ● ● 30 35 35 ● ● ● ● ● ● ● ● ● ● ● ● 30 30 ● ● ● ● ● ● X3 ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● 25 30 ● ● ● ● ● ● ●● 50 55 ● ● ● ● ●● 55 ● ● ● ● ● ● ● ● ● ● ● ● ● 50 X2 50 ● ● ● ● ● ● ● 45 ● ● ● 45 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 25 30 ● ● ● ● ● ● ● ●● ● ● 25 ● ● ●● ● ● X1 ● ● 20 ● ● ●● ● ● ● ● 15 20 15 ● ● Scatter Plot Matrix 6 / 40 Body Fat Example (cont) fit<-lm(Y~.,data=bfat) summary(fit) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 117.085 99.782 1.173 0.258 X1 4.334 3.016 1.437 0.170 X2 -2.857 2.582 -1.106 0.285 X3 -2.186 1.595 -1.370 0.190 Residual standard error: 2.48 on 16 degrees of freedom Multiple R-Squared: 0.8014, Adjusted R-squared: 0.7641 F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06 7 / 40 Remarks I Estimation: I OLS is unbiased but when some input variables are highly correlated, β^'s may be highly variable. T I To see if a linear model f^(x) = x β^ is a good candidate, we can ask ourselves two questions: 1. Is β^ close to the true β? 2. Will f^(x) fit future observations well? I Bias-variance trade-off. I Interpretation: I Determine a smaller subset that exhibit the strongest effects I Obtain a \big picture" by sacrifice some small details 8 / 40 Is β^ close to the true β? I To answer this question, we might consider the mean squared error of our estimate β^: I Squared distance of β^ to the true β: 2 T MSE(β^) = E[jjβ^ − βjj ] = E[(β^ − β) (β^ − β)] T = E[(β^ − Eβ^ + Eβ^ − β) (β^ − Eβ^ + Eβ^ − β)] 2 T 2 = E[jjβ^ − Eβ^jj ] + 2E[(β^ − Eβ^) (Eβ^ − β)] + jjEβ^ − βjj 2 2 = E[jjβ^ − Eβ^jj ] + jjEβ^ − βjj = tr(Variance) + jjBiasjj2 I For OLS estimate, we have MSE(β^) = σ2tr[(xT x)−1] 9 / 40 Will f^(x) fit future observation well? I Just because f^(x) fits our data well, this doesn't mean that it will be a good fit to a new dataset. 2 I Assume y = f (x) + where E = 0 and Var() = σ I Denote D as the training data to fit the model f^(x; D) I Expected prediction error (squared loss) at an input point x = xnew can be decomposed as new ^ 2 new Err(x ) = ED (Y − f (X ; D)) jX = x h 2 new i = ED (Y − f (X )) jX = x ^ 2 new +ED (f (X ) − EDf (X ; D))jX = x h ^ ^ 2 newi +ED (EDf (X ; D) − f (X ; D)) jX = x 2 h 2 ^ new ^ new i = σ + Bias ff (x )g + Varff (x )g = Irreducible Error + Bias2 + Variance 10 / 40 Subset Selection I Best-subset selection I an efficient algorithm leaps and bounds procedure - Furnival and Wilson (1974) I feasible for p as large as 30 to 40. I Forward- and backward-stepwise selection I FS selection is a greedy algorithm producing a nested sequence of model I BS selection sequentially deletes predictor with the least impact on the fit I Two-way stepwise-selection strategies I consider both forward and backward moves at each step I step function in R package uses AIC criterion 11 / 40 Ridge Regression { A Shrinkage Approach I Shrinks the coefficients by imposing a constraint. p ^R 2 β = arg min RSS subject to βj ≤ C β j=1 X I Equivalent Lagrangian form: p ^R 2 β = arg min RSS + λ βj ; λ ≥ 0: β j=1 X I Solution of ridge regression: β^R = (XT X + λI)−1XT y 12 / 40 Ridge Regression (cont'd) I Apply correlation transformation on X and Y I The estimated standardized regression coefficients in ridge regression is: R −1 β^ = (rXX + λI) rYX T I rXX is the correlation matrix of X (i.e. X X on transformed X) 0 I rYX is the correlation matrix between X and Y (i.e. X y on transformed X and y) I λ is a tuning parameter in ridge regression I λ reflects the amount of bias in the estimators. λ ") bias " I λ stabilize the regression coefficient β. λ ") VIF # I Three ways to choose λ R I Based on ridge trace of β^ and VIF I Based on external validation or cross-validation I Use generalize cross-validation. 13 / 40 Correlation Transformation and Variance Inflation Factor I Correlation transformation is used to standardized variables: 1 Y − Y¯ 1 X − X¯ Y ∗ = p i and X ∗ = p ij j i n − 1 s ij n − 1 s Y Xj th th I For OLS, the VIFj for the j variable is defined as the j −1 diagonal element of the matrix rXX . 2 2 I It is equal to 1=(1 − Rj ), where Rj is the R-square when Xj is regressed on the rest p − 1 variables in the model. I Hence, when VIFj is 1, it indicates Xj is not linearly related to the other X 's. Otherwise VIFj will be greater than 1. I In ridge regression, the VIF values are the diagonal elements of the following matrix: −1 −1 (rXX + λI) rXX (rXX + λI) 14 / 40 Body Fat Example (cont'd) I When λ is around 0.02, VIFj are all close to 1 R I When λ reaches 0.02, β^ stabilize. I Note: horizontal axis is not equally spaced (in log scale) 2.5 3.5 2.0 beta1 3.0 beta2 VIF1 1.5 beta3 2.5 VIF2 VIF3 1.0 2.0 VIF 0.5 1.5 0.0 1.0 0.5 Standardized ridge coefficient 0.0 −1.0 0.001 0.01 0.1 1 0.01 0.032 0.1 0.32 1 lambda lambda 15 / 40 Body Fat Example (cont'd) ● ● 4.0 10 ● ● ● ● ● 5 ● ● 3.0 ● ● ● ● ● 0 ● 2.0 ● ● ● −5 ● ● ● ● ● ● ● ● 1.0 −10 o1 r1 o2 r2 o3 r3 ols ridge Varfβ^g for OLS (ridge): 13.9 (0.016), 11.1 (0.015), 2.11 (0.019) Average of MSE in OLS and ridge: 0.0204 and 0.0155. 16 / 40 Bias and Variance of Ridge Estimate −1 R T T Eβ^ = E X X + λI X (Xβ + ) −1 = XT X + λI XT Xβ −1 = β − λ XT X + λI β Hence, the ridge estimate β^Ris biased. The bias is: −1 R T Eβ^ − β = −λ X X + λI β −1 Var(β^R ) = Var XT X + λI XT (Xβ + ) −1 = Var XT X + λI XT −1 −1 = XT X + λI XT X XT X + λI σ2 17 / 40 Bias-variance tradeoff Figure from Hoerl and Kennard, 1970. 18 / 40 Data Augmentation to Solve Ridge n p p R 2 2 β^ = arg min (yi − xi β) + (0 − λβj ) β 8 9 i=1 j=1 <X X = n+p : λ λ 2 T ; = arg min (yi − zi β) = (yλ − zλβ) (yλ − zλβ); β ( i=1 ) X where y1 x x ··· x 1;1 1;2 1;p . 2 . 3 2 . 3 y X n xn;1 xn;2 ··· xn;p p 6 7 zλ = 6 p 7 = and yλ = 6 0 7 λ 0 ··· 0 λIp 6 7 6 p 7 6 0 7 6 7 6 7 6 0 λ ··· p0 7 6 . 7 6 7 6 . 7 6 0 0 ··· λ 7 6 7 6 7 6 0 7 4 5 6 7 4 5 19 / 40 Lasso Lasso is a loop of rope that is designed to be thrown around a target and tighten when pulled. Figure is from Wikipedia.org. 20 / 40 LASSO: Least Absolute Shrinkage and Selection Operator I Shrinks the coefficients by imposing the constraint: p R β^ = arg min RSS subject to jβj j ≤ C β j=1 X I Equivalent Lagrangian form: p R β^ = arg min RSS + λ jβj j; λ ≥ 0: β j=1 X I When C is sufficiently small (or λ is sufficiently large), some of the coefficients will be exactly zero.

Load more