Model Selection: General Techniques

Statistics 203: Introduction to Regression and Analysis of Variance Model Selection: General Techniques Jonathan Taylor - p. 1/16 Today ● Today ■ Outlier detection / simultaneous inference. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Goals of model selection. ● Model selection: goals ■ ● Model selection: general Criteria to compare models. ● Model selection: strategies ● Possible criteria ■ (Some) model selection. ● Mallow's Cp ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 2/16 Crude outlier detection test ● Today ■ If the studentized residuals are large: observation may be an ● Crude outlier detection test ● Bonferroni correction outlier. ● Simultaneous inference for β ● Model selection: goals ■ 1 2 1 ● Model selection: general Problem: if n is large, if we “threshold” at t −α/ ;n−p− we ● Model selection: strategies will get many outliers by chance even if model is correct. ● Possible criteria ● Mallow's Cp ■ ● AIC & BIC Solution: Bonferroni correction, threshold at t1−α/2n;n−p−1. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 3/16 Bonferroni correction ● Today ■ If we are doing many t (or other) tests, say m > 1 we can ● Crude outlier detection test ● Bonferroni correction control overall false positive rate at α by testing each one at ● Simultaneous inference for β ● Model selection: goals level α=m. ● Model selection: general ● Model selection: strategies ■ Proof: ● Possible criteria ● Mallow's Cp P (at least one false positive) ● AIC & BIC ● Maximum likelihood m estimation = P [i=1jTij ≥ t1−α/2m;n−p−1 ● AIC for a linear model ● Search strategies m ● Implementations in R ● Caveats ≤ P jT j ≥ t1− 2 − −1 X i α/ m;n p i=1 m α = = α: X m i=1 ■ Known as “simultaneous inference”: controlling overall false positive rate at α while performing many tests. - p. 4/16 Simultaneous inference for β ● Today ■ Other common situations in which simultaneous inference ● Crude outlier detection test ● Bonferroni correction occurs is “simultaneous inference” for β. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Using the facts that ● Model selection: strategies ● Possible criteria 2 t −1 ● Mallow's Cp β ∼ N β; σ (X X) ● AIC & BIC ● Maximum likelihood b 2 estimation 2 2 χn−p ● AIC for a linear model σ ∼ σ · ● Search strategies n − p ● Implementations in R b ● Caveats along with β ? σ2 leads to b t bt 2 (β − β) (X X)(β − β)=p χp=p 2 ∼ 2 ∼ Fp;n−p b σ b χn−p=(n − p) b ■ (1 − α) · 100% simultaneous confidence region: t t 2 nβ : (β − β) (X X)(β − β) ≤ pσ Fp;n−p;1−αo b b b - p. 5/16 Model selection: goals ● Today ■ When we have many predictors (with many possible ● Crude outlier detection test ● Bonferroni correction interactions), it can be difficult to find a good model. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Which main effects do we include? ● Model selection: strategies ■ ● Possible criteria Which interactions do we include? ● Mallow's Cp ■ ● AIC & BIC Model selection tries to “simplify” this task. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 6/16 Model selection: general ● Today ■ This is an “unsolved” problem in statistics: there are no ● Crude outlier detection test ● Bonferroni correction magic procedures to get you the “best model.” ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general In some sense, model selection is “data mining.” ● Model selection: strategies ■ ● Possible criteria Data miners / machine learners often work with very many ● Mallow's Cp ● AIC & BIC predictors. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 7/16 Model selection: strategies ● Today ■ To “implement” this, we need: ● Crude outlier detection test ● Bonferroni correction ◆ a criterion or benchmark to compare two models. ● Simultaneous inference for β ◆ ● Model selection: goals a search strategy. ● Model selection: general ● Model selection: strategies ■ With a limited number of predictors, it is possible to search ● Possible criteria ● Mallow's Cp all possible models. ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 8/16 Possible criteria ● Today ■ 2 ● Crude outlier detection test R : not a good criterion. Always increase with model size –> ● Bonferroni correction “optimum” is to take the biggest model. ● Simultaneous inference for β ● Model selection: goals ■ 2 ● Model selection: general Adjusted R : better. It “penalized” bigger models. ● Model selection: strategies ■ ● Possible criteria Mallow’s Cp. ● Mallow's Cp ■ ● AIC & BIC Akaike’s Information Criterion (AIC), Schwarz’s BIC. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 9/16 Mallow's Cp ● Today ■ ● Crude outlier detection test ● Bonferroni correction SSE(M) ● Simultaneous inference for β Cp(M) = 2 − n + 2 · p(M): ● Model selection: goals σ ● Model selection: general ■ 2 2 ● Model selection: strategies σ = SSE(F )=dfF is the “best”b estimate of σ we have (use ● Possible criteria ● Mallow's Cp theb fullest model) ● AIC & BIC 2 ● Maximum likelihood ■ SSE(M) = kY − YMk is the SSE of the model M estimation ● AIC for a linear model ■ ● Search strategies p(M) is the numberb of predictors in M, or the degrees of ● Implementations in R ● Caveats freedom used up by the model. ■ Based on an estimate of n 1 2 E (Y − E(Y )) σ2 X i i i=1 n 1 2 = E (Y − Y ) + Var(Y ) σ2 X i i i i=1 b b - p. 10/16 AIC & BIC ● Today ■ Mallow’s C is (almost) a special case of Akaike Information ● Crude outlier detection test p ● Bonferroni correction Criterion (AIC) ● Simultaneous inference for β ● Model selection: goals ● Model selection: general AIC(M) = −2 log L(M) + 2 · p(M): ● Model selection: strategies ● Possible criteria ● Mallow's Cp ■ L(M) is the likelihood function of the parameters in model ● AIC & BIC ● Maximum likelihood M evaluated at the MLE (Maximum Likelihood Estimators). estimation ● AIC for a linear model ■ ● Search strategies Schwarz’s Bayesian Information Criterion (BIC) ● Implementations in R ● Caveats BIC(M) = −2 log L(M) + p(M) · log n - p. 11/16 Maximum likelihood estimation ● Today ■ If the model is correct then the log-likelihood of is ● Crude outlier detection test (β; σ) ● Bonferroni correction ● Simultaneous inference for β n 2 1 2 ● Model selection: goals log L(β; σjX; Y ) = − log(2π) + log σ − 2 kY − Xβk ● Model selection: general 2 2σ ● Model selection: strategies ● Possible criteria ● Mallow's Cp where Y is the vector of observed responses. ● AIC & BIC ■ ● Maximum likelihood MLE for β in this case is the same as least squares estimate estimation ● AIC for a linear model because first term does not depend on β ● Search strategies ● Implementations in R ■ 2 ● Caveats MLE for σ : @ n 1 2 2 log L(β; σ) = − 2 + 4 kY − Xβk = 0 @σ b 2 2σ 2σ β;σb b ■ Solving for σ2: 2 1 2 1 σ = kY − Xβk = SSE(M) MLE n n b b Note that the MLE is biased. - p. 12/16 AIC for a linear model ● Today ■ ● Crude outlier detection test Using βMLE = β ● Bonferroni correction ● Simultaneous inference for β b b ● Model selection: goals 2 1 ● Model selection: general σMLE = SSE(M) ● Model selection: strategies n ● Possible criteria b ● Mallow's Cp we see that the AIC of a multiple linear regression model is ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1) ● Search strategies ● Implementations in R ■ 2 ● Caveats If σ is known, then 2 SSE(M) AIC(M) = n log(2π) + log(σ ) + + 2p(M) σ2 which is almost Cp(M) + Kn. - p. 13/16 Search strategies ● Today ■ “Best subset”: search all possible models and take the one ● Crude outlier detection test 2 ● Bonferroni correction with highest R or lowest Cp. ● Simultaneous inference for β a ● Model selection: goals ■ ● Model selection: general Stepwise (forward, backward or both): useful when the ● Model selection: strategies number of predictors is large. Choose an initial model and ● Possible criteria ● Mallow's Cp be “greedy”. ● AIC & BIC ● Maximum likelihood ■ “Greedy” means always take the biggest jump (up or down) estimation ● AIC for a linear model ● Search strategies in your selected criterion. ● Implementations in R ● Caveats - p. 14/16 Implementations in R ● Today ■ “Best subset”: use the function leaps. Works only for ● Crude outlier detection test ● Bonferroni correction multiple linear regression models. ● Simultaneous inference for β ● Model selection: goals ■ step ● Model selection: general Stepwise: use the function . Works for any model with ● Model selection: strategies Akaike Information Criterion (AIC). In multiple linear ● Possible criteria ● Mallow's Cp regression, AIC is (almost) a linear function of Cp. ● AIC & BIC ● Maximum likelihood ■ Here is an example. estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 15/16 Caveats ● Today ■ Many other “criteria” have been proposed. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Some work well for some types of data, others for different ● Model selection: goals ● Model selection: general data. ● Model selection: strategies ■ ● Possible criteria These criteria are not “direct measures” of predictive power. ● Mallow's Cp ■ ● AIC & BIC Later – we will see cross-validation which is an estimate of ● Maximum likelihood estimation predictive power. ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 16/16.

Load more