Model Selection: General Techniques

Statistics 203: Introduction to Regression and Analysis of Variance Model Selection: General Techniques Jonathan Taylor - p. 1/16 Today ● Today ■ Outlier detection / simultaneous inference. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Goals of model selection. ● Model selection: goals ■ ● Model selection: general Criteria to compare models. ● Model selection: strategies ● Possible criteria ■ (Some) model selection. ● Mallow's Cp ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 2/16 Crude outlier detection test ● Today ■ If the studentized residuals are large: observation may be an ● Crude outlier detection test ● Bonferroni correction outlier. ● Simultaneous inference for β ● Model selection: goals ■ 1 2 1 ● Model selection: general Problem: if n is large, if we “threshold” at t −α/ ;n−p− we ● Model selection: strategies will get many outliers by chance even if model is correct. ● Possible criteria ● Mallow's Cp ■ ● AIC & BIC Solution: Bonferroni correction, threshold at t1−α/2n;n−p−1. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 3/16 Bonferroni correction ● Today ■ If we are doing many t (or other) tests, say m > 1 we can ● Crude outlier detection test ● Bonferroni correction control overall false positive rate at α by testing each one at ● Simultaneous inference for β ● Model selection: goals level α=m. ● Model selection: general ● Model selection: strategies ■ Proof: ● Possible criteria ● Mallow's Cp P (at least one false positive) ● AIC & BIC ● Maximum likelihood m estimation = P [i=1jTij ≥ t1−α/2m;n−p−1 ● AIC for a linear model ● Search strategies m ● Implementations in R ● Caveats ≤ P jT j ≥ t1− 2 − −1 X i α/ m;n p i=1 m α = = α: X m i=1 ■ Known as “simultaneous inference”: controlling overall false positive rate at α while performing many tests. - p. 4/16 Simultaneous inference for β ● Today ■ Other common situations in which simultaneous inference ● Crude outlier detection test ● Bonferroni correction occurs is “simultaneous inference” for β. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Using the facts that ● Model selection: strategies ● Possible criteria 2 t −1 ● Mallow's Cp β ∼ N β; σ (X X) ● AIC & BIC ● Maximum likelihood b 2 estimation 2 2 χn−p ● AIC for a linear model σ ∼ σ · ● Search strategies n − p ● Implementations in R b ● Caveats along with β ? σ2 leads to b t bt 2 (β − β) (X X)(β − β)=p χp=p 2 ∼ 2 ∼ Fp;n−p b σ b χn−p=(n − p) b ■ (1 − α) · 100% simultaneous confidence region: t t 2 nβ : (β − β) (X X)(β − β) ≤ pσ Fp;n−p;1−αo b b b - p. 5/16 Model selection: goals ● Today ■ When we have many predictors (with many possible ● Crude outlier detection test ● Bonferroni correction interactions), it can be difficult to find a good model. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Which main effects do we include? ● Model selection: strategies ■ ● Possible criteria Which interactions do we include? ● Mallow's Cp ■ ● AIC & BIC Model selection tries to “simplify” this task. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 6/16 Model selection: general ● Today ■ This is an “unsolved” problem in statistics: there are no ● Crude outlier detection test ● Bonferroni correction magic procedures to get you the “best model.” ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general In some sense, model selection is “data mining.” ● Model selection: strategies ■ ● Possible criteria Data miners / machine learners often work with very many ● Mallow's Cp ● AIC & BIC predictors. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 7/16 Model selection: strategies ● Today ■ To “implement” this, we need: ● Crude outlier detection test ● Bonferroni correction ◆ a criterion or benchmark to compare two models. ● Simultaneous inference for β ◆ ● Model selection: goals a search strategy. ● Model selection: general ● Model selection: strategies ■ With a limited number of predictors, it is possible to search ● Possible criteria ● Mallow's Cp all possible models. ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 8/16 Possible criteria ● Today ■ 2 ● Crude outlier detection test R : not a good criterion. Always increase with model size –> ● Bonferroni correction “optimum” is to take the biggest model. ● Simultaneous inference for β ● Model selection: goals ■ 2 ● Model selection: general Adjusted R : better. It “penalized” bigger models. ● Model selection: strategies ■ ● Possible criteria Mallow’s Cp. ● Mallow's Cp ■ ● AIC & BIC Akaike’s Information Criterion (AIC), Schwarz’s BIC. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 9/16 Mallow's Cp ● Today ■ ● Crude outlier detection test ● Bonferroni correction SSE(M) ● Simultaneous inference for β Cp(M) = 2 − n + 2 · p(M): ● Model selection: goals σ ● Model selection: general ■ 2 2 ● Model selection: strategies σ = SSE(F )=dfF is the “best”b estimate of σ we have (use ● Possible criteria ● Mallow's Cp theb fullest model) ● AIC & BIC 2 ● Maximum likelihood ■ SSE(M) = kY − YMk is the SSE of the model M estimation ● AIC for a linear model ■ ● Search strategies p(M) is the numberb of predictors in M, or the degrees of ● Implementations in R ● Caveats freedom used up by the model. ■ Based on an estimate of n 1 2 E (Y − E(Y )) σ2 X i i i=1 n 1 2 = E (Y − Y ) + Var(Y ) σ2 X i i i i=1 b b - p. 10/16 AIC & BIC ● Today ■ Mallow’s C is (almost) a special case of Akaike Information ● Crude outlier detection test p ● Bonferroni correction Criterion (AIC) ● Simultaneous inference for β ● Model selection: goals ● Model selection: general AIC(M) = −2 log L(M) + 2 · p(M): ● Model selection: strategies ● Possible criteria ● Mallow's Cp ■ L(M) is the likelihood function of the parameters in model ● AIC & BIC ● Maximum likelihood M evaluated at the MLE (Maximum Likelihood Estimators). estimation ● AIC for a linear model ■ ● Search strategies Schwarz’s Bayesian Information Criterion (BIC) ● Implementations in R ● Caveats BIC(M) = −2 log L(M) + p(M) · log n - p. 11/16 Maximum likelihood estimation ● Today ■ If the model is correct then the log-likelihood of is ● Crude outlier detection test (β; σ) ● Bonferroni correction ● Simultaneous inference for β n 2 1 2 ● Model selection: goals log L(β; σjX; Y ) = − log(2π) + log σ − 2 kY − Xβk ● Model selection: general 2 2σ ● Model selection: strategies ● Possible criteria ● Mallow's Cp where Y is the vector of observed responses. ● AIC & BIC ■ ● Maximum likelihood MLE for β in this case is the same as least squares estimate estimation ● AIC for a linear model because first term does not depend on β ● Search strategies ● Implementations in R ■ 2 ● Caveats MLE for σ : @ n 1 2 2 log L(β; σ) = − 2 + 4 kY − Xβk = 0 @σ b 2 2σ 2σ β;σb b ■ Solving for σ2: 2 1 2 1 σ = kY − Xβk = SSE(M) MLE n n b b Note that the MLE is biased. - p. 12/16 AIC for a linear model ● Today ■ ● Crude outlier detection test Using βMLE = β ● Bonferroni correction ● Simultaneous inference for β b b ● Model selection: goals 2 1 ● Model selection: general σMLE = SSE(M) ● Model selection: strategies n ● Possible criteria b ● Mallow's Cp we see that the AIC of a multiple linear regression model is ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1) ● Search strategies ● Implementations in R ■ 2 ● Caveats If σ is known, then 2 SSE(M) AIC(M) = n log(2π) + log(σ ) + + 2p(M) σ2 which is almost Cp(M) + Kn. - p. 13/16 Search strategies ● Today ■ “Best subset”: search all possible models and take the one ● Crude outlier detection test 2 ● Bonferroni correction with highest R or lowest Cp. ● Simultaneous inference for β a ● Model selection: goals ■ ● Model selection: general Stepwise (forward, backward or both): useful when the ● Model selection: strategies number of predictors is large. Choose an initial model and ● Possible criteria ● Mallow's Cp be “greedy”. ● AIC & BIC ● Maximum likelihood ■ “Greedy” means always take the biggest jump (up or down) estimation ● AIC for a linear model ● Search strategies in your selected criterion. ● Implementations in R ● Caveats - p. 14/16 Implementations in R ● Today ■ “Best subset”: use the function leaps. Works only for ● Crude outlier detection test ● Bonferroni correction multiple linear regression models. ● Simultaneous inference for β ● Model selection: goals ■ step ● Model selection: general Stepwise: use the function . Works for any model with ● Model selection: strategies Akaike Information Criterion (AIC). In multiple linear ● Possible criteria ● Mallow's Cp regression, AIC is (almost) a linear function of Cp. ● AIC & BIC ● Maximum likelihood ■ Here is an example. estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 15/16 Caveats ● Today ■ Many other “criteria” have been proposed. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Some work well for some types of data, others for different ● Model selection: goals ● Model selection: general data. ● Model selection: strategies ■ ● Possible criteria These criteria are not “direct measures” of predictive power. ● Mallow's Cp ■ ● AIC & BIC Later – we will see cross-validation which is an estimate of ● Maximum likelihood estimation predictive power. ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 16/16.

Model Selection: General Techniques

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support