<<

Statistics 203: Introduction to Regression and Analysis of : General Techniques

Jonathan Taylor

- p. 1/16 Today

● Today ■ Outlier detection / simultaneous inference. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Goals of model selection. ● Model selection: goals ■ ● Model selection: general Criteria to compare models. ● Model selection: strategies ● Possible criteria ■ (Some) model selection. ● Mallow's Cp ● AIC & BIC ● Maximum likelihood estimation ● AIC for a ● Search strategies ● Implementations in R ● Caveats

- p. 2/16 Crude outlier detection test

● Today ■ If the studentized residuals are large: observation may be an ● Crude outlier detection test ● Bonferroni correction outlier. ● Simultaneous inference for β ● Model selection: goals ■ 1 2 1 ● Model selection: general Problem: if n is large, if we “threshold” at t −α/ ,n−p− we ● Model selection: strategies will get many outliers by chance even if model is correct. ● Possible criteria ● Mallow's Cp ■ ● AIC & BIC Solution: Bonferroni correction, threshold at t1−α/2n,n−p−1. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 3/16 Bonferroni correction

● Today ■ If we are doing many t (or other) tests, say m > 1 we can ● Crude outlier detection test ● Bonferroni correction control overall false positive rate at α by testing each one at ● Simultaneous inference for β ● Model selection: goals level α/m. ● Model selection: general ● Model selection: strategies ■ Proof: ● Possible criteria ● Mallow's Cp P (at least one false positive) ● AIC & BIC ● Maximum likelihood m estimation = P ∪i=1|Ti| ≥ t1−α/2m,n−p−1 ● AIC for a linear model  ● Search strategies m ● Implementations in R ● Caveats ≤ P |T | ≥ t1− 2 − −1 X i α/ m,n p  i=1 m α = = α. X m i=1 ■ Known as “simultaneous inference”: controlling overall false positive rate at α while performing many tests.

- p. 4/16 Simultaneous inference for β

● Today ■ Other common situations in which simultaneous inference ● Crude outlier detection test ● Bonferroni correction occurs is “simultaneous inference” for β. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Using the facts that ● Model selection: strategies ● Possible criteria 2 t −1 ● Mallow's Cp β ∼ N β, σ (X X) ● AIC & BIC  ● Maximum likelihood b 2 estimation 2 2 χn−p ● AIC for a linear model σ ∼ σ · ● Search strategies n − p ● Implementations in R b ● Caveats along with β ⊥ σ2 leads to b t bt 2 (β − β) (X X)(β − β)/p χp/p 2 ∼ 2 ∼ Fp,n−p b σ b χn−p/(n − p) b ■ (1 − α) · 100% simultaneous confidence region:

t t 2 nβ : (β − β) (X X)(β − β) ≤ pσ Fp,n−p,1−αo b b b

- p. 5/16 Model selection: goals

● Today ■ When we have many predictors (with many possible ● Crude outlier detection test ● Bonferroni correction interactions), it can be difficult to find a good model. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Which main effects do we include? ● Model selection: strategies ■ ● Possible criteria Which interactions do we include? ● Mallow's Cp ■ ● AIC & BIC Model selection tries to “simplify” this task. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 6/16 Model selection: general

● Today ■ This is an “unsolved” problem in : there are no ● Crude outlier detection test ● Bonferroni correction magic procedures to get you the “best model.” ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general In some sense, model selection is “ mining.” ● Model selection: strategies ■ ● Possible criteria Data miners / machine learners often work with very many ● Mallow's Cp ● AIC & BIC predictors. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 7/16 Model selection: strategies

● Today ■ To “implement” this, we need: ● Crude outlier detection test ● Bonferroni correction ◆ a criterion or benchmark to compare two models. ● Simultaneous inference for β ◆ ● Model selection: goals a search strategy. ● Model selection: general ● Model selection: strategies ■ With a limited number of predictors, it is possible to search ● Possible criteria ● Mallow's Cp all possible models. ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 8/16 Possible criteria

● Today ■ 2 ● Crude outlier detection test R : not a good criterion. Always increase with model size –> ● Bonferroni correction “optimum” is to take the biggest model. ● Simultaneous inference for β ● Model selection: goals ■ 2 ● Model selection: general Adjusted R : better. It “penalized” bigger models. ● Model selection: strategies ■ ● Possible criteria Mallow’s Cp. ● Mallow's Cp ■ ● AIC & BIC Akaike’s Information Criterion (AIC), Schwarz’s BIC. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 9/16 Mallow's Cp

● Today ■ ● Crude outlier detection test ● Bonferroni correction SSE(M) ● Simultaneous inference for β Cp(M) = 2 − n + 2 · p(M). ● Model selection: goals σ ● Model selection: general ■ 2 2 ● Model selection: strategies σ = SSE(F )/dfF is the “best”b estimate of σ we have (use ● Possible criteria ● Mallow's Cp theb fullest model) ● AIC & BIC 2 ● Maximum likelihood ■ SSE(M) = kY − YMk is the SSE of the model M estimation ● AIC for a linear model ■ ● Search strategies p(M) is the numberb of predictors in M, or the degrees of ● Implementations in R ● Caveats freedom used up by the model. ■ Based on an estimate of n 1 2 E (Y − E(Y )) σ2 X i i  i=1 n 1 2 = E (Y − Y ) + Var(Y ) σ2 X  i i  i i=1 b b

- p. 10/16 AIC & BIC

● Today ■ Mallow’s C is (almost) a special case of Akaike Information ● Crude outlier detection test p ● Bonferroni correction Criterion (AIC) ● Simultaneous inference for β ● Model selection: goals ● Model selection: general AIC(M) = −2 log L(M) + 2 · p(M). ● Model selection: strategies ● Possible criteria ● Mallow's Cp ■ L(M) is the likelihood of the parameters in model ● AIC & BIC ● Maximum likelihood M evaluated at the MLE (Maximum Likelihood ). estimation ● AIC for a linear model ■ ● Search strategies Schwarz’s Bayesian Information Criterion (BIC) ● Implementations in R ● Caveats BIC(M) = −2 log L(M) + p(M) · log n

- p. 11/16 Maximum likelihood estimation

● Today ■ If the model is correct then the log-likelihood of is ● Crude outlier detection test (β, σ) ● Bonferroni correction ● Simultaneous inference for β n 2 1 2 ● Model selection: goals log L(β, σ|X, Y ) = − log(2π) + log σ − 2 kY − Xβk ● Model selection: general 2  2σ ● Model selection: strategies ● Possible criteria ● Mallow's Cp where Y is the vector of observed responses. ● AIC & BIC ■ ● Maximum likelihood MLE for β in this case is the same as estimate estimation ● AIC for a linear model because first term does not depend on β ● Search strategies ● Implementations in R ■ 2 ● Caveats MLE for σ :

∂ n 1 2 2 log L(β, σ) = − 2 + 4 kY − Xβk = 0 ∂σ b 2 2σ 2σ β,σb b

■ Solving for σ2:

2 1 2 1 σ = kY − Xβk = SSE(M) MLE n n b b Note that the MLE is biased.

- p. 12/16 AIC for a linear model

● Today ■ ● Crude outlier detection test Using βMLE = β ● Bonferroni correction ● Simultaneous inference for β b b ● Model selection: goals 2 1 ● Model selection: general σMLE = SSE(M) ● Model selection: strategies n ● Possible criteria b ● Mallow's Cp we see that the AIC of a multiple model is ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1) ● Search strategies ● Implementations in R ■ 2 ● Caveats If σ is known, then

2 SSE(M) AIC(M) = n log(2π) + log(σ ) + + 2p(M)  σ2

which is almost Cp(M) + Kn.

- p. 13/16 Search strategies

● Today ■ “Best subset”: search all possible models and take the one ● Crude outlier detection test 2 ● Bonferroni correction with highest R or lowest Cp. ● Simultaneous inference for β a ● Model selection: goals ■ ● Model selection: general Stepwise (forward, backward or both): useful when the ● Model selection: strategies number of predictors is large. Choose an initial model and ● Possible criteria ● Mallow's Cp be “greedy”. ● AIC & BIC ● Maximum likelihood ■ “Greedy” always take the biggest jump (up or down) estimation ● AIC for a linear model ● Search strategies in your selected criterion. ● Implementations in R ● Caveats

- p. 14/16 Implementations in R

● Today ■ “Best subset”: use the function leaps. Works only for ● Crude outlier detection test ● Bonferroni correction multiple linear regression models. ● Simultaneous inference for β ● Model selection: goals ■ step ● Model selection: general Stepwise: use the function . Works for any model with ● Model selection: strategies Akaike Information Criterion (AIC). In multiple linear ● Possible criteria ● Mallow's Cp regression, AIC is (almost) a linear function of Cp. ● AIC & BIC ● Maximum likelihood ■ Here is an example. estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 15/16 Caveats

● Today ■ Many other “criteria” have been proposed. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Some work well for some types of data, others for different ● Model selection: goals ● Model selection: general data. ● Model selection: strategies ■ ● Possible criteria These criteria are not “direct measures” of predictive power. ● Mallow's Cp ■ ● AIC & BIC Later – we will see cross-validation which is an estimate of ● Maximum likelihood estimation predictive power. ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats

- p. 16/16