Model Selection in GLMs
• Last class: estimability/identifiability, analysis of deviance, stan-
dard errors & confidence intervals
(should be able to implement frequentist GLM analyses!)
• Today: standard frequentist methods for model selection
1 Model Uncertainty
In most data analyses, there is uncertainty about the model & you need to do some form of model comparison.
1. Select q out of p predictors to form a parsimonius model
2. Select the link function (e.g., logit or probit)
3. Select the distributional form (normal, t-distributed)
We focus first on problem 1 - Variable Selection
2 Frequentist Strategies for Variable Selection
It is standard practice to sequentially add or drop variables from a model one at a time and examine the change in model fit
If model fit does not improve much (or significantly at some arbitrary level) when adding a predictor, then that predictor is left out of the model (forward selection)
Similarly, if model fit does not decrease much when removing a pre- dictor, then that predictor is removed (backwards elimination)
Implementation: step() function in S-PLUS.
3 Illustration
• Suppose we have 30 candidate predictors, X1,...,X30.
• The true model is
0 logit Pr(yi = 1 | xi) = xiβ,
0 where xi = (xi1, . . . , xi,30) (i = 1, . . . , n = 100)
• Note that predictors with β = 0 are effectively not in the model
• Let γj = 1(βj =6 0) be a 0/1 indicator that the jth predictor is
included
4 • Interest focuses on selection of the subset of q ≤ p important
predictors out of the p candidate predictors:
Xγ = {Xj : γj = 1, βj =6 0} ∼ (q × 1).
• There is a list of 2p possible models corresponding to the different
subsets Xγ ⊂ X.
• The size of the model list, 2p, is calculated by noting that we can
either include or exclude each of the p candidate predictors.
5 • Letting M denote the model list, we can introduce a model index
M ∈ M
• The trick is how to decide between competing models M ∈ M
and M 0 ∈ M.
• An important issue is also efficient searching of the (potentially
huge!) model space M for good models
• Often not possible to visit and perform calculations for all models
in M (230 = 1073741824)
6 • Returning to the logistic regression with 30 candidate predictors
example, we simulate data & run stepwise selection
• Simulation conditions: n = 100, q = 5, p = 30,
xi ∼ Np(0, Ip),
X1 − X5 have coefficients βγ = (1, 2, 3, 4, 5)/2,
X6 − X30 have coefficients 0.
• S-PLUS implementation:
n<- 1000 # sample size
p<- 30 # number of candidate predictors
X<- matrix(0,n,p) # simulate predictor values
for(j in 1:p) X[,j]<- rnorm(n,0,1)
Xl<- data.frame(X)
beta<- c((1:5)/2,rep(0,25)) # true values of parameters
py<- 1/(1+exp(0.5-X%*%beta)) # probability of response
# note that intercept is assumed to be -0.5
7 y<- rbinom(n,1,py) # simulate response from true model fit.true<- glm(y ~ X.1 + X.2 + X.3 + X.4 + X.5,
family=binomial, data=Xl) # fit true model
X2<- Xl[,ind] # scramble order of predictors fit.full<- glm(y ~ ., family=binomial, data=X2)
# implement stepwise selection using AIC criteria fit.step<- step(fit.full, scope=list(upper = ~., lower=~1),
trace=F) # intercept in all fit.step$anova # summarize results
8 • Comments: start with full model & performs backwards elimi-
nation - removes predictors sequentially which reduce AIC
• Definition of Akaike Information Criterion (AIC):
AIC = −2 maximized log likelihood + 2# parameters
• Why put in the number of parameters?
9 • Results from stepwise analysis of simulated data:
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
y ~ X.4 + X.15 + X.26 + X.13 + X.16 + X.25 + X.9 + X.3 +
X.18 + X.29 + X.1 + X.30 + X.24 + X.11 + X.12 + X.7 +
X.21 + X.10 + X.17 + X.8 + X.5 + X.28 + X.22 + X.27 +
X.2 + X.23 + X.6 + X.14 + X.19 + X.20
Final Model:
y ~ X.4 + X.15 + X.26 + X.25 + X.3 + X.1 + X.12 + X.17 +
X.5 + X.28 + X.22 + X.2
10 Step Df Deviance Resid. Df Resid. Dev AIC
1 969 528.2075 590.2075
2 - X.9 1 0.018536 970 528.2260 588.2260
3 - X.20 1 0.020528 971 528.2466 586.2466
4 - X.11 1 0.024500 972 528.2711 584.2711
5 - X.18 1 0.045945 973 528.3170 582.3170
6 - X.13 1 0.057255 974 528.3743 580.3743
7 - X.16 1 0.080202 975 528.4545 578.4545
8 - X.24 1 0.100360 976 528.5548 576.5548
9 - X.21 1 0.278245 977 528.8331 574.8331
10 - X.30 1 0.296931 978 529.1300 573.1300
11 - X.27 1 0.440982 979 529.5710 571.5710
12 - X.10 1 0.445923 980 530.0169 570.0169
13 - X.7 1 0.447855 981 530.4648 568.4648
11 14 - X.23 1 0.625145 982 531.0899 567.0899
15 - X.8 1 1.000717 983 532.0906 566.0906
16 - X.6 1 1.025047 984 533.1157 565.1157
17 - X.29 1 1.366004 985 534.4817 564.4817
18 - X.19 1 1.445814 986 535.9275 563.9275
19 - X.14 1 1.623275 987 537.5508 563.5508
12 • Selected model contains predictors X1 − X5 belonging to true
model (we got lucky & coefficients large)
• Also contains predictors X12,X15,X17,X22,X25,X26,X28, which
should have been excluded (not so good)
• Results of maximum likelihood estimation for true model:
Value Std. Error t value
(Intercept) -0.4494802 0.1106592 -4.061844
X.1 0.5525122 0.1115489 4.953095
X.2 0.9541896 0.1176940 8.107376
X.3 1.7647030 0.1489397 11.848442
X.4 2.0024375 0.1638457 12.221483
X.5 2.8375060 0.2018184 14.059703
13 • Results of maximum likelihood estimation for full model (focus-
ing on important predictors):
Value Std. Error t value
(Intercept) -0.47183494 0.1181678 -3.9929235
X.1 0.58449537 0.1214352 4.8132281
X.2 1.06182854 0.1289719 8.2330242
X.3 1.91001906 0.1626665 11.7419354
X.4 2.13608265 0.1772401 12.0519146
X.5 3.04082332 0.2230194 13.6347913
• Full model MLEs differ somewhat from MLEs under true model
and standard errors are higher
14 • Now, consider the estimates from the final selected model:
Value Std. Error t value
(Intercept) -0.4661101 0.1148807 -4.057341
X.1 0.5675410 0.1157651 4.902522
X.2 1.0170453 0.1233606 8.244489
X.3 1.8653303 0.1574395 11.847921
X.4 2.1040769 0.1731561 12.151332
X.5 2.9939133 0.2168262 13.807893
X.12 0.1725280 0.1124232 1.534630
X.15 0.1593174 0.1071552 1.486790
X.17 -0.2245846 0.1059882 -2.118958
X.22 -0.2398270 0.1128883 -2.124464
X.25 -0.1953539 0.1166990 -1.673999
X.26 -0.2685661 0.1112931 -2.413143
X.28 -0.1726031 0.1110436 -1.554373
15 • Comparing the parameters for the predictors that should have
been included to the earlier results, we find results in between
those for true model and full model.
• Interesting result: many of the predictors known to be unimpor-
tant have coefficients “significantly” different from zero
• This is a very common occurrence when using stepwise selection!
• If we had chosen smaller values for the coefficients for the im-
portant predictors, we would have missed one or more of them
(based on repeating simulation)
16 Some Issues with Stepwise Procedures
Normal Linear & Orthogonal Case
• In normal models with orthogonal X, forward and backwards
selection will yield the same model (i.e., the selection process is
not order-dependent).
• However, the selection of the significance level for inclusion in
the model is arbitrary and can have a large impact on the final
model selected. Potentially, one can use some goodness of fit
criterion (e.g., AIC, BIC).
• In addition, if interest focus on inference on the β’s, stepwise
procedures can result in biased estimates and invalid hypothesis
tests (i.e., if one naively uses the final model selected without
correction for the selection process)
17 Issues with Stepwise Procedures (General Case)
For GLMs other than orthogonal, linear Gaussian models, the order in which parameters are added or dropped from the model can have a large impact on the final model selected.
This is not good, since choices of the order of selection are typically
(if not always) arbitrary.
Model selection is a challenging area, but there are alternatives
18 Goodness of Fit Criteria
There are a number of criteria that have been proposed for comparing models based on a measure of goodness of fit penalized by model complexity
1. Akaike’s Information Criterion (AIC):
AICM = DM + 2pM ,
where DM is the deviance for model M and pM is the number
of predictors
2. Bayesian Information Criterion (BIC):
BICM = DM + pM log(n).
19 Note that deviance decreases as variables are added and the likelihood increases.
The AIC and BIC differ in the penalty for model complexity, with the AIC using twice the number of parameters and the BIC using the number of parameters multiplied by the logarithm of the sample size.
The BIC tends to place a larger penalty on the number of predictors and hence more parsimonius models are selected.
20 Uncertainty in the link function & distribution
• Note that we can similarly use the AIC or BIC criteria to select
the link function or distribution
• For example, suppose that we have binary outcome data
• There are many possible link functions - any smooth, monotone
function mapping from < → [0, 1] (i.e., cumulative distribution
functions for continuous densities)
21 • We simulated data from a logistic regression model with xi =
(1, dosei), where dosei ∼ Uniform(0, 1), i = 1,..., 100.
• The parameters were chosen to be β = (−3, 5)
• As an alternative to the logistic model, we considered the probit:
0 Pr(yi = 1 | xi) = Φ(xiβ),
R z −1/2 2 where Φ(z) = −∞(2π) exp(−z /2) is the standard normal
cdf
• We also considered the complementary log-log model:
0 Pr(yi = 1 | xi) = 1 − exp { − exp(xiβ)}.
22 0.8 0.6 23 0.4 Pr(response) 0.2
0.0 0.2 0.4 0.6 0.8 1.0 dose Summary of results
• Logistic regression results:
βc = (−3.67, 6.14), AIC = 94.0211, BIC = 99.23145
• Probit regression results:
βc = (−2.14, 3.60), AIC = 93.9993, BIC = 99.20964
• Complementary log-log results:
βc = (−3.14, 4.39), AIC = 94.21893 BIC = 99.42927
• Low values of AIC & BIC are preferred, so we select the probit
model (which happens to be wrong)
24 Some things to think about
• It is very often the case that many models are consistent with
the data
• This is particularly true when there is a large number of models
in the list of plausible models
• Ideally, substantive information can be brought to bear to reduce
the size of the list
• For example, certain models may be more consistent with biology
25 • However, one is typically still faced with many possible models
& concerned about sensitivity of inferences to the model chosen
or selected by some algorithm
• Given the selection bias that occurs, it may be better to focus
a priori on a single model rather than run stepwise selection
• However, better yet would be to formally account for model un-
certainty (e.g., through Bayesian model averaging)
26 DDE and Pre-Term Birth Example
(Longnecker et al., 2001, Lancet)
• Scientific Interest: Association between DDE exposure (contin-
uous predictor) and preterm birth (binary response).
• Previous Data: DDT is a reproductive toxin in birds, rabbits
and possibly sea lions. In humans, DDT has been associated
with preterm birth in small studies.
27 • Current Data: Data drawn from the US Collaborative Perinatal
Project (CPP) - complete data for n = 2380 children out of
which 361 were born preterm.
• Possibly Confounding Factors: Study center, serum triglycerides
and cholesterol, infant ethnicity and gender, mother’s age, height,
body mass index before pregnancy, rate of weight gain during
pregnancy, parity, socioeconomic status, and smoking during
pregnancy.
28 Notation: yi = 1 if preterm birth and yi = 0 if full-term birth
di = dose of dde for ith woman (i = 1, . . . , n)
0 zi = (zi1, . . . , zip) = vector of confounding factors
xi = vector of predictors (i.e., covariates)
What model should we use to investigate the association between dose of dde and the probability of preterm birth?
29 Logistic Regression Analysis
Pr(yi = 1 | xi) 0 log = βdi + α1zi1 + ... + αpzip = βdi + ziα Pr(yi = 0 | xi) where β is the increase in the log-odds of a response associated with a unit increase in dose of dde
This log-odds ratio is adjusted for possible confounding variables zi
30 Typically, investigators are interested in the exposure odds ratio:
Pr(yi = 1 | xi = c, zi)/Pr(yi = 0 | xi = c, zi) Pr(yi = 1 | xi = c − 1, zi)/Pr(yi = 0 | xi = c − 1, zi) associated with a unit increase in exposure (e.g., dde dose)
Exponentiating the logit model, the exposure odds ratio is equal to
0 exp{βc + ziα} 0 = exp(β), exp{β(c − 1) + ziα} which is simply the exponentiated regression coefficient for exposure
31 Rare Disease Assumption
Often, investigators are interested in the Relative Risk,
Pr(y = 1 | x = c, z ) RR = i i i , Pr(yi = 1 | xi = c − 1, zi) which (unfortunately) depends on the value of c
However, under the assumption that the disease (i.e., the response) is rare,
Pr(yi = 1 | xi = c, zi) Pr(yi = 1 | xi = c, zi) ≈ , Pr(yi = 0 | xi = c, zi) and it follows that the Relative Risk is approximately equal to the exposure odds ratio
32 Returning to the Longnecker et al. example, what assumption about the dose-response are we making by using the model:
Pr(yi = 1 | xi) 0 log = βdi + ziα Pr(yi = 0 | xi)
Do they use this model in the published analysis?
33