Lecture 3 Residual Analysis + Generalized Linear Models
Colin Rundel 1/23/2017
1 Residual Analysis
2 Atmospheric CO2 (ppm) from Mauna Loa
360 co2
350
1988 1992 1996 date
3 360 co2 350
1988 1992 1996 date
2.5
0.0 resid −2.5
1988 1992 1996 date
Where to start?
Well, it looks like stuff is going up on average …
4 Where to start?
Well, it looks like stuff is going up on average …
360 co2 350
1988 1992 1996 date
2.5
0.0 resid −2.5
1988 1992 1996 date
4 2.5
0.0 resid −2.5
1988 1992 1996 date
1.0 0.5 0.0
resid2 −0.5 −1.0
1988 1992 1996 date
and then?
Well there is some periodicity lets add the month …
5 and then?
Well there is some periodicity lets add the month …
2.5
0.0 resid −2.5
1988 1992 1996 date
1.0 0.5 0.0
resid2 −0.5 −1.0
1988 1992 1996 date
5 1.0 0.5 0.0
resid2 −0.5 −1.0
1988 1992 1996 date 0.8
0.4
0.0 resid3 −0.4
−0.8 1988 1992 1996 date
and then and then?
Maybe there is some different effect by year …
6 and then and then?
Maybe there is some different effect by year …
1.0 0.5 0.0
resid2 −0.5 −1.0
1988 1992 1996 date 0.8
0.4
0.0 resid3 −0.4
−0.8 1988 1992 1996 date
6 Too much
(lm= lm(co2~date+ month+ as.factor(year), data=co2_df)) ## ##Call: ##lm(formula=co2~date+month+as.factor(year),data=co2_df) ## ##Coefficients: ## (Intercept) date monthAug ## -2.645e+03 1.508e+00 -4.177e+00 ## monthDec monthFeb monthJan ## -3.612e+00 -2.008e+00 -2.705e+00 ## monthJul monthJun monthMar ## -2.035e+00 -3.251e-01 -1.227e+00 ## monthMay monthNov monthOct ## 4.821e-01 -4.838e+00 -6.135e+00 ## monthSepas.factor(year)1986as.factor(year)1987 ## -6.064e+00 -2.585e-01 9.722e-03 ##as.factor(year)1988as.factor(year)1989as.factor(year)1990 ## 1.065e+00 9.978e-01 7.726e-01 ##as.factor(year)1991as.factor(year)1992as.factor(year)1993 ## 7.067e-01 1.236e-02 -7.911e-01 ##as.factor(year)1994as.factor(year)1995as.factor(year)1996 ## -4.146e-01 1.119e-01 3.768e-01 ##as.factor(year)1997 7 ## NA 360 co2
350
1988 1992 1996 date
8 Generalized Linear Models
9 Background
A generalized linear model has three key components:
1. a probability distribution (from the exponential family) that describes your response variable 2. a linear predictor η = Xβ, 3. and a link function g such that g(E(Y|X)) = µ = η.
11 Model Specification
A generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance).
∼ Yi Poisson(λi) log E(Y|X) = log λ = Xβ
12 Example - AIDS in Belgium
AIDS cases in Belgium
250
200
150 cases 100
50
0 1985 1990 year
13 Frequentist glm fit
g = glm(cases~year, data=aids, family=poisson) pred = data_frame(year=seq(1981,1993,by=0.1)) pred$cases = predict(g, newdata=pred, type = ”response”)
AIDS cases in Belgium
300
200 cases
100
0 1985 1990 year
14 Residuals
Standard residuals:
− − ˆ ri = Yi ˆYi = Yi λi
Pearson residuals:
Y − E(Y |X) Y − λˆ = √i i = i√ i ri | Var(Yi X) ˆ λi
Deviance residuals: √ − ˆ − − ˆ di = sign(yi λi) 2(yi log(yi/λi) (yi λi))
15 Deviance and deviance residuals
Deviance can be interpreted as the difference between your model’s fit and
the fit of an ideal model (where E(ˆYi) = Yi).
∑n L | − L |ˆ 2 D = 2( (Y θbest) (Y θ)) = di i=1
Deviance is a measure of goodness of fit in a similar way to the residual sum of squares (which is just the sum of squared standard residuals).
16 Deviance residuals derivation
17 Residual plots
standard pearson deviance
2.5 2.5
0 0.0 0.0 residual
−2.5 −50 −2.5
−5.0 1985 1990 1985 1990 1985 1990 year
18 print(aids_fit)
AIDS cases in Belgium
300
200 cases
100
0 1985 1990 year
19 Quadratic fit
g2 = glm(cases~year+I(year^2), data=aids, family=poisson) pred2 = data_frame(year=seq(1981,1993,by=0.1)) pred2$cases = predict(g2, newdata=pred, type = ”response”)
AIDS cases in Belgium
250
200
150 cases 100
50
0 1985 1990 year
20 Quadratic fit - residuals
standard pearson deviance
20 1 1
10
0 0 residual 0
−1 −1 −10
1985 1990 1985 1990 1985 1990 year
21 Bayesian Poisson Regression Model
## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*X[i] ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }
22 Bayesian Model fit
m = jags.model( textConnection(poisson_model1), quiet = TRUE, data = list(Y=aids$cases, X=aids$year) ) update(m, n.iter=1000, progress.bar=”none”) samp = coda.samples( m, variable.names=c(”beta”,”lambda”,”Y_hat”), n.iter=5000, progress.bar=”none” )
23 MCMC Diagnostics
Trace of beta[1] Density of beta[1] −1 0.2 −3 −5 0.0 2000 3000 4000 5000 6000 7000 −6 −4 −2 0
Iterations N = 5000 Bandwidth = 0.3209
Trace of beta[2] Density of beta[2] 0.0040 400 0
0.0020 2000 3000 4000 5000 6000 7000 0.002 0.003 0.004 0.005
Iterations N = 5000 Bandwidth = 0.0001615
24 Model fit?
AIDS cases in Belgium
300
200 cases
100
0 1985 1990 year
25 summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4
What went wrong?
26 What went wrong?
summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 26 summary(glm(cases~I(year-1981), data=aids, family=poisson)) ## ## Call: ## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.342711 0.070920 47.13 <2e-16 *** ## I(year - 1981) 0.202121 0.007771 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4
27 Revising the model
## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }
28 MCMC Diagnostics
Trace of beta[1] Density of beta[1] 3.5 4 3.3 2 0 3.1 2000 3000 4000 5000 6000 7000 3.1 3.2 3.3 3.4 3.5 3.6
Iterations N = 5000 Bandwidth = 0.01336
Trace of beta[2] Density of beta[2] 40 0.21 20 0 0.18 2000 3000 4000 5000 6000 7000 0.18 0.19 0.20 0.21 0.22 0.23
Iterations N = 5000 Bandwidth = 0.001456
29 Model fit
AIDS cases in Belgium (Linear Poisson Model)
300
200 cases
100
0 1985 1990 year
30 Bayesian Residual Plots
standard pearson deviance 4 50
2 2
0 0 0
−2 −2 post_mean −50
−4 −4
−100 −6 −6 1985 1990 1985 1990 1985 1990 year
31 Model fit
AIDS cases in Belgium (Quadratic Poisson Model) 300
200 cases
100
0 1985 1990 year
32 Bayesian Residual Plots
standard pearson deviance 40
2 2
20 1 1
0 0 0 post_mean
−1 −1 −20
−2 −2
1985 1990 1985 1990 1985 1990 year
33 Negative Binomial Regression
34 mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526
Overdispersion
One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ.
If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.
35 Overdispersion
One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ.
If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.
mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526
35 Negative binomial regession
If we define
| ∼ Yi Zi Pois(λi Zi) ∼ Zi Gamma(θi, θi)
then the marginal distribution of Yi will be negative binomial with,
E(Yi) = λi 2 Var(Yi) = λi + λi /θi
36 Model
## model{ ## for(i in 1:length(Y)) ## { ## Z[i] ~ dgamma(theta, theta) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) + beta[3]*(X[i] - 1981)^2 ## ## lambda_Z[i] <- Z[i]*lambda[i] ## ## Y[i] ~ dpois(lambda_Z[i]) ## Y_hat[i] ~ dpois(lambda_Z[i]) ## } ## ## for(j in 1:3){ ## beta[j] ~ dnorm(0, 1/100) ## } ## ## log_theta ~ dnorm(0, 1/100) ## theta <- exp(log_theta) ## }
37 Negative Binomial Model fit
AIDS cases in Belgium AIDS cases in Belgium (Quadratic Poisson Model) (Quadratic Negative Binomial Model) 300 300
200 200 cases cases
100 100
0 0 1985 1990 1985 1990 year year
38 Bayesian Residual Plots
standard pearson 40
2
20 1
0 0 post_mean
−1 −20
−2
1985 1990 1985 1990 year
39