<<

Lecture 3 Residual Analysis + Generalized Linear Models

Colin Rundel 1/23/2017

1 Residual Analysis

2 Atmospheric CO2 (ppm) from Mauna Loa

360 co2

350

1988 1992 1996 date

3 360 co2 350

1988 1992 1996 date

2.5

0.0 resid −2.5

1988 1992 1996 date

Where to start?

Well, it looks like stuff is going up on average …

4 Where to start?

Well, it looks like stuff is going up on average …

360 co2 350

1988 1992 1996 date

2.5

0.0 resid −2.5

1988 1992 1996 date

4 2.5

0.0 resid −2.5

1988 1992 1996 date

1.0 0.5 0.0

resid2 −0.5 −1.0

1988 1992 1996 date

and then?

Well there is some periodicity lets add the month …

5 and then?

Well there is some periodicity lets add the month …

2.5

0.0 resid −2.5

1988 1992 1996 date

1.0 0.5 0.0

resid2 −0.5 −1.0

1988 1992 1996 date

5 1.0 0.5 0.0

resid2 −0.5 −1.0

1988 1992 1996 date 0.8

0.4

0.0 resid3 −0.4

−0.8 1988 1992 1996 date

and then and then?

Maybe there is some different effect by year …

6 and then and then?

Maybe there is some different effect by year …

1.0 0.5 0.0

resid2 −0.5 −1.0

1988 1992 1996 date 0.8

0.4

0.0 resid3 −0.4

−0.8 1988 1992 1996 date

6 Too much

(lm= lm(co2~date+ month+ as.factor(year), =co2_df)) ## ##Call: ##lm(formula=co2~date+month+as.factor(year),data=co2_df) ## ##Coefficients: ## (Intercept) date monthAug ## -2.645e+03 1.508e+00 -4.177e+00 ## monthDec monthFeb monthJan ## -3.612e+00 -2.008e+00 -2.705e+00 ## monthJul monthJun monthMar ## -2.035e+00 -3.251e-01 -1.227e+00 ## monthMay monthNov monthOct ## 4.821e-01 -4.838e+00 -6.135e+00 ## monthSepas.factor(year)1986as.factor(year)1987 ## -6.064e+00 -2.585e-01 9.722e-03 ##as.factor(year)1988as.factor(year)1989as.factor(year)1990 ## 1.065e+00 9.978e-01 7.726e-01 ##as.factor(year)1991as.factor(year)1992as.factor(year)1993 ## 7.067e-01 1.236e-02 -7.911e-01 ##as.factor(year)1994as.factor(year)1995as.factor(year)1996 ## -4.146e-01 1.119e-01 3.768e-01 ##as.factor(year)1997 7 ## NA 360 co2

350

1988 1992 1996 date

8 Generalized Linear Models

9 Background

A generalized has three key components:

1. a (from the ) that describes your response variable 2. a linear predictor η = Xβ, 3. and a link function g such that g(E(Y|X)) = µ = η.

10

11 Model Specification

A for where we assume the outcome variable follows a poisson distribution ( = ).

∼ Yi Poisson(λi) log E(Y|X) = log λ = Xβ

12 Example - AIDS in Belgium

AIDS cases in Belgium

250

200

150 cases 100

50

0 1985 1990 year

13 Frequentist glm fit

g = glm(cases~year, data=aids, family=poisson) pred = data_frame(year=seq(1981,1993,by=0.1)) pred$cases = predict(g, newdata=pred, type = ”response”)

AIDS cases in Belgium

300

200 cases

100

0 1985 1990 year

14 Residuals

Standard residuals:

− − ˆ ri = Yi ˆYi = Yi λi

Pearson residuals:

Y − E(Y |X) Y − λˆ = √i i = i√ i ri | Var(Yi X) ˆ λi

Deviance residuals: √ − ˆ − − ˆ di = sign(yi λi) 2(yi log(yi/λi) (yi λi))

15 and deviance residuals

Deviance can be interpreted as the difference between your model’s fit and

the fit of an ideal model (where E(ˆYi) = Yi).

∑n L | − L |ˆ 2 D = 2( (Y θbest) (Y θ)) = di i=1

Deviance is a measure of goodness of fit in a similar way to the (which is just the sum of squared standard residuals).

16 Deviance residuals derivation

17 Residual plots

standard pearson deviance

2.5 2.5

0 0.0 0.0 residual

−2.5 −50 −2.5

−5.0 1985 1990 1985 1990 1985 1990 year

18 print(aids_fit)

AIDS cases in Belgium

300

200 cases

100

0 1985 1990 year

19 Quadratic fit

g2 = glm(cases~year+I(year^2), data=aids, family=poisson) pred2 = data_frame(year=seq(1981,1993,by=0.1)) pred2$cases = predict(g2, newdata=pred, type = ”response”)

AIDS cases in Belgium

250

200

150 cases 100

50

0 1985 1990 year

20 Quadratic fit - residuals

standard pearson deviance

20 1 1

10

0 0 residual 0

−1 −1 −10

1985 1990 1985 1990 1985 1990 year

21 Bayesian Poisson Regression Model

## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*X[i] ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }

22 Bayesian Model fit

m = jags.model( textConnection(poisson_model1), quiet = TRUE, data = list(Y=aids$cases, X=aids$year) ) update(m, n.iter=1000, progress.bar=”none”) samp = coda.samples( m, variable.names=c(”beta”,”lambda”,”Y_hat”), n.iter=5000, progress.bar=”none” )

23 MCMC Diagnostics

Trace of beta[1] Density of beta[1] −1 0.2 −3 −5 0.0 2000 3000 4000 5000 6000 7000 −6 −4 −2 0

Iterations N = 5000 Bandwidth = 0.3209

Trace of beta[2] Density of beta[2] 0.0040 400 0

0.0020 2000 3000 4000 5000 6000 7000 0.002 0.003 0.004 0.005

Iterations N = 5000 Bandwidth = 0.0001615

24 Model fit?

AIDS cases in Belgium

300

200 cases

100

0 1985 1990 year

25 summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4

What went wrong?

26 What went wrong?

summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 26 summary(glm(cases~I(year-1981), data=aids, family=poisson)) ## ## Call: ## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.342711 0.070920 47.13 <2e-16 *** ## I(year - 1981) 0.202121 0.007771 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4

27 Revising the model

## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }

28 MCMC Diagnostics

Trace of beta[1] Density of beta[1] 3.5 4 3.3 2 0 3.1 2000 3000 4000 5000 6000 7000 3.1 3.2 3.3 3.4 3.5 3.6

Iterations N = 5000 Bandwidth = 0.01336

Trace of beta[2] Density of beta[2] 40 0.21 20 0 0.18 2000 3000 4000 5000 6000 7000 0.18 0.19 0.20 0.21 0.22 0.23

Iterations N = 5000 Bandwidth = 0.001456

29 Model fit

AIDS cases in Belgium (Linear Poisson Model)

300

200 cases

100

0 1985 1990 year

30 Bayesian Residual Plots

standard pearson deviance 4 50

2 2

0 0 0

−2 −2 post_mean −50

−4 −4

−100 −6 −6 1985 1990 1985 1990 1985 1990 year

31 Model fit

AIDS cases in Belgium (Quadratic Poisson Model) 300

200 cases

100

0 1985 1990 year

32 Bayesian Residual Plots

standard pearson deviance 40

2 2

20 1 1

0 0 0 post_mean

−1 −1 −20

−2 −2

1985 1990 1985 1990 1985 1990 year

33 Negative

34 mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526

Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ.

If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.

35 Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ.

If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.

mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526

35 Negative binomial regession

If we define

| ∼ Yi Zi Pois(λi Zi) ∼ Zi Gamma(θi, θi)

then the marginal distribution of Yi will be negative binomial with,

E(Yi) = λi 2 Var(Yi) = λi + λi /θi

36 Model

## model{ ## for(i in 1:length(Y)) ## { ## Z[i] ~ dgamma(theta, theta) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) + beta[3]*(X[i] - 1981)^2 ## ## lambda_Z[i] <- Z[i]*lambda[i] ## ## Y[i] ~ dpois(lambda_Z[i]) ## Y_hat[i] ~ dpois(lambda_Z[i]) ## } ## ## for(j in 1:3){ ## beta[j] ~ dnorm(0, 1/100) ## } ## ## log_theta ~ dnorm(0, 1/100) ## theta <- exp(log_theta) ## }

37 Negative Binomial Model fit

AIDS cases in Belgium AIDS cases in Belgium (Quadratic Poisson Model) (Quadratic Negative Binomial Model) 300 300

200 200 cases cases

100 100

0 0 1985 1990 1985 1990 year year

38 Bayesian Residual Plots

standard pearson 40

2

20 1

0 0 post_mean

−1 −20

−2

1985 1990 1985 1990 year

39