Lecture 3 Residual Analysis + Generalized Linear Models

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1 Residual Analysis 2 Atmospheric CO2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date 3 360 co2 350 1988 1992 1996 date 2.5 0.0 resid −2.5 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on average … 4 360 co2 350 1988 1992 1996 date 2.5 0.0 resid −2.5 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on average … 4 2.5 0.0 resid −2.5 1988 1992 1996 date 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date and then? Well there is some periodicity lets add the month … 5 2.5 0.0 resid −2.5 1988 1992 1996 date 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date and then? Well there is some periodicity lets add the month … 5 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date 0.8 0.4 0.0 resid3 −0.4 −0.8 1988 1992 1996 date and then and then? Maybe there is some different effect by year … 6 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date 0.8 0.4 0.0 resid3 −0.4 −0.8 1988 1992 1996 date and then and then? Maybe there is some different effect by year … 6 Too much (lm = lm(co2~date + month + as.factor(year), data=co2_df)) ## ## Call: ## lm(formula = co2 ~ date + month + as.factor(year), data = co2_df) ## ## Coefficients: ## (Intercept) date monthAug ## -2.645e+03 1.508e+00 -4.177e+00 ## monthDec monthFeb monthJan ## -3.612e+00 -2.008e+00 -2.705e+00 ## monthJul monthJun monthMar ## -2.035e+00 -3.251e-01 -1.227e+00 ## monthMay monthNov monthOct ## 4.821e-01 -4.838e+00 -6.135e+00 ## monthSep as.factor(year)1986 as.factor(year)1987 ## -6.064e+00 -2.585e-01 9.722e-03 ## as.factor(year)1988 as.factor(year)1989 as.factor(year)1990 ## 1.065e+00 9.978e-01 7.726e-01 ## as.factor(year)1991 as.factor(year)1992 as.factor(year)1993 ## 7.067e-01 1.236e-02 -7.911e-01 ## as.factor(year)1994 as.factor(year)1995 as.factor(year)1996 ## -4.146e-01 1.119e-01 3.768e-01 ## as.factor(year)1997 7 ## NA 360 co2 350 1988 1992 1996 date 8 Generalized Linear Models 9 Background A generalized linear model has three key components: 1. a probability distribution (from the exponential family) that describes your response variable 2. a linear predictor η = Xβ, 3. and a link function g such that g(E(Y|X)) = µ = η. 10 Poisson Regression 11 Model Specification A generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance). ∼ Yi Poisson(λi) log E(Y|X) = log λ = Xβ 12 Example - AIDS in Belgium AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year 13 Frequentist glm fit g = glm(cases~year, data=aids, family=poisson) pred = data_frame(year=seq(1981,1993,by=0.1)) pred$cases = predict(g, newdata=pred, type = ”response”) AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year 14 Residuals Standard residuals: − − ^ ri = Yi ^Yi = Yi λi Pearson residuals: Y − E(Y |X) Y − λ^ = √i i = i√ i ri | Var(Yi X) ^ λi Deviance residuals: √ − ^ − − ^ di = sign(yi λi) 2(yi log(yi/λi) (yi λi)) 15 Deviance and deviance residuals Deviance can be interpreted as the difference between your model’s fit and the fit of an ideal model (where E(^Yi) = Yi). ∑n L | − L |^ 2 D = 2( (Y θbest) (Y θ)) = di i=1 Deviance is a measure of goodness of fit in a similar way to the residual sum of squares (which is just the sum of squared standard residuals). 16 Deviance residuals derivation 17 Residual plots standard pearson deviance 2.5 2.5 0 0.0 0.0 residual −2.5 −50 −2.5 −5.0 1985 1990 1985 1990 1985 1990 year 18 print(aids_fit) AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year 19 Quadratic fit g2 = glm(cases~year+I(year^2), data=aids, family=poisson) pred2 = data_frame(year=seq(1981,1993,by=0.1)) pred2$cases = predict(g2, newdata=pred, type = ”response”) AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year 20 Quadratic fit - residuals standard pearson deviance 20 1 1 10 0 0 residual 0 −1 −1 −10 1985 1990 1985 1990 1985 1990 year 21 Bayesian Poisson Regression Model ## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*X[i] ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## } 22 Bayesian Model fit m = jags.model( textConnection(poisson_model1), quiet = TRUE, data = list(Y=aids$cases, X=aids$year) ) update(m, n.iter=1000, progress.bar=”none”) samp = coda.samples( m, variable.names=c(”beta”,”lambda”,”Y_hat”), n.iter=5000, progress.bar=”none” ) 23 MCMC Diagnostics Trace of beta[1] Density of beta[1] −1 0.2 −3 −5 0.0 2000 3000 4000 5000 6000 7000 −6 −4 −2 0 Iterations N = 5000 Bandwidth = 0.3209 Trace of beta[2] Density of beta[2] 0.0040 400 0 0.0020 2000 3000 4000 5000 6000 7000 0.002 0.003 0.004 0.005 Iterations N = 5000 Bandwidth = 0.0001615 24 Model fit? AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year 25 summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 What went wrong? 26 What went wrong? summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 26 summary(glm(cases~I(year-1981), data=aids, family=poisson)) ## ## Call: ## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.342711 0.070920 47.13 <2e-16 *** ## I(year - 1981) 0.202121 0.007771 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 27 Revising the model ## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## } 28 MCMC Diagnostics Trace of beta[1] Density of beta[1] 3.5 4 3.3 2 0 3.1 2000 3000 4000 5000 6000 7000 3.1 3.2 3.3 3.4 3.5 3.6 Iterations N = 5000 Bandwidth = 0.01336 Trace of beta[2] Density of beta[2] 40 0.21 20 0 0.18 2000 3000 4000 5000 6000 7000 0.18 0.19 0.20 0.21 0.22 0.23 Iterations N = 5000 Bandwidth = 0.001456 29 Model fit AIDS cases in Belgium (Linear Poisson Model) 300 200 cases 100 0 1985 1990 year 30 Bayesian Residual Plots standard pearson deviance 4 50 2 2 0 0 0 −2 −2 post_mean −50 −4 −4 −100 −6 −6 1985 1990 1985 1990 1985 1990 year 31 Model fit AIDS cases in Belgium (Quadratic Poisson Model) 300 200 cases 100 0 1985 1990 year 32 Bayesian Residual Plots standard pearson deviance 40 2 2 20 1 1 0 0 0 post_mean −1 −1 −20 −2 −2 1985 1990 1985 1990 1985 1990 year 33 Negative Binomial Regression 34 mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526 Overdispersion One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ. If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction. 35 Overdispersion One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ.

Lecture 3 Residual Analysis + Generalized Linear Models

Fast Computation of the Deviance Information Criterion for Latent Variable Models

A Generalized Linear Model for Binomial Response Data

Comparison of Some Chemometric Tools for Metabonomics Biomarker Identification ⁎ Réjane Rousseau A, , Bernadette Govaerts A, Michel Verleysen A,B, Bruno Boulanger C

Statistics 149 – Spring 2016 – Assignment 4 Solutions Due Monday April 4, 2016 1. for the Poisson Distribution, B(Θ) =

Bayesian Methods: Review of Generalized Linear Models

What I Should Have Described in Class Today

The Deviance Information Criterion: 12 Years On

Heteroscedastic Errors

Statistical Approaches for Highly Skewed Data 1

Negative Binomial Regression Models and Estimation Methods

And the Winner Is…? How to Pick a Better Model Part 2 – Goodness-Of-Fit and Internal Stability

Robustbase: Basic Robust Statistics