<<

Stat 5120 – Categorical Analysis Dr. Corcoran, Fall 2010

VI. Introduction to Generalized Linear Models

Readings for Section VI: Agresti, Chapter 3

We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models unifies a family of regression models that includes (but is not limited to) logistic , Poisson, and . Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Why Model?

1. Structure of the model helps us to describe how various factors are associated with the outcome variable.

2. Size and strength of the effects of primary factor(s) of interest can be explained while adjusting for possible ftfactors.

3. The simultaneous effects of several explanatory variables can be considered, as opposed to just one or two at a time. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Linear Regression: Modeling the Recall that linear regression involves modeling the mean of some outcome variable as a function of one or more explanatory variables. That is, we have a sample Y1,…, Yn of independent measures, where the ith subject in our sample has p explanatory variables xi1, xi2,…, xip, and E(Yi) = μi.

The linear reggpression model specifies that

… Yi = β0 + β1 xi1 + β2 xi2 + + βp xip + εi,

2 where εi ~ N(0, σ ), for i = 1,…,n. Then E(Yi | xi1, xi2,…, xip) = … 2 β0 + β1 xi1 + + βp xip, and Var(Yi) = σ . This model assumes that the mean of the outcome variable changes linearly with respect to the explanatory variables. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The Three Components of a Generalized

Whereas w ith linear regressi on, we mod el th e mean of th e out come variable directly, a is constructed to model the effects of the covariates on a function of the mean.

There are hence three parts or components that comprise a generalized linear model:

1. The random component, which specifies the distribution of the outibltcome variable. 2. The systematic component, which represents a function of the covariates that will link to the outcome variable. 3. The link function, which determines how the mean of the outcome variable relates to the covariates. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Generalized Linear Models for Binary Data

We have a sample Y1,…, Yn of independent binary outcome measurements. The ith subject in our sample has p explanatory variables xi1, xi2,…, xip. Suppose that P(Yi = 1) = πi and P(Yi = 0) = 1 – πi. Hence, E(Yi) = πi.

The random component in this case is clearly binomial. For the purpose of this class, we will always assume that the systematic component is simply a linear combination of the covariates, or … β0 + β1 xi1 + + βp xip. The remaining question is: how do we model πi as a function of the covariates (i.e., what is the link function)? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Link Functions for the Binomial Distribution … Suppose that we assume that E(Yi) = πi = β0 + β1 xi1 + + βp xip. We call this the identity link.

Does this model have any practical shortcomings?

Since the systematic component can take on any value, we often prefer us ing a li nk th at will const rai n πi tthitto the interval lbt between 0 and 1.

The so-called logistic or logit link g(πi) = log[πi /(1 – πi)] accomplishes this.

There are other links (e.g., the probit), but we will focus mainly on the logit. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.A Consider a generalized linear model for the Alzheimer’s data from ElIVAExample IV.A. The d ata are tab ul ated agai n h ere:

AD No AD Total ≥ 1 ε4 Allele 308 1296 1604 No ε4 262 3096 3358 Total 570 4392 4962

Note that these data can be viewed as a sample of outcome th measures Y1,…,Y4962, where Yi = 1ifh1 if the i idiidlindividual was diagnosed with AD, and Yi = 0 otherwise. Moreover, we have a th single covariate Xi, which is 1 if the i individual has at least one copy of the APOE ε4 allele, and 0 otherwise. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.A (cont’d)

LtLet πi = P(Yi = 1). No te that thi s i mpli es E(Yi) = πi. TifthTo specify the generalized linear model to explain the relationship between Xi and Yi (that is, between genotype and AD risk), (1) we assume the distribution of Yi (i.e., the random component) is binomial, (2) we employ the standard systematic component, and (3) we use the logit link . This yields the model

  i  logit( i )  log   0  1xi .  1i 

This happens to be the model for these data. The strong assumption is that the log odds of response change linearly with respect to the covariate(s) in the model. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.A (cont’d) First, how do we interpret the coefficients of this model?

What is logit(πi | Xi = 1)?

What is logit( πi | Xi = 0)?

What is the log with respect to AD risk, comparing those with at least one ε4 to those with no ε4?

The regression coefficient of a binary covariate in a logistic regression model represents the log odds ratio comparing the group identified as ‘1’ to the group identified as ‘0’.

More generally, the coefficient of any arbitrary covariate in a logistic regression l represent s th e l og odd s rati o f or subj ect s wh o diff er b y one unit with respect to the covariate. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Fitting the Generalized Linear Model for the Alzheimer’s Data in SAS We obtain parameter estimates for a generalized linear model using the method of maximum likelihood – these estimates typically cannot be comp uted in closed form.

The following SAS program shows how to read the AD data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.A.

options ls=79 nodate; proc genmod descending; model ad=e4 / dis=bin link=logit data; type3; input e4 ad count; weight count; cards; run; 1 1 308 1 0 1296 0 1 262 0 0 3096 ;;

proc sort; by descending ad descending e4; run;

proc freq order=data; tables e4*ad / chisq relrisk; weight count; run; SAS Output Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 The FREQ Procedure

Table of e4 by ad e4 ad ‚ Percent ‚ Row Pct ‚ Col Pct ‚‚‚‚ 1‚ 0‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 308 ‚ 1296 ‚ 1604 ‚ 6.21 ‚ 26.12 ‚ 32.33 ‚ 19.20 ‚ 80.80 ‚ ‚ 54.04 ‚ 29.51 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 ‚ 262 ‚ 3096 ‚ 3358 ‚ 5.28 ‚ 62.39 ‚ 67.67 ‚ 7.80 ‚ 92.20 ‚ ‚ 45.96 ‚ 70.49 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 570 4392 4962 11.4998850000 88.51 100.00

Statistics for Table of e4 by ad

Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 138.7375 <.0001 Like lihoo d Rat io Chi-Square 1 129. 9802 <. 0001 Continuity Adj. Chi-Square 1 137.6186 <.0001 Mantel-Haenszel Chi-Square 1 138.7095 <.0001 Phi Coefficient 0.1672 Contingency Coefficient 0.1649 Cramer's V 0.1672 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Fisher's ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 308 Left-sided Pr <= F 1.0000 Rig ht-side d Pr >= F 3. 366E-30 Table Probability (P) 6.085E-30 Two-sided Pr <= P 4.889E-30

Estimates of the Relative Risk (Row1/Row2)

Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 2.8083 2.3527 3.3522 Cohort (Col1 Risk) 2.4611 2.1106 2.8697 Cohort (Col2 Risk) 0.8764 0.8540 0.8993 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The GENMOD Procedure Model Information

Data Set WORK.DATA1 Distribution Binomial Link Function Logit Dependent Variable ad Scale Weight Variable count Number of Observations Read 4 Number of Observations Used 4 Sum of Weights 4962 Number of Events 2 Number of Trials 4

Response Profile

Ordered Total Value ad Frequency 1 1 570 2 0 4392

PROC GENMOD is modeling the probability that ad='1'.

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF 2 3408.7579 1704.3790 Scaled Deviance 2 3408.7579 1704.3790 Pearson Chi-Square 2 4962.0000 2481.0000 Scaled Pearson X2 2 4962.0000 2481.0000 Log Likelihood -1704.3790

Algorithm converged. SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -2. 4695 0. 0643 -259562.5956 -2. 3434 1473. 15 <. 0001 e4 1 1.0326 0.0903 0.8556 1.2096 130.69 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.A (cont’d)

ˆ ˆ The SAS out put i ndi cat es th at 0  2.4695 and 1 1.0326.

According to the fit of the regression model, what are the estimated log odds of AD for someone with no APOE ε4 allele?

What are the estimated log odds given for someone with at least one high-risk allele?

Wha t is the esti mat ed l og odd s rati o of AD ri sk compari ng ε4 carriers to non-carriers? How does this estimate compare with the sample odds ratio computed using the data in the 2 x 2 table? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Generalized Linear Models for Poisson Data

We have a sample Y1,…, Yn of independent counts. As usual, the th i subject in our sample has p explanatory variables xi1, xi2,…, xip. Let μi = E(Yi). This expectation then is the average count for subject i.

The random component in this case we assume to be Poisson, and we will use the standard systematic component. How do we choose the link function for a Poisson outcome? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Link Functions for the Poisson Distribution As with the binomial case,,y, first assume the identity link: that is, … E(Yi) = μi = β0 + β1 xi1 + + βp xip.

Does this model have any practical shortcomings?

Since the systematic component can take on any value, we often prefer using a link that will constrain μi to be positive.

A simple log link,,g given by log( μi), accom plishes this. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.B As in the binomial case, we will first consider a very simple model that involves a two- group comparison. The data for this example come from an to assess the envilifhilironmental impact of chemicals. Twenty-five cul tures of th e b acteri a S. capricornutum were randomized to either control (no chemical), or exposure to an investigative toxicant. Group NbNumber o fBi(f7d)f Bacteria (after 7 days) TlTotal 167 158 158 175 167 105 123 Exposed 105 105 105 88 88 61 61 1973 88 61 44 35 35 44 Control 219 228 202 237 228 1144

These d ata can b e vi ewed as a sampl e of outcome measures Y1,…,Y25, where Yi represents the number of bacteria counted at the end of the exposure in the ith culture. Moreover, we th have a single covariate Xi, which is 1 if the i culture was exposed to the chemical, and 0 otherwise.

How does the average number of bacteria compare between the chemically treated cultures and the controls? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.B (cont’d)

DtDenote E(Yi)b) by μi. To specif y th e generali zed li near mod el th at defines the relationship between Xi and Yi (that is, between presence or absence of the chemical and number of bacteria), (1) we assume the distribution of Yi (i.e., the random component) is Poisson, (2) we employ the standard systematic component, and (3) we use the log link . This yields the model

log(μi ) = β0 + β1xi .

The assumption is that the log mean number of bacteria changes linearly with respect to the covariate(s) in the model. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.B (cont’d) How do we interpret the coefficients of this model?

What is log(μi | Xi = 1)?

What is log( μi | Xi = 0)?

What is the log rate ratio comparing the average number of bacteria in exposed cultures to the average number of bacteria in the control cultures?

The regression coefficient of a binary covariate in this model represents the log rate ratio comparing the group identified as ‘1’ to the group identified as ‘0’.

More generally, the coefficient of any arbitrary covariate in a Poisson regression mode l represent s th e l og rat e rati o f or subj ect s wh o diff er b y one unit with respect to the covariate. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Fitting the Generalized Linear Model for the S. capricornutum Data in SAS The following SAS program shows how to read the bacteria data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.B.

options ls=79 nodate; proc ; by exposed; data chemical; var bacteria; input exposed bacteria; run; cards; 1 167 proc genmod; 1 158 model bacteria = exposed / dist=poisson 1 158 link=log type3; 1 175 run; 1 167 . . (SOME DATA OMITTED TO SAVE SPACE) . 0 219 02280 228 0 202 0 237 0 228 ;;

proc sort; by exposed; run; SAS Output Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

------exposed=0 ------

The MEANS Procedure

Analysis Variable : bacteria

N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 5 222.8000000 13.2551877 202.0000000 237.0000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

------exposed=1 ------

Analysis Variable : bacteria

N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 20 98.6500000 46.8146120 35.0000000 175.0000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The GENMOD Procedure

Model Information

Data Set WORK.CHEMICAL Distribution Poisson Link Function Log Dependent Variable bacteria

Number of Observations Read 25 Number of Observations Used 25

Cri teri a F or A ssessin g Goodn ess Of Fi t

Criterion DF Value Value/DF

Deviance 23 436.3888 18.9734 Scaled Deviance 23 436.3888 18.9734 Pearson Chi-Square 23 425.2583 18.4895 Sca le d Pearson X2 23 425. 2583 18. 4895 Log Likelihood 11994.7737

Algorithm converged. SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 5.4063 0.0300 5.3476 5.4650 32559.8 <.0001 exposed 1 -0.8147 0.0375 -0.8881 -0.7412 472.57 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The was held fixed.

LR Statistics For Type 3 Analysis

Chi- Source DF Square Pr > ChiSq

exposed 1 429.07 < .0001 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.B (cont’d)

ˆ ˆ The SAS out put i ndi cat es th at β0 = 5.4063 and β1 = – 0. 8147.

According to the fit of the regression model, what is the estimated log mean bacterial count for exposed cultures?

What is the estimated mean count for unexposed cultures?

What is the estimated log rate ratio of average bacterial count comparing th e t wo exposure groups? C ompare thi s esti mat e with the the results from PROC MEANS. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Inference for Generalized Linear Models Based upon the fitted model for VI.A and VI.B, how do we make inferences wihith respect to mod el parameters? F or exampl e, i s tumor ri iksk si gnifi cantl y elevated in comparing the synthetic to natural fiber groups for the data in VII.A? That is, is the coefficient of the treatment group variable significantly greater than zero?

There are three modes of likelihood-based inference available to us in trying to assess the significance of model effects:

1. The .

2. The .

3. The likelihood-ratio test.

We will focus mainly on the Wald and likelihood-ratio tests. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The Large-Sample Distribution of Parameter Estimators Suppose we are interested in inference for a single parameter , say

βj, for j = 1,…, p. Using the method of maximum likelihood, we ˆ obtain an estimate of this parameter that we denote by  j. The numerical method used internally by your statistical software also produces an estimate of the of this estimate, or ˆ s.e.( j ).

According to maximum likelihood theory, under common ˆ conditions the MLE  j is approximately normally distributed . The mean of this distribution is βj and the estimated is the standard error. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The Wald Test We are often interested in testing

H 0 :  j  0, versus

H A :  j  0.

Of course, even under H0 our estimate of βj will never exactly equal 0, due to random variation. The Wald test assesses evidence against the null hypothesis by considering how many standard errors our estimate lies away from zero. That is, the Wald has the form  ˆ  0 W  j . ˆ s.e.( j )

Under H0, this statistic has an approximate standard normal distribution. The P- value is equal to twice the tail probability determined by the observed statistic. The related Wald (1 – α)100% confidence interval for βj has the form ˆ ˆ j  z1 / 2s.e.( j ). Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.C

FthSASttfFrom the SAS output for ElVIAExample VI.A, whtithhat is the st and ard error of the coefficient of genotype classification? What is the value of the Wald statistic that tests whether this coefficient is significantly different from zero? What is the 95% Wald ?

Example VI.D

FthSASttfFrom the SAS output for ElVIBExample VI.B, whtithhat is the st and ard error of the coefficient for exposure group? What is the value of the Wald statistic that tests whether this coefficient is significantly different from zero? What is the 95% Wald confidence interval? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The Likelihood Ratio Test Recall that the likelihood is the distribution (or joint mass function) of the data that treats the observed data as fixed and the model parameters as unknown. As with the Wald test, suppose that we are interested in testing H0: βj = 0 versus HA: βj <> 0.

The likelihood-ratio method of inference focuses on the ratio of two maximizations of the likelihood: one assuming that the the null hypothesis is true, and one assuming that the alternative is true. We denote the first by l0 and the second by l1.

Note that because l1 is computed with fewer restrictions, it will be at least as large as l0. It turns out that the statistic –2log(l0/l1) = -2{log(l0) – log(l1)} = -2(L0 – L1), where L0 and L1 are used to denote log(l0) and log(l1), respectively. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Distribution of the Likelihood-Ratio Test Statistic The large-sample distribution of the LRT statistic is approximately chi-square, with degrees of freedom equal to the number of parameters being tested.

Example VI.E

Does the likelihood ratio test provide evidence of an association between genotype and AD risk for the data in Example VI.A?

Example VI.F

Does the likelihood ratio test provide evidence that chemical exposure is associated with smaller average bacterial count for the data in Example VI.B?

Compare the results of the LRT in VI.E and VI.F to the respective results of the Wald tests in VI.C and VI.D. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Multivariable Models

As we’ve discussed previously, a big advantage of the modeling approach is that we can focus on an independent variable of interest while accounting simultaneously for the effects of confounding factors.

In the following example we will begin examining how a generalized linear model allows us to accomplish this. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.G A data set that we will use frequently over the next week or two consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hosppgpital discharge of these patients.

These data are discussed at length in the textbook Applied Logistic Regression by Hosmer and Lemeshow (2000, John Wiley & Sons), an excellent reference for anyone who uses logistic regression in practice. Some of the variables are described on the following slide. Example VI.G (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 VARIABLE DESCRIPTION ID PiPatient’ ’idifiis identification numb er. STA Vital status: 0 = lived, 1= died. AGE Age in years. SEX Gender: 0 = male, 1 = female. RACE Race: 1 = white, 2 = black, 3 = other. SER Service at ICU admission: 0 = medical, 1 = surgical. CAN Cancer part of present problem (yes/no). CRN History of chronic renal failure (yes/no) . INF Infection probable at ICU admission (yes/no). CPR CPR prior to admission (yes/no). SYS Systolic blood pressure at admission in mm Hg. HRA Heart rate at admission in beats per minute. PRE Previous admission to ICU within 6 months (yes/no). TYP Type of admission: 0 = elective, 1 = emergency. FRA Bone fracture (yes/no) .

PO2 PO2 from initial blood gases: 0 if >60, 1 if <= 60. PH PH from initial blood gases: 0 if >= 7.25, 1 if < 7.25.

PCO PCO2 from initial blood gases: 0 if <= 45, 1 if > 45. BIC Bicarbonate from initial blood gases: 0 if >= 18, 1 if < 18. CRE Creatinine from initial blood gases: 0 if <= 2.0, 1 if > 2.0. LOC Level of consciousness at admission: 0 = no coma/stupor, 1 = deep stupor, 2 = coma. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.G (cont’d)

Since we are interested in how these variables predict the probability of a given patient surviving, the outcome variable is STA.

In this example, we will consider and compare four simple models that account for (()gi) gender only, y,( (ii) )g gender and ag e, ,( (iii ) ag e and race, and (iv) gender, age, and race. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Fitting Models for the ICU Data in SAS The following SAS program reads the ICU data into SAS, and then fits the models specified on the previous slide .

options ls=79 nodate; proc genmod descending; class race; data icu; model sta=sex race / dist=bin link=logit; infile run; "d:\stat5120\datasets\icu_data.txt"; input id sta age sex race ser can crn inf proc genmod descending; cpr sys hra pre typ fra po2 ph pco class race; bic cre loc; model sta=age sex race / dist=bin run; link=logit; run; proc print; run;

proc genmod descending; model sta=sex / dist=bin link=logit; run;

proc genmod descending; model sta=age sex / dist=bin link=logit; run; SAS Output Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The SAS System 21

r O s a s a s c c i c s h p t f p p b c l b i t g e c e a r n p y r r y r o p c i r o s d a e x e r n n f r s a e p a 2 h o c e c

1 8 0 27 1 1 0 0 0 1 0 142 88 0 1 0 0 0 0 0 0 0 2 12 0 59 0 1 0 0 0 0 0 112 80 1 1 0 0 0 0 0 0 0 3 14 0 77 0 1 1 0 0 0 0 100 70 0 0 0 0 0 0 0 0 0 4 28 0 54 0 1 0 0 0 1 0 142 103 0 1 1 0 0 0 0 0 0 5 32 0 87 1 1 1 0 0 1 0 110 154 1 1 0 0 0 0 0 0 0 6 38 0 69 0 1 0 0 0 1 0 110 132 0 1 0 1 0 0 1 0 0 7 40 0 63 0 1 1 0 0 0 0 104 66 0 0 0 0 0 0 0 0 0 8 41 0 30 1 1 0 0 0 0 0 144 110 0 1 0 0 0 0 0 0 0 9 42 0 35 0 2 0 0 0 0 0 108 60 0 1 0 0 0 0 0 0 0 10 50 0 70 1 1 1 1 0 0 0 138 103 0 0 0 0 0 0 0 0 0 11 51 0 55 1 1 1 0 0 1 0 188 86 1 0 0 0 0 0 0 0 0 12 53 0 48 0 2 1 1 0 0 0 162 100 0 0 0 0 0 0 0 0 0 13 58 0 66 1 1 1 0 0 0 0 160 80 1 0 0 0 0 0 0 0 0

. . .

(OBSERVATIONS OMITTED TO PRESERVE SPACE)

. . .

196 751 1 69 0 1 0 0 1 0 0 80 81 0 1 0 0 0 0 0 0 2 197 752 1 64 0 1 0 1 0 1 0 80 118 0 1 0 1 0 0 0 1 0 198 789 1 60 0 1 0 0 0 1 0 56 114 1 1 0 0 1 0 1 0 0 199 871 1 60 0 3 1 0 1 1 0 130 55 0 1 0 0 0 0 0 0 1 200 921 1 50 1 2 0 0 0 0 0 256 64 0 1 0 0 0 0 0 0 1 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Model Information

Data Set WORK.ICU Distribution Binomial Link Function Logit Dependent Variable sta Observations Used 200

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 198 200.0765 1.0105 Scaled Deviance 198 200.0765 1.0105 Pearson Chi-Square 198 199.9999 1.0101 Scaled Pearson X2 198 199.9999 1.0101 Log Likelihood -100.0382

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -1.4271 0.2273 -1.8726 -0.9816 39.42 <.0001 sex 1 0.1054 0.3617 -0.6036 0.8143 0.08 0.7708 Scale 0 1.0000 0.0000 1.0000 1.0000 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Model Information

Data Set WORK.ICU Distribution Binomial Link Function Logit Dependent Variable sta Observations Used 200

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 197 192. 3055 0. 9762 Scaled Deviance 197 192.3055 0.9762 Pearson Chi-Square 197 199.2974 1.0117 Scaled Pearson X2 197 199.2974 1.0117 Log Likelihood -96.1527

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Est imate Error Con fidence Lim its Square Pr > ChiSq

Intercept 1 -3.0567 0.6989 -4.4265 -1.6869 19.13 <.0001 age 1 0.0276 0.0107 0.0067 0.0485 6.70 0.0096 sex 1 -0.0113 0.3718 -0.7400 0.7173 0.00 0.9757 Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed. SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Model Information

Data Set WORK.ICU Distribution Binomial Link Function Logit Dependent Variable sta Observations Used 200

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 196 197. 8357 1. 0094 Scaled Deviance 196 197.8357 1.0094 Pearson Chi-Square 196 199.3992 1.0173 Scaled Pearson X2 196 199.3992 1.0173 Log Likelihood -98.9179

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Est imate Error Con fidence Lim its Square Pr > ChiSq

Intercept 1 -1.4241 0.8048 -3.0016 0.1533 3.13 0.0768 sex 1 0.0931 0.3634 -0.6191 0.8052 0.07 0.7978 race 1 1 0.0716 0.8121 -1.5201 1.6633 0.01 0.9298 race 2 1 -1.2468 1.3028 -3.8002 1.3067 0.92 0.3386 race 3 0 0. 0000 0. 0000 0. 0000 0. 0000 . . SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Model Information

Data Set WORK.ICU Distribution Binomial Link Function Logit Dependent Variable sta Observations Used 200

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 195 191.0494 0.9797 Scaled Deviance 195 191.0494 0.9797 Pearson Chi-Square 195 199.5076 1.0231 Scaled Pearson X2 195 199.5076 1.0231 Log Likelihood -95.5247

Algor ithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -2.7063 0.9971 -4.6605 -0.7520 7.37 0.0066 age 1 0.0260 0.0107 0.0050 0.0470 5.89 0.0152 sex 1 -0.0125 0.3724 -0.7424 0.7175 0.00 0.9733 race 1 1 -0.2154 0.8366 -1.8550 1.4243 0.07 0.7968 race 2 1 -1.2207 1.3171 -3.8021 1.3607 0.86 0.3540 race 3 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Goodness-of-fit for Generalized Linear Models The inferences we’ve discussed thus far have often centered on what the model has. Goodness-of-fit procedures focus on what the model lacks.

As a simple demonstration, consider a fit of the model 2 Y = β0 + β1x + ε, ε ~ N(0, σ ), to the data below. Note that while the fitted slope is significantly greater than zero, we are missing an important association in ignoring the quadratic effect of X. In other words, the model 2 Y = β0 + β1x + β2x + ε seems a more accurate representation of the X-Y association. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Checking the Model Fit

We will discuss three methods of checking goodness-of-fit:

1. Poisson model checking.

2. Using the Deviance .

3. Model Residuals Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Poisson Model Checking This procedure is quite straightforward. You can carry it out in the following steps:

1. Divide the observations into K groups, according to some breakdown of the values of the covariate(s).

2. Compute the predicted number of observations in each of the K groups, based th on the fitted model. For the i group, let ˆ i represent the predicted number and yi represented the observed number of subjects.

3. Use either the chi-square or likelihood ratio goodness-of-fit statistic to test the hypothesis that the model fit is “adequate”. These statistics are each asymptotically χ2 with K – 1 degrees of freedom. They are respectively given by:

K (y  ˆ )2 K 2 i i 2 ˆ X   , G  2 yi log(yi / i ). i1 ˆi i1 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.H This example has to do with the effect of age on the rate of menopause in women. The data in the table below are grouped – that is , the number of binary “successes” (in this case, menopause) is given along with the total sample size within each age grouping. What is the observed relationship between age and menopause rate?

Model I treats age as nominal, whereas Model II treats age as continuous (using midrange scores). How do the models compare in terms of goodness of fit?

Number Total Observed Fit for Fit for Age Menopausal Proportion Model I Model II 45- 2 104 0.01923 0.01923 0.02990

47- 10 73 0.13699 0.13699 0.09730

49- 28 102 0.27451 0.27451 0.27373

51- 59 116 0.50862 0.50862 0.56860

53+ 62 80 0.77500 0.77500 0.71137 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.H (cont’d)

Note that for a given model, the predicted number of observations within a group is equal to the fitted proportion for that group multiplied by the number of subjects in that group.

Basedhfifhddlflhdidbfd upon the fit of the second model, for example, the predicted number of menopausal women in the 45-47 age is (0.02990)(104) = 3.11, while the observed number is 2.

Hence, the chi-square and likelihood-ratio goodness-of-fit test statistics for the second model are respectively given by

(2 3.11)2 (62 56.91)2 X 2 = + + = 2.77, and 3.11  56.91 2 G = 2{2log(2 / 3.11) ++ 62log(62 / 56.91)} = 2.70.

What is our conclusion? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Obtaining Goodness-of-fit Information in SAS The following SAS program reads the ICU data into SAS, and then fits the models specified on the previous slide .

options ls=79 nodate; proc genmod order=data; class agegp; data; model meno/total=agegp; input agegp meno total; output out=fitted pred=pred1;pred1; prop=meno/total; run; cards; 46 2 104 proc genmod; 48 10 73 model meno/total=agegp / obstats residuals; 50 28 102 output out=fitted reschi=pearsres2 52 59 116 pred=pred2; 53 62 80 run; ;; data fitted2; proc sort; set fitted; by descending agegp; fit2=pred2*total; run; run;

proc print; proc sort data=fitted2; var agegp meno total prop; by agegp; run; run;

proc print data=fitted2; run; SAS Output Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Obs agegp meno total prop

1 53 62 80 0.77500 2 52 59 116 0.50862 3 50 28 102 0.27451 4 48 10 73 0. 13699 5 46 2 104 0.01923 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The GENMOD Procedure

Model Information

Data Set WORK.DATA2 Distribution Binomial Link Function Logit Response Variable (Events) meno Response Variable (Trials) total Observations Used 5 Number Of Events 161 Number Of Trials 475

Class Level Information

Class Levels Values

agegp 5 53 52 50 48 46

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 0 0.0000 . Sca le d Dev iance 0 0. 0000 . Pearson Chi-Square 0 0.0000 . Scaled Pearson X2 0 0.0000 . Log Likelihood -222.0290

Algorithm converged. SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -3.9318 0.7140 -5.3313 -2.5324 30.32 <.0001 agegp 53 1 5.1686 0.7626 3.6740 6.6632 45.94 <.0001 agegp 52 1 3. 9663 0.7 37 8 2. 52 03 5.412 3 2 8. 90 <. 0001 agegp 50 1 2.9600 0.7477 1.4945 4.4254 15.67 <.0001 agegp 48 1 2.0913 0.7910 0.5409 3.6416 6.99 0.0082 agegp 46 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 The GENMOD Procedure

Model Information

Data Set WORK.FITTED Predicted Values and Diagnostic Statistics Distribution Binomial Link Function Logit Response Variable (Events) meno Response Variable (Trials) total Observations Used 5 Number Of Events 161 Number Of Trials 475

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 3 4.9870 1.6623 Scal ed D evi an ce 3 4. 987 0 1. 662 3 Pearson Chi-Square 3 4.9962 1.6654 Scaled Pearson X2 3 4.9962 1.6654 Log Likelihood -224.5224

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -32.2728 3.2532 -38.6489 -25.8968 98.42 <.0001 agegp 1 0.6259 0.0636 0.5012 0.7507 96.76 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Observation Statistics

Observation meno total agegp Pred Xbeta Std HessWgt Lower Upper Resraw Reschi Resdev StResdev StReschi Reslik

1 62 80 53 0.7113734 0.9020635 0.1678704 16.425704 0.6394652 0.7740082 5.0901303 1.2559342 1.2873726 1.7565897 1.7136927 1.7368651 2 59 116 52 0.5685953 0.2761224 0.129464 28.454183 0.5055942 0.6294521 -6.957056 -1.304225 -1.298582 -1.795496 -1.803299 -1.799222 3 28 102 50 0. 2737339 -0. 97576 0. 1349742 20.277973 0.224383 0.3293303 0.0791414 0.0175749 0.0175697 0.0221256 0.0221321 0.022128 4 10 73 48 0.0972955 -2.227642 0.2281888 6.4115244 0.0644718 0.1442537 2.8974271 1.1442788 1.0849015 1.3292416 1.4019919 1.353964 5 2 104 46 0.0299005 -3.479525 0.3439727 3.0166685 0.0154633 0.0570357 -1.109648 -0.638883 -0.682647 -0.851265 -0.796692 -0.832198 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Obs agegp meno total prop pred1 pred2 pearsres2 adjres2 fit2

1 46 2 104 0.01923 0.01923 0.02990 -0.63888 -0.79669 3.1096 2 48 10 73 0.13699 0.13699 0.09730 1.14428 1.40199 7.1026 3 50 28 102 0.27451 0.27451 0.27373 0.01757 0.02213 27.9209 4 52 59 116 0. 50862 0. 50862 0. 56860 -1. 30422 -1. 80330 65. 9571 5 53 62 80 0.77500 0.77500 0.71137 1.25593 1.71369 56.9099 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

The Model Deviance Likelihood-based inference, such as the Wald or likelihood-ratio tests, may also prove handy in assessing what a model lacks. We can use these procedures to compare simpler, more parsimonious models to more complex models.

For example, the model deviance is the likelihood-ratio test statistic that compares your model to the most complicated model possible. This latter model we call the saturated model, because it has as many parameters as it has unique covariate combinations. In other words, the saturated model yields fitted values that are equal to the observed values.

Denote the loglikelihood for a given model by LM, and let the loglikelihood for the sat urat ed mod el b e LS. Then th e d evi ance f or th e gi ven mod el i s D = –2(LM – LS). Note that the likelihood ratio test statistic for comparing two nested models – model 0 under the null and model 1 under the alternative – is just the difference in their deviances . That is , -2(L0 – L1) = -2(L0 – LS) – [-2(L0 – LS)] = D0 – D1, where L0 and L1 are the respective loglikelihoods for the two models, and D0 and D1 are their respective deviances. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.I

For the menopause data in Example VI.H (based on the SAS analysis), what is the deviance for the model that treats age group as nominal? Can you explain this value?

What is the deviance for the model that treats age group as continuous?

Is the model that treats age group as continuous defensible? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Model Residuals While the Poisson model-checking procedures (described a few slides previous) yield an overall assessment of how well the model fits the data over K groups, we may be interested in looking at individual groups. This is directly analogous to the cell residuals we discussed in looking at two-way tables, and we saw for suchblhh tables how one can ob bihtain these resid idlfuals from SAS.

In the context of generalized linear models, the adjusted residual for the ith of K groups has the form, y  ˆ i i , ˆi (1 hi )

th where hi is called the leverage of the i observation. Leverage is a measure of influence for a particular group – the higher the leverage for a given observation, the ggpreater the impact of that observation on the model fit. The ad justed residuals are asymptotically N(0,1), provided that underlying group means are sufficiently large. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010

Example VI.J

For the menopause data in Example VI.H (based on the SAS analysis), examine the adjusted residuals for each of the two models.

For the model that treats age group as continuous, are there any idiidlindividual age groups f or whi hihhch the fi fiit is si gnifi cantl y poor?