VI. Introduction to Generalized Linear Models

Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 VI. Introduction to Generalized Linear Models Readings for Section VI: Agresti, Chapter 3 We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models unifies a family of regression models that includes (but is not limited to) logistic , Poisson, and linear regression. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Why Model? 1. Structure of the model helps us to describe how various factors are associated with the outcome variable. 2. Size and strength of the effects of primary factor(s) of interest can be explained while adjusting for possible confounding ftfactors. 3. The simultaneous effects of several explanatory variables can be considered, as opposed to just one or two at a time. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Linear Regression: Modeling the Mean Recall that linear regression involves modeling the mean of some outcome variable as a function of one or more explanatory variables. That is, we have a sample Y1,…, Yn of independent measures, where the ith subject in our sample has p explanatory variables xi1, xi2,…, xip, and E(Yi) = μi. The linear reggpression model specifies that … Yi = β0 + β1 xi1 + β2 xi2 + + βp xip + εi, 2 where εi ~ N(0, σ ), for i = 1,…,n. Then E(Yi | xi1, xi2,…, xip) = … 2 β0 + β1 xi1 + + βp xip, and Var(Yi) = σ . This model assumes that the mean of the outcome variable changes linearly with respect to the explanatory variables. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 The Three Components of a Generalized Linear Model Whereas w ith linear regressi on, we mod el th e mean of th e out come variable directly, a generalized linear model is constructed to model the effects of the covariates on a function of the mean. There are hence three parts or components that comprise a generalized linear model: 1. The random component, which specifies the distribution of the outibltcome variable. 2. The systematic component, which represents a function of the covariates that will link to the outcome variable. 3. The link function, which determines how the mean of the outcome variable relates to the covariates. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Generalized Linear Models for Binary Data We have a sample Y1,…, Yn of independent binary outcome measurements. The ith subject in our sample has p explanatory variables xi1, xi2,…, xip. Suppose that P(Yi = 1) = πi and P(Yi = 0) = 1 – πi. Hence, E(Yi) = πi. The random component in this case is clearly binomial. For the purpose of this class, we will always assume that the systematic component is simply a linear combination of the covariates, or … β0 + β1 xi1 + + βp xip. The remaining question is: how do we model πi as a function of the covariates (i.e., what is the link function)? Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Link Functions for the Binomial Distribution … Suppose that we assume that E(Yi) = πi = β0 + β1 xi1 + + βp xip. We call this the identity link. Does this model have any practical shortcomings? Since the systematic component can take on any value, we often prefer us ing a link th at will const rai n πi tthitto the interval lbt between 0 and 1. The so-called logistic or logit link g(πi) = log[πi /(1 – πi)] accomplishes this. There are other links (e.g., the probit), but we will focus mainly on the logit. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.A Consider a generalized linear model for the Alzheimer’s data from ElIVAExample IV.A. The data are ta bu lated agai n h ere: AD No AD Total ≥ 1 ε4 Allele 308 1296 1604 No ε4 262 3096 3358 Total 570 4392 4962 Note that these data can be viewed as a sample of outcome th measures Y1,…,Y4962, where Yi = 1ifh1 if the i idiidlindividual was diagnosed with AD, and Yi = 0 otherwise. Moreover, we have a th single covariate Xi, which is 1 if the i individual has at least one copy of the APOE ε4 allele, and 0 otherwise. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.A (cont’d) LtLet πi = P(Yi = 1). No te that thi s i mpli es E(Yi) = πi. TifthTo specify the generalized linear model to explain the relationship between Xi and Yi (that is, between genotype and AD risk), (1) we assume the distribution of Yi (i.e., the random component) is binomial, (2) we employ the standard systematic component, and (3) we use the logit link . This yields the model i logit( i ) log 0 1xi . 1i This happens to be the logistic regression model for these data. The strong assumption is that the log odds of response change linearly with respect to the covariate(s) in the model. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.A (cont’d) First, how do we interpret the coefficients of this model? What is logit(πi | Xi = 1)? What is logit( πi | Xi = 0)? What is the log odds ratio with respect to AD risk, comparing those with at least one ε4 to those with no ε4? The regression coefficient of a binary covariate in a logistic regression model represents the log odds ratio comparing the group identified as ‘1’ to the group identified as ‘0’. More generally, the coefficient of any arbitrary covariate in a logistic regression mode l represen ts th e l og odd s rati o f or subj ect s wh o diff er b y one unit with respect to the covariate. Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Fitting the Generalized Linear Model for the Alzheimer’s Data in SAS We obtain parameter estimates for a generalized linear model using the method of maximum likelihood – these estimates typically cannot be comp uted in closed form. The following SAS program shows how to read the AD data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.A. options ls=79 nodate; proc genmod descending; model ad=e4 / dis=bin link=logit data; type3; input e4 ad count; weight count; cards; run; 1 1 308 1 0 1296 0 1 262 0 0 3096 ;; proc sort; by descending ad descending e4; run; proc freq order=data; tables e4*ad / chisq relrisk; weight count; run; SAS Output Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 The FREQ Procedure Table of e4 by ad e4 ad Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚‚‚‚ 1‚ 0‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 308 ‚ 1296 ‚ 1604 ‚ 6.21 ‚ 26.12 ‚ 32.33 ‚ 19.20 ‚ 80.80 ‚ ‚ 54.04 ‚ 29.51 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 ‚ 262 ‚ 3096 ‚ 3358 ‚ 5.28 ‚ 62.39 ‚ 67.67 ‚ 7.80 ‚ 92.20 ‚ ‚ 45.96 ‚ 70.49 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 570 4392 4962 11.4998850000 88.51 100.00 Statistics for Table of e4 by ad Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 138.7375 <.0001 Like lihoo d Rat io Chi-Square 1 129. 9802 <.0001 Continuity Adj. Chi-Square 1 137.6186 <.0001 Mantel-Haenszel Chi-Square 1 138.7095 <.0001 Phi Coefficient 0.1672 Contingency Coefficient 0.1649 Cramer's V 0.1672 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 308 Left-sided Pr <= F 1.0000 Rig ht-side d Pr >= F 3. 366E-30 Table Probability (P) 6.085E-30 Two-sided Pr <= P 4.889E-30 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 2.8083 2.3527 3.3522 Cohort (Col1 Risk) 2.4611 2.1106 2.8697 Cohort (Col2 Risk) 0.8764 0.8540 0.8993 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 The GENMOD Procedure Model Information Data Set WORK.DATA1 Distribution Binomial Link Function Logit Dependent Variable ad Scale Weight Variable count Number of Observations Read 4 Number of Observations Used 4 Sum of Weights 4962 Number of Events 2 Number of Trials 4 Response Profile Ordered Total Value ad Frequency 1 1 570 2 0 4392 PROC GENMOD is modeling the probability that ad='1'. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 2 3408.7579 1704.3790 Scaled Deviance 2 3408.7579 1704.3790 Pearson Chi-Square 2 4962.0000 2481.0000 Scaled Pearson X2 2 4962.0000 2481.0000 Log Likelihood -1704.3790 Algorithm converged. SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 -2. 4695 0. 0643 -259562.5956 -2. 3434 1473. 15 <. 0001 e4 1 1.0326 0.0903 0.8556 1.2096 130.69 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2010 Example VI.A (cont’d) ˆ ˆ The SAS ou tput i ndi cat es th at 0 2.4695 and 1 1.0326. According to the fit of the regression model, what are the estimated log odds of AD for someone with no APOE ε4 allele? What are the estimated log odds given for someone with at least one high-risk allele? Wha t is the esti mat ed l og odd s rati o of AD ri sk compari ng ε4 carriers to non-carriers? How does this estimate compare with the sample odds ratio computed using the data in the 2 x 2 table? Stat 5120 – Categorical Data Analysis Dr.

Load more