Generalized Linear Models: an Introduction

Applied Statistics With R

Generalized Linear Models: An Introduction

John Fox

WU Wien May/June 2006

© 2006 by John Fox Generalized Linear Models: An Introduction 1 Generalized Linear Models: An Introduction 2 A synthesis due to Nelder and Wedderburn, generalized linear models 1. Goals • (GLMs) extend the range of application of linear statistical models by accommodating response variables with non-normal conditional To introduce the format and structure of generalized linear models • distributions. To show how the familiar linear, logit, and probit models ﬁtintotheGLM • Except for the error, the right-hand side of a generalized linear model is framework. • essentially the same as for a linear models. To introduce Poisson generalized linear models for count data. • To describe diagnostics for generalized linear models. •

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 3 Generalized Linear Models: An Introduction 4 2. The Structure of Generalized Linear 2. A linear function of the regressors, called the linear predictor, η = α + β Xi1 + + β Xik Models i 1 ··· k on which the expected value µi of Yi depends. A generalized linear model consists of three components: – The X’s may include quantitative explanatory, but they may also • include transformations of explanatory variables, polynomial terms, 1. A random component, specifying the conditional distribution of the contrasts generated from factors, interaction regressors, etc. response variable, Yi, given the predictors. – Traditionally, the random component is a univariate “exponential 3. An invertible link function g(µi)=ηi, which transforms the expectation family” — the normal (Gaussian), binomial, Poisson, gamma, or of the response to the linear predictor. inverse-Gaussian family of distributions — but generalized linear – The inverse of the link function is sometimes called the mean function: 1 models have been extended beyond the univariate exponential g− (ηi)=µi. families.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 5 Generalized Linear Models: An Introduction 6 – Standard link functions and their inverses are shown in the following – The logit, probit, log-log, and complementary-log-log links are for table: binomial data, where Yi represents the observed proportion and µi the expected proportion of ‘successes’ in n binomial trials — that is, Link η = g(µ ) µ = g 1(η ) i i i i − i µ is the probability of a success. identity µ η i i i For the probit link, Φ is the standard-normal cumulative distribution ηi log loge µi e ∗ 1 1 1 function, and Φ− is the standard-normal quantile function. inverse µi− ηi− 2 1/2 An important special case is binary data, where all of the binomial inverse-square µ η− i− i ∗ trials are 1, and therefore all of the observed proportions Y are square-root µ η2 i √ i i either 0 or 1. This is the case that we examined previously. µi 1 logit loge η 1 µi 1+e− i 1 − probit Φ− (µi) Φ(ηi) log-log loge[ loge(µi)] exp[ exp( ηi)] complementary log-log log− [ log− (1 µ )] 1 exp[− exp(− η )] e − e − i − − i

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 7 Generalized Linear Models: An Introduction 8 For distributions in the exponential families, the conditional variance of • Family Canonical Link Range of Yi V (Yi ηi) Y is a function of the mean µ together with a dispersion parameter φ (as Gaussian identity ( , + ) φ| showninthetablebelow). −∞ ∞ 0, 1, ..., ni µi(1 µi) – For the binomial and Poisson distributions, the dispersion parameter binomial logit − ni n is ﬁxedto1. Poisson log 0, 1, 2, ... µi – For the Gaussian distribution, the dispersion parameter is the usual gamma inverse (0, ) φµ2 ∞ i error variance, which we previously symbolized by σ2 (and which inverse-Gaussian inverse-square (0, ) φµ3 ε ∞ i doesn’t depend on µ).

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 9 Generalized Linear Models: An Introduction 10 The canonical link for each familiy is not only the one most commonly – The log-likelihood for the model, maximized over the regression • used, but also arises naturally from the general formula for distributions coefficients, is n in the exponential families. log L = log p(µ ,φ; y ) – Other links may be more appropriate for the specific problem at hand e 0 e i i i=1 – One of the strengths of the GLM paradigm — in contrast, for example, where p( ) is the probability orX probability-density function correspond- b with transformation of the response variable in a linear model — is the ing to the· family employed. separation of the link function from the conditional distribution of the – A ‘saturated’ model, which dedicates one parameter to each observa- response. tion, and hence fits the data perfectly, has log-likelihood n GLMs are typically fit to data by the method of maximum likelihood. log L = log p(y ,φ; y ) • – Denote the maximum-likelihood estimates of the regression parame- e 1 e i i i=1 ters as α, β ,...,β . X 1 k – Twice the difference between these log-likelihoods defines the residual These imply an estimate of the mean of the response, µi = ∗ 1 deviance under the model, a generalization of the residual sum of g− (α + bβ xi1 +b + β xik). b 1 ··· k squares for linear models: b D(y; µ)=2(loge L1 loge L0) b b b − b John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 11 Generalized Linear Models: An Introduction 12 – Dividing the deviance by the estimated dispersion produces the scaled Although the logit and probit links are familiar, the log-log and comple- deviance: D(y; µ)/φ. • mentary log-log link for binomial data are not. – Likelihood-ratio tests can be formulated by taking differences in the – These links are compared in the following graph. deviance for nestedb b models. – The log-log and complementary log-log link may be appropriate when – Wald tests for individual coefﬁcients are formulated using the estimated the probability of the response as a function of the linear predictor asymptotic standard errors of the coefﬁcients. approaches 0 and 1 asymmetrically. Some familiar examples: – The log-log link can be applied via the complementary log-log link by • – Combining the identity link with the Gaussian family produces the exchanging the roles of 0’s and 1’s in the response Y . normal linear model. The maximum-likelihood estimates for this model are the ordinary ∗ least-squares estimates. – Combining the logit link with the binomial family produces the logistic- regression model (linear-logit model). – Combining the probit link with the binomial family produces the linear probit model.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 13 Generalized Linear Models: An Introduction 14 3. Poisson GLMs for Count Data Poisson generalized linear models arise in two common formally 1.0 • logit identical but substantively distinguishable contexts: probit log−log 0.8 complementary log−log 1. when the response variable in a regression model takes on non-negative integer values, such as a count; ) i 0.6 η ( 1

− 2. to analyze associations among categorical variables in a contingency g = i

µ 0.4 table of counts. The default link for the Poisson family is the log link. 0.2 •

0.0

−4 −2 0 2 4

ηi

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 15 Generalized Linear Models: An Introduction 16 3.1 Poisson Regression I will employ data collected by Ornstein on interlocking director and • top-executive positions among 248 major Canadian firms – Ornstein performed a least-squares regression of the number of interlocks maintained by each firm on the firm’s assets, and dummy variables for the firm’s nation of control (Canada, US, UK, Other) and sector of operation (10 categories).

Because the response variable is a count, a Poisson linear model might Frequency • be preferable. – The marginal distribution of number of interlocks shows many zero counts and an extreme positive skew: 0 5 10 15 20 25

0 20406080100

Number of Interlocks

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 17 Generalized Linear Models: An Introduction 18 – FittingaPoissonGLMwithloglinktoOrnstein’sdataproducesthe Coefﬁcient SE following results: Constant 2.32 0.052 Assets 0.0000209 0.0000012 Nation of Control (baseline: Canada) Other 0.163 0.073 United Kingdom −0.577 0.089 United States −0.826 0.049 Sector (baseline: Agriculture and Food) − Banking 0.409 0.156 Construction −0.620 0.211 Finance − 0.677 0.069 Holding Company 0.208 0.119 Manufacturing 0.0527 0.0752 Merchandizing 0.178 0.087 Mining 0.621 0.069 Transportation 0.678 0.075 Wood and Forest Products 0.712 0.075

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 19 Generalized Linear Models: An Introduction 20 An analysis of deviance table for the model shows that all three – Effect displays for the terms in the Poisson-regression model: ∗ explanatory variables have highly statistically signiﬁcant effects: (a) (b) Source G2 df p Assets 390.90 1 .0001 Nation of Control 328.94 3 ¿ .0001

¿ Interlocks of Number Interlocks of Number e e Sector 361.46 9 .0001 Interlocks of Number Interlocks of Number log log 123456 123456

¿ 2 4 8 16 64 256 2 4 8 16 64 256 The deviance for the null model (with only a constant) is 3737.0, and ∗ 050100150 Canada Other U.S. U.K. 1887.4 for the full model; thus the analog of the squared multiple Assets (billions of dollars) Nation of Control correlation is 1887.4 (c) R2 =1 = .495 − 3737.0 Number of Interlocks of Number e Number of Interlocks of Number log 123456 2 4 8 16 64 256

WOD TRN FIN MIN HLD MER MAN AGR BNK CON

Sector

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 21 Generalized Linear Models: An Introduction 22 3.2 Loglinear Models for Contingency Tables – This model looks very much like a three-way ANOVA model, where in place of the cell mean we have the log cell expected count: Poisson GLMs may also be used to fit loglinear models to a contingency log µ = µ + α + α + α • table of frequency counts, where the object is to model association ijk 1(i) 2(j) 3(k) +α + α + α + α among the variables in the table. 12(ij) 13(ik) 23(jk) 123(ijk) – Here, variable 1 is perceived closeness of the election; variable 2 is The variables constituting the classifications of the table are treated as intensity of preference; and variable 3 is turnout. • ‘explanatory variables’ in the Poisson model, while the cell count plays the role of the ‘response.’ – Although a term such as α12(ij) looks like an ‘interaction,’ it actually models the partial association between variables 1 and 2. We previously examined Campbell et al.’s data on voter turnout in the • 1956 U. S. presidential election – The three-way term α123(ijk) allows the association between any pair – We used a binomial logit model to analyze a three-way contingency of variables to be different in different categories of the third variable; table for turnout by perceived closeness of the election and intensity it thus represents an interaction in the usual sense of that concept. of partisan preference. In fitting the log-linear model to data, we can use sigma-constraints on • – The binomial logit model treats turnout as the response. the parameters, much as we would for an ANOVA model. An alternative is to construct a log-linear model for the expected cell • count.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 23 Generalized Linear Models: An Introduction 24 In the context of a three-way contingency table, the loglinear model – Notice that the likelihood-ratio test for the three-way term Closeness • above is a saturated model, because it has as many independent Preference Turnout is identical to the test for the Closeness × × parameters (12) as there are cells in the table. Preference interaction in the logit model in which Turnout is the ×response variable. I ﬁt this model to the American Voter data, obtaining the following • ‘Type-II’ analysis of deviance table: – In general, as long as we ﬁt the parameters for the associations Source G2 df p among the explanatory variable (here Closeness Preference and, of × Closeness 200.61 1 .0001 course, its lower-order relatives, Closeness and Preference) and for Preference 56.03 2 ¿ .0001 the marginal distribution of the response (Turnout), the loglinear model Turnout 376.26 1 ¿ .0001 for a contingency table is equivalent to a logit model. ¿ Closeness Preference 1.23 2 .540 – There is, therefore, no real advantage to using a loglinear model in × Closeness Turnout 8.29 1 .004 this setting. × Preference Tur nout 19.11 2 <.0001 – Loglinear models, however, can be used to model association in other Closeness ×Preference Tur nout 7.12 2 .028 × × contexts.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 25 Generalized Linear Models: An Introduction 26 3.3 Over-Dispersed Binomial and Poisson Models – For example, in modeling proportions, it is possible that the probability of success µ varies for different individuals who The binomial and Poisson GLMs fix the dispersion parameter φ to 1. ∗ • share identical values of the predictors (this is called ‘unmodeled It is possible to fit versions of these models in which the dispersion is a heterogeneity’); • free parameter, to be estimated along with the coefficients of the linear or the individual successes and failures for a ‘binomial’ observation predictor ∗ are not independent, as required by the binomial distribution. – The resulting error distribution is not an exponential family. The regression coefficients are unaffected by allowing dispersion • different from 1, but the coefficient standard errors are multiplied by the square-root of φ. – Because the estimated dispersion typically exceeds 1, this inflates the standard errorsb – That is, failing to account for ‘over-dispersion’ produces misleadingly small standard errors. So-called over-dispersed binomial and Poisson models arise in several • different circumstances.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 27 Generalized Linear Models: An Introduction 28 4. Diagnostics for GLMS 4.1 Outlier, Leverage, and Influence Diagnostics Most regression diagnostics extend straightforwardly to generalized 4.1.1 Hat-Values • linear models. Hat-values for a generalized linear model can be taken directly from the These extensions typically take advantage of the computation of • final iteration of IWLS. • maximum-likelihood estimates for generalized linear models by iterated They have the usual interpretation — except that the hat-values in a weighted least squares (IWLS—the procedure typically used to fit • GLM depend on Y as well as on the configuration of the X’s. GLMs).

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 29 Generalized Linear Models: An Introduction 30 4.1.2 Residuals – Standardized Pearson residuals correct for the conditional response Several kinds of residuals can be defined for generalized linear models: variation and for the leverage of the observations: • Yi µi – Response residuals are simply the differences between the observed RPi = − response and its estimated expected value: Yi µi. V (Yi ηi)(1 hi) − . | b − – Working residuals are the residuals from the final WLS fit. q These may be used to define partial residuals forb component-plus- – Deviance residuals, Di, are theb square-roots of the case-wise ∗ components of the residual deviance, attaching the sign of Yi µ . residual plots (see below). − i – Pearson residuals are case-wise components of the Pearson Standardized deviance residuals are • Di b goodness-of-fit statistic for the model: RDi = 1/2 φ(1 h ) φ (Yi µ ) i − i − Several different approximationsq to studentized residuals have been V (Yi ηi) • b b | b suggested. where is the dispersion parameter for the model and ( ) is the φ q V Yi ηi – To calculate exact studentized residuals would require literally refitting variance of the response givenb the linear predictor. | the model deleting each observation in turn, and noting the decline in the deviance.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 31 Generalized Linear Models: An Introduction 32 – Here is an approximation due to Williams: 4.1.3 Influence Measures 2 2 Ei∗ = (1 hi)RDi + hiRPi An approximation to Cook’s distance influence measure is − • R2 h where, once again, the signq is taken from Yi µi. Pi i − Di = φ(k +1)× 1 hi – A Bonferroni outlier test using the standard normal distribution may be − based on the largest absolute studentized residual.b – Note: Influence on coefficientsb = outlyingness leverage. × Approximate values of DFBETAij and DFBETASij (influence and • standardized influence on each coefficient) may be obtained directly from the final iteration of the IWLS procedure. There are two largely similar extensions of added-variable plots to • generalized linear models, one due to Wang and another to Cook and Weisberg.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 33 Generalized Linear Models: An Introduction 34 The following graph shows hat-values, studentized residuals, and Cook’s – One observation—number 1, the corporation with the largest assets— • distances for the quasi-Poisson model ﬁt to Ornstein’s interlocking direc- stands out by combining a very large hat-value with the biggest torate data: absolute studentized residual. – This point is not a statistically signiﬁcant outlier Studentized Residuals

1 −3 −2 −1 0 1 2 3

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Hat−Values

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 35 Generalized Linear Models: An Introduction 36 – As shown in a DFBETA plot, observation 1 makes the coefficient of 4.2 Nonlinearity Diagnostics assets substantially smaller than it would otherwise be (recall that the Component-plus-residual plots also extend straightforwardly to general- coefficient for assets is 0.02085): • ized linear models. – Nonparametric smoothing of the resulting scatterplots can be important to interpretation, especially in models for binary responses, where the discreteness of the response makes the plots difficult to examine.

Assets – Similar effects can occur for binomial and Poisson data. DFBETA Component-plus-residual plots use the linearized model from the last • step of the IWLS ﬁt. 1 −0.006 −0.004 −0.002 0.000 0.002 0 50 100 150 200 250 – For example, the partial residual for Xj adds the working residual to Index BjXij. – The component-plus-residual plot graphs the partial residual against In this case, the approximate DFBETA is quite accurate: If observa- Xj. ∗ tion 1 is deleted, the assets coefﬁcient increases to 0.02602.

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006 Generalized Linear Models: An Introduction 37 Generalized Linear Models: An Introduction 38 An illustrative component+residual plot, for assets in Ornstein’s – I therefore investigated transforming assets down the ladder of powers • interlocking-directorate quasi-Poisson regression: and roots, eventually arriving at the log transformation, the component+residual plot for which appears quite straight: Component + Residual −10123

0 50 100 150

Assets (billions of dollars) Component + Residual −2 −1 0 1 2 3 – This plot is difﬁcult to examine because of the large positive skew in assets, but it appears as if the assets slope is a good deal steeper at −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

the left than at the right. Log 10 Assets (billions of dollars)

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006

Generalized Linear Models: An Introduction 39 Generalized Linear Models: An Introduction 40 Alternative effect displays for assets in the transformed model: • Note the relationship between the problems of influence and nonlinearity • in this example: (a) (b) – Observation 1 was influential in the original regression because its very large assets gave it high leverage, and because un-modelled nonlinearity put the observation below the erroneously linear fitfor assets, pulling the regression surface towards it. – Log-transforming assets fixes both of these problems. Number of Interlocks of Number Interlocks of Number e e Number of Interlocks of Number Interlocks of Number log log 1234 1234 248163264 248163264

0 50 100 150 0.05 0.50 5.00 50.00

Assets (billions of dollars) Assets (billions of dollars)

John Fox WU Wien May/June 2006 John Fox WU Wien May/June 2006