Part V: Binary Response Data
Total Page:16
File Type:pdf, Size:1020Kb
✬ ✩ Part V: Binary response data ✫275 BIO 233, Spring 2015 ✪ ✬Western Collaborative Group Study ✩ Prospective study of coronary heart disease (CHD) • Recruited 3,524 men aged 39-59 between 1960-61 • ⋆ employed at 10 companies in California ⋆ baseline survey at intake ⋆ annual surveys until December 1969 Exclusions: • ⋆ 78 men who were actually outside the pre-specified age range ⋆ 141 subjects with CHD manifest at intake ⋆ 106 employees at one firm that excluded itself from follow-up ⋆ 45 subjects who were lost to follow-up, non-CHD death or self-exclusion prior to the first follow-up n = 3,154 study participants at risk for CHD • ✫276 BIO 233, Spring 2015 ✪ ✬Our primary goal is to investigate the relationship between ‘behavior ✩ • pattern’ and risk of CHD Participants were categorized into one of two behavior pattern groups: • Type A: characterized by enhanced aggressiveness, ambitiousness, competitive drive, and chronic sense of urgency Type B: characterized by more relaxed and non-competitive Data and documentation are available on the class website • > ## > load("WCGS_data.dat") > > dim(wcgs) [1] 3154 11 > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time" ✫277 BIO 233, Spring 2015 ✪ ✬The variables (in column order) are: ✩ • 1 age age, years 2 ht height, in 3 wt weight, lbs 4 sbp systolic blood pressure, mmHg 5 dbp diastolic blood pressure, mmHg 6 chol cholesterol, mg/dL 7 ncigs number of cigarettes smoked per day 8 behave behavior type 0/1 = B/A 9 chd occurrence of a CHD event during follow-up 10 type type of CHD event 11 time time post-recruitment of the CHD event, days Values for the ‘risk factor’ covariates are those measured at the intake visit • The three CHD-related variables were measured prospectively • ⋆ over an approx. 8.5 years of follow-up ✫278 BIO 233, Spring 2015 ✪ ✬Important note: ✩ • ⋆ 423 were lost to follow-up ⋆ 140 men died during the follow-up For our purposes, we are going to ignore these issues and consider the • binary outcome of: 1 occurrence of CHD during follow-up Y = 0 otherwise In the dataset, the response variable is ‘chd’: • > ## > table(wcgs$chd) 0 1 2897 257 > round(mean(wcgs$chd) * 100, 1) [1] 8.1 ✫279 BIO 233, Spring 2015 ✪ ✬Primary exposure of interest is ’behave’: ✩ • > ## > table(wcgs$behave) 0 1 1565 1589 > round(mean(wcgs$behave) * 100, 1) [1] 50.4 Cross-tabulation and exposure-specific incidence • > ## > table(wcgs$behave, wcgs$chd) 0 1 0 1486 79 1 1411 178 > round(tapply(wcgs$chd, list(wcgs$behave), FUN=mean) * 100, 1) 0 1 5.0 11.2 ✫280 BIO 233, Spring 2015 ✪ ✬The probability of the occurrence of CHD during follow-up among type B ✩ • men is estimated to be 0.050 ⋆ expected percentage of type B men who will develop CHD during follow-up is 5.0% The probability of the occurrence of CHD during follow-up among type A • men is estimated to be 0.112 ⋆ expected percentage of type A men who will develop CHD during follow-up is 11.2% Often use the generic term ‘risk’ • Either way, it’s important to remember that these statements are referring • to populations of men, rather than the individuals themselves ⋆ we’ve estimated a common or average risk of CHD ⋆ referred to as the marginal risk ⋆ ‘marginal’ in the sense that it does not condition on anything else ✫281 BIO 233, Spring 2015 ✪ ✬Contrasts ✩ As stated at the start, the primary goal is to investigate the relationship • between behavior pattern and risk of CHD We’ve characterized risk for each type but the goal requires a comparison • of the risks To perform such a comparison we need to choose a contrast • Risk difference: • ⋆ RD = 0.112 - 0.050 = 0.062 ⋆ difference in the estimated risk of CHD during follow-up between type A and type B men is 0.062 (or 6.2%) ⋆ the way in which the additional risk of CHD of being a type A person manifests through an absolute increase ✫282 BIO 233, Spring 2015 ✪ ✬Relative risk: ✩ • ⋆ RR = 0.112 / 0.050 = 2.24 ⋆ ratio of the estimated risk of CHD for type A men during follow-up to the estimated risk for type B men ⋆ the way in which the additional risk of CHD of being a type A person manifests through an relative increase As with the interpretation of the risks themselves, these statements refer • to contrasts between populations ⋆ population of Type A men vs. population of Type B men Contrasts are ‘marginal’ in the sense that we don’t condition on anything • else when comparing the two populations ⋆ i.e. we don’t adjust for anything ✫283 BIO 233, Spring 2015 ✪ ✬Important to note that the RD and RR are related ✩ • ⋆ relationship depends on the value of the response probability for the ‘referent’ group RD across different combinations of P (Y = 1 X = 0) and RR • | 0.01 0.05 0.10 0.20 0.50 RR = 0.2 -0.008 -0.040 -0.08 -0.16 -0.40 RR = 0.5 -0.005 -0.025 -0.05 -0.10 -0.25 RR = 0.9 -0.001 -0.005 -0.01 -0.02 -0.05 RR=1.0 0 0 0 0 0 RR = 1.1 0.001 0.005 0.01 0.02 0.05 RR = 1.5 0.005 0.025 0.05 0.10 0.25 RR = 3.0 0.020 0.100 0.20 0.40 NA RR = 5.0 0.040 0.200 0.40 NA NA ✫284 BIO 233, Spring 2015 ✪ ✬The RD may be small even if the RR is big ✩ • ⋆ for either ‘protective’ or ‘detrimental’ effects When the RR is small, the RD is also small unless P (Y = 1 X = 0) is big • | ⋆ ‘common’ outcome However a small RR operating on a large population could correspond to a • big ‘public health’ impact ⋆ this rationale is often cited in studies of air pollution To move beyond simple contrasts, we need a more general framework for • modeling the relationship between the binary response and a vector of covariates ✫285 BIO 233, Spring 2015 ✪ ✬ GLMs for binary data ✩ We’ve noted that the Bernoulli distribution is the only possible distribution • for binary data ⋆ Y Bernoulli(µ) ∼ y 1−y fY (y; µ) = µ (1 µ) − fY (y; θ,φ) = exp yθ log(1 + exp θ ) { − { } } µ θ = log 1 µ − a(φ) = 1 b(θ) = log(1+exp θ ) c(y,φ) = 0 { } ✫286 BIO 233, Spring 2015 ✪ ✬The log-likelihood is ✩ • n ℓ(β; y) = yiθi b(θi) − i=1 Xn = yiθi log(1 + exp θi ) − { } i=1 X where θi is a function of β via T g(µi)= Xi β and exp θi µi = { } 1+exp θi { } ✫287 BIO 233, Spring 2015 ✪ ✬The score function for βj is ✩ • n ∂ℓ(β; y) ∂µi Xj,i = (yi µi) ∂β ∂η µ (1 µ ) − j i=1 i i i X − where the expression for ∂µi/∂ηi is dependent on the choice of the link function g( ) · Since the log-likelihood is only a function of β, the expected information • matrix is given by the (p + 1) (p + 1) matrix: × T ββ = X W X I where X is the design matrix for the model and W is a diagonal matrix with ith diagonal element 2 ∂µi 1 Wi = ∂ηi µi(1 µi) − ✫288 BIO 233, Spring 2015 ✪ ✬ Link functions ✩ In a GLM, the systematic component is given by • T g(µi) = ηi = Xi β We’ve noted previously that, for binary data, there are various options for • link functions including: linear: g(µi) = µi log: g(µi) = log(µi) µi logit: g(µi) = log 1 µi − probit: g(µi) = probit(µi) complementary log-log: g(µi) = log log(1 µi) {− − } ✫289 BIO 233, Spring 2015 ✪ ✬Q: How do we make a choice from among these options? ✩ Balance between interpretability and mathematical properties • ⋆ interpretability of contrasts ⋆ mathematical properties in terms of fitted values being in the appropriate range ✫290 BIO 233, Spring 2015 ✪ ✬Linear (identity) link function ✩ µi = β0 + β1Xi Interpret β0 as the probability of response when X = 0 • Interpret β1 as the change in the probability of response, comparing two • populations whose value of X differs by 1 unit The contrast we are modeling the risk difference (RD) • As we’ve noted, a potential problem is that this specification of the model • doesn’t respect the fact that the (true) response probability is bounded ✫291 BIO 233, Spring 2015 ✪ ✬Log link function ✩ log(µi) = β0 + β1Xi Interpret β0 as the log of the probability of response when X = 0 • ⋆ exp β0 is the probability of response when X = 0 { } Interpret β1 as the change in the log of the probability of response, • comparing two populations whose value of X differs by 1 unit ⋆ exp β1 is the ratio of the probability of response when X = 1 to that { } when X = 0 The contrast we are modeling the risk ratio (RR) • ✫292 BIO 233, Spring 2015 ✪ ✬As with the linear link, this choice of link function doesn’t necessarily ✩ • respect the fact that the (true) response probability is bounded Can see this explicitly this by considering the inverse of the link function: • T µi = exp X β { i } which takes values on (0, ) ∞ ✫293 BIO 233, Spring 2015 ✪ ✬Logit link function ✩ µi T logit(µi) = log = Xi β 1 µi − The functional • µi P (Yi = 1 Xi) = | 1 µi P (Yi = 0 Xi) − | is the odds of response Interpret β0 as the log of the odds of response when X = 0 • ⋆ exp β0 is the odds of response when X = 0 { } ✫294 BIO 233, Spring 2015 ✪ ✬Interpret β1 as the change in the log of the odds of response, comparing ✩ • two populations whose value of X differs by 1 unit ⋆ exp β1 is the ratio of the odds of response when X = 1 to that when { } X = 0 The contrast we are modeling is the odds ratio (OR) • Considering the inverse of the link function yields: • T exp Xi β µi = { } 1+exp XT β { i } ⋆ referred to as the ‘expit’ function ✫295 BIO 233, Spring 2015 ✪ ✬The expit function is the CDF of the standard logistic distribution ✩ • ⋆ distribution for a continuous random variable with support on ( , ) −∞ ∞ ⋆ pdf is given by exp x fX (x) = {− } 2 (1 + exp x ) {− } The CDF (of any distribution) provides a mapping from the support of the • random variable to the (0,1) interval FX ( ):( , ) (0, 1) · −∞ ∞ −→ We could use the inverse CDF of any distribution as a link function • F−1( ):(0, 1) ( , ) X · −→ −∞ ∞ ⋆ g( ) F−1( ) maps µ (0, 1)