Part V: Binary Response Data

✬ ✩ Part V: Binary response data ✫275 BIO 233, Spring 2015 ✪ ✬Western Collaborative Group Study ✩ Prospective study of coronary heart disease (CHD) • Recruited 3,524 men aged 39-59 between 1960-61 • ⋆ employed at 10 companies in California ⋆ baseline survey at intake ⋆ annual surveys until December 1969 Exclusions: • ⋆ 78 men who were actually outside the pre-specified age range ⋆ 141 subjects with CHD manifest at intake ⋆ 106 employees at one firm that excluded itself from follow-up ⋆ 45 subjects who were lost to follow-up, non-CHD death or self-exclusion prior to the first follow-up n = 3,154 study participants at risk for CHD • ✫276 BIO 233, Spring 2015 ✪ ✬Our primary goal is to investigate the relationship between ‘behavior ✩ • pattern’ and risk of CHD Participants were categorized into one of two behavior pattern groups: • Type A: characterized by enhanced aggressiveness, ambitiousness, competitive drive, and chronic sense of urgency Type B: characterized by more relaxed and non-competitive Data and documentation are available on the class website • > ## > load("WCGS_data.dat") > > dim(wcgs) [1] 3154 11 > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time" ✫277 BIO 233, Spring 2015 ✪ ✬The variables (in column order) are: ✩ • 1 age age, years 2 ht height, in 3 wt weight, lbs 4 sbp systolic blood pressure, mmHg 5 dbp diastolic blood pressure, mmHg 6 chol cholesterol, mg/dL 7 ncigs number of cigarettes smoked per day 8 behave behavior type 0/1 = B/A 9 chd occurrence of a CHD event during follow-up 10 type type of CHD event 11 time time post-recruitment of the CHD event, days Values for the ‘risk factor’ covariates are those measured at the intake visit • The three CHD-related variables were measured prospectively • ⋆ over an approx. 8.5 years of follow-up ✫278 BIO 233, Spring 2015 ✪ ✬Important note: ✩ • ⋆ 423 were lost to follow-up ⋆ 140 men died during the follow-up For our purposes, we are going to ignore these issues and consider the • binary outcome of: 1 occurrence of CHD during follow-up Y = 0 otherwise In the dataset, the response variable is ‘chd’: • > ## > table(wcgs$chd) 0 1 2897 257 > round(mean(wcgs$chd) * 100, 1) [1] 8.1 ✫279 BIO 233, Spring 2015 ✪ ✬Primary exposure of interest is ’behave’: ✩ • > ## > table(wcgs$behave) 0 1 1565 1589 > round(mean(wcgs$behave) * 100, 1) [1] 50.4 Cross-tabulation and exposure-specific incidence • > ## > table(wcgs$behave, wcgs$chd) 0 1 0 1486 79 1 1411 178 > round(tapply(wcgs$chd, list(wcgs$behave), FUN=mean) * 100, 1) 0 1 5.0 11.2 ✫280 BIO 233, Spring 2015 ✪ ✬The probability of the occurrence of CHD during follow-up among type B ✩ • men is estimated to be 0.050 ⋆ expected percentage of type B men who will develop CHD during follow-up is 5.0% The probability of the occurrence of CHD during follow-up among type A • men is estimated to be 0.112 ⋆ expected percentage of type A men who will develop CHD during follow-up is 11.2% Often use the generic term ‘risk’ • Either way, it’s important to remember that these statements are referring • to populations of men, rather than the individuals themselves ⋆ we’ve estimated a common or average risk of CHD ⋆ referred to as the marginal risk ⋆ ‘marginal’ in the sense that it does not condition on anything else ✫281 BIO 233, Spring 2015 ✪ ✬Contrasts ✩ As stated at the start, the primary goal is to investigate the relationship • between behavior pattern and risk of CHD We’ve characterized risk for each type but the goal requires a comparison • of the risks To perform such a comparison we need to choose a contrast • Risk difference: • ⋆ RD = 0.112 - 0.050 = 0.062 ⋆ difference in the estimated risk of CHD during follow-up between type A and type B men is 0.062 (or 6.2%) ⋆ the way in which the additional risk of CHD of being a type A person manifests through an absolute increase ✫282 BIO 233, Spring 2015 ✪ ✬Relative risk: ✩ • ⋆ RR = 0.112 / 0.050 = 2.24 ⋆ ratio of the estimated risk of CHD for type A men during follow-up to the estimated risk for type B men ⋆ the way in which the additional risk of CHD of being a type A person manifests through an relative increase As with the interpretation of the risks themselves, these statements refer • to contrasts between populations ⋆ population of Type A men vs. population of Type B men Contrasts are ‘marginal’ in the sense that we don’t condition on anything • else when comparing the two populations ⋆ i.e. we don’t adjust for anything ✫283 BIO 233, Spring 2015 ✪ ✬Important to note that the RD and RR are related ✩ • ⋆ relationship depends on the value of the response probability for the ‘referent’ group RD across different combinations of P (Y = 1 X = 0) and RR • | 0.01 0.05 0.10 0.20 0.50 RR = 0.2 -0.008 -0.040 -0.08 -0.16 -0.40 RR = 0.5 -0.005 -0.025 -0.05 -0.10 -0.25 RR = 0.9 -0.001 -0.005 -0.01 -0.02 -0.05 RR=1.0 0 0 0 0 0 RR = 1.1 0.001 0.005 0.01 0.02 0.05 RR = 1.5 0.005 0.025 0.05 0.10 0.25 RR = 3.0 0.020 0.100 0.20 0.40 NA RR = 5.0 0.040 0.200 0.40 NA NA ✫284 BIO 233, Spring 2015 ✪ ✬The RD may be small even if the RR is big ✩ • ⋆ for either ‘protective’ or ‘detrimental’ effects When the RR is small, the RD is also small unless P (Y = 1 X = 0) is big • | ⋆ ‘common’ outcome However a small RR operating on a large population could correspond to a • big ‘public health’ impact ⋆ this rationale is often cited in studies of air pollution To move beyond simple contrasts, we need a more general framework for • modeling the relationship between the binary response and a vector of covariates ✫285 BIO 233, Spring 2015 ✪ ✬ GLMs for binary data ✩ We’ve noted that the Bernoulli distribution is the only possible distribution • for binary data ⋆ Y Bernoulli(µ) ∼ y 1−y fY (y; µ) = µ (1 µ) − fY (y; θ,φ) = exp yθ log(1 + exp θ ) { − { } } µ θ = log 1 µ − a(φ) = 1 b(θ) = log(1+exp θ ) c(y,φ) = 0 { } ✫286 BIO 233, Spring 2015 ✪ ✬The log-likelihood is ✩ • n ℓ(β; y) = yiθi b(θi) − i=1 Xn = yiθi log(1 + exp θi ) − { } i=1 X where θi is a function of β via T g(µi)= Xi β and exp θi µi = { } 1+exp θi { } ✫287 BIO 233, Spring 2015 ✪ ✬The score function for βj is ✩ • n ∂ℓ(β; y) ∂µi Xj,i = (yi µi) ∂β ∂η µ (1 µ ) − j i=1 i i i X − where the expression for ∂µi/∂ηi is dependent on the choice of the link function g( ) · Since the log-likelihood is only a function of β, the expected information • matrix is given by the (p + 1) (p + 1) matrix: × T ββ = X W X I where X is the design matrix for the model and W is a diagonal matrix with ith diagonal element 2 ∂µi 1 Wi = ∂ηi µi(1 µi) − ✫288 BIO 233, Spring 2015 ✪ ✬ Link functions ✩ In a GLM, the systematic component is given by • T g(µi) = ηi = Xi β We’ve noted previously that, for binary data, there are various options for • link functions including: linear: g(µi) = µi log: g(µi) = log(µi) µi logit: g(µi) = log 1 µi − probit: g(µi) = probit(µi) complementary log-log: g(µi) = log log(1 µi) {− − } ✫289 BIO 233, Spring 2015 ✪ ✬Q: How do we make a choice from among these options? ✩ Balance between interpretability and mathematical properties • ⋆ interpretability of contrasts ⋆ mathematical properties in terms of fitted values being in the appropriate range ✫290 BIO 233, Spring 2015 ✪ ✬Linear (identity) link function ✩ µi = β0 + β1Xi Interpret β0 as the probability of response when X = 0 • Interpret β1 as the change in the probability of response, comparing two • populations whose value of X differs by 1 unit The contrast we are modeling the risk difference (RD) • As we’ve noted, a potential problem is that this specification of the model • doesn’t respect the fact that the (true) response probability is bounded ✫291 BIO 233, Spring 2015 ✪ ✬Log link function ✩ log(µi) = β0 + β1Xi Interpret β0 as the log of the probability of response when X = 0 • ⋆ exp β0 is the probability of response when X = 0 { } Interpret β1 as the change in the log of the probability of response, • comparing two populations whose value of X differs by 1 unit ⋆ exp β1 is the ratio of the probability of response when X = 1 to that { } when X = 0 The contrast we are modeling the risk ratio (RR) • ✫292 BIO 233, Spring 2015 ✪ ✬As with the linear link, this choice of link function doesn’t necessarily ✩ • respect the fact that the (true) response probability is bounded Can see this explicitly this by considering the inverse of the link function: • T µi = exp X β { i } which takes values on (0, ) ∞ ✫293 BIO 233, Spring 2015 ✪ ✬Logit link function ✩ µi T logit(µi) = log = Xi β 1 µi − The functional • µi P (Yi = 1 Xi) = | 1 µi P (Yi = 0 Xi) − | is the odds of response Interpret β0 as the log of the odds of response when X = 0 • ⋆ exp β0 is the odds of response when X = 0 { } ✫294 BIO 233, Spring 2015 ✪ ✬Interpret β1 as the change in the log of the odds of response, comparing ✩ • two populations whose value of X differs by 1 unit ⋆ exp β1 is the ratio of the odds of response when X = 1 to that when { } X = 0 The contrast we are modeling is the odds ratio (OR) • Considering the inverse of the link function yields: • T exp Xi β µi = { } 1+exp XT β { i } ⋆ referred to as the ‘expit’ function ✫295 BIO 233, Spring 2015 ✪ ✬The expit function is the CDF of the standard logistic distribution ✩ • ⋆ distribution for a continuous random variable with support on ( , ) −∞ ∞ ⋆ pdf is given by exp x fX (x) = {− } 2 (1 + exp x ) {− } The CDF (of any distribution) provides a mapping from the support of the • random variable to the (0,1) interval FX ( ):( , ) (0, 1) · −∞ ∞ −→ We could use the inverse CDF of any distribution as a link function • F−1( ):(0, 1) ( , ) X · −→ −∞ ∞ ⋆ g( ) F−1( ) maps µ (0, 1)

Part V: Binary Response Data

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support