✬ ✩

Part V: Binary response

✫275 BIO 233, Spring 2015 ✪ ✬Western Collaborative Group Study ✩

Prospective study of coronary heart disease (CHD) • Recruited 3,524 men aged 39-59 between 1960-61 • ⋆ employed at 10 companies in California ⋆ baseline survey at intake ⋆ annual surveys until December 1969

Exclusions: • ⋆ 78 men who were actually outside the pre-specified age range ⋆ 141 subjects with CHD manifest at intake ⋆ 106 employees at one firm that excluded itself from follow-up ⋆ 45 subjects who were lost to follow-up, non-CHD death or self-exclusion prior to the first follow-up

n = 3,154 study participants at risk for CHD •

✫276 BIO 233, Spring 2015 ✪ ✬Our primary goal is to investigate the relationship between ‘behavior ✩ • pattern’ and risk of CHD

Participants were categorized into one of two behavior pattern groups: • Type A: characterized by enhanced aggressiveness, ambitiousness, competitive drive, and chronic sense of urgency

Type B: characterized by more relaxed and non-competitive

Data and documentation are available on the class website •

> ## > load("WCGS_data.dat") > > dim(wcgs) [1] 3154 11 > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time"

✫277 BIO 233, Spring 2015 ✪ ✬The variables (in column order) are: ✩ • 1 age age, years 2 ht height, in 3 wt weight, lbs 4 sbp systolic blood pressure, mmHg 5 dbp diastolic blood pressure, mmHg 6 chol cholesterol, mg/dL 7 ncigs number of cigarettes smoked per day 8 behave behavior type 0/1 = B/A 9 chd occurrence of a CHD event during follow-up 10 type type of CHD event 11 time time post-recruitment of the CHD event, days

Values for the ‘risk factor’ covariates are those measured at the intake visit •

The three CHD-related variables were measured prospectively • ⋆ over an approx. 8.5 years of follow-up

✫278 BIO 233, Spring 2015 ✪ ✬Important note: ✩ • ⋆ 423 were lost to follow-up ⋆ 140 men died during the follow-up

For our purposes, we are going to ignore these issues and consider the • binary outcome of:

1 occurrence of CHD during follow-up Y =  0 otherwise   In the dataset, the response variable is ‘chd’: • > ## > table(wcgs$chd)

0 1 2897 257 > round(mean(wcgs$chd) * 100, 1) [1] 8.1

✫279 BIO 233, Spring 2015 ✪ ✬Primary exposure of interest is ’behave’: ✩ •

> ## > table(wcgs$behave)

0 1 1565 1589 > round(mean(wcgs$behave) * 100, 1) [1] 50.4

Cross-tabulation and exposure-specific incidence •

> ## > table(wcgs$behave, wcgs$chd)

0 1 0 1486 79 1 1411 178 > round(tapply(wcgs$chd, list(wcgs$behave), FUN=mean) * 100, 1) 0 1 5.0 11.2

✫280 BIO 233, Spring 2015 ✪ ✬The probability of the occurrence of CHD during follow-up among type B ✩ • men is estimated to be 0.050 ⋆ expected percentage of type B men who will develop CHD during follow-up is 5.0%

The probability of the occurrence of CHD during follow-up among type A • men is estimated to be 0.112 ⋆ expected percentage of type A men who will develop CHD during follow-up is 11.2%

Often use the generic term ‘risk’ • Either way, it’s important to remember that these statements are referring • to populations of men, rather than the individuals themselves ⋆ we’ve estimated a common or average risk of CHD ⋆ referred to as the marginal risk ⋆ ‘marginal’ in the sense that it does not condition on anything else

✫281 BIO 233, Spring 2015 ✪ ✬Contrasts ✩

As stated at the start, the primary goal is to investigate the relationship • between behavior pattern and risk of CHD

We’ve characterized risk for each type but the goal requires a comparison • of the risks

To perform such a comparison we need to choose a contrast •

Risk difference: • ⋆ RD = 0.112 - 0.050 = 0.062 ⋆ difference in the estimated risk of CHD during follow-up between type A and type B men is 0.062 (or 6.2%) ⋆ the way in which the additional risk of CHD of being a type A person manifests through an absolute increase

✫282 BIO 233, Spring 2015 ✪ ✬Relative risk: ✩ • ⋆ RR = 0.112 / 0.050 = 2.24 ⋆ ratio of the estimated risk of CHD for type A men during follow-up to the estimated risk for type B men ⋆ the way in which the additional risk of CHD of being a type A person manifests through an relative increase

As with the interpretation of the risks themselves, these statements refer • to contrasts between populations ⋆ population of Type A men vs. population of Type B men

Contrasts are ‘marginal’ in the sense that we don’t condition on anything • else when comparing the two populations ⋆ i.e. we don’t adjust for anything

✫283 BIO 233, Spring 2015 ✪ ✬Important to note that the RD and RR are related ✩ • ⋆ relationship depends on the value of the response probability for the ‘referent’ group

RD across different combinations of P (Y = 1 X = 0) and RR • |

0.01 0.05 0.10 0.20 0.50 RR = 0.2 -0.008 -0.040 -0.08 -0.16 -0.40 RR = 0.5 -0.005 -0.025 -0.05 -0.10 -0.25 RR = 0.9 -0.001 -0.005 -0.01 -0.02 -0.05 RR=1.0 0 0 0 0 0 RR = 1.1 0.001 0.005 0.01 0.02 0.05 RR = 1.5 0.005 0.025 0.05 0.10 0.25 RR = 3.0 0.020 0.100 0.20 0.40 NA RR = 5.0 0.040 0.200 0.40 NA NA

✫284 BIO 233, Spring 2015 ✪ ✬The RD may be small even if the RR is big ✩ • ⋆ for either ‘protective’ or ‘detrimental’ effects

When the RR is small, the RD is also small unless P (Y = 1 X = 0) is big • | ⋆ ‘common’ outcome

However a small RR operating on a large population could correspond to a • big ‘public health’ impact ⋆ this rationale is often cited in studies of air pollution

To move beyond simple contrasts, we need a more general framework for • modeling the relationship between the binary response and a vector of covariates

✫285 BIO 233, Spring 2015 ✪ ✬ GLMs for binary data ✩

We’ve noted that the is the only possible distribution • for binary data

⋆ Y Bernoulli(µ) ∼

y 1−y fY (y; µ) = µ (1 µ) −

fY (y; θ,φ) = exp yθ log(1 + exp θ ) { − { } }

µ θ = log 1 µ  − 

a(φ) = 1 b(θ) = log(1+exp θ ) c(y,φ) = 0 { }

✫286 BIO 233, Spring 2015 ✪ ✬The log-likelihood is ✩ • n ℓ(β; y) = yiθi b(θi) − i=1 Xn = yiθi log(1 + exp θi ) − { } i=1 X

where θi is a function of β via

T g(µi)= Xi β

and

exp θi µi = { } 1+exp θi { }

✫287 BIO 233, Spring 2015 ✪ ✬The score function for βj is ✩ • n ∂ℓ(β; y) ∂µi Xj,i = (yi µi) ∂β ∂η µ (1 µ ) − j i=1 i i i X −

where the expression for ∂µi/∂ηi is dependent on the choice of the link function g( ) ·

Since the log-likelihood is only a function of β, the expected information • matrix is given by the (p + 1) (p + 1) matrix: × T ββ = X W X I

where X is the design matrix for the model and W is a diagonal matrix with ith diagonal element

2 ∂µi 1 Wi = ∂ηi µi(1 µi)   −

✫288 BIO 233, Spring 2015 ✪ ✬ Link functions ✩

In a GLM, the systematic component is given by • T g(µi) = ηi = Xi β

We’ve noted previously that, for binary data, there are various options for • link functions including:

linear: g(µi) = µi

log: g(µi) = log(µi)

µi logit: g(µi) = log 1 µi  −  probit: g(µi) = probit(µi)

complementary log-log: g(µi) = log log(1 µi) {− − }

✫289 BIO 233, Spring 2015 ✪ ✬Q: How do we make a choice from among these options? ✩

Balance between interpretability and mathematical properties • ⋆ interpretability of contrasts ⋆ mathematical properties in terms of fitted values being in the appropriate range

✫290 BIO 233, Spring 2015 ✪ ✬Linear (identity) link function ✩

µi = β0 + β1Xi

Interpret β0 as the probability of response when X = 0 •

Interpret β1 as the change in the probability of response, comparing two • populations whose value of X differs by 1 unit

The contrast we are modeling the risk difference (RD) •

As we’ve noted, a potential problem is that this specification of the model • doesn’t respect the fact that the (true) response probability is bounded

✫291 BIO 233, Spring 2015 ✪ ✬Log link function ✩

log(µi) = β0 + β1Xi

Interpret β0 as the log of the probability of response when X = 0 • ⋆ exp β0 is the probability of response when X = 0 { }

Interpret β1 as the change in the log of the probability of response, • comparing two populations whose value of X differs by 1 unit

⋆ exp β1 is the ratio of the probability of response when X = 1 to that { } when X = 0

The contrast we are modeling the risk ratio (RR) •

✫292 BIO 233, Spring 2015 ✪ ✬As with the linear link, this choice of link function doesn’t necessarily ✩ • respect the fact that the (true) response probability is bounded

Can see this explicitly this by considering the inverse of the link function: • T µi = exp X β { i }

which takes values on (0, ) ∞

✫293 BIO 233, Spring 2015 ✪ ✬Logit link function ✩

µi T logit(µi) = log = Xi β 1 µi  − 

The functional • µi P (Yi = 1 Xi) = | 1 µi P (Yi = 0 Xi) − | is the odds of response

Interpret β0 as the log of the odds of response when X = 0 • ⋆ exp β0 is the odds of response when X = 0 { }

✫294 BIO 233, Spring 2015 ✪ ✬Interpret β1 as the change in the log of the odds of response, comparing ✩ • two populations whose value of X differs by 1 unit

⋆ exp β1 is the ratio of the odds of response when X = 1 to that when { } X = 0

The contrast we are modeling is the odds ratio (OR) •

Considering the inverse of the link function yields: • T exp Xi β µi = { } 1+exp XT β { i } ⋆ referred to as the ‘expit’ function

✫295 BIO 233, Spring 2015 ✪ ✬The expit function is the CDF of the standard logistic distribution ✩ • ⋆ distribution for a continuous with support on ( , ) −∞ ∞ ⋆ pdf is given by

exp x fX (x) = {− } 2 (1 + exp x ) {− }

The CDF (of any distribution) provides a mapping from the support of the • random variable to the (0,1) interval

FX ( ):( , ) (0, 1) · −∞ ∞ −→

We could use the inverse CDF of any distribution as a link function • F−1( ):(0, 1) ( , ) X · −→ −∞ ∞

⋆ g( ) F−1( ) maps µ (0, 1) to η ( , ) · ≡ · ∈ ∈ −∞ ∞

✫296 BIO 233, Spring 2015 ✪ ✬Probit link function ✩

−1 T probit(µi) = Φ (µi) = Xi β

where Φ( ) is the CDF of the standard normal distribution ·

Interpret β0 as the probit of probability of response when X = 0 •

Interpret β1 as the change in the probit of the probability of response, • comparing two populations whose value of X differs by 1 unit

Interpretation is tricky • ⋆ contrast is in terms of the inverse CDF of a standard normal distribution ⋆ no easy way of relating this contrast to more intuitive measures

✫297 BIO 233, Spring 2015 ✪ ✬Complementary log-log function ✩

T log log(1 µi) = X β {− − } i

Inverse CDF of the extreme value (or log-Weibull) distribution •

As with the probit link function, there isn’t any intuitive way of • interpreting regression parameters based on this link function

Has the distinction that it is asymmetric • ⋆ may be useful if the primary purpose is prediction

✫298 BIO 233, Spring 2015 ✪ ✬Comparisons ✩

Over values of µ (0.1, 0.9), models based on the linear, logit and probit • ∈ link function agree approximately

⋆ considering their inverse link functions, over the range ηi ( 2, 2): ∈ −

1 ηi √2πηi + expit(ηi) Φ 2 4 ≈ ≈ 4 !

⋆ so their fitted values will be approximately equal over this range

Also use these relationships to provide approximate relationships between • the regression parameters:

βlinear √2πβprobit 1 βlogit 1 4 ≈ 1 ≈ 4

✫299 BIO 233, Spring 2015 ✪ ✬ ✩

linear logit

i probit µ Conditional mean, mean, Conditional 0.0 0.2 0.4 0.6 0.8 1.0

−5 −4 −3 −2 −1 0 1 2 3 4 5

Linear predictor, ηi

✫300 BIO 233, Spring 2015 ✪ ✬ ✩

5

c−log−log 4 logit 3 probit log 2

1 ) i

µ 0

g( −5 −4 −3 −2 −1 0 1 2 3 4 5 −1

−2

−3

−4

−5

logit(µi)

✫301 BIO 233, Spring 2015 ✪ ✬From the figures, differences across these link functions manifest primarily ✩ • in the tails ⋆ when the probability of response is small or large

Also, the logit and probit functions are almost linearly related • ⋆ noted this from the approximations as well

For small values of µi, the complementary log-log, logit and log functions • are close to each other ⋆ equally good for rare events

⋆ for µi 0.1 ≤

µi log log(µi) 1 µi ≈  −  ⋆ log link has the ‘best’ interpretation ⋆ OR and RR are close numerically

✫302 BIO 233, Spring 2015 ✪ ✬ Modeling: WCGS ✩

Returning to the WCGS, the dataset has a number of covariates that we • might consider including in a model

> ## > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time"

Q: How do we approach making decisions about what to include in the model? ⋆ depends on the purpose of the analysis

Towards this, it’s useful to classify the analysis into one of two types: • ⋆ association studies ⋆ prediction studies

✫303 BIO 233, Spring 2015 ✪ ✬Association studies ✩

The goal is to characterize the relationship between some exposure of • interest and the response ⋆ establish cause-and-effect

Understanding the underlying (data generating) mechanisms are crucial • ⋆ need to be attentive to the possibility of ‘alternative explanations’ ⋆ control of confounding is crucial

Model selection, in terms of the choice of potential confounders, should be • based on scientific considerations

Despite this ‘ideal’, it’s not always clear which covariates are confounders • and which aren’t

✫304 BIO 233, Spring 2015 ✪ ✬One strategy is to fit and report the following three models: ✩ • (1) an unadjusted or minimally adjusted model

(2) a model that includes ‘core’ confounders clear indication from scientific knowledge and/or the literature ∗ consensus among investigators ∗ (3) a model that includes ‘core’ confounders plus any ‘potential’ confounders indication is less certain ∗

Report results from model (2) as ‘primary’ • ⋆ based conclusions on the results of this model ⋆ interpret models (1) and (3) in terms of sensitivity analyses

There are, of course, other philosophies on this! •

✫305 BIO 233, Spring 2015 ✪ ✬Prediction studies ✩

The goal is to estimate the response Y • ⋆ as opposed to the goal of estimating β

In contrast to association studies, prediction is typically not • hypothesis-driven ⋆ there is no single exposure or association or parameter that is of interest ⋆ mechanisms and confounding is less of a concern, if at all

Choice of which covariates to include in the model is driven by the extent • to which its inclusion improves our ability to predict future outcomes ⋆ care is needed not to ‘overfit’ the data

These issues typically don’t come up in association studies • ⋆ requires different analysis strategies and different statistical tools

✫306 BIO 233, Spring 2015 ✪ ✬Confounding ✩

The data for the WCGS is ‘observational’ • ⋆ as a study of Type A vs Type B behavior patterns, the investigators didn’t randomize behavior pattern

As such, an analysis based on these data may be subject to confounding • bias

A confounder is defined as a covariate that is (causally) associated with • both the exposure of interest and the outcome of interest, while not being on the causal pathway

C ❅ ✠ ❅❅❘ X ? ✲ Y

✫307 BIO 233, Spring 2015 ✪ ✬Intuitively, from the causal diagram, there is a ‘backdoor’ association ✩ • between X and Y , through C

If one does not ‘block’ this pathway then one cannot isolate the (direct) • association between X and Y ⋆ the unadjusted association is ‘spurious’ in the sense that it is a mixture of the ‘true’ association and the association characterized by the backdoor pathway ⋆ confounding bias

Note, we haven’t introduced any estimators yet • ⋆ we haven’t even introduced a contrast yet!

As such, confounding is a scientific issue • ⋆ distinct from statistical bias that is an operating characteristic of an estimator

✫308 BIO 233, Spring 2015 ✪ ✬The control of confounding bias must, therefore, be approached from a ✩ • scientific perspective ⋆ we cannot use statistical techniques to determine whether or not a covariate is a confounder ⋆ we must use scientific knowledge to make these decisions

Given a collection of (potential) confounders, the standard approach to • controlling confounding bias is to include them in the linear predictor ⋆ referred to as regression adjustment ⋆ e.g.,

ηi = β0 + βxXi + βcCi

⋆ interpret βx conditional on C or within strata of C

✫309 BIO 233, Spring 2015 ✪ ✬Going back to the causal diagram, conditioning on the confounder blocks ✩ • the backdoor pathway ⋆ the effect of including C in the model is to ‘break’ the association between C and Y

C

✠ X ? ✲ Y

✫310 BIO 233, Spring 2015 ✪ ✬Exploratory data analysis ✩

Whatever the purpose of the study, it is often useful to perform some • preliminary exploratory data analysis

Q: Why?

> ## > apply(wcgs[,1:7], 2, FUN=summary) $age Min. 1st Qu. Median Mean 3rd Qu. Max. 39.00 42.00 45.00 46.28 50.00 59.00

$ht Min. 1st Qu. Median Mean 3rd Qu. Max. 60.00 68.00 70.00 69.78 72.00 78.00

$wt Min. 1st Qu. Median Mean 3rd Qu. Max.

✫311 BIO 233, Spring 2015 ✪ ✬78 155 170 170 182 320 ✩

$sbp Min. 1st Qu. Median Mean 3rd Qu. Max. 98.0 120.0 126.0 128.6 136.0 230.0

$dbp Min. 1st Qu. Median Mean 3rd Qu. Max. 58.00 76.00 80.00 82.02 86.00 150.00

$chol Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 103.0 197.2 223.0 226.4 253.0 645.0 12.0

$ncigs Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 0.0 11.6 20.0 99.0

✫312 BIO 233, Spring 2015 ✪

✬ ✩ Age, years Age, 04 05 60 55 50 45 40 0 0 0 0 0 600 500 400 300 200 100 0 Frequency

✫313 BIO 233, Spring 2015 ✪ ✬ ✩ Weight, lbs Weight, 80 160 240 320

60 65 70 75 80

Height, in

✫314 BIO 233, Spring 2015 ✪ ✬ ✩ Systolic blood pressure, mmHg pressure, blood Systolic 90 130 170 210 250

60 90 120 150

Diastolic blood pressure, mmHg

✫315 BIO 233, Spring 2015 ✪ ✬ ✩ Frequency Cholesterol, mg/dL Cholesterol, 0 200 400 600 800 100 200 300 400 500 600

100 150 200 250 300 350 400 450

Study id Cholesterol, mg/dL

✫316 BIO 233, Spring 2015 ✪ ✬ ✩ Frequency 0 500 1000 1500

0 20 40 60 80 100

Number of cigarettes, per day

✫317 BIO 233, Spring 2015 ✪ ✬> ## ✩ > table(wcgs$ncigs)

0 1 2 3 4 5 6 7 8 910 1652 7 15 14 13 23 14 12 4 3 83

11 12 13 14 15 16 17 18 20 22 23 3 22 4 3 102 4 13 13 453 3 5

24 25 27 28 30 31 33 35 36 37 38 1 105 6 2 283 2 4 36 2 2 1

40 42 45 48 50 55 60 70 80 99 182 1 11 1 32 1 14 1 1 1

Study participants seem to be reporting ‘round’ numbers • ⋆ likely some misclassification of actual smoking

✫318 BIO 233, Spring 2015 ✪ ✬Overall, nothing too worrying pops out ✩ •

Some instances of ‘large’ values • ⋆ weight of 320lbs ⋆ diastolic blood pressure of 150 mmHg ⋆ cholesterol of 645mg/dL ⋆ smoking 99 cigarettes per day

There is also some missingness in the data • ⋆ in a real collaborative setting, we’d want to know more about the cholesterol values ⋆ in particular, why were they missing? ⋆ only 12 out of 3,154 observations with missing values

✫319 BIO 233, Spring 2015 ✪ ✬Based on the EDA, perform the following data manipulations: ✩ •

> ## > wcgs$chol[wcgs$chol > 500] <- NA ## Take out (particularly) strange value > wcgs <- na.omit(wcgs) ## Remove observations with missing chol > > ## Standardize continuous variables to make the intercept interpretable > ## > wcgs$age <- (wcgs$age - 40) / 5 > wcgs$ht <- (wcgs$ht - 70) / 2 > wcgs$wt <- (wcgs$wt - 170) / 10 > wcgs$sbp <- (wcgs$sbp - 125) / 10 > wcgs$dbp <- (wcgs$dbp - 80) / 10 > wcgs$chol <- (wcgs$chol - 200) / 20 > > ## Smoker 0/1 = No/Yes > ## > wcgs$smoker <- as.numeric(wcgs$ncigs > 0)

✫320 BIO 233, Spring 2015 ✪ ✬Unadjusted analysis ✩

Fit the model: •

logit(µi) = β0 + β1behavei

> ## > fit0 <- glm(chd ~ behave, family=binomial(), data=wcgs) > summary(fit0)

Call: glm(formula = chd ~ behave, family = binomial(), data = wcgs)

Deviance Residuals: Min 1Q Median 3Q Max -0.4870 -0.4870 -0.3226 -0.3226 2.4420

✫321 BIO 233, Spring 2015 ✪ ✬Coefficients: ✩ Estimate Std. Error z value Pr(>|z|) (Intercept) -2.9297 0.1155 -25.372 <2e-16 *** behave 0.8573 0.1403 6.109 1e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1774.2 on 3140 degrees of freedom Residual deviance: 1734.1 on 3139 degrees of freedom AIC: 1738.1

Number of Fisher Scoring iterations: 5

> summary(fit0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05071 0.05071 0.11180 0.08150 0.11180 0.11180

✫322 BIO 233, Spring 2015 ✪ ✬Core adjustment ✩

Add ‘core’ adjustment variables into the linear predictor and fit •

logit(µi) = β0 + β1behavei + β2agei + β3wti

+ β4sbpi + β5choli + β6smokeri

> ## > fit1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) > summary(fit1)

... Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.14430 0.18604 -22.276 < 2e-16 *** behave 0.68672 0.14453 4.751 2.02e-06 *** age 0.31511 0.06022 5.233 1.67e-07 *** wt 0.09240 0.03173 2.912 0.00359 ** sbp 0.17526 0.04095 4.280 1.87e-05 ***

✫323 BIO 233, Spring 2015 ✪ ✬chol 0.21730 0.03056 7.110 1.16e-12 *** ✩ smoker 0.60794 0.14054 4.326 1.52e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1774.2 on 3140 degrees of freedom Residual deviance: 1585.8 on 3134 degrees of freedom AIC: 1599.8

Number of Fisher Scoring iterations: 6

> summary(fit1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00426 0.03120 0.05806 0.08150 0.10550 0.62350

✫324 BIO 233, Spring 2015 ✪ ✬Full adjustment ✩

Add remaining adjustment variables into the linear predictor and fit •

logit(µi) = β0 + β1behavei + β2agei + β3wti

+ β4sbpi + β5choli + β6smokeri

+ β7hti + β8dbpi

> fit2 <- glm(chd ~ behave + age + wt + sbp + chol + smoker + ht + dbp, family=binomial(), data=wcgs) > summary(fit2)

... Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.14239 0.18615 -22.253 < 2e-16 *** behave 0.68535 0.14463 4.739 2.15e-06 *** age 0.31738 0.06035 5.259 1.45e-07 *** wt 0.08147 0.03879 2.100 0.03571 * sbp 0.18303 0.06402 2.859 0.00425 **

✫325 BIO 233, Spring 2015 ✪ ✬chol 0.21925 0.03077 7.125 1.04e-12 *** ✩ smoker 0.59920 0.14197 4.220 2.44e-05 *** ht 0.03648 0.06614 0.552 0.58127 dbp -0.01041 0.10837 -0.096 0.92349 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1774.2 on 3140 degrees of freedom Residual deviance: 1585.4 on 3132 degrees of freedom AIC: 1603.4

Number of Fisher Scoring iterations: 6

> summary(fit2$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.004438 0.031230 0.057700 0.081500 0.106000 0.631200

✫326 BIO 233, Spring 2015 ✪ ✬Interpretation of results ✩

Characterizing the effect of behavior type is the primary scientific goal • ⋆ typically report results on the odds ratio scale

⋆ denote the odds ratio by θ1 = exp β1 { }

95% CIs can be obtained in a number of ways • (i) compute the 95% CI for βˆ1 and exponentiate

(ii) compute a 95% CI directly for θˆ1 glm() returns the standard error estimates for the βˆ’s ∗ use the delta method to get the standard error for θˆ1 ∗

Approaches are equivalent asymptotically • ⋆ in small samples, first approach results in an asymmetric CI

✫327 BIO 233, Spring 2015 ✪ ✬getCI() function implements the first approach ✩ • ⋆ code is available on the class website

> ## > getCI(fit0) exp{beta} lower upper (Intercept) 0.05 0.04 0.07 behave 2.36 1.79 3.10

Interpretation of θˆ1 = 2.36: •

> ## > getCI(fit1)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.02 behave 1.99 1.50 2.64

Interpretation of θˆ1 = 1.99: •

✫328 BIO 233, Spring 2015 ✪ ✬> ## ✩ > getCI(fit2)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.02 behave 1.98 1.49 2.63

Interpretation of θˆ1 = 1.98: •

✫329 BIO 233, Spring 2015 ✪ ✬Flexible adjustment ✩

When we include potential confounders in the model, we are less • concerned with their interpretation ⋆ primary purpose is the control of confounding bias ⋆ if we don’t model the effects of confounders properly, there may be residual confounding

Suggest including these covariates into the model in as flexible manner as • possible ⋆ go beyond linearity

Two simple strategies for flexibly modeling continuous covariates are • (i) including additional polynomial terms (ii) categorization

✫330 BIO 233, Spring 2015 ✪ ✬> ## Polynomial ✩ > ## > wcgs$age2 <- wcgs$age^2 > wcgs$age3 <- wcgs$age^3 ... > > ## Categorization > ## > wcgs$cigsCat <- 0 > wcgs$cigsCat[wcgs$ncigs >= 10] <- 1 > wcgs$cigsCat[wcgs$ncigs >= 20] <- 2 > wcgs$cigsCat[wcgs$ncigs >= 30] <- 3 > wcgs$cigsCat[wcgs$ncigs >= 40] <- 4 > > ## > flex1 <- glm(chd ~ behave + age + age2 + age3 + wt + wt2 + wt3 + sbp + sbp2 + sbp3 + chol + chol2 + chol3 + factor(cigsCat), family=binomial(), data=wcgs)

✫331 BIO 233, Spring 2015 ✪ ✬> summary(flex1) ✩ ... Estimate Std. Error z value Pr(>|z|) (Intercept) -3.970454 0.215492 -18.425 < 2e-16 *** behave 0.689283 0.146046 4.720 2.36e-06 *** age -0.140305 0.383673 -0.366 0.714597 age2 0.351120 0.270200 1.299 0.193778 age3 -0.064625 0.050819 -1.272 0.203494 wt 0.087191 0.041275 2.112 0.034650 * wt2 -0.040637 0.014332 -2.835 0.004576 ** wt3 0.003624 0.001244 2.913 0.003580 ** sbp 0.183941 0.071780 2.563 0.010390 * sbp2 0.007816 0.036603 0.214 0.830901 sbp3 -0.001986 0.004632 -0.429 0.668109 chol 0.324908 0.077268 4.205 2.61e-05 *** chol2 -0.033245 0.027891 -1.192 0.233276 chol3 0.001882 0.002742 0.686 0.492457 factor(cigsCat)1 0.044957 0.286233 0.157 0.875194 factor(cigsCat)2 0.644399 0.179435 3.591 0.000329 *** factor(cigsCat)3 0.945077 0.198961 4.750 2.03e-06 *** factor(cigsCat)4 0.657484 0.229380 2.866 0.004152 ** >

✫332 BIO 233, Spring 2015 ✪ ✬> ## ✩ > getCI(fit1)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.02 behave 1.99 1.50 2.64 > > getCI(flex1)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.03 behave 1.99 1.50 2.65 > > ## > LRtest(fit1, flex1) Test Statistic = 25.5 on 11 df => p-value = 0.01 [1] 0.01

The likelihood ratio test suggests a ‘better’ fit but there is virtually no • impact on estimation or inference

✫333 BIO 233, Spring 2015 ✪ ✬Link functions ✩

So far, we’ve only considered the logit link function •

µi T g(µi) = log = Xi β 1 µi  − 

By far the most common link function used for GLMs of binary data • ⋆ guaranteed that fitted values are in (0,1) ⋆ reasonable interpretation of contrasts in terms of odds ratios when the event is rare: OR RR ∗ ≈ ⋆ ability to analyze case-control data as if it had been collected prospectively

Q: What about other link functions?

✫334 BIO 233, Spring 2015 ✪ ✬Potential choices include: ✩ •

linear: g(µi) = µi

log: g(µi) = log(µi)

probit: g(µi) = probit(µi)

complementary log-log: g(µi) = log log(1 µi) {− − }

We’ve noted that there is a trade-off between interpretability and • mathematical properties

For the goal of characterizing the association between behavior type and • risk of CHD, interpretability is crucial ⋆ examine the linear and log link functions

If the goals is prediction, then we’d be more likely to entertain the probit • and complementary log-log link functions

✫335 BIO 233, Spring 2015 ✪ ✬In R we use the ‘family’ argument to change the link ✩ • ⋆ other components of the GLM that are functions of the link are appropriately adjusted

Let’s first consider changing the link function for the unadjusted analysis • ⋆ for the binomial family, the logit link is the default but just to show you how it works

> ## > logitF0 <- glm(chd ~ behave, family=binomial(link="logit"), data=wcgs) > summary(logitF0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05071 0.05071 0.11180 0.08150 0.11180 0.11180 > getCI(logitF0) exp{beta} lower upper (Intercept) 0.05 0.04 0.07 behave 2.36 1.79 3.10

✫336 BIO 233, Spring 2015 ✪ ✬Now let’s fit model using the linear link: ✩ •

> ## > linearF0 <- glm(chd ~ behave, family=binomial(link="identity"), data=wcgs) > summary(linearF0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05071 0.05071 0.11180 0.08150 0.11180 0.11180 > getCI(linearF0, expo=FALSE, digits=4) * 100 beta lower upper (Intercept) 5.07 3.98 6.16 behave 6.11 4.21 8.01

Notice that the fitted values are the same as those obtained using the logit • link Q: Why?

Interpretation of βˆ1 = 6.11: •

✫337 BIO 233, Spring 2015 ✪ ✬Finally, let’s fit model using the log link: ✩ •

> ## > logF0 <- glm(chd ~ behave, family=binomial(link="log"), data=wcgs) > summary(logF0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05071 0.05071 0.11180 0.08150 0.11180 0.11180 > getCI(logF0) exp{beta} lower upper (Intercept) 0.05 0.04 0.06 behave 2.21 1.71 2.85

Again, notice that the fitted values are the same •

Interpretation of θˆ1 = 2.21: •

✫338 BIO 233, Spring 2015 ✪ ✬Q: How does changing the link function impact the adjusted analysis? ✩

> ## > logitF1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) > getCI(logitF1)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.02 behave 1.99 1.50 2.64 > > ## > linearF1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="identity"), data=wcgs) Error: no valid set of coefficients has been found: please supply starting values

The IWLS algorithm is having trouble finding valid starting values •

✫339 BIO 233, Spring 2015 ✪ ✬Taking a closer look at the glm() function ✩ • > > args(glm) function (formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ...) NULL

we can provide our own starting values via ⋆ start for the regression coefficients, β

⋆ etastart for the linear predictors, η1 ..., ηn { } ⋆ mustart for the fitted value, µ1, ..., µn { }

Use values from some other fit that was successful • ⋆ a fit using some other link function ⋆ a fit based on a different mean model

✫340 BIO 233, Spring 2015 ✪ ✬Using a linear link with binary data we also have to be careful about the ✩ • mean-variance relationship ⋆ specified by the binomial() family

> > names(binomial()) [1] "family" "link" "linkfun" "linkinv" "variance" [6] "dev.resids" "aic" "mu.eta" "initialize" "validmu" [11] "valideta" "simulate" > > binomial()$variance function (mu) mu * (1 - mu)

If, at any point during the IWLS algorithm, one of the fitted values is • outside (0,1) then the variance will be negative ⋆ unlikely that the algorithm will converge

✫341 BIO 233, Spring 2015 ✪ ✬An alternative is to use OLS and use an appropriate variance estimator to ✩ • account for the heteroskedasticity induced by the mean-variance relationship ⋆ Huber-White variance estimator sandwich estimator ∗ robust estimator ∗ ⋆ bootstrap variance estimator

In R • ⋆ use the lm() function ⋆ function robustCI(), available on the class website, computes robust- and bootstrap-based 95% confidence intervals

> ## > linearF1 <- lm(chd ~ behave + age + wt + sbp + chol + smoker, data=wcgs) > robustCI(linearF1, digits=4, B=1000) * 100

✫342 BIO 233, Spring 2015 ✪ ✬ betaHat Naive Lo Naive Up Robust Lo Robust Up Boot Lo Boot Up ✩ (Intercept) -1.49 -3.37 0.39 -3.00 0.02 -3.03 0.05 behave 4.59 2.71 6.46 2.75 6.43 2.78 6.39 age 2.20 1.34 3.06 1.27 3.13 1.24 3.16 wt 0.64 0.17 1.10 0.15 1.12 0.14 1.13 sbp 1.52 0.87 2.18 0.75 2.29 0.75 2.29 chol 1.60 1.15 2.04 1.12 2.07 1.11 2.08 smoker 4.05 2.16 5.94 2.17 5.93 2.14 5.96

Interpretation of βˆ1 = 4.59: •

Q: What about the negative fitted values?

> ## > summary(linearF1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.11570 0.03415 0.07964 0.08150 0.12760 0.32570

✫343 BIO 233, Spring 2015 ✪ ✬ ✩ Fitted values using a linear link using a linear Fitted values −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Fitted values using a logit link

✫344 BIO 233, Spring 2015 ✪ ✬Clearly, some of the fitted values are < 0 ✩ • > ## > range(logitF1$fitted[linearF1$fitted <= 0]) [1] 0.004259678 0.021542907

Fitted values that are < 0 are all small •

Fitted values that are > 0 are in a much tighter range of values • ⋆ maximum value of 0.326, as opposed to 0.623 for the logistic model

Turning to the log link: •

> ## > logF1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="log"), data=wcgs) Error: no valid set of coefficients has been found: please supply starting values

✫345 BIO 233, Spring 2015 ✪ ✬This time we can’t use the lm() function but we can provide starting ✩ • values from the (successful) fit of the logistic regression:

> ## > logF1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="log"), mustart=fitted(logitF1), data=wcgs) > getCI(logF1)[1:2,] exp{beta} lower upper (Intercept) 0.02 0.01 0.03 behave 1.78 1.38 2.29 > summary(logF1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.007259 0.035030 0.058970 0.080380 0.100700 0.704600

All of the fitted values are in (0,1) •

Interpretation of θˆ1 = 1.78: •

✫346 BIO 233, Spring 2015 ✪ ✬ ✩ Fitted values using a log link Fitted values 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fitted values using a logit link

✫347 BIO 233, Spring 2015 ✪ ✬Summary of results: ✩ •

Link Contrast Unadjusted Adjusted function model model logit OR 2.36 (1.79, 3.10) 1.99 (1.50, 2.64) linear RD 6.11 (4.21, 8.01)∗ 4.59 (2.75, 6.43)∗ log RR 2.21 (1.71, 2.85) 1.78 (1.38, 2.29)

∗ 95% CI based on the Huber-White robust standard error estimate

Convincing evidence of a statistically significant difference between Type A • and Type B behavior types in CHD risk ⋆ however you define the contrast

Q: Do you think we can claim clinical significance?

✫348 BIO 233, Spring 2015 ✪ ✬ The Bayesian Solution ✩

GLMs for binary data are specified by: •

Yi Xi Bernoulli(µi) | ∼ T g(µi) = Xi β

The unknown parameters are the regression coefficients: β • ⋆ p + 1 parameters

In the absence of prior knowledge, it is typical to adopt a flat prior • π(β) 1 ∝

✫349 BIO 233, Spring 2015 ✪ ✬Computation ✩

Generate samples from the posterior • π(β y) (β; y)π(β) | ∝ L

via the Metropolis-Hastings algorithm

Use the asymptotic sampling distribution of the MLE as a proposal • distribution

q(β; y) Normal(β , −1) ≡ MLE Iββ

⋆ from the (usual) frequentist fit of the GLMb

Also use this distribution for starting values •

✫350 BIO 233, Spring 2015 ✪ ✬## ✩ fit1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) ## betaHat <- fit1$coef betaVar <- summary(fit1)$cov.unscaled X <- model.matrix(fit1) Y <- model.frame(fit1)[,1]

## 3 chains, each for 1,000 scans ## M <- 3 R <- 1000 startVals <- rmvnorm(M, betaHat, betaVar) posterior <- array(NA, dim=c(R, length(betaHat), M)) accept <- array(0, dim=c(R, M)) for(m in 1:M) { ## beta <- startVals[m,] mu <- as.vector(expit(X %*% beta))

✫351 BIO 233, Spring 2015 ✪ ✬## ✩ for(r in 1:R) { ## betaStar <- as.vector(rmvnorm(1, betaHat, betaVar)) muStar <- as.vector(expit(X %*% betaStar)) ## logpiRatio <- sum(dbinom(Y, 1, muStar, log=TRUE)) - sum(dbinom(Y, 1, mu, log=TRUE)) logqRatio <- log(dmvnorm(beta, betaHat, betaVar)) - log(dmvnorm(betaStar, betaHat, betaVar)) aR <- exp(logpiRatio + logqRatio) if(runif(1) < aR) { beta <- betaStar mu <- muStar accept[r,m] <- 1 } posterior[r,,m] <- beta } }

✫352 BIO 233, Spring 2015 ✪ ✬Examine trace plots for evidence of convergence (or lack thereof) ✩ • 0 β Intercept, Intercept, −5.0 −4.5 −4.0 −3.5 0 100 200 300 400 500 600 700 800 900 1000

Scan 1 β behave logOR, behave 0.0 0.4 0.8 1.2

0 100 200 300 400 500 600 700 800 900 1000

Scan

✫353 BIO 233, Spring 2015 ✪ ✬Acceptance rate for the Metropolis-Hastings algorithm: ✩ • > ## > accRate <- round(apply(accept, 2, mean) * 100, 1) > accRate [1] 89.5 88.0 90.9

Proposal and posterior distribution for the log-OR of behave, β1 •

proposal posterior

0.00 0.25 0.50 0.75 1.00 1.25 1.50

✫354 BIO 233, Spring 2015 ✪ ✬Summaries of the posterior distribution ✩ • ⋆ potential scale reduction (PSR) ⋆ results based on the Bayesian analysis pool samples from the 3 chains, each with 10% burn in ∗ ⋆ MLE and 95% confidence interval

PSR Median 2.5% 97.5% exp{beta} lower upper (Intercept) 1.002 0.02 0.01 0.02 0.02 0.01 0.02 behave 1.000 1.99 1.50 2.64 1.99 1.50 2.64 age 1.001 1.38 1.22 1.54 1.37 1.22 1.54 wt 0.999 1.10 1.03 1.17 1.10 1.03 1.17 sbp 1.000 1.19 1.10 1.29 1.19 1.10 1.29 chol 1.000 1.24 1.17 1.32 1.24 1.17 1.32 smoker 1.000 1.85 1.41 2.45 1.84 1.39 2.42

Numerical results based on the Bayesian and frequentist analyses are • virtually identical ⋆ differ in their interpretation

✫355 BIO 233, Spring 2015 ✪ ✬Posterior distribution for the OR of behave, θ1 = exp β1 ✩ • { } ⋆ posterior median/mean and (central) 95% credible interval

1.0 1.5 2.0 2.5 3.0 3.5

✫356 BIO 233, Spring 2015 ✪ ✬Log link ✩

Suppose we want to model the RR, rather than the OR • ⋆ log link, rather the logit link

In terms of the model specification, the only thing that changes is the • dependence of the mean on the linear predictor:

Yi Xi Bernoulli(µi) | ∼ T log(µi) = Xi β

⋆ form of the likelihood is the same

Retain the flat prior for β • ⋆ even though the parameters are different

✫357 BIO 233, Spring 2015 ✪ ✬Operationally we need to modify the Metropolis-Hasthings algorithm: ✩ •

(1) change how the µi’s are calculated to evaluate the likelihood/posterior

T T µi = expit(X β) µi = exp(X β) i ⇒ i

(2) check that the proposed value of β yields a valid set of µi’s

if the proposal yields any µi / (0, 1) then we automatically reject ∗ ∈ the proposal will have zero posterior probability ∗

✫358 BIO 233, Spring 2015 ✪ ✬At the rth scan for the mth chain, the algorithm proceeds as: ✩ •

## betaStar <- as.vector(rmvnorm(1, betaHat, betaVar)) muStar <- as.vector(exp(X %*% betaStar)) ## change to the link ## if(sum(muStar <= 0 | muStar >= 1) == 0) { logpiRatio <- sum(dbinom(Y, 1, muStar, log=TRUE)) - sum(dbinom(Y, 1, mu, log=TRUE)) logqRatio <- log(dmvnorm(beta, betaHat, betaVar)) - log(dmvnorm(betaStar, betaHat, betaVar)) aR <- exp(logpiRatio + logqRatio) if(runif(1) < aR) { beta <- betaStar mu <- muStar accept[r,m] <- 1 } posterior[r,,m] <- beta }

✫359 BIO 233, Spring 2015 ✪ ✬Examine trace plots for evidence of convergence (or lack thereof) ✩ • 0 β Intercept, Intercept, −5.0 −4.5 −4.0 −3.5 0 100 200 300 400 500 600 700 800 900 1000

Scan 1 β behave logOR, behave 0.0 0.4 0.8 1.2

0 100 200 300 400 500 600 700 800 900 1000

Scan

✫360 BIO 233, Spring 2015 ✪ ✬Acceptance rate for the Metropolis-Hastings algorithm: ✩ • > ## > accRate <- round(apply(accept, 2, mean) * 100, 1) > accRate [1] 71.5 73.2 72.8

Results: •

PSR Median 2.5% 97.5% exp{beta} lower upper (Intercept) 1.000 0.02 0.01 0.03 0.02 0.01 0.03 behave 1.001 1.78 1.37 2.32 1.78 1.38 2.29 age 1.001 1.30 1.19 1.43 1.30 1.18 1.44 wt 1.000 1.05 1.01 1.10 1.06 1.01 1.11 sbp 1.000 1.15 1.08 1.22 1.15 1.08 1.23 chol 1.001 1.18 1.13 1.24 1.19 1.13 1.24 smoker 1.001 1.70 1.36 2.12 1.69 1.33 2.14

Again, the numerical results are virtually identical although the • interpretation differs

✫361 BIO 233, Spring 2015 ✪ ✬ Confounding and Collapsibility ✩

Linear regression

For a continuous response variable, consider two models: •

E[Y X,Z] = β0 + β1X + β2Z (1) | E[Y X] = α0 + α1X (2) |

In model (1), β1 is a conditional parameter • ⋆ contrast conditions on the value of Z

In model (2), α1 is a marginal parameter • ⋆ contrast does not condition on anything

Q: How are these parameters related?

✫362 BIO 233, Spring 2015 ✪ ✬It’s straightforward to show that ✩ • E[Y X] = E [E[Y X,Z]] | | = E[Y X,Z]f (Z = z X)∂z | Z|X | Zz = β0 + β1X + β2E[Z X] |

So the marginal contrast equals •

α1 = E[Y X =(x + 1)] E[Y X = x] | − | = β1 + β2 E[Z X =(x + 1)] E[Z X = x] { | − | }

The expression within the brackets is the slope from a linear regression of • Z X ∼

✫363 BIO 233, Spring 2015 ✪ ✬Using this fact, we can write ✩ • COV[X,Z] α1 = β1 + β2 V[X]

⋆ the marginal contrast is the conditional contrast plus a bias term

Bias requires both β2 = 0 and COV[X,Z] = 0 • 6 6 ⋆ Z is related to Y ⋆ Z is related to X ⋆ i.e. Z is a confounder

The direction of the bias depends on the interplay between β2 and • COV[X,Z] ⋆ confounding bias may be positive or negative ⋆ confounding may result in an estimate that is too big or too small

✫364 BIO 233, Spring 2015 ✪ ✬If either β2 = 0 or COV[X,Z] = 0 then β1 = α1 ✩ •

Therefore, if Z is a precision variable then β1 and α1 have • ⋆ different interpretations ⋆ the same numerical value

However, as the name suggests, the standard error of β1 will be smaller • than the standard error for α1 b Suggests that adjusting forb a precisions variable is a good thing, even if • one is interested in the marginal association

✫365 BIO 233, Spring 2015 ✪ ✬Logistic regression ✩

Q: Does the same hold for logistic regression? ⋆ how are the marginal and conditional parameters related?

For a binary outcome, consider two models: •

logit E[Y X,Z] = β0 + β1X + β2Z (3) | logit E[Y X] = α0 + α1X (4) |

The conditional odds ratio for a binary X is •

c E[Y = 1 X = 1,Z] E[Y = 1 X = 0,Z] θ = exp β1 = | | x { } E[Y = 0 X = 1,Z] E[Y = 0 X = 0,Z] |  | ⋆ conditional on the value of Z

✫366 BIO 233, Spring 2015 ✪ ✬The marginal odds ratio for X is ✩ •

m E[Y = 1 X = 1] E[Y = 1 X = 0] θ = exp α1 = | | x { } E[Y = 0 X = 1] E[Y = 0 X = 0] |  | where

E[Y X] = E[Y X,Z]f (Z = z X)∂z | | Z|X | Zz

c The relationship between the conditional contrast θx and marginal • m contrast θx is not straightforward m c ⋆ no simple, closed-form expression for θx as a function of θx

In particular, unlike in the setting of linear regression, they are not linearly • related

✫367 BIO 233, Spring 2015 ✪ ✬We can, however, calculate θm numerically ✩ • x

To do so, from the expression for E[Y X], we need to specify • | ⋆ E[Y X,Z] | ⋆ f (Z = z X) Z|X |

The first component is given by the logistic regression model: •

logit E[Y X,Z]= β0 + β1X + β2Z |

For binary X and Z, it’s convenient to represent f (Z = z X) via the • Z|X | logistic regression

logit E[Z X]= γ0 + γ1X |

⋆ notationally, let φ = exp γ1 denote the X/Z odds ratio XZ { }

✫368 BIO 233, Spring 2015 ✪ ✬The following slides consider the percent difference: ✩ • m c θ1 θ1 −c 100 θ1 ×

under various scenarios for c ⋆ the conditional odds ratio for X, θx c ⋆ the conditional odds ratio for Z, θz

⋆ the X/Z odds ratio, φXZ

Throughout, the following are held fixed • ⋆ P (X = 1) = 0.2 ⋆ P (Z = 1 X = 0) = 0.2 | ⋆ P (Y = 1) = 0.1

R code is available on the course website •

✫369 BIO 233, Spring 2015 ✪ ✬Strong confounder/exposure association: φ = 0.33 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −30 −20 −10 0 10 20 30 40 50 60

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫370 BIO 233, Spring 2015 ✪ ✬Strong confounder/exposure association: φ = 3.00 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −30 −20 −10 0 10 20 30 40 50 60

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫371 BIO 233, Spring 2015 ✪ ✬Moderate confounder/exposure association: φ = 0.50 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −30 −20 −10 0 10 20 30 40

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫372 BIO 233, Spring 2015 ✪ ✬Moderate confounder/exposure association: φ = 2.00 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −30 −20 −10 0 10 20 30 40

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫373 BIO 233, Spring 2015 ✪ ✬Weak confounder/exposure association: φ = 0.80 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −20 −10 0 10 20

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫374 BIO 233, Spring 2015 ✪ ✬Weak confounder/exposure association: φ = 1.20 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −20 −10 0 10 20

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫375 BIO 233, Spring 2015 ✪ ✬No confounder/exposure association: φ = 1.00 ✩ • XZ

θc X = 0.20 θc X = 0.50 θc X = 0.67 θc X = 1.00 c X c

θ θ X = 1.50 θc X = 2.00 and θc = 5.00 m X X θ Percentage difference between between difference Percentage −10 −5 0 5 10

0.20 0.33 0.50 0.67 1.00 1.50 2.00 3.00 5.00 θc Conditional odds ratio for Z, Z

✫376 BIO 233, Spring 2015 ✪ ✬As with linear regression, confounding bias may lead to marginal contrasts ✩ • that are either bigger or smaller than the conditional contrast ⋆ the true association may be of the opposite sign to the estimated association c ⋆ depends on whether or not the sign of θz and φXZ are the same or opposite

c The magnitude of confounding bias depends on an interplay between θx, • c θz and φXZ

m c If φXZ = 0, then θ may not equal θ • x x ⋆ i.e. Z is precision variable ⋆ this difference is not confounding bias ⋆ it is due to the non-collapsibility of the odds ratio

✫377 BIO 233, Spring 2015 ✪ ✬In contrast to linear regression, if Z is a precision variable then θm and θc ✩ • x x have ⋆ different interpretations ⋆ different numerical values

Q: How does one choose between the target parameters?

✫378 BIO 233, Spring 2015 ✪ ✬ Stratified designs ✩

So far, we’ve considered estimation and inference based on an independent • sample of size n, (Xi,Yi); i = 1,...,n and the likelihood: { } n = P(Yi Xi) L | i=1 Y ⋆ parameterize P(Y X) in terms of a regression model, µ = E[Y X; β] | | ⋆ learn about the regression coefficients, β

Prospective sampling: • ⋆ choose individuals on the basis of their covariates and ‘observe’ their outcomes ⋆ Y is random, conditional on X

✫379 BIO 233, Spring 2015 ✪ ✬Cross-sectional sampling: ✩ • ⋆ choose individuals completely at random and ‘observe’ their outcomes/ covariates ⋆ (Y, X) are jointly random, so that the likelihood is

n = P(Yi, Xi) L i=1 Yn = P(Yi Xi)P(Xi) | i=1 Y ⋆ assume that the marginal covariate distribution does not provide information about the prospective association(s) ⋆ base estimation/inference on

n ∗ = P(Yi Xi) L | i=1 Y

✫380 BIO 233, Spring 2015 ✪ ✬In many settings, these sampling schemes are perfectly reasonable ✩ •

However there are settings where we may need a surprisingly large sample • size to have reasonable power

King County birth weight data: • ⋆ examine power to detect an association between lbw and welfare based on the logistic model:

lbw ~ welfare + married + college + age + smoker + wpre

⋆ use simulation to estimate power under a range of scenarios odds ratio: 1.5, 2.0, and 3.0 ∗ sample size: 3,000 8,000 ∗ → Homework #6 ∗

✫381 BIO 233, Spring 2015 ✪ ✬ ✩ Power for the welfare effect the welfare for Power 0 20 40 60 80 100

3000 4000 5000 6000 7000 8000

Sample size, n

⋆ with a sample size of n=8,000, we would have an estimated 67% power to detect an odds ratio of 2.0

✫382 BIO 233, Spring 2015 ✪ ✬That the outcome is rare is a key reason why power is so low ✩ • ⋆ incidence of 5.1% in the observed sample

⋆ controlled in the simulation by manipulating the value β0

As we draw random samples, we get very few LBW events • ⋆ see the direct impact on the standard error for the odds ratio association between a binary X and binary outcome Y

1 1 1 1 se[θ] = θ + + + . n00 n01 n10 n11 r b b Q: What happens if we increase the incidence?

✫383 BIO 233, Spring 2015 ✪ ✬Repeat simulations for the association between welfare and lbw ✩ • ⋆ manipulate β0 such that the incidence increases from 0.05 to 0.20 ⋆ fix the sample size at n=4,000 ⋆ estimated power based on a Wald test:

Odds ratio Incidence 0.05 0.10 0.15 0.20 0.25 1.5 18.6 24.5 29.3 30.8 32.8 2.0 42.9 57.2 65.3 69.1 70.5

⋆ as incidence increases power increases → ⋆ rate of increase is not dramatic because the exposure of interest (welfare) is also rare

✫384 BIO 233, Spring 2015 ✪ ✬In practice, of course, we cannot manipulate incidence ✩ •

But we can manipulate the (relative) number of cases and non-cases that • we observe in the data ⋆ i.e., artificially inflate the observed incidence ⋆ for example, via a case-control design

The problem is that the sample is no longer representative of the target • population ⋆ the sample is non-random

But this non-randomness is by design • ⋆ under the control of the researcher ⋆ such designs referred to as biased sampling schemes ⋆ use statistical techniques to account for the non-random sampling

✫385 BIO 233, Spring 2015 ✪ ✬ Case-control studies ✩

In a case-control study, we initially stratify the population by outcome • status ⋆ know Y =0/1 for everyone ⋆ for any given individual, we can (easily) determine Y

Proceed by sampling, at random, • ⋆ n1 cases, i.e. for whom Y = 1

⋆ n0 non-cases or controls, i.e. for whom Y = 0

For all n=n0 + n1 sampled individuals, ‘observe’ the value of their • covariates ⋆ crucial: X is random and not Y

✫386 BIO 233, Spring 2015 ✪ ✬The appropriate likelihood is ✩ • n R = P(Xi Yi) L | i=1 Y n0 n0+n1

= P(Xi Yi = 0) P(Xi Yi = 1) | | i=1 i=n0+1 Y Y ⋆ n independent, outcome-specific contributions ⋆ retrospective likelihood

However, the scientific goal is (most often) to learn about prospective • associations ⋆ i.e., P(Y X) |

Q: How do we learn about prospective associations from the retrospective likelihood?

✫387 BIO 233, Spring 2015 ✪ ✬Consider the logistic regression model: ✩ • logit P(Y = 1 X) = XT β |

⋆ model corresponds the target population of interest

As we’ve noted, case-control sampling is non-random with respect to the • target population

Formalize this by introducing a random variable S that indicates selection • by the sampling scheme

1 selected S =  0 not selected   ⋆ binary random variable with some probability, P(S = 1)

✫388 BIO 233, Spring 2015 ✪ ✬Cross-sectional sampling ✩ • ⋆ selection is independent of (Y , X) ⋆ P(S = 1) is constant

Prospective sampling • ⋆ selection depends on the covariate values, X ⋆ write P(S = 1 X) |

Case-control sampling • ⋆ selection depends on outcome status, Y ⋆ write P(S = 1 Y = y) |

✫389 BIO 233, Spring 2015 ✪ ✬Now consider the distribution of the outcome, conditional on being ✩ • selected:

P(Y = 1 X,S = 1) |

✫390 BIO 233, Spring 2015 ✪ ✬Using Bayes’ Theorem and noting that selection depends solely on Y : ✩ • P(S = 1 X,Y = 1) P(Y = 1 X) P(Y = 1 X,S =1) = | | | P(S = 1 X) |

P(S = 1 X,Y = 1) P(Y = 1 X) = 1 | | P(S = 1 X,Y = y) P(Y = y X) y=0 | | P

P(S = 1 Y = 1) P(Y = 1 X) = 1 | | P(S = 1 Y = y) P(Y = y X) y=0 | |

Pπ1 P(Y = 1 X) = 1 | πy P(Y = y X) y=0 | P

✫391 BIO 233, Spring 2015 ✪ ✬dividing the numerator and denominator by: π0 P(Y = 0 X) ✩ × | π1 exp XT β = π0 { } 1+ π1 exp XT β π0 { }

∗ exp β0 + β1X1 + ... + βkXK = { ∗ } 1+ exp β + β1X1 + ... + βkXK { 0 } where

∗ π1 β0 = β0 + log π0  

✫392 BIO 233, Spring 2015 ✪ ✬We see that P(Y = 1 X,S = 1) has the same functional form as the ✩ • | desired logistic regression model ⋆ if P(Y = 1 X) is of logistic form then so is P(Y = 1 X,S = 1) | |

The odds ratio relationships between X and Y are preserved • ⋆ despite the ‘selection process’ ⋆ in Homework #5, we saw that bias (for odds ratios) only arises when selection depends on both Y and X

The intercept for the two logistic models are different, however •

✫393 BIO 233, Spring 2015 ✪ ✬All this suggests that, if the primary goal is to learn about odds ratio ✩ • parameters, estimation/inference could proceed by forming a likelihood using these probabilities:

n P = P(Yi Xi,Si = 1) L | i=1 Y ⋆ ignores the fact that the sample was obtained via a case-control scheme ⋆ i.e., pretend that the sample was obtained prospectively

∗ Use P to learn about β , β1,...,βK • L { 0 }

In principle, we can also learn about the intercept, β0, if we have • information on the probabilities of selection π0 and π1

∗ π1 β0 = β0 log − π0  

✫394 BIO 233, Spring 2015 ✪ ✬While this seems reasonable, showing that P(Y = 1 X,S = 1) and ✩ • | P(Y = 1 X) have the same functional form is not sufficient |

Recall the retrospective likelihood: • n n R = P(Xi Yi) = P(Xi Yi,Si = 1) L | | i=1 i=1 Y Yn P(Xi Si = 1) = P(Yi Xi,Si = 1) | | P(Y S = 1) i=1 i i Y |

⋆ the components of P correspond to the first component of R but L L ignores the other terms

Crucially, the P(Yi Xi,Si = 1) contributions are not independent of each • | other

⋆ as is assumed by P L

✫395 BIO 233, Spring 2015 ✪ ✬The true joint distribution of the outcomes Y1,...,Yn is constrained by ✩ • { } the sampling scheme

⋆ the case-control sampling scheme dictates that there will be n0 controls and n1 cases

⋆ so the Y1,...,Yn cannot freely vary { }

To see this more formally, note that •

n0 n0+n1

R = P(Xi Yi = 0) P(Xi Yi = 1) L | | i=1 i=n0+1 Y Y

n0 P(Yi = 0 Xi,Si = 1) P(Xi Si = 1) = | | P(Y = 0 S = 1) i=1 i i Y | n0+n1 P(Yi = 1 Xi,Si = 1) P(Xi Si = 1) | | × P(Yi = 1 Si = 1) i=n0+1 Y |

✫396 BIO 233, Spring 2015 ✪ ✬The denominators in each term are ✩ • P(Y = 0 S =1) = P(Y = 0 X = x,S = 1) P(X = x S = 1) ∂x | | | Z P(Y = 1 S =1) = P(Y = 1 X = x,S = 1) P(X = x S = 1) ∂x | | | Z ⋆ the terms on the LHS are under the control of the researcher 1 ⋆ if they are both equal to 2 , then we have a balanced case-control study i.e. n0 = n1 ∗

These expressions impose contraints on P(Y = y X = x,S = 1) • | ⋆ they cannot freely vary because each of the P(Y = 0 S = 1) are fixed |

These constraints imply that values that β∗ can take on are also subject to • constraints

✫397 BIO 233, Spring 2015 ✪ ✬Indicates that to obtain an estimate of β∗ via ✩ • n ∗ P (β ) = P(Yi Xi,Si = 1) L | i=1 Y one must maximize over a constrained parameter space ⋆ use constrained optimization procedure such as Lagrange multipliers

However, it turns out that the unconstrained MLE is the same as the • constrained MLE ∗ ⋆ maximizing P (β ) without regard to the constraints yields the L appropriate (constrained) MLE

Inference can also proceed without regard to the sampling scheme • ⋆ asymptotic variance is also the same (Prentice and Pyke, 1979) ⋆ likelihood ratio tests are valid (Scott and Wild, 1989)

✫398 BIO 233, Spring 2015 ✪ ✬So we can ignore the constraints imposed by the sampling scheme and ✩ • proceed as if the data has been collected prospectively

Caveats: • ⋆ cannot draw conclusions about β0 without additional information ⋆ cannot perform prediction without additional information ⋆ cannot learn about other contrasts, such as the relative risk, without additional information

Further, one must also be aware of a number of non-statistical issues: • ⋆ issues associated with observational studies ⋆ appropriate selection of controls ⋆ recall bias

✫399 BIO 233, Spring 2015 ✪ ✬Esophageal cancer ✩

Consider data from a case-control study on the association between • alcohol/tobacco consumption and risk of esophageal cancer ⋆ conducted in France in the 1970s ⋆ data and documentation available on the course website ⋆ also available directly in R

> ## > esoph <- read.table("Esoph_data.txt") > table(esoph$Y)

0 1 975 200

In the sample, 200 of 1,175 individuals are ‘cases’ • ⋆ rate of 17% ⋆ overall incidence in the U.S. is around 5 per 100,000

✫400 BIO 233, Spring 2015 ✪ ✬Information on three covariates: ✩ • > > ## Age, years > ## > esoph$ageF <- factor(esoph$agegp, label=c(" 25-34", " 35-44", " 45-54", " 55-64", " 65-74", " 75+ ")) > table(esoph$Y, esoph$ageF)

25-34 35-44 45-54 55-64 65-74 75+ 0 116 199 213 242 161 44 1 1 9 46 76 55 13 > > ## Alchohol consumption, gm/day > ## > esoph$alcF <- factor(esoph$alcgp, label=c(" 0-39", " 40-79", " 80-119", " 120+")) > table(esoph$Y, esoph$alcF)

0-39 40-79 80-119 120+ 0 415 355 138 67 1 29 75 51 45

✫401 BIO 233, Spring 2015 ✪ ✬> ✩ > ## Tobacco consumption, gm/day > ## > esoph$tobF <- factor(esoph$tobgp, label=c(" 0-9", " 10-19", " 20-29", " 30+")) > table(esoph$Y, esoph$tobF)

0-9 10-19 20-29 30+ 0 525 236 132 82 1 78 58 33 31

Use logistic regression to investigate the joint association between • alcohol/tobacco consumption and risk of esophageal cancer

> > ## Compare a main effects only model to one with an interaction term > ## > fit0 <- glm(Y ~ ageF + alcF + tobF, data=esoph, family=binomial()) > fit1 <- glm(Y ~ ageF + alcF * tobF, data=esoph, family=binomial()) > LRtest(fit0, fit1) Test Statistic = 6.5 on 9 df => p-value = 0.69 [1] 0.69

✫402 BIO 233, Spring 2015 ✪ ✬> ## Report results from the main effects only model ✩ > ## > getCI(fit0) exp{beta} lower upper ... alcF 40-79 3.07 1.92 4.90 alcF 80-119 4.25 2.54 7.12 alcF 120+ 8.29 4.72 14.57 tobF 10-19 1.41 0.94 2.10 tobF 20-29 1.49 0.92 2.41 tobF 30+ 2.38 1.39 4.09

Interpretation of θˆ = 3.07: •

Interpretation of θˆ = 2.38: •

✫403 BIO 233, Spring 2015 ✪ ✬ Matched case-control studies ✩

North Carolina infant mortality

Suppose interest lies in the relationship between birth weight and infant • mortality ⋆ death within the first year of life

The infants dataset has information on 225,152 births • ⋆ all caucasian and African-American births in NC in 2003-4

Infant mortality is a rare event • ⋆ 1,752 events in the available data ⋆ rate of 8 per 1,000 births

In the absence of these complete data, we would need to conduct a study •

✫404 BIO 233, Spring 2015 ✪ ✬Suppose we have sufficient resources to collect information on n=400 ✩ • births

Simple random sampling would yield very few ‘cases’ • ⋆ on average: 3 deaths

A case-control design would be much more efficient • ⋆ randomly sample n1=200 cases, from the 1,752 infant deaths

⋆ randomly sample n0=200 controls, from the 223,400 non-deaths ⋆ retrospectively ‘observe’ their covariate values, X

Dataset includes information on a number of potential confounders • ⋆ mothers’ age at birth, smoking during pregnancy, weight gained ⋆ gestational period ⋆ babes’ race and gender

✫405 BIO 233, Spring 2015 ✪ ✬> ## ✩ > load("NorthCarolina_data.dat") > > infants$Y <- as.numeric(infants$dTime < 365) > table(infants$Y)

0 1 223400 1752 > > ## > cases <- sample(c(1:nrow(infants))[infants$Y == 1], 200) > conts <- sample(c(1:nrow(infants))[infants$Y == 0], 200) > > dataCC <- infants[c(conts, cases),] > dim(dataCC) [1] 400 9 > > table(dataCC$Y)

0 1 200 200

✫406 BIO 233, Spring 2015 ✪ ✬Under case-control sampling: X is random, given Y ✩ •

Examine the distributions of X among the controls and among the cases • > > names(dataCC) [1] "race" "sex" "mage" "smoker" "gained" "weight" "weeks" "dTime" [9] "Y" > > tapply(dataCC$sex, dataCC$Y, FUN=mean) 0 1 0.550 0.615 > > tapply(dataCC$race, dataCC$Y, FUN=mean) 0 1 0.245 0.455 > > tapply(dataCC$smoker, dataCC$Y, FUN=mean) 0 1 0.11 0.25

✫407 BIO 233, Spring 2015 ✪ ✬Distribution of mothers’ age, in years, by case-control status ✩ • 15 25 35 45 55

Controls Cases

✫408 BIO 233, Spring 2015 ✪ ✬Distribution of weight gained, in lbs, by case-control status ✩ • 20 30 40 50

Controls Cases

✫409 BIO 233, Spring 2015 ✪ ✬Distribution of gestational period, weeks, by case-control status ✩ • 20 30 40 50

Controls Cases

✫410 BIO 233, Spring 2015 ✪ ✬There seems to be substantial imbalance in the distributions of several of ✩ • the potential confounders between the cases and the controls

Q: What impact could this have?

It would be desirable to have greater balance between the cases and • controls

One approach to achieving this is through matching •

✫411 BIO 233, Spring 2015 ✪ ✬Matching ✩

The case-control design assumes the population can be stratified on the • basis of Y

Suppose we also have access to information on certain covariates for • everyone ⋆ e.g. we know the gender and race of the baby but we don’t (readily) have information on the mother or the pregnancy ⋆ denote these covariates by Z

We could impose balance on Z by drawing controls such that the • distribution of Z is the same as in the cases ⋆ e.g. each time you draw a case, only draw from the controls with the same values of Z

✫412 BIO 233, Spring 2015 ✪ ✬Each individual belongs to a matched set ✩ • ⋆ K such sets or stratum th ⋆ Nk individuals in the k set

In an M:1 matched case-control study • ⋆ K = n1

⋆ Nk = M + 1

The pair-matched case-control study is a special where M = 1 •

Data structure: • th th ⋆ Yki is the outcome for the i individual in the k matched set

i = 1,...,Nk ∗ k = 1,...,K ∗ ⋆ Xki is their corresponding covariate vector

✫413 BIO 233, Spring 2015 ✪ ✬By having matched, we have introduced a new form of non-randomness ✩ • ⋆ retrospective sampling of X is only random within each matched set

The appropriate retrospective likelihood is •

K Nk R = P(Xi Yi, set k) L | k=1 i=1 Y Y

We could follow the same arguments as before and base estimation/ • inference on the prospective likelihood

K Nk P = P(Yi Xi, set k) L | k=1 i=1 Y Y ⋆ again ignore the constraints imposed by the fixed number of cases and controls

✫414 BIO 233, Spring 2015 ✪ ✬The components of the likelihood require specification of the distribution ✩ • of Y conditional on both X and being in the kth matched set

Can be achieved by including a set-specific intercept into the model • ⋆ e.g. for a single covariate

logit P(Yki = 1 Xki) = αk + β1Xki |

As with a standard case-control study, we could proceed with estimation of • α1,...,αK , β1 by ignoring the (retrospective) sampling scheme { } ⋆ Prentice and Pyke, 1979

However, the asymptotics are not well-behaved • ⋆ the number of parameters increases with the sample size ⋆ K plus the number of β coefficients for the covariate effects

✫415 BIO 233, Spring 2015 ✪ ✬Further, interest lies primarily with β1 ✩ • ⋆ the interpretation of the αk’s doesn’t correspond to anything real the constructs that they represent are artificial ∗

The K intercepts α1,...,αK are nuisance parameters • { }

Two general techniques for ‘eliminating’ nuisance parameters • ⋆ marginalizing (or integrating) ⋆ conditioning

✫416 BIO 233, Spring 2015 ✪ ✬Conditional logistic regression ✩

To eliminate nuisance parameters based on conditioning, we need to • identify their sufficient

The (prospective) likelihood is •

K Nk K Nk yki exp αk + β1Xki P(Yki = yki Xki) = { } | 1 + exp αk + β1Xki k=1 i=1 k=1 i=1 Y Y Y Y { }

exp ykiαk + β1 ykiXki = { ki ki } [1 + exp αk + β1Xki ] k Pi { P } Q Q exp tkαk + β1tx = { k } [1 + exp αk + β1Xki ] k i P { } Q Q

✫417 BIO 233, Spring 2015 ✪ ✬Noting that the denominator is constant as a function of the yki, the ✩ • sufficient statistic for αk is

Nk

tk = yki i=1 X ⋆ the observed number of events in the kth matched set

The sufficient statistic for β is • K Nk tx = ykiXki k=1 i=1 X X

th Let Y k = Yk1,...,YkNk and Tk the number of ‘events’ in the k • { } matched set

Nk

Tk = Yki i=1 X

✫418 BIO 233, Spring 2015 ✪ ✬Consider the joint distribution of the outcomes in the kth set, conditional ✩ • on the number of successes

P(Y k = yk,Tk = tk) P(Y k = yk Tk = tk) = | P(Tk = tk)

Note that •

Nk

P(Tk = tk) = P(Yki = yki Xki) | y :Tk=tk i=1 k X Y

exp tkαk + β1 ykiXki = { i } [1 + exp αk + β1Xki ] y :Tk=tk i k X { P } Q ⋆ sum over the possible ways in which the total number of events within

set k could equal tk

✫419 BIO 233, Spring 2015 ✪ ✬It follows that ✩ •

I Tk = tk exp tkαk + β1 i ykiXki P(Y k = yk Tk = tk) = { }× { } | exp tkαk + β1 ukiXki uk:Tk=tk { Pi } P P

I Tk = tk exp β1 ykiXki = { }× { i } exp β1 ukiXki uk:Tk=tk { Pi } P P Taking the product over the K sets, gives the conditional likelihood • K I Tk = tk exp β1 i ykiXki C (β1) = { }× { } L exp β1 ukiXki k=1 uk:Tk=tk i Y { P } P P ⋆ solely a function of β1

⋆ by conditioning on T1,...,TK , the likelihood is no longer a function { } of the nuisance parameters

✫420 BIO 233, Spring 2015 ✪ ✬Suppose there is only one case in the matched set and they occupy the ✩ • first index:

⋆ Yk1 = 1, k ∀ ⋆ Yki = 0, for i = 2, ..., Nk

⋆ Tk = 1, k ∀

The (observed) conditional likelihood simplifies to • K exp β1Xk1 C (β1) = { } L exp β1Xki k=1 i Y { } P ⋆ numerator gives the only way in which there could be one case, based on the observed data ⋆ denominator enumerates all the possible ways in which there could be exactly one case

✫421 BIO 233, Spring 2015 ✪ ✬Comments ✩

Matched sets with no exposure variability do not contribute to the • likelihood

⋆ i.e., when all the Xki are the same within the set ⋆ for example, in a 1:1 matched study the contributions by any set for

which Xk1 = Xk2 is

exp β1Xk1 1 { } = exp β1Xk1 + exp β1Xk2 2 { } { }

⋆ independent of β1 and so do not contribute any information ⋆ base analyses solely on matched sets with discordant exposures

A consequence of this is that we cannot use conditional logistic regression • to learn about covariates used in the matching process

✫422 BIO 233, Spring 2015 ✪ ✬Suppose, for example, we conducted a matched case-control study of the ✩ • association between birth weight and infant mortality ⋆ matched on gender ⋆ within each matched set, there is no variation in gender ⋆ none of the matched sets can contribute to estimation of the ‘gender effect’

A benefit of having no variation within each set is that we do not have to • worry about gender as a confounder

⋆ consider the interpretation of β1 in the underlying logistic regression:

logit P(Yki = 1 Xki) = αk + β1Xki |

⋆ by holding ‘matched set’ constant, gender is implicitly held constant ⋆ same applies for any covariate used in the matching process

✫423 BIO 233, Spring 2015 ✪ ✬When we match on a continuous covariate, we have to be careful about ✩ • residual confounding ⋆ suppose we matched on mothers age, in 5-year age bands 20 ∗ ≤ 21-25 ∗ ... ∗ 40-45 ∗ ⋆ within any given set there will likely remain some variation in the mothers’ exact age ⋆ residual confounding

Remedy is to include mothers’ age as a covariate in the model • ⋆ need to be careful with the interpretation of the age coefficient

✫424 BIO 233, Spring 2015 ✪ ✬Diabetes and acute MI ✩

Matched case-control study among Navajo Indians • ⋆ 144 ‘cases’ who experienced a myocardial infarction (MI) ⋆ matched to 144 ‘controls’ who were free of heart disease 1:1 matching based on age and sex ∗

Recorded whether or not they had diabetes: •

Diabetes n case control 0 0 82 0 1 16 1 0 37 1 1 9

✫425 BIO 233, Spring 2015 ✪ ✬> ## ✩ > set <- rep(1:144, rep(2, 144)) > mi <- rep(c(1, 0), 144) > diab <- c(rep(c(0,0), 82), rep(c(0,1), 16), rep(c(1,0), 37), rep(c(1,1), 9)) > > ## > navajo <- data.frame(set, mi, diab) > navajo set mi diab 1 11 0 2 10 0 3 21 0 4 20 0 ... 195 98 1 0 196 98 0 1 197 99 1 1 198 99 0 0 ... 287 144 1 1 288 144 0 1

✫426 BIO 233, Spring 2015 ✪ ✬In R, we can use the clogit() function to estimate parameters from a ✩ • logistic regression using the conditional likelihood ⋆ survival package

> ## > library(survival) Loading required package: splines > ?clogit > > fit0 <- clogit(mi ~ diab + strata(set), data=navajo) > summary(fit0) ... coef exp(coef) se(coef) z Pr(>|z|) diab 0.8383 2.3125 0.2992 2.802 0.00508 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

exp(coef) exp(-coef) lower .95 upper .95 diab 2.312 0.4324 1.286 4.157

✫427 BIO 233, Spring 2015 ✪ ✬Interpretation of θˆ = 2.31: ✩ •

Q: What do we get if we ignore the matching? ⋆ fit a logistic regression model with a single intercept

> ## > fitBad <- glm(mi ~ diab, data=navajo, family=binomial()) > getCI(fitBad) exp{beta} lower upper (Intercept) 0.82 0.63 1.08 diab 2.23 1.28 3.89

Ignoring the matching will, in general, bias the estimated odds ratio • towards the null ⋆ most evident when matching is on important predictors of the outcome

✫428 BIO 233, Spring 2015 ✪