HST 190: Introduction to

Lecture 7:

1 HST 190: Intro to Biostatistics Logistic regression

• We’ve previously discussed methods for predicting continuous outcomes § Functionally, predicting the at particular covariate levels • What if we want to predict values for a dichotomous , instead of a continuous one? § This corresponds to predicting the of the outcome variable being a “1” versus “0” • Can we just use linear regression for a 0-1 outcome variable?

2 HST 190: Intro to Biostatistics • Consider modeling the probability that a person receives a physical exam in a given year as a function of income. § A sample of individuals is collected. Each individual reports income and whether he/she went to the doctor last year.

patient # � = checkup � = income 1 1 32,000 2 0 28,000 3 0 41,000 4 1 38,000 etc.

3 HST 190: Intro to Biostatistics • Plotting this data and fitting a linear regression line, we see that the is not tailored to this type of outcome § For example, an income of $500,000 yields a predicted probability of visiting the doctor greater than 1!

4 HST 190: Intro to Biostatistics transformation

• To overcome the problem, we define the logit transformation: if 0 < � < 1, then logit � = ln § Notice that as � ↑ 1, logit � ↑ ∞, and as � ↓ 0, logit � ↓ −∞ • Thus, logit � can take any continuous value, so we will fit linear model on this transformed outcome instead • Write this type of model generally as

� � � = � + �� + ⋯ + �� § Where � � = 1 ⋅ � � = 1 + 0 ⋅ � � = 0 = � � = 1 = � and � �(�) = logit �(�) § This model is called a logistic regression model or a logit model § By comparison, the linear regression model takes � �(�) = � �

5 HST 190: Intro to Biostatistics • A key benefit of fitting logit model rather than methods is ability to adjust for multiple covariates (including continuous covariates) simultaneously patient # � = checkup income age gender 1 1 32,000 60 F 2 0 28,000 53 M 3 0 41,000 45 M 4 1 38,000 40 F etc. • To interpret parameters, compare fit for man and woman

§ logit � = � + �� + �� + �

§ logit � = � + �� + ��

6 HST 190: Intro to Biostatistics ⇒ logit � = logit � − � � � ⟺ ln − ln = � 1 − � 1 − � � 1 − � ⟺ ln = � � 1 − �

� 1 − � ⟺ = = � � 1 − � odds

• So, � is the log of the for getting a checkup between men and women, adjusting for age and income • This result holds for any dichotomous variable in the model § This allows us to estimate odds ratio for a given exposure with disease in a regression, accounting for the effects of other variables

7 HST 190: Intro to Biostatistics • In a logistic regression logit � = � + �� + ⋯ + ��, then denote the fitted parameter estimates as �, �, … , �

• if � is a dichotomous exposure, then the estimated odds ratio relating this exposure to the outcome is OR = �

• If instead � is a continuous exposure, then the above odds ratio and CI describe the outcome’s association with a one- unit increase in the exposure, adjusting for other covariates § e.g., “a one unit increase in age is associated on average with an �-fold change in the odds of getting a checkup, holding gender constant.”

8 HST 190: Intro to Biostatistics Hypothesis testing and confidence intervals

• For an estimated � coefficient in a logistic model, the corresponding 100 1 − � % CI is given by � , �

§ Matlab or other software will provide both � and se � § take note of whether you are given � or OR = � in software output! This differs between programs • Testing the hypothesis �: � = 0 versus �: � ≠ 0 is a z- test that is typically provided as part of software output

§ If the null is true, � = is approximately �(0,1)

9 HST 190: Intro to Biostatistics terms

• Like in linear regression, we can also incorporate interaction terms in a logistic regression model

logit � = � + �� + ��

+�� + �:��

• �: captures the presence of an interaction effect or effect modification of gender by age § e.g., gender effect on probability of getting annual checkup is greater among younger people

10 HST 190: Intro to Biostatistics Model building for inference

• The techniques for variable selection in logistic regression are similar as for linear regression § Biggest challenge is lack of comparable visual fit diagnostics like residual plots • When model building for studies of association between exposure and outcome, focus is on including sources of (i.e., external variables associated with both exposure and outcome) • One strategy is to fit and report the following three models: 1) an unadjusted or minimally adjusted model 2) a model that includes ‘core’ confounders (‘primary’ model)

o clear indication from scientific knowledge and/or the literature

o consensus among investigators 3) a model that includes ‘core’ confounders plus any ‘potential’ confounders

o indication is less certain

11 HST 190: Intro to Biostatistics Logistic regression in retrospective setting

• How do we interpret intercept logit � = � + �� + ⋯ + �� § � = log is the log odds of experiencing the outcome in the population among subjects with � = ⋯ = � = 0 § Links model to absolute of the outcome in the population • What happens to logit model if our is case-control (or retrospective)? § That is, what if we sample based on outcome status? § e.g., sample 100 patients with a disease and 100 patients without • Typically this setting artificially selects more cases than would arise naturally under cross-sectional or prospective sampling § so we cannot readily use the sample to describe the true probability of disease in the population

12 HST 190: Intro to Biostatistics • Thus, we see that the intercept � is no longer meaningful in a logistic regression using case-control sampled data § What about the other estimates? • Recall we showed that using contingency tables to compute odds ratios was valid both in prospective and retrospective sampling designs • It turns out that the same is true for the estimated coefficients in logistic regression! Just as before, OR = � § all other inference (tests, CIs) is also the same • The estimated odds ratio of the outcome between exposed and unexposed groups is the same even if the ‘absolute’ proportion of cases sampled is higher

13 HST 190: Intro to Biostatistics Matched case-control designs

• To further increase the statistical of a study, researchers may create a matched case-control design § For every case sampled, one or more controls is selected based on similarity to the case

o each case with � controls is called �: � matching § Goal is to correct for potential confounding in the study design § e.g., match each case with noncase of same age and gender, resulting in two groups having same distributions of age and gender • As with standard case-control, analysis then measures association between an exposure of interest and the outcome § Exposure of interest is not a factor used for matching • Matched designs balance increased cost of matching each subject with higher power and potential for causal inference

14 HST 190: Intro to Biostatistics Analyzing matched case-control designs

• Suppose the sample includes � matched sets, how should we approach analysis? • Naïve approach: choose one matched set to be ‘baseline,’ and include � − 1 indicator variables for each other set § Essentially, treat each matched set as level of a categorical variable • Such a model forces us to estimate effect of exposure within groups that may only have a few people in them § Unstable estimation § Cannot generalize estimated comparisons of specific pairs of people • Instead, we want an analysis that estimates exposure effect by aggregating across matched sets

15 HST 190: Intro to Biostatistics Conditional logistic regression

• Instead, researchers use conditional logistic regression to estimate the effect of an exposure of interest, conditioning out the factors used to create the matched sets • To illustrate, assume a matched pairs design. Let

§ � = 1, � = 0 be the disease indicators of the �th case-control pair

§ �, … , � , �, … , � be the covariates of the �th pair

o Does not include ‘matched on’ factors, which are accounted for in design • Then for each pair, define the conditional likelihood contribution

� � = 1 ∩ � = 0 � �, … , � = � � = 1 ∩ � = 0 + � � = 0 ∩ � = 1 �∑ = ∑ ∑ � + �

16 HST 190: Intro to Biostatistics • Thus, we can compute estimates that maximize the conditional likelihood � = arg min� ∑ � �, … , �

• If � is coefficient of exposure of interest, then as before OR = � § Standard methods for testing and CIs are all the same as before • Note that because we already adjusted for factors used for matching, we do not get estimated effects for these factors § It would be inappropriate to include matching factors as covariates • We also do not get an estimated intercept, which makes sense because intercept not interpretable in case-control setting anyways

17 HST 190: Intro to Biostatistics Logistic regression modeling for

• Using a fitted logistic regression model, we so far have focused on estimation and inference of associations between the covariates and the outcome in the form of odds ratios • We can also predict individual of experiencing the outcome using the fitted model:

• From our model logit � = � + �� + ⋯ + ��, we can rearrange to get a predicted probability �: � � ln = � ⇔ = � ⇔ � = � − �� 1 − � 1 − � � ⇔ � 1 + � = � ⇔ � = 1 + � • Therefore, we see that our regression model leads to predicted probabilities �⋯ � = 1 + �⋯

18 HST 190: Intro to Biostatistics • We may even want to predict individuals outcome status, using the model to predict whether or not they will experience the outcome § e.g., build a risk prediction model to predict who might develop a disease • Just as in the linear regression case, prediction introduces important considerations of , and prediction validation

19 HST 190: Intro to Biostatistics Variable selection for prediction

• Variable selection for prediction is also similar to linear regression setting, and can use similar techniques: 1) Fixed set by design (treatment indicator + background variables) 2) Fit all possible subsets of models and find the one that fits the best according to some criterion: § AIC or BIC § Predictive performance by (cross-validated) AUC 3) Sequential: forward / backward / stepwise selection 4) Regularized/penalized regression method

20 HST 190: Intro to Biostatistics Model selection criteria

• Logistic regression models fit by maximum likelihood, so if two models have the same number of parameters, choose the one with a higher final likelihood value § Similar to linear regression, aim to balance final likelihood (�) and number of parameters (�). § General form of any criterion: � � + �(�)

• Same criteria used for linear regression available here: § Akaike’s Information Criterion: AIC = −2ln � + 2� § Bayesian Information Criterion: BIC = −2ln � + �log �

21 HST 190: Intro to Biostatistics Binary prediction validation

• To measure predictive performance for binary outcomes, one approach returns to our discussion of diagnostic testing • Recall, in diagnostic testing it is important to balance § Correctly testing positive for true disease cases (‘sensitivity’) § Correctly testing negative for true non-cases (‘specificity’)

Test positive Test negative True Positive (TP) False Negative (FN) Disease � + � + ∩ � + � + ∩ � − False Positive (FP) True Negative (TN) No Disease � − � − ∩ � + � − ∩ � − � + � −

22 HST 190: Intro to Biostatistics • Binary prediction is nearly identical, where instead of ‘testing’ we are ‘predicting’ disease status § Want to correctly predict disease in true cases, and correctly predict no disease in true non-cases • How do we convert predicted individual probabilities �̂ into discrete ‘case’ or ‘non-case’ ?

§ we must choose an arbitrary cutoff value, e.g., “if �̂ > 0.5 then �th individual is predicted to be a case”

23 HST 190: Intro to Biostatistics • How to choose the cutoff value for the Vit. E level in serum? • For a group with KNOWN Disease status, let’s list some possible cutoff values. First, we’ll see how many disease vs. no disease patients fall on either side of each cutoff…

Predicted probability cutoff for predicted ‘case’ 0.01 0.20 0.40 0.60 0.80 0.99

% patients Disease 0.95 0.87 0.73 0.54 0.34 0.17 with value ≥ cut-off No Disease 0.91 0.68 0.38 0.12 0.02 0.002

24 HST 190: Intro to Biostatistics • Several important relationships here:

§ �(�̂ ≥ cutoff ∩ patient � is case) = sensitivity

§ � �̂ ≥ cutoff ∩ patient � is noncase = 1 − specificity • We can summarize the testing possibilities by plotting sensitivity vs. (1 – specificity) …

Predicted probability cutoff for ‘case’ 0.01 0.20 0.40 0.60 0.80 0.99

% patients Disease 0.95 0.87 0.73 0.54 0.34 0.17 with value ≥ cut-off No Disease 0.91 0.68 0.38 0.12 0.02 0.002

25 HST 190: Intro to Biostatistics Predicted probability cutoff for ‘case’ 0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

26 HST 190: Intro to Biostatistics Predicted probability cutoff for ‘case’ 0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

27 HST 190: Intro to Biostatistics Predicted probability cutoff for ‘case’ 0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

28 HST 190: Intro to Biostatistics • A receiver operating characteristic (ROC) curve for a test is a plot of sensitivity vs. (1-specificity) § A test’s ROC curve helps us choose an optimal cutoff point. It also shows us how useful a test is overall.

• The Area Under the Curve (AUC) is a single number summarizing a test’s ability to discriminate between true positives and true negatives. § AUC is the probability for a randomly chosen case and noncase that the case will have the higher predicted probability—0.5 is a ‘coin toss’

29 HST 190: Intro to Biostatistics Cross-validation

• Cross-validation extends to the setting using prediction metrics like AUC

• The available data set is divided into two (or 3) random parts. § Training set is used to fit the model. § Test set is used to check the predictive capability (e.g., AUC) and refine the model. Go back to training if needed. § Optional: Validation set used once to estimate model’s true AUC. • If dataset is smaller or you do not want to set aside data, can still estimate AUC using �-fold cross validation

30 HST 190: Intro to Biostatistics • In �-fold cross validation, the data is split into � groups, and the model is repeatedly fit on all but one group, then its ability to predict the left-out group is recorded § average AUC over all � groups estimates predictive performance on ‘new’ dataset

31 HST 190: Intro to Biostatistics