HST 190: Introduction to Biostatistics

HST 190: Introduction to Biostatistics Lecture 7: Logistic regression 1 HST 190: Intro to Biostatistics Logistic regression • We’ve previously discussed linear regression methods for predicting continuous outcomes § Functionally, predicting the mean at particular covariate levels • What if we want to predict values for a dichotomous categorical variable, instead of a continuous one? § This corresponds to predicting the probability of the outcome variable being a “1” versus “0” • Can we Just use linear regression for a 0-1 outcome variable? 2 HST 190: Intro to Biostatistics • Consider modeling the probability that a person receives a physical exam in a given year as a function of income. § A sample of individuals is collected. Each individual reports income and whether he/she went to the doctor last year. patient # � = checkup � = income 1 1 32,000 2 0 28,000 3 0 41,000 4 1 38,000 etc. 3 HST 190: Intro to Biostatistics • Plotting this data and fitting a linear regression line, we see that the linear model is not tailored to this type of outcome § For example, an income of $500,000 yields a predicted probability of visiting the doctor greater than 1! 4 HST 190: Intro to Biostatistics Logit transformation • To overcome the problem, we define the logit transformation: if 5 0 < � < 1, then logit � = ln 675 § Notice that as � ↑ 1, logit � ↑ ∞, and as � ↓ 0, logit � ↓ −∞ • Thus, logit � can take any continuous value, so we will fit linear model on this transformed outcome instead • Write this type of model generally as � � � = � + �6�6 + ⋯ + �E�E § Where � � = 1 ⋅ � � = 1 + 0 ⋅ � � = 0 = � � = 1 = � and � �(�) = logit �(�) § This model is called a logistic regression model or a logit model § By comparison, the linear regression model takes � �(�) = � � 5 HST 190: Intro to Biostatistics • A key benefit of fitting logit model rather than contingency table methods is ability to adJust for multiple covariates (including continuous covariates) simultaneously patient # � = checkup income age gender 1 1 32,000 60 F 2 0 28,000 53 M 3 0 41,000 45 M 4 1 38,000 40 F etc. • To interpret parameters, compare fit for man and woman § logit �JKLMN = � + �MOP�MOP + �QNRKLP�QNRKLP + �JKLMN § logit �LMN = � + �MOP�MOP + �QNRKLP�QNRKLP 6 HST 190: Intro to Biostatistics ⇒ logit �LMN = logit �JKLMN − �JKLMN �JKLMN �LMN ⟺ ln − ln = �JKLMN 1 − �JKLMN 1 − �LMN �JKLMN 1 − �LMN ⟺ ln = �JKLMN �LMN 1 − �JKLMN �JKLMN 1 − �LMN oddsJKLMN ⟺ = = �XYZ[\] �LMN 1 − �JKLMN oddsLMN • So, �JKLMN is the log of the odds ratio for getting a checkup between men and women, adjusting for age and income • This result holds for any dichotomous variable in the model § This allows us to estimate odds ratio for a given exposure with disease in a regression, accounting for the effects of other variables 7 HST 190: Intro to Biostatistics • In a logistic regression logit � = � + �6�6 + ⋯ + �E�E, ` ` then denote the fitted parameter estimates as �_, �6, … , �E • if �b is a dichotomous exposure, then the estimated odds ratio relating this exposure to the outcome is f ORe = �Xg • If instead �b is a continuous exposure, then the above odds ratio and CI describe the outcome’s association with a one- unit increase in the exposure, adjusting for other covariates § e.g., “a one unit increase in age is associated on average with an f �X\hi-fold change in the odds of getting a checkup, holding gender constant.” 8 HST 190: Intro to Biostatistics Hypothesis testing and confidence intervals ` • For an estimated �b coefficient in a logistic model, the corresponding 100 1 − � % CI is given by Xf 7k n pPq Xf Xf rk n pPq Xf g lm g g lm g � o , � o § Matlab or other software will provide both �`b and ses �`b f Xg § take note of whether you are given �`b or ORe = � in software output! This differs between programs ` ` • Testing the hypothesis �u: �b = 0 versus �6: �b ≠ 0 is a z- test that is typically provided as part of software output f Xg § If the null is true, � = f is approximately �(0,1) pPq Xg 9 HST 190: Intro to Biostatistics Interaction terms • Like in linear regression, we can also incorporate interaction terms in a logistic regression model logit � = � + �MOP�MOP + �QNRKLP�QNRKLP +�JKLMN�JKLMN + �MOP:JKLMN�MOP�JKLMN • �MOP:JKLMN captures the presence of an interaction effect or effect modification of gender by age § e.g., gender effect on probability of getting annual checkup is greater among younger people 10 HST 190: Intro to Biostatistics Model building for inference • The techniques for variable selection in logistic regression are similar as for linear regression § Biggest challenge is lack of comparable visual fit diagnostics like residual plots • When model building for studies of association between exposure and outcome, focus is on including sources of confounding (i.e., external variables associated with both exposure and outcome) • One strategy is to fit and report the following three models: 1) an unadjusted or minimally adjusted model 2) a model that includes ‘core’ confounders (‘primary’ model) o clear indication from scientific knowledge and/or the literature o consensus among investigators 3) a model that includes ‘core’ confounders plus any ‘potential’ confounders o indication is less certain 11 HST 190: Intro to Biostatistics Logistic regression in retrospective setting • How do we interpret intercept logit � = � + �6�6 + ⋯ + �E�E 5 § � = log is the log odds of experiencing the outcome in the 675 population among subjects with �6 = ⋯ = �E = 0 § Links model to absolute prevalence of the outcome in the population • What happens to logit model if our sampling is case-control (or retrospective)? § That is, what if we sample based on outcome status? § e.g., sample 100 patients with a disease and 100 patients without • Typically this setting artificially selects more cases than would arise naturally under cross-sectional or prospective sampling § so we cannot readily use the sample to describe the true probability of disease in the population 12 HST 190: Intro to Biostatistics • Thus, we see that the intercept � is no longer meaningful in a logistic regression using case-control sampled data § What about the other estimates? • Recall we showed that using contingency tables to compute odds ratios was valid both in prospective and retrospective sampling designs • It turns out that the same is true for the estimated coefficients in logistic regression! Just as before, f ORe = �Xg § all other inference (tests, CIs) is also the same • The estimated odds ratio of the outcome between exposed and unexposed groups is the same even if the ‘absolute’ proportion of cases sampled is higher 13 HST 190: Intro to Biostatistics Matched case-control designs • To further increase the statistical efficiency of a study, researchers may create a matched case-control design § For every case sampled, one or more controls is selected based on similarity to the case o Matching each case with � controls is called �: � matching § Goal is to correct for potential confounding in the study design § e.g., match each case with noncase of same age and gender, resulting in two groups having same distributions of age and gender • As with standard case-control, analysis then measures association between an exposure of interest and the outcome § Exposure of interest is not a factor used for matching • Matched designs balance increased cost of matching each subject with higher power and potential for causal inference 14 HST 190: Intro to Biostatistics Analyzing matched case-control designs • Suppose the sample includes � matched sets, how should we approach analysis? • Naïve approach: choose one matched set to be ‘baseline,’ and include � − 1 indicator variables for each other set § Essentially, treat each matched set as level of a categorical variable • Such a model forces us to estimate effect of exposure within groups that may only have a few people in them § Unstable estimation § Cannot generalize estimated comparisons of specific pairs of people • Instead, we want an analysis that estimates exposure effect by aggregating across matched sets 15 HST 190: Intro to Biostatistics Conditional logistic regression • Instead, researchers use conditional logistic regression to estimate the effect of an exposure of interest, conditioning out the factors used to create the matched sets • To illustrate, assume a matched pairs design. Let § �~6 = 1, �~• = 0 be the disease indicators of the �th case-control pair § �~66, … , �~6E , �~•6, … , �~•E be the covariates of the �th pair o Does not include ‘matched on’ factors, which are accounted for in design • Then for each pair, define the conditional likelihood contribution � �~6 = 1 ∩ �~• = 0 �~ �6, … , �E = � �~6 = 1 ∩ �~• = 0 + � �~6 = 0 ∩ �~• = 1 † �∑g‡l Xg„…lg = ∑† ∑† � g‡l Xg„…lg + � g‡l Xg„…og 16 HST 190: Intro to Biostatistics • Thus, we can compute estimates that maximize the ‰ Œ conditional likelihood � = arg min� ∑~•6 �~ �6, … , �E • If �b is coefficient of exposure of interest, then as before ‰ ORŽ = �Xg § Standard methods for testing and CIs are all the same as before • Note that because we already adJusted for factors used for matching, we do not get estimated effects for these factors § It would be inappropriate to include matching factors as covariates • We also do not get an estimated intercept, which makes sense because intercept not interpretable in case-control setting anyways 17 HST 190: Intro to Biostatistics Logistic regression modeling for prediction • Using a fitted logistic regression model, we so far have focused on estimation and inference of associations

Load more