HST 190: Introduction to Biostatistics

HST 190: Introduction to Biostatistics Lecture 7: Logistic regression 1 HST 190: Intro to Biostatistics Logistic regression • We’ve previously discussed linear regression methods for predicting continuous outcomes § Functionally, predicting the mean at particular covariate levels • What if we want to predict values for a dichotomous categorical variable, instead of a continuous one? § This corresponds to predicting the probability of the outcome variable being a “1” versus “0” • Can we Just use linear regression for a 0-1 outcome variable? 2 HST 190: Intro to Biostatistics • Consider modeling the probability that a person receives a physical exam in a given year as a function of income. § A sample of individuals is collected. Each individual reports income and whether he/she went to the doctor last year. patient # � = checkup � = income 1 1 32,000 2 0 28,000 3 0 41,000 4 1 38,000 etc. 3 HST 190: Intro to Biostatistics • Plotting this data and fitting a linear regression line, we see that the linear model is not tailored to this type of outcome § For example, an income of $500,000 yields a predicted probability of visiting the doctor greater than 1! 4 HST 190: Intro to Biostatistics Logit transformation • To overcome the problem, we define the logit transformation: if 5 0 < � < 1, then logit � = ln 675 § Notice that as � ↑ 1, logit � ↑ ∞, and as � ↓ 0, logit � ↓ −∞ • Thus, logit � can take any continuous value, so we will fit linear model on this transformed outcome instead • Write this type of model generally as � � � = � + �6�6 + ⋯ + �E�E § Where � � = 1 ⋅ � � = 1 + 0 ⋅ � � = 0 = � � = 1 = � and � �(�) = logit �(�) § This model is called a logistic regression model or a logit model § By comparison, the linear regression model takes � �(�) = � � 5 HST 190: Intro to Biostatistics • A key benefit of fitting logit model rather than contingency table methods is ability to adJust for multiple covariates (including continuous covariates) simultaneously patient # � = checkup income age gender 1 1 32,000 60 F 2 0 28,000 53 M 3 0 41,000 45 M 4 1 38,000 40 F etc. • To interpret parameters, compare fit for man and woman § logit �JKLMN = � + �MOP�MOP + �QNRKLP�QNRKLP + �JKLMN § logit �LMN = � + �MOP�MOP + �QNRKLP�QNRKLP 6 HST 190: Intro to Biostatistics ⇒ logit �LMN = logit �JKLMN − �JKLMN �JKLMN �LMN ⟺ ln − ln = �JKLMN 1 − �JKLMN 1 − �LMN �JKLMN 1 − �LMN ⟺ ln = �JKLMN �LMN 1 − �JKLMN �JKLMN 1 − �LMN oddsJKLMN ⟺ = = �XYZ[\] �LMN 1 − �JKLMN oddsLMN • So, �JKLMN is the log of the odds ratio for getting a checkup between men and women, adjusting for age and income • This result holds for any dichotomous variable in the model § This allows us to estimate odds ratio for a given exposure with disease in a regression, accounting for the effects of other variables 7 HST 190: Intro to Biostatistics • In a logistic regression logit � = � + �6�6 + ⋯ + �E�E, ` ` then denote the fitted parameter estimates as �_, �6, … , �E • if �b is a dichotomous exposure, then the estimated odds ratio relating this exposure to the outcome is f ORe = �Xg • If instead �b is a continuous exposure, then the above odds ratio and CI describe the outcome’s association with a one- unit increase in the exposure, adjusting for other covariates § e.g., “a one unit increase in age is associated on average with an f �X\hi-fold change in the odds of getting a checkup, holding gender constant.” 8 HST 190: Intro to Biostatistics Hypothesis testing and confidence intervals ` • For an estimated �b coefficient in a logistic model, the corresponding 100 1 − � % CI is given by Xf 7k n pPq Xf Xf rk n pPq Xf g lm g g lm g � o , � o § Matlab or other software will provide both �`b and ses �`b f Xg § take note of whether you are given �`b or ORe = � in software output! This differs between programs ` ` • Testing the hypothesis �u: �b = 0 versus �6: �b ≠ 0 is a z- test that is typically provided as part of software output f Xg § If the null is true, � = f is approximately �(0,1) pPq Xg 9 HST 190: Intro to Biostatistics Interaction terms • Like in linear regression, we can also incorporate interaction terms in a logistic regression model logit � = � + �MOP�MOP + �QNRKLP�QNRKLP +�JKLMN�JKLMN + �MOP:JKLMN�MOP�JKLMN • �MOP:JKLMN captures the presence of an interaction effect or effect modification of gender by age § e.g., gender effect on probability of getting annual checkup is greater among younger people 10 HST 190: Intro to Biostatistics Model building for inference • The techniques for variable selection in logistic regression are similar as for linear regression § Biggest challenge is lack of comparable visual fit diagnostics like residual plots • When model building for studies of association between exposure and outcome, focus is on including sources of confounding (i.e., external variables associated with both exposure and outcome) • One strategy is to fit and report the following three models: 1) an unadjusted or minimally adjusted model 2) a model that includes ‘core’ confounders (‘primary’ model) o clear indication from scientific knowledge and/or the literature o consensus among investigators 3) a model that includes ‘core’ confounders plus any ‘potential’ confounders o indication is less certain 11 HST 190: Intro to Biostatistics Logistic regression in retrospective setting • How do we interpret intercept logit � = � + �6�6 + ⋯ + �E�E 5 § � = log is the log odds of experiencing the outcome in the 675 population among subjects with �6 = ⋯ = �E = 0 § Links model to absolute prevalence of the outcome in the population • What happens to logit model if our sampling is case-control (or retrospective)? § That is, what if we sample based on outcome status? § e.g., sample 100 patients with a disease and 100 patients without • Typically this setting artificially selects more cases than would arise naturally under cross-sectional or prospective sampling § so we cannot readily use the sample to describe the true probability of disease in the population 12 HST 190: Intro to Biostatistics • Thus, we see that the intercept � is no longer meaningful in a logistic regression using case-control sampled data § What about the other estimates? • Recall we showed that using contingency tables to compute odds ratios was valid both in prospective and retrospective sampling designs • It turns out that the same is true for the estimated coefficients in logistic regression! Just as before, f ORe = �Xg § all other inference (tests, CIs) is also the same • The estimated odds ratio of the outcome between exposed and unexposed groups is the same even if the ‘absolute’ proportion of cases sampled is higher 13 HST 190: Intro to Biostatistics Matched case-control designs • To further increase the statistical efficiency of a study, researchers may create a matched case-control design § For every case sampled, one or more controls is selected based on similarity to the case o Matching each case with � controls is called �: � matching § Goal is to correct for potential confounding in the study design § e.g., match each case with noncase of same age and gender, resulting in two groups having same distributions of age and gender • As with standard case-control, analysis then measures association between an exposure of interest and the outcome § Exposure of interest is not a factor used for matching • Matched designs balance increased cost of matching each subject with higher power and potential for causal inference 14 HST 190: Intro to Biostatistics Analyzing matched case-control designs • Suppose the sample includes � matched sets, how should we approach analysis? • Naïve approach: choose one matched set to be ‘baseline,’ and include � − 1 indicator variables for each other set § Essentially, treat each matched set as level of a categorical variable • Such a model forces us to estimate effect of exposure within groups that may only have a few people in them § Unstable estimation § Cannot generalize estimated comparisons of specific pairs of people • Instead, we want an analysis that estimates exposure effect by aggregating across matched sets 15 HST 190: Intro to Biostatistics Conditional logistic regression • Instead, researchers use conditional logistic regression to estimate the effect of an exposure of interest, conditioning out the factors used to create the matched sets • To illustrate, assume a matched pairs design. Let § �~6 = 1, �~• = 0 be the disease indicators of the �th case-control pair § �~66, … , �~6E , �~•6, … , �~•E be the covariates of the �th pair o Does not include ‘matched on’ factors, which are accounted for in design • Then for each pair, define the conditional likelihood contribution � �~6 = 1 ∩ �~• = 0 �~ �6, … , �E = � �~6 = 1 ∩ �~• = 0 + � �~6 = 0 ∩ �~• = 1 † �∑g‡l Xg„…lg = ∑† ∑† � g‡l Xg„…lg + � g‡l Xg„…og 16 HST 190: Intro to Biostatistics • Thus, we can compute estimates that maximize the ‰ Œ conditional likelihood � = arg min� ∑~•6 �~ �6, … , �E • If �b is coefficient of exposure of interest, then as before ‰ ORŽ = �Xg § Standard methods for testing and CIs are all the same as before • Note that because we already adJusted for factors used for matching, we do not get estimated effects for these factors § It would be inappropriate to include matching factors as covariates • We also do not get an estimated intercept, which makes sense because intercept not interpretable in case-control setting anyways 17 HST 190: Intro to Biostatistics Logistic regression modeling for prediction • Using a fitted logistic regression model, we so far have focused on estimation and inference of associations

HST 190: Introduction to Biostatistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support