HST 190: Introduction to Biostatistics
Lecture 7: Logistic regression
1 HST 190: Intro to Biostatistics Logistic regression
• We’ve previously discussed linear regression methods for predicting continuous outcomes § Functionally, predicting the mean at particular covariate levels • What if we want to predict values for a dichotomous categorical variable, instead of a continuous one? § This corresponds to predicting the probability of the outcome variable being a “1” versus “0” • Can we just use linear regression for a 0-1 outcome variable?
2 HST 190: Intro to Biostatistics • Consider modeling the probability that a person receives a physical exam in a given year as a function of income. § A sample of individuals is collected. Each individual reports income and whether he/she went to the doctor last year.
patient # � = checkup � = income 1 1 32,000 2 0 28,000 3 0 41,000 4 1 38,000 etc.
3 HST 190: Intro to Biostatistics • Plotting this data and fitting a linear regression line, we see that the linear model is not tailored to this type of outcome § For example, an income of $500,000 yields a predicted probability of visiting the doctor greater than 1!
4 HST 190: Intro to Biostatistics Logit transformation
• To overcome the problem, we define the logit transformation: if 0 < � < 1, then logit � = ln § Notice that as � ↑ 1, logit � ↑ ∞, and as � ↓ 0, logit � ↓ −∞ • Thus, logit � can take any continuous value, so we will fit linear model on this transformed outcome instead • Write this type of model generally as
� � � = � + � � + ⋯ + � � § Where � � = 1 ⋅ � � = 1 + 0 ⋅ � � = 0 = � � = 1 = � and � �(�) = logit �(�) § This model is called a logistic regression model or a logit model § By comparison, the linear regression model takes � �(�) = � �
5 HST 190: Intro to Biostatistics • A key benefit of fitting logit model rather than contingency table methods is ability to adjust for multiple covariates (including continuous covariates) simultaneously patient # � = checkup income age gender 1 1 32,000 60 F 2 0 28,000 53 M 3 0 41,000 45 M 4 1 38,000 40 F etc. • To interpret parameters, compare fit for man and woman
§ logit � = � + � � + � � + �
§ logit � = � + � � + � �
6 HST 190: Intro to Biostatistics ⇒ logit � = logit � − � � � ⟺ ln − ln = � 1 − � 1 − � � 1 − � ⟺ ln = � � 1 − �
� 1 − � odds