HST 190: Introduction to Biostatistics
Lecture 5: Multiple linear regression
1 HST 190: Intro to Biostatistics Multiple linear regression
• Last time, we introduced linear regression in the simple case of a single outcome � and a single covariate � with model � = � + �� + �, �~� 0, �
• Now, consider multiple linear regression, predicting � via � = � + � � + ⋯ + � � + �, �~� 0, �
• In this model, � represents the average increase in � corresponding to a one unit increase in � when all other �’s have been held constant
§ The � ’s are called regression coefficients
2 HST 190: Intro to Biostatistics Fitting multiple linear regression models
• Like simple linear regression, multiple linear regression can be fit by minimizing the sum of squared residuals ∑ � = ∑ (� − � − � � − ⋯ − � � )
• Formally, letting � = (� , … , � ), �, � = arg min ,� � − � − � � − ⋯ − � � § Unlike simple linear regression, there are no closed formulas for individual multiple regression estimates § However, solution for �, � simultaneously can be expressed using matrix notation
∑ • We can estimate Var(�|�) by � =
3 HST 190: Intro to Biostatistics • Consider researchers studying the relationship between systolic blood pressure, age, and gender using multiple linear regression. These variables are recorded in a sample: § note “gender” is a categorical variable with two categories. Therefore, we represent it by creating an indicator variable, � , that takes values 0 and 1 for men and women, respectively. § this is called a dummy variable, since its value (1 vs. 0) is arbitrarily chosen as a stand-in for a non-numeric quantity.
patient # � = SBP �1 = age gender � = female 1 120 45 Female 1 2 135 40 Male 0 3 132 49 Male 0 4 140 35 Female 1 etc.
4 HST 190: Intro to Biostatistics • If the resulting regression equation was estimated as y = 110 + 0.3� − 10�
§ � = 110 is an estimated average SBP for men (� = 0) at ‘age 0’ (� = 0). It is recommended to “center” continuous covariates around their sample averages to make the intercept more meaningful
§ � = 0.3 means that, within each gender, a 1-year increase in age is associated with an increase of 0.3 in average SBP. • How do we interpret the value of a dummy variable’s coefficient (� = 10)? Consider the predicted SBP values for a man and a woman who are both age 40
§ For the man: y = � + � 40 − � 0
§ For the woman: y = � + � 40 − � 1
§ So, � is a difference in average values between women and men, holding age constant. Or, � = −10 is the difference in average SBP between genders when all other variables are held constant
5 HST 190: Intro to Biostatistics Interaction terms
• In the regression models above, each explanatory variable is related to the outcome independently of all others • If an explanatory variable’s effect depends on another explanatory variable, it is called an interaction effect § If we believe interaction effects are present, we include them in the model • For example, if SBP changed with age differently for men and women, we might choose to model � = � + � � + � � + � � � + �
§ In this ‘fullest’ model, � � � is called an interaction term
6 HST 190: Intro to Biostatistics • These two proposed models predict SBP differently
§ the original model � = � + � � + � � has the same SBP slope � for men and women
§ The interaction model � = � + � � + � � + � � � has different SBP slopes for men (� ) and for women (� + � )
• Suppose our new fit is y = 110 + 0.2� − 10� + 0.2� �
§ For men (� = 0), we predict y = 110 + 0.2�
§ For women (� = 1), we predict y = 100 + 0.4�
• In other words, � represents the difference in changes in average SBP associated with a 1-year increase in age for women versus men
7 HST 190: Intro to Biostatistics Model specification
• We have seen several linear models relating SBP, age and gender. What linear regression models can we fit, in principle?
§ � = � + � � + � � = female § � = � + � � + � � + � � = SBP
§ � = � + � � + � � + � � � + �
§ � = � + � � + � � � + �
� = Age
8 HST 190: Intro to Biostatistics § � = � + ���� + �
§ � = � + � � + � � + �
§ � = � + � � + � � + � � � + � � = female § � = � + � � + � � � + � � = SBP • Men and women constrained to same slope and intercept
� = Age
9 HST 190: Intro to Biostatistics § � = � + � � + �
§ � = � + ���� + ���� + �
§ � = � + � � + � � + � � � + � � = female § � = � + � � + � � � + � � = SBP • Adding “main effect” for gender allows same slope with different intercepts § assumption that age-SBP
relationship is same for men/women � = Age
10 HST 190: Intro to Biostatistics § � = � + � � + �
§ � = � + � � + � � + �
§ � = � + ���� + ���� + ������ + � � = female § � = � + � � + � � � + � � = SBP • Adding interaction for gender allows different slopes and different intercepts § assumption that age-SBP
relationship differs for men/women � = Age
11 HST 190: Intro to Biostatistics § � = � + � � + �
§ � = � + � � + � � + �
§ � = � + � � + � � + � � � + � � = female § � = � + ���� + ������ + � � = SBP • Omitting gender main effect allows different slopes but forces same intercepts § Rarely sensible, so include main
effects first before interactions � = Age
12 HST 190: Intro to Biostatistics Categorical variables with multiple categories
• Suppose researchers want to study the relationship between systolic blood pressure, age, and US geographic region using multiple regression. Region is categorized as “Northeast”, “South”, “Midwest”, or “West”. § Since there are four categories, we must create three dummy variables to uniquely characterize each patient:
patient # � = SBP �1 = age region �� = � �� = �� �� = � 1 120 45 NE 0 0 0 2 135 40 S 1 0 0 3 132 49 MW 0 1 0 4 140 35 W 0 0 1 etc.
13 HST 190: Intro to Biostatistics • The fitted regression model is � = � + � � + � � + � � + � � • Consider four 40-year-old patients, one from each region. Their predicted SBP values will be:
§ NE (� = � = � = 0): � = � + 40�
§ S (� = 1, � = � = 0): � = � + 40� + �
§ MW (� = 1, � = � = 0): � = � + 40� + �
§ W (� = 1, � = � = 0): � = � + 40� + �
• (� , � , � ) represent the difference in prediction between the categories (S, MW, W) and the “baseline” category (or the reference level) NE. • For a categorical variable with multiple categories, you must use one fewer dummy variable than the number of categories. One category, by default, becomes reference (baseline) category.
14 HST 190: Intro to Biostatistics Inference about a single �
• Because each parameter � describes the relationship between � and �, inference about � tells us about the strength of the linear relationship § in multiple linear regression, this is holding all other included covariates constant • We can do hypothesis testing and form confidence intervals for a particular �
• Both require the estimated standard error of � , se �