HST 190: Introduction to Biostatistics

HST 190: Introduction to Biostatistics Lecture 5: Multiple linear regression 1 HST 190: Intro to Biostatistics Multiple linear regression • Last time, we introduced linear regression in the simple case of a single outcome � and a single covariate � with model � = � + �� + �, �~� 0, �- • Now, consider multiple linear regression, predicting � via - � = � + �.�. + ⋯ + �0�0 + �, �~� 0, � • In this model, �1 represents the average increase in � corresponding to a one unit increase in �1 when all other �’s have been held constant § The �1’s are called regression coefficients 2 HST 190: Intro to Biostatistics Fitting multiple linear regression models • Like simple linear regression, multiple linear regression can be fit 5 - 5 by minimizing the sum of squared residuals ∑16. �1 = ∑16.(�1 − - � − �.�1. − ⋯ − �0�10) • Formally, letting � = (�., … , �0), 5 - �, � = arg minD,� F �1 − � − �.�1. − ⋯ − �0�10 16. § Unlike simple linear regression, there are no closed formulas for individual multiple regression estimates § However, solution for �, � simultaneously can be expressed using matrix notation ∑N J LJI M • We can estimate Var(�|�) by �I- = KOP K K 5L- 3 HST 190: Intro to Biostatistics • Consider researchers studying the relationship between systolic blood pressure, age, and gender using multiple linear regression. These variables are recorded in a sample: § note “gender” is a categorical variable with two categories. Therefore, we represent it by creating an indicator variable, �-, that takes values 0 and 1 for men and women, respectively. § this is called a dummy variable, since its value (1 vs. 0) is arbitrarily chosen as a stand-in for a non-numeric quantity. patient # � = SBP �1 = age gender �- = female 1 120 45 Female 1 2 135 40 Male 0 3 132 49 Male 0 4 140 35 Female 1 etc. 4 HST 190: Intro to Biostatistics • If the resulting regression equation was estimated as YI = 110 + 0.3�. − 10�- § � = 110 is an estimated average SBP for men (�- = 0) at ‘age 0’ (�. = 0). It is recommended to “center” continuous covariates around their sample averages to make the intercept more meaningful § �. = 0.3 means that, within each gender, a 1-year increase in age is associated with an increase of 0.3 in average SBP. • How do we interpret the value of a dummy variable’s coefficient (�- = 10)? Consider the predicted SBP values for a man and a woman who are both age 40 § For the man: YI = � + �. 40 − �- 0 § For the woman: YI = � + �. 40 − �- 1 § So, �- is a difference in average values between women and men, holding age constant. Or, �- = −10 is the difference in average SBP between genders when all other variables are held constant 5 HST 190: Intro to Biostatistics Interaction terms • In the regression models above, each explanatory variable is related to the outcome independently of all others • If an explanatory variable’s effect depends on another explanatory variable, it is called an interaction effect § If we believe interaction effects are present, we include them in the model • For example, if SBP changed with age differently for men and women, we might choose to model � = � + �.�. + �-�- + �]�.�- + � § In this ‘fullest’ model, �]�.�- is called an interaction term 6 HST 190: Intro to Biostatistics • These two proposed models predict SBP differently § the original model � = � + �.�. + �-�- has the same SBP slope �. for men and women § The interaction model � = � + �.�. + �-�- + �]�.�- has different SBP slopes for men (�.) and for women (�.+ �]) • Suppose our new fit is YI = 110 + 0.2�. − 10�- + 0.2�.�- § For men (�- = 0), we predict YI = 110 + 0.2�. § For women (�- = 1), we predict YI = 100 + 0.4�. • In other words, �] represents the difference in changes in average SBP associated with a 1-year increase in age for women versus men 7 HST 190: Intro to Biostatistics Model specification • We have seen several linear models relating SBP, age and gender. What linear regression models can we fit, in principle? § � = � + �.�. + � �- = female § � = � + �.�. + �-�- + � � = SBP § � = � + �.�. + �-�- + �]�.�- + � § � = � + �.�. + �]�.�- + � �. = Age 8 HST 190: Intro to Biostatistics § � = � + �� + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Men and women constrained to same slope and intercept �. = Age 9 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �� + �� + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding “main effect” for gender allows same slope with different intercepts § assumption that age-SBP relationship is same for men/women �. = Age 10 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + �� + �� + �� + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding interaction for gender allows different slopes and different intercepts § assumption that age-SBP relationship differs for men/women �. = Age 11 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �� + �� + � � = SBP • Omitting gender main effect allows different slopes but forces same intercepts § Rarely sensible, so include main effects first before interactions �. = Age 12 HST 190: Intro to Biostatistics Categorical variables with multiple categories • Suppose researchers want to study the relationship between systolic blood pressure, age, and US geographic region using multiple regression. Region is categorized as “Northeast”, “South”, “Midwest”, or “West”. § Since there are four categories, we must create three dummy variables to uniquely characterize each patient: patient # � = SBP �1 = age region �� = � �� = �� = � 1 120 45 NE 0 0 0 2 135 40 S 1 0 0 3 132 49 MW 0 1 0 4 140 35 W 0 0 1 etc. 13 HST 190: Intro to Biostatistics • The fitted regression model is �I = � + �.�. + �-�- + �]�] + �l�l • Consider four 40-year-old patients, one from each region. Their predicted SBP values will be: § NE (�- = �] = �l = 0): �I = � + 40�. § S (�- = 1, �] = �l = 0): �I = � + 40�. + �- § MW (�] = 1, �- = �l = 0): �I = � + 40�. + �] § W (�l = 1, �- = �] = 0): �I = � + 40�. + �l • (�- , �] , �l) represent the difference in prediction between the categories (S, MW, W) and the “baseline” category (or the reference level) NE. • For a categorical variable with multiple categories, you must use one fewer dummy variable than the number of categories. One category, by default, becomes reference (baseline) category. 14 HST 190: Intro to Biostatistics Inference about a single �m • Because each parameter �m describes the relationship between �m and �, inference about �m tells us about the strength of the linear relationship § in multiple linear regression, this is holding all other included covariates constant • We can do hypothesis testing and form confidence intervals for a particular �m • Both require the estimated standard error of �m, seo �m r . § In single linear regression, the closed form is seo � = �I M 5L. pq § In multiple linear regression, no closed form for individual �m 15 HST 190: Intro to Biostatistics • The 100(1 − �)% CI for a particular �m is simply D �m ± �5L0L.,.L ⋅ seo �m - • To test the hypothesis that �x: �m = 0 vs. �x: �m ≠ 0, holding all other parameters fixed, we use test statistic ∗ �m � = ~�5L0L. seo �m ∗ • If �x is true, � follows �-distribution with � − � − 1 degrees of freedom, where � is the number of covariates included in model § testing procedure follows exactly as before 16 HST 190: Intro to Biostatistics Matlab example output • Model output gives all necessary components SBP = � + �1AGE + �2WEIGHT + � > fit = fitlm(data,’sbp ~ age + weight') Linear regression model: sbp ~ 1 + age + weight �m Estimated Coefficients: seo �m � |�∗ > � ) Estimate SE tStat pValue �m seo �m (Intercept) 60.56178 44.81431 1.351 0.219 age 1.01226 .3406004 2.972 0.021 weight .2166858 .2468829 0.878 0.409 � Number of observations: 10, Error degrees of freedom: 7 Root Mean Squared Error: 11.583 � − � − 1 �I 17 HST 190: Intro to Biostatistics Model checking • For linear model, assumptions (in order of importance) are: § Linearity (debatable!) § independence of residuals § equal spread of points around the line § normality of residuals § [Random sampling from a large population] 18 HST 190: Intro to Biostatistics Linearity • Linearity means validity of � = � + �.�. + ⋯ + �0�0 + � • Check this assumption in two ways: • Graphically (for SLR): look for nonlinearity and outliers. § more difficult to check linearity for MLR, consider pairwise scatterplots (plotting each �m against �) and ‘trellis graphs’ • Conceptually: based on a phenomenon of interest and chosen predictors § e.g., age is related linearly to body weight for children 19 HST 190: Intro to Biostatistics • If violated: § Estimators and predictions may be biased • Strategies: . § Consider transformations (log � , , �-, log � , , etc. ) or add ‡ J interaction terms § Use nonlinear functions of �: spline (or polynomial) regression or other generalized additive models 20 HST 190: Intro to Biostatistics • Transformations allow linear regression to model non-linear relationships between x and y, as long as there is a transformation of x or y (or both) that makes it linear. Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 21 HST 190: Intro to Biostatistics Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 22 HST 190: Intro to Biostatistics Independence of Errors • Independence of errors �1 means that: § Residuals for any two observations

Load more