HST 190: Introduction to Biostatistics

HST 190: Introduction to Biostatistics Lecture 5: Multiple linear regression 1 HST 190: Intro to Biostatistics Multiple linear regression • Last time, we introduced linear regression in the simple case of a single outcome � and a single covariate � with model � = � + �� + �, �~� 0, �- • Now, consider multiple linear regression, predicting � via - � = � + �.�. + ⋯ + �0�0 + �, �~� 0, � • In this model, �1 represents the average increase in � corresponding to a one unit increase in �1 when all other �’s have been held constant § The �1’s are called regression coefficients 2 HST 190: Intro to Biostatistics Fitting multiple linear regression models • Like simple linear regression, multiple linear regression can be fit 5 - 5 by minimizing the sum of squared residuals ∑16. �1 = ∑16.(�1 − - � − �.�1. − ⋯ − �0�10) • Formally, letting � = (�., … , �0), 5 - �, � = arg minD,� F �1 − � − �.�1. − ⋯ − �0�10 16. § Unlike simple linear regression, there are no closed formulas for individual multiple regression estimates § However, solution for �, � simultaneously can be expressed using matrix notation ∑N J LJI M • We can estimate Var(�|�) by �I- = KOP K K 5L- 3 HST 190: Intro to Biostatistics • Consider researchers studying the relationship between systolic blood pressure, age, and gender using multiple linear regression. These variables are recorded in a sample: § note “gender” is a categorical variable with two categories. Therefore, we represent it by creating an indicator variable, �-, that takes values 0 and 1 for men and women, respectively. § this is called a dummy variable, since its value (1 vs. 0) is arbitrarily chosen as a stand-in for a non-numeric quantity. patient # � = SBP �1 = age gender �- = female 1 120 45 Female 1 2 135 40 Male 0 3 132 49 Male 0 4 140 35 Female 1 etc. 4 HST 190: Intro to Biostatistics • If the resulting regression equation was estimated as YI = 110 + 0.3�. − 10�- § � = 110 is an estimated average SBP for men (�- = 0) at ‘age 0’ (�. = 0). It is recommended to “center” continuous covariates around their sample averages to make the intercept more meaningful § �. = 0.3 means that, within each gender, a 1-year increase in age is associated with an increase of 0.3 in average SBP. • How do we interpret the value of a dummy variable’s coefficient (�- = 10)? Consider the predicted SBP values for a man and a woman who are both age 40 § For the man: YI = � + �. 40 − �- 0 § For the woman: YI = � + �. 40 − �- 1 § So, �- is a difference in average values between women and men, holding age constant. Or, �- = −10 is the difference in average SBP between genders when all other variables are held constant 5 HST 190: Intro to Biostatistics Interaction terms • In the regression models above, each explanatory variable is related to the outcome independently of all others • If an explanatory variable’s effect depends on another explanatory variable, it is called an interaction effect § If we believe interaction effects are present, we include them in the model • For example, if SBP changed with age differently for men and women, we might choose to model � = � + �.�. + �-�- + �]�.�- + � § In this ‘fullest’ model, �]�.�- is called an interaction term 6 HST 190: Intro to Biostatistics • These two proposed models predict SBP differently § the original model � = � + �.�. + �-�- has the same SBP slope �. for men and women § The interaction model � = � + �.�. + �-�- + �]�.�- has different SBP slopes for men (�.) and for women (�.+ �]) • Suppose our new fit is YI = 110 + 0.2�. − 10�- + 0.2�.�- § For men (�- = 0), we predict YI = 110 + 0.2�. § For women (�- = 1), we predict YI = 100 + 0.4�. • In other words, �] represents the difference in changes in average SBP associated with a 1-year increase in age for women versus men 7 HST 190: Intro to Biostatistics Model specification • We have seen several linear models relating SBP, age and gender. What linear regression models can we fit, in principle? § � = � + �.�. + � �- = female § � = � + �.�. + �-�- + � � = SBP § � = � + �.�. + �-�- + �]�.�- + � § � = � + �.�. + �]�.�- + � �. = Age 8 HST 190: Intro to Biostatistics § � = � + �� + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Men and women constrained to same slope and intercept �. = Age 9 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �� + �� + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding “main effect” for gender allows same slope with different intercepts § assumption that age-SBP relationship is same for men/women �. = Age 10 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + �� + �� + �� + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding interaction for gender allows different slopes and different intercepts § assumption that age-SBP relationship differs for men/women �. = Age 11 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �� + �� + � � = SBP • Omitting gender main effect allows different slopes but forces same intercepts § Rarely sensible, so include main effects first before interactions �. = Age 12 HST 190: Intro to Biostatistics Categorical variables with multiple categories • Suppose researchers want to study the relationship between systolic blood pressure, age, and US geographic region using multiple regression. Region is categorized as “Northeast”, “South”, “Midwest”, or “West”. § Since there are four categories, we must create three dummy variables to uniquely characterize each patient: patient # � = SBP �1 = age region �� = � �� = �� = � 1 120 45 NE 0 0 0 2 135 40 S 1 0 0 3 132 49 MW 0 1 0 4 140 35 W 0 0 1 etc. 13 HST 190: Intro to Biostatistics • The fitted regression model is �I = � + �.�. + �-�- + �]�] + �l�l • Consider four 40-year-old patients, one from each region. Their predicted SBP values will be: § NE (�- = �] = �l = 0): �I = � + 40�. § S (�- = 1, �] = �l = 0): �I = � + 40�. + �- § MW (�] = 1, �- = �l = 0): �I = � + 40�. + �] § W (�l = 1, �- = �] = 0): �I = � + 40�. + �l • (�- , �] , �l) represent the difference in prediction between the categories (S, MW, W) and the “baseline” category (or the reference level) NE. • For a categorical variable with multiple categories, you must use one fewer dummy variable than the number of categories. One category, by default, becomes reference (baseline) category. 14 HST 190: Intro to Biostatistics Inference about a single �m • Because each parameter �m describes the relationship between �m and �, inference about �m tells us about the strength of the linear relationship § in multiple linear regression, this is holding all other included covariates constant • We can do hypothesis testing and form confidence intervals for a particular �m • Both require the estimated standard error of �m, seo �m r . § In single linear regression, the closed form is seo � = �I M 5L. pq § In multiple linear regression, no closed form for individual �m 15 HST 190: Intro to Biostatistics • The 100(1 − �)% CI for a particular �m is simply D �m ± �5L0L.,.L ⋅ seo �m - • To test the hypothesis that �x: �m = 0 vs. �x: �m ≠ 0, holding all other parameters fixed, we use test statistic ∗ �m � = ~�5L0L. seo �m ∗ • If �x is true, � follows �-distribution with � − � − 1 degrees of freedom, where � is the number of covariates included in model § testing procedure follows exactly as before 16 HST 190: Intro to Biostatistics Matlab example output • Model output gives all necessary components SBP = � + �1AGE + �2WEIGHT + � > fit = fitlm(data,’sbp ~ age + weight') Linear regression model: sbp ~ 1 + age + weight �m Estimated Coefficients: seo �m � |�∗ > � ) Estimate SE tStat pValue �m seo �m (Intercept) 60.56178 44.81431 1.351 0.219 age 1.01226 .3406004 2.972 0.021 weight .2166858 .2468829 0.878 0.409 � Number of observations: 10, Error degrees of freedom: 7 Root Mean Squared Error: 11.583 � − � − 1 �I 17 HST 190: Intro to Biostatistics Model checking • For linear model, assumptions (in order of importance) are: § Linearity (debatable!) § independence of residuals § equal spread of points around the line § normality of residuals § [Random sampling from a large population] 18 HST 190: Intro to Biostatistics Linearity • Linearity means validity of � = � + �.�. + ⋯ + �0�0 + � • Check this assumption in two ways: • Graphically (for SLR): look for nonlinearity and outliers. § more difficult to check linearity for MLR, consider pairwise scatterplots (plotting each �m against �) and ‘trellis graphs’ • Conceptually: based on a phenomenon of interest and chosen predictors § e.g., age is related linearly to body weight for children 19 HST 190: Intro to Biostatistics • If violated: § Estimators and predictions may be biased • Strategies: . § Consider transformations (log � , , �-, log � , , etc. ) or add ‡ J interaction terms § Use nonlinear functions of �: spline (or polynomial) regression or other generalized additive models 20 HST 190: Intro to Biostatistics • Transformations allow linear regression to model non-linear relationships between x and y, as long as there is a transformation of x or y (or both) that makes it linear. Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 21 HST 190: Intro to Biostatistics Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 22 HST 190: Intro to Biostatistics Independence of Errors • Independence of errors �1 means that: § Residuals for any two observations

HST 190: Introduction to Biostatistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support