<<

HST 190: Introduction to

Lecture 5: Multiple

1 HST 190: Intro to Biostatistics Multiple linear regression

• Last time, we introduced linear regression in the simple case of a single outcome � and a single covariate � with model � = � + �� + �, �~� 0, �

• Now, consider multiple linear regression, predicting � via � = � + �� + ⋯ + �� + �, �~� 0, �

• In this model, � represents the average increase in � corresponding to a one unit increase in � when all other �’s have been held constant

§ The �’s are called regression coefficients

2 HST 190: Intro to Biostatistics Fitting multiple linear regression models

• Like , multiple linear regression can be fit by minimizing the sum of squared residuals ∑ � = ∑(� − � − �� − ⋯ − ��)

• Formally, letting � = (�, … , �), �, � = arg min,� � − � − �� − ⋯ − �� § Unlike simple linear regression, there are no closed formulas for individual multiple regression estimates § However, solution for �, � simultaneously can be expressed using matrix notation

∑ • We can estimate Var(�|�) by � =

3 HST 190: Intro to Biostatistics • Consider researchers studying the relationship between systolic blood pressure, age, and gender using multiple linear regression. These variables are recorded in a sample: § note “gender” is a with two categories. Therefore, we represent it by creating an indicator variable, �, that takes values 0 and 1 for men and women, respectively. § this is called a dummy variable, since its value (1 vs. 0) is arbitrarily chosen as a stand-in for a non-numeric quantity.

patient # � = SBP �1 = age gender � = female 1 120 45 Female 1 2 135 40 Male 0 3 132 49 Male 0 4 140 35 Female 1 etc.

4 HST 190: Intro to Biostatistics • If the resulting regression equation was estimated as y = 110 + 0.3� − 10�

§ � = 110 is an estimated average SBP for men (� = 0) at ‘age 0’ (� = 0). It is recommended to “center” continuous covariates around their sample averages to make the intercept more meaningful

§ � = 0.3 that, within each gender, a 1-year increase in age is associated with an increase of 0.3 in average SBP. • How do we interpret the value of a dummy variable’s coefficient (� = 10)? Consider the predicted SBP values for a man and a woman who are both age 40

§ For the man: y = � + � 40 − � 0

§ For the woman: y = � + � 40 − � 1

§ So, � is a difference in average values between women and men, holding age constant. Or, � = −10 is the difference in average SBP between genders when all other variables are held constant

5 HST 190: Intro to Biostatistics terms

• In the regression models above, each explanatory variable is related to the outcome independently of all others • If an explanatory variable’s effect depends on another explanatory variable, it is called an interaction effect § If we believe interaction effects are present, we include them in the model • For example, if SBP changed with age differently for men and women, we might choose to model � = � + �� + �� + ��� + �

§ In this ‘fullest’ model, ��� is called an interaction term

6 HST 190: Intro to Biostatistics • These two proposed models predict SBP differently

§ the original model � = � + �� + �� has the same SBP slope � for men and women

§ The interaction model � = � + �� + �� + ��� has different SBP slopes for men (�) and for women (�+ �)

• Suppose our new fit is y = 110 + 0.2� − 10� + 0.2��

§ For men (� = 0), we predict y = 110 + 0.2�

§ For women (� = 1), we predict y = 100 + 0.4�

• In other words, � represents the difference in changes in average SBP associated with a 1-year increase in age for women versus men

7 HST 190: Intro to Biostatistics Model specification

• We have seen several linear models relating SBP, age and gender. What linear regression models can we fit, in principle?

§ � = � + �� + � � = female § � = � + �� + �� + � � = SBP

§ � = � + �� + �� + ��� + �

§ � = � + �� + ��� + �

� = Age

8 HST 190: Intro to Biostatistics § � = � + ���� + �

§ � = � + �� + �� + �

§ � = � + �� + �� + ��� + � � = female § � = � + �� + ��� + � � = SBP • Men and women constrained to same slope and intercept

� = Age

9 HST 190: Intro to Biostatistics § � = � + �� + �

§ � = � + ���� + ���� + �

§ � = � + �� + �� + ��� + � � = female § � = � + �� + ��� + � � = SBP • Adding “main effect” for gender allows same slope with different intercepts § assumption that age-SBP

relationship is same for men/women � = Age

10 HST 190: Intro to Biostatistics § � = � + �� + �

§ � = � + �� + �� + �

§ � = � + ���� + ���� + ������ + � � = female § � = � + �� + ��� + � � = SBP • Adding interaction for gender allows different slopes and different intercepts § assumption that age-SBP

relationship differs for men/women � = Age

11 HST 190: Intro to Biostatistics § � = � + �� + �

§ � = � + �� + �� + �

§ � = � + �� + �� + ��� + � � = female § � = � + ���� + ������ + � � = SBP • Omitting gender main effect allows different slopes but forces same intercepts § Rarely sensible, so include main

effects first before interactions � = Age

12 HST 190: Intro to Biostatistics Categorical variables with multiple categories

• Suppose researchers want to study the relationship between systolic blood pressure, age, and US geographic region using multiple regression. Region is categorized as “Northeast”, “South”, “Midwest”, or “West”. § Since there are four categories, we must create three dummy variables to uniquely characterize each patient:

patient # � = SBP �1 = age region �� = � �� = �� �� = � 1 120 45 NE 0 0 0 2 135 40 S 1 0 0 3 132 49 MW 0 1 0 4 140 35 W 0 0 1 etc.

13 HST 190: Intro to Biostatistics • The fitted regression model is � = � + �� + �� + �� + �� • Consider four 40-year-old patients, one from each region. Their predicted SBP values will be:

§ NE (� = � = � = 0): � = � + 40�

§ S (� = 1, � = � = 0): � = � + 40� + �

§ MW (� = 1, � = � = 0): � = � + 40� + �

§ W (� = 1, � = � = 0): � = � + 40� + �

• (� , � , �) represent the difference in between the categories (S, MW, W) and the “baseline” category (or the reference level) NE. • For a categorical variable with multiple categories, you must use one fewer dummy variable than the number of categories. One category, by default, becomes reference (baseline) category.

14 HST 190: Intro to Biostatistics Inference about a single �

• Because each parameter � describes the relationship between � and �, inference about � tells us about the strength of the linear relationship § in multiple linear regression, this is holding all other included covariates constant • We can do hypothesis testing and form confidence intervals for a particular �

• Both require the estimated of �, se �

§ In single linear regression, the closed form is se � = �

§ In multiple linear regression, no closed form for individual �

15 HST 190: Intro to Biostatistics • The 100(1 − �)% CI for a particular � is simply � ± �, ⋅ se �

• To test the hypothesis that �: � = 0 vs. �: � ≠ 0, holding all other parameters fixed, we use test

∗ � � = ~� se � ∗ • If � is true, � follows �-distribution with � − � − 1 degrees of freedom, where � is the number of covariates included in model § testing procedure follows exactly as before

16 HST 190: Intro to Biostatistics Matlab example output

• Model output gives all necessary components

SBP = � + �1AGE + �2WEIGHT + �

> fit = fitlm(,’sbp ~ age + weight')

Linear regression model:

sbp ~ 1 + age + weight � Estimated Coefficients: se � � |�∗ > � ) Estimate SE tStat pValue � se � (Intercept) 60.56178 44.81431 1.351 0.219

age 1.01226 .3406004 2.972 0.021

weight .2166858 .2468829 0.878 0.409 � Number of observations: 10, Error degrees of freedom: 7

Root Squared Error: 11.583 � − � − 1 � 17 HST 190: Intro to Biostatistics Model checking

• For , assumptions (in order of importance) are: § Linearity (debatable!) § independence of residuals § equal spread of points around the line § normality of residuals § [Random from a large population]

18 HST 190: Intro to Biostatistics Linearity

• Linearity means of � = � + �� + ⋯ + �� + � • Check this assumption in two ways: • Graphically (for SLR): look for nonlinearity and outliers. § more difficult to check linearity for MLR, consider pairwise

scatterplots (plotting each � against �) and ‘trellis graphs’ • Conceptually: based on a phenomenon of interest and chosen predictors § e.g., age is related linearly to body weight for children

19 HST 190: Intro to Biostatistics • If violated: § and may be biased • Strategies:

§ Consider transformations (log � , , �, log � , , etc. ) or add interaction terms § Use nonlinear functions of �: spline (or polynomial) regression or other generalized additive models

20 HST 190: Intro to Biostatistics • Transformations allow linear regression to model non-linear relationships between x and y, as long as there is a transformation of x or y (or both) that makes it linear.

Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 21 HST 190: Intro to Biostatistics Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 22 HST 190: Intro to Biostatistics Independence of Errors

• Independence of errors � means that:

§ Residuals for any two observations � and � do not “travel together” (after taking into account the corresponding � values) • Checking: Were all independent predictors included in the model of �? Examine the study design § Plot residuals vs. time/distance, when applicable • If violated: § Doesn’t lead to bias in estimates, but standard errors are affected so tests and CIs can be misleading

23 HST 190: Intro to Biostatistics • When checking the independence assumption, consider the following questions: § Do units interact in any way? (for example, belong to the same household) § Is there a spatial (and temporal) proximity? – Units closer together (in space and time) are more likely to behave similarly than units farther apart. § Is there a common data-generating source? (for example, repeated measurements on the same subject) § Are there clusters where units tend to have similar responses? (for example, weight of cubs in the same litter)

24 HST 190: Intro to Biostatistics • Modeling strategies: § Add more predictors, group dependent units in the same cluster (i.e., redefine units). § For cluster effects or repeated observations consider linear regression with correlated errors, including

o Multilevel (or random-effects)

o longitudinal models (see Sec. 13.14 in Rosner). § For serial effects see time-series models (AR, MA, ARMA, etc.).

25 HST 190: Intro to Biostatistics • If errors are positively correlated, we really have less information than we presume, which results in confidence intervals that are too short and Type I error that is inflated.

• Lack of independence between � and � implies that only part of the information about �, �, … , � added by � is new; the rest has already been gained from �. • Ultimately, aside from a more advanced model the best solution is to design study to ensure independence

26 HST 190: Intro to Biostatistics Equal of errors (homoskedasticity)

• Equal variance of errors, , Var � = � . • If violated: § Doesn’t lead to bias in estimates, but standard errors are affected (tests and CIs can be misleading) § Prediction is more sensitive • Strategies:

§ Consider transformations (log � , , �, log � , , etc. ) § Use weighted regression, where each observation is weighted inversely proportional to its variance. • Checking: look at residual plot.

27 HST 190: Intro to Biostatistics Residual plots

• Previously we defined a point’s residual: � = � − � • Because of the assumptions of linear regression, we expect all the residuals to be normally distributed with the same mean (0) and the same variance. § Violations of the linear regression assumptions can often be detected on a residual plot. • Suppose data looked like: �

28 HST 190: Intro to Biostatistics • Residual plots display the residuals on the y-axis and usually the predicted y-values on the x-axis. • When looking at a residual plot consider these questions: § Is there any overall trend? § Is the variance constant? • Problems (in this case, quadratic trend) can often be solved by transforming the data in some way (e.g., add � term)

residuals

� �

� �

29 HST 190: Intro to Biostatistics • Here, the residuals don’t have constant variance; it’s increasing (residual plot looks like a funnel)

§ Strategy: often resolved by regressing log � , , or � on �.

residuals � �

� �

30 HST 190: Intro to Biostatistics Normality of errors

• We assume the particular distribution of residuals �~�(0, �) • If violated: § Doesn’t lead to bias in estimates, and due to CLT, standard errors are not affected much (unless residuals are long-tailed and there are outliers) • Prediction is more sensitive, because it is based on the normality of population distribution of � given � • Strategies: § Ignore § Use regression with t-distribution assumption on errors (a form of ) • Checking: -quantile (QQ)-plot

31 HST 190: Intro to Biostatistics • A QQ plot compares the theoretical with the distribution of your data § Points should fall on the center line if they are truly normal

Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002

32 HST 190: Intro to Biostatistics Prediction with MLR

• So far we have focused on estimation and inference of associations between the covariates and the outcome • We can also make predictions about what �-values we’d expect to see associated with any given �-value. § Including, quantify uncertainty of predictions and assess predictive performance • In fact, there are two (related) intervals worth considering: 1) for the average � among all individuals with a certain �-value 2) for the �-value of a single individual with a certain �-value

33 HST 190: Intro to Biostatistics 1) For a given value �∗, the 100(1 − �)% CI for the average �- value among all individuals having � = �∗ is

1 �∗ − �̅ � ± �, � + � � − 1 �

∑ § where � = , and � is sample variance of all �’s 2) For a given value �∗, the 100(1 − �)% prediction interval for � is the interval which contains 100(1 − �)% of the individual y- values having � = �∗ is

1 �∗ − �̅ � ± �, � 1 + + � � − 1 �

§ There is additional uncertainty about predicting a particular person’s outcome, beyond predicting the average of all similar people

34 HST 190: Intro to Biostatistics 35 HST 190: Intro to Biostatistics Extrapolation

• These intervals are valid within the region of observed data § They may be valid for predictions where �∗is outside of the observed , but only if the assumed linear association holds § Because assumption cannot be validated (we do not know about regions where we do not have data), this is not recommended

Observed data

∗ Least-squares � ??? regression line �∗

36 HST 190: Intro to Biostatistics Strategies for variable selection

1) Identify key objectives 2) Screen variables and identify the ones that are sensitive to the objectives, exclude redundancies 3) Exploratory analysis: graphical displays and correlation coefficients § Apply transformations, if necessary 4) Fit a rich model and perform model check: residual plot, consider outliers 5) Simplify the model without losing too much of the initial explanatory qualities § Possibly, perform automatic variable selection § If interested in prediction, perform cross-validation – set aside a portion of the data set to check the model 6) Finalize the model and proceed with analysis

37 HST 190: Intro to Biostatistics Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 38 HST 190: Intro to Biostatistics Variable selection for MLR

• Our focus is on variable selection for prediction, e.g., identifying the best set of risk factors for a disease § No interpretation needed, no concerns about causal inference § If goal is causal inference, may need more advanced techniques (e.g., subclassification or ) unless data come from a randomized • Competing objectives: § include more variables in order to predict the outcome better § include not too many variables so the model does not pick up spurious associations • Balancing this tension results in a model that fits your data well, and will be able to accurately predict new observations • Useful metric for ‘prediction error’ is SSR = ∑ � − �

39 HST 190: Intro to Biostatistics 40 HST 190: Intro to Biostatistics General principles of variable selection for prediction

1) Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome 2) It is not always necessary to include these inputs as separate predictors § Sometimes, several inputs can be averaged or summed to create a ‘score’ that can be used as a single predictor in the model 3) For inputs that have large effects, consider including their interactions as well

Adapted from Gelman and Hill, "Applied regression and multilevel modeling," p. 69

41 HST 190: Intro to Biostatistics 4) We suggest the following strategy for decisions regarding whether to exclude a variable from a prediction model based on expected sign and (typically measured at the 5% level; that is, a coefficient is “statistically significant” if its estimate is more than 2 standard errors from zero) § If a predictor is not statistically significant and has the expected sign, it is generally fine to keep it in. It may not help predictions dramatically but is also probably not hurting them § If a predictor is not statistically significant and does not have the expected sign (for example, a comorbidity having a protective effect on mortality), consider removing it from the model (that is, setting its coefficient to zero) § If a predictor is statistically significant and does not have the expected sign, then think hard if it makes sense. Try to gather data on potential lurking variables and include them in the analysis § If a predictor is statistically significant and has the expected sign, then by all means keep it in the model

Adapted from Gelman and Hill, "Applied regression and multilevel modeling," p. 69

42 HST 190: Intro to Biostatistics To summarize: • Consider including interactions for predictors with large and significant main effects • In general, keep the variable if: § Sign makes sense (significant or not); § Significant but sign is unexpected. Think hard why:

o Additional interactions?

o Unobserved confounders?

o Ecological fallacy? • In general, remove the variable if: § Insignificant AND sign is unexpected

43 HST 190: Intro to Biostatistics Techniques for variable selection

1) Fixed set by design (treatment indicator + background variables) 2) Fit all possible subsets of models and find the one that fits the best according to some criterion: § e.g., AIC, BIC 3) Sequential: forward / backward / stepwise selection 4) Regularized/penalized regression method

44 HST 190: Intro to Biostatistics criteria

• If two models have the same number of parameters, choose the one with smaller residual variance. § More generally, take into account both residual variance (�) and number of parameters (�). § General form of any criterion: � � + �(�) • For example, select the model that minimizes one of :

∑ § Akaike’s Information Criterion: AIC = �log + 2�

∑ § Bayesian Information Criterion: BIC = �log + �log �

45 HST 190: Intro to Biostatistics Sequential variable selection

• Forward selection: At each step, a single variable can be added, but variables are never deleted. The process continues until no more improvement can be obtained by adding a variable. • Backward elimination: The procedure starts with a model involving all the possible predictors. At each step a variable is deleted. The procedure stops when no more improvement can be obtained by deleting a variable. • : At each step, a single variable can be added or deleted. The process continues until no more improvement can be obtained by adding or deleting a variable.

46 HST 190: Intro to Biostatistics Sequential variable selection has many caveats: • Forward, backward, and stepwise may lead to different final models § Inclusion/exclusion depends on correlation between the new variable and the ones that are already in the model § Different initial models may produce difference final ones • Sequential variable selection is a form of data snooping! § Particularly inappropriate if used for finding ‘significant’ associations, liable to inflate type I error rate • Think not “here is the best model,” but rather “here is one possibly useful model”

47 HST 190: Intro to Biostatistics Regularized/penalized regression

• So far, we fit the linear model , such that �, � = arg min,� � − � − �� − ⋯ − �� • Another class of variable selection techniques changes the quantity we minimize to penalize the model for choosing too many variables § Force the estimates of unimportant predictors to 0, excluding them from model

48 HST 190: Intro to Biostatistics • For example, the is defined by minimizing the penalized sum of squares �, � = arg min,� � − � − �� − ⋯ − �� + � � § Where � is a “tuning” parameter indicating how heavily to ‘penalize’ the model § As � increases, fewer predictors are included in the final model

o Plot shows coefficients going to 0 as log � increases • Other penalties exist, with various properties (e.g., SCAD, adaptive LASSO)

49 HST 190: Intro to Biostatistics • Another example, the ridge estimator is defined by minimizing a slightly different penalized sum of squares �, � = arg min,� � − � − �� − ⋯ − �� + � �

§ As � increases, LASSO, usually, results in some coefficients set to zero exactly. § However, for ridge, all coefficients will slowly shrink to zero.

• When you have many small/medium sized effects you should go with ridge. If you have only a few variables with a medium/large effect, go with LASSO. Hastie, Tibshirani, Friedman

50 HST 190: Intro to Biostatistics Selection of �

• The variable selection properties of penalized regression models depend on the choice of � § So, variable selection problem reduced to tuning a single parameter • Several approaches: choose � that… 1) yields model with the best predictive performance 2) yields model minimizing a criterion, e.g., AIC or BIC • Key question: how to assess predictive performance using a single sample?

51 HST 190: Intro to Biostatistics Selection of �

If n is large, one may dividing the data set as follows:

• The two (or 3) random parts are § Training set, used to fit the model. § Test set, used to check the predictive capability (e.g., SSR), refine the model, select tuning parameters. Go back to training if needed. § Optional: Validation set, used once to estimate the final model’s true prediction error.

52 HST 190: Intro to Biostatistics If don’t have a lot of data, can still perform �-fold cross validation, where the data is split into � groups, and the model is repeatedly fit on all but one group. Then its ability to predict the left-out group is recorded: § The average over all � groups estimates predictive performance

53 HST 190: Intro to Biostatistics