Simple Linear Regression and Multiple Regression
Total Page:16
File Type:pdf, Size:1020Kb
15 – Simple & Multiple Regression and ANCOVA
Regression is used to study relationships between variables. Linear regression is used for a special class of relationships, namely, those that can be described by straight lines, or by generalizations of straight lines to many dimensions.
In regression we seek to understand how the value of a response of variable (Y) is related to a set of explanatory variables or predictors (X’s). This is done by specifying a model that relates the mean or average value of the response to the predictors in mathematical form. In simple linear regression we have single predictor variable (X) and we use a line (of the form y = mx + b) to relate the response to the predictor.
Regression Model Notation:
E(Y|X) =
In regression we first specify a functional form or model for E(Y|X). We then fit that model with using available data and then assess the adequacy of the model. We may then decide to remove some of the explanatory variables that do not appear important, add others to improve the model, change the function form of the model, etc. Regression is an iterative process where we fit a preliminary model and then modify our model based on the results.
Regression Examples:
E(Cholesterol Level |Weight, Height, Age, Daily Caloric Intake) =
0 1Weight 2 Length 3 Age 4Calories
E(Blood Pressure| ) =
E(Survival Time of Cancer Patient | ) =
1 Example 1: Diastolic and Systolic Blood Pressure in Hypertensive Patients Data File: Hypertension.JMP
Background: These data were collected as part of a large study looking at patients with hypertension. Variables: Diastolic BP: diastolic blood pressure (mmHg) Systolic BP: systolic blood pressure (mmHg) Goal: Investigate the relationship between diastolic blood pressure (X) and systolic blood pressure (Y).
Note: the roles of X and Y could certainly be reversed for this study. In others, there is usually a clear response of interest.
Assumptions for a Simple Linear Regression Model 1. The mean of the response variable (Y) can be modeled using X in the following
form: E(Y | X ) o x * X Recall: E(Y|X) is the notation we use to denote the mean value of Y given X. Here we are using a line to summarize the mean value of Y as a function of X.
2. The variability in the response variable (Y) must be the same for each X, i.e. Var(Y | X ) 2 or SD(Y | X ) . In other words, the variance of Y must be constant across the entire range of X.
3. The response measurements (Y’s) should be independent of each other. If a simple random sample is taken from the population this typically the case. One situation where this assumption is violated is when data if collected over time.
4. The response measurements (Y), for a given value of X, should follow a normal distribution.
You should also take the time to identify outliers and influential points. Outliers and influential points can be very problematic in a regression model. We will discuss how to check the assumptions outlined above after fitting our initial model.
Outlier Influential Point
2 Start with a Plot of your Data Select Analyze > Fit Y by X and place Systolic BP (mmHg) in the Y box and Diastolic BP (mmHg) in the X box.
The resulting plot is shown below.
Fitting the regression model relating Systolic BP (Y) to Diastolic BP (X) What is the population model? We want to model the mean value of Y using X, so the model is given by:
E(Y | X ) o x * X or being specific for this situation, we have
E(Systolic | Diastolic) o D Diastolic
3 Again: E(Y|X) is the notation we use to denote the mean value of Y given X.
Taking a closer look at its pieces:
For this data set, Y = Systolic BP and X = Diastolic BP.
How is the line that best fits our data determined? Answer: Method of Least Squares
To fit the model
E(Systolic | Diastolic) o D * Diastolic we first select Analyze > Fit Y by X and place Systolic BP in the Y box and Diastolic BP in the X box as shown below. This will give us scatter plot of Y vs. X, from which we can fit the simple linear regression model.
4 The resulting scatter plot is shown below. With regression line added.
To perform the regression of Systolic BP on Diastolic BP in JMP, select Fit Line from the Bivariate Fit... pull-down menu.
The resulting output is shown below.
5 We begin by looking at whether or not this regression stuff is even helpful:
H o : Regression is NOT useful
H a : Regression is useful
This says that Diastolic BP (X) explains a significant amount of variation in the response Systolic BP (Y).
Next we can assess the importance of the X variable, Diastolic BP in this case:
H o : D 0 This is equivalent to saying the X is not useful Conclusions from both tests: for explaining Y. Thus the results of these two H a : D 0 tests are identical in every way!
Determining how well the model is doing in terms of explaining the response is done using the R-Square and Root Mean Square Error:
The proportion or percentage of variation explained by the regression of Length (Y) on Age (X) is given by the R-Square = .547 or 54.7%
The amount of unexplained variability in Systolic BP (Y) taking Diastolic BP (X) into consideration is given by the RMSE (root mean square error). This is an estimate of the SD(Systolic BP|Diastolic BP).
6 Describing the Relationship: Eˆ(Systolic BP | Diastolic BP) 26.476 1.242* Diastolic BP
Interpret each of the parameter estimates:
CI’s for the Parameter Estimates (estimate) (t table) SE(estimate) where t-table comes from t-distribution df = n – 2.
95% CI for the Slope ( D )
What do we estimate for the mean systolic blood pressure of hypertension patients that have a diastolic blood pressure of 100 mmHg?
If picked a single hypertensive patient with a diastolic blood pressure of 100 mmHg at random, what do we predict their systolic blood pressure will be?
7 Checking the Assumptions: To check the adequacy of the model in terms of the basic assumptions involves looking a plots of the residuals from the fit. One plot that is particular helpful for checking a number of the model assumptions is a plot of the residuals vs. the fitted values. When we are performing simple linear regression we can alternatively plot the residuals vs. X, which is what JMP gives by default.
Ideal Residual Plot:
Violations to Assumption #1:
The trend need not be linear (BAD) The trend need not be linear (BAD)
Violations to Assumption #2:
Megaphone opening to right (BAD) Megaphone opening to the left (BAD)
8 Violations to Assumption #3:
One point closely following another -- Extreme bouncing back and forth -- negative positive autocorrelation, (BAD) autocorrelation (BAD)
Violations to Assumption #4:
To check this assumption, simple save the residuals out and make a histogram of the residuals and/or look at a normal quantile plot. Recall, you can easily make a histogram of a variable under Analyze > Distribution. We should generally assess normality using a normal quantile plot as well.
Checking for outliers:
Determine the value of 2*RMSE. Any observations outside these bands are potential outliers and should be investigated further to determine whether or not they adversely affect the model.
9 Checking for outliers in this example we find:
To obtain the residual plot in JMP select Linear Fit > Plot Residuals Mild outliers
Mild outliers
THE ASSUMPTION CHECKLIST: Model Appropriate:
Constant Variance:
Independence:
10 Normality Assumption (see histogram above):
Identify Outliers:
Confidence Interval for the Average or Mean Systolic Blood Pressure: (i.e. the average systolic blood pressure for the entire population of hypertension patients with a given diastolic blood pressure.)
Select Confid Curves Fit from the Linear Fit pull-down menu located below the scatter plot. The narrow bands in plot below represent the CI for the Mean Systolic Blood Pressure. For example, from the plot below we estimate that the mean systolic blood pressure for patients with a diastolic blood pressure of 80 mmHg is likely to be somewhere between 122 – 126 mmHg. We will examine a more precise way for obtaining such intervals later in the tutorial.
Prediction Interval for the Systolic Blood Pressure of a Single Patient : (i.e., for a single hypertension patient sampled from the population of all hypertension patients with a specified diastolic blood pressure.)
11 Select Confid Curves Indiv from the Linear Fit pull-down menu which is located directly beneath the scatter plot. These are the wider bands in the plot above. For example if we were to sampled a single individual with a diastolic blood pressure of 100 mmHg we estimate with 95% confidence that their systolic blood pressure will be somewhere between 125 – 170 mmHg. We will examine a more precise way for obtaining exact intervals of this form later in the tutorial.
Using the Analyze > Fit Model Option to Perform the Regression
An alternative to using Fit Y by X to perform simple linear regression, is to use the Fit Model option from the Analyze menu. The advantages of this approach are two-fold:
1) You have access to more detailed results from your regression and have enhanced features for estimation/prediction of Y.
2) Allows for the addition of more predictors (X’s) to your model. This is called multiple regression and will be discussed later.
For the blood pressure example we fit the model as follows:
Select Analyze > Fit Model and place Systolic BP in the Y box and Diastolic BP in Model Effects box.
Y variable
X variables go here.
If we had more predictors (X’s) that we wanted to add to our model we would simply put them in the Model Effects box, e.g. we could add information about the age, gender, height, weight, etc. of the patient as predictors. When we have more than one predictor
12 in a linear regression we call it a multiple regression.
The output from the Analyze > Fit Model option is shown below:
The same numeric summaries, parameter estimates, and test results are contained as part of the standard output. The plot at the top is NOT a scatter plot of Y vs. X. It is a plot of the actual Y values vs. the predicted values from the regression model. The stronger the trend exhibited the better the fit. It is essentially a visualization of the R-square for the fitted model.
In cases where we have multiple values of the response for some of the X values we will be given the results of the Lack Of Fit test. If p-value here is small it can indicate that our model does not adequately model the mean of the response variable Y. For example, if we fit line to clearly nonlinear relationship. In cases where there is significant lack of fit the plot of the residuals vs. the fitted values will generally show some non-linear trend. Residual Plot
13 The bulk of the output is same as that obtained using the Analyze > Fit Y by X approach.
Estimation of the E(Y|X), the Mean Value of Y for a given X & Prediction of Y for an Individual with a given X You can save 95% Confidence Intervals for E(Y|X) to the data spread sheet by selecting Mean Confidence Interval.
You can save 95% Prediction Intervals for Individual Y values to the data spread sheet by selecting Indiv Confidence Interval.
Below is a portion data spread sheet showing both types of intervals.
Interpretation of the 95% Confidence Interval for E(Systolic BP|Diastolic BP = 80) Consider estimating the average/mean systolic blood pressure for all hypertension patients with a diastolic blood pressure of 80 mmHg. A 95% confidence interval for this mean is given by the interval 124.32 mmHg to 127.42 mmHg. There is a 95% chance this interval covers the true mean systolic blood pressure of hypertension patients with diastolic blood pressures of 100 mmHg in the population.
Interpretation of the 95% Prediction Interval for Systolic BP|Diastolic BP = 100 Suppose we picked one individual at random from the population of all hypertension patients with diastolic blood pressures of 100 mmHg. What do estimate the systolic blood pressure for this individual will be? We estimate, with 95% confidence, that the actual blood pressure for this particular individual will be somewhere between 129.0 mmHg and 172.4 mmHg. This range of pressures has a 95% chance of covering the systolic blood pressure for this randomly selected individual with a diastolic blood pressure of 100 mmHg. Notice how much wider this interval is when compared to interval for the mean systolic blood pressure above. This should seem natural as it is much harder to predict the systolic blood pressure of a single randomly selected
14 individual than the mean systolic blood pressure of all individuals with a specific diastolic blood pressure.
Example 2 – BP (Y) vs. Age (X)
Eˆ(BP | Age) 112.32 .445Age
Conclusions from tests results highlighted above:
15 Discussion of Residual Plot:
Distribution of the Residuals
16 Multiple Regression Example: Berkley Guidance Study (BGSgirls.JMP) The data for this example are excerpted from the Berkeley Guidance Study, a longitudinal monitoring the growth of boys and girls in Berkelely, CA, born between January 1928 and June 1929.
The variables in the data for girls are: WT2 = weight at age 2 (kg) HT2 = height at age 2 (cm) WT9 = weight at age 9 HT9 = height at age 9 LEG9 = leg circumference at age 9 (cm) STR9 = a composite measure of strength at age 9 (high values = stronger) WT18 = weight at age 18 HT18 = height at age 18 LEG18 = leg circumference at age 18 STR18 = strength at age 18 SOMA = somatotype, a seven-point scale, as a measure of fatness (1 = slender, 7 = fat), determined using a photograph taken at age 18.
In this example we will develop a multiple regression model for SOMA at age 18 using as potential predictors the variables from ages 2 and 9 only. We begin by examining a scatter plot matrix of the potential predictors and the response, somatotype. To do this in JMP select Multivariate from the Analyze menu and place the response (SOMA) and the predictors (WT2, HT2, WT9, HT9, LEG9, STR9) in the right-hand Y box and click OK. To obtain pair-wise correlations and tests of their significance select the Pairwise Correlations options from the Multivariate pull-down menu. The results are shown below:
Significance tests for the pair-wise correlations:
17 We can see that weight and leg circumference at age 9 exhibit the strongest linear relationship with the response while height at age 2 and strength at age 9 exhibit the weakest correlation with the somatotype.
We now will fit a preliminary multiple regression model using all potential predictors.
To fit this model in JMP select Fit Model from the Analyze menu and place SOMA in the Y box and all of the predictors in the Effects in Model as shown below.
18 To fit the model click Run Model. The resulting model is shown on the next page.
Before beginning any model simplification we will examine residual plots to check basic model assumptions. This plot is given at the bottom of the column of output with the heading Response Soma.
19 The plot suggests no obvious model violations. There is a mild outlier in the lower right- hand corner of the plot. The stripes in the plot are due to the ordinal/discrete nature of the response and are of little concern. A normal quantile plot for the residuals is shown below. This is obtained in the usual way but we need the residuals saved to the data spreadsheet first. To do this select Response Soma > Save Columns > Stud. Residuals Studentized residuals are essentially the residuals divided by the estimated standard deviation (RMSE) ˆ estimate of SD(Y | X ) . Thus the studentized residual can be interpreted as a z-score. Approx. 95% of the studentized residuals should fall between -2 and 2. Extreme studentized residual values can indicate that a given observation is an outlier. Assessing the normality of the studentized residuals.
With the exception of the outlier evident in previous plot normality appears to be satisfied. The effect tests for the individual predictors suggest that the model could be simplified by removing several terms.
20 The individual tests suggest that WT2, HT9, Leg9, & Str9 could potentially be removed from the model. Backward Elimination is model development strategy where first fit a model that includes all potential predictors of interest and then we proceed to remove insignificant predictors/effects one at a time until no further terms can be removed. We remove terms with the largest p-values first and then continue removing until are terms are significant at some specified level of significance. Often times we use .10 rather than the usual .05 level for determining significance of an individual predictor. We begin by taking out the height at age 9 (HT9) term whose associated p-value = .9050. The results for this simpler model are shown below.
Leg circumference at age 9 (p = .6100) could be removed next resulting the following:
Finally we will remove weight at age 2 (WT2, p = .5840).
21 Although the Str9 does not test as significant at the .05 level we will leave it in leaving us with the following model for the mean somatotype given HT2, WT9, and STR9.
E(SOMA|HT2,WT9,STR9) = HT2WT9 STR9
The negative coefficients for HT2 and STR9 seem surprising considering the fact that both are positively correlated with somatotype. To help understand how this can happen do the following: Use Distribution to obtain a histogram for WT9. Next use Fit Y by X to construct scatter plots of SOMA vs. HT2 and SOMA vs. Str9. Finally click on bars in the histogram for WT9 and examine the relationship between SOMA & HT2 and SOMA & STR9 for the highlighted points in the scatter plots. What do we see?
22 We see that the relationship between both SOMA & HT2 and SOMA & STR9 is negative when conditioning on WT9.
In multiple regression the marginal relationships between the response (Y) and the individual predictors (X) convey little useful information about their role in a multiple regression model!
Diagnostic plots (residuals vs. fitted and residual normal quantile) for the final three- predictor model are shown below.
Again no major model violations are suggested, however there is a fairly extreme outlier. The studentized residual value is -3.76 for this observation. This could indicate a potential error in recording the data for this individual or it may simply be the case this particular girl is very atypical. Her actual somatotype at age 18 is much smaller than is expected given her body measurements at ages 2 and 9. Perhaps she was very chubby as a younger child and grew to be tall and thin as an adult or perhaps she has an eating disorder. While it is generally not acceptable to delete observations from your data without good reason it is interesting to see how the analysis changes when this girl is excluded from the study. This was done the resulting model obtained via backward elimination is identical to the one above. The summary of the final model with the outlier deleted is shown below.
Regression Summary with Outlier Deleted
What are the major changes we see?
23 The plots below are called Effect Leverage Plots. They are equivalent to a more commonly employed graphical device called an Added Variable Plot (AVP). These plots show the relationship between the response (SOMA) and each of the predictors adjusted for the other terms in the model. The negative estimated coefficients for HT2 and STR9 supported by the negative adjusted relationships for these terms. If the dashed red lines do not completely contain the horizontal blue line then the term is deemed significant. Clearly weight at age 9 has the strongest adjusted relationship with somatotype.
A plot of the actual somatotype values (Y) vs. the fitted values (Yˆ ) from the model is shown below. The R-Square = .62 (or 62%) is the square of the correlation between Soma Actual and Soma Predicted. RMSE is an estimate of the standard deviation of somatotype given the age 2 and 9 measurements used in the final model.
24 Analysis of Covariance (ANCOVA) and Multiple Regression
Example 1 – Gestational Age, Birth Weight, and Mother’s Smoking Status During Pregnancy Data File: Lowbirthweight.JMP
The variables in this data file are: id – identification number of the infant (labeling purpose only) headcir – head circumference of infant (nearest in.) length – length of infant (nearest in.) weight – birth weight (lbs.) gest – gestational age of infant (weeks) mage – mothers age mnocig – daily number of cigarettes during pregnancy, mother mheight – mothers height (nearest in.) mppwt – mothers pre-pregnancy weight (lbs.) fage – fathers age fedyrs – fathers education level (yrs.) fnocig – daily number of cigarettes, father fheight – fathers height (in.) lowbwt – low birth weight indicator (1 = yes, 0 = no) mage35 – mother over 35 years of age (1 = yes, 0 = no) smoker – mother smoked during pregnancy (1 = yes, 0 = no)
There are many questions of interest one could examine using these data. In this analysis we will examine the relationship between birth weight and smoking. We could do this by using a two-sample t-test, either pooled or non-pooled depending upon the equality of population variances, to compare the mean birth weight of infants born to non-smokers vs. smokers. The results of such an analysis are presented below.
Here we can see that the mean birth weight of infants born to smokers is significantly lower than the mean birth weight of infants born to non-smokers (p < .0001). In particular we estimate that the mean birth weight of infants born to smokers is between . 33 and .65 lbs. less than the mean birth weight of infants born to non-smokers.
25 Does this mean that if we compared the population of infants that have the same gestational age those born to smokers will have smaller birth weight by between .33 and . 65 lbs. on average when compared to those born to non-smokers?
Perhaps smoking during pregnancy leads to infants being born earlier and hence have a smaller birth weight as result. We cannot tell unless we include information about gestational age in our analysis. To do this we can use Analysis of Covariance (ANCOVA) which is really just multiple regression where one of the predictors/covariates is a factor of interest (i.e. smoking in this example) and the other variables (covariates) are used as “adjustments”. For example, if we include information about gestational age into our model we will be able to say “adjusting for gestational age of the infant, we estimate that the effect of smoking during pregnancy is (fill in the blank) on the birth weight of infants”.
How do we include information about smoking and gestational age in a multiple regression model?
Potential Models 1) Smoking Effects Only
E(Birth Weight| Smoking Status) = o 1 Smoking Status
1 if the infant was born to a non - smoker where Smoking Status 1 if the infant was born to a smoker so the regression model can expressed separately for smokers and non-smokers as follows:
To fit this model in JMP select Analyze > Fit Model and place weight in the Y box and smoker in the Construct Model Effects box.
weight goes in the Y
smoker goes here
26 Here are the results of fitting the model in JMP using Analyze > Fit Model to fit the smoking status model outlined on the previous page.
The sample mean birth weights for infants born to non-smokers (0) and smokers (1) are given above.
How is our estimated regression model related to these means and the t-test above?
2) Both Smoking and Gestational Age (Parallel Lines Model)
E(Birth Weight| Smoking Status, Gest. Age) = o 1Smoking Status 2Gest. Age
1 if the infant was born to a non - smoker where Smoking Status 1 if the infant was born to a smoker and Gest. Age = gestational age of the infant in weeks.
Picture of this model :
To fit this model in JMP we again use Analyze > Fit Model and place weight in the Y box and both smoker and gest in the Construct Model Effects box as shown below.
27 The results from JMP are shown below.
28 Our estimated model is: Eˆ(Birth Weight | Smoking Status, Gest. Age) 1.95 .202 Smoking Status .237 Gest Age
For smokers we have
For non-smokers we have
Predict the mean birth weight for an infant with a gestational age of 36 weeks born to a smoker.
Predict the mean birth weight for an infant with a gestational age of 36 weeks born to a non-smoker.
95% CI for the “Smoking Effect” for Infants with a Given Gestational Age
What if the effect of gestational age is different for smokers and non-smokers? For example, maybe for smokers an additional week of gestational age does not translate to the same increase in birth weight as it does for non-smokers? What should we do?
3) Both Smoking and Gestational Age (Unrelated Lines Model)
29 E(Birth Weight| Smoking Status, Gest. Age) = o 1Smoking Status 2Gest. Age+
12 (Gest. Age - 39.77) Smoking Status
For smokers we have the following
For non-smokers we have the following
Picture of the unrelated lines model:
Select grouping variable from the list
To visualize the unrelated lines model in JMP select Analyze > Fit Y by X to construct the plot of Y vs. X. Next from the pull-down menu in upper left-hand corner of the plot select Group By… and highlight the categorical variable or factor you wish to use in constructing the unrelated lines (see above).
30 To fit the unrelated lines regression model in JMP select Analyze > Fit Model put weight in the Y box and then highlight both gest and smoker in the list of variables while holding down the CTRL key. Next click on Full Factorial from the Macros pull- down menu which will place the two main effects of gestational age and mothers smoking status along with the interaction between them into the model. The interaction term is used to allow for a potential difference in the effect of gestational age for smokers and non-smokers, i.e. it allows for the slope of the regression lines for smokers and non- smokers to have a different slope.
The resulting output from JMP is shown below.
The estimate regression equation is
E(Birth Weight|Smoking Status, Gest. Age) = -2.04 + .201Smoking Status + .240Gest.Age
- .0183(Gest. Age – 39.77)(Smoking Status)
31 The interaction term is NOT significant (p = .3616) so we should go with the simpler model (i.e. parallel lines) model.
Quantifying the “Smoking Effect” adjusted for Gestational Age
Adjusting for gestational age we estimate that….
32 Example 2 – Birth Weight and Smoking Adjusting for all Potential Covariates We know consider adding all relevant predictors to the model for predicting birth weight. The list of variables in this data is again presented below.
The variables available: headcir – head circumference of infant (nearest in.) length – length of infant (nearest in.) weight – birth weight (lbs.) gest – gestational age of infant (weeks) mage – mothers age mnocig – daily number of cigarettes during pregnancy, mother mheight – mothers height (nearest in.) mppwt – mothers pre-pregnancy weight (lbs.) fage – fathers age fedyrs – fathers education level (yrs.) fnocig – daily number of cigarettes, father fheight – fathers height (in.) mage35 – mother over 35 years of age (1 = yes, 0 = no) smoker – mother smoked during pregnancy (1 = yes, 0 = no)
We first fit a large model using most of the available covariates. Rather than use mothers number of cigarettes (mnocig) we are again using the smoking status indicator (smoker). The other infant size measurements, head circumference and length, have also not been included. These would actually be other responses we might wish to examine.
A summary of this model found on the next page.
33 Summary of Preliminary Model Comments:
Backward Elimination Removing what predictor at a time using (p < .10) to retain a predictor we arrive at the following model.
34 Looking at specifically at effect of mothers smoking we find.
Conclusion: After adjusting for gestational age of the infant, mothers height, mothers pre-pregnancy weight, and fathers height that women who smoke during pregnancy will have infants with a mean birth weight between .23 and .52 lbs. less than the mean birth weight of infants born to non-smokers.
35