REGRESSION ANALYSIS - MODEL BUILDING:
A regression analysis is typically conducted to obtain a model that may needed for one of the following reasons: • to explore whether a hypothesis regarding the relationship between the response and predictors is true. • to estimate a known theoretical relationship between the response and predictors.
The model will then be used for:
• Prediction: the model will be used to predict the response variable from a chosen set of predictors, and • Inference: the model will be used to explore the strength of the relationships between the response and the predictors
Therefore, steps in model building may be summarized as follows:
1. Choosing the predictor variables and response variable on which to collect the data.
2. Collecting data. You may be using data that already exists (retrospective), or you may be conducting an experiment during which you will collect data (prospective). Note that this step is important in determining the researcher’s ability to claim ‘association’ or ‘causality’ based on the regression model.
3. Exploring the data. • check for data errors and missing values. • study the bivariate relationships to reveal other outliers and influential observations, relationships, and identify possible multicollinearities to suggest possible transformations.
4. Dividing the data into a model-building set and a model-validation set: • The training set is used to estimate the model. • The validation set is later used for cross-validation of the selected model.
5. Identify several candidate models: • Use best subsets regression. • Use stepwise regression.
6. Evaluate the selected models for violation of the model conditions. Below checks may be performed visually via residual plots as well as formal statistical tests. • Check the linearity condition. • Check for normality of the residuals. • Check for constant variance of the residuals. • After time-ordering your data (if appropriate), assess the independence of the observations. • Overall goodness-of-fit of the model. Above checks turn out to be unsatisfactory, then modifications to the model may be needed (such as a different functional form). Regardless, checking the assumptions of your model as well as the model’s overall adequacy is usually accomplished through residual diagnostic procedures.
7. Select the final model:
• Compare the competing models by cross-validating them against the validation data.
Remember, there is not necessarily only one good model for a given set of data. There may be a few equally satisfactory models.
1 The following table is a good summary of why checking the assumptions is of vital importance: http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/regression.class.pdf
The figure on the left summarizes the assumptions in regression analysis.
2 OUTLIERS, HIGH LEVERAGE, AND INFLUENTIAL OBSERVATIONS
A data point is influential if it influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and high leverage data points have the potential to be influential, but have to be investigated to determine whether or not they are actually influential.
See the plot on the left: No apparent data points as outliers (regarding the value of Y) or high leverage observation (regarding the value of X) (filename: influencenone.sav)
3 Case 1: Dataset influence1.sav In the plot of influence1 data below, the red dot is an outlier. How influential is it? Regression line for influence1 data
The plot below shows the regression line with and without the data point marked red.
4 • The R2 value is lower, however, the relationship is still strong in both cases. F-test is significant with both and without the red marked data point. But because the red data point is an outlier in the y direction the MSE and thus the standard error of the predicted variable has increased from (MSE from 6.718 to 22.191, and standard error from 2.59 to 4.71). This will cause wider confidence and prediction intervals for the predicted variable.
• The t-test for β1 is also still significant, implying that predictor X accounts for a significant amount of variation in Y and that there is sufficient evidence at the 0.05 level to conclude that X is related to Y. However, the standard error of β1 has increased, from 0.200 to 0.363 (this is expected since standard error of β1 depends on MSE) which will result in a wider confidence interval for β1 as well.
• Red data point is an outlier but not influential enough to change conclusions regarding the regression relationship.
Case 2: Dataset influence2.sav In the plot of influence2 data below, the red dot is an outlier. How influential is it?
Regression line for influence2 data
The plot below shows the regression line with and without the data point marked red.
5
• The R2 value increased slightly from 97.3% to 97.7%. In either case, the relationship between Y and remains strong.
• In each case, the P-value for testing H0: β1 = 0 is less than 0.001. In either case, we can conclude that there is sufficient evidence at the 0.05 level to conclude that, in the population, x is related to y.
• The standard error of b1 is about the same in each case — 0.172 when the red data point is included, and 0.200 when the red data point is excluded. Therefore, the width of the confidence intervals for β1 would largely remain unaffected by the existence of the red data point. Note that this is because the data point is not an outlier heavily impacting MSE.
• Red data point has high leverage but it is not influential.
CASE 3. Dataset influence3.sav In the plot of influence3 data below, the red dot is an outlier. How influential is it?
Regression line for influence3 data
The plot below shows the regression line with and without the data point marked red.
6
• The R2 value has decreased substantially from 97.32% to 55.19%. If we include the red data point, we conclude that the relationship between y and x is only moderately strong, whereas if we exclude the red data point, we conclude that the relationship between y and x is very strong.
• The standard error of b1 is almost 3.5 times larger when the red data point is included — increasing from 0.200 to 0.686. This increase would have a substantial effect on the width of our confidence interval for β1. Again, the increase is because the red data point is an outlier — in the y direction.
• In each case, the P-value for testing H0: β1 = 0 is less than 0.001. In both cases, we can conclude that there is sufficient evidence at the 0.05 level to conclude that, in the population, x is related to y. Note, however, that the t-statistic decreases dramatically from 25.55 to 4.84 upon inclusion of the red data point.
• The predicted responses and estimated slope coefficients are clearly affected by the presence of the red data point. While the data point did not affect the significance of the hypothesis test, the t-statistic did change dramatically. In this case, the red data point is deemed both high leverage and an outlier, and it turned out to be influential too.
7 Identifying Y values that are extreme
• Residuals: The difference between actual and fitted Y values
• Standardized (Semistudentized) Residuals: Residuals divided by their standard error
• Studentized Residuals (internally studentized residuals)
Note that the denominator is another estimate of the standard error or residuals.
• Deleted and Externally Studentized Residuals
The basic idea is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations; then, comparing the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals.
Standardizing the deleted residuals produces studentized residuals.
In general, studentized residuals are going to be more effective for detecting outlying Y observations than standardized residuals. A rule of thumb is that if an observation has a studentized residual that is larger than 2 (in absolute value) we can call it an outlier.
Identifying X values that are extreme
• Leverages: A measure of the leverage of a data point i, hii, quantifies the influence that the observed response yi has on its predicted value ŷ i. That is, if hii is small, then the observed response yi plays only a small role in the value of the predicted response ŷ i. On the other hand, if hii is large, then the observed response yi plays a large role in the value of the predicted response ŷ i. It's for this reason that the hii are called the "leverages."
Some important properties of the leverages: th o The leverage hii is a measure of the distance between the X value for the i data point and the mean of the X values for all n data points. o The leverage hii is a number between 0 and 1, inclusive. o The sum of the hii equals p, the number of parameters (regression coefficients including the intercept). th th o The leverage hii quantifies how far away the i X value is from the rest of the X values. If the i X value is far away, the leverage hii will be large; and otherwise not.
Leverages help identify X values that are extreme and therefore potentially influential on the regression analysis. A common rule is to flag any observation whose leverage value, hii, is more than two times larger than the mean leverage value.
The mean leverage value is given by: