Multiple Regression And Model Building
Total Page:16
File Type:pdf, Size:1020Kb
Multiple Regression and Model Building
Using Transformations in Regression Models Bootstrapping Influence Analysis Multicollinearity Model Building
1. Transformations
1.1 Transformations of the Dependent Variable
Used when the dependent variable does not satisfy the normality and/or equal variance assumption.
Possible transformations include square root, logarithm, and reciprocal.
Drawbacks o Complicates interpretation of coefficients o May cause other assumption violations
Example data: http://wweb.uta.edu/faculty/eakin/busa5325/cardata.xls
1.2 Transformation of the Independent Variable
Used when the linearity assumption is violated
Possible transformations include polynomial models or using the square root, logarithm and reciprocal.
Complicates interpretation but does not affect the other assumptions.
1.3 EXCEL You will first need to create new columns based on the log and square root functions. To create reciprocals create a function which is 1 divided by the value in another column. You will then have to create multiple scatterplots taking two variables at a time.
2. Bootstrapping
Used when the sampling distribution of the coefficients is not known
Repeated samples from the data, with replacement, in order to estimate the sampling distribution Allows inferences conducted when the t distribution does not hold.
In NCSS you will find in on bottom right of the Variables tab in Multiple Regression. It is not available in EXCEL.
Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/bootstrapping.xlsx
Reference: http://en.wikipedia.org/wiki/Bootstrap_(statistics)
3. Influence Analysis
3.1 Use
Determine if individual observations fall far from the others and/or cause an undue influence on the regression estimates.
3.2 Diagnostics
3.2.1 Detecting outlying X values: the hat matrix
Indicates the distance an observation falls from the center of the other observations (considering only the x values)
Does not necessarily mean the observation is influential.
Values above 2*(k+1)/n are considered outliers.
3.2.2 Detecting outlying Y values: studentized deleted residuals
Indicates for a specific observation, the difference between the actual value of Y and the predicted value of Y that occurs after the observation is removed from the data set.
Uses a t-test approach. Any value larger than a t-table value with n-k-2 degrees of freedom
3.2.3 Alternative approach to detecting influential observations: Cook’s Distance
Indicates for a specific observation, the overall effect on all Beta estimates when the observation is removed from the data set.
Value above the fiftieth (50) percentile of an F distribution k+1 and n-k-1 degrees of freedom are considered influential
3.2.4 Remedies If the problem is due to a clerical error, correct the measurement.
ONLY after determining that the observation does not belong to the population of interest, then remove the data value.
If removal causes new influential values to appear, try transforming the variable values.
Two influential points near each other will not be detected.
3.2.5 Examples
Figure 1
Point A is an outlier in Y maybe influential. Point B is an outlier in X but not influential. Point C is both an outlier in X and influential.
Figure 2
Points B and C are both influential. Dropping only one would not affect line much.
Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/influence.xlsx
3.3 SAS – click the following link to find the SAS instructions
Using SAS to obtain influence and outlier diagnostics
4. Multicollinearity
When an variable is linearly associated with other independent variables.
4.1 Effects
Important independent variables appear to have little effect on the dependent. Large variability in estimates from sample to sample. Estimated effects that have values opposite what was expected Is it difficult to hold all variables constant if the variables are too closely related.
4.2 Diagnostics
If the Variance Inflation Factor (VIF) exceeds ten, the independent variables are too highly related to each other.
4.3 Remedies
Use regression approaches other than least squares. Remove some of the redundant variable. Use a designed experiment
4.4 SAS notes – click the following link to view SAS instructions
SAS instructions to obtain VIF values
4.5 Links
Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/multicollinearity.xlsx
5. Model building Approach
5.1 Theory driven
Use previous studies to determine which variables to choose.
5.2 Data driven
Fix any problems with influential values or assumption violations
Look at all combinations of the independent variables.
Choose subsets that are simple but explain as much variation as possible or have the smallest variance estimate.
Compare the reduced sets of variables to determine the simplest and most valid model.
Draw new data and test the model for significance and prediction
For prediction, use the least squares line found in the first data set to predict the values of Y in the new data set. Calculate the average squared error and its square root (Root Mean Square Prediction Error) or average the absolute errors (Mean Absolute Prediction Error). The size of the average error indicates the usefulness of the model.
5.3 Comments
Stepwise is a very fast but inefficient procedure and can lead to belief the model is much better than it actually is.
All-possible takes longer study and is recommended.
5.4 SAS notes – click the following link to view SAS instructions
SAS instructions for instructions on obtaining all - possible regressions