Multiple Regression And Model Building

Multiple Regression and Model Building

 Using Transformations in Regression Models  Bootstrapping  Influence Analysis  Multicollinearity  Model Building

1. Transformations

1.1 Transformations of the Dependent Variable

 Used when the dependent variable does not satisfy the normality and/or equal variance assumption.

 Possible transformations include square root, logarithm, and reciprocal.

 Drawbacks o Complicates interpretation of coefficients o May cause other assumption violations

Example data: http://wweb.uta.edu/faculty/eakin/busa5325/cardata.xls

1.2 Transformation of the Independent Variable

 Used when the linearity assumption is violated

 Possible transformations include polynomial models or using the square root, logarithm and reciprocal.

 Complicates interpretation but does not affect the other assumptions.

1.3 EXCEL You will first need to create new columns based on the log and square root functions. To create reciprocals create a function which is 1 divided by the value in another column. You will then have to create multiple scatterplots taking two variables at a time.

2. Bootstrapping

 Used when the sampling distribution of the coefficients is not known

 Repeated samples from the data, with replacement, in order to estimate the sampling distribution  Allows inferences conducted when the t distribution does not hold.

 In NCSS you will find in on bottom right of the Variables tab in Multiple Regression. It is not available in EXCEL.

Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/bootstrapping.xlsx

Reference: http://en.wikipedia.org/wiki/Bootstrap_(statistics)

3. Influence Analysis

3.1 Use

Determine if individual observations fall far from the others and/or cause an undue influence on the regression estimates.

3.2 Diagnostics

3.2.1 Detecting outlying X values: the hat matrix

 Indicates the distance an observation falls from the center of the other observations (considering only the x values)

 Does not necessarily mean the observation is influential.

 Values above 2*(k+1)/n are considered outliers.

3.2.2 Detecting outlying Y values: studentized deleted residuals

 Indicates for a specific observation, the difference between the actual value of Y and the predicted value of Y that occurs after the observation is removed from the data set.

 Uses a t-test approach. Any value larger than a t-table value with n-k-2 degrees of freedom

3.2.3 Alternative approach to detecting influential observations: Cook’s Distance

 Indicates for a specific observation, the overall effect on all Beta estimates when the observation is removed from the data set.

 Value above the fiftieth (50) percentile of an F distribution k+1 and n-k-1 degrees of freedom are considered influential

3.2.4 Remedies  If the problem is due to a clerical error, correct the measurement.

 ONLY after determining that the observation does not belong to the population of interest, then remove the data value.

 If removal causes new influential values to appear, try transforming the variable values.

 Two influential points near each other will not be detected.

3.2.5 Examples

Figure 1

Point A is an outlier in Y maybe influential. Point B is an outlier in X but not influential. Point C is both an outlier in X and influential.

Figure 2

Points B and C are both influential. Dropping only one would not affect line much.

Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/influence.xlsx

3.3 SAS – click the following link to find the SAS instructions

Using SAS to obtain influence and outlier diagnostics

4. Multicollinearity

 When an variable is linearly associated with other independent variables.

4.1 Effects

 Important independent variables appear to have little effect on the dependent.  Large variability in estimates from sample to sample.  Estimated effects that have values opposite what was expected  Is it difficult to hold all variables constant if the variables are too closely related.

4.2 Diagnostics

 If the Variance Inflation Factor (VIF) exceeds ten, the independent variables are too highly related to each other.

4.3 Remedies

 Use regression approaches other than least squares.  Remove some of the redundant variable.  Use a designed experiment

4.4 SAS notes – click the following link to view SAS instructions

SAS instructions to obtain VIF values

4.5 Links

Illustration: http://wweb.uta.edu/faculty/eakin/busa5325/multicollinearity.xlsx

5. Model building Approach

5.1 Theory driven

 Use previous studies to determine which variables to choose.

5.2 Data driven

 Fix any problems with influential values or assumption violations

 Look at all combinations of the independent variables.

 Choose subsets that are simple but explain as much variation as possible or have the smallest variance estimate.

 Compare the reduced sets of variables to determine the simplest and most valid model.

 Draw new data and test the model for significance and prediction

 For prediction, use the least squares line found in the first data set to predict the values of Y in the new data set. Calculate the average squared error and its square root (Root Mean Square Prediction Error) or average the absolute errors (Mean Absolute Prediction Error). The size of the average error indicates the usefulness of the model.

5.3 Comments

 Stepwise is a very fast but inefficient procedure and can lead to belief the model is much better than it actually is.

 All-possible takes longer study and is recommended.

5.4 SAS notes – click the following link to view SAS instructions

SAS instructions for instructions on obtaining all - possible regressions