REGRESSION ANALYSIS - MODEL BUILDING:

A is typically conducted to obtain a model that may needed for one of the following reasons: • to explore whether a hypothesis regarding the relationship between the response and predictors is true. • to estimate a known theoretical relationship between the response and predictors.

The model will then be used for:

• Prediction: the model will be used to predict the response variable from a chosen set of predictors, and • Inference: the model will be used to explore the strength of the relationships between the response and the predictors

Therefore, steps in model building may be summarized as follows:

1. Choosing the predictor variables and response variable on which to collect the data.

2. Collecting data. You may be using data that already exists (retrospective), or you may be conducting an experiment during which you will collect data (prospective). Note that this step is important in determining the researcher’s ability to claim ‘association’ or ‘causality’ based on the regression model.

3. Exploring the data. • check for data errors and missing values. • study the bivariate relationships to reveal other and influential observations, relationships, and identify possible multicollinearities to suggest possible transformations.

4. Dividing the data into a model-building set and a model-validation set: • The training set is used to estimate the model. • The validation set is later used for cross-validation of the selected model.

5. Identify several candidate models: • Use best subsets regression. • Use stepwise regression.

6. Evaluate the selected models for violation of the model conditions. Below checks may be performed visually via residual plots as well as formal statistical tests. • Check the linearity condition. • Check for normality of the residuals. • Check for constant variance of the residuals. • After time-ordering your data (if appropriate), assess the independence of the observations. • Overall goodness-of-fit of the model. Above checks turn out to be unsatisfactory, then modifications to the model may be needed (such as a different functional form). Regardless, checking the assumptions of your model as well as the model’s overall adequacy is usually accomplished through residual diagnostic procedures.

7. Select the final model:

• Compare the competing models by cross-validating them against the validation data.

Remember, there is not necessarily only one good model for a given set of data. There may be a few equally satisfactory models.

1 The following table is a good summary of why checking the assumptions is of vital importance: http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/regression.class.pdf

The figure on the left summarizes the assumptions in regression analysis.

2 OUTLIERS, HIGH , AND INFLUENTIAL OBSERVATIONS

A data point is influential if it influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and high leverage data points have the potential to be influential, but have to be investigated to determine whether or not they are actually influential.

See the plot on the left: No apparent data points as outliers (regarding the value of Y) or high leverage observation (regarding the value of X) (filename: influencenone.sav)

3 Case 1: Dataset influence1.sav In the plot of influence1 data below, the red dot is an . How influential is it? Regression line for influence1 data

The plot below shows the regression line with and without the data point marked red.

4 • The R2 value is lower, however, the relationship is still strong in both cases. F-test is significant with both and without the red marked data point. But because the red data point is an outlier in the y direction the MSE and thus the standard error of the predicted variable has increased from (MSE from 6.718 to 22.191, and standard error from 2.59 to 4.71). This will cause wider confidence and prediction intervals for the predicted variable.

• The t-test for β1 is also still significant, implying that predictor X accounts for a significant amount of variation in Y and that there is sufficient evidence at the 0.05 level to conclude that X is related to Y. However, the standard error of β1 has increased, from 0.200 to 0.363 (this is expected since standard error of β1 depends on MSE) which will result in a wider confidence interval for β1 as well.

• Red data point is an outlier but not influential enough to change conclusions regarding the regression relationship.

Case 2: Dataset influence2.sav In the plot of influence2 data below, the red dot is an outlier. How influential is it?

Regression line for influence2 data

The plot below shows the regression line with and without the data point marked red.

5

• The R2 value increased slightly from 97.3% to 97.7%. In either case, the relationship between Y and remains strong.

• In each case, the P-value for testing H0: β1 = 0 is less than 0.001. In either case, we can conclude that there is sufficient evidence at the 0.05 level to conclude that, in the population, x is related to y.

• The standard error of b1 is about the same in each case — 0.172 when the red data point is included, and 0.200 when the red data point is excluded. Therefore, the width of the confidence intervals for β1 would largely remain unaffected by the existence of the red data point. Note that this is because the data point is not an outlier heavily impacting MSE.

• Red data point has high leverage but it is not influential.

CASE 3. Dataset influence3.sav In the plot of influence3 data below, the red dot is an outlier. How influential is it?

Regression line for influence3 data

The plot below shows the regression line with and without the data point marked red.

6

• The R2 value has decreased substantially from 97.32% to 55.19%. If we include the red data point, we conclude that the relationship between y and x is only moderately strong, whereas if we exclude the red data point, we conclude that the relationship between y and x is very strong.

• The standard error of b1 is almost 3.5 times larger when the red data point is included — increasing from 0.200 to 0.686. This increase would have a substantial effect on the width of our confidence interval for β1. Again, the increase is because the red data point is an outlier — in the y direction.

• In each case, the P-value for testing H0: β1 = 0 is less than 0.001. In both cases, we can conclude that there is sufficient evidence at the 0.05 level to conclude that, in the population, x is related to y. Note, however, that the t-statistic decreases dramatically from 25.55 to 4.84 upon inclusion of the red data point.

• The predicted responses and estimated slope coefficients are clearly affected by the presence of the red data point. While the data point did not affect the significance of the hypothesis test, the t-statistic did change dramatically. In this case, the red data point is deemed both high leverage and an outlier, and it turned out to be influential too.

7 Identifying Y values that are extreme

• Residuals: The difference between actual and fitted Y values

• Standardized (Semistudentized) Residuals: Residuals divided by their standard error

• Studentized Residuals (internally studentized residuals)

Note that the denominator is another estimate of the standard error or residuals.

• Deleted and Externally Studentized Residuals

The basic idea is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations; then, comparing the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals.

Standardizing the deleted residuals produces studentized residuals.

In general, studentized residuals are going to be more effective for detecting outlying Y observations than standardized residuals. A rule of thumb is that if an observation has a studentized residual that is larger than 2 (in absolute value) we can call it an outlier.

Identifying X values that are extreme

• Leverages: A measure of the leverage of a data point i, hii, quantifies the influence that the observed response yi has on its predicted value ŷ i. That is, if hii is small, then the observed response yi plays only a small role in the value of the predicted response ŷ i. On the other hand, if hii is large, then the observed response yi plays a large role in the value of the predicted response ŷ i. It's for this reason that the hii are called the "leverages."

Some important properties of the leverages: th o The leverage hii is a measure of the distance between the X value for the i data point and the mean of the X values for all n data points. o The leverage hii is a number between 0 and 1, inclusive. o The sum of the hii equals p, the number of parameters (regression coefficients including the intercept). th th o The leverage hii quantifies how far away the i X value is from the rest of the X values. If the i X value is far away, the leverage hii will be large; and otherwise not.

Leverages help identify X values that are extreme and therefore potentially influential on the regression analysis. A common rule is to flag any observation whose leverage value, hii, is more than two times larger than the mean leverage value.

The mean leverage value is given by:

The leverage value is considered high when: ℎ > 2( )

8 Identifying Influential Observations

Measure of Influence on Fitted Values (predicted):

• DFFITS

th The difference in fits for observation i, denoted DFFITSi , measures the difference in Y-fitted when the i data point is included and excluded from the analysis. Basically, the difference in fits quantifies the number of standard deviations that the fitted value changes when the ith data point is omitted.

As guideline for identifying influential data points, consider an observation influential

if the absolute value of standardized DFFITS > 1 for small to medium data sets, and � if the absolute value of standardized DFFITS > 2 ∗ � for large data sets, where

n is the number of observations and p is the number of regression parameters including the intercept.

• Cook’s Distance

th Cook's distance (Di ) is a measure of the influence of the i observation on all n fitted values.

As guideline for identifying influential data points: th o If Standardized Di > 0.5, then the i data point is worthy of further investigation as it may be influential. th o If Standardized Di > 1, then the i data point is quite likely to be influential.

Measure of Influence on Regression Coefficients (Beta coefficients):

• DFBETAS

DFBETAS is a measure of the influence of the ith observation on each of the regression coefficients.

if DFBETAS > 1 (for small to medium data sets), and if DFBETAS > (for large data sets),

then observation i is considered influential on the respective regression coefficient.

Multicollinearity and the Variance Inflation Factor (VIF)

Multicollinearity is the condition where one or more of the predictor variables are highly correlated with each other. This situation may be detected via: o Large changes in the estimated regression coefficients when a predictor variable is added or deleted, or when an observation is altered or deleted. o Nonsignificant results in individual tests on the regression coefficients for important predictor variables. o Estimated regression coefficients with an algebraic sign that is the opposite of that expected from theoretical considerations or prior experience. o Large coefficients of simple correlation between pairs of predictor variables in the correlation matrix o Wide confidence intervals for the regression coefficients representing important predictor variables.

A formal method of detecting the presence of multicollinearity that is widely accepted is use of VIF. These factors measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. A VIF of 1 means that there is no correlation among the pth predictor and the remaining predictor variables, and hence the variance of bp is not inflated at all. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction. 9 Dealing with Problematic Data Points

First, check for obvious data errors: o If the error is just a data entry or data collection error, it needs to be corrected. o If the data point is not representative of the intended study population, it may be deleted. o If the data point is a procedural error and invalidates the measurement, it may be deleted. o Consider the possibility that the regression model needs modification. o Data points should not be deleted only because they do not fit the preconceived regression model. o There must be a good, objective reason for deleting data points. o First, foremost, and finally — it's okay to use common sense and knowledge about the context of the analysis.

Exercise: Use the Body Fat Data in Canvas, regress Body Fat only on Triceps and Skinfold Thickness

1. Check for Outliers:

In , window, select “All Casewise Diagnostics” option. In linear regression, plots window, check “All Partial Plots” option. In linear regression, save window, check “Unstandardized Residuals”, “Standardized Residuals”, and “Studentized Deleted Residuals” options.

a. View the casewise diagnostics output. Are there any values of residuals and/or standardized residuals that seem to stand out? State your conclusions.

Check the data set where now the residual, standardized residual, and studentized residuals should be added under the variable names RES_#, ZRE_#, and SDR_#, respectively. Do any of the observations have high values of studentized deleted residuals? State your conclusions.

b. Check partial plot of body fat and triceps (partial regression plots or added variable plots show the marginal contribution of the variable that is added to the model, that is given the other variables that are in the model. Added variable plots also help reveal the effects of outliers in multiple regression. Are there any observations that attract your attention, if so what is the case number? State your conclusions.

2. Check for High Leverage points (outlying X values)

a. In linear regression, save window, check “Leverages” option.

Check the data set where now the leverage values should be added under a new variable name LEV_#. Do any of the observations have a leverage value that exceeds the threshold? State your conclusions.

Remember, the threshold rule of thumb is, a leverage value is considered large if it is more than twice the ‘mean leverage value’, where:

# �� ���������� ����������(�) ���� �������� = ������ �� ������������(�)

if �������� ����� �� �� ����������� > 2 ∗ , ( which is 2*(3/20) = 0.30 for this example) then it is considered outlying with respect to its X value..

3. Check for influential observations.

a. Influence on the fitted value (predicted)

In linear regression, save, check “Standardized DFFITS” and “Cook’s” options.

10 Check the data set where now the Standardized DFFITS values should be added under the variable name SDF_#, and Cook’s Distance values should be added under the variable name COO_#.

First, view the Standardized DFFITS values. Do any of the observations have a DFFITS value that exceeds the threshold? State your conclusions.

Remember, the rule of thumb: if the Standardized DFFITS value for observation i > 1 (for small to medium data sets), and if the Standardized DFFITS value for observation i > 2 ∗ for large data sets, it is considered large and, thus, influential on the fitted value of Y i.

Next, view Cook’s Distance values added to the data. Do any of the observations have a Cook’s Distance value that exceeds the threshold? State your conclusions.

Remember, the rule of thumb: o If the Standardized Cook’s Distance value for observation i > 0.5, then the ith data point is worthy of further investigation as it may be influential.

o If the Standardized Cook’s Distance value for observation > 1, then the ith data point is quite likely to be influential.

b. Influence on the regression coefficients

In linear regression, save, check “Standardized DFBeta(s)” option.

Now, check the data set where the Standardized Beta values for each regression coefficient should be added under the variable names, SDB0_#, SDB0_#, and SDB0_#, for the three regression coefficients (Beta 0, Beta 1, and Beta 2), showing the influence of each observation on each of the three regression coefficients.

Do any of the observations exceed the threshold? State your conclusions.

Remember, the rule of thumb: if DFBeta > 1 (for small to medium data sets), and if DFBeta > (for large data sets), then observation i is considered influential on the respective regression coefficient.

State your overall conclusions for questions 1, 2, and 3 above. 4. a. Without changing your data, view the ANOVA table resulting from the regression of body fat on triceps and thigh variables. State your conclusions.

b. Next, run the regression with body fat (dependent) and triceps (independent) variables, and state your conclusions regarding the significance of the model and the predictor variables.

c. Next, run the regression with body fat (dependent) and thigh (independent) variables, and state your conclusions regarding the significance of the model and the predictor variables.

d. On the main SPSS window, from the dropdown menu under “Analyze”, choose Correlate, then Bivariate, and place variables thigh and triceps in the variables window, then press OK. View the correlation coefficient for the two variables in the resulting output.

e. Next, take steps to run the regression of body fat on triceps and thigh variables one more time. This time in linear regression, statistics, check “Collinearity Diagnostics”. State your conclusions on your findings.

11 CHECKING THE ASSUMPTION OF LINEARITY OF THE REGRESSION RELATIONSHIP

The graph on the left shows a clearly non-linear relationship between SPI (https://www.socialprogressindex.com/) and GDP per capita for 2016 (https://data.worldbank.org).

Below on the left is the scatterplot of standardized residuals vs standardized predicted, where the non-linear nature of the relationship is even more obvious. Y = -44595.588 + 914.412 (SPI) R2 = 0.742

For models with a single predictor, the scatter plot of dependent vs independent variable will reveal the nature of the relationship. The scatterplot of residuals vs predicted values will provide a clearer picture. For multiple linear regression models, since there are multiple independent variables one has to look at residual plots.

Note on partial plots: When conducting multiple regression analysis, partial residual plots will provide information regarding the relationship between the response and each of the predictors as follows:

Partial Regression Plots: Assume the dependent variable Y is regressed with three predictor variables, X1, X2, and X3 The Partial Plot of X1 is obtained as follows:

• First, regress Y on all three predictor variables, save standardized residuals.

• Next, regress X1 on the other two predictor variables, X2, and X3, save standardized residuals. • Plot (scatterplot) the standardized residuals from steps 1 and 2.

• It can be shown that the slope of the partial regression of ei (Y|X2, X3) on ei (X1|X2, X3) is equal to the estimated regression coefficient b1 of X1 in the multiple regression model Y = b0 + b1X1 + b2X2 + b3X3 + e. Thus, the partial regression plot isolates the contribution of the specific independent variable in the multiple regression model and reveals the nature of the relationship between Y and X1.

A pattern such as (a) shows that the X variable does not have additional contribution to the model in the presence of X2 and X3. The pattern in (b) indicates that a linear term X1 may be helpful, while the pattern in (c) revels a non-linear relationship.

There are two options to remedy non-linearity: 1. Data transformation 2. Fitting a non-linear regression model 12 In the case of non-linearity, transformation of the independent variable may remedy the problem. In some cases, transformation of both variables may be necessary. In any case, before addressing independence, normality and constancy of variance issues, the non-linearity problem must be fixed. However, while some assumptions may appear to hold prior to applying a transformation, they may no longer hold once a transformation is applied. In other words, using transformations is part of an iterative process where all the assumptions are re-checked after each iteration. Some common transformations and their effect are described below.

Square Transformation of X: Spreads out the high X values relative to smaller X values. Try with the data below: first, plot Y vs X values, then take the square of X values and plot Y vs X2. Before transformation of X After transformation of X

X Y X2 Y 0 2 0 2 1 3 1 3 2 6 4 6 3 11 9 11 4 18 16 18 5 27 25 27 6 38 36 38 7 51 49 51 8 66 64 66 9 83 81 83 10 102 100 102

1/X (Inverse) transformation of X: Compresses large X values relative to the smaller X values, to a greater extent than the log transformation. Values of X less than 1 become greater than 1, and values of X greater than 1 become less than 1, so the order of the data values is reversed. Try with the data below: first, plot Y vs X values, then take the inverse of X values and plot Y vs 1/X. Before transformation of X After transformation of X

X Y x y

0.20 10.0 5.0 10.0 0.33 51.0 3.030 51.0 0.55 79.0 1.818 79.0 0.70 87.0 1.429 87.0 1.111 94.0 0.90 94.0 1.10 99.0 0.909 99.0 1.30 103.0 0.769 103.0 1.50 106.0 0.667 106.0 1.70 108.0 0.588 108.0 1.85 109.0 0.541 109.0 0.513 109.7 1.95 109.7 Log Transformation: Compresses high X values relative to smaller X values. Note that the log function can only be applied to values that are greater than 0. Try with the data below: first, plot Y vs X values, then take the natural log of X values and plot Y vs ln(X). Before transformation of X After transformation of X

x y x y 0.20 10.0 (1.61) 10.0 0.33 33.0 (1.11) 33.0 0.55 55.0 (0.60) 55.0 0.70 66.0 (0.36) 66.0 0.90 78.0 (0.11) 78.0 1.10 86.0 0.10 86.0 1.30 94.0 0.26 94.0 1.50 100.0 0.41 100.0 Square root 1.70 106.0 0.53 106.0 transformation has a 1.85 110.0 0.62 110.0 1.95 113.0 0.67 113.0 similar effect. 13 In summary: If the relationship looks as below, then perform a square root transformation (X'= SQRT(X)) of the independent variable.

If the relationship looks as below, then perform a reciprocal (inverse) transformation (X' = 1 / X) of the independent variable.

If the relationship looks as below, then, again, perform a reciprocal (inverse) transformation (X' = 1 / X) of the independent variable.

If the relationship looks as below, perform a log transformation (X'=log X) of the independent variable.

If the relationship looks as below, then, again, perform a log transformation (X'=log X) of the independent variable.

If transformation of the independent variable (X) fails to meet the linearity assumption, log transformation of both dependent and independent variables would most likely linearize any the relationship. 14 A Note on Data Transformations – Tukey’s Ladder of Power

Note that the transformations above are ‘power’ transformations of X. In general transformation of a variable (Y or any of the X variables) is usually chosen from the "power family" of transformations (referred to as Tukey’s Ladder of Power):

Power Re-expression (could be an X variable or Y) 2 X2 or Y2 (square) 1 X or Y (no transformation) ½ � or � (square root) 0 – (log or ln) ln(X) or ln(Y) (log) – see the note on log transformation below. -½ 1/ � or 1/ � (reciprocal square foot) -1 1/X 1/Y (reciprocal) -2 1/X2 1/Y2 (reciprocal square) based on the relationship revealed by residual plots or diagnostic checks regarding normality residuals and homoscedasticity of residual variance.

For power transformations of reciprocal square root, log, or square root, all data values must be positive. To use these transformations when there are negative and positive values, a constant can be added to all the data values such that the smallest is greater than 0.

To transform a variable in SPSS:

In the window for transforming a variable (above, on the right), create a new target variable name, choose how you want to create it (such as taking ln of SPI), then click ‘OK’. A new variable will be added to your data set with the name you defined. Now you can perform regression using the newly created variable by the transformation of an existing variable.

A Note on Natural Log Transformation

The default logarithmic transformation involves taking the natural logarithm — ln or loge or simply, log — of each data value. The general characteristics of the natural logarithmic function are:

• The natural logarithm of x is the power of e = 2.718282... that you have to take in order to get x, i.e, ln(ex) = x. For example, the natural logarithm of 5 is the power to e = 2.718282… should be raised in order to obtain 5. Since 2.7182821.60944 is approximately 5, the natural logarithm of 5 is 1.60944 or ln(5) = 1.60944. 15 The natural logarithm of e is equal to one, that is, ln(e) = 1, the natural logarithm of, that is, ln(1) = 0. The effects of taking the natural logarithmic transformation are:

• Small values that are close together are spread further out. • Large values that are spread out are brought closer together. • And: if a variable grows exponentially, its logarithm grows linearly.

A different kind of logarithm, such as log base 10 or log base 2 could be used. However, the natural logarithm — which can be thought of as log base e where e is the constant 2.718282... — is the most common logarithmic scale used in scientific work.

When both variables are transformed the regression equation becomes: ln(Y) = a + b(ln x) where a and b are the intercept and the regression coefficient, respectively. Regression analysis is then conducted on the transformed data. Then, to visualize the actual relationship between Y and X back transformation of transformed variables would be needed, as follows: a+b(lnX) Y = e

EXAMPLE: Data involving ‘the period of revolution’ and ‘distance from the sun’ (in astronomical units- AU) of the nine planets are given in the table below. • First plot the raw data. period of distance • Then transform only distance to ln(distance), plot the revolution from the sun (AU) data. mercury 0.241 0.387 • Then transform only period to ln(period), plot the data. venus 0.615 0.723 • Finally, transform both variables, and plot ln(period) vs earth 1.000 1.000 ln(distance). • Find the regression relationship after both variables are mars 1.881 1.524 transformed. jupiter 11.862 5.203 saturn 29.456 9.539 The regression equation is:

uranus 84.070 19.191 ln(period) = 0.0002544+1.49986 ln(distance) neptune 164.810 30.061 pluto 248.530 39.529 • To predict the period of revolution of planet Eris whose distance from the sun is 102.15 AU:

ln(period) = 0.0002544+1.49986 ln(102.15) = 6.939

• To view the predicted period of revolution in original units: 6.939 Period = e à Period » 1032 years.

An objective criterion for choosing the type of transformation is the R2 value. The best transformation should have the highest R 2 value (some people argue otherwise, and point to residual distribution as being most important). For the most part, this will be a trial and error process, with the end result being improved precision of your estimates of α and β.

After transformation and achieving a linear relationship, then problems with normality or homoscedasticity are to be addressed by transforming the dependent variable (Y).

Exercise: Use the GDP per capita and SPI.xlsx to conduct a regression analysis where the GDP per capita (of 126 countries) is the dependent variable and their SPI (Social Progress Index) is the predictor (these are real data for 2016). First, try transforming SPI only. If you don’t achieve the desired linearity of relationship by transforming SPI, try transforming both GDP and SPI. After you achieve a linear relationship, use the regression function to predict GDP per capita when the SPI is 55. 16 HOMOSCESADTICITY (Homogeneity of Variance):

The figure on the left shows a hypothetical, linear relationship between X and Y, and the regression line passing through the observations:

Remember: Residual = Y-actual – Y-fitted.

Plotting the residuals against X (or Y-fitted), we obtain a residual plot as shown in the figure on the left.

The horizontal line running through zero on the Y-axis represents our regression line, allowing us to visualize the distribution of the observations around the line. The distribution of the points around the line remains constant across all values of X. This suggests that the homogeneity of variance assumption has been met.

The figure left, on the other hand, shows a situation where the distribution of residuals spread further out as the value of X (and thus the fitted Y) increases, displaying a wedge pattern. This is the situation we refer to as heteroscedasticity – non-constancy of the residual variance.

This is a common violation of the assumption of homoscedasticity – residual variance increases proportional to the mean value of Y, so that the spread of the observations gets wider as the value of X (and, therefore Y-fitted) increases.

Plotting residuals against X (or Y-fitted) would produce a plot like the figure on the left, where increasing variability of the residuals is clearly visible.

Example: Use dataset: realestatedata.xlsx Regress sales price on square feet only, save standardized residuals and standardized fitted values.

Then view the scatter plot of zresid vs. zpred. Does the plot indicate constant variance?

17 Transformations of Y are used to stabilize the residual variance. The type of transformation may be selected using the below as general guidelines: • the square root transformation of the observations (Y'=sqrt(Y) or Y' = Y1/2), when the variance is proportional to the mean of th estimate of Y, • the log transformation (Y' = log(Y) or Y' = ln(Y) or Y' = Y0), when the variance is proportional to the square of the estimate of Y, • the reciprocal square root transformation (Y' = 1/sqrt(Y) or Y' = Y-1/2), when the variance is proportional to the cube of estimate of Y, and • the reciprocal transformation (Y' = 1/Y or Y' = Y-1), when the variance is proportional to Y4.

Refer to the Note on Transformations above for more detail. An objective criterion for distinguishing among transformations is the R2 value. The best transformation should have the highest R 2 value (some people argue otherwise, and point to residual distribution as being most important). For the most part, this will be a trial and error process, with the end result being improved precision of your estimates of α and β.

One procedure for estimating an appropriate value of the ‘power’ to which Y should be raised is the Box-Cox transformations. Box-Cox transformations are a family of power transformations on Y such that Yʹ=Yλ, where λ is a parameter to be determined using the data. The maximum likelihood estimate λ is that value of λ for which the SSE is a minimum.

Also note that the transformation of Y involves changing the metric in which the fitted values are analyzed, which may make interpretation of the results difficult. The resulting relationship must be back-transformed to make meaningful predictions.

Use the diamond.xlsx data set to check for outliers, and apply the appropriate transformation.

18 NORMALITY OF RESIDUAL DISTRIBUTION

The dependent and independent variables in a regression model do not need to be normally distributed --only the prediction errors need to be normally distributed. But if the distribution of some of the variables are extremely asymmetric or long-tailed, this may suggest a problem because calculation of confidence intervals and significance tests for regression coefficients are based on the assumptions of normally distributed errors. Therefore, it is important to check if the error distribution is significantly non-normal.

A significant violation of the assumption may indicate: 1. There are a few unusual data points that should be studied closely and that the error distribution is "skewed" by the presence of a few large outliers. (Checking for the presence of unusual data points is covered previously.)

2. There is some other problem with the model assumptions, and/or that the model is not a good fit for the data.

Widely used checks for normality of residuals include the following: normal probability plot (P-P plot) of residuals, which are automatically given by SPSS in linear regression. The P-P plot compares the observed cumulative distribution function of the residuals (on the X-0axis) with the theoretical cumulative normal distribution function (values range from 0 to 1, on the Y-axis). Again, if the residuals were exactly normally distributed, the points would lie on a straight line: normal quantile plot (Q-Q plot)of residuals. The Q-Q plot compares the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on such a plot should fall close to the diagonal reference line. statistical tests for normality, include the Kolmogorov-Smirnov test, the Shapiro-Wilk test, the Jarque-Bera test, and the Anderson-Darling test. Real data rarely has errors that are perfectly normally distributed, and it may not be possible to fit your data with a model whose errors do not violate the normality assumption at the 0.05 level of significance. These tests are quite strict. It is usually necessary to look at the plots and draw conclusions about the seriousness of the problem. Example: Use residualskew.xlsx – First regress variable Y on X. In SPSS, linear regression, in ‘Plots’, check ‘Histogram’ and ‘Normal Probability Plot’. To obtain a scatterplot, place ‘Standardized Residuals’ in the Y-axis, and ‘Standardized Predicted’ in the X-axis. Also, in ‘Save’, check ‘Standardized Residuals’, Then click ‘OK’. Your data set will now include ‘Standardized Residuals’ as a variable.

19 Your output will include:

In the main SPSS window, under ‘Analyze’ choose ‘Descriptive Statistics’, then ‘Explore’ In the ‘Explore’ Window, place the ‘standardized residuals’ variable into the dependents window:

Then the output will show:

20 Note that when using the Shapiro-Wilks test:

For skewness and kurtosis:

Now repeat the above by regressing Y2 on X2. Compare the results to those from Y and X.

21 PROCEDURES FOR FINAL

In general, if there are p-1 predictors, then there are 2p-1 alternative models that can be constructed. For example, 10 predictors yield 210 = 1024 possible regression models.

Several model selection procedures have been developed to suggest ‘the best subset’ regression model according to specified criteria. These are briefly summarized below.

Best Subsets Model Selection Criteria

• The model with the largest R2 (equivalent to selecting the model with the smallest SSE) The R2-value can only increase as more variables are added. The "best" model is not the model with the largest R2-value. Instead, the R2-value is used to find the point where adding more predictors is not worthwhile, because it yields a very small increase in the R2-value. In other words, the size of the increase in R2, not just its magnitude alone is used to determine the best model.

• The model with the largest Adjusted R2 (equivalent to selecting the model with the smallest MSE)

• Mallows’ Cp-statistic

Mallows' Cp-statistic selects the subset of variables that will have the smallest Cp value – a measure of the combination of the bias and the variance. Note that a model with p-1 predictors will have a minimum Cp value of p. Therefore, the simplest model to yield a Cp value close to p is selected.

• Information Criteria

Three information criteria, Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC) (which is sometimes called Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC) are also used in the selection of best subsets. These criteria, in general, penalize models having a large number of predictors.

Each of the information criteria is used in a similar way—in comparing subset models, the model with the lower value is preferred.

• The Prediction Sum of Squares (or PRESS)

PRESS is a method used to assess a model's predictive ability. For a data set of size n, PRESS is calculated by omitting each observation individually and then the remaining n - 1 observations are used to calculate a regression equation which is used to predict the value of the omitted response value. In general, the smaller the PRESS value, the better the model's predictive ability.

Note that these criteria are often used together where the results of all may be pooled to determine the best subset of predictors to keep in the regression model.

22 Note that SPSS has an ‘Automatic Linear Modeling’ procedure – a relatively new procedure, under the ‘Analyze’, ‘Regression’, and ‘Automatic Linear Modeling’.

When the window opens, all variables are in the same list of ‘predictors’. Remove the dependent variable from the list (any of the predictors that you do not want to be included in the analysis), then place the dependent variable in the ‘Target’ field.

Then click on ‘Build Options’,

On the ‘Objectives’ options, choose ‘Create a Standard Model’,

Then click on ‘Basics’ on the menu on the left side.

23

In ‘Basics’, you will see that SPSS has the option of automatically preparing data – leave it checked.

You can choose any confidence level, for now, leave it as 95%.

Then, click on ‘Model Selection’ on the menu on the left side.

In ‘Model Selection’, you have the options ‘Forward Stepwise’ and ‘Best Subsets’ in addition to all predictors. Choose ‘Best Subsets’.

You will see at the bottom on the right side that there is a choice for criteria to select the best subsets. Leave it as Information Criterion (AICC)

Then click ‘Run’.

The output will be provided in a different format as below:

Double click on the box including the results of automatic linear modeling to activate it.

24 Now you can scroll up and down on the boxes on the left to get detailed information about the final selected model. You will see that some variables’ outliers have been trimmed, and certain variables are manipulated to get a good fit.

Scroll down to the last box down on the left with letter i on it to summarize the model information.

Automatic Linear Modeling is attractive functionality. However, since you do not really know what SPSS is doing in the background, for the sake of ‘learning by doing’, I do not recommend using this functionality before completing the other exercises in the handout.

25 Stepwise Regression:

This is a procedure to determine which predictor variables should be in the final model, starting with one variable, and then adding others, one at a time, depending on the extent of their marginal contribution in accounting for the variation in the predicted variable. The extent of marginal contribution is measured by the p-value of the t-test about the regression coefficient.

§ Specify an Alpha-to-Enter significance level. This will typically be greater than the usual 0.05 level so that it is not too difficult to enter predictors into the model. Many software packages set this significance level by default to αE = 0.15. § Specify an Alpha-to-Remove significance level. This will typically be greater than the usual 0.05 level so that it is not too easy to remove predictors from the model. Again, many software packages set this significance level by default to αR = 0.15. Stepwise Regression Example bloodpress.xlsx

Some researchers observed the following data (bloodpress.xls) on 20 individuals with high blood pressure: § blood pressure (y = BP, in mm Hg) § age (x1 = Age, in years) § weight (x2 = Weight, in kg) § body surface area (x3 = BSA, in sq m) § duration of hypertension (x4 = Dur, in years) § basal pulse (x5 = Pulse, in beats per minute) § stress index (x6 = Stress)

The researchers are interested in determining if a relationship exists between blood pressure and age, weight, body surface area, duration, pulse rate and/or stress level. Use stepwise regression and state the resulting final model.

MODEL VALIDATION

Most of the time it is difficult to obtain new independent data to validate a regression model. An alternative is to partition the sample data into a model-building set (where the number of observations should be at least 6 to 10 times the number of predictor variables) which will be used to develop the model, and a validation set, which will be used to evaluate the predictive ability of the model. This is called cross-validation.

References Kutner et. al. Applied Linear Statistical Models (5th Ed.) https://stats.idre.ucla.edu/spss/seminars/introduction-to-regression-with-spss/introreg-lesson2/ https://onlinecourses.science.psu.edu/stat501/ http://www.cambridge.edu.au/downloads/education/extra/209/PageProofs/Further%20Maths%20TINCP/Core%206.pdf http://www.basic.northwestern.edu/statguidefiles/linreg_alts.html http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/regression.class.pdf http://www.cambridge.edu.au/downloads/education/extra/209/PageProofs/Further%20Maths%20TINCP/Core%206.pdf http://people.duke.edu/~rnau/testing.htm http://core.ecu.edu/psyc/wuenschk/StatData/StatData.htm https://erc.barnard.edu/spss/descriptives_normality

List of Exercises in this handout Use Bodyfat.xlsx – outliers, high leverage, and influential observations Use GDP per capita and SPI.xlsx – linearity of regression relationship Use realestatedata.xlsx – constancy of residual variance Use the diamond.xlsx data set to check for outliers, relationship, and apply the appropriate transformation. Use residualskew.xlsx – normality of residual distribution Exercise: Use bloodpress.xlsx – stepwise regression – model selectio 26