Multiple Regression Lesson

MULTIPLE REGRESSION

Regression analysis can be extended to include the simultaneous effects of multiple independent variables. The resulting equation takes a form such as:

Y = a + b1X1 + b2X2 + e,

where X1 and X2 are the values of cases on the two independent variables, a is the Y-intercept, and b1 and b2 are the two slope coefficients, and e is the error term (actual Y - predicted Y).

See the text for formulae (B&K, pp. 288-289).

Now regress respondent’s income on education, age, gender, and race---but first you must recode race so that it is a dichotomy (all variables in a regression analysis must be either quantitative or dichotomies; can you explain why [think about the previous example of education and region]?).

Recode race(1=1)(2=2)(3=2)into race1. Value labels race1 1'white'2'others'. Var labels race1'recoded race variable'.

ANALYZEREGRESSIONLINEARDEPENDENT= INC98FR INDEPENDENT(S)=ENTER EDUC,AGE,SEX,RACE1

Examine the output produced by this command. The value of r2 is the proportion of variance in respondent’s income explained by the combination of all 4 independent variables. Overall, the F test indicates this level of explained variance is not likely due to chance. The values of b (and beta) indicate the unique effects of each independent variable. Comparing betas (which are standardized, varying from 0 to +/- 1, so they can be compared between variables having different scales), you can see that education has the strongest independent effect. Race and gender are following the same pattern. Age does not have a statistically significant independent effect (see the Sig T column). Soc. 651. Regression Analysis, Notes & Exercises -2-

Unique Problems and Special Techniques

Before you read further, run the following set of commands. Refer to the output as I ask you questions in each of the next subsections. corr happy educ inc98fr. regress dependent=happy/method=enter educ. regress descriptives=corr/statistics=defaults tol/ dependent=happy/method=test (marital,sex,childs age) (educ inc98fr) /residuals.

Specification error. We interpret the value of each beta as indicating the unique effect of the corresponding independent variable. In order to properly “specify” this effect, we must have included all other variables that might influence X and Y in the regression equation. If we have not, the equation is misspecified—the particular X is given more (or less) credit than it is due.

The only way to correct for specification error is to think ahead and include plausible extraneous variables in the equation. Compare the coefficient for EDUC in the second multiple regression analysis to the corresponding coefficient in the preceding regression analysis. The correlation matrix indicates that EDUC and INC98FR are correlated. Does the apparent effect of EDUC change much after we enter INC98FR into the equation? If so, we might conclude that the original equation was misspecified.

Multicollinearity. If two or more independent variables in the regression equation are highly correlated, it may become mathematically impossible to identify the separate effects of each with any confidence. After all, the collinear independent variables are then measuring almost the same thing. The result can be misleading significance tests.

You can test for multicollinearity with SPSS, by requesting STATISTICS=DEFAULTS TOL in the REGRESS command. The resulting “tolerance” numbers indicate variables that have little variance left after their correlations with the other independent variables are taken into account. If a tolerance is close to 0, the variable is too highly correlated with another IV. You might delete one of the 2 Soc. 651. Regression Analysis, Notes & Exercises -3- from the equation. In the example, inspect the values of the tolerance for each variable. Are any close to zero? (Use of interaction terms is likely to create this problem.)

Dummy variables. You can test for the effects of dichotomies, or dummy variables, with linear regression. If the variable is coded as 0 or 1, then the value of the regression coefficient indicates the amount that the regression line for the group scored as 1 is above or below the regression line for the group scored as 0. You can extend this logic to represent the categories of nominal variables that have more than 2 categories. Just construct separate dummies to stand in for the presence or absence of each of the categories, less one. So the three categories of a trichotomy would be represented by two dummy variables, each coded as 0 or 1.

In the example, marital status is recoded to 0 and 1 to represent being married (compared to all other marital statuses) and included in the regression, as a dummy variables. “Not married” is termed the “omitted category.” Interpret the effect of being married. The signs on dummies indicate whether that category has a higher or lower average value than does the omitted category, controlling for the other variables in the equation.

For several different reasons, you may want to test for the significance of the effect of sets of variables. The “test” subcommand allows you to do this. Such combined significance tests are required to determine whether a variable like that is represented by several dummies has a significant effect. It is also useful to test the significance of blocks of variables that have some conceptual similarity (such as social background or achieved status).

Standardized coefficients. Values of beta are often preferred over the unstandardized slope coefficient, b, because beta is calculated so as to vary from 0 to +/- 1, no matter what the original values of the variable. This means that the values of beta for different independent variables can be compared to each other as an indication of relative strength of effects. However, because beta is standardized in terms of the variation in the independent and dependent variables, it will vary in part as a function of the variability of these variables, not just in tandem with the strength of their association. So beta Soc. 651. Regression Analysis, Notes & Exercises -4- coefficients calculated from different samples with differing amounts of variability might differ for this reason, even when the strength of effects of the independent variable in the two samples is identical. For this reason, it is better to compare values of b, rather than of beta, when judging the relative strength of effects of the same variable in different samples.

Compare the values of b and beta for several variables. Comment.

Standard assumptions. Don’t forget about the assumptions that must be met even in a bivariate regression analysis. The RESIDUALS output helps to test for the presence of heteroscedasticity and nonlinearity. You can use the plot to check that the residuals are normally distributed and that they do not vary systematically in size across the regression line. In this output, both plots suggest that the residuals are “well behaved,” so it seems that heteroscedasticity and nonlinearity are not problems. Examination of “outliers” can also help to determine whether are any cases with values so extreme that they alter the slope of the regression line.

Interaction effects. Does the effect of the independent variable vary across the categories of another independent variable? If such an interaction effect is hypothesized, or suspected, it can be tested in a multiple regression analysis. The simplest way to do so is with a product term. If one variable is coded as a dichotomy, with values of 0 and 1, then the variable produced by multiplying this dichotomy by another independent variable will indicate, in the regression equation, the effect of the other independent variable when cases have the value of 1 on the first variable.

MARSEX can be calculated to allow a test for a possible interaction effect involving gender and marital status (Does the effect of being married on health differ for women and men?) Examine the COMPUTE statement and then interpret the effects of these variables in the following regression analysis.

Recode marital(1=1)(else=0). Value labels marital 1 ‘Married’ 0 ‘Not Married’. compute marsex=sex*marital. Soc. 651. Regression Analysis, Notes & Exercises -5- regress statistics=defaults tol/ dependent=happy/method=enter marital,sex,childs age educ INC98FR marsex.

Structural modeling. Multiple regression in its most basic form tests for the simultaneous effects of all independent variables. The IVs are thus treated as all having a direct influence on the dependent variable. However, some variables may play a causal role of intervening in the effects of other variables. For example, if you control simultaneously for the effects of education and occupational status in a regression analysis of income, you might conclude that education has no effect. But this could be because education’s effect operates through the influence it has on occupational status. When you control for occupational status, therefore, the effect of education is reduced. But this does not mean that education “has no effect.”

One way to deal with this problem is to identify a priori the causal order of the independent variables. If there is more than one stage in the causal path influencing the dependent variable, conduct the regression analysis in stages. In the first stage, enter those variables whose influence is first in the causal sequence. Then enter the variables that intervene after these. Continue until you have entered the effects of variables in each causal stage. The changes in effects of independent variables after you enter supposed intervening variables give a crude indication of the extent to which the effects of the prior variables operate through the variables entered later.

The separate ENTER subcommands in the following regression command generate the analysis in stages. Inspect the r2 and slope coefficients for sex, age, childs, and marital to see if there are any marked changes after education and income are entered. What do the results indicate about the causal structure?

REGRESS DEPENDENT=happy/METHOD=ENTER SEX, AGE, CHILDS marital/ENTER educ,inc98fr.

Please note that developing and testing structural models is a very big topic that, when done right, involves Soc. 651. Regression Analysis, Notes & Exercises -6- much more sophisticated extensions of regression analysis. We have only scratched the surface.