<<

Learn About Multiple Regression With Dummy Variables in SPSS With From the General Social (2012)

© 2015 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the (2012)

Student Guide

Introduction This dataset example introduces readers to multiple regression with dummy variables. Multiple regression allows researchers to evaluate whether a continuous dependent variable is a linear function of two or more independent variables. When one (or more) of the independent variables is a categorical variable, the most common method of properly including them in the model is to code them as dummy variables. Dummy variables are dichotomous variables coded as 1 to indicate the presence of some attribute and as 0 to indicate the absence of that attribute. The multiple regression model is most commonly estimated via ordinary (OLS), and is sometimes called OLS regression.

This example describes multiple regression with dummy variables, discusses the assumptions underlying it, and shows how to estimate and interpret such models. We use a subset of data from the 2012 General Social Survey (http://www3.norc.org/GSS+Website/). It presents an analysis of whether a person’s weight is a linear function of a number of attributes, including whether or not the person is female and whether or not the person smokes cigarettes. Weight, and particularly being overweight, is associated with a number of negative

Page 2 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 health outcomes. Thus, results from an analysis like this could therefore have implications for individual behavior and public health policy.

What Is Multiple Regression With Dummy Variables? Multiple regression expresses a dependent, or response, variable as a linear function of two or more independent variables. Readers looking for a general introduction to multiple regression should refer to the appropriate examples in Sage Research Methods. This example focuses specifically on including dummy variables among the independent variables in a multiple regression model.

Many times, an independent variable of interest is categorical. "Gender" might be coded as Male or Female; "Region" might be coded as South, Northeast, Midwest, and West. When there is no obvious order to the categories or when there are three or more categories and differences between them are not all assumed to be equal, such variables need to be coded as dummy variables for inclusion into a regression model.

The number of dummy variables you will need to capture a categorical variable will be one less than the number of categories. Thus, for gender, we only need one dummy variable, maybe coded "1" for Female and "0" for Male. For region, we would need three, which might look like this:

• northeast: coded "1" if from the Northeast and "0" otherwise. • south: coded "1" if from the South and "0" otherwise. • midwest: coded "1" if from the Midwest and"0" otherwise.

We always need one less than the number of categories because the last one would be perfectly predicted by the others. For example, if we know that northeast, south, and midwest all equal zero, then the observation must be from the West.

Multiple regression models are typically estimated via

Page 3 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 (OLS). OLS produces estimates of the intercept and slopes that minimize the sum of the squared differences between the observed values of the dependent variable and the values predicted based on the regression model.

When computing formal statistical tests, it is customary to define the null hypothesis (H0) to be tested. In multiple regression, the standard null hypothesis is that each coefficient is equal to zero. The actual coefficient estimates will not be exactly equal to zero in any particular of data simply due to random chance in . The t-tests conducted to test each coefficient are designed to help us determine if the coefficient estimates are different enough from zero to be declared statistically significant. "Different enough" is typically defined as a test with a level of , or p-value, of less than 0.05. This would lead us to reject the null hypothesis (H0) that a coefficient estimate equals zero.

Estimating a Multiple Regression With Dummy Variables Model To make this example easier to follow, we will focus on estimating a model with just two independent variables, in this case, labeled X and D. Let’s further assume that X is a continuous variable while D is a dummy variable coded 1 if the observation has the characteristic associated with D and coded 0 if it does not. The multiple regression model with two independent variables can be defined as in Equation 1:

(1)

Yi = β0 + β1Xi + β2Di + εi

Where:

• Yi = individual values of the dependent variable • Xi = individual values of the continuous independent variable

Page 4 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

• Di = individual values of the dummy independent variable • β0 = the intercept, or constant, associated with the regression line • β1 = the coefficient operating on the continuous independent variable • β2 = the coefficient operating on the dummy independent variable • ε = the unmodeled random, or stochastic, component of the dependent variable; often called the error term or the residual of the model.

Researchers have values for Yi, Xi, and Di in their datasets – they use OLS to estimate values for β0, β1, and β2. The coefficients β1 and β2 are often called partial slope coefficients, or partial regression coefficients, because they represent the unique independent effect of the corresponding independent variable on the dependent variable after accounting for, or controlling for, the effects of the other independent variables in the model.

Equation 2 can be used to estimate the coefficient operating on the first independent variable. The same equation can be altered to estimate as well.

(2)

2 x y × d − d y × x d ^ ∑ ( i)( i) ∑ ( i) ∑ ( i)( i) ∑ ( i)( i) β1 = 2 2 2 ∑ (xi) × ∑ (di) − ∑ (xi)(di)

Where:

^ • β1 = the estimated value of the coefficient operating on X1 ¯ • yi = Yi − Y ¯ • Y = the sample of the dependent variable ¯ • xi = Xi − X ¯ • X = the sample mean of the continuous independent variable

Page 5 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 ¯ • di = Di − D ¯ • D = the sample mean of the dummy independent variable.

The numerator of Equation 2 is based on the product of deviations in X from its mean and deviations of Y from its mean. The sum of these products will determine whether the slope is positive, negative, or near zero. The numerator also accounts for the shared association between D and Y as well as the correlation between D and X. The denominator in Equation 2 adjusts the estimate of βi to account for how much variability there is in X and D. The result is that β1, which is the marginal effect of X on Y, captures the unique or independent effect of X on Y after accounting for the presence of D. The same logic applies to computing the estimate of β2.

Once both β1 and β2 are computed , Equation 3 can be used to compute the value for the intercept.

(3)

^ ¯ ^ ¯ ^ ¯ β0 = Y − β1X − β2D

Equation 3 is a simple way to estimate the intercept, β0 . We can use this formula because the OLS regression model is estimated such that it always goes through the point at which the of X, D, and Y intersect.

Note that the formulas presented here are identical to the formulas used for multiple regression generally. The presence of a dummy variable among the independent variables does not change the math of OLS.

As noted above, β1 and β2 represent the marginal effect of X and D, respectively, on the expected value of the dependent variable. That means that when X increases by 1 unit, the expected value of Y will change by an amount equal to

Page 6 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1

β1. Similarly, when D increases by 1 unit, the expected value of Y will change by an amount equal to β2.

For the continuous independent variable X, a 1-unit increase represents some incremental increase in that variable. It might mean an increase of 1 dollar, 1 thousand dollars, 1 inch, 1 year, and so forth. In contrast, a 1-unit increase in the dummy variable D represents shifting from the absence of some attribute (D = 0) to the presence of that attribute (D = 1). Moving from 0 to 1 constitutes the entire of D. As a result, the estimate of can be interpreted as the mean difference between observations where D = 0 and observations where D = 1 after accounting for the effects of the other independent variables in the model. In that way, regression with dummy variables effectively conducts a difference of means test for the dependent variable across the two categories of the dummy independent variable in question while controlling for the other independent variables in the model. Note that in this setting, the model assumes equal in the dependent variable across the two groups defined by D.

Finally, if you have a categorical variable with more than two categories, you would construct a dummy variable for each of those categories except one and include all of those dummy variables in the regression. Returning to our earlier example, if you thought your dependent variable was influenced by region, you could include the three regional dummy variables we constructed – northeast, south, and midwest – as independent variables in the model. The coefficient estimate for each one would capture the difference between that particular region and the region that was left out of the model, which in this case was the West.

Assessing Model Fit The most common way of assessing how well a regression model fits the data is to use a statistic called R-squared (also written R2). R-squared measures the

Page 7 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 proportion of variance in the dependent variable that is explained by the set of independent variables included in the model, and will always fall between 0 and 1. The formula for R-squared can be written many ways – we show one version in Equation 4:

(4)

RSS R2 = 1 − TSS where:

• RSS = The Residual Sum of Squares, or ε2 ∑ i ¯ 2 • TSS = The Total Sum of Squares, or Yi − Y ∑ ( )

R-squared can also be thought of as the square of the Pearson measuring the correlation between the actual values of the dependent variable and the values of the dependent variable predicted by the regression model.

Because R-squared will always increase if more independent variables are added, many scholars prefer the Adjusted R-squared. This statistic adjusts the value of R-squared downward based on how many independent variables are included in the model. The formula is:

(5)

RSS ( (n − k − 1) ) Adjusted R − Squared = 1 − TSS ( (n − 1) )

Where:

Page 8 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 • RSS and TSS are as before • n = the sample size for the model • k = the number of independent variables included in the model.

Assumptions Behind the Method Nearly every statistical test relies on some underlying assumptions, and they are all affected by the mix of data you happen to have. Different textbooks present the assumptions of OLS regression in different ways, but we present them as follows:

• The dependent variable is a linear function of the independent variables. • Values of the independent variables are fixed in repeated samples; most critical here is that the independent variables are not correlated with the residual. • The expected mean of the residual equals zero. • The variance of the residual is constant (e.g. homoskedastic). • The individual residuals are independent of each other (e.g. not correlated with each other). • The residuals are distributed normally. • There is no perfect collinearity among the independent variables.

OLS is fairly robust to moderate violations of these assumptions. If these assumptions hold, it can be shown that OLS produces the best linear unbiased estimates of the coefficients in the model.

Illustrative Example: Modelling Weight in the 2012 General Social Survey This example explores whether a person’s weight can be modeled as a linear function of a person’s height, age (and age squared), and family income as well as two dummy variables: whether the person is female and whether the person is

Page 9 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 a non-smoker. The primary research question guiding this example is:

Do non-smokers on average weigh more or less than to smokers, controlling for other factors?

We can also state this in the form of null hypotheses:

H0 = After accounting for the effect of height, age, gender, and income, there is no difference in weight between non-smokers and smokers.

In order to keep the example manageable, we treat the remaining independent variables as control variables and do not discuss them in great detail.

The Data This example uses several variables from the 2012 General Social Survey:

• The respondent’s weight (rweight), measured in pounds (the dependent variable). • The respondent’s height (rheight), measured in inches. • Whether the respondent is female (female), coded 1 = Yes and 0 = No. • The respondent’s age (age), coded in years. • The respondent’s age squared (age2), which is just age in years squared. • The respondent’s family income (income), coded into categories from 1 to 25. • Whether the respondent is a non-smoker (nosmoke), coded 1 = Yes and 0 = No.

The sample dataset includes 1351 respondents. The average weight of respondents to the survey is just over 178 pounds, while the average height is nearly 67 inches. Almost 55 percent of the respondents are female, with an average age of almost 50 years old. The income falls between $40,000

Page 10 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 and $49,000 per year. Turning to the independent variable of interest, nearly 76 percent of respondents are non-smokers, leaving 24 percent who do smoke.

Analyzing the Data Before producing the full regression model, it is a good idea to look more carefully at the dependent variable. Figure 1 presents a for weight variable.

Figure 1: Histogram showing the distribution of respondent weight measured in pounds, 2012 General Social Survey.

Page 11 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Figure 1 shows that the majority of values for weight fall near the mean of 178. Very few respondents report weights below 100, but a substantial number of respondents report weights of 200 to 250 pounds. A handful of respondents report weights of 300 pounds or greater. The distribution shown in Figure 1 suggests that most of the data is distributed reasonably close to normal, though there is a positive skew driven mostly by a few outliers. Researchers might want to explore whether the handful of cases with particularly large values for weight have a undue influence on the results. We also recommend doing similar descriptive analysis of each independent variable, but we leave that to readers so we can move toward estimating the model itself.

Regression results are often presented in a table that reports the coefficient estimates, standard errors, t-scores, and levels of statistical significance. Table 1 presents the results of regressing weight on the set of independent variables described above.

Table 1: Results from a multiple regression model where respondent weight is regressed on a number of factors, 2012 General Social Survey.

Coefficient t-score Sig.

Constant −164.15 26.46 −6.20 .000

Height 4.59 0.36 12.81 .000

Female −8.87 2.91 −3.05 .002

Age 1.87 0.37 5.07 .000

Age Squared −0.02 0.004 −5.25 .000

Family Income −0.70 0.20 −3.50 .000

Non-Smoker 12.98 2.57 5.05 .000

Table 1 reports results for the full model, but we focus attention on the dummy

Page 12 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 variable for being a non-smoker. The results in Table 1 report an estimate for the coefficient operating on this variable of 12.98 that is statistically significant. This means that a 1-unit increase in the non-smoker dummy variable is associated with an average increase in weight of nearly 13 pounds. In other words, after controlling for the effects of the other variables in the model, the average difference in weight between smokers and non-smokers is nearly 13 pounds.

The remaining results in Table 1 conform to what we likely suspected in terms of their direction, and all of the estimated partial slope coefficients reach conventional levels of statistical significance. The R-squared for the model is 0.267, which means that about 26.7 percent of the variance in respondent weight is explained by the independent variables in the model.

There are multiple diagnostic tests researchers might perform following the estimation of a regression model to evaluate whether the model appears to violate any of the OLS assumptions or whether there are other kinds of problem such as particularly influential cases. Describing all of these diagnostic tests is well beyond the scope of this example.

Presenting Results The results of a multiple regression can be presented as follows:

"We used a subset of data from the 2012 General Social Survey to test the following null hypothesis:

H0 = After accounting for the effect of height, age, gender, and income, there is no difference in weight between non-smokers and smokers.

The data include 1351 individual respondents, and the regression model includes a number of control variables. Results presented in Table 1 show that there is a positive and statistically significant relationship between weight and being a

Page 13 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 non-smoker. Specifically, the results show that non-smokers on average weigh nearly 13 pounds more than do smokers, controlling for the effects of the other independent variables in the model. This result is statistically significant, meaning that we should reject the null hypothesis of no difference. The remaining partial slope coefficients estimated for this model are all in the expected direction and are all statistically significant as well. The R-squared for the model is 0.267, which means that about 26.7 percent of the variance in respondent weight is explained by the independent variables in the model. Further diagnostic tests should be explored to evaluate the robustness of these findings."

Review Multiple regression allows researchers to model a continuous dependent variable as a linear function of two or more independent variables. This example focused specifically on the situation where one (or more) of those independent variables is categorical and how to use dummy variables in response. This boils down to testing the difference between the means of the dependent variable between the two groups designated by the dummy variable in question while controlling for the effects of the other independent variables in the model. Coefficients for a multiple regression model are typically estimated via OLS. Accepting or rejecting the null hypothesis that any of the partial slope coefficients differ from zero tells us whether the dependent variable is a linear function of the independent variable(s) in question. However, it does not say anything about whether there is some other form of association between the dependent variable and any of the independent variables. Two-way scatter plots comparing the dependent variable to each independent variable can be useful for exploring more complicated relationships, but only partially so because they only permit exploration of one independent variable at a time.

You should know:

Page 14 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 • What types of variables are suitable for multiple regression with dummy variables. • The basic assumptions underlying OLS regression. • How to estimate and interpret a multiple regression model that includes dummy variables. • How to report the results of a multiple regression with dummy variables model.

Your Turn You can download this sample dataset along with a guide showing how to estimate a multiple regression with dummy variables model using statistical software. See if you can replicate the analysis presented here. Next, try estimating the model separately for men and women.

Page 15 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data From the General Social Survey (2012)