Multiple Linear Regression

Chapter 12 Multiple Linear Regression

Introduction

Multiple linear regression (MLR) is an extension of simple linear regression. In the previous chapter we considered a single dependent variable, y, and a single independent variable x. MLR is used when there are two or more independent variables where the model using population information is

yt=b0 + b 1 x 1 t + b 2 x 2 t +⋯ + b k x kt

It should not be surprising that most interesting phenomena are too complex to be modeled using just a single independent variable. Corn yield, from the previous chapter, might be more accurately modeled using Eq. with

yi  corn yield per acre

x1i  amount of water applied per acre

x2i  amount of fertilizer applied per acre

x3i  type of soil the corn is planted in.

In practice, the Keynesian consumption function consumption function considered in the previous chapter, is probably too simple to be of much use. A better consumption function might be given by Eq. with

yi  real household consumption

x1i  household real income

x2i  the real interest rate

x3i  household real wealth.

MLR allows a much more comprehensive model than does simple LR and should provide superior predictions.

Figure 12.1 is a time series plot showing how Arizona state and local government employment has changed over time. It does not seem that a single straight line will fit this data very well. You might be curious about the periodic dips in government employment. The dips generally represent school teachers who are not considered (using the government definition of employment) to be employed in the summer.

1 Figure 12.1 Arizona state and local government employment

12.1 The assumptions of MLR

MLR uses some assumptions. These are necessary for the mathematics to work out properly and we need not worry about these details. We do need to worry about what happens to the MLR process when these assumptions are not met (usually we will say “'when the assumptions have been violated”'). Strange and awful things can happen when the assumptions have been violated as we will see later. The assumptions used in MLR:

1. The error terms i are normally distributed 2. The error terms are independent of past error terms, that is

E(ei e i-1, e i - 2, ⋯) = E ( e i )

3. The populations all have equal variances, 1  2   3 ⋯ 4. The independent variables are not correlated with each other.

12.2 The MLR output

In this section we will examine a MLR output from the SPSS statistical software package. You will find there is some carryover from the previous chapters, but that there is some new material as well. In the previous chapter we considered a regression equation with household consumption as the dependent variable and household income as the independent variable. It seems quite reasonable that other variables might affect consumption as well. The classical economists would maintain that interest rates affect

2 consumption (as interest rates increase people save more and consume less). Other economists believe that household wealth is a determining factor in the household consumption decision. Let's consider a MLR equation

yi0   1 x 1 i   2 x 2 i   3 x 3 i ⋯   k x ki   i where

yi  real household consumption

x1i  household real income

x2i  the real interest rate

x3i  household real wealth.

The SPSS regression output is shown in Fig.12.2. The portion of Fig. 12.2 labeled coefficients contains values for the ˆ ’s, t-statistics, and p—values. The values for the ˆ ’s are to be found in the column labeled Unstandardized coefficients. Again we will ignore the column labeled Standardized coefficients. The column labeled Std. Error s contains values for the ˆ ’s. The column labeled t contains t--values used in hypothesis testing and the column labeled Sig contains p--values. As before, we want to determine whether we believe that a particular   0 or not. Recall that   0 means that an independent variable does not cause a change to occur in the dependent variable.

ˆ b1 (income) = 0.888 p = 0.000 ˆ b2 (wealth) = - 9.16E - 02 p = 0.676 ˆ b3 (interest rate) = 4.731 p = 0.062

Now we can see if there is a statistically significant relationship between household income and household consumption. The easiest way to do this is with the p--value. The p--value is located in the column labeled Sig. Note that the p--value is that for a two-- tailed test. Suppose that the level of significance is   0.05 , then for 1 , p 0.000   0.05 and we would reject the hypothesis H0: 1  0 . Thus we would conclude that a statistically significant relation exists between household income and household consumption. You might note the large value of the test statistic, t= 27.691, would certainly lead to the rejection of the null hypothesis for any reasonable level of significance.

3 Figure 12.2 The SPSS regression output for a regression function using household consumption for the dependent variable and household income, household wealth, and the real interest rate as explanitory variables.

ˆ The row labeled WEALTH contains information about 2 , the coefficient associated with household real wealth . ˆ b2 = -9.16E - 02 = - 0.00916

s ˆ = 0.0.217 b2

Consider the hypothesis

4 H0: 2  0

H A :2  0 and consider the p—value, p = 0.676. Here we would not reject the hypothese that there is no statistically significant relationship between household wealth and household consumption . However, we would not claim this at a 5% level of significance.

Next, let's look at real interest rates ( x3 ). The estimates are ˆ 3  2.041

s ˆ  0.009 3

The p—value, p  0.009 indicates that we would reject the null hypothesis for all but the very smallest levels of significance. The negative sign indicates that when real interest rates increase household consumption decreases (household save more and consume less).

ˆ ˆ You may have noticed that b3 = -4.731 has a greater magnitude than b1 . Do not think that this means that interest rates have a greater influence on consumption than does household income. The size of the coefficients depends on the units used when recording the data. We could represent a five percent rate of interest as 5 or as 0.05. If we use 0.05 the resulting coefficient will be 100 times as large the result we would get using 5. What we can say is that a one unit increase in household income will increase consumption by 0.888 units and that a one unit increase in interest rates will decrease consumption by – 4.731 units, but that it matters what the units are.

12.3 The ANOVA table: Testing H0: 1  2   3 ⋯  k  0

We can also test the regression equation as a whole rather than just the individual parts. We can use information in the ANOVA table to test the hypothesis

H0: 1  2   3 ⋯  k  0

H A : Not all of the  's  0

We can test this using the p--value from the ANOVA table (it is in the column labeled SIG). This is p  0.000 and we would reject the null hypothesis for any reasonable level of significance. Thus we believe we have sufficient evidence to conclude that at least one of the  's is not zero. The very large value of R2  0.945 also indicates that the regression equation fits the data quite well. A final bit of information we will use is that an estimate of the population standard deviation can be found in the column labeled Std.

5 Error of theEstimate. That value is s  5.4551 which is a point estimate for the population standard deviation  .

Note that the ANOVA table contains a column labeled df (this does stand for degrees of freedom). The value df = 3 for the REGRESSION row indicates there are 3 slope terms in the regression equation. The number of observations can be obtained from the df for the TOTAL row. The number of observations is n = df TOTAL + 1 = 29+1 =30.

Usually we experiment with different regression models until we find one with only statistically significant coefficients. In this example, let's rerun the model excluding wealth from the equation. The new MLR results are shown in Fig. 12.3. It appears that ˆ ˆ both b1 = 0.875 (income) and b2 = -3.735 (interest rates) suggest both statistically signifcant relationships because each have p—values of 0. So now the regression equation contains only statistically significant relationships.

Once we have decided that we have a good regression equation, we can use it to predict.

Suppose that we wish a prediction for consumption if x1=1000 and x 2 = 5 , then

Cons=117.977 + 0.875 x1 - 3.735 x 2 Cons =117.977 + 0.875( 1000) - 3.735( 5) Cons = 974.302

Note: the data used in this example is data I created and does not represent real world data at all. There is an advantage to this. I know the true values for the b 's . The true values are b1=0.9 and b 2 = - 4 . In this case, it looks like MLR has done a pretty good job of estimating the values of the slope parameters.

6 Figure 12.3 SPSS Regression output using household consumption as the dependent vartiable and household income and interest rates as independent variables.

12.4 Adjusted R2

Another term that we have not yet discussed in the MLR output is Adjusted R2 or Adjusted R Square on the printout. Anytime you add a variable to a regression equation you will increase R2 . This is true regardless of whether or not there is a statistically significant relationship between the dependent variable and the variable added to the equation. So you can make R2 as large as you want by adding more and more variables. That is because any variable will have some correlation with the dependent variable. Adjusted R2 is a statistic that helps determine whether a variable should be included in the regression equation or not. A statistically insignificant variable will increase R2 but decrease Adjusted R2 . Fig. 12.4 shows the values of R2 and adjusted R2 for regression

7 equations using different combinations of the variables x1, x 2, x 3, and x 4, as independent variables. The independent variables included in the equation are shown in the left most 2 column. Note that when x4 is eliminated from the equation R decreases but adjusted 2 R increases. This suggests that x4 probably does not belong in the equation. When x3 is dropped from the equation, a large drop occurs in both statistics --this suggests that x3 belongs in the equation (unless some otherevidence exists for dropping it).

Figure 12.4. Different models showning how dropping a variable can affect R2

12.5 Colinearity

Let's consider another example. The model is

yi0   1 x 1 i   2 x 2 i   3 x 3 i   i

The regression output is shown in Fig. 12.5. A close look at this regression output indicates that something strange has happened.

First note that the value of R2 is large indicating a statistically significant relationship between the dependent variable and some of the independent variables. The p--value for the ANOVA table is p  0.000 meaning we could reject H0: 1  2   3  0 for almost

8 any level of significance. These two bits of evidence strongly suggest that at least one of the  's s does not equal zero.

This evidence is contradicted when we look at the statistics for the individual  's , however.

ˆ 0  3.371 p  0.111 ˆ 1  3.382 p  0.159 ˆ 2  3.560 p  0.448 ˆ 3  0.124 p  0.478

The p—values for all the ˆ 's are fairly large. We would not reject the hypothesis that any slope coefficient was zero for any level of significance smaller than 15%. In short this is the sort of result we would expect to get if none of the independent variables had a statistically significant relationship with the independent variable.

Figure 12.5. A regression output of colinear data

9 So what went wrong? Should we conclude that MLR does not work? No, what we have here is a violation of one of the assumptions of MLR. We violated the assumption that says the independent variables must not be correlated with each other. The scatterplot of x1 and x 2 is shown in Fig. 12.6. It seems pretty clear that these two variables are correlated. The scatter plot of x1 and x 3 is shown in Fig 12.7. These two variables are not correlated with each other. The violation of the assumption that the independent variables are not correlated is called colinearity or multiple colinearity. The problem that colinearity creates is that it makes a mess of the statistics. As above you can conlude both that there is and that there is not a statistically significant relationship between the independent and dependent variables. Most modern statistical packages now provide at least some means of helping detect colinearity (though none is perfect). SPSS calculates a variance inflation factor (VIF). If the VIF between two variable is between 0 and 10, colinearity is presumed not to exist. If the VIF is greater than 10, it is presumed to exist.

The VIF for x1 is 839.712 in Fig 12.5 and for x2 the VIF is 839.976. This suggests that

Figure 12.6. Correlated independent variables

10 Figure 12.7. Uncorrelated independent variables

x1 and x 2 are correlated with some other independent variable. The VIF for x3 is 1.017 indicating that it is not correlated with any other independent variable.

Value of VIF Colinearity ? 0--10 NO >10 Yes Table 12.1 Values of the Variance Inflation Factor

Next let’s eliminated one of the colinear variables ( x2 ) from the regression and see what happens. The new regression output is shown in Fig. 12.8.

11 Figure 12.8 A regression output with no colinear variables

The VIF suggests that the two remaining variables are not colinear. Note now that there does seem to be a statistically significant relationship between y and x1 which had been obscured by the colinearity. Colinearity can create a number of problems. Significant relationships can appear to be insignificant, insignificant relationshipts can appear significant. In some cases the coefficients may have the wrong sign. Suppose that a regression equation produced a negative value for the marginal propensity to consume. This would mean that an increase in income would cause a decrease in consumption. Clearly this would be incorrect. But that is one of the hazards of colinearity—the chance of reaching the wrong conclusion.

12. 6 Heteroscedasticity

Heteroscedasticity is a Greek phrase roughly meaning “different scatter”'. Homoscedasticity roughly means “same scatter”. This refers to one of the assumptions of MLR, that is assumption 3 from the list in Sec 12.1. This assumption is that all populations have the same variance (the same scatter). If they don't have the same scatter (heteroscedasticity), then this assumption has been violated. An example of

12 heteroscedastic data is shown in Fig. 12.9. An example of homoscedastic data is shown in Fig. 12.10.

Figure 12.9. Heteroscedastic data

Figure 12.10. Homoscedastic data

13 Figure 12.11 Regression output of the heteroscedastic data.

14 Figure 12.12. Regression output of the homoscedastic data

The MLR output in Fig. 12.11 should be compared with that in Fig. 12.12. The effects are similar to those for colinearity. In Fig. 12.11 the p—value in the ANOVA table is p  0.000 suggesting strongly that there is a statistically significant relationship present, ˆ but the the p—value for 1 is 0.117 which suggests that there in not a statistically significant relationship.

Unfortunately, there is no particularly good way of detecting the presence of heteroscedasticity, that is we can't compute a number like the VIF. One popular technique often used is to plot the values of the dependent variable against values of the residual terms. The residual(or error) terms are given by

15 o ei y i  yˆ i

o where yi is the i—th observed value of the variable and yˆi is the prediction for that observation. If the variance of the populations is constant, then we should not expect to find any relationship between the error terms and the dependent variables. Fig. 12.14 is an example of such a graph. It uses the results from the regression example in Fig. 12.13, the example of homoscedastic data.. Fig. 12.13 uses the residuals from the regression o shown in Fig. 12.12 (the heteroscedastic case) . In this case small values of yi seem to o produce small values of the residuals and large values of of yi seem to be associated with large values of the residuals. No such pattern is present in the homoscedastic case. Generally, if you can detect a pattern in a plot where one of the variables is the residuals from a regression equation, there is a problem.

Figure 12.13. Heteroscedastic residuals

16 Figure 12.14 Homoscedastic residuals

12.7 Serial correlation

Serial correlation is a violation of the assumption that the error terms are statistically independent. This error would occur if the error for one observation were somehow related to the errors of other observations. This sort of error most likely occurs with time series data. Time series data occurs when the data is observed and recorded at distinct time periods, usually daily, weekly, monthly, quarterly or yearly. For example, with the consumption function example, data on income, consumption and interest rates might be collected every quarter. Another way of collecting data would be to survey several households at roughly the same time and record each households consumption and income. This sort of data is called cross sectional data. Note that we would not record interest rates in the case of cross sectional data because we would not expect interest rates to change from family to family.

The order with which the data are recorded should not be important with cross sectional data. It should not matter which household is recorded as observation 10 or 3 or 49 in a household survey of income and consumption. Such is not the case with time series data. In the case of time series data order is important. The observation for 1949 should come before the observation for 1950. Economic conditions that prevail in one time period may persist for a number of time periods. Recessions, for example, often extend over more than just a couple of time periods. If economics activity is below (or above) normal in one time period, it may be expected to be below (or above) normal in the next time period.

Suppose that interest rates did affect household consumption and we left it out of the regression equation. If interest rates vary randomly over time perhaps not much damage

17 would result from our regression model. Interest rates don't seem to vary randomly over time. You probably have heard the expression that interest rates are “low” or that they are “high”. What people mean by this is that they have been “low” or “high” for an extended period of time. If interest rates do affect consumption and we leave them out of the regression equation, then the resulting equation will systematically under or over predict consumption depending on whether interest rates or “low” or “high” at that particular time. This means the error terms will tend to be positive or negative during these time periods. So you might see a period of negative error terms followed by a period of positive error terms. MLR assumes that the error terms should be randomly distributed without any pattern. The most common cause of serial correlation is the omission of a variable which should be present in the regression equation.

In the following example the true model is y0   1 x 1t   2 x 2 t   Suppose that we don't know nature of the true model (who knows everything?) and we think the true model is y0   1 x 1t   Again we use computer generated data to emphasize a particular point. The regression results obtained form the model with the missing dependent variable are given in Fig. 12.15. The regression results using the true model are shown in Fig.12.16.

18 Figure 12.15 Regression results with a missing variable

The results in Fig. 12.15 suggest that no statistically significant relationship exists between yt and x1t. The p—value from the coefficient table is p  0.274 which is considered very weak evidence for a relationship. Identical results appear in the ANOVA table. Finally the value R2  0.025 is very close to zero also suggesting the lack of a statistically significant relationship.

If x2t is now included in the regression equation the results are those shown in Fig. 12.16 These results seem quite good. The p—values all suggest the existence of statistically significant results. Note the huge increase in the value of R2 . There is no evidence of colinearity. Because the is the true model it should not be too surprising to find good results

19 Figure 12.16. Regression results with no missing variable

If you check the portion of the table that contains R Square you will see a new item at the right end of the table, the Durbin—Watson (DW) statistic. We will use the rules of thumb for the DW statistic given in Table 12.2 .

DW statistic Serial correlation present? 0-1 Yes 1-1.5 Don’t know 1.5-2.5 No 2.5-3 Don’t know 3-4 Yes Table 12.2 The Durbin--Watson statistic

So if DW has a value of 2.2, say, we would conclude that serial correlation is not a problem. If D--W has a value of $0.9$ we would say that serial correlation is a problem. If DW = 1.3 then we simply can't say. The regression output in Fig. 12.15 has DW = . 037, a strong indication that serial correlation exists. The regression output in Fig. 12.16 has DW=1.885 and we would con clude that serial correlation is not a problem for that

20 regression equation. We can also look at a time series plot of the residuals to see is a pattern exists. If so, then serial correlation is likely a problem. Fig. 12.17 shows a plot of the residuals that do not contain a distinct patter (it there is a pattern there I can’t see it) The time series residual plot for the regression in Fig. 12.15 is shown in Figure 12.19. Another typical pattern of residuals is given in Fig. 12.18 where the residual tends to have positive values for a long period of time followed by negative values for a long period of time. One would expect that the residuals would have randomly determined positive and negative values.

Figure 12.17 Residuals with no apparent pattern

Figure 12.18 Residuals with a pattern (The residuals seem to be consistently positive or negative)

21 Figure 12.19 Another set of residuals with a pattern

We have seen that the presence of serial correlation can produce unreliable parameter estimates and give misleading conclusions about the statistical significance of the regression equation. How about using the resulting regression equation for prediction? The plot of the residuals suggests that the equation with the missing variable will, at times, give very large residuals. Suppose we find predictions for y using the model in Fig. 12.15 (the model with the missing variable). In that case the prediction for the 50—th value of y. The values for the independent variables are x1 t 50.00 and x 2 t  148.41 . (Don’t look for these values, they we values in the data set but I haven’t given it to you). The observed value of y at that point is yo  56.09 . Using the model in Fig. 12.15 the prediction is ˆ ˆ yˆ 0   1 x 1  44.594  0.433(50)   66.24 .

Using the “true” model in Figure 12.16

ˆ ˆ ˆ yˆ 0   1 x 1   2 x 2 9.494  4.972(50)  1.990(148.41)  56.23.

The prediction using the “true” model (56.23) is much closer to the observed value of y (56.09) than is the prediction using the model with the missing variable(-66.24). Not only is the prediction wrong, but it even has the wrong sign. Not only are the parameter estimates drastically wrong, but the predictions are as well in this case.

22 12.8 Dummy variables

Often business and economic data will contain observations which cannot be easily modeled using causal data. For example, we might want to predict output for a particular industry, but that industry might have been subject to a strike during the period of data collection. Such data might look like that in Fig. 12.20. The strike occurred in time periods 50, 51, and} 52. Suppose that we have a pretty good regression model (ignoring the strike period) given by ˆ ˆ yt0   1 x t   t

We might expect that the model might have some problems if we ignore the strike period. The regression output is shown in Fig. 12.21.. The time series plot of the actual and predicted values is shown in Fig.12.22.

Figure 12.20 Time series data of industry output with a strike occurring in time periods 50, 51, and 52.

The results of the regression look pretty bad. The p—value from the ANOVA table and from the coefficient table both suggest that there is no relationship between time *(the independent variable) and output. The value of R2 is also quite small. Finally the value of the DW statistic (DW=0.726) suggests that there is a value missing from the regression model . What we haven’t accounted for is the strike.

23 Figure 12.21. The output for the strike data using a regression model without a strike dummy variable

24 Figure 12.22 Plot of actual vs predicted values for industry output without using a strike dummy variable.

A “dummy variable” can be used to model the effects of the strike in the regression model. A dummy variable is one that has a value of zero when the phenomena is absent and a value of one when it is present. For this example, the strike lasts for time periods $50, 51, 52$ so the dummy is given a value of one for these time periods.The dummy is assigned a value of zero for the preceding and subsequent time periods as shown in table 12.3..

Time Strike dummy (SD) 48 0 49 0 50 1 51 1 52 1 53 0 54 0 Table12.3 A dummy variable. Variable SD has a value 1 while a strike is in progress and a value of 0 when one is not in progress.

25 Figure 12.23 Output from a regression model that includes a dummy variable to represent a strike

26 Figure 12.24. A plot of the predicted vs actual values for the regression model that includes the strike dummy variable.

The regression output which includes the dummy variable is shown in Fig. 12.23 and a plot of the actual vs predicted values is shown in Fig. 12.24. . The value of the Durbin Watson statistic is greatly improved, with DW = 2.170 . The value of R2 is also considerably improved. The ANOVA table indicates that at least one slope term is significantly related to the dependent varible. The coefficient table shows both independent variable have a statistically significant relationship. The plot in Fig. 12.24 also seems better than the one in Fig. 12.22. The regression equation for Fig. 12.22 predicts that output would have a tendency to decrease with time while the equation for Fig. 12.24 shows that it increases with time.

We can use the regression equations to predict output at time period time  50 and SD 1. The regression model that does not include the strike dummy is

ˆ ˆ yˆ 0   1( time )  201.5  (  0.175)(50)  192.75 while the model with the strike dummy is

ˆ ˆ ˆ yˆ 0   1( time )   2 ( SD )  198.58  (0.144)(50)  195.9(1)  9.88 and clearly the model with the strike dummy makes much better predictions during a strike. The prediction the model makes in time period 53 where there is no strike is

27 ˆ ˆ ˆ yˆ 0   1( time )   2 ( SD )  198.58  (0.144)(53)  195.9(0)  206.21.

Another thing to note is that the model with the strike dummy has the regression equation with a positive value for the slope of the time variable (0.144) where the regression equation without the strike dummy has a negative slope for the time variable(-0.175).

This can be particularly useful when you can predict that something out of the ordinary like a strike might occur. Other uses for dummy variables might include using a dummy variable to predict sales of seasonal goods (Christmas trees, soft drinks in summer, automobiles, etc).

12.9 Applications of regression

Example 12.1. In the text we discussed some of the factors that might affect real household consumption, that is real income, real wealth, and the real interest rate. It is very difficult to get aggregate household wealth measures, so we will consider a slightly different model. We will use nominal consumption ( cons ) as the dependent variable, nominal disposable income ( dpi ), nominal interest rates ( ffr --the Federal Funds rate), and the money supply ( m2 ns --money 2 not seasonally adjusted) as explanatory variables. A graph of cons vs dpi is shown in Fig. 12.25 and is strongly suggestive of a relationship between the two. Note: this is real data, not hypothetically generated data.

The regression model we will use is

ˆ ˆ ˆ ˆ cons0   1( dpi )   2 ( ffr )   3 ( m 2 ns )

The regression output is shown in Fig. 12.26. This output shows that the regression model is plagued with at least two violations of the assumptions. The Durbin—Watson statistic (0.350) surely indicates the presence of serial correlation. The large VIF’s of DPI and M2NS indicate the presence of colinearity. This model leaves much to be desired. A model where one of the colinear variables, M2NS, has been eliminated from the regression equation produces the output shown in Fig. 12.27. This model does not show any evidence of colinearity, but the D—W statistic is even worse.

28 Figure 12.25 A plot of nominal household consumption vs. nominal disposable personal income.

29 Figure 12.26. The regression output for Example 12.1.

30 Figure 12.27 The regression output for Example 12.1 with M2NS removed from the model.

Example 12.2 Another way of looking at the consumption function is in the form of rates of change. That is we would model

cons 0   1V dpi   2 ffr   2 V confidence   4  V m2 ns where x x  x . That is the symbol indicates that the past value of a V t t t1  V varible is to be subtraced from the current value creating a new variable called a difference. The regression output is shown in Fig. 12. 28. Note that the statistics of this equation are much superior to those of the previous equation. There is no evidence of

31 either serial correlation or collinearity. The consumer confidence index does not seem significant here, however. The interest rate variable is only moderately significant.

32 Problems

Panel A Panel B

Panel C Panel D

Problem 12.1 Which of the above residual plots indicate a possible violation of the assumptions of MLR? What is the violation called (e.g. heteroscedasticity). What is the cause of the violation.

Problem 12.2 What is meant by the phrase “a statistically significant relationship.” What is the difference between a statistically significant relationship and a causal relationship.

33 Problem 12.3 Use the output from Fig. P12.1 to answer the questions that follow. a) Is there any evidence of collinearity? State why or why not. b) Is there any evidence of serial correlation? State why or why not. c) Should the hypothesis 1  2 ⋯  k  0 be rejected at a 5% level of significance? d) Should the hypothesis 1  2 ⋯  k  0 be rejected at a 10% level of significance? e) Which independent variables should be included in the regression model. Why?

Figure P12.1 Use for Problem 12.3