Multiple Regression

Multiple Regression Zillow.com Zillow.com is a real estate research site, founded in 2005 by Rich Barton and Lloyd Frink. Both are former Microsoft executives and founders of Expedia .com, the Internet-based travel agency. Zillow collects publicly available data and provides an estimate (called a Zestimate®) of a home’s worth. The estimate is based on a model of the data that Zillow has been able to collect on a variety of predictor variables, including the past history of the home’s sales, the location of the home, and characteristics of the house such as its size and number of bedrooms and bathrooms. The site is enormously popular among both potential buyers and sellers of homes. According to Rismedia.com, Zillow is one of the most-visited U.S. real estate sites on the Web, with approximately 5 million unique users each month. These users include more than one-third of all mortgage professionals in the U.S.—or approximately 125,000—in any given month. Additionally, 90% of Zillow users are homeowners, and two-thirds are either buying and selling now, or plan to in the near future. 577 578 CHAPTER 18 • Multiple Regression ow exactly does Zillow figure the worth of a house? According to the Zillow.com site, “We compute this figure by taking zillions of data points—much of this data is public—and entering them into a formula. This formula is built using what our statisticians call ‘a proprietary Halgorithm’—big words for ‘secret formula.’ When our statisticians developed the model to determine home values, they explored how homes in certain areas were similar (i.e., number of bedrooms and baths, and a myriad of other details) and then looked at the relationships between actual sale prices and those home details.” These relationships form a pattern, and they use that pattern to develop a model to estimate a market value for a home. In other words, the Zillow statisticians use a model, most likely a regression model, to predict home value from the characteristics of the house. We’ve seen how to predict a response variable based on a single predictor. That’s been useful, but the types of business decisions we’ll want to make are often too complex for simple regression.1 In this chapter, we’ll expand the power of the regression model to take into account many predictor variables into what’s called a multiple regression model. With our understanding of simple WHO Houses regression as a base, getting to multiple regression isn’t a big step, but it’s an impor- WHAT Sale price (2002 dollars) and tant and worthwhile one. Multiple regression is probably the most powerful and other facts about the houses widely used statistical tool today. WHEN 2002–2003 As anyone who’s ever looked at house prices knows, house prices depend on the WHERE Upstate New York near Saratoga local market. To control for that, we will restrict our attention to a single market. Springs We have a random sample of 1057 home sales from the public records of sales in up- WHY To understand what influences state New York, in the region around the city of Saratoga Springs. The first thing housing prices and how to predict them often mentioned in describing a house for sale is the number of bedrooms. Let’s start with just one predictor variable. Can we use Bedrooms to predict home Price? 600 500 400 300 Price ($000) 200 100 0 12345 Number of Bedrooms Figure 18.1 Side-by-side boxplots of Price against Bedrooms show that price increases, on average, with more bedrooms. The number of Bedrooms is a quantitative variable, but it holds only a few values (from 1 to 5 in this data set). So a scatterplot may not be the best way to ex- amine the relationship between Bedrooms and Price. In fact, at each value for Bedrooms there is a whole distribution of prices. Side-by-side boxplots of Price against Bedrooms (Figure 18.1) show a general increase in price with more bedrooms, and an approximately linear growth. Figure 18.1 also shows a clearly increasing spread from left to right, violating the Equal Spread Condition, and that’s a possible sign of trouble. For now, we’ll proceed cautiously. We’ll fit the regression model, but we will be cautious about using inference methods for the model. Later we’ll add more variables to increase the power and usefulness of the model. 1When we need to note the difference, a regression with a single predictor is called a simple regression. The Multiple Regression Model 579 The output from a linear regression model of Price on Bedrooms shows: Table 18.1 Linear regression in Excel of Price on Bedrooms. Apparently, just knowing the number of bedrooms gives us some useful information about the sale price. The model tells us that, on average, we’d expect the price to increase by almost $50,000 for each additional bedroom in the house, as we can see from the slope value of $48,218.91: Price = 14349.48 + 48218.91 * Bedrooms. Even though the model does tell us something, notice that the R2 for this regression is only 21.4%. This means that the variation in the number of bedrooms accounts for about 21.4% of the variation in house prices. Perhaps some of the other facts about these houses can account for portions of the remaining variation. The standard deviation of the residuals, s = 68,432, tells us that the model only does a modestly good job of accounting for the price of a home. Approximating with the 68–95–99.7 Rule, we’d guess that only about 68% of home prices predicted by this model would be within $68,432 of the actual price. That’s not likely to be close enough to be useful for a home buyer. 18.1 The Multiple Regression Model For simple regression, we wrote the predicted values in terms of one predictor variable: N = + y b0 b1x. To include more predictors in the model, we simply write the same regression model with more predictor variables. The resulting multiple regression looks like this: N = + + + Á + y b0 b1x1 b2x2 bk xk where b0 is still the intercept and each bk is the estimated coefficient of its corresponding predictor xk. Although the model doesn’t look much more complicated than a simple regression, it isn’t practical to determine a multiple regression by hand. This is a job for a statistics program on a computer. Remember that for simple regression, we found the coefficients for the model using the least squares solution, the one whose coefficients made the sum of the squared residuals as small as possible. For multiple regression, a statistics package does the same thing and can find the coefficients of the least squares model easily. If you know how to find the regression of Price on Bedrooms using a statistics package, you can probably just add another variable to the list of predictors in your program to compute a multiple regression. A multiple regression of Price on the two variables Bedrooms and Living Area generates a multiple regression table like this one. 580 CHAPTER 18 • Multiple Regression Response variable: Price R2 = 57.8% s = 50142.4 with 1057 - 3 = 1054 degrees of freedom Variable Coeff SE(Coeff) t-ratio P-value Intercept 20986.09 6816.3 3.08 0.0021 Bedrooms -7483.10 2783.5 -2.69 0.0073 Living Area 93.84 3.11 30.18 …0.0001 Table 18.2 Multiple regression output for the linear model predicting Price from Bedrooms and Living Area. You should recognize most of the numbers in this table, and most of them mean what you expect them to. The value of R2 for a regression on two variables gives the fraction of the variability of Price accounted for by both predictor variables together. With Bedrooms alone predicting Price, the R2 value was 21.4%, but this model accounts for 57.8% of the variability in Price and the standard deviation of the residuals is now $50,142.40. We shouldn’t be surprised that the variability explained by the model has gone up. It was for this reason—the hope of accounting for some of that leftover variability—that we introduced a second predictor. We also shouldn’t be surprised that the size of the house, as measured by Living Area, also contributes to a good prediction of house prices. Collecting the coefficients of the multiple regression of Price on Bedrooms and Living Area from Table 18.2, we can write the estimated regression as: Price = 20,986.09 - 7,483.10Bedrooms + 93.84Living Area. As before, we define the residuals as: e = y - yN. The standard deviation of the residuals is still denoted as s (or also sometimes as se as in simple regression—for the same reason—to distinguish it from the standard deviation of y). The degrees of freedom calculation comes right from our defini- tion. The degrees of freedom is the number of observations (n = 1057) minus one for each coefficient estimated: df = n - k - 1, where k is the number of predictor variables and n is the number of cases. For this model, we subtract 3 (the two coefficients and the intercept). To find the standard deviation of the residuals, we use that number of degrees of freedom in the denominator: a 1 y - yN22 s = . e C n - k - 1 For each predictor, the regression output shows a coefficient, its standard error, a t-ratio, and the corresponding P-value.

Load more