Week 8 Notes

PBAF 528 Week 8

What are some problems with our model? Regression models are used to represent relationships between a dependent variable and one or more predictors. In order to make inference from the sample data about the true relationships, we have to make assumptions about the model. In real life, though, we often violate these assumptions. Those violations have implications for our estimates. Is important to understand when these violations affect the reliability of our estimates.

When reporting our results to a nontechnical audience, we DO NOT report these analyses in the body. We might discuss them in an appendix or a footnote, but often they are not mentioned at all.

A. Properties of OLS regression coefficients OLS estimators (coefficients) are supposed to be BLUE (Best Linear Unbiased Estimator). That is, they are:

Þ Unbiased -- on average, they are right Þ Minimum Variance -- efficient -- chosen estimates are closer to the true value than alternative estimates Þ Consistent -- the estimates get better at sample size get larger

B. Regression Residuals These properties have implications for the residuals of the regression.

Residual · the distance between the actual value of the outcome and the estimated value according to the regression line. · the distance between the data point and the regression line.

In SPSS you can save the residuals when you do a regression and then use them to make plots, get descriptive statistics, and diagnose various problems that prevent your equation from being BLUE.

C. The Assumptions 1. The regression model is linear in the coefficients, is correctly specified,and has an additive error term. 2. The error term has zero population mean. 3. All explanatory variables are uncorrelated with the error term. 4. Observations of the error term are uncorrelated with each other (no serial correlation). 5. The error term has a constant variance (no heteroskedasticity). 6. No explanatory variable is a perfect linear function of any other explanatory variable (no multicollinearity). 7. The error term is normally distributed.

D. The Problems 1. When the linear regression is inappropriate Could be a violation of #1, #2, or #3: An OLS model is supposed to be linear in its coefficients. But, a linear model does not explain the relationships. Instead, a model in which terms that were functions of several coefficients multiplied together or otherwise combined would be better than a linear model. a) To check

Look at scattered diagrams of Yi and Xi; the distributions of the actual measured data should lie along a line.

2. Specification error (omitted variable bias) A violation of #3: If the explanatory factors and the errors are correlated, then a variable could be missing. Does the model include all important factors? a) Symptom Coefficients on the variables that included are of unexpected sign or are not significant. The model may lack explanatory power. b) To check

Look at scattered diagram of residuals ei vs Xi: are they scattered about zero? c) Effects Coefficients on any included variables that are correlated with the omitted variable are biased and inconsistent.

d) To Fix Add the omitted variables or a proxy variable to the regression model. If the data are unavailable and you can’t add a variable you may have to use theory to discuss the interpretation of the coefficients of the included variables, such as to explain the sign (+/-) or the bias. e) Example Is an earnings model biased by not including a variable representing work experience or motivation of the individual?

Earnings would certainly be affected by these factors—not just age, gender, race, education. If work experience and motivation have a positive effect on earnings and if work and earnings are positively correlated with education, then the coefficient on the education variable that has been included in the model may be too positive (because it has to account for the influence of both education AND experience and motivation on earnings.

3. Serial correlation A violation of #4: each observation is correlated with the previous one, and therefore is not independent. a) To check First, think about how the data were collected. Was this a random sample? Were measures made on the same individuals over time (which would mean that one value might be dependent on the previous one). Second, look at a plot of the residuals vs. time and see if it looks like there is a correlation. b) Effects Coefficients are unbiased but not efficient, SEs are biased and inconsistent, residuals are correlated and residuals on adjacent outcomes. c) To Fix There are methods to remove correlation errors. In short, one estimates the correlation between the errors and if it is a problem, you could either try to remove the correlation or you could use some other estimation method for your model. d) Example The classic example is series correlation in time series data. For example, if you were predicting GNP over time, each year’s GNP would be somewhat related to the previous year’s GNP (each year’s GNP is not an independent shot in the dark occurrence, the country’s GNP has up and down trends).

4. Heteroskedasticity A violation of #5, that the model should be homoskedastic. That is, for all observations, there is an equal variance in the distribution of the y’s. A violation of this assumption could result from the variance in the error terms being related to an explanatory factor or outcome, e.g. does the variance in earnings go up with age? a) To check

Look at a scattered diagram of Yi and Xi Look at the residual plot of ei and Xi b) Effects Coefficients are unbiased and consistent, but inefficient Estimated SEs are biased and inconsistent. c) To fix You could use weighted least squares to fit the regression model instead of ordinary least squares (which is what we’ve been using). Weighted least squares gives greater weight to observations with the smallest variance. If this problem happens with demographic data you could redefine the variables, such as in per capita terms, to get everything on the same basis. d) Example Does the variance of earnings go up with age? This would be observed if the residuals have a wider distribution at older values of age.

5. Multicollinearity A violation of #6. The problem is that there is not enough variation in the data so that two or more explanatory variables are highly correlated with each other. a) Symptoms Model has a high R2 but low t-values on the products The coefficients change dramatically when you drop a variable Simple correlation between Xs is high (look at correlation in SPSS; if above .8 worry, if about .9 do something about it!). b) Effects OLS regression cannot disentangle the effect of one factor from another. The SEs are high on the coefficients. This does not violate regression assumptions (OLS is still BLUE). Estimates are unbiased and consistent. Standard errors are inflated. Computed t-statistics will fall, making it difficult to see significant effects. c) To fix If two variables are measuring the same thing, you don’t need them both— drop one! Or just live with the problems of including both. You can use an F test to check the predictive power of a group of correlated variables. Or, you can get a larger sample and this may solve your problem. d) Example If you have both number of hours worked per week on average and number of hours worked per year on average, the latter will be 52 times the former. You don’t need both.

6. Measurement error a) Symptoms The measured values of one or more variables ( X or Y) are known not to be "right" since there is a known problem of measurement, or the variable values just don't look right, or there is theory about mismeasurement. b) Effects If the error is in the dependent variable, then variability of the dependent variable is increased and thus the overall statistical fit the equation is decreased. An error of measurement in the dependent variable does not cause any bias in estimates. If the error is in the explanatory variable, then the coefficients and standard errors are biased and inconsistent -- even if the error is random. If the error is random, the coefficient would be biased toward zero, that is, toward not showing predictor to be significant. c) To fix Find an alternative quantity to represent variable in question, one that is measured without error. d) Example Suppose that we believe that rainfall is measured with a random error because of the way it is measured (it have to fall into a rain gauge which only works well when rainfall is straight down, not when the wind is blowing which is what happens during many storms). The effect that this will have is that the influence of rainfall may be underestimated. In a model predicting temperature (from rainfall and some other factors), the fix would be to use another rainfall estimation technique, such as radar or satellite data.

7. Non-normal errors A violation of #7. One assumption of the model is that the distribution of Y values at each X follows a normal distribution. The result is that the variance in Y is constant for each X I so that the errors, e i, are normally distributed with a mean of 0 and a variance of s2. a) Symptoms Look at the distribution of the residuals to see if they are normally distributed. This can be done graphically by constructing a frequency histogram or numerically by looking at mean and standard deviations of the residuals. b) Effects Coefficients are unbiased, but SEs and t-statistics are biased and inconsistent because hypothesis testing relies on the assumption of normality of the error terms. c) To fix This is a situation where you might specify a different functional form (like polynomial or natural log), or get a larger sample, or add omitted variables and check the residuals again.

Summary – Steps in Residual Analysis

1. Check for a misspecified model by plotting the residuals against each of the quantitative independent variables. Look for a curvilinear trend. This indicates the need for a quadratic term in the model.

2. Examine the residual plots for outliers. Draw lines on the residual plots at 2- and 3-standard deviation distances below and above the 0 line. Look for potential outliers. Check that approximately 5% of the residuals exceed the 2-standard deviation lines. Identify the nature of any outliers. Rerun the regression without the outlier to determine its effect on the analysis.

3. Plot a frequency distribution of the residuals. Check for any obvious departures from normality. Extreme skewness may be due to outliers and indicate the need for transformations.

4. Plot the residuals against the predicted values of y. Look for a cone- shaped pattern or some other pattern. This indicates the need to refit the model using a transformation such as ln(y).

Exam II – Revision Topics

Revision Problems

1. A medical researcher wants to model the extent of lung damage in emphysema patients as a function of two variables: the number of years the patient has smoked (x1) and the gender of the patient (x2). Fifty emphysema patients are used in the study, and the response y for each patient is a subjective score ranging from 0 to 50. (High scores indicate extensive lung damage)

a) Identify the independent variables as qualitative or quantitative. b) Write the regression model for predicting the lung damage score from the two variables identified in part (a). c) Using a female patient as the base level, the researcher obtained the prediction equation below:

Y = -0.85 + 0.65x1 + 0.09x2

What is the estimated difference between the lung damage scores for a male and female who have both smoked for 20 years?

2. To help firms determine which of their executive salaries might be out of line, a management consultant fitted the following multiple regression equation from a data base of 270 executives under the age of 40:

SAL = 43.4 + 1.24 EXP + 3.60 EDUC + 0.74 GEN (SE) (.30) (1.20) (1.10)

s = 16.4

where SAL = annual salary ($1,000) EDUC = number of years of post-secondary education EXP = number of years of experience GEN = dummy variable, coded 1 for male, 0 for female

a) Fred Kopp is a 32-year old vice president of a large restaurant chain. He has been with the firm since he obtained a 2-year MBA at age 25, following a 4-year degree in economics. He now earns $126,000 annually.

i) What is Fred’s fitted salary? ii) How many standard deviations is his actual salary away from his fitted salary? Would you therefore call his salary exceptional? iii) Closer inspection of Fred’s record showed that he had spent two years studying at Oxford as a Rhodes Scholar before obtaining his MBA. In light of this information, recalculate your answers to i and ii.

In addition to identifying unusual salaries in specific firms, this regression can be used to answer questions about the economy-wide structure of executive salaries in all firms. For example,

b) Using this model, is there evidence of inequality of salaries based on gender? c) Is it fair to say that each year’s education (beyond high school) increases the income of the average executive by $3600 a year?