<<

Regression: An Introduction to

Econometrics

Overview

The goal in the econometric work is to help us move from the qualitative analysis in the theoretical work favored in the textbooks to the quantitative world in which policy makers operate. The focus in this work is the quantification of relationships. For example, in one of the central concepts was . It is one half of the -demand model that use to explain , whether it is the of stock, the exchange rate, , or the price of bananas. One of the fundamental rules of is the downward sloping - an increase in price will result in lower demand.

Knowing this, would you be in a position to decide on a strategy for your product? For example, armed with the knowledge that demand is negatively related to price, do you have enough information to decide whether a price increase or decrease will raise revenue? You may recall from a discussion in an intro econ course that the answer depends upon the of demand, a measure of how responsive demand is to price changes.

But how do we get the elasticity of demand? In your earlier work you were just given the number and asked how this would influence your choices, while here you will be asked to figure out what the elasticity figure is. It is here things become interesting, where we must move from the deterministic world of algebra and calculus to the probabilistic world of . To make this move, a working knowledge of , a fancy name for applied statistics, is extremely valuable.

As you will see, this is not a place for the meek at heart. There are a number of valuable techniques you will be exposed to in econometrics. You will work hard on setting up the 'right ' for your study, collecting the and specifying the equation. Fortunately, this is only the beginning. There will never be the magic button that produces 'truth' at the end of some regression, the favorite econometric technique for estimating relationships. You can also be assured you will not get it quite right the first time. There is, however, something to be learned from your 'mistakes'. To the trained eye, the produced by any regression package paint a vivid, if somewhat blurred picture, of the problems with the model as specified. These are problems that must be dealt with because they can produce biases in the results that reduces the reliability of the regression and increases the chance we will not end up an understanding of the true relationship. With existing software packages, anyone can produce regression results so one needs to be aware of the limitations of the analysis when evaluating regression results.

In this overview of econometrics we will begin with a discussion of Specification. What equation will we estimate? Does demand depend upon price alone, or does income also matter? Is demand linearly or nonlinearly related to price? These are the types of questions discussed in this section. We will then shift to Interpretation, a discussion of how to interpret the results of our regression. What if we find out demand is negatively relate to price? Should we believe the result? And what about the times where demand turns out to be positively related to price. How could we explain this result and do we actually have proof demand curves should be positively sloped. This will be followed by a discussion of the assumptions of the Classical , all of the things that must go right if we are to have complete confidence in our results. And for those instances where we have some reason to believe there is a problem, we have a discussion of the Limitations of the Classical Linear Model where the potential problems as well as solutions are discussed.

When you have completed this section, you should be well aware of the fact the estimation of 'economic relationships' has both an art and a science component. Given the technology available to people today,

1 anyone can run regressions with the use of some magic buttons. Computer programs exist that allow us to estimate the regressions, perform diagnostics to evaluate the model, and correct any problems encountered. Do not, however, be misled into thinking your empirical work will be easy. As you will find with your own work, there is a long road of painful, time-consuming work ahead of anyone who embarks on an empirical project. Furthermore, there are many places where you can take a wrong turn. This section was designed to offer you some guidance as you make the journey, to help you know in advance the obstacles you are likely to encounter and the best way of dealing with them.

There is a second reason for spending the time studying and conducting your own empirical project. The scientific advances are not a guarantee we are more likely to uncover the 'truth' that we are searching for. The world is in many respects the same as it was when was prompted to write his wonderful little book entitled, How to Lie With Statistics. In the hands of an unscrupulous researcher, the modern econometric software increases the chances someone can find the results they want. The complexities of the statistical analysis simply make it harder to find the biases in the study. Your time spent here will simply increase the chances of recognizing the biases.

For an on-line overview of regression analysis you might want to check out the DAU and Stockburger sites.

You should also check out the worksheet Regression, the from an excel regression. The data on sheet simple is for years, rate, rate, and rate appear in cells A3 - D50. Once the data set is complete, you then select Data Analysis in the Tools menu. You will then select Regression, which will bring up a dialogue box. At this time you highlight the data set for the input box. The Y variable is the variable you want to explain, in this case and it is the . The X variable is the explainer, in this case the inflation rate. We are going to use regression to see the extent to which the inflation rate explains interest rates. You then specify the top left cell of the space where you want the output to appear. For an interpretation of the results, you should check out the Interpretation page. In these results you find the coefficient of inflation to be .68 - every time the inflation rate rises by one percentage point, interest rates rise by nearly .7 percent. The t- is 7.22, which indicates you should believe in this relationship, and the R2 tells you the model helps explain about one half the variation in interest rates.

Mechanics

Once you have decided on estimating a relationship using regression analysis, you need to decide upon the appropriate software package. There are some very useful software packages designed primarily for regression type analysis you may want to explore if you were doing some high powered regression work or you were using the software in other courses. Here, however, we will stick with Excel that allows you to run some simple regression analyses. The first step is creation of the data set, an example of which can be found on the simple tab on the Regression spreadsheet example. On the simple tab example we will be looking at a bivariate regression - a regression with only one right-side variable. The estimated equation will be of the form Y = a + bX + e, where Y is the variable being explained (dependent) and X is the variable doing the explaining (independent).

To estimate the regression you simply select Data Analysis from the Tool menu and within this select Regression. You will get a dialogue box into which you need to input the relevant data. In the simple example we will be trying to identify the impact inflation has on interest rates. Because the causality runs from inflation to interest rates, the interest rate will be the dependent variable and the inflation rate will be the independent variable. You will input the dependent variable in the Input Y : by highlighting the interest rate column (C3:C50). You then input the independent variable in the Input X Range: by highlighting the inflation rate column (B3:B50). Because I did not use the labels you do not check off the labels box. I then tell it I would like the output to have its top left corner in cell F2. After checking off all the options you get all of the information on the simple tab.

Below is the data that appears with the regression output. While all of this data gives you important information about the relationship, at this time your attention should be directed to just a few of the features

2 that are highlighted in red. The first is the adjusted R Square. This tells the reader that of all of the year - to-year variation in the interest rate, about 52% of it can be explained by movements in the independent variable (inflation rate). The second things to look for are the coefficients. In this example, the regression analysis suggests the best equation for these data would be:

Interest rate = 2.44 + .68*Inflation rate

What we are most interested in is the coefficient of the Inflation rate, which in this example is .68. This every time the inflation rate rises by one percentage pint (from 4 to 5 percent), then the interest rate rises by .68 percentage points. The final piece of valuable information is the t-stat, which tells us how much to "believe" in the coefficients. You notice the t-Stats are associated with each coefficient so you can actually test the "believability" of all coefficients. Fortunately there is a convenient rule of thumb for the t- Stats. If the absolute of the t-stat is greater than 2 then you believe the coefficient is not zero, which would be the case if there was no relationship, In this example you will see the t-Stats for both the intercept and the coefficient of the inflation rate are greater than two so you can assume the interest rate is affected by the inflation rate.

SUMMARY OUTPUT

Regression Statistics Multiple R 0.72901 R Square 0.531456 Adjusted R 0.52127 Square Standard 1.990115 Error Observations 48

ANOVA Significance df SS MS F F Regression 1 206.6476 206.6476 52.1764 4.22E-09 Residual 46 182.1856 3.960557 Total 47 388.8332

Standard Upper Lower Upper Coefficients t Stat P-value Lower 95% Error 95% 95.0% 95.0% Intercept 2.443228 0.481394 5.075321 6.82E-06 1.474233 3.412222 1.474233 3.412222 X Variable 1 0.680573 0.094219 7.223323 4.22E-09 0.490921 0.870226 0.490921 0.870226

Specification

Before becoming involved with the more sophisticated statistical questions regarding regression analysis, it is useful to briefly discuss some of the preliminary issues one must deal with in any empirical project. First, it is important to note the difference between causality and correlation. The statistical analyses that are used to determine the nature of the causality, never actually allow us to prove the causality. It is impossible to separate out causality from correlation. All we can reasonably hope to do is find statistical correlations that do not disprove our hypotheses.

3 Given this limitation, once the decision has been made to undertake an empirical project, the principal investigator must make a number of important choices. A schematic outline of the process is presented below. The project starts with the choice of the theoretical relationship which one wants to study, the hypotheses one wants to test. Before proceeding with the development of the specific model, it is appropriate to review the scholarly literature. There is little advantage to be gained by rediscovering the wheel and in reviewing these articles you might find some information that would help in the other stages in your empirical analysis. A good place to start your search of the literature would be the Journal of Economic Literature.

After the review of the literature has been completed and the outlines of the model settled on, there is the need to identify the data necessary to estimate your model. What are you going to use as the dependent variable? For example, consider the empirical project designed to identify the link between spending and interest rates. There is a need to specify which interest rate it is we are concerned with explaining. Is it the rate on 3-month government securities or the rate on 30-year bonds we expect to affect investment decisions?

A decision must also be made concerning the choice of the independent variables. The choice of the regressors is based on the underlying economic theory. The variables should be selected because there is a reason to believe they are causally related to Y. If, for example, your goal was to estimate a demand equation for a certain product, then based on your knowledge of microeconomic theory you would need to identify at the very least the appropriate data to capture the influence of income, population, and the price of related . In each of these instances, you will be making choices that will significantly affect the findings of your study. Furthermore, for every variable selected, you should have an a priori expectation for the estimated . Based on our understanding of economic theory, for example, the coefficient of price in the demand equation should be negative and the coefficient of income should be positive.

A good example of the importance of the proper specification of the independent variable would be the treatment of demographic factors in a demand equation. The normal choice would often be the population, but there may be instances where this is likely to be an inappropriate choice. Consider the demand for motorcycles. Is it the growth in the population or is it the growth in the population of young people that matters? To the extent the primary for motorcycles is younger people, then the use of total population as an independent variable could cause problems. This would be the case if there was a between the two growth rates, a phenomenon of the 1970s. Similarly, a model for housing demand would most certainly include as an independent variable some measure of population. Is it the number of people or is it the number of separate that is the primary determinant of demand? The choice you make will have a significant impact on the results since we find that in the 1980s the growth rates of the two differed substantially.

The choice of dependent and independent variables involves a number of other crucial decisions. For time- series analysis, care has to be taken to avoid the mixing of seasonally adjusted and unadjusted data. This is not a problem when dealing with annual data, but it is a potential problem when dealing with quarterly or monthly data. It is also often relevant to adjust data for population. For example, in a demand equation for a specific product, it might be personal income per capita rather than personal income that is the appropriate independent variable.

One also has choices with regard to the form of the variables. Let us assume we believe the unemployment rate has an influence on demand. Is it best captured by the level of the unemployment rate, which would be used as an indicator of ‘ability to pay’, or would it be better measured by the change in the unemployment rate, which would capture the 'expectations' effect of a change in the direction in the ? When estimating a equation, should the dependent variable be aggregate (S), the average savings rate (S/Y), or the year-to-year changes in the savings rate (Δ(S/Y)). Most likely, the answer to these questions will be, at least in part, determined by the empirical work.

One must also be very careful to adjust the data for the influences of inflation. I will always recall my undergraduate students who reported that the 1970s was a period of high growth because GNP grew more rapidly during this period than in the 1960s and 1980s. This is certainly not the case. The 1970s figures

4 were primarily a reflection of higher inflation rates and any should account for these substantial differences. Returning to the product demand example, the model should certainly be specified in terms of real, or inflation adjusted income. Similarly, when we examine the relationship between investment spending and interest rates, it is the real interest rate which we would expect to use as an independent variable.

There is also the problem of dealing with phenomena that cannot be easily or adequately quantified. In a model of the inflation-unemployment -off, there is reason to believe there was a significant difference between the 1960s and 1970s. Another situation would be in a model of determination where we were attempting to identify the relationship between average earnings (W) and the number of years of (E). In the wage study there would be a need to capture the gender effect because of the sharply different profiles for males and females. In fact, it is questions such as this that are at the center of many of the discrimination cases that get to the courtroom. Similarly, in any study of toy sales based on quarterly data, it would be important to take explicit account of the fact sales are typically higher in the fourth quarter.

Each of these problems can be solved with the use of dummy variables. A dummy variable is a 0-1 variable that can best be viewed as an on-off switch. The left hand diagram describes the situation where we would want to add an intercept dummy, a variable that has a value of 0 for each year in the 1960s and a value of 1 in the 1970s. The estimated equation would be:

i = b0 +b1*u +b2*D

The diagram indicates a situation where the coefficient of D would be positive, the intercept is shifted upwards in the 1970s. The equations for the two time periods would be:

i = b0 +b1*u (1960s)

i = (b0 ++b2) + b1*u (1970s)

A somewhat different situation is depicted in the second diagram. Here it is not the intercept but the slope seems to vary. For this example consider the situation where the gender variable (G) would have a value of 0 for each observation of a woman's wage, and a value of 1 for each man's wage. We could use this dummy variable to test the hypotheses that the education-earnings profile for women is flatter than it is for males, that the extra earnings men receive for an extra year of education are greater than the gains for women. The equation would be:

Ei = b0 +b1*Educ +b2*D*Educ

In this case, evidence of the steeper slope for the males would be found in a positive coefficient for b2. The slope of the females curve would be b1 while the slope of the males would be b2. The equations for the two periods would be:

E = b0 +(b1+b2)*Educ (males)

5 E = b0 +b1*Educ (females)

Finally, in the retail sales equation in which we attempt to identify the link between sales (S) and income (Y), it would be appropriate to specify three dummy variables. The first dummy variable would have a value of 1 in the first quarter and 0 other wise, the second would have a value of 1 in the second quarter and the third dummy variable would have a value of 1 in the third quarter.

S = b0 +b1*D1 + b2*D2 + b3*D3 +b4*Y

In this case, evidence of seasonal patterns in retail sales would be found in the coefficient for the dummy variables. The equations for the four quarters would be:

S = (b0 +b1) + b4*Y Q1 S = (b0 +b2) + b4*Y Q2 S = (b0 +b3) + b4*Y Q3 S = b0 + b4*Y Q4

If sales were highest in the fourth quarter then the coefficients for all of the dummy variables would be negative.

Having decided on the appropriate independent variables, the first issue involves the choice of time-series or cross-section analysis. Returning to the interest rate problem, one possibility would involve a study of investment spending and interest rates for the year 1991 for a sample of 35 countries. A second approach could focus on the behavior of these two phenomena in the U.S. for the past 30 years. Each approach has its strengths and weaknesses and its econometric peculiarities. I suspect, however, the majority of the work you are likely to do will be time-series analysis. When you work with time-series data you must decide on both the time period and the of the data (daily, weekly, monthly, quarterly, annually).

We now have the variables and we have the data The final decision to be made is the choice of the estimation procedure. There are many possibilities open to the researcher interested in quantifying a specific relationship. At this time I intend only to discuss , equations that are linear in their parameters. Furthermore, I do not intend the discussion of regression analysis that follows to be a replacement for statistics and econometrics texts. The emphasis here will be on a brief overview of the process one goes through in arriving at a finished product. We will begin at the beginning with the single- equation, bi-variate linear regression model. The simplest form of the model is:

Yi = B0 + B1Xi+ ei = 1...n where;

• Yi = ith observation on the dependent variable • Xi = ith observation on the independent variable • ei = ith observation on the error term • B0, B0 = the estimates • n = number of observations

As is often the case, a picture can save one a good deal of explaining. The data collected on variables Y and X are presented in a scatter diagram below. Linear regression analysis identifies the equation for the straight line that best captures the 'flavor' of the scatter. More specifically, the regression procedure specifies the values of the parameters B0 and B1 so that we have a specific equation, which will allow us to calculate the 'average' value of Y [AVG(Y)] given the value of X. What remains unexplained by the equation is captured in the error term. In the diagram below, the actual value of Y for the ath observation is Ya while the model estimates AVG(Ya) as the value for Y. The difference between these two is the error term.

6 Bi-variate Regression: The Graphics

We can never expect a perfect fit with our model because there are always going to be some minor influences on Y omitted in the specification of the model, human behavior will always contain an element of or unpredictability, the variables may not be measured correctly, and the model may not be truly linear. We do, however, hope these problems are minor, and when they do surface, we can modify our analysis in a number of ways to help correct the problems. In any event, as we will see later, the standard linear regression model is designed to choose the values for the parameters in such a way as to minimize the errors. For example, in the diagram below it is obvious that the equation Y = B2 +B3X does not adequately reflect the data and that the error terms would on average be larger. Stated somewhat differently, the second equation does a much poorer job of representing the data.

Alternative Regression Equations

If this were the end of the story, it would be a one. The fact is there are few, if any, instances where the bi-variate model is appropriate because there are few cases where the value of a dependent variable is influenced by only one independent variable. It is more likely the dependent variable (Y) will be influenced by a number of independent variables. In this case the linear regression model can be written as:

Yi = B0 + B1X1 i + B2X2 i ...+... BKXK i + e i i = 1...n • where; • Y i = ith observation on the dependent variable • X ji = ith observation on the jth independent variable • e i = ith observation on the error term • B0... BK = the parameter estimates • K = the number of independent variables • n = number of observations It is also true there are many times when the linear model depicted above does not adequately reflect the data. One possible alternative specification would be the exponential form:

Y = ea1X1b1X2b2e

7 It you believed this was the appropriate model, you would employ a logarithmic transformation, which makes the equation linear in its parameters:

lnY = a1 + b1lnX1 + b2lnX2 + e

In the case of the exponential model, the sign and size of the estimated coefficients have a significant impact on the 'picture' of the relationship. The graph below shows the relationship between y and X1 for different values of the parameter b1. When b>1 we have the familiar parabola and when b1<0 we have the hyperbola. If a scatter diagram had one of these shapes, it would be appropriate to use the exponential function. Exponential Function

An alternative specification would be the semi-log equation. Two possibilities would be the equations:

Y = a1 + b1lnX1 + b2X2 + e

lnY = b0 +b1*X1 +b2X2 +e

The picture of these are presented below. While both of these are legitimate equations, neither is frequently used. The reason is there is seldom, if ever, a compelling reason to use either of these equations since simpler specifications with easier interpretations can be used which have similar 'pictures'. It is also a bit of work to calculate the elasticities which are often the primary concern of the researcher.

The list of alternative functional forms certainly extends beyond the few mentioned here. Some of the more popular forms are polynomials and equations containing terms and inverses. I would suggest that your experimentation with these alternative forms be restricted to those times when all else fails. A more advanced problem involves the specification of a model in which two or more of the variables are mutually dependent. When analyzing any market for example, it is a safe bet that the quantity demanded depends upon the market price just as the market price depends on the amount demanded. In this situation it is important to construct a multi-equation model in which you estimate the parameters of all the equations simultaneously. More will be said about this problem in the following section.

Interpretation

The task for the researcher at this point, after collection of the data and estimation of the equation, is the interpretation of the results. A thorough analysis of the results will focus on the extent to which the model adequately explains the dependent variable and the correspondence between the values of the estimated parameters and a priori expectations. As for the first of these, the 'goodness of fit', we can return to the simple bi-variate diagram. There are a number of measures of the 'goodness of fit', but the one which economists tend to focus on is the sum of squared errors (SST) which is written as:

2 SST = S(Yi-Y)

By acknowledging that Yi = Y + e , and that the error term is Yi-Y, the equation can be rewritten as:

8

2 2 SST = S(Yi-Y) + Sei

As evident in the diagram below, the total sum of squares can be decomposed into two separate components, the sum of squares of the difference between the actual value of the dependent variable and its value (Yi-Y) and the sum of squares of the residuals. The first of these terms, referred to as the regression sum of squares (SSR), represents the amount of the deviations in Y from its mean that is explained by the model. The unexplained deviations of Y from its mean are captured in the second term, the error sum of squares (SSE). Decomposition of

This decomposition of variance provides us with the primary measure of 'goodness of fit', the measure of how adequately the dependent variable is explained by the model. This measure, known as the coefficient of determination, is simply defined as the ratio of the explained to the total sum of squares:

2 2 2 R = SSR/SST = 1-SSR/SST = 1 - Sei /S(Yi-Y)

It should be clear from this formulation that the R2 is bounded by 0 and 1. As the model's explanatory power increases, the R2 approaches 1. On the other hand, as the scatter of points becomes more random and the errors increase, the R2 approaches 0. One of the undesirable features of the R2 is the fact it can only increase as more independent variables are added. The difficulty with the addition of the independent variables is that it reduces the degrees of freedom (n-K-1), the difference between the number of independent variables and the number of observations. Any decrease in the degrees of freedom will result in a loss in the reliability of the model. For this reason, the adjusted R2 has been created. The adjusted R2 is defined as: R2 = 1 -((SSE/(n-K-1))/(SST/(n-1))

The coefficient of determination gives us a good 'gut feeling' for the goodness of fit, but it possesses no statistical properties. It can, however, be slightly modified to allow us a 'statistical' test of the goodness of fit. As you will find in your statistics book, the decomposition can be reformulated to derive the F-ratio:

F = (SSR/K)/(SSE/(n-K-1))

This is simply the ratio of the explained to the unexplained sum of squares adjusted for the number of regressors and the number of observations. Unlike the coefficient of determination, the F-ratio has no upper bound. For high values of F, we can be confident the model does an adequate job of explaining Y. Low values of F, meanwhile, indicate the model is inadequate as an explanation for Y. The actual division between the accept and not accept regions can be determined from the F Table. Having assessed the overall explanatory power of the model, the researcher would turn to the individual parameter estimates. Are the parameter estimates consistent with their predicted sign based on the underlying economic theory? Do the parameter estimates suggest that the variable is important? Is there reason to believe that the dependent variable is statistically related to the independent variable or is it possible that the two are statistically unrelated?

9 The first two of these questions are generally easily answered. If you have estimated an equation explaining investment expenditures and the coefficient of the interest rate is positive, the results of your analysis differ significantly from the established theory. Similarly, if you estimated a equation and found the coefficient of income was .2, there would be reason to question the analysis. The parameter has the correct sign, but it is sharply lower than our economic theory would suggest. In both cases, it would be time to look very closely at the model and begin some diagnostic tests to determine what could be wrong. It would only be after extensive testing and re-estimation that one would accept these results.

As for the question of , the standard regression package will always provide a non zero estimate of parameters. To help us determine whether there is enough evidence to accept the hypothesis the parameter is not zero, the t-test was designed. The t- statistic is defined as:

tk = Bk/sk where Bk is the estimated parameter value for the kth independent variable and sk is the '' of this coefficient. As with the F-statistic, a high value for t means that there is reason to accept the model. More specifically, a value for t above 2 can generally be accepted as evidence of a non zero parameter value and therefore evidence of a relationship between the kth independent variable and Y.

But can I really believe in my results? Not if you restrict your investigative 'research' to a casual study of these summary statistics. There are a myriad of possible problems with your work just as there are wide array of solutions to your problems. To better understand the problems one is likely to encounter when , it is necessary to go back to the beginning briefly. Most of the problems that surface are the result of the fact that the data has shown that some of the assumptions made during the specification of the model were inappropriate. It therefore seems appropriate to begin with a quick review of the assumptions.

Classical Linear Model

To answer the question of believability, one must look at the properties of the . Ordinary (OLS) is by far the most popular estimation technique. Its popularity stems from the fact it is simple and it will provide the best linear unbiased estimators under certain assumptions. OLS's limitations, meanwhile, are that the basic assumptions are quite restrictive and when the assumptions can not be justified, the estimation technique is not the 'best' estimation technique available. If it turns out that some of these assumptions do not apply to the problem, then one must consider alternative estimation techniques.

The assumptions that must be valid for OLS to have its desirable estimation properties form the structure of what is called the Classical Model. The equation which we will be estimating is:

Yi = B0 + B1X1i + B2X2i + BKXKi + ei i = 1...n

The classical assumptions are listed in the table below.

Assumptions of the Classical Model • 1. The regression model is linear in the coefficients and the error term. • 2. The error term has a zero mean • 3. The error terms are not serially correlated • 4. The error term has a constant variance • 5. The independent variables are not linearly related • 6. The independent variables and the error term are unrelated • 7. The error term is distributed normally

The first of these assumptions is, as we have seen before, much less restrictive than it appears on the surface. Linear regression refers to 'linear in the parameters' and can be applied to many nonlinear forms such as the polynomial, log, semi log, and exponential functions. The assumption of a zero mean in the

10 error term is also not very restrictive. In fact, as long as the equation has a constant term, this assumption is satisfied. The final assumption is primarily of value in hypothesis testing. Referring back to the equation above, you can envision that one of the goals of the analysis was to determine if the independent variable X1 had an influence on Y. By assuming that the error term is normally distributed, the researcher can rely on the t and F-statistics to test these hypotheses. Once again, there is reason to believe that this assumption is not too restrictive. The basis for this optimism is the that states:

The mean of a number of independent, identically distributed random variables will tend to be normally distributed, regardless of their , if the number of different random variables is large.

The implication of the assumption of normally distributed error terms is that the estimates of the parameters are normally distributed because the estimators of the coefficients are simply linear functions of the normally distributed error term. Stated somewhat differently, if we ran the above regression 100 times we would generate 100 estimates of the parameter B1. Three possible distributions for the parameter estimates of B1 are pictured in the graph below. These distributions can be identified by their means and . In distribution #1, the mean of the parameter estimate equals the true parameter value and there is a small variance in estimates about the mean. Distribution #2 has the same mean, but it has a higher variance, while Distribution #3 has a large variance and a mean that is not equal to the true parameter value. In a comparison of the estimation techniques that produced these distributions of parameter estimates, Distribution #1 would clearly be the preferred choice. This distribution is the most likely going to give us an estimated parameter with a value close to the 'true' parameter. Distribution #2, meanwhile, will provide us with an unbiased estimate of the parameter, but is not as efficient as #1 because there is a greater chance of us being away from the true parameter value. As for Distribution #3, it is less desirable because it provides biased parameter estimates. Even if we can reduce the variance, the parameter estimate does not tend to center on the true value of the parameter.

So where does the OLS model fit in? What is the distribution of the estimators from this simple technique? The answer to these questions is supplied by the Gauss-Markov Theorem, a center-piece of all econometrics texts. Very simply, the Gauss-Markov model states: If assumptions 1-6 of the Classical Model are valid, then the OLS estimate of B is BLUE, the minimum variance, linear, unbiased of B

Given this idealized world, we can now return briefly to the issue of hypothesis testing and the t-statistic that is at the center of any such testing. For those who are rusty on this issue, you should return to your statistics book to brush up. We begin with the specification of the hypothesis, or more specifically, with the specification of the null and alternative hypotheses. The null hypothesis generally expresses the range of values for the parameters under the assumption the theory was incorrect. For example, in a model of investment expenditures, a null hypothesis could be that the coefficient of interest rates is zero, evidence that interest rates do not influence total expenditures. The expresses the range of values likely to occur if the researcher's theory is correct. In this case the alternative hypothesis would be that the parameter is not equal to zero.

When testing hypotheses, there are two types of errors, referred to as Type I and Type II Errors. Type I errors are made when we are tempted to reject a hypothesis which is actually true. Type II errors occur when we do not reject a false hypothesis. A graphical representation of the errors is presented in the following diagrams of the parameter estimates. In the left hand diagram, the true

11 parameter is zero and our null hypothesis is B=0, but we end up with an estimate on the far right tail and we reject the null hypothesis. In the right hand diagram, the null hypothesis is once again B=0, but here the true parameter value is 1. If we happen to end up with a parameter estimate near zero, we may reject the null hypothesis when in fact it was correct.

Having developed the hypotheses, one must establish a decision rule which will specify the acceptable and unacceptable ranges of some sample statistic. In the case of the regression coefficient, there is a need to specify the critical value of the parameter estimate which separates the accept from the reject regions. What choice one makes depends to a large extent on the relative of TYPE I and TYPE II errors because generally any choice which will reduce the probability of one error will increase the probability of the other. A graphical representation of the specification of the critical value is presented below.

The top graph refers to the situation where we are dealing with a one-tailed test, such as the situation when we have Ho: B<=0. In this case we have specified a critical value (Bc) so that a value greater than that for B will us to reject the null hypothesis. In the lower graph the two-tailed test is depicted. In this case the null hypothesis would be, Ho: B=0. The critical value is chosen so that any values of B greater than +Bc or less than -Bc will result in a rejection of the null hypothesis.

How is it that these critical values are chosen? Most econometricians use the t-test to test hypotheses concerning individual parameters. Returning to the typical regression equation:

Yi = B0 + B1X1i + B2X2i + BKXKi + ei i = 1...n the t-value for the ith coefficient can be calculated as: ti = (Bi - Bhi)/s(Bi)

12 where: • Bi = estimated regression coefficient • Bhi = border value implied by null hypothesis • s(Bi) = estimated standard error of Bi

The value of the t-statistic for the parameter can then be compared with the critical value for the t-statistic that can be obtained from the tables at the end of nearly any statistics book. The choice of a critical value depends on the level of significance, or level of confidence that you want in your study and the degrees of freedom that depends upon the number of observations and the number of estimated parameters. Once the critical value is chosen, the rule for the hypothesis test is:

Reject Ho: if ti < tc

When reading empirical work in economics you are likely to see the use of 1, 5, and 10 percent levels of significance. The specification of a 5 percent level of significance means that the probability of observing a value for the t-statistic that is greater than the critical value is 5 percent if the null hypothesis were correct. If we turn this around, we could state that we have a 95 percent level of confidence, that the results are statistically significant at the 95 percent level. The table below provides a guide to the critical values for the t-statistics when there are 30 degrees of freedom.

Critical Values for t-statistic

Significance Level One-tailed Two-tailed 1 percent 2.457 2.750 5 percent 1.697 2.042 10 percent 1.310 1.697

What about the special case in which the null hypothesis is Ho: B=0? In this situation the border value for the null hypothesis is 0 and the t-statistic reduces to the ratio of the coefficient to the estimated standard error of the parameter.

Limitations of Classical Linear Model

So much for the good news. It is these next four assumptions that prove to be the most difficult to accept in most regression analyses. What are the problems with the parameter estimates caused by dropping these assumptions and how can the basic OLS estimation technique be altered to provide unbiased, efficient parameter estimates? As you will see, there are some simple tests that can be invoked to 'test' these assumptions or some features of the regression output which suggest a problem. Fortunately, once the problems have been diagnosed, there are some solutions. It is precisely these diagnostics and solutions which will be the focus in the following section. Please note that this will be a very brief presentation and that you are encouraged to explore these issues more fully in any econometrics or linear regression text.

Serial Correlation

What is it? The third assumption of the Classical Model is there are no interdependencies between the errors of the separate observations. Stated somewhat differently, knowledge of the error for any one observation should not help us anticipate the error for the next observation. First-order serial correlation is the most common form of the problem, but certainly not the only form. When it does exist, we can describe the relationship between the error terms by the equation:

13 et = ret-1 + ut • where: • et =the error term • r = the parameter describing the linkage (-1 < r < +1) • ut = a classical error term In the diagrams below, examples of positive and negative serial correlation are presented. In the left hand side diagram, a positive error for any observation is a good indicator that the next error will be positive while in the right hand diagram we find that a positive error is a good indicator that the error in the next observation will be negative.

Why do we Care?

Serial correlation's primary impact on the regression is that it reduces the of OLS estimators. The problem is the pattern of serial correlation is likely to be assigned by OLS to the effect of one of the independent variables. In this sense OLS is more likely to misjudge the correct parameter value, a situation reflected in a higher variance in the distribution of the parameter estimates.

What do we Do?

The best treatment of the serial correlation problem depends upon the likely source of the problem. One possibility is an incorrect specification of the functional form. The left hand side diagram below depicts a situation where the 'true' relationship is nonlinear, but where a linear equation has been estimated. In this situation the errors for all observations below t1 and above t2 would be positive while they would be negative for the values of T between these two. In this case the appropriate correction would be to estimate a different functional form.

Serial correlation can also surface because an important independent variable has been omitted. In the right hand diagram a variable with a strong cyclical component has been estimated by a linear equation. An example of where one might encounter this would be a model of retail sales based on quarterly data. In this situation, fourth quarter sales can be expected to be 'abnormally' high and this information should be built into the model. One possible correction would be to include a dummy variable to capture the seasonal effect. Another would be to use seasonally adjusted data.

14

A third possibility is that we have some serial correlation that can not be traced to inappropriate functional form or missing variables. There are many possible types of serial correlation. The error in the current period could be related to the error two or four periods ago, or it could be related to some combination of previous errors. The most likely situation, and the one most closely studied, is first-order serial correlation. The most widely used test to detect the existence of serial correlation is the Durbin-Watson d Statistic. The Durbin-Watson Statistic varies from 0 to 4. A value of 0 indicates extreme positive serial correlation (et = et-1). On the other end of the spectrum would be extreme negative serial correlation (et = -et-1) which would have a d statistic of 4. If there were no serial correlation, the Durbin-Watson d statistic would approach 2.

The output from a regression package should include the Durbin-Watson Statistic. Armed with the d statistic and the number of observations and independent variables, one turns to the table containing the Durbin-Watson Statistic. Once here, you will note a difference from the t and F statistic tables. There are two values for d, dL and dU. If the regression d value falls below dL, the null hypothesis can be rejected while a value greater than dU means that the null hypothesis should not be rejected. If, however, the d statistic falls between these two numbers, the test is inconclusive.

If we get back the results which indicate the existence of serial correlation, the OLS model should be adjusted. The appropriate adjustment would entail the use of the Generalized Least Squares Model(GLS) as the estimation technique. The GLS model combines the information contained in the following two equations:

Yt = B0 + B1X1i + et et = ret-1 + ut

The first equation is the linear model that contains the serial correlation. The second equation describes the first order serial correlation. Combining these two equations we get the equation: Yt = B0 + B1X1t + ret-1 + ut

The problem with this equation is the term ret-1. To eliminate this term we can use the equation for rYt-1:

rYt-1 = rB0 + rB1X1t-1 + ret-1

If we now subtract this from the equation for Yt we get:

Yt -rYt-1 = (1-r)*B0 + B1(X1t -rX1t-1) + ut

OK, so how do we get an estimate of the serial to use GLS. One possibility is to use the rough approximation p = 1-d/2. A more sophisticated method would be the Cochrane-Orcutt iterative technique. The first step requires calculation of error terms from the OLS regression output. These errors are used to estimate the equation et = ret-1 + ut. With the estimate of the value of r, the GLS model can be estimated. The residuals from this regression will then be used to re-estimate the equation et = ret-1 + u and the entire process continues until the values for p change very little in the latest round of the iterative process. If you do not like this approach, you could try other methods including Hildreth-Lu, Theil- Nagar,and Durbin, but you are on your own in this venture.

Heteroskedasticity

What is it?

The fourth assumption of the Classical Model is the error terms have a constant variance. As with serial correlation, there can be many versions of the heteroskedasticity problem and a thorough treatment of all variations of the problem is well beyond the scope of our present work. Here we will focus our attention on the form of the problem which you are most likely to encounter. In this version of the problem, which is demonstrated in the diagram below, the variance in the error term seems to be increasing as the value of X

15 increases. This is likely to be a problem when the data set contains a dependent variable that has a significant range in its values or there has been a significant change in the procedure.

Why do we Care?

Heteroskedasticity's primary impact on the regression is that it reduces the efficiency of OLS estimators. The problem is that the pattern of heteroskedasticity is likely to be assigned by OLS to the effect of one of the independent variables. In this sense OLS is more likely to misjudge the correct parameter value, a situation reflected in a higher variance in the distribution of the parameter estimates.

What do we Do?

The variety of forms which heteroskedasticity can take translated directly into a wide array of methods for testing for its existence. Before mentioning the formal approaches to detection and correction, it is worth noting that a careful selection of the dependent and independent variables can greatly reduce the likelihood of a problem. If, however, there is a need for a formal test, two widely recognized tests would be the Park and Goldfeld-Quandt Tests. Both of these methods are designed to test for the existence of a proportionality factor. For both tests you have to have some notion of what the missing proportionality factor is. In the Park test you take the residuals from the OLS regression and run a regression in which you estimate the natural log of the residuals as a function of the natural log of the proportionality factor. [ ln(ei2) = a0 + a1lnZi +ui]. The coefficient of Z is then tested using the simple t-statistic.

In the Goldfeld-Quandt Test, meanwhile, the sample is ordered by the proportionality factor and the variance of the residuals is calculated for the lowest and highest third of the sample. The ratio of the residual sum of squares is formed (RSS3/RSS1) and an F-test is performed to test the hypothesis that this ratio differs significantly from 1. Once the problem has been diagnosed, the solution is to use weighted least squares. The entire data set is divided by the proportionality factor, Z:

Yt/Zt = B0/Zt + B1X1t/Zt + B2X2t/Zt + ut

This is the easy part. From here on you must be quite careful in your interpretation of your results. First, you can note that there are two distinct possibilities, Z is one of the independent variables or it is not. In the case that it is not one of the Xs, this weighting will eliminate the constant term. To avoid the problems associated with this, a constant term could be added to the equation. In the case that Z=X2, there will be a constant term because X2/Z=1. The equation would be:

Yt/Zt = B0/Zt + B1X1t/Zt + B2 + ut

The only problem here is that the coefficients are a bit more difficult to interpret. The parameter linking the variables Y and X2 is actually the intercept in this equation and the intercept is the coefficient of Z.

Multicolinearity

16 What is it? The fifth assumption of the Classical Model is that one independent variable is not a linear combination of the other independent variables. This, like serial correlation, is a problem most likely to be encountered by the researcher working with time-series data. While it is true that the extreme case is quite rare, less extreme cases of multicolinearity often complicate the work of the applied researcher. To better understand the problem, consider the situation when two independent variables are highly correlated. The regression package searches for a relationship between the dependent and independent variables. As the dependent variable changes it looks for similar patterns of changes in the independent variable. For example, if the value of Y changed 2 every time X changed by 4, the coefficient of X should approach 1/2. In a situation where we have multicolinearity, however, it becomes difficult to isolate this relationship. For example, if two independent variables X and Z are highly correlated, they will both tend to change in similar patterns. The regression package will not be able to disentangle the relationship between Y and X from the relationship between Y and Z. Is it the change in Z or the change in X that is responsible for the change in Y? When we have multicolinearity there is no way to know.

Why do we Care?

Multicolinearity's primary impact on the regression is that it increases the variance of the OLS estimators. By increasing the variance and the standard error, multicolinearity decreases the computed t scores. More importantly, the difficulty of isolating the unique contribution of each explanatory variable, means that the estimated parameters are extremely sensitive to the specification of the model. The deletion or addition of explanatory variables can dramatically alter the parameter estimates and their significance level. A second indicator of the problem would be a high R2 value with insignificant t-statistics. In this case the regression model has been able to explain the variation in Y, it simply does not know whether to credit X or Z.

What do we Do?

As indicated above, there are some very good indicators of the existence of multicolinearity. If the t- statistics and the parameter estimates are quite sensitive the specification of the equation or the equation has low t scores for its parameter estimates, but a high R2, then the problem exists. The question at that time is what to do. One possibility that should be seriously considered is to do nothing. A high correlation between two or more independent variables is not a guarantee of trouble with the regression. Also, the elimination of a 'theoretically' important variable could result in biased parameter estimates because of misspecification. If ignoring the problem does not suit you, then you could always remove the problem variable(s), or, you could attempt some transformation of the independent variables. Two popular transformations would be to from a linear combination of the variables or to transform the equation into first differences.

Simultaneous Equations

What is it?

If you really want to do a good job of estimating an empirical relationship, then there is one more potential problem you must address. Unfortunately, it is no small problem encountered only infrequently, and easily solved. To understand the problem, consider your favorite theoretical model, . In specifying a demand equation, the price would certainly be one of the right hand side variables. Could you make the case, however, that the price is unaffected by the level of demand? Except in extreme cases, this would be an unreasonable assumption and the single equation OLS model, which ignores this possibility, is an inappropriate estimation technique. Estimation of the equation in which there is no explicit treatment of the interdependency will violate the sixth assumption of the Classical Model, the assumption that the error and the right hand side variables are uncorrelated. This results in biased estimates of the parameters. The nature of the problem can be seen in the left hand scatter diagram below. The researcher has accumulated the appropriate price and quantity data to estimate a demand curve to determine the price elasticity of demand. If the model P = b0 + b1*Q is estimated by OLS, the result would be the heavy dashed line. Based on this one would conclude that demand was inelastic. The problem is that we 'know' that the true supply and demand curves are given by the solid lines. What we are observing are equilibrium points,

17 points on or near the intersection of shifting supply and demand curves, rather than points along a stationary demand curve. The secret to solving the simultaneous equations problem is to design a model which allows one to identify the shifts in the curves.

Simultaneous Equation Bias and The Identification Problem

In the right hand diagram, the demand curve can be estimated if we know it is the supply curve that has been shifting in response to some external shocks. The scatter of points is viewed as points on a stable demand curve and a fluctuating supply curve. Similarly, if it was the demand curve that was shifting, then the data would allow us to estimate the supply curve. If, however, we assumed both the supply and demand curves responded to the same external factors, then it would be impossible for us to isolate the two separate influences and we would not be able to generate estimates of the curves. This is referred to as the identification problem. The solution to the identification problem is to have at least one independent variable in each equation that is not in the other equations. The two-equation model below is an example of a model that is identified.

Qd = b0 + b1*P + b2*Z + e Qs = a0 + a1*P + a2*W + u

What do we Do?

To better understand the solution to the problem, it is useful to differentiate the structural and reduced forms of the model. A simple structural model is:

Y1 = b0 + b1*Y2 + b2*Z + e Y2 = a0 + a1*Y1 + a2*W + u where the Ys are the dependent variables and Z and W represent a set of predetermined variables. In this form of the model we find the dependent variables on both sides of the equation. To see the problem, assume that the error in the Y1 equation is large. If this error is large it will make Y1 large and, because Y1 is an explanatory variable in the Y2 equation, it will alter the size of Y2, a violation of the assumed independence of the explanatory variables and the errors.

This structural model can be transformed, with the use of some elementary algebra, into its reduced form. The reduced form of a model is actually the solution of the model.

Y1 = B0 + B1*Z + e Y2 = A0 + A1*W + u

The equations in the reduced form specify the dependent variables as functions of the predetermined variables. There are no dependent variables on the right hand side of the equation and therefore OLS can be used without the problems encountered when estimating the structural form. Unfortunately, it is not often

18 that the original parameters (a's and b's) can be solved from the estimates of the reduced form parameters (A's and B's).

One, of many possible solutions to the problem would be the use of Two-Stage Least Squares (2SLS). In the first stage, the reduced form equation for each of the dependent variables that also appears as a right hand side variable in the structural model. This model produces estimates of these dependent variables as functions of the predetermined variables. There are no independent variables on the right-side of the equation and therefore OLS can be used without problems encountered when estimating the structural form. Unfortunately, it is not often that the original parameters (a's and b's) can be solved from the estimates of the reduced form parameters (A's and B's).

One of the many possible solutions to the problem would be the use of Two-Stage Least Squares (2SLS). In the first stage the reduced form equation for each of the dependent variables that also appears as a right- side variable in the structural model. This model produces estimates of these dependent variables. In stage two these estimates are substituted for the dependent variables in the structural model and an OLS regression is estimated.

19